├── .gitignore
├── LICENSE
├── README.md
├── configs
    ├── llava_video
    │   ├── llava-video_lvbench.yaml
    │   ├── llava-video_mlvu.yaml
    │   ├── llava-video_videomme.yaml
    │   ├── retake_llava-video_lvbench.yaml
    │   ├── retake_llava-video_mlvu.yaml
    │   └── retake_llava-video_videomme.yaml
    ├── qwen2_vl
    │   ├── qwen2-vl_lvbench.yaml
    │   ├── qwen2-vl_mlvu.yaml
    │   ├── qwen2-vl_videomme.yaml
    │   ├── retake_qwen2-vl_lvbench.yaml
    │   ├── retake_qwen2-vl_mlvu.yaml
    │   └── retake_qwen2-vl_videomme.yaml
    ├── retake_demo.yaml
    └── retake_demo_npu.yaml
├── demo.py
├── docs
    ├── prepare_lvbench.md
    ├── prepare_mlvu.md
    └── prepare_videomme.md
├── environment.yaml
├── environment_npu.yaml
├── misc
    ├── Q8AZ16uBhr8_resized_fps2_mute.mp4
    └── overview.png
├── retake
    ├── dataset_utils.py
    ├── infer_eval.py
    ├── llava_onevision.py
    ├── longvideo_cache.py
    ├── monkeypatch.py
    ├── qwen2_vl.py
    └── visual_compression.py
└── scripts
    ├── infer_eval_retake.sh
    └── utils
        ├── build_lvbench_dataset.py
        ├── build_mlvu_dataset.py
        ├── build_mlvu_test_dataset.py
        ├── build_videomme_dataset.py
        ├── cal_flops.py
        ├── cal_ttft.py
        ├── convert_llava_video_weights_to_hf.py
        └── frame_extraction.py


/.gitignore:
--------------------------------------------------------------------------------
1 | /dataset
2 | /results
3 | */__pycache__


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 SCZwangxiao
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # [ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding](https://arxiv.org/abs/2412.20504)
  2 | 
  3 | ReTaKe is a novel approach for long video understanding that reduces temporal and knowledge redundancy, enabling MLLMs to process 8x longer video sequences (up to 2048 frames) under the same memory budget.
  4 | 
  5 | ---
  6 | 
  7 | ## 📢 Recent Updates
  8 | - **2025/03/11**: Polish the paper, improve the readability of the methods section, and add more ablation studies and results for LongVideoBench.
  9 | - **2025/02/01**: Support for the latest version of Transformers (v4.48).
 10 | - **2025/01/29**: Added support for LLaVA-Video and LLaVA-OneVision.
 11 | 
 12 | ---
 13 | 
 14 | ## 🚀 Key Contributions
 15 | 
 16 | - **Training-Free Framework**: ReTaKe is the first method to jointly model temporal and knowledge redundancy for long video understanding, reducing the model sequence length to 1/4 of the original with a relative performance loss within 1%.
 17 | 
 18 | - **Novel Techniques**:
 19 |   - **DPSelect**: A keyframe selection method to reduce low-level temporal redundancy.
 20 |   - **PivotKV**: A KV cache compression method to reduce high-level knowledge redundancy in long videos.
 21 | 
 22 | <p align="center">
 23 |   <img src="misc/overview.png" alt="Overview of ReTaKe" width="60%">
 24 | </p>
 25 | 
 26 | ---
 27 | 
 28 | ## ⚙️ Environment Setup
 29 | 
 30 | ### For GPU Users:
 31 | ```bash
 32 | conda env create -f environment.yaml
 33 | ```
 34 | 
 35 | ### For NPU Users:
 36 | ```bash
 37 | conda env create -f environment_npu.yaml
 38 | ```
 39 | 
 40 | ### Additional Dependencies:
 41 | ```bash
 42 | apt-get install ffmpeg  # Required for full functionality; quick demo does not require ffmpeg.
 43 | ```
 44 | 
 45 | ---
 46 | 
 47 | ## 🖥️ Quick Demo
 48 | 
 49 | ### Step 1: Update Configuration
 50 | Modify the `hf_qwen2vl7b_path` in `./demo.py` to point to your local path for `Qwen2-VL-7B-Instruct`.  
 51 | For NPU users, also update `config_path` to `'configs/retake_demo_npu.yaml'`.
 52 | 
 53 | ### Step 2 (Optional for LLaVA-Video): Convert Model
 54 | ```bash
 55 | # Convert LLaVA-Video model into Hugging Face format
 56 | # Ensure the following models are downloaded: Qwen2-7B-Instruct, siglip-so400m-patch14-384, and LLaVAVideoQwen2_7B.
 57 | python scripts/utils/convert_llava_video_weights_to_hf.py \
 58 |   --text_model_id /path_to/Qwen2-7B-Instruct \
 59 |   --vision_model_id /path_to/siglip-so400m-patch14-384 \
 60 |   --output_hub_path /path_to/llava-video-qwen2-7b-hf \
 61 |   --old_state_dict_id /path_to/LLaVAVideoQwen2_7B
 62 | ```
 63 | 
 64 | ### Step 3: Run the Demo
 65 | ```bash
 66 | python demo.py
 67 | ```
 68 | 
 69 | ---
 70 | 
 71 | ## 📊 Reproducing ReTaKe Results
 72 | 
 73 | ### Step 1: Prepare Datasets
 74 | Follow the documentation to prepare the required datasets:
 75 | - [VideoMME](docs/prepare_videomme.md)
 76 | - [MLVU](docs/prepare_mlvu.md)
 77 | - [LVBench](docs/prepare_lvbench.md)
 78 | 
 79 | ### Step 2: Run Inference and Evaluation
 80 | Use the provided script to perform inference and evaluation:
 81 | ```bash
 82 | bash scripts/infer_eval_retake.sh ${YOUR_PATH_TO_Qwen2-VL-7B-Instruct} configs/qwen2_vl/retake_qwen2-vl_videomme.yaml 8
 83 | bash scripts/infer_eval_retake.sh ${YOUR_PATH_TO_Qwen2-VL-7B-Instruct} configs/qwen2_vl/retake_qwen2-vl_mlvu.yaml 8
 84 | bash scripts/infer_eval_retake.sh ${YOUR_PATH_TO_Qwen2-VL-7B-Instruct} configs/qwen2_vl/retake_qwen2-vl_lvbench.yaml 8
 85 | ```
 86 | 
 87 | - Results will be saved in the `./results` directory.
 88 | 
 89 | ---
 90 | 
 91 | ## 📚 Citation
 92 | If you find this work helpful, please consider citing:
 93 | ```bibtex
 94 | @misc{xiao_retake_2024,
 95 |   author       = {Xiao Wang and
 96 |                   Qingyi Si and
 97 |                   Jianlong Wu and
 98 |                   Shiyu Zhu and
 99 |                   Li Cao and
100 |                   Liqiang Nie},
101 |   title        = {{ReTaKe}: {Reducing} {Temporal} and {Knowledge} {Redundancy} for {Long} {Video} {Understanding}},
102 |   year         = {2024},
103 |   note = {arXiv:2412.20504 [cs]}
104 | }
105 | ```
106 | 


--------------------------------------------------------------------------------
/configs/llava_video/llava-video_lvbench.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: llava_video
 3 | method: retake
 4 | attn_implementation: "flash_attention_2"
 5 | 
 6 | ### dataset
 7 | dataset_name: lvbench
 8 | anno_file: dataset/lvbench/lvbench.json
 9 | dataloader_num_workers: 4
10 | 
11 | ### data
12 | sample_fps: 2
13 | max_num_frames: 64
14 | longsize_resolution: 682 # short-side can be 384
15 | 
16 | ### generate
17 | do_sample: false
18 | 
19 | ### output
20 | output_dir: results/llava-video_lvbench_f64_2fps_r682/base
21 | 


--------------------------------------------------------------------------------
/configs/llava_video/llava-video_mlvu.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: llava_video
 3 | method: retake
 4 | attn_implementation: "flash_attention_2"
 5 | 
 6 | ### dataset
 7 | dataset_name: mlvu
 8 | anno_file: dataset/mlvu/mlvu.json
 9 | dataloader_num_workers: 4
10 | 
11 | ### data
12 | sample_fps: 2
13 | max_num_frames: 64
14 | longsize_resolution: 682 # short-side can be 384
15 | 
16 | ### generate
17 | do_sample: false
18 | 
19 | ### output
20 | output_dir: results/llava-video_mlvu_f64_2fps_r682/base
21 | 


--------------------------------------------------------------------------------
/configs/llava_video/llava-video_videomme.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: llava_video
 3 | method: retake
 4 | attn_implementation: "flash_attention_2"
 5 | 
 6 | ### dataset
 7 | dataset_name: videomme
 8 | anno_file: dataset/video_mme/video_mme.json
 9 | dataloader_num_workers: 4
10 | 
11 | ### data
12 | sample_fps: 2
13 | max_num_frames: 64
14 | longsize_resolution: 682 # short-side can be 384
15 | 
16 | ### generate
17 | do_sample: false
18 | 
19 | ### output
20 | output_dir: results/llava-video_video_mme_f64_2fps_r682/base
21 | 


--------------------------------------------------------------------------------
/configs/llava_video/retake_llava-video_lvbench.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: llava_video
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | longvideo_kwargs: {
 7 |   'frame_chunk_size': 32,
 8 |   'chunked_prefill_frames': 32,
 9 |   # Keyframe compression
10 |   'visual_compression': True,
11 |   'visual_compression_kwargs': {
12 |     'compression_ratio': 1.0,
13 |     'compression_method': 'Keyframe',
14 |     'patch_sync': False,
15 |     'return_keyframe_mask': True
16 |   },
17 |   # KVCache compression
18 |   'kvcache_compression': True,
19 |   'kvcache_compression_kwargs': {
20 |     'dynamic_compression_ratio': True,
21 |     'compression_method': 'pivotkv',
22 |     'pos_embed_reforge': True,
23 |     'max_input_length': 40000
24 |   },
25 | }
26 | 
27 | ### dataset
28 | dataset_name: lvbench
29 | anno_file: dataset/lvbench/lvbench.json
30 | dataloader_num_workers: 4
31 | 
32 | ### data
33 | sample_fps: 2
34 | max_num_frames: 1024
35 | longsize_resolution: 682
36 | 
37 | ### generate
38 | do_sample: false
39 | 
40 | ### output
41 | output_dir: results/llava-video_f1024_2fps_r682/retake_dp1-async_pivot-40k
42 | 


--------------------------------------------------------------------------------
/configs/llava_video/retake_llava-video_mlvu.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: llava_video
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | longvideo_kwargs: {
 7 |   'frame_chunk_size': 32,
 8 |   'chunked_prefill_frames': 32,
 9 |   # Keyframe compression
10 |   'visual_compression': True,
11 |   'visual_compression_kwargs': {
12 |     'compression_ratio': 1.0,
13 |     'compression_method': 'Keyframe',
14 |     'patch_sync': False,
15 |     'return_keyframe_mask': True
16 |   },
17 |   # KVCache compression
18 |   'kvcache_compression': True,
19 |   'kvcache_compression_kwargs': {
20 |     'dynamic_compression_ratio': True,
21 |     'compression_method': 'pivotkv',
22 |     'pos_embed_reforge': True,
23 |     'max_input_length': 40000
24 |   },
25 | }
26 | 
27 | ### dataset
28 | dataset_name: mlvu
29 | anno_file: dataset/mlvu/mlvu.json
30 | dataloader_num_workers: 4
31 | 
32 | ### data
33 | sample_fps: 2
34 | max_num_frames: 1024
35 | longsize_resolution: 682
36 | 
37 | ### generate
38 | do_sample: false
39 | 
40 | ### output
41 | output_dir: results/llava-video_rope4_mlvu_f1024_2fps_r682/retake_dp1-async_pivot-40k
42 | 


--------------------------------------------------------------------------------
/configs/llava_video/retake_llava-video_videomme.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: llava_video
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | longvideo_kwargs: {
 7 |   'frame_chunk_size': 32,
 8 |   'chunked_prefill_frames': 32,
 9 |   # Keyframe compression
10 |   'visual_compression': True,
11 |   'visual_compression_kwargs': {
12 |     'compression_ratio': 1.0,
13 |     'compression_method': 'Keyframe',
14 |     'patch_sync': False,
15 |     'return_keyframe_mask': True
16 |   },
17 |   # KVCache compression
18 |   'kvcache_compression': True,
19 |   'kvcache_compression_kwargs': {
20 |     'dynamic_compression_ratio': True,
21 |     'compression_method': 'pivotkv',
22 |     'pos_embed_reforge': True,
23 |     'max_input_length': 40000
24 |   },
25 | }
26 | 
27 | ### dataset
28 | dataset_name: videomme
29 | anno_file: dataset/video_mme/video_mme.json
30 | dataloader_num_workers: 4
31 | 
32 | ### data
33 | sample_fps: 2
34 | max_num_frames: 1024
35 | longsize_resolution: 682
36 | 
37 | ### generate
38 | do_sample: false
39 | 
40 | ### output
41 | output_dir: results/llava-video_rope4_video_mme_f1024_2fps_r682/retake_dp1-async_pivot-40k
42 | 


--------------------------------------------------------------------------------
/configs/qwen2_vl/qwen2-vl_lvbench.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: qwen2_vl
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | 
 7 | ### dataset
 8 | dataset_name: lvbench
 9 | anno_file: dataset/lvbench/lvbench.json
10 | dataloader_num_workers: 2
11 | 
12 | ### data
13 | sample_fps: 2
14 | max_num_frames: 256
15 | longsize_resolution: 448
16 | 
17 | ### generate
18 | do_sample: false
19 | 
20 | ### output
21 | output_dir: results/qwen2vl_7b_lvbench_f256_2fps_r448/base
22 | 


--------------------------------------------------------------------------------
/configs/qwen2_vl/qwen2-vl_mlvu.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: qwen2_vl
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | 
 7 | ### dataset
 8 | dataset_name: mlvu
 9 | anno_file: dataset/mlvu/mlvu.json
10 | dataloader_num_workers: 2
11 | 
12 | ### data
13 | sample_fps: 4
14 | max_num_frames: 256
15 | longsize_resolution: 448
16 | 
17 | ### generate
18 | do_sample: false
19 | 
20 | ### output
21 | output_dir: results/qwen2vl_7b_mlvu_f256_4fps_r448/base
22 | 


--------------------------------------------------------------------------------
/configs/qwen2_vl/qwen2-vl_videomme.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: qwen2_vl
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | 
 7 | ### dataset
 8 | dataset_name: videomme
 9 | anno_file: dataset/video_mme/video_mme.json
10 | dataloader_num_workers: 2
11 | 
12 | ### data
13 | sample_fps: 4
14 | max_num_frames: 256
15 | longsize_resolution: 448
16 | 
17 | ### generate
18 | do_sample: false
19 | 
20 | ### output
21 | output_dir: results/qwen2vl_7b_videomme_f256_4fps_r448/base
22 | 


--------------------------------------------------------------------------------
/configs/qwen2_vl/retake_qwen2-vl_lvbench.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: qwen2_vl
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | longvideo_kwargs: {
 7 |   'frame_chunk_size': 128,
 8 |   'chunked_prefill_frames': 32,
 9 |   # Keyframe compression
10 |   'visual_compression': True,
11 |   'visual_compression_kwargs': {
12 |     'compression_ratio': 1.0,
13 |     'compression_method': 'Keyframe',
14 |     'patch_sync': False,
15 |     'return_keyframe_mask': True
16 |   },
17 |   # KVCache compression
18 |   'kvcache_compression': True,
19 |   'kvcache_compression_kwargs': {
20 |     'dynamic_compression_ratio': True,
21 |     'compression_method': 'pivotkv',
22 |     'pos_embed_reforge': True,
23 |     'max_input_length': 32000
24 |   },
25 | }
26 | 
27 | ### dataset
28 | dataset_name: lvbench
29 | anno_file: dataset/lvbench/lvbench.json
30 | dataloader_num_workers: 2
31 | 
32 | ### data
33 | sample_fps: 2
34 | max_num_frames: 2048
35 | longsize_resolution: 448
36 | 
37 | ### generate
38 | do_sample: false
39 | 
40 | ### output
41 | output_dir: results/qwen2vl_7b_lvbench_f2048_2fps_r448/retake_dp1-async_pivot-32k
42 | 


--------------------------------------------------------------------------------
/configs/qwen2_vl/retake_qwen2-vl_mlvu.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: qwen2_vl
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | longvideo_kwargs: {
 7 |   'frame_chunk_size': 128,
 8 |   'chunked_prefill_frames': 32,
 9 |   # Keyframe compression
10 |   'visual_compression': True,
11 |   'visual_compression_kwargs': {
12 |     'compression_ratio': 1.0,
13 |     'compression_method': 'Keyframe',
14 |     'patch_sync': False,
15 |     'return_keyframe_mask': True
16 |   },
17 |   # KVCache compression
18 |   'kvcache_compression': True,
19 |   'kvcache_compression_kwargs': {
20 |     'dynamic_compression_ratio': True,
21 |     'compression_method': 'pivotkv',
22 |     'pos_embed_reforge': True,
23 |     'max_input_length': 32000
24 |   },
25 | }
26 | 
27 | ### dataset
28 | dataset_name: mlvu
29 | anno_file: dataset/mlvu/mlvu.json
30 | dataloader_num_workers: 2
31 | 
32 | ### data
33 | sample_fps: 4
34 | max_num_frames: 2048
35 | longsize_resolution: 448
36 | 
37 | ### generate
38 | do_sample: false
39 | 
40 | ### output
41 | output_dir: results/qwen2vl_7b_mlvu_f2048_4fps_r448/retake_dp1-async_pivot-32k
42 | 


--------------------------------------------------------------------------------
/configs/qwen2_vl/retake_qwen2-vl_videomme.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | model_name: qwen2_vl
 3 | method: retake
 4 | scaling_factor: 4
 5 | attn_implementation: "flash_attention_2"
 6 | longvideo_kwargs: {
 7 |   'frame_chunk_size': 128,
 8 |   'chunked_prefill_frames': 32,
 9 |   # KVCache compression
10 |   'kvcache_compression': True,
11 |   'kvcache_compression_kwargs': {
12 |     'dynamic_compression_ratio': True,
13 |     'compression_method': 'pivotkv',
14 |     'pos_embed_reforge': True,
15 |     'max_input_length': 32000
16 |   },
17 | }
18 | 
19 | 
20 | ### dataset
21 | dataset_name: videomme
22 | anno_file: dataset/video_mme/video_mme.json
23 | dataloader_num_workers: 2
24 | 
25 | ### data
26 | sample_fps: 4
27 | max_num_frames: 2048
28 | longsize_resolution: 448
29 | 
30 | ### generate
31 | do_sample: false
32 | 
33 | ### output
34 | output_dir: results/qwen2vl_7b_video_mme_f2048_4fps_r448/retake_pivot-32k
35 | 


--------------------------------------------------------------------------------
/configs/retake_demo.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | method: retake
 3 | scaling_factor: 4
 4 | attn_implementation: "flash_attention_2"
 5 | longvideo_kwargs: {
 6 |   'frame_chunk_size': 128,
 7 |   'chunked_prefill_frames': 32,
 8 |   # Keyframe compression
 9 |   'visual_compression': True,
10 |   'visual_compression_kwargs': {
11 |     'compression_ratio': 1.0,
12 |     'compression_method': 'Keyframe',
13 |     'patch_sync': False,
14 |     'return_keyframe_mask': True
15 |   },
16 |   # KVCache compression
17 |   'kvcache_compression': True,
18 |   'kvcache_compression_kwargs': {
19 |     'dynamic_compression_ratio': True,
20 |     'compression_method': 'pivotkv',
21 |     'pos_embed_reforge': True,
22 |     'max_input_length': 32000
23 |   },
24 | }
25 | 
26 | ### data
27 | sample_fps: 4
28 | max_num_frames: 2048
29 | longsize_resolution: 448
30 | 
31 | ### generate
32 | do_sample: false


--------------------------------------------------------------------------------
/configs/retake_demo_npu.yaml:
--------------------------------------------------------------------------------
 1 | ### model
 2 | method: retake
 3 | scaling_factor: 4
 4 | attn_implementation: "eager" # NPU does not support sdpa attention now
 5 | longvideo_kwargs: {
 6 |   'frame_chunk_size': 16, # Trade-off beteen peak memory and speed
 7 |   'chunked_prefill_frames': 16, # Trade-off beteen peak memory and speed
 8 |   # Keyframe compression
 9 |   'visual_compression': True,
10 |   'visual_compression_kwargs': {
11 |     'compression_ratio': 1.0,
12 |     'compression_method': 'Keyframe',
13 |     'patch_sync': False,
14 |     'return_keyframe_mask': True
15 |   },
16 |   # KVCache compression
17 |   'kvcache_compression': True,
18 |   'kvcache_compression_kwargs': {
19 |     'dynamic_compression_ratio': True,
20 |     'compression_method': 'pivotkv',
21 |     'pos_embed_reforge': True,
22 |     'max_input_length': 32000
23 |   },
24 | }
25 | 
26 | ### data
27 | sample_fps: 4
28 | max_num_frames: 2048
29 | longsize_resolution: 448
30 | 
31 | ### generate
32 | do_sample: false


--------------------------------------------------------------------------------
/demo.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import cv2
  3 | import math
  4 | import yaml
  5 | from PIL import Image
  6 | from typing import List, Union
  7 | 
  8 | import torch
  9 | import numpy as np
 10 | from torchvision.transforms.functional import pil_to_tensor
 11 | from transformers import AutoProcessor
 12 | 
 13 | import retake
 14 | 
 15 | 
 16 | def get_frame_indices(total_frames, max_num_frames, sample_fps, extraction_fps):
 17 |     # Get number of sampled frames
 18 |     sample_frames = float(total_frames / extraction_fps) * sample_fps
 19 |     sample_frames = min(total_frames, max_num_frames, sample_frames)
 20 |     sample_frames = math.floor(sample_frames)
 21 |     sample_frames = int(sample_frames / 2) * 2
 22 |     # Get sampled frame indices
 23 |     frame_indices = np.linspace(0, total_frames - 1, sample_frames).astype(np.int32)
 24 |     return frame_indices
 25 | 
 26 | 
 27 | def load_specific_frames(cap, frame_indices):
 28 |     # List to store the frames
 29 |     frames = []
 30 |     # Read frames from the video
 31 |     for frame_index in frame_indices:
 32 |         # Set the video position to the desired frame index
 33 |         cap.set(cv2.CAP_PROP_POS_FRAMES, frame_index)
 34 |         # Read the frame
 35 |         ret, frame = cap.read()
 36 |         # If the frame was read successfully, append it to the list
 37 |         if ret:
 38 |             # Convert the frame from BGR to RGB
 39 |             frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
 40 |             # Create a PIL Image from the frame
 41 |             frame = Image.fromarray(frame_rgb)
 42 |             frames.append(frame)
 43 |         else:
 44 |             ValueError(f"Warning: Could not read frame at index {frame_index}. It may be out of range.")
 45 |     return frames
 46 | 
 47 | 
 48 | def load_video(video_path: str, max_num_frames: int, fps: Union[int, float]=None, frame_extraction_fps: Union[int, float]=None):
 49 |     """Load video frames at fps. If total frames larger than `max_num_frames`, do downsample.
 50 |        If 'fps' is `None`, load uniformly sample `max_num_frames` frames.
 51 | 
 52 |        video_path: Should either be a videofile or a directory of extracted frames.
 53 | 
 54 |        # NOTE: The extract frames must have name pattern of `%06d.(ext)`, or the loaded frame order will be wrong.
 55 |     """
 56 |     if video_path.startswith("file://"):
 57 |         video_path = video_path[7:]
 58 |     if os.path.isdir(video_path): # directory extracted frames
 59 |         assert frame_extraction_fps is not None
 60 |         pass
 61 |     else: # filename of a video
 62 |         # Open the video file
 63 |         cap = cv2.VideoCapture(video_path)
 64 |         if not cap.isOpened():
 65 |             raise ValueError("Error: Could not open video.")
 66 |         total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
 67 |         frame_extraction_fps = cap.get(cv2.CAP_PROP_FPS)
 68 |         # Get indices of sampled frame
 69 |         frame_indices = get_frame_indices(total_frames, max_num_frames, fps, frame_extraction_fps)
 70 |         # Get frames
 71 |         frames = load_specific_frames(cap, frame_indices)
 72 |         # Release the video capture object
 73 |         cap.release()
 74 | 
 75 |     # Convert into RGB format
 76 |     frames = [
 77 |         frame.convert("RGB") if frame.mode != "RGB" else frame
 78 |         for frame in frames
 79 |     ]
 80 | 
 81 |     return frames
 82 | 
 83 | 
 84 | def resize_image_longside(image, image_resolution):
 85 |     r"""
 86 |     Pre-processes a single image.
 87 |     """
 88 |     if max(image.width, image.height) > image_resolution:
 89 |         resize_factor = image_resolution / max(image.width, image.height)
 90 |         width, height = int(image.width * resize_factor), int(image.height * resize_factor)
 91 |         image = image.resize((width, height), resample=Image.NEAREST)
 92 | 
 93 |     return image
 94 | 
 95 | 
 96 | def resize_video_longside(frames: List, video_resolution):
 97 |     """
 98 |     frames: list of PIL images.
 99 |     """
100 |     frames = [
101 |         resize_image_longside(frame, video_resolution)
102 |         for frame in frames
103 |     ]
104 |     return frames
105 | 
106 | 
107 | def load_yaml(file_path):
108 |     with open(file_path, 'r') as file:
109 |         data = yaml.safe_load(file)
110 |     return data
111 | 
112 | 
113 | def fetch_video(video_info, max_num_frames, sample_fps, longsize_resolution):
114 |     frames = load_video(video_info['video'], max_num_frames, sample_fps, video_info.get('frame_extraction_fps', None))
115 |     frames = resize_video_longside(frames, longsize_resolution)
116 |     frames = [pil_to_tensor(frame) for frame in frames]
117 |     return frames
118 | 
119 | 
120 | def load_and_patch_model(model_name, hf_model_path, exp_configs, device):
121 |     model_name = model_name if model_name is not None else exp_configs['model_name']
122 |     model_name = model_name.lower().replace('-', '').replace('_', '')
123 |     if model_name == 'qwen2vl': # QWen2VL
124 |         from transformers import Qwen2VLConfig, Qwen2VLForConditionalGeneration
125 |         from retake.monkeypatch import patch_qwen2vl, patch_qwen2vl_config
126 |         retake.qwen2_vl.DEBUG_MODE = True
127 |         patch_qwen2vl(exp_configs['method']) # Replace some functions of QWen2VL with those from ReTaKe
128 |         qwen2vl_config = Qwen2VLConfig.from_pretrained(hf_model_path)
129 |         qwen2vl_config = patch_qwen2vl_config(qwen2vl_config, exp_configs)
130 |         model = Qwen2VLForConditionalGeneration.from_pretrained(
131 |             hf_model_path,
132 |             config=qwen2vl_config,
133 |             torch_dtype=torch.bfloat16,
134 |             attn_implementation=exp_configs.get('attn_implementation', None),
135 |             device_map=device # "auto"
136 |         ).eval()
137 |         processor = AutoProcessor.from_pretrained(hf_model_path)
138 |     elif model_name in ['llavaonevision', 'llavavideo']: # LLaVA-OneVision, LLaVA-Video
139 |         from transformers import LlavaOnevisionConfig, LlavaOnevisionForConditionalGeneration
140 |         from retake.monkeypatch import patch_llava_onevision, patch_llava_onevision_config
141 |         retake.llava_onevision.DEBUG_MODE = True
142 |         patch_llava_onevision(exp_configs['method']) # Replace some functions of LLaVA-Video with those from ReTaKe
143 |         llava_onevision_config = LlavaOnevisionConfig.from_pretrained(hf_model_path)
144 |         llava_onevision_config = patch_llava_onevision_config(llava_onevision_config, exp_configs)
145 |         processor = AutoProcessor.from_pretrained(hf_model_path) 
146 |         model = LlavaOnevisionForConditionalGeneration.from_pretrained(
147 |             hf_model_path, 
148 |             config=llava_onevision_config,
149 |             torch_dtype=torch.bfloat16,
150 |             attn_implementation=exp_configs.get('attn_implementation', None),
151 |             device_map=device # "auto"
152 |         ) 
153 |     else:
154 |         raise NotImplementedError
155 |     return model, processor
156 | 
157 | 
158 | DEMO_VIDEO = 'misc/Q8AZ16uBhr8_resized_fps2_mute.mp4'
159 | DEMO_QUESTIONS = [
160 |     "As depicted in the video, how is the relationship between the rabbit and human?\nOptions:\nA. Hostile.\nB. Friend.\nC. Cooperator.\nD. No one is correct above.\nAnswer with the option's letter from the given choices directly.",
161 |     "What is the impression of the video?\nOptions:\nA. Sad.\nB. Funny.\nC. Horrible.\nD. Silent.\nAnswer with the option's letter from the given choices directly.",
162 |     "What is the subject of the video?\nOptions:\nA. Rabbit likes to eat carrots.\nB. How to raise a rabbit.\nC. A rabbit gives people trouble.\nD. A rabbit performs for food.\nAnswer with the option's letter from the given choices directly.",
163 | ]
164 | EXPECTED_ANSWERS = ['A', 'B', 'C']
165 | 
166 | 
167 | if __name__ == "__main__":
168 |     #------------------- Modify the following configs ------------------#
169 |     hf_model_path = 'Qwen/Qwen2-VL-7B-Instruct' # TODO: replace to local path if you have trouble downloading huggingface models
170 |     model_name = 'qwen2_vl'
171 |     # hf_model_path = '/path_to/llava-video-qwen2-7b-hf'
172 |     # model_name = 'llava_video'
173 |     # hf_model_path = 'llava-hf/llava-onevision-qwen2-7b-ov-hf'
174 |     # model_name = 'llava_onevision'
175 | 
176 |     # NOTE: for Nvidia GPUs
177 |     config_path = 'configs/retake_demo.yaml'
178 |     device = 'cuda:0'
179 | 
180 |     # NOTE: for NPUs or GPUs without support for FlashAttention
181 |     # config_path = 'configs/retake_demo_npu.yaml'
182 |     # device = 'npu:0'
183 | 
184 |     #------------------------ No need to change ------------------------#
185 |     video_info = {"type": "video", 
186 |                   "video": DEMO_VIDEO, 
187 |                   "fps": 2.0}
188 | 
189 |     exp_configs = load_yaml(config_path)
190 | 
191 |     model, processor = load_and_patch_model(model_name, hf_model_path, exp_configs, device)
192 | 
193 |     # Video
194 |     video = fetch_video(video_info, exp_configs['max_num_frames'], exp_configs['sample_fps'], exp_configs['longsize_resolution'])
195 |     for question, expect_answer in zip(DEMO_QUESTIONS, EXPECTED_ANSWERS):
196 |         conversation = [
197 |             {
198 |                 "role": "user",
199 |                 "content": [
200 |                     {"type": "video"},
201 |                     {"type": "text", "text": question},
202 |                 ],
203 |             }
204 |         ]
205 | 
206 |         # Preprocess the inputs
207 |         text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
208 |         print('Input prompt:\n', text_prompt)
209 | 
210 |         inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
211 |         inputs = inputs.to(device)
212 |         inputs['pixel_values_videos'] = inputs['pixel_values_videos'].to(torch.bfloat16)
213 | 
214 |         # Inference: Generation of the output
215 |         output_ids = model.generate(**inputs, do_sample=False, max_new_tokens=128)
216 |         generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
217 |         output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
218 |         output_text = output_text[0]
219 |         print('Output text:\n', output_text)
220 |         print('Expected answer:\n', expect_answer)
221 | 


--------------------------------------------------------------------------------
/docs/prepare_lvbench.md:
--------------------------------------------------------------------------------
 1 | ## Prepare LVBench Dataset
 2 | 
 3 | 
 4 | ### Step 1: download LVBench data from [huggingface](https://huggingface.co/datasets/THUDM/LVBench/tree/main)
 5 | ```bash
 6 | git clone https://huggingface.co/datasets/THUDM/LVBench # Contain annotations only
 7 | git clone https://huggingface.co/datasets/AIWinter/LVBench # Contain videos only
 8 | ```
 9 | Move all_files in `AIWinter/LVBench` into `THUDM/LVBench`.
10 | 
11 | Denote the root directory of download LVBench dataset as `lvbench_root`, it should has the following structure:
12 | ```
13 | ${lvbench_root}/
14 | ├── docs/
15 | ├── video_info.meta.jsonl
16 | ├── all_videos_split.zip.001
17 | ├── all_videos_split.zip.002
18 | ├── ...
19 | └── all_videos_split.zip.014
20 | ```
21 | 
22 | 
23 | ### Step 2: Unzip everything
24 | ```bash
25 | cd ${lvbench_root}
26 | cat all_videos_split.zip.* > all_videos.zip
27 | unzip all_videos.zip
28 | ```
29 | 
30 | 
31 | ### Step 3: Extract frames of all videos
32 | ```bash
33 | cd ${retake_repo_root}
34 | python scripts/utils/frame_extraction.py \
35 | --videofile_tpl ${lvbench_root}/all_videos/'*.mp4' \
36 | --results_dir ${lvbench_root}/video_25fps \
37 | --fps 25 \
38 | --num_workers 32
39 | ```
40 | 
41 | 
42 | ### Step 4: Build LVBench dataset
43 | ```bash
44 | cd ${retake_repo_root}
45 | python scripts/utils/build_lvbench_dataset.py --hf_root ${lvbench_root}
46 | ```
47 | Note that you can NOT modify folder `${lvbench_root}/video_25fps` after this step, since the absolute path of extracted frames are written into annotation files `lvbench.json`:
48 | ```
49 | retake_repo_root/
50 | ├── dataset/
51 |     ├── lvbench/
52 |         ├── lvbench.json
53 | ├── ...
54 | ```


--------------------------------------------------------------------------------
/docs/prepare_mlvu.md:
--------------------------------------------------------------------------------
 1 | ## Prepare MLVU Dataset
 2 | 
 3 | 
 4 | ### Step 1: download MLVU dataset from [huggingface](https://huggingface.co/datasets/MLVU/MVLU)
 5 | ```bash
 6 | git clone https://huggingface.co/datasets/MLVU/MVLU
 7 | ```
 8 | 
 9 | Denote the root directory of download MLVU dataset as `mlvu_root`, it should has the following structure:
10 | ```
11 | ${mlvu_root}/
12 | ├── MLVU/
13 |     ├── json
14 |         ...
15 |     ├── video
16 |         ...
17 | ├── figs/
18 | ```
19 | 
20 | 
21 | ### Step 2: Extract frames of all videos
22 | ```bash
23 | cd ${retake_repo_root}
24 | python scripts/utils/frame_extraction.py \
25 | --videofile_tpl ${mlvu_root}/MLVU/video/'*/*.mp4' \
26 | --results_dir ${mlvu_root}/MLVU/video_25fps \
27 | --fps 25 \
28 | --num_workers 32
29 | ```
30 | 
31 | 
32 | ### Step 3: Build MLVU dataset
33 | ```bash
34 | cd ${retake_repo_root}
35 | python scripts/utils/build_mlvu_dataset.py --hf_root ${mlvu_root}
36 | ```
37 | Note that you can NOT modify folder `${mlvu_root}/MLVU/video_25fps` after this step, since the absolute path of extracted frames are written into annotation files `mlvu.json`:
38 | ```
39 | retake_repo_root/
40 | ├── dataset/
41 |     ├── mlvu/
42 |         ├── mlvu.json
43 | ├── ...
44 | ```


--------------------------------------------------------------------------------
/docs/prepare_videomme.md:
--------------------------------------------------------------------------------
 1 | ## Prepare VideoMME Dataset
 2 | 
 3 | 
 4 | ### Step 1: download VideoMME dataset from [huggingface](https://huggingface.co/datasets/lmms-lab/Video-MME)
 5 | ```bash
 6 | git clone https://huggingface.co/datasets/lmms-lab/Video-MME
 7 | ```
 8 | 
 9 | Denote the root directory of download VideoMME dataset as `videomme_root`, it should has the following structure:
10 | ```
11 | ${videomme_root}/
12 | ├── videomme/
13 | ├── subtitle.zip
14 | ├── videos_chunked_01.zip
15 | ├── videos_chunked_02.zip
16 | ├── ...
17 | └── videos_chunked_20.zip
18 | ```
19 | 
20 | 
21 | ### Step 2: Unzip everything
22 | ```bash
23 | cd ${videomme_root}
24 | unzip subtitle.zip
25 | cat videos_chunked_*.zip > videos.zip
26 | unzip videos.zip
27 | ```
28 | 
29 | 
30 | ### Step 3: Extract frames of all videos
31 | ```bash
32 | cd ${retake_repo_root}
33 | python scripts/utils/frame_extraction.py \
34 | --videofile_tpl ${videomme_root}/data/'*.mp4' \
35 | --results_dir ${videomme_root}/data_25fps \
36 | --fps 25 \
37 | --num_workers 32
38 | ```
39 | 
40 | 
41 | ### Step 4: Build VideoMME dataset
42 | ```bash
43 | cd ${retake_repo_root}
44 | python scripts/utils/build_videomme_dataset.py \
45 | --hf_qwen2vl7b_path ${PATH_TO_Qwen2-VL-7B-Instruct} \
46 | --hf_root ${videomme_root}
47 | ```
48 | Note that you can NOT modify folder `${videomme_root}/data_25fps` after this step, since the absolute path of extracted frames are written into annotation files `video_mme.json` and `video_mme_subtitle.json`:
49 | ```
50 | retake_repo_root/
51 | ├── dataset/
52 |     ├── video_mme/
53 |         ├── video_mme_subtitle.json
54 |         ├── video_mme.json
55 | ├── ...
56 | ```


--------------------------------------------------------------------------------
/environment.yaml:
--------------------------------------------------------------------------------
 1 | name: retake
 2 | channels:
 3 |   - defaults
 4 | dependencies:
 5 |   - python==3.11
 6 |   - pip:
 7 |       - torch==2.4.0
 8 |       - torchvision==0.19.0
 9 |       - transformers==4.48
10 |       - accelerate==0.34.2
11 |       - flash-attn==2.6.3
12 |       - av==13.1.0
13 |       - pyyaml==6.0.2
14 |       - opencv-python-headless==4.10.0.84
15 |       - pandas==2.2.3
16 |       - pysubs2==1.7.3
17 |       - pyarrow==17.0.0
18 |       - openai==1.56.0


--------------------------------------------------------------------------------
/environment_npu.yaml:
--------------------------------------------------------------------------------
 1 | name: retake
 2 | channels:
 3 |   - defaults
 4 | dependencies:
 5 |   - python==3.11
 6 |   - pip:
 7 |       - numpy==1.26.4
 8 |       - scipy==1.14.1
 9 |       - torch==2.4.0
10 |       - torch-npu==2.4.0
11 |       - torchvision==0.19.0
12 |       - transformers==4.48
13 |       - accelerate==0.34.2
14 |       - av==13.1.0
15 |       - pyyaml==6.0.2
16 |       - opencv-python-headless==4.10.0.84
17 |       - pandas==2.2.3
18 |       - pysubs2==1.7.3
19 |       - pyarrow==17.0.0
20 |       - openai==1.56.0


--------------------------------------------------------------------------------
/misc/Q8AZ16uBhr8_resized_fps2_mute.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SCZwangxiao/video-ReTaKe/e268c32c242061093d694f0cc219794857e71dd8/misc/Q8AZ16uBhr8_resized_fps2_mute.mp4


--------------------------------------------------------------------------------
/misc/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SCZwangxiao/video-ReTaKe/e268c32c242061093d694f0cc219794857e71dd8/misc/overview.png


--------------------------------------------------------------------------------
/retake/dataset_utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import os.path as osp
  3 | import io
  4 | import re
  5 | import json
  6 | import math
  7 | import base64
  8 | from PIL import Image
  9 | import pandas as pd
 10 | from tqdm import tqdm
 11 | from typing import Optional, List
 12 | 
 13 | try:
 14 |     os.environ['OPENAI_BASE_URL'] = None
 15 |     os.environ['OPENAI_API_KEY'] = None
 16 |     import openai
 17 | except:
 18 |     print("Warning! openai not installed for MLVU evalutation")
 19 | import numpy as np
 20 | 
 21 | 
 22 | class BaseDataset:
 23 |     def __init__(self, 
 24 |                  anno_file: str,
 25 |                  processor_kwargs: str
 26 |                  ) -> None:
 27 |         self.processor_kwargs = processor_kwargs
 28 |         # Load annotations
 29 |         with open(anno_file, 'r') as F:
 30 |             self.annos = json.load(F)
 31 |         # Preprocess meta
 32 |         for anno in self.annos:
 33 |             # NOTE: Pyarrow caching in LLaMA-Factory will raise error
 34 |             # for some complicate json data. So dump to jsons.
 35 |             if type(anno['meta']) == str:
 36 |                 anno['meta'] = json.loads(anno['meta'])
 37 | 
 38 |     @staticmethod
 39 |     def _get_video_sample_extracted_frames(frame_files: List[str], **kwargs) -> int:
 40 |         video_fps = kwargs.get("video_fps")
 41 |         video_maxlen = kwargs.get("video_maxlen")
 42 |         extraction_fps = kwargs.get("video_frame_extraction_fps")
 43 |         total_frames = len(frame_files)
 44 |         sample_frames = float(total_frames / extraction_fps) * video_fps
 45 |         sample_frames = min(total_frames, video_maxlen, sample_frames)
 46 |         sample_frames = math.floor(sample_frames)
 47 |         return int(sample_frames / 2) * 2
 48 | 
 49 |     @staticmethod
 50 |     def _preprocess_image(image, **kwargs):
 51 |         r"""
 52 |         Pre-processes a single image.
 53 |         """
 54 |         image_resolution: int = kwargs.get("image_resolution")
 55 |         if max(image.width, image.height) > image_resolution:
 56 |             resize_factor = image_resolution / max(image.width, image.height)
 57 |             width, height = int(image.width * resize_factor), int(image.height * resize_factor)
 58 |             image = image.resize((width, height), resample=Image.NEAREST)
 59 | 
 60 |         if image.mode != "RGB":
 61 |             image = image.convert("RGB")
 62 | 
 63 |         return image
 64 | 
 65 |     def __len__(self):
 66 |         return len(self.annos)
 67 | 
 68 |     def get_video_message(self, video_root: str):
 69 |         frames = []
 70 |         frame_files = [
 71 |             os.path.join(video_root, file) for file in list(sorted(os.listdir(video_root)))
 72 |         ]
 73 |         total_frames = len(frame_files)
 74 |         sample_frames = self._get_video_sample_extracted_frames(frame_files, **self.processor_kwargs)
 75 |         sample_indices = np.linspace(0, total_frames - 1, sample_frames).astype(np.int32)
 76 |         for frame_idx, frame_file in enumerate(frame_files):
 77 |             if frame_idx in sample_indices:
 78 |                 # NOTE: Load and resize on the fly can creatly RAM cost of dataloader
 79 |                 image = Image.open(frame_file)
 80 |                 resized_image = self._preprocess_image(image, **self.processor_kwargs)
 81 |                 frames.append(resized_image)
 82 | 
 83 |         return frames
 84 | 
 85 |     def __getitem__(self, idx):
 86 |         anno = self.annos[idx]
 87 | 
 88 |         question = anno["messages"][0]["content"].replace('<video>', '')
 89 |         frames = self.get_video_message(anno["videos"][0])
 90 |         meta = anno["meta"]
 91 |         meta['answer'] = anno["messages"][1]["content"]
 92 | 
 93 |         messages = dict(
 94 |             question=question,
 95 |             video=frames
 96 |         )
 97 | 
 98 |         return (idx, messages, meta)
 99 | 
100 | 
101 | def evaluate_mlvu_generation(anno_id, gt_answer, pred_answer, meta, enable_gpt_eval):
102 |     """
103 |     Evaluates question and answer pairs using GPT-4
104 |     Returns a score for correctness.
105 |     # Cost:
106 |     # Before
107 |     # 3,994.95 / ￥4,514.22
108 | 
109 |     # After
110 |     # 4,043.27 / ￥4,514.22
111 |     """
112 |     if enable_gpt_eval and meta['question_type'] in ['Video Summary', 'Sub-Scene Captioning']:
113 |         question = meta['question'].replace('<video>', '')
114 |         pred_answer = meta['original_answer']
115 |         client = openai.OpenAI()
116 |         for _ in range(3):
117 |             try:
118 |                 if meta['question_type'] == 'Video Summary':
119 |                     response = client.chat.completions.create(
120 |                         temperature=0,
121 |                         model="gpt-4-turbo",
122 |                         messages = [
123 |                             {
124 |                                 "role": "system",
125 |                                 "content": 
126 |                                 """
127 |                                     ##TASK DESCRIPTION: 
128 |                                     You are required to evaluate the performance of the respondent in the video summarization task based on the standard answer and the respondent's answer. You should provide two scores. The first is the COMPLETENESS score, which should range from 1 to 5. The second is the RELIABILITY score, which should also range from 1 to 5. Below are the criteria for each scoring category:
129 |                                     ##COMPLETENESS Scoring Criteria:
130 |                                     The completeness score focuses on whether the summary covers all key points and main information from the video. 
131 |                                     Score 1: The summary hardly covers any of the main content or key points of the video.
132 |                                     Score 2: The summary covers some of the main content and key points but misses many.
133 |                                     Score 3: The summary covers most of the main content and key points.
134 |                                     Score 4: The summary is very comprehensive, covering most to nearly all of the main content and key points.
135 |                                     Score 5: The summary completely covers all the main content and key points of the video.
136 |                                     ##RELIABILITY Scoring Criteria:
137 |                                     The reliability score evaluates the correctness and clarity of the video summary. It checks for factual errors, misleading statements, and contradictions with the video content. If the respondent's answer includes details that are not present in the standard answer, as long as these details do not conflict with the correct answer and are reasonable, points should not be deducted.
138 |                                     Score 1: Contains multiple factual errors and contradictions; presentation is confusing.
139 |                                     Score 2: Includes several errors and some contradictions; needs clearer presentation.
140 |                                     Score 3: Generally accurate with minor errors; minimal contradictions; reasonably clear presentation.
141 |                                     Score 4: Very accurate with negligible inaccuracies; no contradictions; clear and fluent presentation.
142 |                                     Score 5: Completely accurate with no errors or contradictions; presentation is clear and easy to understand.
143 |                                     ----
144 |                                     ##INSTRUCTION:
145 |                                     1. Evaluate COMPLETENESS: First, analyze the respondent's answer according to the scoring criteria, then provide an integer score between 1 and 5 based on sufficient evidence. 
146 |                                     2. Evaluate RELIABILITY: First, analyze the respondent's answer according to the scoring criteria, then provide an integer score between 1 and 5 based on sufficient evidence. 
147 |                                     3. Output Scores in JSON Format: Present the scores in JSON format as follows:
148 |                                     {'score_completeness': score_comp, 'score_reliability': score_reli, 'total_score': score_comp + score_reli}
149 |                                 """
150 |                             },
151 |                             {
152 |                                 "role": "user",
153 |                                 "content": f"""
154 |                                     Please score the respondent's answer according to the steps in the Instructions. You must end with a JSON dict to store the scores.
155 |                                     Standard Answer: {gt_answer}
156 |                                     Respondent's Answer: {pred_answer}
157 |                                 """
158 |                             }
159 |                         ]
160 |                     )
161 |                 else: # "Sub-Scene Captioning"
162 |                     scoring_points = meta['scoring_points']
163 | 
164 |                     response = client.chat.completions.create(
165 |                         temperature=0,
166 |                         model="gpt-4-turbo",
167 |                         messages = [
168 |                             {
169 |                                 "role": "system",
170 |                                 "content": 
171 |                                 """
172 |                                     ##TASK DESCRIPTION: 
173 |                                     You are required to evaluate a respondent's answer based on a provided question, some scoring points, and the respondent's answer. You should provide two scores. The first is the accuracy score, which should range from 1 to 5. The second is the relevance score, which should also range from 1 to 5. Below are the criteria for each scoring category.
174 |                                     ##ACCURACY Scoring Criteria: 
175 |                                     Evaluate the respondent's answer against specific scoring points as follows:
176 |                                     Score 1: The response completely misses the scoring point.
177 |                                     Score 3: The response mentions content related to the scoring point but is not entirely correct.
178 |                                     Score 5: The response accurately addresses the scoring point.
179 |                                     Calculate the average score across all scoring points to determine the final accuracy score.
180 |                                     ##RELEVANCE Scoring Criteria:
181 |                                     Assess how the respondent's answer relates to the original question:
182 |                                     Score 1: The response is completely off-topic from the question.
183 |                                     Score 2: The response is partially related to the question but contains a significant amount of irrelevant content.
184 |                                     Score 3: The response primarily addresses the question, but the respondent seems uncertain about their own answer.
185 |                                     Score 4: The response mostly addresses the question and the respondent appears confident in their answer.
186 |                                     Score 5: The response is fully focused on addressing the question with no irrelevant content and demonstrates complete certainty.
187 |                                     ----
188 |                                     ##INSTRUCTION:
189 |                                     1. Evaluate Accuracy: First, assess and score each scoring point based on the respondent's answer. Calculate the average of these scores to establish the final accuracy score. Provide a detailed rationale before assigning your score.
190 |                                     2. Evaluate RELEVANCE: Assess the relevance of the respondent’s answer to the question. Note that when evaluating relevance, the correctness of the answer is not considered; focus solely on how relevant the answer is to the question. Provide a comprehensive rationale before assigning your score.
191 |                                     3. Output Scores in JSON Format: Present the scores in JSON format as follows:
192 |                                     {'score_accuracy': score_acc, 'score_relevance': score_rele, 'total_score': score_acc + score_rele}
193 |                                 """
194 |                             },
195 |                             {
196 |                                 "role": "user",
197 |                                 "content": f"""
198 |                                     Please score the respondent's answer according to the steps in the Instructions. You must end with a JSON dict to store the scores.
199 |                                     Question: {question}
200 |                                     Scoring Points: {scoring_points}
201 |                                     Respondent's Answer: {pred_answer}
202 |                                 """
203 |                             }
204 |                         ]
205 |                     )
206 |                 gpt_message = response.choices[0].message.content
207 |                 # Use regex to extract the JSON part of the string
208 |                 json_match = re.search(r'```json\n(.*?)\n```', gpt_message, re.DOTALL)
209 |                 if json_match:
210 |                     json_str = json_match.group(1)
211 |                     # Parse the JSON string into a Python dictionary
212 |                     json_dict = json.loads(json_str)
213 |                     score = json_dict['total_score']
214 |                     break
215 |                 else:
216 |                     print(f"No JSON found in the string. {anno_id}")
217 |                     score = 0
218 |             except Exception as e:
219 |                 score = 0
220 |                 print(f"Error processing anno_id '{anno_id}': {e}")
221 |     else:
222 |         if gt_answer.lower() == pred_answer.lower():
223 |             score = 1
224 |         else:
225 |             score = 0
226 |         gpt_message = ""
227 | 
228 |     return score, gpt_message
229 | 
230 | 
231 | def eval_videomme_results(anno_id2result, anno_id2meta, **kwargs):
232 |     # Load and merge all result files
233 |     anno_id_list, subfield_list, domain_list, duration_list = [], [], [], []
234 |     gt_answer_list, pred_answer_list, correct_list = [], [], []
235 |     for anno_id in anno_id2result:
236 |         pred_answer = anno_id2result[anno_id]
237 |         meta = anno_id2meta[anno_id]
238 |         gt_answer = meta['answer']
239 | 
240 |         anno_id_list.append(anno_id)
241 |         subfield_list.append(meta['task_type'])
242 |         domain_list.append(meta['domain'])
243 |         duration_list.append(meta['duration'])
244 |         gt_answer_list.append(gt_answer)
245 |         pred_answer_list.append(pred_answer)
246 |         if gt_answer.lower() == pred_answer.lower():
247 |             correct_list.append(1)
248 |         else:
249 |             correct_list.append(0)
250 | 
251 |     infer_result_df = pd.DataFrame({
252 |         'anno_id': anno_id_list,
253 |         'subfield': subfield_list,
254 |         'domain': domain_list,
255 |         'duration': duration_list,
256 |         'gt_answer': gt_answer_list,
257 |         'pred_answer': pred_answer_list,
258 |         'correct': correct_list
259 |     })
260 | 
261 |     # Evaluation
262 |     # Calculate accuracy per subfield
263 |     subfield_accuracy = infer_result_df.groupby('subfield')['correct'].mean()
264 | 
265 |     # Calculate accuracy per duration
266 |     duration_accuracy = infer_result_df.groupby('duration')['correct'].mean()
267 | 
268 |     duration_subfield_accuracy = infer_result_df.groupby(['duration', 'subfield'])['correct'].mean()
269 |     final_df = duration_subfield_accuracy.unstack()
270 | 
271 |     # Overall agregated in duration
272 |     final_df.loc[len(final_df)] = subfield_accuracy
273 |     final_df.index.values[-1] = 'overall'
274 | 
275 |     # Overall agregated in subfield
276 |     duration_accuracy.loc[3] = duration_accuracy.mean() # NOTE: This is correct because they have the same number of samples
277 |     duration_accuracy.index.values[-1] = 'overall'
278 |     final_df.insert(0, 'overall', duration_accuracy)
279 | 
280 |     # Reindex the DataFrame
281 |     new_order = ['short', 'medium', 'long', 'overall']
282 |     eval_result_df = final_df.reindex(new_order)
283 |     eval_result_df *= 100 # to percent
284 |     print(eval_result_df.head())
285 | 
286 |     return eval_result_df, infer_result_df
287 | 
288 | 
289 | def eval_mlvu_results(anno_id2result, anno_id2meta, enable_gpt_eval=False):
290 |     # Load and merge all result files
291 |     anno_id_list, question_type_list = [], []
292 |     gt_answer_list, pred_answer_list, correct_list, gpt_message_list = [], [], [], []
293 |     for anno_id in tqdm(anno_id2result, total=len(anno_id2result)):
294 |         meta = anno_id2meta[anno_id]
295 |         pred_answer = anno_id2result[anno_id]
296 |         gt_answer = meta['answer']
297 | 
298 |         anno_id_list.append(anno_id)
299 |         question_type_list.append(meta['question_type'])
300 |         gt_answer_list.append(gt_answer)
301 |         pred_answer_list.append(pred_answer)
302 |         # if gt_answer.lower() == pred_answer.lower():
303 |         #     correct_list.append(1)
304 |         # else:
305 |         #     correct_list.append(0)
306 |         correct, gpt_message = evaluate_mlvu_generation(anno_id, gt_answer, pred_answer, meta, enable_gpt_eval)
307 |         correct_list.append(correct)
308 |         gpt_message_list.append(gpt_message)
309 |     infer_result_df = pd.DataFrame({
310 |         'anno_id': anno_id_list,
311 |         'question_type': question_type_list,
312 |         'gt_answer': gt_answer_list,
313 |         'pred_answer': pred_answer_list,
314 |         'correct': correct_list,
315 |         'gpt_message': gpt_message_list,
316 |     })
317 | 
318 |     # Judge the MLVU split
319 |     question_types = set(infer_result_df['question_type'].tolist())
320 |     if len(question_types) == 9:
321 |         split = 'dev'
322 |     elif len(question_types) == 11:
323 |         split = 'test'
324 |     else:
325 |         raise NotImplementedError
326 | 
327 |     # Evaluation
328 |     # Calculate accuracy for each 'question_type'
329 |     accuracy_by_question_type = infer_result_df.groupby('question_type')['correct'].mean() * 100
330 |     accuracy_by_question_type = accuracy_by_question_type.reset_index()
331 |     accuracy_by_question_type.rename(columns={'correct': 'Accuracy'}, inplace=True)
332 | 
333 |     # Calculate AVG
334 |     if split == 'dev':
335 |         # TR	AR	NQA	ER	PQA	    AO	AC              SSC VS
336 |         mc_types = ['Topic Reasoning', 'Anomaly Recognition', 'Needle QA', 
337 |                     'Ego Reasoning', 'Plot QA', 'Action Order', 'Action Count']
338 |     else:
339 |         # TR	AR	NQA	ER	PQA	SQA	AO	AC	TQA	M-AVG	SSC	VS	G-Avg
340 |         # SportsQA
341 |         raise NotImplementedError
342 |     mc_rows = accuracy_by_question_type['question_type'].isin(mc_types)
343 |     mc_accuracy = accuracy_by_question_type[mc_rows]['Accuracy'].mean()
344 |     g_types = ['Video Summary', 'Sub-Scene Captioning']
345 |     g_rows = accuracy_by_question_type['question_type'].isin(g_types)
346 |     accuracy_by_question_type.loc[g_rows, 'Accuracy'] = accuracy_by_question_type.loc[g_rows, 'Accuracy'] / 100
347 | 
348 |     g_accuracy = accuracy_by_question_type[g_rows]['Accuracy'].mean()
349 | 
350 |     # Add the overall accuracy to the DataFrame
351 |     overall_df = pd.DataFrame({'question_type': ['M-AVG', 'G-AVG'], 'Accuracy': [mc_accuracy, g_accuracy]})
352 | 
353 |     # Combine the results
354 |     eval_result_df = pd.concat([accuracy_by_question_type, overall_df], ignore_index=True)
355 |     eval_result_df = eval_result_df.set_index('question_type').transpose()
356 | 
357 |     if split == 'dev':
358 |         new_order = ['Topic Reasoning', 'Anomaly Recognition', 
359 |                      'Needle QA', 'Ego Reasoning', 'Plot QA',
360 |                      'Action Order', 'Action Count', 'M-AVG',
361 |                      'Video Summary', 'Sub-Scene Captioning', 'G-AVG']
362 |     else:
363 |         # TR	AR	NQA	ER	PQA	SQA	AO	AC	TQA	M-AVG	SSC	VS	G-Avg
364 |         # SportsQA
365 |         raise NotImplementedError
366 |     eval_result_df = eval_result_df[new_order]
367 | 
368 |     print(eval_result_df.head())
369 | 
370 |     return eval_result_df, infer_result_df
371 | 
372 | 
373 | def eval_lvbench_results(anno_id2result, anno_id2meta, **kwargs):
374 |     type2correct_list = {}
375 |     anno_id_list = []
376 |     question_type_list = []
377 |     gt_answer_list = []
378 |     pred_answer_list = []
379 |     infer_result_correct_list = []
380 |     for anno_id in anno_id2result:
381 |         pred_answer = anno_id2result[anno_id]
382 |         meta = anno_id2meta[anno_id]
383 |         gt_answer = meta['answer']
384 |         if gt_answer.lower() == pred_answer.lower():
385 |             correct = 1
386 |         else:
387 |             correct = 0
388 | 
389 |         anno_id_list.append(anno_id)
390 |         question_type_list.append(json.dumps(meta['question_type']))
391 |         gt_answer_list.append(gt_answer)
392 |         pred_answer_list.append(pred_answer)
393 |         infer_result_correct_list.append(correct)
394 | 
395 |         for question_type in meta['question_type'] + ['overall']:
396 |             correct_list = type2correct_list.get(question_type, [])
397 |             correct_list.append(correct)
398 |             type2correct_list[question_type] = correct_list
399 | 
400 |     infer_result_df = pd.DataFrame({
401 |         'anno_id': anno_id_list,
402 |         'question_type_list': question_type_list,
403 |         'gt_answer': gt_answer_list,
404 |         'pred_answer': pred_answer_list,
405 |         'correct': infer_result_correct_list
406 |     })
407 | 
408 |     for qtype, correct_list in type2correct_list.items():
409 |         type2correct_list[qtype] = [sum(correct_list) / len(correct_list)]
410 |     # type2correct_list['overall'] = sum([v[0] for v in type2correct_list.values()]) / len(type2correct_list)
411 | 
412 |     eval_result_df = pd.DataFrame(type2correct_list)
413 | 
414 |     # Reindex the DataFrame
415 |     new_order = ['entity recognition', 'event understanding', 'key information retrieval', 'temporal grounding', 'reasoning', 'summarization', 'overall']
416 |     eval_result_df = eval_result_df[new_order]
417 |     eval_result_df *= 100 # to percent
418 |     print(eval_result_df.head())
419 | 
420 |     return eval_result_df, infer_result_df
421 | 
422 | 
423 | def get_dataset(dataset_name, anno_file, processor_kwargs):
424 |     if dataset_name.lower() in ['videomme', 'mlvu', 'lvbench']:
425 |         return BaseDataset(anno_file, processor_kwargs)
426 |     else:
427 |         print("Error! Dataset not implemented!", dataset_name)
428 |         raise NotImplementedError
429 | 
430 | 
431 | def get_eval_methods(dataset_name):
432 |     if dataset_name.lower() == 'videomme':
433 |         return eval_videomme_results
434 |     elif dataset_name.lower() == 'mlvu':
435 |         return eval_mlvu_results
436 |     elif dataset_name.lower() == 'lvbench':
437 |         return eval_lvbench_results
438 |     else:
439 |         print("Error! Evaluation method not implemented!", dataset_name)
440 |         raise NotImplementedError


--------------------------------------------------------------------------------
/retake/infer_eval.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import os.path as osp
  3 | import json
  4 | import re
  5 | import yaml
  6 | import datetime
  7 | import argparse
  8 | from tqdm import tqdm
  9 | from torch.utils.data import DataLoader, Subset
 10 | import torch
 11 | import torch.multiprocessing as mp
 12 | import torch.distributed as dist
 13 | from transformers import AutoProcessor
 14 | 
 15 | from retake.monkeypatch import patch_qwen2vl, patch_qwen2vl_config
 16 | from retake.dataset_utils import get_dataset, get_eval_methods
 17 | 
 18 | 
 19 | def load_yaml(file_path):
 20 |     with open(file_path, 'r') as file:
 21 |         data = yaml.safe_load(file)
 22 |     return data
 23 | 
 24 | 
 25 | def trimm_results(s):
 26 |     s = s.strip()
 27 |     answer_prefixes = [
 28 |         "The best answer is",
 29 |         "The correct answer is",
 30 |         "The answer is",
 31 |         "The answer",
 32 |         "The best option is",
 33 |         "The correct option is",
 34 |         "Best answer:",
 35 |         "Best option:",
 36 |     ]
 37 |     for answer_prefix in answer_prefixes:
 38 |         s = s.replace(answer_prefix, "")
 39 | 
 40 |     if len(s.split()) > 10 and not re.search("[ABCDEFG]", s):
 41 |         return ""
 42 | 
 43 |     matches = re.search(r"[ABCDEFG]", s)
 44 |     if matches is None:
 45 |         return ""
 46 |     return matches[0]
 47 | 
 48 | 
 49 | class InferClient:
 50 |     def __init__(self, model_name, hf_model_path, exp_configs, device) -> None:
 51 |         self.device = device
 52 |         self.model_name = model_name if model_name is not None else exp_configs['model_name']
 53 |         self.do_sample = exp_configs.get('do_sample', False)
 54 |         self.load_model(model_name, hf_model_path, exp_configs, device)
 55 | 
 56 |     def load_model(self, model_name, hf_model_path, exp_configs, device):
 57 |         model_name = model_name if model_name is not None else exp_configs['model_name']
 58 |         model_name = model_name.lower().replace('-', '').replace('_', '')
 59 |         if model_name == 'qwen2vl': # QWen2VL
 60 |             from transformers import Qwen2VLConfig, Qwen2VLForConditionalGeneration
 61 |             from retake.monkeypatch import patch_qwen2vl, patch_qwen2vl_config
 62 |             patch_qwen2vl(exp_configs['method']) # Replace some functions of QWen2VL with those from ReTaKe
 63 |             qwen2vl_config = Qwen2VLConfig.from_pretrained(hf_model_path)
 64 |             qwen2vl_config = patch_qwen2vl_config(qwen2vl_config, exp_configs)
 65 |             model = Qwen2VLForConditionalGeneration.from_pretrained(
 66 |                 hf_model_path,
 67 |                 config=qwen2vl_config,
 68 |                 torch_dtype=torch.bfloat16,
 69 |                 attn_implementation=exp_configs.get('attn_implementation', None),
 70 |                 device_map=device # "auto"
 71 |             ).eval()
 72 |             processor = AutoProcessor.from_pretrained(hf_model_path)
 73 |         elif model_name in ['llavaonevision', 'llavavideo']: # LLaVA-OneVision, LLaVA-Video
 74 |             from transformers import LlavaOnevisionConfig, LlavaOnevisionForConditionalGeneration
 75 |             from retake.monkeypatch import patch_llava_onevision, patch_llava_onevision_config
 76 |             patch_llava_onevision(exp_configs['method']) # Replace some functions of LLaVA-Video with those from ReTaKe
 77 |             llava_onevision_config = LlavaOnevisionConfig.from_pretrained(hf_model_path)
 78 |             llava_onevision_config = patch_llava_onevision_config(llava_onevision_config, exp_configs)
 79 |             processor = AutoProcessor.from_pretrained(hf_model_path) 
 80 |             model = LlavaOnevisionForConditionalGeneration.from_pretrained(
 81 |                 hf_model_path, 
 82 |                 config=llava_onevision_config,
 83 |                 torch_dtype=torch.bfloat16,
 84 |                 attn_implementation=exp_configs.get('attn_implementation', None),
 85 |                 device_map=device # "auto"
 86 |             ) 
 87 |         else:
 88 |             raise NotImplementedError
 89 |         self.model = model
 90 |         self.processor = processor
 91 | 
 92 |     def infer(self, message):
 93 |         # Prepare inputs
 94 |         video = message['video']
 95 |         conversation = [
 96 |             {
 97 |                 "role": "user",
 98 |                 "content": [
 99 |                     {"type": "video"},
100 |                     {"type": "text", "text": message['question']},
101 |                 ],
102 |             }
103 |         ]
104 |         text_prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
105 |         inputs = self.processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
106 |         inputs = inputs.to(self.device)
107 |         inputs['pixel_values_videos'] = inputs['pixel_values_videos'].to(torch.bfloat16)
108 | 
109 |         # Inference
110 |         output_ids = self.model.generate(**inputs, do_sample=self.do_sample, max_new_tokens=128)
111 |         generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
112 |         output_text = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
113 |         output_text = output_text[0]
114 | 
115 |         return output_text
116 | 
117 | 
118 | def parse_arguments():
119 |     # Set up argument parser
120 |     parser = argparse.ArgumentParser(description="ReTaKe Evaluation")
121 |     parser.add_argument('--model_name', 
122 |                         type=str, 
123 |                         help="Modelname")
124 |     parser.add_argument('--hf_path',
125 |                         '--hf_qwen2vl7b_path', 
126 |                         type=str, 
127 |                         default='Qwen/Qwen2-VL-7B-Instruct', 
128 |                         help="Path to the huggingface model")
129 |     parser.add_argument('--config_path', 
130 |                         type=str, 
131 |                         default='configs/retake_videomme.yaml', 
132 |                         help="Path to the experimental config")
133 |     parser.add_argument('--video_frame_extraction_fps', 
134 |                         type=int, 
135 |                         default=25, 
136 |                         help="Video frame extraction FPS")
137 |     parser.add_argument('--n_gpus', 
138 |                         type=int, 
139 |                         default=8, 
140 |                         help="Number of GPUs to use")
141 |     parser.add_argument('--timeout', 
142 |                         type=int, 
143 |                         default=30, 
144 |                         help="Timeout for waiting each GPU finish inference (in minutes).")
145 |     args = parser.parse_args()
146 |     return args
147 | 
148 | 
149 | def setup(rank, world_size, timeout):
150 |     os.environ['MASTER_ADDR'] = 'localhost'
151 |     os.environ['MASTER_PORT'] = '12355'
152 |     dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(minutes=timeout))
153 |     torch.cuda.set_device(f'cuda:{rank}')
154 | 
155 | 
156 | def cleanup():
157 |     dist.destroy_process_group()
158 | 
159 | 
160 | def main(rank, world_size, args):
161 |     device = f'cuda:{rank}'
162 |     setup(rank, world_size, args.timeout)
163 | 
164 |     exp_configs = load_yaml(args.config_path)
165 | 
166 |     client = InferClient(args.model_name, args.hf_path, exp_configs, device)
167 | 
168 |     dataset = get_dataset(
169 |         dataset_name=exp_configs['dataset_name'],
170 |         anno_file=exp_configs['anno_file'], 
171 |         processor_kwargs=dict(
172 |             video_fps=exp_configs['sample_fps'],
173 |             video_maxlen=exp_configs['max_num_frames'],
174 |             image_resolution=exp_configs['longsize_resolution'],
175 |             video_frame_extraction_fps=args.video_frame_extraction_fps
176 |         )
177 |     )
178 | 
179 |     # Split dataset into shards
180 |     # Function to create a round-robin shard for a given rank
181 |     indices = [i for i in range(len(dataset)) if i % world_size == rank]
182 |     shard_dataset = Subset(dataset, indices)
183 | 
184 |     dataloader = DataLoader(shard_dataset, batch_size=None, num_workers=exp_configs['dataloader_num_workers'])
185 | 
186 |     # Inference
187 |     anno_id2result = {}
188 |     anno_id2meta = {}
189 |     for sample in tqdm(dataloader, desc=f'rank {rank}'): # disable=rank!=0
190 |         idx, message, meta = sample
191 |         pred_answer = client.infer(message)
192 |         pred_answer = trimm_results(pred_answer)
193 |         anno_id2result[idx] = pred_answer
194 |         anno_id2meta[idx] = meta
195 | 
196 |     # Gather results from all processes
197 |     all_anno_id2result = [None] * world_size
198 |     all_anno_id2meta = [None] * world_size
199 |     dist.barrier()
200 |     dist.all_gather_object(all_anno_id2result, anno_id2result)
201 |     dist.all_gather_object(all_anno_id2meta, anno_id2meta)
202 | 
203 |     if rank == 0:
204 |         # Merge results
205 |         merged_anno_id2result = {k: v for d in all_anno_id2result for k, v in d.items()}
206 |         merged_anno_id2meta = {k: v for d in all_anno_id2meta for k, v in d.items()}
207 | 
208 |         # Evaluate
209 |         eval_func = get_eval_methods(exp_configs['dataset_name'])
210 |         eval_result_df, infer_result_df = eval_func(merged_anno_id2result, merged_anno_id2meta)
211 | 
212 |         # Dump inference & evaluation results
213 |         os.makedirs(exp_configs['output_dir'], exist_ok=True)
214 |         answer_file = os.path.join(exp_configs['output_dir'], "anno_id2result.json")
215 |         infer_res_file = os.path.join(exp_configs['output_dir'], "infer_results.csv")
216 |         eval_res_file = os.path.join(exp_configs['output_dir'], "eval_results.csv")
217 | 
218 |         with open(answer_file, 'w') as F:
219 |             json.dump(merged_anno_id2result, F)
220 |         infer_result_df.to_csv(infer_res_file, index=False)
221 |         eval_result_df.to_csv(eval_res_file, index=True)
222 | 
223 |     cleanup()
224 | 
225 | 
226 | if __name__ == "__main__":
227 |     args = parse_arguments()
228 |     world_size = args.n_gpus
229 |     mp.spawn(main, args=(world_size, args), nprocs=world_size, join=True)
230 | 


--------------------------------------------------------------------------------
/retake/llava_onevision.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | from tqdm import tqdm
  3 | from dataclasses import dataclass
  4 | from typing import Any, Callable, Dict, List, Optional, Tuple, Union
  5 | 
  6 | import torch
  7 | import torch.nn as nn
  8 | import torch.nn.functional as F
  9 | import torch.utils.checkpoint
 10 | from torch.nn import CrossEntropyLoss
 11 | import numpy as np
 12 | 
 13 | from transformers.cache_utils import Cache
 14 | from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
 15 | from transformers.processing_utils import Unpack
 16 | from transformers.utils import (
 17 |     is_flash_attn_2_available,
 18 |     logging,
 19 | )
 20 | from transformers import Qwen2VLConfig
 21 | from transformers.models.llava_onevision.modeling_llava_onevision import (
 22 |     LlavaOnevisionCausalLMOutputWithPast,
 23 |     image_size_to_num_patches,
 24 | )
 25 | from transformers.models.qwen2.modeling_qwen2 import (
 26 |     repeat_kv,
 27 |     eager_attention_forward,
 28 |     Qwen2Attention,
 29 |     Qwen2RotaryEmbedding,
 30 |     apply_rotary_pos_emb,
 31 | )
 32 | 
 33 | from .visual_compression import *
 34 | from .longvideo_cache import *
 35 | 
 36 | if is_flash_attn_2_available():
 37 |     from flash_attn import flash_attn_varlen_func
 38 | 
 39 |     from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
 40 | else:
 41 |     flash_attn_varlen_func = None
 42 | 
 43 | DEBUG_MODE = False
 44 | 
 45 | logger = logging.get_logger(__name__)
 46 | 
 47 | 
 48 | Qwen2Attention_original_init = Qwen2Attention.__init__
 49 | 
 50 | def retake_Qwen2Attention_init(
 51 |     self, 
 52 |     config: Qwen2VLConfig, 
 53 |     layer_idx: Optional[int] = None
 54 | ):
 55 |     Qwen2Attention_original_init(self, config, layer_idx)
 56 |     self.rotary_emb = Qwen2RotaryEmbedding(config=self.config)
 57 | 
 58 | 
 59 | def retake_Qwen2Attention_forward(
 60 |     self,
 61 |     hidden_states: torch.Tensor,
 62 |     position_embeddings: Tuple[torch.Tensor, torch.Tensor],
 63 |     attention_mask: Optional[torch.Tensor],
 64 |     past_key_value: Optional[Cache] = None,
 65 |     cache_position: Optional[torch.LongTensor] = None,
 66 |     **kwargs: Unpack[FlashAttentionKwargs],
 67 | ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
 68 |     input_shape = hidden_states.shape[:-1]
 69 |     hidden_shape = (*input_shape, -1, self.head_dim)
 70 | 
 71 |     query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
 72 |     key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
 73 |     value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
 74 | 
 75 |     # Update position_ids if positional embeddings are reforged
 76 |     position_ids = None
 77 |     if past_key_value is not None and getattr(past_key_value, "pos_embed_reforge", False):
 78 |         # This code reforge the `position_ids` of current chunk, 
 79 |         # the `position_ids` of previous chunks are reforged in KVCache.update()
 80 |         position_ids = kwargs.get('position_ids')
 81 |         prev_tempo_idx = past_key_value.get_prev_temporal_idx(self.layer_idx)
 82 |         cur_tempo_idx = position_ids[0,0]
 83 |         if prev_tempo_idx + 1 != cur_tempo_idx:
 84 |             # print("Warning! Discontinuous positional ids %d (prev) + 1 != %d (cur) at layer %d. Fixed!" % (prev_tempo_idx,  cur_tempo_idx, self.layer_idx))
 85 |             # NOTE: clone `position_ids` to avoid influence of in-place ops in different layers
 86 |             position_ids = position_ids.clone()
 87 |             position_ids[0,:] += prev_tempo_idx + 1 - cur_tempo_idx
 88 |         position_embeddings = None # `position_embeddings` need to be re-calculated
 89 | 
 90 |     if position_embeddings is None:
 91 |         logger.warning_once(
 92 |             "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
 93 |             "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
 94 |             "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
 95 |             "removed and `position_embeddings` will be mandatory."
 96 |         )
 97 |         cos, sin = self.rotary_emb(value_states, position_ids)
 98 |     else:
 99 |         cos, sin = position_embeddings
100 |     query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
101 | 
102 |     if past_key_value is not None:
103 |         # sin and cos are specific to RoPE models; cache_position needed for the static cache
104 |         cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
105 |         # Specific to KVCache compression methods
106 |         cache_kwargs.update({"query_states": query_states, "position_ids": position_ids, "rotary_emb": self.rotary_emb})
107 |         key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
108 | 
109 |     sliding_window = None
110 |     if (
111 |         self.config.use_sliding_window
112 |         and getattr(self.config, "sliding_window", None) is not None
113 |         and self.layer_idx >= self.config.max_window_layers
114 |     ):
115 |         sliding_window = self.config.sliding_window
116 | 
117 |     attention_interface: Callable = eager_attention_forward
118 |     if self.config._attn_implementation != "eager":
119 |         if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
120 |             logger.warning_once(
121 |                 "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
122 |                 'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
123 |             )
124 |         else:
125 |             attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
126 | 
127 |     attn_output, attn_weights = attention_interface(
128 |         self,
129 |         query_states,
130 |         key_states,
131 |         value_states,
132 |         attention_mask,
133 |         dropout=0.0 if not self.training else self.attention_dropout,
134 |         scaling=self.scaling,
135 |         sliding_window=sliding_window,  # main diff with Llama
136 |         **kwargs,
137 |     )
138 | 
139 |     attn_output = attn_output.reshape(*input_shape, -1).contiguous()
140 |     attn_output = self.o_proj(attn_output)
141 |     return attn_output, attn_weights
142 | 
143 | 
144 | def retake_LlavaOnevisionForConditionalGeneration_get_chunk_size(
145 |     self, 
146 |     config, 
147 |     pixel_values_videos
148 | ) -> int:
149 |     # Calculate the number of tokens in each prefill chunk
150 |     chunk_frames = (
151 |         config.longvideo_kwargs.get('chunked_prefill_frames', None) if getattr(config, 'longvideo_kwargs', None) 
152 |         else None
153 |     )
154 |     if chunk_frames is None:
155 |         chunk_prefill_size = None
156 |     else:
157 |         T, _, H, W = pixel_values_videos[0].shape
158 |         H = math.ceil(H // self.config.vision_config.patch_size / self.pool_stride)
159 |         W = math.ceil(W // self.config.vision_config.patch_size / self.pool_stride)
160 |         chunk_prefill_size = min(chunk_frames, T) * H * W
161 |     return chunk_prefill_size
162 | 
163 | 
164 | def retake_LlavaOnevisionForConditionalGeneration_segment_input_ids(
165 |     self,
166 |     input_ids
167 | ):
168 |     """Split video and text segments in the input_ids
169 |     return: list[(s, e, type)], indices of [s, e) are of `type`.
170 |     """
171 |     videomask = (input_ids[0] == self.config.video_token_index)
172 |     # Find the difference between consecutive elements
173 |     diff = torch.diff(videomask.long())
174 |     diff_pos_indices = (torch.where(diff == 1)[0] + 1).cpu().numpy()
175 |     diff_neg_indices = (torch.where(diff == -1)[0] + 1).cpu().numpy()
176 | 
177 |     # True mask
178 |     start_indices_true = diff_pos_indices
179 |     end_indices_true = diff_neg_indices
180 |     if videomask[0] == True: # segment starts at the beginning
181 |         start_indices_true = np.insert(start_indices_true, 0, 0)
182 |     if videomask[-1] == True: # segment ends at the beginning
183 |         end_indices_true = np.append(end_indices_true, len(videomask))
184 | 
185 |     # False mask
186 |     start_indices_flase = diff_neg_indices
187 |     end_indices_flase = diff_pos_indices
188 |     if videomask[0] == False:
189 |         start_indices_flase = np.insert(start_indices_flase, 0, 0)
190 |     if videomask[-1] == False:
191 |         end_indices_flase = np.append(end_indices_flase, len(videomask))
192 | 
193 |     segments = (
194 |         list(zip(start_indices_true, end_indices_true, ['video']*len(end_indices_true))) + 
195 |         list(zip(start_indices_flase, end_indices_flase, ['text']*len(end_indices_flase)))
196 |     )
197 |     segments = sorted(segments, key=lambda x: x[0])
198 |     return segments
199 | 
200 | 
201 | def retake_LlavaOnevisionForConditionalGeneration_compress_video_tokens(
202 |     self, 
203 |     input_ids: torch.LongTensor = None,
204 |     attention_mask: torch.Tensor = None,
205 |     selected_video_feature: torch.Tensor = None,
206 |     position_ids: Optional[torch.LongTensor] = None,
207 |     cache_position: Optional[torch.LongTensor] = None,
208 |     labels: Optional[torch.LongTensor] = None,
209 | ) -> Tuple[torch.LongTensor, torch.Tensor, torch.Tensor, torch.LongTensor]:
210 |     # Parse long video compression configs
211 |     visual_compression = False
212 |     if getattr(self.config, "longvideo_kwargs", None) is not None:
213 |         visual_compression = self.config.longvideo_kwargs.get("visual_compression", False)
214 |         if visual_compression:
215 |             compression_kwargs = self.config.longvideo_kwargs['visual_compression_kwargs']
216 |             compression_ratio = compression_kwargs.get("compression_ratio")
217 |             compression_method = compression_kwargs.get("compression_method")
218 |             patch_sync = compression_kwargs.get("patch_sync")
219 |             return_keyframe_mask = compression_kwargs.get("return_keyframe_mask")
220 | 
221 |     grid_t, grid_hw = selected_video_feature.shape[:2]
222 |     tgt_grid_t = grid_t
223 |     if visual_compression:
224 |         assert labels is None
225 |         assert input_ids.shape[0] == 1, "Currently, only inference are supported"
226 |         video_token_indices = torch.where(input_ids[0] == self.config.video_token_index)[0]
227 |         s_index, e_index = video_token_indices[0], video_token_indices[-1]
228 |         height = width = self.config.vision_config.image_size // self.config.vision_config.patch_size
229 |         grid_hw_after_pool = math.ceil(height / self.pool_stride) * math.ceil(width / self.pool_stride)
230 |         ori_seq_len = input_ids.shape[1]
231 |         tgt_grid_t = max(1, round(compression_ratio * grid_t))
232 | 
233 |         # Compress
234 |         compressed_memory_bank = selected_video_feature.reshape(1, grid_t, grid_hw, -1)
235 |         if compression_method == "MA-LLM":
236 |             compression_size = torch.ones_like(compressed_memory_bank[:,:,:,0])
237 |             while compressed_memory_bank.shape[1] > tgt_grid_t:
238 |                 compressed_memory_bank, compression_size = memory_bank_compress_MALLM(compressed_memory_bank, compression_size, sync=patch_sync)
239 |             keypatches_mask = None
240 |         elif compression_method == "MA-LLM-hard":
241 |             while compressed_memory_bank.shape[1] > tgt_grid_t:
242 |                 compressed_memory_bank = memory_bank_compress_MALLM_hard(compressed_memory_bank, sync=patch_sync)
243 |             keypatches_mask = None
244 |         elif compression_method == "Keyframe":
245 |             compressed_memory_bank, keypatches_mask = memory_bank_compress_keyframe(compressed_memory_bank, tgt_grid_t, 3, sync=patch_sync)
246 |             keypatches_mask = keypatches_mask if return_keyframe_mask else None
247 |         else:
248 |             raise NotImplementedError
249 |         selected_video_feature = compressed_memory_bank[0]
250 |         mem_len_after = tgt_grid_t * grid_hw_after_pool
251 | 
252 |         # Reforge the input
253 |         input_ids = torch.cat([
254 |             input_ids[:, :s_index],
255 |             input_ids[:, s_index:e_index+1][:,:mem_len_after],
256 |             input_ids[:, e_index+1:]
257 |             ],
258 |             dim=1)
259 |         num_token_diff = ori_seq_len - input_ids.shape[1]
260 |         if num_token_diff and attention_mask is not None:
261 |             attention_mask = attention_mask[:, num_token_diff:]
262 |         if num_token_diff and position_ids is not None:
263 |             position_ids = position_ids[:,:-num_token_diff]
264 |         if num_token_diff and cache_position is not None:
265 |             cache_position = cache_position[:-num_token_diff]
266 |     else:
267 |         keypatches_mask = None
268 | 
269 |     return input_ids, attention_mask, selected_video_feature, position_ids, cache_position, tgt_grid_t, keypatches_mask
270 | 
271 | 
272 | def retake_LlavaOnevisionForConditionalGeneration_forge_input_chunks(
273 |     self, 
274 |     ss, 
275 |     ee, 
276 |     modality_segments, 
277 |     position_ids, 
278 |     cache_position, 
279 |     attention_mask, 
280 |     past_key_values, 
281 |     inputs_embeds
282 | ):
283 |     position_ids_chunk = position_ids[:,ss:ee]
284 |     cache_position_chunk = cache_position[:ee]
285 |     attention_mask_chunk = attention_mask[:,:ee] # NOTE: specially from 0 to ee
286 |     inputs_embeds_chunk = inputs_embeds[:,ss:ee]
287 |     prompt_length = None
288 | 
289 |     if getattr(self.config, 'longvideo_kwargs', None) and self.config.longvideo_kwargs.get('kvcache_compression', False):
290 |         compression_kwargs = self.config.longvideo_kwargs['kvcache_compression_kwargs']
291 |         if compression_kwargs.get('prompt_guided_compression', False) and compression_kwargs.get('compression_ratio', 1) < 1.0:
292 |             # Prompt guided KV cache compression
293 |             s_p, e_p, t_p = modality_segments[-1]
294 |             assert t_p == 'text'
295 |             pos_offset = position_ids[0,s_p] - position_ids_chunk[0,-1] - 1 # (bs, seq_len)
296 |             # print('forge_input_chunks() temporal position_ids of prompts', position_ids[0,s_p:e_p])
297 |             position_ids_chunk = torch.cat([position_ids_chunk, position_ids[:,s_p:e_p] - pos_offset], dim=1)
298 |             cache_position_chunk = torch.cat([cache_position_chunk, cache_position[s_p:e_p] - pos_offset], dim=0)
299 |             attention_mask_chunk = torch.cat([attention_mask_chunk, attention_mask[:,s_p:e_p]], dim=1)
300 |             inputs_embeds_chunk = torch.cat([inputs_embeds_chunk, inputs_embeds[:,s_p:e_p]], dim=1)
301 |             prompt_length = e_p - s_p
302 | 
303 |     return position_ids_chunk, cache_position_chunk, attention_mask_chunk, inputs_embeds_chunk, prompt_length
304 | 
305 | 
306 | def retake_LlavaOnevisionForConditionalGeneration_forward(
307 |     self,
308 |     input_ids: torch.LongTensor = None,
309 |     pixel_values: torch.FloatTensor = None,
310 |     image_sizes: Optional[torch.LongTensor] = None,
311 |     pixel_values_videos: torch.FloatTensor = None,
312 |     image_sizes_videos: Optional[torch.LongTensor] = None,
313 |     attention_mask: Optional[torch.Tensor] = None,
314 |     position_ids: Optional[torch.LongTensor] = None,
315 |     past_key_values: Optional[List[torch.FloatTensor]] = None,
316 |     inputs_embeds: Optional[torch.FloatTensor] = None,
317 |     vision_feature_layer: Optional[int] = None,
318 |     vision_feature_select_strategy: Optional[str] = None,
319 |     vision_aspect_ratio: Optional[str] = None,
320 |     labels: Optional[torch.LongTensor] = None,
321 |     use_cache: Optional[bool] = None,
322 |     output_attentions: Optional[bool] = None,
323 |     output_hidden_states: Optional[bool] = None,
324 |     return_dict: Optional[bool] = None,
325 |     cache_position: Optional[torch.LongTensor] = None,
326 |     logits_to_keep: Union[int, torch.Tensor] = 0,
327 | ) -> Union[Tuple, LlavaOnevisionCausalLMOutputWithPast]:
328 |     assert input_ids.shape[0] == 1, "Batch inference of long video is not supported yet!"
329 | 
330 |     self.pool_stride = 2
331 |     if input_ids.shape[1] > 1: # Prefill
332 |         is_prefill = True
333 |         # Calculate chunk size based on inputs
334 |         chunk_size = self.get_chunk_size(self.config, pixel_values_videos)
335 |         # Configuring KV Cache compression kwargs
336 |         if getattr(self.config, 'longvideo_kwargs', None) and self.config.longvideo_kwargs.get('kvcache_compression', False):
337 |             compression_kwargs = self.config.longvideo_kwargs['kvcache_compression_kwargs']
338 |             if compression_kwargs.get('dynamic_compression_ratio', False):
339 |                 # Dynamic compression ratio
340 |                 input_length = input_ids.shape[1]
341 |                 max_input_length = compression_kwargs['max_input_length']
342 |                 if input_length <= max_input_length:
343 |                     compression_kwargs['compression_ratio'] = 1
344 |                 else:
345 |                     compression_kwargs['compression_ratio'] = max_input_length / input_length
346 |         if chunk_size is not None:
347 |             modality_segments = self.segment_input_ids(input_ids)
348 |             # print("modality_segments", modality_segments)
349 |             past_key_values = build_kvcache(self.config)
350 |             use_cache = True
351 |     else:
352 |         is_prefill = False
353 |         chunk_size = None
354 | 
355 |     output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
356 |     output_hidden_states = (
357 |         output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
358 |     )
359 |     return_dict = return_dict if return_dict is not None else self.config.use_return_dict
360 |     vision_feature_layer = (
361 |         vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
362 |     )
363 |     vision_feature_select_strategy = (
364 |         vision_feature_select_strategy
365 |         if vision_feature_select_strategy is not None
366 |         else self.config.vision_feature_select_strategy
367 |     )
368 |     vision_aspect_ratio = (
369 |         vision_aspect_ratio if vision_aspect_ratio is not None else self.config.vision_aspect_ratio
370 |     )
371 | 
372 |     if (input_ids is None) ^ (inputs_embeds is not None):
373 |         raise ValueError(
374 |             "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
375 |         )
376 | 
377 |     if (pixel_values is not None or pixel_values_videos is not None) and inputs_embeds is not None:
378 |         raise ValueError(
379 |             "You cannot specify both pixel_values/pixel_values_videos and inputs_embeds at the same time, and must specify either one"
380 |         )
381 | 
382 |     # Images are processed with Anyres
383 |     if pixel_values is not None:
384 |         image_num_patches = [
385 |             image_size_to_num_patches(
386 |                 image_size=imsize,
387 |                 grid_pinpoints=self.config.image_grid_pinpoints,
388 |                 patch_size=self.config.vision_config.image_size,
389 |             )
390 |             for imsize in image_sizes
391 |         ]
392 | 
393 |         # unpad extra patches and concatenate them
394 |         if pixel_values.dim() == 5:
395 |             _pixel_values_list = [
396 |                 pix_val[:num_patch] for pix_val, num_patch in zip(pixel_values, image_num_patches)
397 |             ]
398 |             # [batch_size*frames*num_patches, num_channels, height, width] where frames=1 for images
399 |             pixel_values = torch.cat(_pixel_values_list, dim=0)
400 |         elif pixel_values.dim() != 4:
401 |             raise ValueError(f"pixel_values of shape {pixel_values.shape}, expect to be of 4 or 5 dimensions")
402 | 
403 |         image_features = self.vision_tower(pixel_values, output_hidden_states=True)
404 |         selected_image_feature = image_features.hidden_states[vision_feature_layer]
405 | 
406 |         if vision_feature_select_strategy == "default":
407 |             selected_image_feature = selected_image_feature[:, 1:]
408 |         elif vision_feature_select_strategy == "full":
409 |             selected_image_feature = selected_image_feature
410 |         image_features = self.multi_modal_projector(selected_image_feature)
411 | 
412 |         image_features = torch.split(image_features, image_num_patches, dim=0)
413 |         image_features, feature_lens = self.pack_image_features(
414 |             image_features,
415 |             image_sizes,
416 |             image_newline=self.image_newline,
417 |             vision_aspect_ratio=vision_aspect_ratio,
418 |         )
419 | 
420 |     # Video are simply embedded and further pooled to decrease seq len
421 |     if pixel_values_videos is not None:
422 |         batch_size, frames, channels, height, width = pixel_values_videos.shape
423 |         pixel_values_videos = pixel_values_videos.view(batch_size * frames, channels, height, width)
424 |         # NOTE: Split video into chunks to avoid OOM due to large activations during visual forward
425 |         # chunk_size can be up to 128 or higher if you have flash attention
426 |         frame_chunk_size = getattr(self.config, 'longvideo_kwargs', {}).get('frame_chunk_size', 1000000000)
427 |         if batch_size * frames < frame_chunk_size:
428 |             video_features = self.vision_tower(pixel_values_videos, output_hidden_states=True)
429 |             selected_video_feature = video_features.hidden_states[vision_feature_layer]
430 |         else:
431 |             selected_video_feature = []
432 |             for i in range(0, batch_size*frames, frame_chunk_size):
433 |                 pixel_values_videos_chunk = pixel_values_videos[i:i+frame_chunk_size]
434 |                 selected_video_feature.append(
435 |                     self.vision_tower(pixel_values_videos_chunk, 
436 |                                       output_hidden_states=True
437 |                     ).hidden_states[vision_feature_layer]
438 |                 )
439 |             selected_video_feature = torch.cat(selected_video_feature)
440 | 
441 |         # Compression video tokens
442 |         input_ids, attention_mask, selected_video_feature, position_ids, cache_position, frames, keypatches_mask = self.compress_video_tokens(
443 |             input_ids=input_ids, 
444 |             attention_mask=attention_mask, 
445 |             selected_video_feature=selected_video_feature, 
446 |             position_ids=position_ids,
447 |             cache_position=cache_position,
448 |             labels=labels,
449 |         )
450 | 
451 |         if vision_feature_select_strategy == "default":
452 |             selected_video_feature = selected_video_feature[:, 1:]
453 |         elif vision_feature_select_strategy == "full":
454 |             selected_video_feature = selected_video_feature
455 |         video_features = self.multi_modal_projector(selected_video_feature)
456 | 
457 |         video_features = self.apply_pooling(video_features)
458 |         video_features = video_features.reshape(batch_size, frames * video_features.shape[1], -1)
459 |         image_newline = self.image_newline[None, None, :].repeat(batch_size, 1, 1).to(video_features.device)
460 |         video_features = torch.cat((video_features, image_newline), dim=1)
461 |         video_features = video_features.flatten(0, 1)
462 | 
463 |     if inputs_embeds is None:
464 |         inputs_embeds = self.get_input_embeddings()(input_ids)
465 | 
466 |     # Concatenate vision and language features
467 |     if pixel_values is not None:
468 |         special_image_mask = (
469 |             (input_ids == self.config.image_token_index)
470 |             .unsqueeze(-1)
471 |             .expand_as(inputs_embeds)
472 |             .to(inputs_embeds.device)
473 |         )
474 |         image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
475 |         inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
476 |     if pixel_values_videos is not None:
477 |         special_video_mask = (
478 |             (input_ids == self.config.video_token_index)
479 |             .unsqueeze(-1)
480 |             .expand_as(inputs_embeds)
481 |             .to(inputs_embeds.device)
482 |         )
483 |         video_features = video_features.to(inputs_embeds.device, inputs_embeds.dtype)
484 |         inputs_embeds = inputs_embeds.masked_scatter(special_video_mask, video_features)
485 |         if keypatches_mask is not None:
486 |             keypatches_mask = torch.zeros_like(input_ids).bool().masked_scatter(special_video_mask[:,:,0], keypatches_mask)
487 | 
488 |     if is_prefill and chunk_size is not None: # Chunked prefill stage
489 |         assert past_key_values is not None
490 |         kvcache_compression = getattr(past_key_values, 'kvcache_compression', False)
491 |         for seg_id, (s, e, dtype) in enumerate(modality_segments):
492 |             if dtype == 'text': # Prefill text without kvcache_compression
493 |                 past_key_values.kvcache_compression = False
494 |                 outputs = self.language_model(
495 |                     attention_mask=attention_mask[:,:e],
496 |                     position_ids=position_ids[:,s:e],
497 |                     past_key_values=past_key_values,
498 |                     inputs_embeds=inputs_embeds[:,s:e],
499 |                     use_cache=True,
500 |                     output_attentions=output_attentions,
501 |                     output_hidden_states=output_hidden_states,
502 |                     return_dict=return_dict,
503 |                     cache_position=cache_position[:e],
504 |                     logits_to_keep=logits_to_keep,
505 |                 )
506 |                 past_key_values = outputs['past_key_values']
507 |             elif dtype == 'video': # Prefill video may with kvcache_compression
508 |                 num_chunks = math.ceil((e - s) / chunk_size)
509 |                 past_key_values.kvcache_compression = kvcache_compression
510 |                 for idx in tqdm(range(num_chunks), total=num_chunks, desc='Prefilling chunk', disable=not DEBUG_MODE):
511 |                     ss = s + idx * chunk_size
512 |                     ee = min(s + (idx + 1) * chunk_size, e)
513 |                     if keypatches_mask is not None:
514 |                         past_key_values.keypatches_mask_chunk = keypatches_mask[0, ss:ee]
515 |                     position_ids_chunk, cache_position_chunk, attention_mask_chunk, inputs_embeds_chunk, prompt_length = self.forge_input_chunks(
516 |                         ss, ee, modality_segments, position_ids, cache_position, attention_mask, past_key_values, inputs_embeds
517 |                     )
518 |                     if hasattr(past_key_values, 'before_forward'):
519 |                         past_key_values.before_forward(prompt_length=prompt_length)
520 |                     outputs = self.language_model(
521 |                         attention_mask=attention_mask_chunk,
522 |                         position_ids=position_ids_chunk,
523 |                         past_key_values=past_key_values,
524 |                         inputs_embeds=inputs_embeds_chunk,
525 |                         use_cache=True,
526 |                         output_attentions=output_attentions,
527 |                         output_hidden_states=output_hidden_states,
528 |                         return_dict=return_dict,
529 |                         cache_position=cache_position_chunk,
530 |                         logits_to_keep=logits_to_keep,
531 |                     )
532 |                     past_key_values = outputs['past_key_values']
533 |                     if hasattr(past_key_values, 'after_forward'):
534 |                         past_key_values.after_forward()
535 |                 past_key_values.keypatches_mask_chunk = None
536 |                 past_key_values.kvcache_compression = False # Turned off for decoding
537 |             else:
538 |                 raise ValueError
539 |     else: # Decode / Standard prefill stage
540 |         outputs = self.language_model(
541 |             attention_mask=attention_mask,
542 |             position_ids=position_ids,
543 |             past_key_values=past_key_values,
544 |             inputs_embeds=inputs_embeds,
545 |             use_cache=use_cache,
546 |             output_attentions=output_attentions,
547 |             output_hidden_states=output_hidden_states,
548 |             return_dict=return_dict,
549 |             cache_position=cache_position,
550 |             logits_to_keep=logits_to_keep,
551 |         )
552 | 
553 |     logits = outputs[0]
554 | 
555 |     loss = None
556 |     if labels is not None:
557 |         # Shift so that tokens < n predict n
558 |         if attention_mask is not None:
559 |             shift_attention_mask = attention_mask[..., 1:]
560 |             shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
561 |             shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
562 |         else:
563 |             shift_logits = logits[..., :-1, :].contiguous()
564 |             shift_labels = labels[..., 1:].contiguous()
565 |         # Flatten the tokens
566 |         loss_fct = nn.CrossEntropyLoss()
567 |         loss = loss_fct(
568 |             shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
569 |         )
570 | 
571 |     if not return_dict:
572 |         output = (logits,) + outputs[1:]
573 |         return (loss,) + output if loss is not None else output
574 | 
575 |     return LlavaOnevisionCausalLMOutputWithPast(
576 |         loss=loss,
577 |         logits=logits,
578 |         past_key_values=outputs.past_key_values,
579 |         hidden_states=outputs.hidden_states,
580 |         attentions=outputs.attentions,
581 |         image_hidden_states=image_features if pixel_values is not None else None,
582 |         video_hidden_states=video_features if pixel_values_videos is not None else None,
583 |     )
584 | 


--------------------------------------------------------------------------------
/retake/longvideo_cache.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | from typing import Optional, List, Tuple, Dict, Any
  3 | 
  4 | import torch
  5 | import torch.nn as nn
  6 | import torch.nn.functional as F
  7 | 
  8 | from transformers.cache_utils import Cache, StaticCache, DynamicCache
  9 | from transformers.utils import logging
 10 | 
 11 | 
 12 | logger = logging.get_logger(__name__)
 13 | 
 14 | 
 15 | # Copied from transformers.models.llama.modeling_llama.repeat_kv
 16 | def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
 17 |     """
 18 |     This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
 19 |     num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
 20 |     """
 21 |     batch, num_key_value_heads, slen, head_dim = hidden_states.shape
 22 |     if n_rep == 1:
 23 |         return hidden_states
 24 |     hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
 25 |     return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
 26 | 
 27 | 
 28 | # Copied from transformers.models.llama.modeling_llama.rotate_half
 29 | def rotate_half(x):
 30 |     """Rotates half the hidden dims of the input."""
 31 |     x1 = x[..., : x.shape[-1] // 2]
 32 |     x2 = x[..., x.shape[-1] // 2 :]
 33 |     return torch.cat((-x2, x1), dim=-1)
 34 | 
 35 | 
 36 | def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section, unsqueeze_dim=1, reverse=False, attention_scaling=1):
 37 |     """Applies Rotary Position Embedding with Multimodal Sections to the query and key tensors (https://qwenlm.github.io/blog/qwen2-vl/).
 38 | 
 39 |     Explanation:
 40 |         Multimodal 3D rotary position embedding is an extension to 1D rotary position embedding. The input embedding
 41 |         sequence contains vision (images / videos) embedding and text embedding or just contains text embedding. For
 42 |         vision embedding part, we apply rotary position embedding on temporal, height and width dimension seperately.
 43 |         Here we split the channel dimension to 3 chunks for the temporal, height and width rotary position embedding.
 44 |         For text embedding part, we just apply 1D rotary position embedding. The three rotary position index (temporal,
 45 |         height and width) of text embedding is always the same, so the text embedding rotary position embedding has no
 46 |         difference with modern LLMs.
 47 | 
 48 |     Args:
 49 |         q (`torch.Tensor`): The query tensor.
 50 |         k (`torch.Tensor`): The key tensor.
 51 |         cos (`torch.Tensor`): The cosine part of the rotary embedding.
 52 |         sin (`torch.Tensor`): The sine part of the rotary embedding.
 53 |         position_ids (`torch.Tensor`):
 54 |             The position indices of the tokens corresponding to the query and key tensors. For example, this can be
 55 |             used to pass offsetted position ids when working with a KV-cache.
 56 |         mrope_section(`List(int)`):
 57 |             Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.
 58 |         unsqueeze_dim (`int`, *optional*, defaults to 1):
 59 |             The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
 60 |             sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
 61 |             that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
 62 |             k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
 63 |             cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
 64 |             the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
 65 |     Returns:
 66 |         `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
 67 |     """
 68 |     mrope_section = mrope_section * 2
 69 |     cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(
 70 |         unsqueeze_dim
 71 |     )
 72 |     sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(
 73 |         unsqueeze_dim
 74 |     )
 75 | 
 76 |     if reverse: # Rotate towards the opposite direction
 77 |         q_embed = ((q * cos) - (rotate_half(q) * sin)) / attention_scaling**2
 78 |         k_embed = ((k * cos) - (rotate_half(k) * sin)) / attention_scaling**2
 79 |     else:
 80 |         q_embed = (q * cos) + (rotate_half(q) * sin) if q is not None else None
 81 |         k_embed = (k * cos) + (rotate_half(k) * sin) if k is not None else None
 82 | 
 83 |     return q_embed, k_embed
 84 | 
 85 | 
 86 | def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1, reverse=False, attention_scaling=1):
 87 |     """Applies Rotary Position Embedding to the query and key tensors.
 88 | 
 89 |     Args:
 90 |         q (`torch.Tensor`): The query tensor.
 91 |         k (`torch.Tensor`): The key tensor.
 92 |         cos (`torch.Tensor`): The cosine part of the rotary embedding.
 93 |         sin (`torch.Tensor`): The sine part of the rotary embedding.
 94 |         position_ids (`torch.Tensor`, *optional*):
 95 |             Deprecated and unused.
 96 |         unsqueeze_dim (`int`, *optional*, defaults to 1):
 97 |             The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
 98 |             sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
 99 |             that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
100 |             k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
101 |             cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
102 |             the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
103 |     Returns:
104 |         `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
105 |     """
106 |     cos = cos.unsqueeze(unsqueeze_dim)
107 |     sin = sin.unsqueeze(unsqueeze_dim)
108 | 
109 |     if reverse: # Rotate towards the opposite direction
110 |         q_embed = ((q * cos) - (rotate_half(q) * sin))  / attention_scaling**2
111 |         k_embed = ((k * cos) - (rotate_half(k) * sin))  / attention_scaling**2
112 |     else:
113 |         q_embed = (q * cos) + (rotate_half(q) * sin) if q is not None else None
114 |         k_embed = (k * cos) + (rotate_half(k) * sin) if k is not None else None
115 | 
116 |     return q_embed, k_embed
117 | 
118 | 
119 | class PivotKVCache(DynamicCache):
120 |     def __init__(self, config) -> None:
121 |         super().__init__()
122 |         self.config = config
123 |         if hasattr(config, 'text_config'):
124 |             # LLaVA-OneVision
125 |             llm_config = config.text_config
126 |         else:
127 |             # QWen2VL
128 |             llm_config = config
129 |         self.hidden_size = llm_config.hidden_size
130 |         self.num_hidden_layers = llm_config.num_hidden_layers
131 |         self.num_heads = llm_config.num_attention_heads
132 |         self.head_dim = self.hidden_size // self.num_heads
133 |         self.num_key_value_heads = llm_config.num_key_value_heads
134 |         self.num_key_value_groups = self.num_heads // self.num_key_value_heads
135 | 
136 |         # Patch longvideo kwargs
137 |         kv_compression_kwargs = config.longvideo_kwargs['kvcache_compression_kwargs']
138 |         self.kvcache_compression = True
139 |         self.kv_compression_kwargs = kv_compression_kwargs
140 |         self.compression_ratio = kv_compression_kwargs['compression_ratio']
141 |         self.compression_method = kv_compression_kwargs['compression_method']
142 |         self.pos_embed_reforge = kv_compression_kwargs.get('pos_embed_reforge', False)
143 |         self.position_cache: List[torch.Tensor] = []
144 |         self.num_evicted_tokens: List[int] = []
145 | 
146 |     def before_forward(self, **kwargs):
147 |         pass
148 | 
149 |     def after_forward(self, **kwargs):
150 |         pass
151 | 
152 |     def update_num_evicted_tokens(
153 |         self,
154 |         num_tokens: int,
155 |         layer_idx: int,
156 |     ) -> Tuple[torch.Tensor]:
157 |         """
158 |         Updates the `num_evicted_tokens` with an increment `num_tokens` at layer `layer_idx`.
159 |         If `num_tokens` = 0, this function get the number of evicted tokens in layer `layer_idx`.
160 | 
161 |         Parameters:
162 |             num_tokens (`int`):
163 |                 The number of evicted tokens.
164 |             layer_idx (`int`):
165 |                 The index of the layer to cache the states for.
166 | 
167 |         Return:
168 |             A tuple containing the updated key and value states.
169 |         """
170 |         # Update the cache
171 |         if len(self.num_evicted_tokens) <= layer_idx:
172 |             # There may be skipped layers without eviction, fill them with 0
173 |             for _ in range(len(self.num_evicted_tokens), layer_idx):
174 |                 self.num_evicted_tokens.append(0)
175 |             self.num_evicted_tokens.append(num_tokens)
176 |         else:
177 |             self.num_evicted_tokens[layer_idx] += num_tokens
178 | 
179 |         return self.num_evicted_tokens[layer_idx]
180 | 
181 |     def update_position_ids(
182 |         self,
183 |         position_ids: torch.Tensor,
184 |         layer_idx: int,
185 |     ) -> Tuple[torch.Tensor]:
186 |         """
187 |         Updates the cache with the new `position_ids` for the layer `layer_idx`.
188 | 
189 |         Parameters:
190 |             position_ids (`torch.Tensor`):
191 |                 The new key states to cache.
192 |             layer_idx (`int`):
193 |                 The index of the layer to cache the states for.
194 | 
195 |         Return:
196 |             A tuple containing the updated key and value states.
197 |         """
198 |         # Update the cache
199 |         if len(self.position_cache) <= layer_idx:
200 |             # There may be skipped layers, fill them with empty lists
201 |             for _ in range(len(self.position_cache), layer_idx):
202 |                 self.position_cache.append([])
203 |             self.position_cache.append(position_ids)
204 |         elif len(self.position_cache[layer_idx]) == 0:  # fills previously skipped layers; checking for tensor causes errors
205 |             self.position_cache[layer_idx] = position_ids
206 |         else:
207 |             self.position_cache[layer_idx] = torch.cat([self.position_cache[layer_idx], position_ids], dim=-1)
208 | 
209 |         return self.position_cache[layer_idx]
210 | 
211 |     def get_prev_temporal_idx(self, layer_idx: int) -> torch.LongTensor:
212 |         if len(self.position_cache) <= layer_idx:
213 |             return -1
214 |         cache_layer = self.position_cache[layer_idx]
215 |         return cache_layer[0,0,-1] if cache_layer.ndim == 3 else cache_layer[0,-1]
216 | 
217 |     def update(
218 |         self,
219 |         key_states: torch.Tensor,
220 |         value_states: torch.Tensor,
221 |         layer_idx: int,
222 |         cache_kwargs: Optional[Dict[str, Any]] = None,
223 |     ) -> Tuple[torch.Tensor, torch.Tensor]:
224 |         """
225 |         Input
226 |             query_states: [bsz, num_heads, q_len, d]
227 |             key_states: [bsz, num_key_value_heads, q_len, d]
228 |             position_ids: [3, bsz, q_len] / [bsz, q_len]
229 |         Output
230 |             key_states_output: for calculating self attention
231 |             value_states_output: for calculating self attention
232 |         """
233 |         logger.warning_once("Enable PivotKVCache compression: length after compression %.2f" % (self.compression_ratio))
234 | 
235 |         position_ids = cache_kwargs.pop('position_ids', None)
236 | 
237 |         # 1) Hidden states for the next layer remains uncompressed in current chunked prefill iter
238 |         key_states_output, value_states_output = super().update(key_states, value_states, layer_idx, cache_kwargs)
239 | 
240 |         if self.kvcache_compression: # when prefilling visual tokens
241 |             query_states = cache_kwargs.pop('query_states')
242 |             rotary_emb_fn = cache_kwargs.pop('rotary_emb')
243 |             mrope_section = cache_kwargs.pop('mrope_section', None) # For MRope only
244 |             bsz, num_heads, q_len, head_dim = query_states.shape
245 |             num_key_value_heads, k_len = key_states.shape[1:3]
246 |             assert bsz == 1
247 | 
248 |             if self.pos_embed_reforge:
249 |                 cos, sin = rotary_emb_fn(value_states, position_ids)
250 |                 if mrope_section:
251 |                     query_states, key_states = apply_multimodal_rotary_pos_emb(
252 |                         query_states, key_states, cos, sin, mrope_section, 
253 |                         reverse=True, attention_scaling=rotary_emb_fn.attention_scaling
254 |                     )
255 |                 else:
256 |                     query_states, key_states = apply_rotary_pos_emb(
257 |                         query_states, key_states, cos, sin,
258 |                         reverse=True, attention_scaling=rotary_emb_fn.attention_scaling
259 |                     )
260 |             key_states_repeated = repeat_kv(key_states, self.num_heads // self.num_key_value_heads)
261 | 
262 |             # 2) Evit KV Cache based on query_states
263 |             keep_len = max(1, int(self.compression_ratio * q_len)) # Evict new tokens only
264 |             attn_weights = torch.matmul(query_states, key_states_repeated.transpose(2, 3)) / math.sqrt(self.head_dim)
265 |             attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(
266 |                 query_states.dtype
267 |             ).detach() # [bsz, self.num_heads, q_len, q_len(k)]
268 |             attn_weights = attn_weights[0].sum(1) # [self.num_heads, q_len(k)]
269 |             attn_weights = attn_weights.reshape(self.num_key_value_heads, -1, q_len).mean(1) # [num_key_value_heads, q_len(k)]
270 |             attn_weights = attn_weights.mean(0) # [q_len(k)]
271 | 
272 |             if getattr(self, "keypatches_mask_chunk", None) is not None:
273 |                 keypatches_mask_chunk = self.keypatches_mask_chunk
274 |                 attn_weights.masked_fill_(keypatches_mask_chunk, 1.) # Select key patches first
275 | 
276 |             _, keep_indices = attn_weights.topk(keep_len)
277 |             keep_indices = keep_indices.sort().values # [keep_len]
278 |             keep_indices_kv = keep_indices[None,None,:,None].repeat(bsz, self.num_key_value_heads, 1, self.head_dim) # [bsz, num_key_value_heads, keep_len, head_dim]
279 |             compressed_key_states = torch.gather(input=key_states, dim=2, index=keep_indices_kv)
280 |             compressed_value_states = torch.gather(input=value_states, dim=2, index=keep_indices_kv) # [bsz, num_k_heads, keep_len, head_dim]
281 | 
282 |             # Calculate new postional ids
283 |             if mrope_section:
284 |                 keep_indices_ids = keep_indices[None,None,:].repeat(3, bsz, 1) # [3, bsz, keep_len]
285 |                 compressed_position_ids = torch.gather(input=position_ids, dim=2, index=keep_indices_ids) # [3, bsz, keep_len]
286 |             else:
287 |                 keep_indices_ids = keep_indices[None,:].repeat(bsz, 1) # [bsz, keep_len]
288 |                 compressed_position_ids = torch.gather(input=position_ids, dim=1, index=keep_indices_ids) # [bsz, keep_len]
289 | 
290 |             if self.pos_embed_reforge:
291 |                 assert bsz == 1
292 | 
293 |                 min_temp_id = compressed_position_ids[0].min()
294 |                 comp_ratio = keep_len / k_len # NOTE: avoid truncating issues when calculating keep_len
295 |                 compressed_position_ids[0] = min_temp_id + ((compressed_position_ids[0] - min_temp_id) * comp_ratio).long()
296 | 
297 |                 # Add new rotary embedding
298 |                 cos, sin = rotary_emb_fn(compressed_value_states, compressed_position_ids)
299 |                 if mrope_section:
300 |                     _, compressed_key_states = apply_multimodal_rotary_pos_emb(
301 |                         None, compressed_key_states, cos, sin, mrope_section
302 |                     )
303 |                 else:
304 |                     _, compressed_key_states = apply_rotary_pos_emb(
305 |                         None, compressed_key_states, cos, sin,
306 |                     )
307 | 
308 |             if self.pos_embed_reforge:
309 |                 _ = self.update_position_ids(compressed_position_ids, layer_idx)
310 |             _ = self.update_num_evicted_tokens(k_len - keep_len, layer_idx)
311 | 
312 |             # 3) Update KVCache
313 |             self.key_cache[layer_idx] = torch.cat([
314 |                 key_states_output[...,:-q_len,:], compressed_key_states
315 |             ], dim=2)
316 |             self.value_cache[layer_idx] = torch.cat([
317 |                 value_states_output[...,:-q_len,:], compressed_value_states
318 |             ], dim=2)
319 |         else: # when prefilling textual tokens or decoding / kvcache compression disabled
320 |             if self.pos_embed_reforge:
321 |                 _ = self.update_position_ids(position_ids, layer_idx)
322 | 
323 |         return key_states_output, value_states_output
324 | 
325 | 
326 | def build_kvcache(config):
327 |     if getattr(config, "longvideo_kwargs", None) is None or not config.longvideo_kwargs.get('kvcache_compression', False):
328 |         return DynamicCache()
329 |     else:
330 |         compression_method = config.longvideo_kwargs['kvcache_compression_kwargs']['compression_method']
331 |         if compression_method.lower() == 'pivotkv':
332 |             return PivotKVCache(config)
333 |         else:
334 |             raise NotImplementedError
335 | 


--------------------------------------------------------------------------------
/retake/monkeypatch.py:
--------------------------------------------------------------------------------
 1 | import transformers
 2 | 
 3 | from retake.qwen2_vl import (
 4 |     retake_Qwen2VLAttention_forward,
 5 |     retake_Qwen2VLSdpaAttention_forward,
 6 |     retake_Qwen2VLFlashAttention2_forward,
 7 |     retake_Qwen2VLForConditionalGeneration_compress_video_tokens,
 8 |     retake_Qwen2VLForConditionalGeneration_segment_input_ids,
 9 |     retake_Qwen2VLForConditionalGeneration_get_chunk_size,
10 |     retake_Qwen2VLForConditionalGeneration_forge_input_chunks,
11 |     retake_Qwen2VLForConditionalGeneration_forward,
12 | )
13 | from retake.llava_onevision import (
14 |     retake_Qwen2Attention_init,
15 |     retake_Qwen2Attention_forward,
16 |     retake_LlavaOnevisionForConditionalGeneration_get_chunk_size,
17 |     retake_LlavaOnevisionForConditionalGeneration_segment_input_ids,
18 |     retake_LlavaOnevisionForConditionalGeneration_compress_video_tokens,
19 |     retake_LlavaOnevisionForConditionalGeneration_forge_input_chunks,
20 |     retake_LlavaOnevisionForConditionalGeneration_forward,
21 | )
22 | 
23 | 
24 | def patch_qwen2vl_config(config, exp_configs):
25 |     # Rope Scaling
26 |     if 'scaling_factor' in exp_configs:
27 |         config.rope_scaling.pop('type')
28 |         config.rope_scaling['rope_type'] = 'yarn'
29 |         config.rope_scaling['factor'] = exp_configs['scaling_factor']
30 |         config.rope_scaling['beta_fast'] = 32.0
31 |         config.rope_scaling['beta_slow'] = 1.0
32 |     # ReTaKe
33 |     config.longvideo_kwargs = exp_configs.get('longvideo_kwargs', {})
34 |     return config
35 | 
36 | 
37 | def patch_llava_onevision_config(config, exp_configs):
38 |     # Rope Scaling
39 |     if 'scaling_factor' in exp_configs:
40 |         config.text_config.rope_scaling = {
41 |             'rope_type': 'yarn',
42 |             'factor': exp_configs['scaling_factor'],
43 |             'beta_fast': 32.0,
44 |             'beta_slow': 1.0,
45 |         }
46 |     # ReTaKe
47 |     config.longvideo_kwargs = exp_configs.get('longvideo_kwargs', {})
48 |     return config
49 | 
50 | 
51 | def patch_qwen2vl(method):
52 | 
53 |     if method == "retake":
54 |         print("Using ReTaKe for Qwen2VLForConditionalGeneration!")
55 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLAttention.forward = retake_Qwen2VLAttention_forward
56 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLSdpaAttention.forward = retake_Qwen2VLSdpaAttention_forward
57 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLFlashAttention2.forward = retake_Qwen2VLFlashAttention2_forward
58 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration.compress_video_tokens = retake_Qwen2VLForConditionalGeneration_compress_video_tokens
59 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration.segment_input_ids = retake_Qwen2VLForConditionalGeneration_segment_input_ids
60 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration.get_chunk_size = retake_Qwen2VLForConditionalGeneration_get_chunk_size
61 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration.forge_input_chunks = retake_Qwen2VLForConditionalGeneration_forge_input_chunks
62 |         transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration.forward = retake_Qwen2VLForConditionalGeneration_forward
63 |     else:
64 |         raise NotImplementedError
65 | 
66 | 
67 | def patch_llava_onevision(method):
68 | 
69 |     if method == "retake":
70 |         print("Using ReTaKe for LlavaOnevisionForConditionalGeneration!")
71 |         transformers.models.qwen2.modeling_qwen2.Qwen2Attention.__init__ = retake_Qwen2Attention_init
72 |         transformers.models.qwen2.modeling_qwen2.Qwen2Attention.forward = retake_Qwen2Attention_forward
73 |         transformers.models.llava_onevision.modeling_llava_onevision.LlavaOnevisionForConditionalGeneration.get_chunk_size = retake_LlavaOnevisionForConditionalGeneration_get_chunk_size
74 |         transformers.models.llava_onevision.modeling_llava_onevision.LlavaOnevisionForConditionalGeneration.segment_input_ids = retake_LlavaOnevisionForConditionalGeneration_segment_input_ids
75 |         transformers.models.llava_onevision.modeling_llava_onevision.LlavaOnevisionForConditionalGeneration.compress_video_tokens = retake_LlavaOnevisionForConditionalGeneration_compress_video_tokens
76 |         transformers.models.llava_onevision.modeling_llava_onevision.LlavaOnevisionForConditionalGeneration.forge_input_chunks = retake_LlavaOnevisionForConditionalGeneration_forge_input_chunks
77 |         transformers.models.llava_onevision.modeling_llava_onevision.LlavaOnevisionForConditionalGeneration.forward = retake_LlavaOnevisionForConditionalGeneration_forward
78 |     else:
79 |         raise NotImplementedError
80 | 


--------------------------------------------------------------------------------
/retake/qwen2_vl.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | from tqdm import tqdm
  3 | from dataclasses import dataclass
  4 | from typing import Any, Callable, Dict, List, Optional, Tuple, Union
  5 | 
  6 | import torch
  7 | import torch.nn as nn
  8 | import torch.nn.functional as F
  9 | import torch.utils.checkpoint
 10 | from torch.nn import CrossEntropyLoss
 11 | import numpy as np
 12 | 
 13 | from transformers.cache_utils import Cache
 14 | from transformers.utils import (
 15 |     is_flash_attn_2_available,
 16 |     logging,
 17 | )
 18 | from transformers import Qwen2VLConfig
 19 | from transformers.models.qwen2_vl.modeling_qwen2_vl import (
 20 |     Qwen2VLSdpaAttention,
 21 |     Qwen2VLCausalLMOutputWithPast,
 22 |     Qwen2VLForConditionalGeneration,
 23 |     repeat_kv,
 24 |     apply_multimodal_rotary_pos_emb,
 25 | )
 26 | 
 27 | from .visual_compression import *
 28 | from .longvideo_cache import *
 29 | 
 30 | if is_flash_attn_2_available():
 31 |     from flash_attn import flash_attn_varlen_func
 32 | 
 33 |     from transformers.modeling_flash_attention_utils import _flash_attention_forward
 34 | else:
 35 |     flash_attn_varlen_func = None
 36 | 
 37 | DEBUG_MODE = False
 38 | 
 39 | logger = logging.get_logger(__name__)
 40 | 
 41 | 
 42 | def retake_Qwen2VLAttention_forward(
 43 |     self,
 44 |     hidden_states: torch.Tensor,
 45 |     attention_mask: Optional[torch.Tensor] = None,
 46 |     position_ids: Optional[torch.LongTensor] = None,
 47 |     past_key_value: Optional[Cache] = None,
 48 |     output_attentions: bool = False,
 49 |     use_cache: bool = False,
 50 |     cache_position: Optional[torch.LongTensor] = None,
 51 |     position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
 52 | ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
 53 |     bsz, q_len, _ = hidden_states.size()
 54 | 
 55 |     query_states = self.q_proj(hidden_states)
 56 |     key_states = self.k_proj(hidden_states)
 57 |     value_states = self.v_proj(hidden_states)
 58 | 
 59 |     query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
 60 |     key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
 61 |     value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
 62 | 
 63 |     kv_seq_len = key_states.shape[-2]
 64 |     if past_key_value is not None:
 65 |         kv_seq_len += cache_position[0] + 1
 66 | 
 67 |     # Update position_ids if positional embeddings are reforged
 68 |     if past_key_value is not None and getattr(past_key_value, "pos_embed_reforge", False):
 69 |         prev_tempo_idx = past_key_value.get_prev_temporal_idx(self.layer_idx)
 70 |         if prev_tempo_idx + 1 != position_ids[0,0,0]:
 71 |             assert bsz == 1
 72 |             # print("Discontinuous positional ids %d + 1 != %d at layer %d" % (prev_tempo_idx,  position_ids[0,0,0], self.layer_idx))
 73 |             position_ids[0,0,:] += prev_tempo_idx + 1 - position_ids[0,0,0]
 74 | 
 75 |     # NOTE: Compute position_ids internally to support positional id reforge from KV compression
 76 |     cos, sin = self.rotary_emb(value_states, position_ids)
 77 |     query_states, key_states = apply_multimodal_rotary_pos_emb(
 78 |         query_states, key_states, cos, sin, self.rope_scaling["mrope_section"]
 79 |     )
 80 | 
 81 |     if past_key_value is not None:
 82 |         cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}  # Specific to RoPE models
 83 |         # Specific to KVCache compression methods
 84 |         cache_kwargs.update({"query_states": query_states, "position_ids": position_ids, 
 85 |                              "rotary_emb": self.rotary_emb, "mrope_section": self.rope_scaling["mrope_section"]})
 86 |         key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
 87 | 
 88 |     # repeat k/v heads if n_kv_heads < n_heads
 89 |     key_states = repeat_kv(key_states, self.num_key_value_groups)
 90 |     value_states = repeat_kv(value_states, self.num_key_value_groups)
 91 | 
 92 |     attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
 93 | 
 94 |     if attention_mask is not None:  # no matter the length, we just slice it
 95 |         causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
 96 |         attn_weights = attn_weights + causal_mask
 97 | 
 98 |     # Fix precision issues in Qwen2-VL float16 inference
 99 |     # Replace inf values with zeros in attention weights to prevent NaN propagation
100 |     if query_states.dtype == torch.float16:
101 |         attn_weights = torch.where(torch.isinf(attn_weights), torch.zeros_like(attn_weights), attn_weights)
102 | 
103 |     # upcast attention to fp32
104 |     attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
105 |     attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
106 |     attn_output = torch.matmul(attn_weights, value_states)
107 | 
108 |     if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
109 |         raise ValueError(
110 |             f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
111 |             f" {attn_output.size()}"
112 |         )
113 | 
114 |     attn_output = attn_output.transpose(1, 2).contiguous()
115 |     attn_output = attn_output.reshape(bsz, q_len, -1)
116 | 
117 |     attn_output = self.o_proj(attn_output)
118 | 
119 |     if not output_attentions:
120 |         attn_weights = None
121 | 
122 |     return attn_output, attn_weights, past_key_value
123 | 
124 | 
125 | def retake_Qwen2VLSdpaAttention_forward(
126 |     self,
127 |     hidden_states: torch.Tensor,
128 |     attention_mask: Optional[torch.Tensor] = None,
129 |     position_ids: Optional[torch.LongTensor] = None,
130 |     past_key_value: Optional[Cache] = None,
131 |     output_attentions: bool = False,
132 |     use_cache: bool = False,
133 |     cache_position: Optional[torch.LongTensor] = None,
134 |     position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
135 | ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
136 |     if output_attentions:
137 |         # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
138 |         logger.warning_once(
139 |             "Qwen2VLModel is using Qwen2VLSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
140 |             'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
141 |         )
142 |         return super(Qwen2VLSdpaAttention, self).forward(
143 |             hidden_states=hidden_states,
144 |             attention_mask=attention_mask,
145 |             position_ids=position_ids,
146 |             past_key_value=past_key_value,
147 |             output_attentions=output_attentions,
148 |             use_cache=use_cache,
149 |             cache_position=cache_position,
150 |         )
151 | 
152 |     bsz, q_len, _ = hidden_states.size()
153 | 
154 |     query_states = self.q_proj(hidden_states)
155 |     key_states = self.k_proj(hidden_states)
156 |     value_states = self.v_proj(hidden_states)
157 | 
158 |     query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
159 |     key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
160 |     value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
161 | 
162 |     kv_seq_len = key_states.shape[-2]
163 |     if past_key_value is not None:
164 |         kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
165 | 
166 |     # Update position_ids if positional embeddings are reforged
167 |     if past_key_value is not None and getattr(past_key_value, "pos_embed_reforge", False):
168 |         prev_tempo_idx = past_key_value.get_prev_temporal_idx(self.layer_idx)
169 |         if prev_tempo_idx + 1 != position_ids[0,0,0]:
170 |             assert bsz == 1
171 |             # print("Discontinuous positional ids %d + 1 != %d at layer %d" % (prev_tempo_idx,  position_ids[0,0,0], self.layer_idx))
172 |             position_ids[0,0,:] += prev_tempo_idx + 1 - position_ids[0,0,0]
173 | 
174 |     # NOTE: Compute position_ids internally to support positional id reforge from KV compression
175 |     cos, sin = self.rotary_emb(value_states, position_ids)
176 | 
177 |     query_states, key_states = apply_multimodal_rotary_pos_emb(
178 |         query_states, key_states, cos, sin, self.rope_scaling["mrope_section"]
179 |     )
180 | 
181 |     if past_key_value is not None:
182 |         cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}  # Specific to RoPE models
183 |         # Specific to KVCache compression methods
184 |         cache_kwargs.update({"query_states": query_states, "position_ids": position_ids, 
185 |                              "rotary_emb": self.rotary_emb, "mrope_section": self.rope_scaling["mrope_section"]})
186 |         key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
187 | 
188 |     key_states = repeat_kv(key_states, self.num_key_value_groups)
189 |     value_states = repeat_kv(value_states, self.num_key_value_groups)
190 | 
191 |     causal_mask = attention_mask
192 |     if attention_mask is not None:  # no matter the length, we just slice it
193 |         causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
194 | 
195 |     # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
196 |     # Reference: https://github.com/pytorch/pytorch/issues/112577.
197 |     if query_states.device.type == "cuda" and attention_mask is not None:
198 |         query_states = query_states.contiguous()
199 |         key_states = key_states.contiguous()
200 |         value_states = value_states.contiguous()
201 | 
202 |     # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
203 |     # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
204 |     # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
205 |     is_causal = True if causal_mask is None and q_len > 1 else False
206 | 
207 |     attn_output = torch.nn.functional.scaled_dot_product_attention(
208 |         query_states,
209 |         key_states,
210 |         value_states,
211 |         attn_mask=causal_mask,
212 |         dropout_p=self.attention_dropout if self.training else 0.0,
213 |         is_causal=is_causal,
214 |     )
215 | 
216 |     attn_output = attn_output.transpose(1, 2).contiguous()
217 |     attn_output = attn_output.view(bsz, q_len, self.hidden_size)
218 | 
219 |     attn_output = self.o_proj(attn_output)
220 | 
221 |     return attn_output, None, past_key_value
222 | 
223 | 
224 | def retake_Qwen2VLFlashAttention2_forward(
225 |     self,
226 |     hidden_states: torch.Tensor,
227 |     attention_mask: Optional[torch.Tensor] = None,
228 |     position_ids: Optional[torch.LongTensor] = None,
229 |     past_key_value: Optional[Cache] = None,
230 |     output_attentions: bool = False,
231 |     use_cache: bool = False,
232 |     cache_position: Optional[torch.LongTensor] = None,
233 |     position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
234 | ):
235 |     bsz, q_len, _ = hidden_states.size()
236 | 
237 |     query_states = self.q_proj(hidden_states)
238 |     key_states = self.k_proj(hidden_states)
239 |     value_states = self.v_proj(hidden_states)
240 | 
241 |     query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
242 |     key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
243 |     value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
244 | 
245 |     kv_seq_len = key_states.shape[-2]
246 |     if past_key_value is not None:
247 |         if self.layer_idx is None:
248 |             raise ValueError(
249 |                 f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
250 |                 "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
251 |                 "with a layer index."
252 |             )
253 |         kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
254 | 
255 |     # Update position_ids if positional embeddings are reforged
256 |     if past_key_value is not None and getattr(past_key_value, "pos_embed_reforge", False):
257 |         prev_tempo_idx = past_key_value.get_prev_temporal_idx(self.layer_idx)
258 |         if prev_tempo_idx + 1 != position_ids[0,0,0]:
259 |             assert bsz == 1
260 |             # print("Discontinuous positional ids %d + 1 != %d at layer %d" % (prev_tempo_idx,  position_ids[0,0,0], self.layer_idx))
261 |             position_ids[0,0,:] += prev_tempo_idx + 1 - position_ids[0,0,0]
262 | 
263 |     # NOTE: Compute position_ids internally to support positional id reforge from KV compression
264 |     cos, sin = self.rotary_emb(value_states, position_ids)
265 | 
266 |     query_states, key_states = apply_multimodal_rotary_pos_emb(
267 |         query_states, key_states, cos, sin, self.rope_scaling["mrope_section"]
268 |     )
269 | 
270 |     if past_key_value is not None:
271 |         # Activate slicing cache only if the config has a value `sliding_windows` attribute
272 |         cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
273 |         if (
274 |             self.config.use_sliding_window
275 |             and getattr(self.config, "sliding_window", None) is not None
276 |             and kv_seq_len > self.config.sliding_window
277 |             and cache_has_contents
278 |         ):
279 |             slicing_tokens = 1 - self.config.sliding_window
280 | 
281 |             past_key = past_key_value[self.layer_idx][0]
282 |             past_value = past_key_value[self.layer_idx][1]
283 | 
284 |             past_key = past_key[:, :, slicing_tokens:, :].contiguous()
285 |             past_value = past_value[:, :, slicing_tokens:, :].contiguous()
286 | 
287 |             if past_key.shape[-2] != self.config.sliding_window - 1:
288 |                 raise ValueError(
289 |                     f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
290 |                     f" {past_key.shape}"
291 |                 )
292 | 
293 |             if attention_mask is not None:
294 |                 attention_mask = attention_mask[:, slicing_tokens:]
295 |                 attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
296 | 
297 |         cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}  # Specific to RoPE models
298 |         # Specific to KVCache compression methods
299 |         cache_kwargs.update({"query_states": query_states, "position_ids": position_ids, 
300 |                              "rotary_emb": self.rotary_emb, "mrope_section": self.rope_scaling["mrope_section"]})
301 |         key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
302 | 
303 |     # repeat k/v heads if n_kv_heads < n_heads
304 |     key_states = repeat_kv(key_states, self.num_key_value_groups)
305 |     value_states = repeat_kv(value_states, self.num_key_value_groups)
306 |     dropout_rate = 0.0 if not self.training else self.attention_dropout
307 | 
308 |     # In PEFT, usually we cast the layer norms in float32 for training stability reasons
309 |     # therefore the input hidden states gets silently casted in float32. Hence, we need
310 |     # cast them back in float16 just to be sure everything works as expected.
311 |     input_dtype = query_states.dtype
312 |     if input_dtype == torch.float32:
313 |         if torch.is_autocast_enabled():
314 |             target_dtype = torch.get_autocast_gpu_dtype()
315 |         # Handle the case where the model is quantized
316 |         elif hasattr(self.config, "_pre_quantization_dtype"):
317 |             target_dtype = self.config._pre_quantization_dtype
318 |         else:
319 |             target_dtype = self.q_proj.weight.dtype
320 | 
321 |         logger.warning_once(
322 |             f"The input hidden states seems to be silently casted in float32, this might be related to"
323 |             f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
324 |             f" {target_dtype}."
325 |         )
326 | 
327 |         query_states = query_states.to(target_dtype)
328 |         key_states = key_states.to(target_dtype)
329 |         value_states = value_states.to(target_dtype)
330 | 
331 |     # Reashape to the expected shape for Flash Attention
332 |     query_states = query_states.transpose(1, 2)
333 |     key_states = key_states.transpose(1, 2)
334 |     value_states = value_states.transpose(1, 2)
335 | 
336 |     if (
337 |         self.config.use_sliding_window
338 |         and getattr(self.config, "sliding_window", None) is not None
339 |         and self.layer_idx >= self.config.max_window_layers
340 |     ):
341 |         sliding_window = self.config.sliding_window
342 |     else:
343 |         sliding_window = None
344 | 
345 |     attn_output = _flash_attention_forward(
346 |         query_states,
347 |         key_states,
348 |         value_states,
349 |         attention_mask,
350 |         q_len,
351 |         dropout=dropout_rate,
352 |         sliding_window=sliding_window,
353 |         is_causal=self.is_causal,
354 |         use_top_left_mask=self._flash_attn_uses_top_left_mask,
355 |     )
356 | 
357 |     attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
358 |     attn_output = self.o_proj(attn_output)
359 | 
360 |     if not output_attentions:
361 |         attn_weights = None
362 | 
363 |     return attn_output, attn_weights, past_key_value
364 | 
365 | 
366 | def retake_Qwen2VLForConditionalGeneration_compress_video_tokens(
367 |     self, 
368 |     input_ids: torch.LongTensor = None,
369 |     attention_mask: torch.Tensor = None,
370 |     video_embeds: torch.Tensor = None,
371 |     cache_position: Optional[torch.LongTensor] = None,
372 |     position_ids: Optional[torch.LongTensor] = None,
373 |     labels: Optional[torch.LongTensor] = None,
374 |     video_grid_thw: Optional[torch.LongTensor] = None,
375 | ) -> Tuple[torch.LongTensor, torch.Tensor, torch.Tensor, torch.LongTensor]:
376 |     # Parse long video compression configs
377 |     visual_compression = False
378 |     if getattr(self.config, "longvideo_kwargs", None) is not None:
379 |         visual_compression = self.config.longvideo_kwargs.get("visual_compression", False)
380 |         if visual_compression:
381 |             compression_kwargs = self.config.longvideo_kwargs['visual_compression_kwargs']
382 |             compression_ratio = compression_kwargs.get("compression_ratio")
383 |             compression_method = compression_kwargs.get("compression_method")
384 |             patch_sync = compression_kwargs.get("patch_sync")
385 |             return_keyframe_mask = compression_kwargs.get("return_keyframe_mask")
386 | 
387 |     if visual_compression:
388 |         assert labels is None
389 |         assert video_grid_thw.shape[0] <= 1, "Currently, interleaved videos are not supported"
390 |         assert input_ids.shape[0] == 1, "Currently, only inference are supported"
391 |         video_token_indices = torch.where(input_ids[0] == self.config.video_token_id)[0]
392 |         s_index, e_index = video_token_indices[0], video_token_indices[-1]
393 |         grid_t = video_grid_thw[0][0]
394 |         grid_hw = video_embeds.shape[0] // grid_t
395 |         ori_seq_len = input_ids.shape[1]
396 | 
397 |         tgt_mem_len = max(1, round(compression_ratio * grid_t.item()))
398 |         num_frame_diff = grid_t - tgt_mem_len
399 | 
400 |         # Compress
401 |         compressed_memory_bank = video_embeds.reshape(1, grid_t, grid_hw, -1)
402 |         if compression_method == "MA-LLM":
403 |             compression_size = torch.ones_like(compressed_memory_bank[:,:,:,0])
404 |             while compressed_memory_bank.shape[1] > tgt_mem_len:
405 |                 compressed_memory_bank, compression_size = memory_bank_compress_MALLM(compressed_memory_bank, compression_size, sync=patch_sync)
406 |             keypatches_mask = None
407 |         elif compression_method == "MA-LLM-hard":
408 |             while compressed_memory_bank.shape[1] > tgt_mem_len:
409 |                 compressed_memory_bank = memory_bank_compress_MALLM_hard(compressed_memory_bank, sync=patch_sync)
410 |             keypatches_mask = None
411 |         elif compression_method == "Keyframe":
412 |             compressed_memory_bank, keypatches_mask = memory_bank_compress_keyframe(compressed_memory_bank, tgt_mem_len, 3, sync=patch_sync)
413 |             keypatches_mask = keypatches_mask if return_keyframe_mask else None
414 |         else:
415 |             raise NotImplementedError
416 |         video_embeds = compressed_memory_bank.flatten(1, 2)[0]
417 |         tgt_seq_len = video_embeds.shape[0]
418 | 
419 |         # Reforge the input
420 |         input_ids = torch.cat([
421 |             input_ids[:, :s_index],
422 |             input_ids[:, s_index:e_index+1][:,:tgt_seq_len],
423 |             input_ids[:, e_index+1:]
424 |             ],
425 |             dim=1)
426 |         num_token_diff = ori_seq_len - input_ids.shape[1]
427 |         if num_token_diff and attention_mask is not None:
428 |             attention_mask = attention_mask[:, :-num_token_diff]
429 |         if num_token_diff and cache_position is not None:
430 |             cache_position = cache_position[:-num_token_diff]
431 |         if position_ids is not None:
432 |             position_ids = torch.cat([
433 |                 position_ids[..., :s_index],
434 |                 position_ids[..., s_index:e_index+1][...,:tgt_seq_len],
435 |                 position_ids[..., e_index+1:]
436 |                 ],
437 |             dim=2)
438 |             position_ids[:, :, s_index+tgt_seq_len:] -= num_frame_diff
439 |     else:
440 |         keypatches_mask = None
441 | 
442 |     return input_ids, attention_mask, video_embeds, cache_position, position_ids, labels, keypatches_mask
443 | 
444 | def retake_Qwen2VLForConditionalGeneration_segment_input_ids(self, input_ids):
445 |     """Split video and text segments in the input_ids
446 |     return: list[(s, e, type)], indices of [s, e) are of `type`.
447 |     """
448 |     videomask = (input_ids[0] == self.config.video_token_id)
449 |     # Find the difference between consecutive elements
450 |     diff = torch.diff(videomask.long())
451 |     diff_pos_indices = (torch.where(diff == 1)[0] + 1).cpu().numpy()
452 |     diff_neg_indices = (torch.where(diff == -1)[0] + 1).cpu().numpy()
453 | 
454 |     # True mask
455 |     start_indices_true = diff_pos_indices
456 |     end_indices_true = diff_neg_indices
457 |     if videomask[0] == True: # segment starts at the beginning
458 |         start_indices_true = np.insert(start_indices_true, 0, 0)
459 |     if videomask[-1] == True: # segment ends at the beginning
460 |         end_indices_true = np.append(end_indices_true, len(videomask))
461 | 
462 |     # False mask
463 |     start_indices_flase = diff_neg_indices
464 |     end_indices_flase = diff_pos_indices
465 |     if videomask[0] == False:
466 |         start_indices_flase = np.insert(start_indices_flase, 0, 0)
467 |     if videomask[-1] == False:
468 |         end_indices_flase = np.append(end_indices_flase, len(videomask))
469 | 
470 |     segments = (
471 |         list(zip(start_indices_true, end_indices_true, ['video']*len(end_indices_true))) + 
472 |         list(zip(start_indices_flase, end_indices_flase, ['text']*len(end_indices_flase)))
473 |     )
474 |     segments = sorted(segments, key=lambda x: x[0])
475 |     return segments
476 | 
477 | def retake_Qwen2VLForConditionalGeneration_get_chunk_size(self, config, video_grid_thw) -> int:
478 |     # Calculate the number of tokens in each prefill chunk
479 |     chunk_frames = (
480 |         config.longvideo_kwargs.get('chunked_prefill_frames', None) if getattr(config, 'longvideo_kwargs', None) 
481 |         else None
482 |     )
483 |     if chunk_frames is None:
484 |         chunk_prefill_size = None
485 |     else:
486 |         T, H, W = video_grid_thw[0]
487 |         t_factor = config.vision_config.spatial_merge_size**2 * config.vision_config.temporal_patch_size
488 |         chunk_prefill_size = min(chunk_frames, T) * H * W // t_factor
489 |         chunk_prefill_size = int(chunk_prefill_size.item())
490 |         # Avoid machine error in ceil() when calculating `num_chunks`.
491 |     return chunk_prefill_size
492 | 
493 | def retake_Qwen2VLForConditionalGeneration_forge_input_chunks(self, ss, ee, modality_segments, cache_position, position_ids, attention_mask, past_key_values, inputs_embeds):
494 |     cache_position_chunk = cache_position[:ee]
495 |     position_ids_chunk = position_ids[:,:,ss:ee]
496 |     attention_mask_chunk = attention_mask[:,:ee] # NOTE: specially from 0 to ee
497 |     inputs_embeds_chunk = inputs_embeds[:,ss:ee]
498 |     prompt_length = None
499 | 
500 |     if getattr(self.config, 'longvideo_kwargs', None) and self.config.longvideo_kwargs.get('kvcache_compression', False):
501 |         compression_kwargs = self.config.longvideo_kwargs['kvcache_compression_kwargs']
502 |         if compression_kwargs.get('prompt_guided_compression', False) and compression_kwargs.get('compression_ratio', 1) < 1.0:
503 |             # Prompt guided KV cache compression
504 |             s_p, e_p, t_p = modality_segments[-1]
505 |             # NOTE: make sure to include end-of-vision symbol like '<|vision_end|>'(151653)
506 |             if isinstance(self, Qwen2VLForConditionalGeneration):
507 |                 # QWen2VL automatically satisfies, since only '<|vision_pad|>'(151656) is recognized as vision.
508 |                 pass
509 |             else:
510 |                 raise NotImplementedError
511 |             assert t_p == 'text'
512 |             pos_offset = position_ids[0,0,s_p] - position_ids_chunk[0,0,-1] - 1 # (3, bs, seq_len)
513 |             position_ids_chunk = torch.cat([position_ids_chunk, position_ids[:,:,s_p:e_p] - pos_offset], dim=2)
514 |             attention_mask_chunk = torch.cat([attention_mask_chunk, attention_mask[:,s_p:e_p]], dim=1)
515 |             inputs_embeds_chunk = torch.cat([inputs_embeds_chunk, inputs_embeds[:,s_p:e_p]], dim=1)
516 |             prompt_length = e_p - s_p
517 |             cache_position_chunk = cache_position[:ee+prompt_length]
518 | 
519 |     return cache_position_chunk, position_ids_chunk, attention_mask_chunk, inputs_embeds_chunk, prompt_length
520 | 
521 | 
522 | def retake_Qwen2VLForConditionalGeneration_forward(
523 |     self,
524 |     input_ids: torch.LongTensor = None,
525 |     attention_mask: Optional[torch.Tensor] = None,
526 |     position_ids: Optional[torch.LongTensor] = None,
527 |     past_key_values: Optional[List[torch.FloatTensor]] = None,
528 |     inputs_embeds: Optional[torch.FloatTensor] = None,
529 |     labels: Optional[torch.LongTensor] = None,
530 |     use_cache: Optional[bool] = None,
531 |     output_attentions: Optional[bool] = None,
532 |     output_hidden_states: Optional[bool] = None,
533 |     return_dict: Optional[bool] = None,
534 |     pixel_values: Optional[torch.Tensor] = None,
535 |     pixel_values_videos: Optional[torch.FloatTensor] = None,
536 |     image_grid_thw: Optional[torch.LongTensor] = None,
537 |     video_grid_thw: Optional[torch.LongTensor] = None,
538 |     rope_deltas: Optional[torch.LongTensor] = None,
539 |     cache_position: Optional[torch.LongTensor] = None,
540 | ) -> Union[Tuple, Qwen2VLCausalLMOutputWithPast]:
541 |     assert input_ids.shape[0] == 1, "Batch inference of long video is not supported yet!"
542 | 
543 |     if (cache_position is not None and cache_position[0] == 0): # Prefill
544 |         is_prefill = True
545 |         # Calculate chunk size based on inputs
546 |         chunk_size = self.get_chunk_size(self.config, video_grid_thw)
547 |         # Configuring KV Cache compression kwargs
548 |         if getattr(self.config, 'longvideo_kwargs', None) and self.config.longvideo_kwargs.get('kvcache_compression', False):
549 |             compression_kwargs = self.config.longvideo_kwargs['kvcache_compression_kwargs']
550 |             if compression_kwargs.get('dynamic_compression_ratio', False):
551 |                 # Dynamic compression ratio
552 |                 input_length = input_ids.shape[1]
553 |                 max_input_length = compression_kwargs['max_input_length']
554 |                 if input_length <= max_input_length:
555 |                     compression_kwargs['compression_ratio'] = 1
556 |                 else:
557 |                     compression_kwargs['compression_ratio'] = max_input_length / input_length
558 |         if chunk_size is not None:
559 |             modality_segments = self.segment_input_ids(input_ids)
560 |             past_key_values = build_kvcache(self.config)
561 |             use_cache = True
562 |     else:
563 |         is_prefill = False
564 |         chunk_size = None
565 | 
566 |     output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
567 |     output_hidden_states = (
568 |         output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
569 |     )
570 |     return_dict = return_dict if return_dict is not None else self.config.use_return_dict
571 | 
572 |     # if we get 4D attention mask we cannot calculate rope deltas anymore. TODO @raushan fixme
573 |     if position_ids is None and (attention_mask is None or attention_mask.ndim == 2):
574 |         # calculate RoPE index once per generation in the pre-fill stage only
575 |         if (cache_position is not None and cache_position[0] == 0) or self.rope_deltas is None:
576 |             position_ids, rope_deltas = self.get_rope_index(
577 |                 input_ids, image_grid_thw, video_grid_thw, attention_mask
578 |             )
579 |             self.rope_deltas = rope_deltas
580 |         # then use the prev pre-calculated rope-deltas to get the correct position ids
581 |         else:
582 |             batch_size, seq_length = input_ids.shape
583 |             delta = cache_position[0] + self.rope_deltas if cache_position is not None else 0
584 |             position_ids = torch.arange(seq_length, device=input_ids.device)
585 |             position_ids = position_ids.view(1, -1).expand(batch_size, -1)
586 |             if cache_position is not None:  # otherwise `deltas` is an int `0`
587 |                 delta = delta.repeat_interleave(batch_size // delta.shape[0], dim=0)
588 |             position_ids = position_ids.add(delta)
589 |             position_ids = position_ids.unsqueeze(0).expand(3, -1, -1)
590 | 
591 |     if inputs_embeds is None:
592 |         # Extract visual features
593 |         if pixel_values is not None:
594 |             pixel_values = pixel_values.type(self.visual.get_dtype())
595 |             image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
596 | 
597 |         if pixel_values_videos is not None:
598 |             pixel_values_videos = pixel_values_videos.type(self.visual.get_dtype())
599 |             grid_t, grid_h, grid_w = video_grid_thw[0]
600 |             # NOTE: Split video into chunks to avoid OOM due to large activations during visual forward
601 |             # chunk_size can be up to 128 or higher if you have flash attention
602 |             frame_chunk_size = getattr(self.config, 'longvideo_kwargs', {}).get('frame_chunk_size', 1000000000)
603 |             if grid_t < frame_chunk_size:
604 |                 video_embeds = self.visual(pixel_values_videos, grid_thw=video_grid_thw)
605 |             else:
606 |                 d = pixel_values_videos.shape[-1]
607 |                 pixel_values_videos = pixel_values_videos.reshape(grid_t, grid_h*grid_w, d)
608 |                 video_embeds = []
609 |                 for i in range(0, grid_t, frame_chunk_size):
610 |                     pixel_values_videos_chunk = pixel_values_videos[i:i+frame_chunk_size]
611 |                     grid_t_chunk = pixel_values_videos_chunk.shape[0]
612 |                     video_grid_thw_chunk = video_grid_thw.clone()
613 |                     video_grid_thw_chunk[0,0] = grid_t_chunk
614 |                     video_embeds.append(
615 |                         self.visual(pixel_values_videos_chunk.reshape(-1, d), grid_thw=video_grid_thw_chunk)
616 |                     )
617 |                 video_embeds = torch.cat(video_embeds)
618 |             # Compression video tokens
619 |             input_ids, attention_mask, video_embeds, cache_position, position_ids, labels, keypatches_mask = self.compress_video_tokens(
620 |                 input_ids=input_ids, 
621 |                 attention_mask=attention_mask, 
622 |                 video_embeds=video_embeds, 
623 |                 cache_position=cache_position,
624 |                 position_ids=position_ids,
625 |                 labels=labels,
626 |                 video_grid_thw=video_grid_thw
627 |             )
628 | 
629 |         # Concat visual and textual features
630 |         inputs_embeds = self.model.embed_tokens(input_ids)
631 |         if pixel_values is not None:
632 |             n_image_tokens = (input_ids == self.config.image_token_id).sum().item()
633 |             n_image_features = image_embeds.shape[0]
634 |             if n_image_tokens != n_image_features:
635 |                 raise ValueError(
636 |                     f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
637 |                 )
638 |             image_mask = (
639 |                 (input_ids == self.config.image_token_id)
640 |                 .unsqueeze(-1)
641 |                 .expand_as(inputs_embeds)
642 |                 .to(inputs_embeds.device)
643 |             )
644 |             image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
645 |             inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
646 | 
647 |         if pixel_values_videos is not None:
648 |             n_video_tokens = (input_ids == self.config.video_token_id).sum().item()
649 |             n_video_features = video_embeds.shape[0]
650 |             if n_video_tokens != n_video_features:
651 |                 raise ValueError(
652 |                     f"Video features and video tokens do not match: tokens: {n_video_tokens}, features {n_video_features}"
653 |                 )
654 |             video_mask = (
655 |                 (input_ids == self.config.video_token_id)
656 |                 .unsqueeze(-1)
657 |                 .expand_as(inputs_embeds)
658 |                 .to(inputs_embeds.device)
659 |             )
660 |             video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
661 |             inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
662 |             if keypatches_mask is not None:
663 |                 keypatches_mask = torch.zeros_like(input_ids).bool().masked_scatter(video_mask[:,:,0], keypatches_mask)
664 | 
665 |         if attention_mask is not None:
666 |             attention_mask = attention_mask.to(inputs_embeds.device)
667 |         if position_ids is not None:
668 |             position_ids = position_ids.to(inputs_embeds.device)
669 | 
670 |     if is_prefill and chunk_size is not None: # Chunked prefill stage
671 |         assert past_key_values is not None
672 |         kvcache_compression = getattr(past_key_values, 'kvcache_compression', False)
673 |         for seg_id, (s, e, dtype) in enumerate(modality_segments):
674 |             if dtype == 'text': # Prefill text without kvcache_compression
675 |                 past_key_values.kvcache_compression = False
676 |                 outputs = self.model(
677 |                     input_ids=None,
678 |                     position_ids=position_ids[:,:,s:e],
679 |                     attention_mask=attention_mask[:,:e],
680 |                     past_key_values=past_key_values,
681 |                     inputs_embeds=inputs_embeds[:,s:e],
682 |                     use_cache=True,
683 |                     output_attentions=output_attentions,
684 |                     output_hidden_states=output_hidden_states,
685 |                     return_dict=return_dict,
686 |                     cache_position=cache_position[:e],
687 |                 )
688 |                 past_key_values = outputs['past_key_values']
689 |             elif dtype == 'video': # Prefill video may with kvcache_compression
690 |                 num_chunks = math.ceil((e - s) / chunk_size)
691 |                 past_key_values.kvcache_compression = kvcache_compression
692 |                 for idx in tqdm(range(num_chunks), total=num_chunks, desc='Prefilling chunk', disable=not DEBUG_MODE):
693 |                     ss = s + idx * chunk_size
694 |                     ee = min(s + (idx + 1) * chunk_size, e)
695 |                     if keypatches_mask is not None:
696 |                         past_key_values.keypatches_mask_chunk = keypatches_mask[0,ss:ee]
697 |                     cache_position_chunk, position_ids_chunk, attention_mask_chunk, inputs_embeds_chunk, prompt_length = self.forge_input_chunks(
698 |                         ss, ee, modality_segments, cache_position, position_ids, attention_mask, past_key_values, inputs_embeds
699 |                     )
700 |                     if hasattr(past_key_values, 'before_forward'):
701 |                         past_key_values.before_forward(prompt_length=prompt_length)
702 |                     outputs = self.model(
703 |                         input_ids=None,
704 |                         position_ids=position_ids_chunk,
705 |                         attention_mask=attention_mask_chunk,
706 |                         past_key_values=past_key_values,
707 |                         inputs_embeds=inputs_embeds_chunk,
708 |                         use_cache=True,
709 |                         output_attentions=output_attentions,
710 |                         output_hidden_states=output_hidden_states,
711 |                         return_dict=return_dict,
712 |                         cache_position=cache_position_chunk,
713 |                     )
714 |                     past_key_values = outputs['past_key_values']
715 |                     if hasattr(past_key_values, 'after_forward'):
716 |                         past_key_values.after_forward()
717 |                 past_key_values.keypatches_mask_chunk = None
718 |                 past_key_values.kvcache_compression = False # Turned off for decoding
719 |             else:
720 |                 raise ValueError
721 |     else: # Decode / Standard prefill stage
722 |         outputs = self.model(
723 |             input_ids=None,
724 |             position_ids=position_ids,
725 |             attention_mask=attention_mask,
726 |             past_key_values=past_key_values,
727 |             inputs_embeds=inputs_embeds,
728 |             use_cache=use_cache,
729 |             output_attentions=output_attentions,
730 |             output_hidden_states=output_hidden_states,
731 |             return_dict=return_dict,
732 |             cache_position=cache_position,
733 |         )
734 | 
735 |     hidden_states = outputs[0]
736 |     logits = self.lm_head(hidden_states)
737 | 
738 |     loss = None
739 |     if labels is not None:
740 |         # Upcast to float if we need to compute the loss to avoid potential precision issues
741 |         logits = logits.float()
742 |         # Shift so that tokens < n predict n
743 |         shift_logits = logits[..., :-1, :].contiguous()
744 |         shift_labels = labels[..., 1:].contiguous()
745 |         # Flatten the tokens
746 |         loss_fct = CrossEntropyLoss()
747 |         shift_logits = shift_logits.view(-1, self.config.vocab_size)
748 |         shift_labels = shift_labels.view(-1)
749 |         # Enable model parallelism
750 |         shift_labels = shift_labels.to(shift_logits.device)
751 |         loss = loss_fct(shift_logits, shift_labels)
752 | 
753 |     if not return_dict:
754 |         output = (logits,) + outputs[1:]
755 |         return (loss,) + output if loss is not None else output
756 | 
757 |     return Qwen2VLCausalLMOutputWithPast(
758 |         loss=loss,
759 |         logits=logits,
760 |         past_key_values=outputs.past_key_values,
761 |         hidden_states=outputs.hidden_states,
762 |         attentions=outputs.attentions,
763 |         rope_deltas=self.rope_deltas,
764 |     )
765 | 


--------------------------------------------------------------------------------
/retake/visual_compression.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | 
  4 | 
  5 | def memory_bank_compress_MALLM(memory_bank: torch.Tensor, compression_size: torch.Tensor, sync: bool=False) -> tuple:
  6 |     """
  7 |     Compresses the memory bank if the current memory bank length is greater than the threshold.
  8 |     Compression_size is the number of frames that are compressed into each position.
  9 |     
 10 |     Args:
 11 |         memory_bank (torch.Tensor): The input memory bank to compress. Shape: (B, T, N, C)
 12 |         compression_size (torch.Tensor): The number of frames to compress into each position. Shape: (B, T, N)
 13 |     
 14 |     Returns:
 15 |         compressed_memory_bank (torch.Tensor): The compressed memory bank. Shape: (B, T-1, N, C)
 16 |         compressed_size (torch.Tensor): The number of frames compressed into each position. Shape: (B, T-1, N)
 17 |     """
 18 |     B, T, N, C = memory_bank.shape
 19 |     # Calculate the cosine similarity between adjacent frames
 20 |     similarity_matrix = F.cosine_similarity(memory_bank[:, :-1, :], memory_bank[:, 1:, :], dim=-1)
 21 |     if sync:
 22 |         similarity_matrix = similarity_matrix.mean(-1, keepdim=True).expand(-1, -1, N)
 23 |     # Select the frame indices with the top-1 similarity 
 24 |     _, max_similarity_indices = torch.max(similarity_matrix, dim=1, keepdim=True)
 25 | 
 26 |     # Calculate source and dst indices for compression
 27 |     src_indices = max_similarity_indices + 1
 28 |     dst_indices = torch.arange(T - 1).to(memory_bank.device)[None, :, None].repeat(B, 1, N)
 29 |     dst_indices[dst_indices > max_similarity_indices] += 1
 30 | 
 31 |     # Gather source and dst memory banks and sizes
 32 |     src_memory_bank = memory_bank.gather(dim=1, index=src_indices.unsqueeze(-1).expand(-1, -1, -1, C))
 33 |     dst_memory_bank = memory_bank.gather(dim=1, index=dst_indices.unsqueeze(-1).expand(-1, -1, -1, C))
 34 |     src_size = compression_size.gather(dim=1, index=src_indices)
 35 |     dst_size = compression_size.gather(dim=1, index=dst_indices)
 36 | 
 37 |     # Multiply the memory banks by their corresponding sizes
 38 |     src_memory_bank *= src_size.unsqueeze(-1)
 39 |     dst_memory_bank *= dst_size.unsqueeze(-1)
 40 | 
 41 |     # Compress the memory bank by adding the source memory bank to the dst memory bank
 42 |     dst_memory_bank.scatter_add_(dim=1, index=max_similarity_indices.unsqueeze(-1).expand(-1, -1, -1, C), src=src_memory_bank)
 43 |     dst_size.scatter_add_(dim=1, index=max_similarity_indices, src=src_size)
 44 | 
 45 |     # Normalize the dst memory bank by its size
 46 |     compressed_memory_bank = dst_memory_bank / dst_size.unsqueeze(-1)
 47 |     return compressed_memory_bank, dst_size
 48 | 
 49 | 
 50 | def memory_bank_compress_MALLM_hard(memory_bank: torch.Tensor, sync: bool=False) -> tuple:
 51 |     """
 52 |     Compresses the memory bank if the current memory bank length is greater than the threshold.
 53 |     Different from MA-LLM, this method replace the tgt features with src ones
 54 |     
 55 |     Args:
 56 |         memory_bank (torch.Tensor): The input memory bank to compress. Shape: (B, T, N, C)
 57 |     
 58 |     Returns:
 59 |         compressed_memory_bank (torch.Tensor): The compressed memory bank. Shape: (B, T-1, N, C)
 60 |     """
 61 |     B, T, N, C = memory_bank.shape
 62 |     # Calculate the cosine similarity between adjacent frames
 63 |     similarity_matrix = F.cosine_similarity(memory_bank[:, :-1, :], memory_bank[:, 1:, :], dim=-1)
 64 |     if sync:
 65 |         similarity_matrix = similarity_matrix.mean(-1, keepdim=True).expand(-1, -1, N)
 66 |     # Select the frame indices with the top-1 similarity 
 67 |     _, max_similarity_indices = torch.max(similarity_matrix, dim=1, keepdim=True)
 68 | 
 69 |     # Calculate source and dst indices for compression
 70 |     src_indices = max_similarity_indices + 1
 71 |     dst_indices = torch.arange(T - 1).to(memory_bank.device)[None, :, None].repeat(B, 1, N)
 72 |     dst_indices[dst_indices > max_similarity_indices] += 1
 73 | 
 74 |     # Gather source and dst memory banks and sizes
 75 |     src_memory_bank = memory_bank.gather(dim=1, index=src_indices.unsqueeze(-1).expand(-1, -1, -1, C))
 76 |     dst_memory_bank = memory_bank.gather(dim=1, index=dst_indices.unsqueeze(-1).expand(-1, -1, -1, C))
 77 | 
 78 |     # Compress the memory bank by adding the source memory bank to the dst memory bank
 79 |     dst_memory_bank.scatter_(dim=1, index=max_similarity_indices.unsqueeze(-1).expand(-1, -1, -1, C), src=src_memory_bank)
 80 | 
 81 |     # Normalize the dst memory bank by its size
 82 |     compressed_memory_bank = dst_memory_bank
 83 |     return compressed_memory_bank
 84 | 
 85 | 
 86 | def memory_bank_compress_keyframe(memory_bank: torch.Tensor, tgt_mem_len: int, window_size: int=3, sync: bool=True) -> tuple:
 87 |     """
 88 |     Compresses the memory bank if the current memory bank length is greater than the threshold.
 89 |     Different from MA-LLM, this method replace the tgt features with src ones
 90 |     
 91 |     Args:
 92 |         memory_bank (torch.Tensor): The input memory bank to compress. Shape: (B, T, N, C)
 93 |     
 94 |     Returns:
 95 |         compressed_memory_bank (torch.Tensor): The compressed memory bank. Shape: (B, T-1, N, C)
 96 |         keypatches_mask (torch.Tensor): The compressed memory bank. Shape: (T-1 * N)
 97 |     """
 98 |     B, T, N, C = memory_bank.shape
 99 |     # Calculate the cosine similarity between adjacent frames
100 |     similarity_matrix = F.cosine_similarity(memory_bank[:, :-1, :], memory_bank[:, 1:, :], dim=-1)
101 |     dis_matrix = 1 - similarity_matrix[0].type(torch.float)
102 |     # dis-similarity of the (i)th frame with the (i-1)th frame
103 |     dis_matrix = torch.cat([
104 |         torch.ones_like(dis_matrix[:1]),
105 |         dis_matrix
106 |     ], dim=0) # [T, N]
107 | 
108 |     if sync:
109 |         # Meanpool over spatial locations
110 |         dis_matrix = dis_matrix.mean(1) # [T]
111 |         keypatches_mask = torch.zeros_like(dis_matrix).bool()
112 | 
113 |         # Argrelmax
114 |         try:
115 |             if torch.npu.is_available():
116 |                 # F.max_pool1d_with_indices is not supported and returns wrong tensor
117 |                 device = dis_matrix.device
118 |                 dis_matrix = dis_matrix.cpu()
119 |         except:
120 |             pass
121 |         window_maxima = F.max_pool1d_with_indices(dis_matrix[None,None,:], window_size, 1, padding=window_size//2)[1].squeeze() # [T]
122 |         candidates = window_maxima.unique()
123 |         peaks = candidates[(window_maxima[candidates]==candidates).nonzero()].squeeze()
124 |         try:
125 |             if torch.npu.is_available():
126 |                 dis_matrix = dis_matrix.to(device)
127 |                 peaks = peaks.to(device)
128 |         except:
129 |             pass
130 | 
131 |         # Fill remaining frames
132 |         keypatches_mask[peaks] = True
133 |         dis_matrix[peaks] += 2 # select from peaks first
134 |         peaks = torch.topk(dis_matrix, k=tgt_mem_len, sorted=False)[1] # [t]
135 |         peaks = peaks.sort()[0]
136 | 
137 |         # Get keyframe memory
138 |         compressed_memory_bank = memory_bank[:,peaks] # [B, t, N, C]
139 |         keypatches_mask = keypatches_mask[peaks] # [t]
140 |         keypatches_mask = keypatches_mask[:, None].repeat(1, N) # [t, N]
141 |     else:
142 |         dis_matrix = dis_matrix.transpose(0, 1) # [N, T]
143 |         keypatches_mask = torch.zeros_like(dis_matrix).bool()
144 |         # Argrelmax
145 |         try:
146 |             if torch.npu.is_available():
147 |                 # F.max_pool1d_with_indices is not supported and returns wrong tensor
148 |                 device = dis_matrix.device
149 |                 dis_matrix = dis_matrix.cpu()
150 |                 keypatches_mask = keypatches_mask.cpu()
151 |         except:
152 |             pass
153 |         window_maxima = F.max_pool1d_with_indices(dis_matrix[:,None,:], window_size, 1, padding=window_size//2)[1].squeeze() # [N, T]
154 |         for p, window_maxima_patch in enumerate(window_maxima):
155 |             candidates_patch = window_maxima_patch.unique()
156 |             peaks_patch = candidates_patch[(window_maxima_patch[candidates_patch]==candidates_patch).nonzero()][:,0]
157 | 
158 |             # Fill remaining frames
159 |             keypatches_mask[p, peaks_patch] = True
160 |             dis_matrix[p, peaks_patch] += 2
161 |         try:
162 |             if torch.npu.is_available():
163 |                 dis_matrix = dis_matrix.to(device)
164 |                 keypatches_mask = keypatches_mask.to(device)
165 |         except:
166 |             pass
167 |         peaks = torch.topk(dis_matrix, k=tgt_mem_len, sorted=False, dim=1)[1] # [N, t]
168 |         peaks = peaks.sort(dim=1)[0]
169 |         peaks = peaks.transpose(0, 1) # [t, N]
170 |         keypatches_mask = keypatches_mask.transpose(0, 1) # [t, N]
171 | 
172 |         # Get keyframe memory
173 |         compressed_memory_bank = memory_bank.gather(dim=1, index=peaks[None,:,:,None].expand(-1, -1, -1, C))
174 |         # [B, t, N, C]
175 |         keypatches_mask = keypatches_mask.gather(dim=0, index=peaks) # [t, N]
176 | 
177 |     return compressed_memory_bank, keypatches_mask.flatten()


--------------------------------------------------------------------------------
/scripts/infer_eval_retake.sh:
--------------------------------------------------------------------------------
1 | ckpt_path=$1
2 | config_path=$2
3 | num_gpus=$3
4 | 
5 | PYTHONPATH=$PYTHONPATH:./ python retake/infer_eval.py \
6 | --hf_qwen2vl7b_path $ckpt_path \
7 | --config_path $config_path \
8 | --video_frame_extraction_fps 25 \
9 | --n_gpus $num_gpus


--------------------------------------------------------------------------------
/scripts/utils/build_lvbench_dataset.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | import math
 4 | import pandas as pd
 5 | import argparse
 6 | import glob
 7 | 
 8 | 
 9 | def main(args):
10 |     video_root = os.path.join(args.hf_root, 'video_25fps')
11 | 
12 |     with open(os.path.join(args.hf_root, "video_info.meta.jsonl")) as F:
13 |         dataset = [json.loads(line) for line in F.readlines()]
14 | 
15 |     question_type_all = set()
16 | 
17 |     data = []
18 |     for video_data in dataset:
19 |         for anno in video_data['qa']:
20 |             question = anno['question'].replace('\n(A)', '\nOptions:\nA.')
21 |             question = question.replace('\n(B)', '\nB.').replace('\n(C)', '\nC.').replace('\n(D)', '\nD.')
22 |             assert '(E)' not in question
23 |             question = f"<video>{question}.\nAnswer with the option's letter from the given choices directly."
24 | 
25 |             d = {
26 |                 "messages": [
27 |                 {
28 |                     "content": question,
29 |                     "role": "user"
30 |                 },
31 |                 {
32 |                     "content": anno['answer'],
33 |                     "role": "assistant"
34 |                 }
35 |                 ],
36 |                 "videos": [
37 |                     os.path.join(video_root, video_data['key'])
38 |                 ],
39 |                 "meta": {
40 |                     "video_id": video_data['key'],
41 |                     "uid": anno['uid'],
42 |                     "video_type": video_data['type'],
43 |                     "question_type": anno['question_type'],
44 |                     "time_reference": anno['time_reference'],
45 |                 }
46 |             }
47 |             d['meta'].update(video_data['video_info'])
48 |             d['meta'] = json.dumps(d['meta'])
49 |             question_type_all.union(set(anno['question_type']))
50 |             data.append(d)
51 | 
52 |     os.makedirs(os.path.join(args.data_root, f"lvbench"), exist_ok=True)
53 |     with open(os.path.join(args.data_root, "lvbench", f"lvbench.json"), 'w') as F:
54 |         json.dump(data, F, indent=2)
55 | 
56 | 
57 | if __name__ == "__main__":
58 |     parser = argparse.ArgumentParser(description="Process videos to extract frames at a specified FPS.")
59 |     parser.add_argument('--hf_root', type=str, required=True)
60 |     parser.add_argument('--data_root', type=str, default="./dataset")
61 | 
62 |     args = parser.parse_args()
63 | 
64 |     main(args)


--------------------------------------------------------------------------------
/scripts/utils/build_mlvu_dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import glob
  4 | import argparse
  5 | 
  6 | 
  7 | QTYPE_FORMAT_DICT = {
  8 |     # Multiple choice tasks
  9 |     'plotQA': "Plot QA",
 10 |     'findNeedle': "Needle QA",
 11 |     'ego': "Ego Reasoning",
 12 |     'count': "Action Count",
 13 |     'order': "Action Order",
 14 |     'anomaly_reco': "Anomaly Recognition",
 15 |     'topic_reasoning': "Topic Reasoning",
 16 |     # Generation tasks
 17 |     'subPlot': 'Sub-Scene Captioning',
 18 |     'summary': 'Video Summary',
 19 | }
 20 | 
 21 | 
 22 | def main(args):
 23 |     video_root = os.path.join(args.hf_root, 'MLVU/video_25fps')
 24 | 
 25 |     data = []
 26 |     anno_files = glob.glob(os.path.join(args.hf_root, "MLVU/json/*.json"))
 27 |     for anno_file in anno_files:
 28 |         with open(anno_file, 'r') as F:
 29 |             raw_data = json.load(F)
 30 | 
 31 |         if os.path.basename(anno_file) in ['8_sub_scene.json', '9_summary.json']:
 32 |             task_category = 'G'
 33 |         else:
 34 |             task_category = 'M'
 35 | 
 36 |         for sample in raw_data:
 37 |             question = sample['question']
 38 | 
 39 |             if task_category == 'M':
 40 |                 if 'candidates' not in sample:
 41 |                     print("Warning, candidates not found", anno_file)
 42 |                     continue
 43 |                 candidates = sample['candidates']
 44 | 
 45 |                 question = "<video>{question}\nOptions:\nA. {o1}.\nB. {o2}.\nC. {o3}.\nD. {o4}.\nAnswer with the option's letter from the given choices directly.".format(
 46 |                     question=question, o1=candidates[0], o2=candidates[1], o3=candidates[2], o4=candidates[3]
 47 |                 )
 48 | 
 49 |                 answer = None
 50 |                 for a, cand in zip(['A', 'B', 'C', 'D'], candidates):
 51 |                     if cand == sample['answer']:
 52 |                         answer = a
 53 |                         break
 54 |                 if answer is None:
 55 |                     print("Warning! Answer not found for current sample, so it is deleted:", sample)
 56 |                     continue
 57 |                 scoring_points = None
 58 |             else:
 59 |                 question = "<video>{question}".format(question=question)
 60 |                 answer = sample['answer']
 61 |                 scoring_points = sample.get('scoring_points', None)
 62 | 
 63 |             question_type = QTYPE_FORMAT_DICT[sample['question_type']]
 64 | 
 65 |             d = {
 66 |                 "messages": [
 67 |                 {
 68 |                     "content": question,
 69 |                     "role": "user"
 70 |                 },
 71 |                 {
 72 |                     "content": answer,
 73 |                     "role": "assistant"
 74 |                 }
 75 |                 ],
 76 |                 "videos": [
 77 |                     os.path.join(video_root, os.path.splitext(sample["video"])[0])
 78 |                 ],
 79 |                 "meta": {
 80 |                     "video": sample["video"],
 81 |                     "duration": sample["duration"],
 82 |                     "question_type": question_type,
 83 |                 }
 84 |             }
 85 |             if scoring_points is not None:
 86 |                 d['meta']['scoring_points'] = scoring_points
 87 |             data.append(d)
 88 | 
 89 |     os.makedirs(os.path.join(args.data_root, "mlvu"), exist_ok=True)
 90 |     with open(os.path.join(args.data_root, "mlvu", "mlvu.json"), 'w') as F:
 91 |         json.dump(data, F, indent=2)
 92 | 
 93 | 
 94 | if __name__ == "__main__":
 95 |     parser = argparse.ArgumentParser(description="Process videos to extract frames at a specified FPS.")
 96 |     parser.add_argument('--hf_root', type=str, required=True)
 97 |     parser.add_argument('--data_root', type=str, default="./dataset")
 98 | 
 99 |     args = parser.parse_args()
100 | 
101 |     main(args)


--------------------------------------------------------------------------------
/scripts/utils/build_mlvu_test_dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import glob
  4 | import argparse
  5 | 
  6 | 
  7 | QTYPE_FORMAT_DICT = {
  8 |     # Multiple choice tasks
  9 |     'plotQA': "Plot QA",
 10 |     'findNeedle': "Needle QA",
 11 |     'ego': "Ego Reasoning",
 12 |     'count': "Action Count",
 13 |     'order': "Action Order",
 14 |     'anomaly_reco': "Anomaly Recognition",
 15 |     'topic_reasoning': "Topic Reasoning",
 16 |     # Generation tasks
 17 |     'subPlot': 'Sub-Scene Captioning',
 18 |     'summary': 'Video Summary',
 19 | }
 20 | 
 21 | 
 22 | def main(args):
 23 |     video_root = os.path.join(args.hf_root, 'MLVU/data_25fps')
 24 |     data = []
 25 |     anno_files = glob.glob(os.path.join(args.hf_root, "MLVU/json/*.json"))
 26 |     for anno_file in anno_files:
 27 |         with open(anno_file, 'r') as F:
 28 |             raw_data = json.load(F)
 29 | 
 30 |         if os.path.basename(anno_file) in ['8_sub_scene.json', '9_summary.json']:
 31 |             task_category = 'G'
 32 |         else:
 33 |             task_category = 'M'
 34 | 
 35 |         for sample in raw_data:
 36 |             question = sample['question']
 37 | 
 38 |             if task_category == 'M':
 39 |                 if 'candidates' not in sample:
 40 |                     print("Warning, candidates not found", anno_file)
 41 |                     continue
 42 |                 candidates = sample['candidates']
 43 | 
 44 |                 question = "<video>{question}\nOptions:\nA. {o1}.\nB. {o2}.\nC. {o3}.\nD. {o4}.\nAnswer with the option's letter from the given choices directly.".format(
 45 |                     question=question, o1=candidates[0], o2=candidates[1], o3=candidates[2], o4=candidates[3]
 46 |                 )
 47 | 
 48 |                 answer = None
 49 |                 for a, cand in zip(['A', 'B', 'C', 'D'], candidates):
 50 |                     if cand == sample['answer']:
 51 |                         answer = a
 52 |                         break
 53 |                 if answer is None:
 54 |                     print("Warning! Answer not found!", sample['answer'], candidates)
 55 |                     continue
 56 |                 scoring_points = None
 57 |             else:
 58 |                 question = "<video>{question}".format(question=question)
 59 |                 answer = sample['answer']
 60 |                 scoring_points = sample.get('scoring_points', None)
 61 | 
 62 |             question_type = QTYPE_FORMAT_DICT[sample['question_type']]
 63 | 
 64 |             d = {
 65 |                 "messages": [
 66 |                 {
 67 |                     "content": question,
 68 |                     "role": "user"
 69 |                 },
 70 |                 {
 71 |                     "content": answer,
 72 |                     "role": "assistant"
 73 |                 }
 74 |                 ],
 75 |                 "videos": [
 76 |                     video_root.format(
 77 |                         typename=os.path.splitext(os.path.basename(anno_file))[0], 
 78 |                         videoname=os.path.splitext(sample["video"])[0]
 79 |                     )
 80 |                 ],
 81 |                 "meta": {
 82 |                     "video": sample["video"],
 83 |                     "duration": sample["duration"],
 84 |                     "question_type": question_type,
 85 |                 }
 86 |             }
 87 |             if scoring_points is not None:
 88 |                 d['meta']['scoring_points'] = scoring_points
 89 |             data.append(d)
 90 | 
 91 | 
 92 |     os.makedirs(os.path.join(args.data_root, "mlvu"), exist_ok=True)
 93 |     with open(os.path.join(args.data_root, "mlvu", "mlvu.json"), 'w') as F:
 94 |         json.dump(data, F, indent=2)
 95 | 
 96 | 
 97 | if __name__ == "__main__":
 98 |     parser = argparse.ArgumentParser(description="Process videos to extract frames at a specified FPS.")
 99 |     parser.add_argument('--hf_root', type=str, required=True)
100 |     parser.add_argument('--data_root', type=str, default="./dataset")
101 | 
102 |     args = parser.parse_args()
103 | 
104 |     main(args)


--------------------------------------------------------------------------------
/scripts/utils/build_videomme_dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import copy
  4 | import argparse
  5 | from tqdm import tqdm
  6 | import pysubs2
  7 | import pandas as pd
  8 | from transformers import AutoTokenizer
  9 | 
 10 | 
 11 | def get_subtitles(srt_path, question, tokenizer, max_tokens):
 12 |     if srt_path and os.path.exists(srt_path):
 13 |         subs = pysubs2.load(srt_path, encoding="utf-8")
 14 |         subtitles = []
 15 |         for sub in subs:
 16 |             sub_text = sub.text.replace("\\N", " ")
 17 |             if sub_text.strip():
 18 |                 subtitles.append(sub_text)
 19 |         subtitles = "\n".join(subtitles)
 20 | 
 21 |         question_tokenized = tokenizer(question).input_ids
 22 |         subtitles_tokenized = tokenizer(subtitles).input_ids
 23 |         if len(question_tokenized) + len(subtitles_tokenized) > max_tokens:
 24 |             cutoff = len(question_tokenized) + len(subtitles_tokenized) - max_tokens
 25 |             subtitles_tokenized = subtitles_tokenized[:-cutoff]
 26 |             subtitles = tokenizer.decode(subtitles_tokenized, skip_special_tokens=True)
 27 |     else:
 28 |         subtitles = ""
 29 | 
 30 |     return subtitles
 31 | 
 32 | 
 33 | def main(args):
 34 |     video_root = os.path.join(args.hf_root, 'data_25fps')
 35 |     data_root = args.data_root
 36 |     if not os.path.exists(data_root):
 37 |         os.makedirs(data_root)
 38 | 
 39 |     srt_root = os.path.join(args.hf_root, "subtitle")
 40 |     annos = pd.read_parquet(
 41 |         os.path.join(args.hf_root, 'videomme', 'test-00000-of-00001.parquet')
 42 |     )
 43 |     tokenizer = AutoTokenizer.from_pretrained(args.hf_qwen2vl7b_path)
 44 | 
 45 | 
 46 |     data = []
 47 |     data_sub = []
 48 |     for idx, row in tqdm(annos.iterrows(), total=len(annos)):
 49 |         question = "<video>%s\nOptions:\n%s\nAnswer with the option's letter from the given choices directly." % (
 50 |             row["question"], '\n'.join(row["options"])
 51 |         )
 52 |         d = {
 53 |             "messages": [
 54 |             {
 55 |                 "content": question,
 56 |                 "role": "user"
 57 |             },
 58 |             {
 59 |                 "content": row["answer"],
 60 |                 "role": "assistant"
 61 |             }
 62 |             ],
 63 |             "videos": [
 64 |                 os.path.join(video_root, row["videoID"])
 65 |             ],
 66 |             "meta": {
 67 |                 "video_id": row["video_id"],
 68 |                 "question_id": row["question_id"],
 69 |                 "duration": row["duration"],
 70 |                 "domain": row["domain"],
 71 |                 "sub_category": row["sub_category"],
 72 |                 "task_type": row["task_type"],
 73 |             }
 74 |         }
 75 |         data.append(d)
 76 | 
 77 |         subtitles = get_subtitles(os.path.join(srt_root, f'{row["videoID"]}.srt'), question, tokenizer, args.max_tokens)
 78 |         if subtitles != "":
 79 |             question = "<video>This video's subtitles are listed below:\n%s\n%s\nOptions:\n%s\nAnswer with the option's letter from the given choices directly." % (
 80 |                 subtitles, row["question"], '\n'.join(row["options"])
 81 |             )
 82 |         d = copy.deepcopy(d)
 83 |         d['messages'][0]['content'] = question
 84 |         data_sub.append(d)
 85 | 
 86 | 
 87 |     os.makedirs(os.path.join(data_root, "video_mme"), exist_ok=True)
 88 |     with open(os.path.join(data_root, "video_mme", "video_mme.json"), 'w') as F:
 89 |         json.dump(data, F, indent=2)
 90 |     with open(os.path.join(data_root, "video_mme", "video_mme_subtitle.json"), 'w') as F:
 91 |         json.dump(data_sub, F, indent=2)
 92 | 
 93 | 
 94 | if __name__ == "__main__":
 95 |     parser = argparse.ArgumentParser(description="Process videos to extract frames at a specified FPS.")
 96 |     parser.add_argument('--hf_qwen2vl7b_path', type=str, required=True)
 97 |     parser.add_argument('--hf_root', type=str, required=True)
 98 |     parser.add_argument('--data_root', type=str, default="./dataset")
 99 |     parser.add_argument('--max_tokens', type=int, default=10000) # cover 90% long videos
100 | 
101 |     args = parser.parse_args()
102 | 
103 |     main(args)


--------------------------------------------------------------------------------
/scripts/utils/cal_flops.py:
--------------------------------------------------------------------------------
  1 | # QWen2-VL
  2 | 
  3 | C3 =  3584 # attention dim	dm	3584
  4 | C4 =  28 # attention head number	head_num	28
  5 | C5 =  18944 # FF hidden layer dim	df	18944
  6 | C6 =  28 # Total Layer	L	28
  7 | C7 =  152064 # vocabulary size	V	152064
  8 | # C8 =  2304 # maximum token number	s	2304
  9 | C9 =  1 # Data format	FP16	1
 10 | C10 =  128 # attention dim for each head	dk	128
 11 | C11 =  7 # attention head number each chip	dhead_num	7
 12 | C12 =  896 # attention dim for each chip	dt	896
 13 | C13 =  4736 # FF hidden layer dim for each chip	dft	4736
 14 | 
 15 | C16 = 1 # batch size	b	1
 16 | C17 = 1 # current token number	n	1
 17 | # C18 = 3456 # KV Cache Size		3456
 18 | 
 19 | C32 = 1000000000
 20 | 
 21 | def calculate_flops_prefill(input_length, kvcache_length):
 22 |     C8 = input_length
 23 |     C18 = kvcache_length
 24 |     layernorm = 5*C16*C8*C12/1000/1000/1000
 25 |     QKV计算 = 2*3*C16*C8*C12*C3/C32
 26 |     RoPE = 6*C16*C8*C12/C32
 27 |     # attention计算 = (2*C16*C8*(C8+C18)*C12 + 3*C16*(C8+C18)*C12*C12 + C16*C8*(C8+C18))/C32
 28 |     attention计算 = (4*C16*C8*(C8+C18)*C12 - 2*C16*C8*C8*C12 + 3*C16*C8*C12*C12 + 2*C16*C8*(C8+C18) - C16*C8*C8)/C32
 29 |     linear计算 = 2*C16*C8*C12*C3/C32
 30 |     layernorm = 5*C16*C8*C12/1000/1000/1000
 31 |     gate门控 = (2*C16*C8*C3*C13 + 4*C16*C8*C13)/1000/1000/1000
 32 |     FF计算1 = 2*C16*C8*C3*C13/1000/1000/1000
 33 |     FF计算2 = 2*C16*C8*C3*C13/1000/1000/1000
 34 |     return sum([
 35 |         layernorm,
 36 |         QKV计算,
 37 |         RoPE,
 38 |         attention计算,
 39 |         linear计算,
 40 |         layernorm,
 41 |         gate门控,
 42 |         FF计算1,
 43 |         FF计算2
 44 |     ])
 45 | 
 46 | def cal_pivotkv_flops(num_frames, chunk_size, visual_compression_ratio=1.0, kvcache_compression_ratio=1.0):
 47 |     tokens_per_chunk = (448/14/2)*(448/14/2)*chunk_size/2/(1280/720) * visual_compression_ratio
 48 |     flops = 0
 49 |     kvcache_size = 0
 50 |     for idx in range(num_frames // chunk_size):
 51 |         flops += calculate_flops_prefill(tokens_per_chunk, kvcache_size)
 52 |         kvcache_size += tokens_per_chunk * kvcache_compression_ratio
 53 |     return flops
 54 | 
 55 | # ax*x + bx
 56 | # a2c*2c + b2c = ac*c + bc + ac*2c + b2c
 57 | # 4ac^2 + 2bc = ac^2 + bc + 2ac^2 + 2bc
 58 | 
 59 | # 9ac^2 + 3bc = ac*c + bc + ac*2c + b2c + ac*3c + b3c
 60 | #             = (1+2+3)ac^2 + (1+2+3)bc
 61 | 
 62 | # k^2ac^2 + kbc = (k^2 + k)ac^2 + kbc - kac^2
 63 | 
 64 | # f(n) = 2ac^2 + bc - ac^2
 65 | 
 66 | 
 67 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.25))
 68 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=0.8660254037844386, kvcache_compression_ratio=0.28867513459481287))
 69 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=0.7071067811865476, kvcache_compression_ratio=0.3535533905932738))
 70 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=0.5, kvcache_compression_ratio=0.5))
 71 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=0.3535533905932738, kvcache_compression_ratio=0.7071067811865476))
 72 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=0.28867513459481287, kvcache_compression_ratio=0.8660254037844386))
 73 | print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=0.25, kvcache_compression_ratio=1.0))
 74 | 
 75 | # print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.1))
 76 | # print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.3))
 77 | # print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.5))
 78 | # print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.7))
 79 | # print(cal_pivotkv_flops(num_frames=1024, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=1.0))
 80 | 
 81 | 
 82 | print(cal_pivotkv_flops(num_frames=256, chunk_size=256, visual_compression_ratio=1.0, kvcache_compression_ratio=1.0))
 83 | print(cal_pivotkv_flops(num_frames=256, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=1.0))
 84 | print(cal_pivotkv_flops(num_frames=256, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.5))
 85 | 
 86 | 
 87 | 
 88 | # LLaVA-Video
 89 | 
 90 | C3 =  3584 # attention dim	dm	3584
 91 | C4 =  28 # attention head number	head_num	28
 92 | C5 =  18944 # FF hidden layer dim	df	18944
 93 | C6 =  28 # Total Layer	L	28
 94 | C7 =  152064 # vocabulary size	V	152064
 95 | # C8 =  2304 # maximum token number	s	2304
 96 | C9 =  1 # Data format	FP16	1
 97 | C10 =  128 # attention dim for each head	dk	128
 98 | C12 =  896 # attention dim for each chip	dt	896
 99 | C13 =  4736 # FF hidden layer dim for each chip	dft	4736
100 | 
101 | C16 = 1 # batch size	b	1
102 | C17 = 1 # current token number	n	1
103 | # C18 = 3456 # KV Cache Size		3456
104 | 
105 | C32 = 1000000000
106 | 
107 | def calculate_flops_prefill(input_length, kvcache_length):
108 |     C8 = input_length
109 |     C18 = kvcache_length
110 |     layernorm = 5*C16*C8*C12/1000/1000/1000
111 |     QKV计算 = 2*3*C16*C8*C12*C3/C32
112 |     RoPE = 6*C16*C8*C12/C32
113 |     # attention计算 = (2*C16*C8*(C8+C18)*C12 + 3*C16*(C8+C18)*C12*C12 + C16*C8*(C8+C18))/C32
114 |     attention计算 = (4*C16*C8*(C8+C18)*C12 - 2*C16*C8*C8*C12 + 3*C16*C8*C12*C12 + 2*C16*C8*(C8+C18) - C16*C8*C8)/C32
115 |     linear计算 = 2*C16*C8*C12*C3/C32
116 |     layernorm = 5*C16*C8*C12/1000/1000/1000
117 |     gate门控 = (2*C16*C8*C3*C13 + 4*C16*C8*C13)/1000/1000/1000
118 |     FF计算1 = 2*C16*C8*C3*C13/1000/1000/1000
119 |     FF计算2 = 2*C16*C8*C3*C13/1000/1000/1000
120 |     return sum([
121 |         layernorm,
122 |         QKV计算,
123 |         RoPE,
124 |         attention计算,
125 |         linear计算,
126 |         layernorm,
127 |         gate门控,
128 |         FF计算1,
129 |         FF计算2
130 |     ])
131 | 
132 | def cal_pivotkv_flops(num_frames, chunk_size, visual_compression_ratio=1.0, kvcache_compression_ratio=1.0):
133 |     tokens_per_chunk = (384/14/2)**2*chunk_size * visual_compression_ratio
134 |     flops = 0
135 |     kvcache_size = 0
136 |     for idx in range(num_frames // chunk_size):
137 |         flops += calculate_flops_prefill(tokens_per_chunk, kvcache_size)
138 |         kvcache_size += tokens_per_chunk * kvcache_compression_ratio
139 |     return flops
140 | 
141 | # ax*x + bx
142 | # a2c*2c + b2c = ac*c + bc + ac*2c + b2c
143 | # 4ac^2 + 2bc = ac^2 + bc + 2ac^2 + 2bc
144 | 
145 | # 9ac^2 + 3bc = ac*c + bc + ac*2c + b2c + ac*3c + b3c
146 | #             = (1+2+3)ac^2 + (1+2+3)bc
147 | 
148 | # k^2ac^2 + kbc = (k^2 + k)ac^2 + kbc - kac^2
149 | 
150 | # f(n) = 2ac^2 + bc - ac^2
151 | 
152 | print(cal_pivotkv_flops(num_frames=256, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=1.0))
153 | print(cal_pivotkv_flops(num_frames=256, chunk_size=32, visual_compression_ratio=1.0, kvcache_compression_ratio=0.5))
154 | 


--------------------------------------------------------------------------------
/scripts/utils/cal_ttft.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import cv2
  3 | import math
  4 | import yaml
  5 | import time
  6 | from PIL import Image
  7 | from typing import List, Union
  8 | 
  9 | import torch
 10 | import numpy as np
 11 | from torchvision.transforms.functional import pil_to_tensor
 12 | from transformers import AutoProcessor
 13 | 
 14 | import retake
 15 | 
 16 | 
 17 | def get_frame_indices(total_frames, max_num_frames, sample_fps, extraction_fps):
 18 |     # Get number of sampled frames
 19 |     sample_frames = float(total_frames / extraction_fps) * sample_fps
 20 |     sample_frames = min(total_frames, max_num_frames, sample_frames)
 21 |     sample_frames = math.floor(sample_frames)
 22 |     sample_frames = int(sample_frames / 2) * 2
 23 |     # Get sampled frame indices
 24 |     frame_indices = np.linspace(0, total_frames - 1, sample_frames).astype(np.int32)
 25 |     return frame_indices
 26 | 
 27 | 
 28 | def load_specific_frames(cap, frame_indices):
 29 |     # List to store the frames
 30 |     frames = []
 31 |     # Read frames from the video
 32 |     for frame_index in frame_indices:
 33 |         # Set the video position to the desired frame index
 34 |         cap.set(cv2.CAP_PROP_POS_FRAMES, frame_index)
 35 |         # Read the frame
 36 |         ret, frame = cap.read()
 37 |         # If the frame was read successfully, append it to the list
 38 |         if ret:
 39 |             # Convert the frame from BGR to RGB
 40 |             frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
 41 |             # Create a PIL Image from the frame
 42 |             frame = Image.fromarray(frame_rgb)
 43 |             frames.append(frame)
 44 |         else:
 45 |             ValueError(f"Warning: Could not read frame at index {frame_index}. It may be out of range.")
 46 |     return frames
 47 | 
 48 | 
 49 | def load_video(video_path: str, max_num_frames: int, fps: Union[int, float]=None, frame_extraction_fps: Union[int, float]=None):
 50 |     """Load video frames at fps. If total frames larger than `max_num_frames`, do downsample.
 51 |        If 'fps' is `None`, load uniformly sample `max_num_frames` frames.
 52 | 
 53 |        video_path: Should either be a videofile or a directory of extracted frames.
 54 | 
 55 |        # NOTE: The extract frames must have name pattern of `%06d.(ext)`, or the loaded frame order will be wrong.
 56 |     """
 57 |     if video_path.startswith("file://"):
 58 |         video_path = video_path[7:]
 59 |     if os.path.isdir(video_path): # directory extracted frames
 60 |         assert frame_extraction_fps is not None
 61 |         pass
 62 |     else: # filename of a video
 63 |         # Open the video file
 64 |         cap = cv2.VideoCapture(video_path)
 65 |         if not cap.isOpened():
 66 |             raise ValueError("Error: Could not open video.")
 67 |         total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
 68 |         frame_extraction_fps = cap.get(cv2.CAP_PROP_FPS)
 69 |         # Get indices of sampled frame
 70 |         frame_indices = get_frame_indices(total_frames, max_num_frames, fps, frame_extraction_fps)
 71 |         # Get frames
 72 |         frames = load_specific_frames(cap, frame_indices)
 73 |         # Release the video capture object
 74 |         cap.release()
 75 | 
 76 |     # Convert into RGB format
 77 |     frames = [
 78 |         frame.convert("RGB") if frame.mode != "RGB" else frame
 79 |         for frame in frames
 80 |     ]
 81 | 
 82 |     return frames
 83 | 
 84 | 
 85 | def resize_image_longside(image, image_resolution):
 86 |     r"""
 87 |     Pre-processes a single image.
 88 |     """
 89 |     if max(image.width, image.height) > image_resolution:
 90 |         resize_factor = image_resolution / max(image.width, image.height)
 91 |         width, height = int(image.width * resize_factor), int(image.height * resize_factor)
 92 |         image = image.resize((width, height), resample=Image.NEAREST)
 93 | 
 94 |     return image
 95 | 
 96 | 
 97 | def resize_video_longside(frames: List, video_resolution):
 98 |     """
 99 |     frames: list of PIL images.
100 |     """
101 |     frames = [
102 |         resize_image_longside(frame, video_resolution)
103 |         for frame in frames
104 |     ]
105 |     return frames
106 | 
107 | 
108 | def load_yaml(file_path):
109 |     with open(file_path, 'r') as file:
110 |         data = yaml.safe_load(file)
111 |     return data
112 | 
113 | 
114 | def fetch_video(video_info, max_num_frames, sample_fps, longsize_resolution):
115 |     frames = load_video(video_info['video'], max_num_frames, sample_fps, video_info.get('frame_extraction_fps', None))
116 |     frames = resize_video_longside(frames, longsize_resolution)
117 |     frames = [pil_to_tensor(frame) for frame in frames]
118 |     return frames
119 | 
120 | 
121 | def load_and_patch_model(model_name, hf_model_path, exp_configs, device):
122 |     model_name = model_name if model_name is not None else exp_configs['model_name']
123 |     model_name = model_name.lower().replace('-', '').replace('_', '')
124 |     if model_name == 'qwen2vl':
125 |         from transformers import Qwen2VLConfig, Qwen2VLForConditionalGeneration
126 |         from retake.monkeypatch import patch_qwen2vl, patch_qwen2vl_config
127 |         retake.qwen2_vl.DEBUG_MODE = True
128 |         patch_qwen2vl(exp_configs['method']) # Replace some functions of QWen2VL with those from ReTaKe
129 |         qwen2vl_config = Qwen2VLConfig.from_pretrained(hf_model_path)
130 |         qwen2vl_config = patch_qwen2vl_config(qwen2vl_config, exp_configs)
131 |         model = Qwen2VLForConditionalGeneration.from_pretrained(
132 |             hf_model_path,
133 |             config=qwen2vl_config,
134 |             torch_dtype=torch.bfloat16,
135 |             attn_implementation=exp_configs.get('attn_implementation', None),
136 |             device_map=device # "auto"
137 |         ).eval()
138 |         processor = AutoProcessor.from_pretrained(hf_model_path)
139 |     elif model_name == 'llavavideo':
140 |         from transformers import LlavaOnevisionConfig, LlavaOnevisionForConditionalGeneration
141 |         from retake.monkeypatch import patch_llava_video_config
142 |         llava_video_config = LlavaOnevisionConfig.from_pretrained(hf_model_path)
143 |         llava_video_config = patch_llava_video_config(llava_video_config, exp_configs)
144 |         processor = AutoProcessor.from_pretrained(hf_model_path) 
145 |         model = LlavaOnevisionForConditionalGeneration.from_pretrained(
146 |             hf_model_path, 
147 |             config=llava_video_config,
148 |             torch_dtype=torch.bfloat16,
149 |             attn_implementation=exp_configs.get('attn_implementation', None),
150 |             device_map=device # "auto"
151 |         ) 
152 |         # TODO: Patch LLava-Video with ReTaKe
153 |     else:
154 |         raise NotImplementedError
155 |     return model, processor
156 | 
157 | 
158 | DEMO_VIDEO = 'misc/Q8AZ16uBhr8_resized_fps2_mute.mp4'
159 | DEMO_QUESTION = "Describe the video."
160 | 
161 | 
162 | if __name__ == "__main__":
163 |     #------------------- Modify the following configs ------------------#
164 |     hf_model_path = 'Qwen/Qwen2-VL-7B-Instruct' # TODO: replace to local path if you have trouble downloading huggingface models
165 |     model_name = 'qwen2_vl'
166 |     # hf_model_path = '/path_to/llava-video-qwen2-7b-hf'
167 |     # model_name = 'llava_video'
168 |     # hf_model_path = 'llava-hf/llava-onevision-qwen2-7b-ov-hf'
169 |     # model_name = 'llava_onevision'
170 |     
171 |     # NOTE: for Nvidia GPUs
172 |     config_path = 'configs/retake_demo_tpot.yaml'
173 |     device = 'cuda:0'
174 | 
175 |     # NOTE: for NPUs or GPUs without support for FlashAttention
176 |     # config_path = 'configs/retake_demo_npu.yaml'
177 |     # device = 'npu:0'
178 | 
179 |     #------------------------ No need to change ------------------------#
180 |     exp_configs = load_yaml(config_path)
181 | 
182 |     model, processor = load_and_patch_model(model_name, hf_model_path, exp_configs, device)
183 | 
184 |     # Video
185 |     video_info = {"type": "video", 
186 |                   "video": DEMO_VIDEO, 
187 |                   "fps": 2.0}
188 |     video = fetch_video(video_info, 256, 2, exp_configs['longsize_resolution'])
189 | 
190 |     def cal_total_time(num_samples, max_new_tokens=512):
191 |         total_times = []
192 |         num_new_tokens = []
193 |         conversation = [
194 |             {
195 |                 "role": "user",
196 |                 "content": [
197 |                     {"type": "video"},
198 |                     {"type": "text", "text": DEMO_QUESTION},
199 |                 ],
200 |             }
201 |         ]
202 | 
203 |         # Preprocess the inputs
204 |         text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
205 | 
206 |         inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
207 |         inputs = inputs.to(device)
208 |         inputs['pixel_values_videos'] = inputs['pixel_values_videos'].to(torch.bfloat16)
209 | 
210 |         # Inference: Generation of the output
211 |         for idx in range(num_samples + 1):
212 |             start_time = time.time()
213 |             output_ids = model.generate(**inputs, do_sample=False, max_new_tokens=max_new_tokens)
214 |             end_time = time.time()
215 | 
216 |             generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
217 |             output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
218 |             output_text = output_text[0]
219 | 
220 |             if idx > 0: # Skip first sample, generally takes longer time
221 |                 total_times.append(end_time - start_time)
222 |                 num_new_tokens.append(generated_ids[0].shape[0])
223 |         print('Output text:\n', output_text)
224 |         return total_times, num_new_tokens
225 | 
226 |     total_times, num_new_tokens = cal_total_time(5, max_new_tokens=1)
227 |     print('total_times', total_times)
228 |     print('num_tokens', num_new_tokens)
229 |     print('Avg. total time:', sum(total_times) / len(total_times))
230 |     ttft = sum(total_times) / len(total_times)
231 | 
232 |     total_times, num_new_tokens = cal_total_time(5, max_new_tokens=512)
233 |     print('total_times', total_times)
234 |     print('num_tokens', num_new_tokens)
235 |     print('Avg. total time:', sum(total_times) / len(total_times))
236 |     tpot = (sum(total_times) / len(total_times) - ttft) / (sum(num_new_tokens) / len(num_new_tokens) - 1)
237 | 
238 |     print(f"Time To First Token (TTFT): {ttft:.6f} seconds")
239 |     print(f"Time Per Output Token (TPOT): {tpot:.6f} seconds")
240 | 


--------------------------------------------------------------------------------
/scripts/utils/convert_llava_video_weights_to_hf.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2023 The HuggingFace Inc. team. All rights reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | import argparse
 15 | import glob
 16 | 
 17 | import torch
 18 | from huggingface_hub import file_exists, hf_hub_download, snapshot_download
 19 | from safetensors import safe_open
 20 | 
 21 | from transformers import (
 22 |     AddedToken,
 23 |     AutoConfig,
 24 |     AutoImageProcessor,
 25 |     AutoTokenizer,
 26 |     LlavaConfig,
 27 |     LlavaForConditionalGeneration,
 28 |     LlavaOnevisionForConditionalGeneration,
 29 |     LlavaProcessor,
 30 |     LlavaOnevisionProcessor,
 31 |     SiglipVisionConfig,
 32 | )
 33 | 
 34 | 
 35 | EPILOG_TXT = """Example:
 36 |     python transformers/src/transformers/models/llava/convert_llava_weights_to_hf.py --text_model_id lmsys/vicuna-7b-v1.5 --vision_model_id openai/clip-vit-large-patch14-336 --output_hub_path org/llava-v1.5-7b-conv --old_state_dict_id liuhaotian/llava-v1.5-7b
 37 | 
 38 | Example for creating the old state dict file with Python:
 39 | 
 40 |     import torch
 41 |     from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM
 42 | 
 43 |     # load model
 44 |     kwargs = {"device_map": "auto", "torch_dtype": torch.float16}
 45 |     model = LlavaLlamaForCausalLM.from_pretrained("liuhaotian/llava-v1.5-7b", low_cpu_mem_usage=True, **kwargs)
 46 | 
 47 |     # load vision tower
 48 |     model.get_vision_tower().load_model()
 49 | 
 50 |     # Save state dict
 51 |     torch.save(model.state_dict(), "tmp/hf_models/llava-v1.5-7b/model_state_dict.bin")
 52 | """
 53 | 
 54 | KEYS_TO_MODIFY_MAPPING = {
 55 |     "model.image_newline": "image_newline",
 56 |     "model.vision_tower.": "",
 57 |     ".vision_resampler": "",  # all lmms-lab models do avg pooling, so no vision_resampler
 58 |     "model.mm_projector": "multi_modal_projector",
 59 |     "model": "model.model",
 60 |     "vision_model.model": "vision_model",
 61 |     "lm_head": "language_model.lm_head",
 62 |     "model.model": "language_model.model",
 63 |     "multi_modal_projector.0": "multi_modal_projector.linear_1",
 64 |     "multi_modal_projector.2": "multi_modal_projector.linear_2",
 65 | }
 66 | 
 67 | 
 68 | def load_original_state_dict(model_id):
 69 |     import os
 70 |     if os.path.exists(model_id):
 71 |         directory_path = model_id
 72 |     else:
 73 |         directory_path = snapshot_download(repo_id=model_id, allow_patterns=["*.safetensors"])
 74 | 
 75 |     original_state_dict = {}
 76 |     for path in glob.glob(f"{directory_path}/*"):
 77 |         if path.endswith(".safetensors"):
 78 |             with safe_open(path, framework="pt", device="cpu") as f:
 79 |                 for key in f.keys():
 80 |                     original_state_dict[key] = f.get_tensor(key)
 81 | 
 82 |     # tied wieghts so lm.head is not saved. Let's clone to load state dict
 83 |     if "lm_head.weight" not in original_state_dict:
 84 |         original_state_dict["lm_head.weight"] = original_state_dict["model.embed_tokens.weight"].clone()
 85 | 
 86 |     # if "model.image_newline" in original_state_dict:
 87 |     #     # not used in the original implementation because "merge_type=flat"
 88 |     #     del original_state_dict["model.image_newline"]
 89 |     return original_state_dict
 90 | 
 91 | 
 92 | # used only for llava-interlave
 93 | # for ex: Qwen/Qwen1.5-0.5B-Chat google/siglip-so400m-patch14-384 lmms-lab/llava-next-interleave-qwen-0.5b
 94 | def convert_state_dict_to_hf(state_dict):
 95 |     new_state_dict = {}
 96 |     for key, value in state_dict.items():
 97 |         if key.endswith(".inv_freq"):
 98 |             continue
 99 |         for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
100 |             if key_to_modify in key:
101 |                 key = key.replace(key_to_modify, new_key)
102 | 
103 |         new_state_dict[key] = value
104 |     return new_state_dict
105 | 
106 | 
107 | def convert_llava_llama_to_hf(text_model_id, vision_model_id, output_hub_path, old_state_dict_id):
108 |     torch.set_default_dtype(torch.float16)
109 |     text_config = AutoConfig.from_pretrained(text_model_id)
110 | 
111 |     tokenizer = AutoTokenizer.from_pretrained(text_model_id)
112 |     tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
113 |     if "Qwen" not in text_model_id:  # qwen already has a pad token
114 |         tokenizer.add_special_tokens({"pad_token": "<pad>"})
115 | 
116 |     # image_processor = AutoImageProcessor.from_pretrained(vision_model_id)
117 |     # processor = LlavaOnevisionProcessor(tokenizer=tokenizer, image_processor=image_processor)
118 |     processor = LlavaOnevisionProcessor.from_pretrained('llava-hf/llava-onevision-qwen2-7b-ov-hf')
119 | 
120 |     if "siglip" in vision_model_id:
121 |         vision_config = SiglipVisionConfig(
122 |             hidden_size=1152,
123 |             image_size=384,
124 |             intermediate_size=4304,
125 |             num_attention_heads=16,
126 |             num_hidden_layers=26,
127 |             patch_size=14,
128 |             vision_use_head=False,
129 |         ).to_dict()
130 |     else:
131 |         vision_config = None
132 | 
133 |     config = LlavaConfig(
134 |         text_config=text_config,
135 |         vision_config=vision_config,
136 |     )
137 | 
138 |     # llms-lab interleeave models do not use any selection startegy except for last hidden state
139 |     if "Qwen" in text_model_id:
140 |         config.image_token_index = 151646
141 |         if "siglip" in vision_model_id:
142 |             config.vision_feature_select_strategy = "full"
143 |             config.vision_feature_layer = -1
144 |     else:
145 |         config.pad_token_id = 32001
146 |         config.image_token_index = 32000
147 | 
148 |     with torch.device("meta"):
149 |         # model = LlavaForConditionalGeneration(config)
150 |         model = LlavaOnevisionForConditionalGeneration(config)
151 | 
152 |     # Some llava variants like microsoft/llava-med-v1.5-mistral-7b use safetensors to store weights
153 |     # if file_exists(old_state_dict_id, "model_state_dict.bin"):
154 |     #     state_dict_path = hf_hub_download(old_state_dict_id, "model_state_dict.bin")
155 |     #     state_dict = torch.load(state_dict_path, map_location="cpu", weights_only=True)
156 |     # else:
157 |     state_dict = load_original_state_dict(old_state_dict_id)
158 | 
159 |     state_dict = convert_state_dict_to_hf(state_dict)
160 |     model.load_state_dict(state_dict, strict=True, assign=True)
161 | 
162 |     pre_expansion_embeddings = model.language_model.model.embed_tokens.weight.data
163 |     mu = torch.mean(pre_expansion_embeddings, dim=0).float()
164 |     n = pre_expansion_embeddings.size()[0]
165 |     sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
166 |     dist = torch.distributions.multivariate_normal.MultivariateNormal(mu, covariance_matrix=1e-5 * sigma)
167 | 
168 |     # We add an image token so we resize the model and pad to 64 for performance reasons
169 |     # pad_shape = 64
170 |     # vocab_size = config.text_config.vocab_size
171 |     # model.resize_token_embeddings(config.text_config.vocab_size + 2, pad_shape)
172 |     # model.language_model.model.embed_tokens.weight.data[vocab_size:] = torch.stack(
173 |     #     tuple(
174 |     #         (dist.sample() for _ in range(model.language_model.model.embed_tokens.weight.data[vocab_size:].shape[0]))
175 |     #     ),
176 |     #     dim=0,
177 |     # )
178 |     # print('vocab_size', vocab_size)
179 |     # print('model.language_model.lm_head.weight.data.shape[0]', model.language_model.lm_head.weight.data.shape[0])
180 |     # model.language_model.lm_head.weight.data[vocab_size:] = torch.stack(
181 |     #     tuple((dist.sample() for _ in range(model.language_model.lm_head.weight.data[vocab_size:].shape[0]))),
182 |     #     dim=0,
183 |     # )
184 | 
185 |     # model.push_to_hub(output_hub_path)
186 |     model.save_pretrained(output_hub_path)
187 |     # processor.push_to_hub(output_hub_path)
188 |     processor.save_pretrained(output_hub_path)
189 | 
190 | 
191 | def main():
192 |     parser = argparse.ArgumentParser(
193 |         epilog=EPILOG_TXT,
194 |         formatter_class=argparse.RawDescriptionHelpFormatter,
195 |     )
196 |     parser.add_argument(
197 |         "--text_model_id",
198 |         help="Hub location of the text model",
199 |     )
200 |     parser.add_argument(
201 |         "--vision_model_id",
202 |         help="Hub location of the vision model",
203 |     )
204 |     parser.add_argument(
205 |         "--output_hub_path",
206 |         help="Location on the hub of the converted model",
207 |     )
208 |     parser.add_argument(
209 |         "--old_state_dict_id",
210 |         help="Location on the hub of the raw state dict of the original model. The filename needs to be `model_state_dict.bin`",
211 |     )
212 |     args = parser.parse_args()
213 |     convert_llava_llama_to_hf(args.text_model_id, args.vision_model_id, args.output_hub_path, args.old_state_dict_id)
214 | 
215 | 
216 | if __name__ == "__main__":
217 |     main()


--------------------------------------------------------------------------------
/scripts/utils/frame_extraction.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import glob
 3 | import argparse
 4 | from concurrent.futures import ProcessPoolExecutor
 5 | from tqdm import tqdm
 6 | 
 7 | def process_video(videofile, results_dir, fps, overwrite_existing_results):
 8 |     videoframe_dir = os.path.join(results_dir, os.path.basename(videofile).replace(".mp4", ""))
 9 |     if os.path.exists(videoframe_dir):
10 |         if not overwrite_existing_results:
11 |             print(f"Skipping {videofile} as results already exist.")
12 |             return
13 |         os.system(f'rm -rf {videoframe_dir}')
14 |     try:
15 |         os.makedirs(videoframe_dir)
16 |     except:
17 |         # MLVU has duplicated videos in differnet subclasses
18 |         return
19 |     framefile_tpl = os.path.join(videoframe_dir, "%06d.jpg")
20 |     print(f"{videofile} -> {framefile_tpl}")
21 |     os.system(f'ffmpeg -i {videofile} -vf fps={fps} -vsync vfr {framefile_tpl} -hide_banner -loglevel error')
22 |     # NOTE: You can add `-q:v 1` to achieve the highest frame quality, but it will be slower.
23 |     # We did not add `-q:v 1`. If is useful for those models trained with high quality image only.
24 | 
25 | def main(args):
26 |     videofiles = glob.glob(args.videofile_tpl)
27 | 
28 |     with ProcessPoolExecutor(max_workers=args.num_workers) as executor:
29 |         list(tqdm(executor.map(process_video, videofiles, [args.results_dir]*len(videofiles), 
30 |                               [args.fps]*len(videofiles), [args.overwrite_existing_results]*len(videofiles)), 
31 |                   total=len(videofiles)))
32 | 
33 | if __name__ == "__main__":
34 |     """
35 |     For each video `*/videoname.mp4`, this script creats a directory
36 |     {results_dir}/videoname and puts extracted frames in it.
37 |     """
38 |     parser = argparse.ArgumentParser(description="Process videos to extract frames at a specified FPS.")
39 |     parser.add_argument('--videofile_tpl', type=str, required=True, 
40 |                         help='Template for video files, e.g., /path/to/videos/*.mp4')
41 |     parser.add_argument('--results_dir', type=str, required=True, 
42 |                         help='Directory where extracted frames will be stored')
43 |     parser.add_argument('--num_workers', type=int, default=32, 
44 |                         help='Number of parallel workers for processing videos')
45 |     parser.add_argument('--fps', type=int, default=25, 
46 |                         help='Frames per second for extracted frames')
47 |     parser.add_argument('--overwrite_existing_results', action='store_true', 
48 |                         help='Overwrite existing results if they exist')
49 | 
50 |     args = parser.parse_args()
51 |     main(args)
52 | 


--------------------------------------------------------------------------------