├── .gitignore
├── LICENSE
├── README.md
├── checkpoints
    └── .gitkeep
├── docs
    ├── data.md
    ├── eval.md
    ├── inference.ipynb
    ├── inference_for_glm.ipynb
    ├── offline_demo.md
    └── train.md
├── images
    ├── demo.mp4
    ├── ex.png
    ├── ex1.png
    ├── ex2.png
    ├── ex3.png
    └── framework.png
├── requirements.txt
├── scripts
    ├── stage1.sh
    ├── stage1_glm.sh
    ├── stage2.sh
    ├── stage2_glm.sh
    ├── stage3.sh
    ├── stage3_glm.sh
    ├── zero2.json
    ├── zero3.json
    └── zero3_offload.json
└── vtimellm
    ├── __init__.py
    ├── constants.py
    ├── conversation.py
    ├── demo_gradio.py
    ├── eval
        ├── data_example.json
        ├── dvc_eval
        │   ├── SODA
        │   │   ├── LICENSE
        │   │   ├── README.md
        │   │   ├── dataset.py
        │   │   ├── nlpeval
        │   │   │   ├── bert_f_score.py
        │   │   │   ├── bert_r_score.py
        │   │   │   └── mover.py
        │   │   ├── requirements.txt
        │   │   ├── soda.py
        │   │   └── utils.py
        │   ├── __init__.py
        │   ├── eval_dvc.py
        │   └── eval_soda.py
        ├── eval.py
        ├── log
        │   └── example_log.txt
        └── metric.py
    ├── inference.py
    ├── mm_utils.py
    ├── model
        ├── __init__.py
        ├── builder.py
        ├── chatglm
        │   ├── __init__.py
        │   ├── configuration_chatglm.py
        │   ├── modeling_chatglm.py
        │   ├── quantization.py
        │   └── tokenization_chatglm.py
        ├── vtimellm_arch.py
        ├── vtimellm_chatglm.py
        └── vtimellm_llama.py
    ├── train
        ├── dataset.py
        ├── llama_flash_attn_monkey_patch.py
        ├── train.py
        ├── train_mem.py
        └── vtimellm_trainer.py
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | **/__pycache__
2 | checkpoints/
3 | .ipynb_checkpoints/
4 | wandb/
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # VTimeLLM \[[Paper](https://arxiv.org/pdf/2311.18445.pdf)\]
  2 | Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
  3 | 
  4 | 
  5 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=vtimellm-empower-llm-to-grasp-video-moments)
  6 |  	
  7 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=vtimellm-empower-llm-to-grasp-video-moments)
  8 |  	
  9 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=vtimellm-empower-llm-to-grasp-video-moments)
 10 |  	
 11 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=vtimellm-empower-llm-to-grasp-video-moments)
 12 | 
 13 | 
 14 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=vtimellm-empower-llm-to-grasp-video-moments)
 15 | 
 16 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=vtimellm-empower-llm-to-grasp-video-moments)
 17 |  	
 18 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=vtimellm-empower-llm-to-grasp-video-moments)
 19 | 
 20 | ---
 21 | 
 22 | ## :loudspeaker: Latest Updates
 23 | - **Jan-2**: Thanks to [Xiao Xia](https://github.com/Rishubi) , [Shengbo Tong](https://github.com/tsb-19) and [Beining Wang](https://github.com/Benson0704), we have refactored the code to now support both the LLAMA and ChatGLM3 architectures. We translated the training data into Chinese and fine-tuned a Chinese version based on the ChatGLM3-6b. 
 24 | - **Dec-14**: Released the training code and data. All the resources including models, datasets and extracted features are available 
 25 | [here](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2F&mode=list). :fire::fire:
 26 | - **Dec-4**: VTimeLLM: demo released.
 27 | 
 28 | ---
 29 | 
 30 | 
 31 | 
 32 | ## VTimeLLM Overview :bulb:
 33 | 
 34 | VTimeLLM is a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary.
 35 | 
 36 | VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents.
 37 | 
 38 | ![framework](images/framework.png)
 39 | 
 40 | 
 41 | ---
 42 | 
 43 | ## Contributions :trophy:
 44 | 
 45 | - We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge.
 46 | - We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability.
 47 | - We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning.
 48 | 
 49 | 
 50 | ---
 51 | 
 52 | ## Installation :wrench:
 53 | 
 54 | We recommend setting up a conda environment for the project:
 55 | ```shell
 56 | conda create --name=vtimellm python=3.10
 57 | conda activate vtimellm
 58 | 
 59 | git clone https://github.com/huangb23/VTimeLLM.git
 60 | cd VTimeLLM
 61 | pip install -r requirements.txt
 62 | ```
 63 | Additionally, install additional packages for training cases.
 64 | ```shell
 65 | pip install ninja
 66 | pip install flash-attn --no-build-isolation
 67 | ```
 68 | 
 69 | 
 70 | 
 71 | ## Running Demo Offline :cd:
 72 | 
 73 | To run the demo offline, please refer to the instructions in [offline_demo.md](docs/offline_demo.md).
 74 | 
 75 | ## Training :train:
 76 | 
 77 | For training instructions, check out [train.md](docs/train.md).
 78 | 
 79 | ## Qualitative Analysis :mag:
 80 | 
 81 | A Comprehensive Evaluation of VTimeLLM's Performance across Multiple Tasks.
 82 | 
 83 | 
 84 | ### Video Understanding and Conversational Tasks :speech_balloon:
 85 | ![0](images/ex.png) 
 86 | 
 87 | ---
 88 | 
 89 | ### Creative Tasks :paintbrush:
 90 | ![1](images/ex1.png) 
 91 | 
 92 | ---
 93 | ### Fine-grained Understanding Tasks :globe_with_meridians:
 94 | ![2](images/ex2.png) 
 95 | 
 96 | ---
 97 | ### Video Reasoning Tasks :question:
 98 | ![3](images/ex3.png) 
 99 | 
100 | ---
101 | 
102 | 
103 | ## Acknowledgements :pray:
104 | 
105 | We are grateful for the following awesome projects our VTimeLLM arising from:
106 | 
107 | * [LLaVA](https://github.com/haotian-liu/LLaVA): Large Language and Vision Assistant
108 | * [FastChat](https://github.com/lm-sys/FastChat): An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
109 | * [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT): Towards Detailed Video Understanding via Large Vision and Language Models
110 | * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
111 | * [Vid2seq](https://github.com/google-research/scenic/tree/main/scenic/projects/vid2seq): Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
112 | * [InternVid](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid): A Large-scale Video-Text dataset
113 | 
114 | 
115 | If you're using VTimeLLM in your research or applications, please cite using this BibTeX:
116 | ```bibtex
117 | @inproceedings{huang2024vtimellm,
118 |   title={Vtimellm: Empower llm to grasp video moments},
119 |   author={Huang, Bin and Wang, Xin and Chen, Hong and Song, Zihan and Zhu, Wenwu},
120 |   booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
121 |   pages={14271--14280},
122 |   year={2024}
123 | }
124 | ```
125 | 
126 | ## License :scroll:
127 | <a rel="license" href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/80x15.png" /></a> 
128 | 
129 | This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License</a>.
130 | 
131 | 
132 | Looking forward to your feedback, contributions, and stars! :star2:
133 | 


--------------------------------------------------------------------------------
/checkpoints/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/checkpoints/.gitkeep


--------------------------------------------------------------------------------
/docs/data.md:
--------------------------------------------------------------------------------
 1 | Our dataset can be downloaded from [here](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fdata&mode=list). The data is organized in the following JSON format:
 2 | 
 3 | * source: The original sources of the dataset. This field is ['internvid'](https://huggingface.co/datasets/OpenGVLab/InternVid) in stage 2 and ['anet'](http://activity-net.org/download.html) or ['didemo'](https://drive.google.com/drive/u/0/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc) in stage 3.
 4 | * id: The ID of the video in the original dataset.
 5 | * conversations: The conversation data used for training. We utilize special tokens  (such as `<s0>`, `<e0>`) to indicate timestamps. These need to be replaced during training using information from the 'meta' field.
 6 | * meta:
 7 |   * split: refers to the extraction of a segment from the original  video as input, with two numbers denoting the start and end timestamps  of the extraction. If this field does not exist, the entire video is  used as input. 
 8 |   * duration: indicates the length of the input video.
 9 |   * token: represents the timestamp of special tokens appearing in  'conversations'.
10 | 
11 | Here is an example:
12 | 
13 | ```json
14 | {
15 |     "source": "internvid",
16 |     "id": "3n3oCNerzV0",
17 |     "conversations": [
18 |         {
19 |             "from": "human",
20 |             "value": "<video>\nI'd appreciate it if you could provide a detailed account of the events that occurred at different timestamps in the video."
21 |         },
22 |         {
23 |             "from": "gpt",
24 |             "value": "From <s0> to <e0>, woman is counting money with a pen on a white table. From <s1> to <e1>, two people shaking hands in front of a desk."
25 |         }
26 |     ],
27 |     "meta": {
28 |         "split": [
29 |             129.8,
30 |             184.0
31 |         ],
32 |         "duration": 54.2,
33 |         "token": {
34 |             "<s0>": 0,
35 |             "<e0>": 9.4,
36 |             "<s1>": 49.4,
37 |             "<e1>": 53.2
38 |         }
39 |     }
40 | }
41 | 
42 | ```
43 | 
44 | Due to Internvid's data being sourced from YouTube, the 'id' field  provides the YouTube ID of the video. We download the [original video](https://www.youtube.com/watch?v=3n3oCNerzV0) and extract the segment from 129.8s to 184.0s as input.
45 | 
46 | If we sample $N=100$ frames as input, then `<s0>` should be replaced by $\frac{0}{54.2} * N = 00$,  `<e0>` should be replaced by $\frac{9.4}{54.2} * N = 17$. and `<s1>` and `<e1>` should be replaced in a similar manner. Finally, in this example, the question from 'human' remains unchanged,  while the 'answer' from 'gpt' is shown as below:
47 | 
48 | ```
49 | From 00 to 17, woman is counting money with a pen on a white table. From 91 to 98, two people shaking hands in front of a desk.
50 | ```
51 | 
52 | 
53 | 
54 | In the Stage 3 data, approximately 20k data originate from [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT). This subset of  data does not contain special tokens and the 'meta' field.
55 | 


--------------------------------------------------------------------------------
/docs/eval.md:
--------------------------------------------------------------------------------
 1 | # VTimeLLM-Vicuna Evaluation 
 2 | 
 3 | We provide evaluation code for VTimeLLM-Vicuna on dense video captioning and temporal video grounding tasks. Before proceeding, you should be able to run the inference code (refer to [offline_demo.md](offline_demo.md)). Below, we outline the evaluation process using the [ActivityNet Captions](https://cs.stanford.edu/people/ranjaykrishna/densevid/) dataset as an example.
 4 | 
 5 | - Download the annotation JSON file for the dataset. For the ActivityNet Captions dataset, the test set annotation file is `val_2.json`. For other datasets, you need to preprocess them to match the JSON format of this dataset.
 6 | - Download the raw videos of the dataset and store them in a specific folder.
 7 | - (Optional) Pre-extract video features (refer to inference.py) and store them in a specific folder. (For the ActivityNet Captions dataset, we have extracted features for approximately 80% of the test set videos, placed in the [feat folder of the stage3 training](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Ffeat&mode=list). You can use them for incomplete testing.)
 8 | - Run the evaluation code, and record the model's responses in a log file. (Specify at least one of `feat_folder` and `video_folder`) :
 9 | 
10 | ```bash
11 | python vtimellm/eval/eval.py \
12 |      --data_path /path/to/val_2.json \
13 |      --feat_folder /path/to/feat \
14 |      --video_folder /path/to/video \
15 |      --log_path /path/to/log \
16 |      --model_base <path to the Vicuna v1.5 weights> 
17 | ```
18 | 
19 | * In order to compute metrics for the dense video captioning task, you need to install `pycocoevalcap` and `Java`. 
20 | 
21 | ```bash
22 | pip install pycocoevalcap
23 | conda install conda-forge::openjdk
24 | ```
25 | 
26 | * Parse the log file and obtain the corresponding metrics.
27 | 
28 | ```bash
29 | python vtimellm/eval/metric.py \
30 |      --data_path /path/to/val_2.json \
31 |      --log_path /path/to/log
32 | ```


--------------------------------------------------------------------------------
/docs/inference.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "6401fb4e-c559-49e5-bc88-874a74dd54c4",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import os\n",
 11 |     "root_dir = os.path.join(os.getcwd(), \"..\")\n",
 12 |     "import sys\n",
 13 |     "sys.path.append(root_dir)\n",
 14 |     "from vtimellm.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN\n",
 15 |     "from vtimellm.conversation import conv_templates, SeparatorStyle\n",
 16 |     "from vtimellm.model.builder import load_pretrained_model, load_lora\n",
 17 |     "from vtimellm.utils import disable_torch_init\n",
 18 |     "from vtimellm.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria, VideoExtractor\n",
 19 |     "from PIL import Image\n",
 20 |     "import requests\n",
 21 |     "from io import BytesIO\n",
 22 |     "from transformers import TextStreamer\n",
 23 |     "from easydict import EasyDict as edict\n",
 24 |     "try:\n",
 25 |     "    from torchvision.transforms import InterpolationMode\n",
 26 |     "    BICUBIC = InterpolationMode.BICUBIC\n",
 27 |     "except ImportError:\n",
 28 |     "    from PIL import Image\n",
 29 |     "    BICUBIC = Image.BICUBIC\n",
 30 |     "from torchvision.transforms import Compose, Resize, CenterCrop, Normalize\n",
 31 |     "import numpy as np\n",
 32 |     "import clip\n",
 33 |     "import torch"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 2,
 39 |    "id": "a05c755d-ae5c-4e93-ae6f-4e4a6e795b14",
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "args = edict()\n",
 44 |     "args.model_base = \"/path/to/vicuna-7b-v1.5\"\n",
 45 |     "args.clip_path = os.path.join(root_dir, \"checkpoints/clip/ViT-L-14.pt\")\n",
 46 |     "args.pretrain_mm_mlp_adapter = os.path.join(root_dir, \"checkpoints/vtimellm-vicuna-v1-5-7b-stage1/mm_projector.bin\")\n",
 47 |     "args.stage2 = os.path.join(root_dir, \"checkpoints/vtimellm-vicuna-v1-5-7b-stage2\")\n",
 48 |     "args.stage3 = os.path.join(root_dir, \"checkpoints/vtimellm-vicuna-v1-5-7b-stage3\")\n",
 49 |     "args.video_path = os.path.join(root_dir, \"images/demo.mp4\")\n",
 50 |     "args.temperature = 0.05"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": 3,
 56 |    "id": "d3815e10-d5a5-4fdf-a98f-ae9050f23b97",
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "def inference(model, tokenizer, context_len, image, args):\n",
 61 |     "    conv = conv_templates['v1'].copy()\n",
 62 |     "    roles = conv.roles\n",
 63 |     "    first = True\n",
 64 |     "    while True:\n",
 65 |     "        try:\n",
 66 |     "            inp = input(f\"{roles[0]}: \")\n",
 67 |     "        except EOFError:\n",
 68 |     "            inp = \"\"\n",
 69 |     "        if not inp:\n",
 70 |     "            print(\"exit...\")\n",
 71 |     "            break\n",
 72 |     "\n",
 73 |     "        print(f\"{roles[1]}: \", end=\"\")\n",
 74 |     "\n",
 75 |     "        if first:\n",
 76 |     "            # first message\n",
 77 |     "            inp = DEFAULT_IMAGE_TOKEN + '\\n' + inp\n",
 78 |     "            conv.append_message(conv.roles[0], inp)\n",
 79 |     "            first = False\n",
 80 |     "        else:\n",
 81 |     "            # later messages\n",
 82 |     "            conv.append_message(conv.roles[0], inp)\n",
 83 |     "        conv.append_message(conv.roles[1], None)\n",
 84 |     "        prompt = conv.get_prompt()\n",
 85 |     "\n",
 86 |     "        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()\n",
 87 |     "        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 # plain:sep(###) v1:sep2(None)\n",
 88 |     "        keywords = [stop_str]\n",
 89 |     "        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)\n",
 90 |     "        streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n",
 91 |     "\n",
 92 |     "        with torch.inference_mode():\n",
 93 |     "            output_ids = model.generate(\n",
 94 |     "                input_ids,\n",
 95 |     "                images=image[None,].cuda(),\n",
 96 |     "                do_sample=True,\n",
 97 |     "                temperature=args.temperature,\n",
 98 |     "                max_new_tokens=1024,\n",
 99 |     "                streamer=streamer,\n",
100 |     "                use_cache=True,\n",
101 |     "                stopping_criteria=[stopping_criteria]\n",
102 |     "            )\n",
103 |     "\n",
104 |     "        outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()\n",
105 |     "        conv.messages[-1][-1] = outputs"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": 4,
111 |    "id": "a29e94ff-620e-4c86-adee-47206b67b606",
112 |    "metadata": {},
113 |    "outputs": [
114 |     {
115 |      "name": "stderr",
116 |      "output_type": "stream",
117 |      "text": [
118 |       "You are using a model of type llama to instantiate a model of type VTimeLLM. This is not supported for all configurations of models and can yield errors.\n"
119 |      ]
120 |     },
121 |     {
122 |      "name": "stdout",
123 |      "output_type": "stream",
124 |      "text": [
125 |       "Loading VTimeLLM from base model...\n"
126 |      ]
127 |     },
128 |     {
129 |      "data": {
130 |       "application/vnd.jupyter.widget-view+json": {
131 |        "model_id": "8c13771904c8424abf87e3e2507e4bd8",
132 |        "version_major": 2,
133 |        "version_minor": 0
134 |       },
135 |       "text/plain": [
136 |        "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
137 |       ]
138 |      },
139 |      "metadata": {},
140 |      "output_type": "display_data"
141 |     },
142 |     {
143 |      "name": "stderr",
144 |      "output_type": "stream",
145 |      "text": [
146 |       "/home/bhuang/miniconda3/envs/vtime/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:381: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.\n",
147 |       "  warnings.warn(\n",
148 |       "/home/bhuang/miniconda3/envs/vtime/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:386: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.\n",
149 |       "  warnings.warn(\n"
150 |      ]
151 |     },
152 |     {
153 |      "name": "stdout",
154 |      "output_type": "stream",
155 |      "text": [
156 |       "load mlp: /DATA/DATANAS2/bhuang/link/gitlab/vtimellm/docs/../checkpoints/vtimellm-vicuna-v1-5-7b-stage1/mm_projector.bin\n",
157 |       "Loading stage2 weights...\n",
158 |       "Loading LoRA weights...\n",
159 |       "Merging stage2 weights...\n",
160 |       "Loading stage3 weights...\n",
161 |       "Loading LoRA weights...\n",
162 |       "Merging stage3 weights...\n"
163 |      ]
164 |     }
165 |    ],
166 |    "source": [
167 |     "disable_torch_init()\n",
168 |     "tokenizer, model, context_len = load_pretrained_model(args, args.stage2, args.stage3)\n",
169 |     "model = model.cuda()\n",
170 |     "model = model.to(torch.float16)"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 5,
176 |    "id": "f8165baa-775b-4106-88aa-3c723bd97c12",
177 |    "metadata": {},
178 |    "outputs": [
179 |     {
180 |      "name": "stderr",
181 |      "output_type": "stream",
182 |      "text": [
183 |       "/home/bhuang/miniconda3/envs/vtime/lib/python3.10/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).\n",
184 |       "  warnings.warn(\n"
185 |      ]
186 |     }
187 |    ],
188 |    "source": [
189 |     "clip_model, _ = clip.load(args.clip_path)\n",
190 |     "clip_model.eval()\n",
191 |     "clip_model = clip_model.cuda()\n",
192 |     "\n",
193 |     "video_loader = VideoExtractor(N=100)\n",
194 |     "_, images = video_loader.extract({'id': None, 'video': args.video_path})\n",
195 |     "\n",
196 |     "transform = Compose([\n",
197 |     "    Resize(224, interpolation=BICUBIC),\n",
198 |     "    CenterCrop(224),\n",
199 |     "    Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),\n",
200 |     "])\n",
201 |     "\n",
202 |     "# print(images.shape) # <N, 3, H, W>\n",
203 |     "images = transform(images / 255.0)\n",
204 |     "images = images.to(torch.float16)\n",
205 |     "with torch.no_grad():\n",
206 |     "    features = clip_model.encode_image(images.to('cuda'))"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": 8,
212 |    "id": "010a986d-6121-43f2-9971-619472c0732c",
213 |    "metadata": {},
214 |    "outputs": [
215 |     {
216 |      "name": "stdin",
217 |      "output_type": "stream",
218 |      "text": [
219 |       "USER:  Explain why this video is funny.\n"
220 |      ]
221 |     },
222 |     {
223 |      "name": "stdout",
224 |      "output_type": "stream",
225 |      "text": [
226 |       "ASSISTANT: The video is funny because the bear is dancing to the music and moving its arms and legs in a funny way. The bear's movements are exaggerated and comical, making it difficult for the person to keep up with the beat. The bear's facial expressions and body language add to the humor of the video.\n"
227 |      ]
228 |     },
229 |     {
230 |      "name": "stdin",
231 |      "output_type": "stream",
232 |      "text": [
233 |       "USER:  Is it a real bear?\n"
234 |      ]
235 |     },
236 |     {
237 |      "name": "stdout",
238 |      "output_type": "stream",
239 |      "text": [
240 |       "ASSISTANT: No, it is not a real bear. It is a costume worn by a person.\n"
241 |      ]
242 |     },
243 |     {
244 |      "name": "stdin",
245 |      "output_type": "stream",
246 |      "text": [
247 |       "USER:  \n"
248 |      ]
249 |     },
250 |     {
251 |      "name": "stdout",
252 |      "output_type": "stream",
253 |      "text": [
254 |       "exit...\n"
255 |      ]
256 |     }
257 |    ],
258 |    "source": [
259 |     "inference(model, tokenizer, context_len, features, args)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "id": "ef73e662-cab9-46fa-9df8-b45e2b1d9f5b",
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": []
269 |   }
270 |  ],
271 |  "metadata": {
272 |   "kernelspec": {
273 |    "display_name": "Python 3 (ipykernel)",
274 |    "language": "python",
275 |    "name": "python3"
276 |   },
277 |   "language_info": {
278 |    "codemirror_mode": {
279 |     "name": "ipython",
280 |     "version": 3
281 |    },
282 |    "file_extension": ".py",
283 |    "mimetype": "text/x-python",
284 |    "name": "python",
285 |    "nbconvert_exporter": "python",
286 |    "pygments_lexer": "ipython3",
287 |    "version": "3.10.13"
288 |   }
289 |  },
290 |  "nbformat": 4,
291 |  "nbformat_minor": 5
292 | }
293 | 


--------------------------------------------------------------------------------
/docs/inference_for_glm.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "id": "6401fb4e-c559-49e5-bc88-874a74dd54c4",
  7 |    "metadata": {},
  8 |    "outputs": [
  9 |     {
 10 |      "name": "stdout",
 11 |      "output_type": "stream",
 12 |      "text": [
 13 |       "/DATA/DATANAS2/bhuang/link/gitlab/vtimellm/docs/..\n"
 14 |      ]
 15 |     }
 16 |    ],
 17 |    "source": [
 18 |     "import os\n",
 19 |     "os.environ['CUDA_VISIBLE_DEVICES'] = '0'\n",
 20 |     "root_dir = os.path.join(os.getcwd(), \"..\")\n",
 21 |     "print(root_dir)\n",
 22 |     "import sys\n",
 23 |     "sys.path.append(root_dir)\n",
 24 |     "\n",
 25 |     "from vtimellm.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN\n",
 26 |     "from vtimellm.conversation import conv_templates, SeparatorStyle\n",
 27 |     "from vtimellm.model.builder import load_pretrained_model, load_lora\n",
 28 |     "from vtimellm.train.dataset import preprocess_glm\n",
 29 |     "from vtimellm.utils import disable_torch_init\n",
 30 |     "from vtimellm.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria, VideoExtractor\n",
 31 |     "from PIL import Image\n",
 32 |     "import requests\n",
 33 |     "from io import BytesIO\n",
 34 |     "from transformers import TextStreamer\n",
 35 |     "from easydict import EasyDict as edict\n",
 36 |     "try:\n",
 37 |     "    from torchvision.transforms import InterpolationMode\n",
 38 |     "    BICUBIC = InterpolationMode.BICUBIC\n",
 39 |     "except ImportError:\n",
 40 |     "    from PIL import Image\n",
 41 |     "    BICUBIC = Image.BICUBIC\n",
 42 |     "from torchvision.transforms import Compose, Resize, CenterCrop, Normalize\n",
 43 |     "import numpy as np\n",
 44 |     "import clip\n",
 45 |     "import torch"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 3,
 51 |    "id": "a05c755d-ae5c-4e93-ae6f-4e4a6e795b14",
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "model_version = 'chatglm3-6b' # vicuna-v1-5-7b\n",
 56 |     "args = edict()\n",
 57 |     "args.model_base = \"/DATA/DATANAS2/bhuang/data/vicuna-7b-v1.5\"\n",
 58 |     "if model_version == 'chatglm3-6b':\n",
 59 |     "    args.model_base = os.path.join(root_dir, 'checkpoints/' + model_version)\n",
 60 |     "args.clip_path = os.path.join(root_dir, \"checkpoints/clip/ViT-L-14.pt\")\n",
 61 |     "args.pretrain_mm_mlp_adapter = os.path.join(root_dir, f\"checkpoints/vtimellm-{model_version}-stage1/mm_projector.bin\")\n",
 62 |     "args.stage2 = os.path.join(root_dir, f\"checkpoints/vtimellm-{model_version}-stage2\")\n",
 63 |     "args.stage3 = os.path.join(root_dir, f\"checkpoints/vtimellm-{model_version}-stage3\")\n",
 64 |     "args.temperature = 0.05"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": 4,
 70 |    "id": "7cb39bb4",
 71 |    "metadata": {},
 72 |    "outputs": [
 73 |     {
 74 |      "name": "stderr",
 75 |      "output_type": "stream",
 76 |      "text": [
 77 |       "You are using a model of type chatglm to instantiate a model of type VTimeLLM_ChatGLM. This is not supported for all configurations of models and can yield errors.\n"
 78 |      ]
 79 |     },
 80 |     {
 81 |      "name": "stdout",
 82 |      "output_type": "stream",
 83 |      "text": [
 84 |       "Loading VTimeLLM from base model...\n"
 85 |      ]
 86 |     },
 87 |     {
 88 |      "data": {
 89 |       "application/vnd.jupyter.widget-view+json": {
 90 |        "model_id": "20832fd2d2a44b7f8d540c32069df157",
 91 |        "version_major": 2,
 92 |        "version_minor": 0
 93 |       },
 94 |       "text/plain": [
 95 |        "Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]"
 96 |       ]
 97 |      },
 98 |      "metadata": {},
 99 |      "output_type": "display_data"
100 |     },
101 |     {
102 |      "name": "stdout",
103 |      "output_type": "stream",
104 |      "text": [
105 |       "load mlp: /DATA/DATANAS2/bhuang/link/gitlab/vtimellm/docs/../checkpoints/vtimellm-chatglm3-6b-stage1/mm_projector.bin\n",
106 |       "Loading stage2 weights...\n",
107 |       "Loading LoRA weights...\n",
108 |       "Merging stage2 weights...\n",
109 |       "Loading stage3 weights...\n",
110 |       "Loading LoRA weights...\n",
111 |       "Merging stage3 weights...\n"
112 |      ]
113 |     }
114 |    ],
115 |    "source": [
116 |     "disable_torch_init()\n",
117 |     "tokenizer, model, context_len = load_pretrained_model(args, args.stage2, args.stage3)\n",
118 |     "model = model.cuda()\n",
119 |     "model = model.to(torch.float16)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 13,
125 |    "id": "f8165baa-775b-4106-88aa-3c723bd97c12",
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "clip_model, _ = clip.load(args.clip_path)\n",
130 |     "clip_model.eval()\n",
131 |     "clip_model = clip_model.cuda()\n",
132 |     "\n",
133 |     "video_loader = VideoExtractor(N=100)\n",
134 |     "\n",
135 |     "transform = Compose([\n",
136 |     "    Resize(224, interpolation=BICUBIC),\n",
137 |     "    CenterCrop(224),\n",
138 |     "    Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),\n",
139 |     "])\n"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "id": "b0038bac",
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "args.video_path = '/DATA/DATANAS2/bhuang/link/1.mp4'\n",
150 |     "_, images = video_loader.extract({'id': None, 'video': args.video_path})\n",
151 |     "# print(images.shape) # <N, 3, H, W>\n",
152 |     "images = transform(images / 255.0)\n",
153 |     "images = images.to(torch.float16)\n",
154 |     "with torch.no_grad():\n",
155 |     "    features = clip_model.encode_image(images.to('cuda'))"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 14,
161 |    "id": "7b42d546",
162 |    "metadata": {},
163 |    "outputs": [],
164 |    "source": [
165 |     "def inference(model, tokenizer, context_len, image, args):\n",
166 |     "    source = []\n",
167 |     "    first = True\n",
168 |     "    while True:\n",
169 |     "        try:\n",
170 |     "            inp = input(f\"USER: \")\n",
171 |     "        except EOFError:\n",
172 |     "            inp = \"\"\n",
173 |     "        if not inp:\n",
174 |     "            print(\"exit...\")\n",
175 |     "            break\n",
176 |     "\n",
177 |     "        print(f\"ASSISTANT:\", end=\"\")\n",
178 |     "\n",
179 |     "        if first:\n",
180 |     "            # first message\n",
181 |     "            inp = DEFAULT_IMAGE_TOKEN + '\\n' + inp\n",
182 |     "            first = False\n",
183 |     "        \n",
184 |     "        source.append({\n",
185 |     "            'from': \"human\",\n",
186 |     "            'value': inp\n",
187 |     "        })\n",
188 |     "        input_ids = preprocess_glm([source], tokenizer)['input_ids'].cuda()\n",
189 |     "        input_ids[0][-1] = tokenizer.get_command(\"<|assistant|>\")\n",
190 |     "        streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n",
191 |     "\n",
192 |     "        with torch.inference_mode():\n",
193 |     "            output_ids = model.generate(\n",
194 |     "                input_ids,\n",
195 |     "                images=image[None,].cuda(),\n",
196 |     "                do_sample=True,\n",
197 |     "                temperature=args.temperature,\n",
198 |     "                max_new_tokens=1024,\n",
199 |     "                streamer=streamer,\n",
200 |     "                use_cache=True,\n",
201 |     "                eos_token_id=[tokenizer.get_command(\"<|user|>\"), tokenizer.eos_token_id],\n",
202 |     "            )\n",
203 |     "\n",
204 |     "        outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:-1]).strip()\n",
205 |     "        # print(outputs)\n",
206 |     "        source.append({\n",
207 |     "            'from': \"gpt\",\n",
208 |     "            'value': outputs\n",
209 |     "        })"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": 19,
215 |    "id": "010a986d-6121-43f2-9971-619472c0732c",
216 |    "metadata": {},
217 |    "outputs": [
218 |     {
219 |      "name": "stdout",
220 |      "output_type": "stream",
221 |      "text": [
222 |       "ASSISTANT:视频中，一名男子在黑暗的房间里，手里拿着一个装满东西的盒子。他打开盒子，里面装满了各种物品。然后，该男子爬上一座高高的建筑物，并从窗户跳入水中。<|user|>\n",
223 |       "exit...\n"
224 |      ]
225 |     }
226 |    ],
227 |    "source": [
228 |     "inference(model, tokenizer, context_len, features, args)"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": null,
234 |    "id": "ef73e662-cab9-46fa-9df8-b45e2b1d9f5b",
235 |    "metadata": {},
236 |    "outputs": [],
237 |    "source": []
238 |   }
239 |  ],
240 |  "metadata": {
241 |   "kernelspec": {
242 |    "display_name": "Python 3 (ipykernel)",
243 |    "language": "python",
244 |    "name": "python3"
245 |   },
246 |   "language_info": {
247 |    "codemirror_mode": {
248 |     "name": "ipython",
249 |     "version": 3
250 |    },
251 |    "file_extension": ".py",
252 |    "mimetype": "text/x-python",
253 |    "name": "python",
254 |    "nbconvert_exporter": "python",
255 |    "pygments_lexer": "ipython3",
256 |    "version": "3.10.12"
257 |   }
258 |  },
259 |  "nbformat": 4,
260 |  "nbformat_minor": 5
261 | }
262 | 


--------------------------------------------------------------------------------
/docs/offline_demo.md:
--------------------------------------------------------------------------------
 1 | # Running VTimeLLM inference Offline
 2 | Please follow the instructions below to run the VTimeLLM inference on your local GPU machine. 
 3 | 
 4 | Note: Our demo requires approximately 18 GB of GPU memory.
 5 | 
 6 | ### Clone the VTimeLLM repository
 7 | 
 8 | ```shell
 9 | conda create --name=vtimellm python=3.10
10 | conda activate vtimellm
11 | 
12 | git clone https://github.com/huangb23/VTimeLLM.git
13 | cd VTimeLLM
14 | pip install -r requirements.txt
15 | ```
16 | 
17 | ### Download weights
18 | 
19 | * Download clip model and vtimellm model from the [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fcheckpoints&mode=list) and place them into the 'checkpoints' directory.
20 | * Download [Vicuna v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) or [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b)  weights.
21 | 
22 | ### Run the inference code
23 | 
24 | 
25 | ```shell
26 | python -m vtimellm.inference --model_base <path to the Vicuna v1.5 weights> 
27 | ```
28 | 
29 | Alternatively, you can also choose to conduct multi-turn conversations in [Jupyter Notebook](inference.ipynb). Similarly, you need to set 'args.model_base' to the path of Vicuna v1.5.
30 | 
31 | If you want to run the VTimeLLM-ChatGLM version, please refer to the code in [inference_for_glm.ipynb](inference_for_glm.ipynb).
32 | 
33 | ### Run the gradio demo
34 | 
35 | We have provided an offline gradio demo as follows:
36 | 
37 | ```shell
38 | cd vtimellm
39 | python demo_gradio.py --model_base <path to the Vicuna v1.5 weights> 
40 | ```
41 | 


--------------------------------------------------------------------------------
/docs/train.md:
--------------------------------------------------------------------------------
 1 | # Training VTimeLLM
 2 | VTimeLLM adopts a three-stage training strategy. Please follow the instructions below to train VTimeLLM-7B model.
 3 | 
 4 | 
 5 | * Download [clip](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fcheckpoints&mode=list) and [Vicuna v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) weights, and place them into the 'checkpoints' directory.
 6 | 
 7 | * Download stage1 dataset from [this link](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/blip_laion_cc_sbu_558k.json), and download stage2 and stage3 dataset from the [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fdata&mode=list). Place them into the 'data' directory.
 8 | 
 9 | ```markdown
10 | - VTimeLLM
11 |     - checkpoints
12 |         - clip
13 |         	- ViT-L-14.pt
14 |         - vicuna-7b-v1.5
15 |         	- pytorch_model-00001-of-00002.bin
16 |         	- ...
17 |     - data
18 |         - blip_laion_cc_sbu_558k.json
19 |         - stage2.json
20 |         - stage3.json
21 |     - scripts
22 |     	- stage1.sh
23 |     	- stage2.sh
24 |     	- stage3.sh
25 |     	- ...
26 |     - vtimellm
27 |     - ...
28 | ```
29 | 
30 | If you want to train a Chinese version, you can download the [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) model and the translated Chinese [dataset](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Fdata&mode=list).
31 | 
32 | * Download the pre-extracted features from the [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/?p=%2Ffeat&mode=list).
33 | 
34 | ```shell
35 | tar -xzvf stage1.tar.gz
36 | cat stage2_part_* > stage2.tar.gz
37 | tar -xzvf stage2.tar.gz
38 | tar -xzvf stage3.tar.gz
39 | ```
40 | 
41 | * Train in three stages sequentially, and make sure to modify  '--feat_folder' in the script to the corresponding feature folder for each stage.
42 | 
43 | ```shell
44 | cd VTimeLLM
45 | bash scripts/stage1.sh
46 | bash scripts/stage2.sh
47 | bash scripts/stage3.sh
48 | ```


--------------------------------------------------------------------------------
/images/demo.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/images/demo.mp4


--------------------------------------------------------------------------------
/images/ex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/images/ex.png


--------------------------------------------------------------------------------
/images/ex1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/images/ex1.png


--------------------------------------------------------------------------------
/images/ex2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/images/ex2.png


--------------------------------------------------------------------------------
/images/ex3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/images/ex3.png


--------------------------------------------------------------------------------
/images/framework.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/images/framework.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | torch
 2 | decord
 3 | easydict
 4 | einops
 5 | gradio
 6 | numpy
 7 | pandas>=2.0.3
 8 | peft>=0.4.0
 9 | Pillow
10 | tqdm
11 | transformers==4.31.0
12 | git+https://github.com/openai/CLIP.git
13 | sentencepiece
14 | protobuf
15 | wandb
16 | deepspeed
17 | 


--------------------------------------------------------------------------------
/scripts/stage1.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | MODEL_VERSION=vicuna-v1-5-7b
 4 | gpu_vis=0 # per_device_train_batch_size * gradient_accumulation_steps * n_gpus = 128
 5 | MASTER_PORT=29570
 6 | 
 7 | 
 8 | deepspeed --include localhost:$gpu_vis --master_port $MASTER_PORT vtimellm/train/train_mem.py \
 9 |     --deepspeed ./scripts/zero3.json \
10 |     --model_name_or_path ./checkpoints/vicuna-7b-v1.5 \
11 |     --version plain \
12 |     --data_path ./data/blip_laion_cc_sbu_558k.json \
13 |     --feat_folder /path/to/stage1_feat \
14 |     --tune_mm_mlp_adapter True \
15 |     --output_dir ./checkpoints/vtimellm-$MODEL_VERSION-stage1 \
16 |     --bf16 True \
17 |     --num_train_epochs 1 \
18 |     --per_device_train_batch_size 16 \
19 |     --gradient_accumulation_steps 8 \
20 |     --evaluation_strategy "no" \
21 |     --save_strategy "steps" \
22 |     --save_steps 24000 \
23 |     --save_total_limit 1 \
24 |     --learning_rate 1e-3 \
25 |     --weight_decay 0. \
26 |     --warmup_ratio 0.03 \
27 |     --lr_scheduler_type "cosine" \
28 |     --logging_steps 1 \
29 |     --tf32 True \
30 |     --model_max_length 2048 \
31 |     --gradient_checkpointing True \
32 |     --dataloader_num_workers 4 \
33 |     --lazy_preprocess True \
34 |     --report_to wandb
35 | 


--------------------------------------------------------------------------------
/scripts/stage1_glm.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | MODEL_VERSION=chatglm3-6b
 4 | gpu_vis=0 # per_device_train_batch_size * gradient_accumulation_steps * n_gpus = 128
 5 | MASTER_PORT=29570
 6 | 
 7 | 
 8 | deepspeed --include localhost:$gpu_vis --master_port $MASTER_PORT vtimellm/train/train.py \
 9 |     --deepspeed ./scripts/zero3.json \
10 |     --model_name_or_path ./checkpoints/$MODEL_VERSION \
11 |     --version plain \
12 |     --data_path ./data/blip_laion_cc_sbu_558k_chinese.json \
13 |     --feat_folder /path/to/stage1_feat \
14 |     --tune_mm_mlp_adapter True \
15 |     --output_dir ./checkpoints/vtimellm-$MODEL_VERSION-stage1 \
16 |     --bf16 True \
17 |     --num_train_epochs 1 \
18 |     --per_device_train_batch_size 16 \
19 |     --gradient_accumulation_steps 8 \
20 |     --evaluation_strategy "no" \
21 |     --save_strategy "steps" \
22 |     --save_steps 24000 \
23 |     --save_total_limit 1 \
24 |     --learning_rate 1e-3 \
25 |     --weight_decay 0. \
26 |     --warmup_ratio 0.03 \
27 |     --lr_scheduler_type "cosine" \
28 |     --logging_steps 1 \
29 |     --tf32 True \
30 |     --model_max_length 2048 \
31 |     --gradient_checkpointing True \
32 |     --dataloader_num_workers 4 \
33 |     --lazy_preprocess True \
34 |     --report_to wandb
35 | 


--------------------------------------------------------------------------------
/scripts/stage2.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | MODEL_VERSION=vicuna-v1-5-7b
 4 | gpu_vis=0 # per_device_train_batch_size * gradient_accumulation_steps * n_gpus = 128
 5 | MASTER_PORT=29570
 6 | 
 7 | 
 8 | deepspeed --include localhost:$gpu_vis --master_port $MASTER_PORT vtimellm/train/train_mem.py \
 9 |     --deepspeed ./scripts/zero3.json \
10 |     --lora_enable True \
11 |     --model_name_or_path ./checkpoints/vicuna-7b-v1.5 \
12 |     --version v1 \
13 |     --data_path ./data/stage2.json \
14 |     --feat_folder /path/to/stage2_feat \
15 |     --pretrain_mm_mlp_adapter ./checkpoints/vtimellm-$MODEL_VERSION-stage1/mm_projector.bin \
16 |     --output_dir ./checkpoints/vtimellm-$MODEL_VERSION-stage2 \
17 |     --bf16 True \
18 |     --num_train_epochs 2 \
19 |     --per_device_train_batch_size 8 \
20 |     --gradient_accumulation_steps 16 \
21 |     --evaluation_strategy "no" \
22 |     --save_strategy "steps" \
23 |     --save_steps 50000 \
24 |     --save_total_limit 1 \
25 |     --learning_rate 1e-4 \
26 |     --freeze_mm_mlp_adapter True \
27 |     --lora_r 64 \
28 |     --lora_alpha 128 \
29 |     --weight_decay 0. \
30 |     --warmup_ratio 0.03 \
31 |     --lr_scheduler_type "cosine" \
32 |     --logging_steps 1 \
33 |     --tf32 True \
34 |     --model_max_length 2048 \
35 |     --gradient_checkpointing True \
36 |     --dataloader_num_workers 4 \
37 |     --lazy_preprocess True \
38 |     --report_to wandb
39 | 


--------------------------------------------------------------------------------
/scripts/stage2_glm.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | MODEL_VERSION=chatglm3-6b
 4 | gpu_vis=0 # per_device_train_batch_size * gradient_accumulation_steps * n_gpus = 128
 5 | MASTER_PORT=29570
 6 | 
 7 | 
 8 | deepspeed --include localhost:$gpu_vis --master_port $MASTER_PORT vtimellm/train/train.py \
 9 |     --deepspeed ./scripts/zero3.json \
10 |     --lora_enable True \
11 |     --model_name_or_path ./checkpoints/$MODEL_VERSION \
12 |     --version plain \
13 |     --data_path ./data/stage2_chinese.json \
14 |     --feat_folder /path/to/stage2_feat \
15 |     --pretrain_mm_mlp_adapter ./checkpoints/vtimellm-$MODEL_VERSION-stage1/mm_projector.bin \
16 |     --output_dir ./checkpoints/vtimellm-$MODEL_VERSION-stage2 \
17 |     --bf16 True \
18 |     --num_train_epochs 2 \
19 |     --per_device_train_batch_size 8 \
20 |     --gradient_accumulation_steps 16 \
21 |     --evaluation_strategy "no" \
22 |     --save_strategy "steps" \
23 |     --save_steps 50000 \
24 |     --save_total_limit 1 \
25 |     --learning_rate 1e-4 \
26 |     --freeze_mm_mlp_adapter True \
27 |     --lora_r 64 \
28 |     --lora_alpha 128 \
29 |     --weight_decay 0. \
30 |     --warmup_ratio 0.03 \
31 |     --lr_scheduler_type "cosine" \
32 |     --logging_steps 1 \
33 |     --tf32 True \
34 |     --model_max_length 2048 \
35 |     --gradient_checkpointing True \
36 |     --dataloader_num_workers 4 \
37 |     --lazy_preprocess True \
38 |     --report_to wandb
39 | 


--------------------------------------------------------------------------------
/scripts/stage3.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | MODEL_VERSION=vicuna-v1-5-7b
 4 | gpu_vis=0 # per_device_train_batch_size * gradient_accumulation_steps * n_gpus = 128
 5 | MASTER_PORT=29570
 6 | 
 7 | 
 8 | deepspeed --include localhost:$gpu_vis --master_port $MASTER_PORT vtimellm/train/train_mem.py \
 9 |     --deepspeed ./scripts/zero2.json \
10 |     --lora_enable True \
11 |     --training_stage 3 \
12 |     --model_name_or_path ./checkpoints/vicuna-7b-v1.5 \
13 |     --version v1 \
14 |     --data_path ./data/stage3.json \
15 |     --feat_folder /path/to/stage3_feat \
16 |     --pretrain_mm_mlp_adapter ./checkpoints/vtimellm-$MODEL_VERSION-stage1/mm_projector.bin \
17 |     --stage2_path ./checkpoints/vtimellm-$MODEL_VERSION-stage2 \
18 |     --output_dir ./checkpoints/vtimellm-$MODEL_VERSION-stage3 \
19 |     --bf16 True \
20 |     --num_train_epochs 2 \
21 |     --per_device_train_batch_size 8 \
22 |     --gradient_accumulation_steps 16 \
23 |     --evaluation_strategy "no" \
24 |     --save_strategy "steps" \
25 |     --save_steps 50000 \
26 |     --save_total_limit 1 \
27 |     --learning_rate 1e-4 \
28 |     --freeze_mm_mlp_adapter True \
29 |     --lora_r 64 \
30 |     --lora_alpha 128 \
31 |     --weight_decay 0. \
32 |     --warmup_ratio 0.03 \
33 |     --lr_scheduler_type "cosine" \
34 |     --logging_steps 1 \
35 |     --tf32 True \
36 |     --model_max_length 2048 \
37 |     --gradient_checkpointing True \
38 |     --dataloader_num_workers 4 \
39 |     --lazy_preprocess True \
40 |     --report_to wandb
41 | 


--------------------------------------------------------------------------------
/scripts/stage3_glm.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | MODEL_VERSION=chatglm3-6b
 4 | gpu_vis=0 # per_device_train_batch_size * gradient_accumulation_steps * n_gpus = 128
 5 | MASTER_PORT=29570
 6 | 
 7 | 
 8 | deepspeed --include localhost:$gpu_vis --master_port $MASTER_PORT vtimellm/train/train.py \
 9 |     --deepspeed ./scripts/zero2.json \
10 |     --lora_enable True \
11 |     --training_stage 3 \
12 |     --model_name_or_path ./checkpoints/$MODEL_VERSION \
13 |     --version plain \
14 |     --data_path ./data/stage3_chinese.json \
15 |     --feat_folder /path/to/stage3_feat \
16 |     --pretrain_mm_mlp_adapter ./checkpoints/vtimellm-$MODEL_VERSION-stage1/mm_projector.bin \
17 |     --stage2_path ./checkpoints/vtimellm-$MODEL_VERSION-stage2 \
18 |     --output_dir ./checkpoints/vtimellm-$MODEL_VERSION-stage3 \
19 |     --bf16 True \
20 |     --num_train_epochs 2 \
21 |     --per_device_train_batch_size 8 \
22 |     --gradient_accumulation_steps 16 \
23 |     --evaluation_strategy "no" \
24 |     --save_strategy "steps" \
25 |     --save_steps 50000 \
26 |     --save_total_limit 1 \
27 |     --learning_rate 1e-4 \
28 |     --freeze_mm_mlp_adapter True \
29 |     --lora_r 64 \
30 |     --lora_alpha 128 \
31 |     --weight_decay 0. \
32 |     --warmup_ratio 0.03 \
33 |     --lr_scheduler_type "cosine" \
34 |     --logging_steps 1 \
35 |     --tf32 True \
36 |     --model_max_length 2048 \
37 |     --gradient_checkpointing True \
38 |     --dataloader_num_workers 4 \
39 |     --lazy_preprocess True \
40 |     --report_to wandb
41 | 


--------------------------------------------------------------------------------
/scripts/zero2.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "fp16": {
 3 |         "enabled": "auto",
 4 |         "loss_scale": 0,
 5 |         "loss_scale_window": 1000,
 6 |         "initial_scale_power": 16,
 7 |         "hysteresis": 2,
 8 |         "min_loss_scale": 1
 9 |     },
10 |     "bf16": {
11 |         "enabled": "auto"
12 |     },
13 |     "train_micro_batch_size_per_gpu": "auto",
14 |     "train_batch_size": "auto",
15 |     "gradient_accumulation_steps": "auto",
16 |     "zero_optimization": {
17 |         "stage": 2,
18 |         "overlap_comm": true,
19 |         "contiguous_gradients": true,
20 |         "sub_group_size": 1e9,
21 |         "reduce_bucket_size": "auto"
22 |     }
23 | }


--------------------------------------------------------------------------------
/scripts/zero3.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "fp16": {
 3 |         "enabled": "auto",
 4 |         "loss_scale": 0,
 5 |         "loss_scale_window": 1000,
 6 |         "initial_scale_power": 16,
 7 |         "hysteresis": 2,
 8 |         "min_loss_scale": 1
 9 |     },
10 |     "bf16": {
11 |         "enabled": "auto"
12 |     },
13 |     "train_micro_batch_size_per_gpu": "auto",
14 |     "train_batch_size": "auto",
15 |     "gradient_accumulation_steps": "auto",
16 |     "zero_optimization": {
17 |         "stage": 3,
18 |         "overlap_comm": true,
19 |         "contiguous_gradients": true,
20 |         "sub_group_size": 1e9,
21 |         "reduce_bucket_size": "auto",
22 |         "stage3_prefetch_bucket_size": "auto",
23 |         "stage3_param_persistence_threshold": "auto",
24 |         "stage3_max_live_parameters": 1e9,
25 |         "stage3_max_reuse_distance": 1e9,
26 |         "stage3_gather_16bit_weights_on_model_save": true
27 |     }
28 | }


--------------------------------------------------------------------------------
/scripts/zero3_offload.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "fp16": {
 3 |     "enabled": "auto",
 4 |     "loss_scale": 0,
 5 |     "loss_scale_window": 1000,
 6 |     "initial_scale_power": 16,
 7 |     "hysteresis": 2,
 8 |     "min_loss_scale": 1
 9 |   },
10 |   "bf16": {
11 |     "enabled": "auto"
12 |   },
13 |   "optimizer": {
14 |     "type": "AdamW",
15 |     "params": {
16 |       "lr": "auto",
17 |       "betas": "auto",
18 |       "eps": "auto",
19 |       "weight_decay": "auto"
20 |     }
21 |   },
22 |   "scheduler": {
23 |     "type": "WarmupLR",
24 |     "params": {
25 |       "warmup_min_lr": "auto",
26 |       "warmup_max_lr": "auto",
27 |       "warmup_num_steps": "auto"
28 |     }
29 |   },
30 |   "zero_optimization": {
31 |     "stage": 3,
32 |     "offload_optimizer": {
33 |       "device": "cpu",
34 |       "pin_memory": true
35 |     },
36 |     "offload_param": {
37 |       "device": "cpu",
38 |       "pin_memory": true
39 |     },
40 |     "overlap_comm": true,
41 |     "contiguous_gradients": true,
42 |     "sub_group_size": 1e9,
43 |     "reduce_bucket_size": "auto",
44 |     "stage3_prefetch_bucket_size": "auto",
45 |     "stage3_param_persistence_threshold": "auto",
46 |     "stage3_max_live_parameters": 1e9,
47 |     "stage3_max_reuse_distance": 1e9,
48 |     "gather_16bit_weights_on_model_save": true
49 |   },
50 |   "gradient_accumulation_steps": "auto",
51 |   "gradient_clipping": "auto",
52 |   "train_batch_size": "auto",
53 |   "train_micro_batch_size_per_gpu": "auto",
54 |   "steps_per_print": 1e5,
55 |   "wall_clock_breakdown": false
56 | }


--------------------------------------------------------------------------------
/vtimellm/__init__.py:
--------------------------------------------------------------------------------
1 | from .model import VTimeLLMLlamaForCausalLM
2 | 


--------------------------------------------------------------------------------
/vtimellm/constants.py:
--------------------------------------------------------------------------------
 1 | CONTROLLER_HEART_BEAT_EXPIRATION = 30
 2 | WORKER_HEART_BEAT_INTERVAL = 15
 3 | 
 4 | LOGDIR = "."
 5 | 
 6 | # Model Constants
 7 | IGNORE_INDEX = -100
 8 | IMAGE_TOKEN_INDEX = -200
 9 | DEFAULT_IMAGE_TOKEN = "<video>"
10 | 
11 | 


--------------------------------------------------------------------------------
/vtimellm/conversation.py:
--------------------------------------------------------------------------------
  1 | import dataclasses
  2 | from enum import auto, Enum
  3 | from typing import List, Tuple
  4 | 
  5 | 
  6 | class SeparatorStyle(Enum):
  7 |     """Different separator style."""
  8 |     SINGLE = auto()
  9 |     TWO = auto()
 10 |     MPT = auto()
 11 |     PLAIN = auto()
 12 |     LLAMA_2 = auto()
 13 | 
 14 | 
 15 | @dataclasses.dataclass
 16 | class Conversation:
 17 |     """A class that keeps all conversation history."""
 18 |     system: str
 19 |     roles: List[str]
 20 |     messages: List[List[str]]
 21 |     offset: int
 22 |     sep_style: SeparatorStyle = SeparatorStyle.SINGLE
 23 |     sep: str = "###"
 24 |     sep2: str = None
 25 |     version: str = "Unknown"
 26 | 
 27 |     skip_next: bool = False
 28 | 
 29 |     def get_prompt(self):
 30 |         messages = self.messages
 31 |         if len(messages) > 0 and type(messages[0][1]) is tuple:
 32 |             messages = self.messages.copy()
 33 |             init_role, init_msg = messages[0].copy()
 34 |             init_msg = init_msg[0].replace("<image>", "").strip()
 35 |             if 'mmtag' in self.version:
 36 |                 messages[0] = (init_role, init_msg)
 37 |                 messages.insert(0, (self.roles[0], "<Image><image></Image>"))
 38 |                 messages.insert(1, (self.roles[1], "Received."))
 39 |             else:
 40 |                 messages[0] = (init_role, "<image>\n" + init_msg)
 41 | 
 42 |         if self.sep_style == SeparatorStyle.SINGLE:
 43 |             ret = self.system + self.sep
 44 |             for role, message in messages:
 45 |                 if message:
 46 |                     if type(message) is tuple:
 47 |                         message, _, _ = message
 48 |                     ret += role + ": " + message + self.sep
 49 |                 else:
 50 |                     ret += role + ":"
 51 |         elif self.sep_style == SeparatorStyle.TWO:
 52 |             seps = [self.sep, self.sep2]
 53 |             ret = self.system + seps[0]
 54 |             for i, (role, message) in enumerate(messages):
 55 |                 if message:
 56 |                     if type(message) is tuple:
 57 |                         message, _, _ = message
 58 |                     ret += role + ": " + message + seps[i % 2]
 59 |                 else:
 60 |                     ret += role + ":"
 61 |         elif self.sep_style == SeparatorStyle.MPT:
 62 |             ret = self.system + self.sep
 63 |             for role, message in messages:
 64 |                 if message:
 65 |                     if type(message) is tuple:
 66 |                         message, _, _ = message
 67 |                     ret += role + message + self.sep
 68 |                 else:
 69 |                     ret += role
 70 |         elif self.sep_style == SeparatorStyle.LLAMA_2:
 71 |             wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
 72 |             wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
 73 |             ret = ""
 74 | 
 75 |             for i, (role, message) in enumerate(messages):
 76 |                 if i == 0:
 77 |                     assert message, "first message should not be none"
 78 |                     assert role == self.roles[0], "first message should come from user"
 79 |                 if message:
 80 |                     if type(message) is tuple:
 81 |                         message, _, _ = message
 82 |                     if i == 0: message = wrap_sys(self.system) + message
 83 |                     if i % 2 == 0:
 84 |                         message = wrap_inst(message)
 85 |                         ret += self.sep + message
 86 |                     else:
 87 |                         ret += " " + message + " " + self.sep2
 88 |                 else:
 89 |                     ret += ""
 90 |             ret = ret.lstrip(self.sep)
 91 |         elif self.sep_style == SeparatorStyle.PLAIN:
 92 |             seps = [self.sep, self.sep2]
 93 |             ret = self.system
 94 |             for i, (role, message) in enumerate(messages):
 95 |                 if message:
 96 |                     if type(message) is tuple:
 97 |                         message, _, _ = message
 98 |                     ret += message + seps[i % 2]
 99 |                 else:
100 |                     ret += ""
101 |         else:
102 |             raise ValueError(f"Invalid style: {self.sep_style}")
103 | 
104 |         return ret
105 | 
106 |     def append_message(self, role, message):
107 |         self.messages.append([role, message])
108 | 
109 |     def get_images(self, return_pil=False):
110 |         images = []
111 |         for i, (role, msg) in enumerate(self.messages[self.offset:]):
112 |             if i % 2 == 0:
113 |                 if type(msg) is tuple:
114 |                     import base64
115 |                     from io import BytesIO
116 |                     from PIL import Image
117 |                     msg, image, image_process_mode = msg
118 |                     if image_process_mode == "Pad":
119 |                         def expand2square(pil_img, background_color=(122, 116, 104)):
120 |                             width, height = pil_img.size
121 |                             if width == height:
122 |                                 return pil_img
123 |                             elif width > height:
124 |                                 result = Image.new(pil_img.mode, (width, width), background_color)
125 |                                 result.paste(pil_img, (0, (width - height) // 2))
126 |                                 return result
127 |                             else:
128 |                                 result = Image.new(pil_img.mode, (height, height), background_color)
129 |                                 result.paste(pil_img, ((height - width) // 2, 0))
130 |                                 return result
131 |                         image = expand2square(image)
132 |                     elif image_process_mode == "Crop":
133 |                         pass
134 |                     elif image_process_mode == "Resize":
135 |                         image = image.resize((336, 336))
136 |                     else:
137 |                         raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
138 |                     max_hw, min_hw = max(image.size), min(image.size)
139 |                     aspect_ratio = max_hw / min_hw
140 |                     max_len, min_len = 800, 400
141 |                     shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
142 |                     longest_edge = int(shortest_edge * aspect_ratio)
143 |                     W, H = image.size
144 |                     if H > W:
145 |                         H, W = longest_edge, shortest_edge
146 |                     else:
147 |                         H, W = shortest_edge, longest_edge
148 |                     image = image.resize((W, H))
149 |                     if return_pil:
150 |                         images.append(image)
151 |                     else:
152 |                         buffered = BytesIO()
153 |                         image.save(buffered, format="PNG")
154 |                         img_b64_str = base64.b64encode(buffered.getvalue()).decode()
155 |                         images.append(img_b64_str)
156 |         return images
157 | 
158 |     def to_gradio_chatbot(self):
159 |         ret = []
160 |         for i, (role, msg) in enumerate(self.messages[self.offset:]):
161 |             if i % 2 == 0:
162 |                 if type(msg) is tuple:
163 |                     import base64
164 |                     from io import BytesIO
165 |                     msg, image, image_process_mode = msg
166 |                     max_hw, min_hw = max(image.size), min(image.size)
167 |                     aspect_ratio = max_hw / min_hw
168 |                     max_len, min_len = 800, 400
169 |                     shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
170 |                     longest_edge = int(shortest_edge * aspect_ratio)
171 |                     W, H = image.size
172 |                     if H > W:
173 |                         H, W = longest_edge, shortest_edge
174 |                     else:
175 |                         H, W = shortest_edge, longest_edge
176 |                     image = image.resize((W, H))
177 |                     buffered = BytesIO()
178 |                     image.save(buffered, format="JPEG")
179 |                     img_b64_str = base64.b64encode(buffered.getvalue()).decode()
180 |                     img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
181 |                     ret.append([img_str, None])
182 |                     msg = msg.replace('<image>', '').strip()
183 |                     if len(msg) > 0:
184 |                         ret.append([msg, None])
185 |                 else:
186 |                     ret.append([msg, None])
187 |             else:
188 |                 ret[-1][-1] = msg
189 |         return ret
190 | 
191 |     def copy(self):
192 |         return Conversation(
193 |             system=self.system,
194 |             roles=self.roles,
195 |             messages=[[x, y] for x, y in self.messages],
196 |             offset=self.offset,
197 |             sep_style=self.sep_style,
198 |             sep=self.sep,
199 |             sep2=self.sep2,
200 |             version=self.version)
201 | 
202 |     def dict(self):
203 |         if len(self.get_images()) > 0:
204 |             return {
205 |                 "system": self.system,
206 |                 "roles": self.roles,
207 |                 "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
208 |                 "offset": self.offset,
209 |                 "sep": self.sep,
210 |                 "sep2": self.sep2,
211 |             }
212 |         return {
213 |             "system": self.system,
214 |             "roles": self.roles,
215 |             "messages": self.messages,
216 |             "offset": self.offset,
217 |             "sep": self.sep,
218 |             "sep2": self.sep2,
219 |         }
220 | 
221 | 
222 | conv_vicuna_v0 = Conversation(
223 |     system="A chat between a curious human and an artificial intelligence assistant. "
224 |            "The assistant gives helpful, detailed, and polite answers to the human's questions.",
225 |     roles=("Human", "Assistant"),
226 |     messages=(
227 |         ("Human", "What are the key differences between renewable and non-renewable energy sources?"),
228 |         ("Assistant",
229 |             "Renewable energy sources are those that can be replenished naturally in a relatively "
230 |             "short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
231 |             "Non-renewable energy sources, on the other hand, are finite and will eventually be "
232 |             "depleted, such as coal, oil, and natural gas. Here are some key differences between "
233 |             "renewable and non-renewable energy sources:\n"
234 |             "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
235 |             "energy sources are finite and will eventually run out.\n"
236 |             "2. Environmental impact: Renewable energy sources have a much lower environmental impact "
237 |             "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
238 |             "and other negative effects.\n"
239 |             "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
240 |             "have lower operational costs than non-renewable sources.\n"
241 |             "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
242 |             "locations than non-renewable sources.\n"
243 |             "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
244 |             "situations and needs, while non-renewable sources are more rigid and inflexible.\n"
245 |             "6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
246 |             "non-renewable sources are not, and their depletion can lead to economic and social instability.\n")
247 |     ),
248 |     offset=2,
249 |     sep_style=SeparatorStyle.SINGLE,
250 |     sep="###",
251 | )
252 | 
253 | conv_vicuna_v1 = Conversation(
254 |     system="A chat between a curious user and an artificial intelligence assistant. "
255 |     "The assistant gives helpful, detailed, and polite answers to the user's questions.",
256 |     roles=("USER", "ASSISTANT"),
257 |     version="v1",
258 |     messages=(),
259 |     offset=0,
260 |     sep_style=SeparatorStyle.TWO,
261 |     sep=" ",
262 |     sep2="</s>",
263 | )
264 | 
265 | conv_llama_2 = Conversation(
266 |     system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
267 | 
268 | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
269 |     roles=("USER", "ASSISTANT"),
270 |     version="llama_v2",
271 |     messages=(),
272 |     offset=0,
273 |     sep_style=SeparatorStyle.LLAMA_2,
274 |     sep="<s>",
275 |     sep2="</s>",
276 | )
277 | 
278 | conv_llava_llama_2 = Conversation(
279 |     system="You are a helpful language and vision assistant. "
280 |            "You are able to understand the visual content that the user provides, "
281 |            "and assist the user with a variety of tasks using natural language.",
282 |     roles=("USER", "ASSISTANT"),
283 |     version="llama_v2",
284 |     messages=(),
285 |     offset=0,
286 |     sep_style=SeparatorStyle.LLAMA_2,
287 |     sep="<s>",
288 |     sep2="</s>",
289 | )
290 | 
291 | conv_mpt = Conversation(
292 |     system="""<|im_start|>system
293 | A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
294 |     roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
295 |     version="mpt",
296 |     messages=(),
297 |     offset=0,
298 |     sep_style=SeparatorStyle.MPT,
299 |     sep="<|im_end|>",
300 | )
301 | 
302 | conv_llava_plain = Conversation(
303 |     system="",
304 |     roles=("", ""),
305 |     messages=(
306 |     ),
307 |     offset=0,
308 |     sep_style=SeparatorStyle.PLAIN,
309 |     sep="\n",
310 | )
311 | 
312 | conv_llava_v0 = Conversation(
313 |     system="A chat between a curious human and an artificial intelligence assistant. "
314 |            "The assistant gives helpful, detailed, and polite answers to the human's questions.",
315 |     roles=("Human", "Assistant"),
316 |     messages=(
317 |         ("Human", "Hi!"),
318 |         ("Assistant", "Hi there! How can I help you today?")
319 |     ),
320 |     offset=2,
321 |     sep_style=SeparatorStyle.SINGLE,
322 |     sep="###",
323 | )
324 | 
325 | conv_llava_v0_mmtag = Conversation(
326 |     system="A chat between a curious user and an artificial intelligence assistant. "
327 |            "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
328 |            "The visual content will be provided with the following format: <Image>visual content</Image>.",
329 |     roles=("Human", "Assistant"),
330 |     messages=(
331 |     ),
332 |     offset=0,
333 |     sep_style=SeparatorStyle.SINGLE,
334 |     sep="###",
335 |     version="v0_mmtag",
336 | )
337 | 
338 | conv_llava_v1 = Conversation(
339 |     system="A chat between a curious human and an artificial intelligence assistant. "
340 |            "The assistant gives helpful, detailed, and polite answers to the human's questions.",
341 |     roles=("USER", "ASSISTANT"),
342 |     version="v1",
343 |     messages=(),
344 |     offset=0,
345 |     sep_style=SeparatorStyle.TWO,
346 |     sep=" ",
347 |     sep2="</s>",
348 | )
349 | 
350 | conv_llava_v1_mmtag = Conversation(
351 |     system="A chat between a curious user and an artificial intelligence assistant. "
352 |            "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
353 |            "The visual content will be provided with the following format: <Image>visual content</Image>.",
354 |     roles=("USER", "ASSISTANT"),
355 |     messages=(),
356 |     offset=0,
357 |     sep_style=SeparatorStyle.TWO,
358 |     sep=" ",
359 |     sep2="</s>",
360 |     version="v1_mmtag",
361 | )
362 | 
363 | default_conversation = conv_vicuna_v1
364 | conv_templates = {
365 |     "default": conv_vicuna_v0,
366 |     "v0": conv_vicuna_v0,
367 |     "v1": conv_vicuna_v1,
368 |     "vicuna_v1": conv_vicuna_v1,
369 |     "llama_2": conv_llama_2,
370 | 
371 |     "plain": conv_llava_plain,
372 |     "v0_plain": conv_llava_plain,
373 |     "llava_v0": conv_llava_v0,
374 |     "v0_mmtag": conv_llava_v0_mmtag,
375 |     "llava_v1": conv_llava_v1,
376 |     "v1_mmtag": conv_llava_v1_mmtag,
377 |     "llava_llama_2": conv_llava_llama_2,
378 | 
379 |     "mpt": conv_mpt,
380 | }
381 | 
382 | 
383 | if __name__ == "__main__":
384 |     print(default_conversation.get_prompt())
385 | 


--------------------------------------------------------------------------------
/vtimellm/demo_gradio.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Adapted from: https://github.com/Vision-CAIR/MiniGPT-4/blob/main/demo.py 
  3 | """
  4 | import argparse
  5 | import os
  6 | root_dir = os.path.join(os.getcwd(), "..")
  7 | import sys
  8 | sys.path.append(root_dir)
  9 | 
 10 | import torch
 11 | import gradio as gr
 12 | 
 13 | import decord
 14 | decord.bridge.set_bridge('torch')
 15 | 
 16 | from vtimellm.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
 17 | from vtimellm.conversation import conv_templates, SeparatorStyle
 18 | from vtimellm.model.builder import load_pretrained_model
 19 | from vtimellm.utils import disable_torch_init
 20 | from vtimellm.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria, VideoExtractor
 21 | from PIL import Image
 22 | from transformers import TextStreamer
 23 | try:
 24 |     from torchvision.transforms import InterpolationMode
 25 |     BICUBIC = InterpolationMode.BICUBIC
 26 | except ImportError:
 27 |     from PIL import Image
 28 |     BICUBIC = Image.BICUBIC
 29 | from torchvision.transforms import Compose, Resize, CenterCrop, Normalize
 30 | import clip
 31 | 
 32 | def parse_args():
 33 |     parser = argparse.ArgumentParser()
 34 |     parser.add_argument("--gpu_id", type=int, default=0, help="specify the gpu to load the model.")
 35 |     parser.add_argument("--model_base", type=str, required=True, help="Path to your vicuna-7b-v1.5 huggingface checkpoint")
 36 |     parser.add_argument("--clip_path", type=str, default=os.path.join(root_dir, "checkpoints/clip/ViT-L-14.pt"))
 37 |     parser.add_argument("--pretrain_mm_mlp_adapter", type=str, default=os.path.join(root_dir, "checkpoints/vtimellm-vicuna-v1-5-7b-stage1/mm_projector.bin"))
 38 |     parser.add_argument("--stage2", type=str, default=os.path.join(root_dir, "checkpoints/vtimellm-vicuna-v1-5-7b-stage2"))
 39 |     parser.add_argument("--stage3", type=str, default=os.path.join(root_dir, "checkpoints/vtimellm-vicuna-v1-5-7b-stage3"))
 40 |     parser.add_argument("--share", action='store_true')
 41 |     args = parser.parse_args()
 42 |     return args
 43 | 
 44 | # ========================================
 45 | #             Model Initialization
 46 | # ========================================
 47 | 
 48 | args = parse_args()
 49 | device = f'cuda:{args.gpu_id}'
 50 | 
 51 | disable_torch_init()
 52 | tokenizer, model, context_len = load_pretrained_model(args, args.stage2, args.stage3)
 53 | model = model.to(torch.float16).to(device)
 54 | 
 55 | clip_model, _ = clip.load(args.clip_path)
 56 | clip_model.eval()
 57 | clip_model = clip_model.to(device)
 58 | 
 59 | transform = Compose([
 60 |     Resize(224, interpolation=BICUBIC),
 61 |     CenterCrop(224),
 62 |     Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
 63 | ])
 64 | 
 65 | print('Initialization Finished')
 66 | 
 67 | # ========================================
 68 | #             Gradio Setting
 69 | # ========================================
 70 | 
 71 | TEXT_PLACEHOLDER = 'Upload your video first, or directly click the examples at the bottom of the page.'
 72 | 
 73 | def gradio_reset(chat_state, video_features_state, conv_state):
 74 |     if chat_state is not None:
 75 |         chat_state.messages = []
 76 |     video_features_state = None
 77 |     conv_state = {}
 78 |     return None, gr.update(value=None, interactive=True), gr.update(value='', placeholder=TEXT_PLACEHOLDER, interactive=False), gr.update(value="Upload & Start Chat", interactive=True), chat_state, video_features_state, conv_state
 79 | 
 80 | def upload_video(gr_video, chat_state, video_features_state, conv_state, chatbot):
 81 |     if not gr_video:
 82 |         return None, None, gr.update(interactive=True), chat_state, video_features_state, conv_state, None
 83 |     else:
 84 |         print(gr_video)
 85 |         video_loader = VideoExtractor(N=100)
 86 |         _, images = video_loader.extract({'id': None, 'video': gr_video})
 87 | 
 88 |         transform = Compose([
 89 |             Resize(224, interpolation=BICUBIC),
 90 |             CenterCrop(224),
 91 |             Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
 92 |         ])
 93 | 
 94 |         # print(images.shape) # <N, 3, H, W>
 95 |         images = transform(images / 255.0)
 96 |         images = images.to(torch.float16)
 97 |         with torch.no_grad():
 98 |             video_features_state = clip_model.encode_image(images.to('cuda'))
 99 | 
100 |         chatbot = chatbot + [((gr_video,), None)]
101 |         chat_state = conv_templates["v1"].copy()
102 |         conv_state['first'] = True
103 | 
104 |         return gr.update(interactive=False), gr.update(interactive=True, placeholder='Type and press Enter'), gr.update(value="Start Chatting", interactive=False), chat_state, video_features_state, conv_state, chatbot
105 | 
106 | def gradio_ask(user_message, chatbot, chat_state, conv_state):
107 |     if len(user_message) == 0:
108 |         conv_state['should_answer'] = False
109 |         return gr.update(interactive=True, placeholder='Input should not be empty!'), chatbot, chat_state, conv_state
110 |     conv_state['should_answer'] = True
111 |     chatbot = chatbot + [[user_message, None]]
112 |     if conv_state['first']:
113 |         user_message = DEFAULT_IMAGE_TOKEN + '\n' + user_message
114 |         conv_state['first'] = False
115 |     chat_state.append_message(chat_state.roles[0], user_message)
116 |     chat_state.append_message(chat_state.roles[1], None)
117 |     return '', chatbot, chat_state, conv_state
118 | 
119 | 
120 | def gradio_answer(chatbot, chat_state, video_features_state, conv_state, temperature):
121 |     if not conv_state['should_answer']:
122 |         return chatbot, chat_state
123 |     prompt = chat_state.get_prompt()
124 |     input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(device)
125 |     stop_str = chat_state.sep if chat_state.sep_style != SeparatorStyle.TWO else chat_state.sep2 # plain:sep(###) v1:sep2(None)
126 |     keywords = [stop_str]
127 |     stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
128 |     streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
129 | 
130 |     with torch.inference_mode():
131 |         output_ids = model.generate(
132 |             input_ids,
133 |             images=video_features_state[None,].to(device),
134 |             do_sample=True,
135 |             temperature=temperature,
136 |             max_new_tokens=1024,
137 |             streamer=streamer,
138 |             use_cache=True,
139 |             stopping_criteria=[stopping_criteria]
140 |         )
141 | 
142 |     outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
143 |     chat_state.messages[-1][-1] = outputs
144 | 
145 |     chatbot[-1][1] = outputs
146 |     print(chat_state.get_prompt())
147 |     print(chat_state)
148 |     return chatbot, chat_state
149 | 
150 | with gr.Blocks() as demo:
151 |     gr.Markdown('''# Demo for VTimeLLM''')
152 | 
153 |     with gr.Row():
154 |         with gr.Column(scale=0.5):
155 |             video = gr.Video()
156 | 
157 |             upload_button = gr.Button(value="Upload & Start Chat", interactive=True, variant="primary")
158 |             clear = gr.Button("Reset")
159 |             
160 |             temperature = gr.Slider(
161 |                 minimum=0,
162 |                 maximum=2.0,
163 |                 value=0.05,
164 |                 step=0.01,
165 |                 interactive=True,
166 |                 label="Temperature",
167 |             )
168 |         with gr.Column():
169 |             chat_state = gr.State()
170 |             video_features_state = gr.State()
171 |             conv_state = gr.State({})
172 |             chatbot = gr.Chatbot(label='VTimeLLM')
173 |             text_input = gr.Textbox(label='User', placeholder=TEXT_PLACEHOLDER, interactive=False)
174 |             
175 | 
176 |     with gr.Column():
177 |         gr.Examples(examples=[
178 |             [os.path.join(root_dir, f"images/demo.mp4"), "Explain why the video is funny."],
179 |         ], inputs=[video, text_input])
180 |         
181 |     upload_button.click(upload_video, [video, chat_state, video_features_state, conv_state, chatbot], [video, text_input, upload_button, chat_state, video_features_state, conv_state, chatbot])
182 |     text_input.submit(gradio_ask, [text_input, chatbot, chat_state, conv_state], [text_input, chatbot, chat_state, conv_state]).then(gradio_answer, [chatbot, chat_state, video_features_state, conv_state, temperature], [chatbot, chat_state])
183 |     clear.click(gradio_reset, [chat_state, video_features_state, conv_state], [chatbot, video, text_input, upload_button, chat_state, video_features_state, conv_state], queue=False)
184 | 
185 | demo.queue().launch(share=args.share)
186 | 


--------------------------------------------------------------------------------
/vtimellm/eval/data_example.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"v_bXdq2zI1Ms0": {
 3 | 		"duration": 73.1,
 4 | 		"timestamps": [
 5 | 			[6.94, 69.08],
 6 | 			[37.28, 43.49],
 7 | 			[43.13, 55.55]
 8 | 		],
 9 | 		"sentences": ["Three men are standing on a mat.", " The man in front begins to do karate on the mat.", " He gets down on the ground and flips around."]
10 | 	},
11 | 	"v_CN01Gm2Yc4k": {
12 | 		"duration": 17.56,
13 | 		"timestamps": [
14 | 			[0, 5],
15 | 			[5, 12.2],
16 | 			[12.2, 17.56]
17 | 		],
18 | 		"sentences": ["A young lady is gripping a black and silver punching bag between her legs.", "Once she has secured herself on the bag,she begins doing a set of crunches by pulling herself up.", "In between the crunches,she sits up and makes punches out into the air,before going back down."]
19 | 	}
20 | }


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/LICENSE:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangb23/VTimeLLM/c34ae56455c470ecbe002cbc53e30d5e67b03948/vtimellm/eval/dvc_eval/SODA/LICENSE


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/README.md:
--------------------------------------------------------------------------------
 1 | # SODA
 2 | This repository is the imprimentation of "SODA: Story Oriented Dense Video Captioning Evaluation Flamework" published at ECCV 2020 [pdf](https://fujiso.github.io/publications/ECCV2020_soda.pdf).
 3 | SODA measures the performance of video story description systems.
 4 | 
 5 | ## Update
 6 | v1.1 (2021/5)
 7 | * Added new option "--multi_reference" to deal with multiple reference.  
 8 |   SODA selects the reference that has the maximum f1 for each video, and returns macro averaged scores.  
 9 | * Fixed BertScore import
10 | 
11 | ## Requirements
12 | python 3.6+ (developed with 3.7)
13 | * Numpy
14 | * tqdm
15 | * [pycocoevalcap (Python3 version)](https://github.com/salaniz/pycocoevalcap)
16 | * BERTScore (optional)
17 | 
18 | ## Usage
19 | You can run SODA by specifying the path of system output and that of ground truth.
20 | Both files should be the json format for ActivityNet Captions.
21 | ```bash
22 | python soda.py -s path/to/submission.json -r path/to/ground_truth.json 
23 | ```
24 | 
25 | You can run on the multiple reference setting, with `--multi_reference` option.
26 | ```bash
27 | python soda.py --multi_reference -s path/to/submission.json -r path/to/ground_truth1.json path/to/ground_truth2.json
28 | ```
29 | 
30 | You can try other sentence evaluation metrics, e.g. CIDEr and BERTScore, with `-m` option.
31 | ```bash
32 | python soda.py -s path/to/submission.json -m BERTScore
33 | ```
34 | 
35 | ## Sample input file
36 | Please use the same format as [ActivityNet Challenge](http://activity-net.org/index.html)
37 | ```
38 | {
39 |   version: "VERSION 1.0",
40 |   results: {
41 |     "sample_id" : [
42 |         {
43 |         sentence: "This is a sample caption.",
44 |         timestamp: [1.23, 4.56]
45 |         },
46 |         {
47 |         sentence: "This is a sample caption 2.",
48 |         timestamp: [7.89, 19.87]
49 |         }
50 |     ]
51 |   }
52 |   external_data: {
53 |     used: False,
54 |     }
55 | }
56 | ```
57 | 
58 | ## Reference
59 | ```
60 | @inproceedings{Fujita2020soda,
61 |   title={SODA: Story Oriented Dense Video Captioning Evaluation Flamework},
62 |   author={Soichiro Fujita and Tsutomu Hirao and Hidetaka Kamigaito and Manabu Okumura and Masaaki Nagata},
63 |   booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
64 |   month={August},
65 |   year={2020},
66 | }
67 | ```
68 | 
69 | ## LICENSE
70 | NTT License
71 | 
72 | According to the license, it is not allowed to create pull requests.
73 | Please feel free to send issues.
74 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/dataset.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import json
  3 | from collections import defaultdict
  4 | from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
  5 | from .utils import iou, remove_nonascii
  6 | 
  7 | 
  8 | class ANETCaptions:
  9 |     def __init__(self, preds, gts, gt_vid, verbose=False):
 10 |         self.pred_keys = ['results']
 11 |         # self.pred_keys = ['results', 'version', 'external_data']
 12 |         self.verbose = verbose
 13 |         self.preds = preds
 14 |         self.gts = gts
 15 |         self.gt_vids = gt_vid
 16 |         self.tokenizer = PTBTokenizer()
 17 | 
 18 |     @classmethod
 19 |     def from_load_files(cls, gt_file, pred_file, multi_reference=True, verbose=False):
 20 |         gts, gt_vid = cls.load_ground_truth(gt_file, multi_reference=multi_reference, verbose=verbose)
 21 |         preds = cls.load_prediction(pred_file, verbose=verbose)
 22 |         # missing video
 23 |         gt_vid = [x for x in gt_vid if x in preds]
 24 |         gt_vid = cls.check_videos(gt_vid, preds.keys(),verbose=verbose)
 25 |         return cls(preds, gts, gt_vid, verbose=verbose)
 26 | 
 27 |     @classmethod
 28 |     def from_prediction(cls, gt_file, preds, multi_reference=True, verbose=False):
 29 |         results = {}
 30 |         for vid in preds['results']:
 31 |             results[vid] = sorted(preds["results"][vid], key=lambda x: x["timestamp"][0])
 32 |         gts, gt_vid = cls.load_ground_truth(gt_file, multi_reference=multi_reference)
 33 |         gt_vid = cls.check_videos(gt_vid, results.keys(),verbose=verbose)
 34 | 
 35 |         return cls(results, gts, gt_vid, verbose=verbose)
 36 |     
 37 |     @staticmethod
 38 |     def load_ground_truth(filenames, multi_reference=False, verbose=False):
 39 |         if verbose: 
 40 |             print(f"| Loading ground truths: {filenames}.")
 41 |         if isinstance(filenames, str):
 42 |             filenames = [filenames]
 43 |         gt_vids = set()
 44 |         gt = defaultdict(dict)
 45 |         gts = []
 46 |         for filename in filenames:
 47 |             if isinstance(filename, dict):
 48 |                 _gt = filename
 49 |             else:
 50 |                 with open(filename, "r") as f:
 51 |                     _gt = json.load(f) 
 52 |             gt_vids.update(_gt.keys())
 53 |             gts.append(_gt)
 54 |         if multi_reference is False:
 55 |             for vid in gt_vids:
 56 |                 t, s = [], []
 57 |                 for _g in gts:
 58 |                     if vid not in _g: 
 59 |                         continue
 60 |                     t += _g[vid]["timestamps"]
 61 |                     s += _g[vid]["sentences"]
 62 |                 sort_t, sort_s = list(zip(*sorted(zip(t, s), key=lambda x: x[0][0])))
 63 |                 gt[vid]["timestamps"] = sort_t
 64 |                 gt[vid]["sentences"] = sort_s
 65 |             gts = [gt]
 66 |         if verbose:
 67 |             print(f"stats:\n\t n_files: {len(filenames)}, n_videos: {len(gt_vids)}")
 68 |         return gts, gt_vids 
 69 | 
 70 |     @staticmethod
 71 |     def load_prediction(filename, verbose=False):
 72 |         if verbose: print(f"\n| Loading predictions: {filename}.")
 73 |         if isinstance(filename, dict):
 74 |             pred = filename
 75 |         else:
 76 |             with open(filename, 'r') as f:
 77 |                 pred = json.load(f)
 78 |         # If the json file doesn’t have enough attribute
 79 |         # if not all([key in pred.keys() for key in ["results"]]):
 80 |         #     raise IOError('Please input a correct format prediction file.')
 81 |         results = {}
 82 |         for vid in pred['results']:
 83 |             # if vid not in self.gt_vids: continue
 84 |             results[vid] = sorted(pred["results"][vid], key=lambda x: x["timestamp"][0])
 85 |         return results
 86 | 
 87 |     def preprocess(self):
 88 |         if self.verbose: print("\n| Preprocessing captions...")
 89 |         n_ref = len(self.gts)
 90 |         p_spliter = [0]
 91 |         g_spliter = [[0] for i in range(n_ref)]
 92 |         times = {}
 93 |         cur_preds = {}
 94 |         cur_gts = [{} for i in range(n_ref)]
 95 |         for i, vid in enumerate(self.gt_vids): 
 96 |             cur_preds.update({j+p_spliter[-1]:[{"caption": remove_nonascii(p["sentence"])}] for j,p in enumerate(self.preds[vid])})
 97 |             times[i] = [p["timestamp"] for p in self.preds[vid]]
 98 |             p_spliter.append(p_spliter[-1] + len(times[i]))
 99 |             for n in range(n_ref):
100 |                 if vid not in self.gts[n]: 
101 |                     g_spliter[n].append(g_spliter[n][-1])
102 |                     continue
103 |                 cur_gts[n].update({j+g_spliter[n][-1]:[{"caption": remove_nonascii(p)}] for j,p in enumerate(self.gts[n][vid]["sentences"])})
104 |                 g_spliter[n].append(g_spliter[n][-1] + len(self.gts[n][vid]["sentences"]))
105 |         tokenize_preds = self.tokenizer.tokenize(cur_preds)
106 |         tokenize_gts = [self.tokenizer.tokenize(j) for j in cur_gts]
107 |         for i, vid in enumerate(self.gt_vids): 
108 |             _p = [tokenize_preds[j] for j in range(p_spliter[i],p_spliter[i+1])]
109 |             self.preds[vid] = {"timestamps":times[i], "sentences":_p}
110 |             for n in range(n_ref):
111 |                 if vid not in self.gts[n]: continue
112 |                 _g = [tokenize_gts[n][j] for j in range(g_spliter[n][i],g_spliter[n][i+1])]
113 |                 self.gts[n][vid]["sentences"] = _g
114 | 
115 |     @staticmethod
116 |     def check_videos(gold_vid, pred_vid, verbose=True):
117 |         not_appear = set(gold_vid) - set(pred_vid)
118 |         if len(not_appear) > 0 and verbose:
119 |             print((f"Warning: some videos in ground truth file are not appeared in prediction file!\n"
120 |                 f"\t{len(not_appear)} videos are not predicted: {not_appear}"))
121 |         return list(set(gold_vid) & set(pred_vid))
122 | 
123 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/nlpeval/bert_f_score.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from bert_score.scorer import BERTScorer
 3 | 
 4 | 
 5 | class BertScore:
 6 |     # def __init__(self, lang="en", model_type="bert-large-uncased"):
 7 |     def __init__(self, lang="en", model_type=None):
 8 |         self.lang = lang
 9 |         self.model_type = model_type
10 |         self.bert = BERTScorer(model_type=model_type, lang=lang)
11 | 
12 |     def compute_score(self, gts, res):
13 |         assert gts.keys() == res.keys()
14 |         # convert dict to list of str
15 |         cands = list(map(lambda x: x[0], res.values()))
16 |         refs = list(map(lambda x: x[0], gts.values()))
17 |         (P, R, F), hashname = self.bert.score(cands, refs, return_hash=True)
18 |         # print(f'{hashname}: P={P.mean().item():.6f} R={R.mean().item():.6f} F={F.mean().item():.6f}')
19 |         F = F.numpy()
20 |         return F.mean(), F
21 | 
22 |     def method(self):
23 |         return "BertScore"
24 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/nlpeval/bert_r_score.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from bert_score.scorer import BERTScorer
 4 | 
 5 | 
 6 | class BertScore:
 7 |     def __init__(self, lang="en", model_type="roberta-large"):
 8 |         self.lang = lang
 9 |         self.model_type = model_type
10 |         self.bert = BERTScorer(model_type=model_type, lang=lang)
11 | 
12 |     def compute_score(self, gts, res):
13 |         assert gts.keys() == res.keys()
14 |         # convert dict to list of str
15 |         cands = list(map(lambda x: x[0], res.values()))
16 |         refs = list(map(lambda x: x[0], gts.values()))
17 |         (P, R, F), hashname = self.bert.score(cands, refs, return_hash=True)
18 |         #print(f'{hashname}: P={P.mean().item():.6f} R={R.mean().item():.6f} F={F.mean().item():.6f}')
19 |         R = R.numpy()
20 |         return R.mean(), R
21 | 
22 |     def method(self):
23 |         return "BertScore"
24 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/nlpeval/mover.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import numpy as np
 3 | #from moverscore_v2 import get_idf_dict, word_mover_score 
 4 | from moverscore import get_idf_dict, word_mover_score 
 5 | from collections import defaultdict
 6 | 
 7 | class MoverScore:
 8 |     def __init__(self, lang="en", model_type=None):
 9 |         self.lang = lang
10 |         self.model_type=model_type
11 |         #self.model = load_model(model_type=model_type, lang=lang)
12 |         self.idf_dict_ref = None
13 |         self.idf_dict_hyp = None
14 | 
15 |     def compute_score(self, gts, res):
16 |         assert gts.keys()==res.keys()
17 |         assert self.idf_dict_hyp is not None and self.idf_dict_hyp is not None
18 |         # convert dict to list of str
19 |         cands = list(map(lambda x:x[0], res.values()))
20 |         refs = list(map(lambda x:x[0], gts.values()))
21 | 
22 |         scores = word_mover_score(refs, cands, self.idf_dict_ref, self.idf_dict_hyp, \
23 |                                           stop_words=[], n_gram=1, remove_subwords=True)
24 |         #print(np.mean(scores), max(scores))
25 |         return np.mean(scores), scores
26 | 
27 |     def make_dict(self, all_gts, all_res, vids):
28 |         gold = []
29 |         pred = []
30 |         for vid in vids:
31 |             gold.extend(all_gts[vid]["sentences"])
32 |             pred.extend([pred["sentence"] for pred in all_res[vid]])
33 |         self.idf_dict_ref = get_idf_dict(gold)
34 |         self.idf_dict_hyp = get_idf_dict(pred)
35 |         #print(self.idf_dict_hyp)
36 | 
37 |     def method(self):
38 |         return "MoverScore"
39 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.19.1
2 | tqdm==4.48.2
3 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/soda.py:
--------------------------------------------------------------------------------
  1 | #!/uer/bin/env python
  2 | import argparse
  3 | import json
  4 | from tqdm import tqdm
  5 | import numpy as np
  6 | 
  7 | from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
  8 | from pycocoevalcap.meteor.meteor import Meteor
  9 | from pycocoevalcap.cider.cider import Cider
 10 | 
 11 | from .dataset import ANETCaptions
 12 | from .utils import iou, remove_nonascii
 13 | 
 14 | 
 15 | class SODA:
 16 |     def __init__(self, data, soda_type="c", tious=None, scorer="Meteor", verbose=False):
 17 |         #self.data = data
 18 |         self.preds = data.preds
 19 |         self.gts = data.gts
 20 |         self.gt_vids = data.gt_vids
 21 |         self.soda_type = soda_type
 22 |         self.tious = [0.0] if tious is None else tious
 23 |         self.tokenizer = PTBTokenizer()
 24 |         if scorer == "BertScore":
 25 |             from nlpeval.bert_r_score import BertScore
 26 |         self.scorer = eval(scorer)()
 27 |         self.scorer_name = scorer
 28 |         self.verbose = verbose
 29 | 
 30 |         if soda_type == "a":    # averaging F-measure scores with IoU threshold = 0.9, 0.7, 0.5, 0.3
 31 |             self.soda_func = self.soda_a
 32 |         elif soda_type == "b":  # F-measure, where IoU threshold is set to 0.
 33 |             self.soda_func = self.soda_b
 34 |         elif soda_type == "c":  # F-measure, utilizing the IoU x METEOR score
 35 |             self.soda_func = self.soda_c
 36 |         elif soda_type == "d":  # F-measure of IoU score
 37 |             self.soda_func = self.soda_d
 38 | 
 39 |             class Dummy:
 40 |                 def compute_score(self, x, y):
 41 |                     return [0, 0]
 42 | 
 43 |             self.scorer = Dummy()
 44 |         else:
 45 |             raise NotImplementedError
 46 | 
 47 |     @classmethod
 48 |     def build(cls, preds, gts, gt_vids, soda_type="c", tious=[0.0], scorer="Meteor", verbose=False):
 49 |         data = ANETCaptions(preds, gts, gt_vids)
 50 |         data.preprocess()
 51 |         return cls(data, soda_type, tious, scorer, verbose)
 52 | 
 53 |     @classmethod
 54 |     def build_from_prediction(cls, preds, gt_files, soda_type="c", tious=[0.0], scorer="Meteor", verbose=False):
 55 |         data = ANETCaptions.from_prediction(gt_files, preds)
 56 |         data.preprocess()
 57 |         return cls(data, soda_type, tious, scorer, verbose)
 58 | 
 59 |     def calc_iou_matrix(self, preds, golds):
 60 |         #print(preds["timestamps"], gt["timestamps"])
 61 |         return np.array([[iou(pred, ct) for pred in preds["timestamps"]] for ct in golds['timestamps']])
 62 | 
 63 |     def calc_score_matrix(self, preds, golds):
 64 |         # Reformat to fit the input of pycocoevalcap scorers.
 65 |         p_sent, g_sent = preds["sentences"], golds["sentences"]
 66 |         res = {index: p for index, p in enumerate(p_sent)}
 67 |         gts = [{index: g for index in range(len(p_sent))} for i, g in enumerate(g_sent)]
 68 |         return np.array([self.scorer.compute_score(res, gt)[1] for gt in gts])
 69 | 
 70 |     def evaluate(self,):
 71 |         if self.verbose:
 72 |             print(f"\n| Running SODA {self.soda_type}.")
 73 |         tious = self.tious
 74 |         p_best = [[] for i in range(len(tious))]
 75 |         r_best = [[] for i in range(len(tious))]
 76 |         f_best = [[] for i in range(len(tious))]
 77 |         n_pred = []
 78 |         for vid in tqdm(self.gt_vids, disable=not self.verbose):
 79 |             _p = [[] for i in range(len(tious))]
 80 |             _r = [[] for i in range(len(tious))]
 81 |             _f = [[] for i in range(len(tious))]
 82 |             pred = self.preds[vid]
 83 |             n_pred.append(len(pred["sentences"]))
 84 |             # empty pred
 85 |             if not pred['sentences']:
 86 |                 for i, tiou in enumerate(tious):
 87 |                     p_best[i].append(0)
 88 |                     r_best[i].append(0)
 89 |                     f_best[i].append(0)
 90 |                 continue
 91 |             for gt in self.gts:
 92 |                 if vid not in gt:
 93 |                     continue
 94 |                 gold = gt[vid]
 95 |                 # create matrix
 96 |                 _iou = self.calc_iou_matrix(pred, gold)
 97 |                 scores = self.calc_score_matrix(pred, gold)
 98 |                 for i, tiou in enumerate(tious):
 99 |                     iou = np.copy(_iou)
100 |                     iou[iou < tiou] = 0.0
101 |                     try:
102 |                         max_score, pairs = self.soda_func(iou, scores)
103 |                     except:  # RecursionError
104 |                         max_score, pairs = 0., None
105 |                     (n_g, n_p) = iou.shape
106 |                     p = max_score / n_p
107 |                     r = max_score / n_g
108 |                     _p[i].append(p)
109 |                     _r[i].append(r)
110 |                     _f[i].append(2 * p * r / (p + r) if p+r > 0 else 0)
111 |             best_idx = np.argmax(_f, axis=1)
112 |             for i, tiou in enumerate(tious):
113 |                 p_best[i].append(_p[i][best_idx[i]])
114 |                 r_best[i].append(_r[i][best_idx[i]])
115 |                 f_best[i].append(_f[i][best_idx[i]])
116 |         precision = np.mean(p_best, axis=1)
117 |         recall = np.mean(r_best, axis=1)
118 |         f1 = np.mean(f_best, axis=1)
119 |         print(f"avg. outputs: {np.mean(n_pred)}")
120 |         # average scores across all the tIoUs
121 |         if self.verbose:
122 |             for i, tiou in enumerate(tious):
123 |                 partial_result = {self.scorer_name: [precision[i], recall[i], f1[i]]}
124 |                 print_score(partial_result, description=f"tIoU: {tiou}")
125 | 
126 |         final_scores = [np.mean(precision), np.mean(recall), np.mean(f1)]
127 |         result = {self.scorer_name: final_scores}
128 |         return result
129 | 
130 |     def soda_a(self, iou, scores):
131 |         _, pairs = self.chased_dp_assignment(iou)
132 |         r, c = (*zip(*pairs),)
133 |         max_score = np.sum(scores[r, c])
134 |         return max_score, pairs
135 | 
136 |     def soda_b(self, iou, scores):
137 |         # same as soda_a
138 |         _, pairs = self.chased_dp_assignment(iou)
139 |         r, c = (*zip(*pairs),)
140 |         max_score = np.sum(scores[r, c])
141 |         return max_score, pairs
142 | 
143 |     def soda_c(self, iou, scores):
144 |         max_score, pairs = self.chased_dp_assignment(iou*scores)
145 |         return max_score, pairs
146 | 
147 |     def soda_d(self, iou, scores):
148 |         max_score, pairs = self.chased_dp_assignment(iou)
149 |         return max_score, pairs
150 | 
151 |     def chased_dp_assignment(self, scores):
152 |         """ 
153 |         Run dp matching
154 |         Recurrence:  
155 |             dp[i,j] = 
156 |                 max(dp[i-1,j], dp[i-1,j-1] + scores[i,j], dp[i,j-1])
157 |         """
158 |         M, N = scores.shape
159 |         dp = - np.ones((M, N))
160 |         path = np.zeros((M, N))
161 | 
162 |         def transition(i, j):
163 |             if dp[i, j] >= 0:
164 |                 return dp[i, j]
165 |             elif i == 0 and j == 0:
166 |                 state = [-1, -1, scores[i, j]]
167 |             elif i == 0:
168 |                 state = [-1, transition(i, j-1), scores[i, j]]
169 |             elif j == 0:
170 |                 state = [transition(i-1, j), -1, scores[i, j]]
171 |             else:
172 |                 state = [transition(i-1, j), transition(i, j-1), transition(i-1, j-1) + scores[i, j]]
173 |             dp[i, j] = np.max(state)
174 |             path[i, j] = np.argmax(state)
175 |             return dp[i, j]
176 | 
177 |         def get_pairs(i, j):
178 |             p = np.where(path[i][:j+1] == 2)[0]
179 |             if i != 0 and len(p) == 0:
180 |                 return get_pairs(i-1, j)
181 |             elif i == 0 or p[-1] == 0:
182 |                 return [(i, p[-1])]
183 |             else:
184 |                 return get_pairs(i-1, p[-1]-1) + [(i, p[-1])]
185 |         N, M = scores.shape
186 |         max_score = transition(N-1, M-1)
187 |         pairs = get_pairs(N-1, M-1)
188 |         return max_score, pairs
189 | 
190 | 
191 | def print_score(result, description="SODA result"):
192 |     prf = ["precision", "recall", "f1_score"]
193 |     print('-' * 80)
194 |     print(description)
195 |     print('-' * 80)
196 |     for scorer_name, score in result.items():
197 |         print(f'| scorer:{scorer_name}')
198 |         for k, v in zip(prf, score):
199 |             print(f"\t{k}:{v*100:2.4f}")
200 | 
201 | 
202 | def main(args):
203 |     # Call coco eval
204 |     data = ANETCaptions.from_load_files(args.references,
205 |                                         args.prediction,
206 |                                         multi_reference=args.multi_reference,
207 |                                         verbose=args.verbose,
208 |                                         )
209 |     data.preprocess()
210 |     if args.soda_type == 'a':
211 |         tious = args.tious
212 |     else:
213 |         tious = None
214 |     evaluator = SODA(data,
215 |                      soda_type=args.soda_type,
216 |                      tious=tious,
217 |                      scorer=args.metric,
218 |                      verbose=args.verbose
219 |                      )
220 |     result = evaluator.evaluate()
221 |     print_score(result)
222 | 
223 | 
224 | if __name__ == '__main__':
225 |     parser = argparse.ArgumentParser()
226 |     parser.add_argument('-p', '--prediction', type=str, required=True, default='sample.json',
227 |                         help='system output file with json format for ActivityNet Challenge')
228 |     parser.add_argument('-r', '--references', type=str, nargs='+', default=['./data/val_1.json', './data/val_2.json'],
229 |                         help='reference files with ground truth captions')
230 |     parser.add_argument('-m', '--metric', type=str, default="Meteor", choices=['Meteor', 'Cider',  'BertScore'],
231 |                         help='choice evaluation metrics for SODA')
232 |     parser.add_argument('-s', '--soda_type', type=str, default="c", choices=['a', 'b',  'c', 'd'],
233 |                         help='choice evaluation metrics for SODA')
234 |     parser.add_argument('--tious', type=float,  nargs='+', default=[0.3, 0.5, 0.7, 0.9],
235 |                         help='list of the tIoUs (only for SODA-a)')
236 |     parser.add_argument('-mr', '--multi_reference', action='store_true',
237 |                         help='print details')
238 |     parser.add_argument('-v', '--verbose', action='store_true',
239 |                         help='print details')
240 |     args = parser.parse_args()
241 | 
242 |     main(args)
243 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/SODA/utils.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | def iou(interval_1, interval_2):
 4 |     """
 5 |     interval: list (2 float elements)
 6 |     """
 7 |     eps = 1e-8 # to avoid zero division
 8 |     (s_1, e_1) = interval_1
 9 |     (s_2, e_2) = interval_2
10 | 
11 |     intersection = max(0., min(e_1, e_2) - max(s_1, s_2))
12 |     union = min(max(e_1, e_2) - min(s_1, s_2), e_1 - s_1 + e_2 - s_2)
13 |     iou = intersection / (union + eps)
14 |     return iou
15 | 
16 | def remove_nonascii(text):
17 |     return ''.join([i if ord(i) < 128 else ' ' for i in text])
18 | 
19 | 


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/__init__.py:
--------------------------------------------------------------------------------
1 | from .eval_dvc import eval_dvc
2 | from .eval_soda import eval_soda


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/eval_dvc.py:
--------------------------------------------------------------------------------
  1 | # --------------------------------------------------------
  2 | # evaluation scripts for dense video captioning, support python 3
  3 | # Modified from https://github.com/ranjaykrishna/densevid_eval/tree/9d4045aced3d827834a5d2da3c9f0692e3f33c1c
  4 | # --------------------------------------------------------
  5 | # Dense-Captioning Events in Videos Eval
  6 | # Copyright (c) 2017 Ranjay Krishna
  7 | # Licensed under The MIT License [see LICENSE for details]
  8 | # Written by Ranjay Krishna
  9 | # --------------------------------------------------------
 10 | 
 11 | import argparse
 12 | import json
 13 | import random
 14 | import string
 15 | import sys
 16 | import time
 17 | # sys.path.insert(0, './coco-caption') # Hack to allow the import of pycocoeval
 18 | 
 19 | from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
 20 | from pycocoevalcap.meteor.meteor import Meteor
 21 | from pycocoevalcap.cider.cider import Cider
 22 | from pycocoevalcap.bleu.bleu import Bleu
 23 | from pycocoevalcap.rouge.rouge import Rouge
 24 | 
 25 | Set = set
 26 | import numpy as np
 27 | 
 28 | 
 29 | def random_string(string_length):
 30 |     letters = string.ascii_lowercase
 31 |     return ''.join(random.choice(letters) for i in range(string_length))
 32 | 
 33 | 
 34 | def remove_nonascii(text):
 35 |     return ''.join([i if ord(i) < 128 else ' ' for i in text])
 36 | 
 37 | 
 38 | class ANETcaptions(object):
 39 |     PREDICTION_FIELDS = ['results', 'version', 'external_data']
 40 | 
 41 |     def __init__(self, ground_truth_filenames=None, prediction_filename=None,
 42 |                  tious=None, distances=[1, 3, 5, 10, 30, 60], max_proposals=1000,
 43 |                  prediction_fields=PREDICTION_FIELDS, verbose=False, no_lang_eval=False):
 44 |         # Check that the gt and submission files exist and load them
 45 |         if len(tious) == 0:
 46 |             raise IOError('Please input a valid tIoU.')
 47 |         if not ground_truth_filenames:
 48 |             raise IOError('Please input a valid ground truth file.')
 49 |         if not prediction_filename:
 50 |             raise IOError('Please input a valid prediction file.')
 51 | 
 52 |         self.verbose = verbose
 53 |         self.no_lang_eval = no_lang_eval
 54 |         self.tious = tious
 55 |         self.distances = distances
 56 |         self.max_proposals = max_proposals
 57 |         self.pred_fields = prediction_fields
 58 |         self.ground_truths = self.import_ground_truths(ground_truth_filenames)
 59 |         self.prediction = self.import_prediction(prediction_filename)
 60 |         self.ground_truths_keys = [vid for gt in self.ground_truths for vid in gt]
 61 |         print('available video number', len(set(self.ground_truths_keys) & set(self.prediction.keys())))
 62 | 
 63 |         # Set up scorers
 64 |         if not self.no_lang_eval:
 65 |             self.tokenizer = PTBTokenizer()
 66 |             self.scorers = [
 67 |                 (Meteor(), "METEOR"),
 68 |                 (Cider(), "CIDEr"),
 69 |                 (Rouge(), "Rouge-L"),
 70 |                 (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
 71 |             ]
 72 | 
 73 |     def import_prediction(self, prediction_filename):
 74 |         if self.verbose:
 75 |             print("| Loading submission...")
 76 |         if isinstance(prediction_filename, dict):
 77 |             submission = prediction_filename
 78 |         else:
 79 |             submission = json.load(open(prediction_filename))
 80 |         # if not all([field in submission.keys() for field in self.pred_fields]):
 81 |         #    raise IOError('Please input a valid ground truth file.')
 82 |         # Ensure that every video is limited to the correct maximum number of proposals.
 83 |         results = {}
 84 |         for vid_id in submission['results']:
 85 |             results[vid_id] = submission['results'][vid_id][:self.max_proposals]
 86 |         return results
 87 | 
 88 |     def import_ground_truths(self, filenames):
 89 |         gts = []
 90 |         self.n_ref_vids = Set()
 91 |         for filename in filenames:
 92 |             if isinstance(filename, dict):
 93 |                 gt = filename
 94 |             else:
 95 |                 gt = json.load(open(filename))
 96 |             self.n_ref_vids.update(gt.keys())
 97 |             gts.append(gt)
 98 |         if self.verbose:
 99 |             print("| Loading GT. #files: %d, #videos: %d" % (len(filenames), len(self.n_ref_vids)))
100 |         return gts
101 | 
102 |     def iou(self, interval_1, interval_2):
103 |         start_i, end_i = interval_1[0], interval_1[1]
104 |         start, end = interval_2[0], interval_2[1]
105 |         intersection = max(0, min(end, end_i) - max(start, start_i))
106 |         union = min(max(end, end_i) - min(start, start_i), end - start + end_i - start_i)
107 |         iou = float(intersection) / (union + 1e-8)
108 |         return iou
109 | 
110 |     def check_gt_exists(self, vid_id):
111 |         for gt in self.ground_truths:
112 |             if vid_id in gt:
113 |                 return True
114 |         return False
115 | 
116 |     def get_gt_vid_ids(self):
117 |         vid_ids = set([])
118 |         for gt in self.ground_truths:
119 |             vid_ids |= set(gt.keys())
120 |         return list(vid_ids)
121 | 
122 |     def evaluate(self):
123 |         aggregator = {}
124 |         self.scores = {}
125 |         if not self.no_lang_eval:
126 |             for tiou in self.tious:
127 |                 scores = self.evaluate_tiou(tiou)
128 |                 for metric, score in scores.items():
129 |                     if metric not in self.scores:
130 |                         self.scores[metric] = []
131 |                     self.scores[metric].append(score)
132 |         if True:
133 |             # if self.verbose:
134 |             self.scores['Recall'] = []
135 |             self.scores['Precision'] = []
136 |             self.scores['F1'] = []
137 |             for tiou in self.tious:
138 |                 precision, recall = self.evaluate_detection(tiou)
139 |                 self.scores['Recall'].append(recall)
140 |                 self.scores['Precision'].append(precision)
141 |                 self.scores['F1'].append(2 * recall * precision / (recall + precision) if recall + precision else 0.)
142 |             for tiou in self.distances:
143 |                 precision, recall = self.evaluate_navigation(tiou)
144 |                 self.scores['Recall'].append(recall)
145 |                 self.scores['Precision'].append(precision)
146 |                 self.scores['F1'].append(2 * recall * precision / (recall + precision) if recall + precision else 0.)
147 | 
148 |     def evaluate_detection(self, tiou):
149 |         gt_vid_ids = self.get_gt_vid_ids()
150 |         # Recall is the percentage of ground truth that is covered by the predictions
151 |         # Precision is the percentage of predictions that are valid
152 |         recall = []
153 |         precision = []
154 |         for vid_i, vid_id in enumerate(gt_vid_ids):
155 |             if vid_id not in self.prediction:  # missing video
156 |                 continue
157 |             best_recall = 0
158 |             best_precision = 0
159 |             for gt in self.ground_truths:
160 |                 if vid_id not in gt:
161 |                     continue
162 |                 refs = gt[vid_id]
163 |                 ref_set_covered = set([])
164 |                 pred_set_covered = set([])
165 |                 num_gt = 0
166 |                 num_pred = 0
167 |                 if vid_id in self.prediction:
168 |                     for pred_i, pred in enumerate(self.prediction[vid_id]):
169 |                         pred_timestamp = pred['timestamp']
170 |                         for ref_i, ref_timestamp in enumerate(refs['timestamps']):
171 |                             if self.iou(pred_timestamp, ref_timestamp) > tiou:
172 |                                 ref_set_covered.add(ref_i)
173 |                                 pred_set_covered.add(pred_i)
174 | 
175 |                     new_precision = float(len(pred_set_covered)) / max(len(self.prediction[vid_id]), 1)
176 |                     best_precision = max(best_precision, new_precision)
177 |                 new_recall = float(len(ref_set_covered)) / len(refs['timestamps'])
178 |                 best_recall = max(best_recall, new_recall)
179 |             recall.append(best_recall)
180 |             precision.append(best_precision)
181 |         return sum(precision) / len(precision), sum(recall) / len(recall)
182 | 
183 |     def evaluate_navigation(self, tiou):
184 |         gt_vid_ids = self.get_gt_vid_ids()
185 |         # Recall is the percentage of ground truth that is covered by the predictions
186 |         # Precision is the percentage of predictions that are valid
187 |         recall = []
188 |         precision = []
189 |         for vid_i, vid_id in enumerate(gt_vid_ids):
190 |             if vid_id not in self.prediction:  # missing video
191 |                 continue
192 |             best_recall = 0
193 |             best_precision = 0
194 |             for gt in self.ground_truths:
195 |                 if vid_id not in gt:
196 |                     continue
197 |                 refs = gt[vid_id]
198 |                 ref_set_covered = set([])
199 |                 pred_set_covered = set([])
200 |                 num_gt = 0
201 |                 num_pred = 0
202 |                 if vid_id in self.prediction:
203 |                     for pred_i, pred in enumerate(self.prediction[vid_id]):
204 |                         pred_timestamp = pred['timestamp']
205 |                         for ref_i, ref_timestamp in enumerate(refs['timestamps']):
206 |                             if abs(pred_timestamp[0] - ref_timestamp[0]) < tiou:
207 |                                 ref_set_covered.add(ref_i)
208 |                                 pred_set_covered.add(pred_i)
209 | 
210 |                     new_precision = float(len(pred_set_covered)) / max(len(self.prediction[vid_id]), 1)
211 |                     best_precision = max(best_precision, new_precision)
212 |                 new_recall = float(len(ref_set_covered)) / len(refs['timestamps'])
213 |                 best_recall = max(best_recall, new_recall)
214 |             recall.append(best_recall)
215 |             precision.append(best_precision)
216 |         return sum(precision) / len(precision), sum(recall) / len(recall)
217 | 
218 |     def evaluate_tiou(self, tiou):
219 |         # This method averages the tIoU precision from METEOR, Bleu, etc. across videos
220 |         res = {}
221 |         gts = {}
222 |         gt_vid_ids = self.get_gt_vid_ids()
223 | 
224 |         unique_index = 0
225 | 
226 |         # video id to unique caption ids mapping
227 |         vid2capid = {}
228 | 
229 |         cur_res = {}
230 |         cur_gts = {}
231 | 
232 |         for vid_id in gt_vid_ids:
233 | 
234 |             # If the video does not have a prediction, then we give it no matches
235 |             # We set it to empty, and use this as a sanity check later on
236 |             if vid_id not in self.prediction:  # missing video
237 |                 continue
238 | 
239 |             # If we do have a prediction, then we find the scores based on all the
240 |             # valid tIoU overlaps.
241 |             else:
242 |                 vid2capid[vid_id] = []
243 |                 # For each prediction, we look at the tIoU with ground truth.
244 |                 for pred in self.prediction[vid_id]:
245 |                     has_added = False
246 |                     for gt in self.ground_truths:
247 |                         if vid_id not in gt:
248 |                             continue
249 |                         gt_captions = gt[vid_id]
250 |                         for caption_idx, caption_timestamp in enumerate(gt_captions['timestamps']):
251 |                             if self.iou(pred['timestamp'], caption_timestamp) >= tiou:
252 |                                 cur_res[unique_index] = [{'caption': remove_nonascii(pred['sentence'])}]
253 |                                 cur_gts[unique_index] = [
254 |                                     {'caption': remove_nonascii(gt_captions['sentences'][caption_idx])}]
255 |                                 vid2capid[vid_id].append(unique_index)
256 |                                 unique_index += 1
257 |                                 has_added = True
258 | 
259 |                     # If the predicted caption does not overlap with any ground truth,
260 |                     # we should compare it with garbage.
261 |                     if not has_added:
262 |                         cur_res[unique_index] = [{'caption': remove_nonascii(pred['sentence'])}]
263 |                         cur_gts[unique_index] = [{'caption': random_string(random.randint(10, 20))}]
264 |                         vid2capid[vid_id].append(unique_index)
265 |                         unique_index += 1
266 | 
267 |         # Each scorer will compute across all videos and take average score
268 |         output = {}
269 |         for scorer, method in self.scorers:
270 |             if self.verbose:
271 |                 print('computing %s score...' % (scorer.method()))
272 | 
273 |             # For each video, take all the valid pairs (based from tIoU) and compute the score
274 |             all_scores = {}
275 | 
276 |             # call tokenizer here for all predictions and gts
277 |             tokenize_res = self.tokenizer.tokenize(cur_res)
278 |             tokenize_gts = self.tokenizer.tokenize(cur_gts)
279 | 
280 |             # reshape back
281 |             for vid in vid2capid.keys():
282 |                 res[vid] = {index: tokenize_res[index] for index in vid2capid[vid]}
283 |                 gts[vid] = {index: tokenize_gts[index] for index in vid2capid[vid]}
284 | 
285 |             for vid_id in gt_vid_ids:
286 | 
287 |                 if vid_id not in self.prediction:  # missing video
288 |                     continue
289 | 
290 |                 if len(res[vid_id]) == 0 or len(gts[vid_id]) == 0:
291 |                     if type(method) == list:
292 |                         score = [0] * len(method)
293 |                     else:
294 |                         score = 0
295 |                 else:
296 |                     score, scores = scorer.compute_score(gts[vid_id], res[vid_id])
297 |                 all_scores[vid_id] = score
298 |                 # import ipdb;ipdb.set_trace()
299 | 
300 |             # print(all_scores.values())
301 |             if type(method) == list:
302 |                 scores = np.mean(list(all_scores.values()), axis=0)
303 |                 for m in range(len(method)):
304 |                     output[method[m]] = scores[m]
305 |                     if self.verbose:
306 |                         print("Calculated tIoU: %1.1f, %s: %0.3f" % (tiou, method[m], output[method[m]]))
307 |             else:
308 |                 output[method] = np.mean(list(all_scores.values()))
309 |                 if self.verbose:
310 |                     print("Calculated tIoU: %1.1f, %s: %0.3f" % (tiou, method, output[method]))
311 |         return output
312 | 
313 | 
314 | def eval_dvc(submission, references, tious=[0.3, 0.5, 0.7, 0.9], distances=[1, 3, 5, 10, 30, 60], max_proposals_per_video=1000, verbose=False, no_lang_eval=False):
315 |     # Call coco eval
316 |     evaluator = ANETcaptions(ground_truth_filenames=references,
317 |                              prediction_filename=submission,
318 |                              tious=tious,
319 |                              distances=distances,
320 |                              max_proposals=max_proposals_per_video,
321 |                              verbose=verbose, no_lang_eval=no_lang_eval)
322 |     evaluator.evaluate()
323 |     score = evaluator.scores
324 |     # print(score)
325 |     loc_score = {}
326 |     for i, x in enumerate(tious):
327 |         for y in ["Recall", "Precision", "F1"]:
328 |             loc_score[y + "@" + str(x)] = score[y][i]
329 |     for y in ["Recall", "Precision", "F1"]:
330 |         loc_score[y] = np.array([score[y][i] for i in range(len(tious))]).mean()
331 |     if distances:
332 |         for i, x in enumerate(distances):
333 |             for y in ["Recall", "Precision", "F1"]:
334 |                 loc_score[y + "@" + str(x) + "s"] = score[y][len(tious) + i]
335 |     avg_eval_score = {key: np.array(value).mean() for key, value in score.items() if key not in ["Recall", "Precision", "F1"]}
336 |     avg_eval_score.update(loc_score)
337 |     return avg_eval_score
338 | 
339 | if __name__ == '__main__':
340 |     eval_dvc(pred_path, references, 
341 |                 tious=[0.3, 0.5, 0.7, 0.9], 
342 |                 max_proposals_per_video=1000, 
343 |                 verbose=False, 
344 |                 no_lang_eval=False)
345 |     eval_soda(pred_path, references, verbose=False)


--------------------------------------------------------------------------------
/vtimellm/eval/dvc_eval/eval_soda.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from .SODA.soda import SODA
 3 | from .SODA.dataset import ANETCaptions
 4 | 
 5 | def eval_tool(prediction, referneces=None, metric='Meteor', soda_type='c', verbose=False):
 6 | 
 7 |     args = type('args', (object,), {})()
 8 |     args.prediction = prediction
 9 |     args.references = referneces
10 |     args.metric = metric
11 |     args.soda_type = soda_type
12 |     args.tious = [0.3, 0.5, 0.7, 0.9]
13 |     args.verbose = verbose
14 |     args.multi_reference = False
15 | 
16 |     data = ANETCaptions.from_load_files(args.references,
17 |                                         args.prediction,
18 |                                         multi_reference=args.multi_reference,
19 |                                         verbose=args.verbose,
20 |                                         )
21 |     data.preprocess()
22 |     if args.soda_type == 'a':
23 |         tious = args.tious
24 |     else:
25 |         tious = None
26 |     evaluator = SODA(data,
27 |                      soda_type=args.soda_type,
28 |                      tious=tious,
29 |                      scorer=args.metric,
30 |                      verbose=args.verbose
31 |                      )
32 |     result = evaluator.evaluate()
33 | 
34 |     return result
35 | 
36 | def eval_soda(p, ref_list,verbose=False):
37 |     score_sum = []
38 |     for ref in ref_list:
39 |         r = eval_tool(prediction=p, referneces=[ref], verbose=verbose, soda_type='c')
40 |         score_sum.append(r['Meteor'])
41 |     soda_avg = np.mean(score_sum, axis=0) #[avg_pre, avg_rec, avg_f1]
42 |     soda_c_avg = soda_avg[-1]
43 |     results = {'soda_c': soda_c_avg}
44 |     return results


--------------------------------------------------------------------------------
/vtimellm/eval/eval.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | root_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..")
  3 | import sys
  4 | sys.path.append(root_dir)
  5 | 
  6 | import clip
  7 | import re
  8 | import argparse
  9 | import torch
 10 | import json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | from torchvision.transforms import Compose, Resize, CenterCrop, Normalize
 14 | from vtimellm.model.builder import load_pretrained_model
 15 | from vtimellm.utils import disable_torch_init
 16 | from vtimellm.mm_utils import VideoExtractor
 17 | from vtimellm.inference import inference
 18 | 
 19 | try:
 20 |     from torchvision.transforms import InterpolationMode
 21 |     BICUBIC = InterpolationMode.BICUBIC
 22 | except ImportError:
 23 |     from PIL import Image
 24 |     BICUBIC = Image.BICUBIC
 25 | 
 26 | 
 27 | def parse_args():
 28 |     parser = argparse.ArgumentParser()
 29 |     parser.add_argument("--clip_path", type=str, default="checkpoints/clip/ViT-L-14.pt")
 30 |     parser.add_argument("--pretrain_mm_mlp_adapter", type=str, default="checkpoints/vtimellm-vicuna-v1-5-7b-stage1/mm_projector.bin")
 31 |     parser.add_argument("--stage2", type=str, default="checkpoints/vtimellm-vicuna-v1-5-7b-stage2")
 32 |     parser.add_argument("--stage3", type=str, default="checkpoints/vtimellm-vicuna-v1-5-7b-stage3")
 33 |     parser.add_argument("--model_base", type=str, default="/path/to/vicuna-7b-v1.5")
 34 |     parser.add_argument("--data_path", type=str, default="vtimellm/eval/data_example.json")
 35 |     parser.add_argument("--feat_folder", type=str, default=None)
 36 |     parser.add_argument("--video_folder", type=str, default=None)
 37 |     parser.add_argument("--task", type=str, default='all', choices=['all', 'grounding', 'captioning'])
 38 |     parser.add_argument("--log_path", type=str, default='vtimellm/eval/log/example_log.txt')
 39 |     args = parser.parse_args()
 40 |     return args
 41 | 
 42 | def iou(outputs, gt):
 43 |     matches = re.search(r"(\d{2}) (to|and) (\d{2})", outputs)
 44 |     if not matches:
 45 |         return 0
 46 |     from_number = float(matches.group(1)) / 100
 47 |     to_number = float(matches.group(3)) / 100
 48 |     s, e = gt
 49 |     intersection = max(0, min(to_number, e) - max(from_number, s))
 50 |     union = max(to_number, e) - min(from_number, s)
 51 |     iou = intersection / union
 52 |     return round(iou, 2)
 53 | 
 54 | 
 55 | def write_log(log_path, video_id, task, query_id, answer, info=None):
 56 |     log = {
 57 |         'video_id': video_id,
 58 |         'task': task,
 59 |         'query_id': query_id,
 60 |         'answer': answer
 61 |     }
 62 |     if info is not None:
 63 |         log['info'] = info
 64 |     with open(log_path, 'a') as f:
 65 |         f.write(json.dumps(log) + '\n')
 66 | 
 67 | questions = {
 68 |     'grounding': ['During which frames can we see {}?'],
 69 |     'captioning': ['Could you please describe the events in the video in detail? Be specific about the activities of individuals, their surroundings, and interactions with others. The output should be in JSON format, structured as follows: {"event": "xx", "timestamps": "from xx to xx"}.']
 70 | }
 71 | 
 72 | if __name__ == "__main__":
 73 |     args = parse_args()
 74 |     disable_torch_init()
 75 |     tokenizer, model, context_len = load_pretrained_model(args, args.stage2, args.stage3)
 76 |     model = model.cuda()
 77 |     model.to(torch.float16)
 78 | 
 79 |     if args.video_folder is not None:
 80 |         clip_model, _ = clip.load(args.clip_path)
 81 |         clip_model.eval()
 82 |         clip_model = clip_model.cuda()
 83 | 
 84 |         video_loader = VideoExtractor(N=100)
 85 | 
 86 |         transform = Compose([
 87 |             Resize(224, interpolation=BICUBIC),
 88 |             CenterCrop(224),
 89 |             Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
 90 |         ])
 91 | 
 92 |     js = json.load(open(args.data_path))
 93 |     for id, data in tqdm(js.items()):
 94 |         features = None
 95 | 
 96 |         if args.feat_folder is not None:
 97 |             feat_path = os.path.join(args.feat_folder, f"{id}.npy")
 98 |             if os.path.isfile(feat_path):
 99 |                 features = torch.from_numpy(np.load(feat_path)).cuda()
100 | 
101 |         if features is None and args.video_folder is not None:
102 |             for ext in ['mp4', 'mkv', 'webm']:
103 |                 video_path = os.path.join(args.video_folder, f"{id}.{ext}")
104 |                 if os.path.isfile(video_path):
105 |                     _, images = video_loader.extract({'id': None, 'video': video_path})
106 | 
107 |                     images = transform(images / 255.0)
108 |                     images = images.to(torch.float16)
109 |                     with torch.no_grad():
110 |                         features = clip_model.encode_image(images.to('cuda'))
111 | 
112 |         if features is None:
113 |             print(f'Can not find video {id}')
114 |             continue
115 |  
116 |         if args.task in ['captioning', 'all']:
117 |             for query_id, query in enumerate(questions['captioning']):
118 |                 answer = inference(model, features, "<video>\n " + query, tokenizer)
119 |                 write_log(args.log_path, id, 'captioning', query_id, answer)
120 |       
121 |         if args.task in ['grounding', 'all']:
122 |             for sentence_id, (timestamps, sentence) in enumerate(zip(data['timestamps'], data['sentences'])):
123 |                 sentence = sentence.strip().lower()
124 |                 if sentence.endswith("."):
125 |                     sentence = sentence[:-1]
126 | 
127 |                 for query_id, query in enumerate(questions['grounding']):
128 |                     answer = inference(model, features, "<video>\n" + query.format(sentence), tokenizer)
129 |                     gt = (timestamps[0] / data['duration'], timestamps[1] / data['duration'])
130 |                     u = iou(answer, gt)
131 |                     write_log(args.log_path, id, 'grounding', query_id, answer, info={"sentence_id": sentence_id, 'iou': u})
132 | 


--------------------------------------------------------------------------------
/vtimellm/eval/log/example_log.txt:
--------------------------------------------------------------------------------
1 | {"video_id": "v_bXdq2zI1Ms0", "task": "captioning", "query_id": 0, "answer": "Here's the requested JSON format with the events and their corresponding timestamps:\n[{\"event\": \"A man is seen talking to the camera while a group of men are seen standing in front of him.\", \"timestamps\": \"from 00 to 29\"}, {\"event\": \"The man then leads the group in a series of martial arts moves.\", \"timestamps\": \"from 29 to 71\"}, {\"event\": \"The man then leads the group in a series of martial arts moves.\", \"timestamps\": \"from 71 to 99\"}]"}
2 | {"video_id": "v_bXdq2zI1Ms0", "task": "grounding", "query_id": 0, "answer": "The three men are standing on a mat from 17 to 34.", "info": {"sentence_id": 0, "iou": 0.2}}
3 | {"video_id": "v_bXdq2zI1Ms0", "task": "grounding", "query_id": 0, "answer": "The man in front begins to do karate on the mat from 28 to 71.", "info": {"sentence_id": 1, "iou": 0.2}}
4 | {"video_id": "v_bXdq2zI1Ms0", "task": "grounding", "query_id": 0, "answer": "The man gets down on the ground and flips around from 60 to 99.", "info": {"sentence_id": 2, "iou": 0.4}}
5 | {"video_id": "v_CN01Gm2Yc4k", "task": "captioning", "query_id": 0, "answer": "Here's the requested JSON format with the events and their corresponding timestamps:\n[{\"event\": \"A woman is seen hanging from a bar in a gym.\", \"timestamps\": \"from 00 to 29\"}, {\"event\": \"She then swings her legs up and down while hanging from the bar.\", \"timestamps\": \"from 29 to 71\"}, {\"event\": \"She then swings her legs up and down while hanging from the bar.\", \"timestamps\": \"from 71 to 99\"}]"}
6 | {"video_id": "v_CN01Gm2Yc4k", "task": "grounding", "query_id": 0, "answer": "The young lady is gripping the punching bag between her legs from 00 to 17.", "info": {"sentence_id": 0, "iou": 0.6}}
7 | {"video_id": "v_CN01Gm2Yc4k", "task": "grounding", "query_id": 0, "answer": "The woman begins doing a set of crunches by pulling herself up from 21 to 71.", "info": {"sentence_id": 1, "iou": 0.82}}
8 | {"video_id": "v_CN01Gm2Yc4k", "task": "grounding", "query_id": 0, "answer": "The woman sits up and makes punches out into the air from 22 to 43.", "info": {"sentence_id": 2, "iou": 0.0}}
9 | 


--------------------------------------------------------------------------------
/vtimellm/eval/metric.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | sys.path.append(os.path.dirname(os.path.abspath(__file__)))
  4 | from dvc_eval import eval_dvc, eval_soda
  5 | 
  6 | import json
  7 | import argparse
  8 | import re
  9 | import difflib
 10 | 
 11 | def print_metrics(metrics):
 12 |     for k, v in metrics.items():
 13 |         print(f"{k}: {v:.2f}")
 14 | 
 15 | 
 16 | def merge_similar_sentences(data):
 17 |     if not data: return data
 18 |     merged_data = []
 19 |     current_sentence = data[0]["sentence"]
 20 |     current_timestamp = data[0]["timestamp"]
 21 |     for i in range(1, len(data)):
 22 |         next_sentence = data[i]["sentence"]
 23 |         next_timestamp = data[i]["timestamp"]
 24 |         if difflib.SequenceMatcher(None, current_sentence, next_sentence).ratio() > 0.98 and -1 <= next_timestamp[0] - current_timestamp[1] <= 1:
 25 |             current_timestamp = [current_timestamp[0], next_timestamp[1]]
 26 |         else:
 27 |             merged_data.append({"sentence": current_sentence, "timestamp": current_timestamp})
 28 |             current_sentence = next_sentence
 29 |             current_timestamp = next_timestamp
 30 |     merged_data.append({"sentence": current_sentence, "timestamp": current_timestamp})
 31 |     return merged_data
 32 | 
 33 | def captioning_metrics(all_logs, data_path):
 34 |     logs = [x for x in all_logs if x['task'] == 'captioning']
 35 |     pred = {}
 36 |     for log in logs:
 37 |         id = log['video_id']
 38 |         answer = log['answer']
 39 |         pred[id] = []
 40 |         try:
 41 |             items = json.loads(re.search(r'\[.*\]', answer).group(0))
 42 |             for item in items:
 43 |                 pred[id].append({
 44 |                         'timestamp': [int(item['timestamps'][5:7]), int(item['timestamps'][-2:])],
 45 |                         'sentence': item['event'],
 46 |                     })
 47 |         except Exception as e:
 48 |             print("Error", e, answer)
 49 |         
 50 | 
 51 |     gt_js = json.load(open(data_path))
 52 |     gt_js = {k: v for k, v in gt_js.items() if k in pred.keys()}
 53 | 
 54 |     
 55 |     for id, items in list(pred.items()): 
 56 |         items = merge_similar_sentences(items)
 57 |         duration = gt_js[id]['duration']
 58 |         for item in items:
 59 |             item['timestamp'][0] = item['timestamp'][0] * duration / 100
 60 |             item['timestamp'][1] = (item['timestamp'][1] + 1) * duration / 100
 61 |         pred[id] = items
 62 |      
 63 |     pred_result = {'results': pred}
 64 | 
 65 |     metrics = eval_soda(pred_result, [gt_js])
 66 |     metrics.update(eval_dvc(pred_result, [gt_js], 
 67 |                 tious=[0.3, 0.5, 0.7, 0.9], 
 68 |                 distances=[],
 69 |                 max_proposals_per_video=1000, 
 70 |                 verbose=False, 
 71 |                 no_lang_eval=False))
 72 |     print(f"Found {len(pred)} logs")
 73 |     metrics = {k: v * 100 for k, v in metrics.items() if k in ['soda_c', 'METEOR', 'CIDEr']}
 74 |     return metrics
 75 | 
 76 | 
 77 | def grounding_metrics(all_logs):
 78 |     ious = [x['info']['iou'] for x in all_logs if x['task'] == 'grounding']
 79 |     l = len(ious)
 80 |     print(f"Found {l} logs")
 81 |     if l == 0: return
 82 |     metrics = {
 83 |         "mIoU": sum(ious) / l * 100
 84 |     }
 85 |     for m in [0.3, 0.5, 0.7]:
 86 |         metrics[f"R1@{m}"] = sum(iou >= m for iou in ious) / l * 100
 87 |     return metrics
 88 | 
 89 | if __name__ == "__main__":
 90 |     parser = argparse.ArgumentParser()
 91 |     parser.add_argument("--log_path", type=str, default='vtimellm/eval/log/example_log.txt')
 92 |     parser.add_argument("--task", type=str, default='all', choices=['all', 'grounding', 'captioning'])
 93 |     parser.add_argument("--data_path", type=str, default='vtimellm/eval/data_example.json')
 94 |     args = parser.parse_args()
 95 | 
 96 |     logs = []
 97 |     with open(args.log_path) as f:
 98 |         for line in f:
 99 |             try:
100 |                 json_data = json.loads(line)
101 |                 logs.append(json_data)
102 |             except Exception as e:
103 |                 print(e, line)
104 | 
105 |     if args.task in ['captioning', 'all']:
106 |         print("====================== Captioning =====================")
107 |         print_metrics(captioning_metrics(logs, args.data_path))
108 |     if args.task in ['grounding', 'all']:
109 |         print("====================== Grounding ======================")
110 |         print_metrics(grounding_metrics(logs))
111 | 


--------------------------------------------------------------------------------
/vtimellm/inference.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import argparse
  4 | import torch
  5 | from vtimellm.constants import IMAGE_TOKEN_INDEX
  6 | from vtimellm.conversation import conv_templates, SeparatorStyle
  7 | from vtimellm.model.builder import load_pretrained_model, load_lora
  8 | from vtimellm.utils import disable_torch_init
  9 | from vtimellm.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria, VideoExtractor
 10 | from PIL import Image
 11 | import requests
 12 | from io import BytesIO
 13 | from transformers import TextStreamer
 14 | from easydict import EasyDict as edict
 15 | try:
 16 |     from torchvision.transforms import InterpolationMode
 17 |     BICUBIC = InterpolationMode.BICUBIC
 18 | except ImportError:
 19 |     from PIL import Image
 20 |     BICUBIC = Image.BICUBIC
 21 | from torchvision.transforms import Compose, Resize, CenterCrop, Normalize
 22 | import numpy as np
 23 | import clip
 24 | 
 25 | 
 26 | def inference(model, image, query, tokenizer):
 27 |     conv = conv_templates["v1"].copy()
 28 |     conv.append_message(conv.roles[0], query)
 29 |     conv.append_message(conv.roles[1], None)
 30 |     prompt = conv.get_prompt()
 31 |     input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
 32 | 
 33 |     stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
 34 |     keywords = [stop_str]
 35 |     stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
 36 | 
 37 |     with torch.inference_mode():
 38 |         output_ids = model.generate(
 39 |             input_ids,
 40 |             images=image[None,].cuda(),
 41 |             do_sample=True,
 42 |             temperature=0.05,
 43 |             num_beams=1,
 44 |             # no_repeat_ngram_size=3,
 45 |             max_new_tokens=1024,
 46 |             use_cache=True)
 47 | 
 48 |         # https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1295
 49 | 
 50 |     input_token_len = input_ids.shape[1]
 51 |     n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
 52 |     if n_diff_input_output > 0:
 53 |         print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
 54 |     outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
 55 |     outputs = outputs.strip()
 56 |     if outputs.endswith(stop_str):
 57 |         outputs = outputs[:-len(stop_str)]
 58 |     outputs = outputs.strip()
 59 |     return outputs
 60 | 
 61 | 
 62 | 
 63 | def parse_args():
 64 |     parser = argparse.ArgumentParser(description="Demo")
 65 |     parser.add_argument("--clip_path", type=str, default="checkpoints/clip/ViT-L-14.pt")
 66 |     parser.add_argument("--model_base", type=str, default="/path/to/vicuna-7b-v1.5")
 67 |     parser.add_argument("--pretrain_mm_mlp_adapter", type=str, default="checkpoints/vtimellm-vicuna-v1-5-7b-stage1/mm_projector.bin")
 68 |     parser.add_argument("--stage2", type=str, default="checkpoints/vtimellm-vicuna-v1-5-7b-stage2")
 69 |     parser.add_argument("--stage3", type=str, default="checkpoints/vtimellm-vicuna-v1-5-7b-stage3")
 70 |     parser.add_argument("--video_path", type=str, default="images/demo.mp4")
 71 |     args = parser.parse_args()
 72 | 
 73 |     return args
 74 | 
 75 | if __name__ == "__main__":
 76 |     args = parse_args()
 77 |     disable_torch_init()
 78 |     tokenizer, model, context_len = load_pretrained_model(args, args.stage2, args.stage3)
 79 |     model = model.cuda()
 80 |     # model.get_model().mm_projector.to(torch.float16)
 81 |     model.to(torch.float16)
 82 | 
 83 |     clip_model, _ = clip.load(args.clip_path)
 84 |     clip_model.eval()
 85 |     clip_model = clip_model.cuda()
 86 | 
 87 |     video_loader = VideoExtractor(N=100)
 88 |     _, images = video_loader.extract({'id': None, 'video': args.video_path})
 89 | 
 90 |     transform = Compose([
 91 |         Resize(224, interpolation=BICUBIC),
 92 |         CenterCrop(224),
 93 |         Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
 94 |     ])
 95 | 
 96 |     # print(images.shape) # <N, 3, H, W>
 97 |     images = transform(images / 255.0)
 98 |     images = images.to(torch.float16)
 99 |     with torch.no_grad():
100 |         features = clip_model.encode_image(images.to('cuda'))
101 | 
102 |     query = "describe the video."
103 |     print("query: ", query)
104 |     print("answer: ", inference(model, features, "<video>\n " + query, tokenizer))
105 | 
106 | 
107 | 


--------------------------------------------------------------------------------
/vtimellm/mm_utils.py:
--------------------------------------------------------------------------------
  1 | from PIL import Image
  2 | from io import BytesIO
  3 | import base64
  4 | import numpy as np
  5 | import torch
  6 | import decord
  7 | from transformers import StoppingCriteria
  8 | from vtimellm.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
  9 | 
 10 | 
 11 | def load_image_from_base64(image):
 12 |     return Image.open(BytesIO(base64.b64decode(image)))
 13 | 
 14 | 
 15 | def process_images(images, image_processor, model_cfg):
 16 |     return image_processor(images, return_tensors='pt')['pixel_values']
 17 | 
 18 | 
 19 | def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
 20 |     prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
 21 | 
 22 |     def insert_separator(X, sep):
 23 |         return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
 24 | 
 25 |     input_ids = []
 26 |     offset = 0
 27 |     if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
 28 |         offset = 1
 29 |         input_ids.append(prompt_chunks[0][0])
 30 |     elif tokenizer.name == "GLMTokenizer":
 31 |         offset = 2
 32 |         input_ids = prompt_chunks[0][:2]
 33 | 
 34 |     for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
 35 |         input_ids.extend(x[offset:])
 36 | 
 37 |     if return_tensors is not None:
 38 |         if return_tensors == 'pt':
 39 |             return torch.tensor(input_ids, dtype=torch.long)
 40 |         raise ValueError(f'Unsupported tensor type: {return_tensors}')
 41 |     return input_ids
 42 | 
 43 | 
 44 | def get_model_name_from_path(model_path):
 45 |     model_path = model_path.strip("/")
 46 |     model_paths = model_path.split("/")
 47 |     if model_paths[-1].startswith('checkpoint-'):
 48 |         return model_paths[-2] + "_" + model_paths[-1]
 49 |     else:
 50 |         return model_paths[-1]
 51 | 
 52 | 
 53 | 
 54 | 
 55 | class KeywordsStoppingCriteria(StoppingCriteria):
 56 |     def __init__(self, keywords, tokenizer, input_ids):
 57 |         self.keywords = keywords
 58 |         self.keyword_ids = []
 59 |         for keyword in keywords:
 60 |             cur_keyword_ids = tokenizer(keyword).input_ids
 61 |             if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
 62 |                 cur_keyword_ids = cur_keyword_ids[1:]
 63 |             self.keyword_ids.append(torch.tensor(cur_keyword_ids))
 64 |         self.tokenizer = tokenizer
 65 |         self.start_len = input_ids.shape[1]
 66 | 
 67 |     def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
 68 |         assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)"  # TODO
 69 |         offset = min(output_ids.shape[1] - self.start_len, 3)
 70 |         self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
 71 |         for keyword_id in self.keyword_ids:
 72 |             if output_ids[0, -keyword_id.shape[0]:].equal(keyword_id):
 73 |                 return True
 74 |         outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
 75 |         for keyword in self.keywords:
 76 |             if keyword in outputs:
 77 |                 return True
 78 |         return False
 79 | 
 80 | def print_trainable_parameters(model):
 81 |     trainable_params = 0
 82 |     all_param = 0
 83 |     for _, param in model.named_parameters():
 84 |         all_param += param.numel()
 85 |         # print(_, param.requires_grad, param.numel())
 86 |         if param.requires_grad:
 87 |             trainable_params += param.numel()
 88 |     print(
 89 |         f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
 90 |     )
 91 | 
 92 | class VideoExtractor():
 93 |     """Dataset for supervised fine-tuning."""
 94 | 
 95 |     def __init__(self, N=100):
 96 |         self.N = N
 97 | 
 98 |     def extract(self, data):
 99 |         video_path = data['video']
100 |         id = data['id']
101 |         
102 |         try:
103 |             video_reader = decord.VideoReader(video_path)
104 |             total_frames = len(video_reader)
105 |             start = 0
106 |             end = total_frames - 1
107 | 
108 |             split = data.get('split', None)
109 |             if split is not None:
110 |                 fps = video_reader.get_avg_fps()
111 |                 start = max(int(fps * split[0]), 0)
112 |                 end = min(int(fps * split[1]), total_frames - 1)
113 |             sampled_indices = np.linspace(start, end, self.N, dtype=np.int32)
114 |             sampled_frames = video_reader.get_batch(sampled_indices).asnumpy()
115 |         except Exception as e:
116 |             print(e)
117 |             return None, torch.zeros(1)
118 |         
119 |         images = torch.from_numpy(sampled_frames.transpose((0, 3, 1, 2)))
120 |         return id, images


--------------------------------------------------------------------------------
/vtimellm/model/__init__.py:
--------------------------------------------------------------------------------
1 | from .vtimellm_llama import VTimeLLMLlamaForCausalLM
2 | from .vtimellm_chatglm import VTimeLLMChatGLMForCausalLM


--------------------------------------------------------------------------------
/vtimellm/model/builder.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import shutil
 3 | 
 4 | from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
 5 | import torch
 6 | from vtimellm.model import *
 7 | from peft import PeftModel
 8 | 
 9 | def load_lora(model, lora_path):
10 |     non_lora_trainables_path = os.path.join(lora_path, 'non_lora_trainables.bin')
11 |     if os.path.exists(non_lora_trainables_path):
12 |         non_lora_trainables = torch.load(non_lora_trainables_path, map_location='cpu')
13 |         non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
14 |         if any(k.startswith('model.model.') for k in non_lora_trainables):
15 |             non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
16 |         model.load_state_dict(non_lora_trainables, strict=False)
17 |     print('Loading LoRA weights...')
18 |     model = PeftModel.from_pretrained(model, lora_path)
19 |     return model
20 | 
21 | def load_pretrained_model(args, stage2=None, stage3=None):
22 |     kwargs = {'torch_dtype': torch.float16}
23 | 
24 |     # model_path = os.path.expanduser(args.model_path)
25 |     model_base = args.model_base
26 | 
27 | 
28 |     # lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
29 |     print('Loading VTimeLLM from base model...')
30 |     if 'chatglm' in model_base:
31 |         tokenizer = AutoTokenizer.from_pretrained(model_base, trust_remote_code=True)
32 |         model = VTimeLLMChatGLMForCausalLM.from_pretrained(model_base)
33 |     else:
34 |         tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
35 |         model = VTimeLLMLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs)
36 |         token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
37 |         if model.lm_head.weight.shape[0] != token_num:
38 |             model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
39 |             model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
40 | 
41 | 
42 |     # load stage1:
43 |     model.get_model().initialize_vision_modules(args)
44 | 
45 |     if stage2 is not None:
46 |         print('Loading stage2 weights...')
47 |         model = load_lora(model, stage2)
48 |         print('Merging stage2 weights...')
49 |         model = model.merge_and_unload()
50 |         if stage3 is not None:
51 |             print('Loading stage3 weights...')
52 |             model = load_lora(model, stage3)
53 |             print('Merging stage3 weights...')
54 |             model = model.merge_and_unload()
55 | 
56 | 
57 |     if hasattr(model.config, "max_sequence_length"):
58 |         context_len = model.config.max_sequence_length
59 |     else:
60 |         context_len = 2048
61 | 
62 |     return tokenizer, model, context_len
63 | 


--------------------------------------------------------------------------------
/vtimellm/model/chatglm/__init__.py:
--------------------------------------------------------------------------------
1 | from .configuration_chatglm import ChatGLMConfig
2 | from .modeling_chatglm import ChatGLMModel, ChatGLMForConditionalGeneration


--------------------------------------------------------------------------------
/vtimellm/model/chatglm/configuration_chatglm.py:
--------------------------------------------------------------------------------
 1 | from transformers import PretrainedConfig
 2 | 
 3 | 
 4 | class ChatGLMConfig(PretrainedConfig):
 5 |     model_type = "chatglm"
 6 |     def __init__(
 7 |         self,
 8 |         num_layers=28,
 9 |         padded_vocab_size=65024,
10 |         hidden_size=4096,
11 |         ffn_hidden_size=13696,
12 |         kv_channels=128,
13 |         num_attention_heads=32,
14 |         seq_length=2048,
15 |         hidden_dropout=0.0,
16 |         classifier_dropout=None,
17 |         attention_dropout=0.0,
18 |         layernorm_epsilon=1e-5,
19 |         rmsnorm=True,
20 |         apply_residual_connection_post_layernorm=False,
21 |         post_layer_norm=True,
22 |         add_bias_linear=False,
23 |         add_qkv_bias=False,
24 |         bias_dropout_fusion=True,
25 |         multi_query_attention=False,
26 |         multi_query_group_num=1,
27 |         apply_query_key_layer_scaling=True,
28 |         attention_softmax_in_fp32=True,
29 |         fp32_residual_connection=False,
30 |         quantization_bit=0,
31 |         pre_seq_len=None,
32 |         prefix_projection=False,
33 |         **kwargs
34 |     ):
35 |         self.num_layers = num_layers
36 |         self.vocab_size = padded_vocab_size
37 |         self.padded_vocab_size = padded_vocab_size
38 |         self.hidden_size = hidden_size
39 |         self.ffn_hidden_size = ffn_hidden_size
40 |         self.kv_channels = kv_channels
41 |         self.num_attention_heads = num_attention_heads
42 |         self.seq_length = seq_length
43 |         self.hidden_dropout = hidden_dropout
44 |         self.classifier_dropout = classifier_dropout
45 |         self.attention_dropout = attention_dropout
46 |         self.layernorm_epsilon = layernorm_epsilon
47 |         self.rmsnorm = rmsnorm
48 |         self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
49 |         self.post_layer_norm = post_layer_norm
50 |         self.add_bias_linear = add_bias_linear
51 |         self.add_qkv_bias = add_qkv_bias
52 |         self.bias_dropout_fusion = bias_dropout_fusion
53 |         self.multi_query_attention = multi_query_attention
54 |         self.multi_query_group_num = multi_query_group_num
55 |         self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
56 |         self.attention_softmax_in_fp32 = attention_softmax_in_fp32
57 |         self.fp32_residual_connection = fp32_residual_connection
58 |         self.quantization_bit = quantization_bit
59 |         self.pre_seq_len = pre_seq_len
60 |         self.prefix_projection = prefix_projection
61 |         super().__init__(**kwargs)


--------------------------------------------------------------------------------
/vtimellm/model/chatglm/quantization.py:
--------------------------------------------------------------------------------
  1 | from torch.nn import Linear
  2 | from torch.nn.parameter import Parameter
  3 | 
  4 | import bz2
  5 | import torch
  6 | import base64
  7 | import ctypes
  8 | from transformers.utils import logging
  9 | 
 10 | from typing import List
 11 | from functools import partial
 12 | 
 13 | logger = logging.get_logger(__name__)
 14 | 
 15 | try:
 16 |     from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up
 17 | 
 18 |     class Kernel:
 19 |         def __init__(self, code: bytes, function_names: List[str]):
 20 |             self.code = code
 21 |             self._function_names = function_names
 22 |             self._cmodule = LazyKernelCModule(self.code)
 23 | 
 24 |             for name in self._function_names:
 25 |                 setattr(self, name, KernelFunction(self._cmodule, name))
 26 | 
 27 |     quantization_code = "$QlpoOTFBWSZTWU9yuJUAQHN//////////f/n/8/n///n//bt4dTidcVx8X3V9FV/92/v4B7/AD5FBQFAAAChSgKpFCFAFVSigUAAAEKhSgUUqgFBKigqVREQAABQBQIANDTTIGI00BkZBkNGE0A0BkBkGQGRkaNAaAGQNBoGgDIAAYIGTI0DQAQAaGmmQMRpoDIyDIaMJoBoDIDIMgMjI0aA0AMgaDQNAGQAAwQMmRoGgAgA0NNMgYjTQGRkGQ0YTQDQGQGQZAZGRo0BoAZA0GgaAMgABggZMjQNABABoaaZAxGmgMjIMhowmgGgMgMgyAyMjRoDQAyBoNA0AZAADBAyZGgaAAmqU1NEgJqnptU/Sn4jRR6J6epk2pqb1Q/SgAPUGgyNNGjQ2SBpoAZAAGg0NB6mgDIAAAAA2oaApSREBNAARhGiYEaEwU8pvImlP0k2aam1GaGqbFNM1MHpTwmkepmyU9R6nqPKekHqNNPUxNGhp6n6p6QaZ6o9TG1GMqcoV9ly6nRanHlq6zPNbnGZNi6HSug+2nPiZ13XcnFYZW+45W11CumhzYhchOJ2GLLV1OBjBjGf4TptOddTSOcVxhqYZMYwZXZZY00zI1paX5X9J+b+f4e+x43RXSxXPOdquiGpduatGyXneN696M9t4HU2eR5XX/kPhP261NTx3JO1Ow7LyuDmeo9a7d351T1ZxnvnrvYnrXv/hXxPCeuYx2XsNmO003eg9J3Z6U7b23meJ4ri01OdzTk9BNO96brz+qT5nuvvH3ds/G+m/JcG/F2XYuhXlvO+jP7U3XgrzPN/lr8Sf1n6j4j7jZs+s/T0tNaNNYzTs12rxjwztHlnire3Nzc3N1wuBwOBwXBvZfoHpD7rFmR99V5vj3aXza3xdBbXMalubTg/jIv5dfAi54Pdc75j4z412n3Npj3Ld/ENm7a3b/Cod6h/ret1/5vn/C+l+gdslMvgPSLJ8d8q+U66fevYn/tW1chleEtNTGlcHCbLRlq0tHzF5tsbbZZfHjjLgZu42XCuC3NrdjTasZGNzgxPIrGqp7r3p7L2p5XjnpPSmTd5XtzqnB6U87zzg1Ol0zd0zsLszxR6lkxp35u6/teL0L0W922cR7Lu1lpL9CsHirzuM2T+BgsyViT6LHcm0/Vr6U/7LGGyJeqTEjt0PHWhF5mCT7R9mtlDwriYv0Tyr/OxYt6qp5r0mPVT0608TqnqMZaarU2nFwrTzzlrs1ed7z1ux60wyr4ydCaTi3enW8x68x0zU7tXSlcmPSW1mGpWJMg4zmPC2lK96tp0OE80y4MfEvnZj8zGluR6b22ki1Ou9V2nCd9xovcPvcYMZYy0lvN60ScZ45vN6yeCeeXFb1lVjnnCar5fwXwE2bzJ4HI1XVPXfXZMm44GUsMpYsmLB65TuVdm0cl0b+i/wGNN66XjeV7zuPpHcnK/juhhjdfId5jMdE5nN0dGmmm2zZs2cexD5n9p/dY352XsvXHaZNWWsmmS1atjR452nYudzvqv2HMRyvNNnlMcDl3R2+yx2uVrBubTW9icHDVtbNXlZm7jma1rM4VurZZd2y6nUau7ZXZ7bVU+mnoOVxZGMrVmvX60605JwmzGZhhhjTWtaaaMaaGTGmNMZasY0iX8VMUl8eepaIrzGSpemWOQyZORk2bNpjUybMmxqYmknCGCFynutfksaZpjTNMaaatM0xsxcGR0sociNqxNSmhhR1ZJPbsn8qyF0t2qH6iYBclclalbtTTcHTDsPaX6rlnElph2Jyumumtynv2Kk8GI7rsvXbIcJgHJOSaSXnnGaI3m87RtVXJOZ/YtgdTE6Wpha6ZlE8ayXkef1fh602r2WwvfMXtMdLlkfnLFdYYwYso+bWqm7yJqHXZGw2nrS5ZanSYnWlxBxMF1V940K2wdrI7R6OYf7DGGamMmTSbRhlS45xmVOumF1EyPCmHrrN8wwZOOrdNtLeMtzFzDlWnfTBxMk2NaXIZHBYxYLD4w8yju0ao65Vz1OIXoS9dLanwCe1PWrYuWMqf1if1z2k2yYfKJ741PDgno1ZQ8DRqvUny3mNoWTzGO6m1DkrJI8JiR5cSd+vZdGOO8nrMoc5+NDUFsMSXaZJeNlMmGLtJsovOsUp7I9S5VojKxF6bTVEelXqlfJobQr3LozSh2Jk7VcrVMfhXqszGWMzNqGhqZY0OadxkyyMssKugZR0KNFXBHlqwmJgTE/BNVMk6ItJXZMR0H47GpXv/DMOvNkmVuaV1PRfEdxuqc7Hcd+ZV/zTLaRxWk0nl9CdCeM6mn5rstHIBcpiuwmUZXeq81DacHI2rmrZ5SuE5mOZd6LQrZg9mx32TprA8BMo5jKN6yLTCi3WzQaZSuhzTtM1fUTGVpG8Tw+KXI0tjEpiWxtLYynOlktSbVlaI5kxP8TDH8kx50xoxi5KcA4pcja8KWLRlO/Ks6q06ergnvm1ca3Tq8Uw7LTUsmWyctXPWmpitl/uvGcWTGXGuAXDfhqazGmjkxcJW5hMMMMpYsXl2TZYtVOddG3XCarUt6Ptq9CZXSNzyuRzqRZOjsxdBbFVz6OA5HI43r1jityVlVpVkxmOsyaYWE1NTGq1sOVh36mHMcxtSvcy70edG0ZGR3I1Go1GRlV7mWWo1G0ZGRqlvH40l7o4m5xMWLLLYyNjnqc8556mdPqLJ31n/1nWOncxzG1tizrHs/Z+d2vP/B/l8wdJ6rHUn2nbbDq4p6htFtYzMMMTaZis1K5GKzGNmxhmUx2DDlZ/qNnIx41xnaMfCZWYaZWtNLTNW8ND4Fw1MyZOCdM428suKG1ehW8TesOydg7J+YYcD4cYR+8dFK6M4E3HM9ZfRNNL+Sn6rsl4DsrDl2HpPCnfxjGXtbZtYys1ttlyJ4T+BvexjGWRjMszK4Jpc77D3GyuVD7q0+G8m9G+2+rGm7cOR2y7FdtY2XUYx/oNlfRYxhMYyYZkyyg55enna9Kt/FFi6GMMwYwdwxWgxGMLKYmUyGExTKMZkMFhkymKuh0NOBNnBu+23LdwDoZYYzGGMxtORaTU1pjTGWTTGGtMrNWUsyyTTLLG1qy2ZjbK2DBllWqxMtBMaYZQmcE7zvvRcTkclUwdkxTaSdyySt/7fpL+T1v516Ji97fwr5JbLu305zMn5+GMTTZ9F+y7ExwmGVfG44yxn3dLv6l5i+Wth1jCrDq21nW9LqvvDzz3Vf3LLH/O/32TJ/erx3bXftO4eF+G956D952K/An4NfvOpjFjExjevP/UmE0fIoZXx6/w6lX/no3D0bLt+ixjieBM6ksRd0yB4Lt2SwYNE+gd1detlZWUnpiZfGfFaK+4PyCa/v18V8X75pe9fLXzp7l3VjF76vWZmHwGz1IZNWT7b8yddJ4q5kyrVdfru6atWc7bVYztL9Jf4GXvT+Y8m9/YsXP6H018a8D4XVOqvfzqeR+6yZOD8dPv0+U7/q5Pl+2dNb0MjzGVH5p6MNQ7cOWvw62U9aHE8DprDek+McLyvDz+te+9Zhq5+YTruufMcWMabqysTmZVWjKPfnK0wyVcrsuhjZRdLkHNvD72b9abriOSGIxiLixMOoalNPXzy+wT/tf+U6HHONfsz+xe8ufHBdQWWGWLA9if0rsnmrxK5LvRZQeWsTCsrmOYy8VteVfuRfcVTtDLItLIsMYxZLdU/DbtSemxF6Z6Zo5WBXE4tFdCyVMMXMTEMZXVlS6Xec2T4e0tHsRcEuWshcJ2YsNF5rUx1E8ifCq6Z+ZP7qdCeu/aTwFd53l16/o0NOw6O3dLavP4Hbi4RdmuDk6DoYaninC0+o4uZjbJ7Rxeu0/FbuFg+q7DVS6fQe0rZ6NDGUNNU6DEqOaLTicKnYZMnBWruljQxoaS3dZhocDge0bSTyOvdAbG5hxe2xji7E/L55xX13wWNDi6HCekcFxfCPGxY0MXC+s7afWaMdDyjyr+o8Rudm/NabOZvdl274zH4f5XK9z6On1Pe/K5TdPAslg77BjuO6Y3eO7GqvOPG/stknp1leyvLL0Z7bl9I4noMvLkzytLhWYzrOZzLXCORe028rORzOg4N/L0HlMOQ3Pgmnbb6KczlabORpu980q37TBqRu0/p3PO6234Bl03Ynuz+9W7gnsEcmvYaYY3aMYY0wx3pYd+ujsXauWdaY5Xkbtl23fPzFHiDB/QMo0yFjBllYxTQYYyxkrwn7JufwJ/PfgJ+C83X69ni6zvXcnyXabv0ncbLwsceS+RNlyN2mnneJtX0ngYO0+e+0+UnA+Wch3ji8hj5an4h+i6XBySU4n+R0roVcbw5yvHrmr4Yw8Y7x6c+9POPYHI5HI5HI5HI5HGXGww4nE4nrVyOR8XeqPEO7PLOiukYa3Novk5hV4cdtYZLI93e+uxff2jRo0aNGjRo0aNG1bVtW1dy3m83m8+tQ5ZzHw3nObwOu8La9Rc1dtkdS8A3eTk823tnktXWlxN6Oixe06zrN70Isd9jiOgZFq9yfkPqP/SLhN2Myl8jDM43bl1nbcb4cO57jlh8Jow6pzXZdL4dyODTuuhu77FyO27DdwdRxmvO+O+3N2+BdqyTwLHVczDVY4UPE4O66/ZO2cx1LFzVdSXtF7G4HMbrauOHRw6c8FdZ5m9fHZHYZXfTlZquyynSyTTKke6vcffSD9pzPA/G7n7jxPmuhc1DHMynPMrGL6AdewYmwu5ko+UUyTwrMv27rPH1v1nGqd87+p6N6LU8k3NEng53xXyHS97+44OSg/sy/hn+Se6yfYNjW0/uTgP+PvWYzLMmjhcLB/gGpri6H83/84eUXWT6T9Hsv7785z/7z4icpW+zfXypuR7rx/gMdZb1/wC678pcs8/2a3mDitGHxl9mfPlll5MafWWqxk/eYuTDgcNMzDGWLWvsuglNxs53GtN6uWpktlW1tZZYcuinMMWmnNnJydze3b2Y1McBxrBkXw799izLMZZYyy0TkbsGM4p03S2uVu5s/XXUdSdec6smVxZYYGpVmT8A+8ajuEyV5FatkvVru2x6uxGXXbH4A+jvgP4GMYy3iPLXzq/6z65+E005ey+cwMZD3fZcqc6xpjTFjQ0P3U+e++cPYmTIwj0nrK5NPTfl3WvpfLtXDcb2HQMudYOxFXQBor4L4T6vrOauFctYXJQ++NUWmJe5bmx1jDiZS1dTqWxo4GR8jm3fttpmPHppk9PEyv4/y8/sO07XacOmcqc0x2Vi9BvNJvN5oW8x4mOsydpidRxMYJPx06m1bqPzq9KtK8sxXNXFodD/+MYYaJTLwOhc9brCsV18oOR1i4tXChyTkq4lf4y1Ke+9axjDHqs1mfBbMXuP4Hzi+X7t8vzv7bHerrUPgPCxhjre4fXdfLNtNM+Jd+Zdh8xd8wP87uNPoPgv4W7/5P2BuxfsMabNnMnza+54Pdi5U671GPZY8CehX8Voeoo7FHpkeEc6715FwHZrIrUrHaviPUbPZHND+IhczrP6FcYvhOZ0Di/ETt0OI+YwNWR9r7tpf6WDeZKZDB1+z2IthOl1mPyb5FluvEx9h9d0NnM0Y1XPFkWIsk1WotJ0PBMmkvjvQTd0e71tfeV+8r8lQ/tpzpsmxJ+InrI/dj2UajUajVTUajatRqNRtGo1Go1Go4wjeMpZFMVV9CHbofPraLsJ3JpWV2XOoanCuFky4y3PPNxucK2uKC1Lbdb1eo+m5XomN6HfeZsabHLHRX/K+offtNGGmHWctcVcG44MdSqsOLY9VzX+Zxfxn2HPdWTpzWvkrtJ8M5zorrKcquRytJ5N5DZmcaW02l76nWO+BqPXm1A2Ry/0q71dH/mqrqeFjkYxjEXtsX8qubTk67rGycyqsdm4tZx5D6D5hhi0waaWmiaMP81Yjii5qxPlPuU/GfTL1Y5E6Jyfiq63qTa39A4J0sOGDgO9WF9bOXl0XfPRbsY2bPNKPy1YrFYrFYmRhhlTIyMjJWJYZHXuCXI8OoXsvfljGLFicNifpp2XunoPiG1wtx3p1Tah+/DD66OnVtVXP9rKbVxOnL0tR/rHtqB5UDErUVcl11D4qqvjpOcxX7armUNJB3LpW6bxVvD08e8h3odKKvyCFZBdSh2FVcST9xV3n3T8t1j7Kr9qgrqXg+13Pt5U7JCvFXVIV1YG5lRhkVYZJYYDDD4KOIMoHCp26WS8GB7uBh2zIdgq/PKyInjV2STShuoapUdCpX1yTwqq/z1VvET7Kh5nVPkO8YyxjLt2MaaMmWTLQvx3qnzltnXW0p2jxgbEtSny/Osv8Y9pLMXYoHVPAhkVdWVeODhR6q9/Sxe2liwwZWMVvFXfRkeIDxAePUPIrdJ4ey6yquzH+PD/bUOWAu05qVHtFd8rrKHSoeNIOUqrYr3FXyToqfYJgwmJdKpXXOwYYegNNGMzfZPp/t3t/DVs4zjNTN61rRqaWaa4NYbRjTa0tWwy2Y2tGN8ZO8ofNKq4j9SL7I+cSm4/6ovLV5HNXLI0jJidwrtk6ynCaP6Z++GjRlWS3tLeW129Mi9evxU9mtz6s5J3Z7M2ngTgnKvmpomxpaLCzPfmx0JWE+m3NLDDGOX47RctdYYNK5jakdqLkRlI39n590T5zctGSwwZZDJj6kW8XSi6ot2MmWWJ0DUT3nuvebBudScjZ79g8cWJ8av0k+/bE5WKd5MdbFpbDVMxu1DVMmtNZGJvq1mtRbn6M+g/kP0FwDwr7quZs7xosNGpbscyxhhd9TyJyFwbLcxlTasg75vW7TsV5K7ji44XPMMrdoj+Y3rT0Hie62nlYV/pwczzOmdLqLhYkzGMzCZWGMQzGMSsZYY6Di1t4nlJ+Em63mJxrVLxPbYxNEdgc1dU2iOKyoYYWjNrEeHTYybVk0atSa7ehuwsWMWTqn1TrnS6hYsi71d1+s+k+ic70e20fzE/VaTdxT9ZtU4GIXdeNx3X77guYYfpHeTQjaMX6brOu4OY4K7Y2d9mbHarI5ox3p4GpJ2Vd/Tst60f7j999pppjR+Q/Qf8J/VaORs3cji7FfFuN61+ui9s8hix1OCh5KGVV23BPXvZfz3CLyHpix+exi8z/KnCnosY2eunor+cxyPO/xJ0vKey9OvE9VjqaYu0x3Z3jd6o2b1T12D+F8l232lwaaacD5LE8LBxu7WTlbWraWpew8Xexjel3E+wWD4APITdNqR8F3R3T0lunCQ4GaE9R37DxeCYfcHi4xci5ovKfxVs55y2hf+65E/Xdp6jR5nrebTmi5incpkyOjs50JvrZwstbbW6kfuuQw+2mykf/EXNFzxfKTrxew929TR6bWnGL//F3JFOFCQT3K4lQ"
 28 | 
 29 |     kernels = Kernel(
 30 |         bz2.decompress(base64.b64decode(quantization_code)),
 31 |         [
 32 |             "int4WeightCompression",
 33 |             "int4WeightExtractionFloat",
 34 |             "int4WeightExtractionHalf",
 35 |             "int8WeightExtractionFloat",
 36 |             "int8WeightExtractionHalf",
 37 |         ],
 38 |     )
 39 | except Exception as exception:
 40 |     kernels = None
 41 |     logger.warning("Failed to load cpm_kernels:" + str(exception))
 42 | 
 43 | 
 44 | class W8A16Linear(torch.autograd.Function):
 45 |     @staticmethod
 46 |     def forward(ctx, inp: torch.Tensor, quant_w: torch.Tensor, scale_w: torch.Tensor, weight_bit_width):
 47 |         ctx.inp_shape = inp.size()
 48 |         ctx.weight_bit_width = weight_bit_width
 49 |         out_features = quant_w.size(0)
 50 |         inp = inp.contiguous().view(-1, inp.size(-1))
 51 |         weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
 52 |         ctx.weight_shape = weight.size()
 53 |         output = inp.mm(weight.t())
 54 |         ctx.save_for_backward(inp, quant_w, scale_w)
 55 |         return output.view(*(ctx.inp_shape[:-1] + (out_features,)))
 56 | 
 57 |     @staticmethod
 58 |     def backward(ctx, grad_output: torch.Tensor):
 59 |         inp, quant_w, scale_w = ctx.saved_tensors
 60 |         weight = extract_weight_to_half(quant_w, scale_w, ctx.weight_bit_width)
 61 |         grad_output = grad_output.contiguous().view(-1, weight.size(0))
 62 |         grad_input = grad_output.mm(weight)
 63 |         grad_weight = grad_output.t().mm(inp)
 64 |         return grad_input.view(ctx.inp_shape), grad_weight.view(ctx.weight_shape), None, None
 65 | 
 66 | 
 67 | def compress_int4_weight(weight: torch.Tensor):  # (n, m)
 68 |     with torch.cuda.device(weight.device):
 69 |         n, m = weight.size(0), weight.size(1)
 70 |         assert m % 2 == 0
 71 |         m = m // 2
 72 |         out = torch.empty(n, m, dtype=torch.int8, device="cuda")
 73 |         stream = torch.cuda.current_stream()
 74 | 
 75 |         gridDim = (n, 1, 1)
 76 |         blockDim = (min(round_up(m, 32), 1024), 1, 1)
 77 | 
 78 |         kernels.int4WeightCompression(
 79 |             gridDim,
 80 |             blockDim,
 81 |             0,
 82 |             stream,
 83 |             [ctypes.c_void_p(weight.data_ptr()), ctypes.c_void_p(out.data_ptr()), ctypes.c_int32(n), ctypes.c_int32(m)],
 84 |         )
 85 |         return out
 86 | 
 87 | 
 88 | def extract_weight_to_half(weight: torch.Tensor, scale_list: torch.Tensor, source_bit_width: int):
 89 |     assert scale_list.dtype in [torch.half, torch.bfloat16]
 90 |     assert weight.dtype in [torch.int8]
 91 |     if source_bit_width == 8:
 92 |         return weight.to(scale_list.dtype) * scale_list[:, None]
 93 |     elif source_bit_width == 4:
 94 |         func = (
 95 |             kernels.int4WeightExtractionHalf if scale_list.dtype == torch.half else kernels.int4WeightExtractionBFloat16
 96 |         )
 97 |     else:
 98 |         assert False, "Unsupported bit-width"
 99 | 
100 |     with torch.cuda.device(weight.device):
101 |         n, m = weight.size(0), weight.size(1)
102 |         out = torch.empty(n, m * (8 // source_bit_width), dtype=scale_list.dtype, device="cuda")
103 |         stream = torch.cuda.current_stream()
104 | 
105 |         gridDim = (n, 1, 1)
106 |         blockDim = (min(round_up(m, 32), 1024), 1, 1)
107 | 
108 |         func(
109 |             gridDim,
110 |             blockDim,
111 |             0,
112 |             stream,
113 |             [
114 |                 ctypes.c_void_p(weight.data_ptr()),
115 |                 ctypes.c_void_p(scale_list.data_ptr()),
116 |                 ctypes.c_void_p(out.data_ptr()),
117 |                 ctypes.c_int32(n),
118 |                 ctypes.c_int32(m),
119 |             ],
120 |         )
121 |         return out
122 | 
123 | 
124 | class QuantizedLinear(torch.nn.Module):
125 |     def __init__(self, weight_bit_width: int, weight, bias=None, device="cpu", dtype=None, empty_init=False, *args,
126 |                  **kwargs):
127 |         super().__init__()
128 |         self.weight_bit_width = weight_bit_width
129 | 
130 |         shape = weight.shape
131 | 
132 |         if weight is None or empty_init:
133 |             self.weight = torch.empty(shape[0], shape[1] * weight_bit_width // 8, dtype=torch.int8, device=device)
134 |             self.weight_scale = torch.empty(shape[0], dtype=dtype, device=device)
135 |         else:
136 |             self.weight_scale = weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1)
137 |             self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8)
138 |             if weight_bit_width == 4:
139 |                 self.weight = compress_int4_weight(self.weight)
140 | 
141 |         self.weight = Parameter(self.weight.to(device), requires_grad=False)
142 |         self.weight_scale = Parameter(self.weight_scale.to(device), requires_grad=False)
143 |         self.bias = Parameter(bias.to(device), requires_grad=False) if bias is not None else None
144 | 
145 |     def forward(self, input):
146 |         output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
147 |         if self.bias is not None:
148 |             output = output + self.bias
149 |         return output
150 | 
151 | 
152 | def quantize(model, weight_bit_width, empty_init=False, device=None):
153 |     """Replace fp16 linear with quantized linear"""
154 |     for layer in model.layers:
155 |         layer.self_attention.query_key_value = QuantizedLinear(
156 |             weight_bit_width=weight_bit_width,
157 |             weight=layer.self_attention.query_key_value.weight.to(torch.cuda.current_device()),
158 |             bias=layer.self_attention.query_key_value.bias,
159 |             dtype=layer.self_attention.query_key_value.weight.dtype,
160 |             device=layer.self_attention.query_key_value.weight.device if device is None else device,
161 |             empty_init=empty_init
162 |         )
163 |         layer.self_attention.dense = QuantizedLinear(
164 |             weight_bit_width=weight_bit_width,
165 |             weight=layer.self_attention.dense.weight.to(torch.cuda.current_device()),
166 |             bias=layer.self_attention.dense.bias,
167 |             dtype=layer.self_attention.dense.weight.dtype,
168 |             device=layer.self_attention.dense.weight.device if device is None else device,
169 |             empty_init=empty_init
170 |         )
171 |         layer.mlp.dense_h_to_4h = QuantizedLinear(
172 |             weight_bit_width=weight_bit_width,
173 |             weight=layer.mlp.dense_h_to_4h.weight.to(torch.cuda.current_device()),
174 |             bias=layer.mlp.dense_h_to_4h.bias,
175 |             dtype=layer.mlp.dense_h_to_4h.weight.dtype,
176 |             device=layer.mlp.dense_h_to_4h.weight.device if device is None else device,
177 |             empty_init=empty_init
178 |         )
179 |         layer.mlp.dense_4h_to_h = QuantizedLinear(
180 |             weight_bit_width=weight_bit_width,
181 |             weight=layer.mlp.dense_4h_to_h.weight.to(torch.cuda.current_device()),
182 |             bias=layer.mlp.dense_4h_to_h.bias,
183 |             dtype=layer.mlp.dense_4h_to_h.weight.dtype,
184 |             device=layer.mlp.dense_4h_to_h.weight.device if device is None else device,
185 |             empty_init=empty_init
186 |         )
187 | 
188 |     return model
189 | 


--------------------------------------------------------------------------------
/vtimellm/model/chatglm/tokenization_chatglm.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import re
  4 | from typing import List, Optional, Union, Dict
  5 | from sentencepiece import SentencePieceProcessor
  6 | from transformers import PreTrainedTokenizer
  7 | from transformers.utils import logging, PaddingStrategy
  8 | from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
  9 | 
 10 | 
 11 | class SPTokenizer:
 12 |     def __init__(self, model_path: str):
 13 |         # reload tokenizer
 14 |         assert os.path.isfile(model_path), model_path
 15 |         self.sp_model = SentencePieceProcessor(model_file=model_path)
 16 | 
 17 |         # BOS / EOS token IDs
 18 |         self.n_words: int = self.sp_model.vocab_size()
 19 |         self.bos_id: int = self.sp_model.bos_id()
 20 |         self.eos_id: int = self.sp_model.eos_id()
 21 |         self.pad_id: int = self.sp_model.unk_id()
 22 |         assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()
 23 | 
 24 |         role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"]
 25 |         special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"] + role_special_tokens
 26 |         self.special_tokens = {}
 27 |         self.index_special_tokens = {}
 28 |         for token in special_tokens:
 29 |             self.special_tokens[token] = self.n_words
 30 |             self.index_special_tokens[self.n_words] = token
 31 |             self.n_words += 1
 32 |         self.role_special_token_expression = "|".join([re.escape(token) for token in role_special_tokens])
 33 | 
 34 |     def tokenize(self, s: str, encode_special_tokens=False):
 35 |         if encode_special_tokens:
 36 |             last_index = 0
 37 |             t = []
 38 |             for match in re.finditer(self.role_special_token_expression, s):
 39 |                 if last_index < match.start():
 40 |                     t.extend(self.sp_model.EncodeAsPieces(s[last_index:match.start()]))
 41 |                 t.append(s[match.start():match.end()])
 42 |                 last_index = match.end()
 43 |             if last_index < len(s):
 44 |                 t.extend(self.sp_model.EncodeAsPieces(s[last_index:]))
 45 |             return t
 46 |         else:
 47 |             return self.sp_model.EncodeAsPieces(s)
 48 | 
 49 |     def encode(self, s: str, bos: bool = False, eos: bool = False) -> List[int]:
 50 |         assert type(s) is str
 51 |         t = self.sp_model.encode(s)
 52 |         if bos:
 53 |             t = [self.bos_id] + t
 54 |         if eos:
 55 |             t = t + [self.eos_id]
 56 |         return t
 57 | 
 58 |     def decode(self, t: List[int]) -> str:
 59 |         text, buffer = "", []
 60 |         for token in t:
 61 |             if token in self.index_special_tokens:
 62 |                 if buffer:
 63 |                     text += self.sp_model.decode(buffer)
 64 |                     buffer = []
 65 |                 text += self.index_special_tokens[token]
 66 |             else:
 67 |                 buffer.append(token)
 68 |         if buffer:
 69 |             text += self.sp_model.decode(buffer)
 70 |         return text
 71 | 
 72 |     def decode_tokens(self, tokens: List[str]) -> str:
 73 |         text = self.sp_model.DecodePieces(tokens)
 74 |         return text
 75 | 
 76 |     def convert_token_to_id(self, token):
 77 |         """ Converts a token (str) in an id using the vocab. """
 78 |         if token in self.special_tokens:
 79 |             return self.special_tokens[token]
 80 |         return self.sp_model.PieceToId(token)
 81 | 
 82 |     def convert_id_to_token(self, index):
 83 |         """Converts an index (integer) in a token (str) using the vocab."""
 84 |         if index in self.index_special_tokens:
 85 |             return self.index_special_tokens[index]
 86 |         if index in [self.eos_id, self.bos_id, self.pad_id] or index < 0:
 87 |             return ""
 88 |         return self.sp_model.IdToPiece(index)
 89 | 
 90 | 
 91 | class ChatGLMTokenizer(PreTrainedTokenizer):
 92 |     vocab_files_names = {"vocab_file": "tokenizer.model"}
 93 | 
 94 |     model_input_names = ["input_ids", "attention_mask", "position_ids"]
 95 | 
 96 |     def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, encode_special_tokens=False,
 97 |                  **kwargs):
 98 |         self.name = "GLMTokenizer"
 99 | 
100 |         self.vocab_file = vocab_file
101 |         self.tokenizer = SPTokenizer(vocab_file)
102 |         self.special_tokens = {
103 |             "<bos>": self.tokenizer.bos_id,
104 |             "<eos>": self.tokenizer.eos_id,
105 |             "<pad>": self.tokenizer.pad_id
106 |         }
107 |         self.encode_special_tokens = encode_special_tokens
108 |         super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
109 |                          encode_special_tokens=encode_special_tokens,
110 |                          **kwargs)
111 | 
112 |     def get_command(self, token):
113 |         if token in self.special_tokens:
114 |             return self.special_tokens[token]
115 |         assert token in self.tokenizer.special_tokens, f"{token} is not a special token for {self.name}"
116 |         return self.tokenizer.special_tokens[token]
117 | 
118 |     @property
119 |     def unk_token(self) -> str:
120 |         return "<unk>"
121 | 
122 |     @property
123 |     def pad_token(self) -> str:
124 |         return "<unk>"
125 | 
126 |     @property
127 |     def pad_token_id(self):
128 |         return self.get_command("<pad>")
129 | 
130 |     @property
131 |     def eos_token(self) -> str:
132 |         return "</s>"
133 | 
134 |     @property
135 |     def eos_token_id(self):
136 |         return self.get_command("<eos>")
137 | 
138 |     @property
139 |     def vocab_size(self):
140 |         return self.tokenizer.n_words
141 | 
142 |     def get_vocab(self):
143 |         """ Returns vocab as a dict """
144 |         vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
145 |         vocab.update(self.added_tokens_encoder)
146 |         return vocab
147 | 
148 |     def _tokenize(self, text, **kwargs):
149 |         return self.tokenizer.tokenize(text, encode_special_tokens=self.encode_special_tokens)
150 | 
151 |     def _convert_token_to_id(self, token):
152 |         """ Converts a token (str) in an id using the vocab. """
153 |         return self.tokenizer.convert_token_to_id(token)
154 | 
155 |     def _convert_id_to_token(self, index):
156 |         """Converts an index (integer) in a token (str) using the vocab."""
157 |         return self.tokenizer.convert_id_to_token(index)
158 | 
159 |     def convert_tokens_to_string(self, tokens: List[str]) -> str:
160 |         return self.tokenizer.decode_tokens(tokens)
161 | 
162 |     def save_vocabulary(self, save_directory, filename_prefix=None):
163 |         """
164 |         Save the vocabulary and special tokens file to a directory.
165 | 
166 |         Args:
167 |             save_directory (`str`):
168 |                 The directory in which to save the vocabulary.
169 |             filename_prefix (`str`, *optional*):
170 |                 An optional prefix to add to the named of the saved files.
171 | 
172 |         Returns:
173 |             `Tuple(str)`: Paths to the files saved.
174 |         """
175 |         if os.path.isdir(save_directory):
176 |             vocab_file = os.path.join(
177 |                 save_directory, self.vocab_files_names["vocab_file"]
178 |             )
179 |         else:
180 |             vocab_file = save_directory
181 | 
182 |         with open(self.vocab_file, 'rb') as fin:
183 |             proto_str = fin.read()
184 | 
185 |         with open(vocab_file, "wb") as writer:
186 |             writer.write(proto_str)
187 | 
188 |         return (vocab_file,)
189 | 
190 |     def get_prefix_tokens(self):
191 |         prefix_tokens = [self.get_command("[gMASK]"), self.get_command("sop")]
192 |         return prefix_tokens
193 | 
194 |     def build_single_message(self, role, metadata, message):
195 |         assert role in ["system", "user", "assistant", "observation"], role
196 |         role_tokens = [self.get_command(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n")
197 |         message_tokens = self.tokenizer.encode(message)
198 |         tokens = role_tokens + message_tokens
199 |         return tokens
200 | 
201 |     def build_chat_input(self, query, history=None, role="user"):
202 |         if history is None:
203 |             history = []
204 |         input_ids = []
205 |         for item in history:
206 |             content = item["content"]
207 |             if item["role"] == "system" and "tools" in item:
208 |                 content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False)
209 |             input_ids.extend(self.build_single_message(item["role"], item.get("metadata", ""), content))
210 |         input_ids.extend(self.build_single_message(role, "", query))
211 |         input_ids.extend([self.get_command("<|assistant|>")])
212 |         return self.batch_encode_plus([input_ids], return_tensors="pt", is_split_into_words=True)
213 | 
214 |     def build_inputs_with_special_tokens(
215 |             self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
216 |     ) -> List[int]:
217 |         """
218 |         Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
219 |         adding special tokens. A BERT sequence has the following format:
220 | 
221 |         - single sequence: `[CLS] X [SEP]`
222 |         - pair of sequences: `[CLS] A [SEP] B [SEP]`
223 | 
224 |         Args:
225 |             token_ids_0 (`List[int]`):
226 |                 List of IDs to which the special tokens will be added.
227 |             token_ids_1 (`List[int]`, *optional*):
228 |                 Optional second list of IDs for sequence pairs.
229 | 
230 |         Returns:
231 |             `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
232 |         """
233 |         prefix_tokens = self.get_prefix_tokens()
234 |         token_ids_0 = prefix_tokens + token_ids_0
235 |         if token_ids_1 is not None:
236 |             token_ids_0 = token_ids_0 + token_ids_1 + [self.get_command("<eos>")]
237 |         return token_ids_0
238 | 
239 |     def _pad(
240 |             self,
241 |             encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
242 |             max_length: Optional[int] = None,
243 |             padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
244 |             pad_to_multiple_of: Optional[int] = None,
245 |             return_attention_mask: Optional[bool] = None,
246 |     ) -> dict:
247 |         """
248 |         Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
249 | 
250 |         Args:
251 |             encoded_inputs:
252 |                 Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
253 |             max_length: maximum length of the returned list and optionally padding length (see below).
254 |                 Will truncate by taking into account the special tokens.
255 |             padding_strategy: PaddingStrategy to use for padding.
256 | 
257 |                 - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
258 |                 - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
259 |                 - PaddingStrategy.DO_NOT_PAD: Do not pad
260 |                 The tokenizer padding sides are defined in self.padding_side:
261 | 
262 |                     - 'left': pads on the left of the sequences
263 |                     - 'right': pads on the right of the sequences
264 |             pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
265 |                 This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
266 |                 `>= 7.5` (Volta).
267 |             return_attention_mask:
268 |                 (optional) Set to False to avoid returning attention mask (default: set to model specifics)
269 |         """
270 |         # Load from model defaults
271 |         assert self.padding_side == "left"
272 | 
273 |         required_input = encoded_inputs[self.model_input_names[0]]
274 |         seq_length = len(required_input)
275 | 
276 |         if padding_strategy == PaddingStrategy.LONGEST:
277 |             max_length = len(required_input)
278 | 
279 |         if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
280 |             max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
281 | 
282 |         needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
283 | 
284 |         # Initialize attention mask if not present.
285 |         if "attention_mask" not in encoded_inputs:
286 |             encoded_inputs["attention_mask"] = [1] * seq_length
287 | 
288 |         if "position_ids" not in encoded_inputs:
289 |             encoded_inputs["position_ids"] = list(range(seq_length))
290 | 
291 |         if needs_to_be_padded:
292 |             difference = max_length - len(required_input)
293 | 
294 |             if "attention_mask" in encoded_inputs:
295 |                 encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
296 |             if "position_ids" in encoded_inputs:
297 |                 encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
298 |             encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
299 | 
300 |         return encoded_inputs
301 | 


--------------------------------------------------------------------------------
/vtimellm/model/vtimellm_arch.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | from vtimellm.constants import IMAGE_TOKEN_INDEX, IGNORE_INDEX
  4 | from abc import ABC, abstractmethod
  5 | 
  6 | class VTimeLLMMetaModel:
  7 | 
  8 |     def initialize_vision_modules(self, model_args):
  9 |         pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
 10 | 
 11 |         if not hasattr(self, 'mm_projector'):
 12 |             self.mm_projector = nn.Linear(768, self.config.hidden_size)
 13 | 
 14 |         if pretrain_mm_mlp_adapter is not None:
 15 |             mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
 16 |             def get_w(weights, keyword):
 17 |                 return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
 18 | 
 19 |             self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
 20 |             print("load mlp:", pretrain_mm_mlp_adapter)
 21 | 
 22 | 
 23 | class VTimeLLMMetaForCausalLM(ABC):
 24 | 
 25 |     @abstractmethod
 26 |     def get_model(self):
 27 |         pass
 28 | 
 29 |     def prepare_inputs_labels_for_multimodal(
 30 |         self, input_ids, position_ids, attention_mask, past_key_values, labels, images
 31 |     ):
 32 |         # print(position_ids, attention_mask)
 33 |         # if past_key_values:
 34 |         #     print(past_key_values[-1][-1].shape)
 35 |         # print(input_ids.shape, position_ids.shape, attention_mask.shape, past_key_values.shape, images)
 36 |         if images is None or input_ids.shape[1] == 1:
 37 |             if past_key_values is not None and images is not None and input_ids.shape[1] == 1:
 38 |                 if self.get_model().config.model_type == 'chatglm':
 39 |                     target_shape = past_key_values[-1][-1].shape[0] + 1
 40 |                 else:
 41 |                     target_shape = past_key_values[-1][-1].shape[-2] + 1
 42 |                 attention_mask = torch.cat((attention_mask, torch.ones(
 43 |                     (attention_mask.shape[0], target_shape - attention_mask.shape[1]),
 44 |                     dtype=attention_mask.dtype,
 45 |                     device=attention_mask.device
 46 |                 )), dim=1)
 47 |                 position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
 48 |             return input_ids, position_ids, attention_mask, past_key_values, None, labels
 49 | 
 50 |         if type(images) is list:
 51 |             concat_images = torch.cat([image for image in images], dim=0)
 52 |             image_features = self.get_model().mm_projector(concat_images)
 53 |             split_sizes = [image.shape[0] for image in images]
 54 |             image_features = torch.split(image_features, split_sizes, dim=0)
 55 |             # image_features = [x.flatten(0, 1) for x in image_features]
 56 |         else:
 57 |             image_features = self.get_model().mm_projector(images)
 58 |         # print([image.shape for image in image_features])
 59 |         
 60 |         _labels = labels
 61 |         _position_ids = position_ids
 62 |         _attention_mask = attention_mask
 63 |         if attention_mask is None:
 64 |             attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
 65 |         else:
 66 |             attention_mask = attention_mask.bool()
 67 |         if position_ids is None:
 68 |             position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
 69 |         if labels is None:
 70 |             labels = torch.full_like(input_ids, IGNORE_INDEX)
 71 | 
 72 |         # remove the padding using attention_mask -- TODO: double check
 73 |         input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
 74 |         labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
 75 | 
 76 |         new_input_embeds = []
 77 |         new_labels = []
 78 |         cur_image_idx = 0
 79 |         for batch_idx, cur_input_ids in enumerate(input_ids):
 80 |             num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
 81 |             if num_images == 0:
 82 |                 cur_image_features = image_features[cur_image_idx]
 83 |                 cur_input_embeds_1 = self.get_model().get_input_embeddings()(cur_input_ids)
 84 |                 cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
 85 |                 new_input_embeds.append(cur_input_embeds)
 86 |                 new_labels.append(labels[batch_idx])
 87 |                 cur_image_idx += 1
 88 |                 continue
 89 | 
 90 |             image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
 91 |             cur_input_ids_noim = []
 92 |             cur_labels = labels[batch_idx]
 93 |             cur_labels_noim = []
 94 |             for i in range(len(image_token_indices) - 1):
 95 |                 cur_input_ids_noim.append(cur_input_ids[image_token_indices[i]+1:image_token_indices[i+1]])
 96 |                 cur_labels_noim.append(cur_labels[image_token_indices[i]+1:image_token_indices[i+1]])
 97 |             split_sizes = [x.shape[0] for x in cur_labels_noim]
 98 |             cur_input_embeds = self.get_model().get_input_embeddings()(torch.cat(cur_input_ids_noim))
 99 |             cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
100 |             cur_new_input_embeds = []
101 |             cur_new_labels = []
102 | 
103 |             for i in range(num_images + 1):
104 |                 cur_new_input_embeds.append(cur_input_embeds_no_im[i])
105 |                 cur_new_labels.append(cur_labels_noim[i])
106 |                 if i < num_images:
107 |                     cur_image_features = image_features[cur_image_idx]
108 |                     cur_image_idx += 1
109 |                     cur_new_input_embeds.append(cur_image_features)
110 |                     cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
111 | 
112 |             cur_new_input_embeds = torch.cat(cur_new_input_embeds)
113 |             cur_new_labels = torch.cat(cur_new_labels)
114 | 
115 |             new_input_embeds.append(cur_new_input_embeds)
116 |             new_labels.append(cur_new_labels)
117 | 
118 |         # Truncate sequences to max length as image embeddings can make the sequence longer
119 |         tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
120 |         if tokenizer_model_max_length is not None:
121 |             new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
122 |             new_labels = [x[:tokenizer_model_max_length] for x in new_labels]
123 | 
124 |         # Combine them
125 |         max_len = max(x.shape[0] for x in new_input_embeds)
126 |         batch_size = len(new_input_embeds)
127 | 
128 |         new_input_embeds_padded = []
129 |         new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
130 |         attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
131 |         position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
132 | 
133 |         for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
134 |             cur_len = cur_new_embed.shape[0]
135 |             if getattr(self.config, 'tokenizer_padding_side', 'right') == "left":
136 |                 new_input_embeds_padded.append(torch.cat((
137 |                     torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device),
138 |                     cur_new_embed
139 |                 ), dim=0))
140 |                 if cur_len > 0:
141 |                     new_labels_padded[i, -cur_len:] = cur_new_labels
142 |                     attention_mask[i, -cur_len:] = True
143 |                     position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
144 |             else:
145 |                 new_input_embeds_padded.append(torch.cat((
146 |                     cur_new_embed,
147 |                     torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)
148 |                 ), dim=0))
149 |                 if cur_len > 0:
150 |                     new_labels_padded[i, :cur_len] = cur_new_labels
151 |                     attention_mask[i, :cur_len] = True
152 |                     position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
153 | 
154 |         new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
155 | 
156 |         if _labels is None:
157 |             new_labels = None
158 |         else:
159 |             new_labels = new_labels_padded
160 | 
161 |         if _attention_mask is None:
162 |             attention_mask = None
163 |         else:
164 |             attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
165 | 
166 |         if _position_ids is None:
167 |             position_ids = None
168 | 
169 |         if self.get_model().config.model_type == 'chatglm':
170 |             fake_input_ids = torch.full((new_input_embeds.shape[0], new_input_embeds.shape[1]), -10000, 
171 |                                         dtype=new_input_embeds.dtype, device=new_input_embeds.device)
172 |             attention_mask = attention_mask.to(torch.int8)
173 |             new_input_embeds = new_input_embeds.transpose(0, 1).contiguous()
174 |         else:
175 |             fake_input_ids = None
176 |         # print(position_ids, attention_mask)
177 |         return fake_input_ids, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
178 | 


--------------------------------------------------------------------------------
/vtimellm/model/vtimellm_chatglm.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | from typing import List, Optional, Tuple, Union
 4 | from transformers import AutoConfig, AutoModelForCausalLM
 5 | from .chatglm import ChatGLMConfig, ChatGLMModel, ChatGLMForConditionalGeneration
 6 | from .vtimellm_arch import VTimeLLMMetaModel, VTimeLLMMetaForCausalLM
 7 | 
 8 | class VTimeLLMChatGLMConfig(ChatGLMConfig):
 9 |     model_type = "VTimeLLM_ChatGLM"
10 | 
11 | class VTimeLLMChatGLMModel(ChatGLMModel, VTimeLLMMetaModel):
12 |     config_class = VTimeLLMChatGLMConfig
13 | 
14 |     def __init__(self, config, empty_init=True, device=None):
15 |         super(VTimeLLMChatGLMModel, self).__init__(config, empty_init=empty_init, device=device)
16 | 
17 | class VTimeLLMChatGLMForCausalLM(ChatGLMForConditionalGeneration, VTimeLLMMetaForCausalLM):
18 |     config_class = VTimeLLMChatGLMConfig
19 | 
20 |     def __init__(self, config, empty_init=True, device=None):
21 |         super(ChatGLMForConditionalGeneration, self).__init__(config)
22 |         self.transformer = VTimeLLMChatGLMModel(config, empty_init=empty_init, device=device)
23 |         self.max_sequence_length = config.max_length
24 |         self.config = config
25 |         self.quantized = False
26 |         # Initialize weights and apply final processing
27 |         self.post_init()
28 | 
29 |     def get_model(self):
30 |         return self.transformer
31 | 
32 |     def forward(
33 |         self,
34 |         input_ids: torch.LongTensor = None,
35 |         position_ids: Optional[torch.LongTensor] = None,
36 |         attention_mask: Optional[torch.Tensor] = None,
37 |         past_key_values: Optional[List[torch.FloatTensor]] = None,
38 |         inputs_embeds: Optional[torch.FloatTensor] = None,
39 |         labels: Optional[torch.LongTensor] = None,
40 |         use_cache: Optional[bool] = None,
41 |         output_attentions: Optional[bool] = None,
42 |         output_hidden_states: Optional[bool] = None,
43 |         return_dict: Optional[bool] = None,
44 |         return_last_logit: Optional[bool] = False,
45 |         images: Optional[torch.FloatTensor] = None,
46 |     ):
47 | 
48 |         if inputs_embeds is None:
49 |             (
50 |                 input_ids,
51 |                 position_ids,
52 |                 attention_mask,
53 |                 past_key_values,
54 |                 inputs_embeds,
55 |                 labels
56 |             ) = self.prepare_inputs_labels_for_multimodal(
57 |                 input_ids,
58 |                 position_ids,
59 |                 attention_mask,
60 |                 past_key_values,
61 |                 labels,
62 |                 images
63 |             )
64 | 
65 |         return super().forward(
66 |             input_ids=input_ids,
67 |             attention_mask=attention_mask,
68 |             position_ids=position_ids,
69 |             past_key_values=past_key_values,
70 |             inputs_embeds=inputs_embeds,
71 |             labels=labels,
72 |             use_cache=use_cache,
73 |             output_attentions=output_attentions,
74 |             output_hidden_states=output_hidden_states,
75 |             return_dict=return_dict
76 |         )
77 | 
78 |     def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
79 |         images = kwargs.pop("images", None)
80 |         _inputs = super().prepare_inputs_for_generation(
81 |             input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
82 |         )
83 |         if images is not None:
84 |             _inputs['images'] = images
85 |         return _inputs
86 | 
87 | AutoConfig.register("VTimeLLM_ChatGLM", VTimeLLMChatGLMConfig)
88 | AutoModelForCausalLM.register(VTimeLLMChatGLMConfig, VTimeLLMChatGLMForCausalLM)
89 | 


--------------------------------------------------------------------------------
/vtimellm/model/vtimellm_llama.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | from typing import List, Optional, Tuple, Union
 4 | from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
 5 | from transformers.modeling_outputs import CausalLMOutputWithPast
 6 | from .vtimellm_arch import VTimeLLMMetaModel, VTimeLLMMetaForCausalLM
 7 | 
 8 | class VTimeLLMConfig(LlamaConfig):
 9 |     model_type = "VTimeLLM"
10 | 
11 | class VTimeLLMLlamaModel(LlamaModel, VTimeLLMMetaModel):
12 |     config_class = VTimeLLMConfig
13 | 
14 |     def __init__(self, config: LlamaConfig):
15 |         super(VTimeLLMLlamaModel, self).__init__(config)
16 | 
17 | class VTimeLLMLlamaForCausalLM(LlamaForCausalLM, VTimeLLMMetaForCausalLM):
18 |     config_class = VTimeLLMConfig
19 | 
20 |     def __init__(self, config):
21 |         super(LlamaForCausalLM, self).__init__(config)
22 |         self.model = VTimeLLMLlamaModel(config)
23 |         self.pretraining_tp = config.pretraining_tp
24 |         self.vocab_size = config.vocab_size
25 |         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
26 | 
27 |         # Initialize weights and apply final processing
28 |         self.post_init()
29 | 
30 |     def get_model(self):
31 |         return self.model
32 | 
33 |     def forward(
34 |         self,
35 |         input_ids: torch.LongTensor = None,
36 |         attention_mask: Optional[torch.Tensor] = None,
37 |         position_ids: Optional[torch.LongTensor] = None,
38 |         past_key_values: Optional[List[torch.FloatTensor]] = None,
39 |         inputs_embeds: Optional[torch.FloatTensor] = None,
40 |         labels: Optional[torch.LongTensor] = None,
41 |         use_cache: Optional[bool] = None,
42 |         output_attentions: Optional[bool] = None,
43 |         output_hidden_states: Optional[bool] = None,
44 |         images: Optional[torch.FloatTensor] = None,
45 |         return_dict: Optional[bool] = None,
46 |     ) -> Union[Tuple, CausalLMOutputWithPast]:
47 | 
48 |         if inputs_embeds is None:
49 |             (
50 |                 input_ids,
51 |                 position_ids,
52 |                 attention_mask,
53 |                 past_key_values,
54 |                 inputs_embeds,
55 |                 labels
56 |             ) = self.prepare_inputs_labels_for_multimodal(
57 |                 input_ids,
58 |                 position_ids,
59 |                 attention_mask,
60 |                 past_key_values,
61 |                 labels,
62 |                 images
63 |             )
64 | 
65 |         return super().forward(
66 |             input_ids=input_ids,
67 |             attention_mask=attention_mask,
68 |             position_ids=position_ids,
69 |             past_key_values=past_key_values,
70 |             inputs_embeds=inputs_embeds,
71 |             labels=labels,
72 |             use_cache=use_cache,
73 |             output_attentions=output_attentions,
74 |             output_hidden_states=output_hidden_states,
75 |             return_dict=return_dict
76 |         )
77 | 
78 |     def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
79 |         images = kwargs.pop("images", None)
80 |         _inputs = super().prepare_inputs_for_generation(
81 |             input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
82 |         )
83 |         if images is not None:
84 |             _inputs['images'] = images
85 |         return _inputs
86 | 
87 | AutoConfig.register("VTimeLLM", VTimeLLMConfig)
88 | AutoModelForCausalLM.register(VTimeLLMConfig, VTimeLLMLlamaForCausalLM)
89 | 


--------------------------------------------------------------------------------
/vtimellm/train/llama_flash_attn_monkey_patch.py:
--------------------------------------------------------------------------------
  1 | from typing import List, Optional, Tuple
  2 | import logging
  3 | 
  4 | import torch
  5 | from torch import nn
  6 | 
  7 | import transformers
  8 | from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
  9 | 
 10 | from einops import rearrange
 11 | 
 12 | try:
 13 |     from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
 14 | except ImportError:
 15 |     from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
 16 | from flash_attn.bert_padding import unpad_input, pad_input
 17 | 
 18 | 
 19 | def forward(
 20 |     self,
 21 |     hidden_states: torch.Tensor,
 22 |     attention_mask: Optional[torch.Tensor] = None,
 23 |     position_ids: Optional[torch.Tensor] = None,
 24 |     past_key_value: Optional[Tuple[torch.Tensor]] = None,
 25 |     output_attentions: bool = False,
 26 |     use_cache: bool = False,
 27 | ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
 28 |     """Input shape: Batch x Time x Channel
 29 | 
 30 |     attention_mask: [bsz, q_len]
 31 |     """
 32 |     bsz, q_len, _ = hidden_states.size()
 33 | 
 34 |     query_states = (
 35 |         self.q_proj(hidden_states)
 36 |         .view(bsz, q_len, self.num_heads, self.head_dim)
 37 |         .transpose(1, 2)
 38 |     )
 39 |     key_states = (
 40 |         self.k_proj(hidden_states)
 41 |         .view(bsz, q_len, self.num_heads, self.head_dim)
 42 |         .transpose(1, 2)
 43 |     )
 44 |     value_states = (
 45 |         self.v_proj(hidden_states)
 46 |         .view(bsz, q_len, self.num_heads, self.head_dim)
 47 |         .transpose(1, 2)
 48 |     )
 49 |     # [bsz, q_len, nh, hd]
 50 |     # [bsz, nh, q_len, hd]
 51 | 
 52 |     kv_seq_len = key_states.shape[-2]
 53 |     assert past_key_value is None, "past_key_value is not supported"
 54 | 
 55 |     cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
 56 |     query_states, key_states = apply_rotary_pos_emb(
 57 |         query_states, key_states, cos, sin, position_ids
 58 |     )
 59 |     # [bsz, nh, t, hd]
 60 |     assert not output_attentions, "output_attentions is not supported"
 61 |     assert not use_cache, "use_cache is not supported"
 62 | 
 63 |     # Flash attention codes from
 64 |     # https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attention.py
 65 | 
 66 |     # transform the data into the format required by flash attention
 67 |     qkv = torch.stack(
 68 |         [query_states, key_states, value_states], dim=2
 69 |     )  # [bsz, nh, 3, q_len, hd]
 70 |     qkv = qkv.transpose(1, 3)  # [bsz, q_len, 3, nh, hd]
 71 |     # We have disabled _prepare_decoder_attention_mask in LlamaModel
 72 |     # the attention_mask should be the same as the key_padding_mask
 73 |     key_padding_mask = attention_mask
 74 | 
 75 |     if key_padding_mask is None:
 76 |         qkv = rearrange(qkv, "b s ... -> (b s) ...")
 77 |         max_s = q_len
 78 |         cu_q_lens = torch.arange(
 79 |             0, (bsz + 1) * q_len, step=q_len, dtype=torch.int32, device=qkv.device
 80 |         )
 81 |         output = flash_attn_unpadded_qkvpacked_func(
 82 |             qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True
 83 |         )
 84 |         output = rearrange(output, "(b s) ... -> b s ...", b=bsz)
 85 |     else:
 86 |         nheads = qkv.shape[-2]
 87 |         x = rearrange(qkv, "b s three h d -> b s (three h d)")
 88 |         x_unpad, indices, cu_q_lens, max_s = unpad_input(x, key_padding_mask)
 89 |         x_unpad = rearrange(
 90 |             x_unpad, "nnz (three h d) -> nnz three h d", three=3, h=nheads
 91 |         )
 92 |         output_unpad = flash_attn_unpadded_qkvpacked_func(
 93 |             x_unpad, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True
 94 |         )
 95 |         output = rearrange(
 96 |             pad_input(
 97 |                 rearrange(output_unpad, "nnz h d -> nnz (h d)"), indices, bsz, q_len
 98 |             ),
 99 |             "b s (h d) -> b s h d",
100 |             h=nheads,
101 |         )
102 |     return self.o_proj(rearrange(output, "b s h d -> b s (h d)")), None, None
103 | 
104 | 
105 | # Disable the transformation of the attention mask in LlamaModel as the flash attention
106 | # requires the attention mask to be the same as the key_padding_mask
107 | def _prepare_decoder_attention_mask(
108 |     self, attention_mask, input_shape, inputs_embeds, past_key_values_length
109 | ):
110 |     # [bsz, seq_len]
111 |     return attention_mask
112 | 
113 | 
114 | def replace_llama_attn_with_flash_attn():
115 |     cuda_major, cuda_minor = torch.cuda.get_device_capability()
116 |     if cuda_major < 8:
117 |         logging.warning(
118 |             "Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward."
119 |             "ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593"
120 |         )
121 |     transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = (
122 |         _prepare_decoder_attention_mask
123 |     )
124 |     transformers.models.llama.modeling_llama.LlamaAttention.forward = forward
125 | 


--------------------------------------------------------------------------------
/vtimellm/train/train.py:
--------------------------------------------------------------------------------
  1 | # Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
  2 | # Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
  3 | #    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
  4 | #
  5 | #    Licensed under the Apache License, Version 2.0 (the "License");
  6 | #    you may not use this file except in compliance with the License.
  7 | #    You may obtain a copy of the License at
  8 | #
  9 | #        http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | #    Unless required by applicable law or agreed to in writing, software
 12 | #    distributed under the License is distributed on an "AS IS" BASIS,
 13 | #    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | #    See the License for the specific language governing permissions and
 15 | #    limitations under the License.
 16 | 
 17 | import os
 18 | root_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..")
 19 | from dataclasses import dataclass, field
 20 | import logging
 21 | import pathlib
 22 | from typing import Dict, Optional, Sequence, List
 23 | 
 24 | import torch
 25 | import transformers
 26 | import sys
 27 | sys.path.append(root_dir)
 28 | from vtimellm import conversation as conversation_lib
 29 | from vtimellm.train.vtimellm_trainer import VTimeLLMTrainer
 30 | from vtimellm.train.dataset import make_supervised_data_module, DataArguments
 31 | from vtimellm.model import VTimeLLMLlamaForCausalLM, VTimeLLMChatGLMForCausalLM
 32 | from vtimellm.model.builder import load_lora
 33 | from vtimellm.mm_utils import print_trainable_parameters
 34 | 
 35 | local_rank = None
 36 | 
 37 | def rank0_print(*args):
 38 |     if local_rank == 0:
 39 |         print(*args)
 40 | 
 41 | @dataclass
 42 | class ModelArguments:
 43 |     model_name_or_path: Optional[str] = field(default="lmsys/vicuna-7b-v1.5")
 44 |     stage2_path: Optional[str] = field(default=None)
 45 |     version: Optional[str] = field(default="v0")
 46 |     tune_mm_mlp_adapter: bool = field(default=False)
 47 |     pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
 48 | 
 49 | 
 50 | @dataclass
 51 | class TrainingArguments(transformers.TrainingArguments):
 52 |     training_stage: int = field(default=2)
 53 |     cache_dir: Optional[str] = field(default=None)
 54 |     optim: str = field(default="adamw_torch")
 55 |     remove_unused_columns: bool = field(default=False)
 56 |     freeze_mm_mlp_adapter: bool = field(default=False)
 57 |     model_max_length: int = field(
 58 |         default=512,
 59 |         metadata={
 60 |             "help":
 61 |             "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
 62 |         },
 63 |     )
 64 |     double_quant: bool = field(
 65 |         default=True,
 66 |         metadata={"help": "Compress the quantization statistics through double quantization."}
 67 |     )
 68 |     quant_type: str = field(
 69 |         default="nf4",
 70 |         metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."}
 71 |     )
 72 |     bits: int = field(
 73 |         default=16,
 74 |         metadata={"help": "How many bits to use."}
 75 |     )
 76 |     lora_enable: bool = False
 77 |     lora_r: int = 64
 78 |     lora_alpha: int = 16
 79 |     lora_dropout: float = 0.05
 80 |     lora_weight_path: str = ""
 81 |     lora_bias: str = "none"
 82 | 
 83 | 
 84 | def maybe_zero_3(param, ignore_status=False, name=None):
 85 |     from deepspeed import zero
 86 |     from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
 87 |     if hasattr(param, "ds_id"):
 88 |         if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
 89 |             if not ignore_status:
 90 |                 logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
 91 |         with zero.GatheredParameters([param]):
 92 |             param = param.data.detach().cpu().clone()
 93 |     else:
 94 |         param = param.detach().cpu().clone()
 95 |     return param
 96 | 
 97 | 
 98 | # Borrowed from peft.utils.get_peft_model_state_dict
 99 | def get_peft_state_maybe_zero_3(named_params, bias):
100 |     if bias == "none":
101 |         to_return = {k: t for k, t in named_params if "lora_" in k}
102 |     elif bias == "all":
103 |         to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
104 |     elif bias == "lora_only":
105 |         to_return = {}
106 |         maybe_lora_bias = {}
107 |         lora_bias_names = set()
108 |         for k, t in named_params:
109 |             if "lora_" in k:
110 |                 to_return[k] = t
111 |                 bias_name = k.split("lora_")[0] + "bias"
112 |                 lora_bias_names.add(bias_name)
113 |             elif "bias" in k:
114 |                 maybe_lora_bias[k] = t
115 |         for k, t in maybe_lora_bias:
116 |             if bias_name in lora_bias_names:
117 |                 to_return[bias_name] = t
118 |     else:
119 |         raise NotImplementedError
120 |     to_return = {k: maybe_zero_3(v, name=k) for k, v in to_return.items()}
121 |     return to_return
122 | 
123 | 
124 | def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
125 |     to_return = {k: t for k, t in named_params if "lora_" not in k}
126 |     if require_grad_only:
127 |         to_return = {k: t for k, t in to_return.items() if t.requires_grad}
128 |     to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
129 |     return to_return
130 | 
131 | 
132 | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
133 |     to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
134 |     to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
135 |     return to_return
136 | 
137 | 
138 | def find_all_linear_names(model):
139 |     cls = torch.nn.Linear
140 |     lora_module_names = set()
141 |     for name, module in model.named_modules():
142 |         if isinstance(module, cls):
143 |             names = name.split('.')
144 |             lora_module_names.add(names[0] if len(names) == 1 else names[-1])
145 | 
146 | 
147 |     if 'lm_head' in lora_module_names: # needed for 16-bit
148 |         lora_module_names.remove('lm_head')
149 |     return list(lora_module_names)
150 | 
151 | 
152 | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
153 |                                    output_dir: str):
154 |     """Collects the state dict and dump to disk."""
155 | 
156 |     if getattr(trainer.args, "tune_mm_mlp_adapter", False):
157 |         # Only save Adapter
158 |         keys_to_match = ['mm_projector']
159 |         if getattr(trainer.args, "use_im_start_end", False):
160 |             keys_to_match.extend(['embed_tokens', 'embed_in'])
161 | 
162 |         weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
163 |         trainer.model.config.save_pretrained(output_dir)
164 | 
165 |         current_folder = output_dir.split('/')[-1]
166 |         parent_folder = os.path.dirname(output_dir)
167 |         if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
168 |             if current_folder.startswith('checkpoint-'):
169 |                 mm_projector_folder = os.path.join(parent_folder, "mm_projector")
170 |                 os.makedirs(mm_projector_folder, exist_ok=True)
171 |                 torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
172 |             else:
173 |                 torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
174 |         return
175 | 
176 |     if trainer.deepspeed:
177 |         torch.cuda.synchronize()
178 |         trainer.save_model(output_dir)
179 |         return
180 | 
181 |     state_dict = trainer.model.state_dict()
182 |     if trainer.args.should_save:
183 |         cpu_state_dict = {
184 |             key: value.cpu()
185 |             for key, value in state_dict.items()
186 |         }
187 |         del state_dict
188 |         trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
189 | 
190 | 
191 | def smart_tokenizer_and_embedding_resize(
192 |     special_tokens_dict: Dict,
193 |     tokenizer: transformers.PreTrainedTokenizer,
194 |     model: transformers.PreTrainedModel,
195 | ):
196 |     """Resize tokenizer and embedding.
197 | 
198 |     Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
199 |     """
200 |     num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
201 |     model.resize_token_embeddings(len(tokenizer))
202 | 
203 |     if num_new_tokens > 0:
204 |         input_embeddings = model.get_input_embeddings().weight.data
205 |         output_embeddings = model.get_output_embeddings().weight.data
206 | 
207 |         input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
208 |             dim=0, keepdim=True)
209 |         output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
210 |             dim=0, keepdim=True)
211 | 
212 |         input_embeddings[-num_new_tokens:] = input_embeddings_avg
213 |         output_embeddings[-num_new_tokens:] = output_embeddings_avg
214 | 
215 | 
216 | def train():
217 |     global local_rank
218 | 
219 |     parser = transformers.HfArgumentParser(
220 |         (ModelArguments, DataArguments, TrainingArguments))
221 |     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
222 |     local_rank = training_args.local_rank
223 |     compute_dtype = (torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
224 | 
225 |     bnb_model_from_pretrained_args = {}
226 |     if training_args.bits in [4, 8]:
227 |         from transformers import BitsAndBytesConfig
228 |         bnb_model_from_pretrained_args.update(dict(
229 |             device_map={"": training_args.device},
230 |             load_in_4bit=training_args.bits == 4,
231 |             load_in_8bit=training_args.bits == 8,
232 |             quantization_config=BitsAndBytesConfig(
233 |                 load_in_4bit=training_args.bits == 4,
234 |                 load_in_8bit=training_args.bits == 8,
235 |                 llm_int8_threshold=6.0,
236 |                 llm_int8_has_fp16_weight=False,
237 |                 bnb_4bit_compute_dtype=compute_dtype,
238 |                 bnb_4bit_use_double_quant=training_args.double_quant,
239 |                 bnb_4bit_quant_type=training_args.quant_type # {'fp4', 'nf4'}
240 |             )
241 |         ))
242 | 
243 |     if 'chatglm' in model_args.model_name_or_path:
244 |         model = VTimeLLMChatGLMForCausalLM.from_pretrained(
245 |             model_args.model_name_or_path, empty_init=False, device='cuda'
246 |         )
247 |     else:
248 |         model = VTimeLLMLlamaForCausalLM.from_pretrained(
249 |             model_args.model_name_or_path,
250 |             cache_dir=training_args.cache_dir,
251 |             **bnb_model_from_pretrained_args
252 |         )
253 |     model.config.use_cache = False
254 | 
255 |     if training_args.bits in [4, 8]:
256 |         from peft import prepare_model_for_kbit_training
257 |         model.config.torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
258 |         model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
259 | 
260 |     if training_args.gradient_checkpointing:
261 |         if hasattr(model, "enable_input_require_grads"):
262 |             model.enable_input_require_grads()
263 |         else:
264 |             def make_inputs_require_grad(module, input, output):
265 |                 output.requires_grad_(True)
266 |             model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
267 | 
268 |     if training_args.lora_enable:
269 |         from peft import LoraConfig, get_peft_model
270 |         lora_config = LoraConfig(
271 |             r=training_args.lora_r,
272 |             lora_alpha=training_args.lora_alpha,
273 |             target_modules=find_all_linear_names(model),
274 |             lora_dropout=training_args.lora_dropout,
275 |             bias=training_args.lora_bias,
276 |             task_type="CAUSAL_LM",
277 |         )
278 |         if training_args.bits == 16:
279 |             if training_args.bf16:
280 |                 model.to(torch.bfloat16)
281 |             if training_args.fp16:
282 |                 model.to(torch.float16)
283 | 
284 |         # print_trainable_parameters(model)
285 |         if training_args.training_stage == 3:
286 |             model.get_model().initialize_vision_modules(model_args)
287 | 
288 |             model = load_lora(model, model_args.stage2_path)
289 |             rank0_print('Merging LoRA weights...')
290 |             model = model.merge_and_unload()
291 |             # print_trainable_parameters(model)
292 |             rank0_print("Adding LoRA adapters...")
293 |             model = get_peft_model(model, lora_config)
294 | 
295 |         else:
296 |             rank0_print("Adding LoRA adapters...")
297 |             model = get_peft_model(model, lora_config)
298 |         # print_trainable_parameters(model)
299 | 
300 |     if 'chatglm' in model_args.model_name_or_path:
301 |         tokenizer = transformers.AutoTokenizer.from_pretrained(
302 |             model_args.model_name_or_path,
303 |             trust_remote_code=True
304 |         )
305 |     else:
306 |         tokenizer = transformers.AutoTokenizer.from_pretrained(
307 |             model_args.model_name_or_path,
308 |             cache_dir=training_args.cache_dir,
309 |             model_max_length=training_args.model_max_length,
310 |             padding_side="right",
311 |             use_fast=False,
312 |         )
313 |         tokenizer.pad_token = tokenizer.unk_token
314 | 
315 |     
316 |     if model_args.version in conversation_lib.conv_templates:
317 |         conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
318 |     else:
319 |         conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
320 | 
321 |  
322 |     model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
323 |     model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
324 | 
325 |     if training_args.training_stage != 3:
326 |         model.get_model().initialize_vision_modules(model_args=model_args)
327 | 
328 |         if model_args.tune_mm_mlp_adapter:
329 |             model.requires_grad_(False)
330 |             for p in model.get_model().mm_projector.parameters():
331 |                 p.requires_grad = True
332 | 
333 |         
334 |         if training_args.freeze_mm_mlp_adapter:
335 |             for p in model.get_model().mm_projector.parameters():
336 |                 p.requires_grad = False
337 | 
338 |     if training_args.bits in [4, 8]:
339 |         model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
340 | 
341 |     if training_args.bits in [4, 8]:
342 |         from peft.tuners.lora import LoraLayer
343 |         for name, module in model.named_modules():
344 |             if isinstance(module, LoraLayer):
345 |                 if training_args.bf16:
346 |                     module = module.to(torch.bfloat16)
347 |             if 'norm' in name:
348 |                 module = module.to(torch.float32)
349 |             if 'lm_head' in name or 'embed_tokens' in name:
350 |                 if hasattr(module, 'weight'):
351 |                     if training_args.bf16 and module.weight.dtype == torch.float32:
352 |                         module = module.to(torch.bfloat16)
353 | 
354 |     data_module = make_supervised_data_module(tokenizer=tokenizer,
355 |                                               data_args=data_args)
356 |     trainer = VTimeLLMTrainer(model=model,
357 |                     tokenizer=tokenizer,
358 |                     args=training_args,
359 |                     **data_module)
360 | 
361 |     if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
362 |         trainer.train(resume_from_checkpoint=True)
363 |     else:
364 |         trainer.train()
365 |     trainer.save_state()
366 | 
367 |     model.config.use_cache = True
368 | 
369 |     if training_args.lora_enable:
370 |         state_dict = get_peft_state_maybe_zero_3(
371 |             model.named_parameters(), training_args.lora_bias
372 |         )
373 |         non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
374 |             model.named_parameters()
375 |         )
376 |         if training_args.local_rank == 0 or training_args.local_rank == -1:
377 |             model.config.save_pretrained(training_args.output_dir)
378 |             model.save_pretrained(training_args.output_dir, state_dict=state_dict)
379 |             torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, 'non_lora_trainables.bin'))
380 |     else:
381 |         safe_save_model_for_hf_trainer(trainer=trainer,
382 |                                        output_dir=training_args.output_dir)
383 | 
384 | 
385 | if __name__ == "__main__":
386 |     train()
387 | 


--------------------------------------------------------------------------------
/vtimellm/train/train_mem.py:
--------------------------------------------------------------------------------
 1 | # Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
 2 | # Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
 3 | # Make it more memory efficient by monkey patching the LLaMA model with FlashAttn.
 4 | 
 5 | # Need to call this before importing transformers.
 6 | 
 7 | import os
 8 | root_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..")
 9 | print(root_dir)
10 | import sys
11 | sys.path.append(root_dir)
12 | 
13 | from llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
14 | 
15 | replace_llama_attn_with_flash_attn()
16 | 
17 | from train import train
18 | 
19 | if __name__ == "__main__":
20 |     train()
21 | 


--------------------------------------------------------------------------------
/vtimellm/train/vtimellm_trainer.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import torch
 3 | 
 4 | from transformers import Trainer
 5 | from typing import Optional
 6 | 
 7 | 
 8 | def maybe_zero_3(param, ignore_status=False, name=None):
 9 |     from deepspeed import zero
10 |     from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
11 |     if hasattr(param, "ds_id"):
12 |         if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
13 |             if not ignore_status:
14 |                 print(name, 'no ignore status')
15 |         with zero.GatheredParameters([param]):
16 |             param = param.data.detach().cpu().clone()
17 |     else:
18 |         param = param.detach().cpu().clone()
19 |     return param
20 | 
21 | 
22 | def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
23 |     to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
24 |     to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
25 |     return to_return
26 | 
27 | 
28 | class VTimeLLMTrainer(Trainer):
29 | 
30 |     def _save_checkpoint(self, model, trial, metrics=None):
31 |         if getattr(self.args, 'tune_mm_mlp_adapter', False):
32 |             from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
33 |             checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
34 | 
35 |             run_dir = self._get_output_dir(trial=trial)
36 |             output_dir = os.path.join(run_dir, checkpoint_folder)
37 | 
38 |             # Only save Adapter
39 |             keys_to_match = ['mm_projector']
40 |             if getattr(self.args, "use_im_start_end", False):
41 |                 keys_to_match.extend(['embed_tokens', 'embed_in'])
42 | 
43 |             weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
44 | 
45 |             if self.args.local_rank == 0 or self.args.local_rank == -1:
46 |                 self.model.config.save_pretrained(output_dir)
47 |                 torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
48 |         else:
49 |             super(VTimeLLMTrainer, self)._save_checkpoint(model, trial, metrics)
50 | 
51 |     def _save(self, output_dir: Optional[str] = None, state_dict=None):
52 |         if getattr(self.args, 'tune_mm_mlp_adapter', False):
53 |             pass
54 |         else:
55 |             super(VTimeLLMTrainer, self)._save(output_dir, state_dict)
56 | 


--------------------------------------------------------------------------------
/vtimellm/utils.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import logging
  3 | import logging.handlers
  4 | import os
  5 | import sys
  6 | 
  7 | import requests
  8 | 
  9 | from vtimellm.constants import LOGDIR
 10 | 
 11 | server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
 12 | moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
 13 | 
 14 | handler = None
 15 | 
 16 | 
 17 | def build_logger(logger_name, logger_filename):
 18 |     global handler
 19 | 
 20 |     formatter = logging.Formatter(
 21 |         fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
 22 |         datefmt="%Y-%m-%d %H:%M:%S",
 23 |     )
 24 | 
 25 |     # Set the format of root handlers
 26 |     if not logging.getLogger().handlers:
 27 |         logging.basicConfig(level=logging.INFO)
 28 |     logging.getLogger().handlers[0].setFormatter(formatter)
 29 | 
 30 |     # Redirect stdout and stderr to loggers
 31 |     stdout_logger = logging.getLogger("stdout")
 32 |     stdout_logger.setLevel(logging.INFO)
 33 |     sl = StreamToLogger(stdout_logger, logging.INFO)
 34 |     sys.stdout = sl
 35 | 
 36 |     stderr_logger = logging.getLogger("stderr")
 37 |     stderr_logger.setLevel(logging.ERROR)
 38 |     sl = StreamToLogger(stderr_logger, logging.ERROR)
 39 |     sys.stderr = sl
 40 | 
 41 |     # Get logger
 42 |     logger = logging.getLogger(logger_name)
 43 |     logger.setLevel(logging.INFO)
 44 | 
 45 |     # Add a file handler for all loggers
 46 |     if handler is None:
 47 |         os.makedirs(LOGDIR, exist_ok=True)
 48 |         filename = os.path.join(LOGDIR, logger_filename)
 49 |         handler = logging.handlers.TimedRotatingFileHandler(
 50 |             filename, when='D', utc=True)
 51 |         handler.setFormatter(formatter)
 52 | 
 53 |         for name, item in logging.root.manager.loggerDict.items():
 54 |             if isinstance(item, logging.Logger):
 55 |                 item.addHandler(handler)
 56 | 
 57 |     return logger
 58 | 
 59 | 
 60 | class StreamToLogger(object):
 61 |     """
 62 |     Fake file-like stream object that redirects writes to a logger instance.
 63 |     """
 64 |     def __init__(self, logger, log_level=logging.INFO):
 65 |         self.terminal = sys.stdout
 66 |         self.logger = logger
 67 |         self.log_level = log_level
 68 |         self.linebuf = ''
 69 | 
 70 |     def __getattr__(self, attr):
 71 |         return getattr(self.terminal, attr)
 72 | 
 73 |     def write(self, buf):
 74 |         temp_linebuf = self.linebuf + buf
 75 |         self.linebuf = ''
 76 |         for line in temp_linebuf.splitlines(True):
 77 |             # From the io.TextIOWrapper docs:
 78 |             #   On output, if newline is None, any '\n' characters written
 79 |             #   are translated to the system default line separator.
 80 |             # By default sys.stdout.write() expects '\n' newlines and then
 81 |             # translates them so this is still cross platform.
 82 |             if line[-1] == '\n':
 83 |                 self.logger.log(self.log_level, line.rstrip())
 84 |             else:
 85 |                 self.linebuf += line
 86 | 
 87 |     def flush(self):
 88 |         if self.linebuf != '':
 89 |             self.logger.log(self.log_level, self.linebuf.rstrip())
 90 |         self.linebuf = ''
 91 | 
 92 | 
 93 | def disable_torch_init():
 94 |     """
 95 |     Disable the redundant torch default initialization to accelerate model creation.
 96 |     """
 97 |     import torch
 98 |     setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
 99 |     setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
100 | 
101 | 
102 | def violates_moderation(text):
103 |     """
104 |     Check whether the text violates OpenAI moderation API.
105 |     """
106 |     url = "https://api.openai.com/v1/moderations"
107 |     headers = {"Content-Type": "application/json",
108 |                "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
109 |     text = text.replace("\n", "")
110 |     data = "{" + '"input": ' + f'"{text}"' + "}"
111 |     data = data.encode("utf-8")
112 |     try:
113 |         ret = requests.post(url, headers=headers, data=data, timeout=5)
114 |         flagged = ret.json()["results"][0]["flagged"]
115 |     except requests.exceptions.RequestException as e:
116 |         flagged = False
117 |     except KeyError as e:
118 |         flagged = False
119 | 
120 |     return flagged
121 | 
122 | 
123 | def pretty_print_semaphore(semaphore):
124 |     if semaphore is None:
125 |         return "None"
126 |     return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
127 | 


--------------------------------------------------------------------------------