├── .gitignore
├── LICENSE
├── README.md
├── deepperception
    └── eval
    │   ├── eval.sh
    │   ├── evaluate.py
    │   └── inference.py
├── figs
    └── header.png
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | deepperception/data


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Maxy
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
  2 | Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
  3 | 
  4 | -----
  5 | 
  6 | <a href='https://deepperception-kvg.github.io/'><img src='https://img.shields.io/badge/Project-Page-blue'></a>
  7 | <a href='https://arxiv.org/abs/2503.12797'><img src='https://img.shields.io/badge/Paper-PDF-Green'></a> 
  8 | <a href='https://huggingface.co/MaxyLee/DeepPerception'><img src='https://img.shields.io/badge/Model-Huggingface-yellow'></a> 
  9 | <a href='https://huggingface.co/datasets/MaxyLee/KVG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-orange'></a> 
 10 | <a href='https://huggingface.co/datasets/MaxyLee/KVG'><img src='https://img.shields.io/badge/Dataset-Huggingface-purple'></a> 
 11 | 
 12 | This is the official repository of **DeepPerception**, an MLLM enhanced with cognitive visual perception capabilities.
 13 | 
 14 | ## Release
 15 | 
 16 | - [x] **`2025.03.18`** 🔥Release the DeepPerception evaluation code and model in [`🤗HuggingFace`](https://huggingface.co/MaxyLee/DeepPerception).
 17 | - [x] **`2025.03.18`** 🔥DeepPerception Paper has been released in [`📕Arxiv`](https://arxiv.org/abs/2503.12797).
 18 | 
 19 | ## Overview
 20 | 
 21 | <p align="center">
 22 |     <img src="figs/header.png" width="100%"></a><br>
 23 |     Figure 1: (a) <strong>DeepPerception</strong> employs knowledge-driven reasoning to derive answers, while the baseline model directly outputs predictions without cognitive processing. (b) <strong>DeepPerception</strong> demonstrates superior cognitive visual perception capabilities that cannot be elicited in the foundation model through simplistic zero-shot CoT prompting.
 24 | </p>
 25 | 
 26 | #### Abstract
 27 | 
 28 | Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. 
 29 | 
 30 | To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both finegrained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose **DeepPerception**, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perceptioncognition synergy. To benchmark performance, we introduce KVG-Bench, a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. 
 31 | 
 32 | Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08% accuracy improvements on KVG-Bench and exhibiting +4.60% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research.
 33 | 
 34 | #### Key Contributions
 35 | 
 36 | - We introduce the task of **Knowledge-intensive Visual Grounding (KVG)** to explore the concept of cognitive visual perception for MLLMs, aiming to integrate their inherent knowledge and reasoning capabilities into visual perception.
 37 | -  We propose **[DeepPerception](https://huggingface.co/MaxyLee/DeepPerception)**, an MLLM with enhanced cognitive visual perception capabilities. To achieve this, we develop an automated dataset creation pipeline and a two-stage framework integrating supervised cognitive capability enhancement with perception-oriented reinforcement learning.
 38 | - We introduce **[KVG-Bench](https://huggingface.co/datasets/MaxyLee/KVG-Bench)**, a manually curated benchmark for the KVG task involving diverse knowledge domains and entities. Experiments on KVG-Bench and other fine-grained visual recognition tasks demonstrate DeepPerception's exceptional cognitive visual perception capabilities and superior cross-domain generalization performance.
 39 | 
 40 | ## Get Started
 41 | 
 42 | ### Contents:
 43 | 
 44 | - [Environment](#environment)
 45 | - [Data Preparation](#data-preparation)
 46 | - [Checkpoints](#checkpoints)
 47 | - [Evaluation](#evaluation)
 48 | - [Training](#training)
 49 | 
 50 | ### Environment
 51 | 
 52 | 1. Clone this repository and navigate to DeepPerception folder
 53 | ```bash
 54 | git clone https://github.com/MaxyLee/DeepPerception.git
 55 | cd DeepPerception
 56 | ```
 57 | 2. Install Packages
 58 | For evaluation:
 59 | ```bash
 60 | conda create -n deepperception python=3.9
 61 | conda activate deepperception
 62 | 
 63 | pip install -r requirements.txt
 64 | ```
 65 | 
 66 | ### Data Preparation
 67 | 
 68 | | Dataset                             | Links                                  |
 69 | |---------                            |---------------------------------------|
 70 | | KVG-Bench                           | [`🤗HuggingFace`](https://huggingface.co/datasets/MaxyLee/KVG-Bench)    |
 71 | | KVG Training                        | [`🤗HuggingFace`](https://huggingface.co/datasets/MaxyLee/KVG)    |
 72 | ---
 73 | 
 74 | ### Checkpoints
 75 | 
 76 | | Model                               | Links                                  |
 77 | |---------                            |---------------------------------------|
 78 | | DeepPerception                      | [`🤗HuggingFace`](https://huggingface.co/MaxyLee/DeepPerception)    |
 79 | | DeepPerception-FGVR                 | [`🤗HuggingFace`](https://huggingface.co/MaxyLee/DeepPerception-FGVR)    |
 80 | ---
 81 | 
 82 | ### Evaluation
 83 | 
 84 | ```bash
 85 | # Evaluate on KVG-Bench
 86 | bash eval.sh [CUDA_IDS] [KVG_BENCH_PATH] [CKPT_PATH]
 87 | ```
 88 | Notice: Please modify the script if you want to evaluate on Qwen2-VL.
 89 | 
 90 | ### Training
 91 | 
 92 | TODO
 93 | 
 94 | ## Citation
 95 | 
 96 | If you find DeepPerception useful for your research or applications, please cite using this BibTeX:
 97 | 
 98 | ```bibtex
 99 | @misc{ma2025deepperception,
100 |       title={DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding}, 
101 |       author={Xinyu Ma and Ziyang Ding and Zhicong Luo and Chi Chen and Zonghao Guo and Derek F. Wong and Xiaoyi Feng and Maosong Sun},
102 |       year={2025},
103 |       url={https://arxiv.org/abs/2503.12797}, 
104 | }
105 | ```
106 | 
107 | ## Acknowledgement
108 | 
109 | - [Qwen2-VL](https://github.com/QwenLM/Qwen2.5-VL)
110 | - [vLLM](https://github.com/vllm-project/vllm)
111 | - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
112 | - [R1-V](https://github.com/Deep-Agent/R1-V)
113 | 
114 | ## License
115 | 
116 | [![Code License](https://img.shields.io/badge/Code%20License-MIT-Green.svg)](https://github.com/twbs/bootstrap/blob/main/LICENSE)
117 | [![Data License](https://img.shields.io/badge/Code%20License-Apache_2.0-Green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
118 | 


--------------------------------------------------------------------------------
/deepperception/eval/eval.sh:
--------------------------------------------------------------------------------
 1 | GPU_IDs=$1
 2 | DATA_PATH=$2
 3 | CKPT=$3
 4 | 
 5 | 
 6 | if [[ $DATA_PATH == *"kvg-bench"* ]]; then 
 7 | # KVG-Bench
 8 | OUT_DIR=$CKPT/kvg-bench-eval
 9 | 
10 | # Evaluate DeepPerception
11 | # To ensure precise reproduction of the experimental results of KVG-Bench presented in the paper, please strictly adhere to the package versions specified in the requirements.txt file and DO NOT use the vllm.
12 | 
13 | python evaluate.py \
14 |     --data_path $DATA_PATH \
15 |     --ckpt_path $CKPT \
16 |     --gpu_ids $GPU_IDs \
17 |     --output_path $OUT_DIR \
18 |     --prompt r1 
19 | 
20 | # Evaluate Qwen2-VL
21 | # DO NOT use --prompt r1, which requires model to first output the thinking process
22 | 
23 | # python evaluate.py \
24 | #     --data_path $DATASET \
25 | #     --ckpt_path $CKPT \
26 | #     --gpu_ids $GPU_IDs \
27 | #     --output_path $OUT_DIR 
28 | else 
29 | # TODO
30 | # FGVR
31 | OUT_DIR=$CKPT/fgvr-eval
32 | 
33 | # Evaluate DeepPerception-FGVR
34 | 
35 | python evaluate.py \
36 |     --data_path $DATA_PATH \
37 |     --ckpt_path $CKPT \
38 |     --gpu_ids $GPU_IDs \
39 |     --output_path $OUT_DIR \
40 |     --vllm \
41 |     --prompt r1
42 | 
43 | # Evaluate Qwen2-VL
44 | 
45 | # python evaluate.py \
46 | #     --data_path $DATA_PATH \
47 | #     --ckpt_path $CKPT \
48 | #     --gpu_ids $GPU_IDs \
49 | #     --output_path $OUT_DIR \
50 | #     --vllm
51 | fi


--------------------------------------------------------------------------------
/deepperception/eval/evaluate.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import json
  4 | import argparse
  5 | import subprocess
  6 | import time
  7 | import torch
  8 | 
  9 | from datasets import load_dataset
 10 | from torchvision.ops.boxes import box_area
 11 | from multiprocessing import Process
 12 | from tqdm import tqdm
 13 | 
 14 | 
 15 | bbox_patterns = [
 16 |     re.compile(r'<answer>.*?\((\d*?),.*?(\d*?)\),\((\d*?),(\d*?)\)</answer>'),
 17 |     re.compile(r'So the answer is.*?\((\d*?),.*?(\d*?)\),\((\d*?),(\d*?)\)'),
 18 |     re.compile(r'\((\d*?),.*?(\d*?)\),\((\d*?),(\d*?)\)'),
 19 |     re.compile(r'\((.*?),.*?(.*?)\).*?\((.*?),.*?(.*?)\)'),
 20 |     re.compile(r'\[(\d*?), (\d*?), (\d*?), (\d*?)\]'),
 21 |     re.compile(r'\[(.*?), (.*?), (.*?), (.*?)\]'),
 22 |     re.compile(r'\((\d*?), (\d*?), (\d*?), (\d*?)\)'),
 23 |     re.compile(r'\((\d*?), (\d*?)\)\n?.*?\((\d*?), (\d*?)\)')
 24 | ]
 25 | 
 26 | REF_PATTERN = re.compile(r'<\|object_ref_start\|>(.*?)<\|object_ref_end\|>')
 27 | ANSWER_PATTERN = re.compile(r'<answer>(.*?)</answer>')
 28 | 
 29 | def get_choice(ans):
 30 |     match = re.findall(ANSWER_PATTERN, ans)
 31 |     if len(match) > 0:
 32 |         choice = match[0].strip()
 33 |         if len(choice) > 1:
 34 |             choice = choice.split('.')[0]
 35 |         return choice
 36 |     else:
 37 |         return None
 38 |     
 39 | def box_iou(boxes1, boxes2):
 40 |     area1 = box_area(boxes1)
 41 |     area2 = box_area(boxes2)
 42 | 
 43 |     lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
 44 |     rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
 45 | 
 46 |     wh = (rb - lt).clamp(min=0)  # [N,M,2]
 47 |     inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
 48 | 
 49 |     union = area1[:, None] + area2 - inter
 50 | 
 51 |     iou = inter / union
 52 |     return iou, union
 53 | 
 54 | def get_bbox(ans):
 55 |     for i, pattern in enumerate(bbox_patterns):
 56 |         predict_bbox = re.findall(pattern, ans)
 57 |         if len(predict_bbox) != 0:
 58 |             try:
 59 |                 predict_bbox = (float(predict_bbox[-1][0].replace('[', '').replace('x', '')), float(predict_bbox[-1][1]), float(predict_bbox[-1][2]), float(predict_bbox[-1][3]))
 60 |             except:
 61 |                 predict_bbox = [0, 0, 0, 0]
 62 |             if sum(predict_bbox) < 4:
 63 |                 predict_bbox = [c*1000 for c in predict_bbox]
 64 | 
 65 |             return predict_bbox, i+1
 66 |     
 67 |     return (0., 0., 0., 0.), 0
 68 | 
 69 | def calculate_ious(category, results):
 70 |     ious = []
 71 |     correct = 0
 72 |     match_patterns_cnt = [0] * (len(bbox_patterns) + 1)
 73 |     for r in results:
 74 |         answer = r['answer']
 75 |         
 76 |         predict_bbox, i = get_bbox(answer)
 77 |         r['pred_bbox'] = predict_bbox
 78 |         predict_bbox = torch.tensor(predict_bbox, dtype=torch.float32).view(-1, 4)
 79 | 
 80 |         max_iou = 0
 81 |         for gt_bbox in r['gt_bbox']:
 82 |             target_bbox = torch.tensor(gt_bbox, dtype=torch.float32).view(-1, 4)
 83 |             iou, _ = box_iou(predict_bbox, target_bbox)
 84 |             iou = iou.item()
 85 |             if iou > max_iou:
 86 |                 max_iou = iou    
 87 |         
 88 |         ious.append(max_iou)
 89 |         r['iou'] = max_iou
 90 |         r['match pattern'] = i
 91 |         match_patterns_cnt[i] += 1
 92 |         if max_iou >= 0.5:
 93 |             correct += 1
 94 |     
 95 |     metrics = dict()
 96 |     acc = correct / len(ious)
 97 |     avg_iou = sum(ious)/len(ious)
 98 |     
 99 |     print(category)
100 |     print(f'unmatch: {match_patterns_cnt[0]}, ' + ', '.join([f'match {i+1}: {cnt}' for i, cnt in enumerate(match_patterns_cnt[1:])]))    
101 |     print(f'Acc @ 0.5: {acc}, IoU: {avg_iou}')
102 |     
103 |     metrics['all'] = {
104 |         'Acc': acc,
105 |         'IoU': avg_iou,
106 |         'Num': len(ious)
107 |     }
108 |         
109 |     return results, metrics
110 | 
111 | def eval(task, args, test_data):
112 |     output_path = args.output_path
113 |     
114 |     all_metrics = dict()
115 |     if task == 'grounding':
116 |         seen_categories = args.seen_categories.split(',')
117 |         all_categories = args.all_categories.split(',')
118 |         
119 |         
120 |         results = {d: [] for d in all_categories}
121 |         
122 |         all_res = []
123 |         seen_res = []
124 |         unseen_res = []
125 |         for data in tqdm(test_data):
126 |             with open(f'{output_path}/temp/{data["question_id"]}.json', 'r') as f:
127 |                 r = json.load(f)
128 |                 results[data["category"]].append(r)
129 |                 all_res.append(r)
130 |                 if data["category"] in seen_categories:
131 |                     seen_res.append(r)
132 |                 else:
133 |                     unseen_res.append(r)
134 | 
135 |         all_res, metrics = calculate_ious('all', all_res)
136 |         all_metrics['all'] = metrics
137 |         
138 |         seen_res, metrics = calculate_ious('seen domain', seen_res)
139 |         all_metrics['seen domain'] = metrics
140 |         
141 |         unseen_res, metrics = calculate_ious('unseen domain', unseen_res)
142 |         all_metrics['unseen domain'] = metrics
143 |             
144 |         for dataset, res in results.items():
145 |             res, metrics = calculate_ious(dataset, res)
146 |             all_metrics[dataset] = metrics
147 |             with open(f'{args.output_path}/{dataset}.json', 'w') as f:
148 |                 json.dump(res, f, indent=4)
149 |                 
150 |         with open(f'{args.output_path}/metrics.json', 'w') as f:
151 |             json.dump(all_metrics, f)
152 |             
153 |     elif task == 'classification':
154 |         correct = 0
155 |         match_cnt = 0
156 |         results = []
157 |         for data in tqdm(test_data):
158 |             with open(f'{output_path}/temp/{data["question_id"]}.json', 'r') as f:
159 |                 r = json.load(f)
160 |             answer = r['answer']
161 |             gt = data['messages'][1]['content']
162 | 
163 |             pred = get_choice(answer)
164 |             
165 |             if pred:
166 |                 match_cnt += 1
167 |             if gt == pred:
168 |                 correct += 1
169 |                 r['correct'] = True
170 |             else:
171 |                 r['correct'] = False
172 |             results.append(r)
173 |         
174 |         acc = correct / len(test_data)
175 |         print(f'Acc ({dataset}): {acc}')
176 |         print(f'Match rate: {match_cnt/len(test_data)}')
177 |         
178 |         category = test_data['question_id'].split('/')[0]
179 |         with open(f'{args.output_path}/{category}.json', 'w') as f:
180 |             json.dump(results, f)
181 |         with open(f'{args.output_path}/{category}-metrics.json', 'w') as f:
182 |             json.dump(all_metrics, f)
183 |     
184 | 
185 | def infer(args, json_path, gpu_id):
186 |     os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
187 | 
188 |     if args.vllm:
189 |         subprocess.run(["python", 'inference.py',
190 |                         "--data_path", args.data_path,
191 |                         "--prompt", str(args.prompt),
192 |                         "--vllm", 
193 |                         "--id_path", json_path,
194 |                         "--model_path", args.ckpt_path,
195 |                         "--output_path", args.output_path])
196 |     else:
197 |         subprocess.run(["python", 'inference.py',
198 |                         "--data_path", args.data_path,
199 |                         "--prompt", str(args.prompt),
200 |                         "--id_path", json_path,
201 |                         "--model_path", args.ckpt_path,
202 |                         "--output_path", args.output_path])
203 | 
204 | def launch_subprocesses(args, temp):
205 |     processes = []
206 |     temp_files = []
207 |     
208 |     if len(temp) > 0:
209 |         if '72B' in args.ckpt_path:
210 |             nprocs = args.num_processes
211 |             if nprocs == 2:
212 |                 gpu_ids = ['0,1,2,3', '4,5,6,7']
213 |             elif nprocs == 1:
214 |                 gpu_ids = [args.gpu_ids]
215 |         else: # 7B-scale models
216 |             gpu_ids = list(map(int, args.gpu_ids.split(',')))
217 |             nprocs = len(gpu_ids)
218 |         
219 |         num_data_per_group = len(temp) // len(gpu_ids)
220 |         
221 |         for i, gpu_id in enumerate(gpu_ids):
222 |             start_idx = i * num_data_per_group
223 |             end_idx = start_idx + num_data_per_group if i != (nprocs-1) else None
224 |             
225 |             timestamp = time.strftime("%Y%m%d%H%M%S")
226 |             json_path = f'{args.output_path}/temp/{timestamp}_{gpu_id}.json'
227 |             temp_files.append(json_path)
228 |             with open(json_path, "w") as f:
229 |                 json.dump(temp[start_idx:end_idx], f)
230 | 
231 |             p = Process(target=infer, args=(args, json_path, gpu_id))
232 |             processes.append(p)
233 |             p.start()
234 |         
235 |         for p in processes:
236 |             p.join()
237 |             
238 |         for temp_file in temp_files:
239 |             os.remove(temp_file)
240 | 
241 | def get_data(args):
242 |     output_path = args.output_path
243 |     
244 |     if args.data_path.endswith('.parquet'): # KVG-Bench
245 |         task = 'grounding'
246 |         all_categories = args.all_categories.split(',')
247 |         for c in all_categories:
248 |             os.makedirs(f'{output_path}/temp/{c}', exist_ok=True)
249 |             
250 |         dataset = load_dataset("parquet", data_files={"test": args.data_path})
251 |         test_data = dataset['test']
252 |     elif args.data_path.endswith('.json'): # FGVR
253 |         task = 'classification'
254 |         with open(args.data_path, 'r') as f:
255 |             test_data = json.load(f)
256 |         
257 |         category = args.data_path.split('/')[-1].split('.')[0]
258 |         os.makedirs(f'{output_path}/temp/{category}', exist_ok=True)
259 |         i = 0
260 |         for d in test_data:
261 |             d['question_id'] = f'{category}/{str(i).zfill(5)}'
262 |     else:
263 |         print(f'No supported file type: {args.data_path}')
264 |     
265 |     qids = []
266 |     for d in test_data:
267 |         if not os.path.isfile(f'{output_path}/temp/{d["question_id"]}.json'):
268 |             qids.append(d['question_id'])
269 |     print(f'# Test data: {len(qids)}')
270 |     
271 |     return task, test_data, qids
272 |     
273 | 
274 | def parse_arguments():
275 |     parser = argparse.ArgumentParser(description="Process images across multiple GPUs.")
276 |     parser.add_argument("--data_path", required=True)
277 |     parser.add_argument("--prompt", required=False, default=None)
278 |     parser.add_argument("--vllm", action='store_true')
279 |     parser.add_argument("--seen_categories", required=False, default='aircraft,car,reptilia,bird,food')
280 |     parser.add_argument("--all_categories", required=False, default='aircraft,car,reptilia,bird,food,dog,mollusca,mammal,flower,landmark')
281 |     parser.add_argument("--ckpt_path", required=True)
282 |     parser.add_argument("--num_processes", type=int, required=False, default=8)
283 |     parser.add_argument("--gpu_ids", type=str, required=True, help="Comma-separated GPU IDs.")
284 |     parser.add_argument("--output_path", required=True, help="Path to the output dir")
285 | 
286 |     return parser.parse_args()
287 | 
288 | 
289 | def main():
290 |     args = parse_arguments()
291 |     
292 |     print(f"Evaluating {args.ckpt_path}. Prompt: {args.prompt}. Results will be saved in {args.output_path}.")
293 |     
294 |     task, test_data, qids = get_data(args)
295 |     launch_subprocesses(args, qids)
296 |     eval(task, args, test_data)
297 | 
298 | 
299 | 
300 | if __name__ == "__main__":
301 |     main()
302 | 


--------------------------------------------------------------------------------
/deepperception/eval/inference.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import io
  3 | import os
  4 | import re
  5 | import json
  6 | import time
  7 | import torch
  8 | import base64
  9 | 
 10 | from PIL import Image
 11 | from tqdm import tqdm
 12 | from datasets import load_dataset
 13 | from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
 14 | from qwen_vl_utils import process_vision_info
 15 | 
 16 | PATTERN = re.compile(r'<\|box_start\|>\(([0-9]*?),([0-9]*?)\),\(([0-9]*?),([0-9]*?)\)<\|box_end\|>')
 17 | REF_PATTERN = re.compile(r'<\|object_ref_start\|>(.*?)<\|object_ref_end\|>')
 18 | 
 19 | GROUNDING_TEMPLATE = "{Question}  Output the thinking process in <think> </think> and final answer (bounding box) in <answer> </answer> tags."
 20 | # QUESTION_TEMPLATE = "{Question}  Output the thinking process in <think> </think> and final answer (bounding box in (x1,y1),(x2,y2) format) in <answer> </answer> tags."
 21 | CLASSIFICATION_TEMPLATE = "{Question}  Output the thinking process in <think> </think> and final answer in <answer> </answer> tags."
 22 | 
 23 | def parse_args():
 24 |     parser = argparse.ArgumentParser(description="Training")
 25 |     parser.add_argument("--data_path", required=True)
 26 |     parser.add_argument("--prompt", required=False, default=None)
 27 |     parser.add_argument("--id_path", required=True)
 28 |     parser.add_argument("--model_path", required=True, help="Path to qwen.")
 29 |     parser.add_argument("--output_path", required=True)
 30 |     parser.add_argument('--vllm', action='store_true')
 31 | 
 32 |     parser.add_argument("--batch_size", required=False, type=int, default=1)
 33 | 
 34 |     args = parser.parse_args()
 35 | 
 36 |     return args
 37 | 
 38 | def inference_classification(model, processor, sampling_params, prompt, query, image):
 39 |     messages = []
 40 |     
 41 |     if prompt == 'r1':
 42 |         query = CLASSIFICATION_TEMPLATE.format(Question=query)
 43 |         messages.append({"role": "user", "content": [dict(type='image', image=image), dict(type='text', text=query)]})
 44 |     else:
 45 |         messages.append({"role": "user", "content": [dict(type='image', image=image), dict(type='text', text=query)]})
 46 |     
 47 |     text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 48 |     image_inputs, _ = process_vision_info(messages)
 49 |     if sampling_params:
 50 |         llm_inputs = {
 51 |             "prompt": text,
 52 |             "multi_modal_data": {
 53 |                 "image": image_inputs
 54 |             }
 55 |         }
 56 |         outputs = model.generate([llm_inputs], sampling_params=sampling_params)
 57 |         generated_text = outputs[0].outputs[0].text
 58 |     else:
 59 |         inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to(model.device)
 60 |         
 61 |         generated_ids = model.generate(**inputs, max_new_tokens=1500)
 62 |         generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
 63 |         response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 64 |         generated_text = response[0]
 65 |     
 66 |     return {
 67 |         'answer': generated_text,
 68 |     }
 69 | 
 70 | def inference_grounding(model, processor, sampling_params, prompt, query, image_bytes):
 71 |     encoded_string = 'data:image:base64,' + str(base64.b64encode(image_bytes).decode("utf-8"))
 72 |     messages = []
 73 |     cot_response = None
 74 |     # CoT
 75 |     if prompt == 'cot-kvg':
 76 |         match = re.search(REF_PATTERN, query)
 77 |         ref = match[1]
 78 |         
 79 |         cot_text = (
 80 |             f'Which object in this image is {ref}? Give a detailed and discriminative description of the appearance of it'
 81 |         )
 82 |         
 83 |         messages.append({"role": "user", "content": [dict(type='image_url', image_url=encoded_string), dict(type='text', text=cot_text)]})
 84 |         text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 85 |         image_inputs, _ = process_vision_info(messages)
 86 |         if sampling_params:
 87 |             llm_inputs = {
 88 |                 "prompt": text,
 89 |                 "multi_modal_data": {
 90 |                     "image": image_inputs
 91 |                 }
 92 |             }
 93 |             outputs = model.generate([llm_inputs], sampling_params=sampling_params)
 94 |             cot_response = outputs[0].outputs[0].text
 95 |         else:
 96 |             inputs = processor(text=[text], padding=True, return_tensors="pt").to(model.device)
 97 |             
 98 |             generated_ids = model.generate(**inputs, max_new_tokens=1500)
 99 |             generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
100 |             cot_response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
101 |         
102 |         messages.append({"role": "assistant", "content": cot_response})
103 |         grounding_text = f'Based on the description, find and give the bounding box of <|object_ref_start|>{ref}<|object_ref_end|>'
104 |         messages.append({"role": "user", "content": [dict(type='text', text=grounding_text)]})
105 |     elif prompt == 'cot-normal':
106 |         query += ". Let's think step by step"
107 |         messages.append({"role": "user", "content": [dict(type='image_url', image_url=encoded_string), dict(type='text', text=query)]})
108 |     elif prompt == 'r1':
109 |         query = GROUNDING_TEMPLATE.format(Question=query)
110 |         messages.append({"role": "user", "content": [dict(type='image_url', image_url=encoded_string), dict(type='text', text=query)]})
111 |     else:
112 |         messages.append({"role": "user", "content": [dict(type='image_url', image_url=encoded_string), dict(type='text', text=query)]})
113 |     
114 |     # Grounding
115 |     text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
116 |     image_inputs, _ = process_vision_info(messages)
117 |     if sampling_params:
118 |         llm_inputs = {
119 |             "prompt": text,
120 |             "multi_modal_data": {
121 |                 "image": image_inputs
122 |             }
123 |         }
124 |         outputs = model.generate([llm_inputs], sampling_params=sampling_params)
125 |         generated_text = outputs[0].outputs[0].text
126 |     else:
127 |         inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to(model.device)
128 |         
129 |         generated_ids = model.generate(**inputs, max_new_tokens=1500)
130 |         generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
131 |         response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
132 |         generated_text = response[0]
133 |     
134 |     return {
135 |         'cot': cot_response,
136 |         'answer': generated_text,
137 |     }
138 | 
139 | def infer(model, processor, sampling_params, args):
140 |     prompt = args.prompt
141 |     output_path = args.output_path
142 |     
143 |     if args.data_path.endswith('.parquet'): # KVG-Bench
144 |         task = 'grounding'
145 |         dataset = load_dataset("parquet", data_files={"test": args.data_path})
146 |     elif args.data_path.endswith('.json'): # FGVR
147 |         task = 'classification'
148 |         with open(args.data_path, 'r') as f:
149 |             dataset = json.load(f)
150 |     else:
151 |         print(f'No supported file type: {args.data_path}')
152 |     
153 |     with open(args.id_path, 'r') as f:
154 |         qids = json.load(f)
155 |         
156 |     test_data = []
157 |     for d in dataset['test']:
158 |         if d['question_id'] in qids:
159 |             test_data.append(d)
160 |     
161 |     for data in tqdm(test_data):
162 |         if task == 'grounding':
163 |             query = data['question']
164 |             image_bytes = data['image']['bytes']
165 |             
166 |             gt = data['answer']
167 |             match = re.search(PATTERN, gt)
168 |             bbox = [[float(match[1]), float(match[2]), float(match[3]), float(match[4])]]
169 |             
170 |             image = Image.open(io.BytesIO(image_bytes))
171 |             w, h = image.size
172 |             
173 |             out_filename = f"{output_path}/temp/{data['question_id']}.json"
174 |             
175 |             response = inference_grounding(model, processor, sampling_params, prompt, query, image_bytes)
176 |             response['gt_bbox'] = bbox
177 |             response['hw'] = (h ,w)
178 |         elif task == 'classification':
179 |             query = data['messages'][0]['content'].replace('<image>', '')
180 |             image = data['images'][0]
181 |             
182 |             
183 |             out_filename = f"{output_path}/temp/{data['question_id']}.json"
184 |             response = inference_grounding(model, processor, sampling_params, prompt, query, image)
185 |             
186 |             
187 |         with open(out_filename, 'w') as f:
188 |             json.dump(response, f)
189 |         
190 | 
191 | def main():
192 |     args = parse_args()
193 |     os.makedirs(args.output_path, exist_ok=True)
194 | 
195 |     if args.vllm:
196 |         from vllm import LLM, SamplingParams
197 |         model = LLM(args.model_path, max_model_len=17920, tensor_parallel_size=1)
198 |         sampling_params = SamplingParams(n=1, temperature=0, max_tokens=1536)
199 |     else:
200 |         model = Qwen2VLForConditionalGeneration.from_pretrained(
201 |             args.model_path,
202 |             torch_dtype=torch.bfloat16,
203 |             attn_implementation="flash_attention_2",
204 |             device_map="auto"
205 |         )
206 |         sampling_params = None
207 |     
208 |     processor = AutoProcessor.from_pretrained(args.model_path)
209 | 
210 |     start_time = time.time()
211 |     
212 |     with torch.no_grad():
213 |         infer(model, processor, sampling_params, args)
214 | 
215 |     end_time = time.time()
216 |     elapsed_time = end_time - start_time
217 | 
218 |     print('\033[92m' + "---- Evaluate Time taken: {} seconds ----".format(elapsed_time) + '\033[0m')
219 | 
220 | 
221 | if __name__ == "__main__":
222 |     main()


--------------------------------------------------------------------------------
/figs/header.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thunlp/DeepPerception/971d92f67a21d9aca53f2f565b6899a6cf11dd5f/figs/header.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets==2.16.1
2 | Pillow==11.1.0
3 | qwen_vl_utils==0.0.10
4 | torch==2.2.2
5 | torchvision==0.17.2
6 | tqdm==4.66.5
7 | transformers==4.45.1
8 | 


--------------------------------------------------------------------------------