├── LICENSE ├── README.md ├── demo_models └── cogvlm_grounding_pred.json ├── evaluate_pred.py ├── figure ├── error_samples.png └── examples.png ├── ref_l4 ├── __init__.py ├── __pycache__ │ ├── __init__.cpython-311.pyc │ ├── dataloader.cpython-311.pyc │ ├── evaluation.cpython-311.pyc │ └── overlaps.cpython-311.pyc ├── dataloader.py ├── evaluation.py └── overlaps.py └── setup.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 JierunChen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Ref-L4: a New REC Benchmark in the LMM Era 2 | 3 | This repository contains the data loader and evaluation code for our [Ref-L4](https://huggingface.co/datasets/JierunChen/Ref-L4), a new REC benchmark in the LMM era. Check out our paper [Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models](https://arxiv.org/abs/2406.16866) for more details. 4 | 5 | ## 🧭 Introduction 6 | 7 | Referring expression comprehension (REC) is the task of locating a specific region in an image based on a textual description. Current popular REC benchmarks, such as RefCOCO, RefCOCO+, and RefCOCOg, may not fully capture the comprehensive capabilities of large multimodal models (LMMs) due to limitations in: 8 | 9 | - **Linguistic complexity**: Expressions tend to be overly simple, often consisting of single words like "yellow," "lady," or "far left person." 10 | - **Visual complexity**: The benchmarks are restricted to a limited set of object categories (fewer than 80). 11 | - **Labeling accuracy**: RefCOCO, RefCOCO+, and RefCOCOg have high labeling error rates (14%, 24%, and 5%, respectively). 12 | 13 | In response, we rigorously exclude the problematic instances and reevaluating LMMs, showing significant accuracy improvements. We also introduce Ref-L4, a new and more coprehensive REC benchmark to evaluate LMMs. Ref-L4 offers: 14 | 15 | - **Lengthy referring expressions** averaging 24.2 words; 16 | - **Extensive vocabulary** comprising 22,813 unique words; 17 | - **Diverse objects** covering 365 categories with varying instance scales (from 30 to 3,767); 18 | - **Substantial sample size** with 45,341 annotations. 19 | 20 | ### 😃 Ref-L4 examples 21 | 22 | 23 | 24 | ### 😑 Labeling errors in RefCOCO, +, g 25 | 26 | In the REC task, a referring expression should uniquely describe an instance, which is represented by an accurate bounding box. We have identified and visualized three common types of labeling errors in the RefCOCO, RefCOCO+, and RefCOCOg benchmarks: (a) non-unique referring expressions, which refer to multiple instances within the same image; (b) inaccurate bounding boxes; and (c) misalignment between target instances and their referring expressions, where the referring expressions are either ambiguous or do not refer to any instance in the image. 27 | 28 | 29 | 30 | ## 🛠️ Installation 31 | 32 | ```bash 33 | git clone https://github.com/JierunChen/Ref-L4.git 34 | cd Ref-L4 35 | pip install ./ 36 | ``` 37 | 38 | ## 🚀 Loading Ref-L4 Dataset 39 | 40 | ```python 41 | from ref_l4 import RefL4Dataset 42 | ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all') 43 | print(len(ref_l4_dataset)) 44 | # iterate over the dataset 45 | for img_pil, data in ref_l4_dataset: 46 | print(data) 47 | break 48 | 49 | # swith to 'val' or 'test' split 50 | ref_l4_dataset.change_split('val') 51 | print(len(ref_l4_dataset)) 52 | ``` 53 | 54 | The [Ref-L4 dataset](https://huggingface.co/datasets/JierunChen/Ref-L4) can also be downloaded from Hugging Face. 55 | 56 | 57 | ## 🗂️ Evaluation 58 | 59 | After getting model predictions, we can save them as a json file, which is a list of dictionaries, each containing three keys: `[id, pred_bbox, format]`. The `id` is the annotation ID, and the `format` specifies the format of `pred_bbox`, which should be either `xyxy` or `xywh`. 60 | A sample prediction file the can be found [here](./demo_models/cogvlm_grounding_pred.json). It is obtained by running the CogVLM-Grounding model. 61 | 62 | Now we can run the following command to evaluate the predictions with ground truth. 63 | 64 | ```bash 65 | python evaluate_pred.py --dataset_path JierunChen/Ref-L4 --split all --pred_json_path ./demo_models/cogvlm_grounding_pred.json 66 | ``` 67 | 68 | Here is an example output for the predictions from CogVLM-Grounding on the "all" splits: 69 | 70 | ```txt 71 | Item for split all | Value 72 | -------------------------------------------------------------------------------- 73 | Ann-level acc iou 0.5 | 81.69868331091065 74 | Ann-level acc iou 0.75 | 70.76597340155709 75 | Ann-level acc iou 0.9 | 48.351381751615534 76 | Ann-level macc iou 0.5:0.95 | 66.08808804393375 77 | Ann-level accs for copy | 81.7, 70.77, 48.35, 66.09 78 | -------------------------------------------------------------------------------- 79 | Small acc iou 0.5 | 75.0561797752809 80 | Small macc iou 0.5:0.95 | 52.853932584269664 81 | Medium acc iou 0.5 | 86.43470790378007 82 | Medium macc iou 0.5:0.95 | 71.31099656357388 83 | Large acc iou 0.5 | 77.90972003774772 84 | Large macc iou 0.5:0.95 | 66.25196602705253 85 | Size level accs for copy | 75.06, 52.85, 86.43, 71.31, 77.91, 66.25 86 | -------------------------------------------------------------------------------- 87 | Average class-level acc iou 0.5 | 72.42029130234654 88 | Average class-level macc iou 0.5:0.95 | 52.56151912966669 89 | Avg class-level accs for copy | 72.42, 52.56 90 | ``` 91 | 92 | The output includes three parts: 93 | - The accuracy under various IoU thresholds and the average accuracy of IoU from 0.5 to 0.95 with a stride of 0.05. 94 | - The accuracy of small, medium, and large objects. 95 | - The average accuracy across different classes. 96 | 97 | 98 | ## 📖 Dataset License 99 | 100 | The Ref-L4 dataset is released under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license](https://creativecommons.org/licenses/by-nc/4.0/). Please note that the images in the Ref-L4 dataset are derived from the following datasets, each with their respective licenses: 101 | - **RefCOCO**: Licensed under the [Apache-2.0 license](http://www.apache.org/licenses/). 102 | - **RefCOCO+**: Licensed under the [Apache-2.0 license](http://www.apache.org/licenses/). 103 | - **RefCOCOg**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/). 104 | - **COCO 2014**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/). 105 | - **Objects365**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](http://creativecommons.org/licenses/by/4.0/). 106 | 107 | By using the Ref-L4 dataset, you agree to comply with the licensing terms of these source datasets. 108 | -------------------------------------------------------------------------------- /evaluate_pred.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | from ref_l4 import RefL4Dataset, RefL4Evaluator 4 | 5 | def main(args): 6 | custom_transforms = None 7 | ref_l4_dataset = RefL4Dataset(args.dataset_path, split=args.split, custom_transforms=custom_transforms) 8 | print("Dataset loaded. Length:", len(ref_l4_dataset)) 9 | 10 | evaluator = RefL4Evaluator(dataset=ref_l4_dataset) 11 | 12 | with open(args.pred_json_path, 'r') as f: 13 | predictions = json.load(f) 14 | evaluator.evaluate(predictions) 15 | 16 | if __name__ == '__main__': 17 | parser = argparse.ArgumentParser(description='Evaluate the predictions on Ref-L4 dataset.') 18 | parser.add_argument('--dataset_path', type=str, default="JierunChen/Ref-L4", help='Path to the Ref-L4 dataset.') 19 | parser.add_argument('--split', type=str, default='all', choices=['val', 'test', 'all'], help='Dataset split to use (val, test, all).') 20 | parser.add_argument('--pred_json_path', type=str, required=True, help='Path to the predictions JSON file.') 21 | 22 | args = parser.parse_args() 23 | 24 | main(args) 25 | -------------------------------------------------------------------------------- /figure/error_samples.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/figure/error_samples.png -------------------------------------------------------------------------------- /figure/examples.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/figure/examples.png -------------------------------------------------------------------------------- /ref_l4/__init__.py: -------------------------------------------------------------------------------- 1 | from .dataloader import RefL4Dataset 2 | from .evaluation import RefL4Evaluator 3 | 4 | __all__ = ['RefL4Dataset', 'RefL4Evaluator'] -------------------------------------------------------------------------------- /ref_l4/__pycache__/__init__.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/__init__.cpython-311.pyc -------------------------------------------------------------------------------- /ref_l4/__pycache__/dataloader.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/dataloader.cpython-311.pyc -------------------------------------------------------------------------------- /ref_l4/__pycache__/evaluation.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/evaluation.cpython-311.pyc -------------------------------------------------------------------------------- /ref_l4/__pycache__/overlaps.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/overlaps.cpython-311.pyc -------------------------------------------------------------------------------- /ref_l4/dataloader.py: -------------------------------------------------------------------------------- 1 | from torch.utils.data import Dataset 2 | from datasets import load_dataset, concatenate_datasets 3 | from huggingface_hub import hf_hub_download 4 | from PIL import Image 5 | import tarfile 6 | import io 7 | from tqdm import tqdm 8 | 9 | class RefL4Dataset(Dataset): 10 | def __init__(self, dataset_path, split, images_file='images.tar.gz', custom_transforms=None): 11 | """ 12 | Initialize the RefL4Dataset class. 13 | 14 | Parameters: 15 | - dataset_path (str): Path to the dataset directory. 16 | - split (str): Dataset split, typically "val", "test", or "all". 17 | - images_file (str): Name of the tar file containing images, default to 'images.tar.gz'. 18 | - custom_transforms: Custom image transformations to apply, default to 'None'. 19 | """ 20 | super(RefL4Dataset, self).__init__() 21 | assert split in ['val', 'test', 'all'], 'split should be val, test or all' 22 | self.dataset_path=dataset_path 23 | self.split = split 24 | self.images_file=images_file 25 | self.transforms = custom_transforms 26 | self._load_dataset() 27 | 28 | def _load_dataset(self): 29 | self.dataset = load_dataset(self.dataset_path) 30 | image_tar_path = f"{self.dataset_path}/{self.images_file}" 31 | # manually download the images.tar.gz file 32 | if self.dataset_path == 'JierunChen/Ref-L4': 33 | image_tar_path = hf_hub_download(repo_id="JierunChen/Ref-L4", filename="images.tar.gz", repo_type="dataset") 34 | print(f"Downloaded images.tar.gz to {image_tar_path}") 35 | 36 | all_splits=concatenate_datasets([self.dataset['val'],self.dataset['test']]) 37 | self.dataset['all']=all_splits 38 | self.images=self._load_images_from_tar(image_tar_path) 39 | 40 | def _load_images_from_tar(self, image_tar_path): 41 | images = {} 42 | print(f"Loading images from {image_tar_path}") 43 | with tarfile.open(image_tar_path, "r:gz") as tar: 44 | for member in tqdm(tar.getmembers()): 45 | if member.isfile() and member.name.endswith(('jpg', 'jpeg', 'png', 'webp')): 46 | f = tar.extractfile(member) 47 | if f: 48 | image = Image.open(io.BytesIO(f.read())) 49 | # transfer the grayscale image to RGB if needed 50 | if image.mode != 'RGB': 51 | image = image.convert('RGB') 52 | # remove any previous extension of name 53 | 54 | images[member.name] = image 55 | return images 56 | 57 | def change_split(self, split): 58 | assert split in ['val', 'test', 'all'], 'split should be val, test or all' 59 | self.split = split 60 | 61 | def __len__(self): 62 | return len(self.dataset[self.split]) 63 | 64 | def __getitem__(self, idx): 65 | """ 66 | Returns: 67 | - image (Tensor): Transformed image data. 68 | - data (dict): Other sample data. 69 | """ 70 | data = self.dataset[self.split][idx] 71 | image = self.images[data['file_name']] 72 | if self.transforms: 73 | image = self.transforms(image) 74 | return image, data 75 | 76 | # Example usage: 77 | if __name__ == '__main__': 78 | ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all') 79 | print(len(ref_l4_dataset)) 80 | print(ref_l4_dataset[0]) 81 | # swith to val split 82 | ref_l4_dataset.change_split('val') 83 | print(len(ref_l4_dataset)) 84 | print(ref_l4_dataset[0]) 85 | # swith to test split 86 | ref_l4_dataset.change_split('test') 87 | print(len(ref_l4_dataset)) 88 | print(ref_l4_dataset[0]) 89 | 90 | ref_l4_dataset.change_split('all') 91 | print(len(ref_l4_dataset)) 92 | print(ref_l4_dataset[0]) 93 | -------------------------------------------------------------------------------- /ref_l4/evaluation.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Jierun Chen. All rights reserved. 2 | import torch 3 | from statistics import mean 4 | import pandas as pd 5 | import math 6 | from datasets import load_dataset 7 | from .dataloader import RefL4Dataset 8 | from .overlaps import bbox_overlaps 9 | import numpy as np 10 | 11 | mapping_dict = { 12 | "refcoco_1": "o365_1", 13 | "refcoco_10": "o365_41", 14 | "refcoco_11": "o365_177", 15 | "refcoco_13": "o365_128", 16 | "refcoco_14": "o365_250", 17 | "refcoco_15": "o365_25", 18 | "refcoco_16": "o365_56", 19 | "refcoco_17": "o365_140", 20 | "refcoco_18": "o365_93", 21 | "refcoco_19": "o365_79", 22 | "refcoco_2": "o365_47", 23 | "refcoco_20": "o365_100", 24 | "refcoco_21": "o365_97", 25 | "refcoco_22": "o365_145", 26 | "refcoco_23": "o365_296", 27 | "refcoco_24": "o365_179", 28 | "refcoco_25": "o365_181", 29 | "refcoco_27": "o365_39", 30 | "refcoco_28": "o365_40", 31 | "refcoco_3": "o365_6", 32 | "refcoco_31": "o365_13", 33 | "refcoco_32": "o365_44", 34 | "refcoco_33": "o365_194", 35 | "refcoco_34": "o365_220", 36 | "refcoco_35": "o365_119", 37 | "refcoco_36": "o365_174", 38 | "refcoco_37": "o365_100000", # This is a placeholder, the actual value is not provided 39 | "refcoco_38": "o365_155", 40 | "refcoco_39": "o365_138", 41 | "refcoco_4": "o365_59", 42 | "refcoco_40": "o365_114", 43 | "refcoco_41": "o365_146", 44 | "refcoco_42": "o365_147", 45 | "refcoco_43": "o365_205", 46 | "refcoco_44": "o365_9", 47 | "refcoco_46": "o365_36", 48 | "refcoco_47": "o365_11", 49 | "refcoco_48": "o365_89", 50 | "refcoco_49": "o365_85", 51 | "refcoco_5": "o365_115", 52 | "refcoco_50": "o365_94", 53 | "refcoco_51": "o365_26", 54 | "refcoco_52": "o365_113", 55 | "refcoco_53": "o365_83", 56 | "refcoco_54": "o365_266", 57 | "refcoco_55": "o365_104", 58 | "refcoco_56": "o365_142", 59 | "refcoco_57": "o365_153", 60 | "refcoco_58": "o365_235", 61 | "refcoco_59": "o365_144", 62 | "refcoco_6": "o365_56", 63 | "refcoco_60": "o365_151", 64 | "refcoco_61": "o365_98", 65 | "refcoco_62": "o365_3", 66 | "refcoco_63": "o365_51", 67 | "refcoco_64": "o365_26", 68 | "refcoco_65": "o365_76", 69 | "refcoco_67": "o365_98", 70 | "refcoco_7": "o365_117", 71 | "refcoco_70": "o365_154", 72 | "refcoco_72": "o365_37", 73 | "refcoco_73": "o365_74", 74 | "refcoco_74": "o365_116", 75 | "refcoco_75": "o365_133", 76 | "refcoco_76": "o365_107", 77 | "refcoco_77": "o365_62", 78 | "refcoco_78": "o365_164", 79 | "refcoco_79": "o365_135", 80 | "refcoco_8": "o365_66", 81 | "refcoco_80": "o365_278", 82 | "refcoco_81": "o365_82", 83 | "refcoco_82": "o365_134", 84 | "refcoco_84": "o365_19", 85 | "refcoco_85": "o365_95", 86 | "refcoco_86": "o365_31", 87 | "refcoco_87": "o365_170", 88 | "refcoco_88": "o365_70", 89 | "refcoco_89": "o365_328", 90 | "refcoco_9": "o365_22", 91 | "refcoco_90": "o365_227", 92 | } 93 | 94 | 95 | class RefL4Evaluator: 96 | def __init__( 97 | self, 98 | dataset: RefL4Dataset, 99 | split=None, 100 | ann_level_acc_ths=[0.5, 0.75, 0.9], 101 | ann_level_macc_ths=[i/100 for i in range(50,100,5)], 102 | size_level_acc_ths=[0.5, ], 103 | size_level_macc_ths=[i/100 for i in range(50,100,5)], 104 | small_size_th=128, 105 | large_size_th=256, 106 | avg_cls_level_acc_ths=[0.5, ], 107 | avg_cls_level_macc_ths=[i/100 for i in range(50,100,5)], 108 | ) -> None: 109 | ''' 110 | dataset (RefL4Dataset): The RefL4Dataset dataset for evaluation. 111 | split (str): The split of the dataset to evaluate. If None, use the dataset's split. Default is None. 112 | ann_level_acc_ths (List[float]): The thresholds to evaluate the annotation level accuracy. Default is [0.5, 0.75, 0.9]. 113 | ann_level_macc_ths (List[float]): The thresholds to evaluate the annotation level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]. 114 | size_level_acc_ths (List[float]): The thresholds to evaluate the size level accuracy. Default is [0.5, ]. 115 | size_level_macc_ths (List[float]): The thresholds to evaluate the size level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]. 116 | small_size_th (int): The threshold to define the small size bbox. Default is 128. 117 | large_size_th (int): The threshold to define the large size bbox. Default is 256. 118 | avg_cls_level_acc_ths (List[float]): The thresholds to evaluate the average of all class level accuracy. Default is [0.5, ]. 119 | avg_cls_level_macc_ths (List[float]): The thresholds to evaluate the average of all class level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]. 120 | ''' 121 | self.dataset = dataset.dataset 122 | self.split=split 123 | if split is None: 124 | self.split=dataset.split 125 | 126 | self.ann_level_acc_ths = ann_level_acc_ths 127 | self.ann_level_macc_ths = ann_level_macc_ths 128 | self.size_level_acc_ths = size_level_acc_ths 129 | self.size_level_macc_ths = size_level_macc_ths 130 | self.avg_cls_level_acc_ths = avg_cls_level_acc_ths 131 | self.avg_cls_level_macc_ths = avg_cls_level_macc_ths 132 | 133 | self.small_size_th=small_size_th 134 | self.large_size_th=large_size_th 135 | 136 | @staticmethod 137 | def calculate_iou_acc(bboxes_1, bboxes_2, thresh=0.5): 138 | """ 139 | bboxes_1 (torch.Tensor, numpy.Array): shape=[N,4], format=[x1, y1, x2, y2] 140 | bboxes_2 (torch.Tensor, numpy.Array): shape=[N,4], format=[x1, y1, x2, y2] 141 | calculate the iou and acc of the pred_bboxes and gt_bboxes, 142 | if iou(pred_bboxes[i],gt_bboxes[i])>0.5, then acc+=1 143 | all pred_bboxes_i and gt_bboxes_i are one to one assigned. 144 | 145 | """ 146 | iou=bbox_overlaps(bboxes_1,bboxes_2,mode='iou', is_aligned=True) 147 | if(type(thresh) is not list): 148 | thresh=[thresh] 149 | accs=dict() 150 | for t in thresh: 151 | accs[t]=(iou>t).sum().item()/len(iou) 152 | return iou,accs 153 | 154 | def evaluate(self, predictions, save_file=None): 155 | """ 156 | Evaluate given dataset and predictions. 157 | 158 | Parameters: 159 | - predictions (List(Dict)): The predictions to evaluate. 160 | Each item in the list is a dict, containing the keys: 'pred_bbox', 'id' and 'format', 161 | where 'id' is the annotation id, 'format' is the bbox format 'xyxy' or 'xywh'. 162 | e.g.: 163 | [ 164 | { 165 | 'pred_bbox': [x1, y1, x2, y2], 166 | 'id': '000000', 167 | 'format': 'xyxy' 168 | }, 169 | ... 170 | ] 171 | - save_file (str): The file to save the evaluation results to. 172 | """ 173 | if(len(predictions)==0): 174 | print("Warning: No predictions found.") 175 | return dict() 176 | 177 | # convert predictions to a dict, key is the id, raise error if there are duplicate ids 178 | predictions_dict={pred['id']:pred for pred in predictions} 179 | if(len(predictions)!=len(predictions_dict)): 180 | raise ValueError("Duplicate ids found in the predictions.") 181 | 182 | gt_bboxes = [] 183 | pred_bboxes = [] 184 | dataset_split = self.dataset[self.split] 185 | 186 | for idx, gt_data in enumerate(dataset_split): 187 | gt_bbox = gt_data['bbox'] 188 | gt_bboxes.append([gt_bbox[0], gt_bbox[1], gt_bbox[0]+gt_bbox[2], gt_bbox[1]+gt_bbox[3]]) 189 | 190 | # raise error if the id canot be found in the predictions 191 | if gt_data['id'] not in predictions_dict: 192 | raise ValueError(f"Id {gt_data['id']} not found in the predictions.") 193 | 194 | pred_bbox = predictions_dict[gt_data['id']]['pred_bbox'] 195 | if predictions_dict[gt_data['id']]['format'] == 'xywh': 196 | pred_bbox = [pred_bbox[0], pred_bbox[1], pred_bbox[0]+pred_bbox[2], pred_bbox[1]+pred_bbox[3]] 197 | pred_bboxes.append(pred_bbox) 198 | 199 | 200 | # calculate_iou 201 | iou = bbox_overlaps(torch.tensor(gt_bboxes), torch.tensor(pred_bboxes), mode='iou', is_aligned=True).numpy() 202 | 203 | acc_all = dict() 204 | 205 | # Annotation level evaluation 206 | for th in self.ann_level_acc_ths: 207 | acc = (iou > th).sum().item() / len(iou) 208 | key = f"Ann-level acc iou {th}" 209 | acc_all[key] = acc * 100 210 | 211 | macc = [] 212 | for th in self.ann_level_macc_ths: 213 | acc = (iou > th).sum().item() / len(iou) 214 | macc.append(acc) 215 | key = f"Ann-level macc iou {self.ann_level_macc_ths[0]}:{self.ann_level_macc_ths[-1]}" 216 | acc_all[key] = mean(macc) * 100 217 | 218 | # get the acc for copy, so that we can output the results as a table, round to 2 decimal places 219 | # acc_all['Ann-level accs for copy'] = [acc_all[key] for key in acc_all] 220 | acc_all['Ann-level accs for copy'] = [ 221 | round(acc_all[key], 2) for key in acc_all 222 | ] 223 | 224 | # Size evaluation 225 | small_size_list = [] 226 | medium_size_list = [] 227 | large_size_list = [] 228 | for idx, gt_data in enumerate(dataset_split): 229 | gt_bbox = gt_data['bbox'] 230 | obj_size = math.sqrt(gt_bbox[2]*gt_bbox[3]) 231 | iou_item = iou[idx] 232 | 233 | # gather the small, medium and large size bboxes 234 | if obj_size < self.small_size_th: 235 | small_size_list.append(iou_item) 236 | elif obj_size <= self.large_size_th: 237 | medium_size_list.append(iou_item) 238 | else: 239 | large_size_list.append(iou_item) 240 | 241 | # small size evaluation 242 | for th in self.size_level_acc_ths: 243 | acc_small = sum(i > th for i in small_size_list) / len(small_size_list) * 100 244 | key = f"Small acc iou {th}" 245 | acc_all[key] = acc_small 246 | 247 | macc_small = [] 248 | for th in self.size_level_macc_ths: 249 | small_size_list = np.array(small_size_list) 250 | acc_small = (small_size_list > th).sum().item() / len(small_size_list) * 100 251 | macc_small.append(acc_small) 252 | key = f"Small macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}" 253 | acc_all[key] = mean(macc_small) 254 | 255 | # medium size evaluation 256 | for th in self.size_level_acc_ths: 257 | acc_medium = sum(i > th for i in medium_size_list) / len(medium_size_list) * 100 258 | key = f"Medium acc iou {th}" 259 | acc_all[key] = acc_medium 260 | 261 | macc_medium = [] 262 | for th in self.size_level_macc_ths: 263 | medium_size_list = np.array(medium_size_list) 264 | acc_medium = (medium_size_list > th).sum().item() / len(medium_size_list) * 100 265 | macc_medium.append(acc_medium) 266 | key = f"Medium macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}" 267 | acc_all[key] = mean(macc_medium) 268 | 269 | # large size evaluation 270 | for th in self.size_level_acc_ths: 271 | acc_large = sum(i > th for i in large_size_list) / len(large_size_list) * 100 272 | key = f"Large acc iou {th}" 273 | acc_all[key] = acc_large 274 | 275 | macc_large = [] 276 | for th in self.size_level_macc_ths: 277 | large_size_list = np.array(large_size_list) 278 | acc_large = (large_size_list > th).sum().item() / len(large_size_list) * 100 279 | macc_large.append(acc_large) 280 | key = f"Large macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}" 281 | acc_all[key] = mean(macc_large) 282 | 283 | # get the size-level acc for copy, so that we can output the results as a table, round to 2 decimal places 284 | acc_all['Size level accs for copy'] = [ 285 | round(acc_all[key], 2) for key in acc_all 286 | if 'Small' in key or 'Medium' in key or 'Large' in key 287 | ] 288 | 289 | # Average class-level evaluation 290 | iou_avg_cls_level_acc_ths = dict() 291 | for idx, gt_data in enumerate(dataset_split): 292 | iou_item = iou[idx] 293 | if gt_data['ori_category_id'] in mapping_dict: 294 | ori_category_id = mapping_dict[gt_data['ori_category_id']] 295 | else: 296 | ori_category_id = gt_data['ori_category_id'] 297 | 298 | if ori_category_id not in iou_avg_cls_level_acc_ths: 299 | iou_avg_cls_level_acc_ths[ori_category_id] = [] 300 | iou_avg_cls_level_acc_ths[ori_category_id].append(iou_item) 301 | 302 | for th in self.avg_cls_level_acc_ths: 303 | acc_list = [] 304 | for key in iou_avg_cls_level_acc_ths: 305 | iou_array = np.array(iou_avg_cls_level_acc_ths[key]) 306 | acc = (iou_array > th).sum().item() / len(iou_array) * 100 307 | acc_list.append(acc) 308 | key = f"Average class-level acc iou {th}" 309 | acc_all[key] = mean(acc_list) 310 | 311 | # macc 312 | macc_list = [] 313 | for th in self.avg_cls_level_macc_ths: 314 | acc_list = [] 315 | for key in iou_avg_cls_level_acc_ths: 316 | iou_array = np.array(iou_avg_cls_level_acc_ths[key]) 317 | acc = (iou_array > th).sum().item() / len(iou_avg_cls_level_acc_ths[key]) * 100 318 | acc_list.append(acc) 319 | macc_list.append(mean(acc_list)) 320 | key = f"Average class-level macc iou {self.avg_cls_level_macc_ths[0]}:{self.avg_cls_level_macc_ths[-1]}" 321 | acc_all[key] = mean(macc_list) 322 | 323 | # get the avg_cls-level acc for copy, so that we can output the results as a table, round to 2 decimal places 324 | acc_all['Avg class-level accs for copy'] = [ 325 | round(acc_all[key], 2) for key in acc_all 326 | if 'Average class-level' in key 327 | ] 328 | 329 | 330 | # Output as table 331 | table = [] 332 | table.append([f"Item for split {self.split}", "Value"]) 333 | for k, v in acc_all.items(): 334 | if isinstance(v, list): 335 | table.append([k, ", ".join(map(str, v))]) 336 | else: 337 | table.append([k, v]) 338 | 339 | # Define where to add horizontal lines 340 | horizontal_lines = {1, 6, 13} # After header, IoU, and Subject evaluations 341 | 342 | # Print table with selective horizontal lines 343 | max_len = max(len(row[0]) for row in table) 344 | for i, row in enumerate(table): 345 | if i in horizontal_lines: 346 | print('-' * (max_len + 3 + max(len(str(r[1])) for r in table))) 347 | print(f"{row[0].ljust(max_len)} | {row[1]}") 348 | 349 | if(save_file is not None): 350 | acc_all.pop('Ann-level accs for copy') 351 | acc_all.pop('Size level accs for copy') 352 | acc_all.pop('Avg class-level accs for copy') 353 | df=pd.DataFrame(acc_all, index=[0]) 354 | df.to_csv(save_file) 355 | 356 | return acc_all 357 | 358 | # Example usage: 359 | if __name__ == '__main__': 360 | from dataloader import RefL4Dataset 361 | import json 362 | custom_transforms = None 363 | ref_l4_dataset = RefL4Dataset('/Users/jchen12/Documents/misc/Ref-L4', split='all', custom_transforms=custom_transforms) 364 | print("Dataset loaded. Length:", len(ref_l4_dataset)) 365 | 366 | evaluator = RefL4Evaluator(dataset=ref_l4_dataset) 367 | 368 | pred_json_path = "/Users/jchen12/Documents/misc/cogvlm_grounding_pred.json" 369 | with open(pred_json_path, 'r') as f: 370 | predictions = json.load(f) 371 | evaluator.evaluate(predictions) -------------------------------------------------------------------------------- /ref_l4/overlaps.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) OpenMMLab. All rights reserved. 2 | import torch 3 | 4 | 5 | def fp16_clamp(x, min=None, max=None): 6 | if not x.is_cuda and x.dtype == torch.float16: 7 | # clamp for cpu float16, tensor fp16 has no clamp implementation 8 | return x.float().clamp(min, max).half() 9 | 10 | return x.clamp(min, max) 11 | 12 | 13 | def bbox_overlaps(bboxes1, bboxes2, mode='iou', is_aligned=False, eps=1e-6): 14 | """Calculate overlap between two set of bboxes. 15 | 16 | FP16 Contributed by https://github.com/open-mmlab/mmdetection/pull/4889 17 | Note: 18 | Assume bboxes1 is M x 4, bboxes2 is N x 4, when mode is 'iou', 19 | there are some new generated variable when calculating IOU 20 | using bbox_overlaps function: 21 | 22 | 1) is_aligned is False 23 | area1: M x 1 24 | area2: N x 1 25 | lt: M x N x 2 26 | rb: M x N x 2 27 | wh: M x N x 2 28 | overlap: M x N x 1 29 | union: M x N x 1 30 | ious: M x N x 1 31 | 32 | Total memory: 33 | S = (9 x N x M + N + M) * 4 Byte, 34 | 35 | When using FP16, we can reduce: 36 | R = (9 x N x M + N + M) * 4 / 2 Byte 37 | R large than (N + M) * 4 * 2 is always true when N and M >= 1. 38 | Obviously, N + M <= N * M < 3 * N * M, when N >=2 and M >=2, 39 | N + 1 < 3 * N, when N or M is 1. 40 | 41 | Given M = 40 (ground truth), N = 400000 (three anchor boxes 42 | in per grid, FPN, R-CNNs), 43 | R = 275 MB (one times) 44 | 45 | A special case (dense detection), M = 512 (ground truth), 46 | R = 3516 MB = 3.43 GB 47 | 48 | When the batch size is B, reduce: 49 | B x R 50 | 51 | Therefore, CUDA memory runs out frequently. 52 | 53 | Experiments on GeForce RTX 2080Ti (11019 MiB): 54 | 55 | | dtype | M | N | Use | Real | Ideal | 56 | |:----:|:----:|:----:|:----:|:----:|:----:| 57 | | FP32 | 512 | 400000 | 8020 MiB | -- | -- | 58 | | FP16 | 512 | 400000 | 4504 MiB | 3516 MiB | 3516 MiB | 59 | | FP32 | 40 | 400000 | 1540 MiB | -- | -- | 60 | | FP16 | 40 | 400000 | 1264 MiB | 276MiB | 275 MiB | 61 | 62 | 2) is_aligned is True 63 | area1: N x 1 64 | area2: N x 1 65 | lt: N x 2 66 | rb: N x 2 67 | wh: N x 2 68 | overlap: N x 1 69 | union: N x 1 70 | ious: N x 1 71 | 72 | Total memory: 73 | S = 11 x N * 4 Byte 74 | 75 | When using FP16, we can reduce: 76 | R = 11 x N * 4 / 2 Byte 77 | 78 | So do the 'giou' (large than 'iou'). 79 | 80 | Time-wise, FP16 is generally faster than FP32. 81 | 82 | When gpu_assign_thr is not -1, it takes more time on cpu 83 | but not reduce memory. 84 | There, we can reduce half the memory and keep the speed. 85 | 86 | If ``is_aligned`` is ``False``, then calculate the overlaps between each 87 | bbox of bboxes1 and bboxes2, otherwise the overlaps between each aligned 88 | pair of bboxes1 and bboxes2. 89 | 90 | Args: 91 | bboxes1 (Tensor): shape (B, m, 4) in format or empty. 92 | bboxes2 (Tensor): shape (B, n, 4) in format or empty. 93 | B indicates the batch dim, in shape (B1, B2, ..., Bn). 94 | If ``is_aligned`` is ``True``, then m and n must be equal. 95 | mode (str): "iou" (intersection over union), "iof" (intersection over 96 | foreground) or "giou" (generalized intersection over union). 97 | Default "iou". 98 | is_aligned (bool, optional): If True, then m and n must be equal. 99 | Default False. 100 | eps (float, optional): A value added to the denominator for numerical 101 | stability. Default 1e-6. 102 | 103 | Returns: 104 | Tensor: shape (m, n) if ``is_aligned`` is False else shape (m,) 105 | 106 | Example: 107 | >>> bboxes1 = torch.FloatTensor([ 108 | >>> [0, 0, 10, 10], 109 | >>> [10, 10, 20, 20], 110 | >>> [32, 32, 38, 42], 111 | >>> ]) 112 | >>> bboxes2 = torch.FloatTensor([ 113 | >>> [0, 0, 10, 20], 114 | >>> [0, 10, 10, 19], 115 | >>> [10, 10, 20, 20], 116 | >>> ]) 117 | >>> overlaps = bbox_overlaps(bboxes1, bboxes2) 118 | >>> assert overlaps.shape == (3, 3) 119 | >>> overlaps = bbox_overlaps(bboxes1, bboxes2, is_aligned=True) 120 | >>> assert overlaps.shape == (3, ) 121 | 122 | Example: 123 | >>> empty = torch.empty(0, 4) 124 | >>> nonempty = torch.FloatTensor([[0, 0, 10, 9]]) 125 | >>> assert tuple(bbox_overlaps(empty, nonempty).shape) == (0, 1) 126 | >>> assert tuple(bbox_overlaps(nonempty, empty).shape) == (1, 0) 127 | >>> assert tuple(bbox_overlaps(empty, empty).shape) == (0, 0) 128 | """ 129 | 130 | assert mode in ['iou', 'iof', 'giou'], f'Unsupported mode {mode}' 131 | # Either the boxes are empty or the length of boxes' last dimension is 4 132 | assert (bboxes1.size(-1) == 4 or bboxes1.size(0) == 0) 133 | assert (bboxes2.size(-1) == 4 or bboxes2.size(0) == 0) 134 | 135 | # Batch dim must be the same 136 | # Batch dim: (B1, B2, ... Bn) 137 | assert bboxes1.shape[:-2] == bboxes2.shape[:-2] 138 | batch_shape = bboxes1.shape[:-2] 139 | 140 | rows = bboxes1.size(-2) 141 | cols = bboxes2.size(-2) 142 | if is_aligned: 143 | assert rows == cols 144 | 145 | if rows * cols == 0: 146 | if is_aligned: 147 | return bboxes1.new(batch_shape + (rows, )) 148 | else: 149 | return bboxes1.new(batch_shape + (rows, cols)) 150 | 151 | area1 = (bboxes1[..., 2] - bboxes1[..., 0]) * ( 152 | bboxes1[..., 3] - bboxes1[..., 1]) 153 | area2 = (bboxes2[..., 2] - bboxes2[..., 0]) * ( 154 | bboxes2[..., 3] - bboxes2[..., 1]) 155 | 156 | if is_aligned: 157 | lt = torch.max(bboxes1[..., :2], bboxes2[..., :2]) # [B, rows, 2] 158 | rb = torch.min(bboxes1[..., 2:], bboxes2[..., 2:]) # [B, rows, 2] 159 | 160 | wh = fp16_clamp(rb - lt, min=0) 161 | overlap = wh[..., 0] * wh[..., 1] 162 | 163 | if mode in ['iou', 'giou']: 164 | union = area1 + area2 - overlap 165 | else: 166 | union = area1 167 | if mode == 'giou': 168 | enclosed_lt = torch.min(bboxes1[..., :2], bboxes2[..., :2]) 169 | enclosed_rb = torch.max(bboxes1[..., 2:], bboxes2[..., 2:]) 170 | else: 171 | lt = torch.max(bboxes1[..., :, None, :2], 172 | bboxes2[..., None, :, :2]) # [B, rows, cols, 2] 173 | rb = torch.min(bboxes1[..., :, None, 2:], 174 | bboxes2[..., None, :, 2:]) # [B, rows, cols, 2] 175 | 176 | wh = fp16_clamp(rb - lt, min=0) 177 | overlap = wh[..., 0] * wh[..., 1] 178 | 179 | if mode in ['iou', 'giou']: 180 | union = area1[..., None] + area2[..., None, :] - overlap 181 | else: 182 | union = area1[..., None] 183 | if mode == 'giou': 184 | enclosed_lt = torch.min(bboxes1[..., :, None, :2], 185 | bboxes2[..., None, :, :2]) 186 | enclosed_rb = torch.max(bboxes1[..., :, None, 2:], 187 | bboxes2[..., None, :, 2:]) 188 | 189 | eps = union.new_tensor([eps]) 190 | union = torch.max(union, eps) 191 | ious = overlap / union 192 | if mode in ['iou', 'iof']: 193 | return ious 194 | # calculate gious 195 | enclose_wh = fp16_clamp(enclosed_rb - enclosed_lt, min=0) 196 | enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1] 197 | enclose_area = torch.max(enclose_area, eps) 198 | gious = ious - (enclose_area - union) / enclose_area 199 | return gious -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='ref_l4', 5 | version='1.1', 6 | packages=find_packages(), 7 | py_modules=['dataloader', 'evaluation', 'overlaps'], 8 | install_requires=[ 9 | # Add any dependencies your package needs here 10 | # e.g., 'numpy', 'pandas', 'scikit-learn' 11 | 'torch', 'datasets', 'pillow', 'statistics', 'pandas', 'datasets' 12 | ], 13 | entry_points={ 14 | 'console_scripts': [ 15 | # Add any command line scripts here 16 | # e.g., 'mycommand = mypackage.module:function' 17 | ], 18 | }, 19 | author='Jierun Chen', 20 | description='Evaluation code for Ref-L4, a new REC benchmark in the LMM era.', 21 | long_description=open('README.md').read(), 22 | long_description_content_type='text/markdown', 23 | url='https://github.com/JierunChen/Ref-L4', # Replace with your package's URL 24 | classifiers=[ 25 | 'Programming Language :: Python :: 3', 26 | 'License :: OSI Approved :: MIT License', 27 | 'Operating System :: OS Independent', 28 | ], 29 | license='MIT', 30 | python_requires='>=3.6', 31 | ) --------------------------------------------------------------------------------