├── LICENSE
├── README.md
├── demo_models
└── cogvlm_grounding_pred.json
├── evaluate_pred.py
├── figure
├── error_samples.png
└── examples.png
├── ref_l4
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-311.pyc
│ ├── dataloader.cpython-311.pyc
│ ├── evaluation.cpython-311.pyc
│ └── overlaps.cpython-311.pyc
├── dataloader.py
├── evaluation.py
└── overlaps.py
└── setup.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 JierunChen
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Ref-L4: a New REC Benchmark in the LMM Era
2 |
3 | This repository contains the data loader and evaluation code for our [Ref-L4](https://huggingface.co/datasets/JierunChen/Ref-L4), a new REC benchmark in the LMM era. Check out our paper [Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models](https://arxiv.org/abs/2406.16866) for more details.
4 |
5 | ## 🧭 Introduction
6 |
7 | Referring expression comprehension (REC) is the task of locating a specific region in an image based on a textual description. Current popular REC benchmarks, such as RefCOCO, RefCOCO+, and RefCOCOg, may not fully capture the comprehensive capabilities of large multimodal models (LMMs) due to limitations in:
8 |
9 | - **Linguistic complexity**: Expressions tend to be overly simple, often consisting of single words like "yellow," "lady," or "far left person."
10 | - **Visual complexity**: The benchmarks are restricted to a limited set of object categories (fewer than 80).
11 | - **Labeling accuracy**: RefCOCO, RefCOCO+, and RefCOCOg have high labeling error rates (14%, 24%, and 5%, respectively).
12 |
13 | In response, we rigorously exclude the problematic instances and reevaluating LMMs, showing significant accuracy improvements. We also introduce Ref-L4, a new and more coprehensive REC benchmark to evaluate LMMs. Ref-L4 offers:
14 |
15 | - **Lengthy referring expressions** averaging 24.2 words;
16 | - **Extensive vocabulary** comprising 22,813 unique words;
17 | - **Diverse objects** covering 365 categories with varying instance scales (from 30 to 3,767);
18 | - **Substantial sample size** with 45,341 annotations.
19 |
20 | ### 😃 Ref-L4 examples
21 |
22 |
23 |
24 | ### 😑 Labeling errors in RefCOCO, +, g
25 |
26 | In the REC task, a referring expression should uniquely describe an instance, which is represented by an accurate bounding box. We have identified and visualized three common types of labeling errors in the RefCOCO, RefCOCO+, and RefCOCOg benchmarks: (a) non-unique referring expressions, which refer to multiple instances within the same image; (b) inaccurate bounding boxes; and (c) misalignment between target instances and their referring expressions, where the referring expressions are either ambiguous or do not refer to any instance in the image.
27 |
28 |
29 |
30 | ## 🛠️ Installation
31 |
32 | ```bash
33 | git clone https://github.com/JierunChen/Ref-L4.git
34 | cd Ref-L4
35 | pip install ./
36 | ```
37 |
38 | ## 🚀 Loading Ref-L4 Dataset
39 |
40 | ```python
41 | from ref_l4 import RefL4Dataset
42 | ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all')
43 | print(len(ref_l4_dataset))
44 | # iterate over the dataset
45 | for img_pil, data in ref_l4_dataset:
46 | print(data)
47 | break
48 |
49 | # swith to 'val' or 'test' split
50 | ref_l4_dataset.change_split('val')
51 | print(len(ref_l4_dataset))
52 | ```
53 |
54 | The [Ref-L4 dataset](https://huggingface.co/datasets/JierunChen/Ref-L4) can also be downloaded from Hugging Face.
55 |
56 |
57 | ## 🗂️ Evaluation
58 |
59 | After getting model predictions, we can save them as a json file, which is a list of dictionaries, each containing three keys: `[id, pred_bbox, format]`. The `id` is the annotation ID, and the `format` specifies the format of `pred_bbox`, which should be either `xyxy` or `xywh`.
60 | A sample prediction file the can be found [here](./demo_models/cogvlm_grounding_pred.json). It is obtained by running the CogVLM-Grounding model.
61 |
62 | Now we can run the following command to evaluate the predictions with ground truth.
63 |
64 | ```bash
65 | python evaluate_pred.py --dataset_path JierunChen/Ref-L4 --split all --pred_json_path ./demo_models/cogvlm_grounding_pred.json
66 | ```
67 |
68 | Here is an example output for the predictions from CogVLM-Grounding on the "all" splits:
69 |
70 | ```txt
71 | Item for split all | Value
72 | --------------------------------------------------------------------------------
73 | Ann-level acc iou 0.5 | 81.69868331091065
74 | Ann-level acc iou 0.75 | 70.76597340155709
75 | Ann-level acc iou 0.9 | 48.351381751615534
76 | Ann-level macc iou 0.5:0.95 | 66.08808804393375
77 | Ann-level accs for copy | 81.7, 70.77, 48.35, 66.09
78 | --------------------------------------------------------------------------------
79 | Small acc iou 0.5 | 75.0561797752809
80 | Small macc iou 0.5:0.95 | 52.853932584269664
81 | Medium acc iou 0.5 | 86.43470790378007
82 | Medium macc iou 0.5:0.95 | 71.31099656357388
83 | Large acc iou 0.5 | 77.90972003774772
84 | Large macc iou 0.5:0.95 | 66.25196602705253
85 | Size level accs for copy | 75.06, 52.85, 86.43, 71.31, 77.91, 66.25
86 | --------------------------------------------------------------------------------
87 | Average class-level acc iou 0.5 | 72.42029130234654
88 | Average class-level macc iou 0.5:0.95 | 52.56151912966669
89 | Avg class-level accs for copy | 72.42, 52.56
90 | ```
91 |
92 | The output includes three parts:
93 | - The accuracy under various IoU thresholds and the average accuracy of IoU from 0.5 to 0.95 with a stride of 0.05.
94 | - The accuracy of small, medium, and large objects.
95 | - The average accuracy across different classes.
96 |
97 |
98 | ## 📖 Dataset License
99 |
100 | The Ref-L4 dataset is released under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license](https://creativecommons.org/licenses/by-nc/4.0/). Please note that the images in the Ref-L4 dataset are derived from the following datasets, each with their respective licenses:
101 | - **RefCOCO**: Licensed under the [Apache-2.0 license](http://www.apache.org/licenses/).
102 | - **RefCOCO+**: Licensed under the [Apache-2.0 license](http://www.apache.org/licenses/).
103 | - **RefCOCOg**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/).
104 | - **COCO 2014**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/).
105 | - **Objects365**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](http://creativecommons.org/licenses/by/4.0/).
106 |
107 | By using the Ref-L4 dataset, you agree to comply with the licensing terms of these source datasets.
108 |
--------------------------------------------------------------------------------
/evaluate_pred.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | from ref_l4 import RefL4Dataset, RefL4Evaluator
4 |
5 | def main(args):
6 | custom_transforms = None
7 | ref_l4_dataset = RefL4Dataset(args.dataset_path, split=args.split, custom_transforms=custom_transforms)
8 | print("Dataset loaded. Length:", len(ref_l4_dataset))
9 |
10 | evaluator = RefL4Evaluator(dataset=ref_l4_dataset)
11 |
12 | with open(args.pred_json_path, 'r') as f:
13 | predictions = json.load(f)
14 | evaluator.evaluate(predictions)
15 |
16 | if __name__ == '__main__':
17 | parser = argparse.ArgumentParser(description='Evaluate the predictions on Ref-L4 dataset.')
18 | parser.add_argument('--dataset_path', type=str, default="JierunChen/Ref-L4", help='Path to the Ref-L4 dataset.')
19 | parser.add_argument('--split', type=str, default='all', choices=['val', 'test', 'all'], help='Dataset split to use (val, test, all).')
20 | parser.add_argument('--pred_json_path', type=str, required=True, help='Path to the predictions JSON file.')
21 |
22 | args = parser.parse_args()
23 |
24 | main(args)
25 |
--------------------------------------------------------------------------------
/figure/error_samples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/figure/error_samples.png
--------------------------------------------------------------------------------
/figure/examples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/figure/examples.png
--------------------------------------------------------------------------------
/ref_l4/__init__.py:
--------------------------------------------------------------------------------
1 | from .dataloader import RefL4Dataset
2 | from .evaluation import RefL4Evaluator
3 |
4 | __all__ = ['RefL4Dataset', 'RefL4Evaluator']
--------------------------------------------------------------------------------
/ref_l4/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/__init__.cpython-311.pyc
--------------------------------------------------------------------------------
/ref_l4/__pycache__/dataloader.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/dataloader.cpython-311.pyc
--------------------------------------------------------------------------------
/ref_l4/__pycache__/evaluation.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/evaluation.cpython-311.pyc
--------------------------------------------------------------------------------
/ref_l4/__pycache__/overlaps.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/overlaps.cpython-311.pyc
--------------------------------------------------------------------------------
/ref_l4/dataloader.py:
--------------------------------------------------------------------------------
1 | from torch.utils.data import Dataset
2 | from datasets import load_dataset, concatenate_datasets
3 | from huggingface_hub import hf_hub_download
4 | from PIL import Image
5 | import tarfile
6 | import io
7 | from tqdm import tqdm
8 |
9 | class RefL4Dataset(Dataset):
10 | def __init__(self, dataset_path, split, images_file='images.tar.gz', custom_transforms=None):
11 | """
12 | Initialize the RefL4Dataset class.
13 |
14 | Parameters:
15 | - dataset_path (str): Path to the dataset directory.
16 | - split (str): Dataset split, typically "val", "test", or "all".
17 | - images_file (str): Name of the tar file containing images, default to 'images.tar.gz'.
18 | - custom_transforms: Custom image transformations to apply, default to 'None'.
19 | """
20 | super(RefL4Dataset, self).__init__()
21 | assert split in ['val', 'test', 'all'], 'split should be val, test or all'
22 | self.dataset_path=dataset_path
23 | self.split = split
24 | self.images_file=images_file
25 | self.transforms = custom_transforms
26 | self._load_dataset()
27 |
28 | def _load_dataset(self):
29 | self.dataset = load_dataset(self.dataset_path)
30 | image_tar_path = f"{self.dataset_path}/{self.images_file}"
31 | # manually download the images.tar.gz file
32 | if self.dataset_path == 'JierunChen/Ref-L4':
33 | image_tar_path = hf_hub_download(repo_id="JierunChen/Ref-L4", filename="images.tar.gz", repo_type="dataset")
34 | print(f"Downloaded images.tar.gz to {image_tar_path}")
35 |
36 | all_splits=concatenate_datasets([self.dataset['val'],self.dataset['test']])
37 | self.dataset['all']=all_splits
38 | self.images=self._load_images_from_tar(image_tar_path)
39 |
40 | def _load_images_from_tar(self, image_tar_path):
41 | images = {}
42 | print(f"Loading images from {image_tar_path}")
43 | with tarfile.open(image_tar_path, "r:gz") as tar:
44 | for member in tqdm(tar.getmembers()):
45 | if member.isfile() and member.name.endswith(('jpg', 'jpeg', 'png', 'webp')):
46 | f = tar.extractfile(member)
47 | if f:
48 | image = Image.open(io.BytesIO(f.read()))
49 | # transfer the grayscale image to RGB if needed
50 | if image.mode != 'RGB':
51 | image = image.convert('RGB')
52 | # remove any previous extension of name
53 |
54 | images[member.name] = image
55 | return images
56 |
57 | def change_split(self, split):
58 | assert split in ['val', 'test', 'all'], 'split should be val, test or all'
59 | self.split = split
60 |
61 | def __len__(self):
62 | return len(self.dataset[self.split])
63 |
64 | def __getitem__(self, idx):
65 | """
66 | Returns:
67 | - image (Tensor): Transformed image data.
68 | - data (dict): Other sample data.
69 | """
70 | data = self.dataset[self.split][idx]
71 | image = self.images[data['file_name']]
72 | if self.transforms:
73 | image = self.transforms(image)
74 | return image, data
75 |
76 | # Example usage:
77 | if __name__ == '__main__':
78 | ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all')
79 | print(len(ref_l4_dataset))
80 | print(ref_l4_dataset[0])
81 | # swith to val split
82 | ref_l4_dataset.change_split('val')
83 | print(len(ref_l4_dataset))
84 | print(ref_l4_dataset[0])
85 | # swith to test split
86 | ref_l4_dataset.change_split('test')
87 | print(len(ref_l4_dataset))
88 | print(ref_l4_dataset[0])
89 |
90 | ref_l4_dataset.change_split('all')
91 | print(len(ref_l4_dataset))
92 | print(ref_l4_dataset[0])
93 |
--------------------------------------------------------------------------------
/ref_l4/evaluation.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2024 Jierun Chen. All rights reserved.
2 | import torch
3 | from statistics import mean
4 | import pandas as pd
5 | import math
6 | from datasets import load_dataset
7 | from .dataloader import RefL4Dataset
8 | from .overlaps import bbox_overlaps
9 | import numpy as np
10 |
11 | mapping_dict = {
12 | "refcoco_1": "o365_1",
13 | "refcoco_10": "o365_41",
14 | "refcoco_11": "o365_177",
15 | "refcoco_13": "o365_128",
16 | "refcoco_14": "o365_250",
17 | "refcoco_15": "o365_25",
18 | "refcoco_16": "o365_56",
19 | "refcoco_17": "o365_140",
20 | "refcoco_18": "o365_93",
21 | "refcoco_19": "o365_79",
22 | "refcoco_2": "o365_47",
23 | "refcoco_20": "o365_100",
24 | "refcoco_21": "o365_97",
25 | "refcoco_22": "o365_145",
26 | "refcoco_23": "o365_296",
27 | "refcoco_24": "o365_179",
28 | "refcoco_25": "o365_181",
29 | "refcoco_27": "o365_39",
30 | "refcoco_28": "o365_40",
31 | "refcoco_3": "o365_6",
32 | "refcoco_31": "o365_13",
33 | "refcoco_32": "o365_44",
34 | "refcoco_33": "o365_194",
35 | "refcoco_34": "o365_220",
36 | "refcoco_35": "o365_119",
37 | "refcoco_36": "o365_174",
38 | "refcoco_37": "o365_100000", # This is a placeholder, the actual value is not provided
39 | "refcoco_38": "o365_155",
40 | "refcoco_39": "o365_138",
41 | "refcoco_4": "o365_59",
42 | "refcoco_40": "o365_114",
43 | "refcoco_41": "o365_146",
44 | "refcoco_42": "o365_147",
45 | "refcoco_43": "o365_205",
46 | "refcoco_44": "o365_9",
47 | "refcoco_46": "o365_36",
48 | "refcoco_47": "o365_11",
49 | "refcoco_48": "o365_89",
50 | "refcoco_49": "o365_85",
51 | "refcoco_5": "o365_115",
52 | "refcoco_50": "o365_94",
53 | "refcoco_51": "o365_26",
54 | "refcoco_52": "o365_113",
55 | "refcoco_53": "o365_83",
56 | "refcoco_54": "o365_266",
57 | "refcoco_55": "o365_104",
58 | "refcoco_56": "o365_142",
59 | "refcoco_57": "o365_153",
60 | "refcoco_58": "o365_235",
61 | "refcoco_59": "o365_144",
62 | "refcoco_6": "o365_56",
63 | "refcoco_60": "o365_151",
64 | "refcoco_61": "o365_98",
65 | "refcoco_62": "o365_3",
66 | "refcoco_63": "o365_51",
67 | "refcoco_64": "o365_26",
68 | "refcoco_65": "o365_76",
69 | "refcoco_67": "o365_98",
70 | "refcoco_7": "o365_117",
71 | "refcoco_70": "o365_154",
72 | "refcoco_72": "o365_37",
73 | "refcoco_73": "o365_74",
74 | "refcoco_74": "o365_116",
75 | "refcoco_75": "o365_133",
76 | "refcoco_76": "o365_107",
77 | "refcoco_77": "o365_62",
78 | "refcoco_78": "o365_164",
79 | "refcoco_79": "o365_135",
80 | "refcoco_8": "o365_66",
81 | "refcoco_80": "o365_278",
82 | "refcoco_81": "o365_82",
83 | "refcoco_82": "o365_134",
84 | "refcoco_84": "o365_19",
85 | "refcoco_85": "o365_95",
86 | "refcoco_86": "o365_31",
87 | "refcoco_87": "o365_170",
88 | "refcoco_88": "o365_70",
89 | "refcoco_89": "o365_328",
90 | "refcoco_9": "o365_22",
91 | "refcoco_90": "o365_227",
92 | }
93 |
94 |
95 | class RefL4Evaluator:
96 | def __init__(
97 | self,
98 | dataset: RefL4Dataset,
99 | split=None,
100 | ann_level_acc_ths=[0.5, 0.75, 0.9],
101 | ann_level_macc_ths=[i/100 for i in range(50,100,5)],
102 | size_level_acc_ths=[0.5, ],
103 | size_level_macc_ths=[i/100 for i in range(50,100,5)],
104 | small_size_th=128,
105 | large_size_th=256,
106 | avg_cls_level_acc_ths=[0.5, ],
107 | avg_cls_level_macc_ths=[i/100 for i in range(50,100,5)],
108 | ) -> None:
109 | '''
110 | dataset (RefL4Dataset): The RefL4Dataset dataset for evaluation.
111 | split (str): The split of the dataset to evaluate. If None, use the dataset's split. Default is None.
112 | ann_level_acc_ths (List[float]): The thresholds to evaluate the annotation level accuracy. Default is [0.5, 0.75, 0.9].
113 | ann_level_macc_ths (List[float]): The thresholds to evaluate the annotation level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95].
114 | size_level_acc_ths (List[float]): The thresholds to evaluate the size level accuracy. Default is [0.5, ].
115 | size_level_macc_ths (List[float]): The thresholds to evaluate the size level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95].
116 | small_size_th (int): The threshold to define the small size bbox. Default is 128.
117 | large_size_th (int): The threshold to define the large size bbox. Default is 256.
118 | avg_cls_level_acc_ths (List[float]): The thresholds to evaluate the average of all class level accuracy. Default is [0.5, ].
119 | avg_cls_level_macc_ths (List[float]): The thresholds to evaluate the average of all class level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95].
120 | '''
121 | self.dataset = dataset.dataset
122 | self.split=split
123 | if split is None:
124 | self.split=dataset.split
125 |
126 | self.ann_level_acc_ths = ann_level_acc_ths
127 | self.ann_level_macc_ths = ann_level_macc_ths
128 | self.size_level_acc_ths = size_level_acc_ths
129 | self.size_level_macc_ths = size_level_macc_ths
130 | self.avg_cls_level_acc_ths = avg_cls_level_acc_ths
131 | self.avg_cls_level_macc_ths = avg_cls_level_macc_ths
132 |
133 | self.small_size_th=small_size_th
134 | self.large_size_th=large_size_th
135 |
136 | @staticmethod
137 | def calculate_iou_acc(bboxes_1, bboxes_2, thresh=0.5):
138 | """
139 | bboxes_1 (torch.Tensor, numpy.Array): shape=[N,4], format=[x1, y1, x2, y2]
140 | bboxes_2 (torch.Tensor, numpy.Array): shape=[N,4], format=[x1, y1, x2, y2]
141 | calculate the iou and acc of the pred_bboxes and gt_bboxes,
142 | if iou(pred_bboxes[i],gt_bboxes[i])>0.5, then acc+=1
143 | all pred_bboxes_i and gt_bboxes_i are one to one assigned.
144 |
145 | """
146 | iou=bbox_overlaps(bboxes_1,bboxes_2,mode='iou', is_aligned=True)
147 | if(type(thresh) is not list):
148 | thresh=[thresh]
149 | accs=dict()
150 | for t in thresh:
151 | accs[t]=(iou>t).sum().item()/len(iou)
152 | return iou,accs
153 |
154 | def evaluate(self, predictions, save_file=None):
155 | """
156 | Evaluate given dataset and predictions.
157 |
158 | Parameters:
159 | - predictions (List(Dict)): The predictions to evaluate.
160 | Each item in the list is a dict, containing the keys: 'pred_bbox', 'id' and 'format',
161 | where 'id' is the annotation id, 'format' is the bbox format 'xyxy' or 'xywh'.
162 | e.g.:
163 | [
164 | {
165 | 'pred_bbox': [x1, y1, x2, y2],
166 | 'id': '000000',
167 | 'format': 'xyxy'
168 | },
169 | ...
170 | ]
171 | - save_file (str): The file to save the evaluation results to.
172 | """
173 | if(len(predictions)==0):
174 | print("Warning: No predictions found.")
175 | return dict()
176 |
177 | # convert predictions to a dict, key is the id, raise error if there are duplicate ids
178 | predictions_dict={pred['id']:pred for pred in predictions}
179 | if(len(predictions)!=len(predictions_dict)):
180 | raise ValueError("Duplicate ids found in the predictions.")
181 |
182 | gt_bboxes = []
183 | pred_bboxes = []
184 | dataset_split = self.dataset[self.split]
185 |
186 | for idx, gt_data in enumerate(dataset_split):
187 | gt_bbox = gt_data['bbox']
188 | gt_bboxes.append([gt_bbox[0], gt_bbox[1], gt_bbox[0]+gt_bbox[2], gt_bbox[1]+gt_bbox[3]])
189 |
190 | # raise error if the id canot be found in the predictions
191 | if gt_data['id'] not in predictions_dict:
192 | raise ValueError(f"Id {gt_data['id']} not found in the predictions.")
193 |
194 | pred_bbox = predictions_dict[gt_data['id']]['pred_bbox']
195 | if predictions_dict[gt_data['id']]['format'] == 'xywh':
196 | pred_bbox = [pred_bbox[0], pred_bbox[1], pred_bbox[0]+pred_bbox[2], pred_bbox[1]+pred_bbox[3]]
197 | pred_bboxes.append(pred_bbox)
198 |
199 |
200 | # calculate_iou
201 | iou = bbox_overlaps(torch.tensor(gt_bboxes), torch.tensor(pred_bboxes), mode='iou', is_aligned=True).numpy()
202 |
203 | acc_all = dict()
204 |
205 | # Annotation level evaluation
206 | for th in self.ann_level_acc_ths:
207 | acc = (iou > th).sum().item() / len(iou)
208 | key = f"Ann-level acc iou {th}"
209 | acc_all[key] = acc * 100
210 |
211 | macc = []
212 | for th in self.ann_level_macc_ths:
213 | acc = (iou > th).sum().item() / len(iou)
214 | macc.append(acc)
215 | key = f"Ann-level macc iou {self.ann_level_macc_ths[0]}:{self.ann_level_macc_ths[-1]}"
216 | acc_all[key] = mean(macc) * 100
217 |
218 | # get the acc for copy, so that we can output the results as a table, round to 2 decimal places
219 | # acc_all['Ann-level accs for copy'] = [acc_all[key] for key in acc_all]
220 | acc_all['Ann-level accs for copy'] = [
221 | round(acc_all[key], 2) for key in acc_all
222 | ]
223 |
224 | # Size evaluation
225 | small_size_list = []
226 | medium_size_list = []
227 | large_size_list = []
228 | for idx, gt_data in enumerate(dataset_split):
229 | gt_bbox = gt_data['bbox']
230 | obj_size = math.sqrt(gt_bbox[2]*gt_bbox[3])
231 | iou_item = iou[idx]
232 |
233 | # gather the small, medium and large size bboxes
234 | if obj_size < self.small_size_th:
235 | small_size_list.append(iou_item)
236 | elif obj_size <= self.large_size_th:
237 | medium_size_list.append(iou_item)
238 | else:
239 | large_size_list.append(iou_item)
240 |
241 | # small size evaluation
242 | for th in self.size_level_acc_ths:
243 | acc_small = sum(i > th for i in small_size_list) / len(small_size_list) * 100
244 | key = f"Small acc iou {th}"
245 | acc_all[key] = acc_small
246 |
247 | macc_small = []
248 | for th in self.size_level_macc_ths:
249 | small_size_list = np.array(small_size_list)
250 | acc_small = (small_size_list > th).sum().item() / len(small_size_list) * 100
251 | macc_small.append(acc_small)
252 | key = f"Small macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}"
253 | acc_all[key] = mean(macc_small)
254 |
255 | # medium size evaluation
256 | for th in self.size_level_acc_ths:
257 | acc_medium = sum(i > th for i in medium_size_list) / len(medium_size_list) * 100
258 | key = f"Medium acc iou {th}"
259 | acc_all[key] = acc_medium
260 |
261 | macc_medium = []
262 | for th in self.size_level_macc_ths:
263 | medium_size_list = np.array(medium_size_list)
264 | acc_medium = (medium_size_list > th).sum().item() / len(medium_size_list) * 100
265 | macc_medium.append(acc_medium)
266 | key = f"Medium macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}"
267 | acc_all[key] = mean(macc_medium)
268 |
269 | # large size evaluation
270 | for th in self.size_level_acc_ths:
271 | acc_large = sum(i > th for i in large_size_list) / len(large_size_list) * 100
272 | key = f"Large acc iou {th}"
273 | acc_all[key] = acc_large
274 |
275 | macc_large = []
276 | for th in self.size_level_macc_ths:
277 | large_size_list = np.array(large_size_list)
278 | acc_large = (large_size_list > th).sum().item() / len(large_size_list) * 100
279 | macc_large.append(acc_large)
280 | key = f"Large macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}"
281 | acc_all[key] = mean(macc_large)
282 |
283 | # get the size-level acc for copy, so that we can output the results as a table, round to 2 decimal places
284 | acc_all['Size level accs for copy'] = [
285 | round(acc_all[key], 2) for key in acc_all
286 | if 'Small' in key or 'Medium' in key or 'Large' in key
287 | ]
288 |
289 | # Average class-level evaluation
290 | iou_avg_cls_level_acc_ths = dict()
291 | for idx, gt_data in enumerate(dataset_split):
292 | iou_item = iou[idx]
293 | if gt_data['ori_category_id'] in mapping_dict:
294 | ori_category_id = mapping_dict[gt_data['ori_category_id']]
295 | else:
296 | ori_category_id = gt_data['ori_category_id']
297 |
298 | if ori_category_id not in iou_avg_cls_level_acc_ths:
299 | iou_avg_cls_level_acc_ths[ori_category_id] = []
300 | iou_avg_cls_level_acc_ths[ori_category_id].append(iou_item)
301 |
302 | for th in self.avg_cls_level_acc_ths:
303 | acc_list = []
304 | for key in iou_avg_cls_level_acc_ths:
305 | iou_array = np.array(iou_avg_cls_level_acc_ths[key])
306 | acc = (iou_array > th).sum().item() / len(iou_array) * 100
307 | acc_list.append(acc)
308 | key = f"Average class-level acc iou {th}"
309 | acc_all[key] = mean(acc_list)
310 |
311 | # macc
312 | macc_list = []
313 | for th in self.avg_cls_level_macc_ths:
314 | acc_list = []
315 | for key in iou_avg_cls_level_acc_ths:
316 | iou_array = np.array(iou_avg_cls_level_acc_ths[key])
317 | acc = (iou_array > th).sum().item() / len(iou_avg_cls_level_acc_ths[key]) * 100
318 | acc_list.append(acc)
319 | macc_list.append(mean(acc_list))
320 | key = f"Average class-level macc iou {self.avg_cls_level_macc_ths[0]}:{self.avg_cls_level_macc_ths[-1]}"
321 | acc_all[key] = mean(macc_list)
322 |
323 | # get the avg_cls-level acc for copy, so that we can output the results as a table, round to 2 decimal places
324 | acc_all['Avg class-level accs for copy'] = [
325 | round(acc_all[key], 2) for key in acc_all
326 | if 'Average class-level' in key
327 | ]
328 |
329 |
330 | # Output as table
331 | table = []
332 | table.append([f"Item for split {self.split}", "Value"])
333 | for k, v in acc_all.items():
334 | if isinstance(v, list):
335 | table.append([k, ", ".join(map(str, v))])
336 | else:
337 | table.append([k, v])
338 |
339 | # Define where to add horizontal lines
340 | horizontal_lines = {1, 6, 13} # After header, IoU, and Subject evaluations
341 |
342 | # Print table with selective horizontal lines
343 | max_len = max(len(row[0]) for row in table)
344 | for i, row in enumerate(table):
345 | if i in horizontal_lines:
346 | print('-' * (max_len + 3 + max(len(str(r[1])) for r in table)))
347 | print(f"{row[0].ljust(max_len)} | {row[1]}")
348 |
349 | if(save_file is not None):
350 | acc_all.pop('Ann-level accs for copy')
351 | acc_all.pop('Size level accs for copy')
352 | acc_all.pop('Avg class-level accs for copy')
353 | df=pd.DataFrame(acc_all, index=[0])
354 | df.to_csv(save_file)
355 |
356 | return acc_all
357 |
358 | # Example usage:
359 | if __name__ == '__main__':
360 | from dataloader import RefL4Dataset
361 | import json
362 | custom_transforms = None
363 | ref_l4_dataset = RefL4Dataset('/Users/jchen12/Documents/misc/Ref-L4', split='all', custom_transforms=custom_transforms)
364 | print("Dataset loaded. Length:", len(ref_l4_dataset))
365 |
366 | evaluator = RefL4Evaluator(dataset=ref_l4_dataset)
367 |
368 | pred_json_path = "/Users/jchen12/Documents/misc/cogvlm_grounding_pred.json"
369 | with open(pred_json_path, 'r') as f:
370 | predictions = json.load(f)
371 | evaluator.evaluate(predictions)
--------------------------------------------------------------------------------
/ref_l4/overlaps.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) OpenMMLab. All rights reserved.
2 | import torch
3 |
4 |
5 | def fp16_clamp(x, min=None, max=None):
6 | if not x.is_cuda and x.dtype == torch.float16:
7 | # clamp for cpu float16, tensor fp16 has no clamp implementation
8 | return x.float().clamp(min, max).half()
9 |
10 | return x.clamp(min, max)
11 |
12 |
13 | def bbox_overlaps(bboxes1, bboxes2, mode='iou', is_aligned=False, eps=1e-6):
14 | """Calculate overlap between two set of bboxes.
15 |
16 | FP16 Contributed by https://github.com/open-mmlab/mmdetection/pull/4889
17 | Note:
18 | Assume bboxes1 is M x 4, bboxes2 is N x 4, when mode is 'iou',
19 | there are some new generated variable when calculating IOU
20 | using bbox_overlaps function:
21 |
22 | 1) is_aligned is False
23 | area1: M x 1
24 | area2: N x 1
25 | lt: M x N x 2
26 | rb: M x N x 2
27 | wh: M x N x 2
28 | overlap: M x N x 1
29 | union: M x N x 1
30 | ious: M x N x 1
31 |
32 | Total memory:
33 | S = (9 x N x M + N + M) * 4 Byte,
34 |
35 | When using FP16, we can reduce:
36 | R = (9 x N x M + N + M) * 4 / 2 Byte
37 | R large than (N + M) * 4 * 2 is always true when N and M >= 1.
38 | Obviously, N + M <= N * M < 3 * N * M, when N >=2 and M >=2,
39 | N + 1 < 3 * N, when N or M is 1.
40 |
41 | Given M = 40 (ground truth), N = 400000 (three anchor boxes
42 | in per grid, FPN, R-CNNs),
43 | R = 275 MB (one times)
44 |
45 | A special case (dense detection), M = 512 (ground truth),
46 | R = 3516 MB = 3.43 GB
47 |
48 | When the batch size is B, reduce:
49 | B x R
50 |
51 | Therefore, CUDA memory runs out frequently.
52 |
53 | Experiments on GeForce RTX 2080Ti (11019 MiB):
54 |
55 | | dtype | M | N | Use | Real | Ideal |
56 | |:----:|:----:|:----:|:----:|:----:|:----:|
57 | | FP32 | 512 | 400000 | 8020 MiB | -- | -- |
58 | | FP16 | 512 | 400000 | 4504 MiB | 3516 MiB | 3516 MiB |
59 | | FP32 | 40 | 400000 | 1540 MiB | -- | -- |
60 | | FP16 | 40 | 400000 | 1264 MiB | 276MiB | 275 MiB |
61 |
62 | 2) is_aligned is True
63 | area1: N x 1
64 | area2: N x 1
65 | lt: N x 2
66 | rb: N x 2
67 | wh: N x 2
68 | overlap: N x 1
69 | union: N x 1
70 | ious: N x 1
71 |
72 | Total memory:
73 | S = 11 x N * 4 Byte
74 |
75 | When using FP16, we can reduce:
76 | R = 11 x N * 4 / 2 Byte
77 |
78 | So do the 'giou' (large than 'iou').
79 |
80 | Time-wise, FP16 is generally faster than FP32.
81 |
82 | When gpu_assign_thr is not -1, it takes more time on cpu
83 | but not reduce memory.
84 | There, we can reduce half the memory and keep the speed.
85 |
86 | If ``is_aligned`` is ``False``, then calculate the overlaps between each
87 | bbox of bboxes1 and bboxes2, otherwise the overlaps between each aligned
88 | pair of bboxes1 and bboxes2.
89 |
90 | Args:
91 | bboxes1 (Tensor): shape (B, m, 4) in format or empty.
92 | bboxes2 (Tensor): shape (B, n, 4) in format or empty.
93 | B indicates the batch dim, in shape (B1, B2, ..., Bn).
94 | If ``is_aligned`` is ``True``, then m and n must be equal.
95 | mode (str): "iou" (intersection over union), "iof" (intersection over
96 | foreground) or "giou" (generalized intersection over union).
97 | Default "iou".
98 | is_aligned (bool, optional): If True, then m and n must be equal.
99 | Default False.
100 | eps (float, optional): A value added to the denominator for numerical
101 | stability. Default 1e-6.
102 |
103 | Returns:
104 | Tensor: shape (m, n) if ``is_aligned`` is False else shape (m,)
105 |
106 | Example:
107 | >>> bboxes1 = torch.FloatTensor([
108 | >>> [0, 0, 10, 10],
109 | >>> [10, 10, 20, 20],
110 | >>> [32, 32, 38, 42],
111 | >>> ])
112 | >>> bboxes2 = torch.FloatTensor([
113 | >>> [0, 0, 10, 20],
114 | >>> [0, 10, 10, 19],
115 | >>> [10, 10, 20, 20],
116 | >>> ])
117 | >>> overlaps = bbox_overlaps(bboxes1, bboxes2)
118 | >>> assert overlaps.shape == (3, 3)
119 | >>> overlaps = bbox_overlaps(bboxes1, bboxes2, is_aligned=True)
120 | >>> assert overlaps.shape == (3, )
121 |
122 | Example:
123 | >>> empty = torch.empty(0, 4)
124 | >>> nonempty = torch.FloatTensor([[0, 0, 10, 9]])
125 | >>> assert tuple(bbox_overlaps(empty, nonempty).shape) == (0, 1)
126 | >>> assert tuple(bbox_overlaps(nonempty, empty).shape) == (1, 0)
127 | >>> assert tuple(bbox_overlaps(empty, empty).shape) == (0, 0)
128 | """
129 |
130 | assert mode in ['iou', 'iof', 'giou'], f'Unsupported mode {mode}'
131 | # Either the boxes are empty or the length of boxes' last dimension is 4
132 | assert (bboxes1.size(-1) == 4 or bboxes1.size(0) == 0)
133 | assert (bboxes2.size(-1) == 4 or bboxes2.size(0) == 0)
134 |
135 | # Batch dim must be the same
136 | # Batch dim: (B1, B2, ... Bn)
137 | assert bboxes1.shape[:-2] == bboxes2.shape[:-2]
138 | batch_shape = bboxes1.shape[:-2]
139 |
140 | rows = bboxes1.size(-2)
141 | cols = bboxes2.size(-2)
142 | if is_aligned:
143 | assert rows == cols
144 |
145 | if rows * cols == 0:
146 | if is_aligned:
147 | return bboxes1.new(batch_shape + (rows, ))
148 | else:
149 | return bboxes1.new(batch_shape + (rows, cols))
150 |
151 | area1 = (bboxes1[..., 2] - bboxes1[..., 0]) * (
152 | bboxes1[..., 3] - bboxes1[..., 1])
153 | area2 = (bboxes2[..., 2] - bboxes2[..., 0]) * (
154 | bboxes2[..., 3] - bboxes2[..., 1])
155 |
156 | if is_aligned:
157 | lt = torch.max(bboxes1[..., :2], bboxes2[..., :2]) # [B, rows, 2]
158 | rb = torch.min(bboxes1[..., 2:], bboxes2[..., 2:]) # [B, rows, 2]
159 |
160 | wh = fp16_clamp(rb - lt, min=0)
161 | overlap = wh[..., 0] * wh[..., 1]
162 |
163 | if mode in ['iou', 'giou']:
164 | union = area1 + area2 - overlap
165 | else:
166 | union = area1
167 | if mode == 'giou':
168 | enclosed_lt = torch.min(bboxes1[..., :2], bboxes2[..., :2])
169 | enclosed_rb = torch.max(bboxes1[..., 2:], bboxes2[..., 2:])
170 | else:
171 | lt = torch.max(bboxes1[..., :, None, :2],
172 | bboxes2[..., None, :, :2]) # [B, rows, cols, 2]
173 | rb = torch.min(bboxes1[..., :, None, 2:],
174 | bboxes2[..., None, :, 2:]) # [B, rows, cols, 2]
175 |
176 | wh = fp16_clamp(rb - lt, min=0)
177 | overlap = wh[..., 0] * wh[..., 1]
178 |
179 | if mode in ['iou', 'giou']:
180 | union = area1[..., None] + area2[..., None, :] - overlap
181 | else:
182 | union = area1[..., None]
183 | if mode == 'giou':
184 | enclosed_lt = torch.min(bboxes1[..., :, None, :2],
185 | bboxes2[..., None, :, :2])
186 | enclosed_rb = torch.max(bboxes1[..., :, None, 2:],
187 | bboxes2[..., None, :, 2:])
188 |
189 | eps = union.new_tensor([eps])
190 | union = torch.max(union, eps)
191 | ious = overlap / union
192 | if mode in ['iou', 'iof']:
193 | return ious
194 | # calculate gious
195 | enclose_wh = fp16_clamp(enclosed_rb - enclosed_lt, min=0)
196 | enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1]
197 | enclose_area = torch.max(enclose_area, eps)
198 | gious = ious - (enclose_area - union) / enclose_area
199 | return gious
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | setup(
4 | name='ref_l4',
5 | version='1.1',
6 | packages=find_packages(),
7 | py_modules=['dataloader', 'evaluation', 'overlaps'],
8 | install_requires=[
9 | # Add any dependencies your package needs here
10 | # e.g., 'numpy', 'pandas', 'scikit-learn'
11 | 'torch', 'datasets', 'pillow', 'statistics', 'pandas', 'datasets'
12 | ],
13 | entry_points={
14 | 'console_scripts': [
15 | # Add any command line scripts here
16 | # e.g., 'mycommand = mypackage.module:function'
17 | ],
18 | },
19 | author='Jierun Chen',
20 | description='Evaluation code for Ref-L4, a new REC benchmark in the LMM era.',
21 | long_description=open('README.md').read(),
22 | long_description_content_type='text/markdown',
23 | url='https://github.com/JierunChen/Ref-L4', # Replace with your package's URL
24 | classifiers=[
25 | 'Programming Language :: Python :: 3',
26 | 'License :: OSI Approved :: MIT License',
27 | 'Operating System :: OS Independent',
28 | ],
29 | license='MIT',
30 | python_requires='>=3.6',
31 | )
--------------------------------------------------------------------------------