├── LICENSE
├── README.md
├── demo_models
    └── cogvlm_grounding_pred.json
├── evaluate_pred.py
├── figure
    ├── error_samples.png
    └── examples.png
├── ref_l4
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-311.pyc
    │   ├── dataloader.cpython-311.pyc
    │   ├── evaluation.cpython-311.pyc
    │   └── overlaps.cpython-311.pyc
    ├── dataloader.py
    ├── evaluation.py
    └── overlaps.py
└── setup.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 JierunChen
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Ref-L4: a New REC Benchmark in the LMM Era
  2 | 
  3 | This repository contains the data loader and evaluation code for our [Ref-L4](https://huggingface.co/datasets/JierunChen/Ref-L4), a new REC benchmark in the LMM era. Check out our paper [Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models](https://arxiv.org/abs/2406.16866) for more details.
  4 | 
  5 | ## 🧭 Introduction  
  6 | 
  7 | Referring expression comprehension (REC) is the task of locating a specific region in an image based on a textual description. Current popular REC benchmarks, such as RefCOCO, RefCOCO+, and RefCOCOg, may not fully capture the comprehensive capabilities of large multimodal models (LMMs) due to limitations in:
  8 | 
  9 | - **Linguistic complexity**: Expressions tend to be overly simple, often consisting of single words like "yellow," "lady," or "far left person."
 10 | - **Visual complexity**: The benchmarks are restricted to a limited set of object categories (fewer than 80).
 11 | - **Labeling accuracy**: RefCOCO, RefCOCO+, and RefCOCOg have high labeling error rates (14%, 24%, and 5%, respectively).
 12 | 
 13 | In response, we rigorously exclude the problematic instances and reevaluating LMMs, showing significant accuracy improvements. We also introduce Ref-L4, a new and more coprehensive REC benchmark to evaluate LMMs. Ref-L4 offers:
 14 | 
 15 | - **Lengthy referring expressions** averaging 24.2 words;
 16 | - **Extensive vocabulary** comprising 22,813 unique words;
 17 | - **Diverse objects** covering 365 categories with varying instance scales (from 30 to 3,767);
 18 | - **Substantial sample size** with 45,341 annotations.
 19 | 
 20 | ### 😃 Ref-L4 examples
 21 | 
 22 | <img src="figure/examples.png"  align = "center"  width="800" />
 23 | 
 24 | ### 😑 Labeling errors in RefCOCO, +, g
 25 | 
 26 | In the REC task, a referring expression should uniquely describe an instance, which is represented by an accurate bounding box. We have identified and visualized three common types of labeling errors in the RefCOCO, RefCOCO+, and RefCOCOg benchmarks: (a) non-unique referring expressions, which refer to multiple instances within the same image; (b) inaccurate bounding boxes; and (c) misalignment between target instances and their referring expressions, where the referring expressions are either ambiguous or do not refer to any instance in the image.
 27 | 
 28 | <img src="figure/error_samples.png"  align = "center"  width="800" />
 29 | 
 30 | ## 🛠️ Installation
 31 | 
 32 | ```bash
 33 | git clone https://github.com/JierunChen/Ref-L4.git
 34 | cd Ref-L4
 35 | pip install ./
 36 | ```
 37 | 
 38 | ## 🚀 Loading Ref-L4 Dataset
 39 | 
 40 | ```python
 41 | from ref_l4 import RefL4Dataset
 42 | ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all')
 43 | print(len(ref_l4_dataset))
 44 | # iterate over the dataset
 45 | for img_pil, data in ref_l4_dataset:
 46 |     print(data)
 47 |     break
 48 | 
 49 | # swith to 'val' or 'test' split
 50 | ref_l4_dataset.change_split('val')
 51 | print(len(ref_l4_dataset))
 52 | ```
 53 | 
 54 | The [Ref-L4 dataset](https://huggingface.co/datasets/JierunChen/Ref-L4) can also be downloaded from Hugging Face.
 55 | 
 56 | 
 57 | ## 🗂️ Evaluation
 58 | 
 59 | After getting model predictions, we can save them as a json file, which is a list of dictionaries, each containing three keys: `[id, pred_bbox, format]`. The `id` is the annotation ID, and the `format` specifies the format of `pred_bbox`, which should be either `xyxy` or `xywh`.
 60 | A sample prediction file the can be found [here](./demo_models/cogvlm_grounding_pred.json). It is obtained by running the CogVLM-Grounding model.
 61 | 
 62 | Now we can run the following command to evaluate the predictions with ground truth.
 63 | 
 64 | ```bash
 65 | python evaluate_pred.py --dataset_path JierunChen/Ref-L4 --split all --pred_json_path ./demo_models/cogvlm_grounding_pred.json
 66 | ```
 67 | 
 68 | Here is an example output for the predictions from CogVLM-Grounding on the "all" splits:
 69 | 
 70 | ```txt
 71 | Item for split all                    | Value
 72 | --------------------------------------------------------------------------------
 73 | Ann-level acc iou 0.5                 | 81.69868331091065
 74 | Ann-level acc iou 0.75                | 70.76597340155709
 75 | Ann-level acc iou 0.9                 | 48.351381751615534
 76 | Ann-level macc iou 0.5:0.95           | 66.08808804393375
 77 | Ann-level accs for copy               | 81.7, 70.77, 48.35, 66.09
 78 | --------------------------------------------------------------------------------
 79 | Small acc iou 0.5                     | 75.0561797752809
 80 | Small macc iou 0.5:0.95               | 52.853932584269664
 81 | Medium acc iou 0.5                    | 86.43470790378007
 82 | Medium macc iou 0.5:0.95              | 71.31099656357388
 83 | Large acc iou 0.5                     | 77.90972003774772
 84 | Large macc iou 0.5:0.95               | 66.25196602705253
 85 | Size level accs for copy              | 75.06, 52.85, 86.43, 71.31, 77.91, 66.25
 86 | --------------------------------------------------------------------------------
 87 | Average class-level acc iou 0.5       | 72.42029130234654
 88 | Average class-level macc iou 0.5:0.95 | 52.56151912966669
 89 | Avg class-level accs for copy         | 72.42, 52.56
 90 | ```
 91 | 
 92 | The output includes three parts:
 93 | - The accuracy under various IoU thresholds and the average accuracy of IoU from 0.5 to 0.95 with a stride of 0.05.
 94 | - The accuracy of small, medium, and large objects.
 95 | - The average accuracy across different classes.
 96 | 
 97 | 
 98 | ## 📖 Dataset License
 99 | 
100 | The Ref-L4 dataset is released under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license](https://creativecommons.org/licenses/by-nc/4.0/). Please note that the images in the Ref-L4 dataset are derived from the following datasets, each with their respective licenses:
101 | - **RefCOCO**: Licensed under the [Apache-2.0 license](http://www.apache.org/licenses/).
102 | - **RefCOCO+**: Licensed under the [Apache-2.0 license](http://www.apache.org/licenses/).
103 | - **RefCOCOg**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/).
104 | - **COCO 2014**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/).
105 | - **Objects365**: Licensed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](http://creativecommons.org/licenses/by/4.0/).
106 | 
107 | By using the Ref-L4 dataset, you agree to comply with the licensing terms of these source datasets.
108 | 


--------------------------------------------------------------------------------
/evaluate_pred.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import json
 3 | from ref_l4 import RefL4Dataset, RefL4Evaluator
 4 | 
 5 | def main(args):
 6 |     custom_transforms = None
 7 |     ref_l4_dataset = RefL4Dataset(args.dataset_path, split=args.split, custom_transforms=custom_transforms)
 8 |     print("Dataset loaded. Length:", len(ref_l4_dataset))
 9 | 
10 |     evaluator = RefL4Evaluator(dataset=ref_l4_dataset)
11 | 
12 |     with open(args.pred_json_path, 'r') as f:
13 |         predictions = json.load(f)
14 |     evaluator.evaluate(predictions)
15 | 
16 | if __name__ == '__main__':
17 |     parser = argparse.ArgumentParser(description='Evaluate the predictions on Ref-L4 dataset.')
18 |     parser.add_argument('--dataset_path', type=str, default="JierunChen/Ref-L4", help='Path to the Ref-L4 dataset.')
19 |     parser.add_argument('--split', type=str, default='all', choices=['val', 'test', 'all'], help='Dataset split to use (val, test, all).')
20 |     parser.add_argument('--pred_json_path', type=str, required=True, help='Path to the predictions JSON file.')
21 | 
22 |     args = parser.parse_args()
23 | 
24 |     main(args)
25 | 


--------------------------------------------------------------------------------
/figure/error_samples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/figure/error_samples.png


--------------------------------------------------------------------------------
/figure/examples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/figure/examples.png


--------------------------------------------------------------------------------
/ref_l4/__init__.py:
--------------------------------------------------------------------------------
1 | from .dataloader import RefL4Dataset
2 | from .evaluation import RefL4Evaluator
3 | 
4 | __all__ = ['RefL4Dataset', 'RefL4Evaluator']


--------------------------------------------------------------------------------
/ref_l4/__pycache__/__init__.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/__init__.cpython-311.pyc


--------------------------------------------------------------------------------
/ref_l4/__pycache__/dataloader.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/dataloader.cpython-311.pyc


--------------------------------------------------------------------------------
/ref_l4/__pycache__/evaluation.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/evaluation.cpython-311.pyc


--------------------------------------------------------------------------------
/ref_l4/__pycache__/overlaps.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JierunChen/Ref-L4/bae942c852b52c5be1d8214c978e984294793b91/ref_l4/__pycache__/overlaps.cpython-311.pyc


--------------------------------------------------------------------------------
/ref_l4/dataloader.py:
--------------------------------------------------------------------------------
 1 | from torch.utils.data import Dataset
 2 | from datasets import load_dataset, concatenate_datasets
 3 | from huggingface_hub import hf_hub_download
 4 | from PIL import Image
 5 | import tarfile
 6 | import io
 7 | from tqdm import tqdm
 8 | 
 9 | class RefL4Dataset(Dataset):
10 |     def __init__(self, dataset_path, split, images_file='images.tar.gz', custom_transforms=None):
11 |         """
12 |         Initialize the RefL4Dataset class.
13 | 
14 |         Parameters:
15 |         - dataset_path (str): Path to the dataset directory.
16 |         - split (str): Dataset split, typically "val", "test", or "all".
17 |         - images_file (str): Name of the tar file containing images, default to 'images.tar.gz'.
18 |         - custom_transforms: Custom image transformations to apply, default to 'None'.
19 |         """
20 |         super(RefL4Dataset, self).__init__()
21 |         assert split in ['val', 'test', 'all'], 'split should be val, test or all'
22 |         self.dataset_path=dataset_path
23 |         self.split = split
24 |         self.images_file=images_file
25 |         self.transforms = custom_transforms
26 |         self._load_dataset()
27 | 
28 |     def _load_dataset(self):
29 |         self.dataset = load_dataset(self.dataset_path)
30 |         image_tar_path = f"{self.dataset_path}/{self.images_file}"
31 |         # manually download the images.tar.gz file 
32 |         if self.dataset_path == 'JierunChen/Ref-L4':
33 |             image_tar_path = hf_hub_download(repo_id="JierunChen/Ref-L4", filename="images.tar.gz", repo_type="dataset")
34 |             print(f"Downloaded images.tar.gz to {image_tar_path}")
35 | 
36 |         all_splits=concatenate_datasets([self.dataset['val'],self.dataset['test']])
37 |         self.dataset['all']=all_splits
38 |         self.images=self._load_images_from_tar(image_tar_path)
39 | 
40 |     def _load_images_from_tar(self, image_tar_path):
41 |         images = {}
42 |         print(f"Loading images from {image_tar_path}")
43 |         with tarfile.open(image_tar_path, "r:gz") as tar:
44 |             for member in tqdm(tar.getmembers()):
45 |                 if member.isfile() and member.name.endswith(('jpg', 'jpeg', 'png', 'webp')):
46 |                     f = tar.extractfile(member)
47 |                     if f:
48 |                         image = Image.open(io.BytesIO(f.read()))
49 |                         # transfer the grayscale image to RGB if needed
50 |                         if image.mode != 'RGB':
51 |                             image = image.convert('RGB')
52 |                         # remove any previous extension of name
53 |                         
54 |                         images[member.name] = image
55 |         return images
56 |     
57 |     def change_split(self, split):
58 |         assert split in ['val', 'test', 'all'], 'split should be val, test or all'
59 |         self.split = split
60 | 
61 |     def __len__(self):
62 |         return len(self.dataset[self.split])
63 | 
64 |     def __getitem__(self, idx):
65 |         """
66 |         Returns:
67 |         - image (Tensor): Transformed image data.
68 |         - data (dict): Other sample data.
69 |         """
70 |         data = self.dataset[self.split][idx]
71 |         image = self.images[data['file_name']]
72 |         if self.transforms:
73 |             image = self.transforms(image)
74 |         return image, data
75 | 
76 | # Example usage:
77 | if __name__ == '__main__':
78 |     ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all')
79 |     print(len(ref_l4_dataset))
80 |     print(ref_l4_dataset[0])
81 |     # swith to val split
82 |     ref_l4_dataset.change_split('val')
83 |     print(len(ref_l4_dataset))
84 |     print(ref_l4_dataset[0])
85 |     # swith to test split
86 |     ref_l4_dataset.change_split('test')
87 |     print(len(ref_l4_dataset))
88 |     print(ref_l4_dataset[0])
89 | 
90 |     ref_l4_dataset.change_split('all')
91 |     print(len(ref_l4_dataset))
92 |     print(ref_l4_dataset[0])
93 | 


--------------------------------------------------------------------------------
/ref_l4/evaluation.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2024 Jierun Chen. All rights reserved.
  2 | import torch
  3 | from statistics import mean
  4 | import pandas as pd
  5 | import math
  6 | from datasets import load_dataset
  7 | from .dataloader import RefL4Dataset
  8 | from .overlaps import bbox_overlaps
  9 | import numpy as np
 10 | 
 11 | mapping_dict = {
 12 |     "refcoco_1": "o365_1",
 13 |     "refcoco_10": "o365_41",
 14 |     "refcoco_11": "o365_177",
 15 |     "refcoco_13": "o365_128",
 16 |     "refcoco_14": "o365_250",
 17 |     "refcoco_15": "o365_25",
 18 |     "refcoco_16": "o365_56",
 19 |     "refcoco_17": "o365_140",
 20 |     "refcoco_18": "o365_93",
 21 |     "refcoco_19": "o365_79",
 22 |     "refcoco_2": "o365_47",
 23 |     "refcoco_20": "o365_100",
 24 |     "refcoco_21": "o365_97",
 25 |     "refcoco_22": "o365_145",
 26 |     "refcoco_23": "o365_296",
 27 |     "refcoco_24": "o365_179",
 28 |     "refcoco_25": "o365_181",
 29 |     "refcoco_27": "o365_39",
 30 |     "refcoco_28": "o365_40",
 31 |     "refcoco_3": "o365_6",
 32 |     "refcoco_31": "o365_13",
 33 |     "refcoco_32": "o365_44",
 34 |     "refcoco_33": "o365_194",
 35 |     "refcoco_34": "o365_220",
 36 |     "refcoco_35": "o365_119",
 37 |     "refcoco_36": "o365_174",
 38 |     "refcoco_37": "o365_100000",    # This is a placeholder, the actual value is not provided
 39 |     "refcoco_38": "o365_155",
 40 |     "refcoco_39": "o365_138",
 41 |     "refcoco_4": "o365_59",
 42 |     "refcoco_40": "o365_114",
 43 |     "refcoco_41": "o365_146",
 44 |     "refcoco_42": "o365_147",
 45 |     "refcoco_43": "o365_205",
 46 |     "refcoco_44": "o365_9",
 47 |     "refcoco_46": "o365_36",
 48 |     "refcoco_47": "o365_11",
 49 |     "refcoco_48": "o365_89",
 50 |     "refcoco_49": "o365_85",
 51 |     "refcoco_5": "o365_115",
 52 |     "refcoco_50": "o365_94",
 53 |     "refcoco_51": "o365_26",
 54 |     "refcoco_52": "o365_113",
 55 |     "refcoco_53": "o365_83",
 56 |     "refcoco_54": "o365_266",
 57 |     "refcoco_55": "o365_104",
 58 |     "refcoco_56": "o365_142",
 59 |     "refcoco_57": "o365_153",
 60 |     "refcoco_58": "o365_235",
 61 |     "refcoco_59": "o365_144",
 62 |     "refcoco_6": "o365_56",
 63 |     "refcoco_60": "o365_151",
 64 |     "refcoco_61": "o365_98",
 65 |     "refcoco_62": "o365_3",
 66 |     "refcoco_63": "o365_51",
 67 |     "refcoco_64": "o365_26",
 68 |     "refcoco_65": "o365_76",
 69 |     "refcoco_67": "o365_98",
 70 |     "refcoco_7": "o365_117",
 71 |     "refcoco_70": "o365_154",
 72 |     "refcoco_72": "o365_37",
 73 |     "refcoco_73": "o365_74",
 74 |     "refcoco_74": "o365_116",
 75 |     "refcoco_75": "o365_133",
 76 |     "refcoco_76": "o365_107",
 77 |     "refcoco_77": "o365_62",
 78 |     "refcoco_78": "o365_164",
 79 |     "refcoco_79": "o365_135",
 80 |     "refcoco_8": "o365_66",
 81 |     "refcoco_80": "o365_278",
 82 |     "refcoco_81": "o365_82",
 83 |     "refcoco_82": "o365_134",
 84 |     "refcoco_84": "o365_19",
 85 |     "refcoco_85": "o365_95",
 86 |     "refcoco_86": "o365_31",
 87 |     "refcoco_87": "o365_170",
 88 |     "refcoco_88": "o365_70",
 89 |     "refcoco_89": "o365_328",
 90 |     "refcoco_9": "o365_22",
 91 |     "refcoco_90": "o365_227",
 92 | }
 93 | 
 94 | 
 95 | class RefL4Evaluator:
 96 |     def __init__(
 97 |             self,
 98 |             dataset: RefL4Dataset,
 99 |             split=None, 
100 |             ann_level_acc_ths=[0.5, 0.75, 0.9],
101 |             ann_level_macc_ths=[i/100 for i in range(50,100,5)],
102 |             size_level_acc_ths=[0.5, ],
103 |             size_level_macc_ths=[i/100 for i in range(50,100,5)],
104 |             small_size_th=128,
105 |             large_size_th=256,
106 |             avg_cls_level_acc_ths=[0.5, ],
107 |             avg_cls_level_macc_ths=[i/100 for i in range(50,100,5)],
108 |         ) -> None:
109 |         '''
110 |         dataset (RefL4Dataset): The RefL4Dataset dataset for evaluation.
111 |         split (str): The split of the dataset to evaluate. If None, use the dataset's split. Default is None.
112 |         ann_level_acc_ths (List[float]): The thresholds to evaluate the annotation level accuracy. Default is [0.5, 0.75, 0.9].
113 |         ann_level_macc_ths (List[float]): The thresholds to evaluate the annotation level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95].
114 |         size_level_acc_ths (List[float]): The thresholds to evaluate the size level accuracy. Default is [0.5, ].
115 |         size_level_macc_ths (List[float]): The thresholds to evaluate the size level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95].
116 |         small_size_th (int): The threshold to define the small size bbox. Default is 128.
117 |         large_size_th (int): The threshold to define the large size bbox. Default is 256.
118 |         avg_cls_level_acc_ths (List[float]): The thresholds to evaluate the average of all class level accuracy. Default is [0.5, ].
119 |         avg_cls_level_macc_ths (List[float]): The thresholds to evaluate the average of all class level mAcc. Default is [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95].
120 |         '''
121 |         self.dataset = dataset.dataset
122 |         self.split=split
123 |         if split is None:
124 |             self.split=dataset.split
125 | 
126 |         self.ann_level_acc_ths = ann_level_acc_ths
127 |         self.ann_level_macc_ths = ann_level_macc_ths
128 |         self.size_level_acc_ths = size_level_acc_ths
129 |         self.size_level_macc_ths = size_level_macc_ths
130 |         self.avg_cls_level_acc_ths = avg_cls_level_acc_ths
131 |         self.avg_cls_level_macc_ths = avg_cls_level_macc_ths
132 |         
133 |         self.small_size_th=small_size_th
134 |         self.large_size_th=large_size_th
135 | 
136 |     @staticmethod
137 |     def calculate_iou_acc(bboxes_1, bboxes_2, thresh=0.5):
138 |         """
139 |         bboxes_1 (torch.Tensor, numpy.Array): shape=[N,4], format=[x1, y1, x2, y2]
140 |         bboxes_2 (torch.Tensor, numpy.Array): shape=[N,4], format=[x1, y1, x2, y2]
141 |         calculate the iou and acc of the pred_bboxes and gt_bboxes,
142 |         if iou(pred_bboxes[i],gt_bboxes[i])>0.5, then acc+=1
143 |         all pred_bboxes_i and gt_bboxes_i are one to one assigned.
144 |         
145 |         """
146 |         iou=bbox_overlaps(bboxes_1,bboxes_2,mode='iou', is_aligned=True)
147 |         if(type(thresh) is not list):
148 |             thresh=[thresh]
149 |         accs=dict()
150 |         for t in thresh:
151 |             accs[t]=(iou>t).sum().item()/len(iou)
152 |         return iou,accs
153 | 
154 |     def evaluate(self, predictions, save_file=None):
155 |         """
156 |         Evaluate given dataset and predictions.
157 | 
158 |         Parameters:
159 |         - predictions (List(Dict)): The predictions to evaluate. 
160 |             Each item in the list is a dict, containing the keys: 'pred_bbox', 'id' and 'format', 
161 |             where 'id' is the annotation id, 'format' is the bbox format 'xyxy' or 'xywh'.
162 |             e.g.:
163 |             [
164 |                 {
165 |                 'pred_bbox': [x1, y1, x2, y2],
166 |                 'id': '000000',
167 |                 'format': 'xyxy'
168 |                 },
169 |                 ...
170 |             ]
171 |         - save_file (str): The file to save the evaluation results to.
172 |         """
173 |         if(len(predictions)==0):
174 |             print("Warning: No predictions found.")
175 |             return dict()
176 |         
177 |         # convert predictions to a dict, key is the id, raise error if there are duplicate ids
178 |         predictions_dict={pred['id']:pred for pred in predictions}
179 |         if(len(predictions)!=len(predictions_dict)):
180 |             raise ValueError("Duplicate ids found in the predictions.")
181 | 
182 |         gt_bboxes = []
183 |         pred_bboxes = []
184 |         dataset_split = self.dataset[self.split]
185 | 
186 |         for idx, gt_data in enumerate(dataset_split):
187 |             gt_bbox = gt_data['bbox']
188 |             gt_bboxes.append([gt_bbox[0], gt_bbox[1], gt_bbox[0]+gt_bbox[2], gt_bbox[1]+gt_bbox[3]])
189 |             
190 |             # raise error if the id canot be found in the predictions
191 |             if gt_data['id'] not in predictions_dict:
192 |                 raise ValueError(f"Id {gt_data['id']} not found in the predictions.")
193 | 
194 |             pred_bbox = predictions_dict[gt_data['id']]['pred_bbox']
195 |             if predictions_dict[gt_data['id']]['format'] == 'xywh':
196 |                 pred_bbox = [pred_bbox[0], pred_bbox[1], pred_bbox[0]+pred_bbox[2], pred_bbox[1]+pred_bbox[3]]
197 |             pred_bboxes.append(pred_bbox)
198 | 
199 | 
200 |         # calculate_iou
201 |         iou = bbox_overlaps(torch.tensor(gt_bboxes), torch.tensor(pred_bboxes), mode='iou', is_aligned=True).numpy()
202 | 
203 |         acc_all = dict()
204 | 
205 |         # Annotation level evaluation
206 |         for th in self.ann_level_acc_ths:
207 |             acc = (iou > th).sum().item() / len(iou)
208 |             key = f"Ann-level acc iou {th}"
209 |             acc_all[key] = acc * 100
210 |         
211 |         macc = []
212 |         for th in self.ann_level_macc_ths:
213 |             acc = (iou > th).sum().item() / len(iou)
214 |             macc.append(acc)
215 |         key = f"Ann-level macc iou {self.ann_level_macc_ths[0]}:{self.ann_level_macc_ths[-1]}"
216 |         acc_all[key] = mean(macc) * 100
217 | 
218 |         # get the acc for copy, so that we can output the results as a table, round to 2 decimal places
219 |         # acc_all['Ann-level accs for copy'] = [acc_all[key] for key in acc_all]
220 |         acc_all['Ann-level accs for copy'] = [
221 |             round(acc_all[key], 2) for key in acc_all
222 |         ]
223 | 
224 |         # Size evaluation
225 |         small_size_list = []
226 |         medium_size_list = []
227 |         large_size_list = []
228 |         for idx, gt_data in enumerate(dataset_split):
229 |             gt_bbox = gt_data['bbox']
230 |             obj_size = math.sqrt(gt_bbox[2]*gt_bbox[3])
231 |             iou_item = iou[idx]
232 | 
233 |             # gather the small, medium and large size bboxes
234 |             if obj_size < self.small_size_th:
235 |                 small_size_list.append(iou_item)
236 |             elif obj_size <= self.large_size_th:
237 |                 medium_size_list.append(iou_item)
238 |             else:
239 |                 large_size_list.append(iou_item)
240 |         
241 |         #  small size evaluation
242 |         for th in self.size_level_acc_ths:
243 |             acc_small = sum(i > th for i in small_size_list) / len(small_size_list) * 100
244 |             key = f"Small acc iou {th}"
245 |             acc_all[key] = acc_small 
246 |                 
247 |         macc_small = []
248 |         for th in self.size_level_macc_ths:
249 |             small_size_list = np.array(small_size_list)
250 |             acc_small = (small_size_list > th).sum().item() / len(small_size_list) * 100
251 |             macc_small.append(acc_small)
252 |         key = f"Small macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}"
253 |         acc_all[key] = mean(macc_small)
254 | 
255 |         # medium size evaluation
256 |         for th in self.size_level_acc_ths:
257 |             acc_medium = sum(i > th for i in medium_size_list) / len(medium_size_list) * 100
258 |             key = f"Medium acc iou {th}"
259 |             acc_all[key] = acc_medium
260 | 
261 |         macc_medium = []
262 |         for th in self.size_level_macc_ths:
263 |             medium_size_list = np.array(medium_size_list)
264 |             acc_medium = (medium_size_list > th).sum().item() / len(medium_size_list) * 100
265 |             macc_medium.append(acc_medium)
266 |         key = f"Medium macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}"
267 |         acc_all[key] = mean(macc_medium)
268 | 
269 |         # large size evaluation
270 |         for th in self.size_level_acc_ths:
271 |             acc_large = sum(i > th for i in large_size_list) / len(large_size_list) * 100
272 |             key = f"Large acc iou {th}"
273 |             acc_all[key] = acc_large
274 | 
275 |         macc_large = []
276 |         for th in self.size_level_macc_ths:
277 |             large_size_list = np.array(large_size_list)
278 |             acc_large = (large_size_list > th).sum().item() / len(large_size_list) * 100
279 |             macc_large.append(acc_large)
280 |         key = f"Large macc iou {self.size_level_macc_ths[0]}:{self.size_level_macc_ths[-1]}"
281 |         acc_all[key] = mean(macc_large)
282 | 
283 |         # get the size-level acc for copy, so that we can output the results as a table, round to 2 decimal places
284 |         acc_all['Size level accs for copy'] = [
285 |             round(acc_all[key], 2) for key in acc_all
286 |             if 'Small' in key or 'Medium' in key or 'Large' in key
287 |         ]
288 | 
289 |         # Average class-level evaluation
290 |         iou_avg_cls_level_acc_ths = dict()
291 |         for idx, gt_data in enumerate(dataset_split):
292 |             iou_item = iou[idx]
293 |             if gt_data['ori_category_id'] in mapping_dict:
294 |                 ori_category_id = mapping_dict[gt_data['ori_category_id']]
295 |             else:
296 |                 ori_category_id = gt_data['ori_category_id']
297 | 
298 |             if ori_category_id not in iou_avg_cls_level_acc_ths:
299 |                 iou_avg_cls_level_acc_ths[ori_category_id] = []
300 |             iou_avg_cls_level_acc_ths[ori_category_id].append(iou_item)
301 |         
302 |         for th in self.avg_cls_level_acc_ths:
303 |             acc_list = []
304 |             for key in iou_avg_cls_level_acc_ths:
305 |                 iou_array = np.array(iou_avg_cls_level_acc_ths[key])
306 |                 acc = (iou_array > th).sum().item() / len(iou_array) * 100
307 |                 acc_list.append(acc)
308 |             key = f"Average class-level acc iou {th}"
309 |             acc_all[key] = mean(acc_list)
310 | 
311 |         # macc
312 |         macc_list = []
313 |         for th in self.avg_cls_level_macc_ths:
314 |             acc_list = []
315 |             for key in iou_avg_cls_level_acc_ths:
316 |                 iou_array = np.array(iou_avg_cls_level_acc_ths[key])
317 |                 acc = (iou_array > th).sum().item() / len(iou_avg_cls_level_acc_ths[key]) * 100
318 |                 acc_list.append(acc)
319 |             macc_list.append(mean(acc_list))
320 |         key = f"Average class-level macc iou {self.avg_cls_level_macc_ths[0]}:{self.avg_cls_level_macc_ths[-1]}"
321 |         acc_all[key] = mean(macc_list)
322 | 
323 |         # get the avg_cls-level acc for copy, so that we can output the results as a table, round to 2 decimal places
324 |         acc_all['Avg class-level accs for copy'] = [
325 |             round(acc_all[key], 2) for key in acc_all
326 |             if 'Average class-level' in key
327 |         ]
328 | 
329 | 
330 |         # Output as table
331 |         table = []
332 |         table.append([f"Item for split {self.split}", "Value"])
333 |         for k, v in acc_all.items():
334 |             if isinstance(v, list):
335 |                 table.append([k, ", ".join(map(str, v))])
336 |             else:
337 |                 table.append([k, v])
338 | 
339 |         # Define where to add horizontal lines
340 |         horizontal_lines = {1, 6, 13}  # After header, IoU, and Subject evaluations
341 | 
342 |         # Print table with selective horizontal lines
343 |         max_len = max(len(row[0]) for row in table)
344 |         for i, row in enumerate(table):
345 |             if i in horizontal_lines:
346 |                 print('-' * (max_len + 3 + max(len(str(r[1])) for r in table)))
347 |             print(f"{row[0].ljust(max_len)} | {row[1]}")
348 | 
349 |         if(save_file is not None):
350 |             acc_all.pop('Ann-level accs for copy')
351 |             acc_all.pop('Size level accs for copy')
352 |             acc_all.pop('Avg class-level accs for copy')
353 |             df=pd.DataFrame(acc_all, index=[0])
354 |             df.to_csv(save_file)        
355 |                      
356 |         return acc_all
357 | 
358 | # Example usage:
359 | if __name__ == '__main__':
360 |     from dataloader import RefL4Dataset
361 |     import json
362 |     custom_transforms = None
363 |     ref_l4_dataset = RefL4Dataset('/Users/jchen12/Documents/misc/Ref-L4', split='all', custom_transforms=custom_transforms)
364 |     print("Dataset loaded. Length:", len(ref_l4_dataset))
365 | 
366 |     evaluator = RefL4Evaluator(dataset=ref_l4_dataset)
367 | 
368 |     pred_json_path = "/Users/jchen12/Documents/misc/cogvlm_grounding_pred.json"
369 |     with open(pred_json_path, 'r') as f:
370 |         predictions = json.load(f)
371 |     evaluator.evaluate(predictions)


--------------------------------------------------------------------------------
/ref_l4/overlaps.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) OpenMMLab. All rights reserved.
  2 | import torch
  3 | 
  4 | 
  5 | def fp16_clamp(x, min=None, max=None):
  6 |     if not x.is_cuda and x.dtype == torch.float16:
  7 |         # clamp for cpu float16, tensor fp16 has no clamp implementation
  8 |         return x.float().clamp(min, max).half()
  9 | 
 10 |     return x.clamp(min, max)
 11 | 
 12 | 
 13 | def bbox_overlaps(bboxes1, bboxes2, mode='iou', is_aligned=False, eps=1e-6):
 14 |     """Calculate overlap between two set of bboxes.
 15 | 
 16 |     FP16 Contributed by https://github.com/open-mmlab/mmdetection/pull/4889
 17 |     Note:
 18 |         Assume bboxes1 is M x 4, bboxes2 is N x 4, when mode is 'iou',
 19 |         there are some new generated variable when calculating IOU
 20 |         using bbox_overlaps function:
 21 | 
 22 |         1) is_aligned is False
 23 |             area1: M x 1
 24 |             area2: N x 1
 25 |             lt: M x N x 2
 26 |             rb: M x N x 2
 27 |             wh: M x N x 2
 28 |             overlap: M x N x 1
 29 |             union: M x N x 1
 30 |             ious: M x N x 1
 31 | 
 32 |             Total memory:
 33 |                 S = (9 x N x M + N + M) * 4 Byte,
 34 | 
 35 |             When using FP16, we can reduce:
 36 |                 R = (9 x N x M + N + M) * 4 / 2 Byte
 37 |                 R large than (N + M) * 4 * 2 is always true when N and M >= 1.
 38 |                 Obviously, N + M <= N * M < 3 * N * M, when N >=2 and M >=2,
 39 |                            N + 1 < 3 * N, when N or M is 1.
 40 | 
 41 |             Given M = 40 (ground truth), N = 400000 (three anchor boxes
 42 |             in per grid, FPN, R-CNNs),
 43 |                 R = 275 MB (one times)
 44 | 
 45 |             A special case (dense detection), M = 512 (ground truth),
 46 |                 R = 3516 MB = 3.43 GB
 47 | 
 48 |             When the batch size is B, reduce:
 49 |                 B x R
 50 | 
 51 |             Therefore, CUDA memory runs out frequently.
 52 | 
 53 |             Experiments on GeForce RTX 2080Ti (11019 MiB):
 54 | 
 55 |             |   dtype   |   M   |   N   |   Use    |   Real   |   Ideal   |
 56 |             |:----:|:----:|:----:|:----:|:----:|:----:|
 57 |             |   FP32   |   512 | 400000 | 8020 MiB |   --   |   --   |
 58 |             |   FP16   |   512 | 400000 |   4504 MiB | 3516 MiB | 3516 MiB |
 59 |             |   FP32   |   40 | 400000 |   1540 MiB |   --   |   --   |
 60 |             |   FP16   |   40 | 400000 |   1264 MiB |   276MiB   | 275 MiB |
 61 | 
 62 |         2) is_aligned is True
 63 |             area1: N x 1
 64 |             area2: N x 1
 65 |             lt: N x 2
 66 |             rb: N x 2
 67 |             wh: N x 2
 68 |             overlap: N x 1
 69 |             union: N x 1
 70 |             ious: N x 1
 71 | 
 72 |             Total memory:
 73 |                 S = 11 x N * 4 Byte
 74 | 
 75 |             When using FP16, we can reduce:
 76 |                 R = 11 x N * 4 / 2 Byte
 77 | 
 78 |         So do the 'giou' (large than 'iou').
 79 | 
 80 |         Time-wise, FP16 is generally faster than FP32.
 81 | 
 82 |         When gpu_assign_thr is not -1, it takes more time on cpu
 83 |         but not reduce memory.
 84 |         There, we can reduce half the memory and keep the speed.
 85 | 
 86 |     If ``is_aligned`` is ``False``, then calculate the overlaps between each
 87 |     bbox of bboxes1 and bboxes2, otherwise the overlaps between each aligned
 88 |     pair of bboxes1 and bboxes2.
 89 | 
 90 |     Args:
 91 |         bboxes1 (Tensor): shape (B, m, 4) in <x1, y1, x2, y2> format or empty.
 92 |         bboxes2 (Tensor): shape (B, n, 4) in <x1, y1, x2, y2> format or empty.
 93 |             B indicates the batch dim, in shape (B1, B2, ..., Bn).
 94 |             If ``is_aligned`` is ``True``, then m and n must be equal.
 95 |         mode (str): "iou" (intersection over union), "iof" (intersection over
 96 |             foreground) or "giou" (generalized intersection over union).
 97 |             Default "iou".
 98 |         is_aligned (bool, optional): If True, then m and n must be equal.
 99 |             Default False.
100 |         eps (float, optional): A value added to the denominator for numerical
101 |             stability. Default 1e-6.
102 | 
103 |     Returns:
104 |         Tensor: shape (m, n) if ``is_aligned`` is False else shape (m,)
105 | 
106 |     Example:
107 |         >>> bboxes1 = torch.FloatTensor([
108 |         >>>     [0, 0, 10, 10],
109 |         >>>     [10, 10, 20, 20],
110 |         >>>     [32, 32, 38, 42],
111 |         >>> ])
112 |         >>> bboxes2 = torch.FloatTensor([
113 |         >>>     [0, 0, 10, 20],
114 |         >>>     [0, 10, 10, 19],
115 |         >>>     [10, 10, 20, 20],
116 |         >>> ])
117 |         >>> overlaps = bbox_overlaps(bboxes1, bboxes2)
118 |         >>> assert overlaps.shape == (3, 3)
119 |         >>> overlaps = bbox_overlaps(bboxes1, bboxes2, is_aligned=True)
120 |         >>> assert overlaps.shape == (3, )
121 | 
122 |     Example:
123 |         >>> empty = torch.empty(0, 4)
124 |         >>> nonempty = torch.FloatTensor([[0, 0, 10, 9]])
125 |         >>> assert tuple(bbox_overlaps(empty, nonempty).shape) == (0, 1)
126 |         >>> assert tuple(bbox_overlaps(nonempty, empty).shape) == (1, 0)
127 |         >>> assert tuple(bbox_overlaps(empty, empty).shape) == (0, 0)
128 |     """
129 | 
130 |     assert mode in ['iou', 'iof', 'giou'], f'Unsupported mode {mode}'
131 |     # Either the boxes are empty or the length of boxes' last dimension is 4
132 |     assert (bboxes1.size(-1) == 4 or bboxes1.size(0) == 0)
133 |     assert (bboxes2.size(-1) == 4 or bboxes2.size(0) == 0)
134 | 
135 |     # Batch dim must be the same
136 |     # Batch dim: (B1, B2, ... Bn)
137 |     assert bboxes1.shape[:-2] == bboxes2.shape[:-2]
138 |     batch_shape = bboxes1.shape[:-2]
139 | 
140 |     rows = bboxes1.size(-2)
141 |     cols = bboxes2.size(-2)
142 |     if is_aligned:
143 |         assert rows == cols
144 | 
145 |     if rows * cols == 0:
146 |         if is_aligned:
147 |             return bboxes1.new(batch_shape + (rows, ))
148 |         else:
149 |             return bboxes1.new(batch_shape + (rows, cols))
150 | 
151 |     area1 = (bboxes1[..., 2] - bboxes1[..., 0]) * (
152 |         bboxes1[..., 3] - bboxes1[..., 1])
153 |     area2 = (bboxes2[..., 2] - bboxes2[..., 0]) * (
154 |         bboxes2[..., 3] - bboxes2[..., 1])
155 | 
156 |     if is_aligned:
157 |         lt = torch.max(bboxes1[..., :2], bboxes2[..., :2])  # [B, rows, 2]
158 |         rb = torch.min(bboxes1[..., 2:], bboxes2[..., 2:])  # [B, rows, 2]
159 | 
160 |         wh = fp16_clamp(rb - lt, min=0)
161 |         overlap = wh[..., 0] * wh[..., 1]
162 | 
163 |         if mode in ['iou', 'giou']:
164 |             union = area1 + area2 - overlap
165 |         else:
166 |             union = area1
167 |         if mode == 'giou':
168 |             enclosed_lt = torch.min(bboxes1[..., :2], bboxes2[..., :2])
169 |             enclosed_rb = torch.max(bboxes1[..., 2:], bboxes2[..., 2:])
170 |     else:
171 |         lt = torch.max(bboxes1[..., :, None, :2],
172 |                        bboxes2[..., None, :, :2])  # [B, rows, cols, 2]
173 |         rb = torch.min(bboxes1[..., :, None, 2:],
174 |                        bboxes2[..., None, :, 2:])  # [B, rows, cols, 2]
175 | 
176 |         wh = fp16_clamp(rb - lt, min=0)
177 |         overlap = wh[..., 0] * wh[..., 1]
178 | 
179 |         if mode in ['iou', 'giou']:
180 |             union = area1[..., None] + area2[..., None, :] - overlap
181 |         else:
182 |             union = area1[..., None]
183 |         if mode == 'giou':
184 |             enclosed_lt = torch.min(bboxes1[..., :, None, :2],
185 |                                     bboxes2[..., None, :, :2])
186 |             enclosed_rb = torch.max(bboxes1[..., :, None, 2:],
187 |                                     bboxes2[..., None, :, 2:])
188 | 
189 |     eps = union.new_tensor([eps])
190 |     union = torch.max(union, eps)
191 |     ious = overlap / union
192 |     if mode in ['iou', 'iof']:
193 |         return ious
194 |     # calculate gious
195 |     enclose_wh = fp16_clamp(enclosed_rb - enclosed_lt, min=0)
196 |     enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1]
197 |     enclose_area = torch.max(enclose_area, eps)
198 |     gious = ious - (enclose_area - union) / enclose_area
199 |     return gious


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | setup(
 4 |     name='ref_l4',
 5 |     version='1.1',
 6 |     packages=find_packages(),
 7 |     py_modules=['dataloader', 'evaluation', 'overlaps'],
 8 |     install_requires=[
 9 |         # Add any dependencies your package needs here
10 |         # e.g., 'numpy', 'pandas', 'scikit-learn'
11 |         'torch', 'datasets', 'pillow', 'statistics', 'pandas', 'datasets'
12 |     ],
13 |     entry_points={
14 |         'console_scripts': [
15 |             # Add any command line scripts here
16 |             # e.g., 'mycommand = mypackage.module:function'
17 |         ],
18 |     },
19 |     author='Jierun Chen',
20 |     description='Evaluation code for Ref-L4, a new REC benchmark in the LMM era.',
21 |     long_description=open('README.md').read(),
22 |     long_description_content_type='text/markdown',
23 |     url='https://github.com/JierunChen/Ref-L4',  # Replace with your package's URL
24 |     classifiers=[
25 |         'Programming Language :: Python :: 3',
26 |         'License :: OSI Approved :: MIT License',
27 |         'Operating System :: OS Independent',
28 |     ],
29 |     license='MIT',
30 |     python_requires='>=3.6',
31 | )


--------------------------------------------------------------------------------