├── figures
    └── soft_tifa.png
├── CONTRIBUTING.md
├── LICENSE
├── CODE_OF_CONDUCT.md
├── soft_tifa_analysis.py
├── README.md
└── evaluation.py


/figures/soft_tifa.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/GenEval2/HEAD/figures/soft_tifa.png


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to GenEval2
 2 | 
 3 | Thank you for your interest in contributing to GenEval2!
 4 | 
 5 | ## How to Contribute
 6 | 
 7 | 1. **Fork the repository** and create your branch from `main`.
 8 | 2. **Make your changes** and ensure they follow the existing code style.
 9 | 3. **Test your changes** to make sure they work correctly.
10 | 4. **Submit a pull request** with a clear description of your changes.
11 | 
12 | ## Reporting Issues
13 | 
14 | If you find a bug or have a feature request, please open an issue on GitHub with:
15 | - A clear description of the problem or suggestion
16 | - Steps to reproduce (for bugs)
17 | - Expected vs actual behavior
18 | 
19 | ## Code Style
20 | 
21 | - Follow PEP 8 guidelines for Python code
22 | - Add docstrings for new functions
23 | - Keep code readable and well-documented
24 | 
25 | ## Questions?
26 | 
27 | Feel free to open an issue for any questions about contributing.


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)
 2 | 
 3 | Copyright (c) 2024 Meta Platforms, Inc. and affiliates
 4 | 
 5 | This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 
 6 | International License. To view a copy of this license, visit:
 7 | https://creativecommons.org/licenses/by-nc/4.0/
 8 | 
 9 | You are free to:
10 | 
11 | - Share — copy and redistribute the material in any medium or format
12 | - Adapt — remix, transform, and build upon the material
13 | 
14 | Under the following terms:
15 | 
16 | - Attribution — You must give appropriate credit, provide a link to the license, 
17 |   and indicate if changes were made. You may do so in any reasonable manner, but 
18 |   not in any way that suggests the licensor endorses you or your use.
19 | 
20 | - NonCommercial — You may not use the material for commercial purposes.
21 | 
22 | - No additional restrictions — You may not apply legal terms or technological 
23 |   measures that legally restrict others from doing anything the license permits.
24 | 
25 | Notices:
26 | 
27 | You do not have to comply with the license for elements of the material in the 
28 | public domain or where your use is permitted by an applicable exception or 
29 | limitation.
30 | 
31 | No warranties are given. The license may not give you all of the permissions 
32 | necessary for your intended use. For example, other rights such as publicity, 
33 | privacy, or moral rights may limit how you use the material.
34 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Contributor Covenant Code of Conduct
 2 | 
 3 | ## Our Pledge
 4 | 
 5 | We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
 6 | 
 7 | ## Our Standards
 8 | 
 9 | Examples of behavior that contributes to a positive environment:
10 | 
11 | * Using welcoming and inclusive language
12 | * Being respectful of differing viewpoints and experiences
13 | * Gracefully accepting constructive criticism
14 | * Focusing on what is best for the community
15 | * Showing empathy towards other community members
16 | 
17 | Examples of unacceptable behavior:
18 | 
19 | * The use of sexualized language or imagery and unwelcome sexual attention
20 | * Trolling, insulting/derogatory comments, and personal or political attacks
21 | * Public or private harassment
22 | * Publishing others' private information without explicit permission
23 | * Other conduct which could reasonably be considered inappropriate
24 | 
25 | ## Enforcement
26 | 
27 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the project maintainers. All complaints will be reviewed and investigated promptly and fairly.
28 | 
29 | ## Attribution
30 | 
31 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.1.


--------------------------------------------------------------------------------
/soft_tifa_analysis.py:
--------------------------------------------------------------------------------
  1 | # Copyright: Meta Platforms, Inc. and affiliates
  2 | 
  3 | import re
  4 | import pdb
  5 | import json
  6 | import argparse
  7 | from tqdm import tqdm
  8 | from scipy.stats import gmean
  9 | 
 10 | 
 11 | # Per-skill analysis (Soft-TIFA AM)
 12 | def per_skill_analysis(all_score_lists, all_skill_lists):
 13 |     object_score, object_total = 0, 0
 14 |     count_score, count_total = 0, 0
 15 |     position_score, position_total = 0, 0
 16 |     verb_score, verb_total = 0, 0
 17 |     attribute_score, attribute_total = 0, 0
 18 | 
 19 |     for score_list, skill_list in zip(all_score_lists, all_skill_lists):
 20 |         for i in range(len(score_list)):
 21 |             if skill_list[i] == 'object':
 22 |                 object_score += score_list[i]
 23 |                 object_total += 1
 24 |             elif skill_list[i] == 'count':
 25 |                 count_score += score_list[i]
 26 |                 count_total += 1
 27 |             elif skill_list[i] == 'position':
 28 |                 position_score += score_list[i]
 29 |                 position_total += 1
 30 |             elif skill_list[i] == 'verb':
 31 |                 verb_score += score_list[i]
 32 |                 verb_total += 1
 33 |             elif skill_list[i] == 'attribute':
 34 |                 attribute_score += score_list[i]
 35 |                 attribute_total += 1
 36 |             else:
 37 |                 raise Exception("Unrecognized skill")
 38 |     
 39 |     # Note that we say "accuracy", but it's an estimate.
 40 |     object_accuracy = 100 * object_score / object_total
 41 |     attribute_accuracy = 100 * attribute_score / attribute_total
 42 |     count_accuracy = 100 * count_score / count_total
 43 |     position_accuracy = 100 * position_score / position_total
 44 |     verb_accuracy = 100 * verb_score / verb_total
 45 |     return object_accuracy, attribute_accuracy, count_accuracy, \
 46 |             position_accuracy, verb_accuracy
 47 | 
 48 | 
 49 | # Per-atomicity analysis (Soft-TIFA GM)
 50 | def per_atomicity_analysis(all_score_lists, atomicity_list):
 51 |     all_atomicity_dict = {k: {} for k in range(3, 11)}
 52 |     for k in all_atomicity_dict:
 53 |         all_atomicity_dict[k] = {'correct': 0, 'total': 0}
 54 |     
 55 |     for score_list, atomicity in zip(all_score_lists, atomicity_list):
 56 |         all_atomicity_dict[atomicity]['correct'] += gmean(score_list)
 57 |         all_atomicity_dict[atomicity]['total'] += 1
 58 | 
 59 |     # Here, too, "accuracy" is an estimate.
 60 |     for atomicity in all_atomicity_dict:
 61 |         all_atomicity_dict[atomicity]['accuracy'] = \
 62 |                 (all_atomicity_dict[atomicity]['correct'])*100 / \
 63 |                 all_atomicity_dict[atomicity]['total']
 64 |     return all_atomicity_dict
 65 | 
 66 | 
 67 | def main():
 68 |     parser = argparse.ArgumentParser(description="Analyze T2I Performance")
 69 |     parser.add_argument("--benchmark_data", type=str, required=False, \
 70 |             default='./geneval2_data.jsonl', help="File with benchmark data")
 71 |     parser.add_argument("--score_data", type=str, required=True, \
 72 |             help="File with lists of scores per prompt")
 73 |     args = parser.parse_args()
 74 | 
 75 |     benchmark_data = [json.loads(l) for l in open(args.benchmark_data).readlines()]
 76 |     all_score_lists = json.load(open(args.score_data))
 77 | 
 78 |     atomicity_list = [b['atom_count'] for b in benchmark_data]
 79 |     all_skill_lists = [b['skills'] for b in benchmark_data]
 80 | 
 81 |     # Check that the lists all line up:
 82 |     for score_list, skill_list in zip(all_score_lists, all_skill_lists):
 83 |         assert len(score_list) == len(skill_list)
 84 | 
 85 |     print("Per Atom Type Analysis (Soft-TIFA AM)")
 86 |     object_accuracy, attribute_accuracy, count_accuracy, \
 87 |             position_accuracy, verb_accuracy = per_skill_analysis(all_score_lists, \
 88 |             all_skill_lists)
 89 |     print("Object:")
 90 |     print(round(object_accuracy, 2))
 91 |     print("Attribute:")
 92 |     print(round(attribute_accuracy, 2))
 93 |     print("Count:")
 94 |     print(round(count_accuracy, 2))
 95 |     print("Position:")
 96 |     print(round(position_accuracy, 2))
 97 |     print("Verb:")
 98 |     print(round(verb_accuracy, 2))
 99 |     print()
100 |     
101 |     print("Per Atomicity Analysis (Soft-TIFA GM)")
102 |     all_atomicity_dict = per_atomicity_analysis(all_score_lists, \
103 |             atomicity_list)
104 |     for atomicity in all_atomicity_dict:
105 |         print("Results for Atomicity={}".format(atomicity))
106 |         print(round(all_atomicity_dict[atomicity]['accuracy'], 2))
107 | 
108 | 
109 | if __name__ == "__main__":
110 |         main()
111 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
 2 | 
 3 | **GenEval 2** is a Text-to-Image (T2I) benchmark with improved coverage of primitive visual concepts (objects, attributes, relations, counting) and higher degrees of compositionality than existing benchmarks. It contains 800 prompts with varying degrees of compositionality.
 4 | 
 5 | GenEval 2 is evaluated with **Soft-TIFA**, a VQA-based evaluation method that combines judgments for visual primitives and is better-aligned with human judgment and less likely to drift from human-alignment over time than other evaluation methods. 
 6 | 
 7 | This repository contains the benchmark data of GenEval 2, as well as the evaluation script using Soft-TIFA.
 8 | 
 9 | <p align="center">
10 | <img src="figures/soft_tifa.png" width="850">
11 | </p>
12 | 
13 | ## Benchmark Structure
14 | The benchmark data is contained in `geneval2_data.jsonl`, where each line is a dictionary containing the information related to each prompt: the prompt itself (`prompt`), the compositionality a.k.a. atomicity of the prompt (`atom_count`), a list of VQA question-answer pairs for each atom in the prompt (`vqa_list`), and a list of the skill associated with each VQA pair (`skills`). 
15 | 
16 | For example: 
17 | ```
18 | {
19 |     "prompt": "four white bicycles in front of three plastic cows",
20 |     "atom_count": 7,
21 |     "vqa_list": [["How many bicycles are in the image?", "four"], ["Are the bicycles white?", "Yes"], ["Are there any bicycles in the image?", "Yes"], ["Are the bicycles in front of the cows?", "Yes"], ["How many cows are in the image?", "three"], ["Are the cows plastic?", "Yes"], ["Are there any cows in the image?", "Yes"]],
22 |     "skills": ["count", "attribute", "object", "position", "count", "attribute", "object"]
23 | }
24 | ```
25 | 
26 | Note: We do not consider "and" or "a" to contribute to prompt compositionality (i.e., atomicity); however, we evaluate "a" in case the T2I model generated more than one of the required object, hence, it is included in the VQA list, but not the atom count.
27 | 
28 | ## Installation
29 | ```
30 | git clone https://github.com/facebookresearch/GenEval2
31 | cd GenEval2
32 | conda create --name geneval2
33 | conda activate geneval2
34 | pip install torch transformers==4.57.0 pillow tqdm scipy
35 | ```
36 | 
37 | ## Image Generation
38 | Use any T2I model to generate images for each of the prompts in GenEval 2, and create a dictionary where the keys are the prompts and the values are the filepaths pointing to the corresponding generated image (this will be used by our evaluation script).
39 | 
40 | ## Evaluation
41 | Soft-TIFA uses a VQA model to query the generated image with each of the associated list of questions. It assigns a soft score to each question based on the VQA model's probability assigned to the correct answer when given the image. To obtain an _atom-level_ estimate of model performance, the arithmetic mean of the soft scores per prompt is calculated (Soft-TIFA AM); to obtain a _prompt-level_ estimate of model performance, the geometric mean of the soft scores per prompt is calculated (Soft-TIFA GM). Finally, these scores are averaged over the benchmark. 
42 | 
43 | For comparison, we also provide code for two other T2I evaluation methods: VQAScore (Lin et al., 2024) and TIFA (Hu et al., 2023). 
44 | 
45 | ### Running the evaluation script
46 | ```
47 | python evaluation.py \
48 |     --benchmark_data geneval2_data.jsonl \
49 |     --image_filepath_data your_image_paths.json \
50 |     --method soft_tifa_gm \
51 |     --output_file score_lists.json
52 | ```
53 | Where:
54 | - `--benchmark_data`: Path to the GenEval 2 data provided in this repository
55 | - `--image_filepath_data`: Path to the JSON file mapping prompts to image filepaths, as generated in the previous step
56 | - `--method`: Evaluation method (`vqascore`, `tifa`, or `soft_tifa`)
57 | - `--output_file`: Path to save the per-atom output scores
58 | 
59 | ## Analysis
60 | GenEval 2 supports detailed analyses at the prompt- and atom-level. Each prompt is annotated with its atomicity and list of skills per atom, allowing per-skill analysis at the atom-level and per-atomicity analysis at the prompt-level. The former is estimated with Soft-TIFA AM and the latter with Soft-TIFA GM, which are suited to atom- and prompt-level estimation respectively.
61 | 
62 | ### Running the analysis script
63 | ```
64 | python soft_tifa_analysis.py \
65 |     --benchmark_data geneval2_data.jsonl \
66 |     --score_data score_lists.json
67 | ```
68 | Where:
69 | - `--benchmark_data`: Path to the GenEval 2 data provided in this repository
70 | - `--score_data`: Path to the per-atom scores generated in the evaluation step
71 | 
72 | ## License
73 | 
74 | This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) - see the [LICENSE](LICENSE) file for details.
75 | 
76 | ## Citation
77 | 
78 | If you use GenEval2 in your research, please consider citing our work.
79 | 
80 | ## Contributing
81 | 
82 | Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
83 | 


--------------------------------------------------------------------------------
/evaluation.py:
--------------------------------------------------------------------------------
  1 | # Copyright: Meta Platforms, Inc. and affiliates
  2 | 
  3 | import re
  4 | import pdb
  5 | import json
  6 | import torch
  7 | import base64
  8 | import argparse
  9 | from PIL import Image
 10 | from tqdm import tqdm
 11 | from scipy.stats import gmean
 12 | from transformers import AutoModelForCausalLM, AutoProcessor, \
 13 |         GenerationConfig, AutoTokenizer, Qwen3VLForConditionalGeneration
 14 | 
 15 | 
 16 | print("Loading Qwen")
 17 | qwen_processor = AutoProcessor.from_pretrained(
 18 |                 "Qwen/Qwen3-VL-8B-Instruct",
 19 |                 torch_dtype='auto',
 20 |                 device_map='auto'
 21 |             )
 22 | 
 23 | qwen_model = Qwen3VLForConditionalGeneration.from_pretrained(
 24 |             "Qwen/Qwen3-VL-8B-Instruct", 
 25 |             dtype="auto", 
 26 |             device_map="auto"
 27 |             )
 28 | 
 29 | 
 30 | def return_numeric_string(number):
 31 |     match number:
 32 |         case 'one':
 33 |             return '1'
 34 |         case 'two':
 35 |             return '2'
 36 |         case 'three':
 37 |             return '3'
 38 |         case 'four':
 39 |             return '4'
 40 |         case 'five':
 41 |             return '5'
 42 |         case 'six':
 43 |             return '6'
 44 |         case 'seven':
 45 |             return '7'
 46 |         case 'eight':
 47 |             return '8'
 48 |         case 'nine':
 49 |             return '9'
 50 |         case 'ten':
 51 |             return '10'
 52 |     return 'other'
 53 | 
 54 | 
 55 | def construct_message_with_image(prompt, image_filepath):
 56 |     """
 57 |     Constructs the message structure with image.
 58 |     """
 59 |     return [
 60 |         { "role": "user", "content": [  
 61 |             { 
 62 |                 "type": "image",
 63 |                 "image": image_filepath,
 64 |             },
 65 |             { 
 66 |                 "type": "text", 
 67 |                 "text": prompt 
 68 |             },
 69 |         ] } 
 70 |     ]
 71 | 
 72 | 
 73 | def send_message_with_image(prompt, image_filepath, answer_list=None):
 74 |     messages = construct_message_with_image(prompt, image_filepath)
 75 |     inputs = qwen_processor.apply_chat_template(
 76 |                 messages, tokenize=True, add_generation_prompt=True, 
 77 |                 return_dict=True, return_tensors="pt"
 78 |                 )
 79 |     inputs = inputs.to(qwen_model.device)
 80 |     outputs = qwen_model.generate(**inputs,
 81 |             max_new_tokens=1,
 82 |             do_sample=False,
 83 |             output_scores=True,
 84 |             return_dict_in_generate=True
 85 |             )
 86 |     scores = outputs.scores[0]
 87 |     probs = torch.nn.functional.softmax(scores, dim=-1)
 88 | 
 89 |     if answer_list:
 90 |         lm_prob = 0
 91 |         for answer in answer_list:
 92 |             ans_token_id = qwen_processor.tokenizer.encode(answer)[0]
 93 |             lm_prob += probs[0, ans_token_id].item()
 94 | 
 95 |     else:
 96 |         lm_prob = None
 97 | 
 98 |     argmax_token = qwen_processor.batch_decode([torch.argmax(probs)])[0]
 99 |     pred = argmax_token
100 |     return pred, lm_prob
101 | 
102 | 
103 | def vqa_score(prompt, image_filepath):
104 |     message_prompt = 'Does this image show "{}"? Answer the question with Yes or No.'.format(prompt)
105 |     pred, ans_prob = send_message_with_image(message_prompt.format(prompt), image_filepath, answer_list=['Yes', 'yes', ' yes', ' Yes'])
106 |     return ans_prob
107 | 
108 | 
109 | def tifa(vqa_list, image_filepath):
110 |     score = 0
111 |     score_list = []
112 |     for vqa in vqa_list:
113 |         question, answer = vqa
114 |         if question.startswith("How many"):
115 |             answer_list = [answer, answer.capitalize(), ' '+answer, \
116 |                     ' '+answer.capitalize(), return_numeric_string(answer), \
117 |                     ' '+return_numeric_string(answer)]
118 |         else:
119 |             answer_list = ['Yes', 'yes', ' yes', ' Yes']
120 |         pred, ans_prob = send_message_with_image('{} Answer in one word.'.format(question), image_filepath, \
121 |                 answer_list=answer_list)
122 |         if pred.lower() in answer_list:
123 |             score += 1
124 |             score_list.append(1)
125 |         else:
126 |             score_list.append(0)
127 |     score = score / len(vqa_list)
128 |     return score, score_list
129 | 
130 | 
131 | def soft_tifa(vqa_list, image_filepath):
132 |     score = 0
133 |     score_list = []
134 |     for vqa in vqa_list:
135 |         question, answer = vqa
136 |         if question.startswith("How many"):
137 |             answer_list = [answer, answer.capitalize(), ' '+answer, \
138 |                     ' '+answer.capitalize(), return_numeric_string(answer), \
139 |                     ' '+return_numeric_string(answer)]
140 |         else:
141 |             answer_list = ['Yes', 'yes', ' yes', ' Yes']
142 |         pred, ans_prob = send_message_with_image('{} Answer in one word.'.format(question), image_filepath, \
143 |                 answer_list=answer_list)
144 |         score += ans_prob
145 |         score_list.append(ans_prob)
146 |     score = score / len(vqa_list)
147 |     return score, score_list
148 | 
149 | 
150 | def main():
151 |     parser = argparse.ArgumentParser(description="Evaluate T2I Images")
152 |     parser.add_argument("--benchmark_data", type=str, required=True, \
153 |             help="File with benchmark data")
154 |     parser.add_argument("--image_filepath_data", type=str, required=True, \
155 |             help="File with prompts and image filepaths")
156 |     parser.add_argument("--method", type=str, required=True, choices=["vqascore", \
157 |             "tifa", "soft_tifa_am", "soft_tifa_gm"], help="Method name")
158 |     parser.add_argument("--output_file", type=str, required=True, \
159 |             help="Output filepath name")
160 |     args = parser.parse_args()
161 | 
162 |     benchmark_data = [json.loads(l) for l in open(args.benchmark_data).readlines()]
163 |     # File with prompts and image filepaths as a dictionary:
164 |     # {prompt: image_filepath}
165 |     image_data = json.load(open(args.image_filepath_data))
166 |     all_score_lists = []
167 | 
168 |     for d in tqdm(benchmark_data):
169 |         if d['prompt'] not in image_data.keys():
170 |             raise Exception("Missing filepath for the prompt: {}".format(d['prompt']))
171 |         image_filepath = image_data[d['prompt']]
172 |         if args.method == 'vqascore':
173 |             score = vqa_score(d['prompt'], image_filepath)
174 |             score_list = [score]
175 |         elif args.method == 'tifa':
176 |             score, score_list = tifa(d['vqa_list'], image_filepath) 
177 |         elif args.method == 'soft_tifa_am' or args.method == 'soft_tifa_gm':
178 |             score, score_list = soft_tifa(d['vqa_list'], image_filepath)
179 |         else:
180 |             raise NotImplementedError
181 |         all_score_lists.append(score_list)
182 | 
183 |     # Save scores for later analysis
184 |     json.dump(all_score_lists, open(args.output_file, 'w'))
185 |     
186 |     # Calculating total score
187 |     if args.method == 'soft_tifa_gm':
188 |         per_prompt_scores = [gmean(s) for s in all_score_lists]
189 |     else:
190 |         per_prompt_scores = [sum(s)/len(s) for s in all_score_lists]
191 |     total_score = 100 * sum(per_prompt_scores)/len(per_prompt_scores)
192 |     print("Score: {}".format(total_score))
193 | 
194 | 
195 | if __name__ == "__main__":
196 |         main()
197 | 
198 | 


--------------------------------------------------------------------------------