├── figures └── soft_tifa.png ├── CONTRIBUTING.md ├── LICENSE ├── CODE_OF_CONDUCT.md ├── soft_tifa_analysis.py ├── README.md └── evaluation.py /figures/soft_tifa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/GenEval2/HEAD/figures/soft_tifa.png -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to GenEval2 2 | 3 | Thank you for your interest in contributing to GenEval2! 4 | 5 | ## How to Contribute 6 | 7 | 1. **Fork the repository** and create your branch from `main`. 8 | 2. **Make your changes** and ensure they follow the existing code style. 9 | 3. **Test your changes** to make sure they work correctly. 10 | 4. **Submit a pull request** with a clear description of your changes. 11 | 12 | ## Reporting Issues 13 | 14 | If you find a bug or have a feature request, please open an issue on GitHub with: 15 | - A clear description of the problem or suggestion 16 | - Steps to reproduce (for bugs) 17 | - Expected vs actual behavior 18 | 19 | ## Code Style 20 | 21 | - Follow PEP 8 guidelines for Python code 22 | - Add docstrings for new functions 23 | - Keep code readable and well-documented 24 | 25 | ## Questions? 26 | 27 | Feel free to open an issue for any questions about contributing. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) 2 | 3 | Copyright (c) 2024 Meta Platforms, Inc. and affiliates 4 | 5 | This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 6 | International License. To view a copy of this license, visit: 7 | https://creativecommons.org/licenses/by-nc/4.0/ 8 | 9 | You are free to: 10 | 11 | - Share — copy and redistribute the material in any medium or format 12 | - Adapt — remix, transform, and build upon the material 13 | 14 | Under the following terms: 15 | 16 | - Attribution — You must give appropriate credit, provide a link to the license, 17 | and indicate if changes were made. You may do so in any reasonable manner, but 18 | not in any way that suggests the licensor endorses you or your use. 19 | 20 | - NonCommercial — You may not use the material for commercial purposes. 21 | 22 | - No additional restrictions — You may not apply legal terms or technological 23 | measures that legally restrict others from doing anything the license permits. 24 | 25 | Notices: 26 | 27 | You do not have to comply with the license for elements of the material in the 28 | public domain or where your use is permitted by an applicable exception or 29 | limitation. 30 | 31 | No warranties are given. The license may not give you all of the permissions 32 | necessary for your intended use. For example, other rights such as publicity, 33 | privacy, or moral rights may limit how you use the material. 34 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. 6 | 7 | ## Our Standards 8 | 9 | Examples of behavior that contributes to a positive environment: 10 | 11 | * Using welcoming and inclusive language 12 | * Being respectful of differing viewpoints and experiences 13 | * Gracefully accepting constructive criticism 14 | * Focusing on what is best for the community 15 | * Showing empathy towards other community members 16 | 17 | Examples of unacceptable behavior: 18 | 19 | * The use of sexualized language or imagery and unwelcome sexual attention 20 | * Trolling, insulting/derogatory comments, and personal or political attacks 21 | * Public or private harassment 22 | * Publishing others' private information without explicit permission 23 | * Other conduct which could reasonably be considered inappropriate 24 | 25 | ## Enforcement 26 | 27 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the project maintainers. All complaints will be reviewed and investigated promptly and fairly. 28 | 29 | ## Attribution 30 | 31 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.1. -------------------------------------------------------------------------------- /soft_tifa_analysis.py: -------------------------------------------------------------------------------- 1 | # Copyright: Meta Platforms, Inc. and affiliates 2 | 3 | import re 4 | import pdb 5 | import json 6 | import argparse 7 | from tqdm import tqdm 8 | from scipy.stats import gmean 9 | 10 | 11 | # Per-skill analysis (Soft-TIFA AM) 12 | def per_skill_analysis(all_score_lists, all_skill_lists): 13 | object_score, object_total = 0, 0 14 | count_score, count_total = 0, 0 15 | position_score, position_total = 0, 0 16 | verb_score, verb_total = 0, 0 17 | attribute_score, attribute_total = 0, 0 18 | 19 | for score_list, skill_list in zip(all_score_lists, all_skill_lists): 20 | for i in range(len(score_list)): 21 | if skill_list[i] == 'object': 22 | object_score += score_list[i] 23 | object_total += 1 24 | elif skill_list[i] == 'count': 25 | count_score += score_list[i] 26 | count_total += 1 27 | elif skill_list[i] == 'position': 28 | position_score += score_list[i] 29 | position_total += 1 30 | elif skill_list[i] == 'verb': 31 | verb_score += score_list[i] 32 | verb_total += 1 33 | elif skill_list[i] == 'attribute': 34 | attribute_score += score_list[i] 35 | attribute_total += 1 36 | else: 37 | raise Exception("Unrecognized skill") 38 | 39 | # Note that we say "accuracy", but it's an estimate. 40 | object_accuracy = 100 * object_score / object_total 41 | attribute_accuracy = 100 * attribute_score / attribute_total 42 | count_accuracy = 100 * count_score / count_total 43 | position_accuracy = 100 * position_score / position_total 44 | verb_accuracy = 100 * verb_score / verb_total 45 | return object_accuracy, attribute_accuracy, count_accuracy, \ 46 | position_accuracy, verb_accuracy 47 | 48 | 49 | # Per-atomicity analysis (Soft-TIFA GM) 50 | def per_atomicity_analysis(all_score_lists, atomicity_list): 51 | all_atomicity_dict = {k: {} for k in range(3, 11)} 52 | for k in all_atomicity_dict: 53 | all_atomicity_dict[k] = {'correct': 0, 'total': 0} 54 | 55 | for score_list, atomicity in zip(all_score_lists, atomicity_list): 56 | all_atomicity_dict[atomicity]['correct'] += gmean(score_list) 57 | all_atomicity_dict[atomicity]['total'] += 1 58 | 59 | # Here, too, "accuracy" is an estimate. 60 | for atomicity in all_atomicity_dict: 61 | all_atomicity_dict[atomicity]['accuracy'] = \ 62 | (all_atomicity_dict[atomicity]['correct'])*100 / \ 63 | all_atomicity_dict[atomicity]['total'] 64 | return all_atomicity_dict 65 | 66 | 67 | def main(): 68 | parser = argparse.ArgumentParser(description="Analyze T2I Performance") 69 | parser.add_argument("--benchmark_data", type=str, required=False, \ 70 | default='./geneval2_data.jsonl', help="File with benchmark data") 71 | parser.add_argument("--score_data", type=str, required=True, \ 72 | help="File with lists of scores per prompt") 73 | args = parser.parse_args() 74 | 75 | benchmark_data = [json.loads(l) for l in open(args.benchmark_data).readlines()] 76 | all_score_lists = json.load(open(args.score_data)) 77 | 78 | atomicity_list = [b['atom_count'] for b in benchmark_data] 79 | all_skill_lists = [b['skills'] for b in benchmark_data] 80 | 81 | # Check that the lists all line up: 82 | for score_list, skill_list in zip(all_score_lists, all_skill_lists): 83 | assert len(score_list) == len(skill_list) 84 | 85 | print("Per Atom Type Analysis (Soft-TIFA AM)") 86 | object_accuracy, attribute_accuracy, count_accuracy, \ 87 | position_accuracy, verb_accuracy = per_skill_analysis(all_score_lists, \ 88 | all_skill_lists) 89 | print("Object:") 90 | print(round(object_accuracy, 2)) 91 | print("Attribute:") 92 | print(round(attribute_accuracy, 2)) 93 | print("Count:") 94 | print(round(count_accuracy, 2)) 95 | print("Position:") 96 | print(round(position_accuracy, 2)) 97 | print("Verb:") 98 | print(round(verb_accuracy, 2)) 99 | print() 100 | 101 | print("Per Atomicity Analysis (Soft-TIFA GM)") 102 | all_atomicity_dict = per_atomicity_analysis(all_score_lists, \ 103 | atomicity_list) 104 | for atomicity in all_atomicity_dict: 105 | print("Results for Atomicity={}".format(atomicity)) 106 | print(round(all_atomicity_dict[atomicity]['accuracy'], 2)) 107 | 108 | 109 | if __name__ == "__main__": 110 | main() 111 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation 2 | 3 | **GenEval 2** is a Text-to-Image (T2I) benchmark with improved coverage of primitive visual concepts (objects, attributes, relations, counting) and higher degrees of compositionality than existing benchmarks. It contains 800 prompts with varying degrees of compositionality. 4 | 5 | GenEval 2 is evaluated with **Soft-TIFA**, a VQA-based evaluation method that combines judgments for visual primitives and is better-aligned with human judgment and less likely to drift from human-alignment over time than other evaluation methods. 6 | 7 | This repository contains the benchmark data of GenEval 2, as well as the evaluation script using Soft-TIFA. 8 | 9 |

10 | 11 |

12 | 13 | ## Benchmark Structure 14 | The benchmark data is contained in `geneval2_data.jsonl`, where each line is a dictionary containing the information related to each prompt: the prompt itself (`prompt`), the compositionality a.k.a. atomicity of the prompt (`atom_count`), a list of VQA question-answer pairs for each atom in the prompt (`vqa_list`), and a list of the skill associated with each VQA pair (`skills`). 15 | 16 | For example: 17 | ``` 18 | { 19 | "prompt": "four white bicycles in front of three plastic cows", 20 | "atom_count": 7, 21 | "vqa_list": [["How many bicycles are in the image?", "four"], ["Are the bicycles white?", "Yes"], ["Are there any bicycles in the image?", "Yes"], ["Are the bicycles in front of the cows?", "Yes"], ["How many cows are in the image?", "three"], ["Are the cows plastic?", "Yes"], ["Are there any cows in the image?", "Yes"]], 22 | "skills": ["count", "attribute", "object", "position", "count", "attribute", "object"] 23 | } 24 | ``` 25 | 26 | Note: We do not consider "and" or "a" to contribute to prompt compositionality (i.e., atomicity); however, we evaluate "a" in case the T2I model generated more than one of the required object, hence, it is included in the VQA list, but not the atom count. 27 | 28 | ## Installation 29 | ``` 30 | git clone https://github.com/facebookresearch/GenEval2 31 | cd GenEval2 32 | conda create --name geneval2 33 | conda activate geneval2 34 | pip install torch transformers==4.57.0 pillow tqdm scipy 35 | ``` 36 | 37 | ## Image Generation 38 | Use any T2I model to generate images for each of the prompts in GenEval 2, and create a dictionary where the keys are the prompts and the values are the filepaths pointing to the corresponding generated image (this will be used by our evaluation script). 39 | 40 | ## Evaluation 41 | Soft-TIFA uses a VQA model to query the generated image with each of the associated list of questions. It assigns a soft score to each question based on the VQA model's probability assigned to the correct answer when given the image. To obtain an _atom-level_ estimate of model performance, the arithmetic mean of the soft scores per prompt is calculated (Soft-TIFA AM); to obtain a _prompt-level_ estimate of model performance, the geometric mean of the soft scores per prompt is calculated (Soft-TIFA GM). Finally, these scores are averaged over the benchmark. 42 | 43 | For comparison, we also provide code for two other T2I evaluation methods: VQAScore (Lin et al., 2024) and TIFA (Hu et al., 2023). 44 | 45 | ### Running the evaluation script 46 | ``` 47 | python evaluation.py \ 48 | --benchmark_data geneval2_data.jsonl \ 49 | --image_filepath_data your_image_paths.json \ 50 | --method soft_tifa_gm \ 51 | --output_file score_lists.json 52 | ``` 53 | Where: 54 | - `--benchmark_data`: Path to the GenEval 2 data provided in this repository 55 | - `--image_filepath_data`: Path to the JSON file mapping prompts to image filepaths, as generated in the previous step 56 | - `--method`: Evaluation method (`vqascore`, `tifa`, or `soft_tifa`) 57 | - `--output_file`: Path to save the per-atom output scores 58 | 59 | ## Analysis 60 | GenEval 2 supports detailed analyses at the prompt- and atom-level. Each prompt is annotated with its atomicity and list of skills per atom, allowing per-skill analysis at the atom-level and per-atomicity analysis at the prompt-level. The former is estimated with Soft-TIFA AM and the latter with Soft-TIFA GM, which are suited to atom- and prompt-level estimation respectively. 61 | 62 | ### Running the analysis script 63 | ``` 64 | python soft_tifa_analysis.py \ 65 | --benchmark_data geneval2_data.jsonl \ 66 | --score_data score_lists.json 67 | ``` 68 | Where: 69 | - `--benchmark_data`: Path to the GenEval 2 data provided in this repository 70 | - `--score_data`: Path to the per-atom scores generated in the evaluation step 71 | 72 | ## License 73 | 74 | This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) - see the [LICENSE](LICENSE) file for details. 75 | 76 | ## Citation 77 | 78 | If you use GenEval2 in your research, please consider citing our work. 79 | 80 | ## Contributing 81 | 82 | Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. 83 | -------------------------------------------------------------------------------- /evaluation.py: -------------------------------------------------------------------------------- 1 | # Copyright: Meta Platforms, Inc. and affiliates 2 | 3 | import re 4 | import pdb 5 | import json 6 | import torch 7 | import base64 8 | import argparse 9 | from PIL import Image 10 | from tqdm import tqdm 11 | from scipy.stats import gmean 12 | from transformers import AutoModelForCausalLM, AutoProcessor, \ 13 | GenerationConfig, AutoTokenizer, Qwen3VLForConditionalGeneration 14 | 15 | 16 | print("Loading Qwen") 17 | qwen_processor = AutoProcessor.from_pretrained( 18 | "Qwen/Qwen3-VL-8B-Instruct", 19 | torch_dtype='auto', 20 | device_map='auto' 21 | ) 22 | 23 | qwen_model = Qwen3VLForConditionalGeneration.from_pretrained( 24 | "Qwen/Qwen3-VL-8B-Instruct", 25 | dtype="auto", 26 | device_map="auto" 27 | ) 28 | 29 | 30 | def return_numeric_string(number): 31 | match number: 32 | case 'one': 33 | return '1' 34 | case 'two': 35 | return '2' 36 | case 'three': 37 | return '3' 38 | case 'four': 39 | return '4' 40 | case 'five': 41 | return '5' 42 | case 'six': 43 | return '6' 44 | case 'seven': 45 | return '7' 46 | case 'eight': 47 | return '8' 48 | case 'nine': 49 | return '9' 50 | case 'ten': 51 | return '10' 52 | return 'other' 53 | 54 | 55 | def construct_message_with_image(prompt, image_filepath): 56 | """ 57 | Constructs the message structure with image. 58 | """ 59 | return [ 60 | { "role": "user", "content": [ 61 | { 62 | "type": "image", 63 | "image": image_filepath, 64 | }, 65 | { 66 | "type": "text", 67 | "text": prompt 68 | }, 69 | ] } 70 | ] 71 | 72 | 73 | def send_message_with_image(prompt, image_filepath, answer_list=None): 74 | messages = construct_message_with_image(prompt, image_filepath) 75 | inputs = qwen_processor.apply_chat_template( 76 | messages, tokenize=True, add_generation_prompt=True, 77 | return_dict=True, return_tensors="pt" 78 | ) 79 | inputs = inputs.to(qwen_model.device) 80 | outputs = qwen_model.generate(**inputs, 81 | max_new_tokens=1, 82 | do_sample=False, 83 | output_scores=True, 84 | return_dict_in_generate=True 85 | ) 86 | scores = outputs.scores[0] 87 | probs = torch.nn.functional.softmax(scores, dim=-1) 88 | 89 | if answer_list: 90 | lm_prob = 0 91 | for answer in answer_list: 92 | ans_token_id = qwen_processor.tokenizer.encode(answer)[0] 93 | lm_prob += probs[0, ans_token_id].item() 94 | 95 | else: 96 | lm_prob = None 97 | 98 | argmax_token = qwen_processor.batch_decode([torch.argmax(probs)])[0] 99 | pred = argmax_token 100 | return pred, lm_prob 101 | 102 | 103 | def vqa_score(prompt, image_filepath): 104 | message_prompt = 'Does this image show "{}"? Answer the question with Yes or No.'.format(prompt) 105 | pred, ans_prob = send_message_with_image(message_prompt.format(prompt), image_filepath, answer_list=['Yes', 'yes', ' yes', ' Yes']) 106 | return ans_prob 107 | 108 | 109 | def tifa(vqa_list, image_filepath): 110 | score = 0 111 | score_list = [] 112 | for vqa in vqa_list: 113 | question, answer = vqa 114 | if question.startswith("How many"): 115 | answer_list = [answer, answer.capitalize(), ' '+answer, \ 116 | ' '+answer.capitalize(), return_numeric_string(answer), \ 117 | ' '+return_numeric_string(answer)] 118 | else: 119 | answer_list = ['Yes', 'yes', ' yes', ' Yes'] 120 | pred, ans_prob = send_message_with_image('{} Answer in one word.'.format(question), image_filepath, \ 121 | answer_list=answer_list) 122 | if pred.lower() in answer_list: 123 | score += 1 124 | score_list.append(1) 125 | else: 126 | score_list.append(0) 127 | score = score / len(vqa_list) 128 | return score, score_list 129 | 130 | 131 | def soft_tifa(vqa_list, image_filepath): 132 | score = 0 133 | score_list = [] 134 | for vqa in vqa_list: 135 | question, answer = vqa 136 | if question.startswith("How many"): 137 | answer_list = [answer, answer.capitalize(), ' '+answer, \ 138 | ' '+answer.capitalize(), return_numeric_string(answer), \ 139 | ' '+return_numeric_string(answer)] 140 | else: 141 | answer_list = ['Yes', 'yes', ' yes', ' Yes'] 142 | pred, ans_prob = send_message_with_image('{} Answer in one word.'.format(question), image_filepath, \ 143 | answer_list=answer_list) 144 | score += ans_prob 145 | score_list.append(ans_prob) 146 | score = score / len(vqa_list) 147 | return score, score_list 148 | 149 | 150 | def main(): 151 | parser = argparse.ArgumentParser(description="Evaluate T2I Images") 152 | parser.add_argument("--benchmark_data", type=str, required=True, \ 153 | help="File with benchmark data") 154 | parser.add_argument("--image_filepath_data", type=str, required=True, \ 155 | help="File with prompts and image filepaths") 156 | parser.add_argument("--method", type=str, required=True, choices=["vqascore", \ 157 | "tifa", "soft_tifa_am", "soft_tifa_gm"], help="Method name") 158 | parser.add_argument("--output_file", type=str, required=True, \ 159 | help="Output filepath name") 160 | args = parser.parse_args() 161 | 162 | benchmark_data = [json.loads(l) for l in open(args.benchmark_data).readlines()] 163 | # File with prompts and image filepaths as a dictionary: 164 | # {prompt: image_filepath} 165 | image_data = json.load(open(args.image_filepath_data)) 166 | all_score_lists = [] 167 | 168 | for d in tqdm(benchmark_data): 169 | if d['prompt'] not in image_data.keys(): 170 | raise Exception("Missing filepath for the prompt: {}".format(d['prompt'])) 171 | image_filepath = image_data[d['prompt']] 172 | if args.method == 'vqascore': 173 | score = vqa_score(d['prompt'], image_filepath) 174 | score_list = [score] 175 | elif args.method == 'tifa': 176 | score, score_list = tifa(d['vqa_list'], image_filepath) 177 | elif args.method == 'soft_tifa_am' or args.method == 'soft_tifa_gm': 178 | score, score_list = soft_tifa(d['vqa_list'], image_filepath) 179 | else: 180 | raise NotImplementedError 181 | all_score_lists.append(score_list) 182 | 183 | # Save scores for later analysis 184 | json.dump(all_score_lists, open(args.output_file, 'w')) 185 | 186 | # Calculating total score 187 | if args.method == 'soft_tifa_gm': 188 | per_prompt_scores = [gmean(s) for s in all_score_lists] 189 | else: 190 | per_prompt_scores = [sum(s)/len(s) for s in all_score_lists] 191 | total_score = 100 * sum(per_prompt_scores)/len(per_prompt_scores) 192 | print("Score: {}".format(total_score)) 193 | 194 | 195 | if __name__ == "__main__": 196 | main() 197 | 198 | --------------------------------------------------------------------------------