├── mixeval_x
├── utils
│ ├── __init__.py
│ ├── count_token.py
│ ├── gemini_api_script.py
│ ├── common_utils.py
│ ├── judge_freeform_parser.py
│ ├── judge_multichoice_parser.py
│ ├── judge_text2action.py
│ ├── judge_image2action.py
│ ├── metric_utils.py
│ └── judge_text2image.py
├── examples
│ └── model_response_dir
│ │ ├── audio2text
│ │ └── gemini_1.5_pro
│ │ │ └── gemini_1.5_pro_ff.jsonl
│ │ ├── audio2text_hard
│ │ └── gemini_1.5_pro
│ │ │ └── gemini_1.5_pro_ff.jsonl
│ │ ├── image2text
│ │ └── gemini_1.5
│ │ │ ├── gemini_1.5_ff.jsonl
│ │ │ └── gemini_1.5_mp.jsonl
│ │ ├── image2text_hard
│ │ └── gemini_1.5
│ │ │ ├── gemini_1.5_ff.jsonl
│ │ │ └── gemini_1.5_mp.jsonl
│ │ ├── video2text
│ │ └── gemini_1_5_pro
│ │ │ ├── gemini_1_5_pro_ff.jsonl
│ │ │ └── gemini_1_5_pro_mp.jsonl
│ │ ├── video2text_hard
│ │ └── gemini_1_5_pro
│ │ │ ├── gemini_1_5_pro_ff.jsonl
│ │ │ └── gemini_1_5_pro_mp.jsonl
│ │ ├── text2image
│ │ └── dall-e-3
│ │ │ └── dall-e-3_t2i.jsonl
│ │ ├── image2action
│ │ └── gemini_1.5
│ │ │ └── gemini_1.5_i2a.jsonl
│ │ └── text2action
│ │ └── gemini_15_pro
│ │ └── gemini_15_pro_t2a.jsonl
├── prompts
│ ├── evaluation_prompts.py
│ ├── evaluation_prompts_image2text.py
│ └── judge_prompts.py
├── compute_metrics_mmg_agent.py
└── compute_metrics_mmu.py
├── setup.sh
├── resources
└── imgs
│ ├── header.png
│ └── mixeval-x.jpg
├── setup.py
├── .gitignore
├── docs
└── model_output_formats.md
└── README.md
/mixeval_x/utils/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/setup.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | pip install -e .
--------------------------------------------------------------------------------
/resources/imgs/header.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JinjieNi/MixEval-X/HEAD/resources/imgs/header.png
--------------------------------------------------------------------------------
/resources/imgs/mixeval-x.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JinjieNi/MixEval-X/HEAD/resources/imgs/mixeval-x.jpg
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/audio2text/gemini_1.5_pro/gemini_1.5_pro_ff.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "free-form", "audio_id": "1.wav", "prompt": "Is the frog making noise?", "target": ["yes"], "benchmark_name": "Clotho-AQA", "response": "Yes.", "id": "1"}
2 | {"problem_type": "free-form", "audio_id": "2.wav", "prompt": "Is someone blowing a whistle?", "target": ["no"], "benchmark_name": "Clotho-AQA", "response": "Yes", "id": "2"}
3 | {"problem_type": "free-form", "audio_id": "3.wav", "prompt": "What did you listen to after listening to the door closing?", "target": ["crowd babbling"], "benchmark_name": "DAQA", "response": "Silence", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/audio2text_hard/gemini_1.5_pro/gemini_1.5_pro_ff.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "free-form", "audio_id": "1.wav", "prompt": "Is the frog making noise?", "target": ["yes"], "benchmark_name": "Clotho-AQA", "response": "Yes.", "id": "1"}
2 | {"problem_type": "free-form", "audio_id": "2.wav", "prompt": "Is someone blowing a whistle?", "target": ["no"], "benchmark_name": "Clotho-AQA", "response": "Yes", "id": "2"}
3 | {"problem_type": "free-form", "audio_id": "3.wav", "prompt": "What did you listen to after listening to the door closing?", "target": ["crowd babbling"], "benchmark_name": "DAQA", "response": "Silence", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/image2text/gemini_1.5/gemini_1.5_ff.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "free-form", "image_id": "1.jpg", "prompt": "Where are the people that are standing looking at?", "target": ["away"], "benchmark_name": "GQA", "response": "The kite", "id": "1"}
2 | {"problem_type": "free-form", "image_id": "2.jpg", "prompt": "is this an online museum?", "target": ["yes", "yes", "yes", "no", "yes", "yes", "yes", "the louvre", "yes", "yes"], "benchmark_name": "TextVQA", "response": "Not likely.", "id": "2"}
3 | {"problem_type": "free-form", "image_id": "3.jpg", "prompt": "On which side of the photo is the silk kite?", "target": ["right"], "benchmark_name": "GQA", "response": "The left side", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/image2text_hard/gemini_1.5/gemini_1.5_ff.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "free-form", "image_id": "1.jpg", "prompt": "Where are the people that are standing looking at?", "target": ["away"], "benchmark_name": "GQA", "response": "The kite", "id": "1"}
2 | {"problem_type": "free-form", "image_id": "2.jpg", "prompt": "is this an online museum?", "target": ["yes", "yes", "yes", "no", "yes", "yes", "yes", "the louvre", "yes", "yes"], "benchmark_name": "TextVQA", "response": "Not likely.", "id": "2"}
3 | {"problem_type": "free-form", "image_id": "3.jpg", "prompt": "On which side of the photo is the silk kite?", "target": ["right"], "benchmark_name": "GQA", "response": "The left side", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/video2text/gemini_1_5_pro/gemini_1_5_pro_ff.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "free-form", "video_id": "1.mp4", "prompt": "is playing football difficult", "target": ["yes"], "benchmark_name": "ActivityNet-QA", "response": "Yes.", "id": "1"}
2 | {"problem_type": "free-form", "video_id": "2.mp4", "prompt": "What is the idea of the advertisement?", "target": ["steam clean", "steam clean", "cleaning", "car", "steam clean"], "benchmark_name": "IVQA", "response": "Car steam cleaning. \n", "id": "2"}
3 | {"problem_type": "free-form", "video_id": "3.mp4", "prompt": "What does the location of video look like?", "target": ["labotory", "laboratory", "laboratory", "laboratory", "laboratory"], "benchmark_name": "IVQA", "response": "Laboratory", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/video2text_hard/gemini_1_5_pro/gemini_1_5_pro_ff.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "free-form", "video_id": "1.mp4", "prompt": "is playing football difficult", "target": ["yes"], "benchmark_name": "ActivityNet-QA", "response": "Yes.", "id": "1"}
2 | {"problem_type": "free-form", "video_id": "2.mp4", "prompt": "What is the idea of the advertisement?", "target": ["steam clean", "steam clean", "cleaning", "car", "steam clean"], "benchmark_name": "IVQA", "response": "Car steam cleaning. \n", "id": "2"}
3 | {"problem_type": "free-form", "video_id": "3.mp4", "prompt": "What does the location of video look like?", "target": ["labotory", "laboratory", "laboratory", "laboratory", "laboratory"], "benchmark_name": "IVQA", "response": "Laboratory", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/image2text/gemini_1.5/gemini_1.5_mp.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "single-choice", "image_id": "1.jpg", "prompt": "Is the man's face clearly visible in the image?", "options": ["Yes", "No"], "target": [1], "benchmark_name": "Q-Bench", "response": "B."}
2 | {"problem_type": "single-choice", "image_id": "2.jpg", "prompt": "Are person1 and person2 the same person ?", "options": ["Yes they are related .", "person1 and person2 are not on the same side .", "Yes , they are students in the same class at school .", "Yes , person1 and person2 are the same ."], "target": [3], "benchmark_name": "SEED2", "response": "D."}
3 | {"problem_type": "single-choice", "image_id": "3.jpg", "prompt": "Where are the cast of the television show located in the image?", "options": ["In the foreground", "In the background", "In the center", "At the edges"], "target": [2], "benchmark_name": "SEED", "response": "A"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/image2text_hard/gemini_1.5/gemini_1.5_mp.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "single-choice", "image_id": "1.jpg", "prompt": "Is the man's face clearly visible in the image?", "options": ["Yes", "No"], "target": [1], "benchmark_name": "Q-Bench", "response": "B."}
2 | {"problem_type": "single-choice", "image_id": "2.jpg", "prompt": "Are person1 and person2 the same person ?", "options": ["Yes they are related .", "person1 and person2 are not on the same side .", "Yes , they are students in the same class at school .", "Yes , person1 and person2 are the same ."], "target": [3], "benchmark_name": "SEED2", "response": "D."}
3 | {"problem_type": "single-choice", "image_id": "3.jpg", "prompt": "Where are the cast of the television show located in the image?", "options": ["In the foreground", "In the background", "In the center", "At the edges"], "target": [2], "benchmark_name": "SEED", "response": "A"}
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | setup(
4 | name='mixeval_x',
5 | version='1.0.0',
6 | author='MixEval-X team',
7 | author_email='jinjieni@nus.edu.sg',
8 | packages=find_packages(),
9 | install_requires=[
10 | 'tiktoken==0.6.0',
11 | 'pandas==2.2.2',
12 | 'openai==1.30.5',
13 | 'httpx==0.27.0',
14 | 'nltk==3.8.1',
15 | 'numpy==1.26.3',
16 | 'tqdm==4.66.4',
17 | 'protobuf==4.25.3',
18 | 'python-dotenv==1.0.1',
19 | 'anthropic==0.28.0',
20 | 'google-generativeai==0.5.4',
21 | 'google-cloud-aiplatform==1.53.0',
22 | 'Pillow'
23 | ],
24 | package_data={
25 | },
26 | entry_points={
27 | },
28 | url='https://mixeval-x.github.io/',
29 | license='License :: OSI Approved :: MIT License',
30 | description='A real-world any-to-any benchmark and eval suite for models with diverse input and output modalities.',
31 | long_description=open('README.md').read(),
32 | long_description_content_type='text/markdown',
33 | )
34 |
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/video2text/gemini_1_5_pro/gemini_1_5_pro_mp.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "single-choice", "video_id": "1.mp4", "prompt": "How did Eddie know Pam and Justin before Justin was killed?", "options": ["They were part of the theater company", "They were high school friends", "They went to college together", "They were cousins", "They were siblings"], "target": [0], "benchmark_name": "TVQA", "response": "A\n", "id": "1"}
2 | {"problem_type": "single-choice", "video_id": "2.mp4", "prompt": "Who did Ryan say died in a single vehicle car crash after he reads the state trooper's report?", "options": [" The mother of the victim, Anya Mishkin.", "The victim, Anya Mishkin.", "The witness of the car accident.", "The best friend of the victim, Anya Mishkin.", "The state trooper's officer partner."], "target": [0], "benchmark_name": "TVQA", "response": "A. The mother of", "id": "2"}
3 | {"problem_type": "single-choice", "video_id": "3.mp4", "prompt": "What are the people doing when they gather around the conference table? ", "options": ["They pound on the table and clap.", "They eat cake. ", "They talk on cell phones.", "They type on lap tops. ", "They sit quietly."], "target": [0], "benchmark_name": "TVQA", "response": "A. They pound on", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/video2text_hard/gemini_1_5_pro/gemini_1_5_pro_mp.jsonl:
--------------------------------------------------------------------------------
1 | {"problem_type": "single-choice", "video_id": "1.mp4", "prompt": "How did Eddie know Pam and Justin before Justin was killed?", "options": ["They were part of the theater company", "They were high school friends", "They went to college together", "They were cousins", "They were siblings"], "target": [0], "benchmark_name": "TVQA", "response": "A\n", "id": "1"}
2 | {"problem_type": "single-choice", "video_id": "2.mp4", "prompt": "Who did Ryan say died in a single vehicle car crash after he reads the state trooper's report?", "options": [" The mother of the victim, Anya Mishkin.", "The victim, Anya Mishkin.", "The witness of the car accident.", "The best friend of the victim, Anya Mishkin.", "The state trooper's officer partner."], "target": [0], "benchmark_name": "TVQA", "response": "A. The mother of", "id": "2"}
3 | {"problem_type": "single-choice", "video_id": "3.mp4", "prompt": "What are the people doing when they gather around the conference table? ", "options": ["They pound on the table and clap.", "They eat cake. ", "They talk on cell phones.", "They type on lap tops. ", "They sit quietly."], "target": [0], "benchmark_name": "TVQA", "response": "A. They pound on", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/prompts/evaluation_prompts.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | # Besides the prompt, the input to the model also requires the an input field.
4 | # for text2text, the input field is the text specified in 'context';
5 | # for image2text, the input field is the image specified in 'image_id';
6 | # for audio2text, the input field is the audio specified in 'audio_id';
7 | # for video2text, the input field is the video specified in 'video_id'.
8 |
9 |
10 | MULTI_CHOICE_PROMPT = "Answer with the option letter from the given choices directly."
11 | FREE_FORM_PROMPT = "Answer the question using a single word or phrase."
12 |
13 | def parse_options(options):
14 | option_letters = [chr(ord("A") + i) for i in range(len(options))]
15 | choices_str = "\n".join([f"{option_letter}. {option}" for option_letter, option in zip(option_letters, options)])
16 | return choices_str
17 |
18 | def construct_prompt_multichoice(entry):
19 | prompt = entry["prompt"]
20 | parsed_options = parse_options(entry["options"])
21 | prompt = f"{prompt}\n{parsed_options}\n{MULTI_CHOICE_PROMPT}"
22 | return prompt
23 |
24 | def construct_prompt_freeform(entry):
25 | prompt = entry["prompt"]
26 | prompt = f"{prompt}\n{FREE_FORM_PROMPT}"
27 | return prompt
28 |
29 |
30 | if __name__ == "__main__":
31 |
32 | task_file_path = "data/tasks.json"
33 | with open(task_file_path, "r") as f:
34 | data = json.load(f)
35 | for id, entry in data.items():
36 | if "options" in entry:
37 | prompt = construct_prompt_multichoice(entry)
38 | else:
39 | prompt = construct_prompt_freeform(entry)
40 | print(prompt)
41 |
--------------------------------------------------------------------------------
/mixeval_x/prompts/evaluation_prompts_image2text.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | # Besides the prompt, the input to the model also requires the an input field.
4 | # for text2text, the input field is the text specified in 'context';
5 | # for image2text, the input field is the image specified in 'image_id';
6 | # for audio2text, the input field is the audio specified in 'audio_id';
7 | # for video2text, the input field is the video specified in 'video_id'.
8 |
9 |
10 | MULTI_CHOICE_PROMPT = "Answer with the option letter from the given choices directly."
11 | FREE_FORM_PROMPT = "Answer the question using a single word or phrase."
12 | FREE_FORM_PROMPT_MMVET = "Answer the question using a single word, phrase, or sentence."
13 |
14 | def parse_options(options):
15 | option_letters = [chr(ord("A") + i) for i in range(len(options))]
16 | choices_str = "\n".join([f"{option_letter}. {option}" for option_letter, option in zip(option_letters, options)])
17 | return choices_str
18 |
19 | def construct_prompt_multichoice(entry):
20 | prompt = entry["prompt"]
21 | parsed_options = parse_options(entry["options"])
22 | prompt = f"{prompt}\n{parsed_options}\n{MULTI_CHOICE_PROMPT}"
23 | return prompt
24 |
25 | def construct_prompt_freeform(entry):
26 | prompt = entry["prompt"]
27 | if entry["benchmark_name"] == "MM-Vet":
28 | prompt = f"{prompt}\n{FREE_FORM_PROMPT_MMVET}"
29 | else:
30 | prompt = f"{prompt}\n{FREE_FORM_PROMPT}"
31 | return prompt
32 |
33 |
34 | if __name__ == "__main__":
35 |
36 | task_file_path = "data/tasks.json"
37 | with open(task_file_path, "r") as f:
38 | data = json.load(f)
39 | for id, entry in data.items():
40 | if "options" in entry:
41 | prompt = construct_prompt_multichoice(entry)
42 | else:
43 | prompt = construct_prompt_freeform(entry)
44 | print(prompt)
45 |
--------------------------------------------------------------------------------
/mixeval_x/utils/count_token.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 | import json
4 |
5 | import tiktoken
6 |
7 | from mix_eval.prompts.evaluation_prompts import (
8 | construct_prompt_multichoice,
9 | construct_prompt_freeform,
10 | )
11 |
12 | def parse_args():
13 | parser = argparse.ArgumentParser()
14 | parser.add_argument(
15 | "--split",
16 | type=str,
17 | choices=["close_freeform", "close_multichoice", "open", "all"],
18 | default="all",
19 | help="Split to evaluate."
20 | )
21 | return parser.parse_args()
22 |
23 | def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
24 | """Returns the number of tokens used by a list of messages."""
25 | try:
26 | encoding = tiktoken.encoding_for_model(model)
27 | except KeyError:
28 | encoding = tiktoken.get_encoding("cl100k_base")
29 | if model == "gpt-3.5-turbo-0613": # note: future models may deviate from this
30 | num_tokens = 0
31 | for message in messages:
32 | num_tokens += 4 # every message follows {role/name}\n{content}\n
33 | for key, value in message.items():
34 | num_tokens += len(encoding.encode(value))
35 | if key == "name": # if there's a name, the role is omitted
36 | num_tokens += -1 # role is always required and always 1 token
37 | num_tokens += 2 # every reply is primed with assistant
38 | return num_tokens
39 | else:
40 | raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
41 | See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
42 |
43 | def count_all_tokens_to_filter(args):
44 | number_tokens = 0
45 |
46 | if args.split == "all":
47 | splits = ["close_freeform", "close_multichoice", "open"]
48 | else:
49 | splits = [args.split]
50 |
51 | for split in splits:
52 | if split == "close_freeform":
53 | data_path = "mix_eval/data/text2text/text2text_closeended/free-form.json"
54 | elif split == "close_multichoice":
55 | data_path = "mix_eval/data/text2text/text2text_closeended/multiple-choice.json"
56 | elif split == "open":
57 | data_path = "mix_eval/data/text2text/text2text_openended.json"
58 | with open(data_path, "r") as f:
59 | data = json.load(f)
60 | for id, d in data.items():
61 | if split == "close_multichoice":
62 | formated_input = construct_prompt_multichoice(d)
63 | number_tokens += num_tokens_from_messages([{"content": formated_input}])
64 | elif split == "close_freeform":
65 | formated_input = construct_prompt_freeform(d)
66 | number_tokens += num_tokens_from_messages([{"content": formated_input}])
67 | else:
68 | formated_input = '\n'.join(d["turns"])
69 | number_tokens += num_tokens_from_messages([{"content": formated_input}]) + 1500
70 |
71 | print(f"Total number of tokens: {number_tokens}")
72 |
73 |
74 |
75 | if __name__ == '__main__':
76 | args = parse_args()
77 | count_all_tokens_to_filter(args)
78 |
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/text2image/dall-e-3/dall-e-3_t2i.jsonl:
--------------------------------------------------------------------------------
1 | {"first_turn_user_prompt": "Design a vibrant cityscape banner featuring the iconic Sydney Opera House and Harbour Bridge under a dazzling sunset, with the City of Sydney\u2019s logo prominently displayed in the foreground, ensuring it catches the eye against the vivid backdrop.", "first_turn_caption": "A vibrant cityscape banner featuring the iconic Sydney Opera House and Harbour Bridge under a dazzling sunset, with the City of Sydney\u2019s logo prominently displayed in the foreground, ensuring it catches the eye against the vivid backdrop.", "second_turn_user_prompt": " Remove the City of Sydney\u2019s logo.", "second_turn_caption": "A vibrant cityscape banner featuring the iconic Sydney Opera House and Harbour Bridge under a dazzling sunset, ensuring it catches the eye against the vivid backdrop.", "gen_1st_turn": "/home/nus-njj/projects/InstructioninWild/cache/MixEval-X/text2image/model_outputs/dall-e-3/1/1.jpg", "gen_2nd_turn": "/home/nus-njj/projects/InstructioninWild/cache/MixEval-X/text2image/model_outputs/dall-e-3/1/2.jpg", "id": "1"}
2 | {"first_turn_user_prompt": "Illustrate an informative and detailed chart showcasing various honor societies along with their corresponding cord colors, each clearly labeled for easy identification. Please include a diverse range of societies to reflect a broad spectrum of academic achievements.", "first_turn_caption": "An informative and detailed chart showcasing various honor societies along with their corresponding cord colors, each clearly labeled for easy identification, reflecting a broad spectrum of academic achievements.", "second_turn_user_prompt": " Change the chart to pie chart format.", "second_turn_caption": "An informative and detailed pie chart showcasing various honor societies along with their corresponding cord colors, each clearly labeled for easy identification, reflecting a broad spectrum of academic achievements.", "gen_1st_turn": "/home/nus-njj/projects/InstructioninWild/cache/MixEval-X/text2image/model_outputs/dall-e-3/2/1.jpg", "gen_2nd_turn": "/home/nus-njj/projects/InstructioninWild/cache/MixEval-X/text2image/model_outputs/dall-e-3/2/2.jpg", "id": "2"}
3 | {"first_turn_user_prompt": "Depict a bustling cityscape transitioning from rain to sunshine, where the detailed weather parameters such as temperature gauges and wind speed indicators are artistically incorporated into the scene, capturing the dynamic change in weather conditions throughout the day.", "first_turn_caption": "A bustling cityscape transitioning from rain to sunshine, where detailed weather parameters such as temperature gauges and wind speed indicators are artistically incorporated into the scene, capturing the dynamic change in weather conditions throughout the day.", "second_turn_user_prompt": " Change the weather from rain to snow.", "second_turn_caption": "A bustling cityscape transitioning from rain to snow, where detailed weather parameters such as temperature gauges and wind speed indicators are artistically incorporated into the scene, capturing the dynamic change in weather conditions throughout the day.", "gen_1st_turn": "/home/nus-njj/projects/InstructioninWild/cache/MixEval-X/text2image/model_outputs/dall-e-3/3/1.jpg", "gen_2nd_turn": "/home/nus-njj/projects/InstructioninWild/cache/MixEval-X/text2image/model_outputs/dall-e-3/3/2.jpg", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/image2action/gemini_1.5/gemini_1.5_i2a.jsonl:
--------------------------------------------------------------------------------
1 | {"image_id": "7.jpg", "task description": "Paint an object with blue color to resemble a blueberry.", "allowed actions": "[PickupObject], [PutObject], [Navigation], [OpenContainer], [CloseContainer], [SelectPaintTool], [MixColor], [ApplyPaint], [CleanBrush], [InspectObject], [DryObject]", "already executed steps": "[Navigation] , [SelectPaintTool] , [MixColor] , [ApplyPaint] ", "target": "[InspectObject] , [DryObject] ", "response": "My Planning: \n[InspectObject] , \n[ApplyPaint] , \n[DryObject] ,\n[CleanBrush] \n", "id": "1"}
2 | {"image_id": "18.jpg", "task description": "Analyze the displayed X-ray image by identifying and logging the visible features using the provided software.", "allowed actions": "[Navigate], [Click], [ReadText], [InputText], [OpenMenu], [SelectOption], [LoadImage], [AnalyzeImage], [LogFeature], [SaveData], [ToggleSwitch], [CloseApplication]", "already executed steps": "[Navigate] , [ReadText] , [OpenMenu] , [SelectOption] , [LoadImage] , [AnalyzeImage] ", "target": "[LogFeature] , [SaveData] , [CloseApplication] ", "response": "[Navigate] , [Click] , [LogFeature] , [Click] , [LogFeature] , [Click] , [LogFeature] , ... [Navigate] , [OpenMenu] , [SelectOption] , [CloseApplication] \n\n**Explanation:**\n\n1. **Feature Identification and Logging:** The planning assumes the software highlights identifiable features after analysis. I will navigate to each feature, click on it, and accurately describe it using the [LogFeature] action. This process will be repeated for all visible features.\n2. **Saving and Exiting:** Finally, I'll save the logged data and close the application.\n\n**Important Notes:**\n\n* **Invisible Features:** The prompt mentions potential invisible usable objects. In this context, there might be tools within the software (e.g., zoom, contrast adjustment) that could reveal additional features. If such tools are standard in X-ray analysis software, it would be logical to incorporate their use in the planning.\n* **Feature Specificity:** The exact features and descriptions would depend on the specific X-ray image and the analysis software's capabilities. \n* **Real-World Adaptation:** As a real-world agent, I would adapt my plan based on the software's specific interface and features. I would also seek clarification or assistance if I encounter any unforeseen issues. \n", "id": "2"}
3 | {"image_id": "5.jpg", "task description": "Prepare the AED for use by attaching the electrode pads onto the patient's chest and connect the pads to the AED.", "allowed actions": "[Navigate], [UnpackObject], [IdentifyObject], [PickupObject], [PlaceObject], [ConnectObject], [ActivateObject], [MonitorVitalSigns], [Listen], [Speak], [ReadText]", "already executed steps": "[Navigate] , [IdentifyObject] , [UnpackObject] , [PickupObject] , [PlaceObject] ", "target": "[ConnectObject] , [ActivateObject] , [MonitorVitalSigns], [ReadText] , [Listen]", "response": "Your Planning: \n[ConnectObject] , [ActivateObject] , [MonitorVitalSigns] , [Listen] \n", "id": "3"}
--------------------------------------------------------------------------------
/mixeval_x/utils/gemini_api_script.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | import time
3 | import random
4 | import os
5 | from dotenv import load_dotenv
6 |
7 | from concurrent.futures import ThreadPoolExecutor
8 |
9 | import google.generativeai as genai
10 |
11 |
12 | ########################Gemini########################
13 | class GeminiJudgeText2Image:
14 | def __init__(self, args):
15 | self.args = args
16 | self.JUDGE = "gemini-1.5-pro-latest" # "gemini-1.5-flash-latest"
17 | self.FIX_INTERVAL_SECOND = 0
18 | self.MAX_RETRY_NUM = 999999
19 | self.MAX_NEW_TOKENS = 999
20 |
21 | self.FORMAT_MAXRETRY = 10
22 |
23 | load_dotenv()
24 | genai.configure(api_key=os.getenv('k_g')) # set the api key here
25 | self.model = genai.GenerativeModel(self.JUDGE)
26 |
27 | self.safety_settings={
28 | 'harm_category_harassment':'block_none',
29 | 'harm_category_hate_speech': 'block_none',
30 | 'harm_category_sexually_explicit': 'block_none',
31 | 'harm_category_dangerous_content': 'block_none'
32 | }
33 |
34 | def format_prompts(self, inputs):
35 |
36 | # input format logic here
37 |
38 | return formated
39 |
40 | def _GPT_decode(self, inputs):
41 | completion = self.model.generate_content(
42 | self.format_prompts(inputs),
43 | generation_config=genai.types.GenerationConfig(
44 | candidate_count=1,
45 | max_output_tokens=self.MAX_NEW_TOKENS,
46 | ),
47 | safety_settings=self.safety_settings,
48 | )
49 | time.sleep(self.FIX_INTERVAL_SECOND)
50 | return completion.text
51 |
52 |
53 | def GPT_decode(self, inputs):
54 | delay = 1
55 | blocked = 0
56 | for i in range(self.MAX_RETRY_NUM):
57 | try:
58 | response_content = self._GPT_decode(inputs)
59 | return response_content
60 | except Exception as e:
61 | if 'quick accessor' in str(e) or 'block' in str(e):
62 | print("Content blocked, retrying ...")
63 | blocked += 1
64 | if blocked > 10:
65 | print("Blocked for too many times, using 'Response not available "
66 | "due to content restrictions.' as response, exiting...")
67 | return 'Response not available due to content restrictions.'
68 | elif 'quota' in str(e).lower() or 'limit' in str(e).lower():
69 | exponential_base = 2
70 | delay *= exponential_base * (1 + random.random())
71 | print(f"Error, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
72 | print(e)
73 | time.sleep(delay)
74 | continue
75 | else:
76 | print(f"Error in decode, retrying...")
77 | print(e)
78 | time.sleep(10)
79 | continue
80 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
81 | return 'Error'
82 |
83 |
84 | def annotate_p(self, task):
85 | # the annotation logic here
86 |
87 | pass
88 |
89 |
90 | def annotate_parallel(self, tasks):
91 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
92 | results = []
93 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
94 | for entry in tqdm(
95 | executor.map(self.annotate_p, tasks), total=len(tasks)
96 | ):
97 | results.append(entry)
98 | if None in results:
99 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
100 | return results
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.egg-info
2 | .env
3 | .vscode
4 | *.DS_Store
5 | resampled_output/
6 | misc/
7 | plot/
8 | mixeval_x/data/
9 | post_process/
10 | unit_tests/
11 | eval_scripts/
12 | /mix_eval/
13 |
14 | # Byte-compiled / optimized / DLL files
15 | __pycache__/
16 | *.py[cod]
17 | *$py.class
18 |
19 | # C extensions
20 | *.so
21 |
22 | # Distribution / packaging
23 | .Python
24 | build/
25 | develop-eggs/
26 | dist/
27 | downloads/
28 | eggs/
29 | .eggs/
30 | lib/
31 | lib64/
32 | parts/
33 | sdist/
34 | var/
35 | wheels/
36 | share/python-wheels/
37 | *.egg-info/
38 | .installed.cfg
39 | *.egg
40 | MANIFEST
41 |
42 | # PyInstaller
43 | # Usually these files are written by a python script from a template
44 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
45 | *.manifest
46 | *.spec
47 |
48 | # Installer logs
49 | pip-log.txt
50 | pip-delete-this-directory.txt
51 |
52 | # Unit test / coverage reports
53 | htmlcov/
54 | .tox/
55 | .nox/
56 | .coverage
57 | .coverage.*
58 | .cache
59 | nosetests.xml
60 | coverage.xml
61 | *.cover
62 | *.py,cover
63 | .hypothesis/
64 | .pytest_cache/
65 | cover/
66 |
67 | # Translations
68 | *.mo
69 | *.pot
70 |
71 | # Django stuff:
72 | *.log
73 | local_settings.py
74 | db.sqlite3
75 | db.sqlite3-journal
76 |
77 | # Flask stuff:
78 | instance/
79 | .webassets-cache
80 |
81 | # Scrapy stuff:
82 | .scrapy
83 |
84 | # Sphinx documentation
85 | docs/_build/
86 |
87 | # PyBuilder
88 | .pybuilder/
89 | target/
90 |
91 | # Jupyter Notebook
92 | .ipynb_checkpoints
93 |
94 | # IPython
95 | profile_default/
96 | ipython_config.py
97 |
98 | # pyenv
99 | # For a library or package, you might want to ignore these files since the code is
100 | # intended to run in multiple environments; otherwise, check them in:
101 | # .python-version
102 |
103 | # pipenv
104 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
105 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
106 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
107 | # install all needed dependencies.
108 | #Pipfile.lock
109 |
110 | # poetry
111 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
112 | # This is especially recommended for binary packages to ensure reproducibility, and is more
113 | # commonly ignored for libraries.
114 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
115 | #poetry.lock
116 |
117 | # pdm
118 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
119 | #pdm.lock
120 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
121 | # in version control.
122 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control
123 | .pdm.toml
124 | .pdm-python
125 | .pdm-build/
126 |
127 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
128 | __pypackages__/
129 |
130 | # Celery stuff
131 | celerybeat-schedule
132 | celerybeat.pid
133 |
134 | # SageMath parsed files
135 | *.sage.py
136 |
137 | # Environments
138 | .env
139 | .venv
140 | env/
141 | venv/
142 | ENV/
143 | env.bak/
144 | venv.bak/
145 |
146 | # Spyder project settings
147 | .spyderproject
148 | .spyproject
149 |
150 | # Rope project settings
151 | .ropeproject
152 |
153 | # mkdocs documentation
154 | /site
155 |
156 | # mypy
157 | .mypy_cache/
158 | .dmypy.json
159 | dmypy.json
160 |
161 | # Pyre type checker
162 | .pyre/
163 |
164 | # pytype static type analyzer
165 | .pytype/
166 |
167 | # Cython debug symbols
168 | cython_debug/
169 |
170 | # PyCharm
171 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
172 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
173 | # and can be added to the global gitignore or merged into this file. For a more nuclear
174 | # option (not recommended) you can uncomment the following to ignore the entire idea folder.
175 | #.idea/
176 |
--------------------------------------------------------------------------------
/mixeval_x/utils/common_utils.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | # import torch
4 | import numpy as np
5 | import random
6 | import re
7 |
8 | import tiktoken
9 |
10 | def set_seed(seed: int = 42) -> None:
11 | np.random.seed(seed)
12 | random.seed(seed)
13 | # torch.manual_seed(seed)
14 | # torch.cuda.manual_seed(seed)
15 | # # When running on the CuDNN backend, two further options must be set
16 | # torch.backends.cudnn.deterministic = True
17 | # torch.backends.cudnn.benchmark = False
18 | # Set a fixed value for the hash seed
19 | os.environ["PYTHONHASHSEED"] = str(seed)
20 | # print(f"Random seed set as {seed}")
21 |
22 | def cache_status(args, status):
23 | status_path = os.path.join(
24 | args.output_dir,
25 | args.model_name,
26 | args.benchmark,
27 | args.version,
28 | f'status_{args.split}.json'
29 | )
30 | with open(status_path, 'w') as f:
31 | arg_dict = args.__dict__
32 | status_dict = {
33 | 'args': arg_dict,
34 | 'status': status
35 | }
36 | json.dump(status_dict, f, indent=4)
37 |
38 | def read_status(args):
39 | status_path = os.path.join(
40 | args.output_dir,
41 | args.model_name,
42 | args.benchmark,
43 | args.version,
44 | f'status_{args.split}.json'
45 | )
46 | with open(status_path, 'r') as f:
47 | return json.load(f)
48 |
49 | def dict_equal(dict1, dict2, keys_to_ignore=['resume', 'compute_score_from_judged_file', 'inference_only']):
50 | modified_dict1 = dict1.copy()
51 | modified_dict2 = dict2.copy()
52 | for key in keys_to_ignore:
53 | modified_dict1.pop(key, None)
54 | modified_dict2.pop(key, None)
55 | return modified_dict1 == modified_dict2
56 |
57 | def log_error(message, path):
58 | with open(path, 'a') as f:
59 | f.write(f"{message}\n")
60 |
61 | def get_gpu_memory(max_gpus=None):
62 | """Get available memory for each GPU."""
63 | import torch
64 |
65 | gpu_memory = []
66 | num_gpus = (
67 | torch.cuda.device_count()
68 | if max_gpus is None
69 | else min(max_gpus, torch.cuda.device_count())
70 | )
71 |
72 | for gpu_id in range(num_gpus):
73 | with torch.cuda.device(gpu_id):
74 | device = torch.cuda.current_device()
75 | gpu_properties = torch.cuda.get_device_properties(device)
76 | total_memory = gpu_properties.total_memory / (1024**3)
77 | allocated_memory = torch.cuda.memory_allocated() / (1024**3)
78 | available_memory = total_memory - allocated_memory
79 | gpu_memory.append(available_memory)
80 | return gpu_memory
81 |
82 | def is_meaningful(s):
83 | # Check for alphabetic characters (including other languages) and numeric digits
84 | if re.search(r'[\u0030-\u0039\u0041-\u005A\u0061-\u007A\u00C0-\u00FF\u0100-\u017F\u0180-\u024F\u0370-\u03FF\u0400-\u04FF\u0500-\u052F\u2C00-\u2C5F\uA640-\uA69F\uAC00-\uD7AF\u4E00-\u9FFF]', s):
85 | return True
86 | else:
87 | return False
88 |
89 | def extract_basemodel_response_3e(response):
90 | _response = response.split('\n\n\nQuestion')[0]
91 | if is_meaningful(_response): # non-trival response
92 | return _response
93 | else:
94 | return response
95 |
96 | def extract_basemodel_response_2e(response):
97 | _response = response.split('\n\n')[0]
98 | if is_meaningful(_response): # non-trival response
99 | return _response
100 | else:
101 | return response
102 |
103 | def num_tokens_from_message(message, model="gpt-3.5-turbo-0613"):
104 | try:
105 | encoding = tiktoken.encoding_for_model(model)
106 | except KeyError:
107 | encoding = tiktoken.get_encoding("cl100k_base")
108 | if model == "gpt-3.5-turbo-0613": # note: future models may deviate from this
109 | num_tokens = len(encoding.encode(message))
110 | return num_tokens
111 | else:
112 | raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
113 | See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
114 |
115 |
116 |
117 |
--------------------------------------------------------------------------------
/mixeval_x/utils/judge_freeform_parser.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | import time
3 | import random
4 | import os
5 | from dotenv import load_dotenv
6 |
7 | from concurrent.futures import ThreadPoolExecutor
8 | from openai import OpenAI
9 | from openai._exceptions import RateLimitError, BadRequestError
10 | from httpx import Timeout
11 |
12 | from mixeval_x.prompts.judge_prompts import (
13 | image2text_gpt_judge_for_closeended_freeform,
14 | audio2text_gpt_judge_for_closeended_freeform,
15 | video2text_gpt_judge_for_closeended_freeform,
16 | )
17 | from mixeval_x.utils.common_utils import extract_basemodel_response_3e
18 |
19 | ########################ChatGPT########################
20 | class ChatGPTJudgeCloseendFreeform:
21 | def __init__(self, args):
22 | self.args = args
23 | self.JUDGE = args.freeform_judge
24 | self.FIX_INTERVAL_SECOND = 0
25 | self.MAX_RETRY_NUM = 99
26 | self.MAX_NEW_TOKENS = 999
27 |
28 | load_dotenv()
29 | self.client = OpenAI(
30 | api_key=os.getenv('MODEL_PARSER_API'),
31 | timeout=Timeout(timeout=60.0, connect=5.0)
32 | )
33 |
34 | def format_prompts(self, inputs):
35 | prompt, gold_ans, response = inputs
36 | gold_ans = '; '.join([f" {ans}" for i, ans in enumerate(gold_ans)])
37 | if self.args.benchmark in ['image2text', 'image2text_hard']:
38 | formated = image2text_gpt_judge_for_closeended_freeform(prompt, gold_ans, response)
39 | elif self.args.benchmark in ['audio2text', 'audio2text_hard']:
40 | formated = audio2text_gpt_judge_for_closeended_freeform(prompt, gold_ans, response)
41 | elif self.args.benchmark in ['video2text', 'video2text_hard']:
42 | formated = video2text_gpt_judge_for_closeended_freeform(prompt, gold_ans, response)
43 | else:
44 | raise ValueError(f"Benchmark {self.args.benchmark} not supported in the closeended parser.")
45 | return formated
46 |
47 | def _GPT_decode(self, inputs):
48 | completion = self.client.chat.completions.create(
49 | model=self.JUDGE,
50 | response_format={ "type": 'text'},
51 | max_tokens=self.MAX_NEW_TOKENS,
52 | messages=self.format_prompts(inputs),
53 | )
54 | time.sleep(self.FIX_INTERVAL_SECOND)
55 | return completion
56 |
57 |
58 | def GPT_decode(self, inputs):
59 | delay = 1
60 | blocked = 0
61 | for i in range(self.MAX_RETRY_NUM):
62 | try:
63 | completion = self._GPT_decode(inputs)
64 | return completion
65 | except RateLimitError as e:
66 | exponential_base = 2
67 | delay *= exponential_base * (1 + random.random())
68 | print(f"RateLimitError, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
69 | print(e)
70 | time.sleep(delay)
71 | continue
72 | except BadRequestError as e:
73 | blocked += 1
74 | if blocked >= 10:
75 | print("Blocked too many times, skipping...")
76 | return 'Blocked'
77 | print(f"Input is blocked, retrying...")
78 | print(e)
79 | time.sleep(1)
80 | continue
81 | except Exception as e:
82 | print(f"Error in GPT_decode, retrying...")
83 | print(e)
84 | time.sleep(1)
85 | continue
86 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
87 | return 'Error'
88 |
89 |
90 | def annotate_p(self, task):
91 | prompt = task['prompt']
92 | gold_ans = task['target']
93 | response = task['response']
94 |
95 | if hasattr(self.args, 'model_type'):
96 | if self.args.model_type == 'BaseModel':
97 | response = extract_basemodel_response_3e(response)
98 | task['response_extracted'] = response
99 | elif self.args.model_type == 'ChatModel':
100 | pass
101 | elif self.args.model_type == 'APIModelBase':
102 | pass
103 | else:
104 | raise ValueError(f"Model type {self.args.model_type} not supported.")
105 |
106 | if not isinstance(gold_ans, list):
107 | print(f"Invalid target: {gold_ans}")
108 | return None
109 |
110 | inputs = (prompt, gold_ans, response)
111 |
112 | completion = self.GPT_decode(inputs)
113 | if completion == 'Error':
114 | print(f"Error in GPT_decode, the entry {task} will be retried later...")
115 | task['judge_response'] = None
116 | return task
117 | elif completion == 'Blocked':
118 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
119 | task['judge_response'] = '[[0.0]]'
120 | return task
121 | annotation = completion.choices[0].message.content
122 | task['judge_response'] = annotation
123 | return task
124 |
125 |
126 | def annotate_parallel(self, tasks):
127 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
128 | results = []
129 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
130 | for entry in tqdm(
131 | executor.map(self.annotate_p, tasks), total=len(tasks)
132 | ):
133 | results.append(entry)
134 | if None in results:
135 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
136 | return results
137 |
138 | ########################Claude 3########################
139 | class ClaudeJudgeCloseendFreeform:
140 | def __init__(self):
141 | raise NotImplementedError
142 |
143 |
144 | ########################Gemini########################
145 | class GeminiJudgeCloseendFreeform:
146 | def __init__(self):
147 | raise NotImplementedError
--------------------------------------------------------------------------------
/mixeval_x/utils/judge_multichoice_parser.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | import time
3 | import random
4 | import os
5 | from dotenv import load_dotenv
6 |
7 | from concurrent.futures import ThreadPoolExecutor
8 | from openai import OpenAI
9 | from openai._exceptions import RateLimitError, BadRequestError
10 | from httpx import Timeout
11 |
12 | from mixeval_x.prompts.judge_prompts import (
13 | image2text_gpt_judge_for_closeended_multiplechoice,
14 | video2text_gpt_judge_for_closeended_multiplechoice,
15 | )
16 | from mixeval_x.utils.common_utils import extract_basemodel_response_2e
17 |
18 | ########################ChatGPT########################
19 | class ChatGPTJudgeCloseendMultichoice:
20 | def __init__(self, args):
21 | self.args = args
22 |
23 | self.JUDGE = args.multichoice_judge
24 | self.FIX_INTERVAL_SECOND = 0
25 | self.MAX_RETRY_NUM = 99
26 | self.MAX_NEW_TOKENS = 999
27 |
28 | load_dotenv()
29 | self.client = OpenAI(
30 | api_key=os.getenv('MODEL_PARSER_API'),
31 | timeout=Timeout(timeout=20.0, connect=5.0)
32 | )
33 |
34 | def format_prompts(self, inputs):
35 | prompt, options, response = inputs
36 | option_letters = [chr(ord("A") + i) for i in range(len(options))]
37 | options = "\n".join([f"{option_letter}. {option}" for option_letter, option in zip(option_letters, options)])
38 | if self.args.benchmark in ['image2text', 'image2text_hard']:
39 | formated = image2text_gpt_judge_for_closeended_multiplechoice(prompt, options, response)
40 | elif self.args.benchmark in ['video2text', 'video2text_hard']:
41 | formated = video2text_gpt_judge_for_closeended_multiplechoice(prompt, options, response)
42 | else:
43 | raise ValueError(f"Benchmark {self.args.benchmark} not supported in the closeended parser.")
44 | return formated
45 |
46 | def _GPT_decode(self, inputs):
47 | completion = self.client.chat.completions.create(
48 | model=self.JUDGE,
49 | response_format={ "type": 'text'},
50 | max_tokens=self.MAX_NEW_TOKENS,
51 | messages=self.format_prompts(inputs),
52 | )
53 | time.sleep(self.FIX_INTERVAL_SECOND)
54 | return completion
55 |
56 |
57 | def GPT_decode(self, inputs):
58 | delay = 1
59 | blocked = 0
60 | for i in range(self.MAX_RETRY_NUM):
61 | try:
62 | completion = self._GPT_decode(inputs)
63 | return completion
64 | except RateLimitError as e:
65 | exponential_base = 2
66 | delay *= exponential_base * (1 + random.random())
67 | print(f"RateLimitError, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
68 | print(e)
69 | time.sleep(delay)
70 | continue
71 | except BadRequestError as e:
72 | blocked += 1
73 | if blocked >= 10:
74 | print("Blocked too many times, skipping...")
75 | return 'Blocked'
76 | print(f"Input is blocked, retrying...")
77 | print(e)
78 | time.sleep(1)
79 | continue
80 | except Exception as e:
81 | print(f"Error in GPT_decode, retrying...")
82 | print(e)
83 | time.sleep(1)
84 | continue
85 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
86 | return 'Error'
87 |
88 |
89 | def annotate_p(self, task):
90 | prompt = task['prompt']
91 | options = task['options']
92 | response = task['response']
93 |
94 | if hasattr(self.args, 'model_type'):
95 | if self.args.model_type == 'BaseModel':
96 | response = extract_basemodel_response_2e(response)
97 | task['response_extracted'] = response
98 | elif self.args.model_type == 'ChatModel':
99 | pass
100 | elif self.args.model_type == 'APIModelBase':
101 | pass
102 | else:
103 | raise ValueError(f"Model type {self.args.model_type} not supported.")
104 |
105 | if not isinstance(options, list):
106 | print(f"Invalid target: {options}")
107 | return None
108 |
109 | inputs = (prompt, options, response)
110 |
111 | completion = self.GPT_decode(inputs)
112 | if completion == 'Error':
113 | print(f"Error in GPT_decode, the entry {task} will be retried later...")
114 | task['judge_response'] = None
115 | return task
116 | elif completion == 'Blocked':
117 | print(f"{task}: \n\nBlocked, the entry treated as bad entry. Randomly assigning a choice.")
118 | options = task['options']
119 | option_letters = [chr(ord("A") + i) for i in range(len(options))]
120 | task['judge_response'] = f"[[{random.choice(option_letters)}]]"
121 | return task
122 | annotation = completion.choices[0].message.content
123 | task['judge_response'] = annotation
124 | return task
125 |
126 |
127 | def annotate_parallel(self, tasks):
128 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
129 | results = []
130 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
131 | for entry in tqdm(
132 | executor.map(self.annotate_p, tasks), total=len(tasks)
133 | ):
134 | results.append(entry)
135 | if None in results:
136 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
137 | return results
138 |
139 | ########################Claude 3########################
140 | class ClaudeJudgeCloseendMultichoice:
141 | def __init__(self):
142 | raise NotImplementedError
143 |
144 |
145 | ########################Gemini########################
146 | class GeminiJudgeCloseendMultichoice:
147 | def __init__(self):
148 | raise NotImplementedError
--------------------------------------------------------------------------------
/mixeval_x/utils/judge_text2action.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | import time
3 | import random
4 | import os
5 | from dotenv import load_dotenv
6 | import base64
7 | import re
8 | import ast
9 |
10 | from concurrent.futures import ThreadPoolExecutor
11 | from openai import OpenAI
12 | from openai._exceptions import RateLimitError, BadRequestError
13 | from httpx import Timeout
14 |
15 | from mixeval_x.prompts.judge_prompts import (
16 | text2action_gpt_judge,
17 | )
18 |
19 | ########################ChatGPT########################
20 | class ChatGPTJudgeText2Action:
21 | def __init__(self, args):
22 | self.args = args
23 | self.JUDGE = args.judge_model
24 | self.FIX_INTERVAL_SECOND = 0
25 | self.MAX_RETRY_NUM = 999999
26 | self.MAX_NEW_TOKENS = 999
27 |
28 | self.FORMAT_MAXRETRY = 10
29 |
30 | load_dotenv()
31 | self.client = OpenAI(
32 | api_key=os.getenv('MODEL_PARSER_API'),
33 | timeout=Timeout(timeout=60.0, connect=5.0)
34 | )
35 |
36 | @staticmethod
37 | def get_score_from_judge(judge_response):
38 | """
39 | Get the score from the judge response.
40 | """
41 | one_score_pattern = re.compile("\[\[(\d+\.?\d*)\]\]")
42 | one_score_pattern_backup = re.compile("\[(\d+\.?\d*)\]")
43 |
44 | match = re.search(one_score_pattern, judge_response)
45 | if not match:
46 | match = re.search(one_score_pattern_backup, judge_response)
47 |
48 | if match:
49 | rating = ast.literal_eval(match.groups()[0])
50 | else:
51 | rating = -1
52 |
53 | return float(rating)
54 |
55 | def format_prompts(self, inputs, mode, history=[]):
56 |
57 | if mode == 'turn_1':
58 | formated = text2action_gpt_judge(*inputs)
59 | elif mode == 'append_message' and history:
60 | history.append(inputs)
61 | formated = history
62 | else:
63 | raise ValueError(f"Invalid mode: {mode}.")
64 |
65 | return formated
66 |
67 | def _GPT_decode(self, inputs, mode, history=[]):
68 | completion = self.client.chat.completions.create(
69 | model=self.JUDGE,
70 | response_format={ "type": 'text'},
71 | max_tokens=self.MAX_NEW_TOKENS,
72 | messages=self.format_prompts(inputs, mode, history),
73 | )
74 | time.sleep(self.FIX_INTERVAL_SECOND)
75 | return completion
76 |
77 | def GPT_decode(self, inputs, mode, history=[]):
78 | delay = 1
79 | blocked = 0
80 | for i in range(self.MAX_RETRY_NUM):
81 | try:
82 | completion = self._GPT_decode(inputs, mode, history)
83 | return completion
84 | except RateLimitError as e:
85 | exponential_base = 2
86 | delay *= exponential_base * (1 + random.random())
87 | print(f"RateLimitError, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
88 | print(e)
89 | time.sleep(delay)
90 | continue
91 | except BadRequestError as e:
92 | blocked += 1
93 | if blocked >= 10:
94 | print("Blocked too many times, skipping...")
95 | return 'Blocked'
96 | print(f"Input is blocked, retrying...")
97 | print(e)
98 | time.sleep(1)
99 | continue
100 | except Exception as e:
101 | print(f"Error in GPT_decode, retrying...")
102 | print(e)
103 | time.sleep(1)
104 | continue
105 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
106 | return 'Error'
107 |
108 | def annotate_p(self, task):
109 | task_description = task['task description']
110 | allowed_actions = task['allowed actions']
111 | visible_objects = task['visible objects']
112 | already_executed_steps = task['already executed steps']
113 | target = task['target']
114 | model_response = task['response']
115 |
116 | # first turn
117 | inputs = (task_description, allowed_actions, visible_objects, already_executed_steps, target, model_response)
118 |
119 | completion = self.GPT_decode(inputs, 'turn_1')
120 | if completion == 'Error':
121 | print(f"Error in GPT_decode, the entry treated as bad entry.")
122 | task['judge_response'] = '[[0.0]]'
123 | elif completion == 'Blocked':
124 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
125 | task['judge_response'] = '[[0.0]]'
126 | else:
127 | annotation = completion.choices[0].message.content
128 | task['judge_response'] = annotation
129 | for i in range(self.FORMAT_MAXRETRY):
130 | if self.get_score_from_judge(annotation) != -1:
131 | task['judge_score'] = self.get_score_from_judge(annotation)
132 | break
133 | else:
134 | print(f"No score found in the response, retrying...")
135 | history = text2action_gpt_judge(*inputs)
136 | completion = self.GPT_decode({"role": "user", "content": "Continue your judgment and finish by outputting a final score with the above-mentioned format."}, 'append_message', history)
137 | annotation = completion.choices[0].message.content
138 |
139 | if self.get_score_from_judge(annotation) != -1:
140 | task['judge_score'] = self.get_score_from_judge(annotation)
141 | else:
142 | task['judge_score'] = None
143 | print(f"No score found in the response, please inspect and retry.")
144 |
145 |
146 | return task
147 |
148 | def annotate_parallel(self, tasks):
149 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
150 | results = []
151 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
152 | for entry in tqdm(
153 | executor.map(self.annotate_p, tasks), total=len(tasks)
154 | ):
155 | results.append(entry)
156 | if None in results:
157 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
158 | return results
159 |
160 | ########################Claude 3########################
161 | class ClaudeJudgeText2Action:
162 | def __init__(self):
163 | raise NotImplementedError
164 |
165 |
166 | ########################Gemini########################
167 | class GeminiJudgeText2Action:
168 | def __init__(self):
169 | raise NotImplementedError
--------------------------------------------------------------------------------
/docs/model_output_formats.md:
--------------------------------------------------------------------------------
1 | # Model Output Formats
2 |
3 | As illustrated in the README.md, you can run our provided command after preparing your model output files as specified in this document. For example, below is our provided grading command for the Image2Text benchmark:
4 |
5 | ```
6 | python -m mixeval_x.compute_metrics_mmu \
7 | --benchmark image2text \
8 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
9 | --models_to_eval \
10 | gemini_1_5_pro \
11 | gemini_1_5_flash
12 | ```
13 |
14 | Here, the `--model_response_dir` specifies the grandparent directory of your model directory. `--benchmark` specifies the parent directory name of your model directory. Your model directory contains the model output files. The model directory name is your model name that will be specified by `--models_to_eval`. Generally, your prepared model output files should have a structure like this:
15 |
16 | ```
17 | └── model_response_dir
18 | │
19 | └── benchmark
20 | │
21 | └── your_model_name
22 | │
23 | ├── file1.jsonl
24 | │
25 | └── file2.jsonl
26 | │
27 | └── ...
28 | ```
29 |
30 | > The difference between model input (the benchmark data) and output (the model response file) is just the `"response"` field, i.e., each entry in your output file should keep all key-value pairs (including the 'id') of the input entry, with an additional `"response"` field representing the model's output.
31 |
32 | > The model scores for each model will be saved in the model directory; scores for all graded models will be saved in the benchmark directory.
33 |
34 | 🚨 **We show example model output structures and files in `mixeval_x/examples/`**.
35 |
36 | The exact output structures and data formats are detailed below.
37 |
38 |
39 |
40 | ## MMU Tasks (Image2Text, Video2Text, and Audio2Text)
41 |
42 | ### Structure
43 | ```
44 | └── model_response_dir
45 | │
46 | └── benchmark
47 | │
48 | └── your_model_name
49 | │
50 | ├── your_model_name_ff.jsonl
51 | │
52 | └── your_model_name_mp.jsonl
53 | ```
54 | > Note that the Audio2Text benchmark doesn't have the multiple-choice subset, therefore the "`_mp`" file is not required.
55 |
56 | ### Output Data Format
57 |
58 | Free-form (`your_model_name_ff.jsonl`):
59 | ```
60 | [
61 | {
62 | "id": "1",
63 | "problem_type": "free-form",
64 | "image_id": "1.jpg",
65 | "prompt": "Where are the people that are standing looking at?",
66 | "target": ["away"],
67 | "benchmark_name": "GQA",
68 | "response": "The kite"
69 | },
70 | ...
71 | ]
72 | ```
73 |
74 | Multiple-choice (`your_model_name_mp.jsonl`):
75 | ```
76 | [
77 | {
78 | "problem_type": "single-choice",
79 | "image_id": "1.jpg",
80 | "prompt": "Is the man's face clearly visible in the image?",
81 | "options": ["Yes", "No"],
82 | "target": [1],
83 | "benchmark_name": "Q-Bench",
84 | "response": "B."
85 | },
86 | ...
87 | ]
88 | ```
89 |
90 | ## Agent Tasks
91 |
92 | ### Text2Action
93 |
94 | Structure:
95 | ```
96 | └── model_response_dir
97 | │
98 | └── benchmark
99 | |
100 | └── your_model_name
101 | │
102 | └── your_model_name_t2a.jsonl
103 | ```
104 |
105 | Output Data Format (`your_model_name_t2a.jsonl`):
106 | ```
107 | [
108 | {
109 | "id": "1",
110 | "task description": "Take a photo of the Oso Nabukete cave entrance.",
111 | "allowed actions": "[Navigation], [Photograph], [InteractWithObject], [PickupObject], [PutObject], [UseObject], [ChargeDevice], [CheckBattery], [AdjustSettings], [Wait], [Speak], [Listen]",
112 | "visible objects": ", , , , , , , , , , , ", "already executed steps": "[Navigation] , [PickupObject] , [CheckBattery] ",
113 | "target": "[AdjustSettings] , [Photograph] ",
114 | "response": "[AdjustSettings] , [ChargeDevice] , [PickupObject] , [UseObject] , [Photograph] "
115 | },
116 | ...
117 | ]
118 | ```
119 |
120 | ### Image2Action
121 | Structure:
122 | ```
123 | └── model_response_dir
124 | │
125 | └── benchmark
126 | |
127 | └── your_model_name
128 | │
129 | └── your_model_name_i2a.jsonl
130 | ```
131 |
132 | Output Data Format (`your_model_name_i2a.jsonl`):
133 | ```
134 | [
135 | {
136 | "id": "1",
137 | "image_id": "7.jpg",
138 | "task description": "Paint an object with blue color to resemble a blueberry.",
139 | "allowed actions": "[PickupObject], [PutObject], [Navigation], [OpenContainer], [CloseContainer], [SelectPaintTool], [MixColor], [ApplyPaint], [CleanBrush], [InspectObject], [DryObject]",
140 | "already executed steps": "[Navigation] , [SelectPaintTool] , [MixColor] , [ApplyPaint] ",
141 | "target": "[InspectObject] , [DryObject] ",
142 | "response": "[InspectObject] , [DryObject] , [CleanBrush] "
143 | },
144 | ...
145 | ]
146 | ```
147 |
148 | ## MMG Tasks (only supports Text2Image)
149 |
150 | ### Text2Image
151 | Structure:
152 | ```
153 | └── model_response_dir
154 | │
155 | └── benchmark
156 | |
157 | └── your_model_name
158 | │
159 | └── your_model_name_t2i.jsonl
160 | ```
161 |
162 | Output Data Format (`your_model_name_t2i.jsonl`):
163 | ```
164 | [
165 | {
166 | "id": "1",
167 | "first_turn_user_prompt": "Design a vibrant cityscape banner featuring the iconic Sydney Opera House and Harbour Bridge under a dazzling sunset, with the City of Sydney\u2019s logo prominently displayed in the foreground, ensuring it catches the eye against the vivid backdrop.",
168 | "first_turn_caption": "A vibrant cityscape banner featuring the iconic Sydney Opera House and Harbour Bridge under a dazzling sunset, with the City of Sydney\u2019s logo prominently displayed in the foreground, ensuring it catches the eye against the vivid backdrop.",
169 | "second_turn_user_prompt": " Remove the City of Sydney\u2019s logo.",
170 | "second_turn_caption": "A vibrant cityscape banner featuring the iconic Sydney Opera House and Harbour Bridge under a dazzling sunset, ensuring it catches the eye against the vivid backdrop.",
171 | "gen_1st_turn": "THE DIR PATH/1.jpg",
172 | "gen_2nd_turn": "THE DIR PATH/2.jpg",
173 | },
174 | ...
175 | ]
176 | ```
177 |
178 |
179 |
--------------------------------------------------------------------------------
/mixeval_x/utils/judge_image2action.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | import time
3 | import random
4 | import os
5 | from dotenv import load_dotenv
6 | import base64
7 | import re
8 | import ast
9 |
10 | from concurrent.futures import ThreadPoolExecutor
11 | from openai import OpenAI
12 | from openai._exceptions import RateLimitError, BadRequestError
13 | from httpx import Timeout
14 |
15 | from mixeval_x.prompts.judge_prompts import (
16 | image2action_gpt_judge,
17 | )
18 |
19 | ########################ChatGPT########################
20 | class ChatGPTJudgeImage2Action:
21 | def __init__(self, args):
22 | self.args = args
23 | self.JUDGE = args.judge_model
24 | self.FIX_INTERVAL_SECOND = 0
25 | self.MAX_RETRY_NUM = 999999
26 | self.MAX_NEW_TOKENS = 999
27 |
28 | self.FORMAT_MAXRETRY = 10
29 |
30 | image2action_example_image_path = os.path.join(args.image2action_image_dir, 'example_1.jpg')
31 | if not os.path.exists(image2action_example_image_path):
32 | raise ValueError(f"Example image: 'example_1.jpg' for image2action grading not found in the image directory: {args.image2action_image_dir}. Please use the official image directory for Image2Action.")
33 | self.example_image = self.encode_image(image2action_example_image_path)
34 |
35 | load_dotenv()
36 | self.client = OpenAI(
37 | api_key=os.getenv('MODEL_PARSER_API'),
38 | timeout=Timeout(timeout=60.0, connect=5.0)
39 | )
40 |
41 | @staticmethod
42 | def encode_image(image_path):
43 | if not os.path.exists(image_path):
44 | image_path = image_path.replace('.jpg', '.JPG')
45 | with open(image_path, "rb") as image_file:
46 | return base64.b64encode(image_file.read()).decode('utf-8')
47 |
48 | @staticmethod
49 | def get_score_from_judge(judge_response):
50 | """
51 | Get the score from the judge response.
52 | """
53 | one_score_pattern = re.compile("\[\[(\d+\.?\d*)\]\]")
54 | one_score_pattern_backup = re.compile("\[(\d+\.?\d*)\]")
55 |
56 | match = re.search(one_score_pattern, judge_response)
57 | if not match:
58 | match = re.search(one_score_pattern_backup, judge_response)
59 |
60 | if match:
61 | rating = ast.literal_eval(match.groups()[0])
62 | else:
63 | rating = -1
64 |
65 | return float(rating)
66 |
67 | def format_prompts(self, inputs, mode, history=[]):
68 |
69 | if mode == 'turn_1':
70 | formated = image2action_gpt_judge(*inputs)
71 | elif mode == 'append_message' and history:
72 | history.append(inputs)
73 | formated = history
74 | else:
75 | raise ValueError(f"Invalid mode: {mode}.")
76 |
77 | return formated
78 |
79 | def _GPT_decode(self, inputs, mode, history=[]):
80 | completion = self.client.chat.completions.create(
81 | model=self.JUDGE,
82 | response_format={ "type": 'text'},
83 | max_tokens=self.MAX_NEW_TOKENS,
84 | messages=self.format_prompts(inputs, mode, history),
85 | )
86 | time.sleep(self.FIX_INTERVAL_SECOND)
87 | return completion
88 |
89 | def GPT_decode(self, inputs, mode, history=[]):
90 | delay = 1
91 | blocked = 0
92 | for i in range(self.MAX_RETRY_NUM):
93 | try:
94 | completion = self._GPT_decode(inputs, mode, history)
95 | return completion
96 | except RateLimitError as e:
97 | exponential_base = 2
98 | delay *= exponential_base * (1 + random.random())
99 | print(f"RateLimitError, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
100 | print(e)
101 | time.sleep(delay)
102 | continue
103 | except BadRequestError as e:
104 | blocked += 1
105 | if blocked >= 10:
106 | print("Blocked too many times, skipping...")
107 | return 'Blocked'
108 | print(f"Input is blocked, retrying...")
109 | print(e)
110 | time.sleep(1)
111 | continue
112 | except Exception as e:
113 | print(f"Error in GPT_decode, retrying...")
114 | print(e)
115 | time.sleep(1)
116 | continue
117 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
118 | return 'Error'
119 |
120 | def annotate_p(self, task):
121 | task_description = task['task description']
122 | allowed_actions = task['allowed actions']
123 | image_id = task['image_id']
124 | already_executed_steps = task['already executed steps']
125 | target = task['target']
126 | model_response = task['response']
127 | image = self.encode_image(os.path.join(self.args.image2action_image_dir, image_id))
128 |
129 | # first turn
130 | inputs = (self.example_image, task_description, allowed_actions, image, already_executed_steps, target, model_response)
131 |
132 | completion = self.GPT_decode(inputs, 'turn_1')
133 | if completion == 'Error':
134 | print(f"Error in GPT_decode, the entry treated as bad entry.")
135 | task['judge_response'] = '[[0.0]]'
136 | elif completion == 'Blocked':
137 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
138 | task['judge_response'] = '[[0.0]]'
139 | else:
140 | annotation = completion.choices[0].message.content
141 | task['judge_response'] = annotation
142 | for i in range(self.FORMAT_MAXRETRY):
143 | if self.get_score_from_judge(annotation) != -1:
144 | task['judge_score'] = self.get_score_from_judge(annotation)
145 | break
146 | else:
147 | print(f"No score found in the response, retrying...")
148 | history = image2action_gpt_judge(*inputs)
149 | completion = self.GPT_decode({"role": "user", "content": "Continue your judgment and finish by outputting a final score with the above-mentioned format."}, 'append_message', history)
150 | annotation = completion.choices[0].message.content
151 |
152 | if self.get_score_from_judge(annotation) != -1:
153 | task['judge_score'] = self.get_score_from_judge(annotation)
154 | else:
155 | task['judge_score'] = None
156 | print(f"No score found in the response, please inspect and retry.")
157 |
158 |
159 | return task
160 |
161 | def annotate_parallel(self, tasks):
162 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
163 | results = []
164 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
165 | for entry in tqdm(
166 | executor.map(self.annotate_p, tasks), total=len(tasks)
167 | ):
168 | results.append(entry)
169 | if None in results:
170 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
171 | return results
172 |
173 | ########################Claude 3########################
174 | class ClaudeJudgeImage2Action:
175 | def __init__(self):
176 | raise NotImplementedError
177 |
178 |
179 | ########################Gemini########################
180 | class GeminiJudgeImage2Action:
181 | def __init__(self):
182 | raise NotImplementedError
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | 🚀 Project Page | 🤗 HF Dataset | 📜 arXiv | 🏆 Leaderboard | 📝 blog | 🤗 HF Paper | 𝕏 Twitter
5 |
6 |
7 | ---
8 | [](https://x.com/NiJinjie)
9 | 
10 | 
11 |
12 |
13 |
14 |
15 |
16 |
17 | *MixEval-X encompasses eight input-output modality combinations and can be further extended. Its data points reflect real-world task distributions. The last grid presents the scores of frontier organizations’ flagship models on MixEval-X, normalized to a 0-100 scale, with MMG tasks using win rates instead of Elo. Section C of the paper presents example data samples and model responses.*
18 |
19 |
20 |
21 | # ⚡ News
22 |
23 | [2025-02-11] MixEval-X is accepted to ICLR 2025 as a Spotlight!
24 |
25 | [2024-12-10] [LMMs-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) now supports running model evaluations on MixEval-X. You can run MixEval-X MMU tasks for your own models with one command [here](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/mix_evals)! (feel free to use your own code if you think it's easier)
26 |
27 | [2024-10-20] MixEval-X is released! Checkout the Paper and Leaderboard to learn more about this real-world any-to-any benchmark!🌟
28 |
29 |
30 |
31 | # MixEval-X
32 |
33 | **MixEval-X** is the **first** any-to-any, real-world benchmark featuring **diverse input-output modalities**, **real-world task distributions**, **consistent high standards across modalities**, and **dynamism**. It achieves up to **0.98** correlation with arena-like multi-modal evaluations while being way more efficient.
34 |
35 | > See the [project page](https://mixeval-x.github.io/) and [paper](https://arxiv.org/abs/2410.13754) for more details.
36 |
37 |
38 |
39 | # This Repo
40 | This repo contains the grading code for MixEval-X. Once you have prepared your model outputs according to the required format, you will be able to get the final scores in just a few steps.
41 |
42 | The MixEval-X data can be downloaded from the [huggingface](https://huggingface.co/datasets/MixEval/MixEval-X).
43 |
44 | **Feel free to use your own grading code, as long as it's fair.**
45 |
46 |
47 |
48 | # Quick Start
49 | **(Step 1)** Clone repo and setup the environment:
50 | ```
51 | git clone https://github.com/Psycoy/MixEval-X.git
52 | cd MixEval-X
53 | conda create -n MixEval-X python=3.11 --yes
54 | conda activate MixEval-X
55 | bash setup.sh
56 |
57 | # setup done
58 | ```
59 |
60 | **(Step 2)** Setup the OpenAI API key for model parser. Create `.env` file under root dir (`MixEval-X/`) and add the below line to it:
61 | ```
62 | MODEL_PARSER_API=
63 | ```
64 | > The values in [Leaderboard](https://mixeval-x.github.io/#leaderboard) use `GPT-3.5-Turbo-0125` as the default model parser for MMU tasks, `gpt-4o-2024-08-06` as the default model parser for agent tasks.
65 |
66 |
67 | **(Step 3)** Prepare the model outputs as specified [here](docs/model_output_formats.md) on your own side, and use the below command to compute the results. That's all!
68 |
69 | Image2Text
70 | ```
71 | # Normal Version
72 | python -m mixeval_x.compute_metrics_mmu \
73 | --benchmark image2text \
74 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
75 | --models_to_eval \
76 | gemini_1_5_pro \
77 | gemini_1_5_flash
78 |
79 | # Hard Version
80 | python -m mixeval_x.compute_metrics_mmu \
81 | --benchmark image2text_hard \
82 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
83 | --models_to_eval \
84 | gemini_1_5_pro \
85 | gemini_1_5_flash
86 | ```
87 |
88 | Video2Text
89 | ```
90 | # Normal Version
91 | python -m mixeval_x.compute_metrics_mmu \
92 | --benchmark video2text \
93 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
94 | --models_to_eval \
95 | gemini_1_5_pro \
96 | gemini_1_5_flash
97 |
98 | # Hard Version
99 | python -m mixeval_x.compute_metrics_mmu \
100 | --benchmark video2text_hard \
101 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
102 | --models_to_eval \
103 | gemini_1_5_pro \
104 | gemini_1_5_flash
105 | ```
106 |
107 | Audio2Text
108 | ```
109 | # Normal Version
110 | python -m mixeval_x.compute_metrics_mmu \
111 | --benchmark audio2text \
112 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
113 | --models_to_eval \
114 | gemini_1_5_pro \
115 | gemini_1_5_flash
116 |
117 | # Hard Version
118 | python -m mixeval_x.compute_metrics_mmu \
119 | --benchmark audio2text_hard \
120 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
121 | --models_to_eval \
122 | gemini_1_5_pro \
123 | gemini_1_5_flash
124 | ```
125 |
126 | Text2Action
127 | ```
128 | python -m mixeval_x.compute_metrics_mmg_agent \
129 | --benchmark text2action \
130 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
131 | --judge_model "gpt-4o-2024-08-06" \
132 | --models_to_eval \
133 | gemini_1_5_pro \
134 | gemini_1_5_flash
135 | ```
136 |
137 | Image2Action
138 | ```
139 | python -m mixeval_x.compute_metrics_mmg_agent \
140 | --benchmark image2action \
141 | --judge_model "gpt-4o-2024-08-06" \
142 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
143 | --image2action_image_dir THE_PATH_TO_IMAGE2ACTION_INPUT_IMAGES \
144 | --models_to_eval \
145 | gemini_1_5_pro \
146 | gemini_1_5_flash
147 | ```
148 |
149 | Text2Image
150 | ```
151 | python -m mixeval_x.compute_metrics_mmg_agent \
152 | --benchmark text2image \
153 | --judge_model "gpt-4o-2024-08-06" \
154 | --model_response_dir THE_PATH_TO_MODEL_OUTPUT_FOLDER \
155 | --models_to_eval \
156 | gemini_1_5_pro \
157 | gemini_1_5_flash
158 | ```
159 |
160 | > The MMG results (Text2Image, Text2Video, and Text2Audio) in [Leaderboard](https://mixeval-x.github.io/#leaderboard) were graded with Amazon Mechanical Turk workers. Text2Image also supports model parsing (See Section 4.2 of the paper). However, the Text2Video and Text2Audio lack capable model judges, and thus their grading are not implemented. **You should hire human evaluators to grade these two subsets.**
161 |
162 |
163 |
164 | # Why to Use MixEval-X Benchmarks?
165 |
166 | 🥇 It extends all the benefits of MixEval to multi-modal evaluations, including comprehensive and less biased query distribution; fair grading (except open-ended tasks); dynamism; accurate model ranking; fast, cost-effective, reproducible execution; and challenging nature.
167 |
168 | 🥇 It establishes unified, high standards across modalities and communities. For single-modality models, it ensures its evaluation keeps up with the state-of-the-art standards; for multi-modality models, it ensures consistent, high-standard evaluations across modalities, preventing any from becoming a bottleneck.
169 |
170 | 🥇 Beyond model evaluation, MixEval-X benchmarks different organizations (as shown in the first Figure) with balanced dimensions (modalities), unlocking a new level of evaluation.
171 |
172 |
173 |
174 | # Acknowledgement
175 |
176 | We thank Graham Neubig, Yuntian Deng, Jason Phang, Piotr Nawrot, Luca Soldaini, Guanzhi Wang, Qinghong Lin for insightful discussions and pointers; we thank Fanyi Pu for his great effort in integrating the evaluation protocols.
177 |
178 |
179 |
180 | # 🦾 Contribute
181 |
182 | Feel free to hit the ⭐**star** button or 🦾**contribute**! We review new issues and PRs regularly and will acknowledge your contributions!
183 |
184 |
185 |
186 | # 📑 Citation
187 |
188 | If you found this repository useful, please consider 📑**citing**:
189 |
190 | ```
191 | @article{ni2024mixevalx,
192 | title={MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures},
193 | author={Ni, Jinjie and Song, Yifan and Ghosal, Deepanway and Li, Bo and Zhang, David Junhao and Yue, Xiang and Xue, Fuzhao and Zheng, Zian and Zhang, Kaichen and Shah, Mahir and Jain, Kabir and You, Yang and Shieh, Michael},
194 | journal={arXiv preprint arXiv:2410.13754},
195 | year={2024}
196 | }
197 |
198 | @article{ni2024mixeval,
199 | title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures},
200 | author={Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang},
201 | journal={arXiv preprint arXiv:2406.06565},
202 | year={2024}
203 | }
204 | ```
--------------------------------------------------------------------------------
/mixeval_x/examples/model_response_dir/text2action/gemini_15_pro/gemini_15_pro_t2a.jsonl:
--------------------------------------------------------------------------------
1 | {"task description": "Take a photo of the Oso Nabukete cave entrance.", "allowed actions": "[Navigation], [Photograph], [InteractWithObject], [PickupObject], [PutObject], [UseObject], [ChargeDevice], [CheckBattery], [AdjustSettings], [Wait], [Speak], [Listen]", "visible objects": ", , , , , , , , , , , ", "already executed steps": "[Navigation] , [PickupObject] , [CheckBattery] ", "target": "[AdjustSettings] , [Photograph] ", "formated_input": "You are a real-world agent, and you will plan action-object sequences for the real-world tasks. You will be provided with 'Task Description', 'Allowed Actions', 'Visible Objects', and 'Already Executed Action-Object Sequences'. The 'Task Description' is a user instruction that instructs you to complete the task. The 'Allowed Actions' is a list of actions that are allowed to be used by you to complete the task. The 'visible objects' is a list of objects that are assumed to be visible to you when you are completing the task. The 'Already Executed Action-Object Sequences' is a list of action-object sequences that are assumed to have been completed by you at the moment. You need to plan the remaining action-object sequences to complete the task.\n\nBelow is a simplified example:\n\n**Start of Example**\nTask Description: Put a heated egg in the sink.\nAllowed Actions: [OpenObject], [CloseObject], [PickupObject], [PutObject], [ToggleObjectOn], [ToggleObjectOff], [SliceObject], [Navigation]\nVisible Objects: , , , , , , , , , \nAlready Executed Action-Object Sequences: [Navigation] , [OpenObject] , [PickupObject] , [CloseObject] , [Navigation] , [PutObject] \nYour Planning: [ToggleObjectOn] , [ToggleObjectOff] , [PickupObject] , [Navigation] , [PutObject] \n**End of Example**\n\nWith the above description and example, plan the remaining action-object sequences for the below task:\n\nTask Description: Take a photo of the Oso Nabukete cave entrance.\nAllowed Actions: [Navigation], [Photograph], [InteractWithObject], [PickupObject], [PutObject], [UseObject], [ChargeDevice], [CheckBattery], [AdjustSettings], [Wait], [Speak], [Listen]\nVisible Objects: , , , , , , , , , , , \nAlready Executed Action-Object Sequences: [Navigation] , [PickupObject] , [CheckBattery] \nYour Planning: \n", "id": "1", "response": "Your Planning: \n[AdjustSettings] , [UseObject] \n"}
2 | {"task description": "Disable Google Analytics tracking on the website.", "allowed actions": "[Navigate], [Click], [ReadText], [Scroll], [OpenMenu], [CloseMenu], [ToggleSwitch], [ConfirmAction], [CancelAction], [OpenPrivacySettings], [CloseApplication]", "visible objects": ", , , , , , , , , , , ", "already executed steps": "[Navigate] , [Navigate] , [Scroll] , [Click] , [Navigate] , [ToggleSwitch] ", "target": "[Click] , [ConfirmAction] ", "formated_input": "You are a real-world agent, and you will plan action-object sequences for the real-world tasks. You will be provided with 'Task Description', 'Allowed Actions', 'Visible Objects', and 'Already Executed Action-Object Sequences'. The 'Task Description' is a user instruction that instructs you to complete the task. The 'Allowed Actions' is a list of actions that are allowed to be used by you to complete the task. The 'visible objects' is a list of objects that are assumed to be visible to you when you are completing the task. The 'Already Executed Action-Object Sequences' is a list of action-object sequences that are assumed to have been completed by you at the moment. You need to plan the remaining action-object sequences to complete the task.\n\nBelow is a simplified example:\n\n**Start of Example**\nTask Description: Put a heated egg in the sink.\nAllowed Actions: [OpenObject], [CloseObject], [PickupObject], [PutObject], [ToggleObjectOn], [ToggleObjectOff], [SliceObject], [Navigation]\nVisible Objects: , , , , , , , , , \nAlready Executed Action-Object Sequences: [Navigation] , [OpenObject] , [PickupObject] , [CloseObject] , [Navigation] , [PutObject] \nYour Planning: [ToggleObjectOn] , [ToggleObjectOff] , [PickupObject] , [Navigation] , [PutObject] \n**End of Example**\n\nWith the above description and example, plan the remaining action-object sequences for the below task:\n\nTask Description: Disable Google Analytics tracking on the website.\nAllowed Actions: [Navigate], [Click], [ReadText], [Scroll], [OpenMenu], [CloseMenu], [ToggleSwitch], [ConfirmAction], [CancelAction], [OpenPrivacySettings], [CloseApplication]\nVisible Objects: , , , , , , , , , , , \nAlready Executed Action-Object Sequences: [Navigate] , [Navigate] , [Scroll] , [Click] , [Navigate] , [ToggleSwitch] \nYour Planning: \n", "id": "2", "response": "Your Planning: \n\n1. **[Click] ** - This assumes there's a button to save the changes made to the privacy settings. \n2. **[ConfirmAction] ** (If applicable) - Some websites might have a confirmation popup to ensure you want to disable tracking. This step addresses that possibility. \n3. **[Click] ** (If applicable) - Closes any remaining confirmation popup. \n\n**Important Considerations:**\n\n* **Website Specifics:** Website layouts and procedures for managing privacy settings vary significantly. This plan provides a general framework but may need adjustments based on the actual website.\n* **Missing Information:** The presence and naming of buttons like \"save changes\" or confirmation popups are assumptions. Real-world execution would require adapting to the specific elements present on the webpage. \n"}
3 | {"task description": "Navigate to Settings & Privacy on the phone to delete already stored data using the Off-Facebook Activity feature.", "allowed actions": "[Navigation], [Click], [Swipe], [Scroll], [ReadText], [OpenApplication], [CloseApplication], [OpenMenu], [SelectOption], [ConfirmAction], [ToggleSwitch]", "visible objects": ", , , , , , , , , , , , , , ", "already executed steps": "[Navigation] , [Swipe] , [Click] , [Scroll] , [Click] , [Click] ", "target": "[Click] , [Click] , [CloseApplication] ", "formated_input": "You are a real-world agent, and you will plan action-object sequences for the real-world tasks. You will be provided with 'Task Description', 'Allowed Actions', 'Visible Objects', and 'Already Executed Action-Object Sequences'. The 'Task Description' is a user instruction that instructs you to complete the task. The 'Allowed Actions' is a list of actions that are allowed to be used by you to complete the task. The 'visible objects' is a list of objects that are assumed to be visible to you when you are completing the task. The 'Already Executed Action-Object Sequences' is a list of action-object sequences that are assumed to have been completed by you at the moment. You need to plan the remaining action-object sequences to complete the task.\n\nBelow is a simplified example:\n\n**Start of Example**\nTask Description: Put a heated egg in the sink.\nAllowed Actions: [OpenObject], [CloseObject], [PickupObject], [PutObject], [ToggleObjectOn], [ToggleObjectOff], [SliceObject], [Navigation]\nVisible Objects: , , , , , , , , , \nAlready Executed Action-Object Sequences: [Navigation] , [OpenObject] , [PickupObject] , [CloseObject] , [Navigation] , [PutObject] \nYour Planning: [ToggleObjectOn] , [ToggleObjectOff] , [PickupObject] , [Navigation] , [PutObject] \n**End of Example**\n\nWith the above description and example, plan the remaining action-object sequences for the below task:\n\nTask Description: Navigate to Settings & Privacy on the phone to delete already stored data using the Off-Facebook Activity feature.\nAllowed Actions: [Navigation], [Click], [Swipe], [Scroll], [ReadText], [OpenApplication], [CloseApplication], [OpenMenu], [SelectOption], [ConfirmAction], [ToggleSwitch]\nVisible Objects: , , , , , , , , , , , , , , \nAlready Executed Action-Object Sequences: [Navigation] , [Swipe] , [Click] , [Scroll] , [Click] , [Click] \nYour Planning: \n", "id": "3", "response": "Here's a possible plan, assuming the interface is structured in a common way:\n\n1. **[Scroll]** (Search for options related to stored data) \n2. **[Click]** (Or a similarly named button for deleting data)\n3. **[ReadText]** (Make sure it's the right action)\n4. **[Click]** (Proceed with deleting the data)\n5. **[Click]** (Optional: Navigate back to the previous menu) \n\n**Important Considerations:**\n\n* **Interface Variations:** The exact names of buttons and options can vary significantly between phone models and operating system versions. Adapt the plan based on what you see on the actual phone.\n* **Multiple Confirmation Prompts:** There might be multiple confirmation steps involved. \n* **Data Type:** The task doesn't specify what type of data needs to be deleted. The steps might need adjustments depending on the specific data to be removed. \n"}
--------------------------------------------------------------------------------
/mixeval_x/compute_metrics_mmg_agent.py:
--------------------------------------------------------------------------------
1 | import json
2 | import argparse
3 | import os
4 | from tqdm import tqdm
5 | import time
6 | import warnings
7 | warnings.simplefilter("ignore", category=DeprecationWarning)
8 | warnings.simplefilter("ignore", category=FutureWarning)
9 |
10 | from mixeval_x.utils.common_utils import set_seed
11 | from mixeval_x.utils.metric_utils import (
12 | eval_text2image,
13 | eval_text2action,
14 | eval_image2action
15 | )
16 |
17 | def parse_args():
18 | parser = argparse.ArgumentParser()
19 | parser.add_argument(
20 | "--benchmark",
21 | type=str,
22 | choices=["text2image",
23 | "text2video",
24 | "text2audio",
25 | "text2action",
26 | "image2action"
27 | ],
28 | required=True,
29 | help="Benchmark to evaluate."
30 | )
31 | parser.add_argument(
32 | "--model_response_dir",
33 | type=str,
34 | default="mixeval_x/data/image2text/close_ended/model_responses/",
35 | help="Path to model responses."
36 | )
37 | parser.add_argument(
38 | "--image2action_image_dir",
39 | type=str,
40 | default="mixeval_x/data/image2action/image2action_closeended/images/",
41 | help="Path to the images of image2action data."
42 | )
43 | parser.add_argument(
44 | "--judge_model",
45 | type=str,
46 | default="gpt-4o",
47 | help="Judge model for text2image score computation."
48 | )
49 | parser.add_argument(
50 | "--models_to_eval",
51 | nargs='+',
52 | default=None,
53 | help="Models to evaluate."
54 | )
55 | parser.add_argument(
56 | "--models_to_ignore",
57 | nargs='+',
58 | default=None,
59 | help="Models that would be ignored for free-form."
60 | )
61 | parser.add_argument(
62 | "--api_parallel_num",
63 | type=int,
64 | default=100,
65 | help="Number of parallel threads for calling the model parser api if use model parsing."
66 | "If you hit rate limit error frequently, try to reduce this number."
67 | )
68 | parser.add_argument(
69 | "--compute_score_from_judged_file",
70 | action="store_true",
71 | help="Whether to compute score directly from the judged file."
72 | "This will save budge for those models that has been judged before."
73 | "it also helps to do some analysis easily without running judgements again."
74 | )
75 | parser.add_argument(
76 | "--verbose",
77 | action="store_true",
78 | help="Print verbose information."
79 | )
80 | return parser.parse_args()
81 |
82 |
83 | def compute_metric_text2image(args):
84 |
85 | score_dict = {}
86 | if args.models_to_eval is not None:
87 | models = args.models_to_eval
88 |
89 | else:
90 | if os.path.exists(args.model_response_dir):
91 | models = os.listdir(args.model_response_dir)
92 |
93 | for model in models:
94 | print(f"Parsing model: {model}")
95 |
96 | if args.models_to_ignore is not None and model in args.models_to_ignore:
97 | print(f"Model {model} is ignored for text2image.")
98 | continue
99 |
100 | if args.compute_score_from_judged_file:
101 | results = []
102 | judge_file = os.path.join(
103 | args.model_response_dir,
104 | args.benchmark,
105 | model,
106 | f"judge_results_t2i_model_judge_{args.judge_model}.jsonl"
107 | )
108 | with open(judge_file, "r") as f:
109 | for line in f:
110 | judge_dict = json.loads(line)
111 | results.append(judge_dict)
112 | else:
113 | ans_file = os.path.join(
114 | args.model_response_dir,
115 | args.benchmark,
116 | model,
117 | f"{model}_t2i.jsonl"
118 | )
119 | tasks = []
120 | with open(ans_file, "r") as f:
121 | for line in f:
122 | ans_dict = json.loads(line)
123 | tasks.append(ans_dict)
124 | results = eval_text2image(args, tasks)
125 |
126 | score_dict_model = {}
127 | for judge_dict in results:
128 | judge_score = judge_dict["judge_score_1st_turn"]
129 | if '1st_turn' not in score_dict_model:
130 | score_dict_model['1st_turn'] = []
131 | score_dict_model['1st_turn'].append(judge_score)
132 | judge_score = judge_dict["judge_score_2nd_turn"]
133 | if '2nd_turn' not in score_dict_model:
134 | score_dict_model['2nd_turn'] = []
135 | score_dict_model['2nd_turn'].append(judge_score)
136 |
137 | for key, value in score_dict_model.items():
138 | score_dict_model[key] = round(sum(value)/len(value), 3)
139 | score_dict[model] = score_dict_model
140 |
141 | with open(os.path.join(
142 | args.model_response_dir,
143 | args.benchmark,
144 | model,
145 | f"judge_results_t2i_model_judge_{args.judge_model}.jsonl"), "w") as f:
146 | for case in results:
147 | f.write(json.dumps(case) + "\n")
148 |
149 | if not args.compute_score_from_judged_file:
150 | print("Sleep 20 seconds to avoid ratelimit error ... ")
151 | time.sleep(20)
152 |
153 | if args.verbose:
154 | print(f"[Text2Image]")
155 | for model, score in score_dict.items():
156 | print(f"{model}: {json.dumps(score, indent=4)}")
157 |
158 | # write score_dict
159 | score_dict = dict(sorted(score_dict.items(), key=lambda x: x[1]['1st_turn'], reverse=True))
160 | with open(os.path.join(
161 | args.model_response_dir,
162 | args.benchmark,
163 | "score_t2i.json"), "w") as f:
164 | f.write(json.dumps(score_dict, indent=4) + "\n")
165 |
166 | return score_dict
167 |
168 | def compute_metric_text2action(args):
169 |
170 | score_dict = {}
171 | if args.models_to_eval is not None:
172 | models = args.models_to_eval
173 |
174 | else:
175 | if os.path.exists(args.model_response_dir):
176 | models = os.listdir(args.model_response_dir)
177 |
178 | for model in models:
179 | print(f"Parsing model: {model}")
180 |
181 | if args.models_to_ignore is not None and model in args.models_to_ignore:
182 | print(f"Model {model} is ignored for text2action.")
183 | continue
184 |
185 | if args.compute_score_from_judged_file:
186 | results = []
187 | judge_file = os.path.join(
188 | args.model_response_dir,
189 | args.benchmark,
190 | model,
191 | f"judge_results_t2a_model_judge_{args.judge_model}.jsonl"
192 | )
193 | with open(judge_file, "r") as f:
194 | for line in f:
195 | judge_dict = json.loads(line)
196 | results.append(judge_dict)
197 | else:
198 | ans_file = os.path.join(
199 | args.model_response_dir,
200 | args.benchmark,
201 | model,
202 | f"{model}_t2a.jsonl"
203 | )
204 | tasks = []
205 | with open(ans_file, "r") as f:
206 | for line in f:
207 | ans_dict = json.loads(line)
208 | tasks.append(ans_dict)
209 | results = eval_text2action(args, tasks)
210 |
211 | score_dict_model = {}
212 | for judge_dict in results:
213 | judge_score = judge_dict["judge_score"]
214 | if 'overall' not in score_dict_model:
215 | score_dict_model['overall'] = []
216 | score_dict_model['overall'].append(judge_score)
217 |
218 | for key, value in score_dict_model.items():
219 | score_dict_model[key] = round(sum(value)/len(value), 3)
220 | score_dict[model] = score_dict_model
221 |
222 | with open(os.path.join(
223 | args.model_response_dir,
224 | args.benchmark,
225 | model,
226 | f"judge_results_t2a_model_judge_{args.judge_model}.jsonl"), "w") as f:
227 | for case in results:
228 | f.write(json.dumps(case) + "\n")
229 |
230 | if not args.compute_score_from_judged_file:
231 | print("Sleep 20 seconds to avoid ratelimit error ... ")
232 | time.sleep(20)
233 |
234 | if args.verbose:
235 | print(f"[Text2Action]")
236 | for model, score in score_dict.items():
237 | print(f"{model}: {json.dumps(score, indent=4)}")
238 |
239 | # sort and write score_dict
240 | score_dict = dict(sorted(score_dict.items(), key=lambda x: x[1]['overall'], reverse=True))
241 | with open(os.path.join(
242 | args.model_response_dir,
243 | args.benchmark,
244 | "score_t2a.json"), "w") as f:
245 | f.write(json.dumps(score_dict, indent=4) + "\n")
246 |
247 | return score_dict
248 |
249 |
250 | def compute_metric_image2action(args):
251 |
252 | score_dict = {}
253 | if args.models_to_eval is not None:
254 | models = args.models_to_eval
255 |
256 | else:
257 | if os.path.exists(args.model_response_dir):
258 | models = os.listdir(args.model_response_dir)
259 |
260 | for model in models:
261 | print(f"Parsing model: {model}")
262 |
263 | if args.models_to_ignore is not None and model in args.models_to_ignore:
264 | print(f"Model {model} is ignored for image2action.")
265 | continue
266 |
267 | if args.compute_score_from_judged_file:
268 | results = []
269 | judge_file = os.path.join(
270 | args.model_response_dir,
271 | args.benchmark,
272 | model,
273 | f"judge_results_i2a_model_judge_{args.judge_model}.jsonl"
274 | )
275 | with open(judge_file, "r") as f:
276 | for line in f:
277 | judge_dict = json.loads(line)
278 | results.append(judge_dict)
279 | else:
280 | ans_file = os.path.join(
281 | args.model_response_dir,
282 | args.benchmark,
283 | model,
284 | f"{model}_i2a.jsonl"
285 | )
286 | tasks = []
287 | with open(ans_file, "r") as f:
288 | for line in f:
289 | ans_dict = json.loads(line)
290 | tasks.append(ans_dict)
291 | results = eval_image2action(args, tasks)
292 |
293 | score_dict_model = {}
294 | for judge_dict in results:
295 | judge_score = judge_dict["judge_score"]
296 | if 'overall' not in score_dict_model:
297 | score_dict_model['overall'] = []
298 | score_dict_model['overall'].append(judge_score)
299 |
300 | for key, value in score_dict_model.items():
301 | score_dict_model[key] = round(sum(value)/len(value), 3)
302 | score_dict[model] = score_dict_model
303 |
304 | with open(os.path.join(
305 | args.model_response_dir,
306 | args.benchmark,
307 | model,
308 | f"judge_results_i2a_model_judge_{args.judge_model}.jsonl"), "w") as f:
309 | for case in results:
310 | f.write(json.dumps(case) + "\n")
311 |
312 | if not args.compute_score_from_judged_file:
313 | print("Sleep 20 seconds to avoid ratelimit error ... ")
314 | time.sleep(20)
315 |
316 | if args.verbose:
317 | print(f"[Image2Action]")
318 | for model, score in score_dict.items():
319 | print(f"{model}: {json.dumps(score, indent=4)}")
320 |
321 | # sort and write score_dict
322 | score_dict = dict(sorted(score_dict.items(), key=lambda x: x[1]['overall'], reverse=True))
323 | with open(os.path.join(
324 | args.model_response_dir,
325 | args.benchmark,
326 | "score_i2a.json"), "w") as f:
327 | f.write(json.dumps(score_dict, indent=4) + "\n")
328 |
329 | return score_dict
330 |
331 | def compute_metric(args):
332 | if args.benchmark == "text2image":
333 | compute_metric_text2image(args)
334 | elif args.benchmark == "text2video":
335 | raise NotImplementedError("Benchmark not implemented yet.")
336 | elif args.benchmark == "text2audio":
337 | raise NotImplementedError("Benchmark not implemented yet.")
338 | elif args.benchmark == "text2action":
339 | compute_metric_text2action(args)
340 | elif args.benchmark == "image2action":
341 | compute_metric_image2action(args)
342 | else:
343 | raise ValueError(f"Invalid benchmark: {args.benchmark}, please choose from "
344 | "text2image, text2video, text2audio, text2action, image2action.")
345 |
346 |
347 | if __name__ == '__main__':
348 | set_seed()
349 | args = parse_args()
350 | compute_metric(args)
--------------------------------------------------------------------------------
/mixeval_x/compute_metrics_mmu.py:
--------------------------------------------------------------------------------
1 | import json
2 | import argparse
3 | import os
4 | from tqdm import tqdm
5 | import time
6 | import warnings
7 | warnings.simplefilter("ignore", category=DeprecationWarning)
8 | warnings.simplefilter("ignore", category=FutureWarning)
9 |
10 | from mixeval_x.utils.common_utils import set_seed
11 | from mixeval_x.utils.metric_utils import (
12 | parse_multi_choice_response_rule,
13 | parse_multi_choice_response_model,
14 | eval_multi_choice,
15 | eval_freeform_model,
16 | parse_freeform_response_rule,
17 | eval_freeform_rule,
18 | )
19 |
20 | def parse_args():
21 | parser = argparse.ArgumentParser()
22 | parser.add_argument(
23 | "--benchmark",
24 | type=str,
25 | choices=["image2text",
26 | "video2text",
27 | "audio2text",
28 | "image2text_hard",
29 | "video2text_hard",
30 | "audio2text_hard"
31 | ],
32 | required=True,
33 | help="Benchmark to evaluate."
34 | )
35 | parser.add_argument(
36 | "--model_response_dir",
37 | type=str,
38 | default="mixeval_x/data/image2text/close_ended/model_responses/",
39 | help="Path to model responses."
40 | )
41 | parser.add_argument(
42 | "--multichoice_judge",
43 | type=str,
44 | default="gpt-3.5-turbo-0125",
45 | help="Judge model for multiple-choice score computation."
46 | )
47 | parser.add_argument(
48 | "--freeform_judge",
49 | type=str,
50 | default="gpt-3.5-turbo-0125",
51 | help="Judge model for freeform score computation."
52 | )
53 | parser.add_argument(
54 | "--models_to_eval",
55 | nargs='+',
56 | default=None,
57 | help="Models to evaluate."
58 | )
59 | parser.add_argument(
60 | "--models_to_ignore_ff",
61 | nargs='+',
62 | default=None,
63 | help="Models that would be ignored for free-form."
64 | )
65 | parser.add_argument(
66 | "--models_to_ignore_mp",
67 | nargs='+',
68 | default=None,
69 | help="Models that would be ignored for multiple-choice."
70 | )
71 | parser.add_argument(
72 | "--free_form_parser",
73 | type=str,
74 | default="model",
75 | choices=["model", "rule"],
76 | help="Parser for freeform responses, either model parser or rule-based parser.")
77 | parser.add_argument(
78 | "--multi_choice_parser",
79 | type=str,
80 | default="model",
81 | choices=["model", "rule"],
82 | help="Parser for multiple-choice responses, either model parser or rule-based parser."
83 | )
84 | parser.add_argument(
85 | "--api_parallel_num",
86 | type=int,
87 | default=100,
88 | help="Number of parallel threads for calling the model parser api if use model parsing."
89 | "If you hit rate limit error frequently, try to reduce this number."
90 | )
91 | parser.add_argument(
92 | "--compute_score_from_judged_file",
93 | action="store_true",
94 | help="Whether to compute score directly from the judged file."
95 | "This will save budge for those models that has been judged before."
96 | "it also helps to do some analysis easily without running judgements again."
97 | )
98 | parser.add_argument(
99 | "--verbose",
100 | action="store_true",
101 | help="Print verbose information."
102 | )
103 | return parser.parse_args()
104 |
105 |
106 | def compute_metric_closeended_freeform_modelparse(args):
107 |
108 | score_dict = {}
109 | if args.models_to_eval is not None:
110 | models = args.models_to_eval
111 |
112 | else:
113 | if os.path.exists(args.model_response_dir):
114 | models = os.listdir(args.model_response_dir)
115 |
116 | for model in models:
117 | print(f"Parsing model: {model}")
118 |
119 | if args.models_to_ignore_ff is not None and model in args.models_to_ignore_ff:
120 | print(f"Model {model} is ignored for free-form.")
121 | continue
122 |
123 | if args.compute_score_from_judged_file:
124 | results = []
125 | judge_file = os.path.join(
126 | args.model_response_dir,
127 | args.benchmark,
128 | model,
129 | f"judge_results_ff_model_judge_{args.freeform_judge}.jsonl"
130 | )
131 | with open(judge_file, "r") as f:
132 | for line in f:
133 | judge_dict = json.loads(line)
134 | results.append(judge_dict)
135 | else:
136 | ans_file = os.path.join(
137 | args.model_response_dir,
138 | args.benchmark,
139 | model,
140 | f"{model}_ff.jsonl"
141 | )
142 | tasks = []
143 | with open(ans_file, "r") as f:
144 | for line in f:
145 | ans_dict = json.loads(line)
146 | tasks.append(ans_dict)
147 | results = eval_freeform_model(args, tasks)
148 |
149 | score_dict_model = {}
150 | for judge_dict in results:
151 | judge_score = judge_dict["judge_score"]
152 | if 'overall' not in score_dict_model:
153 | score_dict_model['overall'] = []
154 | score_dict_model['overall'].append(judge_score)
155 | if judge_dict['benchmark_name'] not in score_dict_model:
156 | score_dict_model[judge_dict['benchmark_name']] = []
157 | score_dict_model[judge_dict['benchmark_name']].append(judge_score)
158 |
159 | for key, value in score_dict_model.items():
160 | score_dict_model[key] = round(sum(value)/len(value), 3)
161 | score_dict[model] = score_dict_model
162 |
163 | with open(os.path.join(
164 | args.model_response_dir,
165 | args.benchmark,
166 | model,
167 | f"judge_results_ff_model_judge_{args.freeform_judge}.jsonl"), "w") as f:
168 | for case in results:
169 | f.write(json.dumps(case) + "\n")
170 |
171 | if not args.compute_score_from_judged_file:
172 | print("Sleep 60 seconds to avoid ratelimit error ... ")
173 | time.sleep(60)
174 |
175 | if args.verbose:
176 | print(f"[Close-ended Free-form Model Parser]")
177 | for model, score in score_dict.items():
178 | print(f"{model}: {json.dumps(score, indent=4)}")
179 |
180 | # write score_dict
181 | score_dict = dict(sorted(score_dict.items(), key=lambda x: x[1]['overall'], reverse=True))
182 | with open(os.path.join(
183 | args.model_response_dir,
184 | args.benchmark,
185 | "score_ff.json"), "w") as f:
186 | f.write(json.dumps(score_dict, indent=4) + "\n")
187 |
188 | # print(f"Number of ff entries: {len(results)}")
189 | return score_dict, len(results)
190 |
191 | def compute_metric_closeended_multichoice_modelparse(args):
192 |
193 | score_dict = {}
194 | if args.models_to_eval is not None:
195 | models = args.models_to_eval
196 |
197 | else:
198 | if os.path.exists(args.model_response_dir):
199 | models = os.listdir(args.model_response_dir)
200 |
201 | for model in models:
202 | print(f"Parsing model: {model}")
203 |
204 | if args.models_to_ignore_mp is not None and model in args.models_to_ignore_mp:
205 | print(f"Model {model} is ignored for multiple-choice.")
206 | continue
207 |
208 | if args.compute_score_from_judged_file:
209 | results = []
210 | judge_file = os.path.join(
211 | args.model_response_dir,
212 | args.benchmark,
213 | model,
214 | f"judge_results_mp_model_judge_{args.multichoice_judge}.jsonl"
215 | )
216 | with open(judge_file, "r") as f:
217 | for line in f:
218 | judge_dict = json.loads(line)
219 | results.append(judge_dict)
220 | else:
221 | ans_file = os.path.join(
222 | args.model_response_dir,
223 | args.benchmark,
224 | model,
225 | f"{model}_mp.jsonl"
226 | )
227 | with open(ans_file, "r") as f:
228 | ans_dicts = []
229 | for line in f:
230 | ans_dict = json.loads(line)
231 | ans_dicts.append(ans_dict)
232 |
233 | results = parse_multi_choice_response_model(args, ans_dicts)
234 |
235 | score_dict_model = {}
236 | for judge_dict in results:
237 | options = judge_dict["options"]
238 | target = judge_dict["target"]
239 | assert isinstance(target, list) and len(target) == 1, \
240 | f"Invalid target: {target}"
241 | all_choices = [chr(ord("A") + i) for i in range(len(options))]
242 | model_choice = judge_dict['judge_option']
243 | target_id = all_choices[target[0]]
244 | judge_score = 1 if eval_multi_choice(target_id, model_choice) else 0
245 |
246 | # add score
247 | if 'overall' not in score_dict_model:
248 | score_dict_model['overall'] = []
249 | score_dict_model['overall'].append(judge_score)
250 | if judge_dict['benchmark_name'] not in score_dict_model:
251 | score_dict_model[judge_dict['benchmark_name']] = []
252 | score_dict_model[judge_dict['benchmark_name']].append(judge_score)
253 |
254 | for key, value in score_dict_model.items():
255 | score_dict_model[key] = round(sum(value)/len(value), 3)
256 | score_dict[model] = score_dict_model
257 |
258 | with open(os.path.join(
259 | args.model_response_dir,
260 | args.benchmark,
261 | model,
262 | f"judge_results_mp_model_judge_{args.multichoice_judge}.jsonl"
263 | ), "w") as f:
264 | for case in results:
265 | f.write(json.dumps(case) + "\n")
266 |
267 | if not args.compute_score_from_judged_file:
268 | print("Sleep 60 seconds to avoid ratelimit error ... ")
269 | time.sleep(60)
270 |
271 | if args.verbose:
272 | print(f"[Close-ended Multiple-choice Model Parser]")
273 | for model, score in score_dict.items():
274 | print(f"{model}: {json.dumps(score, indent=4)}")
275 |
276 | # write score_dict
277 | score_dict = dict(sorted(score_dict.items(), key=lambda x: x[1]['overall'], reverse=True))
278 | with open(os.path.join(
279 | args.model_response_dir,
280 | args.benchmark,
281 | "score_mp.json"), "w") as f:
282 | f.write(json.dumps(score_dict, indent=4) + "\n")
283 |
284 | # print(f"Number of mp entries: {len(results)}")
285 | return score_dict, len(results)
286 |
287 | def compute_metric_closeended_freeform(args):
288 | return compute_metric_closeended_freeform_modelparse(args)
289 |
290 | def compute_metric_closeended_multichoice(args):
291 | return compute_metric_closeended_multichoice_modelparse(args)
292 |
293 | def compute_metric_closeended(args):
294 | if "audio" not in args.benchmark:
295 | score_dict_ff, ff_num = compute_metric_closeended_freeform(args)
296 | score_dict_mp, mp_num = compute_metric_closeended_multichoice(args)
297 |
298 | models_ff = set(score_dict_ff.keys())
299 | models_mp = set(score_dict_mp.keys())
300 | common_models = models_ff.intersection(models_mp)
301 | missing_models = models_ff.union(models_mp) - common_models
302 | if missing_models:
303 | print(f"Something went wrong when computing the free-form or multiple-choice "
304 | f"split of these models: \n{missing_models}\n\nA possible reason may be that they lack a model answer file. "
305 | "Skipping them...")
306 |
307 | score_dict = {}
308 | for model in common_models:
309 | score_dir = os.path.join(
310 | args.model_response_dir,
311 | args.benchmark,
312 | model,
313 | )
314 | score_dict_model = {
315 | "overall score (final score)": (score_dict_ff[model]['overall']*ff_num + score_dict_mp[model]['overall']*mp_num) / (ff_num + mp_num),
316 | **{f"{k} (free-form)":v for k, v in score_dict_ff[model].items() if k != "overall"},
317 | **{f"{k} (multiple-choice)":v for k, v in score_dict_mp[model].items() if k != "overall"},
318 | }
319 | score_dict[model] = score_dict_model
320 | with open(os.path.join(score_dir, "score.json"), "w") as f:
321 | f.write(json.dumps(score_dict_model, indent=4) + "\n")
322 |
323 | # sort and write score_dict
324 | score_dict = dict(sorted(score_dict.items(), key=lambda x: x[1]['overall score (final score)'], reverse=True))
325 | with open(os.path.join(args.model_response_dir, args.benchmark, "score.json"), "w") as f:
326 | f.write(json.dumps(score_dict, indent=4) + "\n")
327 | else:
328 | # only takes the freeform score
329 | score_dict_ff = compute_metric_closeended_freeform(args)
330 | models_ff = set(score_dict_ff.keys())
331 | for model in models_ff:
332 | score_dir = os.path.join(
333 | args.model_response_dir,
334 | args.benchmark,
335 | model,
336 | )
337 | score_dict_model = {
338 | "overall score (final score)": score_dict_ff[model]['overall'],
339 | **{k:v for k, v in score_dict_ff[model].items() if k != "overall"},
340 | }
341 | with open(os.path.join(score_dir, "score.json"), "w") as f:
342 | f.write(json.dumps(score_dict_model, indent=4) + "\n")
343 |
344 | # sort and write score_dict
345 | score_dict_ff = dict(sorted(score_dict_ff.items(), key=lambda x: x[1]['overall'], reverse=True))
346 | with open(os.path.join(args.model_response_dir, args.benchmark, "score.json"), "w") as f:
347 | f.write(json.dumps(score_dict_ff, indent=4) + "\n")
348 |
349 |
350 |
351 | def compute_metric(args):
352 | if args.benchmark in ["image2text", "video2text", "audio2text", "image2text_hard", "video2text_hard", "audio2text_hard"]:
353 | compute_metric_closeended(args)
354 | else:
355 | raise ValueError(f"Invalid benchmark: {args.benchmark}, please choose from "
356 | f"['image2text', 'video2text', 'audio2text', 'image2text_hard', 'video2text_hard', 'audio2text_hard']")
357 |
358 |
359 | if __name__ == '__main__':
360 | set_seed()
361 | args = parse_args()
362 | compute_metric(args)
--------------------------------------------------------------------------------
/mixeval_x/utils/metric_utils.py:
--------------------------------------------------------------------------------
1 | import re
2 | import random
3 | random.seed(42)
4 | import numpy as np
5 | import ast
6 |
7 | from mixeval_x.utils.judge_freeform_parser import (
8 | ChatGPTJudgeCloseendFreeform,
9 | ClaudeJudgeCloseendFreeform,
10 | GeminiJudgeCloseendFreeform
11 | )
12 | from mixeval_x.utils.judge_multichoice_parser import (
13 | ChatGPTJudgeCloseendMultichoice,
14 | ClaudeJudgeCloseendMultichoice,
15 | GeminiJudgeCloseendMultichoice
16 | )
17 | from mixeval_x.utils.judge_text2image import (
18 | ChatGPTJudgeText2Image,
19 | ClaudeJudgeText2Image,
20 | GeminiJudgeText2Image
21 | )
22 | from mixeval_x.utils.judge_text2action import (
23 | ChatGPTJudgeText2Action,
24 | ClaudeJudgeText2Action,
25 | GeminiJudgeText2Action
26 | )
27 | from mixeval_x.utils.judge_image2action import (
28 | ChatGPTJudgeImage2Action,
29 | ClaudeJudgeImage2Action,
30 | GeminiJudgeImage2Action
31 | )
32 | from mixeval_x.utils.common_utils import (
33 | extract_basemodel_response_3e,
34 | extract_basemodel_response_2e,
35 | )
36 |
37 | def find_all_sub(s, sub):
38 | """Find all occurrences of a substring in a string using regular expressions."""
39 | pattern = re.escape(sub) # Escape the substring to handle special regex characters
40 | matches = [match.start() for match in re.finditer(pattern, s)]
41 | return matches
42 |
43 | def parse_multi_choice_response_rule(args, response, all_choices, index2ans):
44 | """
45 | Parse the prediction from the generated response.
46 | Return the predicted index e.g., A, B, C, D.
47 | """
48 | GET_FIRST = True # get the first match or last
49 |
50 | if hasattr(args, 'model_type'):
51 | if args.model_type == 'BaseModel':
52 | response = extract_basemodel_response_2e(response)
53 | elif args.model_type == 'ChatModel':
54 | pass
55 | elif args.model_type == 'APIModelBase':
56 | pass
57 | else:
58 | raise ValueError(f"Model type {args.model_type} not supported.")
59 |
60 | for char in [',', '.', '!', '?', ';', ':', "'"]:
61 | response = response.strip(char)
62 | response = " " + response + " " # add space to avoid partial match
63 |
64 | candidates = []
65 | patterns = [
66 | lambda choice: f" {choice} ", # e.g., A B C D
67 | lambda choice: f" {choice}\n", # e.g., A\n B\n C\n D\n
68 | lambda choice: f"\n{choice} ", # e.g., \nA B C D
69 | lambda choice: f"\n{choice}\n", # e.g., \nA\n B\n C\n D\n
70 | lambda choice: f" {choice}. ", # e.g., A. B. C. D.
71 | lambda choice: f" {choice}.\n", # e.g., A.\n B.\n C.\n D.\n
72 | lambda choice: f"\n{choice}. ", # e.g., \nA. \nB. \nC. \nD.
73 | lambda choice: f"\n{choice}.\n", # e.g., \nA.\n \nB.\n \nC.\n \nD.\n
74 | lambda choice: f"({choice})", # e.g., (A) (B) (C) (D)
75 | lambda choice: f"**{choice} ", # e.g., **A **B **C **D
76 | lambda choice: f" {choice}**", # e.g., A** B** C** D**
77 | lambda choice: f"**{choice}. ", # e.g., **A. **B. **C. **D.
78 | lambda choice: f" {choice}.**", # e.g., A.** B.** C.** D.**
79 | ]
80 |
81 | for choice in all_choices:
82 | for pattern in patterns:
83 | ids = find_all_sub(response, pattern(choice))
84 | for id in ids:
85 | candidates.append((choice, id))
86 |
87 | # if all above doesn't get candidates, check if the content is larger than 5 tokens and try to parse the example
88 | if len(candidates) == 0 and len(response.split()) > 5:
89 | for index, ans in index2ans.items():
90 | ids = find_all_sub(response.lower(), ans.lower())
91 | for id in ids:
92 | candidates.append((index, id))
93 |
94 | if len(candidates) == 0: # still not get answer, randomly choose one.
95 | pred_index = random.choice(all_choices)
96 | elif len(candidates) > 1:
97 | start_indexes = [candidate[1] for candidate in candidates]
98 | if GET_FIRST:
99 | pred_index = candidates[np.argmin(start_indexes)][0]
100 | else:
101 | pred_index = candidates[np.argmax(start_indexes)][0]
102 | else: # if only one candidate, use it.
103 | pred_index = candidates[0][0]
104 | return pred_index
105 |
106 | def get_option_from_judge(judge_response):
107 | """
108 | Get the option from the judge response.
109 | """
110 | double_brackets_pattern = re.compile("\[\[([A-Z])\]\]")
111 | single_brackets_pattern = re.compile("\[([A-Z])\]")
112 |
113 | match = re.search(double_brackets_pattern, judge_response)
114 | if not match:
115 | match = re.search(single_brackets_pattern, judge_response)
116 |
117 | if match:
118 | option = match.groups()[0]
119 | else:
120 | option = -1
121 |
122 | return str(option)
123 |
124 | def is_option(variable):
125 | return isinstance(variable, str) and len(variable) == 1 and variable.isupper()
126 |
127 | def parse_multi_choice_response_model(args, tasks):
128 | tasks_remained = tasks
129 | tasks_judged = []
130 | model_judge = ChatGPTJudgeCloseendMultichoice(args)
131 |
132 | MAX_RETRY_NUM = 10
133 | for _ in range(MAX_RETRY_NUM):
134 | tasks_judged_p = model_judge.annotate_parallel(tasks_remained)
135 | # retry those failed cases whose "judge_response" is None or no valid score found inside
136 | tasks_remained = []
137 | for task in tasks_judged_p:
138 | if task['judge_response'] is not None and is_option(get_option_from_judge(task['judge_response'])):
139 | task['judge_option'] = get_option_from_judge(task['judge_response'])
140 | tasks_judged.append(task)
141 | else:
142 | tasks_remained.append(task)
143 |
144 | if len(tasks_remained) == 0:
145 | break
146 | else:
147 | print(f"Still {len(tasks_remained)} tasks remained to be judged. Retry...")
148 |
149 | if len(tasks_remained) > 0:
150 | print(f"Max retry number {MAX_RETRY_NUM} reached, while some tasks are still not judged. "
151 | "Randomly assign the options for them.\n"
152 | "This is expected during parsing. "
153 | "The main cause may be that the evaluated model's response does not contain a valid answer.")
154 | # randomly assign the option for each entry
155 | for task in tasks_remained:
156 | options = task['options']
157 | option_letters = [chr(ord("A") + i) for i in range(len(options))]
158 | task['judge_option'] = random.choice(option_letters)
159 | tasks_judged.append(task)
160 |
161 | assert len(tasks_judged) == len(tasks), \
162 | "The number of tasks judged is not equal to the number of input tasks."
163 |
164 | return tasks_judged
165 |
166 | def check_is_number(string):
167 | """
168 | Check if the given string a number.
169 | """
170 | try:
171 | float(string.replace(',', ''))
172 | return True
173 | except ValueError:
174 | # check if there's comma inside
175 | return False
176 |
177 | def normalize_str(string):
178 | """
179 | Normalize the str to lower case and make them float numbers if possible.
180 | """
181 | # check if characters in the string
182 |
183 | # if number, numerize it.
184 | string = string.strip()
185 |
186 | is_number = check_is_number(string)
187 |
188 | if is_number:
189 | string = string.replace(',', '')
190 | string = float(string)
191 | # leave 2 decimal
192 | string = round(string, 2)
193 | return [string]
194 | else: # it's likely to be a string
195 | # lower it
196 | string = string.lower()
197 | if len(string) == 1:
198 | return [" " + string, string + " "] # avoid trivial matches
199 | return [string]
200 |
201 | def extract_numbers(string):
202 | """
203 | Exact all forms of numbers from a string with regex.
204 | """
205 | # Pattern for numbers with commas
206 | pattern_commas = r'-?\b\d{1,3}(?:,\d{3})+\b'
207 | # Pattern for scientific notation
208 | pattern_scientific = r'-?\d+(?:\.\d+)?[eE][+-]?\d+'
209 | # Pattern for simple numbers without commas
210 | pattern_simple = r'-?(?:\d+\.\d+|\.\d+|\d+\b)(?![eE][+-]?\d+)(?![,\d])'
211 |
212 | # Extract numbers with commas
213 | numbers_with_commas = re.findall(pattern_commas, string)
214 | # Extract numbers in scientific notation
215 | numbers_scientific = re.findall(pattern_scientific, string)
216 | # Extract simple numbers without commas
217 | numbers_simple = re.findall(pattern_simple, string)
218 |
219 | # Combine all extracted numbers
220 | all_numbers = numbers_with_commas + numbers_scientific + numbers_simple
221 | return all_numbers
222 |
223 | def parse_freeform_response_rule(args, response):
224 | """
225 | Parse the prediction from the generated response.
226 | Return a list of predicted strings or numbers.
227 | """
228 | if hasattr(args, 'model_type'):
229 | if args.model_type == 'BaseModel':
230 | response = extract_basemodel_response_3e(response)
231 | elif args.model_type == 'ChatModel':
232 | pass
233 | elif args.model_type == 'APIModelBase':
234 | pass
235 | else:
236 | raise ValueError(f"Model type {args.model_type} not supported.")
237 |
238 | # content = content.strip("\n").strip(".").strip(" ")
239 | def get_key_subresponses(response):
240 | key_responses = []
241 | response = response.strip().strip(".").lower()
242 | sub_responses = re.split(r'\.\s(?=[A-Z])|\n', response)
243 | indicators_of_keys = ['could be ', 'so ', 'is ',
244 | 'thus ', 'therefore ', 'final ', 'answer ', 'result ']
245 | key_responses = []
246 | for index, resp in enumerate(sub_responses):
247 | # if last one, accept it's an equation (the entire response can be just one sentence with equation)
248 | if index == len(sub_responses) - 1:
249 | indicators_of_keys.extend(['='])
250 | shortest_key_response = None # the shortest response that may contain the answer (tail part of the response)
251 | for indicator in indicators_of_keys:
252 | if indicator in resp:
253 | if not shortest_key_response:
254 | shortest_key_response = resp.split(indicator)[-1].strip()
255 | else:
256 | if len(resp.split(indicator)[-1].strip()) < len(shortest_key_response):
257 | shortest_key_response = resp.split(indicator)[-1].strip()
258 |
259 | if shortest_key_response:
260 | # and it's not trivial
261 | if shortest_key_response.strip() not in [":", ",", ".", "!", "?", ";", ":", "'"]:
262 | key_responses.append(shortest_key_response)
263 | if len(key_responses) == 0: # did not found any
264 | return [response]
265 | return key_responses
266 | # pdb.set_trace()
267 |
268 | key_responses = get_key_subresponses(response)
269 |
270 | pred_list = key_responses.copy() # keep the original string response
271 | for resp in key_responses:
272 | pred_list.extend(extract_numbers(resp))
273 |
274 | tmp_pred_list = []
275 | for i in range(len(pred_list)):
276 | tmp_pred_list.extend(normalize_str(pred_list[i]))
277 | pred_list = tmp_pred_list
278 |
279 | # remove duplicates
280 | pred_list = list(set(pred_list))
281 |
282 | return pred_list
283 |
284 | def get_score_from_judge(judge_response):
285 | """
286 | Get the score from the judge response.
287 | """
288 | one_score_pattern = re.compile("\[\[(\d+\.?\d*)\]\]")
289 | one_score_pattern_backup = re.compile("\[(\d+\.?\d*)\]")
290 |
291 | match = re.search(one_score_pattern, judge_response)
292 | if not match:
293 | match = re.search(one_score_pattern_backup, judge_response)
294 |
295 | if match:
296 | rating = ast.literal_eval(match.groups()[0])
297 | else:
298 | rating = -1
299 |
300 | return float(rating)
301 |
302 | def parse_freeform_response_model(args, tasks):
303 | tasks_remained = tasks
304 | tasks_judged = []
305 | model_judge = ChatGPTJudgeCloseendFreeform(args)
306 |
307 | MAX_RETRY_NUM = 10
308 | for _ in range(MAX_RETRY_NUM):
309 | tasks_judged_p = model_judge.annotate_parallel(tasks_remained)
310 | # retry those failed cases whose "judge_response" is None or no valid score found inside
311 | tasks_remained = []
312 | for task in tasks_judged_p:
313 | if (task['judge_response'] is not None
314 | and 0 <= get_score_from_judge(task['judge_response']) <= 1):
315 | task['judge_score'] = get_score_from_judge(task['judge_response'])
316 | tasks_judged.append(task)
317 | else:
318 | tasks_remained.append(task)
319 |
320 | if len(tasks_remained) == 0:
321 | break
322 | else:
323 | print(f"Still {len(tasks_remained)} tasks remained to be judged. Retry...")
324 |
325 | if len(tasks_remained) > 0:
326 | print(f"Max retry number {MAX_RETRY_NUM} reached, "
327 | "while some tasks are still not judged. "
328 | "Randomly assign the scores for them.\n"
329 | "This is expected during parsing. "
330 | "The main cause may be that the evaluated model's response does not contain a valid answer.")
331 | # randomly assign the score for each entry
332 | for task in tasks_remained:
333 | task['judge_score'] = round(random.random(), 1)
334 | tasks_judged.append(task)
335 |
336 | assert len(tasks_judged) == len(tasks), \
337 | "The number of tasks judged is not equal to the number of input tasks."
338 |
339 | return tasks_judged
340 |
341 |
342 | def eval_multi_choice(gold_i, pred_i):
343 | """
344 | Evaluate a multiple choice instance.
345 | """
346 | correct = False
347 | # only they are exactly the same, we consider it as correct
348 | if isinstance(gold_i, list):
349 | for answer in gold_i:
350 | if answer == pred_i:
351 | correct = True
352 | break
353 | else: # gold_i is a string
354 | if gold_i == pred_i:
355 | correct = True
356 | return correct
357 |
358 | def eval_freeform_rule(gold_i, pred_i):
359 | """
360 | Evaluate an open question instance
361 | """
362 | correct = False
363 | if isinstance(gold_i, list):
364 | # use float to avoid trivial matches
365 | norm_answers = []
366 | for answer in gold_i:
367 | norm_answers.extend(normalize_str(answer))
368 | else:
369 | norm_answers = normalize_str(gold_i)
370 | for pred in pred_i: # pred is already normalized in parse response phase
371 | if isinstance(pred, str): # if it's a string, then find if ans in the pred_i
372 | for norm_ans in norm_answers:
373 | # only see if the string answer in the string pred
374 | if isinstance(norm_ans, str) and norm_ans in pred:
375 | if not correct:
376 | correct = True
377 | break
378 | else: # it's a float number
379 | if pred in norm_answers:
380 | if not correct:
381 | correct = True
382 | break
383 | return correct
384 |
385 | def eval_freeform_model(args, tasks):
386 | return parse_freeform_response_model(args, tasks)
387 |
388 |
389 | def eval_text2image(args, tasks):
390 | if 'gpt' in args.judge_model.lower():
391 | model_judge = ChatGPTJudgeText2Image(args)
392 | elif 'claude' in args.judge_model.lower():
393 | model_judge = ClaudeJudgeText2Image(args)
394 | elif 'gemini' in args.judge_model.lower():
395 | model_judge = GeminiJudgeText2Image(args)
396 | else:
397 | raise ValueError(f"Text2Image judge {args.judge_model} not supported.")
398 |
399 | tasks_judged_p = model_judge.annotate_parallel(tasks)
400 |
401 | assert len(tasks_judged_p) == len(tasks), \
402 | f"The number of tasks judged is not equal to the number of input tasks. {len(tasks_judged_p)} vs {len(tasks)}"
403 |
404 | return tasks_judged_p
405 |
406 | def eval_text2action(args, tasks):
407 | if 'gpt' in args.judge_model.lower():
408 | model_judge = ChatGPTJudgeText2Action(args)
409 | elif 'claude' in args.judge_model == 'Claude'.lower():
410 | model_judge = ClaudeJudgeText2Action(args)
411 | elif 'gemini' in args.judge_model == 'Gemini'.lower():
412 | model_judge = GeminiJudgeText2Action(args)
413 | else:
414 | raise ValueError(f"Text2Action judge {args.judge_model} not supported.")
415 |
416 | tasks_judged_p = model_judge.annotate_parallel(tasks)
417 |
418 | assert len(tasks_judged_p) == len(tasks), \
419 | f"The number of tasks judged is not equal to the number of input tasks. {len(tasks_judged_p)} vs {len(tasks)}"
420 |
421 | return tasks_judged_p
422 |
423 | def eval_image2action(args, tasks):
424 | if 'gpt' in args.judge_model.lower():
425 | model_judge = ChatGPTJudgeImage2Action(args)
426 | elif 'claude' in args.judge_model == 'Claude'.lower():
427 | model_judge = ClaudeJudgeImage2Action(args)
428 | elif 'gemini' in args.judge_model == 'Gemini'.lower():
429 | model_judge = GeminiJudgeImage2Action(args)
430 | else:
431 | raise ValueError(f"Image2Action judge {args.judge_model} not supported.")
432 |
433 | tasks_judged_p = model_judge.annotate_parallel(tasks)
434 |
435 | assert len(tasks_judged_p) == len(tasks), \
436 | f"The number of tasks judged is not equal to the number of input tasks. {len(tasks_judged_p)} vs {len(tasks)}"
437 |
438 | return tasks_judged_p
439 |
440 | if __name__ == '__main__':
441 | response = "Sandy will have more tokens than any sibling by 1/8 million."
442 | preds = parse_freeform_response_rule(response)
443 | print(preds)
444 |
--------------------------------------------------------------------------------
/mixeval_x/utils/judge_text2image.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | import time
3 | import random
4 | import os
5 | from dotenv import load_dotenv
6 | import base64
7 | import PIL.Image
8 | from PIL import Image
9 | import re
10 | import ast
11 |
12 | from concurrent.futures import ThreadPoolExecutor
13 | from httpx import Timeout
14 |
15 | from openai import OpenAI
16 | from openai._exceptions import RateLimitError as RateLimitError_openai, BadRequestError as BadRequestError_openai
17 |
18 | import anthropic
19 | from anthropic._exceptions import RateLimitError as RateLimitError_anthropic
20 |
21 | import google.generativeai as genai
22 |
23 | from mixeval_x.prompts.judge_prompts import (
24 | text2image_gpt_judge_turn1,
25 | text2image_gpt_judge_turn2,
26 | text2image_claude_judge_turn1,
27 | text2image_claude_judge_turn2,
28 | text2image_gemini_judge_turn1,
29 | text2image_gemini_judge_turn2,
30 | )
31 |
32 | ########################ChatGPT########################
33 | class ChatGPTJudgeText2Image:
34 | def __init__(self, args):
35 | self.args = args
36 | self.JUDGE = args.judge_model
37 | self.FIX_INTERVAL_SECOND = 0
38 | self.MAX_RETRY_NUM = 999999
39 | self.MAX_NEW_TOKENS = 999
40 |
41 | self.FORMAT_MAXRETRY = 10
42 |
43 | load_dotenv()
44 | self.client = OpenAI(
45 | api_key=os.getenv('MODEL_PARSER_API'),
46 | timeout=Timeout(timeout=60.0, connect=5.0)
47 | )
48 |
49 | @staticmethod
50 | def encode_image(image_path):
51 | with open(image_path, "rb") as image_file:
52 | return base64.b64encode(image_file.read()).decode('utf-8')
53 |
54 | @staticmethod
55 | def get_score_from_judge(judge_response):
56 | """
57 | Get the score from the judge response.
58 | """
59 | one_score_pattern = re.compile("\[\[(\d+\.?\d*)\]\]")
60 | one_score_pattern_backup = re.compile("\[(\d+\.?\d*)\]")
61 |
62 | match = re.search(one_score_pattern, judge_response)
63 | if not match:
64 | match = re.search(one_score_pattern_backup, judge_response)
65 |
66 | if match:
67 | rating = ast.literal_eval(match.groups()[0])
68 | else:
69 | rating = -1
70 |
71 | return float(rating)
72 |
73 | def format_prompts(self, inputs, mode, history=[]):
74 |
75 | if mode == 'turn_1':
76 | formated = text2image_gpt_judge_turn1(*inputs)
77 | elif mode == 'turn_2':
78 | formated = text2image_gpt_judge_turn2(*inputs)
79 | elif mode == 'append_message' and history:
80 | history.append(inputs)
81 | formated = history
82 | else:
83 | raise ValueError(f"Invalid mode: {mode}.")
84 |
85 | return formated
86 |
87 | def _GPT_decode(self, inputs, mode, history=[]):
88 | completion = self.client.chat.completions.create(
89 | model=self.JUDGE,
90 | response_format={ "type": 'text'},
91 | max_tokens=self.MAX_NEW_TOKENS,
92 | messages=self.format_prompts(inputs, mode, history),
93 | )
94 | time.sleep(self.FIX_INTERVAL_SECOND)
95 | return completion
96 |
97 |
98 | def GPT_decode(self, inputs, mode, history=[]):
99 | delay = 1
100 | blocked = 0
101 | for i in range(self.MAX_RETRY_NUM):
102 | try:
103 | completion = self._GPT_decode(inputs, mode, history)
104 | return completion
105 | except RateLimitError_openai as e:
106 | exponential_base = 2
107 | delay *= exponential_base * (1 + random.random())
108 | print(f"RateLimitError, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
109 | print(e)
110 | time.sleep(delay)
111 | continue
112 | except BadRequestError_openai as e:
113 | blocked += 1
114 | if blocked >= 10:
115 | print("Blocked too many times, skipping...")
116 | return 'Blocked'
117 | print(f"Input is blocked, retrying...")
118 | print(e)
119 | time.sleep(1)
120 | continue
121 | except Exception as e:
122 | print(f"Error in GPT_decode, retrying...")
123 | print(e)
124 | time.sleep(1)
125 | continue
126 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
127 | return 'Error'
128 |
129 |
130 | def annotate_p(self, task):
131 | prompt1 = task['first_turn_user_prompt']
132 | prompt2 = task['second_turn_user_prompt']
133 |
134 | if not os.path.exists(task['gen_1st_turn']):
135 | task['judge_response_1st_turn'] = 'Image not found, treated as a neutral entry.'
136 | task['judge_score_1st_turn'] = 5.0
137 | task['judge_response_2nd_turn'] = 'Image not found, treated as a neutral entry.'
138 | task['judge_score_2nd_turn'] = 5.0
139 | print(f"Image 1 not found, treated as a neutral entry.")
140 | return task
141 | else:
142 | image1 = self.encode_image(task['gen_1st_turn'])
143 |
144 | # first turn
145 | inputs = (prompt1, image1)
146 |
147 | completion = self.GPT_decode(inputs, 'turn_1')
148 | if completion == 'Error':
149 | print(f"Error in GPT_decode, the entry treated as bad entry.")
150 | task['judge_response_1st_turn'] = '[[0.0]]'
151 | elif completion == 'Blocked':
152 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
153 | task['judge_response_1st_turn'] = '[[0.0]]'
154 | else:
155 | annotation = completion.choices[0].message.content
156 | task['judge_response_1st_turn'] = annotation
157 | for i in range(self.FORMAT_MAXRETRY):
158 | if self.get_score_from_judge(annotation) != -1:
159 | task['judge_score_1st_turn'] = self.get_score_from_judge(annotation)
160 | break
161 | else:
162 | print(f"No score found in the response, retrying...")
163 | history = text2image_gpt_judge_turn1(*inputs)
164 | history.append({"role": "assistant", "content": annotation})
165 | completion = self.GPT_decode({"role": "user", "content": "Continue your judgment and finish by outputting a final rating with the above-mentioned format, i.e., the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\". Your rating: "}, 'append_message', history)
166 | annotation = completion.choices[0].message.content
167 | history.append({"role": "assistant", "content": annotation})
168 |
169 | if self.get_score_from_judge(annotation) != -1:
170 | task['judge_score_1st_turn'] = self.get_score_from_judge(annotation)
171 | else:
172 | task['judge_score_1st_turn'] = 5.0
173 | print(f"No score found in the response, please inspect and retry.")
174 |
175 | if not os.path.exists(task['gen_2nd_turn']):
176 | task['judge_response_2nd_turn'] = 'Image not found, treated as a neutral entry.'
177 | task['judge_score_2nd_turn'] = 5.0
178 | print(f"Image 2 not found, treated as a neutral entry.")
179 | return task
180 | else:
181 | image2 = self.encode_image(task['gen_2nd_turn'])
182 | # second turn
183 | inputs = (image1, prompt2, image2)
184 | completion = self.GPT_decode(inputs, 'turn_2')
185 | if completion == 'Error':
186 | print(f"Error in GPT_decode, the entry treated as bad entry.")
187 | task['judge_response_2nd_turn'] = '[[0.0]]'
188 | elif completion == 'Blocked':
189 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
190 | task['judge_response_2nd_turn'] = '[[0.0]]'
191 | else:
192 | annotation = completion.choices[0].message.content
193 | task['judge_response_2nd_turn'] = annotation
194 | for i in range(self.FORMAT_MAXRETRY):
195 | if self.get_score_from_judge(annotation) != -1:
196 | task['judge_score_2nd_turn'] = self.get_score_from_judge(annotation)
197 | break
198 | else:
199 | print(f"No score found in the response, retrying...")
200 | history = text2image_gpt_judge_turn1.format(*inputs)
201 | history.append({"role": "assistant", "content": annotation})
202 | completion = self.GPT_decode({"role": "user", "content": "Continue your judgment and finish by outputting a final rating with the above-mentioned format, i.e., the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\". Your rating: "}, 'append_message', history)
203 | annotation = completion.choices[0].message.content
204 | history.append({"role": "assistant", "content": annotation})
205 |
206 | if self.get_score_from_judge(annotation) != -1:
207 | task['judge_score_2nd_turn'] = self.get_score_from_judge(annotation)
208 | else:
209 | task['judge_score_2nd_turn'] = 5.0
210 | print(f"No score found in the response, please inspect and retry.")
211 |
212 |
213 | return task
214 |
215 |
216 | def annotate_parallel(self, tasks):
217 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
218 | results = []
219 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
220 | for entry in tqdm(
221 | executor.map(self.annotate_p, tasks), total=len(tasks)
222 | ):
223 | results.append(entry)
224 | if None in results:
225 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
226 | return results
227 |
228 | ########################Claude 3########################
229 | class ClaudeJudgeText2Image:
230 | def __init__(self, args):
231 | self.args = args
232 | self.JUDGE = args.judge_model
233 | self.FIX_INTERVAL_SECOND = 0
234 | self.MAX_RETRY_NUM = 999999
235 | self.MAX_NEW_TOKENS = 999
236 |
237 | self.FORMAT_MAXRETRY = 10
238 |
239 | load_dotenv()
240 | self.client = anthropic.Anthropic(
241 | api_key=os.getenv('k_ant'),
242 | timeout=Timeout(timeout=20.0, connect=5.0)
243 | )
244 |
245 | @staticmethod
246 | def encode_image(image_path):
247 |
248 | def convert_image_to_jpeg(image_path, output_path):
249 | # Open the image and convert it to JPEG to ensure format compliance
250 | img = Image.open(image_path)
251 | img.convert("RGB").save(output_path, format="JPEG")
252 | convert_image_to_jpeg(image_path, image_path)
253 |
254 | with open(image_path, "rb") as image_file:
255 | return base64.b64encode(image_file.read()).decode('utf-8')
256 |
257 | @staticmethod
258 | def get_score_from_judge(judge_response):
259 | """
260 | Get the score from the judge response.
261 | """
262 | one_score_pattern = re.compile("\[\[(\d+\.?\d*)\]\]")
263 | one_score_pattern_backup = re.compile("\[(\d+\.?\d*)\]")
264 |
265 | match = re.search(one_score_pattern, judge_response)
266 | if not match:
267 | match = re.search(one_score_pattern_backup, judge_response)
268 |
269 | if match:
270 | rating = ast.literal_eval(match.groups()[0])
271 | else:
272 | rating = -1
273 |
274 | return float(rating)
275 |
276 | def format_prompts(self, inputs, mode, history=[]):
277 |
278 | if mode == 'turn_1':
279 | formated = text2image_claude_judge_turn1(*inputs)
280 | elif mode == 'turn_2':
281 | formated = text2image_claude_judge_turn2(*inputs)
282 | elif mode == 'append_message' and history:
283 | history.append(inputs)
284 | formated = history
285 | else:
286 | raise ValueError(f"Invalid mode: {mode}.")
287 |
288 | return formated
289 |
290 | def _GPT_decode(self, inputs, mode, history=[]):
291 | completion = self.client.messages.create(
292 | model=self.JUDGE,
293 | max_tokens=self.MAX_NEW_TOKENS,
294 | messages=self.format_prompts(inputs, mode, history),
295 | )
296 | time.sleep(self.FIX_INTERVAL_SECOND)
297 | return completion
298 |
299 |
300 | def GPT_decode(self, inputs, mode, history=[]):
301 | delay = 1
302 | for i in range(self.MAX_RETRY_NUM):
303 | try:
304 | completion = self._GPT_decode(inputs, mode, history)
305 | return completion
306 | except RateLimitError_anthropic as e:
307 | exponential_base = 2
308 | delay *= exponential_base * (1 + random.random())
309 | print(f"RateLimitError, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
310 | print(e)
311 | time.sleep(delay)
312 | continue
313 | except Exception as e:
314 | print(f"Error in decode, retrying...")
315 | print(e)
316 | time.sleep(1)
317 | continue
318 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
319 | return 'Error'
320 |
321 |
322 | def annotate_p(self, task):
323 | prompt1 = task['first_turn_user_prompt']
324 | prompt2 = task['second_turn_user_prompt']
325 |
326 | if not os.path.exists(task['gen_1st_turn']):
327 | task['judge_response_1st_turn'] = 'Image not found, treated as a neutral entry.'
328 | task['judge_score_1st_turn'] = 5.0
329 | task['judge_response_2nd_turn'] = 'Image not found, treated as a neutral entry.'
330 | task['judge_score_2nd_turn'] = 5.0
331 | print(f"Image 1 not found, treated as a neutral entry.")
332 | return task
333 | else:
334 | image1 = self.encode_image(task['gen_1st_turn'])
335 |
336 | # first turn
337 | inputs = (prompt1, image1)
338 |
339 | completion = self.GPT_decode(inputs, 'turn_1')
340 | if completion == 'Error':
341 | print(f"Error in GPT_decode, the entry treated as bad entry.")
342 | task['judge_response_1st_turn'] = '[[0.0]]'
343 | elif completion == 'Blocked':
344 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
345 | task['judge_response_1st_turn'] = '[[0.0]]'
346 | else:
347 | annotation = completion.content[0].text
348 | task['judge_response_1st_turn'] = annotation
349 | for i in range(self.FORMAT_MAXRETRY):
350 | if self.get_score_from_judge(annotation) != -1:
351 | task['judge_score_1st_turn'] = self.get_score_from_judge(annotation)
352 | break
353 | else:
354 | print(f"No score found in the response, retrying...")
355 | history = text2image_claude_judge_turn1(*inputs)
356 | history.append({"role": "assistant", "content": annotation})
357 | completion = self.GPT_decode({"role": "user", "content": "Continue your judgment and finish by outputting a final rating with the above-mentioned format, i.e., the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\". Your rating: "}, 'append_message', history)
358 | annotation = completion.content[0].text
359 | history.append({"role": "assistant", "content": annotation})
360 |
361 | if self.get_score_from_judge(annotation) != -1:
362 | task['judge_score_1st_turn'] = self.get_score_from_judge(annotation)
363 | else:
364 | task['judge_score_1st_turn'] = 5.0
365 | print(f"No score found in the response, please inspect and retry.")
366 |
367 | if not os.path.exists(task['gen_2nd_turn']):
368 | task['judge_response_2nd_turn'] = 'Image not found, treated as a neutral entry.'
369 | task['judge_score_2nd_turn'] = 5.0
370 | print(f"Image 2 not found, treated as a neutral entry.")
371 | return task
372 | else:
373 | image2 = self.encode_image(task['gen_2nd_turn'])
374 | # second turn
375 | inputs = (image1, prompt2, image2)
376 | completion = self.GPT_decode(inputs, 'turn_2')
377 | if completion == 'Error':
378 | print(f"Error in GPT_decode, the entry treated as bad entry.")
379 | task['judge_response_2nd_turn'] = '[[0.0]]'
380 | elif completion == 'Blocked':
381 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
382 | task['judge_response_2nd_turn'] = '[[0.0]]'
383 | else:
384 | annotation = completion.content[0].text
385 | task['judge_response_2nd_turn'] = annotation
386 | for i in range(self.FORMAT_MAXRETRY):
387 | if self.get_score_from_judge(annotation) != -1:
388 | task['judge_score_2nd_turn'] = self.get_score_from_judge(annotation)
389 | break
390 | else:
391 | print(f"No score found in the response, retrying...")
392 | history = text2image_claude_judge_turn2(*inputs)
393 | history.append({"role": "assistant", "content": annotation})
394 | completion = self.GPT_decode({"role": "user", "content": "Continue your judgment and finish by outputting a final rating with the above-mentioned format, i.e., the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\". Your rating: "}, 'append_message', history)
395 | annotation = completion.content[0].text
396 | history.append({"role": "assistant", "content": annotation})
397 |
398 | if self.get_score_from_judge(annotation) != -1:
399 | task['judge_score_2nd_turn'] = self.get_score_from_judge(annotation)
400 | else:
401 | task['judge_score_2nd_turn'] = 5.0
402 | print(f"No score found in the response, please inspect and retry.")
403 |
404 |
405 | return task
406 |
407 |
408 | def annotate_parallel(self, tasks):
409 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
410 | results = []
411 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
412 | for entry in tqdm(
413 | executor.map(self.annotate_p, tasks), total=len(tasks)
414 | ):
415 | results.append(entry)
416 | if None in results:
417 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
418 | return results
419 |
420 |
421 | ########################Gemini########################
422 | class GeminiJudgeText2Image:
423 | def __init__(self, args):
424 | self.args = args
425 | self.JUDGE = args.judge_model
426 | self.FIX_INTERVAL_SECOND = 0
427 | self.MAX_RETRY_NUM = 999999
428 | self.MAX_NEW_TOKENS = 999
429 |
430 | self.FORMAT_MAXRETRY = 10
431 |
432 | load_dotenv()
433 | genai.configure(api_key=os.getenv('k_g'))
434 | self.model = genai.GenerativeModel(self.JUDGE)
435 |
436 | self.safety_settings={
437 | 'harm_category_harassment':'block_none',
438 | 'harm_category_hate_speech': 'block_none',
439 | 'harm_category_sexually_explicit': 'block_none',
440 | 'harm_category_dangerous_content': 'block_none'
441 | }
442 |
443 | @staticmethod
444 | def encode_image(image_path):
445 | return PIL.Image.open(image_path)
446 |
447 | @staticmethod
448 | def get_score_from_judge(judge_response):
449 | """
450 | Get the score from the judge response.
451 | """
452 | one_score_pattern = re.compile("\[\[(\d+\.?\d*)\]\]")
453 | one_score_pattern_backup = re.compile("\[(\d+\.?\d*)\]")
454 |
455 | match = re.search(one_score_pattern, judge_response)
456 | if not match:
457 | match = re.search(one_score_pattern_backup, judge_response)
458 |
459 | if match:
460 | rating = ast.literal_eval(match.groups()[0])
461 | else:
462 | rating = -1
463 |
464 | return float(rating)
465 |
466 | def format_prompts(self, inputs, mode, history=[]):
467 |
468 | if mode == 'turn_1':
469 | formated = text2image_gemini_judge_turn1(*inputs)
470 | elif mode == 'turn_2':
471 | formated = text2image_gemini_judge_turn2(*inputs)
472 | elif mode == 'append_message' and history:
473 | history.append(inputs)
474 | formated = history
475 | else:
476 | raise ValueError(f"Invalid mode: {mode}.")
477 |
478 | return formated
479 |
480 | def _GPT_decode(self, inputs, mode, history=[]):
481 | completion = self.model.generate_content(
482 | self.format_prompts(inputs, mode, history),
483 | generation_config=genai.types.GenerationConfig(
484 | candidate_count=1,
485 | max_output_tokens=self.MAX_NEW_TOKENS,
486 | ),
487 | safety_settings=self.safety_settings,
488 | )
489 | time.sleep(self.FIX_INTERVAL_SECOND)
490 | return completion.text
491 |
492 |
493 | def GPT_decode(self, inputs, mode, history=[]):
494 | delay = 1
495 | blocked = 0
496 | for i in range(self.MAX_RETRY_NUM):
497 | try:
498 | response_content = self._GPT_decode(inputs, mode, history)
499 | return response_content
500 | except Exception as e:
501 | if 'quick accessor' in str(e) or 'block' in str(e):
502 | print("Content blocked, retrying ...")
503 | blocked += 1
504 | if blocked > 10:
505 | print("Blocked for too many times, using 'Response not available "
506 | "due to content restrictions.' as response, exiting...")
507 | return 'Response not available due to content restrictions.'
508 | elif 'quota' in str(e).lower() or 'limit' in str(e).lower():
509 | exponential_base = 2
510 | delay *= exponential_base * (1 + random.random())
511 | print(f"Error, retrying after {round(delay, 2)} seconds, {i+1}-th retry...")
512 | print(e)
513 | time.sleep(delay)
514 | continue
515 | else:
516 | print(f"Error in decode, retrying...")
517 | print(e)
518 | time.sleep(10)
519 | continue
520 | print(f"Failed after {self.MAX_RETRY_NUM} retries.")
521 | return 'Error'
522 |
523 |
524 | def annotate_p(self, task):
525 | prompt1 = task['first_turn_user_prompt']
526 | prompt2 = task['second_turn_user_prompt']
527 |
528 | if not os.path.exists(task['gen_1st_turn']):
529 | task['judge_response_1st_turn'] = 'Image not found, treated as a neutral entry.'
530 | task['judge_score_1st_turn'] = 5.0
531 | task['judge_response_2nd_turn'] = 'Image not found, treated as a neutral entry.'
532 | task['judge_score_2nd_turn'] = 5.0
533 | print(f"Image 1 not found, treated as a neutral entry.")
534 | return task
535 | else:
536 | image1 = self.encode_image(task['gen_1st_turn'])
537 |
538 | # first turn
539 | inputs = (prompt1, image1)
540 |
541 | completion = self.GPT_decode(inputs, 'turn_1')
542 | if completion == 'Error':
543 | print(f"Error in GPT_decode, the entry treated as bad entry.")
544 | task['judge_response_1st_turn'] = '[[0.0]]'
545 | elif completion == 'Blocked':
546 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
547 | task['judge_response_1st_turn'] = '[[0.0]]'
548 | else:
549 | annotation = completion
550 | task['judge_response_1st_turn'] = annotation
551 | for i in range(self.FORMAT_MAXRETRY):
552 | if self.get_score_from_judge(annotation) != -1:
553 | task['judge_score_1st_turn'] = self.get_score_from_judge(annotation)
554 | break
555 | else:
556 | print(f"No score found in the response, retrying...")
557 | history = text2image_gemini_judge_turn1(*inputs)
558 | history.append(annotation)
559 | completion = self.GPT_decode("Continue your judgment and finish by outputting a final rating with the above-mentioned format, i.e., the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\". Your rating: ", 'append_message', history)
560 | annotation = completion
561 | history.append(annotation)
562 |
563 | if self.get_score_from_judge(annotation) != -1:
564 | task['judge_score_1st_turn'] = self.get_score_from_judge(annotation)
565 | else:
566 | task['judge_score_1st_turn'] = 5.0
567 | print(f"No score found in the response, please inspect and retry.")
568 |
569 | if not os.path.exists(task['gen_2nd_turn']):
570 | task['judge_response_2nd_turn'] = 'Image not found, treated as a neutral entry.'
571 | task['judge_score_2nd_turn'] = 5.0
572 | print(f"Image 2 not found, treated as a neutral entry.")
573 | return task
574 | else:
575 | image2 = self.encode_image(task['gen_2nd_turn'])
576 | # second turn
577 | inputs = (image1, prompt2, image2)
578 | completion = self.GPT_decode(inputs, 'turn_2')
579 | if completion == 'Error':
580 | print(f"Error in GPT_decode, the entry treated as bad entry.")
581 | task['judge_response_2nd_turn'] = '[[0.0]]'
582 | elif completion == 'Blocked':
583 | print(f"{task}: \n\nBlocked, the entry treated as bad entry.")
584 | task['judge_response_2nd_turn'] = '[[0.0]]'
585 | else:
586 | annotation = completion
587 | task['judge_response_2nd_turn'] = annotation
588 | for i in range(self.FORMAT_MAXRETRY):
589 | if self.get_score_from_judge(annotation) != -1:
590 | task['judge_score_2nd_turn'] = self.get_score_from_judge(annotation)
591 | break
592 | else:
593 | print(f"No score found in the response, retrying...")
594 | history = text2image_gemini_judge_turn2(*inputs)
595 | history.append(annotation)
596 | completion = self.GPT_decode("Continue your judgment and finish by outputting a final rating with the above-mentioned format, i.e., the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\". Your rating: ",'append_message', history)
597 | annotation = completion
598 | history.append(annotation)
599 |
600 | if self.get_score_from_judge(annotation) != -1:
601 | task['judge_score_2nd_turn'] = self.get_score_from_judge(annotation)
602 | else:
603 | task['judge_score_2nd_turn'] = 5.0
604 | print(f"No score found in the response, please inspect and retry.")
605 |
606 |
607 | return task
608 |
609 |
610 | def annotate_parallel(self, tasks):
611 | print(f"Parsing in parallel, in total {self.args.api_parallel_num} threads.")
612 | results = []
613 | with ThreadPoolExecutor(self.args.api_parallel_num) as executor:
614 | for entry in tqdm(
615 | executor.map(self.annotate_p, tasks), total=len(tasks)
616 | ):
617 | results.append(entry)
618 | if None in results:
619 | raise ValueError("Some entries are not annotated due to errors in annotate_p, please inspect and retry.")
620 | return results
--------------------------------------------------------------------------------
/mixeval_x/prompts/judge_prompts.py:
--------------------------------------------------------------------------------
1 | ############Judge Prompts for Close-ended Free-form Parser############
2 | image2text_gpt_judge_for_closeended_freeform = lambda prompt, gold_ans, response: [
3 | {"role": "system", "content": f"In this task, I want you to act as a judge."},
4 | {"role": "user", "content": f'''You will be provided with a question, its golden answer(s), and the model's answer, while the context of the question, which is one or more images, is not given here. Your task is to judge how correct the model's answer is based on the golden answer(s), without seeing the input images of the question, and then give a correctness score. The correctness score should be one of the below numbers: 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). Your should first briefly give your reasoning process regarding how the model's answer conforms to or contradicts the golden answer(s), and then give the correctness score. The correctness score must strictly follow this format: \"[[score]]\", e.g., \"The correctness score: [[0.5]]\". Below are some examples.
5 |
6 | Example 1:
7 | Question: what is this advertising?
8 | Golden Answer(s): garden annual; seeds; seeds; seeds; seeds; seeds; seeds; seeds; seeds; cole's garden annual
9 | Model's Answer: Seed
10 | Your Judgment: The golden answers consistently mention "seeds" suggesting an advertisement for a seed catalog. The model's answer, "Seed", aligns exactly with this description. The Correctness Score: [[1.0]]
11 |
12 | Example 2:
13 | Question: Who is making a face?
14 | Golden Answer: child
15 | Model's Answer: A man.
16 | Your Judgment: The golden answer specifies a "child" making a face, but the model answered "A man", which is incorrect as it refers to a different age group. The Correctness Score: [[0.0]]
17 |
18 | Example 3:
19 | Question: what road is to the right?
20 | Golden Answer: troublesome valley rd; troublesome valley rd.; troublesome valley; troublesome valley road; valley road; troublesome valley; troublesome valley road; troublesome valley ; troublesome valley rd; troublesome valley rd.
21 | Model's Answer: troublesome road
22 | Your Judgment: The golden answers all specify the name of the road as "troublesome valley rd" or variations of this phrase with consistent reference to "troublesome valley." The model's answer, "troublesome road," captures the "troublesome" aspect but omits the critical "valley" part of the name, which is crucial for full accuracy. Thus, the model's answer partially matches the golden answer but lacks complete specificity. The Correctness Score: [[0.6]]
23 |
24 | Note that each one of the golden answers is considered correct. Thus if the model's answer matches any one of the golden answers, it should be considered correct. Judge the below case, give the brief reasoning process and the correctness score.
25 |
26 | Question: {prompt}
27 | Golden Answer(s): {gold_ans}
28 | Model's Answer: {response}
29 | Your Judgment:
30 | '''}
31 | ]
32 |
33 | video2text_gpt_judge_for_closeended_freeform = lambda prompt, gold_ans, response: [
34 | {"role": "system", "content": f"In this task, I want you to act as a judge."},
35 | {"role": "user", "content": f'''You will be provided with a question, its golden answer(s), and the model's answer, while the context of the question, which is one or more videos, is not given here. Your task is to judge how correct the model's answer is based on the golden answer(s), without seeing the input videos of the question, and then give a correctness score. The correctness score should be one of the below numbers: 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). Your should first briefly give your reasoning process regarding how the model's answer conforms to or contradicts the golden answer(s), and then give the correctness score. The correctness score must strictly follow this format: \"[[score]]\", e.g., \"The correctness score: [[0.5]]\". Below are some examples.
36 |
37 | Example 1:
38 | Question: what does this video want to express
39 | Golden Answer(s): introduce method of playing
40 | Model's Answer: Volleyball serve \n
41 | Your Judgment: The model's answer "Volleyball serve" suggests a specific action, which may be part of what the video demonstrates. However, it misses the broader educational intent implied by the golden answer "introduce method of playing". Therefore, the answer is partially correct. The Correctness Score: [[0.5]]
42 |
43 | Example 2:
44 | Question: who do two other boys with surprised looks assist up?
45 | Golden Answer(s): boy
46 | Model's Answer: Boy.
47 | Your Judgment: The model's answer "Boy." precisely matches the golden answer which states the two other boys assist a "boy". The Correctness Score: [[1.0]]
48 |
49 | Example 3:
50 | Question: what did the lady do at the end of the video after their performance
51 | Golden Answer(s): picks up her phone
52 | Model's Answer: Nothing.
53 | Your Judgment: The model's answer "Nothing." directly contradicts the golden answer which states that the lady "picks up her phone" at the end of the video after their performance. Since the model's response completely misses the specific action described in the golden answer, it is incorrect. The Correctness Score: [[0.0]]
54 |
55 | Note that each one of the golden answers is considered correct. Thus if the model's answer matches any one of the golden answers, it should be considered correct. Judge the below case, give the brief reasoning process and the correctness score.
56 |
57 | Question: {prompt}
58 | Golden Answer(s): {gold_ans}
59 | Model's Answer: {response}
60 | Your Judgment:
61 | '''}
62 | ]
63 |
64 | audio2text_gpt_judge_for_closeended_freeform = lambda prompt, gold_ans, response: [
65 | {"role": "system", "content": f"In this task, I want you to act as a judge."},
66 | {"role": "user", "content": f'''You will be provided with a question, its golden answer(s), and the model's answer, while the context of the question, which is one or more audios, is not given here. Your task is to judge how correct the model's answer is based on the golden answer(s), without seeing the input audios of the question, and then give a correctness score. The correctness score should be one of the below numbers: 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). Your should first briefly give your reasoning process regarding how the model's answer conforms to or contradicts the golden answer(s), and then give the correctness score. The correctness score must strictly follow this format: \"[[score]]\", e.g., \"The correctness score: [[0.5]]\". Below are some examples.
67 |
68 | Example 1:
69 | Question: Are the people isolated?
70 | Golden Answer(s): no
71 | Model's Answer: yes
72 | Your Judgment: The model's answer contradicts the golden answer directly. The question asked if the people are isolated, to which the golden answer is "no," indicating that the people are not isolated. However, the model's answer is "yes," implying that the people are isolated. The correctness score: [[0.0]]
73 |
74 | Example 2:
75 | Question: Who is speaking?
76 | Golden Answer(s): man
77 | Model's Answer: men
78 | Your Judgment: The model's answer is almost correct but imprecise. The question asked about the identity of the speaker, to which the golden answer specifies a singular "man." However, the model's answer is "men," which suggests multiple individuals rather than one. This small pluralization error suggests a misunderstanding of the query about the exact number of people speaking. The correctness score: [[0.6]]
79 |
80 | Example 3:
81 | Question: What did you hear after the door slamming?
82 | Golden Answer(s): dog making noise
83 | Model's Answer: dog
84 | Your Judgment: The model's answer "dog" matches the golden answer's essential element, "dog making noise," by correctly identifying the dog. Although it omits "making noise," it captures the key information needed. The correctness score: [[1.0]]
85 |
86 | Note that each one of the golden answers is considered correct. Thus if the model's answer matches any one of the golden answers, it should be considered correct. Judge the below case, give the brief reasoning process and the correctness score.
87 |
88 | Question: {prompt}
89 | Golden Answer(s): {gold_ans}
90 | Model's Answer: {response}
91 | Your Judgment:
92 | '''}
93 | ]
94 |
95 | ############Judge Prompts for Close-ended Multiple-choice Parser############
96 | image2text_gpt_judge_for_closeended_multiplechoice = lambda prompt, options, response: [
97 | {"role": "system", "content": f"In this task, I want you to act as an option extractor."},
98 | {"role": "user", "content": f'''You will be provided with a multiple-choice question, its options, and the model's answer, while the context of the question, which is one or more images, is not given here. Your task is to extract or judge which option is chosen by the model based on its response, without seeing the context of the question. The extracted option should be one of the provided option letters. Your should first briefly give your reasoning process, and then give the extracted option letter. The extracted option must strictly follow this format: \"[[option letter]]\", e.g., \"The option chosen by the model: [[A]]\".
99 | Below are some examples.
100 |
101 | Example 1:
102 | Question: Where are the cast of the television show located in the image?
103 | Options:
104 | A. In the foreground
105 | B. In the background
106 | C. In the center
107 | D. At the edges
108 | Model's Answer: C. In the center
109 | Your Judgment: The model's answer clearly states "C. In the center", indicating that the correct option, according to the model, is in the center. The option chosen by the model: [[C]].
110 |
111 | Example 2:
112 | Question: on the left was painted during the
113 | Options:
114 | A. first or second century C. E.
115 | B. sixth or seventh century C. E.
116 | C. tenth or eleventh century C.E.
117 | D. fourteenth or fifteenth century C. E.
118 | Model's Answer: The correct answer is option D, the fourteenth or fifteenth century C.E.
119 | Your Judgment: The model's response specifies "option D, the fourteenth or fifteenth century C.E." directly as the correct answer. The option chosen by the model: [[D]].
120 |
121 | Example 3:
122 | Question: what does the diagram show's you information about
123 | Options:
124 | A. Photosynthesis
125 | B. The plant getting fed
126 | C. A picture of the plant
127 | D. What happens to a plant daily
128 | Model's Answer: The diagram shows the process of photosynthesis, which is the process by which plants convert sunlight, carbon dioxide, and water into oxygen and glucose.
129 | Your Judgment: The model's answer mentions "the process of photosynthesis," which directly corresponds to option A, "Photosynthesis". Therefore, the correct option according to the model is photosynthesis. The option chosen by the model: [[A]].
130 |
131 | Give the brief reasoning process and the extracted option for the below case:
132 |
133 | Question: {prompt}
134 | Options:
135 | {options}
136 | Model's Answer: {response}
137 | Your Judgment:
138 | '''}
139 | ]
140 |
141 | video2text_gpt_judge_for_closeended_multiplechoice = lambda prompt, options, response: [
142 | {"role": "system", "content": f"In this task, I want you to act as an option extractor."},
143 | {"role": "user", "content": f'''You will be provided with a multiple-choice question, its options, and the model's answer, while the context of the question, which is one or more videos, is not given here. Your task is to extract or judge which option is chosen by the model based on its response, without seeing the context of the question. The extracted option should be one of the provided option letters. Your should first briefly give your reasoning process, and then give the extracted option letter. The extracted option must strictly follow this format: \"[[option letter]]\", e.g., \"The option chosen by the model: [[A]]\".
144 | Below are some examples.
145 |
146 | Example 1:
147 | Question: What did he do to the car?
148 | Options:
149 | A. Paint the car
150 | B. Put plastic over the car
151 | C. Put metal over the car
152 | D. Cut the car
153 | Model's Answer: put plastic over the car.
154 | Your Judgment: The model's response directly aligns with option B, which is "Put plastic over the car." The response given is a paraphrase of this option without deviating in meaning. The option chosen by the model: [[B]]
155 |
156 | Example 2:
157 | Question: How did Eddie know Pam and Justin before Justin was killed?
158 | Options:
159 | A. They were part of the theater company
160 | B. They were high school friends
161 | C. They went to college together
162 | D. They were cousins
163 | E. They were siblings
164 | Model's Answer: A.
165 | Your Judgment: The model's answer directly provides the option letter "A." The option chosen by the model: [[A]]
166 |
167 | Example 3:
168 | Question: why do the people move in the same manner
169 | Options:
170 | A. uniform
171 | B. dancing with the baby
172 | C. exercising together
173 | D. stay together
174 | E. singing and dancing
175 | Model's Answer: sing and dance
176 | Your Judgment: The model's response "sing and dance" closely aligns with option E, which is "singing and dancing." The response provided is a direct paraphrase of this option, modifying only slightly the form of the words (from gerund to infinitive) but maintaining the same core activities described in the option. The option chosen by the model: [[E]]
177 |
178 | When you think that the model's answer does not match any of the given options, please choose the option that is the closest to the model's answer.
179 | Give the brief reasoning process and the extracted option for the below case.
180 |
181 | Question: {prompt}
182 | Options:
183 | {options}
184 | Model's Answer: {response}
185 | Your Judgment:
186 | '''}
187 | ]
188 |
189 |
190 | text2image_gpt_judge_turn1 = lambda prompt1, image1: [
191 | {"role": "system", "content": f"In this task, you will act as an impartial judge for image generation tasks."},
192 | {"role": "user", "content": [{'type': 'text', 'text': f'''
193 | Please act as an impartial judge and evaluate the quality of an image generated by an AI assistant given the provided user prompt or caption.
194 |
195 | You must first analyze the generated image based on the provided prompt carefully. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
196 |
197 | You should consider the following core aspects when analyzing the image:
198 | 1. **Alignment**: Assess how accurately the image reflects the given prompt. Check if all elements and requirements are correctly represented.
199 | 2. **Realism**: Judge if the image looks realistic and natural.
200 | 3. **Quality**: Identify if there's any flaw in the image, such as distortion, blurriness, or illogical representation of facial features, limbs, fingers, objects, or text. In addition, evaluate the overall quality of the image.
201 |
202 | Analyze and judge the below case:
203 |
204 | Generation Prompt/Caption: {prompt1}
205 | Generated Image:
206 | '''
207 | },
208 | {
209 | "type": "image_url",
210 | "image_url": {
211 | "url": f"data:image/jpg;base64,{image1}"
212 | }
213 | },
214 | {
215 | "type": "text",
216 | "text": f'''
217 | Your Analysis and Judgment:
218 | '''
219 | }
220 | ]}
221 | ]
222 |
223 | text2image_gpt_judge_turn2 = lambda image1, prompt1, image2: [
224 | {"role": "system", "content": f"In this task, you will act as an impartial judge for an image editing task."},
225 | {"role": "user", "content": [{'type': 'text', 'text': f'''
226 | You will be provided with an image to edit, the user prompt to edit the image, and the edited image. Your task is to evaluate the quality of the edited image based on the given information.
227 |
228 | You must first analyze the edited image based on the provided editing prompt and the image to edit carefully. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
229 |
230 | You should consider the following core aspects when analyzing the image:
231 | 1. **Alignment**: Assess how accurately the edited image reflects the changes indicated in the given editing prompt. Check if all elements and requirements are correctly meeted.
232 | 2. **Consistency**: Evaluate if the edited image is consistent with the original image in terms of details, style, color, overall appearance, etc.
233 | 3. **Realism**: Judge if the edited image looks realistic and natural after the editing process.
234 | 4. **Quality**: Identify if there's any flaw in the edited image, such as distortion, blurriness, or illogical representation of facial features, limbs, fingers, objects, or text. In addition, evaluate the overall quality of the image.
235 |
236 |
237 | Analyze and judge the below case:
238 |
239 | Editing Prompt/Caption: {prompt1}
240 | The Image to Edit:
241 | '''},
242 | {
243 | "type": "image_url",
244 | "image_url": {
245 | "url": f"data:image/jpg;base64,{image1}"
246 | }
247 | },
248 | {
249 | "type": "text",
250 | "text": f'''
251 | The Edited Image:
252 | '''
253 | },
254 | {
255 | "type": "image_url",
256 | "image_url": {
257 | "url": f"data:image/jpg;base64,{image2}"
258 | }
259 | },
260 | {
261 | "type": "text",
262 | "text": f'''
263 | Your Analysis and Judgment:
264 | '''
265 | }
266 | ]}
267 | ]
268 |
269 | text2image_claude_judge_turn1 = lambda prompt1, image1: [
270 | {"role": "user", "content": [{'type': 'text', 'text': f'''In this task, you will act as an impartial judge for image generation tasks.
271 |
272 | Please act as an impartial judge and evaluate the quality of an image generated by an AI assistant given the provided user prompt or caption.
273 |
274 | You must first analyze the generated image based on the provided prompt carefully. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
275 |
276 | You should consider the following core aspects when analyzing the image:
277 | 1. **Alignment**: Assess how accurately the image reflects the given prompt. Check if all elements and requirements are correctly represented.
278 | 2. **Realism**: Judge if the image looks realistic and natural.
279 | 3. **Quality**: Identify if there's any flaw in the image, such as distortion, blurriness, or illogical representation of facial features, limbs, fingers, objects, or text. In addition, evaluate the overall quality of the image.
280 |
281 | Analyze and judge the below case:
282 |
283 | Generation Prompt/Caption: {prompt1}
284 | Generated Image:
285 | '''
286 | },
287 | {
288 | "type": "image",
289 | "source": {
290 | "type": "base64",
291 | "media_type": "image/jpeg",
292 | "data": image1
293 | }
294 | },
295 | {
296 | "type": "text",
297 | "text": f'''
298 | Your Analysis and Judgment:
299 | '''
300 | }
301 | ]}
302 | ]
303 |
304 | text2image_claude_judge_turn2 = lambda image1, prompt1, image2: [
305 | {"role": "user", "content": [{'type': 'text', 'text': f'''In this task, you will act as an impartial judge for an image editing task.
306 |
307 | You will be provided with an image to edit, the user prompt to edit the image, and the edited image. Your task is to evaluate the quality of the edited image based on the given information.
308 |
309 | You must first analyze the edited image based on the provided editing prompt and the image to edit carefully. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
310 |
311 | You should consider the following core aspects when analyzing the image:
312 | 1. **Alignment**: Assess how accurately the edited image reflects the changes indicated in the given editing prompt. Check if all elements and requirements are correctly meeted.
313 | 2. **Consistency**: Evaluate if the edited image is consistent with the original image in terms of details, style, color, overall appearance, etc.
314 | 3. **Realism**: Judge if the edited image looks realistic and natural after the editing process.
315 | 4. **Quality**: Identify if there's any flaw in the edited image, such as distortion, blurriness, or illogical representation of facial features, limbs, fingers, objects, or text. In addition, evaluate the overall quality of the image.
316 |
317 |
318 | Analyze and judge the below case:
319 |
320 | Editing Prompt/Caption: {prompt1}
321 | The Image to Edit:
322 | '''},
323 | {
324 | "type": "image",
325 | "source": {
326 | "type": "base64",
327 | "media_type": "image/jpeg",
328 | "data": image1
329 | }
330 | },
331 | {
332 | "type": "text",
333 | "text": f'''
334 | The Edited Image:
335 | '''
336 | },
337 | {
338 | "type": "image",
339 | "source": {
340 | "type": "base64",
341 | "media_type": "image/jpeg",
342 | "data": image2
343 | }
344 | },
345 | {
346 | "type": "text",
347 | "text": f'''
348 | Your Analysis and Judgment:
349 | '''
350 | }
351 | ]}
352 | ]
353 |
354 |
355 | text2image_gemini_judge_turn1 = lambda prompt1, image1: [
356 | f'''In this task, you will act as an impartial judge for image generation tasks.
357 |
358 | Please act as an impartial judge and evaluate the quality of an image generated by an AI assistant given the provided user prompt or caption.
359 |
360 | You must first analyze the generated image based on the provided prompt carefully. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
361 |
362 | You should consider the following core aspects when analyzing the image:
363 | 1. **Alignment**: Assess how accurately the image reflects the given prompt. Check if all elements and requirements are correctly represented.
364 | 2. **Realism**: Judge if the image looks realistic and natural.
365 | 3. **Quality**: Identify if there's any flaw in the image, such as distortion, blurriness, or illogical representation of facial features, limbs, fingers, objects, or text. In addition, evaluate the overall quality of the image.
366 |
367 | Analyze and judge the below case:
368 |
369 | Generation Prompt/Caption: {prompt1}
370 | Generated Image:
371 | ''',
372 | image1,
373 | f'''
374 | Your Analysis and Judgment:
375 | '''
376 | ]
377 |
378 | text2image_gemini_judge_turn2 = lambda image1, prompt1, image2: [
379 | f'''In this task, you will act as an impartial judge for an image editing task.
380 |
381 | You will be provided with an image to edit, the user prompt to edit the image, and the edited image. Your task is to evaluate the quality of the edited image based on the given information.
382 |
383 | You must first analyze the edited image based on the provided editing prompt and the image to edit carefully. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
384 |
385 | You should consider the following core aspects when analyzing the image:
386 | 1. **Alignment**: Assess how accurately the edited image reflects the changes indicated in the given editing prompt. Check if all elements and requirements are correctly meeted.
387 | 2. **Consistency**: Evaluate if the edited image is consistent with the original image in terms of details, style, color, overall appearance, etc.
388 | 3. **Realism**: Judge if the edited image looks realistic and natural after the editing process.
389 | 4. **Quality**: Identify if there's any flaw in the edited image, such as distortion, blurriness, or illogical representation of facial features, limbs, fingers, objects, or text. In addition, evaluate the overall quality of the image.
390 |
391 |
392 | Analyze and judge the below case:
393 |
394 | Editing Prompt/Caption: {prompt1}
395 | The Image to Edit:
396 | ''',
397 | image1,
398 | f'''
399 | The Edited Image:
400 | ''',
401 | image2,
402 | f'''
403 | Your Analysis and Judgment:
404 | '''
405 | ]
406 |
407 | text2action_gpt_judge = lambda task_description, allowed_actions, visible_objects, already_executed_steps, target, model_response : [
408 | {"role": "system", "content": f"In this task, you will act as an impartial judge for a real-world planning task."},
409 | {"role": "user", "content": f'''Your job is to evaluate the quality of the action-object sequences planned by an AI assistant for a real-world task. You will be provided with the Task Description, Allowed Actions, Visible Objects, Already Executed Action-Object Sequences, the target, and the model's response. The 'Task Description' is a user instruction that instructs the AI assistant, which is being evaluated, to complete the task. The 'Allowed Actions' is a list of actions that are allowed to be used by the AI assistant to complete the task. The 'Visible Objects' is a list of objects that are assumed to be visible to the AI assistant when it's completing the task. The 'Already Executed Action-Object Sequences' is a list of action-object sequences that are assumed to have been completed by the AI assistant at the moment of starting the planning. The 'Reference Answer' is an example action-object sequence output for your reference, which is annotated by a human and may not be the only correct answer. The 'Model Response' is the output of the AI assistant you are evaluating.
410 |
411 | Your task is to analyze the model's response and evaluate how well it plans given the above-mentioned information and the reference answer. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n
412 |
413 | Below is a simplified example of how to judge the model's response:
414 |
415 | **Start of Example**
416 | Task Description: Put a heated egg in the sink.
417 | Allowed Actions: [OpenObject], [CloseObject], [PickupObject], [PutObject], [ToggleObjectOn], [ToggleObjectOff], [SliceObject], [Navigation]
418 | Visible Objects: , , , , , , , , ,
419 | Already Executed Action-Object Sequences: [Navigation] , [OpenObject] , [PickupObject] , [CloseObject] , [Navigation] , [PutObject]
420 | Reference Answer: [ToggleObjectOn] , [ToggleObjectOff] , [PickupObject] , [Navigation] , [PutObject]
421 | Model Response: [PickupObject] , [Navigation] , [PutObject]
422 | Your Analysis and Judgment: The model's response omits crucial steps for heating the egg, assuming it is already heated without evidence from prior actions. It correctly performs the transport and placement of the egg, using appropriate actions and objects. However, by neglecting the heating process essential to the task description, the response is incomplete. My Final Rating: [[3]].
423 | **End of Example**
424 |
425 | With the above description and example, analyze and judge the below case:
426 |
427 | Task Description: {task_description}
428 | Allowed Actions: {allowed_actions}
429 | Visible Objects: {visible_objects}
430 | Already Executed Action-Object Sequences: {already_executed_steps}
431 | Reference Answer: {target}
432 | Model Response: {model_response}
433 | Your Analysis and Judgment:
434 | '''}
435 | ]
436 |
437 |
438 | image2action_gpt_judge = lambda example_image, task_description, allowed_actions, image, already_executed_steps, target, model_response : [
439 | {"role": "system", "content": f"In this task, you will act as an impartial judge for a real-world planning task."},
440 | {"role": "user", "content": [{'type': 'text', 'text': f'''Your job is to evaluate the quality of the action-object sequences planned by an AI assistant with visual perception for a real-world task. You will be provided with the Task Description, Allowed Actions, Visible Objects, Already Executed Action-Object Sequences, the target, and the model's response. The 'Task Description' is a user instruction that instructs the AI assistant, which is being evaluated, to complete the task. The 'Allowed Actions' is a list of actions that are allowed to be used by the AI assistant to complete the task. The 'Visible Objects' is a list of objects that are assumed to be visible to the AI assistant when it's completing the task. Note that some invisible objects may still be usable to the AI assistant, but their existence must be consistent with the commonsense. The 'Already Executed Action-Object Sequences' is a list of action-object sequences that are assumed to have been completed by the AI assistant at the moment of starting the planning. The 'Reference Answer' is an example action-object sequence output for your reference, which is annotated by a human and may not be the only correct answer. The 'Model Response' is the output of the AI assistant you are evaluating.
441 |
442 | Your task is to analyze the model's response and evaluate how well it plans given the above-mentioned information and the reference answer. After providing your analysis, you must give the final score on a scale of 1 to 10 (1 means extremely bad and 10 means extremely good), and the rating must strictly follow this format: \"[[rating]]\", for example: \"Rating: [[5]]\".
443 |
444 | Below is a simplified example of how to judge the model's response:
445 |
446 | **Start of Example**
447 | Task Description: Get the egg from the fridge, and put the heated egg in the sink.
448 | Allowed Actions: [OpenObject], [CloseObject], [PickupObject], [PutObject], [ToggleObjectOn], [ToggleObjectOff], [SliceObject], [Navigation]
449 | Visible Objects: '''},
450 | {
451 | "type": "image_url",
452 | "image_url": {
453 | "url": f"data:image/jpg;base64,{example_image}"
454 | }
455 | },
456 | {'type': 'text', 'text': f'''
457 | Already Executed Action-Object Sequences: [Navigation] , [OpenObject] , [PickupObject]