├── .gitignore ├── README.md ├── dataset_sheet.pdf ├── plot2code ├── __init__.py ├── eval │ ├── __int__.py │ ├── combine_evaluation_results.py │ ├── gpt4v_evaluate_pairs.py │ ├── gpt4v_evaluations_score.py │ └── text_match_score.py ├── execute_generated_code.py ├── gpt4v_generate_code.py ├── llm_generate_code.py └── utils.py ├── requirements.txt └── scripts ├── evaluate-instruct.sh ├── evaluate.sh └── generate_code.sh /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | generated_results/ 3 | evaluation_results -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Plot2Code Benchmark 2 | 3 | Plot2Code benchmark is now open-sourced at [huggingface (ARC Lab)](https://huggingface.co/TencentARC) and [GitHub](https://github.com/TencentARC/Plot2Code). More information can be found in our [paper](https://arxiv.org/abs/2405.07990). 4 | 5 | 6 | This repository contains the code for an evaluation pipeline that generates Python code from reference plots, executes the generated code to draw plots, and then calculates various evaluation metrics to assess the quality of the generated code. 7 | 8 | ## Why we need [Plot2Code](https://huggingface.co/datasets/TencentARC/Plot2Code)? 9 | * 🧐 While MLLMs have demonstrated potential in visual contexts, their capabilities in visual coding tasks have not been thoroughly evaluated. Plot2Code offers a platform for comprehensive assessment of these models. 10 | 11 | * 🤗 To enable individuals to ascertain the proficiency of AI assistants in generating code that renders into plots given reference plots, we initiated the Plot2Code project. This ensures evaluations are pertinent to real-world applications. 12 | 13 | * 💻 Plot2Code accommodates all modalities (text and images) for both input and output, facilitating an exploration of the influence of each modality. 14 | 15 | ## Supported Tasks 16 | 17 | Plot2Code is primarily designed as a benchmark for code generation from scientific plots. Specifically, it supports the following settings: 18 | 19 | * Text2Image: We provide instructions to the assistant, requesting it to generate pyplot code and subsequently render the plots. 20 | * Image2Image: Referred to as the Direct Asking setting in our paper, we input the reference plot directly and ask the assistant to generate pyplot code to render similar plots. 21 | * I+T 2 Image: Combining both instructions and reference plots as input, this is called the Conditional Asking setting in our paper. 22 | 23 | By employing these settings, we can investigate the impact of each input modality on the quality of the final rendered plots. 24 | 25 | ## Requirements 26 | 27 | - NumPy 28 | - Matplotlib==3.8.4 29 | - Pillow 30 | - Levenshtein 31 | - openai>1.12.0 32 | 33 | You can install the required packages using the following command: 34 | 35 | ```bash 36 | pip install -r requirements.txt 37 | ``` 38 | 39 | ## How to Download 40 | You can use following codes to download the dataset: 41 | ```shell 42 | git lfs install 43 | mkdir data 44 | cd data 45 | git clone https://huggingface.co/datasets/TencentARC/Plot2Code 46 | ``` 47 | 48 | ## Usage 49 | 50 | 1. Generate code from reference plots. Add --instruction for the conditional setting. 51 | ``` bash 52 | export OPENAI_API_KEY=[API_KEY] 53 | export OPENAI_API_BASE=[API_BASE] 54 | 55 | # GPT-4V generate code (direct asking) 56 | python -m plot2code.gpt4v_generate_code --prompt_strategy default 57 | 58 | # GPT-4V generate code (conditional asking) 59 | python -m plot2code.gpt4v_generate_code --prompt_strategy default --instruct 60 | 61 | # GPT-4V generate code (conditional asking with CoT) 62 | python -m plot2code.gpt4v_generate_code --prompt_strategy CoT --instruct 63 | ``` 64 | 2. Execute the generated code to render the plots. 65 | ``` bash 66 | python -m plot2code.execute_generated_code --model_name "$model_name" --prompt_strategy $prompt_strategy 67 | 68 | ``` 69 | 3. Evaluate the similarity between the generated plots and the grond truth plots. 70 | 71 | ``` bash 72 | echo "Calculating text match score..." 73 | python -m plot2code.eval.text_match_score --model_name "$model_name" --prompt_strategy $prompt_strategy 74 | 75 | echo "Calculating gpt-4v evaluation score..." 76 | python -m plot2code.eval.gpt4v_evaluations_score --model_name "$model_name" --prompt_strategy $prompt_strategy 77 | 78 | echo "Combining evaluation results..." 79 | python -m plot2code.eval.combine_evaluation_results --model_name "$model_name" --prompt_strategy $prompt_strategy 80 | ``` 81 | 82 | See [scripts](scripts) for more details. 83 | 84 | # News 85 | * 🔥[2024/08] We futther update the Python and R's plotly plot-code pairs with instruction for evaluation!🔥 86 | * 🔥[2024/05] We open source the [Plot2Code benchmark](https://huggingface.co/datasets/TencentARC/Plot2Code). 87 | Stay tuned for this project! 😆 88 | 89 | # License 90 | 91 | In this study, we crawled every website link listed in the Matplotlib gallery and Plotly documentation to collect data for our analysis. Both Matplotlib and Plotly libraries are distributed under permissive open-source licenses. We have taken the following steps to ensure compliance with the respective license terms: 92 | 93 | 1. Acknowledgment of Licenses: We acknowledge that the Matplotlib library and its gallery are distributed under the BSD 3-Clause License, and the Plotly library and its documentation are distributed under the MIT License. 94 | 2. Retention of Copyright Notices: We have retained all copyright notices and license information from the original Matplotlib gallery content and Plotly documentation, as required by their respective licenses. 95 | 3. Usage and Distribution: Our use of the Matplotlib gallery and Plotly documentation content is solely for academic and research purposes. We have not modified the original content from the Matplotlib gallery or Plotly documentation, and any distribution of our work will include proper attribution to the Matplotlib and Plotly projects. 96 | 97 | By adhering to these guidelines, we ensure that our use of the Matplotlib and Plotly content is fully compliant with their respective licenses. 98 | 99 | This project is open-sourced under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0). These evaluation code and datasets are fully open for academic research and can be used for commercial purposes with official written permission. Check our [dataset sheet](dataset_sheet.pdf) for more information. 100 | 101 | # Citation 102 | The code and model in this repository is mostly developed for or derived from the paper below. Please cite it if you find the repository helpful. 103 | ``` 104 | @misc{wu2024plot2code, 105 | title={Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots}, 106 | author={Chengyue Wu and Yixiao Ge and Qiushan Guo and Jiahao Wang and Zhixuan Liang and Zeyu Lu and Ying Shan and Ping Luo}, 107 | year={2024}, 108 | eprint={2405.07990}, 109 | archivePrefix={arXiv}, 110 | primaryClass={cs.CL} 111 | } 112 | ``` 113 | -------------------------------------------------------------------------------- /dataset_sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TencentARC/Plot2Code/12bd95a7e04dfcc2eb664cbeb739d924c090c503/dataset_sheet.pdf -------------------------------------------------------------------------------- /plot2code/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TencentARC/Plot2Code/12bd95a7e04dfcc2eb664cbeb739d924c090c503/plot2code/__init__.py -------------------------------------------------------------------------------- /plot2code/eval/__int__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TencentARC/Plot2Code/12bd95a7e04dfcc2eb664cbeb739d924c090c503/plot2code/eval/__int__.py -------------------------------------------------------------------------------- /plot2code/eval/combine_evaluation_results.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import numpy as np 4 | from ..utils import get_parser, get_eval_path, get_save_path 5 | 6 | parser = get_parser() 7 | args = parser.parse_args() 8 | # Function to read the JSONL file and load the content into a list 9 | def read_jsonl_file(jsonl_file): 10 | content_list = [] 11 | with open(jsonl_file, "r") as file: 12 | for line in file: 13 | content_list.append(json.loads(line)) 14 | return content_list 15 | 16 | eval_dir = get_eval_path(args) 17 | generated_code_file = get_save_path(args) 18 | 19 | text_match_score_results_file = os.path.join(eval_dir, args.text_match_score_results) 20 | gpt4v_evaluation_results_file = os.path.join(eval_dir, args.gpt4_vision_evaluation_results) 21 | final_score_file = os.path.join(eval_dir, args.final_score_results) 22 | 23 | # Read the JSONL files 24 | generated_code_results = read_jsonl_file(generated_code_file) 25 | text_match_score_results = read_jsonl_file(text_match_score_results_file) 26 | evaluations_results = read_jsonl_file(gpt4v_evaluation_results_file) 27 | # Initialize the list to store the final results 28 | final_results = [] 29 | 30 | # Iterate over the items in the lists 31 | for text_match, evaluation in zip(text_match_score_results, evaluations_results): 32 | # Create a new dictionary to store the final result for the current item 33 | final_result = {} 34 | 35 | # Add the evaluation results to the final result 36 | final_result.update(text_match) 37 | final_result.update(evaluation) 38 | 39 | # Append the final result to the list of final results 40 | final_results.append(final_result) 41 | 42 | # Write the final results to a new JSONL file 43 | with open(final_score_file, "w") as file: 44 | for final_result in final_results: 45 | file.write(json.dumps(final_result) + "\n") 46 | 47 | # Calculate the average of each evaluation metric 48 | average_text_match_score = np.mean([result['text_match_score'] for result in final_results]) 49 | average_evaluation_score = np.mean([result['rating'] for result in final_results if result['rating'] is not None]) 50 | 51 | file.write(f'Code pass rate: {(len(final_results)) / len(generated_code_results)}\n') 52 | file.write(f"Average text match score: {average_text_match_score}\n") 53 | file.write(f"Average gpt-4v evaluation score: {average_evaluation_score}\n") 54 | 55 | print(f'Code pass rate: {(len(final_results)) / len(generated_code_results)}\n') 56 | print(f"Average text match score: {average_text_match_score}\n") 57 | print(f"Average gpt-4v evaluation score: {average_evaluation_score}\n") -------------------------------------------------------------------------------- /plot2code/eval/gpt4v_evaluate_pairs.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from openai import OpenAI 4 | from tqdm import tqdm 5 | from PIL import Image 6 | import numpy as np 7 | from ..utils import get_parser, get_save_path, get_eval_path, get_api_response, encode_image, read_jsonl_file 8 | import time 9 | 10 | parser = get_parser() 11 | parser.add_argument("--test_model_name", type=str, default="test_model") 12 | parser.add_argument("--test_prompt_strategy", type=str, default=None) 13 | 14 | args = parser.parse_args() 15 | 16 | client = OpenAI( 17 | api_key=os.getenv("OPENAI_API_KEY"), 18 | base_url=os.getenv("OPENAI_API_BASE"), 19 | ) 20 | 21 | 22 | # Read the JSONL file 23 | generated_code_file = get_save_path(args) 24 | content_list = read_jsonl_file(generated_code_file) 25 | 26 | 27 | if args.test_prompt_strategy is not None: 28 | test_file = generated_code_file.replace(args.prompt_strategy, args.test_prompt_strategy) 29 | else: 30 | test_file = generated_code_file.replace(args.model_name, args.test_model_name) 31 | 32 | compared_content_list = read_jsonl_file(test_file) 33 | 34 | def extract_non_empty_img(content_list): 35 | non_empty_img = {} 36 | for item in content_list: 37 | test_image_path = item['generated_image_path'] 38 | img = Image.open(test_image_path) 39 | img_np = np.array(img) 40 | 41 | if np.all(img_np == 255): 42 | continue # Skip this iteration 43 | 44 | idx = int(item['generated_image_path'].rstrip('.png').split('test_image_')[-1]) 45 | non_empty_img[idx] = test_image_path 46 | 47 | return non_empty_img 48 | 49 | def extract_ground_truth_img(content_list): 50 | ground_truth_img = {} 51 | for item in content_list: 52 | ground_truth_path = item['ground_truth_path'] 53 | 54 | idx = int(item['ground_truth_path'].rstrip('.png').split('ground_truth_image_')[-1]) 55 | ground_truth_img[idx] = ground_truth_path 56 | 57 | return ground_truth_img 58 | 59 | 60 | compare_prompt = "Please act as an impartial judge and evaluate the quality of the generated images provided by two AI assistants given the ground truth image displayed below. " + \ 61 | "You should choose the assistant that generate the more similar image. Your evaluation should consider factors such as the overall appearance, colors, shapes, positions, and other visual elements of the images." 62 | 63 | output_prompt = "Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any biases and ensure that the order in which the responses were presented does not influence your decision. " + \ 64 | "Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: \"[[A]]\" if assistant A is better, \"[[B]]\" if assistant B is better, and \"[[C]]\" for a tie." 65 | # Function to evaluate the similarity between two images 66 | def compare_image(gt_path, image_path1, image_path2): 67 | gt_image = encode_image(gt_path) 68 | base64_image1 = encode_image(image_path1) 69 | base64_image2 = encode_image(image_path2) 70 | 71 | messages=[ 72 | { 73 | "role": "system", 74 | "content": [ 75 | { 76 | "type": "text", 77 | "text": "You are a helpful assistant." 78 | }, 79 | ] 80 | }, 81 | { 82 | "role": "user", 83 | "content": [ 84 | { 85 | "type": "text", 86 | "text": compare_prompt 87 | }, 88 | { 89 | "type": "text", 90 | "text": "Here is the ground truth image." 91 | }, 92 | { 93 | "type": "image_url", 94 | "image_url": { 95 | "url": f"data:image/png;base64,{gt_image}" 96 | }, 97 | }, 98 | { 99 | "type": "text", 100 | "text": "Here is the image generated by the assistant A." 101 | }, 102 | { 103 | "type": "image_url", 104 | "image_url": { 105 | "url": f"data:image/png;base64,{base64_image1}" 106 | }, 107 | }, 108 | { 109 | "type": "text", 110 | "text": "Here is the image generated by the assistant B." 111 | }, 112 | { 113 | "type": "image_url", 114 | "image_url": { 115 | "url": f"data:image/png;base64,{base64_image2}" 116 | }, 117 | }, 118 | { 119 | "type": "text", 120 | "text": output_prompt 121 | }, 122 | ], 123 | } 124 | ] 125 | response = get_api_response(client, messages, args, model_name='gpt-4-vision-preview') 126 | judgement = response.choices[0].message.content.strip() 127 | if "[[A]]" in judgement: 128 | winner = "A" 129 | elif "[[B]]" in judgement: 130 | winner = "B" 131 | elif "[[C]]" in judgement: 132 | winner = "tie" 133 | else: 134 | winner = "error" 135 | return winner, judgement 136 | 137 | 138 | 139 | img_dict_1 = extract_non_empty_img(content_list) 140 | img_dict_2 = extract_non_empty_img(compared_content_list) 141 | gt_dict = extract_ground_truth_img(content_list) 142 | 143 | img_idx_set1 = set(img_dict_1.keys()) 144 | img_idx_set2 = set(img_dict_2.keys()) 145 | 146 | common_img_idx = img_idx_set1.intersection(img_idx_set2) 147 | 148 | eval_dir = get_eval_path(args) 149 | 150 | if args.test_prompt_strategy is not None: 151 | args.test_model_name = args.test_model_name + "_" + args.test_prompt_strategy 152 | 153 | pair_compared_result_file = os.path.join(eval_dir, args.test_model_name + '_compared_results.jsonl') 154 | 155 | previous_results = None 156 | evaluated_idx = None 157 | 158 | if os.path.exists(pair_compared_result_file) and os.path.getsize(pair_compared_result_file) > 0: 159 | previous_results = read_jsonl_file(pair_compared_result_file) 160 | evaluated_idx = [item['question_id'] for item in previous_results] 161 | 162 | with open(pair_compared_result_file, "w") as results_file: 163 | 164 | if previous_results is not None: 165 | for result in previous_results: 166 | results_file.write(json.dumps(result) + "\n") 167 | results_file.flush() 168 | 169 | for idx in tqdm(common_img_idx): 170 | 171 | if evaluated_idx is not None and idx in evaluated_idx: 172 | continue 173 | 174 | image_path1 = img_dict_1[idx] 175 | image_path2 = img_dict_2[idx] 176 | gt_path = gt_dict[idx] 177 | 178 | round1_winner, round1_judgement = compare_image(gt_path, image_path1, image_path2) 179 | round2_winner, round2_judgement = compare_image(gt_path, image_path2, image_path1) 180 | 181 | round1_map = {"A": args.model_name, "B": args.test_model_name} 182 | round2_map = {"A": args.test_model_name, "B": args.model_name} 183 | round1_winner = round1_map.get(round1_winner, round1_winner) 184 | round2_winner = round2_map.get(round2_winner, round2_winner) 185 | 186 | result = { 187 | "question_id": idx, 188 | "model_1": args.model_name, 189 | "model_2": args.test_model_name, 190 | "round1_winner": round1_winner, 191 | "round2_winner": round2_winner, 192 | "round1_judgement": round1_judgement, 193 | "round2_judgement": round2_judgement, 194 | "tstamp": time.time(), 195 | } 196 | 197 | results_file.write(json.dumps(result) + "\n") 198 | results_file.flush() 199 | 200 | results = read_jsonl_file(pair_compared_result_file) 201 | print(f"Total number of compared pairs: {len(results)}") 202 | 203 | win_cnt = 0 204 | loss_cnt = 0 205 | tie_cnt = 0 206 | 207 | for item in results: 208 | if item["round1_winner"] == "tie" or item["round2_winner"] == "tie" or item["round1_winner"] != item["round2_winner"]: 209 | tie_cnt += 1 210 | elif item["round1_winner"] == args.model_name: 211 | win_cnt += 1 212 | else: 213 | loss_cnt += 1 214 | 215 | print(f"Win: {win_cnt}, Loss: {loss_cnt}, Tie: {tie_cnt}") 216 | print(f"Win Ratio: {win_cnt / len(results)}") 217 | print(f"Tie Ratio: {tie_cnt / len(results)}") 218 | print(f"Loss Ratio: {loss_cnt / len(results)}") 219 | -------------------------------------------------------------------------------- /plot2code/eval/gpt4v_evaluations_score.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import os 3 | import openai 4 | import json 5 | import re 6 | from openai import OpenAI 7 | from tqdm import tqdm 8 | from PIL import Image 9 | import numpy as np 10 | from ..utils import get_parser, get_save_path, get_eval_path, get_api_response, read_jsonl_file, encode_image 11 | 12 | parser = get_parser() 13 | args = parser.parse_args() 14 | 15 | client = OpenAI( 16 | api_key=os.getenv("OPENAI_API_KEY"), 17 | base_url=os.getenv("OPENAI_API_BASE"), 18 | ) 19 | 20 | 21 | # Function to evaluate the similarity between two images 22 | def evaluate_image_similarity(image_path1, image_path2): 23 | base64_image1 = encode_image(image_path1) 24 | base64_image2 = encode_image(image_path2) 25 | 26 | messages=[ 27 | { 28 | "role": "system", 29 | "content": [ 30 | { 31 | "type": "text", 32 | "text": "You are a helpful assistant." 33 | }, 34 | ] 35 | }, 36 | { 37 | "role": "user", 38 | "content": [ 39 | { 40 | "type": "text", 41 | "text": "Please evaluate the similarity between a reference image created using matplotlib and an image generated by code provided by an AI assistant. Consider factors such as the overall appearance, colors, shapes, positions, and other visual elements of the images. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".", 42 | }, 43 | { 44 | "type": "image_url", 45 | "image_url": { 46 | "url": f"data:image/png;base64,{base64_image1}" 47 | }, 48 | }, 49 | { 50 | "type": "image_url", 51 | "image_url": { 52 | "url": f"data:image/png;base64,{base64_image2}" 53 | }, 54 | }, 55 | ], 56 | } 57 | ] 58 | response = get_api_response(client, messages, args, model_name='gpt-4-vision-preview') 59 | 60 | return response.choices[0].message.content.strip() 61 | 62 | # Read the JSONL file 63 | generated_code_file = get_save_path(args) 64 | 65 | evaluation_file = os.path.join(get_eval_path(args), args.gpt4_vision_evaluation_results) 66 | 67 | items = read_jsonl_file(generated_code_file) 68 | 69 | # Evaluate the similarity between ground truth images and test images 70 | results = [] 71 | total_rating = 0 72 | 73 | 74 | # check if the evaluation file is empty. If it is not empty, read the file and skip the evaluation for the images that have already been evaluated 75 | if os.path.exists(evaluation_file) and os.path.getsize(evaluation_file) > 0: 76 | previous_results = read_jsonl_file(evaluation_file) 77 | evaluated_ground_truth_paths = [item['ground_truth_path'] for item in previous_results] 78 | total_rating = sum([item['rating'] for item in previous_results if item['rating'] is not None]) 79 | 80 | # Save the evaluation results to a new JSONL file 81 | with open(evaluation_file, "w") as jsonl_file: 82 | 83 | if total_rating > 0: 84 | # Write the previous results to the evaluation file 85 | for result in previous_results: 86 | jsonl_file.write(json.dumps(result) + "\n") 87 | jsonl_file.flush() 88 | else: 89 | previous_results = [] 90 | evaluated_ground_truth_paths = [] 91 | 92 | # Wrap the loop with tqdm to show the progress bar 93 | for item in tqdm(items, desc="Evaluating image similarity"): 94 | ground_truth_path = item['ground_truth_path'] 95 | 96 | if ground_truth_path in evaluated_ground_truth_paths: 97 | continue # Skip this iteration 98 | 99 | test_image_path = item['generated_image_path'] 100 | img = Image.open(test_image_path) 101 | img_np = np.array(img) 102 | 103 | # Check if test_image is all white 104 | if np.all(img_np == 255): 105 | print(f"Skipping all white image: {test_image_path}") 106 | continue # Skip this iteration 107 | 108 | evaluation = evaluate_image_similarity(ground_truth_path, test_image_path) 109 | 110 | # Extract the rating and add it to the total_rating 111 | rating_match = re.search(r'Rating: \[\[(\d+)\]\]', evaluation) 112 | if rating_match: 113 | rating = int(rating_match.group(1)) 114 | total_rating += rating 115 | else: 116 | rating = None 117 | 118 | # Update the results dictionary with the new key for the rating 119 | result = {'ground_truth_path': ground_truth_path, 'test_image_path': test_image_path, 'evaluation': evaluation, 'rating': rating} 120 | previous_results.append(result) 121 | 122 | jsonl_file.write(json.dumps(result) + "\n") 123 | jsonl_file.flush() 124 | 125 | # Calculate the average rating 126 | average_evaluation_score = np.mean([result['rating'] for result in previous_results if result['rating'] is not None]) 127 | print(f"Average Rating: {average_evaluation_score:.2f}") -------------------------------------------------------------------------------- /plot2code/eval/text_match_score.py: -------------------------------------------------------------------------------- 1 | import os 2 | import matplotlib.pyplot as plt 3 | import matplotlib 4 | import json 5 | from tqdm import tqdm 6 | from PIL import Image 7 | import numpy as np 8 | import Levenshtein 9 | from ..utils import get_parser, get_save_path, get_eval_path, read_jsonl_file 10 | from matplotlib.pyplot import * 11 | 12 | parser = get_parser() 13 | args = parser.parse_args() 14 | 15 | # Read the JSONL file 16 | generated_code_file = get_save_path(args) 17 | content_list = read_jsonl_file(generated_code_file) 18 | 19 | 20 | def position_similarity(pos1, pos2, size_ratio): 21 | pos2_adjusted = pos2 * size_ratio 22 | 23 | position_difference = pos1 - pos2_adjusted 24 | 25 | # calculate the absolute distance 26 | absolute_distance = np.sqrt(np.sum(position_difference ** 2)) 27 | 28 | # convert the absolute distance to similarity 29 | distance_similarity = np.exp(-absolute_distance / 100) 30 | 31 | return distance_similarity 32 | 33 | 34 | def extract_texts(component): 35 | texts = [] 36 | positions = [] 37 | if isinstance(component, matplotlib.text.Text) and component.get_visible(): 38 | text = component.get_text().strip().lower() 39 | position = component.get_position() 40 | if text: # only extract non-empty text 41 | texts.append(text) 42 | positions.append(np.array(position)) 43 | 44 | for child in component.get_children(): 45 | child_texts, child_positions = extract_texts(child) 46 | texts.extend(child_texts) 47 | positions.extend(child_positions) 48 | 49 | return texts, positions 50 | 51 | def match_texts(texts1, texts2, positions1, positions2, size_ratio): 52 | matched = 0 53 | unmatched1 = len(texts1) 54 | unmatched2 = len(texts2) 55 | 56 | for text1, pos1 in zip(texts1, positions1): 57 | min_distance = float('inf') 58 | best_match_index = None 59 | for i, (text2, pos2) in enumerate(zip(texts2, positions2)): 60 | try: 61 | distance = Levenshtein.distance(text1, text2) 62 | position_sim = position_similarity(pos1, pos2, size_ratio) 63 | 64 | if distance < min_distance and position_sim > 0.8: 65 | min_distance = distance 66 | best_match_index = i 67 | except: 68 | pass 69 | 70 | if min_distance <= 0: # the maximux edit distance allowed 71 | matched += 1 72 | texts2.pop(best_match_index) 73 | positions2.pop(best_match_index) 74 | unmatched2 -= 1 75 | unmatched1 -= 1 76 | 77 | total_pairs = matched + unmatched1 + unmatched2 78 | if total_pairs == 0: 79 | return 1 80 | match_score = matched / total_pairs 81 | return match_score 82 | 83 | eval_dir = get_eval_path(args) 84 | text_match_score_file = os.path.join(eval_dir, args.text_match_score_results) 85 | 86 | # Open the JSONL file in write mode 87 | with open(text_match_score_file, "w") as jsonl_file: 88 | for item in tqdm(content_list, desc="Evaluating text image similarity"): 89 | ground_truth_path = item['ground_truth_path'] 90 | test_image_path = item['generated_image_path'] 91 | code = item['code'] 92 | ground_truth_code = item['ground_truth_code'] 93 | # Execute the code and save the images 94 | 95 | img = Image.open(test_image_path) 96 | img_np = np.array(img) 97 | 98 | # Check if test_image is all white 99 | if np.all(img_np == 255): 100 | print(f"Skipping all white image: {test_image_path}") 101 | continue # Skip this iteration 102 | 103 | exec(ground_truth_code) 104 | fig1 = plt.gcf() 105 | fig1.savefig('gt_img.png') 106 | plt.close() 107 | matplotlib.rcdefaults() 108 | plt.cla() 109 | plt.clf() 110 | plt.close("all") 111 | 112 | exec(code) 113 | fig2 = plt.gcf() 114 | fig2.savefig('test_img.png') 115 | plt.close() 116 | matplotlib.rcdefaults() 117 | plt.cla() 118 | plt.clf() 119 | plt.close("all") 120 | 121 | # Extract texts and positions from the figures 122 | texts1, positions1 = extract_texts(fig1) 123 | texts2, positions2 = extract_texts(fig2) 124 | # Calculate the size ratio 125 | fig1_size = np.array(fig1.get_size_inches()) 126 | fig2_size = np.array(fig2.get_size_inches()) 127 | size_ratio = fig1_size / fig2_size 128 | 129 | # Calculate the match score 130 | match_score = match_texts(texts1, texts2, positions1, positions2, size_ratio) 131 | 132 | # Append the generated image path, ground truth code and match score to the JSONL file 133 | jsonl_file.write(json.dumps({'ground_truth_path': ground_truth_path, 'test_image_path': test_image_path, 'text_match_score': match_score}) + "\n") 134 | jsonl_file.flush() 135 | -------------------------------------------------------------------------------- /plot2code/execute_generated_code.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import matplotlib.pyplot as plt 4 | from PIL import Image 5 | import numpy as np 6 | from matplotlib.pyplot import * 7 | from .utils import get_parser, read_jsonl_file, get_save_path, get_img_path 8 | from tqdm import tqdm 9 | import matplotlib 10 | parser = get_parser() 11 | 12 | args = parser.parse_args() 13 | 14 | import multiprocessing 15 | 16 | def execute_code(code, success_flag, image_path): 17 | try: 18 | exec(code) 19 | fig = plt.gcf() 20 | # Save the generated image with the same size as the ground truth image 21 | fig.savefig(image_path) 22 | success_flag.value = True 23 | 24 | except Exception as e: 25 | # print(e) 26 | pass 27 | 28 | def execute_code_and_save_image(code, ground_truth_path, image_path): 29 | # Get the ground truth image size 30 | ground_truth_image = Image.open(ground_truth_path) 31 | ground_truth_size = ground_truth_image.size 32 | 33 | success_flag = multiprocessing.Value("b", False) 34 | 35 | # create a process to execute the code 36 | code_process = multiprocessing.Process(target=execute_code, args=(code, success_flag, image_path)) 37 | code_process.start() 38 | 39 | # set a timer of 30 seconds 40 | code_process.join(30) 41 | 42 | # if the process is still running, terminate it 43 | if code_process.is_alive(): 44 | code_process.terminate() 45 | code_process.join() 46 | 47 | if success_flag.value: 48 | generated_image = Image.open(image_path) 49 | else: 50 | # Create a white image and save it 51 | generated_image = Image.fromarray(np.full((ground_truth_size), 255, dtype=np.uint8)) 52 | 53 | # resized_image = generated_image.resize(ground_truth_size)s 54 | generated_image.save(image_path) 55 | 56 | def main(): 57 | # Read the ground truth code JSONL file 58 | ground_truth_code_list = read_jsonl_file(args.ground_truth_code_file) 59 | 60 | # Read the generated code JSONL file 61 | generated_code_file = get_save_path(args) 62 | code_list = read_jsonl_file(generated_code_file) 63 | 64 | # Create the test_images folder if it doesn't exist 65 | test_images_folder = get_img_path(args) 66 | data = [] 67 | for item in tqdm(code_list): 68 | # Execute the code and save the images to the test_images folder 69 | code = item['code'] 70 | ground_truth_path = item['ground_truth_path'] 71 | idx = int(ground_truth_path.rstrip('.png').split('ground_truth_image_')[-1]) 72 | image_path = os.path.join(test_images_folder, f"test_image_{idx}.png") 73 | execute_code_and_save_image(code, ground_truth_path, image_path) 74 | plt.close() 75 | matplotlib.rcdefaults() 76 | plt.cla() 77 | plt.clf() 78 | plt.close("all") 79 | # Get the ground truth code for the current item 80 | ground_truth_code = ground_truth_code_list[idx]['code'] 81 | 82 | data.append({'ground_truth_code': ground_truth_code, 'code': code, 'ground_truth_path': ground_truth_path, 'generated_image_path': image_path}) 83 | 84 | 85 | # Open the JSONL file in write mode 86 | with open(generated_code_file, "w") as jsonl_file: 87 | # Append the generated image path to the JSONL file 88 | for item in data: 89 | jsonl_file.write(json.dumps(item) + "\n") 90 | jsonl_file.flush() 91 | 92 | if __name__ == '__main__': 93 | main() -------------------------------------------------------------------------------- /plot2code/gpt4v_generate_code.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from openai import OpenAI 3 | import os 4 | import json 5 | from tqdm import tqdm 6 | from .utils import get_parser, get_save_path, direct_prompt, read_jsonl_file, get_api_response, extract_code, CoT_prompt, encode_image 7 | # Function to encode the image 8 | 9 | parser = get_parser() 10 | 11 | args = parser.parse_args() 12 | # Directory containing your images 13 | image_directory = args.image_directory 14 | 15 | client = OpenAI( 16 | api_key=os.getenv("OPENAI_API_KEY"), 17 | base_url=os.getenv("OPENAI_API_BASE"), 18 | ) 19 | 20 | # Get current execute path 21 | current_path = sys.path[0] 22 | 23 | save_path = get_save_path(args) 24 | 25 | previous_results = None 26 | previous_filename = [] 27 | 28 | # check if the save_path is empty. if it is not empty, load the already generated results 29 | if os.path.exists(save_path) and os.path.getsize(save_path) > 0: 30 | previous_results = read_jsonl_file(save_path) 31 | previous_filename = [result['question_id'] for result in previous_results] 32 | 33 | if args.instruct: 34 | instructions = read_jsonl_file('data/ground_truth_code_with_instruction.jsonl') 35 | 36 | # Iterate through all the PNG files in the ground_truth folder 37 | # Open the JSONL file 38 | if __name__ == '__main__': 39 | 40 | with open(save_path, "w") as jsonl_file: 41 | 42 | # write the previous results to the file 43 | if previous_results is not None: 44 | for result in previous_results: 45 | jsonl_file.write(json.dumps(result) + '\n') 46 | jsonl_file.flush() 47 | 48 | for filename in tqdm(os.listdir(image_directory)): 49 | 50 | if filename in previous_filename: 51 | continue 52 | 53 | if filename.endswith(".png"): 54 | image_path = os.path.join(current_path, image_directory, filename) 55 | 56 | # Getting the base64 string 57 | base64_image = encode_image(image_path) 58 | 59 | if args.instruct: 60 | prompt = instructions[int(filename.rstrip('.png').lstrip('ground_truth_image_'))]['instruction'] + '\n' + direct_prompt 61 | else: 62 | prompt = direct_prompt 63 | 64 | messages = [ 65 | { 66 | "role": "system", 67 | "content": "You are a helpful assistant." 68 | }, 69 | { 70 | "role": "user", 71 | "content": [ 72 | { 73 | "type": "text", 74 | "text": prompt 75 | }, 76 | { 77 | "type": "image_url", 78 | "image_url": { 79 | "url": f"data:image/jpeg;base64,{base64_image}" if 'gpt' in args.model_name else image_path 80 | }, 81 | }, 82 | ], 83 | } 84 | ] 85 | 86 | if args.prompt_strategy == 'CoT': 87 | messages.append( 88 | { 89 | "role": "assistant", 90 | "content": CoT_prompt 91 | } 92 | ) 93 | response = get_api_response(client, messages, args) 94 | elif args.prompt_strategy == 'Plan-and-Solve': 95 | step1_prompt = 'Let us first describe the plot and make a detailed plan step by step.' 96 | step2_prompt = ' Based on the above description, now we are prepared to generate the code. The generated code is surrounded by ```python and ``` to make it easier to be extracted by regular expressions. Therefore, the code is:' 97 | 98 | messages.append( 99 | { 100 | "role": "assistant", 101 | "content": step1_prompt 102 | } 103 | ) 104 | response = get_api_response(client, messages, args) 105 | tmp_result = extract_code(response.choices[0].message.content.strip()) 106 | if tmp_result != response.choices[0].message.content.strip(): 107 | print('step1') 108 | jsonl_file.write(json.dumps({'code': tmp_result, 'question_id': filename, 'ground_truth_path': image_path}) + "\n") 109 | jsonl_file.flush() 110 | continue 111 | print('steps2') 112 | messages.append( 113 | { 114 | "role": "assistant", 115 | "content": response.choices[0].message.content.strip() + step2_prompt 116 | } 117 | ) 118 | response = get_api_response(client, messages, args) 119 | else: 120 | response = get_api_response(client, messages, args) 121 | 122 | generated_code = extract_code(response.choices[0].message.content.strip()) 123 | 124 | jsonl_file.write(json.dumps({'code': generated_code, 'question_id': filename, 'ground_truth_path': image_path}) + "\n") 125 | jsonl_file.flush() -------------------------------------------------------------------------------- /plot2code/llm_generate_code.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import sys 3 | from openai import OpenAI 4 | import os 5 | import json 6 | from tqdm import tqdm 7 | import time 8 | from .utils import get_parser, get_save_path, direct_prompt, read_jsonl_file, get_api_response, extract_code, encode_image 9 | # Function to encode the 10 | 11 | parser = get_parser() 12 | 13 | args = parser.parse_args() 14 | # Directory containing your images 15 | image_directory = args.image_directory 16 | 17 | client = OpenAI( 18 | api_key=os.getenv("OPENAI_API_KEY"), 19 | base_url=os.getenv("OPENAI_API_BASE"), 20 | ) 21 | 22 | # Get current execute path 23 | current_path = sys.path[0] 24 | 25 | save_path = get_save_path(args) 26 | 27 | previous_results = None 28 | previous_filename = [] 29 | 30 | # check if the save_path is empty. if it is not empty, load the already generated results 31 | if os.path.exists(save_path) and os.path.getsize(save_path) > 0: 32 | previous_results = read_jsonl_file(save_path) 33 | previous_filename = [result['question_id'] for result in previous_results] 34 | 35 | if args.instruct: 36 | instructions = read_jsonl_file('data/ground_truth_code_with_instruction.jsonl') 37 | 38 | # Iterate through all the PNG files in the ground_truth folder 39 | # Open the JSONL file 40 | if __name__ == '__main__': 41 | 42 | with open(save_path, "w") as jsonl_file: 43 | 44 | # write the previous results to the file 45 | if previous_results is not None: 46 | for result in previous_results: 47 | jsonl_file.write(json.dumps(result) + '\n') 48 | jsonl_file.flush() 49 | 50 | for filename in tqdm(os.listdir(image_directory)): 51 | 52 | if filename in previous_filename: 53 | continue 54 | 55 | if filename.endswith(".png"): 56 | image_path = os.path.join(current_path, image_directory, filename) 57 | 58 | # Getting the base64 string 59 | base64_image = encode_image(image_path) 60 | 61 | if args.instruct: 62 | prompt = instructions[int(filename.rstrip('.png').lstrip('ground_truth_image_'))]['instruction'] + '\n' + direct_prompt 63 | else: 64 | prompt = direct_prompt 65 | 66 | messages = [ 67 | { 68 | "role": "system", 69 | "content": "You are a helpful assistant." 70 | }, 71 | { 72 | "role": "user", 73 | "content": [ 74 | { 75 | "type": "text", 76 | "text": prompt 77 | }, 78 | ], 79 | } 80 | ] 81 | 82 | 83 | response = get_api_response(client, messages, args) 84 | 85 | generated_code = extract_code(response.choices[0].message.content.strip()) 86 | 87 | jsonl_file.write(json.dumps({'code': generated_code, 'question_id': filename, 'ground_truth_path': image_path}) + "\n") 88 | jsonl_file.flush() -------------------------------------------------------------------------------- /plot2code/utils.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import base64 3 | import os 4 | import json 5 | import re 6 | import time 7 | 8 | # Generate the argparse 9 | def get_parser(): 10 | parser = argparse.ArgumentParser(description='Generate code from images using GPT-4 Vision') 11 | parser.add_argument('--image_directory', type=str, default='data/Plot2Code/test', help='Directory containing the images') 12 | parser.add_argument('--output_file', type=str, default='generated_code.jsonl', help='Output file to store the generated code') 13 | parser.add_argument('--model_name', type=str, default='gpt-4-vision-preview', help='Model name to use for generating the code') 14 | parser.add_argument('--model_path', type=str, default='/group/40034/chengyuewu/deepseek-vl-7b-chat', help='Model path to use for generating the code') 15 | parser.add_argument('--max_tokens', type=int, default=1024, help='Maximum tokens to use for generating the code') 16 | parser.add_argument('--temperature', type=int, default=0, help='Temperature to use for generating the code') 17 | parser.add_argument('--save_dir', type=str, default='generated_results', help='Directory to save the generated code') 18 | parser.add_argument('--prompt_strategy', type=str, default='default', help='Prompt strategy to use for generating the code') 19 | parser.add_argument('--ground_truth_code_file', type=str, default='data/Plot2Code/test/metadata.jsonl', help='ground truth code file') 20 | parser.add_argument('--max_retries', type=int, default=5, help='the maximum number of retries') 21 | parser.add_argument('--eval_dir', default='evaluation_results', type=str, help='Directory to save the evaluation results') 22 | parser.add_argument("--text_match_score_results", type=str, default="text_match_score.jsonl", help="Path to the JSONL file containing the text match scores") 23 | parser.add_argument("--gpt4-vision-evaluation-results", type=str, default="gpt_4v_evaluation_results.jsonl", help="Path to the JSONL file containing the GPT-4 Vision evaluation results") 24 | parser.add_argument('--final_score_results', type=str, default='final_score_results.jsonl', help='Output file to store the final score results') 25 | parser.add_argument('--instruct', action='store_true', help='Whether to use instruction or not') 26 | return parser 27 | 28 | 29 | def encode_image(image_path): 30 | with open(image_path, "rb") as image_file: 31 | return base64.b64encode(image_file.read()).decode('utf-8') 32 | 33 | 34 | def get_save_path(args): 35 | if args.instruct: 36 | save_path = os.path.join(args.save_dir, args.model_name, 'instruct', args.prompt_strategy, args.output_file) 37 | else: 38 | save_path = os.path.join(args.save_dir, args.model_name, 'direct', args.prompt_strategy, args.output_file) 39 | os.makedirs(os.path.dirname(save_path), exist_ok=True) 40 | return save_path 41 | 42 | def get_img_path(args): 43 | if args.instruct: 44 | save_path = os.path.join(args.save_dir, args.model_name, 'instruct', args.prompt_strategy, "generated_images") 45 | else: 46 | save_path = os.path.join(args.save_dir, args.model_name, 'direct', args.prompt_strategy, "generated_images") 47 | os.makedirs(save_path, exist_ok=True) 48 | return save_path 49 | 50 | def get_eval_path(args): 51 | if args.instruct: 52 | save_path = os.path.join(args.eval_dir, args.model_name, 'instruct', args.prompt_strategy) 53 | else: 54 | save_path = os.path.join(args.eval_dir, args.model_name, 'direct', args.prompt_strategy) 55 | os.makedirs(save_path, exist_ok=True) 56 | return save_path 57 | 58 | def read_jsonl_file(file_path): 59 | with open(file_path, 'r') as json_file: 60 | return [json.loads(line) for line in json_file] 61 | 62 | def extract_code(response_str): 63 | matches = re.findall(r'```python(.*?)```', response_str, re.DOTALL) 64 | if matches: 65 | return "\n".join(match.strip() for match in matches) 66 | else: 67 | return response_str 68 | 69 | def get_api_response(client, messages, args, model_name=None): 70 | retry_cnt = 0 71 | 72 | while retry_cnt < args.max_retries: 73 | try: 74 | response = client.chat.completions.create( 75 | model=args.model_name if model_name is None else model_name, 76 | messages=messages, 77 | max_tokens=args.max_tokens, 78 | n=1, 79 | temperature=args.temperature, 80 | ) 81 | break 82 | except Exception as e: 83 | backoff = 2 ** retry_cnt 84 | time.sleep(backoff) 85 | retry_cnt += 1 86 | print(f"Retry count: {retry_cnt}") 87 | return response 88 | 89 | direct_prompt = "You are a helpful assistant that can generate Python code using matplotlib." + \ 90 | "Generate the matplotlib code to create a plot that looks like the given image, as similar as possible." + \ 91 | "The generated code should be surrounded by ```python and ```\n" 92 | 93 | CoT_prompt = "Let us think step by step." 94 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib==3.8.4 2 | Levenshtein 3 | NumPy 4 | Pillow 5 | openai -------------------------------------------------------------------------------- /scripts/evaluate-instruct.sh: -------------------------------------------------------------------------------- 1 | # #!/bin/bash 2 | 3 | model_name=$1 4 | prompt_strategy=$2 5 | 6 | echo "Executing generated code to draw images..." 7 | python -m plot2code.execute_generated_code --model_name "$model_name" --prompt_strategy $prompt_strategy --instruct 8 | 9 | echo "Calculating text match score..." 10 | python -m plot2code.eval.text_match_score --model_name "$model_name" --prompt_strategy $prompt_strategy --instruct 11 | 12 | echo "Calculating gpt-4v evaluation score..." 13 | python -m plot2code.eval.gpt4v_evaluations_score --model_name "$model_name" --prompt_strategy $prompt_strategy --instruct 14 | 15 | echo "Combining evaluation results..." 16 | python -m plot2code.eval.combine_evaluation_results --model_name "$model_name" --prompt_strategy $prompt_strategy --instruct 17 | 18 | echo "Done!" -------------------------------------------------------------------------------- /scripts/evaluate.sh: -------------------------------------------------------------------------------- 1 | # #!/bin/bash 2 | 3 | model_name=$1 4 | prompt_strategy=$2 5 | 6 | echo "Executing generated code to draw images..." 7 | python -m plot2code.execute_generated_code --model_name "$model_name" --prompt_strategy $prompt_strategy 8 | 9 | echo "Calculating text match score..." 10 | python -m plot2code.eval.text_match_score --model_name "$model_name" --prompt_strategy $prompt_strategy 11 | 12 | echo "Calculating gpt-4v evaluation score..." 13 | python -m plot2code.eval.gpt4v_evaluations_score --model_name "$model_name" --prompt_strategy $prompt_strategy 14 | 15 | echo "Combining evaluation results..." 16 | python -m plot2code.eval.combine_evaluation_results --model_name "$model_name" --prompt_strategy $prompt_strategy 17 | 18 | echo "Done!" -------------------------------------------------------------------------------- /scripts/generate_code.sh: -------------------------------------------------------------------------------- 1 | # GPT-4V generate code (direct asking) 2 | python -m plot2code.gpt4v_generate_code --prompt_strategy default 3 | 4 | # GPT-4V generate code (conditional asking) 5 | python -m plot2code.gpt4v_generate_code --prompt_strategy default --instruct 6 | 7 | # GPT-4V generate code (conditional asking with CoT) 8 | python -m plot2code.gpt4v_generate_code --prompt_strategy CoT --instruct --------------------------------------------------------------------------------