├── .coveragerc ├── LICENSE ├── README.md ├── croissant.json ├── data ├── leetcode-py-all.jsonl ├── leetcode-py-instrumented.jsonl ├── leetcode-py.jsonl └── tgt_paths.jsonl ├── data_utils.py ├── eval_base.py ├── eval_branchcov.py ├── eval_linecov.py ├── eval_overall.py ├── eval_pathcov.py ├── eval_pathcov_base.py ├── format.py ├── gen_linecov_cot_hf.py ├── gen_linecov_cot_openai.py ├── generate_cov_gemini.py ├── generate_cov_hf.py ├── generate_cov_openai.py ├── generate_pathcov_gemini.py ├── generate_pathcov_hf.py ├── generate_pathcov_openai.py ├── generate_targetcov_gemini.py ├── generate_targetcov_hf.py ├── generate_targetcov_openai.py ├── prompt ├── line_oneshot_gencond.txt ├── line_oneshot_gentest.txt ├── system.txt ├── system_exec.txt ├── template_base.txt ├── template_branch.txt ├── template_line.txt └── template_path.txt ├── prompt_utils.py └── requirements.txt /.coveragerc: -------------------------------------------------------------------------------- 1 | [report] 2 | ;exclude import statements 3 | exclude_also = 4 | import math 5 | import itertools 6 | import bisect 7 | import collections 8 | import string 9 | import heapq 10 | import functools 11 | import sortedcontainers 12 | import operator 13 | from typing import List, Dict, Tuple, Iterator* 14 | from math import sqrt 15 | from bisect import bisect_left* 16 | from dataclasses import dataclass 17 | import * 18 | from * import * -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 LLM4SoftwareTesting 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TestEval 2 | 3 | Dataset and benchmark for paper "TESTEVAL: Benchmarking Large Language Models for Test Case Generation". 4 | 5 | ## Data 6 | 7 | ### Dataset description 8 | 9 | | category | data | 10 | |------|------| 11 | | total programs under test | 210 | 12 | | total target lines | 1340 | 13 | | total target branches | 983 | 14 | | total target paths | 854 | 15 | 16 | ### Metadata 17 | 18 | | field name | data type | description | 19 | |------|------|------| 20 | | task_num | int | Problem id in LeetCode | 21 | | task_title | string | LeetCode problem title | 22 | | difficulty | int | LeetCode problem difficulty: from 0 (easy) to 2 (hard) | 23 | | func_name | string | Default function name for the solution | 24 | | description | string | LeetCode problem description | 25 | | python_solution | string | LeetCode problem solution in Python (the program under test) | 26 | | blocks | list | The list for target branches | 27 | | target_lines | list | The list for target lines | 28 | | python_solution_instrumented | string | Add instrumentations to python_solution for recording execution paths | 29 | | sampled_paths | list | The list of target paths, the format is the same as the execution paths collected from python_solution_instrumented | 30 | | sampled_condition_paths | list | The list of target paths, used in prompts | 31 | 32 | ## Quick Start 33 | 34 | Requirements: Python>=3.10 35 | 36 | Install dependencies 37 | 38 | ```bash 39 | pip install -r requirements.txt 40 | ``` 41 | 42 | Create folder to store generated tests 43 | 44 | ```bash 45 | mkdir predictions 46 | ``` 47 | 48 | Set environment variables 49 | 50 | ``` 51 | OPENAI_API_KEY: openai api key 52 | GOOGLE_API_KEY: gemini api key 53 | HUGGINGFACE_TOKEN: huggingface access token 54 | ``` 55 | 56 | ### Run experiments: overall coverage 57 | 58 | ```bash 59 | python generate_cov_{openai/gemini/hf}.py --model {model_name} --num_tests 20 #generate raw test cases 60 | python format.py --mode overall --path {path_to_generated_tests} #reformat test cases 61 | python eval_overall.py --path {path_to_formatted_generated_tests} #evaluate correctness and coverage metrics 62 | ``` 63 | 64 | ### Run experiments: targeted line coverage 65 | 66 | ```bash 67 | python generate_targetcov_{openai/gemini/hf}.py --covmode line --model {model_name} #generate raw test cases 68 | python format.py --mode line --path {path_to_generated_tests} #reformat test cases 69 | python eval_linecov.py --path {path_to_formatted_generated_tests} #evaluate correctness and coverage metrics 70 | ``` 71 | 72 | ### Run experiments: targeted line coverage with two-step reasoning 73 | 74 | ```bash 75 | python gen_linecov_cot_{openai/hf}.py --model {model_name} #generate reasoning steps and raw test cases 76 | python format.py --mode line --path {path_to_generated_tests} #reformat test cases 77 | python eval_linecov.py --path {path_to_formatted_generated_tests} #evaluate correctness and coverage metrics 78 | ``` 79 | 80 | ### Run experiments: targeted branch coverage 81 | 82 | ```bash 83 | python generate_targetcov_{openai/gemini/hf}.py --covmode branch --model {model_name} #generate raw test cases 84 | python format.py --mode branch --path {path_to_generated_tests} #reformat test cases 85 | python eval_branchcov.py --path {path_to_formatted_generated_tests} #evaluate correctness and coverage metrics 86 | ``` 87 | 88 | ### Run experiments: targeted path coverage 89 | 90 | ```bash 91 | python generate_pathcov_{openai/gemini/hf}.py --model {model_name} #generate raw test cases 92 | python format.py --mode overall --path {path_to_generated_tests} #reformat test cases 93 | python eval_pathcov.py --path {path_to_formatted_generated_tests} #evaluate correctness and coverage metrics 94 | ``` 95 | 96 | ### Run baselines: targeted line/branch coverage 97 | 98 | ```bash 99 | python eval_base.py --path {path_to_formatted_generated_tests} #evaluate targeted line/branch coverage for baselines: use the test cases generate from the overall coverage task 100 | ``` 101 | 102 | ### Run baselines: targeted path coverage 103 | 104 | ```bash 105 | python eval_pathcov_base.py --path {path_to_formatted_generated_tests} #evaluate targeted line/branch coverage for baselines: use the test cases generate from the overall coverage task 106 | ``` 107 | 108 | ### Run your own pipeline 109 | 110 | We encourage researchers to use their own test case generation pipeline other than our prompt framework. If you run your own pipeline, your generated test case file should be formatted as: 111 | 112 | Overall coverage: 113 | 114 | ``` 115 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': list of generated test cases} 116 | ``` 117 | 118 | Targeted line coverage: 119 | 120 | ``` 121 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': {target line number: test case for target line}} 122 | ``` 123 | 124 | Targeted branch coverage: 125 | 126 | ``` 127 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': [{"start": branch start line, "end": branch end line, "test": test case for target branch}]} 128 | ``` 129 | 130 | Targeted path coverage: 131 | 132 | ``` 133 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': list of generated test cases for each target path} 134 | ``` 135 | 136 | -------------------------------------------------------------------------------- /croissant.json: -------------------------------------------------------------------------------- 1 | { 2 | "@context": { 3 | "@language": "en", 4 | "@vocab": "https://schema.org/", 5 | "citeAs": "cr:citeAs", 6 | "column": "cr:column", 7 | "conformsTo": "dct:conformsTo", 8 | "cr": "http://mlcommons.org/croissant/", 9 | "rai": "http://mlcommons.org/croissant/RAI/", 10 | "data": { 11 | "@id": "cr:data", 12 | "@type": "@json" 13 | }, 14 | "dataType": { 15 | "@id": "cr:dataType", 16 | "@type": "@vocab" 17 | }, 18 | "dct": "http://purl.org/dc/terms/", 19 | "examples": { 20 | "@id": "cr:examples", 21 | "@type": "@json" 22 | }, 23 | "extract": "cr:extract", 24 | "field": "cr:field", 25 | "fileProperty": "cr:fileProperty", 26 | "fileObject": "cr:fileObject", 27 | "fileSet": "cr:fileSet", 28 | "format": "cr:format", 29 | "includes": "cr:includes", 30 | "isLiveDataset": "cr:isLiveDataset", 31 | "jsonPath": "cr:jsonPath", 32 | "key": "cr:key", 33 | "md5": "cr:md5", 34 | "parentField": "cr:parentField", 35 | "path": "cr:path", 36 | "recordSet": "cr:recordSet", 37 | "references": "cr:references", 38 | "regex": "cr:regex", 39 | "repeated": "cr:repeated", 40 | "replace": "cr:replace", 41 | "sc": "https://schema.org/", 42 | "separator": "cr:separator", 43 | "source": "cr:source", 44 | "subField": "cr:subField", 45 | "transform": "cr:transform" 46 | }, 47 | "@type": "sc:Dataset", 48 | "name": "TestEval", 49 | "description": "Dataset for 'TESTEVAL: Benchmarking Large Language Models for Test Case Generation'", 50 | "conformsTo": "http://mlcommons.org/croissant/1.0", 51 | "citeAs": "@article{TestEval,title={TestEval: Benchmarking Large Language Models for Test Case Generation},author={Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei}year={2024}}", 52 | "license": "MIT", 53 | "url": "https://github.com/LLM4SoftwareTesting/TestEval", 54 | "distribution": [ 55 | { 56 | "@type": "cr:FileObject", 57 | "@id": "github-repository", 58 | "name": "github-repository", 59 | "description": "TestEval repository.", 60 | "contentUrl": "https://github.com/LLM4SoftwareTesting/TestEval", 61 | "encodingFormat": "git+https", 62 | "sha256": "main" 63 | }, 64 | { 65 | "@type": "cr:FileSet", 66 | "@id": "jsonl-files", 67 | "name": "jsonl-files", 68 | "description": "JSONL files are hosted on the GitHub repository.", 69 | "containedIn": { 70 | "@id": "github-repository" 71 | }, 72 | "encodingFormat": "application/jsonlines", 73 | "includes": "data/*.jsonl" 74 | } 75 | ], 76 | "recordSet": [ 77 | { 78 | "@type": "cr:RecordSet", 79 | "@id": "jsonl", 80 | "name": "jsonl", 81 | "field": [ 82 | { 83 | "@type": "cr:Field", 84 | "@id": "jsonl/task_num", 85 | "name": "task_num", 86 | "description": "Problem id in LeetCode", 87 | "dataType": "sc:Integer", 88 | "source": { 89 | "fileSet": { 90 | "@id": "jsonl-files" 91 | }, 92 | "extract": { 93 | "column": "task_num" 94 | } 95 | } 96 | }, 97 | { 98 | "@type": "cr:Field", 99 | "@id": "jsonl/task_title", 100 | "name": "task_title", 101 | "description": "LeetCode problem title.", 102 | "dataType": "sc:Text", 103 | "source": { 104 | "fileSet": { 105 | "@id": "jsonl-files" 106 | }, 107 | "extract": { 108 | "column": "task_title" 109 | } 110 | } 111 | }, 112 | { 113 | "@type": "cr:Field", 114 | "@id": "jsonl/difficulty", 115 | "name": "difficulty", 116 | "description": "LeetCode problem difficulty: from 0 to 2 (easy, medium, hard).", 117 | "dataType": "sc:Integer", 118 | "source": { 119 | "fileSet": { 120 | "@id": "jsonl-files" 121 | }, 122 | "extract": { 123 | "column": "difficulty" 124 | } 125 | } 126 | }, 127 | { 128 | "@type": "cr:Field", 129 | "@id": "jsonl/func_name", 130 | "name": "func_name", 131 | "description": "Defalt funtion name for the solution.", 132 | "dataType": "sc:Text", 133 | "source": { 134 | "fileSet": { 135 | "@id": "jsonl-files" 136 | }, 137 | "extract": { 138 | "column": "func_name" 139 | } 140 | } 141 | }, 142 | { 143 | "@type": "cr:Field", 144 | "@id": "jsonl/description", 145 | "name": "description", 146 | "description": "LeetCode problem description.", 147 | "dataType": "sc:Text", 148 | "source": { 149 | "fileSet": { 150 | "@id": "jsonl-files" 151 | }, 152 | "extract": { 153 | "column": "description" 154 | } 155 | } 156 | }, 157 | { 158 | "@type": "cr:Field", 159 | "@id": "jsonl/python_solution", 160 | "name": "python_solution", 161 | "description": "LeetCode problem solution in Python. The program under test.", 162 | "dataType": "sc:Text", 163 | "source": { 164 | "fileSet": { 165 | "@id": "jsonl-files" 166 | }, 167 | "extract": { 168 | "column": "python_solution" 169 | } 170 | } 171 | }, 172 | { 173 | "@type": "cr:Field", 174 | "@id": "jsonl/blocks", 175 | "name": "blocks", 176 | "description": "The list for target branches.", 177 | "dataType": "sc:Text", 178 | "source": { 179 | "fileSet": { 180 | "@id": "jsonl-files" 181 | }, 182 | "extract": { 183 | "column": "blocks" 184 | } 185 | } 186 | }, 187 | { 188 | "@type": "cr:Field", 189 | "@id": "jsonl/target_lines", 190 | "name": "target_lines", 191 | "description": "The list for target lines.", 192 | "dataType": "sc:Text", 193 | "source": { 194 | "fileSet": { 195 | "@id": "jsonl-files" 196 | }, 197 | "extract": { 198 | "column": "target_lines" 199 | } 200 | } 201 | }, 202 | { 203 | "@type": "cr:Field", 204 | "@id": "jsonl/python_solution_instrumented", 205 | "name": "python_solution_instrumented", 206 | "description": "Add instrumentions to python_solution for recording execution paths.", 207 | "dataType": "sc:Text", 208 | "source": { 209 | "fileSet": { 210 | "@id": "jsonl-files" 211 | }, 212 | "extract": { 213 | "column": "python_solution_instrumented" 214 | } 215 | } 216 | }, 217 | { 218 | "@type": "cr:Field", 219 | "@id": "jsonl/sampled_paths", 220 | "name": "sampled_paths", 221 | "description": "The list of target paths, the format is the same to the execution paths collected from python_solution_instrumented.", 222 | "dataType": "sc:Text", 223 | "source": { 224 | "fileSet": { 225 | "@id": "jsonl-files" 226 | }, 227 | "extract": { 228 | "column": "sampled_paths" 229 | } 230 | } 231 | }, 232 | { 233 | "@type": "cr:Field", 234 | "@id": "jsonl/sampled_condition_paths", 235 | "name": "sampled_condition_paths", 236 | "description": "The list of target paths, used in prompts.", 237 | "dataType": "sc:Text", 238 | "source": { 239 | "fileSet": { 240 | "@id": "jsonl-files" 241 | }, 242 | "extract": { 243 | "column": "sampled_condition_paths" 244 | } 245 | } 246 | } 247 | ] 248 | } 249 | ] 250 | } 251 | -------------------------------------------------------------------------------- /data_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | 4 | def read_jsonl(path): 5 | data=[] 6 | with open(path,'r') as f: 7 | for line in f: 8 | data.append(json.loads(line)) 9 | return data 10 | 11 | 12 | def write_jsonl(data,path): 13 | with open(path,'w') as f: 14 | for d in data: 15 | f.write(json.dumps(d)+'\n') 16 | 17 | 18 | def add_lineno(code): 19 | """Add line numbers to code.""" 20 | lines=code.split('\n') 21 | new_code='' 22 | for i, line in enumerate(lines): 23 | new_code+=f'{i+1}. {line}\n' 24 | return new_code 25 | 26 | 27 | def add_lineno_comment(code,docstring_lines=None): 28 | """Add line numbers to code as comments.""" 29 | lines=code.split('\n') 30 | for i in range(len(lines)-1,-1,-1): 31 | if lines[i]=='': 32 | lines.pop(i) 33 | else: 34 | break 35 | new_code='' 36 | if docstring_lines is None: 37 | for i, line in enumerate(lines): 38 | if i == len(lines) - 1: 39 | new_code+=f'{line} #{i+1}' 40 | else: 41 | new_code+=f'{line} #{i+1}\n' 42 | else: 43 | docstart,docend=docstring_lines 44 | for i, line in enumerate(lines): 45 | if i>=docstart and i<=docend: 46 | new_code+=f'{line}\n' 47 | else: 48 | if i == len(lines) - 1: 49 | new_code+=f'{line} #{i+1}' 50 | else: 51 | new_code+=f'{line} #{i+1}\n' 52 | return new_code 53 | -------------------------------------------------------------------------------- /eval_base.py: -------------------------------------------------------------------------------- 1 | #evaluate target line/branch coverage for baselines 2 | import os 3 | import subprocess 4 | import json 5 | import signal 6 | import random 7 | random.seed(42) 8 | import shutil 9 | import time 10 | import re 11 | from pathlib import Path 12 | from tqdm import tqdm 13 | from argparse import ArgumentParser 14 | from copy import deepcopy 15 | from data_utils import read_jsonl 16 | 17 | 18 | class TimeoutHandler: 19 | def __init__(self, timeout, error_message=None): 20 | self.timeout = timeout 21 | self.error_message = error_message 22 | 23 | def __enter__(self): 24 | signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix 25 | signal.alarm(self.timeout) 26 | 27 | def __exit__(self, type, value, traceback): 28 | signal.alarm(0) 29 | 30 | def raise_timeout(self, *args): 31 | raise TimeoutError(self.error_message) 32 | 33 | 34 | def execute(test_code,timeout=5): 35 | """try to execute test code""" 36 | try: 37 | exec_globals = {} 38 | with TimeoutHandler(timeout): 39 | exec(test_code, globals()) #add globals() to avoid name errors related to import 40 | return True 41 | except AssertionError: #assertionerror is considered as executable 42 | return True 43 | except TimeoutError: 44 | #print("timed out") 45 | return False 46 | except Exception as e: 47 | #print(f"failed: {type(e).__name__}") 48 | return type(e).__name__, e #return error type and error message 49 | 50 | 51 | 52 | 53 | template_data=read_jsonl('data/leetcode-py.jsonl') 54 | def eval_correctness(generated_data): 55 | """Compute syntactical and execution correctness (with coverage).""" 56 | total_cases_line=0 57 | total_cases_branch=0 58 | total_syn_correct=0 59 | total_exec_correct=0 60 | syn_failed=0 61 | total_easy=0 #for evaluating branches with different difficulties 62 | total_medium=0 63 | total_hard=0 64 | cov_easybranch=0 65 | cov_mediumbranch=0 66 | cov_hardbranch=0 67 | 68 | exec_fails=[] 69 | 70 | cov_line_success=0 71 | cov_branch_success=0 72 | remove_pattern=re.compile(r'tmp*') 73 | 74 | 75 | for i, data in tqdm(enumerate(generated_data)): 76 | task_num=data['task_num'] 77 | difficulty=data['difficulty'] 78 | func_name=data['func_name'] 79 | code=data['code'] 80 | generated_tests=data['tests'] 81 | baseline_test=generated_tests[0] #use the first generated test in overall coverage as the baseline 82 | 83 | branches=template_data[i]['blocks'] 84 | for test_branch in branches: 85 | branch_diff=test_branch['difficulty'] 86 | if branch_diff==0: 87 | total_easy+=1 88 | elif branch_diff==1: 89 | total_medium+=1 90 | elif branch_diff==2: 91 | total_hard+=1 92 | target_lines=template_data[i]['target_lines'] 93 | test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n' 94 | test_import_simple=f'from under_test import Solution\n' 95 | os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts 96 | with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files 97 | f.write(code) 98 | 99 | passed_tests=[] 100 | 101 | total_cases_line+=len(target_lines) 102 | total_cases_branch+=len(branches) 103 | try: 104 | res=compile(baseline_test,'','exec') #check syntax correctness 105 | total_syn_correct+=1 106 | 107 | test_code=test_import+baseline_test+f'\ntest_{func_name}()' 108 | time.sleep(0.01) 109 | res=execute(test_code) 110 | if res==True: 111 | if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed 112 | print('func under test not called') 113 | exec_fails.append({'task':task_num, 'error':'not called'}) 114 | else: 115 | total_exec_correct+=1 116 | test_code_simple=test_import_simple+baseline_test #write to files for computing coverage 117 | with open(f'tmp_{i}_{difficulty}/test.py','w') as f: 118 | f.write(test_code_simple) 119 | passed_tests.append('test.py') 120 | #print('correct') 121 | else: 122 | exec_fails.append({'task':task_num,'test_line':lineno,'error':res}) 123 | #print(res) 124 | #print(test_code) 125 | except: 126 | syn_failed+=1 127 | pass 128 | 129 | if len(passed_tests)>0: #start measuring coverage 130 | #check if cover the selected line 131 | cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json'] 132 | subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder 133 | os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing 134 | cov_command=deepcopy(cov_command_prefix) 135 | cov_command.append('test.py') 136 | 137 | subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 138 | cov_report=json.load(open('coverage.json')) 139 | executed_lines=cov_report['files']['under_test.py']['executed_lines'] 140 | 141 | for lineno in target_lines: #get line coverage 142 | if lineno in executed_lines: 143 | cov_line_success+=1 144 | print(f'covered line {lineno}') 145 | else: 146 | print('line not covered') 147 | pass 148 | for j,test_branch in enumerate(branches): #get branch coverage 149 | startline=test_branch['start'] 150 | endline=test_branch['end'] 151 | 152 | branch_firstline=startline+1 #if this line is covered, this branch is covered 153 | if branch_firstline in executed_lines: #use startline+1 to check whether branch is covered 154 | branch_diff=test_branch['difficulty'] 155 | if branch_diff==0: 156 | cov_easybranch+=1 157 | elif branch_diff==1: 158 | cov_mediumbranch+=1 159 | elif branch_diff==2: 160 | cov_hardbranch+=1 161 | cov_branch_success+=1 162 | print(f'covered branch {startline}-{endline}') 163 | else: 164 | print('branch not covered') 165 | pass 166 | os.chdir('..') #exit tmp_ folder 167 | else: #no test cases passed 168 | pass 169 | 170 | for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last 171 | # Filter dirnames based on the regex pattern 172 | for dirname in dirnames: 173 | if remove_pattern.match(dirname): 174 | shutil.rmtree(dirname) 175 | 176 | syn_correct=total_syn_correct/len(generated_data) 177 | exec_correct=total_exec_correct/len(generated_data) 178 | print(total_syn_correct, total_exec_correct,len(generated_data)) 179 | print(f'Syntax Correctness: {syn_correct}') 180 | print(f'Executable Correctness: {exec_correct}') 181 | 182 | cov_line_rate=cov_line_success/total_cases_line 183 | cov_branch_rate=cov_branch_success/total_cases_branch 184 | easy_covrate=cov_easybranch/total_easy 185 | medium_covrate=cov_mediumbranch/total_medium 186 | hard_covrate=cov_hardbranch/total_hard 187 | print(f'Accuracy in cover selected line: {cov_line_rate}') 188 | print(f'Accuracy in cover selected branch: {cov_branch_rate}') 189 | print(f'Easy branch coverage rate: {easy_covrate}') 190 | print(f'Medium branch coverage rate: {medium_covrate}') 191 | print(f'Hard branch coverage rate: {hard_covrate}') 192 | 193 | return {'syn_correct':syn_correct,'exec_correct':exec_correct, 'cov_line':cov_line_rate, 'cov_branch':cov_branch_rate}, exec_fails 194 | 195 | 196 | def parse_args(): 197 | parser = ArgumentParser() 198 | parser.add_argument("--path", type=str, default='predictions/totalcov_gpt-3.5-turbo.jsonl') 199 | return parser.parse_args() 200 | 201 | 202 | if __name__=='__main__': 203 | args=parse_args() 204 | print(args.path) 205 | output_dir = Path('predictions') 206 | predictions=read_jsonl(output_dir / args.path) 207 | print(len(predictions)) 208 | 209 | eval_correctness(predictions) 210 | -------------------------------------------------------------------------------- /eval_branchcov.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import json 4 | import signal 5 | import random 6 | random.seed(42) 7 | import shutil 8 | import time 9 | import re 10 | from pathlib import Path 11 | from tqdm import tqdm 12 | from argparse import ArgumentParser 13 | from copy import deepcopy 14 | from data_utils import read_jsonl 15 | 16 | class TimeoutHandler: 17 | def __init__(self, timeout, error_message=None): 18 | self.timeout = timeout 19 | self.error_message = error_message 20 | 21 | def __enter__(self): 22 | signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix 23 | signal.alarm(self.timeout) 24 | 25 | def __exit__(self, type, value, traceback): 26 | signal.alarm(0) 27 | 28 | def raise_timeout(self, *args): 29 | raise TimeoutError(self.error_message) 30 | 31 | 32 | def execute(test_code,timeout=5): 33 | """try to execute test code""" 34 | try: 35 | exec_globals = {} 36 | with TimeoutHandler(timeout): 37 | exec(test_code, globals()) #add globals() to avoid name errors related to import 38 | return True 39 | except AssertionError: #assertionerror is considered as executable 40 | return 'assertion error' 41 | except TimeoutError: 42 | #print("timed out") 43 | return False 44 | except Exception as e: 45 | #print(f"failed: {type(e).__name__}") 46 | return type(e).__name__, e #return error type and error message 47 | 48 | 49 | def eval_correctness(generated_data): 50 | """Compute syntactical and execution correctness (with coverage).""" 51 | ref_dataset=read_jsonl('data/leetcode-py.jsonl') 52 | total_cases=0 53 | total_syn_correct=0 54 | total_comp_correct=0 #compilation correctness (todo: maybe need static analysis) 55 | total_exec_correct=0 56 | syn_failed=0 57 | 58 | exec_fails=[] 59 | 60 | cov_branch_success=0 61 | total_easy=0 #for evaluating branches with different difficulties 62 | total_medium=0 63 | total_hard=0 64 | cov_easybranch=0 65 | cov_mediumbranch=0 66 | cov_hardbranch=0 67 | remove_pattern=re.compile(r'tmp*') 68 | 69 | for i, data in tqdm(enumerate(generated_data)): 70 | task_num=data['task_num'] 71 | difficulty=data['difficulty'] 72 | func_name=data['func_name'] 73 | code=data['code'] 74 | ref_branches=ref_dataset[i]['blocks'] 75 | test_cases=data['tests'] 76 | test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n' 77 | test_import_simple=f'from under_test import Solution\n' 78 | os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts 79 | with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files 80 | f.write(code) 81 | passed_tests=[] 82 | 83 | for j,test_branch in enumerate(test_cases): 84 | startline=test_branch['start'] 85 | endline=test_branch['end'] 86 | testcase=test_branch['test'] 87 | assert startline==ref_branches[j]['start'] 88 | assert endline==ref_branches[j]['end'] 89 | 90 | total_cases+=1 91 | 92 | branch_diff=ref_branches[j]['difficulty'] 93 | if branch_diff==0: 94 | total_easy+=1 95 | elif branch_diff==1: 96 | total_medium+=1 97 | elif branch_diff==2: 98 | total_hard+=1 99 | 100 | try: 101 | res=compile(testcase,'','exec') #check syntax correctness 102 | total_syn_correct+=1 103 | 104 | test_code=test_import+testcase+f'\ntest_{func_name}()' 105 | time.sleep(0.01) 106 | res=execute(test_code) #check execution correctness 107 | if res!=False: 108 | if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed 109 | print('func under test not called') 110 | exec_fails.append({'task':task_num, 'error':'not called'}) 111 | else: 112 | total_exec_correct+=1 113 | test_code_simple=test_import_simple+testcase #write to files for computing coverage 114 | with open(f'tmp_{i}_{difficulty}/test_{startline}_{endline}.py','w') as f: 115 | f.write(test_code_simple) 116 | passed_tests.append({'start':startline, 'end':endline, 'path':f'test_{startline}_{endline}.py', 'difficulty':branch_diff}) 117 | #print('correct') 118 | else: 119 | exec_fails.append({'task':task_num, 'start':startline, 'end':endline, 'error':res}) 120 | #print(res) 121 | #print(test_code) 122 | except: 123 | syn_failed+=1 124 | pass 125 | 126 | if len(passed_tests)>0: #start measuring coverage 127 | #check if cover the selected line 128 | cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json'] 129 | subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder 130 | os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing 131 | for test_branch in passed_tests: 132 | startline=test_branch['start'] 133 | endline=test_branch['end'] 134 | branch_diff=test_branch['difficulty'] 135 | branch_firstline=startline+1 #if this line is covered, this branch is covered 136 | testpath=test_branch['path'] 137 | cov_command=deepcopy(cov_command_prefix) 138 | cov_command.append(testpath) 139 | 140 | subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 141 | cov_report=json.load(open('coverage.json')) 142 | executed_lines=cov_report['files']['under_test.py']['executed_lines'] 143 | 144 | if branch_firstline in executed_lines: #use startline+1 to check whether branch is covered 145 | cov_branch_success+=1 146 | if branch_diff==0: 147 | cov_easybranch+=1 148 | elif branch_diff==1: 149 | cov_mediumbranch+=1 150 | elif branch_diff==2: 151 | cov_hardbranch+=1 152 | print(f'covered branch {startline}-{endline}') 153 | else: 154 | print('not covered') 155 | pass 156 | os.chdir('..') #exit tmp_ folder 157 | else: #no test cases passed 158 | pass 159 | 160 | for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last 161 | # Filter dirnames based on the regex pattern 162 | for dirname in dirnames: 163 | if remove_pattern.match(dirname): 164 | shutil.rmtree(dirname) 165 | 166 | syn_correct=total_syn_correct/total_cases 167 | exec_correct=total_exec_correct/total_cases 168 | print(total_syn_correct, total_exec_correct,total_cases) 169 | print(f'Syntax Correctness: {syn_correct}') 170 | print(f'Executable Correctness: {exec_correct}') 171 | 172 | cov_branch_rate=cov_branch_success/total_cases 173 | cov_branch_rate_exec=cov_branch_success/total_exec_correct 174 | easy_covrate=cov_easybranch/total_easy 175 | medium_covrate=cov_mediumbranch/total_medium 176 | hard_covrate=cov_hardbranch/total_hard 177 | print(f'Accuracy in cover selected branch: {cov_branch_rate}') 178 | print(f'Easy branch coverage rate: {easy_covrate}') 179 | print(f'Medium branch coverage rate: {medium_covrate}') 180 | print(f'Hard branch coverage rate: {hard_covrate}') 181 | 182 | return {'syn_correct':syn_correct,'exec_correct':exec_correct, 'cov_branch':cov_branch_rate}, exec_fails 183 | 184 | 185 | def parse_args(): 186 | parser = ArgumentParser() 187 | parser.add_argument("--path", type=str, default='branchcov_gpt-3.5-turbo.jsonl') 188 | return parser.parse_args() 189 | 190 | 191 | if __name__=='__main__': 192 | args=parse_args() 193 | print('generated answers:', args.path) 194 | output_dir = Path('predictions') 195 | predictions=read_jsonl(output_dir / args.path) 196 | print(len(predictions)) 197 | 198 | eval_correctness(predictions) 199 | -------------------------------------------------------------------------------- /eval_linecov.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import json 4 | import signal 5 | import random 6 | random.seed(42) 7 | import shutil 8 | import time 9 | import re 10 | from pathlib import Path 11 | from tqdm import tqdm 12 | from argparse import ArgumentParser 13 | from copy import deepcopy 14 | from data_utils import read_jsonl 15 | 16 | 17 | class TimeoutHandler: 18 | def __init__(self, timeout, error_message=None): 19 | self.timeout = timeout 20 | self.error_message = error_message 21 | 22 | def __enter__(self): 23 | signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix 24 | signal.alarm(self.timeout) 25 | 26 | def __exit__(self, type, value, traceback): 27 | signal.alarm(0) 28 | 29 | def raise_timeout(self, *args): 30 | raise TimeoutError(self.error_message) 31 | 32 | 33 | def execute(test_code,timeout=5): 34 | """try to execute test code""" 35 | try: 36 | exec_globals = {} 37 | with TimeoutHandler(timeout): 38 | exec(test_code, globals()) #add globals() to avoid name errors related to import 39 | return True 40 | except AssertionError: #assertionerror is considered as executable 41 | return True 42 | except TimeoutError: 43 | #print("timed out") 44 | return False 45 | except Exception as e: 46 | #print(f"failed: {type(e).__name__}") 47 | return type(e).__name__, e #return error type and error message 48 | 49 | 50 | def eval_correctness(generated_data, covmode='branch'): 51 | """Compute syntactical and execution correctness (with coverage).""" 52 | total_cases=0 53 | total_syn_correct=0 54 | total_comp_correct=0 55 | total_exec_correct=0 56 | syn_failed=0 57 | 58 | exec_fails=[] 59 | 60 | total_line_cov=0 61 | total_branch_cov=0 62 | cov_line_success=0 63 | remove_pattern=re.compile(r'tmp*') 64 | 65 | for i, data in tqdm(enumerate(generated_data)): 66 | task_num=data['task_num'] 67 | difficulty=data['difficulty'] 68 | func_name=data['func_name'] 69 | code=data['code'] 70 | #code=ADDITIONAL_IMPORTS+code #add possibly missing imports 71 | test_cases=data['tests'] 72 | test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n' 73 | test_import_simple=f'from under_test import Solution\n' 74 | os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts 75 | with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files 76 | f.write(code) 77 | passed_tests={} 78 | 79 | for lineno in test_cases: 80 | testcase=test_cases[lineno] 81 | #testcase=test_cases[fixed_testcase_num] #comparison: use the first test case 82 | lineno=int(lineno) 83 | total_cases+=1 84 | 85 | try: 86 | res=compile(testcase,'','exec') #check syntax correctness 87 | total_syn_correct+=1 88 | 89 | test_code=test_import+testcase+f'\ntest_{func_name}()' 90 | time.sleep(0.01) 91 | res=execute(test_code) 92 | if res==True: 93 | if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed 94 | print('func under test not called') 95 | exec_fails.append({'task':task_num, 'error':'not called'}) 96 | else: 97 | total_exec_correct+=1 98 | test_code_simple=test_import_simple+testcase #write to files for computing coverage 99 | with open(f'tmp_{i}_{difficulty}/test_{lineno}.py','w') as f: 100 | f.write(test_code_simple) 101 | passed_tests[lineno]=f'test_{lineno}.py' 102 | #print('correct') 103 | else: 104 | exec_fails.append({'task':task_num,'test_line':lineno,'error':res}) 105 | #print(res) 106 | #print(test_code) 107 | except: 108 | syn_failed+=1 109 | #print('syntax error') 110 | pass 111 | 112 | if len(passed_tests)>0: #start measuring coverage 113 | #check if cover the selected line 114 | cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json'] 115 | subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder 116 | os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing 117 | for lineno in passed_tests: 118 | test=passed_tests[lineno] 119 | cov_command=deepcopy(cov_command_prefix) 120 | cov_command.append(test) 121 | 122 | subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 123 | cov_report=json.load(open('coverage.json')) 124 | executed_lines=cov_report['files']['under_test.py']['executed_lines'] 125 | #missline_lines= 126 | if lineno in executed_lines: 127 | cov_line_success+=1 128 | print(f'covered line {lineno}') 129 | else: 130 | print('not covered') 131 | pass 132 | os.chdir('..') #exit tmp_ folder 133 | else: #no test cases passed 134 | pass 135 | 136 | for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last 137 | # Filter dirnames based on the regex pattern 138 | for dirname in dirnames: 139 | if remove_pattern.match(dirname): 140 | shutil.rmtree(dirname) 141 | 142 | syn_correct=total_syn_correct/total_cases 143 | exec_correct=total_exec_correct/total_cases 144 | print(total_syn_correct, total_exec_correct,total_cases) 145 | print(f'Syntax Correctness: {syn_correct}') 146 | print(f'Executable Correctness: {exec_correct}') 147 | 148 | cov_line_rate=cov_line_success/total_cases 149 | cov_line_rate_exec=cov_line_success/total_exec_correct 150 | print(f'Accuracy in cover selected line: {cov_line_rate}') 151 | 152 | return {'syn_correct':syn_correct,'exec_correct':exec_correct, 'cov_line':cov_line_rate}, exec_fails 153 | 154 | 155 | def parse_args(): 156 | parser = ArgumentParser() 157 | parser.add_argument("--path", type=str, default='linecov_gpt-3.5-turbo.jsonl') 158 | return parser.parse_args() 159 | 160 | 161 | 162 | if __name__=='__main__': 163 | args=parse_args() 164 | print(args.path) 165 | output_dir = Path('predictions') 166 | predictions=read_jsonl(output_dir / args.path) 167 | print(len(predictions)) 168 | 169 | eval_correctness(predictions) 170 | -------------------------------------------------------------------------------- /eval_overall.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import json 4 | import signal 5 | import random 6 | random.seed(42) 7 | import shutil 8 | import time 9 | import re 10 | from pathlib import Path 11 | from tqdm import tqdm 12 | from argparse import ArgumentParser 13 | from copy import deepcopy 14 | from data_utils import read_jsonl 15 | 16 | 17 | class TimeoutHandler: 18 | def __init__(self, timeout, error_message=None): 19 | self.timeout = timeout 20 | self.error_message = error_message 21 | 22 | def __enter__(self): 23 | signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix 24 | signal.alarm(self.timeout) 25 | 26 | def __exit__(self, type, value, traceback): 27 | signal.alarm(0) 28 | 29 | def raise_timeout(self, *args): 30 | raise TimeoutError(self.error_message) 31 | 32 | 33 | def execute(test_code,timeout=5): 34 | """try to execute test code""" 35 | try: 36 | exec_globals = {} 37 | with TimeoutHandler(timeout): 38 | exec(test_code, globals()) 39 | return True 40 | except AssertionError: #assertionerror is considered as executable 41 | return True 42 | except TimeoutError: 43 | #print("timed out") 44 | return False 45 | except Exception as e: 46 | #print(f"failed: {type(e).__name__}") 47 | return type(e).__name__, e #return error type and error message 48 | 49 | 50 | def coverage_at_k_sample(passed_tests, k, cov_command_prefix): 51 | """Compute coverage@k for a single program under test.""" 52 | random.shuffle(passed_tests) 53 | if len(passed_tests)>=k: 54 | #num_splits=math.ceil(len(passed_tests)/k) #round up or down? 55 | num_splits=len(passed_tests)//k 56 | splited_tests=[passed_tests[i * k : (i + 1) * k] for i in range(num_splits)] 57 | else: #if number of passed tests is less than k, do not split 58 | splited_tests=[passed_tests] 59 | #calculate and average coverages for each group 60 | split_line_covs=[] 61 | split_branch_covs=[] 62 | 63 | for i,test_group in enumerate(splited_tests): 64 | group_line_cov=[] 65 | group_branch_cov=[] 66 | cov_command=deepcopy(cov_command_prefix) 67 | for test in test_group: 68 | cov_command.append(test) 69 | subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 70 | cov_report=json.load(open('coverage.json')) 71 | total_stmt=cov_report['totals']['num_statements'] 72 | covered_stmt=cov_report['totals']['covered_lines'] 73 | line_cov=covered_stmt/total_stmt 74 | total_branch=cov_report['totals']['num_branches'] 75 | covered_branch=cov_report['totals']['covered_branches'] 76 | branch_cov=covered_branch/total_branch 77 | group_line_cov.append(line_cov) 78 | group_branch_cov.append(branch_cov) 79 | 80 | group_avg_line_cov=sum(group_line_cov)/len(group_line_cov) 81 | group_avg_branch_cov=sum(group_branch_cov)/len(group_branch_cov) 82 | split_line_covs.append(group_avg_line_cov) 83 | split_branch_covs.append(group_avg_branch_cov) 84 | 85 | avg_line_cov=sum(split_line_covs)/len(split_line_covs) 86 | avg_branch_cov=sum(split_branch_covs)/len(split_branch_covs) 87 | return {'line_cov':avg_line_cov,'branch_cov':avg_branch_cov} 88 | 89 | 90 | 91 | def check_correctness(generated_data,ks=[1, 2, 5]): 92 | """Compute syntactical and execution correctness (with coverage).""" 93 | total_cases=0 94 | total_syn_correct=0 95 | total_comp_correct=0 96 | total_exec_correct=0 97 | syn_failed=0 98 | 99 | exec_fails=[] 100 | 101 | total_line_cov=0 102 | total_branch_cov=0 103 | line_covs_at_k={f'cov@{k}':[] for k in ks} 104 | branch_covs_at_k={f'cov@{k}':[] for k in ks} 105 | 106 | remove_pattern=re.compile(r'tmp*') 107 | 108 | for i, data in tqdm(enumerate(generated_data)): 109 | task_num=data['task_num'] 110 | difficulty=data['difficulty'] 111 | func_name=data['func_name'] 112 | code=data['code'] 113 | test_cases=data['tests'] 114 | test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n' 115 | test_import_simple=f'from under_test import Solution\n' 116 | os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts 117 | with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files 118 | f.write(code) 119 | passed_tests=[] 120 | 121 | for j, testcase in enumerate(test_cases): 122 | #testcase=textwrap.dedent(testcase) 123 | total_cases+=1 124 | try: 125 | res=compile(testcase,'','exec') #check syntax correctness 126 | total_syn_correct+=1 127 | 128 | test_code=test_import+testcase+f'\ntest_{func_name}()' 129 | time.sleep(0.01) 130 | res=execute(test_code) 131 | if res==True: 132 | if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed 133 | print('func under test not called') 134 | exec_fails.append({'task':task_num,'test_num':j,'error':'not called'}) 135 | else: 136 | total_exec_correct+=1 137 | test_code_simple=test_import_simple+testcase #write to files for computing coverage 138 | with open(f'tmp_{i}_{difficulty}/test_{j}.py','w') as f: 139 | f.write(test_code_simple) 140 | passed_tests.append(f'test_{j}.py') 141 | else: 142 | exec_fails.append({'task':task_num,'test_num':j,'error':res}) 143 | #print(res) 144 | #print(test_code) 145 | 146 | except: 147 | syn_failed+=1 148 | #print('syntax error') 149 | #print(testcase) 150 | pass 151 | 152 | if len(passed_tests)>0: #start measuring coverage 153 | #total coverage for all tests 154 | cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json'] 155 | subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder 156 | os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing 157 | cov_command=deepcopy(cov_command_prefix) 158 | for test in passed_tests: 159 | cov_command.append(test) 160 | 161 | try: 162 | subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 163 | cov_report=json.load(open('coverage.json')) 164 | total_stmt=cov_report['totals']['num_statements'] 165 | covered_stmt=cov_report['totals']['covered_lines'] 166 | line_cov=covered_stmt/total_stmt 167 | total_branch=cov_report['totals']['num_branches'] 168 | covered_branch=cov_report['totals']['covered_branches'] 169 | branch_cov=covered_branch/total_branch 170 | total_line_cov+=line_cov 171 | total_branch_cov+=branch_cov 172 | #print(f'Line Coverage: {line_cov}, Branch Coverage: {branch_cov}') 173 | except: #unknown pytest error: cannot generate coverage report (AssertionError: Expected current collector to be , but it's ) 174 | print('Failed to generate coverage report') 175 | pass 176 | 177 | #compute coverage@k 178 | for k in ks: 179 | res_at_k=coverage_at_k_sample(passed_tests,k,cov_command_prefix) 180 | line_covs_at_k[f'cov@{k}'].append(res_at_k['line_cov']) 181 | branch_covs_at_k[f'cov@{k}'].append(res_at_k['branch_cov']) 182 | 183 | os.chdir('..') #exit tmp_ folder 184 | else: #no test cases passed 185 | pass 186 | 187 | for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last 188 | # Filter dirnames based on the regex pattern 189 | for dirname in dirnames: 190 | if remove_pattern.match(dirname): 191 | shutil.rmtree(dirname) 192 | 193 | syn_correct=total_syn_correct/total_cases 194 | exec_correct=total_exec_correct/total_cases 195 | print(f'Syntax Correctness: {syn_correct}') 196 | print(f'Executable Correctness: {exec_correct}') 197 | 198 | #compute average coverage@k 199 | for k in ks: 200 | line_covs_at_k[f'cov@{k}']=sum(line_covs_at_k[f'cov@{k}'])/len(generated_data) 201 | branch_covs_at_k[f'cov@{k}']=sum(branch_covs_at_k[f'cov@{k}'])/len(generated_data) 202 | print(f'line coverage@{k}',line_covs_at_k[f'cov@{k}']) 203 | print(f'branch coverage@{k}',branch_covs_at_k[f'cov@{k}']) 204 | 205 | #compute coverage 206 | avg_line_cov=total_line_cov/len(generated_data) 207 | avg_branch_cov=total_branch_cov/len(generated_data) 208 | print(f'Average Line Coverage: {avg_line_cov}, Average Branch Coverage: {avg_branch_cov}') 209 | return {'syn_correct':syn_correct,'exec_correct':exec_correct}, exec_fails 210 | 211 | 212 | def parse_args(): 213 | parser = ArgumentParser() 214 | parser.add_argument("--path", type=str, default='totalcov_gpt-3.5-turbo.jsonl') 215 | parser.add_argument("--ks", type=int, nargs='+', default=[1, 2, 5]) 216 | return parser.parse_args() 217 | 218 | 219 | if __name__=='__main__': 220 | args=parse_args() 221 | print(args.path) 222 | print(args.ks) 223 | output_dir = Path('predictions') 224 | predictions=read_jsonl(output_dir / args.path) 225 | print(len(predictions)) 226 | 227 | check_correctness(predictions, ks=args.ks) 228 | -------------------------------------------------------------------------------- /eval_pathcov.py: -------------------------------------------------------------------------------- 1 | import os 2 | import signal 3 | import random 4 | random.seed(42) 5 | import shutil 6 | import time 7 | import re 8 | from pathlib import Path 9 | from tqdm import tqdm 10 | from argparse import ArgumentParser 11 | from copy import deepcopy 12 | from data_utils import read_jsonl 13 | 14 | class TimeoutHandler: 15 | def __init__(self, timeout, error_message=None): 16 | self.timeout = timeout 17 | self.error_message = error_message 18 | 19 | def __enter__(self): 20 | signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix 21 | signal.alarm(self.timeout) 22 | 23 | def __exit__(self, type, value, traceback): 24 | signal.alarm(0) 25 | 26 | def raise_timeout(self, *args): 27 | raise TimeoutError(self.error_message) 28 | 29 | 30 | def execute(test_code,timeout=5): 31 | """try to execute test code""" 32 | try: 33 | exec_globals = {} 34 | with TimeoutHandler(timeout): 35 | exec(test_code, globals()) #add globals() to avoid name errors related to import 36 | return True 37 | except AssertionError: #assertionerror is considered as executable 38 | return True 39 | except TimeoutError: 40 | #print("timed out") 41 | return False 42 | except Exception as e: 43 | #print(f"failed: {type(e).__name__}") 44 | return type(e).__name__, e #return error type and error message 45 | 46 | 47 | 48 | def parse_args(): 49 | parser = ArgumentParser() 50 | parser.add_argument("--path", type=str, default='pathcov_gpt-3.5-turbo.jsonl') 51 | return parser.parse_args() 52 | 53 | 54 | def match_path(generated_path, ref_path): 55 | """Compute path similarity based on longest common subsequence. 56 | Return similarity: len(lcs(generated_path, ref_path))/len(ref_path)""" 57 | ref_len=len(ref_path) 58 | generated_len=len(generated_path) 59 | 60 | dp = [[0]*(ref_len+1) for _ in range(generated_len+1)] 61 | max_length = 0 62 | for i in range(1, generated_len+1): 63 | for j in range(1, ref_len+1): 64 | if generated_path[i-1] == ref_path[j-1]: 65 | dp[i][j] = dp[i-1][j-1] + 1 66 | max_length = max(max_length, dp[i][j]) 67 | similarity = max_length / ref_len 68 | return similarity 69 | 70 | 71 | 72 | def eval_correctness(generated_data): 73 | """Compute syntactical and execution correctness (with coverage).""" 74 | ref_dataset=read_jsonl('data/tgt_paths.jsonl') 75 | instrumented_dataset=read_jsonl('data/leetcode-py-instrumented.jsonl') 76 | total_cases=0 77 | total_syn_correct=0 78 | total_exec_correct=0 79 | syn_failed=0 80 | 81 | exec_fails=[] 82 | 83 | total_path_match=0 84 | total_path_similarity=0 #similarity score: based on longest common subsequence 85 | remove_pattern=re.compile(r'tmp*') 86 | 87 | for i, data in tqdm(enumerate(generated_data)): 88 | task_num=data['task_num'] 89 | difficulty=data['difficulty'] 90 | func_name=data['func_name'] 91 | code=data['code'] 92 | task_title=instrumented_dataset[i]['task_title'] 93 | instrumented_code=instrumented_dataset[i]['python_solution_instrumented'] 94 | test_cases=data['tests'] 95 | test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n' 96 | test_import_simple=f'from under_test import Solution\n' 97 | os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts 98 | with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test into tmp files 99 | f.write(instrumented_code) 100 | passed_tests=[] 101 | 102 | os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing 103 | os.makedirs('test_logs',exist_ok=True) 104 | for j, testcase in enumerate(test_cases): 105 | with open(f'test_logs/{task_title}.log', 'w') as f: 106 | f.write('') #add empty log file 107 | total_cases+=1 108 | try: 109 | res=compile(testcase,'','exec') #check syntax correctness 110 | total_syn_correct+=1 111 | 112 | test_code=test_import+testcase+f'\ntest_{func_name}()' 113 | time.sleep(0.01) 114 | res=execute(test_code) 115 | if res==True: 116 | if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed 117 | print('func under test not called') 118 | exec_fails.append({'task':task_num,'test_num':j,'error':'not called'}) 119 | else: #sucussfully execution, start calculating path coverage 120 | total_exec_correct+=1 121 | ref_path=ref_dataset[i]['sampled_paths'][j] 122 | #print(ref_path) 123 | with open(f'test_logs/{task_title}.log') as f: 124 | lines=f.readlines() 125 | generated_path=tuple(lines) 126 | print(len(generated_path)) 127 | path_sim=match_path(generated_path, ref_path) 128 | print(ref_path, path_sim) 129 | if path_sim==1: 130 | total_path_match+=1 131 | total_path_similarity+=path_sim 132 | passed_tests.append({'path': f'test_{j}.py', 'pass': True}) 133 | else: 134 | exec_fails.append({'task':task_num,'test_num':j,'error':res}) 135 | #print(res) 136 | #print(test_code) 137 | passed_tests.append({'path': f'test_{j}.py', 'pass': False}) 138 | except: 139 | syn_failed+=1 140 | print('syntax error') 141 | print(testcase) 142 | passed_tests.append({'path': f'test_{j}.py', 'pass': False}) 143 | pass 144 | #print(passed_tests) 145 | os.chdir('..') #exit tmp_ folder 146 | 147 | for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last 148 | # Filter dirnames based on the regex pattern 149 | for dirname in dirnames: 150 | if remove_pattern.match(dirname): 151 | shutil.rmtree(dirname) 152 | 153 | syn_correct=total_syn_correct/total_cases 154 | exec_correct=total_exec_correct/total_cases 155 | print(f'Syntax Correctness: {syn_correct}') 156 | print(f'Executable Correctness: {exec_correct}') 157 | 158 | path_exactmatch_acc=total_path_match/total_cases 159 | path_similarity_score=total_path_similarity/total_cases 160 | print('path exact match accuracy:', path_exactmatch_acc) 161 | print('path similarity score:', path_similarity_score) 162 | 163 | 164 | 165 | if __name__=='__main__': 166 | args=parse_args() 167 | print(args.path) 168 | output_dir = Path('predictions') 169 | predictions=read_jsonl(output_dir / args.path) 170 | print(len(predictions)) 171 | 172 | eval_correctness(predictions) 173 | -------------------------------------------------------------------------------- /eval_pathcov_base.py: -------------------------------------------------------------------------------- 1 | import os 2 | import signal 3 | import random 4 | random.seed(42) 5 | import shutil 6 | import time 7 | import re 8 | from pathlib import Path 9 | from tqdm import tqdm 10 | from argparse import ArgumentParser 11 | from copy import deepcopy 12 | from data_utils import read_jsonl 13 | 14 | class TimeoutHandler: 15 | def __init__(self, timeout, error_message=None): 16 | self.timeout = timeout 17 | self.error_message = error_message 18 | 19 | def __enter__(self): 20 | signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix 21 | signal.alarm(self.timeout) 22 | 23 | def __exit__(self, type, value, traceback): 24 | signal.alarm(0) 25 | 26 | def raise_timeout(self, *args): 27 | raise TimeoutError(self.error_message) 28 | 29 | 30 | def execute(test_code,timeout=5): 31 | """try to execute test code""" 32 | try: 33 | exec_globals = {} 34 | with TimeoutHandler(timeout): 35 | #with time_limit(timeout): 36 | exec(test_code, globals()) #add globals() to avoid name errors related to import 37 | return True 38 | except AssertionError: #assertionerror is considered as executable 39 | return True 40 | except TimeoutError: 41 | #print("timed out") 42 | return False 43 | except Exception as e: 44 | #print(f"failed: {type(e).__name__}") 45 | return type(e).__name__, e #return error type and error message 46 | 47 | 48 | 49 | def parse_args(): 50 | parser = ArgumentParser() 51 | parser.add_argument("--path", type=str, default='pathcov_gpt-3.5-turboformat.jsonl') 52 | return parser.parse_args() 53 | 54 | 55 | def match_path(generated_path, ref_path): 56 | """Compute path similarity based on longest common subsequence. 57 | Return similarity: len(lcs(generated_path, ref_path))/len(ref_path)""" 58 | ref_len=len(ref_path) 59 | generated_len=len(generated_path) 60 | 61 | dp = [[0]*(ref_len+1) for _ in range(generated_len+1)] 62 | max_length = 0 63 | for i in range(1, generated_len+1): 64 | for j in range(1, ref_len+1): 65 | if generated_path[i-1] == ref_path[j-1]: 66 | dp[i][j] = dp[i-1][j-1] + 1 67 | max_length = max(max_length, dp[i][j]) 68 | similarity = max_length / ref_len 69 | return similarity 70 | 71 | 72 | 73 | def eval_correctness(generated_data): 74 | """Compute syntactical and execution correctness (with coverage).""" 75 | ref_dataset=read_jsonl('data/tgt_paths.jsonl') 76 | instrumented_dataset=read_jsonl('data/leetcode-py-instrumented.jsonl') 77 | total_cases=0 78 | total_paths=0 79 | total_syn_correct=0 80 | total_exec_correct=0 81 | syn_failed=0 82 | 83 | exec_fails=[] 84 | 85 | total_path_match=0 86 | total_path_similarity=0 #similarity score: based on longest common subsequence 87 | remove_pattern=re.compile(r'tmp*') 88 | 89 | for i, data in tqdm(enumerate(generated_data)): 90 | task_num=data['task_num'] 91 | difficulty=data['difficulty'] 92 | func_name=data['func_name'] 93 | task_title=instrumented_dataset[i]['task_title'] 94 | instrumented_code=instrumented_dataset[i]['python_solution_instrumented'] 95 | test_cases=data['tests'] 96 | baseline_test=test_cases[0] #use the first generated test in overall coverage as the baseline 97 | test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n' 98 | test_import_simple=f'from under_test import Solution\n' 99 | os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts 100 | with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test into tmp files 101 | f.write(instrumented_code) 102 | passed_tests=[] 103 | 104 | os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing 105 | os.makedirs('test_logs',exist_ok=True) 106 | 107 | total_cases+=1 108 | total_paths+=len(ref_dataset[i]['sampled_paths']) 109 | with open(f'test_logs/{task_title}.log', 'w') as f: 110 | f.write('') #add empty log file 111 | try: 112 | res=compile(baseline_test,'','exec') #check syntax correctness 113 | total_syn_correct+=1 114 | 115 | test_code=test_import+baseline_test+f'\ntest_{func_name}()' 116 | time.sleep(0.01) 117 | res=execute(test_code) 118 | if res==True: 119 | if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed 120 | print('func under test not called') 121 | exec_fails.append({'task':task_num,'test_num':j,'error':'not called'}) 122 | else: #sucussfully execution, start calculating path coverage 123 | total_exec_correct+=1 124 | with open(f'test_logs/{task_title}.log') as f: 125 | lines=f.readlines() 126 | print(lines) 127 | generated_path=tuple(lines) 128 | for j in range(len(ref_dataset[i]['sampled_paths'])): 129 | #total_paths+=1 #total paths should be added before execution 130 | ref_path=ref_dataset[i]['sampled_paths'][j] 131 | #print(ref_path) 132 | 133 | path_sim=match_path(generated_path, ref_path) 134 | print(generated_path, ref_path, path_sim) 135 | if path_sim==1: 136 | total_path_match+=1 137 | total_path_similarity+=path_sim 138 | 139 | passed_tests.append({'path': f'test.py', 'pass': True}) 140 | else: 141 | exec_fails.append({'task':task_num,'test_num':j,'error':res}) 142 | #print(res) 143 | #print(test_code) 144 | passed_tests.append({'path': f'test.py', 'pass': False}) 145 | except: 146 | syn_failed+=1 147 | #print('syntax error') 148 | passed_tests.append({'path': f'test.py', 'pass': False}) 149 | pass 150 | #print(passed_tests) 151 | os.chdir('..') #exit tmp_ folder 152 | 153 | for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last 154 | # Filter dirnames based on the regex pattern 155 | for dirname in dirnames: 156 | if remove_pattern.match(dirname): 157 | shutil.rmtree(dirname) 158 | 159 | syn_correct=total_syn_correct/total_cases 160 | exec_correct=total_exec_correct/total_cases 161 | print(f'Syntax Correctness: {syn_correct}') 162 | print(f'Executable Correctness: {exec_correct}') 163 | 164 | print(total_exec_correct,total_cases, total_path_match,total_paths) 165 | path_exactmatch_acc=total_path_match/total_paths 166 | path_similarity_score=total_path_similarity/total_paths 167 | print('path exact match accuracy:', path_exactmatch_acc) 168 | print('path similarity score:', path_similarity_score) 169 | 170 | 171 | 172 | if __name__=='__main__': 173 | args=parse_args() 174 | print(args.path) 175 | output_dir = Path('predictions') 176 | predictions=read_jsonl(output_dir / args.path) 177 | print(len(predictions)) 178 | 179 | eval_correctness(predictions) 180 | -------------------------------------------------------------------------------- /format.py: -------------------------------------------------------------------------------- 1 | import os 2 | import textwrap 3 | import ast 4 | from pathlib import Path 5 | from argparse import ArgumentParser 6 | 7 | from data_utils import read_jsonl, write_jsonl 8 | 9 | 10 | def change_function_name(code, new_name): 11 | try: 12 | # Parse the code into an AST 13 | tree = ast.parse(code) 14 | 15 | # Find the first function definition and change its name 16 | for node in ast.walk(tree): 17 | if isinstance(node, ast.FunctionDef): 18 | if node.name!=new_name: 19 | node.name = new_name 20 | break 21 | else: 22 | break 23 | 24 | # Convert the modified AST back to code 25 | new_code = ast.unparse(tree) 26 | return new_code 27 | except Exception as e: #cannoot parse 28 | return code 29 | 30 | 31 | def reformat_case_byrules(testcase, func_name, lang='python'): 32 | if testcase.startswith(' '): #remove extra indents (encountered in codellama, mistral-7b starts with one space...) 33 | testcase=textwrap.dedent(testcase) 34 | lines=testcase.split('\n') 35 | 36 | if lang=='python': 37 | last_line=lines[-1] #if last line is not complete (due to token limit), remove it 38 | last_line=textwrap.dedent(last_line) 39 | try: 40 | compile(last_line,'','exec') 41 | except: 42 | #print('imcomplete last line, remove it', last_line) 43 | lines=lines[:-1] #last line cannot compile 44 | 45 | testcase='\n'.join(lines) 46 | testcase=change_function_name(testcase, func_name) 47 | return testcase 48 | 49 | 50 | def remove_extra(testcase, func_name, lang='python'): 51 | """Remove extra test inputs and natural language descriptions before and after the test method. 52 | Only keep the contents between def test() and solution.{func_name}""" 53 | lines=testcase.split('\n') 54 | func_startline=0 #the line when test function starts (def test....) 55 | for i in range(len(lines)): 56 | if lines[i].find('def test')>=0: 57 | func_startline=i 58 | break 59 | test_endline=len(lines) 60 | for i in range(len(lines)): 61 | if lines[i].find(f'solution.{func_name}')>=0: #first call to the function under test 62 | test_endline=i+1 63 | break 64 | new_testcase='\n'.join(lines[func_startline:test_endline]) 65 | return new_testcase 66 | 67 | num_inputs=testcase.count(f'solution.{func_name}') 68 | if num_inputs>1: 69 | lines=testcase.split('\n') 70 | new_testcase=[] 71 | for line in lines: 72 | new_testcase.append(line) 73 | if line.find(f'solution.{func_name}')>=0: #discard statements after the first test input 74 | break 75 | new_testcase='\n'.join(new_testcase) 76 | return new_testcase 77 | else: 78 | return testcase 79 | 80 | 81 | def reformat_line(datapath,newpath): 82 | data=read_jsonl(datapath) 83 | formatted_data=[] 84 | for e in data: 85 | code=e['code'] 86 | func_name=e['func_name'] 87 | test_funcname=f'test_{func_name}' 88 | #print(code) 89 | tests=e['tests'] 90 | #formated_tests=[] 91 | for lineno in tests: 92 | testcase=tests[lineno] 93 | print(testcase) 94 | testcase=remove_extra(testcase, func_name) 95 | reformatted_testcase=reformat_case_byrules(testcase, test_funcname, 'python') 96 | #print('------') 97 | print(reformatted_testcase) 98 | print('<---------------------->') 99 | tests[lineno]=reformatted_testcase 100 | e['tests']=tests 101 | 102 | formatted_data.append(e) 103 | write_jsonl(formatted_data, newpath) 104 | 105 | 106 | def reformat_branch(datapath,newpath): 107 | data=read_jsonl(datapath) 108 | formatted_data=[] 109 | for e in data: 110 | code=e['code'] 111 | func_name=e['func_name'] 112 | test_funcname=f'test_{func_name}' 113 | #print(code) 114 | tests=e['tests'] 115 | formated_tests=[] 116 | for branch in tests: 117 | testcase=branch['test'] 118 | print(testcase) 119 | testcase=remove_extra(testcase, func_name) 120 | reformatted_testcase=reformat_case_byrules(testcase, test_funcname, 'python') 121 | #print('------') 122 | print(reformatted_testcase) 123 | print('<---------------------->') 124 | branch['test']=reformatted_testcase 125 | formated_tests.append(branch) 126 | e['tests']=formated_tests 127 | 128 | formatted_data.append(e) 129 | write_jsonl(formatted_data, newpath) 130 | 131 | 132 | def reformat_cov(datapath,newpath): 133 | data=read_jsonl(datapath) 134 | formatted_data=[] 135 | for e in data: 136 | #print(code) 137 | func_name=e['func_name'] 138 | test_funcname=f'test_{func_name}' 139 | formatted_test_cases=[] 140 | testcases=e['tests'] 141 | for testcase in testcases: 142 | print(testcase) 143 | extracted_testcase=remove_extra(testcase, func_name) 144 | #if extracted_testcase!=testcase: 145 | #print(testcase) 146 | #print('----') 147 | #print(extracted_testcase) 148 | reformatted_testcase=reformat_case_byrules(extracted_testcase, test_funcname, 'python') 149 | print('------') 150 | print(reformatted_testcase) 151 | print('<---------------------->') 152 | formatted_test_cases.append(reformatted_testcase) 153 | e['tests']=formatted_test_cases 154 | 155 | formatted_data.append(e) 156 | write_jsonl(formatted_data, newpath) 157 | 158 | 159 | def parse_args(): 160 | parser = ArgumentParser() 161 | parser.add_argument("--path", type=str, default='Linecov_python_gemini-1.0-pro-latest.jsonl') 162 | parser.add_argument("--mode", type=str, default='overall', choices=['line', 'branch', 'overall']) 163 | return parser.parse_args() 164 | 165 | 166 | if __name__=='__main__': 167 | args=parse_args() 168 | print('generated answers:', args.path) 169 | print('coverage mode:', args.mode) 170 | output_dir = Path('predictions') 171 | finename,ext=os.path.splitext(args.path) 172 | newpath=f'{finename}format{ext}' 173 | print(newpath) 174 | if args.mode=='line': 175 | print('reformat line coverage') 176 | reformat_line(output_dir / args.path, output_dir / newpath) 177 | elif args.mode=='overall': 178 | print('reformat overall coverage') 179 | reformat_cov(output_dir / args.path, output_dir / newpath) 180 | elif args.mode=='branch': 181 | print('reformat branch coverage') 182 | reformat_branch(output_dir / args.path, output_dir / newpath) 183 | -------------------------------------------------------------------------------- /gen_linecov_cot_hf.py: -------------------------------------------------------------------------------- 1 | import os 2 | import transformers 3 | import torch 4 | import textwrap 5 | from argparse import ArgumentParser 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | 9 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM 10 | from transformers import pipeline 11 | access_token=os.getenv("HUGGINGFACE_TOKEN") 12 | 13 | from data_utils import read_jsonl, write_jsonl, add_lineno, add_lineno_comment 14 | 15 | def parse_args(): 16 | parser = ArgumentParser() 17 | parser.add_argument("--dataset", type=str, default='leetcode') 18 | parser.add_argument("--lang", type=str, default='python', choices=['python', 'java', 'c++']) 19 | parser.add_argument("--model", type=str, default='meta-llama/Meta-Llama-3-8B-Instruct') 20 | parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level') 21 | parser.add_argument("--max_tokens", type=int, default=1024) 22 | parser.add_argument("--temperature", type=float, default=1e-5) 23 | return parser.parse_args() 24 | 25 | 26 | #models do not support system message 27 | models_nosys=['google/gemma-1.1-7b-it', 28 | 'bigcode/starcoder2-15b-instruct-v0.1', 29 | 'mistralai/Mistral-7B-Instruct-v0.3'] 30 | 31 | 32 | def generate_completion(args, generator, prompt, system_message=''): 33 | if args.model in models_nosys: #models don't support system message 34 | messages=[{"role": "user", "content": system_message+prompt}] 35 | else: 36 | messages=[ 37 | {"role": "system", "content": system_message}, 38 | {"role": "user", "content": prompt}, 39 | ] 40 | prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 41 | 42 | generated=generator(prompt, 43 | max_new_tokens=args.max_tokens, 44 | temperature=args.temperature, 45 | return_full_text=False) 46 | generated_text=generated[0]['generated_text'] 47 | return generated_text 48 | 49 | 50 | if __name__=='__main__': 51 | args=parse_args() 52 | model_abbrv=args.model.split('/')[-1] 53 | print('Model:', model_abbrv) 54 | print('task:', args.covmode) 55 | output_dir = Path('predictions') 56 | 57 | prompt_template_cond=open('prompt/line_oneshot_gencond.txt').read() 58 | prompt_template_test=open('prompt/line_oneshot_gentest.txt').read() 59 | system_template=open('prompt/system_exec.txt').read() 60 | system_message=system_template 61 | 62 | dataset=read_jsonl('data/leetcode-py-all.jsonl') 63 | 64 | model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 65 | tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 66 | generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token) 67 | 68 | data_size=len(dataset) 69 | #data_size=50 70 | testing_results=[] 71 | for i in tqdm(range(data_size)): 72 | data=dataset[i] 73 | func_name=data['func_name'] 74 | desc=data['description'] 75 | code=data['python_solution'] 76 | difficulty=data['difficulty'] 77 | #code_withlineno=add_lineno(code) 78 | code_withlineno=add_lineno_comment(code) 79 | #print(code_withlineno) 80 | 81 | #generate test case 82 | target_lines=data['target_lines'] 83 | tests={} 84 | conds={} #store generated conditions 85 | print(data['task_num'],target_lines) 86 | 87 | for lineno in target_lines: #line number to be tested 88 | code_lines=code.split('\n') 89 | target_line=code_lines[lineno-1] 90 | target_line_withlineno=f'{lineno}: {target_line}' 91 | 92 | code_input=code_withlineno 93 | line_input=target_line_withlineno 94 | 95 | prompt_cond=prompt_template_cond.format(program=code_input, targetline=lineno) 96 | generated_cond=generate_completion(args,generator,prompt_cond,system_message) 97 | prompt_test=prompt_template_test.format(func_name=func_name, program=code_input, conditions=generated_cond) 98 | generated_test=generate_completion(args,generator,prompt_test,system_message) 99 | print(generated_cond) 100 | print('--------') 101 | print(generated_test) 102 | print('<--------------------------------------->') 103 | tests[lineno]=generated_test 104 | conds[lineno]=generated_cond 105 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests, 'conditions':conds} 106 | 107 | testing_results.append(testing_data) 108 | print('<<<<----------------------------------------->>>>') 109 | write_jsonl(testing_results, output_dir / f'linecov2_{model_abbrv}_temp.jsonl') 110 | write_jsonl(testing_results, output_dir / f'linecov2_{model_abbrv}_1shot.jsonl') 111 | -------------------------------------------------------------------------------- /gen_linecov_cot_openai.py: -------------------------------------------------------------------------------- 1 | import os 2 | from argparse import ArgumentParser 3 | from tqdm import tqdm 4 | import openai 5 | from openai import OpenAI 6 | openai.api_key=os.getenv("OPENAI_API_KEY") 7 | client=OpenAI(api_key=openai.api_key) 8 | from pathlib import Path 9 | from data_utils import read_jsonl, write_jsonl, add_lineno, add_lineno_comment 10 | 11 | 12 | def parse_args(): 13 | parser = ArgumentParser() 14 | parser.add_argument("--dataset", type=str, default='leetcode') 15 | parser.add_argument("--model", type=str, default='gpt-3.5-turbo') 16 | 17 | parser.add_argument("--max_tokens", type=int, default=1024) 18 | parser.add_argument("--temperature", type=float, default=0) 19 | return parser.parse_args() 20 | 21 | 22 | def generate_completion(args,prompt,system_message=''): 23 | response = client.chat.completions.create( 24 | model=args.model, 25 | messages=[ 26 | {"role": "system", "content": system_message}, 27 | {"role": "user", "content": prompt}, 28 | ], 29 | temperature=args.temperature, 30 | max_tokens=args.max_tokens 31 | ) 32 | code_output=response.choices[0].message.content 33 | return code_output 34 | 35 | 36 | def generate_twostep(args,prompt_cond, prompt_test,system_message=''): 37 | messages=[ 38 | {"role": "system", "content": system_message}, 39 | {"role": "user", "content": prompt_cond}, 40 | ] 41 | response = client.chat.completions.create( 42 | model=args.model, 43 | messages=messages, 44 | temperature=args.temperature, 45 | max_tokens=args.max_tokens 46 | ) 47 | cond=response.choices[0].message.content 48 | print(cond) 49 | print('---------------------------------') 50 | 51 | messages.append({"role": "assistant", "content": cond}) 52 | messages.append({"role": "user", "content": prompt_test}) 53 | response = client.chat.completions.create( 54 | model=args.model, 55 | messages=messages, 56 | temperature=args.temperature, 57 | max_tokens=args.max_tokens 58 | ) 59 | generated_test=response.choices[0].message.content 60 | print(generated_test) 61 | return cond, generated_test 62 | 63 | 64 | if __name__=='__main__': 65 | args=parse_args() 66 | print('Model:', args.model) 67 | output_dir = Path('predictions') 68 | 69 | #two steps reasoning: generate conditions, then generate a test that satisfies the conditions 70 | prompt_template_cond=open('prompt/line_oneshot_gencond.txt').read() 71 | prompt_template_test=open('prompt/line_oneshot_gentest.txt').read() 72 | system_template=open('prompt/system_exec.txt').read() 73 | system_message=system_template 74 | 75 | dataset=read_jsonl('data/leetcode-py-all.jsonl') 76 | 77 | data_size=len(dataset) 78 | #data_size=50 79 | testing_results=[] 80 | for i in tqdm(range(data_size)): 81 | data=dataset[i] 82 | func_name=data['func_name'] 83 | desc=data['description'] 84 | code=data['python_solution'] 85 | difficulty=data['difficulty'] 86 | #code_withlineno=add_lineno(code) 87 | code_withlineno=add_lineno_comment(code) 88 | #print(code_withlineno) 89 | 90 | #generate test case 91 | target_lines=data['target_lines'] 92 | tests={} 93 | conds={} #store generated conditions 94 | print(data['task_num'],target_lines) 95 | 96 | for lineno in target_lines: #line number to be tested 97 | code_lines=code.split('\n') 98 | target_line=code_lines[lineno-1] 99 | target_line_withlineno=f'{lineno}: {target_line}' 100 | 101 | code_input=code_withlineno 102 | line_input=target_line_withlineno 103 | 104 | prompt_cond=prompt_template_cond.format(program=code_input, targetline=lineno) 105 | generated_cond=generate_completion(args,prompt_cond,system_message) 106 | prompt_test=prompt_template_test.format(func_name=func_name, program=code_input, conditions=generated_cond) 107 | generated_test=generate_completion(args,prompt_test,system_message) 108 | print(generated_cond) 109 | print('--------') 110 | print(generated_test) 111 | print('<--------------------------------------->') 112 | tests[lineno]=generated_test 113 | conds[lineno]=generated_cond 114 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests, 'conditions':conds} 115 | 116 | 117 | testing_results.append(testing_data) 118 | print('<<<<----------------------------------------->>>>') 119 | write_jsonl(testing_results, output_dir / f'linecov2_{args.model}_temp.jsonl') 120 | write_jsonl(testing_results, output_dir / f'linecov2_{args.model}_1shot.jsonl') 121 | -------------------------------------------------------------------------------- /generate_cov_gemini.py: -------------------------------------------------------------------------------- 1 | import google.generativeai as genai 2 | from google.generativeai import GenerationConfig 3 | import os 4 | import time 5 | from pathlib import Path 6 | from argparse import ArgumentParser 7 | from tqdm import tqdm 8 | 9 | api_key=os.getenv("GOOGLE_API_KEY") 10 | 11 | from data_utils import read_jsonl, write_jsonl, add_lineno 12 | 13 | genai.configure(api_key=api_key) 14 | 15 | def parse_args(): 16 | parser = ArgumentParser() 17 | parser.add_argument("--dataset", type=str, default='leetcode') 18 | parser.add_argument("--lang", type=str, default='python') 19 | parser.add_argument("--model", type=str, default='models/gemini-1.0-pro-latest', choices=['models/gemini-1.0-pro-latest', 'models/gemini-1.5-pro-latest', 'models/gemini-1.5-flash-latest']) 20 | parser.add_argument("--num_tests", type=int, default=20, help='number of tests generated per program') 21 | parser.add_argument("--temperature", type=float, default=0) 22 | parser.add_argument("--max_tokens", type=int, default=256) 23 | return parser.parse_args() 24 | 25 | 26 | def testgeneration_multiround(args, model, prompt): 27 | """generate test cases with multi-round conversation, each time generate one test case""" 28 | template_append="Generate another test method for the function under test. Your answer must be different from previously-generated test cases, and should cover different statements and branches." 29 | generated_tests=[] 30 | 31 | for i in range(args.num_tests): 32 | generated=model.generate_content(prompt, generation_config=generation_config) 33 | if generated.candidates[0].finish_reason==1: #normal stop 34 | generated_test=generated.text 35 | else: #max_token, safety, ... 36 | generated_test='' 37 | print(generated_test) 38 | 39 | test_append=f'''Generated test: 40 | {generated_test}''' 41 | prompt+=test_append 42 | prompt+='\n' 43 | prompt+=template_append 44 | 45 | generated_tests.append(generated_test) 46 | 47 | return generated_tests 48 | 49 | 50 | if __name__=='__main__': 51 | args=parse_args() 52 | print('Model:', args.model) 53 | model_abbrv=args.model.split('/')[-1] 54 | model = genai.GenerativeModel(args.model) 55 | #print(model) 56 | output_dir = Path('predictions') 57 | 58 | dataset=read_jsonl('LC_data/leetcode-bench-py.jsonl') 59 | 60 | prompt_template=open('prompt/template_base.txt').read() 61 | system_template=open('prompt/system.txt').read() 62 | system_message=system_template.format(lang='python') 63 | 64 | generation_config = GenerationConfig( 65 | candidate_count=1, 66 | max_output_tokens=args.max_tokens, 67 | temperature=args.temperature 68 | ) 69 | 70 | data_size=len(dataset) 71 | testing_results=[] 72 | for i in tqdm(range(data_size)): 73 | data=dataset[i] 74 | func_name=data['func_name'] 75 | desc=data['description'] 76 | 77 | code=data['python_solution'] 78 | difficulty=data['difficulty'] 79 | code_withlineno=add_lineno(code) 80 | target_lines=data['target_lines'] 81 | 82 | #generate test cases 83 | prompt=prompt_template.format(lang='python', program=code, description=desc, func_name=func_name) 84 | prompt=system_message+prompt 85 | generated_tests=testgeneration_multiround(args,model,prompt) 86 | 87 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_tests} 88 | testing_results.append(testing_data) 89 | print('<<<<----------------------------------------->>>>') 90 | write_jsonl(testing_results, output_dir / f'totalcov_{model_abbrv}_temp.jsonl') 91 | 92 | write_jsonl(testing_results, output_dir / f'totalcov_{model_abbrv}.jsonl') 93 | -------------------------------------------------------------------------------- /generate_cov_hf.py: -------------------------------------------------------------------------------- 1 | import os 2 | import transformers 3 | import torch 4 | from pathlib import Path 5 | from argparse import ArgumentParser 6 | from tqdm import tqdm 7 | 8 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM 9 | from transformers import pipeline 10 | access_token=os.getenv("HUGGINGFACE_TOKEN") 11 | 12 | from data_utils import read_jsonl, write_jsonl, add_lineno 13 | 14 | 15 | def parse_args(): 16 | parser = ArgumentParser() 17 | parser.add_argument("--dataset", type=str, default='leetcode') 18 | parser.add_argument("--model", type=str, default='codellama/CodeLlama-7b-Instruct-hf') 19 | parser.add_argument("--num_tests", type=int, default=10, help='number of tests generated per program') 20 | parser.add_argument("--temperature", type=float, default=1e-5) 21 | parser.add_argument("--max_tokens", type=int, default=256) 22 | return parser.parse_args() 23 | 24 | model_list=['codellama/CodeLlama-7b-Instruct-hf','codellama/CodeLlama-13b-Instruct-hf','codellama/CodeLlama-34b-Instruct-hf', 25 | 'meta-llama/Meta-Llama-3-8B-Instruct', 26 | 'bigcode/starcoder2-15b-instruct-v0.1', 27 | 'google/gemma-1.1-2b-it', 'google/gemma-1.1-7b-it' 28 | 'google/codegemma-7b-it', 29 | 'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-instruct', 30 | 'deepseek-ai/deepseek-coder-33b-instruct', 31 | 'mistralai/Mistral-7B-Instruct-v0.2', 'mistralai/Mistral-7B-Instruct-v0.3' 32 | 'Qwen/CodeQwen1.5-7B-Chat' 33 | ] 34 | 35 | #models do not support system message 36 | models_nosys=['google/gemma-1.1-7b-it', 37 | 'bigcode/starcoder2-15b-instruct-v0.1', 38 | 'mistralai/Mistral-7B-Instruct-v0.3'] 39 | 40 | 41 | def testgeneration_multiround(args,prompt,system_message=''): 42 | """generate test cases with multi-round conversation, each time generate one test case""" 43 | template_append="Generate another test method for the function under test. Your answer must be different from previously-generated test cases, and should cover different statements and branches." 44 | generated_tests=[] 45 | 46 | if args.model in models_nosys: #models don't support system message 47 | messages=[{"role": "user", "content": system_message+prompt}] 48 | else: 49 | messages=[ 50 | {"role": "system", "content": system_message}, 51 | {"role": "user", "content": prompt}, 52 | ] 53 | 54 | for i in range(args.num_tests): 55 | prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 56 | generated=generator(prompt, 57 | max_new_tokens=args.max_tokens, 58 | temperature=args.temperature, 59 | return_full_text=False) 60 | 61 | generated_test=generated[0]['generated_text'] 62 | print(generated_test) 63 | 64 | messages.append({"role": "assistant", "content": generated_test}) 65 | messages.append({"role": "user", "content": template_append}) 66 | 67 | 68 | generated_tests.append(generated_test) 69 | return generated_tests 70 | 71 | 72 | if __name__=='__main__': 73 | args=parse_args() 74 | model_abbrv=args.model.split('/')[-1] 75 | print('Model:', model_abbrv) 76 | output_dir = Path('predictions') 77 | 78 | dataset=read_jsonl('data/leetcode-py.jsonl') 79 | 80 | prompt_template=open('prompt/template_base.txt').read() 81 | system_template=open('prompt/system.txt').read() 82 | system_message=system_template.format(lang='python') 83 | 84 | model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 85 | tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 86 | generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token) 87 | 88 | data_size=len(dataset) 89 | testing_results=[] 90 | print('number of samples:',len(dataset)) 91 | 92 | for i in tqdm(range(data_size)): 93 | data=dataset[i] 94 | func_name=data['func_name'] 95 | desc=data['description'] 96 | code=data['python_solution'] 97 | difficulty=data['difficulty'] 98 | code_withlineno=add_lineno(code) 99 | target_lines=data['target_lines'] 100 | 101 | prompt=prompt_template.format(lang='python', program=code, description=desc, func_name=func_name) 102 | generated_tests=testgeneration_multiround(args,prompt,system_message) 103 | 104 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_tests} 105 | testing_results.append(testing_data) 106 | print('<<<<----------------------------------------->>>>') 107 | write_jsonl(testing_results,output_dir / f'totalcov_{model_abbrv}_temp.jsonl') 108 | 109 | write_jsonl(testing_results,output_dir / f'totalcov_{model_abbrv}.jsonl') 110 | -------------------------------------------------------------------------------- /generate_cov_openai.py: -------------------------------------------------------------------------------- 1 | #baseline for targeted line coverage: not providing the target line number 2 | import os 3 | from pathlib import Path 4 | from argparse import ArgumentParser 5 | from tqdm import tqdm 6 | import openai 7 | from openai import OpenAI 8 | openai.api_key=os.getenv("OPENAI_API_KEY") #personal key 9 | client=OpenAI(api_key=openai.api_key) 10 | 11 | from data_utils import read_jsonl, write_jsonl, add_lineno 12 | 13 | 14 | def parse_args(): 15 | parser = ArgumentParser() 16 | parser.add_argument("--dataset", type=str, default='leetcode') 17 | parser.add_argument("--lang", type=str, default='python') 18 | parser.add_argument("--model", type=str, default='gpt-3.5-turbo', choices=['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo', 'gpt-4o']) 19 | parser.add_argument("--num_tests", type=int, default=10, help='number of tests generated per program') 20 | parser.add_argument("--temperature", type=float, default=0) 21 | parser.add_argument("--max_tokens", type=int, default=256) 22 | return parser.parse_args() 23 | 24 | 25 | def generate_completion(args,prompt,system_message=''): 26 | response = client.chat.completions.create( 27 | model=args.model, 28 | messages=[ 29 | {"role": "system", "content": system_message}, 30 | {"role": "user", "content": prompt}, 31 | ], 32 | temperature=args.temperature, 33 | max_tokens=args.max_tokens 34 | ) 35 | code_output=response.choices[0].message.content 36 | return code_output 37 | 38 | 39 | def testgeneration_multiround(args,prompt,system_message=''): 40 | """generate test cases with multi-round conversation, each time generate one test case""" 41 | template_append="Generate another test method for the function under test. Your answer must be different from previously-generated test cases, and should cover different statements and branches." 42 | generated_tests=[] 43 | messages=[ 44 | {"role": "system", "content": system_message}, 45 | {"role": "user", "content": prompt}, 46 | ] 47 | for i in range(args.num_tests): 48 | response = client.chat.completions.create( 49 | model=args.model, 50 | messages=messages, 51 | temperature=args.temperature, 52 | max_tokens=args.max_tokens 53 | ) 54 | generated_test=response.choices[0].message.content 55 | messages.append({"role": "assistant", "content": generated_test}) 56 | messages.append({"role": "user", "content": template_append}) 57 | 58 | generated_tests.append(generated_test) 59 | print(generated_test) 60 | 61 | return generated_tests 62 | 63 | 64 | lang_exts={'python':'py', 'java':'java', 'c++':'cpp'} 65 | 66 | 67 | if __name__=='__main__': 68 | args=parse_args() 69 | print('Model:', args.model) 70 | output_dir = Path('predictions') 71 | 72 | dataset=read_jsonl('data/leetcode-py.jsonl') 73 | 74 | prompt_template=open('prompt/template_base.txt').read() 75 | system_template=open('prompt/system.txt').read() 76 | system_message=system_template.format(lang='python') 77 | 78 | data_size=len(dataset) 79 | 80 | testing_results=[] 81 | for i in tqdm(range(data_size)): 82 | data=dataset[i] 83 | func_name=data['func_name'] 84 | desc=data['description'] 85 | code=data['python_solution'] 86 | difficulty=data['difficulty'] 87 | code_withlineno=add_lineno(code) 88 | target_lines=data['target_lines'] 89 | 90 | #generate test case 91 | prompt=prompt_template.format(lang='python', program=code, description=desc, func_name=func_name) 92 | generated_tests=testgeneration_multiround(args,prompt,system_message) 93 | 94 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_tests} 95 | testing_results.append(testing_data) 96 | print('<<<<----------------------------------------->>>>') 97 | write_jsonl(testing_results, output_dir / f'totalcov_{args.model}_temp.jsonl') 98 | 99 | write_jsonl(testing_results, output_dir / f'totalcov_{args.model}.jsonl') 100 | -------------------------------------------------------------------------------- /generate_pathcov_gemini.py: -------------------------------------------------------------------------------- 1 | import google.generativeai as genai 2 | from google.generativeai import GenerationConfig 3 | 4 | api_key=os.getenv("GOOGLE_API_KEY") 5 | 6 | import os 7 | import time 8 | from pathlib import Path 9 | from argparse import ArgumentParser 10 | from tqdm import tqdm 11 | 12 | from data_utils import read_jsonl, write_jsonl, add_lineno 13 | from prompt_utils import generate_path 14 | 15 | genai.configure(api_key=api_key) 16 | 17 | def parse_args(): 18 | parser = ArgumentParser() 19 | parser.add_argument("--dataset", type=str, default='leetcode') 20 | parser.add_argument("--lang", type=str, default='python') 21 | parser.add_argument("--model", type=str, default='models/gemini-1.5-flash-latest', choices=['models/gemini-1.0-pro-latest', 'models/gemini-1.5-pro-latest', 'models/gemini-1.5-flash-latest']) 22 | parser.add_argument("--temperature", type=float, default=0) 23 | parser.add_argument("--max_tokens", type=int, default=256) 24 | return parser.parse_args() 25 | 26 | 27 | if __name__=='__main__': 28 | args=parse_args() 29 | print('Model:', args.model) 30 | model_abbrv=args.model.split('/')[-1] 31 | model = genai.GenerativeModel(args.model) 32 | output_dir = Path('predictions') 33 | 34 | prompt_template=open('prompt/template_path.txt').read() 35 | system_template=open('prompt/system.txt').read() 36 | system_message=system_template.format(lang='python') 37 | 38 | generation_config = GenerationConfig( 39 | candidate_count=1, 40 | max_output_tokens=args.max_tokens, 41 | temperature=args.temperature 42 | ) 43 | 44 | dataset=read_jsonl('data/leetcode-py-instrumented.jsonl') 45 | path_dataset=read_jsonl('data/tgt_paths.jsonl') 46 | data_size=len(dataset) 47 | testing_results=[] 48 | 49 | for i in tqdm(range(data_size)): 50 | data=dataset[i] 51 | func_name=data['func_name'] 52 | desc=data['description'] 53 | code=data['python_solution'] 54 | difficulty=data['difficulty'] 55 | code_withlineno=add_lineno(code) 56 | log_paths=path_dataset[i]['sampled_paths'] 57 | condition_paths=path_dataset[i]['sampled_condition_paths'] 58 | generated_path_tests=[] 59 | for j in range(len(log_paths)): 60 | log_path=log_paths[j] 61 | condition_path=condition_paths[j] 62 | #print(log_path, condition_path) 63 | 64 | path_prompt=generate_path(condition_path) 65 | prompt=prompt_template.format(func_name=func_name, description=desc, program=code_withlineno, path=path_prompt) 66 | prompt=system_message+prompt 67 | 68 | generated=model.generate_content(prompt, generation_config=generation_config) 69 | if generated.candidates[0].finish_reason==1: #normal stop 70 | generated_test=generated.text 71 | else: #max_token, safety, ... 72 | generated_test='' 73 | print(generated_test) 74 | generated_path_tests.append(generated_test) 75 | 76 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_path_tests} 77 | testing_results.append(testing_data) 78 | 79 | write_jsonl(testing_results, output_dir / f'pathcov_{model_abbrv}.jsonl') 80 | -------------------------------------------------------------------------------- /generate_pathcov_hf.py: -------------------------------------------------------------------------------- 1 | import os 2 | import transformers 3 | import torch 4 | import textwrap 5 | from pathlib import Path 6 | from argparse import ArgumentParser 7 | from tqdm import tqdm 8 | 9 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM 10 | from transformers import pipeline 11 | access_token=os.getenv("HUGGINGFACE_TOKEN") 12 | 13 | from data_utils import read_jsonl, write_jsonl, add_lineno 14 | from prompt_utils import generate_path 15 | 16 | 17 | def parse_args(): 18 | parser = ArgumentParser() 19 | parser.add_argument("--dataset", type=str, default='leetcode') 20 | parser.add_argument("--lang", type=str, default='python', choices=['python', 'java', 'c++']) 21 | parser.add_argument("--model", type=str, default='codellama/CodeLlama-7b-Instruct-hf') 22 | parser.add_argument("--temperature", type=float, default=1e-5) 23 | parser.add_argument("--max_tokens", type=int, default=256) 24 | return parser.parse_args() 25 | 26 | 27 | model_list=['codellama/CodeLlama-7b-Instruct-hf','codellama/CodeLlama-13b-Instruct-hf','codellama/CodeLlama-34b-Instruct-hf', 28 | 'meta-llama/Meta-Llama-3-8B-Instruct', 29 | 'bigcode/starcoder2-15b-instruct-v0.1', 30 | 'google/gemma-1.1-7b-it' 31 | 'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-instruct', 32 | 'deepseek-ai/deepseek-coder-33b-instruct', 33 | 'mistralai/Mistral-7B-Instruct-v0.3' 34 | ] 35 | 36 | #models do not support system message 37 | models_nosys=['google/gemma-1.1-7b-it', 38 | 'bigcode/starcoder2-15b-instruct-v0.1', 39 | 'mistralai/Mistral-7B-Instruct-v0.3'] 40 | 41 | 42 | if __name__=='__main__': 43 | args=parse_args() 44 | model_abbrv=args.model.split('/')[-1] 45 | print('Model:', model_abbrv) 46 | output_dir = Path('predictions') 47 | 48 | prompt_template=open('prompt/template_path.txt').read() 49 | system_template=open('prompt/system.txt').read() 50 | system_message=system_template.format(lang='python') 51 | 52 | dataset=read_jsonl('data/leetcode-py-instrumented.jsonl') 53 | path_dataset=read_jsonl('data/tgt_paths.jsonl') 54 | data_size=len(dataset) 55 | testing_results=[] 56 | 57 | model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 58 | tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 59 | generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token) 60 | 61 | for i in tqdm(range(data_size)): 62 | data=dataset[i] 63 | func_name=data['func_name'] 64 | desc=data['description'] 65 | code=data['python_solution'] 66 | difficulty=data['difficulty'] 67 | code_withlineno=add_lineno(code) 68 | log_paths=path_dataset[i]['sampled_paths'] 69 | condition_paths=path_dataset[i]['sampled_condition_paths'] 70 | generated_path_tests=[] 71 | for j in range(len(log_paths)): 72 | log_path=log_paths[j] 73 | condition_path=condition_paths[j] 74 | #print(log_path, condition_path) 75 | path_prompt=generate_path(condition_path) 76 | 77 | prompt=prompt_template.format(func_name=func_name, description=desc, program=code_withlineno, path=path_prompt) 78 | if args.model in models_nosys: #models don't support system message 79 | messages=[{"role": "user", "content": system_message+prompt}] 80 | else: 81 | messages=[ 82 | {"role": "system", "content": system_message}, 83 | {"role": "user", "content": prompt}, 84 | ] 85 | prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 86 | generated=generator(prompt, 87 | max_new_tokens=args.max_tokens, 88 | temperature=args.temperature, 89 | return_full_text=False) 90 | generated_test=generated[0]['generated_text'] 91 | if generated_test.startswith(' '): #remove extra indents (encountered in codellama) 92 | generated_test=textwrap.dedent(generated_test) 93 | print(generated_test) 94 | generated_path_tests.append(generated_test) 95 | 96 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_path_tests} 97 | testing_results.append(testing_data) 98 | 99 | write_jsonl(testing_results, output_dir / f'pathcov_{model_abbrv}.jsonl') 100 | -------------------------------------------------------------------------------- /generate_pathcov_openai.py: -------------------------------------------------------------------------------- 1 | import os 2 | from argparse import ArgumentParser 3 | from tqdm import tqdm 4 | import openai 5 | from openai import OpenAI 6 | openai.api_key=os.getenv("OPENAI_API_KEY") #personal key 7 | client=OpenAI(api_key=openai.api_key) 8 | from pathlib import Path 9 | from data_utils import read_jsonl, write_jsonl, add_lineno 10 | from prompt_utils import generate_path 11 | 12 | def parse_args(): 13 | parser = ArgumentParser() 14 | parser.add_argument("--dataset", type=str, default='leetcode') 15 | parser.add_argument("--lang", type=str, default='python') 16 | parser.add_argument("--model", type=str, default='gpt-3.5-turbo', choices=['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo', 'gpt-4o']) 17 | parser.add_argument("--max_tokens", type=int, default=256) 18 | parser.add_argument("--temperature", type=float, default=0) 19 | return parser.parse_args() 20 | 21 | 22 | def generate_completion(args,prompt,system_message=''): 23 | response = client.chat.completions.create( 24 | model=args.model, 25 | messages=[ 26 | {"role": "system", "content": system_message}, 27 | {"role": "user", "content": prompt}, 28 | ], 29 | temperature=args.temperature, 30 | max_tokens=args.max_tokens 31 | ) 32 | code_output=response.choices[0].message.content 33 | return code_output 34 | 35 | 36 | if __name__=='__main__': 37 | args=parse_args() 38 | print('Model:', args.model) 39 | output_dir = Path('predictions') 40 | 41 | prompt_template=open('prompt/template_path.txt').read() 42 | system_template=open('prompt/system.txt').read() 43 | system_message=system_template.format(lang='python') 44 | 45 | dataset=read_jsonl('data/leetcode-py-instrumented.jsonl') 46 | path_dataset=read_jsonl('data/tgt_paths.jsonl') 47 | data_size=len(dataset) 48 | testing_results=[] 49 | 50 | for i in tqdm(range(data_size)): 51 | data=dataset[i] 52 | func_name=data['func_name'] 53 | desc=data['description'] 54 | code=data['python_solution'] 55 | difficulty=data['difficulty'] 56 | code_withlineno=add_lineno(code) 57 | 58 | log_paths=path_dataset[i]['sampled_paths'] 59 | condition_paths=path_dataset[i]['sampled_condition_paths'] 60 | generated_path_tests=[] 61 | for j in range(len(log_paths)): 62 | log_path=log_paths[j] 63 | condition_path=condition_paths[j] 64 | #print(log_path, condition_path) 65 | path_prompt=generate_path(condition_path) 66 | 67 | prompt=prompt_template.format(func_name=func_name, description=desc, program=code_withlineno, path=path_prompt) 68 | 69 | generated_test=generate_completion(args,prompt,system_message) 70 | print(generated_test) 71 | generated_path_tests.append(generated_test) 72 | 73 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_path_tests} 74 | testing_results.append(testing_data) 75 | 76 | write_jsonl(testing_results, output_dir / f'pathcov_{args.model}.jsonl') 77 | -------------------------------------------------------------------------------- /generate_targetcov_gemini.py: -------------------------------------------------------------------------------- 1 | import google.generativeai as genai 2 | from google.generativeai import GenerationConfig 3 | from google.api_core.exceptions import InternalServerError, ResourceExhausted 4 | 5 | api_key=os.getenv("GOOGLE_API_KEY") 6 | 7 | import os 8 | import time 9 | from argparse import ArgumentParser 10 | from tqdm import tqdm 11 | from pathlib import Path 12 | 13 | from data_utils import read_jsonl, write_jsonl, add_lineno 14 | 15 | genai.configure(api_key=api_key) 16 | 17 | def parse_args(): 18 | parser = ArgumentParser() 19 | parser.add_argument("--dataset", type=str, default='leetcode') 20 | parser.add_argument("--lang", type=str, default='python') 21 | parser.add_argument("--model", type=str, default='models/gemini-1.0-pro-latest', choices=['models/gemini-1.0-pro-latest', 'models/gemini-1.5-pro-latest']) 22 | parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level') 23 | parser.add_argument("--max_tokens", type=int, default=256) 24 | parser.add_argument("--temperature", type=float, default=0) 25 | return parser.parse_args() 26 | 27 | 28 | if __name__=='__main__': 29 | args=parse_args() 30 | print('Model:', args.model) 31 | model_abbrv=args.model.split('/')[-1] 32 | model = genai.GenerativeModel(args.model) 33 | print('task:', args.covmode) 34 | output_dir = Path('predictions') 35 | 36 | prompt_template=open('prompt/template_line.txt').read() 37 | prompt_template_branch=open('prompt/template_branch.txt').read() 38 | system_template=open('prompt/system.txt').read() 39 | system_message=system_template.format(lang='python') 40 | 41 | dataset=read_jsonl('data/leetcode-py.jsonl') 42 | 43 | generation_config = GenerationConfig( 44 | candidate_count=1, 45 | max_output_tokens=args.max_tokens, 46 | temperature=args.temperature 47 | ) 48 | 49 | data_size=len(dataset) 50 | testing_results=[] 51 | for i in tqdm(range(data_size)): 52 | data=dataset[i] 53 | func_name=data['func_name'] 54 | desc=data['description'] 55 | code=data['python_solution'] 56 | difficulty=data['difficulty'] 57 | code_withlineno=add_lineno(code) 58 | 59 | #generate test case 60 | if args.covmode=='line': 61 | target_lines=data['target_lines'] 62 | tests={} 63 | print(data['task_num'],target_lines) 64 | for lineno in target_lines: #line number to be tested 65 | code_lines=code.split('\n') 66 | target_line=code_lines[lineno-1] 67 | target_line_withlineno=f'{lineno}: {target_line}' 68 | 69 | code_input=code_withlineno 70 | line_input=target_line_withlineno 71 | 72 | prompt=prompt_template.format(lang='python', program=code_input, description=desc, func_name=func_name, lineno=line_input) 73 | prompt=system_message+prompt 74 | 75 | generated=model.generate_content(prompt, generation_config=generation_config) 76 | 77 | if generated.candidates[0].finish_reason==1: #normal stop 78 | generated_test=generated.text 79 | else: #max_token, safety, ... 80 | generated_test='' 81 | 82 | print(generated_test) 83 | tests[lineno]=generated_test 84 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests} 85 | 86 | elif args.covmode=='branch': 87 | tests_branch=[] 88 | print(data['task_num']) 89 | branches=data['blocks'] 90 | for branch in branches: 91 | print(branch) 92 | startline=branch['start'] 93 | endline=branch['end'] 94 | 95 | code_input=code_withlineno 96 | split_lines=code_withlineno.split('\n') 97 | target_lines=split_lines[startline-1:endline] 98 | target_branch_withlineno='\n'.join(target_lines) 99 | branch_input="\n'''\n"+target_branch_withlineno+"\n'''" 100 | 101 | prompt=prompt_template_branch.format(lang='python', program=code_input, description=desc, func_name=func_name, branch=branch_input) 102 | prompt=system_message+prompt 103 | generated=model.generate_content(prompt, generation_config=generation_config) 104 | if generated.candidates[0].finish_reason==1: #normal stop 105 | generated_test=generated.text 106 | else: #max_token, safety, ... 107 | generated_test='' 108 | print(generated_test) 109 | generatedtest_branch={'start':startline,'end':endline,'test':generated_test} 110 | tests_branch.append(generatedtest_branch) 111 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests_branch} 112 | 113 | testing_results.append(testing_data) 114 | 115 | print('<<<<----------------------------------------->>>>') 116 | write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}_temp.jsonl') 117 | write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}.jsonl') 118 | -------------------------------------------------------------------------------- /generate_targetcov_hf.py: -------------------------------------------------------------------------------- 1 | import os 2 | import transformers 3 | import torch 4 | import textwrap 5 | from argparse import ArgumentParser 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | 9 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM 10 | from transformers import pipeline 11 | access_token=os.getenv("HUGGINGFACE_TOKEN") 12 | 13 | from data_utils import read_jsonl, write_jsonl, add_lineno 14 | 15 | def parse_args(): 16 | parser = ArgumentParser() 17 | parser.add_argument("--dataset", type=str, default='leetcode') 18 | parser.add_argument("--lang", type=str, default='python', choices=['python', 'java', 'c++']) 19 | parser.add_argument("--model", type=str, default='meta-llama/Meta-Llama-3-8B-Instruct') 20 | parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level') 21 | parser.add_argument("--max_tokens", type=int, default=256) 22 | parser.add_argument("--temperature", type=float, default=1e-5) 23 | return parser.parse_args() 24 | 25 | model_list=['codellama/CodeLlama-7b-Instruct-hf','codellama/CodeLlama-13b-Instruct-hf','codellama/CodeLlama-34b-Instruct-hf', 26 | 'meta-llama/Meta-Llama-3-8B-Instruct', 27 | 'bigcode/starcoder2-15b-instruct-v0.1', 28 | 'google/gemma-1.1-7b-it' 29 | 'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-instruct', 30 | 'deepseek-ai/deepseek-coder-33b-instruct', 31 | 'mistralai/Mistral-7B-Instruct-v0.3' 32 | ] 33 | 34 | #models do not support system message 35 | models_nosys=['google/gemma-1.1-7b-it', 36 | 'bigcode/starcoder2-15b-instruct-v0.1', 37 | 'mistralai/Mistral-7B-Instruct-v0.3'] 38 | 39 | 40 | if __name__=='__main__': 41 | args=parse_args() 42 | model_abbrv=args.model.split('/')[-1] 43 | print('Model:', model_abbrv) 44 | print('task:', args.covmode) 45 | output_dir = Path('predictions') 46 | 47 | prompt_template=open('prompt/template_line.txt').read() 48 | prompt_template_branch=open('prompt/template_branch.txt').read() 49 | system_template=open('prompt/system.txt').read() 50 | system_message=system_template.format(lang='python') 51 | 52 | dataset=read_jsonl('data/leetcode-py.jsonl') 53 | 54 | model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 55 | tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto') 56 | generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token) 57 | 58 | data_size=len(dataset) 59 | testing_results=[] 60 | print('number of samples:',len(dataset)) 61 | 62 | for i in tqdm(range(data_size)): 63 | data=dataset[i] 64 | func_name=data['func_name'] 65 | desc=data['description'] 66 | code=data['python_solution'] 67 | difficulty=data['difficulty'] 68 | code_withlineno=add_lineno(code) 69 | 70 | #generate test case 71 | if args.covmode=='line': 72 | target_lines=data['target_lines'] 73 | tests={} 74 | print(data['task_num'],target_lines) 75 | for lineno in target_lines: #line number to be tested 76 | code_lines=code.split('\n') 77 | target_line=code_lines[lineno-1] 78 | target_line_withlineno=f'{lineno}: {target_line}' 79 | 80 | code_input=code_withlineno 81 | line_input=target_line_withlineno 82 | 83 | prompt=prompt_template.format(lang='python', program=code_input, description=desc, func_name=func_name, lineno=line_input) 84 | if args.model in models_nosys: #models don't support system message 85 | messages=[{"role": "user", "content": system_message+prompt}] 86 | else: 87 | messages=[ 88 | {"role": "system", "content": system_message}, 89 | {"role": "user", "content": prompt}, 90 | ] 91 | prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 92 | 93 | generated=generator(prompt, 94 | max_new_tokens=args.max_tokens, 95 | temperature=args.temperature, 96 | return_full_text=False) 97 | generated_test=generated[0]['generated_text'] 98 | if generated_test.startswith(' '): #remove extra indents (encountered in codellama) 99 | generated_test=textwrap.dedent(generated_test) 100 | print(generated_test) 101 | tests[lineno]=generated_test 102 | print('----------') 103 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests} 104 | 105 | elif args.covmode=='branch': 106 | tests_branch=[] 107 | print(data['task_num']) 108 | branches=data['blocks'] 109 | for branch in branches: 110 | print(branch) 111 | startline=branch['start'] 112 | endline=branch['end'] 113 | 114 | code_input=code_withlineno 115 | 116 | split_lines=code_withlineno.split('\n') 117 | target_lines=split_lines[startline-1:endline] 118 | target_branch_withlineno='\n'.join(target_lines) 119 | branch_input="\n'''\n"+target_branch_withlineno+"\n'''" 120 | 121 | prompt=prompt_template_branch.format(lang='python', program=code_input, description=desc, func_name=func_name, branch=branch_input) 122 | if args.model in models_nosys: #models don't support system message 123 | messages=[{"role": "user", "content": system_message+prompt}] 124 | else: 125 | messages=[ 126 | {"role": "system", "content": system_message}, 127 | {"role": "user", "content": prompt}, 128 | ] 129 | prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 130 | 131 | generated=generator(prompt, 132 | max_new_tokens=args.max_tokens, 133 | temperature=args.temperature, 134 | return_full_text=False) 135 | generated_test=generated[0]['generated_text'] 136 | if generated_test.startswith(' '): #remove extra indents (encountered in codellama) 137 | generated_test=textwrap.dedent(generated_test) 138 | print(generated_test) 139 | generatedtest_branch={'start':startline,'end':endline,'test':generated_test} 140 | tests_branch.append(generatedtest_branch) 141 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests_branch} 142 | 143 | testing_results.append(testing_data) 144 | print('<<<<----------------------------------------->>>>') 145 | write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}_temp.jsonl') 146 | write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}.jsonl') 147 | -------------------------------------------------------------------------------- /generate_targetcov_openai.py: -------------------------------------------------------------------------------- 1 | import os 2 | from argparse import ArgumentParser 3 | from tqdm import tqdm 4 | import openai 5 | from openai import OpenAI 6 | openai.api_key=os.getenv("OPENAI_API_KEY") 7 | client=OpenAI(api_key=openai.api_key) 8 | from pathlib import Path 9 | from data_utils import read_jsonl, write_jsonl, add_lineno 10 | 11 | 12 | def parse_args(): 13 | parser = ArgumentParser() 14 | parser.add_argument("--dataset", type=str, default='leetcode') 15 | parser.add_argument("--model", type=str, default='gpt-3.5-turbo', choices=['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo', 'gpt-4o']) 16 | parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level') 17 | parser.add_argument("--max_tokens", type=int, default=256) 18 | parser.add_argument("--temperature", type=float, default=0) 19 | return parser.parse_args() 20 | 21 | 22 | def generate_completion(args,prompt,system_message=''): 23 | response = client.chat.completions.create( 24 | model=args.model, 25 | messages=[ 26 | {"role": "system", "content": system_message}, 27 | {"role": "user", "content": prompt}, 28 | ], 29 | temperature=args.temperature, 30 | max_tokens=args.max_tokens 31 | ) 32 | code_output=response.choices[0].message.content 33 | return code_output 34 | 35 | 36 | if __name__=='__main__': 37 | args=parse_args() 38 | print('Model:', args.model) 39 | print('task:', args.covmode) 40 | output_dir = Path('predictions') 41 | 42 | prompt_template=open('prompt/template_line.txt').read() 43 | prompt_template_branch=open('prompt/template_branch.txt').read() 44 | system_template=open('prompt/system.txt').read() 45 | system_message=system_template.format(lang='python') 46 | 47 | dataset=read_jsonl('data/leetcode-py.jsonl') 48 | 49 | data_size=len(dataset) 50 | testing_results=[] 51 | for i in tqdm(range(data_size)): 52 | data=dataset[i] 53 | func_name=data['func_name'] 54 | desc=data['description'] 55 | code=data['python_solution'] 56 | difficulty=data['difficulty'] 57 | code_withlineno=add_lineno(code) 58 | 59 | #generate test case 60 | if args.covmode=='line': 61 | target_lines=data['target_lines'] 62 | tests={} 63 | print(data['task_num'],target_lines) 64 | 65 | for lineno in target_lines: #line number to be tested 66 | code_lines=code.split('\n') 67 | target_line=code_lines[lineno-1] 68 | target_line_withlineno=f'{lineno}: {target_line}' 69 | 70 | code_input=code_withlineno 71 | line_input=target_line_withlineno 72 | 73 | prompt=prompt_template.format(lang='python', program=code_input, description=desc, func_name=func_name, lineno=line_input) 74 | 75 | generated_test=generate_completion(args,prompt,system_message) 76 | print(generated_test) 77 | tests[lineno]=generated_test 78 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests} 79 | 80 | elif args.covmode=='branch': 81 | tests_branch=[] 82 | print(data['task_num']) 83 | branches=data['blocks'] 84 | for branch in branches: 85 | print(branch) 86 | startline=branch['start'] 87 | endline=branch['end'] 88 | 89 | code_input=code_withlineno 90 | 91 | split_lines=code_withlineno.split('\n') 92 | target_lines=split_lines[startline-1:endline] 93 | target_branch_withlineno='\n'.join(target_lines) 94 | branch_input="\n'''\n"+target_branch_withlineno+"\n'''" 95 | 96 | prompt=prompt_template_branch.format(lang='python', program=code_input, description=desc, func_name=func_name, branch=branch_input) 97 | 98 | generated_test=generate_completion(args,prompt,system_message) 99 | print(generated_test) 100 | generatedtest_branch={'start':startline,'end':endline,'test':generated_test} 101 | tests_branch.append(generatedtest_branch) 102 | testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests_branch} 103 | 104 | testing_results.append(testing_data) 105 | print('<<<<----------------------------------------->>>>') 106 | write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{args.model}_temp.jsonl') 107 | write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{args.model}.jsonl') 108 | -------------------------------------------------------------------------------- /prompt/line_oneshot_gencond.txt: -------------------------------------------------------------------------------- 1 | Given a Python code snippet and a target line number, you are asked to generate reasoning steps to satisfy a specific line to be executed. 2 | [Example] 3 | Given the following code snippet: 4 | ```Python 5 | class Solution: #1 6 | def twoSum(self, nums: List[int], target: int) -> List[int]: #2 7 | numMap = {{}} #3 8 | n = len(nums) #4 9 | #5 10 | for i in range(n): #6 11 | numMap[nums[i]] = i #7 12 | #8 13 | for i in range(n): #9 14 | complement = target - nums[i] #10 15 | if complement in numMap and numMap[complement] != i: #11 16 | return [i, numMap[complement]] #12 17 | #13 18 | return [] #14 19 | ``` 20 | Identify when executing funtion twoSum, what conditions need to be satisfied if line 12 is to be executed. 21 | 22 | Answer: 23 | 24 | Step 1: If line 12 `return [i, numMap[complement]]` is executed, then the `if` condition `(complement in numMap and numMap[complement] != i)` in line 11 shoud be true. 25 | Step 2: If condition `complement in numMap` is true, at least one `target - nums[i]` in line 10 equals an element in nums, which means there exists two elements in `nums` that their sum is equal to `target`. 26 | Step 3: If condition `numMap[complement] != i` is ture, then `numMap[target - nums[i]] != i`, meaning that the index of `target - nums[i]` is not equal to `i`. 27 | Step 4: Generally, to execute line 12, we need to ensure that there exists two different elements in `nums` that their sum is equal to `target`. 28 | <\cond> 29 | [\Example] 30 | 31 | In a similar fashion, identify the conditions that need to be satisfied when line {targetline} is to be executed for the following Python code. 32 | ```Python 33 | {program} 34 | ``` 35 | Surround your answer with and <\cond>. 36 | -------------------------------------------------------------------------------- /prompt/line_oneshot_gentest.txt: -------------------------------------------------------------------------------- 1 | For the given code snippet and a list of conditions need to be satisfied, generate a test case that will satisfiy these conditions. Here is an example: 2 | [Example] 3 | Code: 4 | ```Python 5 | class Solution: #1 6 | def twoSum(self, nums: List[int], target: int) -> List[int]: #2 7 | numMap = {{}} #3 8 | n = len(nums) #4 9 | #5 10 | for i in range(n): #6 11 | numMap[nums[i]] = i #7 12 | #8 13 | for i in range(n): #9 14 | complement = target - nums[i] #10 15 | if complement in numMap and numMap[complement] != i: #11 16 | return [i, numMap[complement]] #12 17 | #13 18 | return [] #14 19 | ``` 20 | Conditions: 21 | Step 1: If line 12 `return [i, numMap[complement]]` is executed, then the `if` condition `(complement in numMap and numMap[complement] != i)` in line 11 shoud be true. 22 | Step 2: If condition `complement in numMap` is true, at least one `target - nums[i]` in line 10 equals an element in nums, which means there exists two elements in `nums` that their sum is equal to `target`. 23 | Step 3: If condition `numMap[complement] != i` is ture, then `numMap[target - nums[i]] != i`, meaning that the index of `target - nums[i]` is not equal to `i`. 24 | Step 4: Generally, to execute line 12, we need to ensure that there exists two different elements in `nums` that their sum is equal to `target`. 25 | 26 | Generated test case: 27 | ```Python 28 | def test_twoSum(): 29 | solution = Solution() 30 | assert solution.twoSum([2,7,11,15], 9) == [0, 1] 31 | ``` 32 | [\Example] 33 | 34 | In a similar fashion, generate a test case for the following code snippet and conditions. Your test function should be named `test_{func_name}`. 35 | Code: 36 | ```Python 37 | {program} 38 | ``` 39 | Conditions: 40 | {conditions} 41 | 42 | You should only generate the test case, without any additional explanation. 43 | -------------------------------------------------------------------------------- /prompt/system.txt: -------------------------------------------------------------------------------- 1 | You are a professional who writes {lang} test methods. You always respond with a {lang} function without any natural language descriptions. Especially, Your answer should not include any quotation marks or descriptions outside the function definition. 2 | -------------------------------------------------------------------------------- /prompt/system_exec.txt: -------------------------------------------------------------------------------- 1 | You are an expert Python programmer who excels at generating test cases and reasoning about program execution behaviors. 2 | -------------------------------------------------------------------------------- /prompt/template_base.txt: -------------------------------------------------------------------------------- 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input. 2 | 3 | Program under test: 4 | ---- 5 | {program} 6 | ---- 7 | 8 | Function description for '{func_name}': 9 | ---- 10 | {description} 11 | ---- 12 | 13 | Your test method should begin with: 14 | def test_{func_name}(): 15 | solution=Solution() 16 | -------------------------------------------------------------------------------- /prompt/template_branch.txt: -------------------------------------------------------------------------------- 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input. 2 | 3 | Program under test: 4 | ---- 5 | {program} 6 | ---- 7 | 8 | Function description for '{func_name}': 9 | ---- 10 | {description} 11 | ---- 12 | Your test case must cover the branch {branch} 13 | Your test method should begin with: 14 | def test_{func_name}(): 15 | solution=Solution() 16 | -------------------------------------------------------------------------------- /prompt/template_line.txt: -------------------------------------------------------------------------------- 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input. 2 | 3 | Program under test: 4 | ---- 5 | {program} 6 | ---- 7 | 8 | Function description for '{func_name}': 9 | ---- 10 | {description} 11 | ---- 12 | 13 | Your test case must cover line {lineno}. 14 | Your test method should begin with: 15 | def test_{func_name}(): 16 | solution=Solution() 17 | -------------------------------------------------------------------------------- /prompt/template_path.txt: -------------------------------------------------------------------------------- 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input. 2 | 3 | Program under test: 4 | ---- 5 | {program} 6 | ---- 7 | 8 | Function description for '{func_name}': 9 | ---- 10 | {description} 11 | ---- 12 | 13 | Your test case must cover the following execution path in function {func_name}. The path is a sequence of branch conditions. When executing your test case, each branch condition in the target execution path must be satisfied sequentially. 14 | Target execution path: {path} 15 | ---- 16 | Your test method should begin with: 17 | def test_{func_name}(): 18 | solution=Solution() 19 | -------------------------------------------------------------------------------- /prompt_utils.py: -------------------------------------------------------------------------------- 1 | 2 | def generate_path(pathdata): 3 | for i in range(len(pathdata)): 4 | #pathdata[i]=pathdata[i].replace('\n','') 5 | pathdata[i]=f"'{pathdata[i]}'" 6 | path_prompt=' -> '.join(pathdata) 7 | return path_prompt 8 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | openai>=1.17.1 2 | google-generativeai>=0.5.2 3 | transformers>=4.39.3 4 | sortedcontainers 5 | textwrap 6 | pytest-cov>=5.0.0 7 | --------------------------------------------------------------------------------