├── .coveragerc
├── LICENSE
├── README.md
├── croissant.json
├── data
    ├── leetcode-py-all.jsonl
    ├── leetcode-py-instrumented.jsonl
    ├── leetcode-py.jsonl
    └── tgt_paths.jsonl
├── data_utils.py
├── eval_base.py
├── eval_branchcov.py
├── eval_linecov.py
├── eval_overall.py
├── eval_pathcov.py
├── eval_pathcov_base.py
├── format.py
├── gen_linecov_cot_hf.py
├── gen_linecov_cot_openai.py
├── generate_cov_gemini.py
├── generate_cov_hf.py
├── generate_cov_openai.py
├── generate_pathcov_gemini.py
├── generate_pathcov_hf.py
├── generate_pathcov_openai.py
├── generate_targetcov_gemini.py
├── generate_targetcov_hf.py
├── generate_targetcov_openai.py
├── prompt
    ├── line_oneshot_gencond.txt
    ├── line_oneshot_gentest.txt
    ├── system.txt
    ├── system_exec.txt
    ├── template_base.txt
    ├── template_branch.txt
    ├── template_line.txt
    └── template_path.txt
├── prompt_utils.py
└── requirements.txt


/.coveragerc:
--------------------------------------------------------------------------------
 1 | [report]
 2 | ;exclude import statements
 3 | exclude_also =
 4 |     import math
 5 |     import itertools
 6 |     import bisect
 7 |     import collections
 8 |     import string
 9 |     import heapq
10 |     import functools
11 |     import sortedcontainers
12 |     import operator
13 |     from typing import List, Dict, Tuple, Iterator*
14 |     from math import sqrt
15 |     from bisect import bisect_left*
16 |     from dataclasses import dataclass
17 |     import *
18 |     from * import *


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 LLM4SoftwareTesting
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # TestEval
  2 | 
  3 | Dataset and benchmark for paper "TESTEVAL: Benchmarking Large Language Models for Test Case Generation".
  4 | 
  5 | ## Data
  6 | 
  7 | ### Dataset description
  8 | 
  9 | | category | data |
 10 | |------|------|
 11 | | total programs under test | 210 |
 12 | | total target lines | 1340 |
 13 | | total target branches | 983 |
 14 | | total target paths | 854 |
 15 | 
 16 | ### Metadata
 17 | 
 18 | | field name | data type | description |
 19 | |------|------|------|
 20 | | task_num | int | Problem id in LeetCode |
 21 | | task_title | string | LeetCode problem title |
 22 | | difficulty | int | LeetCode problem difficulty: from 0 (easy) to 2 (hard) |
 23 | | func_name | string | Default function name for the solution |
 24 | | description | string | LeetCode problem description |
 25 | | python_solution | string | LeetCode problem solution in Python (the program under test) |
 26 | | blocks | list | The list for target branches |
 27 | | target_lines | list | The list for target lines |
 28 | | python_solution_instrumented | string | Add instrumentations to python_solution for recording execution paths |
 29 | | sampled_paths | list | The list of target paths, the format is the same as the execution paths collected from python_solution_instrumented |
 30 | | sampled_condition_paths | list | The list of target paths, used in prompts |
 31 | 
 32 | ## Quick Start
 33 | 
 34 | Requirements: Python>=3.10
 35 | 
 36 | Install dependencies
 37 | 
 38 | ```bash
 39 | pip install -r requirements.txt
 40 | ```
 41 | 
 42 | Create folder to store generated tests
 43 | 
 44 | ```bash
 45 | mkdir predictions
 46 | ```
 47 | 
 48 | Set environment variables
 49 | 
 50 | ```
 51 | OPENAI_API_KEY: openai api key
 52 | GOOGLE_API_KEY: gemini api key
 53 | HUGGINGFACE_TOKEN: huggingface access token
 54 | ```
 55 | 
 56 | ### Run experiments: overall coverage
 57 | 
 58 | ```bash
 59 | python generate_cov_{openai/gemini/hf}.py --model {model_name} --num_tests 20  #generate raw test cases
 60 | python format.py --mode overall --path {path_to_generated_tests}  #reformat test cases
 61 | python eval_overall.py --path {path_to_formatted_generated_tests}  #evaluate correctness and coverage metrics
 62 | ```
 63 | 
 64 | ### Run experiments: targeted line coverage
 65 | 
 66 | ```bash
 67 | python generate_targetcov_{openai/gemini/hf}.py --covmode line --model {model_name} #generate raw test cases
 68 | python format.py --mode line --path {path_to_generated_tests}  #reformat test cases
 69 | python eval_linecov.py --path {path_to_formatted_generated_tests}  #evaluate correctness and coverage metrics
 70 | ```
 71 | 
 72 | ### Run experiments: targeted line coverage with two-step reasoning
 73 | 
 74 | ```bash
 75 | python gen_linecov_cot_{openai/hf}.py --model {model_name} #generate reasoning steps and raw test cases
 76 | python format.py --mode line --path {path_to_generated_tests}  #reformat test cases
 77 | python eval_linecov.py --path {path_to_formatted_generated_tests}  #evaluate correctness and coverage metrics
 78 | ```
 79 | 
 80 | ### Run experiments: targeted branch coverage
 81 | 
 82 | ```bash
 83 | python generate_targetcov_{openai/gemini/hf}.py --covmode branch --model {model_name} #generate raw test cases
 84 | python format.py --mode branch --path {path_to_generated_tests}  #reformat test cases
 85 | python eval_branchcov.py --path {path_to_formatted_generated_tests}  #evaluate correctness and coverage metrics
 86 | ```
 87 | 
 88 | ### Run experiments: targeted path coverage
 89 | 
 90 | ```bash
 91 | python generate_pathcov_{openai/gemini/hf}.py --model {model_name} #generate raw test cases
 92 | python format.py --mode overall --path {path_to_generated_tests}  #reformat test cases
 93 | python eval_pathcov.py --path {path_to_formatted_generated_tests}  #evaluate correctness and coverage metrics
 94 | ```
 95 | 
 96 | ### Run baselines: targeted line/branch coverage
 97 | 
 98 | ```bash
 99 | python eval_base.py --path {path_to_formatted_generated_tests}  #evaluate targeted line/branch coverage for baselines: use the test cases generate from the overall coverage task
100 | ```
101 | 
102 | ### Run baselines: targeted path coverage
103 | 
104 | ```bash
105 | python eval_pathcov_base.py --path {path_to_formatted_generated_tests}  #evaluate targeted line/branch coverage for baselines: use the test cases generate from the overall coverage task
106 | ```
107 | 
108 | ### Run your own pipeline
109 | 
110 | We encourage researchers to use their own test case generation pipeline other than our prompt framework. If you run your own pipeline, your generated test case file should be formatted as:
111 | 
112 | Overall coverage:
113 | 
114 | ```
115 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': list of generated test cases}
116 | ```
117 | 
118 | Targeted line coverage: 
119 | 
120 | ```
121 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': {target line number: test case for target line}}
122 | ```
123 | 
124 | Targeted branch coverage:
125 | 
126 | ```
127 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': [{"start": branch start line, "end": branch end line, "test": test case for target branch}]}
128 | ```
129 | 
130 | Targeted path coverage:
131 | 
132 | ```
133 | {'task_num': LeetCode problem id, 'difficulty': LeetCode problem difficulty, 'func_name': solution function name, 'code': solution code, 'tests': list of generated test cases for each target path}
134 | ```
135 | 
136 | 


--------------------------------------------------------------------------------
/croissant.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "@context": {
  3 |     "@language": "en",
  4 |     "@vocab": "https://schema.org/",
  5 |     "citeAs": "cr:citeAs",
  6 |     "column": "cr:column",
  7 |     "conformsTo": "dct:conformsTo",
  8 |     "cr": "http://mlcommons.org/croissant/",
  9 |     "rai": "http://mlcommons.org/croissant/RAI/",
 10 |     "data": {
 11 |       "@id": "cr:data",
 12 |       "@type": "@json"
 13 |     },
 14 |     "dataType": {
 15 |       "@id": "cr:dataType",
 16 |       "@type": "@vocab"
 17 |     },
 18 |     "dct": "http://purl.org/dc/terms/",
 19 |     "examples": {
 20 |       "@id": "cr:examples",
 21 |       "@type": "@json"
 22 |     },
 23 |     "extract": "cr:extract",
 24 |     "field": "cr:field",
 25 |     "fileProperty": "cr:fileProperty",
 26 |     "fileObject": "cr:fileObject",
 27 |     "fileSet": "cr:fileSet",
 28 |     "format": "cr:format",
 29 |     "includes": "cr:includes",
 30 |     "isLiveDataset": "cr:isLiveDataset",
 31 |     "jsonPath": "cr:jsonPath",
 32 |     "key": "cr:key",
 33 |     "md5": "cr:md5",
 34 |     "parentField": "cr:parentField",
 35 |     "path": "cr:path",
 36 |     "recordSet": "cr:recordSet",
 37 |     "references": "cr:references",
 38 |     "regex": "cr:regex",
 39 |     "repeated": "cr:repeated",
 40 |     "replace": "cr:replace",
 41 |     "sc": "https://schema.org/",
 42 |     "separator": "cr:separator",
 43 |     "source": "cr:source",
 44 |     "subField": "cr:subField",
 45 |     "transform": "cr:transform"
 46 |   },
 47 |   "@type": "sc:Dataset",
 48 |   "name": "TestEval",
 49 |   "description": "Dataset for 'TESTEVAL: Benchmarking Large Language Models for Test Case Generation'",
 50 |   "conformsTo": "http://mlcommons.org/croissant/1.0",
 51 |   "citeAs": "@article{TestEval,title={TestEval: Benchmarking Large Language Models for Test Case Generation},author={Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei}year={2024}}",
 52 |   "license": "MIT",
 53 |   "url": "https://github.com/LLM4SoftwareTesting/TestEval",
 54 |   "distribution": [
 55 |     {
 56 |       "@type": "cr:FileObject",
 57 |       "@id": "github-repository",
 58 |       "name": "github-repository",
 59 |       "description": "TestEval repository.",
 60 |       "contentUrl": "https://github.com/LLM4SoftwareTesting/TestEval",
 61 |       "encodingFormat": "git+https",
 62 |       "sha256": "main"
 63 |     },
 64 |     {
 65 |       "@type": "cr:FileSet",
 66 |       "@id": "jsonl-files",
 67 |       "name": "jsonl-files",
 68 |       "description": "JSONL files are hosted on the GitHub repository.",
 69 |       "containedIn": {
 70 |         "@id": "github-repository"
 71 |       },
 72 |       "encodingFormat": "application/jsonlines",
 73 |       "includes": "data/*.jsonl"
 74 |     }
 75 |   ],
 76 |   "recordSet": [
 77 |     {
 78 |       "@type": "cr:RecordSet",
 79 |       "@id": "jsonl",
 80 |       "name": "jsonl",
 81 |       "field": [
 82 |         {
 83 |           "@type": "cr:Field",
 84 |           "@id": "jsonl/task_num",
 85 |           "name": "task_num",
 86 |           "description": "Problem id in LeetCode",
 87 |           "dataType": "sc:Integer",
 88 |           "source": {
 89 |             "fileSet": {
 90 |               "@id": "jsonl-files"
 91 |             },
 92 |             "extract": {
 93 |               "column": "task_num"
 94 |             }
 95 |           }
 96 |         },
 97 |         {
 98 |           "@type": "cr:Field",
 99 |           "@id": "jsonl/task_title",
100 |           "name": "task_title",
101 |           "description": "LeetCode problem title.",
102 |           "dataType": "sc:Text",
103 |           "source": {
104 |             "fileSet": {
105 |               "@id": "jsonl-files"
106 |             },
107 |             "extract": {
108 |               "column": "task_title"
109 |             }
110 |           }
111 |         },
112 |         {
113 |           "@type": "cr:Field",
114 |           "@id": "jsonl/difficulty",
115 |           "name": "difficulty",
116 |           "description": "LeetCode problem difficulty: from 0 to 2 (easy, medium, hard).",
117 |           "dataType": "sc:Integer",
118 |           "source": {
119 |             "fileSet": {
120 |               "@id": "jsonl-files"
121 |             },
122 |             "extract": {
123 |               "column": "difficulty"
124 |             }
125 |           }
126 |         },
127 |         {
128 |           "@type": "cr:Field",
129 |           "@id": "jsonl/func_name",
130 |           "name": "func_name",
131 |           "description": "Defalt funtion name for the solution.",
132 |           "dataType": "sc:Text",
133 |           "source": {
134 |             "fileSet": {
135 |               "@id": "jsonl-files"
136 |             },
137 |             "extract": {
138 |               "column": "func_name"
139 |             }
140 |           }
141 |         },
142 |         {
143 |           "@type": "cr:Field",
144 |           "@id": "jsonl/description",
145 |           "name": "description",
146 |           "description": "LeetCode problem description.",
147 |           "dataType": "sc:Text",
148 |           "source": {
149 |             "fileSet": {
150 |               "@id": "jsonl-files"
151 |             },
152 |             "extract": {
153 |               "column": "description"
154 |             }
155 |           }
156 |         },
157 |         {
158 |           "@type": "cr:Field",
159 |           "@id": "jsonl/python_solution",
160 |           "name": "python_solution",
161 |           "description": "LeetCode problem solution in Python. The program under test.",
162 |           "dataType": "sc:Text",
163 |           "source": {
164 |             "fileSet": {
165 |               "@id": "jsonl-files"
166 |             },
167 |             "extract": {
168 |               "column": "python_solution"
169 |             }
170 |           }
171 |         },
172 |         {
173 |           "@type": "cr:Field",
174 |           "@id": "jsonl/blocks",
175 |           "name": "blocks",
176 |           "description": "The list for target branches.",
177 |           "dataType": "sc:Text",
178 |           "source": {
179 |             "fileSet": {
180 |               "@id": "jsonl-files"
181 |             },
182 |             "extract": {
183 |               "column": "blocks"
184 |             }
185 |           }
186 |         },
187 |         {
188 |           "@type": "cr:Field",
189 |           "@id": "jsonl/target_lines",
190 |           "name": "target_lines",
191 |           "description": "The list for target lines.",
192 |           "dataType": "sc:Text",
193 |           "source": {
194 |             "fileSet": {
195 |               "@id": "jsonl-files"
196 |             },
197 |             "extract": {
198 |               "column": "target_lines"
199 |             }
200 |           }
201 |         },
202 |         {
203 |           "@type": "cr:Field",
204 |           "@id": "jsonl/python_solution_instrumented",
205 |           "name": "python_solution_instrumented",
206 |           "description": "Add instrumentions to python_solution for recording execution paths.",
207 |           "dataType": "sc:Text",
208 |           "source": {
209 |             "fileSet": {
210 |               "@id": "jsonl-files"
211 |             },
212 |             "extract": {
213 |               "column": "python_solution_instrumented"
214 |             }
215 |           }
216 |         },
217 |         {
218 |           "@type": "cr:Field",
219 |           "@id": "jsonl/sampled_paths",
220 |           "name": "sampled_paths",
221 |           "description": "The list of target paths, the format is the same to the execution paths collected from python_solution_instrumented.",
222 |           "dataType": "sc:Text",
223 |           "source": {
224 |             "fileSet": {
225 |               "@id": "jsonl-files"
226 |             },
227 |             "extract": {
228 |               "column": "sampled_paths"
229 |             }
230 |           }
231 |         },
232 |         {
233 |           "@type": "cr:Field",
234 |           "@id": "jsonl/sampled_condition_paths",
235 |           "name": "sampled_condition_paths",
236 |           "description": "The list of target paths, used in prompts.",
237 |           "dataType": "sc:Text",
238 |           "source": {
239 |             "fileSet": {
240 |               "@id": "jsonl-files"
241 |             },
242 |             "extract": {
243 |               "column": "sampled_condition_paths"
244 |             }
245 |           }
246 |         }
247 |       ]
248 |     }
249 |   ]
250 | }
251 | 


--------------------------------------------------------------------------------
/data_utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | 
 4 | def read_jsonl(path):
 5 |     data=[]
 6 |     with open(path,'r') as f:
 7 |         for line in f:
 8 |             data.append(json.loads(line))
 9 |     return data
10 | 
11 | 
12 | def write_jsonl(data,path):
13 |     with open(path,'w') as f:
14 |         for d in data:
15 |             f.write(json.dumps(d)+'\n')
16 | 
17 | 
18 | def add_lineno(code):
19 |     """Add line numbers to code."""
20 |     lines=code.split('\n')
21 |     new_code=''
22 |     for i, line in enumerate(lines):
23 |         new_code+=f'{i+1}. {line}\n'
24 |     return new_code
25 | 
26 | 
27 | def add_lineno_comment(code,docstring_lines=None):
28 |     """Add line numbers to code as comments."""
29 |     lines=code.split('\n')
30 |     for i in range(len(lines)-1,-1,-1):
31 |         if lines[i]=='':
32 |             lines.pop(i)
33 |         else:
34 |             break
35 |     new_code=''
36 |     if docstring_lines is None:
37 |         for i, line in enumerate(lines):
38 |             if i == len(lines) - 1:
39 |                 new_code+=f'{line}  #{i+1}'
40 |             else:
41 |                 new_code+=f'{line}  #{i+1}\n'
42 |     else:
43 |         docstart,docend=docstring_lines
44 |         for i, line in enumerate(lines):
45 |             if i>=docstart and i<=docend:
46 |                 new_code+=f'{line}\n'
47 |             else:
48 |                 if i == len(lines) - 1:
49 |                     new_code+=f'{line}  #{i+1}'
50 |                 else:
51 |                     new_code+=f'{line}  #{i+1}\n'
52 |     return new_code
53 | 


--------------------------------------------------------------------------------
/eval_base.py:
--------------------------------------------------------------------------------
  1 | #evaluate target line/branch coverage for baselines
  2 | import os
  3 | import subprocess
  4 | import json
  5 | import signal
  6 | import random
  7 | random.seed(42)
  8 | import shutil
  9 | import time
 10 | import re
 11 | from pathlib import Path
 12 | from tqdm import tqdm
 13 | from argparse import ArgumentParser
 14 | from copy import deepcopy
 15 | from data_utils import read_jsonl
 16 | 
 17 | 
 18 | class TimeoutHandler:
 19 |     def __init__(self, timeout, error_message=None):
 20 |         self.timeout = timeout
 21 |         self.error_message = error_message
 22 |     
 23 |     def __enter__(self):
 24 |         signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix
 25 |         signal.alarm(self.timeout)
 26 |     
 27 |     def __exit__(self, type, value, traceback):
 28 |         signal.alarm(0)
 29 |     
 30 |     def raise_timeout(self, *args):
 31 |         raise TimeoutError(self.error_message)
 32 |     
 33 | 
 34 | def execute(test_code,timeout=5):
 35 |     """try to execute test code"""  
 36 |     try:
 37 |         exec_globals = {}
 38 |         with TimeoutHandler(timeout):
 39 |             exec(test_code, globals()) #add globals() to avoid name errors related to import
 40 |             return True
 41 |     except AssertionError: #assertionerror is considered as executable
 42 |         return True
 43 |     except TimeoutError:
 44 |         #print("timed out")
 45 |         return False
 46 |     except Exception as e: 
 47 |         #print(f"failed: {type(e).__name__}")
 48 |         return type(e).__name__, e #return error type and error message     
 49 | 
 50 | 
 51 | 
 52 | 
 53 | template_data=read_jsonl('data/leetcode-py.jsonl')
 54 | def eval_correctness(generated_data):
 55 |     """Compute syntactical and execution correctness (with coverage)."""
 56 |     total_cases_line=0
 57 |     total_cases_branch=0
 58 |     total_syn_correct=0
 59 |     total_exec_correct=0
 60 |     syn_failed=0
 61 |     total_easy=0 #for evaluating branches with different difficulties
 62 |     total_medium=0
 63 |     total_hard=0
 64 |     cov_easybranch=0
 65 |     cov_mediumbranch=0
 66 |     cov_hardbranch=0
 67 | 
 68 |     exec_fails=[]
 69 | 
 70 |     cov_line_success=0
 71 |     cov_branch_success=0
 72 |     remove_pattern=re.compile(r'tmp*')
 73 | 
 74 | 
 75 |     for i, data in tqdm(enumerate(generated_data)):
 76 |         task_num=data['task_num']
 77 |         difficulty=data['difficulty']
 78 |         func_name=data['func_name']
 79 |         code=data['code']
 80 |         generated_tests=data['tests']
 81 |         baseline_test=generated_tests[0] #use the first generated test in overall coverage as the baseline
 82 | 
 83 |         branches=template_data[i]['blocks']
 84 |         for test_branch in branches:
 85 |             branch_diff=test_branch['difficulty']
 86 |             if branch_diff==0:
 87 |                 total_easy+=1
 88 |             elif branch_diff==1:
 89 |                 total_medium+=1
 90 |             elif branch_diff==2:
 91 |                 total_hard+=1
 92 |         target_lines=template_data[i]['target_lines']
 93 |         test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n'
 94 |         test_import_simple=f'from under_test import Solution\n'
 95 |         os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts
 96 |         with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files
 97 |             f.write(code)
 98 | 
 99 |         passed_tests=[]
100 | 
101 |         total_cases_line+=len(target_lines)
102 |         total_cases_branch+=len(branches)
103 |         try:
104 |             res=compile(baseline_test,'<string>','exec') #check syntax correctness
105 |             total_syn_correct+=1
106 | 
107 |             test_code=test_import+baseline_test+f'\ntest_{func_name}()'
108 |             time.sleep(0.01)
109 |             res=execute(test_code) 
110 |             if res==True:
111 |                 if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed
112 |                     print('func under test not called')
113 |                     exec_fails.append({'task':task_num, 'error':'not called'})
114 |                 else:
115 |                     total_exec_correct+=1
116 |                     test_code_simple=test_import_simple+baseline_test #write to files for computing coverage
117 |                     with open(f'tmp_{i}_{difficulty}/test.py','w') as f:
118 |                         f.write(test_code_simple)
119 |                     passed_tests.append('test.py')
120 |                     #print('correct')
121 |             else:
122 |                 exec_fails.append({'task':task_num,'test_line':lineno,'error':res})
123 |                 #print(res)
124 |                 #print(test_code)
125 |         except:
126 |             syn_failed+=1
127 |             pass
128 |                
129 |         if len(passed_tests)>0: #start measuring coverage
130 |             #check if cover the selected line
131 |             cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json']
132 |             subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder
133 |             os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing
134 |             cov_command=deepcopy(cov_command_prefix)
135 |             cov_command.append('test.py')
136 | 
137 |             subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
138 |             cov_report=json.load(open('coverage.json'))
139 |             executed_lines=cov_report['files']['under_test.py']['executed_lines']
140 | 
141 |             for lineno in target_lines: #get line coverage
142 |                 if lineno in executed_lines:
143 |                     cov_line_success+=1
144 |                     print(f'covered line {lineno}')
145 |                 else:
146 |                     print('line not covered')
147 |                     pass
148 |             for j,test_branch in enumerate(branches): #get branch coverage
149 |                 startline=test_branch['start']
150 |                 endline=test_branch['end'] 
151 |                 
152 |                 branch_firstline=startline+1 #if this line is covered, this branch is covered
153 |                 if branch_firstline in executed_lines: #use startline+1 to check whether branch is covered
154 |                     branch_diff=test_branch['difficulty']
155 |                     if branch_diff==0:
156 |                         cov_easybranch+=1
157 |                     elif branch_diff==1:
158 |                         cov_mediumbranch+=1
159 |                     elif branch_diff==2:
160 |                         cov_hardbranch+=1
161 |                     cov_branch_success+=1
162 |                     print(f'covered branch {startline}-{endline}')
163 |                 else:
164 |                     print('branch not covered')
165 |                     pass
166 |             os.chdir('..') #exit tmp_ folder
167 |         else: #no test cases passed
168 |             pass
169 | 
170 |     for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last
171 |         # Filter dirnames based on the regex pattern
172 |         for dirname in dirnames:
173 |             if remove_pattern.match(dirname):
174 |                 shutil.rmtree(dirname)
175 |     
176 |     syn_correct=total_syn_correct/len(generated_data)
177 |     exec_correct=total_exec_correct/len(generated_data)
178 |     print(total_syn_correct, total_exec_correct,len(generated_data))
179 |     print(f'Syntax Correctness: {syn_correct}')
180 |     print(f'Executable Correctness: {exec_correct}')
181 | 
182 |     cov_line_rate=cov_line_success/total_cases_line
183 |     cov_branch_rate=cov_branch_success/total_cases_branch
184 |     easy_covrate=cov_easybranch/total_easy
185 |     medium_covrate=cov_mediumbranch/total_medium
186 |     hard_covrate=cov_hardbranch/total_hard
187 |     print(f'Accuracy in cover selected line: {cov_line_rate}')
188 |     print(f'Accuracy in cover selected branch: {cov_branch_rate}')
189 |     print(f'Easy branch coverage rate: {easy_covrate}')
190 |     print(f'Medium branch coverage rate: {medium_covrate}')
191 |     print(f'Hard branch coverage rate: {hard_covrate}')
192 | 
193 |     return {'syn_correct':syn_correct,'exec_correct':exec_correct, 'cov_line':cov_line_rate, 'cov_branch':cov_branch_rate}, exec_fails
194 | 
195 |     
196 | def parse_args():
197 |     parser = ArgumentParser()
198 |     parser.add_argument("--path", type=str, default='predictions/totalcov_gpt-3.5-turbo.jsonl')
199 |     return parser.parse_args()
200 | 
201 | 
202 | if __name__=='__main__':
203 |     args=parse_args()
204 |     print(args.path)
205 |     output_dir = Path('predictions')
206 |     predictions=read_jsonl(output_dir / args.path)
207 |     print(len(predictions))
208 | 
209 |     eval_correctness(predictions)
210 | 


--------------------------------------------------------------------------------
/eval_branchcov.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import json
  4 | import signal
  5 | import random
  6 | random.seed(42)
  7 | import shutil
  8 | import time
  9 | import re
 10 | from pathlib import Path
 11 | from tqdm import tqdm
 12 | from argparse import ArgumentParser
 13 | from copy import deepcopy
 14 | from data_utils import read_jsonl
 15 | 
 16 | class TimeoutHandler:
 17 |     def __init__(self, timeout, error_message=None):
 18 |         self.timeout = timeout
 19 |         self.error_message = error_message
 20 |     
 21 |     def __enter__(self):
 22 |         signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix
 23 |         signal.alarm(self.timeout)
 24 |     
 25 |     def __exit__(self, type, value, traceback):
 26 |         signal.alarm(0)
 27 |     
 28 |     def raise_timeout(self, *args):
 29 |         raise TimeoutError(self.error_message)
 30 |     
 31 | 
 32 | def execute(test_code,timeout=5):
 33 |     """try to execute test code"""  
 34 |     try:
 35 |         exec_globals = {}
 36 |         with TimeoutHandler(timeout):
 37 |             exec(test_code, globals()) #add globals() to avoid name errors related to import
 38 |             return True
 39 |     except AssertionError: #assertionerror is considered as executable
 40 |         return 'assertion error'
 41 |     except TimeoutError:
 42 |         #print("timed out")
 43 |         return False
 44 |     except Exception as e: 
 45 |         #print(f"failed: {type(e).__name__}")
 46 |         return type(e).__name__, e #return error type and error message     
 47 |     
 48 | 
 49 | def eval_correctness(generated_data):
 50 |     """Compute syntactical and execution correctness (with coverage)."""
 51 |     ref_dataset=read_jsonl('data/leetcode-py.jsonl')
 52 |     total_cases=0
 53 |     total_syn_correct=0
 54 |     total_comp_correct=0 #compilation correctness (todo: maybe need static analysis)
 55 |     total_exec_correct=0
 56 |     syn_failed=0
 57 | 
 58 |     exec_fails=[]
 59 | 
 60 |     cov_branch_success=0
 61 |     total_easy=0 #for evaluating branches with different difficulties
 62 |     total_medium=0
 63 |     total_hard=0
 64 |     cov_easybranch=0
 65 |     cov_mediumbranch=0
 66 |     cov_hardbranch=0
 67 |     remove_pattern=re.compile(r'tmp*')
 68 | 
 69 |     for i, data in tqdm(enumerate(generated_data)):
 70 |         task_num=data['task_num']
 71 |         difficulty=data['difficulty']
 72 |         func_name=data['func_name']
 73 |         code=data['code']
 74 |         ref_branches=ref_dataset[i]['blocks']
 75 |         test_cases=data['tests']
 76 |         test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n'
 77 |         test_import_simple=f'from under_test import Solution\n'
 78 |         os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts
 79 |         with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files
 80 |             f.write(code)
 81 |         passed_tests=[]
 82 | 
 83 |         for j,test_branch in enumerate(test_cases):
 84 |             startline=test_branch['start']
 85 |             endline=test_branch['end']
 86 |             testcase=test_branch['test']
 87 |             assert startline==ref_branches[j]['start']
 88 |             assert endline==ref_branches[j]['end']
 89 | 
 90 |             total_cases+=1
 91 | 
 92 |             branch_diff=ref_branches[j]['difficulty']
 93 |             if branch_diff==0:
 94 |                 total_easy+=1
 95 |             elif branch_diff==1:
 96 |                 total_medium+=1
 97 |             elif branch_diff==2:
 98 |                 total_hard+=1
 99 | 
100 |             try:
101 |                 res=compile(testcase,'<string>','exec') #check syntax correctness
102 |                 total_syn_correct+=1
103 | 
104 |                 test_code=test_import+testcase+f'\ntest_{func_name}()'
105 |                 time.sleep(0.01)
106 |                 res=execute(test_code) #check execution correctness
107 |                 if res!=False:
108 |                     if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed
109 |                         print('func under test not called')
110 |                         exec_fails.append({'task':task_num, 'error':'not called'})
111 |                     else:
112 |                         total_exec_correct+=1
113 |                         test_code_simple=test_import_simple+testcase #write to files for computing coverage
114 |                         with open(f'tmp_{i}_{difficulty}/test_{startline}_{endline}.py','w') as f:
115 |                             f.write(test_code_simple)
116 |                         passed_tests.append({'start':startline, 'end':endline, 'path':f'test_{startline}_{endline}.py', 'difficulty':branch_diff})
117 |                         #print('correct')
118 |                 else:
119 |                     exec_fails.append({'task':task_num, 'start':startline, 'end':endline, 'error':res})
120 |                     #print(res)
121 |                     #print(test_code)
122 |             except:
123 |                 syn_failed+=1
124 |                 pass
125 |                
126 |         if len(passed_tests)>0: #start measuring coverage
127 |             #check if cover the selected line
128 |             cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json']
129 |             subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder
130 |             os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing
131 |             for test_branch in passed_tests:
132 |                 startline=test_branch['start']
133 |                 endline=test_branch['end'] 
134 |                 branch_diff=test_branch['difficulty']
135 |                 branch_firstline=startline+1 #if this line is covered, this branch is covered
136 |                 testpath=test_branch['path']
137 |                 cov_command=deepcopy(cov_command_prefix)
138 |                 cov_command.append(testpath)
139 | 
140 |                 subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
141 |                 cov_report=json.load(open('coverage.json'))
142 |                 executed_lines=cov_report['files']['under_test.py']['executed_lines']
143 | 
144 |                 if branch_firstline in executed_lines: #use startline+1 to check whether branch is covered
145 |                     cov_branch_success+=1
146 |                     if branch_diff==0:
147 |                         cov_easybranch+=1
148 |                     elif branch_diff==1:
149 |                         cov_mediumbranch+=1
150 |                     elif branch_diff==2:
151 |                         cov_hardbranch+=1
152 |                     print(f'covered branch {startline}-{endline}')
153 |                 else:
154 |                     print('not covered')
155 |                     pass
156 |             os.chdir('..') #exit tmp_ folder
157 |         else: #no test cases passed
158 |             pass
159 | 
160 |     for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last
161 |         # Filter dirnames based on the regex pattern
162 |         for dirname in dirnames:
163 |             if remove_pattern.match(dirname):
164 |                 shutil.rmtree(dirname)
165 |     
166 |     syn_correct=total_syn_correct/total_cases
167 |     exec_correct=total_exec_correct/total_cases
168 |     print(total_syn_correct, total_exec_correct,total_cases)
169 |     print(f'Syntax Correctness: {syn_correct}')
170 |     print(f'Executable Correctness: {exec_correct}')
171 | 
172 |     cov_branch_rate=cov_branch_success/total_cases
173 |     cov_branch_rate_exec=cov_branch_success/total_exec_correct
174 |     easy_covrate=cov_easybranch/total_easy
175 |     medium_covrate=cov_mediumbranch/total_medium
176 |     hard_covrate=cov_hardbranch/total_hard
177 |     print(f'Accuracy in cover selected branch: {cov_branch_rate}')
178 |     print(f'Easy branch coverage rate: {easy_covrate}')
179 |     print(f'Medium branch coverage rate: {medium_covrate}')
180 |     print(f'Hard branch coverage rate: {hard_covrate}')
181 | 
182 |     return {'syn_correct':syn_correct,'exec_correct':exec_correct, 'cov_branch':cov_branch_rate}, exec_fails
183 | 
184 |     
185 | def parse_args():
186 |     parser = ArgumentParser()
187 |     parser.add_argument("--path", type=str, default='branchcov_gpt-3.5-turbo.jsonl')
188 |     return parser.parse_args()
189 | 
190 | 
191 | if __name__=='__main__':
192 |     args=parse_args()
193 |     print('generated answers:', args.path)
194 |     output_dir = Path('predictions')
195 |     predictions=read_jsonl(output_dir / args.path)
196 |     print(len(predictions))
197 | 
198 |     eval_correctness(predictions)
199 | 


--------------------------------------------------------------------------------
/eval_linecov.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import json
  4 | import signal
  5 | import random
  6 | random.seed(42)
  7 | import shutil
  8 | import time
  9 | import re
 10 | from pathlib import Path
 11 | from tqdm import tqdm
 12 | from argparse import ArgumentParser
 13 | from copy import deepcopy
 14 | from data_utils import read_jsonl
 15 | 
 16 | 
 17 | class TimeoutHandler:
 18 |     def __init__(self, timeout, error_message=None):
 19 |         self.timeout = timeout
 20 |         self.error_message = error_message
 21 |     
 22 |     def __enter__(self):
 23 |         signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix
 24 |         signal.alarm(self.timeout)
 25 |     
 26 |     def __exit__(self, type, value, traceback):
 27 |         signal.alarm(0)
 28 |     
 29 |     def raise_timeout(self, *args):
 30 |         raise TimeoutError(self.error_message)
 31 |     
 32 | 
 33 | def execute(test_code,timeout=5):
 34 |     """try to execute test code"""  
 35 |     try:
 36 |         exec_globals = {}
 37 |         with TimeoutHandler(timeout):
 38 |             exec(test_code, globals()) #add globals() to avoid name errors related to import
 39 |             return True
 40 |     except AssertionError: #assertionerror is considered as executable
 41 |         return True
 42 |     except TimeoutError:
 43 |         #print("timed out")
 44 |         return False
 45 |     except Exception as e: 
 46 |         #print(f"failed: {type(e).__name__}")
 47 |         return type(e).__name__, e #return error type and error message     
 48 |     
 49 | 
 50 | def eval_correctness(generated_data, covmode='branch'):
 51 |     """Compute syntactical and execution correctness (with coverage)."""
 52 |     total_cases=0
 53 |     total_syn_correct=0
 54 |     total_comp_correct=0
 55 |     total_exec_correct=0
 56 |     syn_failed=0
 57 | 
 58 |     exec_fails=[]
 59 | 
 60 |     total_line_cov=0
 61 |     total_branch_cov=0
 62 |     cov_line_success=0
 63 |     remove_pattern=re.compile(r'tmp*')
 64 | 
 65 |     for i, data in tqdm(enumerate(generated_data)):
 66 |         task_num=data['task_num']
 67 |         difficulty=data['difficulty']
 68 |         func_name=data['func_name']
 69 |         code=data['code']
 70 |         #code=ADDITIONAL_IMPORTS+code #add possibly missing imports
 71 |         test_cases=data['tests']
 72 |         test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n'
 73 |         test_import_simple=f'from under_test import Solution\n'
 74 |         os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts
 75 |         with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files
 76 |             f.write(code)
 77 |         passed_tests={}
 78 |         
 79 |         for lineno in test_cases:
 80 |             testcase=test_cases[lineno]
 81 |             #testcase=test_cases[fixed_testcase_num] #comparison: use the first test case
 82 |             lineno=int(lineno)
 83 |             total_cases+=1
 84 | 
 85 |             try:
 86 |                 res=compile(testcase,'<string>','exec') #check syntax correctness
 87 |                 total_syn_correct+=1
 88 | 
 89 |                 test_code=test_import+testcase+f'\ntest_{func_name}()'
 90 |                 time.sleep(0.01)
 91 |                 res=execute(test_code) 
 92 |                 if res==True:
 93 |                     if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed
 94 |                         print('func under test not called')
 95 |                         exec_fails.append({'task':task_num, 'error':'not called'})
 96 |                     else:
 97 |                         total_exec_correct+=1
 98 |                         test_code_simple=test_import_simple+testcase #write to files for computing coverage
 99 |                         with open(f'tmp_{i}_{difficulty}/test_{lineno}.py','w') as f:
100 |                             f.write(test_code_simple)
101 |                         passed_tests[lineno]=f'test_{lineno}.py'
102 |                         #print('correct')
103 |                 else:
104 |                     exec_fails.append({'task':task_num,'test_line':lineno,'error':res})
105 |                     #print(res)
106 |                     #print(test_code)
107 |             except:
108 |                 syn_failed+=1
109 |                 #print('syntax error')
110 |                 pass
111 |                
112 |         if len(passed_tests)>0: #start measuring coverage
113 |             #check if cover the selected line
114 |             cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json']
115 |             subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder
116 |             os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing
117 |             for lineno in passed_tests:
118 |                 test=passed_tests[lineno]
119 |                 cov_command=deepcopy(cov_command_prefix)
120 |                 cov_command.append(test)
121 | 
122 |                 subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
123 |                 cov_report=json.load(open('coverage.json'))
124 |                 executed_lines=cov_report['files']['under_test.py']['executed_lines']
125 |                 #missline_lines=
126 |                 if lineno in executed_lines:
127 |                     cov_line_success+=1
128 |                     print(f'covered line {lineno}')
129 |                 else:
130 |                     print('not covered')
131 |                     pass
132 |             os.chdir('..') #exit tmp_ folder
133 |         else: #no test cases passed
134 |             pass
135 | 
136 |     for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last
137 |         # Filter dirnames based on the regex pattern
138 |         for dirname in dirnames:
139 |             if remove_pattern.match(dirname):
140 |                 shutil.rmtree(dirname)
141 |     
142 |     syn_correct=total_syn_correct/total_cases
143 |     exec_correct=total_exec_correct/total_cases
144 |     print(total_syn_correct, total_exec_correct,total_cases)
145 |     print(f'Syntax Correctness: {syn_correct}')
146 |     print(f'Executable Correctness: {exec_correct}')
147 | 
148 |     cov_line_rate=cov_line_success/total_cases
149 |     cov_line_rate_exec=cov_line_success/total_exec_correct
150 |     print(f'Accuracy in cover selected line: {cov_line_rate}')
151 | 
152 |     return {'syn_correct':syn_correct,'exec_correct':exec_correct, 'cov_line':cov_line_rate}, exec_fails
153 | 
154 |     
155 | def parse_args():
156 |     parser = ArgumentParser()
157 |     parser.add_argument("--path", type=str, default='linecov_gpt-3.5-turbo.jsonl')
158 |     return parser.parse_args()
159 | 
160 | 
161 | 
162 | if __name__=='__main__':
163 |     args=parse_args()
164 |     print(args.path)
165 |     output_dir = Path('predictions')
166 |     predictions=read_jsonl(output_dir / args.path)
167 |     print(len(predictions))
168 | 
169 |     eval_correctness(predictions)
170 | 


--------------------------------------------------------------------------------
/eval_overall.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import json
  4 | import signal
  5 | import random
  6 | random.seed(42)
  7 | import shutil
  8 | import time
  9 | import re
 10 | from pathlib import Path
 11 | from tqdm import tqdm
 12 | from argparse import ArgumentParser
 13 | from copy import deepcopy
 14 | from data_utils import read_jsonl
 15 | 
 16 | 
 17 | class TimeoutHandler:
 18 |     def __init__(self, timeout, error_message=None):
 19 |         self.timeout = timeout
 20 |         self.error_message = error_message
 21 |     
 22 |     def __enter__(self):
 23 |         signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix
 24 |         signal.alarm(self.timeout)
 25 |     
 26 |     def __exit__(self, type, value, traceback):
 27 |         signal.alarm(0)
 28 |     
 29 |     def raise_timeout(self, *args):
 30 |         raise TimeoutError(self.error_message)
 31 |     
 32 | 
 33 | def execute(test_code,timeout=5):
 34 |     """try to execute test code"""  
 35 |     try:
 36 |         exec_globals = {}
 37 |         with TimeoutHandler(timeout):
 38 |             exec(test_code, globals()) 
 39 |             return True
 40 |     except AssertionError: #assertionerror is considered as executable
 41 |         return True
 42 |     except TimeoutError:
 43 |         #print("timed out")
 44 |         return False
 45 |     except Exception as e:
 46 |         #print(f"failed: {type(e).__name__}")
 47 |         return type(e).__name__, e #return error type and error message
 48 |     
 49 | 
 50 | def coverage_at_k_sample(passed_tests, k, cov_command_prefix):
 51 |     """Compute coverage@k for a single program under test."""
 52 |     random.shuffle(passed_tests)
 53 |     if len(passed_tests)>=k:
 54 |         #num_splits=math.ceil(len(passed_tests)/k) #round up or down?
 55 |         num_splits=len(passed_tests)//k
 56 |         splited_tests=[passed_tests[i * k : (i + 1) * k] for i in range(num_splits)]
 57 |     else: #if number of passed tests is less than k, do not split
 58 |         splited_tests=[passed_tests]
 59 |     #calculate and average coverages for each group
 60 |     split_line_covs=[]
 61 |     split_branch_covs=[]
 62 |     
 63 |     for i,test_group in enumerate(splited_tests):
 64 |         group_line_cov=[]
 65 |         group_branch_cov=[]
 66 |         cov_command=deepcopy(cov_command_prefix)
 67 |         for test in test_group:
 68 |             cov_command.append(test)
 69 |             subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 70 |             cov_report=json.load(open('coverage.json'))
 71 |             total_stmt=cov_report['totals']['num_statements']
 72 |             covered_stmt=cov_report['totals']['covered_lines']
 73 |             line_cov=covered_stmt/total_stmt
 74 |             total_branch=cov_report['totals']['num_branches']
 75 |             covered_branch=cov_report['totals']['covered_branches']
 76 |             branch_cov=covered_branch/total_branch
 77 |             group_line_cov.append(line_cov)
 78 |             group_branch_cov.append(branch_cov)
 79 |         
 80 |         group_avg_line_cov=sum(group_line_cov)/len(group_line_cov)
 81 |         group_avg_branch_cov=sum(group_branch_cov)/len(group_branch_cov)
 82 |         split_line_covs.append(group_avg_line_cov)
 83 |         split_branch_covs.append(group_avg_branch_cov)
 84 | 
 85 |     avg_line_cov=sum(split_line_covs)/len(split_line_covs)
 86 |     avg_branch_cov=sum(split_branch_covs)/len(split_branch_covs)
 87 |     return {'line_cov':avg_line_cov,'branch_cov':avg_branch_cov}
 88 |         
 89 |     
 90 | 
 91 | def check_correctness(generated_data,ks=[1, 2, 5]):
 92 |     """Compute syntactical and execution correctness (with coverage)."""
 93 |     total_cases=0
 94 |     total_syn_correct=0
 95 |     total_comp_correct=0
 96 |     total_exec_correct=0
 97 |     syn_failed=0
 98 | 
 99 |     exec_fails=[]
100 | 
101 |     total_line_cov=0
102 |     total_branch_cov=0
103 |     line_covs_at_k={f'cov@{k}':[] for k in ks}
104 |     branch_covs_at_k={f'cov@{k}':[] for k in ks}
105 | 
106 |     remove_pattern=re.compile(r'tmp*')
107 | 
108 |     for i, data in tqdm(enumerate(generated_data)):
109 |         task_num=data['task_num']
110 |         difficulty=data['difficulty']
111 |         func_name=data['func_name']
112 |         code=data['code']
113 |         test_cases=data['tests']
114 |         test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n'
115 |         test_import_simple=f'from under_test import Solution\n'
116 |         os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts
117 |         with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test and test cases into tmp files
118 |             f.write(code)
119 |         passed_tests=[]
120 |         
121 |         for j, testcase in enumerate(test_cases):
122 |             #testcase=textwrap.dedent(testcase)
123 |             total_cases+=1
124 |             try:
125 |                 res=compile(testcase,'<string>','exec') #check syntax correctness
126 |                 total_syn_correct+=1
127 | 
128 |                 test_code=test_import+testcase+f'\ntest_{func_name}()'
129 |                 time.sleep(0.01)
130 |                 res=execute(test_code) 
131 |                 if res==True:
132 |                     if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed
133 |                         print('func under test not called')
134 |                         exec_fails.append({'task':task_num,'test_num':j,'error':'not called'})
135 |                     else:
136 |                         total_exec_correct+=1
137 |                         test_code_simple=test_import_simple+testcase #write to files for computing coverage
138 |                         with open(f'tmp_{i}_{difficulty}/test_{j}.py','w') as f:
139 |                             f.write(test_code_simple)
140 |                         passed_tests.append(f'test_{j}.py')
141 |                 else:
142 |                     exec_fails.append({'task':task_num,'test_num':j,'error':res})
143 |                     #print(res)
144 |                     #print(test_code)
145 | 
146 |             except:
147 |                 syn_failed+=1
148 |                 #print('syntax error')
149 |                 #print(testcase)
150 |                 pass
151 |         
152 |         if len(passed_tests)>0: #start measuring coverage
153 |             #total coverage for all tests
154 |             cov_command_prefix=['pytest', '--cov=under_test', '--cov-branch', '--cov-report=json:coverage.json']
155 |             subprocess.run(f'cp .coveragerc tmp_{i}_{difficulty}/.coveragerc',shell=True) #copy config file to tmp_folder
156 |             os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing
157 |             cov_command=deepcopy(cov_command_prefix)
158 |             for test in passed_tests:
159 |                 cov_command.append(test)
160 | 
161 |             try:
162 |                 subprocess.run(cov_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
163 |                 cov_report=json.load(open('coverage.json'))
164 |                 total_stmt=cov_report['totals']['num_statements']
165 |                 covered_stmt=cov_report['totals']['covered_lines']
166 |                 line_cov=covered_stmt/total_stmt
167 |                 total_branch=cov_report['totals']['num_branches']
168 |                 covered_branch=cov_report['totals']['covered_branches']
169 |                 branch_cov=covered_branch/total_branch
170 |                 total_line_cov+=line_cov
171 |                 total_branch_cov+=branch_cov
172 |                 #print(f'Line Coverage: {line_cov}, Branch Coverage: {branch_cov}')
173 |             except: #unknown pytest error: cannot generate coverage report (AssertionError: Expected current collector to be <Collector at 0x7f7d2db07810: CTracer>, but it's <Collector at 0x7f7d2cd794d0: CTracer>)
174 |                 print('Failed to generate coverage report')
175 |                 pass
176 | 
177 |             #compute coverage@k
178 |             for k in ks:
179 |                 res_at_k=coverage_at_k_sample(passed_tests,k,cov_command_prefix)
180 |                 line_covs_at_k[f'cov@{k}'].append(res_at_k['line_cov'])
181 |                 branch_covs_at_k[f'cov@{k}'].append(res_at_k['branch_cov'])
182 | 
183 |             os.chdir('..') #exit tmp_ folder
184 |         else: #no test cases passed
185 |             pass
186 | 
187 |     for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last
188 |         # Filter dirnames based on the regex pattern
189 |         for dirname in dirnames:
190 |             if remove_pattern.match(dirname):
191 |                 shutil.rmtree(dirname)
192 |     
193 |     syn_correct=total_syn_correct/total_cases
194 |     exec_correct=total_exec_correct/total_cases
195 |     print(f'Syntax Correctness: {syn_correct}')
196 |     print(f'Executable Correctness: {exec_correct}')
197 | 
198 |     #compute average coverage@k
199 |     for k in ks:
200 |         line_covs_at_k[f'cov@{k}']=sum(line_covs_at_k[f'cov@{k}'])/len(generated_data)
201 |         branch_covs_at_k[f'cov@{k}']=sum(branch_covs_at_k[f'cov@{k}'])/len(generated_data)
202 |         print(f'line coverage@{k}',line_covs_at_k[f'cov@{k}'])
203 |         print(f'branch coverage@{k}',branch_covs_at_k[f'cov@{k}'])
204 | 
205 |     #compute coverage
206 |     avg_line_cov=total_line_cov/len(generated_data)
207 |     avg_branch_cov=total_branch_cov/len(generated_data)
208 |     print(f'Average Line Coverage: {avg_line_cov}, Average Branch Coverage: {avg_branch_cov}')
209 |     return {'syn_correct':syn_correct,'exec_correct':exec_correct}, exec_fails
210 | 
211 |     
212 | def parse_args():
213 |     parser = ArgumentParser()
214 |     parser.add_argument("--path", type=str, default='totalcov_gpt-3.5-turbo.jsonl')
215 |     parser.add_argument("--ks", type=int, nargs='+', default=[1, 2, 5])
216 |     return parser.parse_args()
217 | 
218 | 
219 | if __name__=='__main__':
220 |     args=parse_args()
221 |     print(args.path)
222 |     print(args.ks)
223 |     output_dir = Path('predictions')
224 |     predictions=read_jsonl(output_dir / args.path)
225 |     print(len(predictions))
226 | 
227 |     check_correctness(predictions, ks=args.ks)
228 | 


--------------------------------------------------------------------------------
/eval_pathcov.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import signal
  3 | import random
  4 | random.seed(42)
  5 | import shutil
  6 | import time
  7 | import re
  8 | from pathlib import Path
  9 | from tqdm import tqdm
 10 | from argparse import ArgumentParser
 11 | from copy import deepcopy
 12 | from data_utils import read_jsonl
 13 | 
 14 | class TimeoutHandler:
 15 |     def __init__(self, timeout, error_message=None):
 16 |         self.timeout = timeout
 17 |         self.error_message = error_message
 18 |     
 19 |     def __enter__(self):
 20 |         signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix
 21 |         signal.alarm(self.timeout)
 22 |     
 23 |     def __exit__(self, type, value, traceback):
 24 |         signal.alarm(0)
 25 |     
 26 |     def raise_timeout(self, *args):
 27 |         raise TimeoutError(self.error_message)
 28 |     
 29 | 
 30 | def execute(test_code,timeout=5):
 31 |     """try to execute test code"""  
 32 |     try:
 33 |         exec_globals = {}
 34 |         with TimeoutHandler(timeout):
 35 |             exec(test_code, globals()) #add globals() to avoid name errors related to import
 36 |             return True
 37 |     except AssertionError: #assertionerror is considered as executable
 38 |         return True
 39 |     except TimeoutError:
 40 |         #print("timed out")
 41 |         return False
 42 |     except Exception as e: 
 43 |         #print(f"failed: {type(e).__name__}")
 44 |         return type(e).__name__, e #return error type and error message     
 45 |     
 46 | 
 47 | 
 48 | def parse_args():
 49 |     parser = ArgumentParser()
 50 |     parser.add_argument("--path", type=str, default='pathcov_gpt-3.5-turbo.jsonl')
 51 |     return parser.parse_args()
 52 | 
 53 | 
 54 | def match_path(generated_path, ref_path):
 55 |     """Compute path similarity based on longest common subsequence.
 56 |     Return similarity: len(lcs(generated_path, ref_path))/len(ref_path)"""
 57 |     ref_len=len(ref_path)
 58 |     generated_len=len(generated_path)
 59 | 
 60 |     dp = [[0]*(ref_len+1) for _ in range(generated_len+1)]
 61 |     max_length = 0
 62 |     for i in range(1, generated_len+1):
 63 |         for j in range(1, ref_len+1):
 64 |             if generated_path[i-1] == ref_path[j-1]:
 65 |                 dp[i][j] = dp[i-1][j-1] + 1
 66 |                 max_length = max(max_length, dp[i][j])
 67 |     similarity = max_length / ref_len
 68 |     return similarity
 69 | 
 70 | 
 71 | 
 72 | def eval_correctness(generated_data):
 73 |     """Compute syntactical and execution correctness (with coverage)."""
 74 |     ref_dataset=read_jsonl('data/tgt_paths.jsonl')
 75 |     instrumented_dataset=read_jsonl('data/leetcode-py-instrumented.jsonl')
 76 |     total_cases=0
 77 |     total_syn_correct=0
 78 |     total_exec_correct=0
 79 |     syn_failed=0
 80 | 
 81 |     exec_fails=[]
 82 | 
 83 |     total_path_match=0
 84 |     total_path_similarity=0 #similarity score: based on longest common subsequence
 85 |     remove_pattern=re.compile(r'tmp*')
 86 | 
 87 |     for i, data in tqdm(enumerate(generated_data)):
 88 |         task_num=data['task_num']
 89 |         difficulty=data['difficulty']
 90 |         func_name=data['func_name']
 91 |         code=data['code']
 92 |         task_title=instrumented_dataset[i]['task_title']
 93 |         instrumented_code=instrumented_dataset[i]['python_solution_instrumented']
 94 |         test_cases=data['tests']
 95 |         test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n'
 96 |         test_import_simple=f'from under_test import Solution\n'
 97 |         os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts
 98 |         with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test into tmp files
 99 |             f.write(instrumented_code)
100 |         passed_tests=[]
101 |         
102 |         os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing
103 |         os.makedirs('test_logs',exist_ok=True)
104 |         for j, testcase in enumerate(test_cases):
105 |             with open(f'test_logs/{task_title}.log', 'w') as f:
106 |                 f.write('') #add empty log file
107 |             total_cases+=1
108 |             try:
109 |                 res=compile(testcase,'<string>','exec') #check syntax correctness
110 |                 total_syn_correct+=1
111 | 
112 |                 test_code=test_import+testcase+f'\ntest_{func_name}()'
113 |                 time.sleep(0.01)
114 |                 res=execute(test_code) 
115 |                 if res==True:
116 |                     if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed
117 |                         print('func under test not called')
118 |                         exec_fails.append({'task':task_num,'test_num':j,'error':'not called'})
119 |                     else: #sucussfully execution, start calculating path coverage
120 |                         total_exec_correct+=1
121 |                         ref_path=ref_dataset[i]['sampled_paths'][j]
122 |                         #print(ref_path)
123 |                         with open(f'test_logs/{task_title}.log') as f:
124 |                             lines=f.readlines()
125 |                             generated_path=tuple(lines)
126 |                             print(len(generated_path))
127 |                         path_sim=match_path(generated_path, ref_path)
128 |                         print(ref_path, path_sim)
129 |                         if path_sim==1:
130 |                             total_path_match+=1
131 |                         total_path_similarity+=path_sim
132 |                         passed_tests.append({'path': f'test_{j}.py', 'pass': True})
133 |                 else:
134 |                     exec_fails.append({'task':task_num,'test_num':j,'error':res})
135 |                     #print(res)
136 |                     #print(test_code)
137 |                     passed_tests.append({'path': f'test_{j}.py', 'pass': False})
138 |             except:
139 |                 syn_failed+=1
140 |                 print('syntax error')
141 |                 print(testcase)
142 |                 passed_tests.append({'path': f'test_{j}.py', 'pass': False})
143 |                 pass
144 |         #print(passed_tests)
145 |         os.chdir('..') #exit tmp_ folder
146 |     
147 |     for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last
148 |         # Filter dirnames based on the regex pattern
149 |         for dirname in dirnames:
150 |             if remove_pattern.match(dirname):
151 |                 shutil.rmtree(dirname)
152 | 
153 |     syn_correct=total_syn_correct/total_cases
154 |     exec_correct=total_exec_correct/total_cases
155 |     print(f'Syntax Correctness: {syn_correct}')
156 |     print(f'Executable Correctness: {exec_correct}')
157 |     
158 |     path_exactmatch_acc=total_path_match/total_cases
159 |     path_similarity_score=total_path_similarity/total_cases
160 |     print('path exact match accuracy:', path_exactmatch_acc)
161 |     print('path similarity score:', path_similarity_score)
162 | 
163 | 
164 | 
165 | if __name__=='__main__':
166 |     args=parse_args()
167 |     print(args.path)
168 |     output_dir = Path('predictions')
169 |     predictions=read_jsonl(output_dir / args.path)
170 |     print(len(predictions))
171 | 
172 |     eval_correctness(predictions)
173 | 


--------------------------------------------------------------------------------
/eval_pathcov_base.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import signal
  3 | import random
  4 | random.seed(42)
  5 | import shutil
  6 | import time
  7 | import re
  8 | from pathlib import Path
  9 | from tqdm import tqdm
 10 | from argparse import ArgumentParser
 11 | from copy import deepcopy
 12 | from data_utils import read_jsonl
 13 | 
 14 | class TimeoutHandler:
 15 |     def __init__(self, timeout, error_message=None):
 16 |         self.timeout = timeout
 17 |         self.error_message = error_message
 18 |     
 19 |     def __enter__(self):
 20 |         signal.signal(signal.SIGALRM, self.raise_timeout) #SIGALRM only support unix
 21 |         signal.alarm(self.timeout)
 22 |     
 23 |     def __exit__(self, type, value, traceback):
 24 |         signal.alarm(0)
 25 |     
 26 |     def raise_timeout(self, *args):
 27 |         raise TimeoutError(self.error_message)
 28 |     
 29 | 
 30 | def execute(test_code,timeout=5):
 31 |     """try to execute test code"""  
 32 |     try:
 33 |         exec_globals = {}
 34 |         with TimeoutHandler(timeout):
 35 |         #with time_limit(timeout):
 36 |             exec(test_code, globals()) #add globals() to avoid name errors related to import
 37 |             return True
 38 |     except AssertionError: #assertionerror is considered as executable
 39 |         return True
 40 |     except TimeoutError:
 41 |         #print("timed out")
 42 |         return False
 43 |     except Exception as e: 
 44 |         #print(f"failed: {type(e).__name__}")
 45 |         return type(e).__name__, e #return error type and error message     
 46 |     
 47 | 
 48 | 
 49 | def parse_args():
 50 |     parser = ArgumentParser()
 51 |     parser.add_argument("--path", type=str, default='pathcov_gpt-3.5-turboformat.jsonl')
 52 |     return parser.parse_args()
 53 | 
 54 | 
 55 | def match_path(generated_path, ref_path):
 56 |     """Compute path similarity based on longest common subsequence.
 57 |     Return similarity: len(lcs(generated_path, ref_path))/len(ref_path)"""
 58 |     ref_len=len(ref_path)
 59 |     generated_len=len(generated_path)
 60 | 
 61 |     dp = [[0]*(ref_len+1) for _ in range(generated_len+1)]
 62 |     max_length = 0
 63 |     for i in range(1, generated_len+1):
 64 |         for j in range(1, ref_len+1):
 65 |             if generated_path[i-1] == ref_path[j-1]:
 66 |                 dp[i][j] = dp[i-1][j-1] + 1
 67 |                 max_length = max(max_length, dp[i][j])
 68 |     similarity = max_length / ref_len
 69 |     return similarity
 70 | 
 71 | 
 72 | 
 73 | def eval_correctness(generated_data):
 74 |     """Compute syntactical and execution correctness (with coverage)."""
 75 |     ref_dataset=read_jsonl('data/tgt_paths.jsonl')
 76 |     instrumented_dataset=read_jsonl('data/leetcode-py-instrumented.jsonl')
 77 |     total_cases=0
 78 |     total_paths=0
 79 |     total_syn_correct=0
 80 |     total_exec_correct=0
 81 |     syn_failed=0
 82 | 
 83 |     exec_fails=[]
 84 | 
 85 |     total_path_match=0
 86 |     total_path_similarity=0 #similarity score: based on longest common subsequence
 87 |     remove_pattern=re.compile(r'tmp*')
 88 | 
 89 |     for i, data in tqdm(enumerate(generated_data)):
 90 |         task_num=data['task_num']
 91 |         difficulty=data['difficulty']
 92 |         func_name=data['func_name']
 93 |         task_title=instrumented_dataset[i]['task_title']
 94 |         instrumented_code=instrumented_dataset[i]['python_solution_instrumented']
 95 |         test_cases=data['tests']
 96 |         baseline_test=test_cases[0] #use the first generated test in overall coverage as the baseline
 97 |         test_import=f'from tmp_{i}_{difficulty}.under_test import Solution\n'
 98 |         test_import_simple=f'from under_test import Solution\n'
 99 |         os.makedirs(f'tmp_{i}_{difficulty}',exist_ok=True) #create different tmp folders for different problems to avoid conflicts
100 |         with open(f'tmp_{i}_{difficulty}/under_test.py','w') as f: #write program under test into tmp files
101 |             f.write(instrumented_code)
102 |         passed_tests=[]
103 |         
104 |         os.chdir(f'tmp_{i}_{difficulty}') #enter tmp_ folder for testing
105 |         os.makedirs('test_logs',exist_ok=True)
106 |         
107 |         total_cases+=1
108 |         total_paths+=len(ref_dataset[i]['sampled_paths'])
109 |         with open(f'test_logs/{task_title}.log', 'w') as f:
110 |             f.write('') #add empty log file
111 |         try:
112 |             res=compile(baseline_test,'<string>','exec') #check syntax correctness
113 |             total_syn_correct+=1
114 | 
115 |             test_code=test_import+baseline_test+f'\ntest_{func_name}()'
116 |             time.sleep(0.01)
117 |             res=execute(test_code) 
118 |             if res==True:
119 |                 if test_code.find(f'solution.{func_name}')==-1: #if the function under test is not called, also consider as failed
120 |                     print('func under test not called')
121 |                     exec_fails.append({'task':task_num,'test_num':j,'error':'not called'})
122 |                 else: #sucussfully execution, start calculating path coverage
123 |                     total_exec_correct+=1
124 |                     with open(f'test_logs/{task_title}.log') as f:
125 |                         lines=f.readlines()
126 |                         print(lines)
127 |                         generated_path=tuple(lines)
128 |                     for j in range(len(ref_dataset[i]['sampled_paths'])):
129 |                         #total_paths+=1 #total paths should be added before execution
130 |                         ref_path=ref_dataset[i]['sampled_paths'][j]
131 |                         #print(ref_path)
132 |                         
133 |                         path_sim=match_path(generated_path, ref_path)
134 |                         print(generated_path, ref_path, path_sim)
135 |                         if path_sim==1:
136 |                             total_path_match+=1
137 |                         total_path_similarity+=path_sim
138 | 
139 |                         passed_tests.append({'path': f'test.py', 'pass': True})
140 |             else:
141 |                 exec_fails.append({'task':task_num,'test_num':j,'error':res})
142 |                 #print(res)
143 |                 #print(test_code)
144 |                 passed_tests.append({'path': f'test.py', 'pass': False})
145 |         except:
146 |             syn_failed+=1
147 |             #print('syntax error')
148 |             passed_tests.append({'path': f'test.py', 'pass': False})
149 |             pass
150 |         #print(passed_tests)
151 |         os.chdir('..') #exit tmp_ folder
152 |     
153 |     for dirpath, dirnames, filenames in os.walk('./', topdown=False): #execute() runs too fast, remove dirs at last
154 |         # Filter dirnames based on the regex pattern
155 |         for dirname in dirnames:
156 |             if remove_pattern.match(dirname):
157 |                 shutil.rmtree(dirname)
158 | 
159 |     syn_correct=total_syn_correct/total_cases
160 |     exec_correct=total_exec_correct/total_cases
161 |     print(f'Syntax Correctness: {syn_correct}')
162 |     print(f'Executable Correctness: {exec_correct}')
163 |     
164 |     print(total_exec_correct,total_cases, total_path_match,total_paths)
165 |     path_exactmatch_acc=total_path_match/total_paths
166 |     path_similarity_score=total_path_similarity/total_paths
167 |     print('path exact match accuracy:', path_exactmatch_acc)
168 |     print('path similarity score:', path_similarity_score)
169 | 
170 | 
171 | 
172 | if __name__=='__main__':
173 |     args=parse_args()
174 |     print(args.path)
175 |     output_dir = Path('predictions')
176 |     predictions=read_jsonl(output_dir / args.path)
177 |     print(len(predictions))
178 | 
179 |     eval_correctness(predictions)
180 | 


--------------------------------------------------------------------------------
/format.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import textwrap
  3 | import ast
  4 | from pathlib import Path
  5 | from argparse import ArgumentParser
  6 | 
  7 | from data_utils import read_jsonl, write_jsonl
  8 | 
  9 | 
 10 | def change_function_name(code, new_name):
 11 |     try:
 12 |         # Parse the code into an AST
 13 |         tree = ast.parse(code)
 14 | 
 15 |         # Find the first function definition and change its name
 16 |         for node in ast.walk(tree):
 17 |             if isinstance(node, ast.FunctionDef):
 18 |                 if node.name!=new_name:
 19 |                     node.name = new_name
 20 |                     break
 21 |                 else:
 22 |                     break
 23 | 
 24 |         # Convert the modified AST back to code
 25 |         new_code = ast.unparse(tree)
 26 |         return new_code
 27 |     except Exception as e: #cannoot parse
 28 |         return code
 29 | 
 30 | 
 31 | def reformat_case_byrules(testcase, func_name, lang='python'):
 32 |     if testcase.startswith(' '): #remove extra indents (encountered in codellama, mistral-7b starts with one space...)
 33 |         testcase=textwrap.dedent(testcase)
 34 |     lines=testcase.split('\n')
 35 | 
 36 |     if lang=='python':
 37 |         last_line=lines[-1] #if last line is not complete (due to token limit), remove it    
 38 |         last_line=textwrap.dedent(last_line)
 39 |         try:
 40 |             compile(last_line,'<string>','exec')
 41 |         except:
 42 |             #print('imcomplete last line, remove it', last_line)
 43 |             lines=lines[:-1] #last line cannot compile
 44 | 
 45 |     testcase='\n'.join(lines)
 46 |     testcase=change_function_name(testcase, func_name)
 47 |     return testcase
 48 | 
 49 | 
 50 | def remove_extra(testcase, func_name, lang='python'):
 51 |     """Remove extra test inputs and natural language descriptions before and after the test method.
 52 |     Only keep the contents between def test() and solution.{func_name}"""
 53 |     lines=testcase.split('\n')
 54 |     func_startline=0 #the line when test function starts (def test....)
 55 |     for i in range(len(lines)):
 56 |         if lines[i].find('def test')>=0:
 57 |             func_startline=i
 58 |             break
 59 |     test_endline=len(lines)
 60 |     for i in range(len(lines)):
 61 |         if lines[i].find(f'solution.{func_name}')>=0: #first call to the function under test
 62 |             test_endline=i+1
 63 |             break
 64 |     new_testcase='\n'.join(lines[func_startline:test_endline])
 65 |     return new_testcase
 66 |     
 67 |     num_inputs=testcase.count(f'solution.{func_name}')
 68 |     if num_inputs>1:
 69 |         lines=testcase.split('\n')
 70 |         new_testcase=[]
 71 |         for line in lines:
 72 |             new_testcase.append(line)
 73 |             if line.find(f'solution.{func_name}')>=0: #discard statements after the first test input
 74 |                 break
 75 |         new_testcase='\n'.join(new_testcase)
 76 |         return new_testcase
 77 |     else:
 78 |         return testcase
 79 | 
 80 | 
 81 | def reformat_line(datapath,newpath):
 82 |     data=read_jsonl(datapath)
 83 |     formatted_data=[]
 84 |     for e in data:
 85 |         code=e['code']
 86 |         func_name=e['func_name']
 87 |         test_funcname=f'test_{func_name}'
 88 |         #print(code)
 89 |         tests=e['tests']
 90 |         #formated_tests=[]
 91 |         for lineno in tests:
 92 |             testcase=tests[lineno]
 93 |             print(testcase)
 94 |             testcase=remove_extra(testcase, func_name)
 95 |             reformatted_testcase=reformat_case_byrules(testcase, test_funcname, 'python')
 96 |             #print('------')
 97 |             print(reformatted_testcase)
 98 |             print('<---------------------->')
 99 |             tests[lineno]=reformatted_testcase
100 |         e['tests']=tests
101 | 
102 |         formatted_data.append(e)
103 |     write_jsonl(formatted_data, newpath)
104 | 
105 | 
106 | def reformat_branch(datapath,newpath):
107 |     data=read_jsonl(datapath)
108 |     formatted_data=[]
109 |     for e in data:
110 |         code=e['code']
111 |         func_name=e['func_name']
112 |         test_funcname=f'test_{func_name}'
113 |         #print(code)
114 |         tests=e['tests']
115 |         formated_tests=[]
116 |         for branch in tests:
117 |             testcase=branch['test']
118 |             print(testcase)
119 |             testcase=remove_extra(testcase, func_name)
120 |             reformatted_testcase=reformat_case_byrules(testcase, test_funcname, 'python')
121 |             #print('------')
122 |             print(reformatted_testcase)
123 |             print('<---------------------->')
124 |             branch['test']=reformatted_testcase
125 |             formated_tests.append(branch)
126 |         e['tests']=formated_tests
127 | 
128 |         formatted_data.append(e)
129 |     write_jsonl(formatted_data, newpath)
130 | 
131 | 
132 | def reformat_cov(datapath,newpath):
133 |     data=read_jsonl(datapath)
134 |     formatted_data=[]
135 |     for e in data:
136 |         #print(code)
137 |         func_name=e['func_name']
138 |         test_funcname=f'test_{func_name}'
139 |         formatted_test_cases=[]
140 |         testcases=e['tests']
141 |         for testcase in testcases:
142 |             print(testcase)
143 |             extracted_testcase=remove_extra(testcase, func_name)
144 |             #if extracted_testcase!=testcase:
145 |                 #print(testcase)
146 |                 #print('----')
147 |                 #print(extracted_testcase)
148 |             reformatted_testcase=reformat_case_byrules(extracted_testcase, test_funcname, 'python')
149 |             print('------')
150 |             print(reformatted_testcase)
151 |             print('<---------------------->')
152 |             formatted_test_cases.append(reformatted_testcase)
153 |         e['tests']=formatted_test_cases
154 | 
155 |         formatted_data.append(e)
156 |     write_jsonl(formatted_data, newpath)
157 | 
158 | 
159 | def parse_args():
160 |     parser = ArgumentParser()
161 |     parser.add_argument("--path", type=str, default='Linecov_python_gemini-1.0-pro-latest.jsonl')
162 |     parser.add_argument("--mode", type=str, default='overall', choices=['line', 'branch', 'overall'])
163 |     return parser.parse_args()
164 | 
165 | 
166 | if __name__=='__main__':
167 |     args=parse_args()
168 |     print('generated answers:', args.path)
169 |     print('coverage mode:', args.mode)
170 |     output_dir = Path('predictions')
171 |     finename,ext=os.path.splitext(args.path)
172 |     newpath=f'{finename}format{ext}'
173 |     print(newpath)
174 |     if args.mode=='line':
175 |         print('reformat line coverage')
176 |         reformat_line(output_dir / args.path, output_dir / newpath)
177 |     elif args.mode=='overall':
178 |         print('reformat overall coverage')
179 |         reformat_cov(output_dir / args.path, output_dir / newpath)
180 |     elif args.mode=='branch':
181 |         print('reformat branch coverage')
182 |         reformat_branch(output_dir / args.path, output_dir / newpath)
183 | 


--------------------------------------------------------------------------------
/gen_linecov_cot_hf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import transformers
  3 | import torch
  4 | import textwrap
  5 | from argparse import ArgumentParser
  6 | from tqdm import tqdm
  7 | from pathlib import Path
  8 | 
  9 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM
 10 | from transformers import pipeline
 11 | access_token=os.getenv("HUGGINGFACE_TOKEN")
 12 | 
 13 | from data_utils import read_jsonl, write_jsonl, add_lineno, add_lineno_comment
 14 | 
 15 | def parse_args():
 16 |     parser = ArgumentParser()
 17 |     parser.add_argument("--dataset", type=str, default='leetcode')
 18 |     parser.add_argument("--lang", type=str, default='python', choices=['python', 'java', 'c++'])
 19 |     parser.add_argument("--model", type=str, default='meta-llama/Meta-Llama-3-8B-Instruct')
 20 |     parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level')
 21 |     parser.add_argument("--max_tokens", type=int, default=1024)
 22 |     parser.add_argument("--temperature", type=float, default=1e-5)
 23 |     return parser.parse_args()
 24 | 
 25 | 
 26 | #models do not support system message
 27 | models_nosys=['google/gemma-1.1-7b-it',
 28 |             'bigcode/starcoder2-15b-instruct-v0.1',
 29 |             'mistralai/Mistral-7B-Instruct-v0.3']
 30 | 
 31 | 
 32 | def generate_completion(args, generator, prompt, system_message=''):
 33 |     if args.model in models_nosys: #models don't support system message
 34 |         messages=[{"role": "user", "content": system_message+prompt}]
 35 |     else:
 36 |         messages=[
 37 |         {"role": "system", "content": system_message},
 38 |         {"role": "user", "content": prompt},
 39 |     ]
 40 |     prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 41 |     
 42 |     generated=generator(prompt, 
 43 |                         max_new_tokens=args.max_tokens, 
 44 |                         temperature=args.temperature, 
 45 |                         return_full_text=False)
 46 |     generated_text=generated[0]['generated_text']
 47 |     return generated_text
 48 | 
 49 | 
 50 | if __name__=='__main__':
 51 |     args=parse_args()
 52 |     model_abbrv=args.model.split('/')[-1]
 53 |     print('Model:', model_abbrv)
 54 |     print('task:', args.covmode)
 55 |     output_dir = Path('predictions')
 56 | 
 57 |     prompt_template_cond=open('prompt/line_oneshot_gencond.txt').read()
 58 |     prompt_template_test=open('prompt/line_oneshot_gentest.txt').read()
 59 |     system_template=open('prompt/system_exec.txt').read()
 60 |     system_message=system_template
 61 | 
 62 |     dataset=read_jsonl('data/leetcode-py-all.jsonl')
 63 | 
 64 |     model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 65 |     tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 66 |     generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token)
 67 | 
 68 |     data_size=len(dataset)
 69 |     #data_size=50
 70 |     testing_results=[]
 71 |     for i in tqdm(range(data_size)):
 72 |         data=dataset[i]
 73 |         func_name=data['func_name']
 74 |         desc=data['description']
 75 |         code=data['python_solution']
 76 |         difficulty=data['difficulty']
 77 |         #code_withlineno=add_lineno(code)   
 78 |         code_withlineno=add_lineno_comment(code) 
 79 |         #print(code_withlineno)       
 80 | 
 81 |         #generate test case
 82 |         target_lines=data['target_lines']
 83 |         tests={}
 84 |         conds={} #store generated conditions
 85 |         print(data['task_num'],target_lines)
 86 | 
 87 |         for lineno in target_lines: #line number to be tested
 88 |             code_lines=code.split('\n')
 89 |             target_line=code_lines[lineno-1]
 90 |             target_line_withlineno=f'{lineno}: {target_line}'
 91 | 
 92 |             code_input=code_withlineno
 93 |             line_input=target_line_withlineno
 94 | 
 95 |             prompt_cond=prompt_template_cond.format(program=code_input, targetline=lineno)
 96 |             generated_cond=generate_completion(args,generator,prompt_cond,system_message)
 97 |             prompt_test=prompt_template_test.format(func_name=func_name, program=code_input, conditions=generated_cond)
 98 |             generated_test=generate_completion(args,generator,prompt_test,system_message)
 99 |             print(generated_cond)
100 |             print('--------')
101 |             print(generated_test)
102 |             print('<--------------------------------------->')
103 |             tests[lineno]=generated_test
104 |             conds[lineno]=generated_cond
105 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests, 'conditions':conds}
106 | 
107 |         testing_results.append(testing_data)
108 |         print('<<<<----------------------------------------->>>>')
109 |         write_jsonl(testing_results, output_dir / f'linecov2_{model_abbrv}_temp.jsonl')
110 |     write_jsonl(testing_results, output_dir / f'linecov2_{model_abbrv}_1shot.jsonl')
111 | 


--------------------------------------------------------------------------------
/gen_linecov_cot_openai.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from argparse import ArgumentParser
  3 | from tqdm import tqdm
  4 | import openai
  5 | from openai import OpenAI
  6 | openai.api_key=os.getenv("OPENAI_API_KEY") 
  7 | client=OpenAI(api_key=openai.api_key)
  8 | from pathlib import Path
  9 | from data_utils import read_jsonl, write_jsonl, add_lineno, add_lineno_comment
 10 | 
 11 | 
 12 | def parse_args():
 13 |     parser = ArgumentParser()
 14 |     parser.add_argument("--dataset", type=str, default='leetcode')
 15 |     parser.add_argument("--model", type=str, default='gpt-3.5-turbo')
 16 |     
 17 |     parser.add_argument("--max_tokens", type=int, default=1024)
 18 |     parser.add_argument("--temperature", type=float, default=0)
 19 |     return parser.parse_args()
 20 | 
 21 | 
 22 | def generate_completion(args,prompt,system_message=''):
 23 |     response = client.chat.completions.create(
 24 |         model=args.model,
 25 |         messages=[
 26 |             {"role": "system", "content": system_message},
 27 |             {"role": "user", "content": prompt},
 28 |         ],
 29 |         temperature=args.temperature,
 30 |         max_tokens=args.max_tokens
 31 |     )
 32 |     code_output=response.choices[0].message.content
 33 |     return code_output
 34 | 
 35 | 
 36 | def generate_twostep(args,prompt_cond, prompt_test,system_message=''):
 37 |     messages=[
 38 |             {"role": "system", "content": system_message},
 39 |             {"role": "user", "content": prompt_cond},
 40 |         ]
 41 |     response = client.chat.completions.create(
 42 |         model=args.model,
 43 |         messages=messages,
 44 |         temperature=args.temperature,
 45 |         max_tokens=args.max_tokens
 46 |     )
 47 |     cond=response.choices[0].message.content
 48 |     print(cond)
 49 |     print('---------------------------------')
 50 |     
 51 |     messages.append({"role": "assistant", "content": cond})
 52 |     messages.append({"role": "user", "content": prompt_test})
 53 |     response = client.chat.completions.create(
 54 |         model=args.model,
 55 |         messages=messages,
 56 |         temperature=args.temperature,
 57 |         max_tokens=args.max_tokens
 58 |     )
 59 |     generated_test=response.choices[0].message.content
 60 |     print(generated_test)
 61 |     return cond, generated_test
 62 | 
 63 | 
 64 | if __name__=='__main__':
 65 |     args=parse_args()
 66 |     print('Model:', args.model)
 67 |     output_dir = Path('predictions')
 68 | 
 69 |     #two steps reasoning: generate conditions, then generate a test that satisfies the conditions
 70 |     prompt_template_cond=open('prompt/line_oneshot_gencond.txt').read()
 71 |     prompt_template_test=open('prompt/line_oneshot_gentest.txt').read()
 72 |     system_template=open('prompt/system_exec.txt').read()
 73 |     system_message=system_template
 74 | 
 75 |     dataset=read_jsonl('data/leetcode-py-all.jsonl')
 76 | 
 77 |     data_size=len(dataset)
 78 |     #data_size=50
 79 |     testing_results=[]
 80 |     for i in tqdm(range(data_size)):
 81 |         data=dataset[i]
 82 |         func_name=data['func_name']
 83 |         desc=data['description']
 84 |         code=data['python_solution']
 85 |         difficulty=data['difficulty']
 86 |         #code_withlineno=add_lineno(code)   
 87 |         code_withlineno=add_lineno_comment(code) 
 88 |         #print(code_withlineno)       
 89 | 
 90 |         #generate test case
 91 |         target_lines=data['target_lines']
 92 |         tests={}
 93 |         conds={} #store generated conditions
 94 |         print(data['task_num'],target_lines)
 95 | 
 96 |         for lineno in target_lines: #line number to be tested
 97 |             code_lines=code.split('\n')
 98 |             target_line=code_lines[lineno-1]
 99 |             target_line_withlineno=f'{lineno}: {target_line}'
100 | 
101 |             code_input=code_withlineno
102 |             line_input=target_line_withlineno
103 | 
104 |             prompt_cond=prompt_template_cond.format(program=code_input, targetline=lineno)
105 |             generated_cond=generate_completion(args,prompt_cond,system_message)
106 |             prompt_test=prompt_template_test.format(func_name=func_name, program=code_input, conditions=generated_cond)
107 |             generated_test=generate_completion(args,prompt_test,system_message)
108 |             print(generated_cond)
109 |             print('--------')
110 |             print(generated_test)
111 |             print('<--------------------------------------->')
112 |             tests[lineno]=generated_test
113 |             conds[lineno]=generated_cond
114 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests, 'conditions':conds}
115 |         
116 |         
117 |         testing_results.append(testing_data)
118 |         print('<<<<----------------------------------------->>>>')
119 |         write_jsonl(testing_results, output_dir / f'linecov2_{args.model}_temp.jsonl')
120 |     write_jsonl(testing_results, output_dir / f'linecov2_{args.model}_1shot.jsonl')
121 | 


--------------------------------------------------------------------------------
/generate_cov_gemini.py:
--------------------------------------------------------------------------------
 1 | import google.generativeai as genai
 2 | from google.generativeai import GenerationConfig
 3 | import os
 4 | import time
 5 | from pathlib import Path
 6 | from argparse import ArgumentParser
 7 | from tqdm import tqdm
 8 | 
 9 | api_key=os.getenv("GOOGLE_API_KEY")
10 | 
11 | from data_utils import read_jsonl, write_jsonl, add_lineno
12 | 
13 | genai.configure(api_key=api_key)
14 | 
15 | def parse_args():
16 |     parser = ArgumentParser()
17 |     parser.add_argument("--dataset", type=str, default='leetcode')
18 |     parser.add_argument("--lang", type=str, default='python')
19 |     parser.add_argument("--model", type=str, default='models/gemini-1.0-pro-latest', choices=['models/gemini-1.0-pro-latest', 'models/gemini-1.5-pro-latest', 'models/gemini-1.5-flash-latest'])
20 |     parser.add_argument("--num_tests", type=int, default=20, help='number of tests generated per program')
21 |     parser.add_argument("--temperature", type=float, default=0)
22 |     parser.add_argument("--max_tokens", type=int, default=256)
23 |     return parser.parse_args()
24 | 
25 | 
26 | def testgeneration_multiround(args, model, prompt):
27 |     """generate test cases with multi-round conversation, each time generate one test case"""
28 |     template_append="Generate another test method for the function under test. Your answer must be different from previously-generated test cases, and should cover different statements and branches."
29 |     generated_tests=[]
30 | 
31 |     for i in range(args.num_tests):
32 |         generated=model.generate_content(prompt, generation_config=generation_config)
33 |         if generated.candidates[0].finish_reason==1: #normal stop
34 |             generated_test=generated.text
35 |         else: #max_token, safety, ...
36 |             generated_test=''
37 |         print(generated_test)
38 | 
39 |         test_append=f'''Generated test:
40 |         {generated_test}'''
41 |         prompt+=test_append
42 |         prompt+='\n'
43 |         prompt+=template_append
44 | 
45 |         generated_tests.append(generated_test)
46 | 
47 |     return generated_tests
48 | 
49 | 
50 | if __name__=='__main__':
51 |     args=parse_args()
52 |     print('Model:', args.model)
53 |     model_abbrv=args.model.split('/')[-1]
54 |     model = genai.GenerativeModel(args.model)
55 |     #print(model)
56 |     output_dir = Path('predictions')
57 | 
58 |     dataset=read_jsonl('LC_data/leetcode-bench-py.jsonl')
59 | 
60 |     prompt_template=open('prompt/template_base.txt').read()
61 |     system_template=open('prompt/system.txt').read()
62 |     system_message=system_template.format(lang='python')
63 | 
64 |     generation_config = GenerationConfig(
65 |             candidate_count=1,
66 |             max_output_tokens=args.max_tokens,
67 |             temperature=args.temperature
68 |         )
69 |     
70 |     data_size=len(dataset)
71 |     testing_results=[]
72 |     for i in tqdm(range(data_size)):
73 |         data=dataset[i]
74 |         func_name=data['func_name']
75 |         desc=data['description']
76 |         
77 |         code=data['python_solution']
78 |         difficulty=data['difficulty']
79 |         code_withlineno=add_lineno(code)
80 |         target_lines=data['target_lines']
81 | 
82 |         #generate test cases
83 |         prompt=prompt_template.format(lang='python', program=code, description=desc, func_name=func_name)
84 |         prompt=system_message+prompt
85 |         generated_tests=testgeneration_multiround(args,model,prompt)
86 | 
87 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_tests}
88 |         testing_results.append(testing_data)
89 |         print('<<<<----------------------------------------->>>>')
90 |         write_jsonl(testing_results, output_dir / f'totalcov_{model_abbrv}_temp.jsonl')
91 |     
92 |     write_jsonl(testing_results, output_dir / f'totalcov_{model_abbrv}.jsonl')
93 | 


--------------------------------------------------------------------------------
/generate_cov_hf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import transformers
  3 | import torch
  4 | from pathlib import Path
  5 | from argparse import ArgumentParser
  6 | from tqdm import tqdm
  7 | 
  8 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM
  9 | from transformers import pipeline
 10 | access_token=os.getenv("HUGGINGFACE_TOKEN")
 11 | 
 12 | from data_utils import read_jsonl, write_jsonl, add_lineno
 13 | 
 14 | 
 15 | def parse_args():
 16 |     parser = ArgumentParser()
 17 |     parser.add_argument("--dataset", type=str, default='leetcode')
 18 |     parser.add_argument("--model", type=str, default='codellama/CodeLlama-7b-Instruct-hf')
 19 |     parser.add_argument("--num_tests", type=int, default=10, help='number of tests generated per program')
 20 |     parser.add_argument("--temperature", type=float, default=1e-5)
 21 |     parser.add_argument("--max_tokens", type=int, default=256)
 22 |     return parser.parse_args()
 23 | 
 24 | model_list=['codellama/CodeLlama-7b-Instruct-hf','codellama/CodeLlama-13b-Instruct-hf','codellama/CodeLlama-34b-Instruct-hf',
 25 |             'meta-llama/Meta-Llama-3-8B-Instruct',
 26 |             'bigcode/starcoder2-15b-instruct-v0.1',
 27 |             'google/gemma-1.1-2b-it', 'google/gemma-1.1-7b-it'
 28 |             'google/codegemma-7b-it',
 29 |             'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-instruct',
 30 |             'deepseek-ai/deepseek-coder-33b-instruct',
 31 |             'mistralai/Mistral-7B-Instruct-v0.2', 'mistralai/Mistral-7B-Instruct-v0.3'
 32 |             'Qwen/CodeQwen1.5-7B-Chat'
 33 |             ]
 34 | 
 35 | #models do not support system message
 36 | models_nosys=['google/gemma-1.1-7b-it',
 37 |             'bigcode/starcoder2-15b-instruct-v0.1',
 38 |             'mistralai/Mistral-7B-Instruct-v0.3']
 39 | 
 40 | 
 41 | def testgeneration_multiround(args,prompt,system_message=''):
 42 |     """generate test cases with multi-round conversation, each time generate one test case"""
 43 |     template_append="Generate another test method for the function under test. Your answer must be different from previously-generated test cases, and should cover different statements and branches."
 44 |     generated_tests=[]
 45 | 
 46 |     if args.model in models_nosys: #models don't support system message
 47 |         messages=[{"role": "user", "content": system_message+prompt}]
 48 |     else:
 49 |         messages=[
 50 |         {"role": "system", "content": system_message},
 51 |         {"role": "user", "content": prompt},
 52 |     ]
 53 | 
 54 |     for i in range(args.num_tests):
 55 |         prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 56 |         generated=generator(prompt, 
 57 |                             max_new_tokens=args.max_tokens, 
 58 |                             temperature=args.temperature, 
 59 |                             return_full_text=False)
 60 |         
 61 |         generated_test=generated[0]['generated_text']
 62 |         print(generated_test)
 63 | 
 64 |         messages.append({"role": "assistant", "content": generated_test})
 65 |         messages.append({"role": "user", "content": template_append})
 66 |         
 67 | 
 68 |         generated_tests.append(generated_test)
 69 |     return generated_tests
 70 | 
 71 | 
 72 | if __name__=='__main__':
 73 |     args=parse_args()
 74 |     model_abbrv=args.model.split('/')[-1]
 75 |     print('Model:', model_abbrv)
 76 |     output_dir = Path('predictions')
 77 |     
 78 |     dataset=read_jsonl('data/leetcode-py.jsonl')
 79 | 
 80 |     prompt_template=open('prompt/template_base.txt').read()
 81 |     system_template=open('prompt/system.txt').read()
 82 |     system_message=system_template.format(lang='python')
 83 | 
 84 |     model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 85 |     tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 86 |     generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token)
 87 | 
 88 |     data_size=len(dataset)
 89 |     testing_results=[]
 90 |     print('number of samples:',len(dataset))
 91 |     
 92 |     for i in tqdm(range(data_size)):
 93 |         data=dataset[i]
 94 |         func_name=data['func_name']
 95 |         desc=data['description']
 96 |         code=data['python_solution']
 97 |         difficulty=data['difficulty']
 98 |         code_withlineno=add_lineno(code)
 99 |         target_lines=data['target_lines']
100 | 
101 |         prompt=prompt_template.format(lang='python', program=code, description=desc, func_name=func_name)
102 |         generated_tests=testgeneration_multiround(args,prompt,system_message)
103 | 
104 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_tests}
105 |         testing_results.append(testing_data)
106 |         print('<<<<----------------------------------------->>>>')
107 |         write_jsonl(testing_results,output_dir / f'totalcov_{model_abbrv}_temp.jsonl')
108 | 
109 |     write_jsonl(testing_results,output_dir / f'totalcov_{model_abbrv}.jsonl')
110 | 


--------------------------------------------------------------------------------
/generate_cov_openai.py:
--------------------------------------------------------------------------------
  1 | #baseline for targeted line coverage: not providing the target line number
  2 | import os
  3 | from pathlib import Path
  4 | from argparse import ArgumentParser
  5 | from tqdm import tqdm
  6 | import openai
  7 | from openai import OpenAI
  8 | openai.api_key=os.getenv("OPENAI_API_KEY") #personal key
  9 | client=OpenAI(api_key=openai.api_key)
 10 | 
 11 | from data_utils import read_jsonl, write_jsonl, add_lineno
 12 | 
 13 | 
 14 | def parse_args():
 15 |     parser = ArgumentParser()
 16 |     parser.add_argument("--dataset", type=str, default='leetcode')
 17 |     parser.add_argument("--lang", type=str, default='python')
 18 |     parser.add_argument("--model", type=str, default='gpt-3.5-turbo', choices=['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo', 'gpt-4o'])
 19 |     parser.add_argument("--num_tests", type=int, default=10, help='number of tests generated per program')
 20 |     parser.add_argument("--temperature", type=float, default=0)
 21 |     parser.add_argument("--max_tokens", type=int, default=256)
 22 |     return parser.parse_args()
 23 | 
 24 | 
 25 | def generate_completion(args,prompt,system_message=''):
 26 |     response = client.chat.completions.create(
 27 |         model=args.model,
 28 |         messages=[
 29 |             {"role": "system", "content": system_message},
 30 |             {"role": "user", "content": prompt},
 31 |         ],
 32 |         temperature=args.temperature,
 33 |         max_tokens=args.max_tokens
 34 |     )
 35 |     code_output=response.choices[0].message.content
 36 |     return code_output
 37 | 
 38 | 
 39 | def testgeneration_multiround(args,prompt,system_message=''):
 40 |     """generate test cases with multi-round conversation, each time generate one test case"""
 41 |     template_append="Generate another test method for the function under test. Your answer must be different from previously-generated test cases, and should cover different statements and branches."
 42 |     generated_tests=[]
 43 |     messages=[
 44 |             {"role": "system", "content": system_message},
 45 |             {"role": "user", "content": prompt},
 46 |         ]
 47 |     for i in range(args.num_tests):
 48 |         response = client.chat.completions.create(
 49 |             model=args.model,
 50 |             messages=messages,
 51 |             temperature=args.temperature,
 52 |             max_tokens=args.max_tokens
 53 |         )
 54 |         generated_test=response.choices[0].message.content
 55 |         messages.append({"role": "assistant", "content": generated_test})
 56 |         messages.append({"role": "user", "content": template_append})
 57 | 
 58 |         generated_tests.append(generated_test)
 59 |         print(generated_test)
 60 | 
 61 |     return generated_tests
 62 | 
 63 | 
 64 | lang_exts={'python':'py', 'java':'java', 'c++':'cpp'}
 65 | 
 66 | 
 67 | if __name__=='__main__':
 68 |     args=parse_args()
 69 |     print('Model:', args.model)
 70 |     output_dir = Path('predictions')
 71 | 
 72 |     dataset=read_jsonl('data/leetcode-py.jsonl')
 73 | 
 74 |     prompt_template=open('prompt/template_base.txt').read()
 75 |     system_template=open('prompt/system.txt').read()
 76 |     system_message=system_template.format(lang='python')
 77 | 
 78 |     data_size=len(dataset)
 79 | 
 80 |     testing_results=[]
 81 |     for i in tqdm(range(data_size)):
 82 |         data=dataset[i]
 83 |         func_name=data['func_name']
 84 |         desc=data['description']
 85 |         code=data['python_solution']
 86 |         difficulty=data['difficulty']
 87 |         code_withlineno=add_lineno(code)
 88 |         target_lines=data['target_lines']
 89 | 
 90 |         #generate test case
 91 |         prompt=prompt_template.format(lang='python', program=code, description=desc, func_name=func_name)
 92 |         generated_tests=testgeneration_multiround(args,prompt,system_message)
 93 |                    
 94 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_tests}
 95 |         testing_results.append(testing_data)
 96 |         print('<<<<----------------------------------------->>>>')
 97 |         write_jsonl(testing_results, output_dir / f'totalcov_{args.model}_temp.jsonl')
 98 | 
 99 |     write_jsonl(testing_results, output_dir / f'totalcov_{args.model}.jsonl')
100 | 


--------------------------------------------------------------------------------
/generate_pathcov_gemini.py:
--------------------------------------------------------------------------------
 1 | import google.generativeai as genai
 2 | from google.generativeai import GenerationConfig
 3 | 
 4 | api_key=os.getenv("GOOGLE_API_KEY")
 5 | 
 6 | import os
 7 | import time
 8 | from pathlib import Path
 9 | from argparse import ArgumentParser
10 | from tqdm import tqdm
11 | 
12 | from data_utils import read_jsonl, write_jsonl, add_lineno
13 | from prompt_utils import generate_path
14 | 
15 | genai.configure(api_key=api_key)
16 | 
17 | def parse_args():
18 |     parser = ArgumentParser()
19 |     parser.add_argument("--dataset", type=str, default='leetcode')
20 |     parser.add_argument("--lang", type=str, default='python')
21 |     parser.add_argument("--model", type=str, default='models/gemini-1.5-flash-latest', choices=['models/gemini-1.0-pro-latest', 'models/gemini-1.5-pro-latest', 'models/gemini-1.5-flash-latest'])
22 |     parser.add_argument("--temperature", type=float, default=0)
23 |     parser.add_argument("--max_tokens", type=int, default=256)
24 |     return parser.parse_args()
25 | 
26 | 
27 | if __name__=='__main__':
28 |     args=parse_args()
29 |     print('Model:', args.model)
30 |     model_abbrv=args.model.split('/')[-1]
31 |     model = genai.GenerativeModel(args.model)
32 |     output_dir = Path('predictions')
33 | 
34 |     prompt_template=open('prompt/template_path.txt').read()
35 |     system_template=open('prompt/system.txt').read()
36 |     system_message=system_template.format(lang='python')
37 | 
38 |     generation_config = GenerationConfig(
39 |         candidate_count=1,
40 |         max_output_tokens=args.max_tokens,
41 |         temperature=args.temperature
42 |     )
43 | 
44 |     dataset=read_jsonl('data/leetcode-py-instrumented.jsonl')
45 |     path_dataset=read_jsonl('data/tgt_paths.jsonl')
46 |     data_size=len(dataset)
47 |     testing_results=[]
48 | 
49 |     for i in tqdm(range(data_size)):
50 |         data=dataset[i]
51 |         func_name=data['func_name']
52 |         desc=data['description']
53 |         code=data['python_solution']
54 |         difficulty=data['difficulty']
55 |         code_withlineno=add_lineno(code)           
56 |         log_paths=path_dataset[i]['sampled_paths']
57 |         condition_paths=path_dataset[i]['sampled_condition_paths']
58 |         generated_path_tests=[]
59 |         for j in range(len(log_paths)):
60 |             log_path=log_paths[j]
61 |             condition_path=condition_paths[j]
62 |             #print(log_path, condition_path)
63 |             
64 |             path_prompt=generate_path(condition_path)
65 |             prompt=prompt_template.format(func_name=func_name, description=desc, program=code_withlineno, path=path_prompt)
66 |             prompt=system_message+prompt
67 | 
68 |             generated=model.generate_content(prompt, generation_config=generation_config)
69 |             if generated.candidates[0].finish_reason==1: #normal stop
70 |                 generated_test=generated.text
71 |             else: #max_token, safety, ...
72 |                 generated_test=''
73 |             print(generated_test)
74 |             generated_path_tests.append(generated_test)
75 |         
76 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_path_tests}
77 |         testing_results.append(testing_data)
78 | 
79 |     write_jsonl(testing_results, output_dir / f'pathcov_{model_abbrv}.jsonl')
80 | 


--------------------------------------------------------------------------------
/generate_pathcov_hf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import transformers
  3 | import torch
  4 | import textwrap
  5 | from pathlib import Path
  6 | from argparse import ArgumentParser
  7 | from tqdm import tqdm
  8 | 
  9 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM
 10 | from transformers import pipeline
 11 | access_token=os.getenv("HUGGINGFACE_TOKEN")
 12 | 
 13 | from data_utils import read_jsonl, write_jsonl, add_lineno
 14 | from prompt_utils import generate_path
 15 | 
 16 | 
 17 | def parse_args():
 18 |     parser = ArgumentParser()
 19 |     parser.add_argument("--dataset", type=str, default='leetcode')
 20 |     parser.add_argument("--lang", type=str, default='python', choices=['python', 'java', 'c++'])
 21 |     parser.add_argument("--model", type=str, default='codellama/CodeLlama-7b-Instruct-hf')
 22 |     parser.add_argument("--temperature", type=float, default=1e-5)
 23 |     parser.add_argument("--max_tokens", type=int, default=256)
 24 |     return parser.parse_args()
 25 | 
 26 | 
 27 | model_list=['codellama/CodeLlama-7b-Instruct-hf','codellama/CodeLlama-13b-Instruct-hf','codellama/CodeLlama-34b-Instruct-hf',
 28 |             'meta-llama/Meta-Llama-3-8B-Instruct',
 29 |             'bigcode/starcoder2-15b-instruct-v0.1',
 30 |             'google/gemma-1.1-7b-it'
 31 |             'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-instruct',
 32 |             'deepseek-ai/deepseek-coder-33b-instruct',
 33 |             'mistralai/Mistral-7B-Instruct-v0.3'
 34 |             ]
 35 | 
 36 | #models do not support system message
 37 | models_nosys=['google/gemma-1.1-7b-it',
 38 |             'bigcode/starcoder2-15b-instruct-v0.1',
 39 |             'mistralai/Mistral-7B-Instruct-v0.3']
 40 | 
 41 | 
 42 | if __name__=='__main__':
 43 |     args=parse_args()
 44 |     model_abbrv=args.model.split('/')[-1]
 45 |     print('Model:', model_abbrv)
 46 |     output_dir = Path('predictions')
 47 | 
 48 |     prompt_template=open('prompt/template_path.txt').read()
 49 |     system_template=open('prompt/system.txt').read()
 50 |     system_message=system_template.format(lang='python')
 51 | 
 52 |     dataset=read_jsonl('data/leetcode-py-instrumented.jsonl')
 53 |     path_dataset=read_jsonl('data/tgt_paths.jsonl')
 54 |     data_size=len(dataset)
 55 |     testing_results=[]
 56 | 
 57 |     model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 58 |     tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 59 |     generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token)
 60 | 
 61 |     for i in tqdm(range(data_size)):
 62 |         data=dataset[i]
 63 |         func_name=data['func_name']
 64 |         desc=data['description']
 65 |         code=data['python_solution']
 66 |         difficulty=data['difficulty']
 67 |         code_withlineno=add_lineno(code)           
 68 |         log_paths=path_dataset[i]['sampled_paths']
 69 |         condition_paths=path_dataset[i]['sampled_condition_paths']
 70 |         generated_path_tests=[]
 71 |         for j in range(len(log_paths)):
 72 |             log_path=log_paths[j]
 73 |             condition_path=condition_paths[j]
 74 |             #print(log_path, condition_path)
 75 |             path_prompt=generate_path(condition_path)
 76 | 
 77 |             prompt=prompt_template.format(func_name=func_name, description=desc, program=code_withlineno, path=path_prompt)
 78 |             if args.model in models_nosys: #models don't support system message
 79 |                 messages=[{"role": "user", "content": system_message+prompt}]
 80 |             else:
 81 |                 messages=[
 82 |                 {"role": "system", "content": system_message},
 83 |                 {"role": "user", "content": prompt},
 84 |             ]
 85 |             prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 86 |             generated=generator(prompt, 
 87 |                                 max_new_tokens=args.max_tokens, 
 88 |                                 temperature=args.temperature, 
 89 |                                 return_full_text=False)
 90 |             generated_test=generated[0]['generated_text']
 91 |             if generated_test.startswith('  '): #remove extra indents (encountered in codellama)
 92 |                 generated_test=textwrap.dedent(generated_test)
 93 |             print(generated_test)
 94 |             generated_path_tests.append(generated_test)
 95 |         
 96 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_path_tests}
 97 |         testing_results.append(testing_data)
 98 | 
 99 |     write_jsonl(testing_results, output_dir / f'pathcov_{model_abbrv}.jsonl')
100 | 


--------------------------------------------------------------------------------
/generate_pathcov_openai.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from argparse import ArgumentParser
 3 | from tqdm import tqdm
 4 | import openai
 5 | from openai import OpenAI
 6 | openai.api_key=os.getenv("OPENAI_API_KEY") #personal key
 7 | client=OpenAI(api_key=openai.api_key)
 8 | from pathlib import Path
 9 | from data_utils import read_jsonl, write_jsonl, add_lineno
10 | from prompt_utils import generate_path
11 | 
12 | def parse_args():
13 |     parser = ArgumentParser()
14 |     parser.add_argument("--dataset", type=str, default='leetcode')
15 |     parser.add_argument("--lang", type=str, default='python')
16 |     parser.add_argument("--model", type=str, default='gpt-3.5-turbo', choices=['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo', 'gpt-4o'])
17 |     parser.add_argument("--max_tokens", type=int, default=256)
18 |     parser.add_argument("--temperature", type=float, default=0)
19 |     return parser.parse_args()
20 | 
21 | 
22 | def generate_completion(args,prompt,system_message=''):
23 |     response = client.chat.completions.create(
24 |         model=args.model,
25 |         messages=[
26 |             {"role": "system", "content": system_message},
27 |             {"role": "user", "content": prompt},
28 |         ],
29 |         temperature=args.temperature,
30 |         max_tokens=args.max_tokens
31 |     )
32 |     code_output=response.choices[0].message.content
33 |     return code_output
34 | 
35 | 
36 | if __name__=='__main__':
37 |     args=parse_args()
38 |     print('Model:', args.model)
39 |     output_dir = Path('predictions')
40 | 
41 |     prompt_template=open('prompt/template_path.txt').read()
42 |     system_template=open('prompt/system.txt').read()
43 |     system_message=system_template.format(lang='python')
44 | 
45 |     dataset=read_jsonl('data/leetcode-py-instrumented.jsonl')
46 |     path_dataset=read_jsonl('data/tgt_paths.jsonl')
47 |     data_size=len(dataset)
48 |     testing_results=[]
49 | 
50 |     for i in tqdm(range(data_size)):
51 |         data=dataset[i]
52 |         func_name=data['func_name']
53 |         desc=data['description']
54 |         code=data['python_solution']
55 |         difficulty=data['difficulty']
56 |         code_withlineno=add_lineno(code)           
57 | 
58 |         log_paths=path_dataset[i]['sampled_paths']
59 |         condition_paths=path_dataset[i]['sampled_condition_paths']
60 |         generated_path_tests=[]
61 |         for j in range(len(log_paths)):
62 |             log_path=log_paths[j]
63 |             condition_path=condition_paths[j]
64 |             #print(log_path, condition_path)
65 |             path_prompt=generate_path(condition_path)
66 | 
67 |             prompt=prompt_template.format(func_name=func_name, description=desc, program=code_withlineno, path=path_prompt)
68 | 
69 |             generated_test=generate_completion(args,prompt,system_message)
70 |             print(generated_test)
71 |             generated_path_tests.append(generated_test)
72 |         
73 |         testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':generated_path_tests}
74 |         testing_results.append(testing_data)
75 | 
76 |     write_jsonl(testing_results, output_dir / f'pathcov_{args.model}.jsonl')
77 | 


--------------------------------------------------------------------------------
/generate_targetcov_gemini.py:
--------------------------------------------------------------------------------
  1 | import google.generativeai as genai
  2 | from google.generativeai import GenerationConfig
  3 | from google.api_core.exceptions import InternalServerError, ResourceExhausted
  4 | 
  5 | api_key=os.getenv("GOOGLE_API_KEY")
  6 | 
  7 | import os
  8 | import time
  9 | from argparse import ArgumentParser
 10 | from tqdm import tqdm
 11 | from pathlib import Path
 12 | 
 13 | from data_utils import read_jsonl, write_jsonl, add_lineno
 14 | 
 15 | genai.configure(api_key=api_key)
 16 | 
 17 | def parse_args():
 18 |     parser = ArgumentParser()
 19 |     parser.add_argument("--dataset", type=str, default='leetcode')
 20 |     parser.add_argument("--lang", type=str, default='python')
 21 |     parser.add_argument("--model", type=str, default='models/gemini-1.0-pro-latest', choices=['models/gemini-1.0-pro-latest', 'models/gemini-1.5-pro-latest'])
 22 |     parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level')
 23 |     parser.add_argument("--max_tokens", type=int, default=256)
 24 |     parser.add_argument("--temperature", type=float, default=0)
 25 |     return parser.parse_args()
 26 | 
 27 | 
 28 | if __name__=='__main__':
 29 |     args=parse_args()
 30 |     print('Model:', args.model)
 31 |     model_abbrv=args.model.split('/')[-1]
 32 |     model = genai.GenerativeModel(args.model)
 33 |     print('task:', args.covmode)
 34 |     output_dir = Path('predictions')
 35 | 
 36 |     prompt_template=open('prompt/template_line.txt').read()
 37 |     prompt_template_branch=open('prompt/template_branch.txt').read()
 38 |     system_template=open('prompt/system.txt').read()
 39 |     system_message=system_template.format(lang='python')
 40 | 
 41 |     dataset=read_jsonl('data/leetcode-py.jsonl')
 42 | 
 43 |     generation_config = GenerationConfig(
 44 |             candidate_count=1,
 45 |             max_output_tokens=args.max_tokens,
 46 |             temperature=args.temperature
 47 |         )
 48 |     
 49 |     data_size=len(dataset)
 50 |     testing_results=[]
 51 |     for i in tqdm(range(data_size)):
 52 |         data=dataset[i]
 53 |         func_name=data['func_name']
 54 |         desc=data['description']
 55 |         code=data['python_solution']
 56 |         difficulty=data['difficulty']
 57 |         code_withlineno=add_lineno(code)          
 58 | 
 59 |         #generate test case
 60 |         if args.covmode=='line':
 61 |             target_lines=data['target_lines']
 62 |             tests={}
 63 |             print(data['task_num'],target_lines)
 64 |             for lineno in target_lines: #line number to be tested
 65 |                 code_lines=code.split('\n')
 66 |                 target_line=code_lines[lineno-1]
 67 |                 target_line_withlineno=f'{lineno}: {target_line}'
 68 | 
 69 |                 code_input=code_withlineno
 70 |                 line_input=target_line_withlineno
 71 | 
 72 |                 prompt=prompt_template.format(lang='python', program=code_input, description=desc, func_name=func_name, lineno=line_input)
 73 |                 prompt=system_message+prompt
 74 | 
 75 |                 generated=model.generate_content(prompt, generation_config=generation_config)
 76 | 
 77 |                 if generated.candidates[0].finish_reason==1: #normal stop
 78 |                     generated_test=generated.text
 79 |                 else: #max_token, safety, ...
 80 |                     generated_test=''
 81 | 
 82 |                 print(generated_test)
 83 |                 tests[lineno]=generated_test                
 84 |             testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests}
 85 |         
 86 |         elif args.covmode=='branch':
 87 |             tests_branch=[]
 88 |             print(data['task_num'])
 89 |             branches=data['blocks']
 90 |             for branch in branches:
 91 |                 print(branch)
 92 |                 startline=branch['start']
 93 |                 endline=branch['end']
 94 |                 
 95 |                 code_input=code_withlineno
 96 |                 split_lines=code_withlineno.split('\n')
 97 |                 target_lines=split_lines[startline-1:endline]
 98 |                 target_branch_withlineno='\n'.join(target_lines)
 99 |                 branch_input="\n'''\n"+target_branch_withlineno+"\n'''"
100 | 
101 |                 prompt=prompt_template_branch.format(lang='python', program=code_input, description=desc, func_name=func_name, branch=branch_input)
102 |                 prompt=system_message+prompt
103 |                 generated=model.generate_content(prompt, generation_config=generation_config)
104 |                 if generated.candidates[0].finish_reason==1: #normal stop
105 |                     generated_test=generated.text
106 |                 else: #max_token, safety, ...
107 |                     generated_test=''
108 |                 print(generated_test)
109 |                 generatedtest_branch={'start':startline,'end':endline,'test':generated_test}
110 |                 tests_branch.append(generatedtest_branch)
111 |             testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests_branch}
112 | 
113 |         testing_results.append(testing_data)
114 |         
115 |         print('<<<<----------------------------------------->>>>')
116 |         write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}_temp.jsonl')
117 |     write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}.jsonl')
118 | 


--------------------------------------------------------------------------------
/generate_targetcov_hf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import transformers
  3 | import torch
  4 | import textwrap
  5 | from argparse import ArgumentParser
  6 | from tqdm import tqdm
  7 | from pathlib import Path
  8 | 
  9 | from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoTokenizer, AutoModelForCausalLM
 10 | from transformers import pipeline
 11 | access_token=os.getenv("HUGGINGFACE_TOKEN")
 12 | 
 13 | from data_utils import read_jsonl, write_jsonl, add_lineno
 14 | 
 15 | def parse_args():
 16 |     parser = ArgumentParser()
 17 |     parser.add_argument("--dataset", type=str, default='leetcode')
 18 |     parser.add_argument("--lang", type=str, default='python', choices=['python', 'java', 'c++'])
 19 |     parser.add_argument("--model", type=str, default='meta-llama/Meta-Llama-3-8B-Instruct')
 20 |     parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level')
 21 |     parser.add_argument("--max_tokens", type=int, default=256)
 22 |     parser.add_argument("--temperature", type=float, default=1e-5)
 23 |     return parser.parse_args()
 24 | 
 25 | model_list=['codellama/CodeLlama-7b-Instruct-hf','codellama/CodeLlama-13b-Instruct-hf','codellama/CodeLlama-34b-Instruct-hf',
 26 |             'meta-llama/Meta-Llama-3-8B-Instruct',
 27 |             'bigcode/starcoder2-15b-instruct-v0.1',
 28 |             'google/gemma-1.1-7b-it'
 29 |             'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-instruct',
 30 |             'deepseek-ai/deepseek-coder-33b-instruct',
 31 |             'mistralai/Mistral-7B-Instruct-v0.3'
 32 |             ]
 33 | 
 34 | #models do not support system message
 35 | models_nosys=['google/gemma-1.1-7b-it',
 36 |             'bigcode/starcoder2-15b-instruct-v0.1',
 37 |             'mistralai/Mistral-7B-Instruct-v0.3']
 38 | 
 39 | 
 40 | if __name__=='__main__':
 41 |     args=parse_args()
 42 |     model_abbrv=args.model.split('/')[-1]
 43 |     print('Model:', model_abbrv)
 44 |     print('task:', args.covmode)
 45 |     output_dir = Path('predictions')
 46 | 
 47 |     prompt_template=open('prompt/template_line.txt').read()
 48 |     prompt_template_branch=open('prompt/template_branch.txt').read()
 49 |     system_template=open('prompt/system.txt').read()
 50 |     system_message=system_template.format(lang='python')
 51 | 
 52 |     dataset=read_jsonl('data/leetcode-py.jsonl')
 53 | 
 54 |     model = AutoModelForCausalLM.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 55 |     tokenizer = AutoTokenizer.from_pretrained(args.model, token=access_token, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
 56 |     generator = pipeline("text-generation",model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, device_map='auto', token=access_token)
 57 | 
 58 |     data_size=len(dataset)
 59 |     testing_results=[]
 60 |     print('number of samples:',len(dataset))
 61 | 
 62 |     for i in tqdm(range(data_size)):
 63 |         data=dataset[i]
 64 |         func_name=data['func_name']
 65 |         desc=data['description']
 66 |         code=data['python_solution']
 67 |         difficulty=data['difficulty']
 68 |         code_withlineno=add_lineno(code)           
 69 |         
 70 |         #generate test case
 71 |         if args.covmode=='line':
 72 |             target_lines=data['target_lines']
 73 |             tests={}
 74 |             print(data['task_num'],target_lines)
 75 |             for lineno in target_lines: #line number to be tested
 76 |                 code_lines=code.split('\n')
 77 |                 target_line=code_lines[lineno-1]
 78 |                 target_line_withlineno=f'{lineno}: {target_line}'
 79 | 
 80 |                 code_input=code_withlineno
 81 |                 line_input=target_line_withlineno
 82 | 
 83 |                 prompt=prompt_template.format(lang='python', program=code_input, description=desc, func_name=func_name, lineno=line_input)
 84 |                 if args.model in models_nosys: #models don't support system message
 85 |                     messages=[{"role": "user", "content": system_message+prompt}]
 86 |                 else:
 87 |                     messages=[
 88 |                     {"role": "system", "content": system_message},
 89 |                     {"role": "user", "content": prompt},
 90 |                 ]
 91 |                 prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 92 |                 
 93 |                 generated=generator(prompt, 
 94 |                                     max_new_tokens=args.max_tokens, 
 95 |                                     temperature=args.temperature, 
 96 |                                     return_full_text=False)
 97 |                 generated_test=generated[0]['generated_text']
 98 |                 if generated_test.startswith('  '): #remove extra indents (encountered in codellama)
 99 |                     generated_test=textwrap.dedent(generated_test)
100 |                 print(generated_test)
101 |                 tests[lineno]=generated_test
102 |                 print('----------')        
103 |             testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests}
104 |         
105 |         elif args.covmode=='branch':
106 |             tests_branch=[]
107 |             print(data['task_num'])
108 |             branches=data['blocks']
109 |             for branch in branches:
110 |                 print(branch)
111 |                 startline=branch['start']
112 |                 endline=branch['end']
113 |                 
114 |                 code_input=code_withlineno
115 | 
116 |                 split_lines=code_withlineno.split('\n')
117 |                 target_lines=split_lines[startline-1:endline]
118 |                 target_branch_withlineno='\n'.join(target_lines)
119 |                 branch_input="\n'''\n"+target_branch_withlineno+"\n'''"
120 | 
121 |                 prompt=prompt_template_branch.format(lang='python', program=code_input, description=desc, func_name=func_name, branch=branch_input)
122 |                 if args.model in models_nosys: #models don't support system message
123 |                     messages=[{"role": "user", "content": system_message+prompt}]
124 |                 else:
125 |                     messages=[
126 |                     {"role": "system", "content": system_message},
127 |                     {"role": "user", "content": prompt},
128 |                 ]
129 |                 prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
130 |                 
131 |                 generated=generator(prompt, 
132 |                                     max_new_tokens=args.max_tokens, 
133 |                                     temperature=args.temperature, 
134 |                                     return_full_text=False)
135 |                 generated_test=generated[0]['generated_text']
136 |                 if generated_test.startswith('  '): #remove extra indents (encountered in codellama)
137 |                     generated_test=textwrap.dedent(generated_test)
138 |                 print(generated_test)
139 |                 generatedtest_branch={'start':startline,'end':endline,'test':generated_test}
140 |                 tests_branch.append(generatedtest_branch)
141 |             testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests_branch}
142 |         
143 |         testing_results.append(testing_data)
144 |         print('<<<<----------------------------------------->>>>')
145 |         write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}_temp.jsonl')
146 |     write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{model_abbrv}.jsonl')
147 | 


--------------------------------------------------------------------------------
/generate_targetcov_openai.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from argparse import ArgumentParser
  3 | from tqdm import tqdm
  4 | import openai
  5 | from openai import OpenAI
  6 | openai.api_key=os.getenv("OPENAI_API_KEY") 
  7 | client=OpenAI(api_key=openai.api_key)
  8 | from pathlib import Path
  9 | from data_utils import read_jsonl, write_jsonl, add_lineno
 10 | 
 11 | 
 12 | def parse_args():
 13 |     parser = ArgumentParser()
 14 |     parser.add_argument("--dataset", type=str, default='leetcode')
 15 |     parser.add_argument("--model", type=str, default='gpt-3.5-turbo', choices=['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo', 'gpt-4o'])
 16 |     parser.add_argument("--covmode", type=str, default='line', choices=['line', 'branch'], help='cover targets at line level or branch level')
 17 |     parser.add_argument("--max_tokens", type=int, default=256)
 18 |     parser.add_argument("--temperature", type=float, default=0)
 19 |     return parser.parse_args()
 20 | 
 21 | 
 22 | def generate_completion(args,prompt,system_message=''):
 23 |     response = client.chat.completions.create(
 24 |         model=args.model,
 25 |         messages=[
 26 |             {"role": "system", "content": system_message},
 27 |             {"role": "user", "content": prompt},
 28 |         ],
 29 |         temperature=args.temperature,
 30 |         max_tokens=args.max_tokens
 31 |     )
 32 |     code_output=response.choices[0].message.content
 33 |     return code_output
 34 | 
 35 | 
 36 | if __name__=='__main__':
 37 |     args=parse_args()
 38 |     print('Model:', args.model)
 39 |     print('task:', args.covmode)
 40 |     output_dir = Path('predictions')
 41 | 
 42 |     prompt_template=open('prompt/template_line.txt').read()
 43 |     prompt_template_branch=open('prompt/template_branch.txt').read()
 44 |     system_template=open('prompt/system.txt').read()
 45 |     system_message=system_template.format(lang='python')
 46 | 
 47 |     dataset=read_jsonl('data/leetcode-py.jsonl')
 48 | 
 49 |     data_size=len(dataset)
 50 |     testing_results=[]
 51 |     for i in tqdm(range(data_size)):
 52 |         data=dataset[i]
 53 |         func_name=data['func_name']
 54 |         desc=data['description']
 55 |         code=data['python_solution']
 56 |         difficulty=data['difficulty']
 57 |         code_withlineno=add_lineno(code)           
 58 | 
 59 |         #generate test case
 60 |         if args.covmode=='line':
 61 |             target_lines=data['target_lines']
 62 |             tests={}
 63 |             print(data['task_num'],target_lines)
 64 | 
 65 |             for lineno in target_lines: #line number to be tested
 66 |                 code_lines=code.split('\n')
 67 |                 target_line=code_lines[lineno-1]
 68 |                 target_line_withlineno=f'{lineno}: {target_line}'
 69 | 
 70 |                 code_input=code_withlineno
 71 |                 line_input=target_line_withlineno
 72 | 
 73 |                 prompt=prompt_template.format(lang='python', program=code_input, description=desc, func_name=func_name, lineno=line_input)
 74 | 
 75 |                 generated_test=generate_completion(args,prompt,system_message)
 76 |                 print(generated_test)
 77 |                 tests[lineno]=generated_test
 78 |             testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests}
 79 |         
 80 |         elif args.covmode=='branch':
 81 |             tests_branch=[]
 82 |             print(data['task_num'])
 83 |             branches=data['blocks']
 84 |             for branch in branches:
 85 |                 print(branch)
 86 |                 startline=branch['start']
 87 |                 endline=branch['end']
 88 | 
 89 |                 code_input=code_withlineno
 90 | 
 91 |                 split_lines=code_withlineno.split('\n')
 92 |                 target_lines=split_lines[startline-1:endline]
 93 |                 target_branch_withlineno='\n'.join(target_lines)
 94 |                 branch_input="\n'''\n"+target_branch_withlineno+"\n'''"
 95 | 
 96 |                 prompt=prompt_template_branch.format(lang='python', program=code_input, description=desc, func_name=func_name, branch=branch_input)
 97 | 
 98 |                 generated_test=generate_completion(args,prompt,system_message)
 99 |                 print(generated_test)
100 |                 generatedtest_branch={'start':startline,'end':endline,'test':generated_test}
101 |                 tests_branch.append(generatedtest_branch)
102 |             testing_data={'task_num':data['task_num'],'task_title':data['task_title'],'func_name':func_name,'difficulty':difficulty,'code':code,'tests':tests_branch}
103 |         
104 |         testing_results.append(testing_data)
105 |         print('<<<<----------------------------------------->>>>')
106 |         write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{args.model}_temp.jsonl')
107 |     write_jsonl(testing_results, output_dir / f'{args.covmode}cov_{args.model}.jsonl')
108 | 


--------------------------------------------------------------------------------
/prompt/line_oneshot_gencond.txt:
--------------------------------------------------------------------------------
 1 | Given a Python code snippet and a target line number, you are asked to generate reasoning steps to satisfy a specific line to be executed.
 2 | [Example]
 3 | Given the following code snippet:
 4 | ```Python
 5 | class Solution: #1
 6 |     def twoSum(self, nums: List[int], target: int) -> List[int]:    #2
 7 |         numMap = {{}} #3
 8 |         n = len(nums)   #4
 9 |     #5
10 |         for i in range(n):  #6
11 |             numMap[nums[i]] = i #7
12 |     #8
13 |         for i in range(n):  #9
14 |             complement = target - nums[i]   #10
15 |             if complement in numMap and numMap[complement] != i:    #11
16 |                 return [i, numMap[complement]]  #12
17 |     #13
18 |         return []   #14
19 | ```
20 | Identify when executing funtion twoSum, what conditions need to be satisfied if line 12 is to be executed.
21 | 
22 | Answer:
23 | <cond>
24 | Step 1: If line 12 `return [i, numMap[complement]]` is executed, then the `if` condition `(complement in numMap and numMap[complement] != i)` in line 11 shoud be true.
25 | Step 2: If condition `complement in numMap` is true, at least one `target - nums[i]` in line 10 equals an element in nums, which means there exists two elements in `nums` that their sum is equal to `target`.
26 | Step 3: If condition `numMap[complement] != i` is ture, then `numMap[target - nums[i]] != i`, meaning that the index of `target - nums[i]` is not equal to `i`.
27 | Step 4: Generally, to execute line 12, we need to ensure that there exists two different elements in `nums` that their sum is equal to `target`.
28 | <\cond>
29 | [\Example]
30 | 
31 | In a similar fashion, identify the conditions that need to be satisfied when line {targetline} is to be executed for the following Python code.
32 | ```Python
33 | {program}
34 | ```
35 | Surround your answer with <cond> and <\cond>.
36 | 


--------------------------------------------------------------------------------
/prompt/line_oneshot_gentest.txt:
--------------------------------------------------------------------------------
 1 | For the given code snippet and a list of conditions need to be satisfied, generate a test case that will satisfiy these conditions. Here is an example:
 2 | [Example]
 3 | Code:
 4 | ```Python
 5 | class Solution: #1
 6 |     def twoSum(self, nums: List[int], target: int) -> List[int]:    #2
 7 |         numMap = {{}} #3
 8 |         n = len(nums)   #4
 9 |     #5
10 |         for i in range(n):  #6
11 |             numMap[nums[i]] = i #7
12 |     #8
13 |         for i in range(n):  #9
14 |             complement = target - nums[i]   #10
15 |             if complement in numMap and numMap[complement] != i:    #11
16 |                 return [i, numMap[complement]]  #12
17 |     #13
18 |         return []   #14
19 | ```
20 | Conditions:
21 | Step 1: If line 12 `return [i, numMap[complement]]` is executed, then the `if` condition `(complement in numMap and numMap[complement] != i)` in line 11 shoud be true.
22 | Step 2: If condition `complement in numMap` is true, at least one `target - nums[i]` in line 10 equals an element in nums, which means there exists two elements in `nums` that their sum is equal to `target`.
23 | Step 3: If condition `numMap[complement] != i` is ture, then `numMap[target - nums[i]] != i`, meaning that the index of `target - nums[i]` is not equal to `i`.
24 | Step 4: Generally, to execute line 12, we need to ensure that there exists two different elements in `nums` that their sum is equal to `target`.
25 | 
26 | Generated test case:
27 | ```Python
28 | def test_twoSum():
29 |     solution = Solution()
30 |     assert solution.twoSum([2,7,11,15], 9) == [0, 1]
31 | ```
32 | [\Example]
33 | 
34 | In a similar fashion, generate a test case for the following code snippet and conditions. Your test function should be named `test_{func_name}`.
35 | Code:
36 | ```Python
37 | {program}
38 | ```
39 | Conditions:
40 | {conditions}
41 | 
42 | You should only generate the test case, without any additional explanation.
43 | 


--------------------------------------------------------------------------------
/prompt/system.txt:
--------------------------------------------------------------------------------
1 | You are a professional who writes {lang} test methods. You always respond with a {lang} function without any natural language descriptions. Especially, Your answer should not include any quotation marks or descriptions outside the function definition.
2 | 


--------------------------------------------------------------------------------
/prompt/system_exec.txt:
--------------------------------------------------------------------------------
1 | You are an expert Python programmer who excels at generating test cases and reasoning about program execution behaviors.
2 | 


--------------------------------------------------------------------------------
/prompt/template_base.txt:
--------------------------------------------------------------------------------
 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input.
 2 | 
 3 | Program under test:
 4 | ----
 5 | {program}
 6 | ----
 7 | 
 8 | Function description for '{func_name}':
 9 | ----
10 | {description}
11 | ----
12 | 
13 | Your test method should begin with:
14 | def test_{func_name}():
15 |     solution=Solution()
16 | 


--------------------------------------------------------------------------------
/prompt/template_branch.txt:
--------------------------------------------------------------------------------
 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input.
 2 | 
 3 | Program under test:
 4 | ----
 5 | {program}
 6 | ----
 7 | 
 8 | Function description for '{func_name}':
 9 | ----
10 | {description}
11 | ----
12 | Your test case must cover the branch {branch}
13 | Your test method should begin with:
14 | def test_{func_name}():
15 |     solution=Solution()
16 | 


--------------------------------------------------------------------------------
/prompt/template_line.txt:
--------------------------------------------------------------------------------
 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input.
 2 | 
 3 | Program under test:
 4 | ----
 5 | {program}
 6 | ----
 7 | 
 8 | Function description for '{func_name}':
 9 | ----
10 | {description}
11 | ----
12 | 
13 | Your test case must cover line {lineno}.
14 | Your test method should begin with:
15 | def test_{func_name}():
16 |     solution=Solution()
17 | 


--------------------------------------------------------------------------------
/prompt/template_path.txt:
--------------------------------------------------------------------------------
 1 | Please write a test method for the function '{func_name}' given the following program under test and function description. Your answer should only contain one test input.
 2 | 
 3 | Program under test:
 4 | ----
 5 | {program}
 6 | ----
 7 | 
 8 | Function description for '{func_name}':
 9 | ----
10 | {description}
11 | ----
12 | 
13 | Your test case must cover the following execution path in function {func_name}. The path is a sequence of branch conditions. When executing your test case, each branch condition in the target execution path must be satisfied sequentially.
14 | Target execution path: {path}
15 | ----
16 | Your test method should begin with:
17 | def test_{func_name}():
18 |     solution=Solution()
19 | 


--------------------------------------------------------------------------------
/prompt_utils.py:
--------------------------------------------------------------------------------
1 | 
2 | def generate_path(pathdata):
3 |     for i in range(len(pathdata)):
4 |         #pathdata[i]=pathdata[i].replace('\n','')
5 |         pathdata[i]=f"'{pathdata[i]}'"
6 |     path_prompt=' -> '.join(pathdata)
7 |     return path_prompt
8 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | openai>=1.17.1
2 | google-generativeai>=0.5.2
3 | transformers>=4.39.3
4 | sortedcontainers
5 | textwrap
6 | pytest-cov>=5.0.0
7 | 


--------------------------------------------------------------------------------