├── .env.example ├── .gitignore ├── requirements.txt ├── images └── agentcoder_pipeline.pdf ├── .gitmodules ├── scripts └── run.sh ├── prompts ├── zero_shot_mbpp_prompt_update.txt ├── zero_shot_test_designer_mbpp_prompt_update.txt ├── zero_shot_test_designer_humaneval_prompt_update.txt ├── zero_shot_humaneval_prompt_update.txt ├── zero_shot_mbpp_prompt.txt ├── zero_shot_humaneval_prompt.txt ├── test_designer_mbpp_prompt_update.txt ├── zero_shot_test_designer_mbpp_prompt.txt ├── zero_shot_test_designer_humaneval_prompt.txt ├── mbpp_prompt_update.txt ├── test_designer_humaneval_prompt_update.txt ├── mbpp_prompt.txt ├── test_designer_mbpp_prompt.txt ├── humaneval_prompt_update.txt ├── test_designer_humaneval_prompt.txt └── humaneval_prompt.txt ├── README.md └── src ├── test_designer_humaneval.py ├── programmer_humaneval.py ├── test_designer_mbpp.py ├── programmer_mbpp.py ├── test_executor_mbpp.py └── test_executor_humaneval.py /.env.example: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY="YOUR_OPENAI_API_KEY" 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .venv/ 2 | .DS_Store 3 | .env 4 | *.pyc 5 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | datasets==3.3.1 2 | openai==0.28.0 3 | python-dotenv==1.0.1 4 | -------------------------------------------------------------------------------- /images/agentcoder_pipeline.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huangd1999/AgentCoder/HEAD/images/agentcoder_pipeline.pdf -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "CodeGeeX"] 2 | path = CodeGeeX 3 | url = https://github.com/THUDM/CodeGeeX.git 4 | ignore = dirty 5 | -------------------------------------------------------------------------------- /scripts/run.sh: -------------------------------------------------------------------------------- 1 | python programmer_[humaneval/mbpp].py 2 | python test_designer_[humaneval/mbpp].py 3 | python test_executor_[humaneval/mbpp].py -------------------------------------------------------------------------------- /prompts/zero_shot_mbpp_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. -------------------------------------------------------------------------------- /prompts/zero_shot_test_designer_mbpp_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. 2 | 3 | - The format of test cases should be: 4 | ```python 5 | assert function_name(input) == expected_output, "Test Case Description" 6 | ``` 7 | -------------------------------------------------------------------------------- /prompts/zero_shot_test_designer_humaneval_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. 2 | 3 | - The format of test cases should be: 4 | ```python 5 | assert function_name(input) == expected_output, "Test Case Description" 6 | ``` 7 | -------------------------------------------------------------------------------- /prompts/zero_shot_humaneval_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | **Code Formatting**: Please write code in 6 | ```python 7 | [Code] 8 | ``` 9 | format. -------------------------------------------------------------------------------- /prompts/zero_shot_mbpp_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | **Instructions**: 6 | 1. **Understand and Clarify**: Make sure you understand the task. 7 | 2. **Algorithm/Method Selection**: Decide on the most efficient way. 8 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 9 | 4. **Code Generation**: Translate your pseudocode into executable Python code. -------------------------------------------------------------------------------- /prompts/zero_shot_humaneval_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | **Code Formatting**: Please write code in 6 | ```python 7 | [Code] 8 | ``` 9 | format. 10 | 11 | **Instructions**: 12 | 1. **Understand and Clarify**: Make sure you understand the task. 13 | 2. **Algorithm/Method Selection**: Decide on the most efficient way. 14 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 15 | 4. **Code Generation**: Translate your pseudocode into executable Python code. -------------------------------------------------------------------------------- /prompts/test_designer_mbpp_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. 2 | 3 | - The format of test cases should be: 4 | ```python 5 | assert function_name(input) == expected_output, "Test Case Description" 6 | ``` 7 | 8 | # For example: 9 | 10 | ## Prompt 1: 11 | ```python 12 | Write a function to find the shared elements from the given two lists. 13 | ``` 14 | 15 | ## Completion 1: 16 | ```python 17 | assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5)) 18 | assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4)) 19 | assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14)) 20 | ``` 21 | 22 | ## Prompt 2: 23 | ```python 24 | Write a python function to identify non-prime numbers. 25 | ``` 26 | 27 | ## Completion 2: 28 | ```python 29 | assert is_not_prime(2) == False 30 | assert is_not_prime(10) == True 31 | assert is_not_prime(35) == True 32 | assert is_not_prime(37) == False 33 | ``` -------------------------------------------------------------------------------- /prompts/zero_shot_test_designer_mbpp_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability. 2 | 3 | **1. Basic Test Cases**: 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions. 5 | 6 | **2. Edge Test Cases**: 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions. 8 | 9 | **3. Large Scale Test Cases**: 10 | - **Objective**: To assess the function’s performance and scalability with large data samples. 11 | 12 | **Instructions**: 13 | - Implement a comprehensive set of test cases following the guidelines above. 14 | - Ensure each test case is well-documented with comments explaining the scenario it covers. 15 | - Pay special attention to edge cases as they often reveal hidden bugs. 16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads. 17 | 18 | - The format of test cases should be: 19 | ```python 20 | assert function_name(input) == expected_output, "Test Case Description" 21 | ``` -------------------------------------------------------------------------------- /prompts/zero_shot_test_designer_humaneval_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability. 2 | 3 | **1. Basic Test Cases**: 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions. 5 | 6 | **2. Edge Test Cases**: 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions. 8 | 9 | **3. Large Scale Test Cases**: 10 | - **Objective**: To assess the function’s performance and scalability with large data samples. 11 | 12 | **Instructions**: 13 | - Implement a comprehensive set of test cases following the guidelines above. 14 | - Ensure each test case is well-documented with comments explaining the scenario it covers. 15 | - Pay special attention to edge cases as they often reveal hidden bugs. 16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads. 17 | 18 | - The format of test cases should be: 19 | ```python 20 | assert function_name(input) == expected_output, "Test Case Description" 21 | ``` 22 | -------------------------------------------------------------------------------- /prompts/mbpp_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | # For example: 6 | 7 | ## Prompt 1: 8 | ```python 9 | Write a python function to remove first and last occurrence of a given character from the string. 10 | ``` 11 | ## Test Case 1: 12 | ```python 13 | assert remove_Occ("hello","l") == "heo" 14 | assert remove_Occ("abcda","a") == "bcd" 15 | assert remove_Occ("PHP","P") == "H" 16 | ``` 17 | 18 | ## Completion 1: 19 | ```python 20 | def remove_Occ(s,ch): 21 | for i in range(len(s)): 22 | if (s[i] == ch): 23 | s = s[0 : i] + s[i + 1:] 24 | break 25 | for i in range(len(s) - 1,-1,-1): 26 | if (s[i] == ch): 27 | s = s[0 : i] + s[i + 1:] 28 | break 29 | return s 30 | ``` 31 | 32 | ## Prompt 2: 33 | ```python 34 | Write a function to sort a given matrix in ascending order according to the sum of its rows. 35 | ``` 36 | 37 | ## Test Case 1: 38 | ```python 39 | assert sort_matrix([[1, 2, 3], [2, 4, 5], [1, 1, 1]])==[[1, 1, 1], [1, 2, 3], [2, 4, 5]] 40 | assert sort_matrix([[1, 2, 3], [-2, 4, -5], [1, -1, 1]])==[[-2, 4, -5], [1, -1, 1], [1, 2, 3]] 41 | assert sort_matrix([[5,8,9],[6,4,3],[2,1,4]])==[[2, 1, 4], [6, 4, 3], [5, 8, 9]] 42 | ``` 43 | 44 | ## Completion 2: 45 | ```python 46 | def sort_matrix(M): 47 | result = sorted(M, key=sum) 48 | return result 49 | ``` -------------------------------------------------------------------------------- /prompts/test_designer_humaneval_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. 2 | 3 | - The format of test cases should be: 4 | ```python 5 | assert function_name(input) == expected_output, "Test Case Description" 6 | ``` 7 | 8 | # For example: 9 | 10 | ## Prompt 1: 11 | ```python 12 | from typing import List 13 | 14 | 15 | def has_close_elements(numbers: List[float], threshold: float) -> bool: 16 | """ Check if in given list of numbers, are any two numbers closer to each other than 17 | given threshold. 18 | >>> has_close_elements([1.0, 2.0, 3.0], 0.5) 19 | False 20 | >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) 21 | True 22 | """ 23 | 24 | ``` 25 | 26 | ## Completion 1: 27 | ```python 28 | 29 | assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False 30 | assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)== True 31 | 32 | ``` 33 | 34 | ## Prompt 2: 35 | ```python 36 | from typing import List 37 | 38 | 39 | def separate_paren_groups(paren_string: str) -> List[str]: 40 | """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to 41 | separate those group into separate strings and return the list of those. 42 | Separate groups are balanced (each open brace is properly closed) and not nested within each other 43 | Ignore any spaces in the input string. 44 | >>> separate_paren_groups('( ) (( )) (( )( ))') 45 | ['()', '(())', '(()())'] 46 | """ 47 | 48 | ``` 49 | 50 | ## Completion 2: 51 | ```python 52 | 53 | assert separate_paren_groups('( ) (( )) (( )( ))') == ['()', '(())', '(()())'] 54 | 55 | ``` -------------------------------------------------------------------------------- /prompts/mbpp_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | **Instructions**: 6 | 1. **Understand and Clarify**: Make sure you understand the task. 7 | 2. **Algorithm/Method Selection**: Decide on the most efficient way. 8 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 9 | 4. **Code Generation**: Translate your pseudocode into executable Python code. 10 | 11 | 12 | # For example: 13 | 14 | ## Prompt 1: 15 | ```python 16 | Write a python function to remove first and last occurrence of a given character from the string. 17 | ``` 18 | ## Test Case 1: 19 | ```python 20 | assert remove_Occ("hello","l") == "heo" 21 | assert remove_Occ("abcda","a") == "bcd" 22 | assert remove_Occ("PHP","P") == "H" 23 | ``` 24 | 25 | ## Completion 1: 26 | ```python 27 | def remove_Occ(s,ch): 28 | for i in range(len(s)): 29 | if (s[i] == ch): 30 | s = s[0 : i] + s[i + 1:] 31 | break 32 | for i in range(len(s) - 1,-1,-1): 33 | if (s[i] == ch): 34 | s = s[0 : i] + s[i + 1:] 35 | break 36 | return s 37 | ``` 38 | 39 | ## Prompt 2: 40 | ```python 41 | Write a function to sort a given matrix in ascending order according to the sum of its rows. 42 | ``` 43 | 44 | ## Test Case 1: 45 | ```python 46 | assert sort_matrix([[1, 2, 3], [2, 4, 5], [1, 1, 1]])==[[1, 1, 1], [1, 2, 3], [2, 4, 5]] 47 | assert sort_matrix([[1, 2, 3], [-2, 4, -5], [1, -1, 1]])==[[-2, 4, -5], [1, -1, 1], [1, 2, 3]] 48 | assert sort_matrix([[5,8,9],[6,4,3],[2,1,4]])==[[2, 1, 4], [6, 4, 3], [5, 8, 9]] 49 | ``` 50 | 51 | ## Completion 2: 52 | ```python 53 | def sort_matrix(M): 54 | result = sorted(M, key=sum) 55 | return result 56 | ``` -------------------------------------------------------------------------------- /prompts/test_designer_mbpp_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability. 2 | 3 | **1. Basic Test Cases**: 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions. 5 | 6 | **2. Edge Test Cases**: 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions. 8 | 9 | **3. Large Scale Test Cases**: 10 | - **Objective**: To assess the function’s performance and scalability with large data samples. 11 | 12 | **Instructions**: 13 | - Implement a comprehensive set of test cases following the guidelines above. 14 | - Ensure each test case is well-documented with comments explaining the scenario it covers. 15 | - Pay special attention to edge cases as they often reveal hidden bugs. 16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads. 17 | 18 | - The format of test cases should be: 19 | ```python 20 | assert function_name(input) == expected_output, "Test Case Description" 21 | ``` 22 | 23 | # For example: 24 | 25 | ## Prompt 1: 26 | ```python 27 | Write a function to find the shared elements from the given two lists. 28 | ``` 29 | 30 | ## Completion 1: 31 | ```python 32 | assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5)) 33 | assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4)) 34 | assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14)) 35 | ``` 36 | 37 | ## Prompt 2: 38 | ```python 39 | Write a python function to identify non-prime numbers. 40 | ``` 41 | 42 | ## Completion 2: 43 | ```python 44 | assert is_not_prime(2) == False 45 | assert is_not_prime(10) == True 46 | assert is_not_prime(35) == True 47 | assert is_not_prime(37) == False 48 | ``` -------------------------------------------------------------------------------- /prompts/humaneval_prompt_update.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | **Code Formatting**: Please write code in 6 | ```python 7 | [Code] 8 | ``` 9 | format. 10 | 11 | # For example: 12 | 13 | ## Prompt 1: 14 | ```python 15 | from typing import List 16 | 17 | 18 | def has_close_elements(numbers: List[float], threshold: float) -> bool: 19 | """ Check if in given list of numbers, are any two numbers closer to each other than 20 | given threshold. 21 | >>> has_close_elements([1.0, 2.0, 3.0], 0.5) 22 | False 23 | >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) 24 | True 25 | """ 26 | 27 | ``` 28 | 29 | ## Completion 1: 30 | ```python 31 | for idx, elem in enumerate(numbers): 32 | for idx2, elem2 in enumerate(numbers): 33 | if idx != idx2: 34 | distance = abs(elem - elem2) 35 | if distance < threshold: 36 | return True 37 | 38 | return False 39 | 40 | ``` 41 | 42 | ## Prompt 2: 43 | ```python 44 | from typing import List 45 | 46 | 47 | def separate_paren_groups(paren_string: str) -> List[str]: 48 | """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to 49 | separate those group into separate strings and return the list of those. 50 | Separate groups are balanced (each open brace is properly closed) and not nested within each other 51 | Ignore any spaces in the input string. 52 | >>> separate_paren_groups('( ) (( )) (( )( ))') 53 | ['()', '(())', '(()())'] 54 | """ 55 | 56 | ``` 57 | 58 | ## Completion 2: 59 | ```python 60 | result = [] 61 | current_string = [] 62 | current_depth = 0 63 | 64 | for c in paren_string: 65 | if c == '(': 66 | current_depth += 1 67 | current_string.append(c) 68 | elif c == ')': 69 | current_depth -= 1 70 | current_string.append(c) 71 | 72 | if current_depth == 0: 73 | result.append(''.join(current_string)) 74 | current_string.clear() 75 | 76 | return result 77 | ``` -------------------------------------------------------------------------------- /prompts/test_designer_humaneval_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability. 2 | 3 | **1. Basic Test Cases**: 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions. 5 | 6 | **2. Edge Test Cases**: 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions. 8 | 9 | **3. Large Scale Test Cases**: 10 | - **Objective**: To assess the function’s performance and scalability with large data samples. 11 | 12 | **Instructions**: 13 | - Implement a comprehensive set of test cases following the guidelines above. 14 | - Ensure each test case is well-documented with comments explaining the scenario it covers. 15 | - Pay special attention to edge cases as they often reveal hidden bugs. 16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads. 17 | 18 | - The format of test cases should be: 19 | ```python 20 | assert function_name(input) == expected_output, "Test Case Description" 21 | ``` 22 | 23 | 24 | # For example: 25 | 26 | ## Prompt 1: 27 | ```python 28 | from typing import List 29 | 30 | 31 | def has_close_elements(numbers: List[float], threshold: float) -> bool: 32 | """ Check if in given list of numbers, are any two numbers closer to each other than 33 | given threshold. 34 | >>> has_close_elements([1.0, 2.0, 3.0], 0.5) 35 | False 36 | >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) 37 | True 38 | """ 39 | 40 | ``` 41 | 42 | ## Completion 1: 43 | ```python 44 | 45 | assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False 46 | assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)== True 47 | 48 | ``` 49 | 50 | ## Prompt 2: 51 | ```python 52 | from typing import List 53 | 54 | 55 | def separate_paren_groups(paren_string: str) -> List[str]: 56 | """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to 57 | separate those group into separate strings and return the list of those. 58 | Separate groups are balanced (each open brace is properly closed) and not nested within each other 59 | Ignore any spaces in the input string. 60 | >>> separate_paren_groups('( ) (( )) (( )( ))') 61 | ['()', '(())', '(()())'] 62 | """ 63 | 64 | ``` 65 | 66 | ## Completion 2: 67 | ```python 68 | 69 | assert separate_paren_groups('( ) (( )) (( )( ))') == ['()', '(())', '(()())'] 70 | 71 | ``` -------------------------------------------------------------------------------- /prompts/humaneval_prompt.txt: -------------------------------------------------------------------------------- 1 | **Role**: You are a software programmer. 2 | 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language. 4 | 5 | **Code Formatting**: Please write code in 6 | ```python 7 | [Code] 8 | ``` 9 | format. 10 | 11 | **Instructions**: 12 | 1. **Understand and Clarify**: Make sure you understand the task. 13 | 2. **Algorithm/Method Selection**: Decide on the most efficient way. 14 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 15 | 4. **Code Generation**: Translate your pseudocode into executable Python code. 16 | 17 | 18 | # For example: 19 | 20 | ## Prompt 1: 21 | ```python 22 | from typing import List 23 | 24 | 25 | def has_close_elements(numbers: List[float], threshold: float) -> bool: 26 | """ Check if in given list of numbers, are any two numbers closer to each other than 27 | given threshold. 28 | >>> has_close_elements([1.0, 2.0, 3.0], 0.5) 29 | False 30 | >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) 31 | True 32 | """ 33 | 34 | ``` 35 | 36 | ## Completion 1: 37 | ```python 38 | for idx, elem in enumerate(numbers): 39 | for idx2, elem2 in enumerate(numbers): 40 | if idx != idx2: 41 | distance = abs(elem - elem2) 42 | if distance < threshold: 43 | return True 44 | 45 | return False 46 | 47 | ``` 48 | 49 | ## Prompt 2: 50 | ```python 51 | from typing import List 52 | 53 | 54 | def separate_paren_groups(paren_string: str) -> List[str]: 55 | """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to 56 | separate those group into separate strings and return the list of those. 57 | Separate groups are balanced (each open brace is properly closed) and not nested within each other 58 | Ignore any spaces in the input string. 59 | >>> separate_paren_groups('( ) (( )) (( )( ))') 60 | ['()', '(())', '(()())'] 61 | """ 62 | 63 | ``` 64 | 65 | ## Completion 2: 66 | ```python 67 | result = [] 68 | current_string = [] 69 | current_depth = 0 70 | 71 | for c in paren_string: 72 | if c == '(': 73 | current_depth += 1 74 | current_string.append(c) 75 | elif c == ')': 76 | current_depth -= 1 77 | current_string.append(c) 78 | 79 | if current_depth == 0: 80 | result.append(''.join(current_string)) 81 | current_string.clear() 82 | 83 | return result 84 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AgentCoder: multi-agent code generation framework 2 | 3 | AgentCoder is a novel multiagent-code generation framework that leverages the power of large language models (LLMs) to enhance the effectiveness of code generation. The framework consists of three specialized agents: the programmer agent, the test designer agent, and the test executor agent. These agents collaborate to generate high-quality code snippets, design comprehensive test cases, and ensure the correctness of the generated code through an iterative feedback loop. 4 | 5 | ## Key Features 6 | 7 | - **Multiagent Collaboration**: AgentCoder utilizes a multiagent framework where each agent specializes in a specific task, leading to improved code generation effectiveness. 8 | - **Independent Test Case Generation**: The test designer agent generates diverse and objective test cases independently, ensuring comprehensive testing of the generated code. 9 | - **Iterative Code Refinement**: The test executor agent executes the generated test cases against the code and provides feedback to the programmer agent for iterative code refinement. 10 | - **Modularity and Scalability**: The modular structure of AgentCoder allows for easy integration with advanced models and future enhancements, ensuring adaptability in the evolving landscape of code generation. 11 | 12 | ## Installation 13 | 14 | To use AgentCoder, you need to have an API key from OpenAI or other similar third-party providers. 15 | 1. Clone the AgentCoder repository: 16 | ``` 17 | git clone https://github.com/your-username/AgentCoder.git 18 | cd AgentCoder 19 | git clone https://github.com/THUDM/CodeGeeX 20 | ``` 21 | 22 | 2. Install the required dependencies: 23 | ``` 24 | pip install -r requirements.txt 25 | ``` 26 | 27 | 3. Add your API key in the `.env` file: 28 | ```python 29 | OPENAI_API_KEY="YOUR_OPENAI_API_KEY" 30 | ``` 31 | 32 | ## Usage 33 | 34 | ### Code Generation 35 | 36 | To generate code snippets, run the following commands: 37 | ``` 38 | python programmer_[humaneval/mbpp].py 39 | ``` 40 | These scripts will generate code snippets that will be used for test case generation. 41 | 42 | ### Test Case Generation 43 | 44 | To generate test cases, run the following command: 45 | ``` 46 | python test_designer_[humaneval/mbpp].py 47 | ``` 48 | This script will generate diverse and comprehensive test cases based on the coding requirements. 49 | 50 | ### Self-Optimization Process 51 | 52 | To perform the self-optimization process, run the following commands: 53 | ``` 54 | python test_executor_[humaneval/mbpp].py 55 | ``` 56 | These scripts will execute the generated test cases against the code and provide feedback to the programmer agent for iterative code refinement. 57 | 58 | 59 | ## Contributions 60 | 61 | Contributions to AgentCoder are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository. 62 | 63 | ## License 64 | 65 | AgentCoder is released under the [MIT License](LICENSE). 66 | 67 | ## Acknowledgments 68 | 69 | We would like to thank AIOHUB for providing funding and support for the development of AgentCoder. We also acknowledge the contributions of the open-source community and the developers of the large language models used in this project. 70 | -------------------------------------------------------------------------------- /src/test_designer_humaneval.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import json 4 | from tqdm import tqdm 5 | import copy 6 | import openai 7 | from concurrent.futures import ThreadPoolExecutor 8 | import concurrent.futures 9 | import time 10 | from datasets import load_dataset 11 | from dotenv import load_dotenv 12 | 13 | load_dotenv() 14 | 15 | # Setting API parameters 16 | openai.api_key = os.getenv("OPENAI_API_KEY") 17 | 18 | dataset = load_dataset("openai_humaneval",split="test") 19 | dataset = [entry for entry in dataset] 20 | 21 | prompt_path = "./prompts/test_designer_humaneval_prompt_update.txt" 22 | with open(prompt_path, "r") as f: 23 | construct_few_shot_prompt = f.read() 24 | 25 | def preprocess_data(test_case_string): 26 | if f"```python" in test_case_string: 27 | test_case_string = test_case_string[test_case_string.find(f"```python")+len(f"```python"):] 28 | test_case_string = test_case_string[:test_case_string.find("```")] 29 | 30 | return test_case_string 31 | 32 | # Function to fetch completion 33 | def fetch_completion(data_entry, model, lg,times=10): 34 | global construct_few_shot_prompt 35 | if "need_reproduce" in data_entry.keys() and data_entry["need_reproduce"]==False: 36 | return data_entry 37 | prompt = data_entry["prompt"] 38 | entry_point = data_entry["entry_point"] 39 | 40 | text = f""" 41 | {construct_few_shot_prompt} 42 | 43 | **Input Code Snippet**: 44 | ```python 45 | {prompt} 46 | ``` 47 | """ 48 | test_case_list = [] 49 | for i in range(times): 50 | while True: 51 | try: 52 | completions = openai.ChatCompletion.create( 53 | model="gpt-3.5-turbo-1106", 54 | stream=False, 55 | messages=[ 56 | {"role": "system", "content": "You are a code developer assistant."}, 57 | {"role": "user", "content":text}, 58 | ], 59 | request_timeout=100, 60 | ) 61 | test_case = completions.choices[0]["message"]["content"] 62 | test_case = preprocess_data(test_case) 63 | except Exception as e: 64 | time.sleep(20) 65 | print(e) 66 | test_case = "" 67 | if test_case!="": 68 | break 69 | test_case_list.append(test_case) 70 | data_entry["test_case_list"] = test_case_list 71 | return data_entry 72 | 73 | def call_fetch_test_completion_helper(dataset, model,lg): 74 | print("Fixing bug...") 75 | with ThreadPoolExecutor(max_workers=5) as executor: 76 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 77 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 78 | entry = future_to_entry[future] 79 | try: 80 | updated_entry = future.result() 81 | idx = dataset.index(entry) 82 | dataset[idx] = updated_entry 83 | except Exception as e: 84 | print(repr(e)) 85 | return dataset 86 | 87 | 88 | if __name__ == "__main__": 89 | model_list = ["gpt-3.5-turbo-1106"] 90 | language = ["python"] 91 | for model in model_list: 92 | for lg in language: 93 | from datasets import load_dataset 94 | with open(f"./dataset/{model}_{lg}.json", "r") as f: 95 | dataset = json.load(f) 96 | dataset = [entry for entry in dataset] 97 | with ThreadPoolExecutor(max_workers=5) as executor: 98 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 99 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 100 | entry = future_to_entry[future] 101 | try: 102 | updated_entry = future.result() 103 | idx = dataset.index(entry) 104 | dataset[idx] = updated_entry 105 | except Exception as e: 106 | print(repr(e)) 107 | 108 | with open(f"./dataset/{model}_{lg}.json", "w") as f: 109 | json.dump(dataset, f, indent=4) 110 | -------------------------------------------------------------------------------- /src/programmer_humaneval.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import json 4 | from tqdm import tqdm 5 | import copy 6 | import openai 7 | from concurrent.futures import ThreadPoolExecutor 8 | import concurrent.futures 9 | import time 10 | from datasets import load_dataset 11 | from dotenv import load_dotenv 12 | 13 | load_dotenv() 14 | 15 | # Setting API parameters 16 | openai.api_base = "https://api.aiohub.org/v1" 17 | openai.api_key = os.getenv("OPENAI_API_KEY") 18 | 19 | dataset = load_dataset("openai_humaneval",split="test") 20 | dataset = [entry for entry in dataset] 21 | 22 | prompt_path = "./prompts/humaneval_prompt_update.txt" 23 | with open(prompt_path, "r") as f: 24 | construct_few_shot_prompt = f.read() 25 | 26 | def preprocess_data(completion_string): 27 | if f"```python" in completion_string: 28 | completion_string = completion_string[completion_string.find(f"```python")+len(f"```python"):] 29 | completion_string = completion_string[:completion_string.find("```")] 30 | else: 31 | print("Error: No code block found") 32 | return completion_string 33 | 34 | # Function to fetch completion 35 | def fetch_completion(data_entry, model,lg,times = 5): 36 | global construct_few_shot_prompt 37 | if "need_reproduce" in data_entry.keys() and data_entry["need_reproduce"]==False: 38 | return data_entry 39 | prompt = data_entry["prompt"] 40 | text = f""" 41 | {construct_few_shot_prompt} 42 | 43 | **Input Code Snippet**: 44 | ```python 45 | {prompt} 46 | ``` 47 | ## Completion 3: 48 | """ 49 | completions_code = [] 50 | for i in range(times): 51 | while True: 52 | try: 53 | completions = openai.ChatCompletion.create( 54 | model=model, 55 | stream=False, 56 | messages=[ 57 | {"role": "system", "content": "You are a software programmer."}, 58 | {"role": "user", "content":text}, 59 | ], 60 | request_timeout=100, 61 | ) 62 | completion = completions.choices[0]["message"]["content"] 63 | completion = preprocess_data(completion) 64 | 65 | except Exception as e: 66 | print(e) 67 | time.sleep(10) 68 | completion = "" 69 | if completion!="": 70 | break 71 | completions_code.append(completion) 72 | data_entry["completion_list"] = completions_code 73 | return data_entry 74 | 75 | 76 | def call_fetch_completion_helper(dataset, model,lg): 77 | print("Fixing bug...") 78 | with ThreadPoolExecutor(max_workers=5) as executor: 79 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 80 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 81 | entry = future_to_entry[future] 82 | try: 83 | updated_entry = future.result() 84 | idx = dataset.index(entry) 85 | dataset[idx] = updated_entry 86 | except Exception as e: 87 | print(repr(e)) 88 | return dataset 89 | 90 | if __name__ == "__main__": 91 | model_list = ["gpt-3.5-turbo-1106"] 92 | language = ["python"] 93 | for model in model_list: 94 | for lg in language: 95 | from datasets import load_dataset 96 | dataset = load_dataset("openai_humaneval",split="test") 97 | dataset = [entry for entry in dataset] 98 | with ThreadPoolExecutor(max_workers=5) as executor: 99 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 100 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 101 | entry = future_to_entry[future] 102 | try: 103 | updated_entry = future.result() 104 | idx = dataset.index(entry) 105 | dataset[idx] = updated_entry 106 | except Exception as e: 107 | print(repr(e)) 108 | with open(f"./dataset/{model}_{lg}.json", "w") as f: 109 | json.dump(dataset, f, indent=4) 110 | -------------------------------------------------------------------------------- /src/test_designer_mbpp.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import json 4 | from tqdm import tqdm 5 | import copy 6 | import openai 7 | from concurrent.futures import ThreadPoolExecutor 8 | import concurrent.futures 9 | import time 10 | from datasets import load_dataset 11 | from dotenv import load_dotenv 12 | 13 | load_dotenv() 14 | 15 | # Setting API parameters 16 | openai.api_key = os.getenv("OPENAI_API_KEY") 17 | 18 | dataset = load_dataset("evalplus/mbppplus",split="test") 19 | dataset = [entry for entry in dataset] 20 | task_0_tests = "\n".join(dataset[0]["test_list"]) 21 | task_1_tests = "\n".join(dataset[1]["test_list"]) 22 | 23 | prompt_path = "./prompts/test_designer_mbpp_prompt_update.txt" 24 | with open(prompt_path, "r") as f: 25 | construct_few_shot_prompt = f.read() 26 | 27 | def preprocess_data(test_case_string): 28 | if f"```python" in test_case_string: 29 | test_case_string = test_case_string[test_case_string.find(f"```python")+len(f"```python"):] 30 | test_case_string = test_case_string[:test_case_string.find("```")] 31 | return test_case_string 32 | 33 | # Function to fetch completion 34 | def fetch_completion(data_entry, model, lg,times=5): 35 | global construct_few_shot_prompt 36 | if "need_reproduce" in data_entry.keys() and data_entry["need_reproduce"]==False: 37 | return data_entry 38 | prompt = data_entry["prompt"] 39 | test_case_0 = data_entry["test_list"][0] 40 | function_name = test_case_0.split("(")[0].split(" ")[-1] 41 | 42 | text = f""" 43 | {construct_few_shot_prompt} 44 | 45 | **Input Code Snippet**: 46 | ```python 47 | {prompt} 48 | ``` 49 | """ 50 | test_case_list = [] 51 | for i in range(times): 52 | while True: 53 | try: 54 | completions = openai.ChatCompletion.create( 55 | model="gpt-3.5-turbo-1106", 56 | stream=False, 57 | messages=[ 58 | {"role": "system", "content": "You are a code developer assistant."}, 59 | {"role": "user", "content":text}, 60 | ], 61 | request_timeout=100, 62 | ) 63 | test_case = completions.choices[0]["message"]["content"] 64 | test_case = preprocess_data(test_case) 65 | except Exception as e: 66 | time.sleep(20) 67 | print(e) 68 | test_case = "" 69 | if test_case!="": 70 | break 71 | test_case_list.append(test_case) 72 | data_entry["test_case_list"] = test_case_list 73 | return data_entry 74 | 75 | def call_fetch_test_completion_helper(dataset, model,lg): 76 | print("Fixing bug...") 77 | with ThreadPoolExecutor(max_workers=5) as executor: 78 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 79 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 80 | entry = future_to_entry[future] 81 | try: 82 | updated_entry = future.result() 83 | idx = dataset.index(entry) 84 | dataset[idx] = updated_entry 85 | except Exception as e: 86 | print(repr(e)) 87 | return dataset 88 | 89 | 90 | if __name__ == "__main__": 91 | model_list = ["gpt-3.5-turbo-0301"] 92 | language = ["python"] 93 | for model in model_list: 94 | for lg in language: 95 | from datasets import load_dataset 96 | with open(f"./dataset/{model}_mbpp.json", "r") as f: 97 | dataset = json.load(f) 98 | dataset = [entry for entry in dataset] 99 | with ThreadPoolExecutor(max_workers=5) as executor: 100 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 101 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 102 | entry = future_to_entry[future] 103 | try: 104 | updated_entry = future.result() 105 | idx = dataset.index(entry) 106 | dataset[idx] = updated_entry 107 | except Exception as e: 108 | print(repr(e)) 109 | 110 | with open(f"./dataset/{model}_mbpp.json", "w") as f: 111 | json.dump(dataset, f, indent=4) 112 | -------------------------------------------------------------------------------- /src/programmer_mbpp.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import json 4 | from tqdm import tqdm 5 | import copy 6 | import openai 7 | from concurrent.futures import ThreadPoolExecutor 8 | import concurrent.futures 9 | from dotenv import load_dotenv 10 | 11 | load_dotenv 12 | 13 | # Setting API parameters 14 | openai.api_base = "https://api.aiohub.org/v1" 15 | openai.api_key = os.getenv("OPENAI_API_KEY") 16 | 17 | prompt_path = "./prompts/mbpp_prompt_update.txt" 18 | with open(prompt_path, "r") as f: 19 | construct_few_shot_prompt = f.read() 20 | 21 | def preprocess_data(data,lg): 22 | if f"```{lg}" in data["completion"]: 23 | data["completion"] = data["completion"][data["completion"].find(f"```{lg}")+len(f"```{lg}"):] 24 | data["completion"] = data["completion"][:data["completion"].find("```")] 25 | else: 26 | print(data["task_id"]) 27 | return data 28 | 29 | # Function to fetch completion 30 | def fetch_completion(data_entry, model,lg): 31 | global construct_few_shot_prompt 32 | lg = "py" 33 | if "passed" in data_entry.keys() and data_entry["passed"] == True: 34 | return data_entry 35 | prompt = data_entry["prompt"] 36 | test_case = data_entry["test_list"] 37 | code = data_entry["completion"] 38 | tests = "" 39 | for test in test_case: 40 | tests+="\n"+test 41 | text = f""" 42 | construct_few_shot_prompt 43 | 44 | **Task**: 45 | ```python 46 | {prompt} 47 | ``` 48 | Your code should pass these tests: 49 | ```python 50 | {tests} 51 | ``` 52 | """ 53 | try: 54 | completions = openai.ChatCompletion.create( 55 | model = model, 56 | stream=False, 57 | messages=[ 58 | {"role": "system", "content": "You are a code developer."}, 59 | {"role": "user", "content":text}, 60 | ], 61 | request_timeout=100, 62 | ) 63 | data_entry["completion"] = completions.choices[0]["message"]["content"] 64 | data_entry = preprocess_data(data_entry,lg) 65 | return data_entry 66 | except Exception as e: 67 | print(repr(e)) 68 | data_entry["completion"] = "" 69 | return data_entry 70 | 71 | def fix_bug(data_entry, model,lg,preprocess_data = preprocess_data): 72 | if "passed" in data_entry.keys() and data_entry["passed"] == True: 73 | return data_entry 74 | else: 75 | gpt_prompt = ( 76 | "Please re-completion the code to fix the error message. "+ 77 | f"\nHere is the previous version:\n```{lg}\n" + 78 | data_entry['completion'] + f"\n```\nWhen we use this test cases: ```{lg}\n"+data_entry["test_case"]+f"\n``` to evaluate the code. It raise the error:\n```{lg}\n" + data_entry["result"] + 79 | f"\n```\nPlease fix the bug and return the code. The re-completion code should in triple backticks format(i.e., in ```{lg} ```)." 80 | ) 81 | try: 82 | completions = openai.ChatCompletion.create( 83 | model = model, 84 | stream=False, 85 | messages=[ 86 | {"role": "system", "content": "You are a code developer assistant."}, 87 | {"role": "user", "content":gpt_prompt}, 88 | ], 89 | request_timeout=100, 90 | ) 91 | data_entry["completion"] = completions.choices[0]["message"]["content"] 92 | data_entry = preprocess_data(data_entry,"py") 93 | except Exception as e: 94 | print(repr(e)) 95 | return data_entry 96 | 97 | def call_fix_bug(dataset, model,lg): 98 | print("Fixing bug...") 99 | with ThreadPoolExecutor() as executor: 100 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 101 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 102 | entry = future_to_entry[future] 103 | try: 104 | updated_entry = future.result() 105 | idx = dataset.index(entry) 106 | dataset[idx] = updated_entry 107 | except Exception as e: 108 | print(repr(e)) 109 | return dataset 110 | 111 | def call_completion(dataset, model,lg): 112 | print("Fixing bug...") 113 | with ThreadPoolExecutor() as executor: 114 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 115 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 116 | entry = future_to_entry[future] 117 | try: 118 | updated_entry = future.result() 119 | idx = dataset.index(entry) 120 | dataset[idx] = updated_entry 121 | except Exception as e: 122 | print(repr(e)) 123 | return dataset 124 | 125 | 126 | 127 | if __name__ == "__main__": 128 | model_list = ["gpt-3.5-turbo-1106"] 129 | language = ["py"] 130 | for model in model_list: 131 | for lg in language: 132 | from datasets import load_dataset 133 | dataset = load_dataset("mbpp",name="sanitized",split="test") 134 | dataset = [entry for entry in dataset] 135 | with open(path, "r") as f: 136 | dataset = json.load(f) 137 | with ThreadPoolExecutor(max_workers=20) as executor: 138 | future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)} 139 | for future in tqdm(concurrent.futures.as_completed(future_to_entry)): 140 | entry = future_to_entry[future] 141 | try: 142 | updated_entry = future.result() 143 | idx = dataset.index(entry) 144 | dataset[idx] = updated_entry 145 | except Exception as e: 146 | print(repr(e)) 147 | 148 | with open(f"./dataset/{model}_mbpp.json", "w") as f: 149 | json.dump(dataset, f, indent=4) 150 | -------------------------------------------------------------------------------- /src/test_executor_mbpp.py: -------------------------------------------------------------------------------- 1 | import random 2 | import json 3 | from typing import Optional, Callable, Dict 4 | import ast 5 | import doctest 6 | import io 7 | from concurrent.futures import ThreadPoolExecutor, as_completed 8 | import inspect 9 | import numpy as np 10 | import sys 11 | sys.path.append('./CodeGeeX/') 12 | import contextlib 13 | import faulthandler 14 | import io 15 | import os 16 | import multiprocessing 17 | import platform 18 | import signal 19 | from tqdm import tqdm 20 | from programmer_mbpp import fix_bug,call_fix_bug,call_completion,single_agent_helper 21 | from codegeex.benchmark.utils import read_dataset, IMPORT_HELPER 22 | from codegeex.benchmark.execution import check_correctness 23 | import tempfile 24 | correct_doctest = 0 25 | correct_before_doctest = 0 26 | correct_after_doctest = 0 27 | result_original = 0 28 | result_canonical_solution = 0 29 | result_fuzzer = 0 30 | result_fuzzer_canonical_solution = 0 31 | idx_run_tests_orginal = [] 32 | idx_run_tests_canonical_solution = [] 33 | idx_run_tests_fuzzer = [] 34 | idx_run_tests_fuzzer_canonical_solution = [] 35 | 36 | language = ["python","cpp","js","go","js"] 37 | 38 | 39 | def process_humaneval_test(sample, problems, example_test=False,language=language, test_case=True,canonical_solution=False): 40 | task_id = sample["task_id"] 41 | task_id = problems.index(sample) 42 | prompt = sample["prompt"] 43 | code = sample["completion"] 44 | if canonical_solution: 45 | code = sample["code"] 46 | # Pre-process for different languages 47 | if language == "python" or language == "py": 48 | if test_case: 49 | tests = sample["test_case"] 50 | else: 51 | test_case = sample["test_list"] 52 | tests = "" 53 | for test in test_case: 54 | tests+="\n"+test 55 | test_string = code + "\n" + tests 56 | return test_string 57 | 58 | 59 | 60 | def preprocess_data(task,lg): 61 | if f"```{lg}" in task["completion"]: 62 | task["completion"] = task["completion"][task["completion"].find(f"```{lg}") +len(f"```{lg}"):] 63 | task["completion"] = task["completion"][:task["completion"].find("```")] 64 | elif "```" in task["completion"]: 65 | task["completion"] = task["completion"][task["completion"].find("```") +3:] 66 | task["completion"] = task["completion"][:task["completion"].find("```")] 67 | 68 | if f"```{lg}" in task["prompt"]: 69 | task["prompt"] = task["prompt"][task["prompt"].find(f"```{lg}") +len(f"```{lg}"):] 70 | task["prompt"] = task["prompt"][:task["prompt"].find("```")] 71 | elif "```" in task["prompt"]: 72 | task["prompt"] = task["prompt"][task["prompt"].find("```") +3:] 73 | task["prompt"] = task["prompt"][:task["prompt"].find("```")] 74 | 75 | if "assert" in task["prompt"]: 76 | task["prompt"] = task["prompt"][:task["prompt"].find("assert")] 77 | return task 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | class TimeoutException(Exception): 86 | pass 87 | class WriteOnlyStringIO(io.StringIO): 88 | """ StringIO that throws an exception when it's read from """ 89 | 90 | def read(self, *args, **kwargs): 91 | raise IOError 92 | 93 | def readline(self, *args, **kwargs): 94 | raise IOError 95 | 96 | def readlines(self, *args, **kwargs): 97 | raise IOError 98 | 99 | def readable(self, *args, **kwargs): 100 | """ Returns True if the IO object can be read. """ 101 | return False 102 | class redirect_stdin(contextlib._RedirectStream): # type: ignore 103 | _stream = 'stdin' 104 | 105 | @contextlib.contextmanager 106 | def swallow_io(): 107 | stream = WriteOnlyStringIO() 108 | with contextlib.redirect_stdout(stream): 109 | with contextlib.redirect_stderr(stream): 110 | with redirect_stdin(stream): 111 | yield 112 | 113 | @contextlib.contextmanager 114 | def time_limit(seconds: float): 115 | def signal_handler(signum, frame): 116 | raise TimeoutException("Timed out!") 117 | signal.setitimer(signal.ITIMER_REAL, seconds) 118 | signal.signal(signal.SIGALRM, signal_handler) 119 | try: 120 | yield 121 | finally: 122 | signal.setitimer(signal.ITIMER_REAL, 0) 123 | 124 | # def check_correctness_mbpp(code_string): 125 | 126 | 127 | def test_report(dataset,lg): 128 | correct = 0 129 | for i in tqdm(range(len(dataset))): 130 | dataset[i]["full_code"] = process_humaneval_test(dataset[i], dataset, example_test=False,language=lg,test_case=False) 131 | result = check_correctness(dataset[i]["task_id"],dataset[i],lg,5,"./tmp") 132 | if result["passed"]==True: 133 | correct+=1 134 | dataset[i]["report_passed"] = result["passed"] 135 | dataset[i]["report_result"] = result["result"] 136 | print("==============Start Report Testing==============") 137 | correct_percent = correct/len(dataset)*100 138 | print(f"test_report, {correct_percent:0.2f}") 139 | return dataset 140 | 141 | def test_agent(dataset,lg): 142 | correct = 0 143 | for i in tqdm(range(len(dataset))): 144 | dataset[i]["full_code"] = process_humaneval_test(dataset[i], dataset, example_test=False,language=lg,test_case=False) 145 | result = check_correctness(dataset[i]["task_id"],dataset[i],lg,5,"./tmp") 146 | if result["passed"]==True: 147 | correct+=1 148 | dataset[i]["result"] = result["result"] 149 | dataset[i]["passed"] = result["passed"] 150 | print("============Start Agent Testing=================") 151 | print("test_report",correct) 152 | return dataset 153 | 154 | if __name__ == "__main__": 155 | model_list = ["gpt-3.5-turbo-1106"] 156 | language = ["python"] 157 | 158 | for model_name in model_list: 159 | for lg in language: 160 | path = f"./dataset/zero_shot_{model_name}_mbpp.json" 161 | with open(path, "r") as f: 162 | dataset = json.load(f) 163 | epoch = 5 164 | for current_epoch in range(epoch): 165 | print(lg,current_epoch) 166 | test_report(dataset,lg) 167 | test_agent(dataset,lg) 168 | dataset = call_completion(dataset,model_name,lg) 169 | with open(f"./dataset/zero_shot_{model_name}_{current_epoch}_mbpp.json", "w") as f: 170 | json.dump(dataset, f, indent=4) 171 | with open(f"./dataset/zero_shot_{model_name}_{current_epoch}_mbpp_total.json", "w") as f: 172 | json.dump(dataset, f, indent=4) 173 | 174 | 175 | 176 | -------------------------------------------------------------------------------- /src/test_executor_humaneval.py: -------------------------------------------------------------------------------- 1 | # test 2 | import random 3 | import json 4 | from typing import Optional, Callable, Dict 5 | import ast 6 | import doctest 7 | from concurrent.futures import ThreadPoolExecutor, as_completed 8 | import inspect 9 | import numpy as np 10 | import sys 11 | sys.path.append('./CodeGeeX/') 12 | import contextlib 13 | import faulthandler 14 | import io 15 | import os 16 | import multiprocessing 17 | import platform 18 | import signal 19 | import concurrent.futures 20 | from tqdm import tqdm 21 | from tqdm import tqdm 22 | from programmer_humaneval import call_fetch_completion_helper 23 | from test_designer_humaneval import call_fetch_test_completion_helper 24 | from codegeex.benchmark.utils import read_dataset, IMPORT_HELPER 25 | from codegeex.benchmark.execution import check_correctness 26 | import tempfile 27 | correct_doctest = 0 28 | correct_before_doctest = 0 29 | correct_after_doctest = 0 30 | result_original = 0 31 | result_canonical_solution = 0 32 | result_fuzzer = 0 33 | result_fuzzer_canonical_solution = 0 34 | idx_run_tests_orginal = [] 35 | idx_run_tests_canonical_solution = [] 36 | idx_run_tests_fuzzer = [] 37 | idx_run_tests_fuzzer_canonical_solution = [] 38 | 39 | language = ["python","cpp","js","go","js"] 40 | 41 | 42 | class TimeoutException(Exception): 43 | pass 44 | class WriteOnlyStringIO(io.StringIO): 45 | """ StringIO that throws an exception when it's read from """ 46 | 47 | def read(self, *args, **kwargs): 48 | raise IOError 49 | 50 | def readline(self, *args, **kwargs): 51 | raise IOError 52 | 53 | def readlines(self, *args, **kwargs): 54 | raise IOError 55 | 56 | def readable(self, *args, **kwargs): 57 | """ Returns True if the IO object can be read. """ 58 | return False 59 | class redirect_stdin(contextlib._RedirectStream): # type: ignore 60 | _stream = 'stdin' 61 | 62 | @contextlib.contextmanager 63 | def swallow_io(): 64 | stream = WriteOnlyStringIO() 65 | with contextlib.redirect_stdout(stream): 66 | with contextlib.redirect_stderr(stream): 67 | with redirect_stdin(stream): 68 | yield 69 | 70 | @contextlib.contextmanager 71 | def time_limit(seconds: float): 72 | def signal_handler(signum, frame): 73 | raise TimeoutException("Timed out!") 74 | signal.setitimer(signal.ITIMER_REAL, seconds) 75 | signal.signal(signal.SIGALRM, signal_handler) 76 | try: 77 | yield 78 | finally: 79 | signal.setitimer(signal.ITIMER_REAL, 0) 80 | 81 | def process_humaneval_test(sample, problems, example_test=False,language=language, test_case=True): 82 | task_id = sample["task_id"] 83 | task_id = problems.index(sample) 84 | prompt = sample["prompt"] 85 | if example_test and "example_test" in problems[task_id] and problems[task_id]["example_test"] != "": 86 | test = problems[task_id]["example_test"] 87 | else: 88 | test = problems[task_id]["test"] 89 | if test_case: 90 | test = problems[task_id]["test_case"] 91 | code = sample["completion"] 92 | # Pre-process for different languages 93 | if language == "python": 94 | code_ = [] 95 | test_setup = "\n".join(IMPORT_HELPER["python"]) + "\n" 96 | if f"class sample['entry_point']" in code: 97 | test_string = test_setup + code + "\n" + test + "\n" + f"check({sample['entry_point']})" 98 | else: 99 | test_string = test_setup + prompt + code + "\n" + test + "\n" + f"check({sample['entry_point']})" 100 | elif language == "cpp": 101 | test_set_up = "" 102 | for s in IMPORT_HELPER["cpp"]: 103 | if s not in prompt: 104 | test_set_up += s + "\n" 105 | # test_string = test_set_up + "\n" + prompt + code + "\n" + test 106 | test_string = test_set_up + "\n" + code + "\n" + test 107 | elif language == "java": 108 | # if sample["declaration"] in code: 109 | if "class Solution" in code: 110 | test_string = code + "\n" + test 111 | else: 112 | test_string = prompt + code + "\n" + test 113 | # else: 114 | # test_string = prompt + code + "\n" + test 115 | elif language == "js" or language == "javascript": 116 | # test_string = prompt + code + "\n" + test 117 | test_string = code + "\n" + test 118 | elif language == "go": 119 | # import_string = problems[task_id]["import"] 120 | # prompt = prompt.replace(import_string, "") 121 | if example_test and "example_test" in problems[task_id]: 122 | test = problems[task_id]["example_test"] 123 | else: 124 | test = problems[task_id]["test"] 125 | candidate_import = ["math.","strings.","strconv.","sort.","time.","regexp.","fmt.","bytes.","md5.","rand."] 126 | test_setup = "package main\nimport (\n \"testing\"\n \"github.com/stretchr/testify/assert\"\n)" 127 | total_string = sample["declaration"] + code + "\n" + test 128 | other_pkgs = [] 129 | for pkg in candidate_import: 130 | if pkg in total_string: 131 | if pkg != "md5." and pkg!="rand": 132 | other_pkgs.append(" " + "\"" + pkg[:len(pkg)-1] + "\"" + "\n") 133 | elif pkg == "md5.": 134 | other_pkgs.append(" " + "\"" + "crypto/md5" + "\"" + "\n") 135 | elif pkg == "rand.": 136 | other_pkgs.append(" " + "\"" + "math/rand" + "\"" + "\n") 137 | if other_pkgs: 138 | import_other_pkgs = "import (\n" + " ".join([p + "\n" for p in other_pkgs]) + ")" 139 | # test_string = test_setup + "\n" + import_other_pkgs + "\n" + prompt + code + "\n" + test 140 | test_string = test_setup + "\n" + import_other_pkgs + "\n" + code + "\n" + test 141 | else: 142 | # test_string = test_setup + "\n" + prompt + code + "\n" + test 143 | test_string = test_setup + "\n" + code + "\n" + test 144 | elif language == "rust": 145 | main = "\nfn main(){ \n } \n" 146 | declaration = problems[task_id]["declaration"] 147 | test_string = main + declaration + prompt + code + test 148 | # print(test_string) 149 | return test_string 150 | 151 | 152 | 153 | def preprocess_data(task,lg): 154 | if f"```{lg}" in task["completion"]: 155 | task["completion"] = task["completion"][task["completion"].find(f"```{lg}") +len(f"```{lg}"):] 156 | task["completion"] = task["completion"][:task["completion"].find("```")] 157 | elif "```" in task["completion"]: 158 | task["completion"] = task["completion"][task["completion"].find("```") +3:] 159 | task["completion"] = task["completion"][:task["completion"].find("```")] 160 | 161 | if f"```{lg}" in task["prompt"]: 162 | task["prompt"] = task["prompt"][task["prompt"].find(f"```{lg}") +len(f"```{lg}"):] 163 | task["prompt"] = task["prompt"][:task["prompt"].find("```")] 164 | elif "```" in task["prompt"]: 165 | task["prompt"] = task["prompt"][task["prompt"].find("```") +3:] 166 | task["prompt"] = task["prompt"][:task["prompt"].find("```")] 167 | 168 | if "assert" in task["prompt"]: 169 | task["prompt"] = task["prompt"][:task["prompt"].find("assert")] 170 | return task 171 | 172 | 173 | def test_report(dataset,lg): 174 | correct = 0 175 | test_setup = "\n".join(IMPORT_HELPER["python"]) + "\n" 176 | for i in tqdm(range(len(dataset))): 177 | try: 178 | with swallow_io(): 179 | with time_limit(2.0): 180 | exec(test_setup + "\n" + dataset[i]["completion"] + "\n" + dataset[i]["test"] + "\n" + f"check({dataset[i]['entry_point']})") 181 | correct+=1 182 | except Exception as exc: 183 | pass 184 | print("==============Start Report Testing==============") 185 | print(f"test_report: {(correct/len(dataset)*100):.1f}") 186 | 187 | 188 | 189 | def test_agent_concurrency(dataset, lg): 190 | test_setup = "\n".join(IMPORT_HELPER["python"]) + "\n" 191 | total_correct = 0 192 | _for_completion = 0 193 | 194 | def process_item(i): 195 | if "need_reproduce" in dataset[i].keys() and dataset[i]["need_reproduce"]==False: 196 | # dataset[i]["need_reproduce"] = True 197 | return dataset[i]["max_correct"], dataset[i]["idx"] 198 | completion_list = dataset[i]["completion_list"] 199 | test_case_list = dataset[i]["test_case_list"] 200 | correct_list = [] 201 | 202 | for j in range(len(completion_list)): 203 | correct = 0 204 | if f"def {dataset[i]['entry_point']}" not in completion_list[j]: 205 | correct_list.append(correct) 206 | continue 207 | for k in range(len(test_case_list)): 208 | if f"assert {dataset[i]['entry_point']}(" not in test_case_list[k]: 209 | continue 210 | dataset[i]["full_code"] = test_setup + "\n" + completion_list[j] + "\n" + test_case_list[k] 211 | result = check_correctness(dataset[i]["task_id"], dataset[i], lg, 3, "./tmp") 212 | if result["passed"]: 213 | correct += 1 214 | correct_list.append(correct) 215 | 216 | max_correct = max(correct_list) 217 | idx = correct_list.index(max_correct) 218 | 219 | return max_correct, idx 220 | 221 | with concurrent.futures.ThreadPoolExecutor() as executor: 222 | futures = [executor.submit(process_item, i) for i in range(len(dataset))] 223 | 224 | for future in tqdm(concurrent.futures.as_completed(futures), total=len(dataset)): 225 | max_correct, idx = future.result() 226 | if max_correct >= 3: # GPT-3.5-turbo-1106's test case accuracy is about 67%. So we choice 60% as the bar. 227 | i = futures.index(future) 228 | dataset[i]["completion"] = dataset[i]["completion_list"][idx] 229 | dataset[i]["need_reproduce"] = False 230 | dataset[i]["idx"] = idx 231 | dataset[i]["max_correct"] = max_correct 232 | _for_completion += 1 233 | else: 234 | i = futures.index(future) 235 | dataset[i]["completion"] = dataset[i]["completion_list"][idx] 236 | 237 | 238 | print("==============Start Agent Testing==============") 239 | print(f"test_report: {(total_correct/len(dataset)*100):.1f}") 240 | print(f"test_for_completion: {(_for_completion/len(dataset)*100):.1f}") 241 | return dataset 242 | 243 | 244 | if __name__ == "__main__": 245 | model_list = ["gpt-3.5-turbo-1106"] 246 | language = ["python"] 247 | for model in model_list: 248 | for lg in language: 249 | path = f"./dataset/{model}_{lg}.json" 250 | with open(path, "r") as f: 251 | dataset = json.load(f) 252 | epoch = 5 253 | for current_epoch in range(epoch): 254 | dataset = test_agent_concurrency(dataset,lg) 255 | test_report(dataset,lg) 256 | dataset = call_fetch_completion_helper(dataset,model,lg) 257 | dataset = call_fetch_test_completion_helper(dataset,model,lg) 258 | with open(f"./dataset/{model}_{current_epoch}.json", "w") as f: 259 | json.dump(dataset, f, indent=4) 260 | dataset = test_agent_concurrency(dataset,lg) 261 | test_report(dataset,lg) 262 | --------------------------------------------------------------------------------