├── .env.example
├── .gitignore
├── requirements.txt
├── images
    └── agentcoder_pipeline.pdf
├── .gitmodules
├── scripts
    └── run.sh
├── prompts
    ├── zero_shot_mbpp_prompt_update.txt
    ├── zero_shot_test_designer_mbpp_prompt_update.txt
    ├── zero_shot_test_designer_humaneval_prompt_update.txt
    ├── zero_shot_humaneval_prompt_update.txt
    ├── zero_shot_mbpp_prompt.txt
    ├── zero_shot_humaneval_prompt.txt
    ├── test_designer_mbpp_prompt_update.txt
    ├── zero_shot_test_designer_mbpp_prompt.txt
    ├── zero_shot_test_designer_humaneval_prompt.txt
    ├── mbpp_prompt_update.txt
    ├── test_designer_humaneval_prompt_update.txt
    ├── mbpp_prompt.txt
    ├── test_designer_mbpp_prompt.txt
    ├── humaneval_prompt_update.txt
    ├── test_designer_humaneval_prompt.txt
    └── humaneval_prompt.txt
├── README.md
└── src
    ├── test_designer_humaneval.py
    ├── programmer_humaneval.py
    ├── test_designer_mbpp.py
    ├── programmer_mbpp.py
    ├── test_executor_mbpp.py
    └── test_executor_humaneval.py


/.env.example:
--------------------------------------------------------------------------------
1 | OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .venv/
2 | .DS_Store
3 | .env
4 | *.pyc
5 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets==3.3.1
2 | openai==0.28.0
3 | python-dotenv==1.0.1
4 | 


--------------------------------------------------------------------------------
/images/agentcoder_pipeline.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huangd1999/AgentCoder/HEAD/images/agentcoder_pipeline.pdf


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "CodeGeeX"]
2 | 	path = CodeGeeX
3 | 	url = https://github.com/THUDM/CodeGeeX.git
4 | 	ignore = dirty
5 | 


--------------------------------------------------------------------------------
/scripts/run.sh:
--------------------------------------------------------------------------------
1 | python programmer_[humaneval/mbpp].py
2 | python test_designer_[humaneval/mbpp].py
3 | python test_executor_[humaneval/mbpp].py


--------------------------------------------------------------------------------
/prompts/zero_shot_mbpp_prompt_update.txt:
--------------------------------------------------------------------------------
1 | **Role**: You are a software programmer.
2 | 
3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.


--------------------------------------------------------------------------------
/prompts/zero_shot_test_designer_mbpp_prompt_update.txt:
--------------------------------------------------------------------------------
1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function.
2 | 
3 | - The format of test cases should be:
4 | ```python
5 | assert function_name(input) == expected_output, "Test Case Description"
6 | ```
7 | 


--------------------------------------------------------------------------------
/prompts/zero_shot_test_designer_humaneval_prompt_update.txt:
--------------------------------------------------------------------------------
1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function.
2 | 
3 | - The format of test cases should be:
4 | ```python
5 | assert function_name(input) == expected_output, "Test Case Description"
6 | ```
7 | 


--------------------------------------------------------------------------------
/prompts/zero_shot_humaneval_prompt_update.txt:
--------------------------------------------------------------------------------
1 | **Role**: You are a software programmer.
2 | 
3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
4 | 
5 | **Code Formatting**: Please write code in 
6 | ```python
7 | [Code]
8 | ``` 
9 | format.


--------------------------------------------------------------------------------
/prompts/zero_shot_mbpp_prompt.txt:
--------------------------------------------------------------------------------
1 | **Role**: You are a software programmer.
2 | 
3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
4 | 
5 | **Instructions**:
6 | 1. **Understand and Clarify**: Make sure you understand the task. 
7 | 2. **Algorithm/Method Selection**: Decide on the most efficient way.
8 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 
9 | 4. **Code Generation**: Translate your pseudocode into executable Python code. 


--------------------------------------------------------------------------------
/prompts/zero_shot_humaneval_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: You are a software programmer.
 2 | 
 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
 4 | 
 5 | **Code Formatting**: Please write code in 
 6 | ```python
 7 | [Code]
 8 | ``` 
 9 | format.
10 | 
11 | **Instructions**:
12 | 1. **Understand and Clarify**: Make sure you understand the task. 
13 | 2. **Algorithm/Method Selection**: Decide on the most efficient way.
14 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 
15 | 4. **Code Generation**: Translate your pseudocode into executable Python code. 


--------------------------------------------------------------------------------
/prompts/test_designer_mbpp_prompt_update.txt:
--------------------------------------------------------------------------------
 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function.
 2 | 
 3 | - The format of test cases should be:
 4 | ```python
 5 | assert function_name(input) == expected_output, "Test Case Description"
 6 | ```
 7 | 
 8 | # For example:
 9 | 
10 | ## Prompt 1:
11 | ```python
12 | Write a function to find the shared elements from the given two lists.
13 | ```
14 | 
15 | ## Completion 1:
16 | ```python
17 | assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))
18 | assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))
19 | assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))
20 | ```
21 | 
22 | ## Prompt 2:
23 | ```python
24 | Write a python function to identify non-prime numbers.
25 | ```
26 | 
27 | ## Completion 2:
28 | ```python
29 | assert is_not_prime(2) == False
30 | assert is_not_prime(10) == True
31 | assert is_not_prime(35) == True
32 | assert is_not_prime(37) == False
33 | ```


--------------------------------------------------------------------------------
/prompts/zero_shot_test_designer_mbpp_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability.
 2 | 
 3 | **1. Basic Test Cases**:
 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions.
 5 | 
 6 | **2. Edge Test Cases**:
 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions.
 8 | 
 9 | **3. Large Scale Test Cases**:
10 | - **Objective**: To assess the function’s performance and scalability with large data samples.
11 | 
12 | **Instructions**:
13 | - Implement a comprehensive set of test cases following the guidelines above.
14 | - Ensure each test case is well-documented with comments explaining the scenario it covers.
15 | - Pay special attention to edge cases as they often reveal hidden bugs.
16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads.
17 | 
18 | - The format of test cases should be:
19 | ```python
20 | assert function_name(input) == expected_output, "Test Case Description"
21 | ```


--------------------------------------------------------------------------------
/prompts/zero_shot_test_designer_humaneval_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability.
 2 | 
 3 | **1. Basic Test Cases**:
 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions.
 5 | 
 6 | **2. Edge Test Cases**:
 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions.
 8 | 
 9 | **3. Large Scale Test Cases**:
10 | - **Objective**: To assess the function’s performance and scalability with large data samples.
11 | 
12 | **Instructions**:
13 | - Implement a comprehensive set of test cases following the guidelines above.
14 | - Ensure each test case is well-documented with comments explaining the scenario it covers.
15 | - Pay special attention to edge cases as they often reveal hidden bugs.
16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads.
17 | 
18 | - The format of test cases should be:
19 | ```python
20 | assert function_name(input) == expected_output, "Test Case Description"
21 | ```
22 | 


--------------------------------------------------------------------------------
/prompts/mbpp_prompt_update.txt:
--------------------------------------------------------------------------------
 1 | **Role**: You are a software programmer.
 2 | 
 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
 4 | 
 5 | # For example:
 6 | 
 7 | ## Prompt 1:
 8 | ```python
 9 | Write a python function to remove first and last occurrence of a given character from the string.
10 | ```
11 | ## Test Case 1:
12 | ```python
13 | assert remove_Occ("hello","l") == "heo"
14 | assert remove_Occ("abcda","a") == "bcd"
15 | assert remove_Occ("PHP","P") == "H"
16 | ```
17 | 
18 | ## Completion 1:
19 | ```python
20 | def remove_Occ(s,ch): 
21 |     for i in range(len(s)): 
22 |         if (s[i] == ch): 
23 |             s = s[0 : i] + s[i + 1:] 
24 |             break
25 |     for i in range(len(s) - 1,-1,-1):  
26 |         if (s[i] == ch): 
27 |             s = s[0 : i] + s[i + 1:] 
28 |             break
29 |     return s 
30 | ```
31 | 
32 | ## Prompt 2:
33 | ```python
34 | Write a function to sort a given matrix in ascending order according to the sum of its rows.
35 | ```
36 | 
37 | ## Test Case 1:
38 | ```python
39 | assert sort_matrix([[1, 2, 3], [2, 4, 5], [1, 1, 1]])==[[1, 1, 1], [1, 2, 3], [2, 4, 5]]
40 | assert sort_matrix([[1, 2, 3], [-2, 4, -5], [1, -1, 1]])==[[-2, 4, -5], [1, -1, 1], [1, 2, 3]]
41 | assert sort_matrix([[5,8,9],[6,4,3],[2,1,4]])==[[2, 1, 4], [6, 4, 3], [5, 8, 9]]
42 | ```
43 | 
44 | ## Completion 2:
45 | ```python
46 | def sort_matrix(M):
47 |     result = sorted(M, key=sum)
48 |     return result
49 | ```


--------------------------------------------------------------------------------
/prompts/test_designer_humaneval_prompt_update.txt:
--------------------------------------------------------------------------------
 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. 
 2 | 
 3 | - The format of test cases should be:
 4 | ```python
 5 | assert function_name(input) == expected_output, "Test Case Description"
 6 | ```
 7 | 
 8 | # For example:
 9 | 
10 | ## Prompt 1:
11 | ```python
12 | from typing import List
13 | 
14 | 
15 | def has_close_elements(numbers: List[float], threshold: float) -> bool:
16 |     """ Check if in given list of numbers, are any two numbers closer to each other than
17 |     given threshold.
18 |     >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
19 |     False
20 |     >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
21 |     True
22 |     """
23 | 
24 | ```
25 | 
26 | ## Completion 1:
27 | ```python
28 | 
29 | assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False
30 | assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)== True
31 | 
32 | ```
33 | 
34 | ## Prompt 2:
35 | ```python
36 | from typing import List
37 | 
38 | 
39 | def separate_paren_groups(paren_string: str) -> List[str]:
40 |     """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
41 |     separate those group into separate strings and return the list of those.
42 |     Separate groups are balanced (each open brace is properly closed) and not nested within each other
43 |     Ignore any spaces in the input string.
44 |     >>> separate_paren_groups('( ) (( )) (( )( ))')
45 |     ['()', '(())', '(()())']
46 |     """
47 | 
48 | ```
49 | 
50 | ## Completion 2:
51 | ```python
52 | 
53 | assert separate_paren_groups('( ) (( )) (( )( ))') == ['()', '(())', '(()())']
54 | 
55 | ```


--------------------------------------------------------------------------------
/prompts/mbpp_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: You are a software programmer.
 2 | 
 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
 4 | 
 5 | **Instructions**:
 6 | 1. **Understand and Clarify**: Make sure you understand the task. 
 7 | 2. **Algorithm/Method Selection**: Decide on the most efficient way.
 8 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 
 9 | 4. **Code Generation**: Translate your pseudocode into executable Python code. 
10 | 
11 | 
12 | # For example:
13 | 
14 | ## Prompt 1:
15 | ```python
16 | Write a python function to remove first and last occurrence of a given character from the string.
17 | ```
18 | ## Test Case 1:
19 | ```python
20 | assert remove_Occ("hello","l") == "heo"
21 | assert remove_Occ("abcda","a") == "bcd"
22 | assert remove_Occ("PHP","P") == "H"
23 | ```
24 | 
25 | ## Completion 1:
26 | ```python
27 | def remove_Occ(s,ch): 
28 |     for i in range(len(s)): 
29 |         if (s[i] == ch): 
30 |             s = s[0 : i] + s[i + 1:] 
31 |             break
32 |     for i in range(len(s) - 1,-1,-1):  
33 |         if (s[i] == ch): 
34 |             s = s[0 : i] + s[i + 1:] 
35 |             break
36 |     return s 
37 | ```
38 | 
39 | ## Prompt 2:
40 | ```python
41 | Write a function to sort a given matrix in ascending order according to the sum of its rows.
42 | ```
43 | 
44 | ## Test Case 1:
45 | ```python
46 | assert sort_matrix([[1, 2, 3], [2, 4, 5], [1, 1, 1]])==[[1, 1, 1], [1, 2, 3], [2, 4, 5]]
47 | assert sort_matrix([[1, 2, 3], [-2, 4, -5], [1, -1, 1]])==[[-2, 4, -5], [1, -1, 1], [1, 2, 3]]
48 | assert sort_matrix([[5,8,9],[6,4,3],[2,1,4]])==[[2, 1, 4], [6, 4, 3], [5, 8, 9]]
49 | ```
50 | 
51 | ## Completion 2:
52 | ```python
53 | def sort_matrix(M):
54 |     result = sorted(M, key=sum)
55 |     return result
56 | ```


--------------------------------------------------------------------------------
/prompts/test_designer_mbpp_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability.
 2 | 
 3 | **1. Basic Test Cases**:
 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions.
 5 | 
 6 | **2. Edge Test Cases**:
 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions.
 8 | 
 9 | **3. Large Scale Test Cases**:
10 | - **Objective**: To assess the function’s performance and scalability with large data samples.
11 | 
12 | **Instructions**:
13 | - Implement a comprehensive set of test cases following the guidelines above.
14 | - Ensure each test case is well-documented with comments explaining the scenario it covers.
15 | - Pay special attention to edge cases as they often reveal hidden bugs.
16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads.
17 | 
18 | - The format of test cases should be:
19 | ```python
20 | assert function_name(input) == expected_output, "Test Case Description"
21 | ```
22 | 
23 | # For example:
24 | 
25 | ## Prompt 1:
26 | ```python
27 | Write a function to find the shared elements from the given two lists.
28 | ```
29 | 
30 | ## Completion 1:
31 | ```python
32 | assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))
33 | assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))
34 | assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))
35 | ```
36 | 
37 | ## Prompt 2:
38 | ```python
39 | Write a python function to identify non-prime numbers.
40 | ```
41 | 
42 | ## Completion 2:
43 | ```python
44 | assert is_not_prime(2) == False
45 | assert is_not_prime(10) == True
46 | assert is_not_prime(35) == True
47 | assert is_not_prime(37) == False
48 | ```


--------------------------------------------------------------------------------
/prompts/humaneval_prompt_update.txt:
--------------------------------------------------------------------------------
 1 | **Role**: You are a software programmer.
 2 | 
 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
 4 | 
 5 | **Code Formatting**: Please write code in 
 6 | ```python
 7 | [Code]
 8 | ``` 
 9 | format.
10 | 
11 | # For example:
12 | 
13 | ## Prompt 1:
14 | ```python
15 | from typing import List
16 | 
17 | 
18 | def has_close_elements(numbers: List[float], threshold: float) -> bool:
19 |     """ Check if in given list of numbers, are any two numbers closer to each other than
20 |     given threshold.
21 |     >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
22 |     False
23 |     >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
24 |     True
25 |     """
26 | 
27 | ```
28 | 
29 | ## Completion 1:
30 | ```python
31 |     for idx, elem in enumerate(numbers):
32 |         for idx2, elem2 in enumerate(numbers):
33 |             if idx != idx2:
34 |                 distance = abs(elem - elem2)
35 |                 if distance < threshold:
36 |                     return True
37 | 
38 |     return False
39 | 
40 | ```
41 | 
42 | ## Prompt 2:
43 | ```python
44 | from typing import List
45 | 
46 | 
47 | def separate_paren_groups(paren_string: str) -> List[str]:
48 |     """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
49 |     separate those group into separate strings and return the list of those.
50 |     Separate groups are balanced (each open brace is properly closed) and not nested within each other
51 |     Ignore any spaces in the input string.
52 |     >>> separate_paren_groups('( ) (( )) (( )( ))')
53 |     ['()', '(())', '(()())']
54 |     """
55 | 
56 | ```
57 | 
58 | ## Completion 2:
59 | ```python
60 |     result = []
61 |     current_string = []
62 |     current_depth = 0
63 | 
64 |     for c in paren_string:
65 |         if c == '(':
66 |             current_depth += 1
67 |             current_string.append(c)
68 |         elif c == ')':
69 |             current_depth -= 1
70 |             current_string.append(c)
71 | 
72 |             if current_depth == 0:
73 |                 result.append(''.join(current_string))
74 |                 current_string.clear()
75 | 
76 |     return result
77 | ```


--------------------------------------------------------------------------------
/prompts/test_designer_humaneval_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: As a tester, your task is to create comprehensive test cases for the incomplete function. These test cases should encompass Basic, Edge, and Large Scale scenarios to ensure the code's robustness, reliability, and scalability.
 2 | 
 3 | **1. Basic Test Cases**:
 4 | - **Objective**: To verify the fundamental functionality of the `has_close_elements` function under normal conditions.
 5 | 
 6 | **2. Edge Test Cases**:
 7 | - **Objective**: To evaluate the function's behavior under extreme or unusual conditions.
 8 | 
 9 | **3. Large Scale Test Cases**:
10 | - **Objective**: To assess the function’s performance and scalability with large data samples.
11 | 
12 | **Instructions**:
13 | - Implement a comprehensive set of test cases following the guidelines above.
14 | - Ensure each test case is well-documented with comments explaining the scenario it covers.
15 | - Pay special attention to edge cases as they often reveal hidden bugs.
16 | - For large-scale tests, focus on the function's efficiency and performance under heavy loads.
17 | 
18 | - The format of test cases should be:
19 | ```python
20 | assert function_name(input) == expected_output, "Test Case Description"
21 | ```
22 | 
23 | 
24 | # For example:
25 | 
26 | ## Prompt 1:
27 | ```python
28 | from typing import List
29 | 
30 | 
31 | def has_close_elements(numbers: List[float], threshold: float) -> bool:
32 |     """ Check if in given list of numbers, are any two numbers closer to each other than
33 |     given threshold.
34 |     >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
35 |     False
36 |     >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
37 |     True
38 |     """
39 | 
40 | ```
41 | 
42 | ## Completion 1:
43 | ```python
44 | 
45 | assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False
46 | assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)== True
47 | 
48 | ```
49 | 
50 | ## Prompt 2:
51 | ```python
52 | from typing import List
53 | 
54 | 
55 | def separate_paren_groups(paren_string: str) -> List[str]:
56 |     """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
57 |     separate those group into separate strings and return the list of those.
58 |     Separate groups are balanced (each open brace is properly closed) and not nested within each other
59 |     Ignore any spaces in the input string.
60 |     >>> separate_paren_groups('( ) (( )) (( )( ))')
61 |     ['()', '(())', '(()())']
62 |     """
63 | 
64 | ```
65 | 
66 | ## Completion 2:
67 | ```python
68 | 
69 | assert separate_paren_groups('( ) (( )) (( )( ))') == ['()', '(())', '(()())']
70 | 
71 | ```


--------------------------------------------------------------------------------
/prompts/humaneval_prompt.txt:
--------------------------------------------------------------------------------
 1 | **Role**: You are a software programmer.
 2 | 
 3 | **Task**: As a programmer, you are required to complete the function. Use a Chain-of-Thought approach to break down the problem, create pseudocode, and then write the code in Python language.
 4 | 
 5 | **Code Formatting**: Please write code in 
 6 | ```python
 7 | [Code]
 8 | ``` 
 9 | format.
10 | 
11 | **Instructions**:
12 | 1. **Understand and Clarify**: Make sure you understand the task. 
13 | 2. **Algorithm/Method Selection**: Decide on the most efficient way.
14 | 3. **Pseudocode Creation**: Write down the steps you will follow in pseudocode. 
15 | 4. **Code Generation**: Translate your pseudocode into executable Python code. 
16 | 
17 | 
18 | # For example:
19 | 
20 | ## Prompt 1:
21 | ```python
22 | from typing import List
23 | 
24 | 
25 | def has_close_elements(numbers: List[float], threshold: float) -> bool:
26 |     """ Check if in given list of numbers, are any two numbers closer to each other than
27 |     given threshold.
28 |     >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
29 |     False
30 |     >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
31 |     True
32 |     """
33 | 
34 | ```
35 | 
36 | ## Completion 1:
37 | ```python
38 |     for idx, elem in enumerate(numbers):
39 |         for idx2, elem2 in enumerate(numbers):
40 |             if idx != idx2:
41 |                 distance = abs(elem - elem2)
42 |                 if distance < threshold:
43 |                     return True
44 | 
45 |     return False
46 | 
47 | ```
48 | 
49 | ## Prompt 2:
50 | ```python
51 | from typing import List
52 | 
53 | 
54 | def separate_paren_groups(paren_string: str) -> List[str]:
55 |     """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
56 |     separate those group into separate strings and return the list of those.
57 |     Separate groups are balanced (each open brace is properly closed) and not nested within each other
58 |     Ignore any spaces in the input string.
59 |     >>> separate_paren_groups('( ) (( )) (( )( ))')
60 |     ['()', '(())', '(()())']
61 |     """
62 | 
63 | ```
64 | 
65 | ## Completion 2:
66 | ```python
67 |     result = []
68 |     current_string = []
69 |     current_depth = 0
70 | 
71 |     for c in paren_string:
72 |         if c == '(':
73 |             current_depth += 1
74 |             current_string.append(c)
75 |         elif c == ')':
76 |             current_depth -= 1
77 |             current_string.append(c)
78 | 
79 |             if current_depth == 0:
80 |                 result.append(''.join(current_string))
81 |                 current_string.clear()
82 | 
83 |     return result
84 | ```


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # AgentCoder: multi-agent code generation framework
 2 | 
 3 | AgentCoder is a novel multiagent-code generation framework that leverages the power of large language models (LLMs) to enhance the effectiveness of code generation. The framework consists of three specialized agents: the programmer agent, the test designer agent, and the test executor agent. These agents collaborate to generate high-quality code snippets, design comprehensive test cases, and ensure the correctness of the generated code through an iterative feedback loop.
 4 | 
 5 | ## Key Features
 6 | 
 7 | - **Multiagent Collaboration**: AgentCoder utilizes a multiagent framework where each agent specializes in a specific task, leading to improved code generation effectiveness.
 8 | - **Independent Test Case Generation**: The test designer agent generates diverse and objective test cases independently, ensuring comprehensive testing of the generated code.
 9 | - **Iterative Code Refinement**: The test executor agent executes the generated test cases against the code and provides feedback to the programmer agent for iterative code refinement.
10 | - **Modularity and Scalability**: The modular structure of AgentCoder allows for easy integration with advanced models and future enhancements, ensuring adaptability in the evolving landscape of code generation.
11 | 
12 | ## Installation
13 | 
14 | To use AgentCoder, you need to have an API key from OpenAI or other similar third-party providers.
15 | 1. Clone the AgentCoder repository:
16 |    ```
17 |    git clone https://github.com/your-username/AgentCoder.git
18 |    cd AgentCoder
19 |    git clone https://github.com/THUDM/CodeGeeX
20 |    ```
21 | 
22 | 2. Install the required dependencies:
23 |    ```
24 |    pip install -r requirements.txt
25 |    ```
26 | 
27 | 3. Add your API key in the `.env` file:
28 |    ```python
29 |    OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
30 |    ```
31 | 
32 | ## Usage
33 | 
34 | ### Code Generation
35 | 
36 | To generate code snippets, run the following commands:
37 | ```
38 | python programmer_[humaneval/mbpp].py
39 | ```
40 | These scripts will generate code snippets that will be used for test case generation.
41 | 
42 | ### Test Case Generation
43 | 
44 | To generate test cases, run the following command:
45 | ```
46 | python test_designer_[humaneval/mbpp].py
47 | ```
48 | This script will generate diverse and comprehensive test cases based on the coding requirements.
49 | 
50 | ### Self-Optimization Process
51 | 
52 | To perform the self-optimization process, run the following commands:
53 | ```
54 | python test_executor_[humaneval/mbpp].py
55 | ```
56 | These scripts will execute the generated test cases against the code and provide feedback to the programmer agent for iterative code refinement.
57 | 
58 | 
59 | ## Contributions
60 | 
61 | Contributions to AgentCoder are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
62 | 
63 | ## License
64 | 
65 | AgentCoder is released under the [MIT License](LICENSE).
66 | 
67 | ## Acknowledgments
68 | 
69 | We would like to thank AIOHUB for providing funding and support for the development of AgentCoder. We also acknowledge the contributions of the open-source community and the developers of the large language models used in this project.
70 | 


--------------------------------------------------------------------------------
/src/test_designer_humaneval.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import json
  4 | from tqdm import tqdm
  5 | import copy
  6 | import openai
  7 | from concurrent.futures import ThreadPoolExecutor
  8 | import concurrent.futures
  9 | import time
 10 | from datasets import load_dataset
 11 | from dotenv import load_dotenv
 12 | 
 13 | load_dotenv()
 14 | 
 15 | # Setting API parameters
 16 | openai.api_key = os.getenv("OPENAI_API_KEY")
 17 | 
 18 | dataset = load_dataset("openai_humaneval",split="test")
 19 | dataset = [entry for entry in dataset]
 20 | 
 21 | prompt_path = "./prompts/test_designer_humaneval_prompt_update.txt"
 22 | with open(prompt_path, "r") as f:
 23 |     construct_few_shot_prompt = f.read()
 24 | 
 25 | def preprocess_data(test_case_string):
 26 |     if f"```python" in test_case_string:
 27 |         test_case_string = test_case_string[test_case_string.find(f"```python")+len(f"```python"):]
 28 |         test_case_string = test_case_string[:test_case_string.find("```")]
 29 | 
 30 |     return test_case_string
 31 | 
 32 | # Function to fetch completion
 33 | def fetch_completion(data_entry, model, lg,times=10):
 34 |     global construct_few_shot_prompt
 35 |     if "need_reproduce" in data_entry.keys() and data_entry["need_reproduce"]==False:
 36 |         return data_entry
 37 |     prompt = data_entry["prompt"]
 38 |     entry_point = data_entry["entry_point"]
 39 |     
 40 |     text = f"""
 41 | {construct_few_shot_prompt}
 42 | 
 43 | **Input Code Snippet**:
 44 | ```python
 45 | {prompt}
 46 | ```
 47 | """
 48 |     test_case_list = []
 49 |     for i in range(times):
 50 |         while True:
 51 |             try:
 52 |                 completions = openai.ChatCompletion.create(
 53 |                     model="gpt-3.5-turbo-1106",
 54 |                     stream=False,
 55 |                     messages=[
 56 |                 {"role": "system", "content": "You are a code developer assistant."},
 57 |                 {"role": "user", "content":text},
 58 |                     ],
 59 |                     request_timeout=100,
 60 |                 )
 61 |                 test_case = completions.choices[0]["message"]["content"]
 62 |                 test_case = preprocess_data(test_case)
 63 |             except Exception as e:
 64 |                 time.sleep(20)
 65 |                 print(e)
 66 |                 test_case = ""
 67 |             if test_case!="":
 68 |                 break
 69 |         test_case_list.append(test_case)
 70 |     data_entry["test_case_list"] = test_case_list
 71 |     return data_entry
 72 | 
 73 | def call_fetch_test_completion_helper(dataset, model,lg):
 74 |     print("Fixing bug...")
 75 |     with ThreadPoolExecutor(max_workers=5) as executor:
 76 |         future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
 77 |         for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
 78 |             entry = future_to_entry[future]
 79 |             try:
 80 |                 updated_entry = future.result()
 81 |                 idx = dataset.index(entry)
 82 |                 dataset[idx] = updated_entry
 83 |             except Exception as e:
 84 |                 print(repr(e))
 85 |     return dataset
 86 | 
 87 | 
 88 | if __name__ == "__main__":
 89 |     model_list = ["gpt-3.5-turbo-1106"]
 90 |     language = ["python"]
 91 |     for model in model_list:
 92 |         for lg in language:
 93 |             from datasets import load_dataset
 94 |             with open(f"./dataset/{model}_{lg}.json", "r") as f:
 95 |                 dataset = json.load(f)
 96 |             dataset = [entry for entry in dataset]
 97 |             with ThreadPoolExecutor(max_workers=5) as executor:
 98 |                 future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
 99 |                 for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
100 |                     entry = future_to_entry[future]
101 |                     try:
102 |                         updated_entry = future.result()
103 |                         idx = dataset.index(entry)
104 |                         dataset[idx] = updated_entry
105 |                     except Exception as e:
106 |                         print(repr(e))
107 | 
108 |             with open(f"./dataset/{model}_{lg}.json", "w") as f:
109 |                 json.dump(dataset, f, indent=4)
110 | 


--------------------------------------------------------------------------------
/src/programmer_humaneval.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import json
  4 | from tqdm import tqdm
  5 | import copy
  6 | import openai
  7 | from concurrent.futures import ThreadPoolExecutor
  8 | import concurrent.futures
  9 | import time
 10 | from datasets import load_dataset
 11 | from dotenv import load_dotenv
 12 | 
 13 | load_dotenv()
 14 | 
 15 | # Setting API parameters
 16 | openai.api_base = "https://api.aiohub.org/v1"
 17 | openai.api_key = os.getenv("OPENAI_API_KEY")
 18 | 
 19 | dataset = load_dataset("openai_humaneval",split="test")
 20 | dataset = [entry for entry in dataset]
 21 | 
 22 | prompt_path = "./prompts/humaneval_prompt_update.txt"
 23 | with open(prompt_path, "r") as f:
 24 |     construct_few_shot_prompt = f.read()
 25 | 
 26 | def preprocess_data(completion_string):
 27 |     if f"```python" in completion_string:
 28 |         completion_string = completion_string[completion_string.find(f"```python")+len(f"```python"):]
 29 |         completion_string = completion_string[:completion_string.find("```")]
 30 |     else:
 31 |         print("Error: No code block found")
 32 |     return completion_string
 33 | 
 34 | # Function to fetch completion
 35 | def fetch_completion(data_entry, model,lg,times = 5):
 36 |     global construct_few_shot_prompt
 37 |     if "need_reproduce" in data_entry.keys() and data_entry["need_reproduce"]==False:
 38 |         return data_entry
 39 |     prompt = data_entry["prompt"]
 40 |     text = f"""
 41 | {construct_few_shot_prompt}
 42 | 
 43 | **Input Code Snippet**:
 44 | ```python
 45 | {prompt}
 46 | ```
 47 | ## Completion 3:
 48 | """
 49 |     completions_code = []
 50 |     for i in range(times):
 51 |         while True:
 52 |             try:
 53 |                 completions = openai.ChatCompletion.create(
 54 |                     model=model,
 55 |                     stream=False,
 56 |                     messages=[
 57 |                 {"role": "system", "content": "You are a software programmer."},
 58 |                 {"role": "user", "content":text},
 59 |                     ],
 60 |                     request_timeout=100,
 61 |                 )
 62 |                 completion = completions.choices[0]["message"]["content"]
 63 |                 completion = preprocess_data(completion)
 64 | 
 65 |             except Exception as e:
 66 |                 print(e)
 67 |                 time.sleep(10)
 68 |                 completion = ""
 69 |             if completion!="":
 70 |                 break
 71 |         completions_code.append(completion)
 72 |     data_entry["completion_list"] = completions_code
 73 |     return data_entry
 74 | 
 75 | 
 76 | def call_fetch_completion_helper(dataset, model,lg):
 77 |     print("Fixing bug...")
 78 |     with ThreadPoolExecutor(max_workers=5) as executor:
 79 |         future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
 80 |         for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
 81 |             entry = future_to_entry[future]
 82 |             try:
 83 |                 updated_entry = future.result()
 84 |                 idx = dataset.index(entry)
 85 |                 dataset[idx] = updated_entry
 86 |             except Exception as e:
 87 |                 print(repr(e))
 88 |     return dataset
 89 | 
 90 | if __name__ == "__main__":
 91 |     model_list = ["gpt-3.5-turbo-1106"]
 92 |     language = ["python"]
 93 |     for model in model_list:
 94 |         for lg in language:
 95 |             from datasets import load_dataset
 96 |             dataset = load_dataset("openai_humaneval",split="test")
 97 |             dataset = [entry for entry in dataset]
 98 |             with ThreadPoolExecutor(max_workers=5) as executor:
 99 |                 future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
100 |                 for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
101 |                     entry = future_to_entry[future]
102 |                     try:
103 |                         updated_entry = future.result()
104 |                         idx = dataset.index(entry)
105 |                         dataset[idx] = updated_entry
106 |                     except Exception as e:
107 |                         print(repr(e))
108 |             with open(f"./dataset/{model}_{lg}.json", "w") as f:
109 |                 json.dump(dataset, f, indent=4)
110 | 


--------------------------------------------------------------------------------
/src/test_designer_mbpp.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import json
  4 | from tqdm import tqdm
  5 | import copy
  6 | import openai
  7 | from concurrent.futures import ThreadPoolExecutor
  8 | import concurrent.futures
  9 | import time
 10 | from datasets import load_dataset
 11 | from dotenv import load_dotenv
 12 | 
 13 | load_dotenv()
 14 | 
 15 | # Setting API parameters
 16 | openai.api_key = os.getenv("OPENAI_API_KEY")
 17 | 
 18 | dataset = load_dataset("evalplus/mbppplus",split="test")
 19 | dataset = [entry for entry in dataset]
 20 | task_0_tests = "\n".join(dataset[0]["test_list"])
 21 | task_1_tests = "\n".join(dataset[1]["test_list"])
 22 | 
 23 | prompt_path = "./prompts/test_designer_mbpp_prompt_update.txt"
 24 | with open(prompt_path, "r") as f:
 25 |     construct_few_shot_prompt = f.read()
 26 | 
 27 | def preprocess_data(test_case_string):
 28 |     if f"```python" in test_case_string:
 29 |         test_case_string = test_case_string[test_case_string.find(f"```python")+len(f"```python"):]
 30 |         test_case_string = test_case_string[:test_case_string.find("```")]
 31 |     return test_case_string
 32 | 
 33 | # Function to fetch completion
 34 | def fetch_completion(data_entry, model, lg,times=5):
 35 |     global construct_few_shot_prompt
 36 |     if "need_reproduce" in data_entry.keys() and data_entry["need_reproduce"]==False:
 37 |         return data_entry
 38 |     prompt = data_entry["prompt"]
 39 |     test_case_0 = data_entry["test_list"][0]
 40 |     function_name = test_case_0.split("(")[0].split(" ")[-1]
 41 | 
 42 |     text = f"""
 43 | {construct_few_shot_prompt}
 44 | 
 45 | **Input Code Snippet**:
 46 | ```python
 47 | {prompt}
 48 | ```
 49 | """
 50 |     test_case_list = []
 51 |     for i in range(times):
 52 |         while True:
 53 |             try:
 54 |                 completions = openai.ChatCompletion.create(
 55 |                     model="gpt-3.5-turbo-1106",
 56 |                     stream=False,
 57 |                     messages=[
 58 |                 {"role": "system", "content": "You are a code developer assistant."},
 59 |                 {"role": "user", "content":text},
 60 |                     ],
 61 |                     request_timeout=100,
 62 |                 )
 63 |                 test_case = completions.choices[0]["message"]["content"]
 64 |                 test_case = preprocess_data(test_case)
 65 |             except Exception as e:
 66 |                 time.sleep(20)
 67 |                 print(e)
 68 |                 test_case = ""
 69 |             if test_case!="":
 70 |                 break
 71 |         test_case_list.append(test_case)
 72 |     data_entry["test_case_list"] = test_case_list
 73 |     return data_entry
 74 | 
 75 | def call_fetch_test_completion_helper(dataset, model,lg):
 76 |     print("Fixing bug...")
 77 |     with ThreadPoolExecutor(max_workers=5) as executor:
 78 |         future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
 79 |         for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
 80 |             entry = future_to_entry[future]
 81 |             try:
 82 |                 updated_entry = future.result()
 83 |                 idx = dataset.index(entry)
 84 |                 dataset[idx] = updated_entry
 85 |             except Exception as e:
 86 |                 print(repr(e))
 87 |     return dataset
 88 | 
 89 | 
 90 | if __name__ == "__main__":
 91 |     model_list = ["gpt-3.5-turbo-0301"]
 92 |     language = ["python"]
 93 |     for model in model_list:
 94 |         for lg in language:
 95 |             from datasets import load_dataset
 96 |             with open(f"./dataset/{model}_mbpp.json", "r") as f:
 97 |                 dataset = json.load(f)
 98 |             dataset = [entry for entry in dataset]
 99 |             with ThreadPoolExecutor(max_workers=5) as executor:
100 |                 future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
101 |                 for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
102 |                     entry = future_to_entry[future]
103 |                     try:
104 |                         updated_entry = future.result()
105 |                         idx = dataset.index(entry)
106 |                         dataset[idx] = updated_entry
107 |                     except Exception as e:
108 |                         print(repr(e))
109 | 
110 |             with open(f"./dataset/{model}_mbpp.json", "w") as f:
111 |                 json.dump(dataset, f, indent=4)
112 | 


--------------------------------------------------------------------------------
/src/programmer_mbpp.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import json
  4 | from tqdm import tqdm
  5 | import copy
  6 | import openai
  7 | from concurrent.futures import ThreadPoolExecutor
  8 | import concurrent.futures
  9 | from dotenv import load_dotenv
 10 | 
 11 | load_dotenv
 12 | 
 13 | # Setting API parameters
 14 | openai.api_base = "https://api.aiohub.org/v1"
 15 | openai.api_key = os.getenv("OPENAI_API_KEY")
 16 | 
 17 | prompt_path = "./prompts/mbpp_prompt_update.txt"
 18 | with open(prompt_path, "r") as f:
 19 |     construct_few_shot_prompt = f.read()
 20 | 
 21 | def preprocess_data(data,lg):
 22 |     if f"```{lg}" in data["completion"]:
 23 |         data["completion"] = data["completion"][data["completion"].find(f"```{lg}")+len(f"```{lg}"):]
 24 |         data["completion"] = data["completion"][:data["completion"].find("```")]
 25 |     else:
 26 |         print(data["task_id"])
 27 |     return data
 28 | 
 29 | # Function to fetch completion
 30 | def fetch_completion(data_entry, model,lg):
 31 |     global construct_few_shot_prompt
 32 |     lg = "py"
 33 |     if "passed" in data_entry.keys() and data_entry["passed"] == True:
 34 |         return data_entry
 35 |     prompt = data_entry["prompt"]
 36 |     test_case = data_entry["test_list"]
 37 |     code = data_entry["completion"]
 38 |     tests = ""
 39 |     for test in test_case:
 40 |         tests+="\n"+test
 41 |     text = f"""
 42 | construct_few_shot_prompt
 43 | 
 44 | **Task**:
 45 | ```python
 46 | {prompt}
 47 | ```
 48 | Your code should pass these tests:
 49 | ```python
 50 | {tests}
 51 | ```
 52 | """
 53 |     try:
 54 |         completions = openai.ChatCompletion.create(
 55 |             model = model,
 56 |             stream=False,
 57 |             messages=[
 58 |         {"role": "system", "content": "You are a code developer."},
 59 |         {"role": "user", "content":text},
 60 |             ],
 61 |             request_timeout=100,
 62 |         )
 63 |         data_entry["completion"] = completions.choices[0]["message"]["content"]
 64 |         data_entry = preprocess_data(data_entry,lg)
 65 |         return data_entry
 66 |     except Exception as e:
 67 |         print(repr(e))
 68 |         data_entry["completion"] = ""
 69 |         return data_entry
 70 | 
 71 | def fix_bug(data_entry, model,lg,preprocess_data = preprocess_data):
 72 |     if "passed" in data_entry.keys() and data_entry["passed"] == True:
 73 |         return data_entry
 74 |     else:
 75 |         gpt_prompt = (
 76 |             "Please re-completion the code to fix the error message. "+
 77 |             f"\nHere is the previous version:\n```{lg}\n" + 
 78 |             data_entry['completion'] + f"\n```\nWhen we use this test cases: ```{lg}\n"+data_entry["test_case"]+f"\n``` to evaluate the code. It raise the error:\n```{lg}\n" + data_entry["result"] +
 79 |             f"\n```\nPlease fix the bug and return the code. The re-completion code should in triple backticks format(i.e., in ```{lg} ```)."
 80 |         )
 81 |         try:
 82 |             completions = openai.ChatCompletion.create(
 83 |                 model = model,
 84 |                 stream=False,
 85 |                 messages=[
 86 |             {"role": "system", "content": "You are a code developer assistant."},
 87 |             {"role": "user", "content":gpt_prompt},
 88 |                 ],
 89 |                 request_timeout=100,
 90 |             )
 91 |             data_entry["completion"] = completions.choices[0]["message"]["content"]
 92 |             data_entry = preprocess_data(data_entry,"py")
 93 |         except Exception as e:
 94 |             print(repr(e))
 95 |     return data_entry
 96 | 
 97 | def call_fix_bug(dataset, model,lg):
 98 |     print("Fixing bug...")
 99 |     with ThreadPoolExecutor() as executor:
100 |         future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
101 |         for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
102 |             entry = future_to_entry[future]
103 |             try:
104 |                 updated_entry = future.result()
105 |                 idx = dataset.index(entry)
106 |                 dataset[idx] = updated_entry
107 |             except Exception as e:
108 |                 print(repr(e))
109 |     return dataset
110 | 
111 | def call_completion(dataset, model,lg):
112 |     print("Fixing bug...")
113 |     with ThreadPoolExecutor() as executor:
114 |         future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
115 |         for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
116 |             entry = future_to_entry[future]
117 |             try:
118 |                 updated_entry = future.result()
119 |                 idx = dataset.index(entry)
120 |                 dataset[idx] = updated_entry
121 |             except Exception as e:
122 |                 print(repr(e))
123 |     return dataset
124 | 
125 | 
126 | 
127 | if __name__ == "__main__":
128 |     model_list = ["gpt-3.5-turbo-1106"]
129 |     language = ["py"]
130 |     for model in model_list:
131 |         for lg in language:
132 |             from datasets import load_dataset
133 |             dataset = load_dataset("mbpp",name="sanitized",split="test")
134 |             dataset = [entry for entry in dataset]
135 |             with open(path, "r") as f:
136 |                 dataset = json.load(f)
137 |             with ThreadPoolExecutor(max_workers=20) as executor:
138 |                 future_to_entry = {executor.submit(fetch_completion, copy.deepcopy(entry), model, lg): entry for entry in tqdm(dataset)}
139 |                 for future in tqdm(concurrent.futures.as_completed(future_to_entry)):
140 |                     entry = future_to_entry[future]
141 |                     try:
142 |                         updated_entry = future.result()
143 |                         idx = dataset.index(entry)
144 |                         dataset[idx] = updated_entry
145 |                     except Exception as e:
146 |                         print(repr(e))
147 | 
148 |             with open(f"./dataset/{model}_mbpp.json", "w") as f:
149 |                 json.dump(dataset, f, indent=4)
150 | 


--------------------------------------------------------------------------------
/src/test_executor_mbpp.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | import json
  3 | from typing import Optional, Callable, Dict
  4 | import ast
  5 | import doctest
  6 | import io
  7 | from concurrent.futures import ThreadPoolExecutor, as_completed
  8 | import inspect
  9 | import numpy as np
 10 | import sys
 11 | sys.path.append('./CodeGeeX/')
 12 | import contextlib
 13 | import faulthandler
 14 | import io
 15 | import os
 16 | import multiprocessing
 17 | import platform
 18 | import signal
 19 | from tqdm import tqdm
 20 | from programmer_mbpp import fix_bug,call_fix_bug,call_completion,single_agent_helper
 21 | from codegeex.benchmark.utils import read_dataset, IMPORT_HELPER
 22 | from codegeex.benchmark.execution import check_correctness
 23 | import tempfile
 24 | correct_doctest = 0
 25 | correct_before_doctest = 0
 26 | correct_after_doctest = 0
 27 | result_original = 0
 28 | result_canonical_solution = 0
 29 | result_fuzzer = 0
 30 | result_fuzzer_canonical_solution = 0
 31 | idx_run_tests_orginal = []
 32 | idx_run_tests_canonical_solution = []
 33 | idx_run_tests_fuzzer = []
 34 | idx_run_tests_fuzzer_canonical_solution = []
 35 | 
 36 | language = ["python","cpp","js","go","js"]
 37 | 
 38 | 
 39 | def process_humaneval_test(sample, problems, example_test=False,language=language, test_case=True,canonical_solution=False):
 40 |     task_id = sample["task_id"]
 41 |     task_id = problems.index(sample)
 42 |     prompt = sample["prompt"]
 43 |     code = sample["completion"]
 44 |     if canonical_solution:
 45 |         code = sample["code"]
 46 |     # Pre-process for different languages
 47 |     if language == "python" or language == "py":
 48 |         if test_case:
 49 |             tests = sample["test_case"]
 50 |         else:
 51 |             test_case = sample["test_list"]
 52 |             tests = ""
 53 |             for test in test_case:
 54 |                 tests+="\n"+test
 55 |         test_string = code + "\n" + tests
 56 |     return test_string
 57 | 
 58 | 
 59 | 
 60 | def preprocess_data(task,lg):
 61 |     if f"```{lg}" in task["completion"]:
 62 |         task["completion"] = task["completion"][task["completion"].find(f"```{lg}") +len(f"```{lg}"):]
 63 |         task["completion"] = task["completion"][:task["completion"].find("```")]
 64 |     elif "```" in task["completion"]:
 65 |         task["completion"] = task["completion"][task["completion"].find("```") +3:]
 66 |         task["completion"] = task["completion"][:task["completion"].find("```")]
 67 | 
 68 |     if f"```{lg}" in task["prompt"]:
 69 |         task["prompt"] = task["prompt"][task["prompt"].find(f"```{lg}") +len(f"```{lg}"):]
 70 |         task["prompt"] = task["prompt"][:task["prompt"].find("```")]
 71 |     elif "```" in task["prompt"]:
 72 |         task["prompt"] = task["prompt"][task["prompt"].find("```") +3:]
 73 |         task["prompt"] = task["prompt"][:task["prompt"].find("```")]
 74 | 
 75 |     if "assert" in task["prompt"]:
 76 |         task["prompt"] = task["prompt"][:task["prompt"].find("assert")]
 77 |     return task
 78 | 
 79 | 
 80 |     
 81 | 
 82 | 
 83 | 
 84 | 
 85 | class TimeoutException(Exception):
 86 |     pass
 87 | class WriteOnlyStringIO(io.StringIO):
 88 |     """ StringIO that throws an exception when it's read from """
 89 | 
 90 |     def read(self, *args, **kwargs):
 91 |         raise IOError
 92 | 
 93 |     def readline(self, *args, **kwargs):
 94 |         raise IOError
 95 | 
 96 |     def readlines(self, *args, **kwargs):
 97 |         raise IOError
 98 | 
 99 |     def readable(self, *args, **kwargs):
100 |         """ Returns True if the IO object can be read. """
101 |         return False
102 | class redirect_stdin(contextlib._RedirectStream):  # type: ignore
103 |     _stream = 'stdin'
104 | 
105 | @contextlib.contextmanager
106 | def swallow_io():
107 |     stream = WriteOnlyStringIO()
108 |     with contextlib.redirect_stdout(stream):
109 |         with contextlib.redirect_stderr(stream):
110 |             with redirect_stdin(stream):
111 |                 yield
112 | 
113 | @contextlib.contextmanager
114 | def time_limit(seconds: float):
115 |     def signal_handler(signum, frame):
116 |         raise TimeoutException("Timed out!")
117 |     signal.setitimer(signal.ITIMER_REAL, seconds)
118 |     signal.signal(signal.SIGALRM, signal_handler)
119 |     try:
120 |         yield
121 |     finally:
122 |         signal.setitimer(signal.ITIMER_REAL, 0)
123 | 
124 | # def check_correctness_mbpp(code_string):
125 | 
126 | 
127 | def test_report(dataset,lg):
128 |     correct = 0
129 |     for i in tqdm(range(len(dataset))):
130 |         dataset[i]["full_code"] = process_humaneval_test(dataset[i], dataset, example_test=False,language=lg,test_case=False)
131 |         result = check_correctness(dataset[i]["task_id"],dataset[i],lg,5,"./tmp")
132 |         if result["passed"]==True:
133 |             correct+=1
134 |         dataset[i]["report_passed"] = result["passed"]
135 |         dataset[i]["report_result"] = result["result"]
136 |     print("==============Start Report Testing==============")
137 |     correct_percent = correct/len(dataset)*100
138 |     print(f"test_report, {correct_percent:0.2f}")
139 |     return dataset
140 |     
141 | def test_agent(dataset,lg):
142 |     correct = 0
143 |     for i in tqdm(range(len(dataset))):
144 |         dataset[i]["full_code"] = process_humaneval_test(dataset[i], dataset, example_test=False,language=lg,test_case=False)
145 |         result = check_correctness(dataset[i]["task_id"],dataset[i],lg,5,"./tmp")
146 |         if result["passed"]==True:
147 |             correct+=1
148 |         dataset[i]["result"] = result["result"]
149 |         dataset[i]["passed"] = result["passed"]
150 |     print("============Start Agent Testing=================")
151 |     print("test_report",correct)
152 |     return dataset
153 | 
154 | if __name__ == "__main__":
155 |     model_list = ["gpt-3.5-turbo-1106"]
156 |     language = ["python"]
157 | 
158 |     for model_name in model_list:
159 |         for lg in language:
160 |             path = f"./dataset/zero_shot_{model_name}_mbpp.json"
161 |             with open(path, "r") as f:
162 |                 dataset = json.load(f)
163 |             epoch = 5
164 |             for current_epoch in range(epoch):
165 |                 print(lg,current_epoch)
166 |                 test_report(dataset,lg)
167 |                 test_agent(dataset,lg)
168 |                 dataset = call_completion(dataset,model_name,lg)
169 |                 with open(f"./dataset/zero_shot_{model_name}_{current_epoch}_mbpp.json", "w") as f:
170 |                     json.dump(dataset, f, indent=4)
171 |             with open(f"./dataset/zero_shot_{model_name}_{current_epoch}_mbpp_total.json", "w") as f:
172 |                 json.dump(dataset, f, indent=4)
173 | 
174 | 
175 | 
176 | 


--------------------------------------------------------------------------------
/src/test_executor_humaneval.py:
--------------------------------------------------------------------------------
  1 | # test
  2 | import random
  3 | import json
  4 | from typing import Optional, Callable, Dict
  5 | import ast
  6 | import doctest
  7 | from concurrent.futures import ThreadPoolExecutor, as_completed
  8 | import inspect
  9 | import numpy as np
 10 | import sys
 11 | sys.path.append('./CodeGeeX/')
 12 | import contextlib
 13 | import faulthandler
 14 | import io
 15 | import os
 16 | import multiprocessing
 17 | import platform
 18 | import signal
 19 | import concurrent.futures
 20 | from tqdm import tqdm
 21 | from tqdm import tqdm
 22 | from programmer_humaneval import call_fetch_completion_helper
 23 | from test_designer_humaneval import call_fetch_test_completion_helper
 24 | from codegeex.benchmark.utils import read_dataset, IMPORT_HELPER
 25 | from codegeex.benchmark.execution import check_correctness
 26 | import tempfile
 27 | correct_doctest = 0
 28 | correct_before_doctest = 0
 29 | correct_after_doctest = 0
 30 | result_original = 0
 31 | result_canonical_solution = 0
 32 | result_fuzzer = 0
 33 | result_fuzzer_canonical_solution = 0
 34 | idx_run_tests_orginal = []
 35 | idx_run_tests_canonical_solution = []
 36 | idx_run_tests_fuzzer = []
 37 | idx_run_tests_fuzzer_canonical_solution = []
 38 | 
 39 | language = ["python","cpp","js","go","js"]
 40 | 
 41 | 
 42 | class TimeoutException(Exception):
 43 |     pass
 44 | class WriteOnlyStringIO(io.StringIO):
 45 |     """ StringIO that throws an exception when it's read from """
 46 | 
 47 |     def read(self, *args, **kwargs):
 48 |         raise IOError
 49 | 
 50 |     def readline(self, *args, **kwargs):
 51 |         raise IOError
 52 | 
 53 |     def readlines(self, *args, **kwargs):
 54 |         raise IOError
 55 | 
 56 |     def readable(self, *args, **kwargs):
 57 |         """ Returns True if the IO object can be read. """
 58 |         return False
 59 | class redirect_stdin(contextlib._RedirectStream):  # type: ignore
 60 |     _stream = 'stdin'
 61 | 
 62 | @contextlib.contextmanager
 63 | def swallow_io():
 64 |     stream = WriteOnlyStringIO()
 65 |     with contextlib.redirect_stdout(stream):
 66 |         with contextlib.redirect_stderr(stream):
 67 |             with redirect_stdin(stream):
 68 |                 yield
 69 | 
 70 | @contextlib.contextmanager
 71 | def time_limit(seconds: float):
 72 |     def signal_handler(signum, frame):
 73 |         raise TimeoutException("Timed out!")
 74 |     signal.setitimer(signal.ITIMER_REAL, seconds)
 75 |     signal.signal(signal.SIGALRM, signal_handler)
 76 |     try:
 77 |         yield
 78 |     finally:
 79 |         signal.setitimer(signal.ITIMER_REAL, 0)
 80 | 
 81 | def process_humaneval_test(sample, problems, example_test=False,language=language, test_case=True):
 82 |     task_id = sample["task_id"]
 83 |     task_id = problems.index(sample)
 84 |     prompt = sample["prompt"]
 85 |     if example_test and "example_test" in problems[task_id] and problems[task_id]["example_test"] != "":
 86 |         test = problems[task_id]["example_test"]
 87 |     else:
 88 |         test = problems[task_id]["test"]
 89 |     if test_case:
 90 |         test = problems[task_id]["test_case"]
 91 |     code = sample["completion"]
 92 |     # Pre-process for different languages
 93 |     if language == "python":
 94 |         code_ = []
 95 |         test_setup = "\n".join(IMPORT_HELPER["python"]) + "\n"
 96 |         if f"class sample['entry_point']" in code:
 97 |             test_string = test_setup + code + "\n" + test + "\n" + f"check({sample['entry_point']})"
 98 |         else:
 99 |             test_string = test_setup + prompt + code + "\n" + test + "\n" + f"check({sample['entry_point']})"
100 |     elif language == "cpp":
101 |         test_set_up = ""
102 |         for s in IMPORT_HELPER["cpp"]:
103 |             if s not in prompt:
104 |                 test_set_up += s + "\n"
105 |         # test_string = test_set_up + "\n" + prompt + code + "\n" + test
106 |         test_string = test_set_up + "\n" + code + "\n" + test
107 |     elif language == "java":
108 |         # if sample["declaration"] in code:
109 |         if "class Solution" in code:
110 |             test_string = code + "\n" + test
111 |         else:
112 |             test_string = prompt + code + "\n" + test
113 |         # else:
114 |         #     test_string = prompt + code + "\n" + test
115 |     elif language == "js" or language == "javascript":
116 |         # test_string = prompt + code + "\n" + test
117 |         test_string = code + "\n" + test
118 |     elif language == "go":
119 |         # import_string = problems[task_id]["import"]
120 |         # prompt = prompt.replace(import_string, "")
121 |         if example_test and "example_test" in problems[task_id]:
122 |             test = problems[task_id]["example_test"]
123 |         else:
124 |             test = problems[task_id]["test"]
125 |         candidate_import = ["math.","strings.","strconv.","sort.","time.","regexp.","fmt.","bytes.","md5.","rand."]
126 |         test_setup = "package main\nimport (\n	\"testing\"\n	\"github.com/stretchr/testify/assert\"\n)"
127 |         total_string = sample["declaration"] + code + "\n" + test
128 |         other_pkgs = []
129 |         for pkg in candidate_import:
130 |             if pkg in total_string:
131 |                 if pkg != "md5." and pkg!="rand":
132 |                     other_pkgs.append("    " + "\"" + pkg[:len(pkg)-1] + "\"" + "\n")
133 |                 elif pkg == "md5.":
134 |                     other_pkgs.append("    " + "\"" + "crypto/md5" + "\"" + "\n")
135 |                 elif pkg == "rand.":
136 |                     other_pkgs.append("    " + "\"" + "math/rand" + "\"" + "\n")
137 |         if other_pkgs:
138 |             import_other_pkgs = "import (\n" + "    ".join([p + "\n" for p in other_pkgs]) + ")"
139 |             # test_string = test_setup + "\n" + import_other_pkgs + "\n" + prompt + code + "\n" + test
140 |             test_string = test_setup + "\n" + import_other_pkgs + "\n" + code + "\n" + test
141 |         else:
142 |             # test_string = test_setup + "\n" + prompt + code + "\n" + test
143 |             test_string = test_setup + "\n" + code + "\n" + test
144 |     elif language == "rust":
145 |         main = "\nfn main(){ \n } \n"
146 |         declaration = problems[task_id]["declaration"]
147 |         test_string = main + declaration + prompt + code + test
148 |     # print(test_string)
149 |     return test_string
150 | 
151 | 
152 | 
153 | def preprocess_data(task,lg):
154 |     if f"```{lg}" in task["completion"]:
155 |         task["completion"] = task["completion"][task["completion"].find(f"```{lg}") +len(f"```{lg}"):]
156 |         task["completion"] = task["completion"][:task["completion"].find("```")]
157 |     elif "```" in task["completion"]:
158 |         task["completion"] = task["completion"][task["completion"].find("```") +3:]
159 |         task["completion"] = task["completion"][:task["completion"].find("```")]
160 | 
161 |     if f"```{lg}" in task["prompt"]:
162 |         task["prompt"] = task["prompt"][task["prompt"].find(f"```{lg}") +len(f"```{lg}"):]
163 |         task["prompt"] = task["prompt"][:task["prompt"].find("```")]
164 |     elif "```" in task["prompt"]:
165 |         task["prompt"] = task["prompt"][task["prompt"].find("```") +3:]
166 |         task["prompt"] = task["prompt"][:task["prompt"].find("```")]
167 | 
168 |     if "assert" in task["prompt"]:
169 |         task["prompt"] = task["prompt"][:task["prompt"].find("assert")]
170 |     return task
171 |                 
172 | 
173 | def test_report(dataset,lg):
174 |     correct = 0
175 |     test_setup = "\n".join(IMPORT_HELPER["python"]) + "\n"
176 |     for i in tqdm(range(len(dataset))):
177 |         try:
178 |             with swallow_io():
179 |                 with time_limit(2.0):
180 |                     exec(test_setup + "\n" + dataset[i]["completion"] + "\n" + dataset[i]["test"] + "\n" + f"check({dataset[i]['entry_point']})")
181 |                 correct+=1
182 |         except Exception as exc:
183 |             pass
184 |     print("==============Start Report Testing==============")
185 |     print(f"test_report: {(correct/len(dataset)*100):.1f}")
186 | 
187 | 
188 | 
189 | def test_agent_concurrency(dataset, lg):
190 |     test_setup = "\n".join(IMPORT_HELPER["python"]) + "\n"
191 |     total_correct = 0
192 |     _for_completion = 0
193 | 
194 |     def process_item(i):
195 |         if "need_reproduce" in dataset[i].keys() and dataset[i]["need_reproduce"]==False:
196 |             # dataset[i]["need_reproduce"] = True
197 |             return dataset[i]["max_correct"], dataset[i]["idx"]
198 |         completion_list = dataset[i]["completion_list"]
199 |         test_case_list = dataset[i]["test_case_list"]
200 |         correct_list = []
201 | 
202 |         for j in range(len(completion_list)):
203 |             correct = 0
204 |             if f"def {dataset[i]['entry_point']}" not in completion_list[j]:
205 |                 correct_list.append(correct)
206 |                 continue
207 |             for k in range(len(test_case_list)):
208 |                 if f"assert {dataset[i]['entry_point']}(" not in test_case_list[k]:
209 |                     continue
210 |                 dataset[i]["full_code"] = test_setup + "\n" + completion_list[j] + "\n" + test_case_list[k]
211 |                 result = check_correctness(dataset[i]["task_id"], dataset[i], lg, 3, "./tmp")
212 |                 if result["passed"]:
213 |                     correct += 1
214 |             correct_list.append(correct)
215 | 
216 |         max_correct = max(correct_list)
217 |         idx = correct_list.index(max_correct)
218 | 
219 |         return max_correct, idx
220 | 
221 |     with concurrent.futures.ThreadPoolExecutor() as executor:
222 |         futures = [executor.submit(process_item, i) for i in range(len(dataset))]
223 | 
224 |         for future in tqdm(concurrent.futures.as_completed(futures), total=len(dataset)):
225 |             max_correct, idx = future.result()
226 |             if max_correct >= 3: # GPT-3.5-turbo-1106's test case accuracy is about 67%. So we choice 60% as the bar.
227 |                 i = futures.index(future)
228 |                 dataset[i]["completion"] = dataset[i]["completion_list"][idx]
229 |                 dataset[i]["need_reproduce"] = False
230 |                 dataset[i]["idx"] = idx
231 |                 dataset[i]["max_correct"] = max_correct
232 |                 _for_completion += 1
233 |             else:
234 |                 i = futures.index(future)
235 |                 dataset[i]["completion"] = dataset[i]["completion_list"][idx]
236 | 
237 | 
238 |     print("==============Start Agent Testing==============")
239 |     print(f"test_report: {(total_correct/len(dataset)*100):.1f}")
240 |     print(f"test_for_completion: {(_for_completion/len(dataset)*100):.1f}")
241 |     return dataset
242 | 
243 | 
244 | if __name__ == "__main__":
245 |     model_list = ["gpt-3.5-turbo-1106"]
246 |     language = ["python"]
247 |     for model in model_list:
248 |         for lg in language:
249 |             path = f"./dataset/{model}_{lg}.json"
250 |             with open(path, "r") as f:
251 |                 dataset = json.load(f)
252 |             epoch = 5
253 |             for current_epoch in range(epoch):
254 |                 dataset = test_agent_concurrency(dataset,lg)
255 |                 test_report(dataset,lg)
256 |                 dataset = call_fetch_completion_helper(dataset,model,lg)
257 |                 dataset = call_fetch_test_completion_helper(dataset,model,lg)
258 |                 with open(f"./dataset/{model}_{current_epoch}.json", "w") as f:
259 |                     json.dump(dataset, f, indent=4)
260 |             dataset = test_agent_concurrency(dataset,lg)
261 |             test_report(dataset,lg)
262 | 


--------------------------------------------------------------------------------