23 |
24 |
25 |
26 | SWE-smith is a toolkit for training software engineering (SWE) agents. With SWE-smith, you can:
27 | * Create an *unlimited* number of [SWE-bench](https://github.com/SWE-bench/SWE-bench) style task instances for any Python repository.
28 | * *Generate trajectories* of [SWE-agent](https://github.com/SWE-agent/SWE-agent) solving those task instances.
29 | * *Train local LMs* on these trajectories to improve their software engineering capabilities ([SWE-agent-LM-32B](https://huggingface.co/SWE-bench/SWE-agent-LM-32B)).
30 |
31 | ## 🚀 Get Started
32 | Check out the [documentation](https://swesmith.com/getting_started/) for a complete guide on how to use SWE-smith, including how to
33 | * [Install](https://swesmith.com/getting_started/installation/) the repository locally or as a PyPI package.
34 | * [Create Task Instances](https://swesmith.com/guides/create_instances/) for any Python repository with SWE-smith.
35 | * Use your task instance to [train your own SWE-agents](https://swesmith.com/guides/train_swe_agent/)
36 |
37 | ## 🏎️ Quick Start
38 | Install the repo:
39 | ```bash
40 | git clone https://github.com/SWE-bench/SWE-smith
41 | cd SWE-smith
42 | conda create -n smith python=3.10;
43 | conda activate smith;
44 | pip install -e .
45 | ```
46 |
47 | Then, check out `scripts/cheatsheet.sh` for scripts to (1) create execution environments, (2) create task instances, and (3) train SWE-agents.
48 |
49 | > [!TIP]
50 | > SWE-smith requires Docker to create execution environments. SWE-smith was developed and tested on Ubuntu 22.04.4 LTS.
51 | > We do *not* plan on supporting Windows or MacOS.
52 |
53 | ## 💿 Resources
54 | In addition to this toolkit, we've also provided several artifacts on the [SWE-bench HuggingFace](https://huggingface.co/SWE-bench), including:
55 | * [50k Python Task Instances](https://huggingface.co/datasets/SWE-bench/SWE-smith), created using SWE-smith.
56 | * [SWE-agent-LM-32B](https://huggingface.co/SWE-bench/SWE-agent-LM-32B), trained using SWE-smith. Achieves **41.6%** pass@1 on [SWE-bench Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified)!
57 | * [5k Trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) that SWE-agent-LM-32B was trained on.
58 |
59 | And there's more coming!
60 |
61 | ## 💫 Contributions
62 | Excited about SWE-smith? We're actively working on several follow ups, and love meaningful collaborations! What we're thinking about...
63 | * Make SWE-smith work for non-Python languages
64 | * New bug generation techniques
65 | * Train SWE-agents with more trajectories and new methods
66 |
67 | Check out the [Contributing Guide](CONTRIBUTING.md) for more.
68 |
69 | Contact Person: [John Yang](https://john-b-yang.github.io/), [Kilian Lieret](https://lieret.net)
70 | (Email: [johnby@stanford.edu](mailto:johnby@stanford.edu))
71 |
72 | ## 🪪 License
73 | MIT. Check `LICENSE` for more information.
74 |
75 | ## ✍️ Citation
76 |
77 | ```bibtex
78 | @misc{yang2025swesmith,
79 | title={SWE-smith: Scaling Data for Software Engineering Agents},
80 | author={John Yang and Kilian Leret and Carlos E. Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang},
81 | year={2025},
82 | eprint={2504.21798},
83 | archivePrefix={arXiv},
84 | primaryClass={cs.SE},
85 | url={https://arxiv.org/abs/2504.21798},
86 | }
87 | ```
88 |
89 | ## 📕 Related Works
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
--------------------------------------------------------------------------------
/agent/_gen_trajs.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | sweagent run-batch --num_workers 20 \
4 | --instances.deployment.docker_args=--memory=10g \
5 | --config agent/swesmith_gen_claude.yaml \
6 | --instances.path /home/john-b-yang/swe-smith/logs/experiments/exp8__ig_orig.json \
7 | --output_dir trajectories/john-b-yang/swesmith_gen__claude-3.5__t-0.00_p-1.00__c.2.00__exp8__ig_orig_run2 \
8 | --random_delay_multiplier=1 \
9 | --agent.model.temperature 0.0
10 |
11 | # Remember to set CLAUDE_API_KEY_ROTATION=key1:::key2:::key3
12 |
--------------------------------------------------------------------------------
/agent/_infer_model.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | sweagent run-batch --config agent/swesmith_infer.yaml \
4 | --instances.deployment.docker_args=--memory=10g \
5 | --agent.model.api_base https://svt25nwvnpipwz.r20.modal.host/v1 \
6 | --random_delay_multiplier=1 \
7 | --output_dir trajectories/john-b-yang/swesmith.ablation.bug.lm_reimplement_500
8 |
--------------------------------------------------------------------------------
/agent/_traj_mgr.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | python -m swesmith.train.traj_mgr.clean_trajs trajectories/
4 |
5 | python -m swesmith.train.traj_mgr.combine_trajs
6 |
7 | python -m swesmith.train.traj_mgr.transform_to_ft
--------------------------------------------------------------------------------
/agent/swesmith_gen_claude.yaml:
--------------------------------------------------------------------------------
1 | # Heavily based on https://github.com/SWE-agent/SWE-agent/blob/main/config/anthropic_filemap.yaml
2 | instances:
3 | type: swesmith
4 | shuffle: true
5 | agent:
6 | templates:
7 | system_template: |-
8 | You are a helpful assistant that can interact with a computer to solve tasks.
9 | instance_template: |-
10 |
11 | {{working_dir}}
12 |
13 | I've uploaded a python code repository in the directory {{working_dir}}. Consider the following PR description:
14 |
15 |
16 | {{problem_statement}}
17 |
18 |
19 | Can you help me implement the necessary changes to the repository so that the requirements specified in the are met?
20 | I've already taken care of all changes to any of the test files described in the . This means you DON'T have to modify the testing logic or any of the tests in any way!
21 | Your task is to make the minimal changes to non-tests files in the {{working_dir}} directory to ensure the is satisfied.
22 | Follow these steps to resolve the issue:
23 | 1. As a first step, it might be a good idea to find and read code relevant to the
24 | 2. Create a script to reproduce the error and execute it with `python ` using the bash tool, to confirm the error
25 | 3. Edit the source code of the repo to resolve the issue
26 | 4. Rerun your reproduce script and confirm that the error is fixed!
27 | 5. Think about edgecases and make sure your fix handles them as well
28 | Your thinking should be thorough and so it's fine if it's very long.
29 | next_step_template: |-
30 | OBSERVATION:
31 | {{observation}}
32 | next_step_no_output_template: |-
33 | Your command ran successfully and did not produce any output.
34 | tools:
35 | bundles:
36 | - path: tools/registry
37 | - path: tools/edit_anthropic
38 | - path: tools/review_on_submit_m
39 | registry_variables:
40 | USE_FILEMAP: 'true'
41 | SUBMIT_REVIEW_MESSAGES:
42 | - |
43 | Thank you for your work on this issue. Please carefully follow the steps below to help review your changes.
44 |
45 | 1. If you made any changes to your code after running the reproduction script, please run the reproduction script again.
46 | If the reproduction script is failing, please revisit your changes and make sure they are correct.
47 | If you have already removed your reproduction script, please ignore this step.
48 | 2. Remove your reproduction script (if you haven't done so already).
49 | 3. If you have modified any TEST files, please revert them to the state they had before you started fixing the issue.
50 | You can do this with `git checkout -- /path/to/test/file.py`. Use below to find the files you need to revert.
51 | 4. Run the submit command again to confirm.
52 |
53 | Here is a list of all of your changes:
54 |
55 |
56 | {{diff}}
57 |
58 | enable_bash_tool: true
59 | parse_function:
60 | type: function_calling
61 | execution_timeout: 300
62 | history_processors:
63 | - type: cache_control
64 | last_n_messages: 2
65 | model:
66 | # name: claude-3-5-sonnet-20241022
67 | name: claude-3-7-sonnet-20250219
68 | max_output_tokens: 64000
69 | api_key: $CLAUDE_API_KEY_ROTATION
70 | per_instance_cost_limit: 2.
71 | per_instance_call_limit: 75
72 | # delay: 1
73 |
--------------------------------------------------------------------------------
/agent/swesmith_gen_gpt.yaml:
--------------------------------------------------------------------------------
1 | # Heavily based on https://github.com/SWE-agent/SWE-agent/blob/main/config/anthropic_filemap.yaml
2 | instances:
3 | type: swesmith
4 | shuffle: true
5 | agent:
6 | templates:
7 | system_template: |-
8 | You are a helpful assistant that can interact with a computer to solve tasks.
9 | instance_template: |-
10 |
11 | {{working_dir}}
12 |
13 | I've uploaded a python code repository in the directory {{working_dir}}. Consider the following PR description:
14 |
15 |
16 | {{problem_statement}}
17 |
18 |
19 | Can you help me implement the necessary changes to the repository so that the requirements specified in the are met?
20 | I've already taken care of all changes to any of the test files described in the . This means you DON'T have to modify the testing logic or any of the tests in any way!
21 | Your task is to make the minimal changes to non-tests files in the {{working_dir}} directory to ensure the is satisfied.
22 | Follow these steps to resolve the issue:
23 | 1. As a first step, it might be a good idea to find and read code relevant to the
24 | 2. Create a script to reproduce the error and execute it with `python ` using the bash tool, to confirm the error
25 | 3. Edit the source code of the repo to resolve the issue
26 | 4. Rerun your reproduce script and confirm that the error is fixed!
27 | 5. Think about edgecases and make sure your fix handles them as well
28 | Your thinking should be thorough and so it's fine if it's very long.
29 | next_step_template: |-
30 | OBSERVATION:
31 | {{observation}}
32 | next_step_no_output_template: |-
33 | Your command ran successfully and did not produce any output.
34 | tools:
35 | execution_timeout: 300
36 | bundles:
37 | - path: tools/registry
38 | - path: tools/edit_anthropic
39 | - path: tools/submit
40 | env_variables:
41 | USE_FILEMAP: 'true'
42 | enable_bash_tool: true
43 | parse_function:
44 | type: function_calling
45 | model:
46 | name: gpt-4o-2024-08-06
47 | per_instance_cost_limit: 2.
48 | per_instance_call_limit: 75
49 | # delay: 1
50 |
--------------------------------------------------------------------------------
/agent/swesmith_install_repo.yaml:
--------------------------------------------------------------------------------
1 | agent:
2 | templates:
3 | system_template: |-
4 | You are a helpful assistant that can interact with a computer to solve tasks.
5 | instance_template: |-
6 |
7 | {{working_dir}}
8 |
9 | I've uploaded a python code repository in the directory {{working_dir}}.
10 |
11 | Can you please help me install this repository?
12 | Your goal should be to configure the repository's development environment such that existing tests pass.
13 | You are currently in the root directory of the repository, and nothing has been installed yet.
14 | You in an Ubuntu 22.04 environment.
15 |
16 | The repository is predominantly written in Python. Here are several tips for installing it:
17 | 1. A good place to start is to look for a `CONTRIBUTING.[md|rst]` file, which will often contain instructions on how to install the repository and any dependencies it may have. Occasionally, the `README.md` file may also contain installation instructions.
18 | 2. Usually, a repository may have `setup.py` or `pyproject.toml` files which can be used to install the package. `pip install -e .` is commonly used, although many packages will also require an additional specifier that installs development packages as well (e.g. `pip install -e .[dev]`).
19 | 3. To check whether the repository was installed successfully, run tests and see if they pass. You can usually find tests in a `tests/` or `test/` directory. You can run tests using `pytest` or `unittest`, depending on the framework used by the repository.
20 | 4. Sometimes, you will need to install additional packages, often listed in a `requirements.txt` or `environment.yml` file. Also, be mindful of Ubuntu system dependencies that may need to be installed via `apt-get` (e.g. `sudo apt-get install `).
21 |
22 | Once you are finished with installing the repository, run the `submit` command to submit your changes for review
23 | next_step_template: |-
24 | OBSERVATION:
25 | {{observation}}
26 | next_step_no_output_template: |-
27 | Your command ran successfully and did not produce any output.
28 | tools:
29 | bundles:
30 | - path: tools/registry
31 | - path: tools/edit_anthropic
32 | - path: tools/submit
33 | registry_variables:
34 | USE_FILEMAP: 'true'
35 | enable_bash_tool: true
36 | parse_function:
37 | type: function_calling
38 | execution_timeout: 300
39 | history_processors:
40 | - type: cache_control
41 | last_n_messages: 2
42 | model:
43 | name: claude-3-7-sonnet-20250219
44 | api_key: $CLAUDE_API_KEY_ROTATION
45 | per_instance_cost_limit: 2.
46 | per_instance_call_limit: 150
47 | delay: 1
48 |
--------------------------------------------------------------------------------
/configs/bug_gen/README.md:
--------------------------------------------------------------------------------
1 | # Writing Config. Files for Bug Generation
2 |
3 | To create bugs using `swesmith.bug_gen.llm.modify`, the script takes in a configuration file
4 | that allows one to (1) define what kind of bug(s) the LLM should generate, and (2) identify
5 | what functions to run this generation for.
6 |
7 | Here are the steps to create a config file for creating a specific kind of bug.
8 |
9 | 1. Create a `configs/bug_gen/*.yaml file`. Typically, the naming convention is `func_.yaml`.
10 | 2. Within the `.yaml` file, define the following prompts / fields:
11 | ```yaml
12 | name:
13 | criteria: reference to criteria in swesmith/bug_gen/llm/criteria.py
14 | parameters: any additional information you'd like to include + can be referenced in the prompts
15 | system: |-
16 | prompt
17 | demonstration: |-
18 | prompt
19 | instance: |-
20 | prompt
21 | ```
22 | 3. (Optional) You can use one of the existing criteria, or create a new one in `swesmith/bug_gen/llm/criteria.py`
23 | * The purpose of defining a criteria is to only consider functions where it would be possible to introduce such a bug.
24 | * For example, if you write a prompt for off by one bugs, but the function doesn't have loops or list indexing, then it's likely the LLM cannot generate a reasonably effective and difficult bug.
25 |
26 | > A criteria function usually follows the below form:
27 | ```python
28 | def filter_(code_entity: CodeEntity) -> bool:
29 | """
30 | `code_entity` is an object representing a function. It includes several
31 | pieces of information, most notably:
32 | * `src_code`: The raw string repr. of a function
33 | * `src_node`: An AST node representation of a function.
34 | """
35 | node = code_entity.src_node
36 | # Logic for checking whether a function has a property has typically been
37 | # enforced by checking node properties (of course, you're not limited to this)
38 | if satisfies_criteria:
39 | return True
40 | return False
41 | ```
42 |
43 | Once you create the `.yaml` with a specified criteria, from this repo, run:
44 | ```bash
45 | python -m swesmith.bug_gen.llm.modify \
46 | --repo datamade/usaddress \
47 | --model openai/gpt-4o \
48 | --entity_type func \
49 | --prompt_config configs/bug_gen/func_.yml \
50 | --n_workers 4 # 4 parallel queries to LM etc.
51 | ```
52 | where `--repo` should point to one of the repositories [here](https://github.com/orgs/swesmith/repositories). (Note: should just be `/`, without the `.`)
53 |
--------------------------------------------------------------------------------
/configs/bug_gen/func_fun.yml:
--------------------------------------------------------------------------------
1 | version: 1
2 | name: func_fun
3 | criteria: all
4 | parameters:
5 | tips:
6 | system: |-
7 | You are a simulation of a tired, deadline-pressured developer who has just worked 14 consecutive hours.
8 |
9 | Your task was to improve the provided code.
10 | Despite your best intentions, your exhausted state causes you to introduce subtle, real-world bugs that would pass code review but cause issues in production.
11 |
12 | Rewrite a function such that it introduces a logical bug that will subtly break existing unit tests in a codebase.
13 |
14 | Here's how to proceed:
15 |
16 | 1. First understand what the code is trying to achieve
17 | 2. Consider how a well-intentioned but fatigued developer might misunderstand it
18 | 3. Implement changes based on that flawed understanding
19 | 4. Ensure the bug represents a genuine cognitive error, not a contrived modification
20 | 5. The code should look like a good-faith attempt at solving the problem
21 | 6. The bug should be something that could genuinely ship to production
22 |
23 | Tips about the bug-introducing task:
24 |
25 | - It should not cause compilation errors.
26 | - It should not be a syntax error.
27 | - It should be subtle and challenging to detect.
28 | - It should not modify the function signature.
29 | - It should not modify the documentation significantly.
30 | - For longer functions, if there is an opportunity to introduce multiple bugs, please do!
31 | - Please DO NOT INCLUDE COMMENTS IN THE CODE indicating the bug location or the bug itself.
32 | - Your code must be included in triple backticks.
33 |
34 | Your answer should be formatted as follows:
35 |
36 | Explanation:
37 |
38 |
39 | Bugged Code:
40 | ```
41 |
42 | ```
43 | demonstration: ""
44 | instance: |-
45 |
46 | {{src_code}}
47 |
48 |
49 | As a reminder, Please DO NOT INCLUDE ANY COMMENTS IN THE CODE OR POINT OUT THE BUG IN ANY WAY.
50 |
51 | OUTPUT:
--------------------------------------------------------------------------------
/configs/bug_gen/lm_modify.yml:
--------------------------------------------------------------------------------
1 | version: 2
2 | name: lm_modify
3 | criteria: simple_complexity10
4 | parameters:
5 | bug_examples:
6 | - "Alter calculation order for incorrect results: Rearrange the sequence of operations in a calculation to subtly change the output (e.g., change (a + b) * c to a + (b * c))."
7 | - "Introduce subtle data transformation errors: Modify data processing logic, such as flipping a sign, truncating a value, or applying the wrong transformation function."
8 | - "Change variable assignments to alter computation state: Assign a wrong or outdated value to a variable that affects subsequent logic."
9 | - "Mishandle edge cases for specific inputs: Change handling logic to ignore or improperly handle boundary cases, like an empty array or a null input."
10 | - "Modify logic in conditionals or loops: Adjust conditions or loop boundaries (e.g., replace <= with <) to change the control flow."
11 | - "Introduce off-by-one errors in indices or loop boundaries: Shift an index or iteration boundary by one, such as starting a loop at 1 instead of 0."
12 | - "Adjust default values or constants to affect behavior: Change a hardcoded value or default parameter that alters how the function behaves under normal use."
13 | - "Reorder operations while maintaining syntax: Rearrange steps in a process so the function produces incorrect intermediate results without breaking the code."
14 | - "Swallow exceptions or return defaults silently: Introduce logic that catches an error but doesn't log or handle it properly, leading to silent failures."
15 | tips:
16 | - "It should not cause compilation errors."
17 | - "It should not be a syntax error."
18 | - "It should be subtle and challenging to detect."
19 | - "It should not modify the function signature."
20 | - "It should not modify the documentation significantly."
21 | - "For longer functions, if there is an opportunity to introduce multiple bugs, please do!"
22 | - "Please DO NOT INCLUDE COMMENTS IN THE CODE indicating the bug location or the bug itself."
23 | system: |-
24 | You are a software developer doing chaos monkey testing.
25 | Your job is to rewrite a function such that it introduces a logical bug that will break existing unit test(s) in a codebase.
26 |
27 | To this end, some kinds of bugs you might introduce include:
28 | {% for bug in (bug_examples | shuffle)[:3] %}
29 | - {{ bug -}}
30 | {% endfor %}
31 |
32 | Tips about the bug-introducing task:
33 | {% for tip in tips | shuffle %}
34 | - {{ tip -}}
35 | {% endfor %}
36 |
37 | Your answer should be formatted as follows:
38 |
39 | Explanation:
40 |
41 |
42 | Bugged Code:
43 | ```
44 |
45 | ```
46 | demonstration: ""
47 | instance: |-
48 |
49 | {{src_code}}
50 |
51 |
52 | As a reminder, Please DO NOT INCLUDE ANY COMMENTS IN THE CODE OR POINT OUT THE BUG IN ANY WAY.
53 |
54 | OUTPUT:
--------------------------------------------------------------------------------
/configs/bug_gen/lm_rewrite.yml:
--------------------------------------------------------------------------------
1 | name: lm_rewrite
2 | system: |-
3 | You are a software developer and you have been asked to implement a function.
4 |
5 | You will be given the contents of an entire file, with one or more functions defined in it.
6 | Please implement the function(s) that are missing.
7 | Do NOT modify the function signature, including the function name, parameters, return types, or docstring if provided.
8 | Do NOT change any other code in the file.
9 | You should not use any external libraries.
10 | instance: |-
11 | Please implement the function `{func_signature}` in the following code:
12 |
13 | ```
14 | {file_src_code}
15 | ```
16 |
17 | Remember, you should not modify the function signature, including the function name, parameters, return types, or docstring if provided.
18 | Do NOT change any other code in the file.
19 | Format your output as:
20 |
21 |
22 |
23 | ```
24 | {func_to_write}
25 | ```
--------------------------------------------------------------------------------
/configs/install_repo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | . /opt/miniconda3/bin/activate
4 | conda create -n testbed python=3.10 -yq
5 | conda activate testbed
6 | pip install -e .
7 | pip install pytest
8 |
--------------------------------------------------------------------------------
/configs/issue_gen/ig_tests.yaml:
--------------------------------------------------------------------------------
1 | system: |-
2 | You are a software engineer and you have been asked to give an issue report.
3 |
4 | You will be given the following input:
5 | 1. Test Source Code: The source code for a test in a GitHub repository that is currently failing.
6 | 2. Test Execution Output: The execution output of running the test.
7 |
8 | Given this input, please write a GitHub issue report.
9 |
10 | Guidelines:
11 | - Use a natural tone, as if reported by a developer.
12 | - DO NOT mention the test that failed.
13 | - Include information about how to reproduce the issue. You can use the test source code to write reproduction code. Use the test execution output to convey the expected behavior and what the actual current behavior is.
14 | demonstration: |-
15 | Here is an example of a well written GitHub issue. Mimic the style and information of this issue in your response.
16 | -----------------------------------
17 | {demo}
18 | instance: |-
19 | Now, write a GitHub issue that conveys the problem reflected in the failing test.
20 |
21 | Remember,
22 | - DO NOT GIVE AWAY THE TEST THAT FAILED.
23 | - DO NOT SAY THAT EXISTING TEST(s) FAILED.
24 | - DO NOT SUGGEST RUNNING ANY TESTING COMMANDS (e.g., pytest).
25 | - Mimic the style and information of the issue text from the demonstration.
26 | - Keep the length of the issue text reasonable and similar to the demonstration.
27 | - Use the test source code to write reproduction code.
28 | - Use the test execution output to convey the expected behavior and what the actual current behavior is.
29 |
30 | {input}
31 |
32 | **Issue Text**
33 |
--------------------------------------------------------------------------------
/configs/issue_gen/ig_v1.yaml:
--------------------------------------------------------------------------------
1 | settings:
2 | n_instructions: 1 # number of instructions to generate
3 | repro_code_n: 1 # number of repo tests to include in prompt
4 | repro_code_rate: 0 # % of task instances to generate repro code for
5 | add_test_output: True # whether to include test output (from validation step)
6 | system: |-
7 | **Task:**
8 | Write a realistic GitHub issue for the following **patch (diff output)** that introduces a bug. The issue should:
9 | - Clearly describe the problem observed in the original (buggy) code.
10 | - Include relevant details like which function or part of the code is affected.
11 | - Explain expected vs. actual behavior.
12 | - Suggest possible causes without explicitly stating the correct fix.
13 | - Use a natural tone, as if reported by a developer.
14 |
15 | Additional Context:
16 | - The diff shows changes to a file, where - lines represent the original (working) code that was removed.
17 | - + lines represent the new (fixed) code that was added.
18 | - The bug existed in the removed (-) lines, and the fix is in the added (+) lines.
19 | - Focus on describing the issue in the removed lines, not explaining the new fix verbatim.
20 | demonstration: |-
21 | Here is an example of a well formed GitHub issue:
22 |
23 | **Issue Text**
24 | {{problem_statement}}
25 | instance: |-
26 | Now, write a GitHub issue for the following patch (diff output).
27 |
28 | Remember to:
29 | - Clearly describe the problem observed in the original (buggy) code.
30 | - Include some relevant details like which function or part of the code is affected. BUT, don't be too specific
31 | - DO NOT GIVE AWAY THE FIX! THE SOLUTION CODE SHOULD NEVER APPEAR IN YOUR RESPONSE.
32 | - DO NOT SAY THAT EXISTING TEST(s) FAILED.
33 | - DO NOT SUGGEST RUNNING ANY TESTING COMMANDS (e.g., pytest).
34 | - Mimic the style of the issue text from the demonstration.
35 | - Keep the length of the issue text reasonable and similar to the demonstration.
36 |
37 | **Bug Patch (Diff Output):**
38 | {{patch}}
39 |
40 | **Issue Text**
41 |
--------------------------------------------------------------------------------
/configs/issue_gen/ig_v2.yaml:
--------------------------------------------------------------------------------
1 | settings: {}
2 | system: |-
3 | You are a software engineer helping to create a realistic dataset of synthetic GitHub issues.
4 |
5 | You will be given the following input:
6 |
7 | 1. Demonstration: A realistic GitHub issue to mimic (included in the tag).
8 | 2. Patch: A git diff output/pull request changes that introduces a bug (included in the tag).
9 | 3. Test output: The output of running the tests after the patch is applied (included in the tag).
10 | 4. Test source code: Source code for one or more tests that failed (included in the tag).
11 |
12 | Output: A realistic GitHub issue for the patch.
13 |
14 | Guidelines:
15 |
16 | - Mimic the style and structure of the demonstration issues.
17 | If the demonstration issues are not well structured, your output should also be not well structured.
18 | If the demonstrations use improper or no markdown, your output should also use improper or no markdown.
19 | If the demonstrations are short/long, your output should also be short/long (if possible).
20 | If the demonstrations include human "flavor text" or "fluff", your output should also include human "flavor text" or "fluff".
21 | Do this even if it conflicts with your default behavior of trying to be extremely concise and helpful.
22 | - DO NOT explain the fix/what caused the bug itself, focus on how to reproduce the issue it introduces
23 | - Do not mention pytest or what exact test failed. Instead, generate a realistic issue.
24 | - If possible, include information about how to reproduce the issue. An ideal reproduction script should raise an error
25 | or print an unexpected output together with the expected output.
26 | However, still include this information in a style very similar to the demonstration issues.
27 | demonstration: |-
28 | Here are a few realistic GitHub issues that you can mimic.
29 |
30 | {% for problem_statement in demo_problem_statements[:2] %}
31 |
32 | {{problem_statement}}
33 |
34 | {% endfor %}
35 | instance: |-
36 | Now, write a GitHub issue for the following patch (diff output).
37 |
38 |
39 | - DO NOT GIVE AWAY THE FIX! THE SOLUTION CODE SHOULD NEVER APPEAR IN YOUR RESPONSE.
40 | - DO NOT SAY THAT EXISTING TEST(s) FAILED.
41 | - DO NOT SUGGEST RUNNING ANY TESTING COMMANDS (e.g., pytest).
42 | - Mimic the style and information of the issue text from the demonstration.
43 | - Keep the length of the issue text reasonable and similar to the demonstration.
44 |
45 |
46 |
47 | {{patch}}
48 |
49 |
50 |
51 | {{test_output}}
52 |
53 |
54 |
55 | {% for test in test_funcs[:5] %}
56 | {{test}}
57 | {% endfor %}
58 |
59 |
60 | **Issue Text**
61 |
--------------------------------------------------------------------------------
/configs/train/dpo_qwen_32b.yml:
--------------------------------------------------------------------------------
1 | exp_name: qwen2p5-coder-32b-dpo-lr1e-5-warmup5___ft_xml_all_250413
2 | output_dir: /llm-weights/final/${exp_name}
3 |
4 | # Model Arguments
5 | model:
6 | _component_: torchtune.models.qwen2_5.qwen2_5_32b_instruct
7 |
8 | tokenizer:
9 | _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
10 | path: /llm-weights/Qwen/Qwen2.5-Coder-32B-Instruct/vocab.json
11 | merges_file: /llm-weights/Qwen/Qwen2.5-Coder-32B-Instruct/merges.txt
12 | max_seq_len: 32768
13 |
14 | checkpointer:
15 | _component_: torchtune.training.FullModelHFCheckpointer
16 | checkpoint_dir: /llm-weights/Qwen/Qwen2.5-Coder-32B-Instruct
17 | checkpoint_files: [
18 | model-00001-of-00014.safetensors,
19 | model-00002-of-00014.safetensors,
20 | model-00003-of-00014.safetensors,
21 | model-00004-of-00014.safetensors,
22 | model-00005-of-00014.safetensors,
23 | model-00006-of-00014.safetensors,
24 | model-00007-of-00014.safetensors,
25 | model-00008-of-00014.safetensors,
26 | model-00009-of-00014.safetensors,
27 | model-00010-of-00014.safetensors,
28 | model-00011-of-00014.safetensors,
29 | model-00012-of-00014.safetensors,
30 | model-00013-of-00014.safetensors,
31 | model-00014-of-00014.safetensors,
32 | ]
33 | recipe_checkpoint: null
34 | output_dir: ${output_dir}
35 | model_type: QWEN2
36 | safe_serialization: True
37 | resume_from_checkpoint: False
38 |
39 | # Dataset and Sampler
40 | dataset:
41 | _component_: torchtune.datasets.preference_dataset
42 | source: json
43 | data_files: /datasets/trajectories_dpo/dpo_250413.json
44 | conversation_column: messages
45 | conversation_style: openai
46 | new_system_prompt: null
47 | packed: False # True increases speed
48 | column_map:
49 | chosen: chosen_conversations
50 | rejected: rejected_conversations
51 | train_on_input: False
52 | split: train
53 | seed: 42
54 | shuffle: True
55 | batch_size: 1
56 |
57 | # Optimizer and Scheduler
58 | optimizer:
59 | _component_: torch.optim.AdamW
60 | fused: True
61 | weight_decay: 0.01
62 | lr: 1e-5
63 | lr_scheduler:
64 | _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
65 | num_warmup_steps: 5
66 | optimizer_in_bwd: True
67 | loss:
68 | _component_: torchtune.rlhf.loss.DPOLoss
69 | beta: 0.05
70 | label_smoothing: 0
71 |
72 | # Training
73 | epochs: 3
74 | max_steps_per_epoch: null
75 | gradient_accumulation_steps: 1 # Use to increase virtual batch size
76 | compile: False # pytorch compile, set to true for better perf/memory
77 |
78 | # Logging
79 | metric_logger:
80 | _component_: torchtune.training.metric_logging.WandBLogger
81 | project: devrl-sft
82 | group: ${exp_name}
83 | job_type: full_dpo_distributed
84 | log_every_n_steps: 1
85 | log_peak_memory_stats: True
86 |
87 | # Environment
88 | device: cuda
89 | dtype: bf16
90 | enable_activation_checkpointing: True # True reduces memory
91 | enable_activation_offloading: False # True reduces memory
92 |
93 | # Show case the usage of pytorch profiler
94 | # Set enabled to False as it's only needed for debugging training
95 | profiler:
96 | _component_: torchtune.training.setup_torch_profiler
97 |
98 | enabled: False
99 |
100 | #Output directory of trace artifacts
101 | output_dir: ${output_dir}/profiling_outputs
102 |
103 | #`torch.profiler.ProfilerActivity` types to trace
104 | cpu: True
105 | cuda: True
106 |
107 | #trace options passed to `torch.profiler.profile`
108 | profile_memory: False
109 | with_stack: False
110 | record_shapes: True
111 | with_flops: False
112 |
113 | # `torch.profiler.schedule` options:
114 | # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
115 | wait_steps: 5
116 | warmup_steps: 5
117 | active_steps: 2
118 | num_cycles: 1
--------------------------------------------------------------------------------
/configs/train/dpo_qwen_7b.yml:
--------------------------------------------------------------------------------
1 | exp_name: qwen2p5-coder-7b-dpo-lr1e-5-warmup5___ft_xml_all_250414
2 | output_dir: /llm-weights/dpo/${exp_name}
3 |
4 | # Model Arguments
5 | model:
6 | _component_: torchtune.models.qwen2_5.qwen2_5_7b_instruct
7 |
8 | tokenizer:
9 | _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
10 | path: /llm-weights/Qwen/Qwen2.5-Coder-7B-Instruct/vocab.json
11 | merges_file: /llm-weights/Qwen/Qwen2.5-Coder-7B-Instruct/merges.txt
12 | max_seq_len: 32768
13 |
14 | checkpointer:
15 | _component_: torchtune.training.FullModelHFCheckpointer
16 | checkpoint_dir: /llm-weights/outputs/qwen2p5-coder-7b-full-lr1e-4-warmup5___all_250331.jsonl/epoch_4
17 | checkpoint_files: [
18 | ft-model-00001-of-00004.safetensors,
19 | ft-model-00002-of-00004.safetensors,
20 | ft-model-00003-of-00004.safetensors,
21 | ft-model-00004-of-00004.safetensors,
22 | ]
23 | recipe_checkpoint: null
24 | output_dir: ${output_dir}
25 | model_type: QWEN2
26 | safe_serialization: True
27 | resume_from_checkpoint: False
28 |
29 | # The ref_checkpointer should always point to the original weights.
30 | ref_checkpointer:
31 | _component_: torchtune.training.FullModelHFCheckpointer
32 | checkpoint_dir: /llm-weights/Qwen/Qwen2.5-Coder-7B-Instruct
33 | checkpoint_files: [
34 | model-00001-of-00004.safetensors,
35 | model-00002-of-00004.safetensors,
36 | model-00003-of-00004.safetensors,
37 | model-00004-of-00004.safetensors,
38 | ]
39 | recipe_checkpoint: null
40 | output_dir: ${output_dir}
41 | model_type: QWEN2
42 | safe_serialization: True
43 |
44 | # Dataset and Sampler
45 | dataset:
46 | _component_: torchtune.datasets.preference_dataset
47 | source: json
48 | data_files: /datasets/trajectories_dpo/swesmith_dpo_250414.json
49 | column_map:
50 | chosen: chosen_conversations
51 | rejected: rejected_conversations
52 | train_on_input: False
53 | seed: 42
54 | shuffle: True
55 | batch_size: 1
56 |
57 | # Optimizer and Scheduler
58 | optimizer:
59 | _component_: torch.optim.AdamW
60 | fused: True
61 | weight_decay: 0.05
62 | lr: 2e-5
63 | lr_scheduler:
64 | _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
65 | num_warmup_steps: 5
66 | optimizer_in_bwd: False
67 | loss:
68 | _component_: torchtune.rlhf.loss.DPOLoss
69 | beta: 0.05
70 | label_smoothing: 0
71 |
72 | # Training
73 | epochs: 2
74 | max_steps_per_epoch: null
75 | gradient_accumulation_steps: 4 # Use to increase effective batch size
76 | compile: False # torch.compile the model + loss, True increases speed + decreases memory
77 |
78 | # Logging
79 | metric_logger:
80 | _component_: torchtune.training.metric_logging.WandBLogger
81 | project: devrl-sft
82 | group: ${exp_name}
83 | job_type: full_dpo_distributed
84 | log_every_n_steps: 1
85 | log_peak_memory_stats: True
86 |
87 | # Environment
88 | device: cuda
89 | dtype: bf16
90 | enable_activation_checkpointing: True # True reduces memory
91 | enable_activation_offloading: False # True reduces memory
92 |
93 | # Show case the usage of pytorch profiler
94 | # Set enabled to False as it's only needed for debugging training
95 | profiler:
96 | _component_: torchtune.training.setup_torch_profiler
97 |
98 | enabled: False
99 |
100 | #Output directory of trace artifacts
101 | output_dir: ${output_dir}/profiling_outputs
102 |
103 | #`torch.profiler.ProfilerActivity` types to trace
104 | cpu: True
105 | cuda: True
106 |
107 | #trace options passed to `torch.profiler.profile`
108 | profile_memory: False
109 | with_stack: False
110 | record_shapes: True
111 | with_flops: False
112 |
113 | # `torch.profiler.schedule` options:
114 | # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
115 | wait_steps: 5
116 | warmup_steps: 5
117 | active_steps: 2
118 | num_cycles: 1
--------------------------------------------------------------------------------
/configs/train/full_ft_qwen_32b.yml:
--------------------------------------------------------------------------------
1 | # Config for multi-device full finetuning in full_finetune_distributed.py
2 | # using a Qwen2.5 7B model
3 | #
4 | # This config assumes that you've run the following command before launching
5 | # this run:
6 | # tune download Qwen/Qwen2.5-7B-Instruct --output-dir /tmp/Qwen2_5-7B-Instruct
7 | #
8 | # To launch on 2 devices, run the following command from root:
9 | # tune run --nnodes 1 --nproc_per_node 2 full_finetune_distributed --config qwen2_5/7B_full
10 | #
11 | # You can add specific overrides through the command line. For example
12 | # to override the checkpointer directory while launching training
13 | # you can run:
14 | # tune run --nnodes 1 --nproc_per_node 2 full_finetune_distributed --config qwen2_5/7B_full checkpointer.checkpoint_dir=
15 | #
16 | # This config works best when the model is being fine-tuned on 2+ GPUs.
17 | # Single device full finetuning requires more memory optimizations. It's
18 | # best to use 7B_full_single_device.yaml for those cases
19 |
20 | exp_name: qwen2p5-coder-32b-full-lr5e-5-warmup5___ft_xml_all_250413
21 | output_dir: /llm-weights/final/${exp_name}
22 | # Model Arguments
23 | model:
24 | _component_: torchtune.models.qwen2_5.qwen2_5_32b_instruct
25 |
26 | tokenizer:
27 | _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
28 | path: /llm-weights/Qwen/Qwen2.5-Coder-32B-Instruct/vocab.json
29 | merges_file: /llm-weights/Qwen/Qwen2.5-Coder-32B-Instruct/merges.txt
30 | max_seq_len: 32768
31 |
32 | checkpointer:
33 | _component_: torchtune.training.FullModelHFCheckpointer
34 | checkpoint_dir: /llm-weights/Qwen/Qwen2.5-Coder-32B-Instruct
35 | checkpoint_files: [
36 | model-00001-of-00014.safetensors,
37 | model-00002-of-00014.safetensors,
38 | model-00003-of-00014.safetensors,
39 | model-00004-of-00014.safetensors,
40 | model-00005-of-00014.safetensors,
41 | model-00006-of-00014.safetensors,
42 | model-00007-of-00014.safetensors,
43 | model-00008-of-00014.safetensors,
44 | model-00009-of-00014.safetensors,
45 | model-00010-of-00014.safetensors,
46 | model-00011-of-00014.safetensors,
47 | model-00012-of-00014.safetensors,
48 | model-00013-of-00014.safetensors,
49 | model-00014-of-00014.safetensors,
50 | ]
51 | recipe_checkpoint: null
52 | output_dir: ${output_dir}
53 | model_type: QWEN2
54 | safe_serialization: True
55 | resume_from_checkpoint: False
56 |
57 | # Dataset and Sampler
58 | dataset:
59 | _component_: torchtune.datasets.chat_dataset
60 | source: json
61 | data_files: /datasets/trajectories_sft/ft_xml_all_250413.jsonl
62 | split: train
63 | conversation_column: messages
64 | conversation_style: openai
65 | train_on_input: False
66 | new_system_prompt: null
67 | packed: False # True increases speed
68 | seed: 42
69 | shuffle: True
70 | batch_size: 1
71 |
72 | # Optimizer and Scheduler
73 | optimizer:
74 | _component_: torch.optim.AdamW
75 | fused: True
76 | weight_decay: 0.01
77 | lr: 5e-5
78 | lr_scheduler:
79 | _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
80 | num_warmup_steps: 5
81 | optimizer_in_bwd: True
82 | loss:
83 | _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
84 |
85 | # Training
86 | epochs: 3
87 | max_steps_per_epoch: null
88 | gradient_accumulation_steps: 1 # Use to increase virtual batch size
89 | compile: True # pytorch compile, set to true for better perf/memory
90 |
91 | # Logging
92 | metric_logger:
93 | _component_: torchtune.training.metric_logging.WandBLogger
94 | project: devrl-sft
95 | group: ${exp_name}
96 | job_type: full_finetune_distributed
97 | log_every_n_steps: 1
98 | log_peak_memory_stats: True
99 |
100 | # Environment
101 | device: cuda
102 | dtype: bf16
103 | enable_activation_checkpointing: True # True reduces memory
104 | enable_activation_offloading: False # True reduces memory
105 | # custom_sharded_layers: ['tok_embeddings'] # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
106 |
107 | # Show case the usage of pytorch profiler
108 | # Set enabled to False as it's only needed for debugging training
109 | profiler:
110 | _component_: torchtune.training.setup_torch_profiler
111 |
112 | enabled: False
113 |
114 | #Output directory of trace artifacts
115 | output_dir: ${output_dir}/profiling_outputs
116 |
117 | #`torch.profiler.ProfilerActivity` types to trace
118 | cpu: True
119 | cuda: True
120 |
121 | #trace options passed to `torch.profiler.profile`
122 | profile_memory: False
123 | with_stack: False
124 | record_shapes: True
125 | with_flops: False
126 |
127 | # `torch.profiler.schedule` options:
128 | # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
129 | wait_steps: 5
130 | warmup_steps: 5
131 | active_steps: 2
132 | num_cycles: 1
--------------------------------------------------------------------------------
/configs/train/full_ft_qwen_7b.yml:
--------------------------------------------------------------------------------
1 | exp_name: qwen2p5-coder-7b-full-lr5e-5-warmup5___ft_xml_all_250331
2 | output_dir: /llm-weights/outputs/${exp_name}
3 |
4 | # Model Arguments
5 | model:
6 | _component_: torchtune.models.qwen2_5.qwen2_5_7b_instruct
7 |
8 | tokenizer:
9 | _component_: torchtune.models.qwen2_5.qwen2_5_tokenizer
10 | path: /llm-weights/Qwen/Qwen2.5-Coder-7B-Instruct/vocab.json
11 | merges_file: /llm-weights/Qwen/Qwen2.5-Coder-7B-Instruct/merges.txt
12 | max_seq_len: 32768
13 |
14 | checkpointer:
15 | _component_: torchtune.training.FullModelHFCheckpointer
16 | checkpoint_dir: /llm-weights/Qwen/Qwen2.5-Coder-7B-Instruct
17 | checkpoint_files: [
18 | model-00001-of-00004.safetensors,
19 | model-00002-of-00004.safetensors,
20 | model-00003-of-00004.safetensors,
21 | model-00004-of-00004.safetensors,
22 | ]
23 | recipe_checkpoint: null
24 | output_dir: ${output_dir}
25 | model_type: QWEN2
26 | safe_serialization: True
27 | resume_from_checkpoint: False
28 |
29 | # Dataset and Sampler
30 | dataset:
31 | _component_: torchtune.datasets.chat_dataset
32 | source: json
33 | data_files: /datasets/trajectories_sft/ft_xml_all_250331.jsonl
34 | split: train
35 | conversation_column: messages
36 | conversation_style: openai
37 | train_on_input: False
38 | new_system_prompt: null
39 | packed: False # True increases speed
40 | seed: 42
41 | shuffle: True
42 | batch_size: 1
43 |
44 | # Optimizer and Scheduler
45 | optimizer:
46 | _component_: torch.optim.AdamW
47 | fused: True
48 | weight_decay: 0.01
49 | lr: 5e-5
50 | lr_scheduler:
51 | _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
52 | num_warmup_steps: 5
53 | optimizer_in_bwd: False
54 | loss:
55 | _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
56 |
57 | # Training
58 | epochs: 3
59 | max_steps_per_epoch: null
60 | gradient_accumulation_steps: 4 # Use to increase virtual batch size
61 | compile: True # pytorch compile, set to true for better perf/memory
62 |
63 | # Logging
64 | metric_logger:
65 | _component_: torchtune.training.metric_logging.WandBLogger
66 | project: devrl-sft
67 | group: ${exp_name}
68 | job_type: full_finetune_distributed
69 | log_every_n_steps: 1
70 | log_peak_memory_stats: True
71 |
72 | # Environment
73 | device: cuda
74 | dtype: bf16
75 | enable_activation_checkpointing: True # True reduces memory
76 | enable_activation_offloading: False # True reduces memory
77 | # custom_sharded_layers: ['tok_embeddings'] # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
78 |
79 | # Show case the usage of pytorch profiler
80 | # Set enabled to False as it's only needed for debugging training
81 | profiler:
82 | _component_: torchtune.training.setup_torch_profiler
83 |
84 | enabled: False
85 |
86 | #Output directory of trace artifacts
87 | output_dir: ${output_dir}/profiling_outputs
88 |
89 | #`torch.profiler.ProfilerActivity` types to trace
90 | cpu: True
91 | cuda: True
92 |
93 | #trace options passed to `torch.profiler.profile`
94 | profile_memory: False
95 | with_stack: False
96 | record_shapes: True
97 | with_flops: False
98 |
99 | # `torch.profiler.schedule` options:
100 | # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
101 | wait_steps: 5
102 | warmup_steps: 5
103 | active_steps: 2
104 | num_cycles: 1
--------------------------------------------------------------------------------
/docs/CNAME:
--------------------------------------------------------------------------------
1 | swesmith.com
2 |
--------------------------------------------------------------------------------
/docs/assets/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/banner.png
--------------------------------------------------------------------------------
/docs/assets/bug_gen_overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/bug_gen_overview.png
--------------------------------------------------------------------------------
/docs/assets/combine.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/combine.png
--------------------------------------------------------------------------------
/docs/assets/home/collection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/home/collection.png
--------------------------------------------------------------------------------
/docs/assets/home/leaderboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/home/leaderboard.png
--------------------------------------------------------------------------------
/docs/assets/home/swesmith.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/home/swesmith.png
--------------------------------------------------------------------------------
/docs/assets/lm_generate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/lm_generate.png
--------------------------------------------------------------------------------
/docs/assets/overview-light.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/overview-light.png
--------------------------------------------------------------------------------
/docs/assets/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SWE-bench/SWE-smith/cfb6cf29568d0841ce15620c96ec795243b229fa/docs/assets/overview.png
--------------------------------------------------------------------------------
/docs/assets/paper.pdf.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Redirecting...
6 |
7 |
8 |
If you are not redirected automatically, follow this link.
6 |
7 | SWE-smith is toolkit for training Software Engineering (SWE) agents. With SWE-smith, you can:
8 |
9 |
10 |
11 |
12 |
13 |
14 | Check out the [installation](installation.md) guide to get started, then head over to the [tutorials](../guides/index.md) to learn
15 | about how to use SWE-smith.
16 |
17 | If you use SWE-smith in your work, we'd greatly appreciate a citation:
18 |
19 | ```bibtex
20 | @misc{yang2025swesmith,
21 | title={SWE-smith: Scaling Data for Software Engineering Agents},
22 | author={John Yang and Kilian Leret and Carlos E. Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang},
23 | year={2025},
24 | eprint={2504.21798},
25 | archivePrefix={arXiv},
26 | primaryClass={cs.SE},
27 | url={https://arxiv.org/abs/2504.21798},
28 | }
29 | ```
30 |
--------------------------------------------------------------------------------
/docs/getting_started/installation.md:
--------------------------------------------------------------------------------
1 | # Installation
2 |
3 | For the latest stable release
4 |
5 | ```bash
6 | pip install swesmith
7 | ```
8 |
9 | For the latest development version
10 |
11 | ```bash
12 | git clone https://github.com/SWE-bench/SWE-smith
13 | cd SWE-smith
14 | conda create -n swesmith python=3.10; conda activate swesmith
15 | pip install -e .
16 | ```
17 |
18 | If you plan to contribute to SWE-smith, please also perform:
19 |
20 | ```bash
21 | pre-commit install
22 | ```
--------------------------------------------------------------------------------
/docs/getting_started/quickstart.md:
--------------------------------------------------------------------------------
1 | We recommend checking out the [tutorials](../guides/index.md) for comprehensive guidance on SWE-smith usage.
2 |
3 | However, if you learn more easily by playing with the code, here's sequences of scripts corresponding to different SWE-smith workflows.
4 | If you run into issues, please consult the [tutorials](../guides/index.md) first, then open an [issue](https://github.com/SWE-bench/SWE-smith/issues/new/choose) if you can't find a solution.
5 |
6 | ### Creating Task Instances
7 | ```bash
8 | # Run LM rewrite strategy to produce bugs
9 | python -m swesmith.bug_gen.llm.modify pandas-dev__pandas.95280573 \
10 | --config_file configs/bug_gen/lm_modify.yml \
11 | --model claude-3-7-sonnet-20250219 \
12 | --n_bugs 1 \
13 | --n_workers=20
14 |
15 | # Collect all task instances into a single file for validation
16 | python -m swesmith.bug_gen.collect_patches logs/bug_gen/pandas-dev__pandas.95280573/
17 |
18 | # Run validation on the collected task instances
19 | python -m swesmith.harness.valid logs/bug_gen/pandas-dev__pandas.95280573_all_patches.json \
20 | --run_id pandas_test \
21 | --max_workers=8
22 |
23 | # Gather valid task instances
24 | python -m swesmith.harness.gather logs/run_validation/pandas_test
25 |
26 | # Generate issues for the valid task instances
27 | python -m swesmith.issue_gen.generate \
28 | --dataset_path logs/run_validation/basic/pandas_test.json \
29 | --model claude-3-7-sonnet-20250219 \
30 | --n_workers=1 \
31 | --config_file configs/issue_gen/ig_v2.yaml \
32 | --experiment_id ig_v2
33 | ```
34 |
35 | !!! tip "Next steps"
36 |
37 | We provide [detailed tutorials](../guides/index.md) on each of these steps.
--------------------------------------------------------------------------------
/docs/guides/difficulty_rating.md:
--------------------------------------------------------------------------------
1 | To see how SWE-smith compares against real world tasks (e.g. SWE-bench), we LoRA Fine-Tuned a [Qwen 2.5 32B Coder Instruct](https://github.com/QwenLM/Qwen2.5-Coder) model on 1.5k human ratings of the difficulty of real world bugs.
2 |
3 | Given the issue text and patch associated with a task instance, the model will rate the task as "easy" (< 15 min), "medium" (15 min - 1 hour), or "hard" (1+ hours).
4 |
5 | ## Inference
6 |
7 | You can rate the difficulty of your own task instances by following these steps:
8 |
9 | 1. Download the [HuggingFace checkpoint]().
10 |
11 | 2. Use `sglang` to serve the checkpoint. The training scripts available in the SWE-smith repository use [Modal](https://modal.com/) as a compute service for hosting inference.
12 |
13 | ```bash
14 | N_HOURS=4 N_GPUS=4 modal run --detach swesmith/train/serve_sglang.py \
15 | --model-path /path/to/checkpoint \
16 | --served-model-name gpt-4o \
17 | --tokenizer-path /path/to/Qwen2.5-Coder-32B-Instruct
18 | ```
19 |
20 | 3. Run the following script:
21 |
22 | ```bash
23 | python swesmith/train/difficulty_rater/get_difficulties.py \
24 | --base_url \
25 | --dataset_path path/to/dataset.json
26 | ```
27 |
28 | The script will generate a `.json` file containing a mapping from each task instance to a difficulty score.
29 | You can then compute the dataset's difficulty score as the average of all task instance scores.
30 |
31 | ## Prior Datasets
32 |
33 | Using our model, we've assessed the difficulty of existing datasets, assigning scores of 1/5/9 to easy/medium/hard tasks.
34 |
35 | | Dataset | # Instances | Score | `easy` | `med` | `hard` |
36 | |------------------------|-------------|--------|--------|-------|--------|
37 | | SWE-bench | 2294 | 5.014 | 438 | 1408 | 446 |
38 | | └── Lite | 300 | 3.893 | 93 | 197 | 10 |
39 | | └── Verified | 500 | 3.960 | 173 | 284 | 43 |
40 | | SWE-bench Multimodal | 510 | 6.036 | 55 | 265 | 186 |
41 | | SWE-gym | 2438 | 5.625 | 288 | 1456 | 664 |
42 | | └── Lite | 230 | 3.890 | 67 | 156 | 4 |
43 | | SWE-smith (LM Modify) | 1000 | 3.304 | 441 | 542 | 17 |
44 | | SWE-smith (LM Rewrite) | 1000 | 5.272 | 68 | 796 | 136 |
45 | | SWE-smith (Procedural) | 1000 | 3.596 | 374 | 603 | 23 |
46 | | SWE-smith (PR Mirror) | 1000 | 4.876 | 206 | 619 | 175 |
47 | | SWE-smith (Combine) | 1000 | 5.720 | 52 | 716 | 232 |
48 |
49 | From the table, we demonstrate that SWE-smith task instances are comparable to real world tasks, and that our bug generation techniques allow for a wide range of task difficulties.
--------------------------------------------------------------------------------
/docs/guides/env_construction.md:
--------------------------------------------------------------------------------
1 | SWE-smith enables automatic construction of execution environments for repositories.
2 | We'll review the two steps of this process:
3 |
4 | 1. SWE-agent + LM attempts to install a repository + run the testing suite.
5 | 2. Construct an execution environment (Docker image).
6 |
7 | For this section, we'll use the [Instagram/MonkeyType](https://github.com/Instagram/MonkeyType/) repository as a running example,
8 | specifically at commit [`70c3acf`](https://github.com/Instagram/MonkeyType/tree/70c3acf62950be5dfb28743c7a719bfdecebcd84).
9 |
10 | ## Automatically Install Repos with SWE-agent
11 |
12 | Coming soon!
13 |
14 | ## Create an Execution Environment
15 | First, create the conda environment for the target repository.
16 | ```bash
17 | python -m swesmith.build_repo.try_install Instagram/MonkeyType install_repo.sh \
18 | --commit 70c3acf62950be5dfb28743c7a719bfdecebcd84
19 | ```
20 | where `install_repo.sh` is the script that installs the repository.
21 | ([Example](https://github.com/SWE-bench/SWE-smith/blob/main/configs/install_repo.sh))
22 |
23 | If successful, two artifacts will be produced under `logs/build_repo/records/`:
24 | * `sweenv_[repo + commit].yml`: A dump of the conda environment that was created.
25 | * `sweenv_[repo + commit].sh`: A log of the installation process.
26 |
27 | Next, run the following command to create a Docker image for the repository.
28 |
29 | ```bash
30 | python -m swesmith.build_repo.create_images --repos Instagram/MonkeyType
31 | ```
32 |
33 | This command will create two artifacts:
34 | 1. A mirror of the original repository at the specified commit, created under [`swesmith`](https://github.com/orgs/swesmith/repositories). To change the organization, you can...
35 | * Pass in an `--org` argument, or
36 | * (If built from source) Change `ORG_NAME` in `swesmith/constants.py`
37 | 2. A Docker image (`swesmith.x86_64..`) which contains the installed codebase.
38 |
39 | It's good practice to check that your Docker image works as expected.
40 | ```bash
41 | docker run -it --rm swesmith.x86_64.instagram__monkeytype.70c3acf6
42 | ```
43 | Within the container, run the testing suite (e.g. `pytest`) to ensure that the codebase is functioning as expected.
44 |
45 | !!! note "Get existing Docker images"
46 |
47 | All repositories represented in the SWE-smith [dataset](https://huggingface.co/datasets/SWE-bench/SWE-smith) are available to download. Simply run:
48 | ```bash
49 | python -m swesmith.build_repo.download_images
50 | ```
51 |
--------------------------------------------------------------------------------
/docs/guides/harnesses.md:
--------------------------------------------------------------------------------
1 | # Validation & Evaluation
2 |
3 | Great! You now have an execution environment + a bunch of candidate task instances. How do we determine which ones can be used for training?
4 |
5 | We provide two harnesses for the purposes of:
6 |
7 | * Validation: To check if a candidate task instance is usable (breaks 1+ existing tests).
8 | * Evaluation: To check if the proposed solution for a task instance is correct.
9 |
10 | The purposes of these harnesses are identical to their motivations in [SWE-bench](https://swe-bench.github.io).
11 |
12 | ## Validation
13 | The validation harness is used to check if a candidate task instance is usable (breaks 1+ existing tests).
14 |
15 | Once you've generated task instance candidates, follow these steps to validate them:
16 |
17 | 1. Collect the candidates
18 |
19 | ```bash
20 | python -m swesmith.bug_gen.collect_patches logs/bug_gen/
21 | ```
22 |
23 | This produces a `logs/bug_gen/_all_patches.json` file with all the candidate task instances.
24 |
25 | 2. Run validation
26 |
27 | ```bash
28 | python -m swesmith.harness.valid \
29 | logs/bug_gen/_all_patches.json \
30 | --run_id
31 | ```
32 |
33 | The validation harness works in two steps.
34 | First, it runs the original repository's test suite to get the passing statuses of the existing tests.
35 | Then, it applies each candidate task instance to the repository and runs the test suite again.
36 | If the candidate task instance breaks 1+ existing tests, it is considered a usable task instance.
37 |
38 | For each task instance, the validation harness produces a `logs/run_validation//` folder containing the following information:
39 |
40 | * `eval.sh`: The sequence of test command(s) run
41 | * `patch.diff`: The candidate task instance
42 | * `report.json`: `FAIL_TO_PASS` and `PASS_TO_PASS` test cases
43 | * `run_instance.log`: The full trace of running validation
44 | * `test_output.txt`: The standard output of the test command(s)
45 |
46 | 3. Collect validated task instances
47 |
48 | ```bash
49 | python -m swesmith.harness.gather logs/run_validation/
50 | ```
51 |
52 | Task instances with 1+ `FAIL_TO_PASS` test cases and 1+ `PASS_TO_PASS` test cases are considered valid.
53 |
54 | This script performs two actions:
55 |
56 | * It collects all valid task instances into a `logs/task_insts/.json`. Each instance contains the following information:
57 | ```json
58 | {
59 | "instance_id": ,
60 | "repo": ,
61 | "patch": ,
62 | "FAIL_TO_PASS": ,
63 | "PASS_TO_PASS": ,
64 | "created_at": ,
65 | "image_name": ,
66 | "base_commit": ,
67 | }
68 | ```
69 | * For each valid task instance, a branch called `` is created in the repository. The branch corresponds to the repository with the task instance's bug patch applied.
70 |
71 | ## Evaluation
72 |
73 | The evaluation harness is used to check if the proposed solution for a task instance is correct.
74 |
75 | You can run this script to sanity check that testing for validated task instances works as expected:
76 |
77 | ```bash
78 | python -m swesmith.harness.eval \
79 | --dataset_path bugs/task_insts/{repo}.json \
80 | --predictions_path gold \
81 | --run_id sanity
82 | ```
83 |
84 | If you want to run on real predictions, simply replace `gold` with the path to your predictions, which should look like:
85 |
86 | ```json
87 | {
88 | "instance_id": ,
89 | "patch": ,
90 | "model_name_or_path": ,
91 | }
92 | ```
93 |
--------------------------------------------------------------------------------
/docs/guides/index.md:
--------------------------------------------------------------------------------
1 | # Tutorials
--------------------------------------------------------------------------------
/docs/guides/issue_gen.md:
--------------------------------------------------------------------------------
1 | You have a bunch of task instances with executable environments.
2 | You're very close to training SWE-agents on this data.
3 | There's one last step - let's generate issue text.
4 |
5 | We primarily use LM's to generate issue text.
6 |
7 | ```bash
8 | python swesmith/issue_gen/generate.py logs/task_insts/.json \
9 | --config_file configs/issue_gen/ig_v2.yaml \
10 | --model anthropic/claude-3-7-sonnet-20250219 \
11 | --n_workers 4 \
12 | --experiment_id ig_v2 \
13 | --use_existing
14 | ```
15 |
16 | This will generated issue text for each task instance, producing several artifacts along the way:
17 |
18 | * Under `logs/issue_gen/ig_v2/`, there will be a folder for each task instance, containing:
19 | * `messages.json`: The messages fed to the LM to generate the issue text.
20 | * `metadata.json`: Conatins the issue text + inference cost.
21 | * In the same directory as `logs/task_insts/.json`, a `logs/issue_gen/__ig_v2_n1.json` file will be created, which is a copy of the original file with issue text added to each task instance (as the `problem_statement` field).
22 |
23 | ## Alternatives
24 |
25 | In our paper, we discuss several alternatives for generating issue text.
26 | While our experiments suggest that LM generated issue text is the best proxy for real issue text, we provide instructions for the alternatives below.
27 |
28 | **Static Issue Text**
29 |
30 | The problem statement is generated by randomly selecting one of 7 static issue text templates.
31 |
32 | ```bash
33 | python swesmith/issue_gen/get_static.py logs/task_insts/.json
34 | ```
35 |
36 | Produces a `logs/issue_gen/__ig_static.json` file.
37 |
38 | **Random F2P Test Case**
39 |
40 | The problem statement shows a randomly selected Fail-to-Pass test case from the task instance.
41 |
42 | ```bash
43 | python swesmith/issue_gen/get_from_tests.py logs/task_insts/.json
44 | ```
45 |
46 | **Original Issue Text**
47 |
48 | !!! note
49 | This strategy only works for some PR Mirrors, if the pull request the mirror is based on has issue(s) associated with it.
50 |
51 | ```bash
52 | python swesmith/issue_gen/get_from_pr.py logs/task_insts/.json
53 | ```
54 |
55 | Produces a `logs/issue_gen/__ig_orig.json` file.
56 |
--------------------------------------------------------------------------------
/docs/guides/train_swe_agent.md:
--------------------------------------------------------------------------------
1 | # Training SWE-agents
2 |
3 | Now the fun part - we provide details on how to operationalize SWE-smith for training SWE-agents!
4 |
5 | Specifically, we'll cover the workflow for Rejection Sampling Fine Tuning.
6 |
7 | !!! note "SWE-agent"
8 |
9 | The documentation in this section is heavily grounded in the [SWE-agent](https://github.com/SWE-agent/SWE-agent) library.
10 | We do *not* plan to explicitly support non SWE-agent scaffolds, but it should not be difficult - the main adaptations would just be how you generate expert trajectories and predictions for evaluation.
11 |
12 | There's several steps we'll cover:
13 |
14 | 1. Creating a subset of SWE-smith task instances.
15 | 2. Generating expert trajectories for those task instances.
16 | 3. Training a model on the expert trajectories.
17 | 4. Evaluating the model on SWE-bench (Lite/Verified/Multimodal).
18 |
19 | ## Creating SWE-smith Subset
20 |
21 | If you are using SWE-smith, the dataset of all [SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith) is quite large.
22 | Usually, we recommend training on a subset.
23 | To curate a subset, you might use the following logic.
24 |
25 | ```python
26 | import json
27 |
28 | from datasets import load_dataset
29 | swesmith = load_dataset("SWE-bench/SWE-smith", split="train")
30 |
31 | subset_name = "subset0"
32 | def criteria(task_instance):
33 | return ".pr_" in task_instance["instance_id"] and \
34 | len(task_instance["FAIL_TO_PASS"]) <= 5 and \
35 | len(task_instance["FAIL_TO_PASS"]) >= 2
36 | bugs = [x for x in swesmith if criteria(x)]
37 | print(f"Found {len(bugs)} bugs that match criteria")
38 | with open(f"logs/experiments/{subset_name}.json", "w") as f:
39 | json.dump(bugs, fp=f, indent=2)
40 | ```
41 |
42 | ## Generate Expert Trajectories
43 |
44 | 1. Clone [SWE-agent](https://github.com/SWE-agent/SWE-agent). Make sure to follow the installation instructions [here](https://swe-agent.com/latest/installation/source/).
45 |
46 | 2. Create a soft link of the `agent/` folder to SWE-agent, meaning in SWE-agent, run:
47 | ```bash
48 | ln -s path/to/SWE-smith/agent/ .
49 | ```
50 |
51 | 3. In SWE-agent, run exeprt trajectory generation:
52 | ```bash
53 | ./agent/_gen_trajs.sh
54 | ```
55 | Check the file to see how the script works. You'll need to adjust the `--instances.path` argument to point to the subset you created in the previous step.
56 |
57 | ## Train Model
58 |
59 | The previous step will generate individual trajectories per task instance under the `SWE-agent/trajectories///` folder.
60 |
61 | We'll now determine which trajectories correspond to resolved instances, convert them to a format that can be used for SFT, and then train a model with them.
62 |
63 | 1. (From SWE-smith) Run evaluation on training task instances.
64 | ```bash
65 | python -m swesmith.harness.eval \
66 | --dataset_path path/to/subset0.json \
67 | --predictions_path path/to/trajectories///preds.json \
68 | --run_id \
69 | --max_workers 10 \
70 | --timeout 240
71 | ```
72 |
73 | !!! tip "`preds.json`"
74 | If there is no `preds.json`, run `sweagent merge-preds trajectories///`.
75 |
76 | This evaluation will generate a `logs/run_evaluation//`
77 | folder with a `report.json` file indicating which instance IDs were successfully resolved.
78 |
79 | 2. (From SWE-smith) Convert trajectories into SFT format.
80 |
81 | ```bash
82 | python -m swesmith.train.traj_mgr.transform_to_ft \
83 | --traj_dir path/to/trajectories/// \
84 | --eval_dir logs/run_evaluation// \
85 | --only_resolved
86 | ```
87 |
88 | This will product an `ft_xml_*.jsonl` file under the `trajectories_sft/` folder.
89 | This dataset can be used directly for SFT.
90 |
91 | 3. Run training. First, upload the file to Modal
92 | ```bash
93 | modal volume put trajectories_sft/ft_xml_*.jsonl
94 | ```
95 |
96 | Then, modify `config/train/full_ft_qwen_7b.yml` to point to the file in Modal.
97 |
98 | Finally, run the training script:
99 | ```bash
100 | ./scripts/train.run_ft_torchtune.py
101 | ```
102 |
103 | ## Evaluation
104 | Run inference on SWE-agent + your SFT'ed model on SWE-bench (Lite/Verified/Multimodal).
105 |
106 | 1. (From SWE-smith) Update `scripts/train.serve_sglang.sh` to point at SFT'ed model, then run it.
107 |
108 | 2. (From SWE-agent) Run inference:
109 | ```bash
110 | ./agent/_infer_model.sh
111 | ```
112 | Make sure the Modal URL is correct and change the evaluation dataset as desired.
113 |
114 | 3. When inference finishes, run evaluation on the model's predictions. (Check out [sb-cli](https://github.com/SWE-bench/sb-cli/tree/main) for more information on how to conveniently run evaluation for SWE-bench-* datasets.)
115 | ```bash
116 | sb-cli submit swe-bench_verified test \
117 | --predictions_path trajectories///preds.json \
118 | --run_id
119 | ```
--------------------------------------------------------------------------------
/docs/overrides/main.html:
--------------------------------------------------------------------------------
1 | {% extends "base.html" %}
2 |
3 | {% block content %}
4 | {{ super() }}
5 |
6 |
7 |