├── LICENSE
└── README.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 Yerba
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
🤖✨ Awesome Repository-Level Code Generation ✨🤖
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 | 🌟 A curated list of awesome repository-level code generation research papers and resources. If you want to contribute to this list (please do), feel free to send me a pull request. 🚀 If you have any further questions, feel free to contact [Yuling Shi](https://yerbasite.github.io) or [Xiaodong Gu](https://guxd.github.io) (SJTU).
11 |
12 | ## 📚 Contents
13 |
14 | - [📚 Contents](#-contents)
15 | - [💥 Repo-Level Issue Resolution](#-repo-level-issue-resolution)
16 | - [🤖 Repo-Level Code Completion](#-repo-level-code-completion)
17 | - [🔄 Repo-Level Code Translation](#-repo-level-code-translation)
18 | - [🧪 Repo-Level Unit Test Generation](#-repo-level-unit-test-generation)
19 | - [🔍 Repo-Level Code QA](#-repo-level-code-qa)
20 | - [👩💻 Repo-Level Issue Task Synthesis](#-repo-level-issue-task-synthesis)
21 | - [📊 Datasets and Benchmarks](#-datasets-and-benchmarks)
22 |
23 | ## 💥 Repo-Level Issue Resolution
24 |
25 | - SWE-Exp: Experience-Driven Software Issue Resolution [2025-07-arXiv] [[📄 paper](http://arxiv.org/abs/2507.23361)] [[🔗 repo](https://github.com/YerbaPage/SWE-Exp)]
26 |
27 | - SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution [2025-07-arXiv] [[📄 paper](http://arxiv.org/abs/2507.23348)] [[🔗 repo](https://github.com/YerbaPage/SWE-Debate)]
28 |
29 | - LIVE-SWE-AGENT: Can Software Engineering Agents Self-Evolve on the Fly? [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.13646)]
30 |
31 | - Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories [2025-10-arXiv] [[📄 paper](https://arxiv.org/abs/2511.00197)]
32 |
33 | - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [2025-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2510.19898)]
34 |
35 | - Where LLM Agents Fail and How They can Learn From Failures [2025-09-arXiv] [[📄 paper](https://www.arxiv.org/abs/2509.25370)] [[🔗 repo](https://github.com/ulab-uiuc/AgentDebug)]
36 |
37 | - SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints [2025-09-arXiv] [[📄 paper](https://arxiv.org/abs/2509.09853)]
38 |
39 | - Diffusion is a code repair operator and generator [2025-08-arXiv] [[📄 paper](https://arxiv.org/abs/2508.11110)]
40 |
41 | - The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.12286)]
42 |
43 | - Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards [2025-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2506.11425)]
44 |
45 | - EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair [2025-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2506.10484)]
46 |
47 | - Coding Agents with Multimodal Browsing are Generalist Problem Solvers [2025-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2506.03011)] [[🔗 repo](https://github.com/adityasoni9998/OpenHands-Versa)]
48 |
49 | - CoRet: Improved Retriever for Code Editing [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.24715)]
50 |
51 | - Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.22954)] [[🔗 repo](https://github.com/jennyzzt/dgm)]
52 |
53 | - SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.16975)] [[🔗 repo](https://github.com/justLittleWhite/SWE-Dev)]
54 |
55 | - Putting It All into Context: Simplifying Agents with LCLMs [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.08120)]
56 |
57 | - SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning [2025-05-arXiv] [[📄 blog](https://novasky-ai.notion.site/skyrl-v0)] [[🔗 repo](https://github.com/novasky-ai/skyrl-v0)]
58 |
59 | - AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions [2025-FSE] [[📄 paper](https://arxiv.org/pdf/2411.18015)]
60 |
61 | - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.23803)] [[🔗 repo](https://github.com/yingweima2022/SWE-Reasoner)]
62 |
63 | - Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.21710)]
64 |
65 | - CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.22424)]
66 |
67 | - SEAlign: Alignment Training for Software Engineering Agent [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.18455)]
68 |
69 | - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.14269)] [[🔗 repo](https://github.com/darsagent/DARS-Agent)]
70 |
71 | - LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.09089)] [[🔗 repo](https://github.com/gersteinlab/LocAgent)]
72 |
73 | - SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.20127)]
74 |
75 | - SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [2025-02-arXiv] [[📄 paper](https://arxiv.org/pdf/2502.18449)] [[🔗 repo](https://github.com/facebookresearch/swe-rl)]
76 |
77 | - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [2025-01-arXiv] [[📄 paper](https://arxiv.org/pdf/2501.05040)] [[🔗 repo](https://github.com/InternLM/SWE-Fixer)]
78 |
79 | - CodeMonkeys: Scaling Test-Time Compute for Software Engineering [2025-01-arXiv] [[📄 paper](https://arxiv.org/abs/2501.14723)] [[🔗 repo](https://github.com/google-research/code-monkeys)]
80 |
81 | - Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.21139)] [[🔗 repo](https://github.com/SWE-Gym/SWE-Gym)]
82 |
83 | - CODEV: Issue Resolving with Visual Data [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.17315)] [[🔗 repo](https://github.com/luolin101/CodeV)]
84 |
85 | - LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [[📄 paper](https://arxiv.org/pdf/2411.13941)]
86 |
87 | - Globant Code Fixer Agent Whitepaper [2024-11] [[📄 paper](https://ai.globant.com/wp-content/uploads/2024/11/Whitepaper-Globant-Code-Fixer-Agent.pdf)]
88 |
89 | - MarsCode Agent: AI-native Automated Bug Fixing [2024-11-arXiv] [[📄 paper](https://arxiv.org/abs/2409.00899)]
90 |
91 | - Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [2024-11-arXiv] [[📄 paper](https://arxiv.org/html/2411.00622v1)] [[🔗 repo](https://github.com/LingmaTongyi/Lingma-SWE-GPT)]
92 |
93 | - SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [2024-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2410.20285)] [[🔗 repo](https://github.com/aorwall/moatless-tree-search)]
94 |
95 | - AutoCodeRover: Autonomous Program Improvement [2024-09-ISSTA] [[📄 paper](https://dl.acm.org/doi/10.1145/3650212.3680384)] [[🔗 repo](https://github.com/nus-apr/auto-code-rover)]
96 |
97 | - SpecRover: Code Intent Extraction via LLMs [2024-08-arXiv] [[📄 paper](https://arxiv.org/abs/2408.02232)]
98 |
99 | - OpenHands: An Open Platform for AI Software Developers as Generalist Agents [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2407.16741)] [[🔗 repo](https://github.com/All-Hands-AI/OpenHands)]
100 |
101 | - AGENTLESS: Demystifying LLM-based Software Engineering Agents [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2407.01489)]
102 |
103 | - RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2410.14684)] [[🔗 repo](https://github.com/ozyyshr/RepoGraph)]
104 |
105 | - CodeR: Issue Resolving with Multi-Agent and Task Graphs [2024-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2406.01304)] [[🔗 repo](https://github.com/NL2Code/CodeR)]
106 |
107 | - Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.01422v2)]
108 |
109 | - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [2024-NeurIPS] [[📄 paper](https://arxiv.org/abs/2405.15793)] [[🔗 repo](https://github.com/SWE-agent/SWE-agent)]
110 |
111 | ## 🤖 Repo-Level Code Completion
112 |
113 | - Enhancing Project-Specific Code Completion by Inferring Internal API Information [2025-07-TSE] [[📄 paper](https://ieeexplore.ieee.org/abstract/document/11096713)] [[🔗 repo](https://github.com/ZJU-CTAG/InferCom)]
114 |
115 | - CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.10046)]
116 |
117 | - CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases [2025-04-NAACL] [[📄 paper](https://aclanthology.org/2025.naacl-long.7/)]
118 |
119 | - RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.08862)]
120 |
121 | - Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs [2025-04-AAAI] [[📄 paper](https://ojs.aaai.org/index.php/AAAI/article/view/34782)] [[🔗 repo](https://github.com/Hambaobao/HCP-Coder)]
122 |
123 | - What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.20589)]
124 |
125 | - REPOFILTER: Adaptive Retrieval Context Trimming for Repository-Level Code Completion [2025-04-OpenReview] [[📄 paper](https://openreview.net/forum?id=oOSeOEXrFA)]
126 |
127 | - Improving FIM Code Completions via Context & Curriculum Based Learning [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.16589)]
128 |
129 | - ContextModule: Improving Code Completion via Repository-level Contextual Information [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.08063)]
130 |
131 | - A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse With Local-Aware, Global-Aware, and Third-Party-Library-Aware [2024-12-TSE] [[📄 paper](https://www.computer.org/csdl/journal/ts/2024/12/10734067/21iLh4j0oG4)]
132 |
133 | - RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation [2024-09-arXiv] [[📄 paper](https://arxiv.org/abs/2409.13122)]
134 |
135 | - RAMBO: Enhancing RAG-based Repository-Level Method Body Completion [2024-09-arXiv] [[📄 paper](https://arxiv.org/abs/2409.15204)] [[🔗 repo](https://github.com/ise-uet-vnu/rambo)]
136 |
137 | - RLCoder: Reinforcement Learning for Repository-Level Code Completion [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2407.19487)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/RLCoder)]
138 |
139 | - STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.10018)]
140 |
141 | - GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.07003)]
142 |
143 | - Enhancing Repository-Level Code Generation with Integrated Contextual Information [2024-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2406.03283)]
144 |
145 | - R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.01359)]
146 |
147 | - Natural Language to Class-level Code Generation by Iterative Tool-augmented Reasoning over Repository [2024-05-arXiv] [[📄 paper](https://arxiv.org/abs/2405.01573)] [[🔗 repo](https://github.com/microsoft/repoclassbench)]
148 |
149 | - Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback [2024-03-arXiv] [[📄 paper](https://arxiv.org/abs/2403.16792)] [[🔗 repo](https://github.com/CGCL-codes/naturalcc/tree/main/examples/cocogen)]
150 |
151 | - Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-03-arXiv] [[📄 paper](https://arxiv.org/abs/2403.10059)] [[🔗 repo](https://repoformer.github.io/)]
152 |
153 | - RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion [2024-03-arXiv] [[📄 paper](https://arxiv.org/abs/2403.06095)] [[🔗 repo](https://github.com/FSoft-AI4Code/RepoHyper)]
154 |
155 | - RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening [2024-07-Internetware] [[📄 paper](https://dl.acm.org/doi/10.1145/3671016.3674819)]
156 |
157 | - CodePlan: Repository-Level Coding using LLMs and Planning [2024-07-FSE] [[📄 paper](https://dl.acm.org/doi/abs/10.1145/3643757)] [[🔗 repo](https://github.com/microsoft/codeplan)]
158 |
159 | - DraCo: Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion [2024-05-ACL] [[📄 paper](https://aclanthology.org/2024.acl-long.431/)] [[🔗 repo](https://github.com/nju-websoft/DraCo)]
160 |
161 | - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-10-EMNLP] [[📄 paper](https://aclanthology.org/2023.emnlp-main.151/)] [[🔗 repo](https://github.com/microsoft/CodeT/tree/main/RepoCoder)]
162 |
163 | - Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context [2023-09-NeurIPS] [[📄 paper](https://neurips.cc/virtual/2023/poster/70362)] [[🔗 repo](https://aka.ms/monitors4codegen)]
164 |
165 | - RepoFusion: Training Code Models to Understand Your Repository [2023-06-arXiv] [[📄 paper](https://arxiv.org/abs/2306.10998)] [[🔗 repo](https://github.com/ServiceNow/RepoFusion)]
166 |
167 | - Repository-Level Prompt Generation for Large Language Models of Code [2023-06-ICML] [[📄 paper](https://proceedings.mlr.press/v202/shrivastava23a.html)] [[🔗 repo](https://github.com/shrivastavadisha/repo_level_prompt_generation)]
168 |
169 | - Fully Autonomous Programming with Large Language Models [2023-06-GECCO] [[📄 paper](https://dl.acm.org/doi/pdf/10.1145/3583131.3590481)] [[🔗 repo](https://github.com/KoutchemeCharles/aied2023)]
170 |
171 | ## 🔄 Repo-Level Code Translation
172 |
173 | - A Systematic Literature Review on Neural Code Translation [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.07425)]
174 |
175 | - EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation [2025-08-arXiv] [[📄 paper](https://arxiv.org/abs/2508.04295)]
176 |
177 | - Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [2024-04-ICSE] [[📄 paper](https://doi.org/10.1145/3597503.3639226)] [[🔗 repo](https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical)]
178 |
179 | - Enhancing llm-based code translation in repository context via triple knowledge-augmented [2025-03-arXiv] [[📄 paper]([https://www.arxiv.org/pdf/2501.14257](https://arxiv.org/pdf/2503.18305))]
180 |
181 | - C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Techniques [2025-01-arXiv] [[📄 paper](https://www.arxiv.org/pdf/2501.14257)] [[🔗 repo](https://github.com/vikramnitin9/c2saferrust)]
182 |
183 | - Scalable, Validated Code Translation of Entire Projects using Large Language Models [2025-06-PLDI] [[📄 paper](https://dl.acm.org/doi/abs/10.1145/3729315)]
184 |
185 | - Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis [2024-12-arxiv] [[📄 paper](https://arxiv.org/pdf/2412.14234)] [[🕸️ website](https://syzygy-project.github.io/)]
186 |
187 | - RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [2024-11-arxiv] [[📄 paper](https://arxiv.org/abs/2411.13990)] [[🔗 repo](https://github.com/SYSUSELab/RustRepoTrans)]
188 |
189 | ## 🧪 Repo-Level Unit Test Generation
190 | - Execution-Feedback Driven Test Generation from SWE Issues [2025-08-arXiv] [[📄 paper](https://www.arxiv.org/abs/2508.06365)]
191 |
192 | - AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.17542)]
193 |
194 | - Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM [2025-06-arXiv] [[📄 paper](https://dl.acm.org/doi/10.1145/3715718)]
195 |
196 | - Issue2Test: Generating Reproducing Test Cases from Issue Reports [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.16320)]
197 |
198 | - Agentic Bug Reproduction for Effective Automated Program Repair at Google [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.01821)]
199 |
200 | - LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [[📄 paper](https://arxiv.org/pdf/2411.13941)]
201 |
202 |
203 | ## 🔍 Repo-Level Code QA
204 |
205 | - SWE-QA: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [[📄 paper](https://arxiv.org/abs/2509.14635)] [[🔗 repo](https://github.com/peng-weihan/SWE-QA-Bench)]
206 |
207 | - Decompositional Reasoning for Graph Retrieval with Large Language Models [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.13380)]
208 |
209 | - LongCodeBench: Evaluating Coding LLMs at 1M Context Windows [2025-05-arXiv] [[📄 paper](https://arxiv.org/pdf/2505.07897)]
210 |
211 | - LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.09089)] [[🔗 repo](https://github.com/gersteinlab/LocAgent)]
212 |
213 | - CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering [2025-01-arXiv] [[📄 paper](https://arxiv.org/pdf/2501.03447)]
214 |
215 | - RepoChat Arena [2025-Blog] [[🔗 repo](https://blog.lmarena.ai/blog/2025/repochat-arena/)]
216 |
217 | - RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering [MSR-2025] [[🔗 repo](https://2025.msrconf.org/details/msr-2025-data-and-tool-showcase-track/35/RepoChat-An-LLM-Powered-Chatbot-for-GitHub-Repository-Question-Answering)]
218 |
219 | - CodeQueries: A Dataset of Semantic Queries over Code [2022-09-arXiv] [[📄 paper](https://arxiv.org/abs/2209.08372)]
220 |
221 | ## 👩💻 Repo-Level Issue Task Synthesis
222 | - SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [2025-09-arXiv] [[📄 paper](https://arxiv.org/pdf/2509.08724)]
223 |
224 | - R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.07164)] [[🔗 repo](https://r2e-gym.github.io/)]
225 |
226 | - SWE-bench Goes Live! [2025-05-arXiv] [[📄 paper](https://www.arxiv.org/abs/2505.23419)] [[🔗 repo](https://github.com/microsoft/SWE-bench-Live)]
227 |
228 | - Scaling Data for Software Engineering Agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.21798)] [[🔗 repo](https://swesmith.com/)]
229 |
230 | - Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.14757v1)] [[🔗 repo](https://github.com/FSoft-AI4Code/SWE-Synth)]
231 |
232 | - Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.21139)] [[🔗 repo](https://github.com/SWE-Gym/SWE-Gym)]
233 |
234 |
235 | ## 📊 Datasets and Benchmarks
236 | - **SWE-Bench++**: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2025-12-arXiv] [[📄 paper](https://arxiv.org/abs/2512.17419)]
237 |
238 | - **Multi-Docker-Eval**: A ‘Shovel of the Gold Rush’ Benchmark on Automatic Environment Building for Software Engineering? [2025-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2512.06915)]
239 |
240 | - **CodeClash**: CodeClash: Benchmarking Goal-Oriented Software Engineering [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.00839)] [[🔗 repo](https://github.com/CodeClash-ai/CodeClash)]
241 |
242 | - **SWE-fficiency**: Can Language Models Optimize Real-World Repositories on Real Workloads? [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.06090)]
243 |
244 | - **SWE-Compass**: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.05459)]
245 |
246 | - **SWE-Sharp-Bench**: A Reproducible Benchmark for C# Software Engineering Tasks [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.02352)]
247 |
248 | - **ImpossibleBench**: Measuring LLMs’ Propensity of Exploiting Test Cases [2025-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2510.20270)]
249 |
250 | - **SWE-QA**: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [[📄 paper](https://arxiv.org/abs/2509.14635)] [[🔗 repo](https://github.com/peng-weihan/SWE-QA-Bench)]
251 |
252 | - **SR-Eval**: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement [2025-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2509.18808)]
253 |
254 | - **RECODE-H**: A Benchmark for Research Code Development with Interactive Human Feedback [2025-09-arXiv] [[📄 paper](https://arxiv.org/pdf/2510.06186v1)]
255 |
256 | - **Bigcodebench**: Benchmarking code generation with diverse function calls and complex instructions [ICLR-2025 Oral] [[📄 paper](https://arxiv.org/abs/2406.15877)] [[🔗 repo](https://github.com/bigcode-project/bigcodebench)]
257 |
258 | - **Vibe Checker**: Aligning Code Evaluation with Human Preference [2025-10-arXiv] [[📄 paper](https://arxiv.org/abs/2510.07315)]
259 |
260 | - **MULocBench**: A Benchmark for Localizing Code and Non-Code Issues in Software Projects [2025-09-arXiv] [[📄 paper](https://www.arxiv.org/abs/2509.25242)] [[🕸️ website](https://huggingface.co/datasets/somethingone/MULocBench)]
261 |
262 | - **SecureAgentBench**: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios [2025-09-arXiv] [[📄 paper](https://arxiv.org/html/2509.22097v1)]
263 |
264 | - **SWE-bench Pro**: Can AI Agents Solve Long-Horizon Software Engineering Tasks? [2025-09] [[📄 paper](https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf)] [[🔗 repo](https://github.com/scaleapi/SWE-bench_Pro-os/tree/main?tab=readme-ov-file)]
265 |
266 | - **AutoCodeBench**: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
267 | [2025-08-arXiv] [[📄 paper](https://arxiv.org/abs/2508.09101)] [[🔗 repo](https://autocodebench.github.io/)]
268 |
269 | - **LiveRepoReflection**: Turning the Tide: Repository-based Code Reflection [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.09866)] [[🔗 repo](https://livereporeflection.github.io/)]
270 |
271 | - **SWE-Perf**: SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.12415)] [[🔗 repo](https://swe-perf.github.io/)]
272 |
273 | - **ResearchCodeBench**: Benchmarking LLMs on Implementing Novel Machine Learning Research Code [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.02314)] [[🔗 repo](https://researchcodebench.github.io/)]
274 |
275 | - **SWE-Factory**: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.10954v1)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/swe-factory)]
276 |
277 | - **UTBoost**: Rigorous Evaluation of Coding Agents on SWE-Bench [ACL-2025] [[📄 paper](https://arxiv.org/abs/2506.09289)]
278 |
279 | - **SWE-Flow**: Synthesizing Software Engineering Data in a Test-Driven Manner [ICML-2025] [[📄 paper](https://arxiv.org/abs/2506.09003)] [[🔗 repo](https://github.com/Hambaobao/SWE-Flow)]
280 |
281 | - **AgentIssue-Bench**: Can Agents Fix Agent Issues? [2025-08-arXiv] [[📄 paper](https://arxiv.org/pdf/2505.20749)] [[🔗 repo](https://github.com/alfin06/AgentIssue-Bench)]
282 |
283 | - **CodeAssistBench (CAB)**: Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.10646)]
284 |
285 | - **OmniGIRL**: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.04606)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/OmniGIRL)]
286 |
287 | - **CodeFlowBench**: A Multi-turn, Iterative Benchmark for Complex Code Generation [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.21751)] [[🔗 repo](https://github.com/Rise-1210/codeflow)]
288 |
289 | - **SWE-Smith**: Scaling Data for Software Engineering Agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.21798)] [[🔗 repo](https://swesmith.com/)]
290 |
291 | - **SWE-Synth**: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.14757v1)] [[🔗 repo](https://github.com/FSoft-AI4Code/SWE-Synth)]
292 |
293 | - Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.15223)]
294 |
295 | - **Unveiling Pitfalls**: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.12374)]
296 |
297 | - **ConvCodeWorld**: Benchmarking Conversational Code Generation in Reproducible Feedback Environments [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.19852)] [[🔗 repo](https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld)]
298 |
299 | - **SWE-Lancer**: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-arXiv] [[📄 paper](https://arxiv.org/pdf/2502.12115)] [[🔗 repo](https://github.com/openai/SWELancer-Benchmark)]
300 |
301 | - Evaluating Agent-based Program Repair at Google [2025-01-arXiv] [[📄 paper](https://arxiv.org/pdf/2501.07531)]
302 |
303 | - **SWE-rebench**: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.20411)] [[🕸️ website](https://swe-rebench.com/leaderboard)]
304 |
305 | - **SWE-bench-Live**: A Live Benchmark for Repository-Level Issue Resolution [2025-05-arXiv] [[📄 paper](https://www.arxiv.org/abs/2505.23419)] [[🔗 repo](https://github.com/microsoft/SWE-bench-Live)]
306 |
307 | - **FEA-Bench**: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [2025-05-ACL] [[📄 paper](https://arxiv.org/abs/2503.06680)] [[🔗 repo](https://github.com/microsoft/FEA-Bench)]
308 |
309 | - **OmniGIRL**: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-ISSTA] [[📄 paper](https://arxiv.org/abs/2505.04606)]
310 |
311 | - **SWE-PolyBench**: A multi-language benchmark for repository level evaluation of coding agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.08703)] [[🔗 repo](https://github.com/FSoft-AI4Code/SWE-PolyBench)]
312 |
313 | - **Multi-SWE-bench**: A Multilingual Benchmark for Issue Resolving [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.02605)] [[🔗 repo](https://github.com/multi-swe-bench/multi-swe-bench)]
314 |
315 | - **LibEvolutionEval**: A Benchmark and Study for Version-Specific Code Generation [2025-04-NAACL] [[📄 paper](https://arxiv.org/abs/2412.04478)][[🔗 Website](https://lib-evolution-eval.github.io/)]
316 |
317 | - **SWEE-Bench & SWA-Bench**: Automated Benchmark Generation for Repository-Level Coding Tasks [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.07701)]
318 |
319 | - **ProjectEval**: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation [2025 ACL-Findings] [[📄 paper](https://arxiv.org/pdf/2503.07010)] [[🔗 repo](https://github.com/RyanLoil/ProjectEval/)]
320 |
321 | - **REPOST-TRAIN**: Scalable Repository-Level Coding Environment Construction with Sandbox Testing [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.07358)] [[🔗 repo](https://github.com/yiqingxyq/RepoST)]
322 |
323 | - **Loc-Bench**: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.09089)] [[🔗 repo](https://github.com/gersteinlab/LocAgent)]
324 |
325 | - **SWE-Lancer**: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-02-arXiv] [[📄 paper](https://arxiv.org/pdf/2502.12115)] [[🔗 repo](https://github.com/openai/SWELancer-Benchmark)]
326 |
327 | - **SolEval**: Benchmarking Large Language Models for Repository-level Solidity Code Generation [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.18793)] [[🔗 repo](https://anonymous.4open.science/r/SolEval-1C06/)]
328 |
329 | - **HumanEvo**: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation [2025-ICSE] [[📄 paper](https://www.computer.org/csdl/proceedings-article/icse/2025/056900a764/251mHzzKizu)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/HumanEvo)]
330 |
331 | - **RepoExec**: On the Impacts of Contexts on Repository-Level Code Generation [2025-NAACL] [[📄 paper](https://arxiv.org/abs/2406.11927)] [[🔗 repo](https://github.com/FSoft-AI4Code/RepoExec)]
332 |
333 | - **SWE-Gym**: Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.21139)] [[🔗 repo](https://github.com/SWE-Gym/SWE-Gym)]
334 |
335 | - **RepoTransBench**: RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.17744)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/RepoTransBench)]
336 |
337 | - **Visual SWE-bench**: Issue Resolving with Visual Data [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.17315)] [[🔗 repo](https://github.com/luolin101/CodeV)]
338 |
339 | - **ExecRepoBench**: Multi-level Executable Code Completion Evaluation [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.11990)] [[🔗 site](https://execrepobench.github.io/)]
340 |
341 | - **REPOCOD**: Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.21647)] [[🔗 repo](https://github.com/lt-asset/REPOCOD)]
342 |
343 | - **M2RC-EVAL**: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.21157)] [[🔗 repo](https://github.com/M2RC-Eval-Team/M2RC-Eval)]
344 |
345 | - **SWE-bench+**: Enhanced Coding Benchmark for LLMs [2024-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2410.06992)]
346 |
347 | - **SWE-bench Multimodal**: Multimodal Software Engineering Benchmark [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.03859)] [[🔗 site](https://swebench.com/multimodal)]
348 |
349 | - **Codev-Bench**: How Do LLMs Understand Developer-Centric Code Completion? [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.01353)] [[🔗 repo](https://github.com/LingmaTongyi/Codev-Bench)]
350 |
351 | - **SWT-Bench**: Testing and Validating Real-World Bug-Fixes with Code Agents
352 | [2024-06-arxiv] [[📄 paper](https://arxiv.org/abs/2406.12952)] [[🕸️ website](https://swtbench.com/?results=verified)]
353 |
354 | - **CodeRAG-Bench**: Can Retrieval Augment Code Generation? [2024-06-arXiv] [[📄 paper](http://arxiv.org/abs/2406.14497)] [[🔗 repo](https://github.com/code-rag-bench/code-rag-bench/tree/main)]
355 |
356 | - **R2C2-Bench**: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.01359)]
357 |
358 | - **RepoClassBench**: Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [2024-05-arXiv] [[📄 paper](https://arxiv.org/abs/2405.01573)] [[🔗 repo](https://github.com/microsoft/repoclassbench/tree/main)]
359 |
360 | - **DevEval**: Evaluating Code Generation in Practical Software Projects [2024-ACL-Findings] [[📄 paper](https://aclanthology.org/2024.findings-acl.214.pdf)] [[🔗 repo](https://github.com/seketeam/DevEval)]
361 |
362 | - **CodAgentBench**: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges [2024-ACL] [[📄 paper](https://aclanthology.org/2024.acl-long.737/)]
363 |
364 | - **RepoBench**: Benchmarking Repository-Level Code Auto-Completion Systems [2024-ICLR] [[📄 paper](https://openreview.net/forum?id=pPjZIOuQuF)] [[🔗 repo](https://github.com/Leolty/repobench)]
365 |
366 | - **SWE-bench**: Can Language Models Resolve Real-World GitHub Issues? [2024-ICLR] [[📄 paper](https://arxiv.org/pdf/2310.06770)] [[🔗 repo](https://github.com/princeton-nlp/SWE-bench)]
367 |
368 | - **CrossCodeLongEval**: Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-ICML] [[📄 paper](https://arxiv.org/abs/2403.10059)] [[🔗 repo](https://repoformer.github.io/)]
369 |
370 | - **R2E-Eval**: Turning Any GitHub Repository into a Programming Agent Test Environment [2024-ICML] [[📄 paper](https://proceedings.mlr.press/v235/jain24c.html)] [[🔗 repo](https://r2e.dev/)]
371 |
372 | - **RepoEval**: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-EMNLP] [[📄 paper](https://aclanthology.org/2023.emnlp-main.151/)] [[🔗 repo](https://github.com/microsoft/CodeT/tree/main/RepoCoder)]
373 |
374 | - **CrossCodeEval**: A Diverse and Multilingual Benchmark for Cross-File Code Completion [2023-NeurIPS] [[📄 paper](https://proceedings.neurips.cc/paper_files/paper/2023/file/920f2dced7d32ab2ba2f1970bc306af6-Paper-Datasets_and_Benchmarks.pdf)] [[🔗 site](https://crosscodeeval.github.io/)]
375 |
376 | - **Skeleton-Guided-Translation**: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [2025-01-arxiv] [[📄 paper](https://arxiv.org/abs/2501.16050)] [[🔗 repo](https://github.com/microsoft/TransRepo)]
377 |
378 | - **SWE-Dev**: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.16975)] [[🔗 repo](https://github.com/justLittleWhite/SWE-Dev)]
379 |
380 | ## Star History
381 |
382 | [](https://www.star-history.com/#YerbaPage/Awesome-Repo-Level-Code-Generation&Date)
383 |
--------------------------------------------------------------------------------