├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Yerba 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

🤖✨ Awesome Repository-Level Code Generation ✨🤖

2 | 3 |

4 | 5 | 6 | 7 | 8 |

9 | 10 | 🌟 A curated list of awesome repository-level code generation research papers and resources. If you want to contribute to this list (please do), feel free to send me a pull request. 🚀 If you have any further questions, feel free to contact [Yuling Shi](https://yerbasite.github.io) or [Xiaodong Gu](https://guxd.github.io) (SJTU). 11 | 12 | ## 📚 Contents 13 | 14 | - [📚 Contents](#-contents) 15 | - [💥 Repo-Level Issue Resolution](#-repo-level-issue-resolution) 16 | - [🤖 Repo-Level Code Completion](#-repo-level-code-completion) 17 | - [🔄 Repo-Level Code Translation](#-repo-level-code-translation) 18 | - [🧪 Repo-Level Unit Test Generation](#-repo-level-unit-test-generation) 19 | - [🔍 Repo-Level Code QA](#-repo-level-code-qa) 20 | - [👩‍💻 Repo-Level Issue Task Synthesis](#-repo-level-issue-task-synthesis) 21 | - [📊 Datasets and Benchmarks](#-datasets-and-benchmarks) 22 | 23 | ## 💥 Repo-Level Issue Resolution 24 | 25 | - SWE-Exp: Experience-Driven Software Issue Resolution [2025-07-arXiv] [[📄 paper](http://arxiv.org/abs/2507.23361)] [[🔗 repo](https://github.com/YerbaPage/SWE-Exp)] 26 | 27 | - SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution [2025-07-arXiv] [[📄 paper](http://arxiv.org/abs/2507.23348)] [[🔗 repo](https://github.com/YerbaPage/SWE-Debate)] 28 | 29 | - LIVE-SWE-AGENT: Can Software Engineering Agents Self-Evolve on the Fly? [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.13646)] 30 | 31 | - Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories [2025-10-arXiv] [[📄 paper](https://arxiv.org/abs/2511.00197)] 32 | 33 | - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [2025-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2510.19898)] 34 | 35 | - Where LLM Agents Fail and How They can Learn From Failures [2025-09-arXiv] [[📄 paper](https://www.arxiv.org/abs/2509.25370)] [[🔗 repo](https://github.com/ulab-uiuc/AgentDebug)] 36 | 37 | - SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints [2025-09-arXiv] [[📄 paper](https://arxiv.org/abs/2509.09853)] 38 | 39 | - Diffusion is a code repair operator and generator [2025-08-arXiv] [[📄 paper](https://arxiv.org/abs/2508.11110)] 40 | 41 | - The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.12286)] 42 | 43 | - Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards [2025-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2506.11425)] 44 | 45 | - EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair [2025-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2506.10484)] 46 | 47 | - Coding Agents with Multimodal Browsing are Generalist Problem Solvers [2025-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2506.03011)] [[🔗 repo](https://github.com/adityasoni9998/OpenHands-Versa)] 48 | 49 | - CoRet: Improved Retriever for Code Editing [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.24715)] 50 | 51 | - Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.22954)] [[🔗 repo](https://github.com/jennyzzt/dgm)] 52 | 53 | - SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.16975)] [[🔗 repo](https://github.com/justLittleWhite/SWE-Dev)] 54 | 55 | - Putting It All into Context: Simplifying Agents with LCLMs [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.08120)] 56 | 57 | - SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning [2025-05-arXiv] [[📄 blog](https://novasky-ai.notion.site/skyrl-v0)] [[🔗 repo](https://github.com/novasky-ai/skyrl-v0)] 58 | 59 | - AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions [2025-FSE] [[📄 paper](https://arxiv.org/pdf/2411.18015)] 60 | 61 | - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.23803)] [[🔗 repo](https://github.com/yingweima2022/SWE-Reasoner)] 62 | 63 | - Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.21710)] 64 | 65 | - CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.22424)] 66 | 67 | - SEAlign: Alignment Training for Software Engineering Agent [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.18455)] 68 | 69 | - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.14269)] [[🔗 repo](https://github.com/darsagent/DARS-Agent)] 70 | 71 | - LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.09089)] [[🔗 repo](https://github.com/gersteinlab/LocAgent)] 72 | 73 | - SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.20127)] 74 | 75 | - SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [2025-02-arXiv] [[📄 paper](https://arxiv.org/pdf/2502.18449)] [[🔗 repo](https://github.com/facebookresearch/swe-rl)] 76 | 77 | - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [2025-01-arXiv] [[📄 paper](https://arxiv.org/pdf/2501.05040)] [[🔗 repo](https://github.com/InternLM/SWE-Fixer)] 78 | 79 | - CodeMonkeys: Scaling Test-Time Compute for Software Engineering [2025-01-arXiv] [[📄 paper](https://arxiv.org/abs/2501.14723)] [[🔗 repo](https://github.com/google-research/code-monkeys)] 80 | 81 | - Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.21139)] [[🔗 repo](https://github.com/SWE-Gym/SWE-Gym)] 82 | 83 | - CODEV: Issue Resolving with Visual Data [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.17315)] [[🔗 repo](https://github.com/luolin101/CodeV)] 84 | 85 | - LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [[📄 paper](https://arxiv.org/pdf/2411.13941)] 86 | 87 | - Globant Code Fixer Agent Whitepaper [2024-11] [[📄 paper](https://ai.globant.com/wp-content/uploads/2024/11/Whitepaper-Globant-Code-Fixer-Agent.pdf)] 88 | 89 | - MarsCode Agent: AI-native Automated Bug Fixing [2024-11-arXiv] [[📄 paper](https://arxiv.org/abs/2409.00899)] 90 | 91 | - Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [2024-11-arXiv] [[📄 paper](https://arxiv.org/html/2411.00622v1)] [[🔗 repo](https://github.com/LingmaTongyi/Lingma-SWE-GPT)] 92 | 93 | - SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [2024-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2410.20285)] [[🔗 repo](https://github.com/aorwall/moatless-tree-search)] 94 | 95 | - AutoCodeRover: Autonomous Program Improvement [2024-09-ISSTA] [[📄 paper](https://dl.acm.org/doi/10.1145/3650212.3680384)] [[🔗 repo](https://github.com/nus-apr/auto-code-rover)] 96 | 97 | - SpecRover: Code Intent Extraction via LLMs [2024-08-arXiv] [[📄 paper](https://arxiv.org/abs/2408.02232)] 98 | 99 | - OpenHands: An Open Platform for AI Software Developers as Generalist Agents [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2407.16741)] [[🔗 repo](https://github.com/All-Hands-AI/OpenHands)] 100 | 101 | - AGENTLESS: Demystifying LLM-based Software Engineering Agents [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2407.01489)] 102 | 103 | - RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2410.14684)] [[🔗 repo](https://github.com/ozyyshr/RepoGraph)] 104 | 105 | - CodeR: Issue Resolving with Multi-Agent and Task Graphs [2024-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2406.01304)] [[🔗 repo](https://github.com/NL2Code/CodeR)] 106 | 107 | - Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.01422v2)] 108 | 109 | - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [2024-NeurIPS] [[📄 paper](https://arxiv.org/abs/2405.15793)] [[🔗 repo](https://github.com/SWE-agent/SWE-agent)] 110 | 111 | ## 🤖 Repo-Level Code Completion 112 | 113 | - Enhancing Project-Specific Code Completion by Inferring Internal API Information [2025-07-TSE] [[📄 paper](https://ieeexplore.ieee.org/abstract/document/11096713)] [[🔗 repo](https://github.com/ZJU-CTAG/InferCom)] 114 | 115 | - CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.10046)] 116 | 117 | - CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases [2025-04-NAACL] [[📄 paper](https://aclanthology.org/2025.naacl-long.7/)] 118 | 119 | - RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.08862)] 120 | 121 | - Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs [2025-04-AAAI] [[📄 paper](https://ojs.aaai.org/index.php/AAAI/article/view/34782)] [[🔗 repo](https://github.com/Hambaobao/HCP-Coder)] 122 | 123 | - What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.20589)] 124 | 125 | - REPOFILTER: Adaptive Retrieval Context Trimming for Repository-Level Code Completion [2025-04-OpenReview] [[📄 paper](https://openreview.net/forum?id=oOSeOEXrFA)] 126 | 127 | - Improving FIM Code Completions via Context & Curriculum Based Learning [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.16589)] 128 | 129 | - ContextModule: Improving Code Completion via Repository-level Contextual Information [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.08063)] 130 | 131 | - A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse With Local-Aware, Global-Aware, and Third-Party-Library-Aware [2024-12-TSE] [[📄 paper](https://www.computer.org/csdl/journal/ts/2024/12/10734067/21iLh4j0oG4)] 132 | 133 | - RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation [2024-09-arXiv] [[📄 paper](https://arxiv.org/abs/2409.13122)] 134 | 135 | - RAMBO: Enhancing RAG-based Repository-Level Method Body Completion [2024-09-arXiv] [[📄 paper](https://arxiv.org/abs/2409.15204)] [[🔗 repo](https://github.com/ise-uet-vnu/rambo)] 136 | 137 | - RLCoder: Reinforcement Learning for Repository-Level Code Completion [2024-07-arXiv] [[📄 paper](https://arxiv.org/abs/2407.19487)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/RLCoder)] 138 | 139 | - STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.10018)] 140 | 141 | - GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.07003)] 142 | 143 | - Enhancing Repository-Level Code Generation with Integrated Contextual Information [2024-06-arXiv] [[📄 paper](https://arxiv.org/pdf/2406.03283)] 144 | 145 | - R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.01359)] 146 | 147 | - Natural Language to Class-level Code Generation by Iterative Tool-augmented Reasoning over Repository [2024-05-arXiv] [[📄 paper](https://arxiv.org/abs/2405.01573)] [[🔗 repo](https://github.com/microsoft/repoclassbench)] 148 | 149 | - Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback [2024-03-arXiv] [[📄 paper](https://arxiv.org/abs/2403.16792)] [[🔗 repo](https://github.com/CGCL-codes/naturalcc/tree/main/examples/cocogen)] 150 | 151 | - Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-03-arXiv] [[📄 paper](https://arxiv.org/abs/2403.10059)] [[🔗 repo](https://repoformer.github.io/)] 152 | 153 | - RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion [2024-03-arXiv] [[📄 paper](https://arxiv.org/abs/2403.06095)] [[🔗 repo](https://github.com/FSoft-AI4Code/RepoHyper)] 154 | 155 | - RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening [2024-07-Internetware] [[📄 paper](https://dl.acm.org/doi/10.1145/3671016.3674819)] 156 | 157 | - CodePlan: Repository-Level Coding using LLMs and Planning [2024-07-FSE] [[📄 paper](https://dl.acm.org/doi/abs/10.1145/3643757)] [[🔗 repo](https://github.com/microsoft/codeplan)] 158 | 159 | - DraCo: Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion [2024-05-ACL] [[📄 paper](https://aclanthology.org/2024.acl-long.431/)] [[🔗 repo](https://github.com/nju-websoft/DraCo)] 160 | 161 | - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-10-EMNLP] [[📄 paper](https://aclanthology.org/2023.emnlp-main.151/)] [[🔗 repo](https://github.com/microsoft/CodeT/tree/main/RepoCoder)] 162 | 163 | - Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context [2023-09-NeurIPS] [[📄 paper](https://neurips.cc/virtual/2023/poster/70362)] [[🔗 repo](https://aka.ms/monitors4codegen)] 164 | 165 | - RepoFusion: Training Code Models to Understand Your Repository [2023-06-arXiv] [[📄 paper](https://arxiv.org/abs/2306.10998)] [[🔗 repo](https://github.com/ServiceNow/RepoFusion)] 166 | 167 | - Repository-Level Prompt Generation for Large Language Models of Code [2023-06-ICML] [[📄 paper](https://proceedings.mlr.press/v202/shrivastava23a.html)] [[🔗 repo](https://github.com/shrivastavadisha/repo_level_prompt_generation)] 168 | 169 | - Fully Autonomous Programming with Large Language Models [2023-06-GECCO] [[📄 paper](https://dl.acm.org/doi/pdf/10.1145/3583131.3590481)] [[🔗 repo](https://github.com/KoutchemeCharles/aied2023)] 170 | 171 | ## 🔄 Repo-Level Code Translation 172 | 173 | - A Systematic Literature Review on Neural Code Translation [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.07425)] 174 | 175 | - EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation [2025-08-arXiv] [[📄 paper](https://arxiv.org/abs/2508.04295)] 176 | 177 | - Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [2024-04-ICSE] [[📄 paper](https://doi.org/10.1145/3597503.3639226)] [[🔗 repo](https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical)] 178 | 179 | - Enhancing llm-based code translation in repository context via triple knowledge-augmented [2025-03-arXiv] [[📄 paper]([https://www.arxiv.org/pdf/2501.14257](https://arxiv.org/pdf/2503.18305))] 180 | 181 | - C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Techniques [2025-01-arXiv] [[📄 paper](https://www.arxiv.org/pdf/2501.14257)] [[🔗 repo](https://github.com/vikramnitin9/c2saferrust)] 182 | 183 | - Scalable, Validated Code Translation of Entire Projects using Large Language Models [2025-06-PLDI] [[📄 paper](https://dl.acm.org/doi/abs/10.1145/3729315)] 184 | 185 | - Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis [2024-12-arxiv] [[📄 paper](https://arxiv.org/pdf/2412.14234)] [[🕸️ website](https://syzygy-project.github.io/)] 186 | 187 | - RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [2024-11-arxiv] [[📄 paper](https://arxiv.org/abs/2411.13990)] [[🔗 repo](https://github.com/SYSUSELab/RustRepoTrans)] 188 | 189 | ## 🧪 Repo-Level Unit Test Generation 190 | - Execution-Feedback Driven Test Generation from SWE Issues [2025-08-arXiv] [[📄 paper](https://www.arxiv.org/abs/2508.06365)] 191 | 192 | - AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.17542)] 193 | 194 | - Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM [2025-06-arXiv] [[📄 paper](https://dl.acm.org/doi/10.1145/3715718)] 195 | 196 | - Issue2Test: Generating Reproducing Test Cases from Issue Reports [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.16320)] 197 | 198 | - Agentic Bug Reproduction for Effective Automated Program Repair at Google [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.01821)] 199 | 200 | - LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [[📄 paper](https://arxiv.org/pdf/2411.13941)] 201 | 202 | 203 | ## 🔍 Repo-Level Code QA 204 | 205 | - SWE-QA: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [[📄 paper](https://arxiv.org/abs/2509.14635)] [[🔗 repo](https://github.com/peng-weihan/SWE-QA-Bench)] 206 | 207 | - Decompositional Reasoning for Graph Retrieval with Large Language Models [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.13380)] 208 | 209 | - LongCodeBench: Evaluating Coding LLMs at 1M Context Windows [2025-05-arXiv] [[📄 paper](https://arxiv.org/pdf/2505.07897)] 210 | 211 | - LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.09089)] [[🔗 repo](https://github.com/gersteinlab/LocAgent)] 212 | 213 | - CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering [2025-01-arXiv] [[📄 paper](https://arxiv.org/pdf/2501.03447)] 214 | 215 | - RepoChat Arena [2025-Blog] [[🔗 repo](https://blog.lmarena.ai/blog/2025/repochat-arena/)] 216 | 217 | - RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering [MSR-2025] [[🔗 repo](https://2025.msrconf.org/details/msr-2025-data-and-tool-showcase-track/35/RepoChat-An-LLM-Powered-Chatbot-for-GitHub-Repository-Question-Answering)] 218 | 219 | - CodeQueries: A Dataset of Semantic Queries over Code [2022-09-arXiv] [[📄 paper](https://arxiv.org/abs/2209.08372)] 220 | 221 | ## 👩‍💻 Repo-Level Issue Task Synthesis 222 | - SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [2025-09-arXiv] [[📄 paper](https://arxiv.org/pdf/2509.08724)] 223 | 224 | - R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.07164)] [[🔗 repo](https://r2e-gym.github.io/)] 225 | 226 | - SWE-bench Goes Live! [2025-05-arXiv] [[📄 paper](https://www.arxiv.org/abs/2505.23419)] [[🔗 repo](https://github.com/microsoft/SWE-bench-Live)] 227 | 228 | - Scaling Data for Software Engineering Agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.21798)] [[🔗 repo](https://swesmith.com/)] 229 | 230 | - Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.14757v1)] [[🔗 repo](https://github.com/FSoft-AI4Code/SWE-Synth)] 231 | 232 | - Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.21139)] [[🔗 repo](https://github.com/SWE-Gym/SWE-Gym)] 233 | 234 | 235 | ## 📊 Datasets and Benchmarks 236 | - **SWE-Bench++**: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2025-12-arXiv] [[📄 paper](https://arxiv.org/abs/2512.17419)] 237 | 238 | - **Multi-Docker-Eval**: A ‘Shovel of the Gold Rush’ Benchmark on Automatic Environment Building for Software Engineering? [2025-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2512.06915)] 239 | 240 | - **CodeClash**: CodeClash: Benchmarking Goal-Oriented Software Engineering [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.00839)] [[🔗 repo](https://github.com/CodeClash-ai/CodeClash)] 241 | 242 | - **SWE-fficiency**: Can Language Models Optimize Real-World Repositories on Real Workloads? [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.06090)] 243 | 244 | - **SWE-Compass**: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.05459)] 245 | 246 | - **SWE-Sharp-Bench**: A Reproducible Benchmark for C# Software Engineering Tasks [2025-11-arXiv] [[📄 paper](https://arxiv.org/abs/2511.02352)] 247 | 248 | - **ImpossibleBench**: Measuring LLMs’ Propensity of Exploiting Test Cases [2025-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2510.20270)] 249 | 250 | - **SWE-QA**: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [[📄 paper](https://arxiv.org/abs/2509.14635)] [[🔗 repo](https://github.com/peng-weihan/SWE-QA-Bench)] 251 | 252 | - **SR-Eval**: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement [2025-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2509.18808)] 253 | 254 | - **RECODE-H**: A Benchmark for Research Code Development with Interactive Human Feedback [2025-09-arXiv] [[📄 paper](https://arxiv.org/pdf/2510.06186v1)] 255 | 256 | - **Bigcodebench**: Benchmarking code generation with diverse function calls and complex instructions [ICLR-2025 Oral] [[📄 paper](https://arxiv.org/abs/2406.15877)] [[🔗 repo](https://github.com/bigcode-project/bigcodebench)] 257 | 258 | - **Vibe Checker**: Aligning Code Evaluation with Human Preference [2025-10-arXiv] [[📄 paper](https://arxiv.org/abs/2510.07315)] 259 | 260 | - **MULocBench**: A Benchmark for Localizing Code and Non-Code Issues in Software Projects [2025-09-arXiv] [[📄 paper](https://www.arxiv.org/abs/2509.25242)] [[🕸️ website](https://huggingface.co/datasets/somethingone/MULocBench)] 261 | 262 | - **SecureAgentBench**: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios [2025-09-arXiv] [[📄 paper](https://arxiv.org/html/2509.22097v1)] 263 | 264 | - **SWE-bench Pro**: Can AI Agents Solve Long-Horizon Software Engineering Tasks? [2025-09] [[📄 paper](https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf)] [[🔗 repo](https://github.com/scaleapi/SWE-bench_Pro-os/tree/main?tab=readme-ov-file)] 265 | 266 | - **AutoCodeBench**: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators 267 | [2025-08-arXiv] [[📄 paper](https://arxiv.org/abs/2508.09101)] [[🔗 repo](https://autocodebench.github.io/)] 268 | 269 | - **LiveRepoReflection**: Turning the Tide: Repository-based Code Reflection [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.09866)] [[🔗 repo](https://livereporeflection.github.io/)] 270 | 271 | - **SWE-Perf**: SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.12415)] [[🔗 repo](https://swe-perf.github.io/)] 272 | 273 | - **ResearchCodeBench**: Benchmarking LLMs on Implementing Novel Machine Learning Research Code [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.02314)] [[🔗 repo](https://researchcodebench.github.io/)] 274 | 275 | - **SWE-Factory**: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [2025-06-arXiv] [[📄 paper](https://arxiv.org/abs/2506.10954v1)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/swe-factory)] 276 | 277 | - **UTBoost**: Rigorous Evaluation of Coding Agents on SWE-Bench [ACL-2025] [[📄 paper](https://arxiv.org/abs/2506.09289)] 278 | 279 | - **SWE-Flow**: Synthesizing Software Engineering Data in a Test-Driven Manner [ICML-2025] [[📄 paper](https://arxiv.org/abs/2506.09003)] [[🔗 repo](https://github.com/Hambaobao/SWE-Flow)] 280 | 281 | - **AgentIssue-Bench**: Can Agents Fix Agent Issues? [2025-08-arXiv] [[📄 paper](https://arxiv.org/pdf/2505.20749)] [[🔗 repo](https://github.com/alfin06/AgentIssue-Bench)] 282 | 283 | - **CodeAssistBench (CAB)**: Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [2025-07-arXiv] [[📄 paper](https://arxiv.org/abs/2507.10646)] 284 | 285 | - **OmniGIRL**: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.04606)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/OmniGIRL)] 286 | 287 | - **CodeFlowBench**: A Multi-turn, Iterative Benchmark for Complex Code Generation [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.21751)] [[🔗 repo](https://github.com/Rise-1210/codeflow)] 288 | 289 | - **SWE-Smith**: Scaling Data for Software Engineering Agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.21798)] [[🔗 repo](https://swesmith.com/)] 290 | 291 | - **SWE-Synth**: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.14757v1)] [[🔗 repo](https://github.com/FSoft-AI4Code/SWE-Synth)] 292 | 293 | - Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [2025-03-arXiv] [[📄 paper](https://arxiv.org/abs/2503.15223)] 294 | 295 | - **Unveiling Pitfalls**: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.12374)] 296 | 297 | - **ConvCodeWorld**: Benchmarking Conversational Code Generation in Reproducible Feedback Environments [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.19852)] [[🔗 repo](https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld)] 298 | 299 | - **SWE-Lancer**: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-arXiv] [[📄 paper](https://arxiv.org/pdf/2502.12115)] [[🔗 repo](https://github.com/openai/SWELancer-Benchmark)] 300 | 301 | - Evaluating Agent-based Program Repair at Google [2025-01-arXiv] [[📄 paper](https://arxiv.org/pdf/2501.07531)] 302 | 303 | - **SWE-rebench**: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.20411)] [[🕸️ website](https://swe-rebench.com/leaderboard)] 304 | 305 | - **SWE-bench-Live**: A Live Benchmark for Repository-Level Issue Resolution [2025-05-arXiv] [[📄 paper](https://www.arxiv.org/abs/2505.23419)] [[🔗 repo](https://github.com/microsoft/SWE-bench-Live)] 306 | 307 | - **FEA-Bench**: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [2025-05-ACL] [[📄 paper](https://arxiv.org/abs/2503.06680)] [[🔗 repo](https://github.com/microsoft/FEA-Bench)] 308 | 309 | - **OmniGIRL**: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-ISSTA] [[📄 paper](https://arxiv.org/abs/2505.04606)] 310 | 311 | - **SWE-PolyBench**: A multi-language benchmark for repository level evaluation of coding agents [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.08703)] [[🔗 repo](https://github.com/FSoft-AI4Code/SWE-PolyBench)] 312 | 313 | - **Multi-SWE-bench**: A Multilingual Benchmark for Issue Resolving [2025-04-arXiv] [[📄 paper](https://arxiv.org/abs/2504.02605)] [[🔗 repo](https://github.com/multi-swe-bench/multi-swe-bench)] 314 | 315 | - **LibEvolutionEval**: A Benchmark and Study for Version-Specific Code Generation [2025-04-NAACL] [[📄 paper](https://arxiv.org/abs/2412.04478)][[🔗 Website](https://lib-evolution-eval.github.io/)] 316 | 317 | - **SWEE-Bench & SWA-Bench**: Automated Benchmark Generation for Repository-Level Coding Tasks [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.07701)] 318 | 319 | - **ProjectEval**: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation [2025 ACL-Findings] [[📄 paper](https://arxiv.org/pdf/2503.07010)] [[🔗 repo](https://github.com/RyanLoil/ProjectEval/)] 320 | 321 | - **REPOST-TRAIN**: Scalable Repository-Level Coding Environment Construction with Sandbox Testing [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.07358)] [[🔗 repo](https://github.com/yiqingxyq/RepoST)] 322 | 323 | - **Loc-Bench**: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [[📄 paper](https://arxiv.org/pdf/2503.09089)] [[🔗 repo](https://github.com/gersteinlab/LocAgent)] 324 | 325 | - **SWE-Lancer**: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-02-arXiv] [[📄 paper](https://arxiv.org/pdf/2502.12115)] [[🔗 repo](https://github.com/openai/SWELancer-Benchmark)] 326 | 327 | - **SolEval**: Benchmarking Large Language Models for Repository-level Solidity Code Generation [2025-02-arXiv] [[📄 paper](https://arxiv.org/abs/2502.18793)] [[🔗 repo](https://anonymous.4open.science/r/SolEval-1C06/)] 328 | 329 | - **HumanEvo**: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation [2025-ICSE] [[📄 paper](https://www.computer.org/csdl/proceedings-article/icse/2025/056900a764/251mHzzKizu)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/HumanEvo)] 330 | 331 | - **RepoExec**: On the Impacts of Contexts on Repository-Level Code Generation [2025-NAACL] [[📄 paper](https://arxiv.org/abs/2406.11927)] [[🔗 repo](https://github.com/FSoft-AI4Code/RepoExec)] 332 | 333 | - **SWE-Gym**: Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.21139)] [[🔗 repo](https://github.com/SWE-Gym/SWE-Gym)] 334 | 335 | - **RepoTransBench**: RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.17744)] [[🔗 repo](https://github.com/DeepSoftwareAnalytics/RepoTransBench)] 336 | 337 | - **Visual SWE-bench**: Issue Resolving with Visual Data [2024-12-arXiv] [[📄 paper](https://arxiv.org/pdf/2412.17315)] [[🔗 repo](https://github.com/luolin101/CodeV)] 338 | 339 | - **ExecRepoBench**: Multi-level Executable Code Completion Evaluation [2024-12-arXiv] [[📄 paper](https://arxiv.org/abs/2412.11990)] [[🔗 site](https://execrepobench.github.io/)] 340 | 341 | - **REPOCOD**: Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.21647)] [[🔗 repo](https://github.com/lt-asset/REPOCOD)] 342 | 343 | - **M2RC-EVAL**: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.21157)] [[🔗 repo](https://github.com/M2RC-Eval-Team/M2RC-Eval)] 344 | 345 | - **SWE-bench+**: Enhanced Coding Benchmark for LLMs [2024-10-arXiv] [[📄 paper](https://arxiv.org/pdf/2410.06992)] 346 | 347 | - **SWE-bench Multimodal**: Multimodal Software Engineering Benchmark [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.03859)] [[🔗 site](https://swebench.com/multimodal)] 348 | 349 | - **Codev-Bench**: How Do LLMs Understand Developer-Centric Code Completion? [2024-10-arXiv] [[📄 paper](https://arxiv.org/abs/2410.01353)] [[🔗 repo](https://github.com/LingmaTongyi/Codev-Bench)] 350 | 351 | - **SWT-Bench**: Testing and Validating Real-World Bug-Fixes with Code Agents 352 | [2024-06-arxiv] [[📄 paper](https://arxiv.org/abs/2406.12952)] [[🕸️ website](https://swtbench.com/?results=verified)] 353 | 354 | - **CodeRAG-Bench**: Can Retrieval Augment Code Generation? [2024-06-arXiv] [[📄 paper](http://arxiv.org/abs/2406.14497)] [[🔗 repo](https://github.com/code-rag-bench/code-rag-bench/tree/main)] 355 | 356 | - **R2C2-Bench**: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [[📄 paper](https://arxiv.org/abs/2406.01359)] 357 | 358 | - **RepoClassBench**: Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [2024-05-arXiv] [[📄 paper](https://arxiv.org/abs/2405.01573)] [[🔗 repo](https://github.com/microsoft/repoclassbench/tree/main)] 359 | 360 | - **DevEval**: Evaluating Code Generation in Practical Software Projects [2024-ACL-Findings] [[📄 paper](https://aclanthology.org/2024.findings-acl.214.pdf)] [[🔗 repo](https://github.com/seketeam/DevEval)] 361 | 362 | - **CodAgentBench**: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges [2024-ACL] [[📄 paper](https://aclanthology.org/2024.acl-long.737/)] 363 | 364 | - **RepoBench**: Benchmarking Repository-Level Code Auto-Completion Systems [2024-ICLR] [[📄 paper](https://openreview.net/forum?id=pPjZIOuQuF)] [[🔗 repo](https://github.com/Leolty/repobench)] 365 | 366 | - **SWE-bench**: Can Language Models Resolve Real-World GitHub Issues? [2024-ICLR] [[📄 paper](https://arxiv.org/pdf/2310.06770)] [[🔗 repo](https://github.com/princeton-nlp/SWE-bench)] 367 | 368 | - **CrossCodeLongEval**: Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-ICML] [[📄 paper](https://arxiv.org/abs/2403.10059)] [[🔗 repo](https://repoformer.github.io/)] 369 | 370 | - **R2E-Eval**: Turning Any GitHub Repository into a Programming Agent Test Environment [2024-ICML] [[📄 paper](https://proceedings.mlr.press/v235/jain24c.html)] [[🔗 repo](https://r2e.dev/)] 371 | 372 | - **RepoEval**: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-EMNLP] [[📄 paper](https://aclanthology.org/2023.emnlp-main.151/)] [[🔗 repo](https://github.com/microsoft/CodeT/tree/main/RepoCoder)] 373 | 374 | - **CrossCodeEval**: A Diverse and Multilingual Benchmark for Cross-File Code Completion [2023-NeurIPS] [[📄 paper](https://proceedings.neurips.cc/paper_files/paper/2023/file/920f2dced7d32ab2ba2f1970bc306af6-Paper-Datasets_and_Benchmarks.pdf)] [[🔗 site](https://crosscodeeval.github.io/)] 375 | 376 | - **Skeleton-Guided-Translation**: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [2025-01-arxiv] [[📄 paper](https://arxiv.org/abs/2501.16050)] [[🔗 repo](https://github.com/microsoft/TransRepo)] 377 | 378 | - **SWE-Dev**: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [[📄 paper](https://arxiv.org/abs/2505.16975)] [[🔗 repo](https://github.com/justLittleWhite/SWE-Dev)] 379 | 380 | ## Star History 381 | 382 | [![Star History Chart](https://api.star-history.com/svg?repos=YerbaPage/Awesome-Repo-Level-Code-Generation&type=Date)](https://www.star-history.com/#YerbaPage/Awesome-Repo-Level-Code-Generation&Date) 383 | --------------------------------------------------------------------------------