├── assets
├── evolve-path.png
├── evolve-tree.jpg
└── evolve-framework.png
├── LICENSE
└── README.md
/assets/evolve-path.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EvoAgentX/Awesome-Self-Evolving-Agents/HEAD/assets/evolve-path.png
--------------------------------------------------------------------------------
/assets/evolve-tree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EvoAgentX/Awesome-Self-Evolving-Agents/HEAD/assets/evolve-tree.jpg
--------------------------------------------------------------------------------
/assets/evolve-framework.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EvoAgentX/Awesome-Self-Evolving-Agents/HEAD/assets/evolve-framework.png
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 EvoAgentX
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
4 |
5 |
6 | Awesome-Self-Evolving-Agents
7 |
8 |
9 |
10 |
11 | [](https://awesome.re)
12 | [](https://arxiv.org/abs/2508.07407)
13 | [](https://github.com/EvoAgentX/Awesome-Self-Evolving-Agents/pulls)
14 | [](https://star-history.com/#EvoAgentX/Awesome-Self-Evolving-Agents)
15 | [](LICENSE)
16 |
17 |
18 | 🤖 We're still cooking — Stay tuned! 🤖
⭐ Give us a star if you like it! ⭐
19 |
20 |
21 |

22 |
23 |
Figure: A visual taxonomy of AI agent evolution and optimisation techniques, categorised into three major directions: single-agent optimisation, multi-agent optimisation, and domain-specific optimisation. The tree structure illustrates the development of these approaches from 2023 to 2025, including representative methods within each branch.
24 |
25 |
26 |
27 | ## AI Agents Development Path
28 |
29 |
30 |
31 |
32 |
33 | ## Conceptual Framework of the Self-Evolving AI Agents
34 |
35 |
36 |
37 |
38 | ## Open-Source Framework
39 | - (*EMNLP'25 Demo*) **EvoAgentX**: An Automated Framework for Evolving Agentic Workflows
40 | [[💻 Code](https://github.com/EvoAgentX/EvoAgentX)] [[📝 Paper](https://arxiv.org/abs/2507.03616)]
41 | - (*Arxiv'25*) MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems [[📝 Paper](https://arxiv.org/abs/2505.16988)] [[💻 Code](https://github.com/MASWorks/MASLab)]
42 |
43 |
44 | ## 1. Single-Agent Optimisation
45 |
46 | ### 1.1 🤖 LLM Behaviour Optimisation
47 |
48 | #### 1.1.1 📌 Training-Based Behaviour Optimisation
49 |
50 | ##### (1) 🔧 Supervised Fine-Tuning Approaches
51 | - (*ICLR'24*) ToRA: A tool-integrated reasoning agent for mathematical problem solving [[📝 Paper](https://arxiv.org/abs/2309.17452)] [[💻 Code](https://github.com/microsoft/ToRA)]
52 | - (*NeurIPS'22*) STaR : Bootstrapping reasoning with reasoning [[📝 Paper](https://arxiv.org/abs/2203.14465)] [[💻 Code](https://github.com/ezelikman/STaR)]
53 | - (*Arxiv'24*) NExT: Teaching large language models to reason about code execution [[📝 Paper](https://arxiv.org/abs/2404.14662)]
54 | - (*EMNLP'24*) MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning [[📝 Paper](https://arxiv.org/abs/2405.07551)]
55 | - (*ICML'25*) MAS-GPT: Training LLMs to build LLM-based multi-agent systems [[📝 Paper](https://arxiv.org/abs/2503.03686)] [[💻 Code](https://github.com/MASWorks/MAS-GPT)]
56 |
57 | ##### (2) 🔧 Reinforcement Learning Approaches
58 | - (*ICML'24*) Self-Rewarding Language Models [[📝 Paper](https://arxiv.org/abs/2401.10020)] [[💻 Code](https://github.com/lucidrains/self-rewarding-lm-pytorch)]
59 | - (*Arxiv'24*) Tulu 3: Pushing Frontiers in Open Language Model Post-Training [[📝 Paper](https://arxiv.org/abs/2411.15124)] [[💻 Code](https://github.com/allenai/open-instruct)]
60 | - (*EMNLP'24*) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [[📝 Paper](https://arxiv.org/abs/2402.00658)] [[💻 Code](https://github.com/SparkJiao/dpo-trajectory-reasoning)]
61 | - (*Arxiv'24*) Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [[📝 Paper](https://arxiv.org/abs/2408.07199)]
62 | - (*Arxiv'24*) DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [[📝 Paper](https://arxiv.org/abs/2405.14333)]
63 | - (*ICML'25*) Diving into Self-Evolving Training for Multimodal Reasoning [[📝 Paper](https://arxiv.org/abs/2412.17451)] [[💻 Code](https://github.com/hkust-nlp/mstar)]
64 | - (*Arxiv'25*) Absolute Zero: Reinforced Self-play Reasoning with Zero Data [[📝 Paper](https://arxiv.org/abs/2505.03335)]
65 | - (*Arxiv'25*) R-Zero: Self-Evolving Reasoning LLM from Zero Data [[📝 Paper](https://arxiv.org/abs/2508.05004)] [[💻 Code](https://github.com/Chengsong-Huang/R-Zero)]
66 | - (*Arxiv'25*) SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2506.24119)] [[💻 Code](https://github.com/spiral-rl/spiral)]
67 | - (*Arxiv'25*) DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training [[📝 Paper](https://arxiv.org/abs/2507.13833)] [[💻 Code](https://github.com/sii-research/siiRL)]
68 | - (*Arxiv'25*) Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play [[📝 Paper](https://www.arxiv.org/abs/2509.25541)] [[💻 Code](https://github.com/wangqinsi1/Vision-Zero)]
69 | - (*Arxiv'25*) Parallel-R1: Towards Parallel Thinking via Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2509.07980)] [[💻 Code](https://github.com/zhengkid/Parallel-R1)]
70 | - (*Arxiv'25*) SSRL: Self-Search Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2508.10874)] [[💻 Code](https://github.com/TsinghuaC3I/SSRL)]
71 | - (*Arxiv'25*) SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [[📝 Paper](https://arxiv.org/abs/2505.20347)] [[💻 Code](https://github.com/wantbook-book/SeRL)]
72 |
73 |
74 | #### 1.1.2 📌 Test-Time Behaviour Optimisation
75 |
76 | ##### (1) 🔧 Feedback-Based Approaches
77 | - (*ICLR'23*) CodeT: Code Generation with Generated Tests [[📝 Paper](https://arxiv.org/abs/2207.10397)] [[💻 Code](https://github.com/microsoft/CodeT)]
78 | - (*ICML'23*) LEVER: Learning to Verify Language-to-Code Generation with Execution [[📝 Paper](https://arxiv.org/abs/2302.08468)] [[💻 Code](https://github.com/niansong1996/lever)]
79 | - (*ESEC/FSE'23*) Baldur: Whole-Proof Generation and Repair with Large Language Models [[📝 Paper](https://arxiv.org/abs/2303.04910)]
80 | - (*ACL'24*) Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [[📝 Paper](https://arxiv.org/abs/2312.08935)]
81 | - (*EMNLP'24*) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [[📝 Paper](https://arxiv.org/abs/2402.00658)] [[💻 Code](https://github.com/SparkJiao/dpo-trajectory-reasoning)]
82 | - (*Arxiv'24*) Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs [[📝 Paper](https://arxiv.org/abs/2410.18451)]
83 | - (*ICLR'25*) Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [[📝 Paper](https://arxiv.org/abs/2410.08146)]
84 | - (*Arxiv'25*) Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy [[📝 Paper](https://arxiv.org/abs/2507.01352)] [[💻 Code](https://github.com/SkyworkAI/Skywork-Reward-V2)]
85 |
86 |
87 | ##### (2) 🔧 Search-Based Approaches
88 | - (*ICLR'23*) Self-consistency improves chain of thought reasoning in language models [[📝 Paper](https://arxiv.org/abs/2203.11171)]
89 | - (*ACL'23*) Solving Math Word Problems via Cooperative Reasoning induced Language Models [[📝 Paper](https://arxiv.org/abs/2210.16257)] [[💻 Code](https://github.com/TianHongZXY/CoRe)]
90 | - (*NeurIPS'23*) Tree of thoughts: Deliberate problem solving with large language models [[📝 Paper](https://arxiv.org/abs/2305.10601)] [[💻 Code](https://github.com/princeton-nlp/tree-of-thought-llm)]
91 | - (*NeurIPS'24*) Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models [[📝 Paper](https://arxiv.org/abs/2406.04271)] [[💻 Code](https://github.com/YangLing0818/buffer-of-thought-llm)]
92 | - (*COLM'24*) Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning [[📝 Paper](https://arxiv.org/abs/2401.17686)] [[💻 Code](https://github.com/OSU-NLP-Group/Deductive-Beam-Search)]
93 | - (*AAAI'24*) Graph of thoughts: Solving elaborate problems with large language models [[📝 Paper](https://arxiv.org/abs/2308.09687)] [[💻 Code](https://github.com/spcl/graph-of-thoughts)]
94 | - (*ICML'25*) Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [[📝 Paper](https://arxiv.org/abs/2412.09078)] [[💻 Code](https://github.com/iamhankai/Forest-of-Thought)]
95 |
96 | ##### (3)🔧 Reasoning-Based Approaches
97 | - (*EMNLP’25*) START: Self‑taught Reasoner with Tools [[📝 Paper](https://arxiv.org/abs/2503.04625)]
98 | - (*ArXiv’25*) CoRT: Code‑integrated Reasoning within Thinking [[📝 Paper](https://arxiv.org/abs/2506.09820)] [[💻 Code](https://github.com/ChengpengLi1003/CoRT)]
99 |
100 | ### 1.2 💬 Prompt Optimisation
101 |
102 | #### 1.2.1 📌 Edit-Based Prompt Optimisation
103 | - (*EMNLP'22*) GPS: Genetic Prompt Search for Efficient Few-shot Learning [[📝 Paper](https://arxiv.org/abs/2210.17041)] [[💻 Code](https://github.com/hwxu20/GPS)]
104 | - (*EACL'23*) GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models [[📝 Paper](https://arxiv.org/abs/2203.07281)] [[💻 Code](https://github.com/archiki/GrIPS)]
105 | - (*ICLR'23*) TEMPERA: Test-Time Prompting via Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2211.11890)] [[💻 Code](https://github.com/tianjunz/TEMPERA)]
106 | - (*ACL'24*) Plum: Prompt Learning using Metaheuristic [[📝 Paper](https://arxiv.org/abs/2311.08364)] [[💻 Code](https://github.com/research4pan/Plum)]
107 |
108 | #### 1.2.2 📌 Evolutionary Prompt Optimisation
109 | - (*ICLR'24*) EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers [[📝 Paper](https://arxiv.org/abs/2309.08532)] [[💻 Code](https://github.com/beeevita/EvoPrompt)]
110 | - (*ICML'24*) Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution [[📝 Paper](https://arxiv.org/abs/2309.16797)]
111 | - (*Arxiv'25*) GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2507.19457)]
112 |
113 | #### 1.2.3 📌 Generative Prompt Optimisation
114 | - (*ICLR'23*) Large Language Models Are Human-Level Prompt Engineers [[📝 Paper](https://arxiv.org/abs/2211.01910)] [[💻 Code](https://github.com/keirp/automatic_prompt_engineer)]
115 | - (*ICLR'24*) PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization [[📝 Paper](https://arxiv.org/abs/2310.16427)] [[💻 Code](https://github.com/XinyuanWangCS/PromptAgent)]
116 | - (*ICLR'24*) Large Language Models as Optimizers [[📝 Paper](https://arxiv.org/abs/2309.03409)] [[💻 Code](https://github.com/google-deepmind/opro)]
117 | - (*ICLR'24*) Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization [[📝 Paper](https://arxiv.org/abs/2308.02151)] [[💻 Code](https://github.com/weirayao/Retroformer)]
118 | - (*EMNLP'24*) Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs [[📝 Paper](https://arxiv.org/abs/2406.11695)] [[💻 Code](https://github.com/stanfordnlp/dspy)]
119 | - (*Arxiv'24*) Prompt Optimization with Human Feedback [[📝 Paper](https://arxiv.org/abs/2405.17346)] [[💻 Code](https://github.com/xqlin98/APOHF)]
120 | - (*Arxiv'24*) StraGo: Harnessing Strategic Guidance for Prompt Optimization [[📝 Paper](https://arxiv.org/abs/2410.08601)]
121 | - (*Arxiv'25*) Self-Supervised Prompt Optimization [[📝 Paper](https://arxiv.org/abs/2502.06855)]
122 |
123 | #### 1.2.4 📌 Text Gradient-Based Prompt Optimisation
124 | - (*EMNLP'23*) Automatic Prompt Optimization with "Gradient Descent" and Beam Search [[📝 Paper](https://arxiv.org/abs/2305.03495)] [[💻 Code](https://github.com/microsoft/LMOps/tree/main/prompt_optimization)]
125 | - (*Arxiv'24*) TextGrad: Automatic "Differentiation" via Text [[📝 Paper](https://arxiv.org/abs/2406.07496)] [[💻 Code](https://github.com/zou-group/textgrad)]
126 | - (*Arxiv'24*) How to Correctly do Semantic Backpropagation on Language-based Agentic Systems [[📝 Paper](https://arxiv.org/abs/2412.03624)] [[💻 Code](https://github.com/HishamAlyahya/semantic_backprop)]
127 | - (*Arxiv'24*) GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering [[📝 Paper](https://arxiv.org/abs/2407.12865)]
128 | - (*AAAI'25*) Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [[📝 Paper](https://arxiv.org/abs/2402.17564)] [[💻 Code](https://github.com/RUCAIBox/GPO)]
129 | - (*ICML'25*) REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization [[📝 Paper](https://arxiv.org/abs/2412.03092)] [[💻 Code](https://github.com/Peiyance/REVOLVE)]
130 | - (*Arxiv'25*) PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [[📝 Paper](https://arxiv.org/abs/2506.06254)]
131 |
132 | ### 1.3 🧠 Memory Optimization
133 | - (*ICML'24*) A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts [[📝 Paper](https://arxiv.org/abs/2402.09727)]
134 | - (*ICML'24*) Agent Workflow Memory [[📝 Paper](https://arxiv.org/abs/2409.07429)]
135 | - (*AAAI'24*) MemoryBank: Enhancing Large Language Models with Long-Term Memory [[📝 Paper](https://arxiv.org/abs/2305.10250)]
136 | - (*EMNLP'24*) GraphReader: Building graph-based agent to enhance long-context [[📝 Paper](https://arxiv.org/abs/2406.14550)]
137 | - (*Arxiv'24*) "My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents [[📝 Paper](https://arxiv.org/abs/2404.00573)]
138 | - (*ICLR'25*) Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations [[📝 Paper](https://arxiv.org/abs/2402.11975)]
139 | - (*ICLR'25*) Boosting knowledge intensive reasoning of llms via inference-time hybrid information [[📝 Paper](https://arxiv.org/abs/2410.08815)] [[💻 Code](https://github.com/icip-cas/StructRAG)]
140 | - (*ACL'25*) Improving factuality with explicit working memory [[📝 Paper](https://arxiv.org/abs/2412.18069)]
141 | - (*Arxiv'25*) A-MEM: Agentic Memory for LLM Agents [[📝 Paper](https://arxiv.org/abs/2502.12110)]
142 | - (*Arxiv'25*) Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory [[📝 Paper](https://arxiv.org/abs/2504.19413)]
143 | - (*Arxiv'25*) Memento: Fine‑tuning LLM Agents without Fine‑tuning LLMs [[📝 Paper](https://www.arxiv.org/abs/2508.16153)] [[💻 Code](https://github.com/Agent-on-the-Fly/Memento)]
144 | - (*Arxiv'25*) Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2508.19828)]
145 | - (*Arxiv'25*) Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [[📝 Paper](https://arxiv.org/abs/2508.09736)] [[💻 Code](https://github.com/bytedance-seed/m3-agent)]
146 | - (*Arxiv'25*) PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [[📝 Paper](https://arxiv.org/abs/2506.06254)]
147 |
148 |
149 | ### 1.4 🧰 Tool Optimization
150 |
151 | #### 1.4.1 📌 Training-Based Tool Optimisation
152 |
153 | ##### (1) Supervised Fine-Tuning for Tool Optimisation
154 | - (*NeurIPS'23*) GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [[📝 Paper](https://arxiv.org/abs/2305.18752)] [[💻 Code](https://github.com/AILab-CVC/GPT4Tools)]
155 | - (*ICLR'24*) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [[📝 Paper](https://arxiv.org/abs/2307.16789)] [[💻 Code](https://github.com/OpenBMB/ToolBench)]
156 | - (*ACL'24*) LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [[📝 Paper](https://arxiv.org/abs/2403.04746)] [[💻 Code](https://github.com/microsoft/simulated-trial-and-error)]
157 | - (*AAAI'24*) Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum [[📝 Paper](https://arxiv.org/abs/2308.14034)] [[💻 Code](https://github.com/shizhl/Confucius)]
158 | - (*ICLR'25*) Learning Evolving Tools for Large Language Models [[📝 Paper](https://arxiv.org/abs/2410.06617)] [[💻 Code](https://github.com/Chen-GX/ToolEVO)]
159 | - (*ICLR'25*) Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning [[📝 Paper](https://arxiv.org/abs/2410.12952)] [[💻 Code](https://github.com/PKU-Baichuan-MLSystemLab/BUTTON)]
160 | - (*ICLR'25*) Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage [[📝 Paper](https://arxiv.org/abs/2412.15606)] [[💻 Code](https://github.com/mat-agent/MAT-Agent)]
161 | - (*Arxiv'25*) Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation [[📝 Paper](https://arxiv.org/abs/2503.07826)]
162 | - (*ICML'25*) Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation [[📝 Paper](https://arxiv.org/abs/2411.00412)] [[💻 Code](https://github.com/Rose-STL-Lab/Adapting-While-Learning)]
163 |
164 |
165 | ##### (2) Reinforcement Learning for Tool Optimisation
166 | - (*Arxiv'25*) ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [[📝 Paper](https://arxiv.org/abs/2504.11536)] [[💻 Code](https://github.com/ReTool-RL/ReTool)]
167 | - (*Arxiv'25*) ToolRL: Reward is All Tool Learning Needs [[📝 Paper](https://arxiv.org/abs/2504.13958)] [[💻 Code](https://github.com/qiancheng0/ToolRL)]
168 | - (*Arxiv'25*) Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning [[📝 Paper](https://arxiv.org/abs/2505.00024)] [[💻 Code](https://github.com/NVlabs/Tool-N1)]
169 | - (*Arxiv'25*) Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use [[📝 Paper](https://arxiv.org/abs/2504.04736)]
170 | - (*Arxiv'25*) Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning [[📝 Paper](https://arxiv.org/abs/2504.21561)] [[💻 Code](https://github.com/SPORT-Agents/SPORT-Agents)]
171 | - (*Arxiv'25*) Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2505.16410)] [[💻 Code](https://github.com/RUC-NLPIR/Tool-Star)]
172 | - (*Arxiv'25*) Agentic Reinforced Policy Optimization [[📝 Paper](https://arxiv.org/abs/2507.19849)] [[💻 Code](https://github.com/RUC-NLPIR/ARPO)]
173 | - (*Arxiv'25*) AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning [[📝 Paper](https://arxiv.org/abs/2507.21836)] [[💻 Code](https://github.com/weiyifan1023/AutoTIR)]
174 |
175 |
176 | #### 1.4.2 📌 Inference-Time Tool Optimisation
177 |
178 | ##### (1) Prompt-Based Optimisation
179 | - (*NAACL'25*) EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [[📝 Paper](https://arxiv.org/abs/2401.06201)] [[💻 Code](https://github.com/microsoft/JARVIS/tree/main/easytool)]
180 | - (*ICLR'25*) From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [[📝 Paper](https://arxiv.org/abs/2410.08197)] [[💻 Code](https://github.com/quchangle1/DRAFT)]
181 | - (*ACL'25*) Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play [[📝 Paper](https://arxiv.org/abs/2503.14432)] [[💻 Code](https://github.com/wfangtw/play2prompt)]
182 |
183 | ##### (2) Reasoning-Based Optimisation
184 | - (*ICLR'24*) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [[📝 Paper](https://arxiv.org/abs/2307.16789)] [[💻 Code](https://github.com/beijixiong1/ToolLLM)]
185 | - (*ICLR'24*) ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [[📝 Paper](https://arxiv.org/abs/2310.13227)]
186 | - (*ICLR'25*) Tool-Planner: Task Planning with Clusters across Multiple Tools [[📝 Paper](https://arxiv.org/abs/2406.03807)] [[💻 Code](https://github.com/OceannTwT/Tool-Planner)]
187 | - (*Arxiv'25*) MCP-Zero: Active Tool Discovery for Autonomous LLM Agents [[📝 Paper](https://arxiv.org/abs/2506.01056)][[💻 Code](https://github.com/xfey/MCP-Zero)]
188 |
189 |
190 | #### 1.4.3 📌 Tool Functionality Optimisation
191 | - (*EMNLP'23*) CREATOR : Tool creation for disentangling abstract and concrete reasoning of large language model [[📝 Paper](https://arxiv.org/abs/2305.14318)] [[💻 Code](https://github.com/qiancheng0/CREATOR)]
192 | - (*ICML'24*) Offline Training of Language Model Agents with Functions as Learnable Weights [[📝 Paper](https://arxiv.org/abs/2402.11359)]
193 | - (*CVPR'24*) CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [[📝 Paper](https://arxiv.org/abs/2312.10908)] [[💻 Code](https://github.com/clova-tool/CLOVA-tool)]
194 | - (*Arxiv'25*) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution [[📝 Paper](https://arxiv.org/abs/2505.20286)] [[💻 Code](https://github.com/CharlesQ9/Alita)]
195 |
196 |
197 | ### 1.5 🧰 Unified Optimization
198 | - (*Arxiv'25*) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [[📝 Paper](https://arxiv.org/abs/2508.19005)] [[💻 Code](https://github.com/ECNU-ICALK/ELL-StuLife)]
199 | - (*Arxiv'25*) EvoAgent: Self-evolving Agent with Continual World Model for Long-Horizon Tasks [[📝 Paper](https://arxiv.org/abs/2502.05907)]
200 |
201 |
202 | ## 2. Multi-Agent Optimisation
203 |
204 | ### 2.1 ⚙️ Automatic Multi-Agent Construction
205 | - (*ICML'25*) MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines [[📝 Paper](https://arxiv.org/abs/2507.22606)] [[💻 Code](https://github.com/SaFoLab-WISC/MetaAgent/)]
206 |
207 | ### 2.2 🚀 MAS Optimisation
208 | - (*Arxiv' 25*) R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution [[📝 Paper](https://arxiv.org/abs/2505.14738)] [[💻 Code](https://github.com/microsoft/RD-Agent)]
209 | - (*ICML'25*) Multi-Agent Architecture Search via Agentic Supernet [[📝 Paper](https://doi.org/10.48550/arXiv.2502.04180)] [[💻Code](https://github.com/bingreeky/MaAS)]
210 | - (*ICML'25*) MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving [[📝 Paper](https://arxiv.org/abs/2503.03205)]
211 | - (*ICLR'25*) AFlow: Automating Agentic Workflow Generation [[📝 Paper](https://arxiv.org/abs/2410.10762)] [[💻 Code](https://github.com/geekan/MetaGPT/tree/main/examples/aflow)]
212 | - (*ICLR'25*) WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [[📝 Paper](https://openreview.net/forum?id=3Hy00Wvabi)]
213 | - (*ICLR'25*) Flow: Modularized Agentic Workflow Automation [[📝 Paper](https://openreview.net/forum?id=sLKDbuyq99)]
214 | - (*ICLR'25*) Automated Design of Agentic Systems [[📝 Paper](https://arxiv.org/abs/2408.08435)] [[💻 Code](https://github.com/ShengranHu/ADAS)]
215 | - (*Arxiv'25*) FlowReasoner: Reinforcing Query-Level Meta-Agents [[📝 Paper](https://doi.org/10.48550/arXiv.2504.15257)]
216 | - (*Arxiv'25*) AgentNet: Decentralized Evolutionary Coordination for LLM-Based Multi-Agent Systems [[📝 Paper](https://arxiv.org/abs/2504.00587)]
217 | - (*Arxiv'25*) MAS-GPT: Training LLMs to Build LLM-Based Multi-Agent Systems [[📝 Paper](https://arxiv.org/abs/2503.03686)]
218 | - (*Arxiv'25*) FlowAgent: Achieving Compliance and Flexibility for Workflow Agents [[📝 Paper](https://doi.org/10.48550/arXiv.2502.14345)]
219 | - (*Arxiv'25*) ScoreFlow: Mastering LLM Agent Workflows via Score-Based Preference Optimization [[📝 Paper](https://arxiv.org/abs/2502.04306)] [[💻 Code](https://github.com/Gen-Verse/ScoreFlow)]
220 | - (*Arxiv'25*) Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies [[📝 Paper](https://arxiv.org/abs/2502.02533)]
221 | - (*Arxiv'25*) MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [[📝 Paper](https://arxiv.org/abs/2505.14996)]
222 | - (*Arxiv'25*) MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming [[📝 Paper](https://arxiv.org/abs/2505.22967)]
223 | - (*ICML'24*) GPTSwarm: Language Agents as Optimizable Graphs [[📝 Paper](https://arxiv.org/abs/2402.16823)] [[Code](https://github.com/metauto-ai/gptswarm)]
224 | - (*ICLR'24*) DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines [[📝 Paper](https://openreview.net/forum?id=sY5N0zY5Od)] [[💻 Code](https://github.com/stanfordnlp/dspy)]
225 | - (*ICLR'24*) AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [[📝 Paper](https://openreview.net/forum?id=EHg5GDnyq1)] [[💻 Code](https://github.com/OpenBMB/AgentVerse)]
226 | - (*ICLR'24*) MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework [[📝 Paper](https://openreview.net/forum?id=VtmBAGCN7o)] [[💻 Code](https://github.com/geekan/MetaGPT)]
227 | - (*COLM'24*) A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration [[📝 Paper](https://openreview.net/forum?id=XII0Wp1XA9)]
228 | - (*COLM'24*) AutoGen: Enabling next-Gen LLM Applications via Multi-Agent Conversations [[📝 Paper](https://openreview.net/forum?id=BAakY1hNKS)] [[💻 Code](https://github.com/microsoft/autogen)]
229 | - (*Arxiv'24*) G-Designer: Architecting Multi-Agent Communication Topologies via Graph Neural Networks [[📝 Paper](https://doi.org/10.48550/arXiv.2410.11782)]
230 | - (*Arxiv'24*) AutoFlow: Automated Workflow Generation for Large Language Model Agents [[📝 Paper](https://arxiv.org/abs/2407.12821)] [[💻 Code](https://github.com/agiresearch/AutoFlow)]
231 | - (*Arxiv'24*) Symbolic Learning Enables Self-Evolving Agents [[📝 Paper](https://arxiv.org/abs/2406.18532)] [[💻 Code](https://github.com/aiwaves-cn/agents)]
232 | - (*Arxiv'24*) Adaptive In-Conversation Team Building for Language Model Agents [[📝 Paper](https://arxiv.org/abs/2405.19425)]
233 | - (*ICLR'25*) Self-Evolving Multi-Agent Collaboration Networks for Software Development [[📝 Paper](https://openreview.net/forum?id=4R71pdPBZp)] [[💻 Code](https://github.com/yuzhu-cai/rSDE-Bench)]
234 | - (*Arxiv'25*) Chain‑of‑Agents: End‑to‑End Agent Foundation Models via Multi‑Agent Distillation and Agentic RL [[📝 Paper](https://arxiv.org/abs/2508.13167)] [[💻 Code](https://github.com/OPPO-PersonalAI/Agent_Foundation_Models)]
235 | - (*Arxiv’25*) Agent KB: Leveraging Cross‑Domain Experience for Agentic Problem Solving [[📝 Paper](https://arxiv.org/abs/2507.06229)] [[💻 Code](https://github.com/OPPO-PersonalAI/Agent-KB)]
236 |
237 |
238 | ## 3. Domain-Specific Optimisation
239 |
240 | ### 3.1 🧬 Biomedicine
241 |
242 | #### 3.1.1 📌 Medical Diagnosis
243 |
244 | - (*EMNLP'24*) MMedAgent: Learning to Use Medical Tools with Multi-modal Agent [[📝 Paper](https://arxiv.org/abs/2407.02483)] [[💻 Code](https://github.com/Wangyixinxin/MMedAgent)]
245 | - (*NeurIPS'24*) MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making [[📝 Paper](https://arxiv.org/abs/2404.15155)] [[💻 Code](https://github.com/mitmedialab/MDAgents)]
246 | - (*Arxiv'25*) HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research [[📝 Paper](https://arxiv.org/pdf/2508.02621)][[💻 Code](https://github.com/yhzhu99/HealthFlow)]
247 | - (*Arxiv'25*) STELLA: Self-Evolving LLM Agent for Biomedical Research [[📝 Paper](https://arxiv.org/abs/2507.02004)][[💻 Code](https://github.com/zaixizhang/STELLA)]
248 | - (*MICCAI'25*) MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions [[📝 Paper](https://arxiv.org/pdf/2503.22678)] [[💻 Code](https://github.com/MAXNORM8650/MedAgentSim)]
249 | - (*Arxiv'25*) PathFinder: A Multi-Modal Multi-Agent System
250 | for Medical Diagnostic Decision-Making Applied to Histopathology
[[📝 Paper](https://arxiv.org/pdf/2502.08916)]
251 | - (*Arxiv'25*) MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation
[[📝 Paper](https://arxiv.org/abs/2503.13856)] [[💻 Code](https://github.com/KaiChenNJ/MDTeamGPT)]
252 | - (*Arxiv'25*) MedAgent-Pro: Towards Evidence-based Multi-modal
253 | Medical Diagnosis via Reasoning Agentic Workflow
[[📝 Paper](https://arxiv.org/pdf/2503.18968)] [[💻 Code](https://github.com/jinlab-imvr/MedAgent-Pro)]
254 | - (*Arxiv'25*) Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs [[📝 Paper](https://arxiv.org/abs/2505.07184)] [[💻 Code](https://github.com/weiyifan1023/senator)]
255 |
256 |
257 | #### 3.1.2 📌 Molecular Discovery
258 |
259 | - (*ACS omega'24*) CACTUS: Chemistry Agent Connecting Tool-Usage to Science [[📝 Paper](https://arxiv.org/abs/2405.00972)] [[💻 Code](https://github.com/pnnl/cactus)]
260 | - (*NMI'24*) ChemCrow: Augmenting large language models with chemistry tools [[📝 Paper](https://arxiv.org/abs/2304.05376)] [[💻 Code](https://github.com/ur-whitelab/chemcrow-public)]
261 | - (*ICLR'25*) ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning[[📝 Paper](https://arxiv.org/abs/2501.06590)] [[💻 Code](https://github.com/gersteinlab/ChemAgent)]
262 | - (*ICLR'25*) OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents [[📝 Paper](https://openreview.net/forum?id=9YNyiCJE3k)]
263 | - (*Arxiv'25*) DrugAgent: Automating AI-aided Drug Discovery Programming through
264 | LLM Multi-Agent Collaboration [[📝 Paper](https://arxiv.org/pdf/2411.15692)]
265 | - (*Arxiv'25*) LIDDIA: Language-based Intelligent Drug Discovery Agent [[📝 Paper](https://arxiv.org/abs/2502.13959)]
266 | - (*Arxiv'25*) GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
[[📝 Paper](https://arxiv.org/abs/2507.21035)] [[💻 Code](https://github.com/Liu-Hy/GenoMAS)]
267 |
268 | ### 3.2 💻 Programming
269 |
270 | #### 3.2.1 📌 Code Refinement
271 |
272 | - (*Arxiv'23*) AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation [[📝 Paper](https://arxiv.org/abs/2312.13010)] [[💻 Code](https://github.com/huangd1999/AgentCoder)]
273 | - (*Arxiv'23*) Self-Refine: Iterative Refinement with Self-Feedback [[📝 Paper](https://arxiv.org/abs/2303.17651)] [[💻 Code](https://github.com/madaan/self-refine)]
274 | - (*EMNLP'24*) CodeAgent: Autonomous Communicative Agents for Code Review [[📝 Paper](https://aclanthology.org/2024.emnlp-main.632/)] [[💻 Code](https://github.com/Daniel4SE/codeagent)]
275 | - (*ICLR'25*) OpenHands: An Open Platform for AI Software Developers as Generalist Agents [[📝 Paper](https://arxiv.org/abs/2407.16741)] [[💻 Code](https://github.com/All-Hands-AI/OpenHands)]
276 | - (*Arxiv'25*) CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation [[📝 Paper](https://arxiv.org/abs/2501.07811)]
277 | - (Arxiv’25) AlphaEvolve: A coding agent for scientific and algorithmic discovery [[📝 Paper](https://arxiv.org/abs/2506.13131)]
278 | - (*Arxiv'25*) Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents [[📝 Paper](https://arxiv.org/pdf/2505.22954)] [[💻 Code](https://github.com/jennyzzt/dgm)]
279 | - (*Software'25*) OpenEvolve: an open-source evolutionary coding agent [[📝 Instructions](https://huggingface.co/blog/codelion/openevolve)] [[💻 Code](https://github.com/codelion/openevolve)]
280 | - (*ICLR'25*) Self-Evolving Multi-Agent Collaboration Networks for Software Development [[📝 Paper](https://openreview.net/forum?id=4R71pdPBZp)] [[💻 Code](https://github.com/yuzhu-cai/rSDE-Bench)]
281 |
282 | #### 3.2.2 📌 Code Debugging
283 |
284 | - (*ACL'23*) Self-Edit: Fault-Aware Code Editor for Code Generation [[📝 Paper](https://arxiv.org/abs/2305.04087)]
285 | - (*ICLR'24*) Teaching Large Language Models to Self-Debug [[📝 Paper](https://arxiv.org/abs/2304.05128)]
286 | - (*ICA'24*) RGD: Multi-LLM based agent debugger via refinement and generation guidance. [[📝 Paper](https://arxiv.org/abs/2410.01242)]
287 | - (*Arxiv'25*) Large Language Model Guided Self-Debugging Code Generation [[📝 Paper](https://arxiv.org/abs/2502.02928)]
288 |
289 |
290 | ### 3.3 Scientific Research
291 | - (*Arxiv’25*) PiFlow: Principle‑aware Scientific Discovery with Multi‑Agent Collaboration [[📝 Paper](https://arxiv.org/abs/2505.15047)] [[💻 Code](https://github.com/amair-lab/PiFlow)]
292 |
293 |
294 | ### 3.4 💰📚 Financial and Legal Research
295 |
296 | #### 3.4.1 📌 Financial Decision-Making
297 |
298 | - (*Arxiv'25*) R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization [[📝 Paper](https://arxiv.org/abs/2505.15155)] [[💻 Code](https://github.com/microsoft/RD-Agent)]
299 | - (*Arxiv'24*) FinRobot: an open-source ai agent platform for financial applications using large language
300 | models [[📝 Paper](https://arxiv.org/abs/2405.14767)] [[💻 Code](https://github.com/AI4Finance-Foundation/FinRobot)]
301 | - (*Arxiv'24*) PEER: Expertizing domain-specific tasks with a multi-agent framework and tuning methods [[📝 Paper](https://arxiv.org/abs/2407.06985)] [[💻 Code](https://github.com/agentuniverse-ai/agentUniverse)]
302 | - (*NeurIPS'25*) Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced
303 | financial decision making [[📝 Paper](https://proceedings.neurips.cc/paper_files/paper/2024/file/f7ae4fe91d96f50abc2211f09b6a7e49-Paper-Conference.pdf)] [[💻 Code](https://github.com/The-FinAI/FinCon)]
304 |
305 |
306 | #### 3.4.2 📌 Legal Reasoning
307 | - (*Arxiv'24*) LawLuo: A Multi-Agent Collaborative Framework for Multi-Round Chinese Legal Consultation [[📝 Paper](https://arxiv.org/abs/2407.16252)]
308 | - (*ICIC'24*) Legalgpt: Legal chain of thought for the legal large
309 | language model multi-agent framework [[📝 Paper](https://dl.acm.org/doi/10.1007/978-981-97-5678-0_3)]
310 | - (*Arxiv'24*) LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model
311 | [[📝 Paper](https://arxiv.org/abs/2406.04614)] [[💻 Code](https://github.com/pengxiao-song/LaWGPT)]
312 | - (*ACL Findings'25*) AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents [[📝 Paper](https://aclanthology.org/2025.findings-acl.304/)] [[💻 Code](https://github.com/relic-yuexi/AgentCourt)]
313 |
314 |
315 | ### 3.5 🧩 Other Domain-Specific Optimisation
316 | - (*Arxiv'25*) Agents of Change: Self-Evolving LLM Agents for Strategic Planning [[📝 Paper](https://arxiv.org/abs/2506.04651)]
317 | - (*Arxiv'25*) EarthLink: A Self-Evolving AI Agent for Climate Science [[📝 Paper](https://arxiv.org/abs/2507.17311)] [[🖥️ System](https://earthlink.intern-ai.org.cn/)]
318 | - (*Arxiv'25*) SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience [[📝 Paper](https://arxiv.org/abs/2508.04700)][[💻 Code](https://github.com/SunzeY/SEAgent)]
319 |
320 |
321 | ## 4. Evaluation
322 |
323 | ### 4.1 📈 Benchmark-Based Evaluation
324 |
325 | - (*NeurIPS'23*) OpenAGI: When LLM Meets Domain Experts [[📝 Paper](https://arxiv.org/abs/2304.04370)] [[💻 Code](https://github.com/agiresearch/OpenAGI)]
326 | - (*Arxiv'25*) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [[📝 Paper](https://arxiv.org/abs/2508.19005)]
327 | - (*Arxiv'25*) MLGym: A New Framework and Benchmark for Advancing AI Research Agents [[📝 Paper](https://arxiv.org/abs/2502.14499)] [[💻 Code](https://github.com/facebookresearch/MLGym)]
328 | - (*Arxiv'25*) X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs [[📝 Paper](https://arxiv.org/abs/2505.16997)] [[💻 Code](https://github.com/MASWorks/X-MAS)]
329 |
330 | #### 4.1.1 📌 Tool and API-Driven Agents
331 | - (*Arxiv'23*) On the Tool Manipulation Capability of Open-source Large Language Models [[📝 Paper](https://arxiv.org/abs/2305.16504)] [[💻 Code](https://github.com/sambanova/toolbench)]
332 | - (*EMNLP'23*) API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [[📝 Paper](https://arxiv.org/abs/2304.08244)] [[💻 Code](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank)]
333 | - (*NeurIPS'23*) ToolQA: A Dataset for LLM Question Answering with External Tools [[📝 Paper](https://arxiv.org/abs/2306.13304)] [[💻 Code](https://github.com/night-chen/ToolQA)]
334 | - (*ICLR'24*) MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
335 | [[📝 Paper](https://arxiv.org/abs/2310.03128)] [[💻 Code](https://github.com/HowieHwong/MetaTool)]
336 | - (*Arxiv'25*) Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [[📝 Paper](https://arxiv.org/abs/2312.17294)] [[💻 Code](https://github.com/OpenBMB/OpenAct)]
337 | - (*Arxiv'25*) LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries [[📝 Paper](https://arxiv.org/abs/2508.15760)]
338 |
339 | #### 4.1.2 📌 Web Navigation and Browsing Agents
340 | - (*ICLR'24*) WebArena: A Realistic Web Environment for Building Autonomous Agents [[📝 Paper](https://arxiv.org/abs/2307.13854)] [[💻 Code](https://github.com/web-arena-x/webarena)]
341 | - (*Arxiv'25*) BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [[📝 Paper](https://arxiv.org/abs/2504.12516)] [[💻 Code](https://github.com/openai/simple-evals)]
342 | - (*ACL'25*) WebWalker: Benchmarking LLMs in Web Traversal [[📝 Paper](https://arxiv.org/abs/2501.07572)] [[💻 Code](https://github.com/Alibaba-NLP/WebAgent)]
343 |
344 | #### 4.1.3 📌 Coding Agents
345 | - (*ICLR'24*) SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [[📝 Paper](https://arxiv.org/abs/2310.06770)] [[💻 Code](https://github.com/SWE-bench/SWE-bench)]
346 | - (*ICLR'25*) Self-Evolving Multi-Agent Collaboration Networks for Software Development [[📝 Paper](https://openreview.net/forum?id=4R71pdPBZp)] [[💻 Code](https://github.com/yuzhu-cai/rSDE-Bench)]
347 |
348 | #### 4.1.4 Scientific Research Agents
349 | - (*Arxiv'25*) DataSciBench: An LLM Agent Benchmark for Data Science [[📝 Paper](https://arxiv.org/abs/2502.13897)] [[💻 Code](https://github.com/THUDM/DataSciBench)]
350 |
351 | #### 4.1.4 📌 Multi-Agent Collaboration and Generalists
352 | - (*ICLR'23*) GAIA: a benchmark for General AI Assistants [[📝 Paper](https://arxiv.org/abs/2311.12983)] [[💻 Code](https://huggingface.co/gaia-benchmark)]
353 | - (*ICLR'24*) AgentBench: Evaluating LLMs as Agents [[📝 Paper](https://arxiv.org/abs/2308.03688)] [[💻 Code](https://github.com/THUDM/AgentBench)]
354 | - (*Arxiv'25*) MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [[📝 Paper](https://arxiv.org/abs/2503.01935)] [[💻 Code](https://github.com/MultiagentBench/MARBLE)]
355 | - (*Arxiv'25*) Benchmarking LLMs' Swarm intelligence [[📝 Paper](https://arxiv.org/abs/2505.04364)] [[💻 Code](https://github.com/RUC-GSAI/YuLan-SwarmIntell)]
356 |
357 | #### 4.1.5 📌 GUI and Multimodal Environment Agents
358 | - (*ACL'24*) Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [[📝 Paper](https://arxiv.org/abs/2407.00993)] [[💻 Code](https://github.com/XiaoMi/MobileBench)]
359 | - (*NeurIPS'24*) OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [[📝 Paper](https://arxiv.org/abs/2404.07972)] [[💻 Code](https://github.com/xlang-ai/OSWorld)]
360 | - (*ICLR'25*) AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [[📝 Paper](https://arxiv.org/abs/2405.14573)] [[💻 Code](https://github.com/google-research/android_world)]
361 |
362 | ### 4.2 ⚖️ LLM-Based Evaluation
363 |
364 | #### 4.2.1 📌 LLM-as-a-Judge
365 |
366 | - (*Arxiv'24*) Towards Better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications [[📝 Paper](https://arxiv.org/abs/2402.09015)]
367 | - (*Arxiv'24*) LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [[📝 Paper](https://arxiv.org/abs/2412.05579)]
368 | - (*Arxiv'25*) LiveIdeaBench: Evaluating LLMs’ Divergent Thinking for Scientific Idea Generation with Minimal Context [[📝 Paper](https://arxiv.org/abs/2412.17596)] [[💻 Code](https://github.com/x66ccff/liveideabench)]
369 | - (*ACL'25*) Auto-Arena: Automating LLM Evaluations with Agent Peer Debate and Committee Voting [[📝 Paper](https://arxiv.org/abs/2405.20267)] [[💻 Code](https://github.com/DAMO-NLP-SG/Auto-Arena-LLMs)]
370 | - (*Arxiv'25*) MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [[📝 Paper](https://arxiv.org/abs/2502.12468)]
371 |
372 | #### 4.2.2 📌 Agent-as-a-Judge
373 |
374 | - (*Arxiv'24*) Agent-as-a-Judge: Evaluate Agents with Agents [[📝 Paper](https://arxiv.org/abs/2410.10934)] [[💻 Code](https://github.com/metauto-ai/agent-as-a-judge)]
375 |
376 | ### 4.3 🛡 Safety, Alignment, and Robustness for Lifelong / Self-Evolving Agents
377 |
378 | - (*Arxiv'24*) AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [[📝 Paper](https://arxiv.org/abs/2410.09024) ]
379 | - (*NeurIPS'24 – Datasets & Benchmarks*) RedCode: Risky Code Execution and Generation [[📝 Paper](https://arxiv.org/abs/2411.07781) ]
380 | - (*Arxiv'24*) MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [[📝 Paper](https://arxiv.org/abs/2410.17520)] [[💻 Code](https://mobilesafetybench.github.io/)]
381 | - (*Arxiv'23*) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [[📝 Paper](https://arxiv.org/abs/2304.03279) ]
382 | - (*Arxiv'24*) R-Judge: Benchmarking Safety Risk Awareness for LLM Judges [[📝 Paper](https://arxiv.org/abs/2401.10019)] [[💻 Code](https://rjudgebench.github.io/)]
383 | - (*ACL'25*) SafeLawBench: Towards Safe Alignment of Large Language Models [[📝 Paper](https://arxiv.org/abs/2506.06636) ]
384 | - (*Arxiv'25*) Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI
385 | [[📝 Paper](https://www.arxiv.org/abs/2509.13345) ]
386 | - (*ICLR'25 Spotlight*) AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs [[📝 Paper](https://arxiv.org/abs/2410.05295)] [[💻 Code](https://github.com/SaFoLab-WISC/AutoDAN-Turbo)]
387 | - (*ACL'25*) AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [[📝 Paper](https://arxiv.org/abs/2502.11448)] [[💻 Code](https://github.com/SaFoLab-WISC/AGrail4Agent)]
388 |
389 |
390 | [](https://www.star-history.com/#EvoAgentX/Awesome-Self-Evolving-Agents&Date&Date)
391 |
392 |
393 | ## 📚 Citation
394 |
395 | If you find this survey useful in your research and applications, please cite using this BibTeX:
396 |
397 | ```
398 | @article{fang2025comprehensive,
399 | title={A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems},
400 | author={Fang, Jinyuan and Peng, Yanwen and Zhang, Xi and Wang, Yingxu and Yi, Xinhao and Zhang, Guibin and Xu, Yi and Wu, Bin and Liu, Siwei and Li, Zihao and others},
401 | journal={arXiv preprint arXiv:2508.07407},
402 | year={2025}
403 | }
404 | ```
405 |
406 |
407 | ## ☕ Acknowledgement
408 |
409 | We would like to thank Shuyu Guo for his valuable contributions to the early-stage exploration and literature review on agent optimisation.
410 |
411 | ## ✉️ Contact Us
412 |
413 | If you have any questions or suggestions, please feel free to contact us via:
414 |
415 | Email: j.fang.2@research.gla.ac.uk and zaiqiao.meng@gmail.com
416 |
--------------------------------------------------------------------------------