├── logo.png ├── index.html └── README.md /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thinkwee/AgentsMeetRL/HEAD/logo.png -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | AgentsMeetRL Dashboard 7 | 8 | 459 | 460 | 461 |
462 |
463 |

🤖 AgentsMeetRL 🔥

464 |

An Awesome List for Reinforcement Learning Based Agentic LLM Open-Source Repos

465 |

Welcome to the Era of Experience!

466 | 467 | 468 | 469 | 470 | View on GitHub 471 | 472 |
473 | 474 |
475 |
476 |
Loading repository data...
477 |
478 |
479 |
480 | 481 |
482 |
483 |
0
484 |
Total Repositories in AgentsMeetRL
485 |
486 |
487 |
0
488 |
Stars of All Repositories in AgentsMeetRL
489 |
490 |
491 |
0
492 |
Categories
493 |
494 |
495 |
0
496 |
AgentsMeetRL Stars
497 |
498 |
499 | 500 |
501 |
502 |
Repositories by Category
503 | 504 |
505 |
506 |
Stars Distribution by Category
507 | 508 |
509 |
510 | 511 | 513 |
514 | 515 | 1004 | 1005 | 1006 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | NOVER Logo 3 |
4 | 5 |
6 | 7 | ![Base Framework](https://img.shields.io/badge/Base_Framework-12-BFA2DB?style=for-the-badge) 8 | ![General](https://img.shields.io/badge/General-6-4E6813?style=for-the-badge) 9 | ![Web](https://img.shields.io/badge/Web-17-845C40?style=for-the-badge) 10 | ![GUI](https://img.shields.io/badge/GUI-9-A259FF?style=for-the-badge) 11 | ![Tool](https://img.shields.io/badge/Tool-10-D89F7B?style=for-the-badge) 12 | ![Game](https://img.shields.io/badge/Game-7-1F4CAD?style=for-the-badge) 13 |
14 | ![Code](https://img.shields.io/badge/Code-12-A47B67?style=for-the-badge) 15 | ![QA](https://img.shields.io/badge/QA-17-FF69B4?style=for-the-badge) 16 | ![Memory](https://img.shields.io/badge/Memory-3-007a88?style=for-the-badge) 17 | ![Embodied](https://img.shields.io/badge/Embodied-2-C0C5CE?style=for-the-badge) 18 | ![Biomedical](https://img.shields.io/badge/Biomedical-3-ffc884?style=for-the-badge) 19 | ![Environment](https://img.shields.io/badge/Environment-32-FA5A4C?style=for-the-badge) 20 | 21 |
22 | 23 |
24 | 25 | [![Interactive Dashboard](https://img.shields.io/badge/📊_Interactive_Dashboard-Visit_Website-blue?style=for-the-badge)](https://thinkwee.top/amr/) 26 | 27 |
28 | 29 | # When LLM Agents Meet Reinforcement Learning 30 | 31 | **AgentsMeetRL** is an awesome list that summarizes **open-source repositories** for training LLM Agents using reinforcement learning: 32 | - 🤖 The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo). 33 | - ⚠️ This project is based on code analysis from open-source repositories using GitHub Copilot Agent, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them! 34 | - 🚀 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table. 35 | - 🤗 Feel free to submit your own projects anytime - we welcome contributions! 36 | 37 | Some Enumeration: 38 | - Enumeration for Reward Type: 39 | - External Verifier: e.g., a compiler or math solver 40 | - Rule-Based: e.g., a LaTeX parser with exact match scoring 41 | - Model-Based: e.g., a trained verifier LLM or reward LLM 42 | - Custom 43 | 44 | --- 45 | 46 | ## 🔧 Base Framework 47 | 48 | 49 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | 50 | | :----: | :----: | :----: | :----: | :----: | 51 | | [siiRL](https://github.com/sii-research/siiRL) | Stars | 2025.7 | Shanghai Innovation Institute | [Paper](https://arxiv.org/abs/2507.13833) | 52 | | [slime](https://github.com/THUDM/slime) | ![](https://img.shields.io/github/stars/THUDM/slime?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700) | 2025.6 | Tsinghua University (THUDM) | [blog](https://lmsys.org/blog/2025-07-09-slime/) | 53 | | [agent-lightning](https://github.com/microsoft/agent-lightning) | Stars | 2025.6 | Microsoft Research | [Paper](https://arxiv.org/abs/2508.03680) | 54 | | [AReaL](https://github.com/inclusionAI/AReaL) | Stars | 2025.6 | AntGroup/Tsinghua | [Paper](https://arxiv.org/pdf/2505.24298) | 55 | | [ROLL](https://github.com/alibaba/ROLL) | Stars | 2025.6 | Alibaba | [Paper](https://arxiv.org/pdf/2506.06122) | 56 | | [MARTI](https://github.com/TsinghuaC3I/MARTI) | Stars | 2025.5 | Tsinghua | -- | 57 | | [RL2](https://github.com/ChenmienTan/RL2) | Stars | 2025.4 | Accio | – | 58 | | [verifiers](https://github.com/willccbb/verifiers) | Stars | 2025.3 | Individual | -- | 59 | | [oat](https://github.com/sail-sg/oat) | Stars | 2024.11 | NUS/Sea AI | [Paper](https://arxiv.org/pdf/2411.01493) | 60 | | [veRL](https://github.com/volcengine/verl) | Stars | 2024.10 | ByteDance | [Paper](https://arxiv.org/pdf/2409.19256) | 61 | | [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) | Stars | 2023.7 | OpenRLHF | [Paper](https://arxiv.org/abs/2405.11143) | 62 | | [trl](https://github.com/huggingface/trl) | Stars | 2019.11 | HuggingFace | -- | 63 | 64 |
65 | 📋 Click to view technical details 66 | 67 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 68 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 69 | | [siiRL](https://github.com/sii-research/siiRL) | PPO/GRPO/CPGD/MARFT | Multi | Both | Multi | LLM/VLM/LLM-MAS PostTraining | Model/Rule | Planned | 70 | | [slime](https://github.com/THUDM/slime) | GRPO/GSPO/REINFORCE++ | Single | Both | Both | Math/Code | External Verifier | Yes | 71 | | [agent-lightning](https://github.com/microsoft/agent-lightning) | PPO/Custom/Automatic Prompt Optimization | Multi | Outcome | Multi | Calculator/SQL | Model/External/Rule | Yes | 72 | | [AReaL](https://github.com/inclusionAI/AReaL) | PPO | Both | Outcome | Both | Math/Code | External | Yes | 73 | | [ROLL](https://github.com/alibaba/ROLL) | PPO/GRPO/Reinforce++/TOPR/RAFT++ | Multi | Both | Multi | Math/QA/Code/Alignment | All | Yes | 74 | | [MARTI](https://github.com/TsinghuaC3I/MARTI) | PPO/GRPO/REINFORCE++/TTRL | Multi | Both | Multi | Math | All | Yes | 75 | | [RL2](https://github.com/ChenmienTan/RL2) | Dr. GRPO/PPO/DPO | Single | Both | Both | QA/Dialogue | Rule/Model/External | Yes | 76 | | [verifiers](https://github.com/willccbb/verifiers) | GRPO | Multi | Outcome | Both | Reasoning/Math/Code | All | Code | 77 | | [oat](https://github.com/sail-sg/oat) | PPO/GRPO | Single | Outcome | Multi | Math/Alignment | External | No | 78 | | [veRL](https://github.com/volcengine/verl) | PPO/GRPO | Single | Outcome | Both | Math/QA/Reasoning/Search | All | Yes | 79 | | [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) | PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO | Multi | Both | Both | Dialogue/Chat/Completion | Rule/Model/External | Yes | 80 | | [trl](https://github.com/huggingface/trl) | PPO/GRPO/DPO | Single | Both | Single | QA | Custom | No | 81 | 82 |
83 | 84 | ## 💪 General/MultiTask 85 | 86 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 87 | | :----: | :----: | :----: | :----: | :----: | :----: | 88 | | [DEPO](https://github.com/OpenCausaLab/DEPO) | Stars | 2025.11 | HKUST/SJTU | [Paper](https://arxiv.org/abs/2511.15392) | LLaMA-Factory | 89 | | [SPEAR](https://github.com/TencentYoutuResearch/SPEAR) | Stars | 2025.10 | Tencent Youtu Lab | [Paper](https://arxiv.org/abs/2509.22601) | veRL/verl-agent | 90 | | [AgentRL](https://github.com/THUDM/AgentRL) | Stars | 2025.9 | Tsinghua | [Paper](https://arxiv.org/abs/2510.04206) | veRL | 91 | | [AgentGym-RL](https://github.com/WooooDyy/AgentGym-RL) | Stars | 2025.9 | Fudan University | [Paper](https://arxiv.org/abs/2509.08755) | veRL | 92 | | [Agent_Foundation_Models](https://github.com/OPPO-PersonalAI/Agent_Foundation_Models) | Stars | 2025.8 | OPPO Personal AI Lab | [Paper](https://arxiv.org/abs/2508.13167) | veRL | 93 | | [SPA-RL-Agent](https://github.com/WangHanLinHenry/SPA-RL-Agent) | Stars | 2025.5 | PolyU | [Paper](https://arxiv.org/pdf/2505.20732) | TRL | 94 | | [verl-agent](https://github.com/langfengQ/verl-agent) | Stars | 2025.5 | NTU/Skywork | [Paper](https://arxiv.org/pdf/2505.10978) | veRL | 95 | 96 |
97 | 📋 Click to view technical details 98 | 99 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 100 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 101 | | [DEPO](https://github.com/OpenCausaLab/DEPO) | KTO + Efficiency Loss | Single | Both | Multi | Agent (BabyAI/WebShop) | Rule | Yes | 102 | | [SPEAR](https://github.com/TencentYoutuResearch/SPEAR) | GRPO/GiGPO + SIL | Single | Both | Multi | Math/Agent | Rule/External | Yes (Search, Sandbox, Browser) | 103 | | [AgentRL](https://github.com/THUDM/AgentRL) | GRPO/REINFORCE++/RLOO/ReMax/GAE | Single | Outcome | Multi | Agent Tasks | External | Yes | 104 | | [AgentGym-RL](https://github.com/WooooDyy/AgentGym-RL) | PPO/GRPO/RLOO/REINFORCE++ | Single | Outcome | Multi | Web/Search/Game/Embodied/Science | Rule/Model/External | Yes (Web, Search, Env APIs) | 105 | | [Agent_Foundation_Models](https://github.com/OPPO-PersonalAI/Agent_Foundation_Models) | DAPO/PPO | Single | Outcome | Single | QA/Code/Math | Rule/External | Yes | 106 | | [SPA-RL-Agent](https://github.com/WangHanLinHenry/SPA-RL-Agent) | PPO | Single | Process | Multi | Navigation/Web/TextGame | Model | No | 107 | | [verl-agent](https://github.com/langfengQ/verl-agent) | PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++ | Multi | Both | Multi | Phone Use/Math/Code/Web/TextGame | All | Yes | 108 | 109 |
110 | 111 | ## 🔍 Search/Research/Web 112 | 113 | 114 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 115 | | :----: | :----: | :----: | :----: | :----: | :----: | 116 | | [ReSeek](https://github.com/TencentBAC/ReSeek) | Stars | 2025.10 | Tencent PCG BAC/Tsinghua University | [Paper](https://arxiv.org/abs/2510.00568) | veRL | 117 | | [Tree-GRPO](https://github.com/AMAP-ML/Tree-GRPO) | Stars | 2025.9 | AMAP | [Paper](https://arxiv.org/abs/2509.21240) | veRL | 118 | | [ASearcher](https://github.com/inclusionAI/ASearcher) | Stars | 2025.8 | Ant Research RL Lab
Tsinghua University & UW | [Paper](https://arxiv.org/abs/2508.07976) | RealHF/AReaL | 119 | | [Kimi-Researcher](https://github.com/moonshotai/Kimi-Researcher) | Stars | 2025.6 | Moonshot AI | [blog](https://moonshotai.github.io/Kimi-Researcher/) | Custom | 120 | | [TTI](https://github.com/test-time-interaction/TTI) | Stars | 2025.6 | CMU | [Paper](https://arxiv.org/abs/2506.07976) | Custom | 121 | | [R-Search](https://github.com/QingFei1/R-Search) | Stars | 2025.6 | Individual | -- | veRL | 122 | | [R1-Searcher-plus](https://github.com/RUCAIBox/R1-Searcher-plus) | Stars | 2025.5 | RUC | [Paper](https://arxiv.org/pdf/2505.17005) | Custom | 123 | | [StepSearch](https://github.com/Zillwang/StepSearch) | Stars | 2025.5 | SenseTime | [Paper](https://arxiv.org/pdf/2505.15107) | veRL | 124 | | [AutoRefine](https://github.com/syr-cn/AutoRefine) | Stars | 2025.5 | USTC | [Paper](https://www.arxiv.org/pdf/2505.11277) | veRL | 125 | | [ZeroSearch](https://github.com/Alibaba-NLP/ZeroSearch) | Stars | 2025.5 | Alibaba |[Paper](https://arxiv.org/pdf/2505.04588) | veRL | 126 | | [WebThinker](https://github.com/RUC-NLPIR/WebThinker) | Stars | 2025.4 | RUC | [Paper](https://arxiv.org/pdf/2504.21776) | Custom | 127 | | [DeepResearcher](https://github.com/GAIR-NLP/DeepResearcher) | Stars | 2025.4 | SJTU | [Paper](https://arxiv.org/pdf/2504.03160) | veRL | 128 | | [Search-R1](https://github.com/PeterGriffinJin/Search-R1) | Stars | 2025.3 | UIUC/Google | [paper1](https://arxiv.org/pdf/2503.09516), [paper2](https://arxiv.org/pdf/2505.15117) | veRL | 129 | | [R1-Searcher](https://github.com/RUCAIBox/R1-Searcher) | Stars | 2025.3 | RUC | [Paper](https://arxiv.org/pdf/2503.05592) | OpenRLHF | 130 | | [C-3PO](https://github.com/Chen-GX/C-3PO) | Stars | 2025.2 | Alibaba | [Paper](https://arxiv.org/pdf/2502.06205) | OpenRLHF | 131 | | [Search-o1](https://github.com/RUC-NLPIR/Search-o1) | Stars | 2025.1 | Renmin University of China (RUC) | [Paper](https://arxiv.org/abs/2501.05366) | N/A (Inference Only) | 132 | | [WebAgent](https://github.com/Alibaba-NLP/WebAgent) | Stars | 2025.1 | Alibaba | [paper1](https://arxiv.org/pdf/2501.07572), [paper2](https://arxiv.org/pdf/2505.22648) | LLaMA-Factory | 133 | 134 |
135 | 📋 Click to view technical details 136 | 137 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 138 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 139 | | [ReSeek](https://github.com/TencentBAC/ReSeek) | GRPO/PPO | Single | Both | Multi | QA/Search | Rule | Search/JUDGE | 140 | | [Tree-GRPO](https://github.com/AMAP-ML/Tree-GRPO) | GRPO/Tree-GRPO | Single | Outcome | Multi | Search | Rule | Search | 141 | | [ASearcher](https://github.com/inclusionAI/ASearcher) | PPO/GRPO + Decoupled PPO | Single | Outcome | Multi | Math/Code/SearchQA | External/Rule | Yes | 142 | | [Kimi-Researcher](https://github.com/moonshotai/Kimi-Researcher) | REINFORCE | Single | Outcome | Multi | Research | Outcome | Search, Browse, Coding | 143 | | [TTI](https://github.com/test-time-interaction/TTI) | REINFORCE/BC | Single | Outcome | Multi | Web | External | Web Browsing | 144 | | [R-Search](https://github.com/QingFei1/R-Search) | PPO/GRPO | Single | Both | Multi | QA/Search | All | Yes | 145 | | [R1-Searcher-plus](https://github.com/RUCAIBox/R1-Searcher-plus) | Custom | Single | Outcome | Multi | Search | Model | Search | 146 | | [StepSearch](https://github.com/Zillwang/StepSearch) | PPO | Single | Process | Multi | QA | Model | Search | 147 | | [AutoRefine](https://github.com/syr-cn/AutoRefine) | PPO/GRPO | Multi | Both | Multi | RAG QA | Rule | Search | 148 | | [ZeroSearch](https://github.com/Alibaba-NLP/ZeroSearch) | PPO/GRPO/REINFORCE | Single | Outcome | Multi | QA/Search | Rule | Yes | 149 | | [WebThinker](https://github.com/RUC-NLPIR/WebThinker) | DPO | Single | Outcome | Multi | Reasoning/QA/Research | Model/External | Web Browsing | 150 | | [DeepResearcher](https://github.com/GAIR-NLP/DeepResearcher) | PPO/GRPO | Multi | Outcome | Multi | Research | All | Yes | 151 | | [Search-R1](https://github.com/PeterGriffinJin/Search-R1) | PPO/GRPO | Single | Outcome | Multi | Search | All | Search | 152 | | [R1-Searcher](https://github.com/RUCAIBox/R1-Searcher) | PPO/DPO | Single | Both | Multi | Search | All | Yes | 153 | | [C-3PO](https://github.com/Chen-GX/C-3PO) | PPO | Multi | Outcome | Multi | Search | Model | Yes | 154 | | [Search-o1](https://github.com/RUC-NLPIR/Search-o1) | N/A | Single | N/A | Multi | Math/Science QA/Code/Open QA | N/A | Web Search | 155 | | [WebAgent](https://github.com/Alibaba-NLP/WebAgent) | DAPO | Multi | Process | Multi | Web | Model | Yes | 156 | 157 |
158 | 159 | ## 📱 GUI 160 | 161 | 162 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 163 | | :----: | :----: | :----: | :----: | :----: | :----: | 164 | | [MobileAgent](https://github.com/X-PLUG/MobileAgent) | Stars | 2025.9 | X-PLUG (TongyiQwen) | [paper](https://arxiv.org/abs/2509.11543) | veRL | 165 | | [InfiGUI-G1](https://github.com/InfiXAI/InfiGUI-G1) | Stars | 2025.8 | InfiX AI | [Paper](https://arxiv.org/abs/2508.05731) | veRL | 166 | | [Grounding-R1](https://github.com/Yan98/Grounding-R1) | Stars | 2025.6 | Salesforce | [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1) | trl | 167 | | [AgentCPM-GUI](https://github.com/OpenBMB/AgentCPM-GUI) | Stars | 2025.6 | OpenBMB/Tsinghua/RUC | [Paper](https://arxiv.org/pdf/2506.01391) | Huggingface | 168 | | [SE-GUI](https://github.com/YXB-NKU/SE-GUI) | Stars | 2025.5 | Nankai University/vivo | [Paper](https://arxiv.org/pdf/2505.12370) | trl | 169 | | [ARPO](https://github.com/dvlab-research/ARPO) | Stars | 2025.5 | CUHK/HKUST | [Paper](https://arxiv.org/pdf/2505.16282) | veRL | 170 | | [GUI-G1](https://github.com/Yuqi-Zhou/GUI-G1) | Stars | 2025.5 | RUC | [Paper](https://arxiv.org/pdf/2505.15810) | TRL | 171 | | [GUI-R1](https://github.com/ritzz-ai/GUI-R1) | Stars | 2025.4 | CAS/NUS | [Paper](https://arxiv.org/pdf/2504.10458) | veRL | 172 | | [UI-R1](https://github.com/lll6gg/UI-R1) | Stars | 2025.3 | vivo/CUHK | [Paper](https://arxiv.org/pdf/2503.21620) | TRL | 173 | 174 |
175 | 📋 Click to view technical details 176 | 177 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 178 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 179 | | [MobileAgent](https://github.com/X-PLUG/MobileAgent) | semi-online RL | Single | Both | Multi | MobileGUI/Automation | Rule | Yes | 180 | | [InfiGUI-G1](https://github.com/InfiXAI/InfiGUI-G1) | AEPO | Single | Outcome | Single | GUI/Grounding | Rule | No | 181 | | [Grounding-R1](https://github.com/Yan98/Grounding-R1) | GRPO | Single | Outcome | Multi | GUI Grounding | Model | Yes | 182 | | [AgentCPM-GUI](https://github.com/OpenBMB/AgentCPM-GUI) | GRPO | Single | Outcome | Multi | Mobile GUI | Model | Yes | 183 | | [SE-GUI](https://github.com/YXB-NKU/SE-GUI) | GRPO | Single | Both | Single | GUI Grounding | Rule | Yes | 184 | | [ARPO](https://github.com/dvlab-research/ARPO) | GRPO | Single | Outcome | Multi | GUI | External | Computer Use | 185 | | [GUI-G1](https://github.com/Yuqi-Zhou/GUI-G1) | GRPO | Single | Outcome | Single | GUI | Rule/External | No | 186 | | [GUI-R1](https://github.com/ritzz-ai/GUI-R1) | GRPO | Single | Outcome | Multi | GUI | Rule | No | 187 | | [UI-R1](https://github.com/lll6gg/UI-R1) | GRPO | Single | Process | Both | GUI | Rule | Computer/Phone Use | 188 | 189 |
190 | 191 | ## 🔨 Tool 192 | 193 | 194 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 195 | | :----: | :----: | :----: | :----: | :----: | :----: | 196 | | [MiroRL](https://github.com/MiroMindAI/MiroRL) | Stars | 2025.8 | MiroMindAI | [HF Repo](https://huggingface.co/miromind-ai) | veRL | 197 | | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) | Stars | 2025.6 | TIGER-Lab | [X](https://x.com/DongfuJiang/status/1929198238017720379) | veRL | 198 | | [Multi-Turn-RL-Agent](https://github.com/SiliangZeng/Multi-Turn-RL-Agent) | Stars | 2025.5 | University of Minnesota | [Paper](https://arxiv.org/pdf/2505.11821) | Custom | 199 | | [Tool-N1](https://github.com/NVlabs/Tool-N1) | Stars | 2025.5 | NVIDIA | [Paper](https://arxiv.org/pdf/2505.00024) | veRL | 200 | | [Tool-Star](https://github.com/dongguanting/Tool-Star) | Stars | 2025.5 | RUC | [Paper](https://arxiv.org/pdf/2505.16410) | LLaMA-Factory | 201 | | [RL-Factory](https://github.com/Simple-Efficient/RL-Factory) | Stars | 2025.5 | Simple-Efficient | [model](https://huggingface.co/Simple-Efficient/RLFactory-Qwen3-8B-GRPO) | veRL | 202 | | [ReTool](https://github.com/ReTool-RL/ReTool) | Stars | 2025.4 | ByteDance | [Paper](https://arxiv.org/pdf/2504.11536) | veRL | 203 | | [AWorld](https://github.com/inclusionAI/AWorld) | Stars | 2025.3 | Ant Group (inclusionAI) | [Paper](https://arxiv.org/abs/2508.20404) | veRL | 204 | | [Agent-R1](https://github.com/0russwest0/Agent-R1) | Stars | 2025.3 | USTC | [Paper](https://arxiv.org/abs/2511.14460) | veRL | 205 | | [ReCall](https://github.com/Agent-RL/ReCall) | Stars | 2025.3 | BaiChuan | [Paper](https://arxiv.org/pdf/2503.19470) | veRL | 206 | 207 |
208 | 📋 Click to view technical details 209 | 210 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 211 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 212 | | [MiroRL](https://github.com/MiroMindAI/MiroRL) | GRPO | Single | Both | Multi | Reasoning/Planning/ToolUse | Rule-based | MCP | 213 | | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) | PPO/GRPO | Single | Both | Both | Math/Code | Rule/External | Yes | 214 | | [Multi-Turn-RL-Agent](https://github.com/SiliangZeng/Multi-Turn-RL-Agent) | GRPO | Single | Both | Multi | Tool-use/Math | Rule/External | Yes | 215 | | [Tool-N1](https://github.com/NVlabs/Tool-N1) | PPO | Single | Outcome | Multi | Math/Dialogue | All | Yes | 216 | | [Tool-Star](https://github.com/dongguanting/Tool-Star) | PPO/DPO/ORPO/SimPO/KTO | Single | Outcome | Multi | Multi-modal/Tool Use/Dialogue | Model/External | Yes | 217 | | [RL-Factory](https://github.com/Simple-Efficient/RL-Factory) | GRPO | Multi | Both | Multi | Tool-use/NL2SQL | All | MCP | 218 | | [ReTool](https://github.com/ReTool-RL/ReTool) | PPO | Single | Outcome | Multi | Math | External | Code | 219 | | [AWorld](https://github.com/inclusionAI/AWorld) | GRPO | Both | Outcome | Multi | Search/Web/Code | External/Rule | Yes | 220 | | [Agent-R1](https://github.com/0russwest0/Agent-R1) | PPO/GRPO | Single | Both | Multi | Tool-use/QA | Model | Yes | 221 | | [ReCall](https://github.com/Agent-RL/ReCall) | PPO/GRPO/RLOO/REINFORCE++/ReMax | Single | Outcome | Multi | Tool-use/Math/QA | All | Yes | 222 | 223 |
224 | 225 | ## 🎮 TextGame 226 | 227 | 228 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 229 | | :----: | :----: | :----: | :----: | :----: | :----: | 230 | | [ARIA](https://github.com/rhyang2021/ARIA) | Stars | 2025.6 | Fudan University | [Paper](https://arxiv.org/abs/2506.00539) | Custom | 231 | | [AMPO](https://github.com/MozerWang/AMPO) | Stars | 2025.5 | Tongyi Lab, Alibaba | [Paper](https://arxiv.org/abs/2505.02156) | veRL | 232 | | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | Stars | 2025.5 | Alibaba | [Paper](https://arxiv.org/pdf/2505.17826) | veRL | 233 | | [VAGEN](https://github.com/RAGEN-AI/VAGEN) | Stars | 2025.3 | RAGEN-AI | [Paper](https://www.notion.so/VAGEN-Training-VLM-Agents-with-Multi-Turn-Reinforcement-Learning-1bfde13afb6e80b792f6d80c7c2fcad0) | veRL | 234 | | [ART](https://github.com/OpenPipe/ART) | Stars | 2025.3 | OpenPipe | [Paper](https://github.com/OpenPipe/ART#-citation) | TRL | 235 | | [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL) | Stars | 2025.3 | UIUC/MetaGPT | -- | Custom | 236 | | [RAGEN](https://github.com/RAGEN-AI/RAGEN) | Stars | 2025.1 | RAGEN-AI | [Paper](https://arxiv.org/pdf/2504.20073) | veRL | 237 | 238 |
239 | 📋 Click to view technical details 240 | 241 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 242 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 243 | | [ARIA](https://github.com/rhyang2021/ARIA) | REINFORCE | Both | Process | Multi | Negotiation/Bargaining | Other | No | 244 | | [AMPO](https://github.com/MozerWang/AMPO) | BC/AMPO(GRPO improvement) | Multi | Outcome | Multi | Social Interaction | Model-based | No | 245 | | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | PPO/GRPO | Single | Outcome | Both | Math/TextGame/Web | All | Yes | 246 | | [VAGEN](https://github.com/RAGEN-AI/VAGEN) | PPO/GRPO | Single | Both | Multi | TextGame/Navigation | All | Yes | 247 | | [ART](https://github.com/OpenPipe/ART) | GRPO | Multi | Both | Multi | TextGame | All | Yes | 248 | | [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL) | PPO/DPO/GRPO | Multi | Outcome | Multi | TextGame | All | Yes | 249 | | [RAGEN](https://github.com/RAGEN-AI/RAGEN) | PPO/GRPO | Single | Both | Multi | TextGame | All | Yes | 250 | 251 |
252 | 253 | ## 💻 Code 254 | 255 | 256 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 257 | | :----: | :----: | :----: | :----: | :----: | :----: | 258 | | [PPP-Agent](https://github.com/sunnweiwei/PPP-Agent) | Stars | 2025.11 | CMU/OpenHands | [Paper](https://arxiv.org/abs/2511.02208) | veRL | 259 | | [RepoDeepSearch](https://github.com/Mizersy/RepoDeepSearch) | Stars | 2025.8 | PKU, Bytedance, BIT | [Paper](https://arxiv.org/abs/2508.03012) | veRL | 260 | | [MedAgentGym](https://github.com/wshi83/MedAgentGym) | Stars | 2025.6 | Emory/Georgia Tech | [Paper](https://arxiv.org/pdf/2506.04405) | Hugginface | 261 | | [CURE](https://github.com/Gen-Verse/CURE) | Stars | 2025.6 | University of Chicago
Princeton/ByteDance | [Paper](https://arxiv.org/pdf/2506.03136) | Huggingface | 262 | | [MASLab](https://github.com/MASWorks/MASLab) | Stars | 2025.5 | MASWorks | [Paper](https://arxiv.org/pdf/2505.16988) | Custom | 263 | | [Time-R1](https://github.com/ulab-uiuc/Time-R1) | Stars | 2025.5 | UIUC | [Paper](https://arxiv.org/pdf/2505.13508) | veRL | 264 | | [ML-Agent](https://github.com/MASWorks/ML-Agent) | Stars | 2025.5 | MASWorks | [Paper](https://arxiv.org/pdf/2505.23723) | Custom | 265 | | [SkyRL](https://github.com/NovaSky-AI/SkyRL) | Stars | 2025.4 | NovaSky | [Paper](https://arxiv.org/abs/2511.16108) | veRL | 266 | | [digitalhuman](https://github.com/Tencent/digitalhuman) | Stars | 2025.4 | Tencent | [Paper](https://arxiv.org/abs/2507.03112) | veRL | 267 | | [sweet_rl](https://github.com/facebookresearch/sweet_rl) | Stars | 2025.3 | Meta/UCB | [Paper](https://arxiv.org/pdf/2503.15478) | OpenRLHF | 268 | | [rllm](https://github.com/agentica-project/rllm) | Stars | 2025.1 | Berkeley Sky Computing Lab
BAIR / Together AI | [Notion Blog](https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31) | veRL | 269 | | [open-r1](https://github.com/huggingface/open-r1) | Stars | 2025.1 | HuggingFace | -- | TRL | 270 | 271 |
272 | 📋 Click to view technical details 273 | 274 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 275 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 276 | | [PPP-Agent](https://github.com/sunnweiwei/PPP-Agent) | PPP-RL | Single | Both | Multi | SWE/Research | Rule+Model | Search, Ask, Browse | 277 | | [RepoDeepSearch](https://github.com/Mizersy/RepoDeepSearch) | GRPO | Single | Both | Multi | Search/Repair | Rule/External | Yes | 278 | | [MedAgentGym](https://github.com/wshi83/MedAgentGym) | SFT/DPO/PPO/GRPO | Single | Outcome | Multi | Medical/Code | External | Yes | 279 | | [CURE](https://github.com/Gen-Verse/CURE) | PPO | Single | Outcome | Single | Code | External | No | 280 | | [MASLab](https://github.com/MASWorks/MASLab) | NO RL | Multi | Outcome | Multi | Code/Math/Reasoning | External | Yes | 281 | | [Time-R1](https://github.com/ulab-uiuc/Time-R1) | PPO/GRPO/DPO | Multi | Outcome | Multi | Temporal | All | Code | 282 | | [ML-Agent](https://github.com/MASWorks/ML-Agent) | Custom | Single | Process | Multi | Code | All | Yes | 283 | | [SkyRL](https://github.com/NovaSky-AI/SkyRL) | PPO/GRPO | Single | Outcome | Multi | Math/Code | All | Code | 284 | | [digitalhuman](https://github.com/Tencent/digitalhuman) | PPO/GRPO/ReMax/RLOO | Multi | Outcome | Multi | Empathy/Math/Code/MultimodalQA | Rule/Model/External | Yes | 285 | | [sweet_rl](https://github.com/facebookresearch/sweet_rl) | DPO | Multi | Process | Multi | Design/Code | Model | Web Browsing | 286 | | [rllm](https://github.com/agentica-project/rllm) | PPO/GRPO | Single | Outcome | Multi | Code Edit | External | Yes | 287 | | [open-r1](https://github.com/huggingface/open-r1) | GRPO | Single | Outcome | Single | Math/Code | All | Yes | 288 | 289 |
290 | 291 | ## 🤔 QA(Reasoning/Math) 292 | 293 | 294 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 295 | | :----: | :----: | :----: | :----: | :----: | :----: | 296 | | [SafeSearch](https://github.com/amazon-science/SafeSearch) | Stars | 2025.11 | Amazon Science | [Paper](https://arxiv.org/abs/2510.17017) | veRL | 297 | | [Agent0](https://github.com/aiming-lab/Agent0) | Stars | 2025.10 | UNC‑Chapel Hill / Salesforce Research / Stanford University | [Paper](https://arxiv.org/abs/2511.16043) | veRL | 298 | | [KG-R1](https://github.com/Jinyeop3110/KG-R1) | Stars | 2025.9 | UIUC/Google | [Paper1](https://arxiv.org/pdf/2503.09516), [Paper2](https://arxiv.org/abs/2505.15117) | veRL | 299 | | [AgentFlow](https://github.com/lupantech/AgentFlow) | Stars | 2025.09 | Stanford University | [arXiv](https://arxiv.org/abs/2510.05592) | veRL | 300 | | [ARPO](https://github.com/dongguanting/ARPO) | Stars | 2025.7 | RUC, Kuaishou | [Paper](https://arxiv.org/abs/2507.19849) | veRL | 301 | | [terminal-bench-rl](https://github.com/Danau5tin/terminal-bench-rl) | Stars | 2025.7 | Individual (Danau5tin) | N/A | rLLM | 302 | | [MOTIF](https://github.com/purbeshmitra/MOTIF) | Stars | 2025.6 | University of Maryland | [Paper](https://arxiv.org/abs/2507.02851) | trl | 303 | | [cmriat/l0](https://github.com/cmriat/l0) | Stars | 2025.6 | CMRIAT | [Paper](https://arxiv.org/abs/2506.23667) | veRL | 304 | | [agent-distillation](https://github.com/Nardien/agent-distillation) | Stars | 2025.5 | KAIST | [Paper](https://arxiv.org/pdf/2505.17612) | Custom | 305 | | [VDeepEyes](https://github.com/Visual-Agent/DeepEyes) | Stars | 2025.5 | Xiaohongshu/XJTU | [Paper](https://arxiv.org/pdf/2505.14362) | veRL | 306 | | [EasyR1](https://github.com/hiyouga/EasyR1) | Stars | 2025.4 | Individual | [repo1](https://github.com/hiyouga/EasyR1)/[paper2](https://arxiv.org/pdf/2409.19256) | veRL | 307 | | [AutoCoA](https://github.com/ADaM-BJTU/AutoCoA) | Stars | 2025.3 | BJTU | [Paper](https://arxiv.org/pdf/2503.06580) | veRL | 308 | | [ToRL](https://github.com/GAIR-NLP/ToRL) | Stars | 2025.3 | SJTU | [Paper](https://arxiv.org/pdf/2503.23383) | veRL | 309 | | [ReMA](https://github.com/ziyuwan/ReMA-public) | Stars | 2025.3 | SJTU, UCL | [Paper](https://arxiv.org/pdf/2503.09501) | veRL | 310 | | [Agentic-Reasoning](https://github.com/theworldofagents/Agentic-Reasoning) | Stars | 2025.2 | Oxford | [Paper](https://arxiv.org/pdf/2502.04644) | Custom | 311 | | [SimpleTIR](https://github.com/ltzheng/SimpleTIR) | Stars | 2025.2 | NTU, Bytedance | [Notion Blog](https://simpletir.notion.site/report) | veRL | 312 | | [openrlhf_async_pipline](https://github.com/yyht/openrlhf_async_pipline) | Stars | 2024.5 | OpenRLHF | [Paper](https://arxiv.org/pdf/2405.11143) | OpenRLHF | 313 | 314 |
315 | 📋 Click to view technical details 316 | 317 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 318 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 319 | | [SafeSearch](https://github.com/amazon-science/SafeSearch) | PPO (GAE/GRPO) | Single | Both | Multi | QA/Search | Rule + Model | Search | 320 | | [Agent0](https://github.com/aiming-lab/Agent0) | ADPO | Multi | Process | Multi | Math/Visual | Model/Verifier | Yes | 321 | | [KG-R1](https://github.com/Jinyeop3110/KG-R1) | GRPO/PPO | Single | Both | Multi | KGQA | Rule/Model | KG Retrieval | 322 | | [AgentFlow](https://github.com/lupantech/AgentFlow) | Flow-GRPO | Single | Outcome | Multi | Search/Math/QA | Model/External | Yes | 323 | | [ARPO](https://github.com/dongguanting/ARPO) | GRPO | Single | Outcome | Multi | Math/Coding | Model/Rule | Yes | 324 | | [terminal-bench-rl](https://github.com/Danau5tin/terminal-bench-rl) | GRPO | Single | Outcome | Multi | Coding/Terminal | Model+External Verifier | Yes | 325 | | [MOTIF](https://github.com/purbeshmitra/MOTIF) | GRPO | Single | Outcome | Multi | QA | Rule | No | 326 | | [cmriat/l0](https://github.com/cmriat/l0) | PPO | Multi | Process | Multi | QA | All | Yes | 327 | | [agent-distillation](https://github.com/Nardien/agent-distillation) | PPO | Single | Process | Multi | QA/Math | External | Yes | 328 | | [VDeepEyes](https://github.com/Visual-Agent/DeepEyes) | PPO/GRPO | Multi | Process | Multi | VQA | All | Yes | 329 | | [EasyR1](https://github.com/hiyouga/EasyR1) | GRPO | Single | Process | Multi | Vision-Language | Model | Yes | 330 | | [AutoCoA](https://github.com/ADaM-BJTU/AutoCoA) | GRPO | Multi | Outcome | Multi | Reasoning/Math/QA | All | Yes | 331 | | [ToRL](https://github.com/GAIR-NLP/ToRL) | GRPO | Single | Outcome | Single | Math | Rule/External | Yes | 332 | | [ReMA](https://github.com/ziyuwan/ReMA-public) | PPO | Multi | Outcome | Multi | Math | Rule | No | 333 | | [Agentic-Reasoning](https://github.com/theworldofagents/Agentic-Reasoning) | Custom | Single | Process | Multi | QA/Math | External | Web Browsing | 334 | | [SimpleTIR](https://github.com/ltzheng/SimpleTIR) | PPO/GRPO (with extensions) | Single | Outcome | Multi | Math, Coding | All | Yes | 335 | | [openrlhf_async_pipline](https://github.com/yyht/openrlhf_async_pipline) | PPO/REINFORCE++/DPO/RLOO | Single | Outcome | Multi | Dialogue/Reasoning/QA | All | No | 336 | 337 |
338 | 339 | ## 🧠 Memory 340 | 341 | 342 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 343 | | :----: | :----: | :----: | :----: | :----: | :----: | 344 | | [MEM1](https://github.com/MIT-MI/MEM1) | Stars | 2025.7 | MIT | [Paper](https://arxiv.org/abs/2506.15841) | veRL (based on Search-R1) | 345 | | [Memento](https://github.com/Agent-on-the-Fly/Memento) | Stars | 2025.6 | UCL, Huawei | [Paper](https://arxiv.org/abs/2508.16153) | Custom | 346 | | [MemAgent](https://github.com/BytedTsinghua-SIA/MemAgent) | Stars | 2025.6 | Bytedance, Tsinghua-SIA | [Paper](https://arxiv.org/abs/2507.02259) | veRL | 347 | 348 |
349 | 📋 Click to view technical details 350 | 351 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 352 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 353 | | [MEM1](https://github.com/MIT-MI/MEM1) | PPO/GRPO | Single | Outcome | Multi | WebShop/GSM8K/QA | Rule/Model | Yes | 354 | | [Memento](https://github.com/Agent-on-the-Fly/Memento) | soft Q-Learning | Single | Outcome | Multi | Research/QA/Code/Web | External/Rule | Yes | 355 | | [MemAgent](https://github.com/BytedTsinghua-SIA/MemAgent) | PPO, GRPO, DPO | Multi | Outcome | Multi | Long-context QA | Rule/Model/External | Yes | 356 | 357 |
358 | 359 | ## 🦾 Embodied 360 | 361 | 362 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 363 | | :----: | :----: | :----: | :----: | :----: | :----: | 364 | | [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1) | Stars | 2025.6 | Tianjing University | [Paper](http://arxiv.org/abs/2508.13998) | veRL | 365 | | [STeCa](https://github.com/WangHanLinHenry/STeCa) | Stars | 2025.2 | The Hong Kong Polytechnic University | [Paper](https://arxiv.org/abs/2502.14276) | FastChat/TRL | 366 | 367 | 368 | 369 |
370 | 📋 Click to view technical details 371 | 372 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 373 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 374 | | [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1) | GRPO | Single | Outcome | Single | Grounding/Waypoint | Rule | No | 375 | | [STeCa](https://github.com/WangHanLinHenry/STeCa) | DPO (RFT) | Single | Both | Multi | Embodied/Household | Rule/MC | Environment Actions | 376 | 377 | 378 |
379 | 380 | ## 🏥 Biomedical 381 | 382 | 383 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework | 384 | | :----: | :----: | :----: | :----: | :----: | :----: | 385 | | [MMedAgent-RL](https://github.com/JanerhYang/MMedAgent-RL) | Stars | 2025.8 | Unknown | [paper](https://arxiv.org/abs/2506.00555) | Unknown | 386 | | [DoctorAgent-RL](https://github.com/JarvisUSTC/DoctorAgent-RL) | Stars | 2025.5 | UCAS/CAS/USTC | [Paper](https://arxiv.org/pdf/2505.19630) | RAGEN | 387 | | [Biomni](https://github.com/snap-stanford/Biomni) | Stars | 2025.3 | Stanford University (SNAP) | [Paper](https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1) | Custom | 388 | 389 | 390 |
391 | 📋 Click to view technical details 392 | 393 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage | 394 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | 395 | | [MMedAgent-RL](https://github.com/JanerhYang/MMedAgent-RL) | Unknown | Multi | Unknown | Unknown | Unknown | Unknown | Unknown | 396 | | [DoctorAgent-RL](https://github.com/JarvisUSTC/DoctorAgent-RL) | GRPO | Multi | Both | Multi | Consultation/Diagnosis | Model/Rule | No | 397 | | [Biomni](https://github.com/snap-stanford/Biomni) | TBD | Single | TBD | Single | scRNAseq/CRISPR/ADMET/Knowledge | TBD | Yes | 398 | 399 | 400 | 401 |
402 | 403 | ## ⛰️ Environment 404 | 405 | | Github Repo | 🌟 Stars | Date | Org | Task | 406 | | :----: | :----: | :----: | :----: | :----: | 407 | | [LoCoBench-Agent](https://github.com/SalesforceAIResearch/LoCoBench-Agent) | ![](https://img.shields.io/github/stars/SalesforceAIResearch/LoCoBench-Agent.svg?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700) | 2025.11 | Salesforce AI Research | SWE | 408 | | [Simia-Agent-Training](https://github.com/microsoft/Simia-Agent-Training) | ![](https://img.shields.io/github/stars/microsoft/Simia-Agent-Training.svg?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700) | 2025.10 | Microsoft | ToolUse/API | 409 | | [PaperArena](https://github.com/Melmaphother/PaperArena) | Stars | 2025.9 | University of Science and Technology of China | ScientificLiteratureQA | 410 | | [enterprise-deep-research](https://github.com/SalesforceAIResearch/enterprise-deep-research) | ![](https://img.shields.io/github/stars/SalesforceAIResearch/enterprise-deep-research.svg?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700) | 2025.9 | Salesforce AI Research | DeepResearch | 411 | | [CompassVerifier](https://github.com/open-compass/CompassVerifier) | Stars | 2025.7 | Shanghai AI Lab | Knowledge/Math/Science/GeneralReasoning | 412 | | [Mind2Web-2](https://github.com/OSU-NLP-Group/Mind2Web-2) | Stars | 2025.6 | Ohio State University | Web | 413 | | [gem](https://github.com/axon-rl/gem) | Stars | 2025.5 | Sea AI Lab | Math/Code/Game/QA | 414 | | [MLE-Dojo](https://github.com/MLE-Dojo/MLE-Dojo) | Stars | 2025.5 | GIT, Stanford | MLE | 415 | | [atropos](https://github.com/NousResearch/atropos) | Stars | 2025.4 | Nous Research | Game/Code/Tool | 416 | | [InternBootcamp](https://github.com/InternLM/InternBootcamp) | Stars | 2025.4 | InternBootcamp | Coding/QA/Game | 417 | | [loong](https://github.com/camel-ai/loong) | Stars | 2025.3 | CAMEL-AI.org | RLVR | 418 | | [DataSciBench](https://github.com/THUDM/DataSciBench) | Stars | 2025.2 | Tsinghua | data analysis | 419 | | [reasoning-gym](https://github.com/open-thought/reasoning-gym) | Stars | 2025.1 | open-thought | Math/Game | 420 | | [llmgym](https://github.com/tensorzero/llmgym) | Stars | 2025.1 | tensorzero | TextGame/Tool | 421 | | [debug-gym](https://github.com/microsoft/debug-gym) | Stars | 2024.11 | Microsoft Research | Debugging/Game/Code | 422 | | [gym-llm](https://github.com/rsanchezmo/gym-llm) | Stars | 2024.8 | Rodrigo Sánchez Molina | Control/Game | 423 | | [AgentGym](https://github.com/WooooDyy/AgentGym) | Stars | 2024.6 | Fudan | Web/Game | 424 | | [tau-bench](https://github.com/sierra-research/tau-bench) | Stars | 2024.6 | Sierra | Tool | 425 | | [appworld](https://github.com/StonyBrookNLP/appworld) | Stars | 2024.6 | Stony Brook University | Phone Use | 426 | | [android_world](https://github.com/google-research/android_world) | Stars | 2024.5 | Google Research | Phone Use | 427 | | [TheAgentCompany](https://github.com/TheAgentCompany/TheAgentCompany) | Stars | 2024.3 | CMU, Duke | Coding | 428 | | [LlamaGym](https://github.com/KhoomeiK/LlamaGym) | Stars | 2024.3 | Rohan Pandey | Game| 429 | | [visualwebarena](https://github.com/web-arena-x/visualwebarena) | Stars | 2024.1 | CMU | Web | 430 | | [LMRL-Gym](https://github.com/abdulhaim/LMRL-Gym) | Stars | 2023.12 | UC Berkeley | Game | 431 | | [OSWorld](https://github.com/xlang-ai/OSWorld) | Stars | 2023.10 | HKU, CMU, Salesforce, Waterloo | Computer Use | 432 | | [webarena](https://github.com/web-arena-x/webarena) | Stars | 2023.7 | CMU | Web | 433 | | [AgentBench](https://github.com/THUDM/AgentBench) | Stars | 2023.7 | Tsinghua University | Game/Web/QA/Tool | 434 | | [WebShop](https://github.com/princeton-nlp/WebShop) | Stars | 2022.7 | Princeton-NLP | Web | 435 | | [ScienceWorld](https://github.com/allenai/ScienceWorld) | Stars | 2022.3 | AllenAI | TextGame/ScienceQA | 436 | | [alfworld](https://github.com/alfworld/alfworld) | Stars | 2020.10 | Microsoft, CMU, UW | Embodied | 437 | | [factorio-learning-environment](https://github.com/JackHopkins/factorio-learning-environment) | Stars | 2021.6 | JackHopkins | Game | 438 | | [jericho](https://github.com/microsoft/jericho) | Stars | 2018.10 | Microsoft, GIT | TextGame | 439 | | [TextWorld](https://github.com/microsoft/TextWorld) | Stars | 2018.6 | Microsoft Research | TextGame | 440 | 441 | ## Under Review/Waiting for Open Source 442 | - [JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning](https://arxiv.org/abs/2506.19846) 443 | - [Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning](https://arxiv.org/abs/2507.17842) 444 | - [Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning](https://arxiv.org/abs/2508.03501) 445 | - [Acting Less is Reasoning More! Teaching Model to Act Efficiently](https://arxiv.org/abs/2504.14870) 446 | - [Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning](https://arxiv.org/abs/2505.01441) 447 | - [ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents](https://arxiv.org/abs/2508.14040) 448 | - [Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward](https://github.com/antgroup/Research-Venus) 449 | - [MUA-RL: MULTI-TURN USER-INTERACTING AGENTREINFORCEMENT LEARNING FOR AGENTIC TOOL USE](https://github.com/zzwkk/MUA-RL) 450 | - [Understanding Tool-Integrated Reasoning](https://zhongwenxu.notion.site/Understanding-Tool-Integrated-Reasoning-2551c4e140e3805489fadcc802a1ea83) 451 | - [Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning](https://arxiv.org/abs/2508.19828) 452 | - [Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning](https://arxiv.org/abs/2508.19598) 453 | - [SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents](https://arxiv.org/abs/2509.06283) 454 | - [WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents](https://arxiv.org/abs/2509.06501) 455 | - [EnvX: Agentize Everything with Agentic AI](https://arxiv.org/abs/2509.08088) 456 | - [UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning](https://arxiv.org/abs/2509.02544) 457 | - [UI-Venus Technical Report: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833) 458 | - [Agent2 : An Agent-Generates-Agent Framework for Reinforcement Learning Automation](https://arxiv.org/abs/2509.13368) 459 | - [Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use](https://arxiv.org/abs/2509.12867v1) 460 | - [Adversarial Reinforcement Learning for Large Language Model Agent Safety](https://arxiv.org/abs/2510.05442) 461 | - [Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction](https://www.arxiv.org/abs/2511.11770) 462 | - [InfoFlow: Reinforcing Search Agent Via Reward Density Optimization](https://arxiv.org/abs/2510.26575) 463 | 464 | ## Star History 465 | 466 | [![Star History Chart](https://api.star-history.com/svg?repos=thinkwee/agentsMeetRL&type=Date)](https://www.star-history.com/#thinkwee/agentsMeetRL&Date) 467 | 468 | 469 | ## Citation 470 | 471 | If you find this repository useful, please consider citing it: 472 | 473 | ```bibtex 474 | @misc{agentsMeetRL, 475 | title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey}, 476 | author={AgentsMeetRL Contributors}, 477 | year={2025}, 478 | url={https://github.com/thinkwee/agentsMeetRL} 479 | } 480 | ``` 481 | 482 | --- 483 | 484 |
485 |

Made with ❤️ by the AgentsMeetRL community

486 |
487 | --------------------------------------------------------------------------------