├── logo.png
├── index.html
└── README.md
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thinkwee/AgentsMeetRL/HEAD/logo.png
--------------------------------------------------------------------------------
/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | AgentsMeetRL Dashboard
7 |
8 |
459 |
460 |
461 |
462 |
473 |
474 |
475 |
476 |
Loading repository data...
477 |
478 |
479 |
480 |
481 |
482 |
483 |
0
484 |
Total Repositories in AgentsMeetRL
485 |
486 |
487 |
0
488 |
Stars of All Repositories in AgentsMeetRL
489 |
490 |
491 |
0
492 |
Categories
493 |
494 |
495 |
0
496 |
AgentsMeetRL Stars
497 |
498 |
499 |
500 |
501 |
502 |
Repositories by Category
503 |
504 |
505 |
506 |
Stars Distribution by Category
507 |
508 |
509 |
510 |
511 |
512 |
513 |
514 |
515 |
1004 |
1005 |
1006 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |

3 |
4 |
5 |
6 |
7 | 
8 | 
9 | 
10 | 
11 | 
12 | 
13 |
14 | 
15 | 
16 | 
17 | 
18 | 
19 | 
20 |
21 |
22 |
23 |
24 |
25 | [](https://thinkwee.top/amr/)
26 |
27 |
28 |
29 | # When LLM Agents Meet Reinforcement Learning
30 |
31 | **AgentsMeetRL** is an awesome list that summarizes **open-source repositories** for training LLM Agents using reinforcement learning:
32 | - 🤖 The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo).
33 | - ⚠️ This project is based on code analysis from open-source repositories using GitHub Copilot Agent, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!
34 | - 🚀 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table.
35 | - 🤗 Feel free to submit your own projects anytime - we welcome contributions!
36 |
37 | Some Enumeration:
38 | - Enumeration for Reward Type:
39 | - External Verifier: e.g., a compiler or math solver
40 | - Rule-Based: e.g., a LaTeX parser with exact match scoring
41 | - Model-Based: e.g., a trained verifier LLM or reward LLM
42 | - Custom
43 |
44 | ---
45 |
46 | ## 🔧 Base Framework
47 |
48 |
49 | | Github Repo | 🌟 Stars | Date | Org | Paper Link |
50 | | :----: | :----: | :----: | :----: | :----: |
51 | | [siiRL](https://github.com/sii-research/siiRL) |
| 2025.7 | Shanghai Innovation Institute | [Paper](https://arxiv.org/abs/2507.13833) |
52 | | [slime](https://github.com/THUDM/slime) |  | 2025.6 | Tsinghua University (THUDM) | [blog](https://lmsys.org/blog/2025-07-09-slime/) |
53 | | [agent-lightning](https://github.com/microsoft/agent-lightning) |
| 2025.6 | Microsoft Research | [Paper](https://arxiv.org/abs/2508.03680) |
54 | | [AReaL](https://github.com/inclusionAI/AReaL) |
| 2025.6 | AntGroup/Tsinghua | [Paper](https://arxiv.org/pdf/2505.24298) |
55 | | [ROLL](https://github.com/alibaba/ROLL) |
| 2025.6 | Alibaba | [Paper](https://arxiv.org/pdf/2506.06122) |
56 | | [MARTI](https://github.com/TsinghuaC3I/MARTI) |
| 2025.5 | Tsinghua | -- |
57 | | [RL2](https://github.com/ChenmienTan/RL2) |
| 2025.4 | Accio | – |
58 | | [verifiers](https://github.com/willccbb/verifiers) |
| 2025.3 | Individual | -- |
59 | | [oat](https://github.com/sail-sg/oat) |
| 2024.11 | NUS/Sea AI | [Paper](https://arxiv.org/pdf/2411.01493) |
60 | | [veRL](https://github.com/volcengine/verl) |
| 2024.10 | ByteDance | [Paper](https://arxiv.org/pdf/2409.19256) |
61 | | [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) |
| 2023.7 | OpenRLHF | [Paper](https://arxiv.org/abs/2405.11143) |
62 | | [trl](https://github.com/huggingface/trl) |
| 2019.11 | HuggingFace | -- |
63 |
64 |
65 | 📋 Click to view technical details
66 |
67 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
68 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
69 | | [siiRL](https://github.com/sii-research/siiRL) | PPO/GRPO/CPGD/MARFT | Multi | Both | Multi | LLM/VLM/LLM-MAS PostTraining | Model/Rule | Planned |
70 | | [slime](https://github.com/THUDM/slime) | GRPO/GSPO/REINFORCE++ | Single | Both | Both | Math/Code | External Verifier | Yes |
71 | | [agent-lightning](https://github.com/microsoft/agent-lightning) | PPO/Custom/Automatic Prompt Optimization | Multi | Outcome | Multi | Calculator/SQL | Model/External/Rule | Yes |
72 | | [AReaL](https://github.com/inclusionAI/AReaL) | PPO | Both | Outcome | Both | Math/Code | External | Yes |
73 | | [ROLL](https://github.com/alibaba/ROLL) | PPO/GRPO/Reinforce++/TOPR/RAFT++ | Multi | Both | Multi | Math/QA/Code/Alignment | All | Yes |
74 | | [MARTI](https://github.com/TsinghuaC3I/MARTI) | PPO/GRPO/REINFORCE++/TTRL | Multi | Both | Multi | Math | All | Yes |
75 | | [RL2](https://github.com/ChenmienTan/RL2) | Dr. GRPO/PPO/DPO | Single | Both | Both | QA/Dialogue | Rule/Model/External | Yes |
76 | | [verifiers](https://github.com/willccbb/verifiers) | GRPO | Multi | Outcome | Both | Reasoning/Math/Code | All | Code |
77 | | [oat](https://github.com/sail-sg/oat) | PPO/GRPO | Single | Outcome | Multi | Math/Alignment | External | No |
78 | | [veRL](https://github.com/volcengine/verl) | PPO/GRPO | Single | Outcome | Both | Math/QA/Reasoning/Search | All | Yes |
79 | | [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) | PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO | Multi | Both | Both | Dialogue/Chat/Completion | Rule/Model/External | Yes |
80 | | [trl](https://github.com/huggingface/trl) | PPO/GRPO/DPO | Single | Both | Single | QA | Custom | No |
81 |
82 |
83 |
84 | ## 💪 General/MultiTask
85 |
86 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
87 | | :----: | :----: | :----: | :----: | :----: | :----: |
88 | | [DEPO](https://github.com/OpenCausaLab/DEPO) |
| 2025.11 | HKUST/SJTU | [Paper](https://arxiv.org/abs/2511.15392) | LLaMA-Factory |
89 | | [SPEAR](https://github.com/TencentYoutuResearch/SPEAR) |
| 2025.10 | Tencent Youtu Lab | [Paper](https://arxiv.org/abs/2509.22601) | veRL/verl-agent |
90 | | [AgentRL](https://github.com/THUDM/AgentRL) |
| 2025.9 | Tsinghua | [Paper](https://arxiv.org/abs/2510.04206) | veRL |
91 | | [AgentGym-RL](https://github.com/WooooDyy/AgentGym-RL) |
| 2025.9 | Fudan University | [Paper](https://arxiv.org/abs/2509.08755) | veRL |
92 | | [Agent_Foundation_Models](https://github.com/OPPO-PersonalAI/Agent_Foundation_Models) |
| 2025.8 | OPPO Personal AI Lab | [Paper](https://arxiv.org/abs/2508.13167) | veRL |
93 | | [SPA-RL-Agent](https://github.com/WangHanLinHenry/SPA-RL-Agent) |
| 2025.5 | PolyU | [Paper](https://arxiv.org/pdf/2505.20732) | TRL |
94 | | [verl-agent](https://github.com/langfengQ/verl-agent) |
| 2025.5 | NTU/Skywork | [Paper](https://arxiv.org/pdf/2505.10978) | veRL |
95 |
96 |
97 | 📋 Click to view technical details
98 |
99 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
100 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
101 | | [DEPO](https://github.com/OpenCausaLab/DEPO) | KTO + Efficiency Loss | Single | Both | Multi | Agent (BabyAI/WebShop) | Rule | Yes |
102 | | [SPEAR](https://github.com/TencentYoutuResearch/SPEAR) | GRPO/GiGPO + SIL | Single | Both | Multi | Math/Agent | Rule/External | Yes (Search, Sandbox, Browser) |
103 | | [AgentRL](https://github.com/THUDM/AgentRL) | GRPO/REINFORCE++/RLOO/ReMax/GAE | Single | Outcome | Multi | Agent Tasks | External | Yes |
104 | | [AgentGym-RL](https://github.com/WooooDyy/AgentGym-RL) | PPO/GRPO/RLOO/REINFORCE++ | Single | Outcome | Multi | Web/Search/Game/Embodied/Science | Rule/Model/External | Yes (Web, Search, Env APIs) |
105 | | [Agent_Foundation_Models](https://github.com/OPPO-PersonalAI/Agent_Foundation_Models) | DAPO/PPO | Single | Outcome | Single | QA/Code/Math | Rule/External | Yes |
106 | | [SPA-RL-Agent](https://github.com/WangHanLinHenry/SPA-RL-Agent) | PPO | Single | Process | Multi | Navigation/Web/TextGame | Model | No |
107 | | [verl-agent](https://github.com/langfengQ/verl-agent) | PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++ | Multi | Both | Multi | Phone Use/Math/Code/Web/TextGame | All | Yes |
108 |
109 |
110 |
111 | ## 🔍 Search/Research/Web
112 |
113 |
114 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
115 | | :----: | :----: | :----: | :----: | :----: | :----: |
116 | | [ReSeek](https://github.com/TencentBAC/ReSeek) |
| 2025.10 | Tencent PCG BAC/Tsinghua University | [Paper](https://arxiv.org/abs/2510.00568) | veRL |
117 | | [Tree-GRPO](https://github.com/AMAP-ML/Tree-GRPO) |
| 2025.9 | AMAP | [Paper](https://arxiv.org/abs/2509.21240) | veRL |
118 | | [ASearcher](https://github.com/inclusionAI/ASearcher) |
| 2025.8 | Ant Research RL Lab
Tsinghua University & UW | [Paper](https://arxiv.org/abs/2508.07976) | RealHF/AReaL |
119 | | [Kimi-Researcher](https://github.com/moonshotai/Kimi-Researcher) |
| 2025.6 | Moonshot AI | [blog](https://moonshotai.github.io/Kimi-Researcher/) | Custom |
120 | | [TTI](https://github.com/test-time-interaction/TTI) |
| 2025.6 | CMU | [Paper](https://arxiv.org/abs/2506.07976) | Custom |
121 | | [R-Search](https://github.com/QingFei1/R-Search) |
| 2025.6 | Individual | -- | veRL |
122 | | [R1-Searcher-plus](https://github.com/RUCAIBox/R1-Searcher-plus) |
| 2025.5 | RUC | [Paper](https://arxiv.org/pdf/2505.17005) | Custom |
123 | | [StepSearch](https://github.com/Zillwang/StepSearch) |
| 2025.5 | SenseTime | [Paper](https://arxiv.org/pdf/2505.15107) | veRL |
124 | | [AutoRefine](https://github.com/syr-cn/AutoRefine) |
| 2025.5 | USTC | [Paper](https://www.arxiv.org/pdf/2505.11277) | veRL |
125 | | [ZeroSearch](https://github.com/Alibaba-NLP/ZeroSearch) |
| 2025.5 | Alibaba |[Paper](https://arxiv.org/pdf/2505.04588) | veRL |
126 | | [WebThinker](https://github.com/RUC-NLPIR/WebThinker) |
| 2025.4 | RUC | [Paper](https://arxiv.org/pdf/2504.21776) | Custom |
127 | | [DeepResearcher](https://github.com/GAIR-NLP/DeepResearcher) |
| 2025.4 | SJTU | [Paper](https://arxiv.org/pdf/2504.03160) | veRL |
128 | | [Search-R1](https://github.com/PeterGriffinJin/Search-R1) |
| 2025.3 | UIUC/Google | [paper1](https://arxiv.org/pdf/2503.09516), [paper2](https://arxiv.org/pdf/2505.15117) | veRL |
129 | | [R1-Searcher](https://github.com/RUCAIBox/R1-Searcher) |
| 2025.3 | RUC | [Paper](https://arxiv.org/pdf/2503.05592) | OpenRLHF |
130 | | [C-3PO](https://github.com/Chen-GX/C-3PO) |
| 2025.2 | Alibaba | [Paper](https://arxiv.org/pdf/2502.06205) | OpenRLHF |
131 | | [Search-o1](https://github.com/RUC-NLPIR/Search-o1) |
| 2025.1 | Renmin University of China (RUC) | [Paper](https://arxiv.org/abs/2501.05366) | N/A (Inference Only) |
132 | | [WebAgent](https://github.com/Alibaba-NLP/WebAgent) |
| 2025.1 | Alibaba | [paper1](https://arxiv.org/pdf/2501.07572), [paper2](https://arxiv.org/pdf/2505.22648) | LLaMA-Factory |
133 |
134 |
135 | 📋 Click to view technical details
136 |
137 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
138 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
139 | | [ReSeek](https://github.com/TencentBAC/ReSeek) | GRPO/PPO | Single | Both | Multi | QA/Search | Rule | Search/JUDGE |
140 | | [Tree-GRPO](https://github.com/AMAP-ML/Tree-GRPO) | GRPO/Tree-GRPO | Single | Outcome | Multi | Search | Rule | Search |
141 | | [ASearcher](https://github.com/inclusionAI/ASearcher) | PPO/GRPO + Decoupled PPO | Single | Outcome | Multi | Math/Code/SearchQA | External/Rule | Yes |
142 | | [Kimi-Researcher](https://github.com/moonshotai/Kimi-Researcher) | REINFORCE | Single | Outcome | Multi | Research | Outcome | Search, Browse, Coding |
143 | | [TTI](https://github.com/test-time-interaction/TTI) | REINFORCE/BC | Single | Outcome | Multi | Web | External | Web Browsing |
144 | | [R-Search](https://github.com/QingFei1/R-Search) | PPO/GRPO | Single | Both | Multi | QA/Search | All | Yes |
145 | | [R1-Searcher-plus](https://github.com/RUCAIBox/R1-Searcher-plus) | Custom | Single | Outcome | Multi | Search | Model | Search |
146 | | [StepSearch](https://github.com/Zillwang/StepSearch) | PPO | Single | Process | Multi | QA | Model | Search |
147 | | [AutoRefine](https://github.com/syr-cn/AutoRefine) | PPO/GRPO | Multi | Both | Multi | RAG QA | Rule | Search |
148 | | [ZeroSearch](https://github.com/Alibaba-NLP/ZeroSearch) | PPO/GRPO/REINFORCE | Single | Outcome | Multi | QA/Search | Rule | Yes |
149 | | [WebThinker](https://github.com/RUC-NLPIR/WebThinker) | DPO | Single | Outcome | Multi | Reasoning/QA/Research | Model/External | Web Browsing |
150 | | [DeepResearcher](https://github.com/GAIR-NLP/DeepResearcher) | PPO/GRPO | Multi | Outcome | Multi | Research | All | Yes |
151 | | [Search-R1](https://github.com/PeterGriffinJin/Search-R1) | PPO/GRPO | Single | Outcome | Multi | Search | All | Search |
152 | | [R1-Searcher](https://github.com/RUCAIBox/R1-Searcher) | PPO/DPO | Single | Both | Multi | Search | All | Yes |
153 | | [C-3PO](https://github.com/Chen-GX/C-3PO) | PPO | Multi | Outcome | Multi | Search | Model | Yes |
154 | | [Search-o1](https://github.com/RUC-NLPIR/Search-o1) | N/A | Single | N/A | Multi | Math/Science QA/Code/Open QA | N/A | Web Search |
155 | | [WebAgent](https://github.com/Alibaba-NLP/WebAgent) | DAPO | Multi | Process | Multi | Web | Model | Yes |
156 |
157 |
158 |
159 | ## 📱 GUI
160 |
161 |
162 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
163 | | :----: | :----: | :----: | :----: | :----: | :----: |
164 | | [MobileAgent](https://github.com/X-PLUG/MobileAgent) |
| 2025.9 | X-PLUG (TongyiQwen) | [paper](https://arxiv.org/abs/2509.11543) | veRL |
165 | | [InfiGUI-G1](https://github.com/InfiXAI/InfiGUI-G1) |
| 2025.8 | InfiX AI | [Paper](https://arxiv.org/abs/2508.05731) | veRL |
166 | | [Grounding-R1](https://github.com/Yan98/Grounding-R1) |
| 2025.6 | Salesforce | [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1) | trl |
167 | | [AgentCPM-GUI](https://github.com/OpenBMB/AgentCPM-GUI) |
| 2025.6 | OpenBMB/Tsinghua/RUC | [Paper](https://arxiv.org/pdf/2506.01391) | Huggingface |
168 | | [SE-GUI](https://github.com/YXB-NKU/SE-GUI) |
| 2025.5 | Nankai University/vivo | [Paper](https://arxiv.org/pdf/2505.12370) | trl |
169 | | [ARPO](https://github.com/dvlab-research/ARPO) |
| 2025.5 | CUHK/HKUST | [Paper](https://arxiv.org/pdf/2505.16282) | veRL |
170 | | [GUI-G1](https://github.com/Yuqi-Zhou/GUI-G1) |
| 2025.5 | RUC | [Paper](https://arxiv.org/pdf/2505.15810) | TRL |
171 | | [GUI-R1](https://github.com/ritzz-ai/GUI-R1) |
| 2025.4 | CAS/NUS | [Paper](https://arxiv.org/pdf/2504.10458) | veRL |
172 | | [UI-R1](https://github.com/lll6gg/UI-R1) |
| 2025.3 | vivo/CUHK | [Paper](https://arxiv.org/pdf/2503.21620) | TRL |
173 |
174 |
175 | 📋 Click to view technical details
176 |
177 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
178 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
179 | | [MobileAgent](https://github.com/X-PLUG/MobileAgent) | semi-online RL | Single | Both | Multi | MobileGUI/Automation | Rule | Yes |
180 | | [InfiGUI-G1](https://github.com/InfiXAI/InfiGUI-G1) | AEPO | Single | Outcome | Single | GUI/Grounding | Rule | No |
181 | | [Grounding-R1](https://github.com/Yan98/Grounding-R1) | GRPO | Single | Outcome | Multi | GUI Grounding | Model | Yes |
182 | | [AgentCPM-GUI](https://github.com/OpenBMB/AgentCPM-GUI) | GRPO | Single | Outcome | Multi | Mobile GUI | Model | Yes |
183 | | [SE-GUI](https://github.com/YXB-NKU/SE-GUI) | GRPO | Single | Both | Single | GUI Grounding | Rule | Yes |
184 | | [ARPO](https://github.com/dvlab-research/ARPO) | GRPO | Single | Outcome | Multi | GUI | External | Computer Use |
185 | | [GUI-G1](https://github.com/Yuqi-Zhou/GUI-G1) | GRPO | Single | Outcome | Single | GUI | Rule/External | No |
186 | | [GUI-R1](https://github.com/ritzz-ai/GUI-R1) | GRPO | Single | Outcome | Multi | GUI | Rule | No |
187 | | [UI-R1](https://github.com/lll6gg/UI-R1) | GRPO | Single | Process | Both | GUI | Rule | Computer/Phone Use |
188 |
189 |
190 |
191 | ## 🔨 Tool
192 |
193 |
194 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
195 | | :----: | :----: | :----: | :----: | :----: | :----: |
196 | | [MiroRL](https://github.com/MiroMindAI/MiroRL) |
| 2025.8 | MiroMindAI | [HF Repo](https://huggingface.co/miromind-ai) | veRL |
197 | | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) |
| 2025.6 | TIGER-Lab | [X](https://x.com/DongfuJiang/status/1929198238017720379) | veRL |
198 | | [Multi-Turn-RL-Agent](https://github.com/SiliangZeng/Multi-Turn-RL-Agent) |
| 2025.5 | University of Minnesota | [Paper](https://arxiv.org/pdf/2505.11821) | Custom |
199 | | [Tool-N1](https://github.com/NVlabs/Tool-N1) |
| 2025.5 | NVIDIA | [Paper](https://arxiv.org/pdf/2505.00024) | veRL |
200 | | [Tool-Star](https://github.com/dongguanting/Tool-Star) |
| 2025.5 | RUC | [Paper](https://arxiv.org/pdf/2505.16410) | LLaMA-Factory |
201 | | [RL-Factory](https://github.com/Simple-Efficient/RL-Factory) |
| 2025.5 | Simple-Efficient | [model](https://huggingface.co/Simple-Efficient/RLFactory-Qwen3-8B-GRPO) | veRL |
202 | | [ReTool](https://github.com/ReTool-RL/ReTool) |
| 2025.4 | ByteDance | [Paper](https://arxiv.org/pdf/2504.11536) | veRL |
203 | | [AWorld](https://github.com/inclusionAI/AWorld) |
| 2025.3 | Ant Group (inclusionAI) | [Paper](https://arxiv.org/abs/2508.20404) | veRL |
204 | | [Agent-R1](https://github.com/0russwest0/Agent-R1) |
| 2025.3 | USTC | [Paper](https://arxiv.org/abs/2511.14460) | veRL |
205 | | [ReCall](https://github.com/Agent-RL/ReCall) |
| 2025.3 | BaiChuan | [Paper](https://arxiv.org/pdf/2503.19470) | veRL |
206 |
207 |
208 | 📋 Click to view technical details
209 |
210 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
211 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
212 | | [MiroRL](https://github.com/MiroMindAI/MiroRL) | GRPO | Single | Both | Multi | Reasoning/Planning/ToolUse | Rule-based | MCP |
213 | | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) | PPO/GRPO | Single | Both | Both | Math/Code | Rule/External | Yes |
214 | | [Multi-Turn-RL-Agent](https://github.com/SiliangZeng/Multi-Turn-RL-Agent) | GRPO | Single | Both | Multi | Tool-use/Math | Rule/External | Yes |
215 | | [Tool-N1](https://github.com/NVlabs/Tool-N1) | PPO | Single | Outcome | Multi | Math/Dialogue | All | Yes |
216 | | [Tool-Star](https://github.com/dongguanting/Tool-Star) | PPO/DPO/ORPO/SimPO/KTO | Single | Outcome | Multi | Multi-modal/Tool Use/Dialogue | Model/External | Yes |
217 | | [RL-Factory](https://github.com/Simple-Efficient/RL-Factory) | GRPO | Multi | Both | Multi | Tool-use/NL2SQL | All | MCP |
218 | | [ReTool](https://github.com/ReTool-RL/ReTool) | PPO | Single | Outcome | Multi | Math | External | Code |
219 | | [AWorld](https://github.com/inclusionAI/AWorld) | GRPO | Both | Outcome | Multi | Search/Web/Code | External/Rule | Yes |
220 | | [Agent-R1](https://github.com/0russwest0/Agent-R1) | PPO/GRPO | Single | Both | Multi | Tool-use/QA | Model | Yes |
221 | | [ReCall](https://github.com/Agent-RL/ReCall) | PPO/GRPO/RLOO/REINFORCE++/ReMax | Single | Outcome | Multi | Tool-use/Math/QA | All | Yes |
222 |
223 |
224 |
225 | ## 🎮 TextGame
226 |
227 |
228 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
229 | | :----: | :----: | :----: | :----: | :----: | :----: |
230 | | [ARIA](https://github.com/rhyang2021/ARIA) |
| 2025.6 | Fudan University | [Paper](https://arxiv.org/abs/2506.00539) | Custom |
231 | | [AMPO](https://github.com/MozerWang/AMPO) |
| 2025.5 | Tongyi Lab, Alibaba | [Paper](https://arxiv.org/abs/2505.02156) | veRL |
232 | | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) |
| 2025.5 | Alibaba | [Paper](https://arxiv.org/pdf/2505.17826) | veRL |
233 | | [VAGEN](https://github.com/RAGEN-AI/VAGEN) |
| 2025.3 | RAGEN-AI | [Paper](https://www.notion.so/VAGEN-Training-VLM-Agents-with-Multi-Turn-Reinforcement-Learning-1bfde13afb6e80b792f6d80c7c2fcad0) | veRL |
234 | | [ART](https://github.com/OpenPipe/ART) |
| 2025.3 | OpenPipe | [Paper](https://github.com/OpenPipe/ART#-citation) | TRL |
235 | | [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL) |
| 2025.3 | UIUC/MetaGPT | -- | Custom |
236 | | [RAGEN](https://github.com/RAGEN-AI/RAGEN) |
| 2025.1 | RAGEN-AI | [Paper](https://arxiv.org/pdf/2504.20073) | veRL |
237 |
238 |
239 | 📋 Click to view technical details
240 |
241 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
242 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
243 | | [ARIA](https://github.com/rhyang2021/ARIA) | REINFORCE | Both | Process | Multi | Negotiation/Bargaining | Other | No |
244 | | [AMPO](https://github.com/MozerWang/AMPO) | BC/AMPO(GRPO improvement) | Multi | Outcome | Multi | Social Interaction | Model-based | No |
245 | | [Trinity-RFT](https://github.com/modelscope/Trinity-RFT) | PPO/GRPO | Single | Outcome | Both | Math/TextGame/Web | All | Yes |
246 | | [VAGEN](https://github.com/RAGEN-AI/VAGEN) | PPO/GRPO | Single | Both | Multi | TextGame/Navigation | All | Yes |
247 | | [ART](https://github.com/OpenPipe/ART) | GRPO | Multi | Both | Multi | TextGame | All | Yes |
248 | | [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL) | PPO/DPO/GRPO | Multi | Outcome | Multi | TextGame | All | Yes |
249 | | [RAGEN](https://github.com/RAGEN-AI/RAGEN) | PPO/GRPO | Single | Both | Multi | TextGame | All | Yes |
250 |
251 |
252 |
253 | ## 💻 Code
254 |
255 |
256 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
257 | | :----: | :----: | :----: | :----: | :----: | :----: |
258 | | [PPP-Agent](https://github.com/sunnweiwei/PPP-Agent) |
| 2025.11 | CMU/OpenHands | [Paper](https://arxiv.org/abs/2511.02208) | veRL |
259 | | [RepoDeepSearch](https://github.com/Mizersy/RepoDeepSearch) |
| 2025.8 | PKU, Bytedance, BIT | [Paper](https://arxiv.org/abs/2508.03012) | veRL |
260 | | [MedAgentGym](https://github.com/wshi83/MedAgentGym) |
| 2025.6 | Emory/Georgia Tech | [Paper](https://arxiv.org/pdf/2506.04405) | Hugginface |
261 | | [CURE](https://github.com/Gen-Verse/CURE) |
| 2025.6 | University of Chicago
Princeton/ByteDance | [Paper](https://arxiv.org/pdf/2506.03136) | Huggingface |
262 | | [MASLab](https://github.com/MASWorks/MASLab) |
| 2025.5 | MASWorks | [Paper](https://arxiv.org/pdf/2505.16988) | Custom |
263 | | [Time-R1](https://github.com/ulab-uiuc/Time-R1) |
| 2025.5 | UIUC | [Paper](https://arxiv.org/pdf/2505.13508) | veRL |
264 | | [ML-Agent](https://github.com/MASWorks/ML-Agent) |
| 2025.5 | MASWorks | [Paper](https://arxiv.org/pdf/2505.23723) | Custom |
265 | | [SkyRL](https://github.com/NovaSky-AI/SkyRL) |
| 2025.4 | NovaSky | [Paper](https://arxiv.org/abs/2511.16108) | veRL |
266 | | [digitalhuman](https://github.com/Tencent/digitalhuman) |
| 2025.4 | Tencent | [Paper](https://arxiv.org/abs/2507.03112) | veRL |
267 | | [sweet_rl](https://github.com/facebookresearch/sweet_rl) |
| 2025.3 | Meta/UCB | [Paper](https://arxiv.org/pdf/2503.15478) | OpenRLHF |
268 | | [rllm](https://github.com/agentica-project/rllm) |
| 2025.1 | Berkeley Sky Computing Lab
BAIR / Together AI | [Notion Blog](https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31) | veRL |
269 | | [open-r1](https://github.com/huggingface/open-r1) |
| 2025.1 | HuggingFace | -- | TRL |
270 |
271 |
272 | 📋 Click to view technical details
273 |
274 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
275 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
276 | | [PPP-Agent](https://github.com/sunnweiwei/PPP-Agent) | PPP-RL | Single | Both | Multi | SWE/Research | Rule+Model | Search, Ask, Browse |
277 | | [RepoDeepSearch](https://github.com/Mizersy/RepoDeepSearch) | GRPO | Single | Both | Multi | Search/Repair | Rule/External | Yes |
278 | | [MedAgentGym](https://github.com/wshi83/MedAgentGym) | SFT/DPO/PPO/GRPO | Single | Outcome | Multi | Medical/Code | External | Yes |
279 | | [CURE](https://github.com/Gen-Verse/CURE) | PPO | Single | Outcome | Single | Code | External | No |
280 | | [MASLab](https://github.com/MASWorks/MASLab) | NO RL | Multi | Outcome | Multi | Code/Math/Reasoning | External | Yes |
281 | | [Time-R1](https://github.com/ulab-uiuc/Time-R1) | PPO/GRPO/DPO | Multi | Outcome | Multi | Temporal | All | Code |
282 | | [ML-Agent](https://github.com/MASWorks/ML-Agent) | Custom | Single | Process | Multi | Code | All | Yes |
283 | | [SkyRL](https://github.com/NovaSky-AI/SkyRL) | PPO/GRPO | Single | Outcome | Multi | Math/Code | All | Code |
284 | | [digitalhuman](https://github.com/Tencent/digitalhuman) | PPO/GRPO/ReMax/RLOO | Multi | Outcome | Multi | Empathy/Math/Code/MultimodalQA | Rule/Model/External | Yes |
285 | | [sweet_rl](https://github.com/facebookresearch/sweet_rl) | DPO | Multi | Process | Multi | Design/Code | Model | Web Browsing |
286 | | [rllm](https://github.com/agentica-project/rllm) | PPO/GRPO | Single | Outcome | Multi | Code Edit | External | Yes |
287 | | [open-r1](https://github.com/huggingface/open-r1) | GRPO | Single | Outcome | Single | Math/Code | All | Yes |
288 |
289 |
290 |
291 | ## 🤔 QA(Reasoning/Math)
292 |
293 |
294 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
295 | | :----: | :----: | :----: | :----: | :----: | :----: |
296 | | [SafeSearch](https://github.com/amazon-science/SafeSearch) |
| 2025.11 | Amazon Science | [Paper](https://arxiv.org/abs/2510.17017) | veRL |
297 | | [Agent0](https://github.com/aiming-lab/Agent0) |
| 2025.10 | UNC‑Chapel Hill / Salesforce Research / Stanford University | [Paper](https://arxiv.org/abs/2511.16043) | veRL |
298 | | [KG-R1](https://github.com/Jinyeop3110/KG-R1) |
| 2025.9 | UIUC/Google | [Paper1](https://arxiv.org/pdf/2503.09516), [Paper2](https://arxiv.org/abs/2505.15117) | veRL |
299 | | [AgentFlow](https://github.com/lupantech/AgentFlow) |
| 2025.09 | Stanford University | [arXiv](https://arxiv.org/abs/2510.05592) | veRL |
300 | | [ARPO](https://github.com/dongguanting/ARPO) |
| 2025.7 | RUC, Kuaishou | [Paper](https://arxiv.org/abs/2507.19849) | veRL |
301 | | [terminal-bench-rl](https://github.com/Danau5tin/terminal-bench-rl) |
| 2025.7 | Individual (Danau5tin) | N/A | rLLM |
302 | | [MOTIF](https://github.com/purbeshmitra/MOTIF) |
| 2025.6 | University of Maryland | [Paper](https://arxiv.org/abs/2507.02851) | trl |
303 | | [cmriat/l0](https://github.com/cmriat/l0) |
| 2025.6 | CMRIAT | [Paper](https://arxiv.org/abs/2506.23667) | veRL |
304 | | [agent-distillation](https://github.com/Nardien/agent-distillation) |
| 2025.5 | KAIST | [Paper](https://arxiv.org/pdf/2505.17612) | Custom |
305 | | [VDeepEyes](https://github.com/Visual-Agent/DeepEyes) |
| 2025.5 | Xiaohongshu/XJTU | [Paper](https://arxiv.org/pdf/2505.14362) | veRL |
306 | | [EasyR1](https://github.com/hiyouga/EasyR1) |
| 2025.4 | Individual | [repo1](https://github.com/hiyouga/EasyR1)/[paper2](https://arxiv.org/pdf/2409.19256) | veRL |
307 | | [AutoCoA](https://github.com/ADaM-BJTU/AutoCoA) |
| 2025.3 | BJTU | [Paper](https://arxiv.org/pdf/2503.06580) | veRL |
308 | | [ToRL](https://github.com/GAIR-NLP/ToRL) |
| 2025.3 | SJTU | [Paper](https://arxiv.org/pdf/2503.23383) | veRL |
309 | | [ReMA](https://github.com/ziyuwan/ReMA-public) |
| 2025.3 | SJTU, UCL | [Paper](https://arxiv.org/pdf/2503.09501) | veRL |
310 | | [Agentic-Reasoning](https://github.com/theworldofagents/Agentic-Reasoning) |
| 2025.2 | Oxford | [Paper](https://arxiv.org/pdf/2502.04644) | Custom |
311 | | [SimpleTIR](https://github.com/ltzheng/SimpleTIR) |
| 2025.2 | NTU, Bytedance | [Notion Blog](https://simpletir.notion.site/report) | veRL |
312 | | [openrlhf_async_pipline](https://github.com/yyht/openrlhf_async_pipline) |
| 2024.5 | OpenRLHF | [Paper](https://arxiv.org/pdf/2405.11143) | OpenRLHF |
313 |
314 |
315 | 📋 Click to view technical details
316 |
317 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
318 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
319 | | [SafeSearch](https://github.com/amazon-science/SafeSearch) | PPO (GAE/GRPO) | Single | Both | Multi | QA/Search | Rule + Model | Search |
320 | | [Agent0](https://github.com/aiming-lab/Agent0) | ADPO | Multi | Process | Multi | Math/Visual | Model/Verifier | Yes |
321 | | [KG-R1](https://github.com/Jinyeop3110/KG-R1) | GRPO/PPO | Single | Both | Multi | KGQA | Rule/Model | KG Retrieval |
322 | | [AgentFlow](https://github.com/lupantech/AgentFlow) | Flow-GRPO | Single | Outcome | Multi | Search/Math/QA | Model/External | Yes |
323 | | [ARPO](https://github.com/dongguanting/ARPO) | GRPO | Single | Outcome | Multi | Math/Coding | Model/Rule | Yes |
324 | | [terminal-bench-rl](https://github.com/Danau5tin/terminal-bench-rl) | GRPO | Single | Outcome | Multi | Coding/Terminal | Model+External Verifier | Yes |
325 | | [MOTIF](https://github.com/purbeshmitra/MOTIF) | GRPO | Single | Outcome | Multi | QA | Rule | No |
326 | | [cmriat/l0](https://github.com/cmriat/l0) | PPO | Multi | Process | Multi | QA | All | Yes |
327 | | [agent-distillation](https://github.com/Nardien/agent-distillation) | PPO | Single | Process | Multi | QA/Math | External | Yes |
328 | | [VDeepEyes](https://github.com/Visual-Agent/DeepEyes) | PPO/GRPO | Multi | Process | Multi | VQA | All | Yes |
329 | | [EasyR1](https://github.com/hiyouga/EasyR1) | GRPO | Single | Process | Multi | Vision-Language | Model | Yes |
330 | | [AutoCoA](https://github.com/ADaM-BJTU/AutoCoA) | GRPO | Multi | Outcome | Multi | Reasoning/Math/QA | All | Yes |
331 | | [ToRL](https://github.com/GAIR-NLP/ToRL) | GRPO | Single | Outcome | Single | Math | Rule/External | Yes |
332 | | [ReMA](https://github.com/ziyuwan/ReMA-public) | PPO | Multi | Outcome | Multi | Math | Rule | No |
333 | | [Agentic-Reasoning](https://github.com/theworldofagents/Agentic-Reasoning) | Custom | Single | Process | Multi | QA/Math | External | Web Browsing |
334 | | [SimpleTIR](https://github.com/ltzheng/SimpleTIR) | PPO/GRPO (with extensions) | Single | Outcome | Multi | Math, Coding | All | Yes |
335 | | [openrlhf_async_pipline](https://github.com/yyht/openrlhf_async_pipline) | PPO/REINFORCE++/DPO/RLOO | Single | Outcome | Multi | Dialogue/Reasoning/QA | All | No |
336 |
337 |
338 |
339 | ## 🧠 Memory
340 |
341 |
342 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
343 | | :----: | :----: | :----: | :----: | :----: | :----: |
344 | | [MEM1](https://github.com/MIT-MI/MEM1) |
| 2025.7 | MIT | [Paper](https://arxiv.org/abs/2506.15841) | veRL (based on Search-R1) |
345 | | [Memento](https://github.com/Agent-on-the-Fly/Memento) |
| 2025.6 | UCL, Huawei | [Paper](https://arxiv.org/abs/2508.16153) | Custom |
346 | | [MemAgent](https://github.com/BytedTsinghua-SIA/MemAgent) |
| 2025.6 | Bytedance, Tsinghua-SIA | [Paper](https://arxiv.org/abs/2507.02259) | veRL |
347 |
348 |
349 | 📋 Click to view technical details
350 |
351 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
352 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
353 | | [MEM1](https://github.com/MIT-MI/MEM1) | PPO/GRPO | Single | Outcome | Multi | WebShop/GSM8K/QA | Rule/Model | Yes |
354 | | [Memento](https://github.com/Agent-on-the-Fly/Memento) | soft Q-Learning | Single | Outcome | Multi | Research/QA/Code/Web | External/Rule | Yes |
355 | | [MemAgent](https://github.com/BytedTsinghua-SIA/MemAgent) | PPO, GRPO, DPO | Multi | Outcome | Multi | Long-context QA | Rule/Model/External | Yes |
356 |
357 |
358 |
359 | ## 🦾 Embodied
360 |
361 |
362 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
363 | | :----: | :----: | :----: | :----: | :----: | :----: |
364 | | [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1) |
| 2025.6 | Tianjing University | [Paper](http://arxiv.org/abs/2508.13998) | veRL |
365 | | [STeCa](https://github.com/WangHanLinHenry/STeCa) |
| 2025.2 | The Hong Kong Polytechnic University | [Paper](https://arxiv.org/abs/2502.14276) | FastChat/TRL |
366 |
367 |
368 |
369 |
370 | 📋 Click to view technical details
371 |
372 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
373 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
374 | | [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1) | GRPO | Single | Outcome | Single | Grounding/Waypoint | Rule | No |
375 | | [STeCa](https://github.com/WangHanLinHenry/STeCa) | DPO (RFT) | Single | Both | Multi | Embodied/Household | Rule/MC | Environment Actions |
376 |
377 |
378 |
379 |
380 | ## 🏥 Biomedical
381 |
382 |
383 | | Github Repo | 🌟 Stars | Date | Org | Paper Link | RL Framework |
384 | | :----: | :----: | :----: | :----: | :----: | :----: |
385 | | [MMedAgent-RL](https://github.com/JanerhYang/MMedAgent-RL) |
| 2025.8 | Unknown | [paper](https://arxiv.org/abs/2506.00555) | Unknown |
386 | | [DoctorAgent-RL](https://github.com/JarvisUSTC/DoctorAgent-RL) |
| 2025.5 | UCAS/CAS/USTC | [Paper](https://arxiv.org/pdf/2505.19630) | RAGEN |
387 | | [Biomni](https://github.com/snap-stanford/Biomni) |
| 2025.3 | Stanford University (SNAP) | [Paper](https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1) | Custom |
388 |
389 |
390 |
391 | 📋 Click to view technical details
392 |
393 | | Github Repo | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
394 | | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
395 | | [MMedAgent-RL](https://github.com/JanerhYang/MMedAgent-RL) | Unknown | Multi | Unknown | Unknown | Unknown | Unknown | Unknown |
396 | | [DoctorAgent-RL](https://github.com/JarvisUSTC/DoctorAgent-RL) | GRPO | Multi | Both | Multi | Consultation/Diagnosis | Model/Rule | No |
397 | | [Biomni](https://github.com/snap-stanford/Biomni) | TBD | Single | TBD | Single | scRNAseq/CRISPR/ADMET/Knowledge | TBD | Yes |
398 |
399 |
400 |
401 |
402 |
403 | ## ⛰️ Environment
404 |
405 | | Github Repo | 🌟 Stars | Date | Org | Task |
406 | | :----: | :----: | :----: | :----: | :----: |
407 | | [LoCoBench-Agent](https://github.com/SalesforceAIResearch/LoCoBench-Agent) |  | 2025.11 | Salesforce AI Research | SWE |
408 | | [Simia-Agent-Training](https://github.com/microsoft/Simia-Agent-Training) |  | 2025.10 | Microsoft | ToolUse/API |
409 | | [PaperArena](https://github.com/Melmaphother/PaperArena) |
| 2025.9 | University of Science and Technology of China | ScientificLiteratureQA |
410 | | [enterprise-deep-research](https://github.com/SalesforceAIResearch/enterprise-deep-research) |  | 2025.9 | Salesforce AI Research | DeepResearch |
411 | | [CompassVerifier](https://github.com/open-compass/CompassVerifier) |
| 2025.7 | Shanghai AI Lab | Knowledge/Math/Science/GeneralReasoning |
412 | | [Mind2Web-2](https://github.com/OSU-NLP-Group/Mind2Web-2) |
| 2025.6 | Ohio State University | Web |
413 | | [gem](https://github.com/axon-rl/gem) |
| 2025.5 | Sea AI Lab | Math/Code/Game/QA |
414 | | [MLE-Dojo](https://github.com/MLE-Dojo/MLE-Dojo) |
| 2025.5 | GIT, Stanford | MLE |
415 | | [atropos](https://github.com/NousResearch/atropos) |
| 2025.4 | Nous Research | Game/Code/Tool |
416 | | [InternBootcamp](https://github.com/InternLM/InternBootcamp) |
| 2025.4 | InternBootcamp | Coding/QA/Game |
417 | | [loong](https://github.com/camel-ai/loong) |
| 2025.3 | CAMEL-AI.org | RLVR |
418 | | [DataSciBench](https://github.com/THUDM/DataSciBench) |
| 2025.2 | Tsinghua | data analysis |
419 | | [reasoning-gym](https://github.com/open-thought/reasoning-gym) |
| 2025.1 | open-thought | Math/Game |
420 | | [llmgym](https://github.com/tensorzero/llmgym) |
| 2025.1 | tensorzero | TextGame/Tool |
421 | | [debug-gym](https://github.com/microsoft/debug-gym) |
| 2024.11 | Microsoft Research | Debugging/Game/Code |
422 | | [gym-llm](https://github.com/rsanchezmo/gym-llm) |
| 2024.8 | Rodrigo Sánchez Molina | Control/Game |
423 | | [AgentGym](https://github.com/WooooDyy/AgentGym) |
| 2024.6 | Fudan | Web/Game |
424 | | [tau-bench](https://github.com/sierra-research/tau-bench) |
| 2024.6 | Sierra | Tool |
425 | | [appworld](https://github.com/StonyBrookNLP/appworld) |
| 2024.6 | Stony Brook University | Phone Use |
426 | | [android_world](https://github.com/google-research/android_world) |
| 2024.5 | Google Research | Phone Use |
427 | | [TheAgentCompany](https://github.com/TheAgentCompany/TheAgentCompany) |
| 2024.3 | CMU, Duke | Coding |
428 | | [LlamaGym](https://github.com/KhoomeiK/LlamaGym) |
| 2024.3 | Rohan Pandey | Game|
429 | | [visualwebarena](https://github.com/web-arena-x/visualwebarena) |
| 2024.1 | CMU | Web |
430 | | [LMRL-Gym](https://github.com/abdulhaim/LMRL-Gym) |
| 2023.12 | UC Berkeley | Game |
431 | | [OSWorld](https://github.com/xlang-ai/OSWorld) |
| 2023.10 | HKU, CMU, Salesforce, Waterloo | Computer Use |
432 | | [webarena](https://github.com/web-arena-x/webarena) |
| 2023.7 | CMU | Web |
433 | | [AgentBench](https://github.com/THUDM/AgentBench) |
| 2023.7 | Tsinghua University | Game/Web/QA/Tool |
434 | | [WebShop](https://github.com/princeton-nlp/WebShop) |
| 2022.7 | Princeton-NLP | Web |
435 | | [ScienceWorld](https://github.com/allenai/ScienceWorld) |
| 2022.3 | AllenAI | TextGame/ScienceQA |
436 | | [alfworld](https://github.com/alfworld/alfworld) |
| 2020.10 | Microsoft, CMU, UW | Embodied |
437 | | [factorio-learning-environment](https://github.com/JackHopkins/factorio-learning-environment) |
| 2021.6 | JackHopkins | Game |
438 | | [jericho](https://github.com/microsoft/jericho) |
| 2018.10 | Microsoft, GIT | TextGame |
439 | | [TextWorld](https://github.com/microsoft/TextWorld) |
| 2018.6 | Microsoft Research | TextGame |
440 |
441 | ## Under Review/Waiting for Open Source
442 | - [JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning](https://arxiv.org/abs/2506.19846)
443 | - [Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning](https://arxiv.org/abs/2507.17842)
444 | - [Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning](https://arxiv.org/abs/2508.03501)
445 | - [Acting Less is Reasoning More! Teaching Model to Act Efficiently](https://arxiv.org/abs/2504.14870)
446 | - [Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning](https://arxiv.org/abs/2505.01441)
447 | - [ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents](https://arxiv.org/abs/2508.14040)
448 | - [Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward](https://github.com/antgroup/Research-Venus)
449 | - [MUA-RL: MULTI-TURN USER-INTERACTING AGENTREINFORCEMENT LEARNING FOR AGENTIC TOOL USE](https://github.com/zzwkk/MUA-RL)
450 | - [Understanding Tool-Integrated Reasoning](https://zhongwenxu.notion.site/Understanding-Tool-Integrated-Reasoning-2551c4e140e3805489fadcc802a1ea83)
451 | - [Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning](https://arxiv.org/abs/2508.19828)
452 | - [Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning](https://arxiv.org/abs/2508.19598)
453 | - [SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents](https://arxiv.org/abs/2509.06283)
454 | - [WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents](https://arxiv.org/abs/2509.06501)
455 | - [EnvX: Agentize Everything with Agentic AI](https://arxiv.org/abs/2509.08088)
456 | - [UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning](https://arxiv.org/abs/2509.02544)
457 | - [UI-Venus Technical Report: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833)
458 | - [Agent2 : An Agent-Generates-Agent Framework for Reinforcement Learning Automation](https://arxiv.org/abs/2509.13368)
459 | - [Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use](https://arxiv.org/abs/2509.12867v1)
460 | - [Adversarial Reinforcement Learning for Large Language Model Agent Safety](https://arxiv.org/abs/2510.05442)
461 | - [Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction](https://www.arxiv.org/abs/2511.11770)
462 | - [InfoFlow: Reinforcing Search Agent Via Reward Density Optimization](https://arxiv.org/abs/2510.26575)
463 |
464 | ## Star History
465 |
466 | [](https://www.star-history.com/#thinkwee/agentsMeetRL&Date)
467 |
468 |
469 | ## Citation
470 |
471 | If you find this repository useful, please consider citing it:
472 |
473 | ```bibtex
474 | @misc{agentsMeetRL,
475 | title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey},
476 | author={AgentsMeetRL Contributors},
477 | year={2025},
478 | url={https://github.com/thinkwee/agentsMeetRL}
479 | }
480 | ```
481 |
482 | ---
483 |
484 |
485 |
Made with ❤️ by the AgentsMeetRL community
486 |
487 |
--------------------------------------------------------------------------------