├── CONTRIBUTING.md ├── LICENSE ├── README.md └── assets ├── mobile.jpg └── pc.jpg /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to Awesome UI Agent 2 | 3 | Anyone interested in UI Agent is welcomed to contribute to this repo: 4 | 5 | - You can add the classical or latest publications / tutorials directly to `README.md` and `your_create.md`. 6 | 7 | - You are welcomed to update anything helpful. 8 | 9 | 10 | ## Pull Requests 11 | 12 | In general, we follow the "fork-and-pull" Git workflow. 13 | 14 | 1. Fork this repo on your personal GitHub. 15 | 16 | 2. Clone this repo to your own machine. 17 | ``` 18 | git clone https://github.com//awesome-ui-agent.git 19 | ``` 20 | 21 | 3. Make necessary changes and commit those changes. 22 | 23 | - If you go to this project directory and execute the command `git status`, you'll see there are changes. 24 | 25 | - Add those changes to the branch using the `git add` command: 26 | ``` 27 | git add 28 | ``` 29 | - Now commit those changes using the `git commit` command: 30 | ``` 31 | git commit -m "add(): commit message" 32 | ``` 33 | * There are some standards of commit message. 34 | * `Template:` add/feature/polish(committer_name or project_name): commit message 35 | * `For example:` add(jrn): add one paper about UI-Agent 36 | 37 | 38 | 4. Push your work back up to your fork. 39 | ``` 40 | git push origin 41 | ``` 42 | 43 | 5. Submit a Pull request so that we can review your changes. 44 | 45 | - If you go to your repository on GitHub, you'll see a `Contribute` button. Click on that button. Then click `Open pull request` and `Create pull request` buttons in turn. 46 | 47 | - Soon We will be merging all your changes into the main branch of this repo. You will get a notification email once the changes have been merged. 48 | 49 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome UI Agent 2 | 3 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 4 | ![visitor badge](https://visitor-badge.lithub.cc/badge?page_id=opendilab.awesome-ui-agents&left_text=Visitors) 5 | [![Twitter](https://img.shields.io/twitter/url?style=social&url=https%3A%2F%2Ftwitter.com%2Fopendilab)](https://twitter.com/opendilab) 6 | [![GitHub stars](https://img.shields.io/github/stars/opendilab/awesome-ui-agents)](https://github.com/opendilab/awesome-ui-agents/stargazers) 7 | [![GitHub forks](https://img.shields.io/github/forks/opendilab/awesome-ui-agents)](https://github.com/opendilab/awesome-ui-agents/network) 8 | ![GitHub commit activity](https://img.shields.io/github/commit-activity/m/opendilab/awesome-ui-agents) 9 | [![GitHub issues](https://img.shields.io/github/issues/opendilab/awesome-ui-agents)](https://github.com/opendilab/awesome-ui-agents/issues) 10 | [![GitHub pulls](https://img.shields.io/github/issues-pr/opendilab/awesome-ui-agents)](https://github.com/opendilab/awesome-ui-agents/pulls) 11 | [![Contributors](https://img.shields.io/github/contributors/opendilab/awesome-ui-agents)](https://github.com/opendilab/awesome-ui-agents/graphs/contributors) 12 | [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) 13 | 14 | This is a collection of research papers for **UI Agent**, which includes models, tools, and datasets. 15 | And the repository will be continuously updated to track the frontier of UI Agent or related fields. 16 | 17 | Welcome to follow and star! 18 | 19 | ## Table of Contents 20 | 21 | - [Awesome UI Agent](#awesome-ui-agent) 22 | - [Table of Contents](#table-of-contents) 23 | - [Overview of UI Agent](#overview-of-ui-agent) 24 | - [Papers](#papers) 25 | - [Models](#models) 26 | - [2025](#2025) 27 | - [2024](#2024) 28 | - [2023](#2023) 29 | - [Tools](#tools) 30 | - [Datasets](#datasets) 31 | - [Related Repositories](#related-repositories) 32 | - [Contributing](#contributing) 33 | - [License](#license) 34 | 35 | ## Overview of UI Agent 36 | 37 | UI Agent aims to build a generalist agent that can interact with various user interfaces (UIs) in different environments, such as mobile apps, web pages, and PC applications. The agent can understand the UIs through vision-language models and interact with them to complete tasks. The agent can be applied to various scenarios, such as mobile device operation, web browsing, and game playing. The agent can be trained in a simulated environment or with real-world data. The agent can be evaluated in terms of task completion rate, efficiency, and generalization ability. 38 | 39 |

40 | Image Description 1 41 |

42 | 43 | The research on UI Agent is still in its early stage, and there are many challenges to be addressed, such as the scalability of the agent, the robustness of the agent, and the interpretability of the agent. The research on UI Agent is interdisciplinary, involving computer vision, natural language processing, reinforcement learning, human-computer interaction, and software engineering. The research on UI Agent has the potential to revolutionize the way we interact with computers and improve the efficiency and usability of computer systems. 44 | 45 |

46 | Image Description 1 47 |

48 | 49 | ## Papers 50 | 51 | ``` 52 | format: 53 | - [title](paper link) [links] 54 | - author1, author2, and author3... 55 | - year 56 | - publisher 57 | - key 58 | - code 59 | - experiment environment 60 | ``` 61 | 62 | ### Models 63 | 64 | #### 2025 65 | 66 | - [VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning](https://arxiv.org/abs/2502.07949) 67 | - Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao 68 | - Key: framework, reinforcement learning, subgoal generation, VSC-RL, learning efficiency 69 | - ExpEnv: Android in the Wild 70 | 71 | - [AppVLM: A Lightweight Vision Language Model for Online App Control](https://arxiv.org/abs/2502.06395) 72 | - Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao 73 | - Key: vision-language model, multi-modal, AppVLM, on-device control 74 | - ExpEnv: two open-source mobile control datasets 75 | 76 | - [DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents](https://arxiv.org/abs/2410.14803) 77 | - Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao 78 | - Key: framework, reinforcement learning, distributed training, A-RIDE, on-device control 79 | - ExpEnv: Android in the Wild 80 | - [code](https://ai-agents-2030.github.io/DistRL/) 81 | 82 | - [OpenAI operator](https://openai.com/index/introducing-operator/) 83 | - OpenAI 84 | - Key: A research preview of an agent that can use its own browser to perform tasks for you. 85 | - ExpEnv: OSWorld, WebArena, WebVoyager 86 | 87 | - [OpenAI Computer-Using Agent](https://openai.com/index/computer-using-agent/) 88 | - OpenAI 89 | - Key: a universal interface for AI to interact with the digital world. 90 | - ExpEnv: OSWorld, WebArena, WebVoyager 91 | 92 | - [Claude computer use](https://www.anthropic.com/news/developing-computer-use) 93 | - anthropic 94 | - Key: emulating the way people interact with their own computer. 95 | - ExpEnv: OSWorld 96 | 97 | - [Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage](https://openreview.net/forum?id=0bmGL4q7vJ) 98 | - Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li 99 | - Key: Multimodal Agents, Vision-language Model, Tool usage 100 | - ExpEnv: GTA, GAIA benchmarks 101 | 102 | - [Lightweight Neural App Control](https://openreview.net/forum?id=BL4WBIfyrz) 103 | - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye HAO, Jun Wang, Kun Shao 104 | - Key: vision-language model, multi-modal, android control, app agent 105 | - ExpEnv: two open-source mobile control datasets 106 | 107 | - [Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback](https://openreview.net/forum?id=G7sIFXugTX) 108 | - Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Yang Wang 109 | - Key: agents, LLM, SWE-agents, SWE-bench, search, planning, reasoning, self-improvement, open-ended 110 | - ExpEnv: SWE-bench 111 | 112 | #### 2024 113 | 114 | - [On the Effects of Data Scale on UI Control Agents](https://arxiv.org/abs/2406.03679) 115 | - Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva 116 | - Key: Autonomous agents, UI control, AndroidControl dataset, fine-tuning, in-domain vs out-of-domain performance 117 | - 2024 118 | - [code](https://github.com/google-research/google-research/tree/master/android_control) 119 | 120 | - [SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering](https://arxiv.org/abs/2405.15793) 121 | - John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press 122 | - Key: Language model agents, agent-computer interface (ACI), automated software engineering, SWE-bench, HumanEvalFix 123 | - 2024 124 | - [code](https://github.com/SWE-agent/SWE-agent) 125 | 126 | - [Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making](https://arxiv.org/abs/2410.07166) 127 | - Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu 128 | - Key: Large language models, embodied decision making, generalized interface, fine-grained metrics, subgoal decomposition, action sequencing 129 | - 2024 130 | - [code](https://github.com/embodied-agent-eval/embodied-agent-eval) 131 | 132 | - [Cradle: Empowering Foundation Agents Towards General Computer Control](https://arxiv.org/abs/2403.03186) 133 | - Weihao Tan and Wentao Zhang and Xinrun Xu and Haochong Xia and et al. 134 | - Key: various virtual scenarios, General Computer Control 135 | - 2024 136 | - [code](https://github.com/BAAI-Agents/Cradle) 137 | 138 | - [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) 139 | - Filippos Christianos and Georgios Papoudakis and Thomas Coste and Jianye Hao and Jun Wang and Kun Shao 140 | - KEY: app agents, Android apps, Action Transformer 141 | - 2024 142 | 143 | - [SeeAct GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://github.com/OSU-NLP-Group/SeeAct) 144 | - Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun amd Yu Su 145 | - Key: live websites, grounding still, mage captioning, visual question answering 146 | - 2024 147 | - [code](https://osu-nlp-group.github.io/SeeAct) 148 | 149 | - [MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot](https://arxiv.org/abs/2404.18074) 150 | - Zirui Song and Yaohang Li and Meng Fang and Zhenhao Chen and Zecheng Shi and Yuan Huang and Ling Chen 151 | - Key: Autonomous virtual agents, Multi-Modal Agent Collaboration 152 | - 2024 153 | 154 | - [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) 155 | - Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu 156 | - Key: Graphical User Interface, screenshots 157 | - 2024 158 | -[code](https://github.com/njucckevin/SeeClick) 159 | 160 | - [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://arxiv.org/abs/2410.23218) 161 | - Zhiyong Wu and Zhenyu Wu and Fangzhi Xu and Yian Wang and Qiushi Sun and Chengyou Jia and Kanzhi Cheng and Zichen Ding and Liheng Chen and Paul Pu Liang and Yu Qiao 162 | - Key: Out-Of-Distribution, GUI grounding, language agent 163 | - 2024 164 | -[code](https://github.com/OS-Copilot/OS-Atlas) 165 | 166 | - [Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents](https://arxiv.org/abs/2412.13194) 167 | - Yifei Zhou and Qianlan Yang and Kaixiang Lin and Min Bai and Xiong Zhou and Yu-Xiong Wang and Sergey Levine and Erran Li 168 | - Key: Large language models, Internet-browsing agent, autonomous task proposal 169 | - 2024 170 | - [code](https://yanqval.github.io/PAE/) 171 | 172 | - [Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent](https://arxiv.org/abs/2404.03648) 173 | - Hanyu Lai and Xiao Liu and Iat Long Iong and Shuntian Yao and Yuxuan Chen and Pengbo Shen and Hao Yu and Hanchen Zhang and Xiaohan Zhang and Yuxiao Dong and Jie Tang 174 | - Key: Large language models, real-world web navigation, bilingual benchmark 175 | - 2024 176 | - [code](https://github.com/THUDM/AutoWebGLM) 177 | 178 | - [Dual-view visual contextualization for web navigation](https://arxiv.org/abs/2402.04476) 179 | - Jihyung Kil and Chan Hee Song and Boyuan Zheng and Xiang Deng and Yu Su and Wei-Lun Chao 180 | - Key: Automatic web navigation, language instructions, HTML elements 181 | - 2024 182 | 183 | - [Agent-e: From autonomous web navigation to foundational design principles in agentic systems](https://arxiv.org/abs/2407.13032) 184 | - Hanyu Lai and Xiao Liu and Iat Long Iong and Shuntian Yao and Yuxuan Chen and Pengbo Shen and Hao Yu and Hanchen Zhang and Xiaohan Zhang and Yuxiao Dong and Jie Tang 185 | - Key: hierarchical architecture, flexible DOM distillation, denoising method 186 | - 2024 187 | - [code](https://github.com/EmergenceAI/Agent-E) 188 | 189 | - [Tree search for language model agents](https://arxiv.org/abs/2407.01476) 190 | - Jing Yu Koh and Stephen McAleer and Daniel Fried and Ruslan Salakhutdinov 191 | - Key: multi-step reasoning, planning, best-first tree search 192 | - 2024 193 | - [code](https://github.com/kohjingyu/search-agents) 194 | 195 | - [Agent S: an open agentic framework that uses computers like a human](https://arxiv.org/abs/2410.08164) 196 | - Saaket Agashe and Jiuzhou Han and Shuyu Gan and Jiachen Yang and Ang Li and Xin Eric Wang 197 | - Key: Multimodal Large Language Models, Graphical User Interface, Agent-Computer Interface 198 | - 2024 199 | - [code](https://github.com/simular-ai/Agent-S) 200 | 201 | - [Apple Intelligence Foundation Language Models](https://arxiv.org/pdf/2407.21075) 202 | - Apple 203 | - Key: Vision-Language Model, Private Cloud Compute 204 | - 2024 205 | 206 | - [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://arxiv.org/abs/2402.11941) 207 | - Xinbei Ma and Zhuosheng Zhang and Hai Zhao 208 | - Key: Vision-Language Model, Phone 209 | - 2024 210 | - [code](https://github.com/xbmxb/CoCo-Agent) 211 | 212 | - [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) 213 | - Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu 214 | - Key: Vision-Language Model, PC 215 | - 2024 216 | - [code](https://github.com/njucckevin/SeeClick) 217 | 218 | - [Intention-inInteraction (IN3): Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents](https://arxiv.org/abs/2402.09205) 219 | - Cheng Qian and Bingxiang He and Zhong Zhuang and Jia Deng and Yujia Qin and Xin Cong and Zhong Zhang and Jie Zhou and Yankai Lin and Zhiyuan Liu and Maosong Sun 220 | - Key: Language Model, User Intention 221 | - 2024 222 | - [code](https://github.com/OpenBMB/Tell_Me_More) 223 | 224 | - [LATS: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models](https://arxiv.org/abs/2406.11896) 225 | - Andy Zhou and Kai Yan and Michal Shlapentokh-Rothman and Haohan Wang and Yu-Xiong Wang 226 | - Key: Tree Search, Language Model 227 | - 2024 228 | - [code](https://github.com/lapisrocks/LanguageAgentTreeSearch) 229 | 230 | - [DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning](https://arxiv.org/abs/2406.11896) 231 | - Hao Bai and Yifei Zhou and Mert Cemri and Jiayi Pan and Alane Suhr and Sergey Levine and Aviral Kumar 232 | - Key: Vision-Language Model, Android, Reinforcement Learning 233 | - 2024 234 | - [code](https://github.com/DigiRL-agent/digirl) 235 | 236 | - [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) 237 | - Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma 238 | - Key: Vision-Language Model, Mobile, Infographics 239 | - 2024 240 | -[code](https://github.com/kyegomez/ScreenAI) 241 | 242 | - [ScreenAgent: A Vision Language Model-driven Computer Control Agent](https://arxiv.org/abs/2402.07945) 243 | - Runliang Niu and Jindong Li and Shiqi Wang and Yali Fu and Xiyu Hu and Xueyuan Leng and He Kong and Yi Chang and Qi Wang 244 | - Key: Vision Language Model, PC 245 | - 2024 246 | - [code](https://github.com/niuzaisheng/ScreenAgent) 247 | 248 | - [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) 249 | - Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang 250 | - Key: Vision-Language Model, Android, Chain-of-Action-Thought 251 | - 2024 252 | - [code](https://github.com/IMNearth/CoAT) 253 | 254 | 255 | - [Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration](https://arxiv.org/abs/2406.01014) 256 | - Junyang Wang and Haiyang Xu and Haitao Jia and Xi Zhang and Ming Yan and Weizhou Shen and Ji Zhang and Fei Huang and Jitao Sang 257 | - Key: Vision-Language Model, Android 258 | - 2024 259 | - [code](https://github.com/X-PLUG/MobileAgent) 260 | 261 | - [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](https://arxiv.org/abs/2401.16158) 262 | - Junyang Wang and Haiyang Xu and Jiabo Ye and Ming Yan and Weizhou Shen and Ji Zhang and Fei Huang and Jitao Sang 263 | - Key: Vision-Language Model, Android 264 | - 2024 265 | - [code](https://github.com/X-PLUG/MobileAgent) 266 | 267 | - [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919) 268 | - Hongliang He and Wenlin Yao and Kaixin Ma and Wenhao Yu and Yong Dai and Hongming Zhang and Zhenzhong Lan and Dong Yu 269 | - Key: Vision-Language Model, Web 270 | - 2024 271 | - [code](https://github.com/MinorJerry/WebVoyager) 272 | 273 | - [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) 274 | - Zhiyong Wu and Chengcheng Han and Zichen Ding and Zhenmin Weng and Zhoumianze Liu and Shunyu Yao and Tao Yu and Lingpeng Kong 275 | - Key: Vision-Language Model, PC 276 | - 2024 277 | - [code](https://github.com/OS-Copilot/OS-Copilot) 278 | 279 | - [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) 280 | - Chaoyun Zhang and Liqun Li and Shilin He and Xu Zhang and Bo Qiao and Si Qin and Minghua Ma and Yu Kang and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang and Qi Zhang 281 | - Key: Vision-Language Model, PC, Windows OS 282 | - 2024 283 | - [code](https://github.com/microsoft/UFO) 284 | 285 | - [Octopus v2: On-device language model for super agent](https://arxiv.org/abs/2404.01744) 286 | - Wei Chen and Zhiyuan Li 287 | - Key: Vision-Language Model, Android, IOS 288 | - 2024 289 | 290 | 291 | #### 2023 292 | - [Openagents: An open platform for language agents in the wild](https://arxiv.org/abs/2309.08172) 293 | - Kaixin Ma and Hongming Zhang and Hongwei Wang and Xiaoman Pan and Wenhao Yu and Dong Yu 294 | - wild of everyday life, Language agents, real-world evaluations 295 | - 2023 296 | -[code](https://github.com/xlang-ai/OpenAgents) 297 | 298 | - [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) 299 | - Kaixin Ma and Hongming Zhang and Hongwei Wang and Xiaoman Pan and Wenhao Yu and Dong Yu 300 | - Large language models, web navigation, interactive task 301 | - 2023 302 | -[code](https://github.com/Mayer123/LASER) 303 | 304 | - [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) 305 | - Chi Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu 306 | - Key: Vision-Language Model, Android 307 | - 2023 308 | - [code](https://github.com/mnotgod96/AppAgent) 309 | 310 | - [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/html/2312.08914v1) 311 | - Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxuan Zhang and Juanzi Li and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang 312 | - Key: Vision-Language Model, PC, Android, screenshots 313 | - 2023 314 | - [code](https://github.com/THUDM/CogVLM) 315 | 316 | - [Octopus: Embodied Vision-Language Programmer from Environmental Feedback](https://arxiv.org/abs/2310.08588) 317 | - Jingkang Yang and Yuhao Dong and Shuai Liu and Bo Li and Ziyue Wang and Chencheng Jiang and Haoran Tan and Jiamu Kang and Yuanhan Zhang and Kaiyang Zhou and Ziwei Liu 318 | - Key: Vision-Language Model, Android, IOS 319 | - 2023 320 | - [code](https://github.com/dongyh20/Octopus) 321 | 322 | - [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) 323 | - Zhuosheng Zhang and Aston Zhang 324 | - Key: Vision-Language Model, Android, Chain-of-Action-Thought 325 | - 2023 326 | - [code](https://github.com/cooelf/Auto-GUI) 327 | 328 | - [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) 329 | - Kaixin Ma and Hongming Zhang and Hongwei Wang and Xiaoman Pan and Wenhao Yu and Dong Yu 330 | - Key: Vision-Language Model, Web, State-Space Exploration 331 | - 2023 332 | - [code](https://github.com/Mayer123/LASER) 333 | 334 | - [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis](https://arxiv.org/abs/2307.12856) 335 | - Izzeddin Gur and Hiroki Furuta and Austin Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust 336 | - Key: Vision-Language Model, Web, Planning, Program Synthesis 337 | - 2023 338 | 339 | - [Augmenting Autotelic Agents with Large Language Models](https://arxiv.org/abs/2305.12487) 340 | - Cédric Colas and Laetitia Teodorescu and Pierre-Yves Oudeyer and Xingdi Yuan and Marc-Alexandre Côté 341 | - Key: Language Model 342 | - 2023 343 | 344 | - [Language Models can Solve Computer Tasks](https://arxiv.org/abs/2303.17491) 345 | - Geunwoo Kim and Pierre Baldi and Stephen McAleer 346 | - Key: Language Model 347 | - 2023 348 | - [code](https://github.com/posgnu/rci-agent) 349 | 350 | ### Tools 351 | - [Opera Browser Operator: AI-based Agentic Browsing](https://press.opera.com/2025/03/03/opera-browser-operator-ai-agentics/) 352 | - Opera Software 353 | - Key: AI agent, agentic browsing, native client-side solution, privacy-focused 354 | - 2025 355 | - [code](https://press.opera.com) 356 | 357 | - [OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent](https://arxiv.org/abs/2408.00203) 358 | - Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah 359 | - Key: UI parsing, vision-based agent, GPT-4V, structured elements 360 | - 2025 361 | - [code](https://github.com/microsoft/OmniParser) 362 | 363 | -[Make Websites Accessible for Agents](https://browser-use.com) 364 | - Li Zhang and Shihe Wang and Xianqing Jia and Zhihan Zheng and Yunhe Yan and Longxi Gao and Yuanchun Li and Mengwei Xu 365 | - Key: websites, Agents 366 | - 2024 367 | - [code](https://github.com/browser-use/browser-use) 368 | 369 | - [ToolGen: Unified Tool Retrieval and Calling via Generation](https://openreview.net/forum?id=XLMAMmowdY) 370 | - Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li 371 | - Key: Agent, Tool Learning, Virtual Token 372 | - 2024 373 | - [code](https://github.com/Reason-Wang/ToolGen) 374 | 375 | - [LEGENT: An Open Platform for Embodied Agentb Agents on Large Language Models](https://aclanthology.org/2024.acl-demos.8/) 376 | - Iat Long Iong and Xiao Liu and Yuxuan Chen and Hanyu Lai and Shuntian Yao and Pengbo Shen and Hao Yu and Yuxiao Dong and Jie Tang 377 | - Key: Webpage, deployment 378 | - 2024 379 | - [code](https://github.com/boxworld18/OpenWebAgent) 380 | 381 | - [LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation](https://arxiv.org/abs/2404.16054) 382 | - Li Zhang and Shihe Wang and Xianqing Jia and Zhihan Zheng and Yunhe Yan and Longxi Gao and Yuanchun Li and Mengwei Xu 383 | - Key: Mobile UI, Simulator 384 | - 2024 385 | - [code](https://github.com/llamatouch/llamatouch) 386 | 387 | - [WebArena: A Realistic Web Environment for Building Autonomous Agents](https://arxiv.org/abs/2307.13854) 388 | - Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig 389 | - Key: Web, Simulator 390 | - 2023 391 | - [code](https://github.com/web-arena-x/webarena) 392 | 393 | - [Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction](https://arxiv.org/abs/2305.08144) 394 | - Danyang Zhang and Zhennan Shen and Rui Xie and Situo Zhang and Tianbao Xie and Zihan Zhao and Siyuan Chen and Lu Chen and Hongshen Xu and Ruisheng Cao and Kai Yu 395 | - Key: Android, Simulator 396 | - 2023 397 | - [code](https://github.com/X-LANCE/Mobile-Env) 398 | 399 | - [AndroidEnv: A Reinforcement Learning Platform for Android](https://arxiv.org/abs/2105.13231) 400 | - Daniel Toyama and Philippe Hamel and Anita Gergely and Gheorghe Comanici and Amelia Glaese and Zafarali Ahmed and Tyler Jackson and Shibl Mourad and Doina Precup 401 | - Key: Android, Reinforcement Learning, Simulator 402 | - 2021 403 | - [code](https://github.com/google-deepmind/android_env) 404 | 405 | ### Datasets 406 | - [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://arxiv.org/abs/2410.15164) 407 | - Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao 408 | - Key: Two Languages, Interactive Environment, Plug-and-play Framework, 11 Agents, Diverse Metrics 409 | - 2025 410 | - [code](https://ai-agents-2030.github.io/SPA-Bench/) 411 | 412 | - [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks](https://arxiv.org/abs/2407.15711) 413 | - Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant 414 | - Key: Web, Realistic, Time-Consuming, Benchmark 415 | - 2024 416 | - [code](https://assistantbench.github.io) 417 | 418 | - [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2307.12856) 419 | - Yichen Pan1 and Dehan Kong and Sida Zhou and Cheng Cui and Yifei Leng and Bing Jiang and Hangyu Liu and Yanyi Shang and Shuyan Zhou and Tongshuang Wu and Zhengyang Wu 420 | - Key: Web, Online Environments, Benchmark 421 | - 2024 422 | - [code](https://www.imean.ai/web-canvas) 423 | 424 | - [MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents](https://arxiv.org/abs/2406.08184) 425 | - Luyuan Wang and Yongyu Deng and Yiwei Zha and Guodong Mao and Qinmin Wang and Tianchen Min and Wei Chen and Shoufa Chen 426 | - Key: Mobile, Benchmark 427 | - 2024 428 | - [code](https://github.com/MobileAgentBench/mobile-agent-bench) 429 | 430 | - [VillagerBench/VillagerAgent: A Graph-Based Multi-Agent Framework for Coordinating Complex Task Dependencies in Minecraft](https://arxiv.org/abs/2406.05720) 431 | - Yubo Dong and Xukun Zhu and Zhengzhe Pan and Linchao Zhu and Yi Yang 432 | - Key: Vision-Language Model, Game 433 | - 2024 434 | - [code](https://github.com/cnsdqd-dyb/VillagerAgent) 435 | 436 | - [CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions](https://aclanthology.org/2024.findings-acl.928/) 437 | - Guo, Zishan and Huang, Yufei and Xiong, Deyi 438 | - Key: Vision-Language Model, Phone 439 | - 2024 440 | - [code](https://github.com/tjunlp-lab/CToolEval) 441 | 442 | - [Multi-Turn Mind2Web: On the Multi-turn Instruction Following for Conversational Web Agents](https://arxiv.org/pdf/2402.15057) 443 | - Yang Deng and Xuan Zhang and Wenxuan Zhang and Yifei Yuan and See-Kiong Ng and Tat-Seng Chua 444 | - Key: Vision-Language Model, Web Tasks 445 | - 2024 446 | - [code](https://github.com/magicgh/self-map) 447 | 448 | - [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) 449 | - Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried 450 | - Key: Vision-Language Model, Web Tasks 451 | - 2024 452 | - [code](https://github.com/web-arena-x/visualwebarena) 453 | 454 | - [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) 455 | - Jiwen Zhang and Jihao Wu and Yihua Teng and Minghui Liao and Nuo Xu and Xiao Xiao and Zhongyu Wei and Duyu Tang 456 | - Key: Vision-Language Model, Android, Chain-of-Action-Thought 457 | - 2024 458 | - [code](https://github.com/IMNearth/CoAT) 459 | 460 | - [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) 461 | - Christopher Rawles and Alice Li and Daniel Rodriguez and Oriana Riva and Timothy Lillicrap 462 | - Key: Android, datasets 463 | - 2023 464 | - [code](https://github.com/google-research/google-research/blob/master/android_in_the_wild/README.md) 465 | 466 | - [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) 467 | - Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su 468 | - Key: Web, datasets 469 | - 2023 470 | - [code](https://github.com/OSU-NLP-Group/Mind2Web) 471 | 472 | - [WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents](https://arxiv.org/abs/2207.01206) 473 | - Shunyu Yao and Howard Chen and John Yang and Karthik Narasimhan 474 | - Key: Web, datasets 475 | - 2022 476 | - [code](https://github.com/princeton-nlp/WebShop) 477 | 478 | - [Rico: A Mobile App Dataset for Building Data-Driven Design Applications](https://dl.acm.org/doi/10.1145/3126594.3126651) 479 | - Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha 480 | - Key: mobile app, datasets 481 | - 2017 482 | 483 | ## Related Repositories 484 | 485 | - [awesome-llm-powered-agent](https://github.com/hyp1231/awesome-llm-powered-agent) 486 | - [Awesome-LLM-based-Web-Agent-and-Tools](https://github.com/albzni/Awesome-LLM-based-Web-Agent-and-Tools) 487 | - [Awesome-GUI-Agent](https://github.com/showlab/Awesome-GUI-Agent) 488 | - [computer-control-agent-knowledge-base](https://github.com/James4Ever0/computer_control_agent_knowledge_base) 489 | 490 | ## Contributing 491 | 492 | Our purpose is to make this repo even better. If you are interested in contributing, please refer to [HERE](CONTRIBUTING.md) for instructions in contribution. 493 | 494 | ## License 495 | 496 | This repository is released under the Apache 2.0 license. 497 | -------------------------------------------------------------------------------- /assets/mobile.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/opendilab/awesome-ui-agents/3dc02de3889057db6c63017a19cc4f55d02194d1/assets/mobile.jpg -------------------------------------------------------------------------------- /assets/pc.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/opendilab/awesome-ui-agents/3dc02de3889057db6c63017a19cc4f55d02194d1/assets/pc.jpg --------------------------------------------------------------------------------