├── README.md └── assets └── teaser.webp /README.md: -------------------------------------------------------------------------------- 1 | # Awesome GUI Agent [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 2 | 3 | A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents. 4 | 5 |

6 | 7 |

8 |

9 | Build a digital assistant on your screen. Generated by DALL-E-3. 10 |

11 | 12 | **WELCOME CONTRIBUTE!** 13 | 14 | 🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request. 15 | 16 | 🤖 Try our [Awesome-Paper-Agent](https://chatgpt.com/g/g-qqs9km6wi-awesome-paper-agent). Just provide an arXiv URL link, and it will automatically return formatted information, like this: 17 | 18 | ``` 19 | User: 20 | https://arxiv.org/abs/2312.13108 21 | 22 | GPT: 23 | + [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023) 24 | 25 | [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui) 26 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108) 27 | [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/) 28 | ``` 29 | 30 | So then you can easily copy and use this information in your pull requests. 31 | 32 | ⭐ If you find this repository useful, please give it a star. 33 | 34 | --- 35 | **Quick Navigation**: [[Datasets / Benchmarks]](#datasets--benchmarks) [[Models / Agents]](#models--agents) [[Surveys]](#surveys) [[Projects]](#projects) 36 | 37 | ## Datasets / Benchmarks 38 | + [World of Bits: An Open-Domain Platform for Web-Based Agents](https://proceedings.mlr.press/v70/shi17a.html) (Aug. 2017, ICML 2017) 39 | 40 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://proceedings.mlr.press/v70/shi17a/shi17a.pdf) 41 | 42 | + [A Unified Solution for Structured Web Data Extraction](https://dl.acm.org/doi/10.1145/2009916.2010020) (Jul. 2011, SIGIR 2011) 43 | 44 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/10.1145/2009916.2010020) 45 | 46 | + [Rico: A Mobile App Dataset for Building Data-Driven Design Applications](https://dl.acm.org/doi/10.1145/3126594.3126651) (Oct. 2017) 47 | 48 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/10.1145/3126594.3126651) 49 | 50 | + [Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration](https://arxiv.org/abs/1802.08802) (Feb. 2018, ICLR 2018) 51 | 52 | [![Star](https://img.shields.io/github/stars/stanfordnlp/wge.svg?style=social&label=Star)](https://github.com/stanfordnlp/wge) 53 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/1802.08802) 54 | 55 | + [Mapping Natural Language Instructions to Mobile UI Action Sequences](https://arxiv.org/abs/2005.03776) (May. 2020, ACL 2020) 56 | 57 | [![Star](https://img.shields.io/github/stars/deepneuralmachine/seq2act-tensorflow.svg?style=social&label=Star)](https://github.com/deepneuralmachine/seq2act-tensorflow) 58 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2005.03776) 59 | 60 | + [WebSRC: A Dataset for Web-Based Structural Reading Comprehension](https://arxiv.org/abs/2101.09465) (Jan. 2021, EMNLP 2021) 61 | 62 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2101.09465) 63 | [![Website](https://img.shields.io/badge/Website-9cf)](https://x-lance.github.io/WebSRC/) 64 | 65 | + [AndroidEnv: A Reinforcement Learning Platform for Android](https://arxiv.org/abs/2105.13231) (May. 2021) 66 | 67 | [![Star](https://img.shields.io/github/stars/deepmind/android_env.svg?style=social&label=Star)](https://github.com/deepmind/android_env) 68 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2105.13231) 69 | [![Website](https://img.shields.io/badge/Website-9cf)](https://github.com/deepmind/android_env) 70 | 71 | + [A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility](https://arxiv.org/abs/2202.02312) (Feb. 2022) 72 | 73 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2202.02312) 74 | 75 | + [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI](https://arxiv.org/abs/2205.11029) (May. 2022) 76 | 77 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2205.11029) 78 | [![Website](https://img.shields.io/badge/Website-9cf)](https://x-lance.github.io/META-GUI-Leaderboard/) 79 | 80 | + [WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents](https://arxiv.org/abs/2207.01206) (Jul. 2022) 81 | 82 | [![Star](https://img.shields.io/github/stars/princeton-nlp/WebShop.svg?style=social&label=Star)](https://github.com/princeton-nlp/WebShop) 83 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2207.01206) 84 | [![Website](https://img.shields.io/badge/Website-9cf)](https://webshop-pnlp.github.io/) 85 | 86 | + [Language Models can Solve Computer Tasks](https://arxiv.org/abs/2303.17491) (Mar. 2023) 87 | 88 | [![Star](https://img.shields.io/github/stars/posgnu/rci-agent.svg?style=social&label=Star)](https://github.com/posgnu/rci-agent) 89 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2303.17491) 90 | [![Website](https://img.shields.io/badge/Website-9cf)](https://posgnu.github.io/rci-web/) 91 | 92 | 93 | + [Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction](https://arxiv.org/abs/2305.08144) (May. 2023) 94 | 95 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2305.08144) 96 | [![GitHub](https://img.shields.io/badge/GitHub-181717.svg?style=social&logo=github)](https://github.com/X-LANCE/Mobile-Env) 97 | 98 | + [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) (Jun. 2023) 99 | 100 | [![Star](https://img.shields.io/github/stars/osu-nlp-group/mind2web.svg?style=social&label=Star)](https://github.com/osu-nlp-group/mind2web) 101 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2306.06070) 102 | [![Website](https://img.shields.io/badge/Website-9cf)](https://osu-nlp-group.github.io/Mind2Web/) 103 | 104 | 105 | + [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) (Jul. 2023) 106 | 107 | [![Star](https://img.shields.io/github/stars/google-research/google-research.svg?style=social&label=Star)](https://github.com/google-research/google-research/tree/master/android_in_the_wild) 108 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2307.10088) 109 | 110 | 111 | + [WebArena: A Realistic Web Environment for Building Autonomous Agents](https://arxiv.org/abs/2307.13854) (Jul. 2023) 112 | 113 | [![Star](https://img.shields.io/github/stars/web-arena-x/webarena.svg?style=social&label=Star)](https://github.com/web-arena-x/webarena) 114 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2307.13854) 115 | [![Website](https://img.shields.io/badge/Website-9cf)](https://webarena.dev/) 116 | 117 | + [Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models](https://arxiv.org/abs/2311.09278) (Nov. 2023) 118 | 119 | [![Star](https://img.shields.io/github/stars/xufangzhi/ENVISIONS.svg?style=social&label=Star)](https://github.com/xufangzhi/ENVISIONS) 120 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2311.09278) 121 | [![Website](https://img.shields.io/badge/Website-9cf)](https://xufangzhi.github.io/symbol-llm-page/) 122 | 123 | + [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2401.07781) (Dec. 2023, CVPR 2024) 124 | 125 | [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui) 126 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.07781) 127 | [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/) 128 | 129 | + [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) (Jan. 2024, ACL 2024) 130 | 131 | [![Star](https://img.shields.io/github/stars/web-arena-x/visualwebarena.svg?style=social&label=Star)](https://github.com/jykoh/visualwebarena) 132 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.13649) 133 | [![Website](https://img.shields.io/badge/Website-9cf)](https://jykoh.com/vwa) 134 | 135 | + [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) (Feb. 2024) 136 | 137 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.17553) 138 | 139 | 140 | + [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) (Feb. 2024) 141 | 142 | [![Star](https://img.shields.io/github/stars/mcgill-nlp/weblinx.svg?style=social&label=Star)](https://github.com/mcgill-nlp/weblinx) 143 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.05930) 144 | [![Website](https://img.shields.io/badge/Website-9cf)](https://mcgill-nlp.github.io/weblinx/) 145 | 146 | + [On the Multi-turn Instruction Following for Conversational Web Agents](https://arxiv.org/abs/2402.15057) (Feb. 2024) 147 | 148 | [![Star](https://img.shields.io/github/stars/magicgh/self-map.svg?style=social&label=Star)](https://github.com/magicgh/self-map) 149 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.15057) 150 | 151 | + [AgentStudio: A Toolkit for Building General Virtual Agents](https://arxiv.org/abs/2403.17918) (Mar. 2024) 152 | 153 | [![Star](https://img.shields.io/github/stars/skyworkai/agent-studio.svg?style=social&label=Star)](https://github.com/skyworkai/agent-studio) 154 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2403.17918) 155 | [![Website](https://img.shields.io/badge/Website-9cf)](https://skyworkai.github.io/agent-studio/) 156 | 157 | 158 | + [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972) (Apr. 2024) 159 | 160 | [![Star](https://img.shields.io/github/stars/xlang-ai/OSWorld.svg?style=social&label=Star)](https://github.com/xlang-ai/OSWorld) 161 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.07972) 162 | [![Website](https://img.shields.io/badge/Website-9cf)](https://os-world.github.io/) 163 | 164 | 165 | + [Benchmarking Mobile Device Control Agents across Diverse Configurations](https://arxiv.org/abs/2404.16660) (Apr. 2024, ICLR 2024) 166 | 167 | [![Star](https://img.shields.io/github/stars/gimme1dollar/b-moca.svg?style=social&label=Star)](https://github.com/gimme1dollar/b-moca) 168 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.16660) 169 | 170 | + [MMInA: Benchmarking Multihop Multimodal Internet Agents](https://arxiv.org/abs/2404.09992) (Apr. 2024) 171 | 172 | [![Star](https://img.shields.io/github/stars/shulin16/MMInA.svg?style=social&label=Star)](https://github.com/shulin16/MMInA) 173 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.09992) 174 | [![Website](https://img.shields.io/badge/Website-9cf)](https://mmina.cliangyu.com) 175 | 176 | + [Autonomous Evaluation and Refinement of Digital Agents](https://arxiv.org/abs/2404.06474) (Apr. 2024) 177 | 178 | [![Star](https://img.shields.io/github/stars/Berkeley-NLP/Agent-Eval-Refine.svg?style=social&label=Star)](https://github.com/Berkeley-NLP/Agent-Eval-Refine) 179 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.06474) 180 | 181 | + [LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation](https://arxiv.org/abs/2404.16054) (Apr. 2024) 182 | 183 | [![Star](https://img.shields.io/github/stars/LlamaTouch/LlamaTouch.svg?style=social&label=Star)](https://github.com/LlamaTouch/LlamaTouch) 184 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.16054) 185 | 186 | + [VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?](https://arxiv.org/abs/2404.05955) (Apr. 2024) 187 | 188 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.05955) 189 | 190 | + [GUICourse: From General Vision Language Models to Versatile GUI Agents](https://arxiv.org/abs/2406.11317) (Jun. 2024) 191 | 192 | [![Star](https://img.shields.io/github/stars/yiye3/GUICourse.svg?style=social&label=Star)](https://github.com/yiye3/GUICourse) 193 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.11317) 194 | 195 | + [GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents](https://arxiv.org/abs/2406.10819) (Jun. 2024) 196 | 197 | [![Star](https://img.shields.io/github/stars/Dongping-Chen/GUI-World.svg?style=social&label=Star)](https://github.com/Dongping-Chen/GUI-World) 198 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.10819) 199 | [![Website](https://img.shields.io/badge/Website-9cf)](https://gui-world.github.io/) 200 | 201 | + [GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices](https://arxiv.org/abs/2406.08451) (Jun. 2024) 202 | 203 | [![Star](https://img.shields.io/github/stars/OpenGVLab/GUI-Odyssey.svg?style=social&label=Star)](https://github.com/OpenGVLab/GUI-Odyssey) 204 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.08451) 205 | 206 | + [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) (Jun. 2024) 207 | 208 | [![Star](https://img.shields.io/github/stars/showlab/videogui.svg?style=social&label=Star)](https://github.com/showlab/videogui) 209 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.10227) 210 | [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/videogui/) 211 | 212 | + [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://arxiv.org/abs/2406.19263) (Jun. 2024) 213 | 214 | [![Star](https://img.shields.io/github/stars/eric-ai-lab/Screen-Point-and-Read.svg?style=social&label=Star)](https://github.com/eric-ai-lab/Screen-Point-and-Read) 215 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.19263) 216 | [![Website](https://img.shields.io/badge/Website-9cf)](https://screen-point-and-read.github.io/) 217 | 218 | + [MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents](https://arxiv.org/abs/2406.08184) (Jun. 2024) 219 | 220 | [![Star](https://img.shields.io/github/stars/MobileAgentBench/mobile-agent-bench.svg?style=social&label=Star)](https://github.com/MobileAgentBench/mobile-agent-bench) 221 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.08184) 222 | [![Website](https://img.shields.io/badge/Website-9cf)](https://mobileagentbench.github.io) 223 | 224 | + [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573) (Jun. 2024) 225 | 226 | [![Star](https://img.shields.io/github/stars/google-research/android_world.svg?style=social&label=Star)](https://github.com/google-research/android_world) 227 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2405.14573) 228 | 229 | + [Practical, Automated Scenario-based Mobile App Testing](https://arxiv.org/abs/2406.08340) (Jun. 2024) 230 | 231 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.08340) 232 | 233 | + [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2406.12373) (Jun. 2024) 234 | 235 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.12373) 236 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.imean.ai/web-canvas) 237 | 238 | + [On the Effects of Data Scale on Computer Control Agents](https://arxiv.org/abs/2406.03679) (Jun. 2024) 239 | 240 | [![Star](https://img.shields.io/github/stars/google-research/google-research)](https://github.com/google-research/google-research/tree/master/android_control) 241 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.03679) 242 | 243 | + [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) (Jul. 2024) 244 | 245 | [![Star](https://img.shields.io/github/stars/camel-ai/crab.svg?style=social&label=Star)](https://github.com/camel-ai/crab) 246 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.01511) 247 | 248 | + [WebVLN: Vision-and-Language Navigation on Websites](https://ojs.aaai.org/index.php/AAAI/article/view/27878) (AAAI 2024) 249 | 250 | [![Star](https://img.shields.io/github/stars/WebVLN/WebVLN.svg?style=social&label=Star)](https://github.com/WebVLN/WebVLN) 251 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/27878) 252 | 253 | + [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https://arxiv.org/abs/2407.10956) (Jul. 2024) 254 | 255 | [![Star](https://img.shields.io/github/stars/xlang-ai/Spider2-V.svg?style=social&label=Star)](https://github.com/xlang-ai/Spider2-V) 256 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.10956) 257 | [![Website](https://img.shields.io/badge/Website-9cf)](https://spider2-v.github.io/) 258 | 259 | + [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://arxiv.org/abs/2407.17490) 260 | 261 | [![Star](https://img.shields.io/github/stars/YuxiangChai/AMEX-codebase.svg?style=social&label=Star)](https://github.com/YuxiangChai/AMEX-codebase) 262 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.17490) 263 | [![Website](https://img.shields.io/badge/Website-9cf)](https://yuxiangchai.github.io/AMEX/) 264 | 265 | + [Windows Agent Arena](https://raw.githubusercontent.com/microsoft/WindowsAgentArena/website/static/files/windows_agent_arena.pdf) 266 | 267 | [![Star](https://img.shields.io/github/stars/microsoft/WindowsAgentArena.svg?style=social&label=Star)](https://github.com/microsoft/WindowsAgentArena) 268 | [![Website](https://img.shields.io/badge/Website-9cf)](https://microsoft.github.io/WindowsAgentArena/) 269 | [![PDF](https://img.shields.io/badge/PDF-4285f4.svg)](https://raw.githubusercontent.com/microsoft/WindowsAgentArena/website/static/files/windows_agent_arena.pdf) 270 | 271 | + [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://arxiv.org/abs/2410.13824) (Oct, 2024) 272 | 273 | [![Star](https://img.shields.io/github/stars/neulab/multiui.svg?style=social&label=Star)](https://github.com/neulab/multiui) 274 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://neulab.github.io/MultiUI/) 275 | [![Website](https://img.shields.io/badge/Website-9cf)](https://neulab.github.io/MultiUI/) 276 | 277 | + [GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent](https://arxiv.org/abs/2412.18426) (Dec, 2024) 278 | 279 | [![Star](https://img.shields.io/github/stars/ZJU-ACES-ISE/ChatUITest.svg?style=social&label=Star)](https://github.com/ZJU-ACES-ISE/ChatUITest) 280 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.18426) 281 | 282 | + [A3: Android Agent Arena for Mobile GUI Agents](https://arxiv.org/abs/2501.01149) (Jan. 2025) 283 | 284 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2501.01149) 285 | [![Website](https://img.shields.io/badge/Website-9cf)](https://yuxiangchai.github.io/Android-Agent-Arena/) 286 | 287 | + [ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf) 288 | 289 | [![Star](https://img.shields.io/github/stars/likaixin2000/ScreenSpot-Pro-GUI-Grounding.svg?style=social&label=Star)](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding) 290 | [![Website](https://img.shields.io/badge/Website-9cf)](https://gui-agent.github.io/grounding-leaderboard/) 291 | [![PDF](https://img.shields.io/badge/Paper-PDF-red)](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf) 292 | 293 | + [WebWalker: Benchmarking LLMs in Web Traversal](https://github.com/Alibaba-nlp/WebWalker) 294 | 295 | [![Star](https://img.shields.io/github/stars/Alibaba-nlp/WebWalker.svg?style=social&label=Star)](https://github.com/Alibaba-nlp/WebWalker) 296 | [![Website](https://img.shields.io/badge/Website-9cf)](https://alibaba-nlp.github.io/WebWalker/) 297 | [![PDF](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/pdf/2501.07572) 298 | 299 | + [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://ai-agents-2030.github.io/SPA-Bench/) (ICLR 2025) 300 | 301 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.15164) 302 | [![Website](https://img.shields.io/badge/Website-9cf)](https://ai-agents-2030.github.io/SPA-Bench/) 303 | 304 | + [WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation](https://arxiv.org/abs/2502.08047) (Feb. 2025) 305 | 306 | [![Star](https://img.shields.io/github/stars/showlab/WorldGUI.svg?style=social&label=Star)](https://github.com/showlab/GUI-Thinker) 307 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2502.08047) 308 | [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/GUI-Thinker/) 309 | 310 | + [LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark](https://arxiv.org/abs/2504.13805) (Apr. 2025) 311 | 312 | [![Star](https://img.shields.io/github/stars/lgy0404/LearnAct.svg?style=social&label=Star)](https://github.com/lgy0404/LearnAct) 313 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2504.13805) 314 | [![Website](https://img.shields.io/badge/Website-9cf)](https://lgy0404.github.io/LearnAct/) 315 | 316 | ## Models / Agents 317 | 318 | + [Grounding Open-Domain Instructions to Automate Web Support Tasks](https://web3.arxiv.org/abs/2103.16057) (Mar. 2021) 319 | 320 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://web3.arxiv.org/abs/2103.16057) 321 | 322 | + [Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning](https://arxiv.org/abs/2108.03353) (Aug. 2021) 323 | 324 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](http://arxiv.org/abs/2108.03353) 325 | 326 | + [A Data-Driven Approach for Learning to Control Computers](https://arxiv.org/abs/2202.08137) (Feb. 2022) 327 | 328 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2202.08137) 329 | 330 | + [Augmenting Autotelic Agents with Large Language Models](https://arxiv.org/pdf/2305.12487) (May. 2023) 331 | 332 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2305.12487) 333 | 334 | + [Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control](https://arxiv.org/abs/2306.07863) (Jun. 2023, ICLR 2024) 335 | 336 | [![Star](https://img.shields.io/github/stars/ltzheng/synapse.svg?style=social&label=Star)](https://github.com/ltzheng/synapse) 337 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2306.07863) 338 | 339 | + [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis](https://arxiv.org/abs/2307.12856) (Jul. 2023, ICLR 2024) 340 | 341 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](http://arxiv.org/abs/2307.12856) 342 | 343 | + [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) (Sep. 2023) 344 | 345 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2309.08172) 346 | 347 | + [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) (Dec. 2023, CVPR 2024) 348 | 349 | [![Star](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social&label=Star)](https://github.com/THUDM/CogVLM) 350 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.08914) 351 | 352 | + [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919) 353 | 354 | [![Star](https://img.shields.io/github/stars/MinorJerry/WebVoyager.svg?style=social&label=Star)](https://github.com/MinorJerry/WebVoyager) 355 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.13919) 356 | 357 | + [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) (Feb. 2024) 358 | 359 | [![Star](https://img.shields.io/github/stars/OS-Copilot/OS-Copilot.svg?style=social&label=Star)](https://github.com/OS-Copilot/OS-Copilot) 360 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.07456) 361 | [![Website](https://img.shields.io/badge/Website-9cf)](https://os-copilot.github.io/) 362 | 363 | + [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) (Feb. 2024) 364 | 365 | [![Star](https://img.shields.io/github/stars/microsoft/UFO.svg?style=social&label=Star)](https://github.com/microsoft/UFO) 366 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.07939) 367 | [![Website](https://img.shields.io/badge/Website-9cf)](https://microsoft.github.io/UFO/) 368 | 369 | + [Comprehensive Cognitive LLM Agent for Smartphone GUI Automation](https://arxiv.org/abs/2402.11941) (Feb. 2024) 370 | 371 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.11941) 372 | 373 | + [Improving Language Understanding from Screenshots](https://arxiv.org/abs/2402.14073) (Feb. 2024) 374 | 375 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.14073) 376 | 377 | + [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648) (Apr. 2024, KDD 2024) 378 | 379 | [![Star](https://img.shields.io/github/stars/THUDM/AutoWebGLM.svg?style=social&label=Star)](https://github.com/THUDM/AutoWebGLM) 380 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.03648) 381 | 382 | + [SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models](https://arxiv.org/abs/2305.19308) (May. 2023, NeurIPS 2023) 383 | 384 | [![Star](https://img.shields.io/github/stars/BraveGroup/SheetCopilot.svg?style=social&label=Star)](https://github.com/BraveGroup/SheetCopilot) 385 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2305.19308) 386 | [![Website](https://img.shields.io/badge/Website-9cf)](https://sheetcopilot.github.io/) 387 | 388 | + [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) (Sep. 2023) 389 | 390 | [![Star](https://img.shields.io/github/stars/cooelf/Auto-UI.svg?style=social&label=Star)](https://github.com/cooelf/Auto-UI) 391 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2309.11436) 392 | 393 | + [Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API](https://arxiv.org/abs/2310.04716) (Oct. 2023) 394 | 395 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2310.04716) 396 | 397 | + [OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD](https://arxiv.org/pdf/2310.10634) (Oct. 2023) 398 | 399 | [![Star](https://img.shields.io/github/stars/xlang-ai/OpenAgents.svg?style=social&label=Star)](https://github.com/xlang-ai/OpenAgents) 400 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2310.10634) 401 | 402 | + [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) (Oct. 2024) 403 | 404 | [![Star](https://img.shields.io/github/stars/chengyou-jia/AgentStore.svg?style=social&label=Star)](https://github.com/chengyou-jia/AgentStore) 405 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.18603) 406 | [![Website](https://img.shields.io/badge/Website-9cf)](https://chengyou-jia.github.io/AgentStore-Home/) 407 | 408 | + [GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation](https://arxiv.org/abs/2311.07562) (Nov. 2023) 409 | 410 | [![Star](https://img.shields.io/github/stars/zzxslp/MM-Navigator.svg?style=social&label=Star)](https://github.com/zzxslp/MM-Navigator) 411 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2311.07562) 412 | 413 | + [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) (Dec. 2023) 414 | 415 | [![Star](https://img.shields.io/github/stars/mnotgod96/AppAgent.svg?style=social&label=Star)](https://github.com/mnotgod96/AppAgent) 416 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13771) 417 | [![Website](https://img.shields.io/badge/Website-9cf)](https://appagent-official.github.io) 418 | 419 | + [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) (Jan. 2024, ACL 2024) 420 | 421 | [![Star](https://img.shields.io/github/stars/njucckevin/SeeClick.svg?style=social&label=Star)](https://github.com/njucckevin/SeeClick) 422 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.10935) 423 | 424 | + [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) (Jan. 2024, ICML 2024) 425 | 426 | [![Star](https://img.shields.io/github/stars/OSU-NLP-Group/SeeAct.svg?style=social&label=Star)](https://github.com/OSU-NLP-Group/SeeAct) 427 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.01614) 428 | [![Website](https://img.shields.io/badge/Website-9cf)](https://osu-nlp-group.github.io/SeeAct/) 429 | 430 | 431 | + [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](http://arxiv.org/abs/2401.16158) (Jan. 2024) 432 | 433 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](http://arxiv.org/abs/2401.16158) 434 | 435 | + [Dual-View Visual Contextualization for Web Navigation](https://arxiv.org/abs/2402.04476) (Feb. 2024, CVPR 2024) 436 | 437 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.04476) 438 | 439 | + [DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning](https://arxiv.org/abs/2406.11896) (Jun. 2024) 440 | 441 | [![Star](https://img.shields.io/github/stars/DigiRL-agent/digirl.svg?style=social&label=Star)](https://github.com/DigiRL-agent/digirl) 442 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.11896) 443 | [![Website](https://img.shields.io/badge/Website-9cf)](https://digirl-agent.github.io/) 444 | 445 | + [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9.pdf) (NAACL 2024) 446 | 447 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://aclanthology.org/2024.naacl-industry.9.pdf) 448 | 449 | + [ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model](https://arxiv.org/abs/2402.07945) (Feb. 2024) 450 | 451 | [![Star](https://img.shields.io/github/stars/niuzaisheng/ScreenAgent.svg?style=social&label=Star)](https://github.com/niuzaisheng/ScreenAgent) 452 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.07945) 453 | [![Website](https://img.shields.io/badge/Website-9cf)](https://screenagent.pages.dev/) 454 | 455 | + [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) (Feb. 2024) 456 | 457 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.04615) 458 | 459 | + [Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs](https://arxiv.org/abs/2404.05719) (Apr. 2024) 460 | 461 | [![Star](https://img.shields.io/github/stars/apple/ml-ferret.svg?style=social&label=Star)](https://github.com/apple/ml-ferret) 462 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.05719) 463 | 464 | + [Octopus: On-device language model for function calling of software APIs](https://arxiv.org/abs/2404.01549) (Apr., 2024) 465 | 466 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.01549) 467 | 468 | + [Octopus v2: On-device language model for super agent](https://arxiv.org/abs/2404.01744) (Apr., 2024) 469 | 470 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.01744) 471 | 472 | + [Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent](https://arxiv.org/abs/2404.11459) (Apr., 2024) 473 | 474 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.11459) 475 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.nexa4ai.com/octopus-v3) 476 | 477 | + [Octopus v4: Graph of language models](https://arxiv.org/abs/2404.19296) (Apr., 2024) 478 | 479 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.19296) 480 | 481 | + [AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648) (Apr. 2024) 482 | 483 | [![Star](https://img.shields.io/github/stars/THUDM/AutoWebGLM.svg?style=social&label=Star)](https://github.com/THUDM/AutoWebGLM) 484 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.03648) 485 | 486 | + [Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning](https://arxiv.org/abs/2404.10887) (Apr. 2024) 487 | 488 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.10887) 489 | 490 | + [Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking](https://arxiv.org/pdf/2404.08860v3) (Apr. 2024, SIGIR 2024) 491 | 492 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2404.08860v3) 493 | 494 | + [AutoDroid: LLM-powered Task Automation in Android](https://arxiv.org/abs/2308.15272) 495 | 496 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2308.15272) 497 | 498 | + [Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation](https://arxiv.org/abs/2312.03003) (Dec. 2023, MobiCom 2024) 499 | 500 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.03003) 501 | [![Website](https://img.shields.io/badge/Website-9cf)](https://mobile-gpt.github.io/) 502 | 503 | 504 | + [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) (Mar. 2024) 505 | 506 | [![Star](https://img.shields.io/github/stars/BAAI-Agents/Cradle.svg?style=social&label=Star)](https://github.com/BAAI-Agents/Cradle) 507 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2403.03186) 508 | [![Website](https://img.shields.io/badge/Website-9cf)](https://baai-agents.github.io/Cradle/) 509 | 510 | + [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) (Mar. 2024) 511 | 512 | [![Star](https://img.shields.io/github/stars/IMNearth/CoAT.svg?style=social&label=Star)](https://github.com/IMNearth/CoAT) 513 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2403.02713) 514 | 515 | + [Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning](https://arxiv.org/abs/2405.00516v1) (May 2024) 516 | 517 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2405.00516v1) 518 | 519 | + [GUI Action Narrator: Where and When Did That Action Take Place?](https://arxiv.org/abs/2406.13719) (Jun. 2024) 520 | 521 | [![Star](https://img.shields.io/github/stars/showlab/GUI-Action-Narrator.svg?style=social&label=Star)](https://github.com/showlab/GUI-Action-Narrator) 522 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.13719) 523 | [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/GUI-Narrator) 524 | 525 | + [Identifying User Goals from UI Trajectories](https://arxiv.org/abs/2406.14314) (Jun. 2024) 526 | 527 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.14314) 528 | 529 | + [VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning](https://arxiv.org/abs/2406.14056) (Jun. 2024) 530 | 531 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.14056) 532 | 533 | + [Octo-planner: On-device Language Model for Planner-Action Agents](https://arxiv.org/abs/2406.18082) (Jun. 2024) 534 | 535 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.18082) 536 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.nexa4ai.com/octo-planner#video) 537 | 538 | + [E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion](https://arxiv.org/abs/2406.14250) (Jun. 2024) 539 | 540 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.14250) 541 | 542 | + [Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration](https://arxiv.org/abs/2406.01014) (Jun. 2024) 543 | 544 | [![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)](https://github.com/X-PLUG/MobileAgent) 545 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.01014) 546 | 547 | + [MobileFlow: A Multimodal LLM For Mobile GUI Agent](https://arxiv.org/abs/2407.04346) (Jul. 2024) 548 | 549 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.04346) 550 | 551 | + [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037) (Jul. 2024) 552 | 553 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.03037) 554 | 555 | + [Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence](https://arxiv.org/abs/2407.07061) (Jul. 2024) 556 | 557 | [![Star](https://img.shields.io/github/stars/OpenBMB/IoA.svg?style=social&label=Star)](https://github.com/OpenBMB/IoA) 558 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.07061) 559 | 560 | + [MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices](https://arxiv.org/abs/2407.03913) (Jul. 2024) 561 | 562 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.03913) 563 | 564 | + [AUITestAgent: Automatic Requirements Oriented GUI Function Testing](https://arxiv.org/abs/2407.09018) (Jul. 2024) 565 | 566 | [![Star](https://img.shields.io/github/stars/bz-lab/AUITestAgent.svg?style=social&label=Star)](https://github.com/bz-lab/AUITestAgent) 567 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.09018) 568 | 569 | + [Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems](https://arxiv.org/abs/2407.13032) (Jul. 2024) 570 | 571 | [![Star](https://img.shields.io/github/stars/EmergenceAI/Agent-E.svg?style=social&label=Star)](https://github.com/EmergenceAI/Agent-E) 572 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.13032) 573 | 574 | + [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/pdf/2408.00203) (Aug. 2024) 575 | 576 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2408.00203) 577 | 578 | + [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327) (Aug. 2024) 579 | 580 | [![Star](https://img.shields.io/github/stars/THUDM/VisualAgentBench.svg?style=social&label=Star)](https://github.com/THUDM/VisualAgentBench) 581 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2408.06327) 582 | [![Website](https://img.shields.io/badge/Website-9cf)](https://github.com/THUDM/VisualAgentBench) 583 | 584 | + [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://web3.arxiv.org/abs/2408.07199v1) (Aug. 2024) 585 | 586 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://web3.arxiv.org/abs/2408.07199v1) 587 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities) 588 | 589 | + [MindSearch: Mimicking Human Minds Elicits Deep AI Searcher](https://arxiv.org/abs/2407.20183) (Jul. 2023) 590 | 591 | [![Star](https://img.shields.io/github/stars/InternLM/MindSearch.svg?style=social&label=Star)](https://github.com/InternLM/MindSearch) 592 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.20183) 593 | [![Website](https://img.shields.io/badge/Website-9cf)](https://mindsearch.netlify.app/) 594 | 595 | + [AppAgent v2: Advanced Agent for Flexible Mobile Interactions](https://arxiv.org/abs/2408.11824) (Aug. 2024) 596 | 597 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2408.11824) 598 | 599 | + [Caution for the Environment: 600 | Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544) (Aug. 2024) 601 | 602 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2408.02544) 603 | 604 | + [Agent Workflow Memory](https://arxiv.org/abs/2409.07429) (Sep. 2024) 605 | 606 | [![Star](https://img.shields.io/github/stars/zorazrw/agent-workflow-memory.svg?style=social&label=Star)](https://github.com/zorazrw/agent-workflow-memory) 607 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2409.07429) 608 | 609 | + [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin](https://arxiv.org/abs/2409.14818) (Sep. 2024) 610 | 611 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2409.14818) 612 | 613 | + [Agent S: An Open Agentic Framework that Uses Computers Like a Human](https://arxiv.org/abs/2410.08164) (Oct. 2024) 614 | 615 | [![Star](https://img.shields.io/github/stars/simular-ai/Agent-S.svg?style=social&label=Star)](https://github.com/simular-ai/Agent-S) 616 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.08164) 617 | 618 | + [MobA: A Two-Level Agent System for Efficient Mobile Task Automation](https://arxiv.org/abs/2410.13757) (Oct. 2024) 619 | 620 | [![Star](https://img.shields.io/github/stars/OpenDFM/MobA.svg?style=social&label=Star)](https://github.com/OpenDFM/MobA) 621 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.13757) 622 | [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/OpenDFM/MobA-MobBench) 623 | 624 | + [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://arxiv.org/abs/2410.05243) (Oct. 2024) 625 | 626 | [![Star](https://img.shields.io/github/stars/OSU-NLP-Group/UGround.svg?style=social&label=Star)](https://github.com/OSU-NLP-Group/UGround) 627 | [![Website](https://img.shields.io/badge/Website-9cf)](https://osu-nlp-group.github.io/UGround/) 628 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.05243) 629 | 630 | 631 | + [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://arxiv.org/pdf/2410.23218) (Oct. 2024) 632 | 633 | [![Star](https://img.shields.io/github/stars/OS-Copilot/OS-Atlas.svg?style=social&label=Star)](https://github.com/OS-Copilot/OS-Atlas) 634 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.23218) 635 | [![Website](https://img.shields.io/badge/Website-9cf)](https://osatlas.github.io/) 636 | [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) 637 | 638 | + [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391) (Nov. 2024) 639 | 640 | [![Star](https://img.shields.io/github/stars/SALT-NLP/PopupAttack.svg?style=social&label=Star)](https://github.com/SALT-NLP/PopupAttack) 641 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.02391) 642 | 643 | 644 | + [AutoGLM: Autonomous Foundation Agents for GUIs](https://arxiv.org/abs/2411.00820) (Nov. 2024) 645 | 646 | [![Star](https://img.shields.io/github/stars/THUDM/AutoGLM.svg?style=social&label=Star)](https://github.com/THUDM/AutoGLM) 647 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.00820) 648 | 649 | + [AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations](https://arxiv.org/abs/2411.13451) (Nov. 2024) 650 | 651 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.13451) 652 | 653 | + [ShowUI: One Vision-Language-Action Model for Generalist GUI Agent](https://arxiv.org/abs/2411.17465) (Nov. 2024) 654 | 655 | [![Star](https://img.shields.io/github/stars/showlab/ShowUI.svg?style=social&label=Star)](https://github.com/showlab/ShowUI) 656 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.17465) 657 | 658 | + [Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction](https://arxiv.org/abs/2412.04454) (Dec. 2024) 659 | 660 | [![Website](https://img.shields.io/badge/Website-9cf)](https://aguvis-project.github.io/) 661 | [![Star](https://img.shields.io/github/stars/xlang-ai/Aguvis.svg?style=social&label=Star)](https://github.com/xlang-ai/aguvis) 662 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.04454) 663 | 664 | + [Falcon-UI: Understanding GUI Before Following User Instructions](https://arxiv.org/abs/2412.09362) (Dec. 2024) 665 | 666 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.09362) 667 | 668 | + [PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World](https://arxiv.org/abs/2412.17589) (Dec. 2024) 669 | 670 | [![Star](https://img.shields.io/github/stars/GAIR-NLP/PC-Agent.svg?style=social&label=Star)](https://github.com/GAIR-NLP/PC-Agent) 671 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.17589) 672 | [![Website](https://img.shields.io/badge/Website-9cf)](https://gair-nlp.github.io/PC-Agent/) 673 | 674 | + [Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining](https://arxiv.org/pdf/2412.10342) (Dec. 2024) 675 | 676 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2412.10342) 677 | 678 | + [Aria-UI: Visual Grounding for GUI Instructions](https://arxiv.org/abs/2412.16256) (Dec. 2024) 679 | 680 | [![Star](https://img.shields.io/github/stars/AriaUI/Aria-UI.svg?style=social&label=Star)](https://github.com/AriaUI/Aria-UI) 681 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.16256) 682 | [![Website](https://img.shields.io/badge/Website-9cf)](https://ariaui.github.io) 683 | [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/Aria-UI/Aria-UI_Data) 684 | 685 | + [CogAgent v2](https://github.com/THUDM/CogAgent) (Dec. 2024) 686 | 687 | [![Star](https://img.shields.io/github/stars/THUDM/CogAgent.svg?style=social&label=Star)](https://github.com/THUDM/CogAgent) 688 | 689 | + [OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](https://arxiv.org/abs/2412.19723) (Dec. 2024) 690 | 691 | [![Star](https://img.shields.io/github/stars/OS-Copilot/OS-Genesis.svg?style=social&label=Star)](https://github.com/OS-Copilot/OS-Genesis) 692 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.19723) 693 | [![Website](https://img.shields.io/badge/Website-9cf)](https://qiushisun.github.io/OS-Genesis-Home/) 694 | 695 | + [InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection](https://arxiv.org/pdf/2501.04575) (Jan. 2025) 696 | 697 | [![Star](https://img.shields.io/github/stars/Reallm-Labs/InfiGUIAgent.svg?style=social&label=Star)](https://github.com/Reallm-Labs/InfiGUIAgent) 698 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.04575) 699 | 700 | + [GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration](https://arxiv.org/pdf/2501.13896) (Jan. 2025) 701 | 702 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.13896) 703 | [![Website](https://img.shields.io/badge/Website-9cf)](https://gui-bee.github.io/) 704 | 705 | + [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) (ICLR 2025) 706 | 707 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.17883) 708 | 709 | + [DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents](https://arxiv.org/abs/2410.14803) (ICLR 2025) 710 | 711 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2410.14803) 712 | [![Website](https://img.shields.io/badge/Website-9cf)](https://ai-agents-2030.github.io/DistRL/) 713 | 714 | + [AppVLM: A Lightweight Vision Language Model for Online App Control](https://arxiv.org/abs/2502.06395) (Feb. 2025) 715 | 716 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2502.06395) 717 | 718 | + [VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning](https://arxiv.org/abs/2502.07949) (Feb. 2025) 719 | 720 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2502.07949) 721 | 722 | + [GUI-Thinker: GUI-Thinker: A Basic yet Comprehensive GUI Agent Developed with Self-Reflection](https://arxiv.org/abs/2502.08047) (Feb. 2025) 723 | 724 | [![Star](https://img.shields.io/github/stars/showlab/WorldGUI.svg?style=social&label=Star)](https://github.com/showlab/GUI-Thinker) 725 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2502.08047) 726 | [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/GUI-Thinker/) 727 | 728 | + [MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users](https://arxiv.org/abs/2502.02982) (Feb., 2025) 729 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2502.02982) 730 | 731 | + [FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data](https://arxiv.org/abs/2503.05143) (Mar., 2025) 732 | 733 | [![Star](https://img.shields.io/github/stars/wwh0411/FedMABench.svg?style=social&label=Star)](https://github.com/wwh0411/FedMABench) 734 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2503.05143) 735 | [![Website](https://img.shields.io/badge/Website-9cf)](https://huggingface.co/datasets/wwh0411/FedMABench) 736 | 737 | 738 | ## Surveys 739 | + [OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use](https://github.com/OS-Agent-Survey/OS-Agent-Survey) (Dec. 2024) 740 | 741 | [![Star](https://img.shields.io/github/stars/OS-Agent-Survey/OS-Agent-Survey.svg?style=social&label=Star)](https://github.com/OS-Agent-Survey/OS-Agent-Survey) 742 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://github.com/OS-Agent-Survey/OS-Agent-Survey/blob/main/paper.pdf) 743 | [![Website](https://img.shields.io/badge/Website-9cf)](https://os-agent-survey.github.io/) 744 | 745 | + [GUI Agents with Foundation Models: A Comprehensive Survey](https://arxiv.org/abs/2411.04890) (Nov. 2024) 746 | 747 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.04890) 748 | 749 | + [Large Language Model-Brained GUI Agents: A Survey](https://arxiv.org/abs/2411.18279) (Nov. 2024) 750 | 751 | [![Website](https://img.shields.io/badge/Website-9cf)](https://vyokky.github.io/LLM-Brained-GUI-Agents-Survey/) 752 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.18279) 753 | 754 | + [GUI Agents: A Survey](https://arxiv.org/abs/2412.13501) (Dec. 2024) 755 | 756 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2412.13501) 757 | 758 | + [LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects](https://arxiv.org/abs/2504.19838) (Apr. 2025) 759 | 760 | [![Star](https://img.shields.io/github/stars/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents.svg?style=social&label=Star)](https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents) 761 | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2504.19838) 762 | 763 | ## Projects 764 | + [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/index.html) 765 | 766 | [![Star](https://img.shields.io/github/stars/asweigart/pyautogui.svg?style=social&label=Star)](https://github.com/asweigart/pyautogui/tree/master) 767 | [![Website](https://img.shields.io/badge/Website-9cf)](https://pyautogui.readthedocs.io/en/latest/) 768 | 769 | + [nut.js](https://nutjs.dev/) 770 | 771 | [![Star](https://img.shields.io/github/stars/nut-tree/nut.js.svg?style=social&label=Star)](https://github.com/nut-tree/nut.js) 772 | [![Website](https://img.shields.io/badge/Website-9cf)](https://nutjs.dev/) 773 | 774 | + [GPT-4V-Act: AI agent using GPT-4V(ision) for web UI interaction](https://github.com/ddupont808/GPT-4V-Act) 775 | 776 | [![Star](https://img.shields.io/github/stars/ddupont808/GPT-4V-Act.svg?style=social&label=Star)](https://github.com/ddupont808/GPT-4V-Act) 777 | 778 | + [gpt-computer-assistant](https://github.com/onuratakan/gpt-computer-assistant) 779 | 780 | [![Star](https://img.shields.io/github/stars/onuratakan/gpt-computer-assistant.svg?style=social&label=Star)](https://github.com/onuratakan/gpt-computer-assistant) 781 | 782 | + [Mobile-Agent: The Powerful Mobile Device Operation Assistant Family](https://github.com/X-PLUG/MobileAgent) 783 | 784 | [![Star](https://img.shields.io/github/stars/X-PLUG/MobileAgent.svg?style=social&label=Star)](https://github.com/X-PLUG/MobileAgent) 785 | 786 | + [OpenUI](https://github.com/wandb/openui) 787 | 788 | [![Star](https://img.shields.io/github/stars/wandb/openui.svg?style=social&label=Star)](https://github.com/wandb/openui) 789 | [![Website](https://img.shields.io/badge/Website-9cf)](https://openui.fly.dev) 790 | 791 | + [ACT-1](https://www.adept.ai/blog/act-1) 792 | 793 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.adept.ai/blog/act-1) 794 | 795 | + [NatBot](https://github.com/nat/natbot) 796 | 797 | [![Star](https://img.shields.io/github/stars/nat/natbot.svg?style=social&label=Star)](https://github.com/nat/natbot) 798 | 799 | + [Multion](https://www.multion.ai) 800 | 801 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.multion.ai/) 802 | 803 | + [Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT) 804 | 805 | [![Star](https://img.shields.io/github/stars/Significant-Gravitas/Auto-GPT.svg?style=social&label=Star)](https://github.com/Significant-Gravitas/Auto-GPT) 806 | 807 | + [WebLlama](https://github.com/McGill-NLP/webllama) 808 | 809 | [![Star](https://img.shields.io/github/stars/McGill-NLP/webllama.svg?style=social&label=Star)](https://github.com/McGill-NLP/webllama) 810 | [![Website](https://img.shields.io/badge/Website-9cf)](https://webllama.github.io) 811 | 812 | + [LaVague: Large Action Model Framework to Develop AI Web Agents](https://github.com/lavague-ai/LaVague) 813 | 814 | [![Star](https://img.shields.io/github/stars/lavague-ai/LaVague.svg?style=social&label=Star)](https://github.com/lavague-ai/LaVague) 815 | [![Website](https://img.shields.io/badge/Website-9cf)](https://docs.lavague.ai/) 816 | 817 | + [OpenAdapt: AI-First Process Automation with Large Multimodal Models](https://github.com/OpenAdaptAI/OpenAdapt) 818 | 819 | [![Star](https://img.shields.io/github/stars/OpenAdaptAI/OpenAdapt.svg?style=social&label=Star)](https://github.com/OpenAdaptAI/OpenAdapt) 820 | 821 | + [Surfkit: A toolkit for building and sharing AI agents that operate on devices](https://github.com/agentsea/surfkit) 822 | 823 | [![Star](https://img.shields.io/github/stars/agentsea/surfkit.svg?style=social&label=Star)](https://github.com/agentsea/surfkit) 824 | 825 | + [AGI Computer Control](https://github.com/James4Ever0/agi_computer_control) 826 | 827 | + [Open Interpreter](https://github.com/OpenInterpreter/open-interpreter) 828 | 829 | [![Star](https://img.shields.io/github/stars/OpenInterpreter/open-interpreter.svg?style=social&label=Star)](https://github.com/OpenInterpreter/open-interpreter) 830 | [![Website](https://img.shields.io/badge/Website-9cf)](https://openinterpreter.com/) 831 | 832 | + [WebMarker: Mark web pages for use with vision-language models](https://github.com/reidbarber/webmarker) 833 | 834 | [![Star](https://img.shields.io/github/stars/reidbarber/webmarker.svg?style=social&label=Star)](https://github.com/reidbarber/webmarker) 835 | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.webmarkerjs.com/) 836 | 837 | + [Computer Use Out-of-the-box](https://github.com/showlab/computer_use_ootb) 838 | 839 | [![Star](https://img.shields.io/github/stars/showlab/computer_use_ootb.svg?style=social&label=Star)](https://github.com/showlab/computer_use_ootb/tree/master) 840 | [![Website](https://img.shields.io/badge/Website-9cf)](https://computer-use-ootb.github.io/) 841 | 842 | ## Safety 843 | 844 | + [Adversarial Attacks on Multimodal Agents](https://github.com/ChenWu98/agent-attack) 845 | 846 | [![Star](https://img.shields.io/github/stars/ChenWu98/agent-attack.svg?style=social&label=Star)](https://github.com/ChenWu98/agent-attack) 847 | [![Website](https://img.shields.io/badge/Website-9cf)](https://chenwu.io/attack-agent/) 848 | 849 | + [AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents](https://github.com/AI-secure/AdvWeb) 850 | 851 | [![Star](https://img.shields.io/github/stars/AI-secure/AdvWeb.svg?style=social&label=Star)](https://github.com/AI-secure/AdvWeb) 852 | [![Website](https://img.shields.io/badge/Website-9cf)](https://ai-secure.github.io/AdvWeb/) 853 | 854 | + [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://github.com/jylee425/mobilesafetybench) 855 | 856 | [![Star](https://img.shields.io/github/stars/jylee425/mobilesafetybench.svg?style=social&label=Star)](https://github.com/jylee425/mobilesafetybench) 857 | [![Website](https://img.shields.io/badge/Website-9cf)](https://mobilesafetybench.github.io/) 858 | 859 | + [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://github.com/OSU-NLP-Group/EIA_against_webagent) 860 | 861 | [![Star](https://img.shields.io/github/stars/OSU-NLP-Group/EIA_against_webagent.svg?style=social&label=Star)](https://github.com/OSU-NLP-Group/EIA_against_webagent) 862 | 863 | + [Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents](https://github.com/OSU-NLP-Group/WebDreamer) 864 | 865 | [![Star](https://img.shields.io/github/stars/OSU-NLP-Group/WebDreamer.svg?style=social&label=Star)](https://github.com/OSU-NLP-Group/WebDreamer) 866 | 867 | + [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544) 868 | 869 | + [Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study](https://arxiv.org/html/2407.09295v2) 870 | 871 | 872 | ## Related Repositories 873 | 874 | - [awesome-llm-powered-agent](https://github.com/hyp1231/awesome-llm-powered-agent) 875 | - [Awesome-LLM-based-Web-Agent-and-Tools](https://github.com/albzni/Awesome-LLM-based-Web-Agent-and-Tools) 876 | - [awesome-ui-agents](https://github.com/opendilab/awesome-ui-agents/) 877 | - [computer-control-agent-knowledge-base](https://github.com/James4Ever0/computer_control_agent_knowledge_base) 878 | - [Awesome GUI Agent Paper List](https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List/) 879 | 880 | ## Acknowledgements 881 | 882 | This template is provided by [Awesome-Video-Diffusion](https://github.com/showlab/Awesome-Video-Diffusion) and [Awesome-MLLM-Hallucination](https://github.com/showlab/Awesome-MLLM-Hallucination). 883 | -------------------------------------------------------------------------------- /assets/teaser.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/showlab/Awesome-GUI-Agent/3c5891a98bb1d3e8055fbb6a4dcdf9d9db6377d4/assets/teaser.webp --------------------------------------------------------------------------------