├── README.md └── assets └── teaser.webp /README.md: -------------------------------------------------------------------------------- 1 | # Awesome GUI Agent [](https://github.com/sindresorhus/awesome) 2 | 3 | A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents. 4 | 5 |
6 |
7 |
9 | Build a digital assistant on your screen. Generated by DALL-E-3. 10 |
11 | 12 | **WELCOME CONTRIBUTE!** 13 | 14 | 🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request. 15 | 16 | 🤖 Try our [Awesome-Paper-Agent](https://chatgpt.com/g/g-qqs9km6wi-awesome-paper-agent). Just provide an arXiv URL link, and it will automatically return formatted information, like this: 17 | 18 | ``` 19 | User: 20 | https://arxiv.org/abs/2312.13108 21 | 22 | GPT: 23 | + [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023) 24 | 25 | [](https://github.com/showlab/assistgui) 26 | [](https://arxiv.org/abs/2312.13108) 27 | [](https://showlab.github.io/assistgui/) 28 | ``` 29 | 30 | So then you can easily copy and use this information in your pull requests. 31 | 32 | ⭐ If you find this repository useful, please give it a star. 33 | 34 | --- 35 | **Quick Navigation**: [[Datasets / Benchmarks]](#datasets--benchmarks) [[Models / Agents]](#models--agents) [[Surveys]](#surveys) [[Projects]](#projects) 36 | 37 | ## Datasets / Benchmarks 38 | + [World of Bits: An Open-Domain Platform for Web-Based Agents](https://proceedings.mlr.press/v70/shi17a.html) (Aug. 2017, ICML 2017) 39 | 40 | [](https://proceedings.mlr.press/v70/shi17a/shi17a.pdf) 41 | 42 | + [A Unified Solution for Structured Web Data Extraction](https://dl.acm.org/doi/10.1145/2009916.2010020) (Jul. 2011, SIGIR 2011) 43 | 44 | [](https://dl.acm.org/doi/10.1145/2009916.2010020) 45 | 46 | + [Rico: A Mobile App Dataset for Building Data-Driven Design Applications](https://dl.acm.org/doi/10.1145/3126594.3126651) (Oct. 2017) 47 | 48 | [](https://dl.acm.org/doi/10.1145/3126594.3126651) 49 | 50 | + [Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration](https://arxiv.org/abs/1802.08802) (Feb. 2018, ICLR 2018) 51 | 52 | [](https://github.com/stanfordnlp/wge) 53 | [](https://arxiv.org/abs/1802.08802) 54 | 55 | + [Mapping Natural Language Instructions to Mobile UI Action Sequences](https://arxiv.org/abs/2005.03776) (May. 2020, ACL 2020) 56 | 57 | [](https://github.com/deepneuralmachine/seq2act-tensorflow) 58 | [](https://arxiv.org/abs/2005.03776) 59 | 60 | + [WebSRC: A Dataset for Web-Based Structural Reading Comprehension](https://arxiv.org/abs/2101.09465) (Jan. 2021, EMNLP 2021) 61 | 62 | [](https://arxiv.org/abs/2101.09465) 63 | [](https://x-lance.github.io/WebSRC/) 64 | 65 | + [AndroidEnv: A Reinforcement Learning Platform for Android](https://arxiv.org/abs/2105.13231) (May. 2021) 66 | 67 | [](https://github.com/deepmind/android_env) 68 | [](https://arxiv.org/abs/2105.13231) 69 | [](https://github.com/deepmind/android_env) 70 | 71 | + [A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility](https://arxiv.org/abs/2202.02312) (Feb. 2022) 72 | 73 | [](https://arxiv.org/abs/2202.02312) 74 | 75 | + [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI](https://arxiv.org/abs/2205.11029) (May. 2022) 76 | 77 | [](https://arxiv.org/abs/2205.11029) 78 | [](https://x-lance.github.io/META-GUI-Leaderboard/) 79 | 80 | + [WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents](https://arxiv.org/abs/2207.01206) (Jul. 2022) 81 | 82 | [](https://github.com/princeton-nlp/WebShop) 83 | [](https://arxiv.org/abs/2207.01206) 84 | [](https://webshop-pnlp.github.io/) 85 | 86 | + [Language Models can Solve Computer Tasks](https://arxiv.org/abs/2303.17491) (Mar. 2023) 87 | 88 | [](https://github.com/posgnu/rci-agent) 89 | [](https://arxiv.org/abs/2303.17491) 90 | [](https://posgnu.github.io/rci-web/) 91 | 92 | 93 | + [Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction](https://arxiv.org/abs/2305.08144) (May. 2023) 94 | 95 | [](https://arxiv.org/abs/2305.08144) 96 | [](https://github.com/X-LANCE/Mobile-Env) 97 | 98 | + [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) (Jun. 2023) 99 | 100 | [](https://github.com/osu-nlp-group/mind2web) 101 | [](https://arxiv.org/abs/2306.06070) 102 | [](https://osu-nlp-group.github.io/Mind2Web/) 103 | 104 | 105 | + [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) (Jul. 2023) 106 | 107 | [](https://github.com/google-research/google-research/tree/master/android_in_the_wild) 108 | [](https://arxiv.org/abs/2307.10088) 109 | 110 | 111 | + [WebArena: A Realistic Web Environment for Building Autonomous Agents](https://arxiv.org/abs/2307.13854) (Jul. 2023) 112 | 113 | [](https://github.com/web-arena-x/webarena) 114 | [](https://arxiv.org/abs/2307.13854) 115 | [](https://webarena.dev/) 116 | 117 | + [Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models](https://arxiv.org/abs/2311.09278) (Nov. 2023) 118 | 119 | [](https://github.com/xufangzhi/ENVISIONS) 120 | [](https://arxiv.org/abs/2311.09278) 121 | [](https://xufangzhi.github.io/symbol-llm-page/) 122 | 123 | + [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2401.07781) (Dec. 2023, CVPR 2024) 124 | 125 | [](https://github.com/showlab/assistgui) 126 | [](https://arxiv.org/abs/2401.07781) 127 | [](https://showlab.github.io/assistgui/) 128 | 129 | + [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) (Jan. 2024, ACL 2024) 130 | 131 | [](https://github.com/jykoh/visualwebarena) 132 | [](https://arxiv.org/abs/2401.13649) 133 | [](https://jykoh.com/vwa) 134 | 135 | + [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) (Feb. 2024) 136 | 137 | [](https://arxiv.org/abs/2402.17553) 138 | 139 | 140 | + [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue](https://arxiv.org/abs/2402.05930) (Feb. 2024) 141 | 142 | [](https://github.com/mcgill-nlp/weblinx) 143 | [](https://arxiv.org/abs/2402.05930) 144 | [](https://mcgill-nlp.github.io/weblinx/) 145 | 146 | + [On the Multi-turn Instruction Following for Conversational Web Agents](https://arxiv.org/abs/2402.15057) (Feb. 2024) 147 | 148 | [](https://github.com/magicgh/self-map) 149 | [](https://arxiv.org/abs/2402.15057) 150 | 151 | + [AgentStudio: A Toolkit for Building General Virtual Agents](https://arxiv.org/abs/2403.17918) (Mar. 2024) 152 | 153 | [](https://github.com/skyworkai/agent-studio) 154 | [](https://arxiv.org/abs/2403.17918) 155 | [](https://skyworkai.github.io/agent-studio/) 156 | 157 | 158 | + [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972) (Apr. 2024) 159 | 160 | [](https://github.com/xlang-ai/OSWorld) 161 | [](https://arxiv.org/abs/2404.07972) 162 | [](https://os-world.github.io/) 163 | 164 | 165 | + [Benchmarking Mobile Device Control Agents across Diverse Configurations](https://arxiv.org/abs/2404.16660) (Apr. 2024, ICLR 2024) 166 | 167 | [](https://github.com/gimme1dollar/b-moca) 168 | [](https://arxiv.org/abs/2404.16660) 169 | 170 | + [MMInA: Benchmarking Multihop Multimodal Internet Agents](https://arxiv.org/abs/2404.09992) (Apr. 2024) 171 | 172 | [](https://github.com/shulin16/MMInA) 173 | [](https://arxiv.org/abs/2404.09992) 174 | [](https://mmina.cliangyu.com) 175 | 176 | + [Autonomous Evaluation and Refinement of Digital Agents](https://arxiv.org/abs/2404.06474) (Apr. 2024) 177 | 178 | [](https://github.com/Berkeley-NLP/Agent-Eval-Refine) 179 | [](https://arxiv.org/abs/2404.06474) 180 | 181 | + [LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation](https://arxiv.org/abs/2404.16054) (Apr. 2024) 182 | 183 | [](https://github.com/LlamaTouch/LlamaTouch) 184 | [](https://arxiv.org/abs/2404.16054) 185 | 186 | + [VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?](https://arxiv.org/abs/2404.05955) (Apr. 2024) 187 | 188 | [](https://arxiv.org/abs/2404.05955) 189 | 190 | + [GUICourse: From General Vision Language Models to Versatile GUI Agents](https://arxiv.org/abs/2406.11317) (Jun. 2024) 191 | 192 | [](https://github.com/yiye3/GUICourse) 193 | [](https://arxiv.org/abs/2406.11317) 194 | 195 | + [GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents](https://arxiv.org/abs/2406.10819) (Jun. 2024) 196 | 197 | [](https://github.com/Dongping-Chen/GUI-World) 198 | [](https://arxiv.org/abs/2406.10819) 199 | [](https://gui-world.github.io/) 200 | 201 | + [GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices](https://arxiv.org/abs/2406.08451) (Jun. 2024) 202 | 203 | [](https://github.com/OpenGVLab/GUI-Odyssey) 204 | [](https://arxiv.org/abs/2406.08451) 205 | 206 | + [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) (Jun. 2024) 207 | 208 | [](https://github.com/showlab/videogui) 209 | [](https://arxiv.org/abs/2406.10227) 210 | [](https://showlab.github.io/videogui/) 211 | 212 | + [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://arxiv.org/abs/2406.19263) (Jun. 2024) 213 | 214 | [](https://github.com/eric-ai-lab/Screen-Point-and-Read) 215 | [](https://arxiv.org/abs/2406.19263) 216 | [](https://screen-point-and-read.github.io/) 217 | 218 | + [MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents](https://arxiv.org/abs/2406.08184) (Jun. 2024) 219 | 220 | [](https://github.com/MobileAgentBench/mobile-agent-bench) 221 | [](https://arxiv.org/abs/2406.08184) 222 | [](https://mobileagentbench.github.io) 223 | 224 | + [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573) (Jun. 2024) 225 | 226 | [](https://github.com/google-research/android_world) 227 | [](https://arxiv.org/abs/2405.14573) 228 | 229 | + [Practical, Automated Scenario-based Mobile App Testing](https://arxiv.org/abs/2406.08340) (Jun. 2024) 230 | 231 | [](https://arxiv.org/abs/2406.08340) 232 | 233 | + [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2406.12373) (Jun. 2024) 234 | 235 | [](https://arxiv.org/abs/2406.12373) 236 | [](https://www.imean.ai/web-canvas) 237 | 238 | + [On the Effects of Data Scale on Computer Control Agents](https://arxiv.org/abs/2406.03679) (Jun. 2024) 239 | 240 | [](https://github.com/google-research/google-research/tree/master/android_control) 241 | [](https://arxiv.org/abs/2406.03679) 242 | 243 | + [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) (Jul. 2024) 244 | 245 | [](https://github.com/camel-ai/crab) 246 | [](https://arxiv.org/abs/2407.01511) 247 | 248 | + [WebVLN: Vision-and-Language Navigation on Websites](https://ojs.aaai.org/index.php/AAAI/article/view/27878) (AAAI 2024) 249 | 250 | [](https://github.com/WebVLN/WebVLN) 251 | [](https://ojs.aaai.org/index.php/AAAI/article/view/27878) 252 | 253 | + [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https://arxiv.org/abs/2407.10956) (Jul. 2024) 254 | 255 | [](https://github.com/xlang-ai/Spider2-V) 256 | [](https://arxiv.org/abs/2407.10956) 257 | [](https://spider2-v.github.io/) 258 | 259 | + [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://arxiv.org/abs/2407.17490) 260 | 261 | [](https://github.com/YuxiangChai/AMEX-codebase) 262 | [](https://arxiv.org/abs/2407.17490) 263 | [](https://yuxiangchai.github.io/AMEX/) 264 | 265 | + [Windows Agent Arena](https://raw.githubusercontent.com/microsoft/WindowsAgentArena/website/static/files/windows_agent_arena.pdf) 266 | 267 | [](https://github.com/microsoft/WindowsAgentArena) 268 | [](https://microsoft.github.io/WindowsAgentArena/) 269 | [](https://raw.githubusercontent.com/microsoft/WindowsAgentArena/website/static/files/windows_agent_arena.pdf) 270 | 271 | + [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://arxiv.org/abs/2410.13824) (Oct, 2024) 272 | 273 | [](https://github.com/neulab/multiui) 274 | [](https://neulab.github.io/MultiUI/) 275 | [](https://neulab.github.io/MultiUI/) 276 | 277 | + [GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent](https://arxiv.org/abs/2412.18426) (Dec, 2024) 278 | 279 | [](https://github.com/ZJU-ACES-ISE/ChatUITest) 280 | [](https://arxiv.org/abs/2412.18426) 281 | 282 | + [A3: Android Agent Arena for Mobile GUI Agents](https://arxiv.org/abs/2501.01149) (Jan. 2025) 283 | 284 | [](https://arxiv.org/abs/2501.01149) 285 | [](https://yuxiangchai.github.io/Android-Agent-Arena/) 286 | 287 | + [ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf) 288 | 289 | [](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding) 290 | [](https://gui-agent.github.io/grounding-leaderboard/) 291 | [](https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf) 292 | 293 | + [WebWalker: Benchmarking LLMs in Web Traversal](https://github.com/Alibaba-nlp/WebWalker) 294 | 295 | [](https://github.com/Alibaba-nlp/WebWalker) 296 | [](https://alibaba-nlp.github.io/WebWalker/) 297 | [](https://arxiv.org/pdf/2501.07572) 298 | 299 | + [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://ai-agents-2030.github.io/SPA-Bench/) (ICLR 2025) 300 | 301 | [](https://arxiv.org/abs/2410.15164) 302 | [](https://ai-agents-2030.github.io/SPA-Bench/) 303 | 304 | + [WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation](https://arxiv.org/abs/2502.08047) (Feb. 2025) 305 | 306 | [](https://github.com/showlab/GUI-Thinker) 307 | [](https://arxiv.org/abs/2502.08047) 308 | [](https://showlab.github.io/GUI-Thinker/) 309 | 310 | + [LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark](https://arxiv.org/abs/2504.13805) (Apr. 2025) 311 | 312 | [](https://github.com/lgy0404/LearnAct) 313 | [](https://arxiv.org/abs/2504.13805) 314 | [](https://lgy0404.github.io/LearnAct/) 315 | 316 | ## Models / Agents 317 | 318 | + [Grounding Open-Domain Instructions to Automate Web Support Tasks](https://web3.arxiv.org/abs/2103.16057) (Mar. 2021) 319 | 320 | [](https://web3.arxiv.org/abs/2103.16057) 321 | 322 | + [Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning](https://arxiv.org/abs/2108.03353) (Aug. 2021) 323 | 324 | [](http://arxiv.org/abs/2108.03353) 325 | 326 | + [A Data-Driven Approach for Learning to Control Computers](https://arxiv.org/abs/2202.08137) (Feb. 2022) 327 | 328 | [](https://arxiv.org/abs/2202.08137) 329 | 330 | + [Augmenting Autotelic Agents with Large Language Models](https://arxiv.org/pdf/2305.12487) (May. 2023) 331 | 332 | [](https://arxiv.org/pdf/2305.12487) 333 | 334 | + [Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control](https://arxiv.org/abs/2306.07863) (Jun. 2023, ICLR 2024) 335 | 336 | [](https://github.com/ltzheng/synapse) 337 | [](https://arxiv.org/abs/2306.07863) 338 | 339 | + [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis](https://arxiv.org/abs/2307.12856) (Jul. 2023, ICLR 2024) 340 | 341 | [](http://arxiv.org/abs/2307.12856) 342 | 343 | + [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) (Sep. 2023) 344 | 345 | [](https://arxiv.org/abs/2309.08172) 346 | 347 | + [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) (Dec. 2023, CVPR 2024) 348 | 349 | [](https://github.com/THUDM/CogVLM) 350 | [](https://arxiv.org/abs/2312.08914) 351 | 352 | + [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919) 353 | 354 | [](https://github.com/MinorJerry/WebVoyager) 355 | [](https://arxiv.org/abs/2401.13919) 356 | 357 | + [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) (Feb. 2024) 358 | 359 | [](https://github.com/OS-Copilot/OS-Copilot) 360 | [](https://arxiv.org/abs/2402.07456) 361 | [](https://os-copilot.github.io/) 362 | 363 | + [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) (Feb. 2024) 364 | 365 | [](https://github.com/microsoft/UFO) 366 | [](https://arxiv.org/abs/2402.07939) 367 | [](https://microsoft.github.io/UFO/) 368 | 369 | + [Comprehensive Cognitive LLM Agent for Smartphone GUI Automation](https://arxiv.org/abs/2402.11941) (Feb. 2024) 370 | 371 | [](https://arxiv.org/abs/2402.11941) 372 | 373 | + [Improving Language Understanding from Screenshots](https://arxiv.org/abs/2402.14073) (Feb. 2024) 374 | 375 | [](https://arxiv.org/abs/2402.14073) 376 | 377 | + [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648) (Apr. 2024, KDD 2024) 378 | 379 | [](https://github.com/THUDM/AutoWebGLM) 380 | [](https://arxiv.org/abs/2404.03648) 381 | 382 | + [SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models](https://arxiv.org/abs/2305.19308) (May. 2023, NeurIPS 2023) 383 | 384 | [](https://github.com/BraveGroup/SheetCopilot) 385 | [](https://arxiv.org/abs/2305.19308) 386 | [](https://sheetcopilot.github.io/) 387 | 388 | + [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) (Sep. 2023) 389 | 390 | [](https://github.com/cooelf/Auto-UI) 391 | [](https://arxiv.org/abs/2309.11436) 392 | 393 | + [Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API](https://arxiv.org/abs/2310.04716) (Oct. 2023) 394 | 395 | [](https://arxiv.org/abs/2310.04716) 396 | 397 | + [OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD](https://arxiv.org/pdf/2310.10634) (Oct. 2023) 398 | 399 | [](https://github.com/xlang-ai/OpenAgents) 400 | [](https://arxiv.org/pdf/2310.10634) 401 | 402 | + [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) (Oct. 2024) 403 | 404 | [](https://github.com/chengyou-jia/AgentStore) 405 | [](https://arxiv.org/abs/2410.18603) 406 | [](https://chengyou-jia.github.io/AgentStore-Home/) 407 | 408 | + [GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation](https://arxiv.org/abs/2311.07562) (Nov. 2023) 409 | 410 | [](https://github.com/zzxslp/MM-Navigator) 411 | [](https://arxiv.org/abs/2311.07562) 412 | 413 | + [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) (Dec. 2023) 414 | 415 | [](https://github.com/mnotgod96/AppAgent) 416 | [](https://arxiv.org/abs/2312.13771) 417 | [](https://appagent-official.github.io) 418 | 419 | + [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) (Jan. 2024, ACL 2024) 420 | 421 | [](https://github.com/njucckevin/SeeClick) 422 | [](https://arxiv.org/abs/2401.10935) 423 | 424 | + [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) (Jan. 2024, ICML 2024) 425 | 426 | [](https://github.com/OSU-NLP-Group/SeeAct) 427 | [](https://arxiv.org/abs/2401.01614) 428 | [](https://osu-nlp-group.github.io/SeeAct/) 429 | 430 | 431 | + [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](http://arxiv.org/abs/2401.16158) (Jan. 2024) 432 | 433 | [](http://arxiv.org/abs/2401.16158) 434 | 435 | + [Dual-View Visual Contextualization for Web Navigation](https://arxiv.org/abs/2402.04476) (Feb. 2024, CVPR 2024) 436 | 437 | [](https://arxiv.org/abs/2402.04476) 438 | 439 | + [DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning](https://arxiv.org/abs/2406.11896) (Jun. 2024) 440 | 441 | [](https://github.com/DigiRL-agent/digirl) 442 | [](https://arxiv.org/abs/2406.11896) 443 | [](https://digirl-agent.github.io/) 444 | 445 | + [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9.pdf) (NAACL 2024) 446 | 447 | [](https://aclanthology.org/2024.naacl-industry.9.pdf) 448 | 449 | + [ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model](https://arxiv.org/abs/2402.07945) (Feb. 2024) 450 | 451 | [](https://github.com/niuzaisheng/ScreenAgent) 452 | [](https://arxiv.org/abs/2402.07945) 453 | [](https://screenagent.pages.dev/) 454 | 455 | + [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) (Feb. 2024) 456 | 457 | [](https://arxiv.org/abs/2402.04615) 458 | 459 | + [Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs](https://arxiv.org/abs/2404.05719) (Apr. 2024) 460 | 461 | [](https://github.com/apple/ml-ferret) 462 | [](https://arxiv.org/abs/2404.05719) 463 | 464 | + [Octopus: On-device language model for function calling of software APIs](https://arxiv.org/abs/2404.01549) (Apr., 2024) 465 | 466 | [](https://arxiv.org/abs/2404.01549) 467 | 468 | + [Octopus v2: On-device language model for super agent](https://arxiv.org/abs/2404.01744) (Apr., 2024) 469 | 470 | [](https://arxiv.org/abs/2404.01744) 471 | 472 | + [Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent](https://arxiv.org/abs/2404.11459) (Apr., 2024) 473 | 474 | [](https://arxiv.org/abs/2404.11459) 475 | [](https://www.nexa4ai.com/octopus-v3) 476 | 477 | + [Octopus v4: Graph of language models](https://arxiv.org/abs/2404.19296) (Apr., 2024) 478 | 479 | [](https://arxiv.org/abs/2404.19296) 480 | 481 | + [AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648) (Apr. 2024) 482 | 483 | [](https://github.com/THUDM/AutoWebGLM) 484 | [](https://arxiv.org/abs/2404.03648) 485 | 486 | + [Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning](https://arxiv.org/abs/2404.10887) (Apr. 2024) 487 | 488 | [](https://arxiv.org/abs/2404.10887) 489 | 490 | + [Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking](https://arxiv.org/pdf/2404.08860v3) (Apr. 2024, SIGIR 2024) 491 | 492 | [](https://arxiv.org/pdf/2404.08860v3) 493 | 494 | + [AutoDroid: LLM-powered Task Automation in Android](https://arxiv.org/abs/2308.15272) 495 | 496 | [](https://arxiv.org/abs/2308.15272) 497 | 498 | + [Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation](https://arxiv.org/abs/2312.03003) (Dec. 2023, MobiCom 2024) 499 | 500 | [](https://arxiv.org/abs/2312.03003) 501 | [](https://mobile-gpt.github.io/) 502 | 503 | 504 | + [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186) (Mar. 2024) 505 | 506 | [](https://github.com/BAAI-Agents/Cradle) 507 | [](https://arxiv.org/abs/2403.03186) 508 | [](https://baai-agents.github.io/Cradle/) 509 | 510 | + [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) (Mar. 2024) 511 | 512 | [](https://github.com/IMNearth/CoAT) 513 | [](https://arxiv.org/abs/2403.02713) 514 | 515 | + [Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning](https://arxiv.org/abs/2405.00516v1) (May 2024) 516 | 517 | [](https://arxiv.org/abs/2405.00516v1) 518 | 519 | + [GUI Action Narrator: Where and When Did That Action Take Place?](https://arxiv.org/abs/2406.13719) (Jun. 2024) 520 | 521 | [](https://github.com/showlab/GUI-Action-Narrator) 522 | [](https://arxiv.org/abs/2406.13719) 523 | [](https://showlab.github.io/GUI-Narrator) 524 | 525 | + [Identifying User Goals from UI Trajectories](https://arxiv.org/abs/2406.14314) (Jun. 2024) 526 | 527 | [](https://arxiv.org/abs/2406.14314) 528 | 529 | + [VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning](https://arxiv.org/abs/2406.14056) (Jun. 2024) 530 | 531 | [](https://arxiv.org/abs/2406.14056) 532 | 533 | + [Octo-planner: On-device Language Model for Planner-Action Agents](https://arxiv.org/abs/2406.18082) (Jun. 2024) 534 | 535 | [](https://arxiv.org/abs/2406.18082) 536 | [](https://www.nexa4ai.com/octo-planner#video) 537 | 538 | + [E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion](https://arxiv.org/abs/2406.14250) (Jun. 2024) 539 | 540 | [](https://arxiv.org/abs/2406.14250) 541 | 542 | + [Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration](https://arxiv.org/abs/2406.01014) (Jun. 2024) 543 | 544 | [](https://github.com/X-PLUG/MobileAgent) 545 | [](https://arxiv.org/abs/2406.01014) 546 | 547 | + [MobileFlow: A Multimodal LLM For Mobile GUI Agent](https://arxiv.org/abs/2407.04346) (Jul. 2024) 548 | 549 | [](https://arxiv.org/abs/2407.04346) 550 | 551 | + [Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model](https://arxiv.org/abs/2407.03037) (Jul. 2024) 552 | 553 | [](https://arxiv.org/abs/2407.03037) 554 | 555 | + [Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence](https://arxiv.org/abs/2407.07061) (Jul. 2024) 556 | 557 | [](https://github.com/OpenBMB/IoA) 558 | [](https://arxiv.org/abs/2407.07061) 559 | 560 | + [MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices](https://arxiv.org/abs/2407.03913) (Jul. 2024) 561 | 562 | [](https://arxiv.org/abs/2407.03913) 563 | 564 | + [AUITestAgent: Automatic Requirements Oriented GUI Function Testing](https://arxiv.org/abs/2407.09018) (Jul. 2024) 565 | 566 | [](https://github.com/bz-lab/AUITestAgent) 567 | [](https://arxiv.org/abs/2407.09018) 568 | 569 | + [Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems](https://arxiv.org/abs/2407.13032) (Jul. 2024) 570 | 571 | [](https://github.com/EmergenceAI/Agent-E) 572 | [](https://arxiv.org/abs/2407.13032) 573 | 574 | + [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/pdf/2408.00203) (Aug. 2024) 575 | 576 | [](https://arxiv.org/pdf/2408.00203) 577 | 578 | + [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327) (Aug. 2024) 579 | 580 | [](https://github.com/THUDM/VisualAgentBench) 581 | [](https://arxiv.org/abs/2408.06327) 582 | [](https://github.com/THUDM/VisualAgentBench) 583 | 584 | + [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://web3.arxiv.org/abs/2408.07199v1) (Aug. 2024) 585 | 586 | [](https://web3.arxiv.org/abs/2408.07199v1) 587 | [](https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities) 588 | 589 | + [MindSearch: Mimicking Human Minds Elicits Deep AI Searcher](https://arxiv.org/abs/2407.20183) (Jul. 2023) 590 | 591 | [](https://github.com/InternLM/MindSearch) 592 | [](https://arxiv.org/abs/2407.20183) 593 | [](https://mindsearch.netlify.app/) 594 | 595 | + [AppAgent v2: Advanced Agent for Flexible Mobile Interactions](https://arxiv.org/abs/2408.11824) (Aug. 2024) 596 | 597 | [](https://arxiv.org/abs/2408.11824) 598 | 599 | + [Caution for the Environment: 600 | Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544) (Aug. 2024) 601 | 602 | [](https://arxiv.org/pdf/2408.02544) 603 | 604 | + [Agent Workflow Memory](https://arxiv.org/abs/2409.07429) (Sep. 2024) 605 | 606 | [](https://github.com/zorazrw/agent-workflow-memory) 607 | [](https://arxiv.org/abs/2409.07429) 608 | 609 | + [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin](https://arxiv.org/abs/2409.14818) (Sep. 2024) 610 | 611 | [](https://arxiv.org/abs/2409.14818) 612 | 613 | + [Agent S: An Open Agentic Framework that Uses Computers Like a Human](https://arxiv.org/abs/2410.08164) (Oct. 2024) 614 | 615 | [](https://github.com/simular-ai/Agent-S) 616 | [](https://arxiv.org/abs/2410.08164) 617 | 618 | + [MobA: A Two-Level Agent System for Efficient Mobile Task Automation](https://arxiv.org/abs/2410.13757) (Oct. 2024) 619 | 620 | [](https://github.com/OpenDFM/MobA) 621 | [](https://arxiv.org/abs/2410.13757) 622 | [](https://huggingface.co/datasets/OpenDFM/MobA-MobBench) 623 | 624 | + [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://arxiv.org/abs/2410.05243) (Oct. 2024) 625 | 626 | [](https://github.com/OSU-NLP-Group/UGround) 627 | [](https://osu-nlp-group.github.io/UGround/) 628 | [](https://arxiv.org/abs/2410.05243) 629 | 630 | 631 | + [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://arxiv.org/pdf/2410.23218) (Oct. 2024) 632 | 633 | [](https://github.com/OS-Copilot/OS-Atlas) 634 | [](https://arxiv.org/abs/2410.23218) 635 | [](https://osatlas.github.io/) 636 | [](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) 637 | 638 | + [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391) (Nov. 2024) 639 | 640 | [](https://github.com/SALT-NLP/PopupAttack) 641 | [](https://arxiv.org/abs/2411.02391) 642 | 643 | 644 | + [AutoGLM: Autonomous Foundation Agents for GUIs](https://arxiv.org/abs/2411.00820) (Nov. 2024) 645 | 646 | [](https://github.com/THUDM/AutoGLM) 647 | [](https://arxiv.org/abs/2411.00820) 648 | 649 | + [AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations](https://arxiv.org/abs/2411.13451) (Nov. 2024) 650 | 651 | [](https://arxiv.org/abs/2411.13451) 652 | 653 | + [ShowUI: One Vision-Language-Action Model for Generalist GUI Agent](https://arxiv.org/abs/2411.17465) (Nov. 2024) 654 | 655 | [](https://github.com/showlab/ShowUI) 656 | [](https://arxiv.org/abs/2411.17465) 657 | 658 | + [Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction](https://arxiv.org/abs/2412.04454) (Dec. 2024) 659 | 660 | [](https://aguvis-project.github.io/) 661 | [](https://github.com/xlang-ai/aguvis) 662 | [](https://arxiv.org/abs/2412.04454) 663 | 664 | + [Falcon-UI: Understanding GUI Before Following User Instructions](https://arxiv.org/abs/2412.09362) (Dec. 2024) 665 | 666 | [](https://arxiv.org/abs/2412.09362) 667 | 668 | + [PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World](https://arxiv.org/abs/2412.17589) (Dec. 2024) 669 | 670 | [](https://github.com/GAIR-NLP/PC-Agent) 671 | [](https://arxiv.org/abs/2412.17589) 672 | [](https://gair-nlp.github.io/PC-Agent/) 673 | 674 | + [Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining](https://arxiv.org/pdf/2412.10342) (Dec. 2024) 675 | 676 | [](https://arxiv.org/pdf/2412.10342) 677 | 678 | + [Aria-UI: Visual Grounding for GUI Instructions](https://arxiv.org/abs/2412.16256) (Dec. 2024) 679 | 680 | [](https://github.com/AriaUI/Aria-UI) 681 | [](https://arxiv.org/abs/2412.16256) 682 | [](https://ariaui.github.io) 683 | [](https://huggingface.co/datasets/Aria-UI/Aria-UI_Data) 684 | 685 | + [CogAgent v2](https://github.com/THUDM/CogAgent) (Dec. 2024) 686 | 687 | [](https://github.com/THUDM/CogAgent) 688 | 689 | + [OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](https://arxiv.org/abs/2412.19723) (Dec. 2024) 690 | 691 | [](https://github.com/OS-Copilot/OS-Genesis) 692 | [](https://arxiv.org/abs/2412.19723) 693 | [](https://qiushisun.github.io/OS-Genesis-Home/) 694 | 695 | + [InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection](https://arxiv.org/pdf/2501.04575) (Jan. 2025) 696 | 697 | [](https://github.com/Reallm-Labs/InfiGUIAgent) 698 | [](https://arxiv.org/pdf/2501.04575) 699 | 700 | + [GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration](https://arxiv.org/pdf/2501.13896) (Jan. 2025) 701 | 702 | [](https://arxiv.org/pdf/2501.13896) 703 | [](https://gui-bee.github.io/) 704 | 705 | + [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) (ICLR 2025) 706 | 707 | [](https://arxiv.org/abs/2410.17883) 708 | 709 | + [DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents](https://arxiv.org/abs/2410.14803) (ICLR 2025) 710 | 711 | [](https://arxiv.org/abs/2410.14803) 712 | [](https://ai-agents-2030.github.io/DistRL/) 713 | 714 | + [AppVLM: A Lightweight Vision Language Model for Online App Control](https://arxiv.org/abs/2502.06395) (Feb. 2025) 715 | 716 | [](https://arxiv.org/abs/2502.06395) 717 | 718 | + [VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning](https://arxiv.org/abs/2502.07949) (Feb. 2025) 719 | 720 | [](https://arxiv.org/abs/2502.07949) 721 | 722 | + [GUI-Thinker: GUI-Thinker: A Basic yet Comprehensive GUI Agent Developed with Self-Reflection](https://arxiv.org/abs/2502.08047) (Feb. 2025) 723 | 724 | [](https://github.com/showlab/GUI-Thinker) 725 | [](https://arxiv.org/abs/2502.08047) 726 | [](https://showlab.github.io/GUI-Thinker/) 727 | 728 | + [MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users](https://arxiv.org/abs/2502.02982) (Feb., 2025) 729 | [](https://arxiv.org/abs/2502.02982) 730 | 731 | + [FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data](https://arxiv.org/abs/2503.05143) (Mar., 2025) 732 | 733 | [](https://github.com/wwh0411/FedMABench) 734 | [](https://arxiv.org/abs/2503.05143) 735 | [](https://huggingface.co/datasets/wwh0411/FedMABench) 736 | 737 | 738 | ## Surveys 739 | + [OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use](https://github.com/OS-Agent-Survey/OS-Agent-Survey) (Dec. 2024) 740 | 741 | [](https://github.com/OS-Agent-Survey/OS-Agent-Survey) 742 | [](https://github.com/OS-Agent-Survey/OS-Agent-Survey/blob/main/paper.pdf) 743 | [](https://os-agent-survey.github.io/) 744 | 745 | + [GUI Agents with Foundation Models: A Comprehensive Survey](https://arxiv.org/abs/2411.04890) (Nov. 2024) 746 | 747 | [](https://arxiv.org/abs/2411.04890) 748 | 749 | + [Large Language Model-Brained GUI Agents: A Survey](https://arxiv.org/abs/2411.18279) (Nov. 2024) 750 | 751 | [](https://vyokky.github.io/LLM-Brained-GUI-Agents-Survey/) 752 | [](https://arxiv.org/abs/2411.18279) 753 | 754 | + [GUI Agents: A Survey](https://arxiv.org/abs/2412.13501) (Dec. 2024) 755 | 756 | [](https://arxiv.org/abs/2412.13501) 757 | 758 | + [LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects](https://arxiv.org/abs/2504.19838) (Apr. 2025) 759 | 760 | [](https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents) 761 | [](https://arxiv.org/abs/2504.19838) 762 | 763 | ## Projects 764 | + [PyAutoGUI](https://pyautogui.readthedocs.io/en/latest/index.html) 765 | 766 | [](https://github.com/asweigart/pyautogui/tree/master) 767 | [](https://pyautogui.readthedocs.io/en/latest/) 768 | 769 | + [nut.js](https://nutjs.dev/) 770 | 771 | [](https://github.com/nut-tree/nut.js) 772 | [](https://nutjs.dev/) 773 | 774 | + [GPT-4V-Act: AI agent using GPT-4V(ision) for web UI interaction](https://github.com/ddupont808/GPT-4V-Act) 775 | 776 | [](https://github.com/ddupont808/GPT-4V-Act) 777 | 778 | + [gpt-computer-assistant](https://github.com/onuratakan/gpt-computer-assistant) 779 | 780 | [](https://github.com/onuratakan/gpt-computer-assistant) 781 | 782 | + [Mobile-Agent: The Powerful Mobile Device Operation Assistant Family](https://github.com/X-PLUG/MobileAgent) 783 | 784 | [](https://github.com/X-PLUG/MobileAgent) 785 | 786 | + [OpenUI](https://github.com/wandb/openui) 787 | 788 | [](https://github.com/wandb/openui) 789 | [](https://openui.fly.dev) 790 | 791 | + [ACT-1](https://www.adept.ai/blog/act-1) 792 | 793 | [](https://www.adept.ai/blog/act-1) 794 | 795 | + [NatBot](https://github.com/nat/natbot) 796 | 797 | [](https://github.com/nat/natbot) 798 | 799 | + [Multion](https://www.multion.ai) 800 | 801 | [](https://www.multion.ai/) 802 | 803 | + [Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT) 804 | 805 | [](https://github.com/Significant-Gravitas/Auto-GPT) 806 | 807 | + [WebLlama](https://github.com/McGill-NLP/webllama) 808 | 809 | [](https://github.com/McGill-NLP/webllama) 810 | [](https://webllama.github.io) 811 | 812 | + [LaVague: Large Action Model Framework to Develop AI Web Agents](https://github.com/lavague-ai/LaVague) 813 | 814 | [](https://github.com/lavague-ai/LaVague) 815 | [](https://docs.lavague.ai/) 816 | 817 | + [OpenAdapt: AI-First Process Automation with Large Multimodal Models](https://github.com/OpenAdaptAI/OpenAdapt) 818 | 819 | [](https://github.com/OpenAdaptAI/OpenAdapt) 820 | 821 | + [Surfkit: A toolkit for building and sharing AI agents that operate on devices](https://github.com/agentsea/surfkit) 822 | 823 | [](https://github.com/agentsea/surfkit) 824 | 825 | + [AGI Computer Control](https://github.com/James4Ever0/agi_computer_control) 826 | 827 | + [Open Interpreter](https://github.com/OpenInterpreter/open-interpreter) 828 | 829 | [](https://github.com/OpenInterpreter/open-interpreter) 830 | [](https://openinterpreter.com/) 831 | 832 | + [WebMarker: Mark web pages for use with vision-language models](https://github.com/reidbarber/webmarker) 833 | 834 | [](https://github.com/reidbarber/webmarker) 835 | [](https://www.webmarkerjs.com/) 836 | 837 | + [Computer Use Out-of-the-box](https://github.com/showlab/computer_use_ootb) 838 | 839 | [](https://github.com/showlab/computer_use_ootb/tree/master) 840 | [](https://computer-use-ootb.github.io/) 841 | 842 | ## Safety 843 | 844 | + [Adversarial Attacks on Multimodal Agents](https://github.com/ChenWu98/agent-attack) 845 | 846 | [](https://github.com/ChenWu98/agent-attack) 847 | [](https://chenwu.io/attack-agent/) 848 | 849 | + [AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents](https://github.com/AI-secure/AdvWeb) 850 | 851 | [](https://github.com/AI-secure/AdvWeb) 852 | [](https://ai-secure.github.io/AdvWeb/) 853 | 854 | + [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://github.com/jylee425/mobilesafetybench) 855 | 856 | [](https://github.com/jylee425/mobilesafetybench) 857 | [](https://mobilesafetybench.github.io/) 858 | 859 | + [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://github.com/OSU-NLP-Group/EIA_against_webagent) 860 | 861 | [](https://github.com/OSU-NLP-Group/EIA_against_webagent) 862 | 863 | + [Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents](https://github.com/OSU-NLP-Group/WebDreamer) 864 | 865 | [](https://github.com/OSU-NLP-Group/WebDreamer) 866 | 867 | + [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544) 868 | 869 | + [Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study](https://arxiv.org/html/2407.09295v2) 870 | 871 | 872 | ## Related Repositories 873 | 874 | - [awesome-llm-powered-agent](https://github.com/hyp1231/awesome-llm-powered-agent) 875 | - [Awesome-LLM-based-Web-Agent-and-Tools](https://github.com/albzni/Awesome-LLM-based-Web-Agent-and-Tools) 876 | - [awesome-ui-agents](https://github.com/opendilab/awesome-ui-agents/) 877 | - [computer-control-agent-knowledge-base](https://github.com/James4Ever0/computer_control_agent_knowledge_base) 878 | - [Awesome GUI Agent Paper List](https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List/) 879 | 880 | ## Acknowledgements 881 | 882 | This template is provided by [Awesome-Video-Diffusion](https://github.com/showlab/Awesome-Video-Diffusion) and [Awesome-MLLM-Hallucination](https://github.com/showlab/Awesome-MLLM-Hallucination). 883 | -------------------------------------------------------------------------------- /assets/teaser.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/showlab/Awesome-GUI-Agent/3c5891a98bb1d3e8055fbb6a4dcdf9d9db6377d4/assets/teaser.webp --------------------------------------------------------------------------------