├── README.md └── img └── logo.png /README.md: -------------------------------------------------------------------------------- 1 |
2 |

3 |
4 | logo 5 |
6 | 7 | [![X Community](https://img.shields.io/badge/Community-black?logo=x&style=flat-square)](https://x.com/i/communities/1874549355442802764) 8 | 9 | ACU - Awesome Agents for Computer Use 10 |

11 |
12 | 13 | > An AI Agent for Computer Use is an autonomous program that can **reason** about tasks, **plan** sequences of actions, and **act** within the domain of a computer or mobile device in the form of clicks, keystrokes, other computer events, command-line operations and internal/external API calls. These agents combine perception, decision-making, and control capabilities to interact with digital interfaces and accomplish user-specified goals independently. 14 | 15 | A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools. 16 | 17 | ## Table of Contents 18 | 19 | - [ACU - Awesome Agents for Computer Use](#acu---awesome-agents-for-computer-use) 20 | - [Table of Contents](#table-of-contents) 21 | - [Articles](#articles) 22 | - [Papers](#papers) 23 | - [Surveys](#surveys) 24 | - [Frameworks & Models](#frameworks--models) 25 | - [UI Grounding](#ui-grounding) 26 | - [Dataset](#dataset) 27 | - [Benchmark](#benchmark) 28 | - [Safety](#safety) 29 | - [Projects](#projects) 30 | - [Open Source](#open-source) 31 | - [Frameworks & Models](#frameworks--models-1) 32 | - [Environment & Sandbox](#environment--sandbox) 33 | - [Automation](#automation) 34 | - [Commercial](#commercial) 35 | - [Frameworks & Models](#frameworks--models-2) 36 | - [Contributing](#contributing) 37 | 38 | ## Articles 39 | - [Anthropic | Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku](https://www.anthropic.com/news/3-5-models-and-computer-use) 40 | - [Bill Gates | AI is about to completely change how you use computers](https://www.gatesnotes.com/AI-agents) 41 | - [Ethan Mollick | When you give a Claude a mouse](https://www.oneusefulthing.org/p/when-you-give-a-claude-a-mouse) 42 | - [OpenAI | Introducing Operator: A research preview of an agent that can use its own browser to perform tasks for you](https://openai.com/index/introducing-operator) 43 | 44 | ## Papers 45 | 46 |
47 | Surveys 48 | 49 | ### Surveys 50 | 51 | - [AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants](https://arxiv.org/abs/2501.16150) (Jan. 2025) 52 | - Comprehensive review establishing taxonomy of computer control agents (CCAs) from environment, interaction, and agent perspectives, analyzing 86 CCAs and 33 datasets 53 | 54 | - [GUI Agents: A Survey](https://arxiv.org/abs/2412.13501) (Dec. 2024) 55 | - General survey of GUI agents 56 | 57 | - [Large Language Model-Brained GUI Agents: A Survey](https://arxiv.org/abs/2411.18279) (Nov. 2024) 58 | - Focus on LLM-based approaches 59 | - [Website](https://vyokky.github.io/LLM-Brained-GUI-Agents-Survey/) 60 | 61 | - [GUI Agents with Foundation Models: A Comprehensive Survey](https://arxiv.org/abs/2411.04890) (Nov. 2024) 62 | - Comprehensive overview of foundation model-based GUI agents 63 | 64 |
65 |
66 | 67 |
68 | Frameworks & Models 69 | 70 | ### Frameworks & Models 71 | 72 | - [Reinforcement Learning for Long-Horizon Interactive LLM Agents](https://arxiv.org/abs/2502.01600) (Feb. 2025) 73 | - Novel RL approach (LOOP) for training IDAs directly in target environments 74 | - 32B parameter agent outperforms OpenAI o1 by 9 percentage points on AppWorld 75 | 76 | - [Large Action Models: From Inception to Implementation](https://arxiv.org/abs/2412.10047) (Dec. 2024) 77 | - Comprehensive framework for developing LAMs that can perform real-world actions beyond language generation 78 | - Details key stages including data collection, model training, environment integration, grounding and evaluation 79 | 80 | - [Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation](https://openreview.net/forum?id=jR6YMxVG9i) (Dec. 2024) 81 | - Novel reward-guided navigation approach 82 | 83 | - [SpiritSight Agent: Advanced GUI Agent with One Look](https://openreview.net/forum?id=jY2ow7jRdZ) (Dec. 2024) 84 | - Single-shot GUI interaction approach 85 | 86 | - [AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs](https://openreview.net/forum?id=wl4c9jvcyY) (Dec. 2024) 87 | - Novel approach for automatic GUI functionality annotation 88 | 89 | - [Simulate Before Act: Model-Based Planning for Web Agents](https://openreview.net/forum?id=JDa5RiTIC7) (Dec. 2024) 90 | - Novel model-based planning approach using LLM world models 91 | 92 | - [Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents](https://arxiv.org/abs/2412.13194) (Dec. 2024) 93 | - Novel autonomous skill discovery framework for web agents 94 | - [Code](https://yanqval.github.io/PAE/) 95 | 96 | - [Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents](https://openreview.net/forum?id=3Gzz7ZQLiz) (Dec. 2024) 97 | - Novel framework for contextualizing web pages to enhance LLM agent decision making 98 | 99 | - [Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL](https://openreview.net/forum?id=CjfQssZtAb) (Dec. 2024) 100 | - Novel value-based offline RL approach for training VLM device-control agents 101 | 102 | - [Magentic-One](https://www.microsoft.com/en-us/research/uploads/prod/2024/11/MagenticOne.pdf) (Nov. 2024) 103 | - Multi-agent system with orchestrator-led coordination 104 | - Strong performance on GAIA, WebArena, and AssistantBench 105 | 106 | - [Agent Workflow Memory](https://arxiv.org/abs/2409.07429) (Sep. 2024) 107 | - Novel workflow memory framework for agents 108 | - [Code](https://github.com/zorazrw/agent-workflow-memory) 109 | 110 | - [The Impact of Element Ordering on LM Agent Performance](https://arxiv.org/abs/2409.12089) (Sep. 2024) 111 | - Novel study on element ordering's impact on agent performance 112 | - [Code](https://github.com/waynchi/gui-agent) 113 | 114 | - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199) (Aug. 2024) 115 | - Novel reasoning and learning framework 116 | - [Website](https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities) 117 | 118 | - [OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models](https://aclanthology.org/2024.acl-demos.8/) (Aug. 2024) 119 | - Open platform for web-based agent deployment 120 | - [Code](https://github.com/boxworld18/OpenWebAgent) 121 | 122 | - [Agent-e: From autonomous web navigation to foundational design principles in agentic systems](https://arxiv.org/abs/2407.13032) (Jul. 2024) 123 | - Hierarchical architecture with flexible DOM distillation 124 | - Novel denoising method for web navigation 125 | 126 | - [Apple Intelligence Foundation Language Models](https://arxiv.org/pdf/2407.21075) (Jul. 2024) 127 | - Vision-Language Model with Private Cloud Compute 128 | - Novel foundation model architecture 129 | 130 | - [Tree search for language model agents](https://arxiv.org/abs/2407.01476) (Jul. 2024) 131 | - Multi-step reasoning and planning with best-first tree search 132 | - Novel approach for LLM-based agents 133 | 134 | - [DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning](https://arxiv.org/abs/2406.11896) (Jun. 2024) 135 | - Novel reinforcement learning approach 136 | - [Code](https://github.com/DigiRL-agent/digirl) 137 | 138 | - [Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration](https://arxiv.org/abs/2406.01014) (Jun. 2024) 139 | - Multi-agent collaboration for mobile device operation 140 | - [Code](https://github.com/X-PLUG/MobileAgent) 141 | 142 | - [Octopus Series: On-device Language Models for Computer Control](https://arxiv.org/abs/2404.01549) (Apr. 2024) 143 | - v4: Graph of language models with functional tokens integration (Apr. 2024) 144 | - v3: Sub-billion parameter multimodal model for edge devices (Apr. 2024) 145 | - v2: Super agent for Android and iOS (Apr. 2024) 146 | - v1: Function calling of software APIs (Apr. 2024) 147 | - [Website](https://www.nexa4ai.com/octopus-v3) 148 | - [Code](https://github.com/NexaAI/octopus-v4) 149 | 150 | - [AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent](https://arxiv.org/abs/2404.03648) (Apr. 2024) 151 | - Novel approach for real-world web navigation and bilingual benchmark 152 | - [Code](https://github.com/THUDM/WebGLM) 153 | 154 | - [Cradle: Empowering Foundation Agents towards General Computer Control](https://arxiv.org/abs/2403.03186) (Mar. 2024) 155 | - Focus on general computer control using Red Dead Redemption II as a case study 156 | - [Code](https://github.com/BAAI-Agents/Cradle) 157 | 158 | - [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) (Mar. 2024) 159 | - Novel Chain-of-Action-Thought framework for Android interaction 160 | - [Code](https://github.com/IMNearth/CoAT) 161 | 162 | - [ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model](https://arxiv.org/abs/2402.07945) (Feb. 2024) 163 | - Vision-language model for computer control 164 | - [Code](https://github.com/niuzaisheng/ScreenAgent) 165 | 166 | - [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) (Feb. 2024) 167 | - Vision-Language Model for PC interaction 168 | - [Code](https://github.com/OS-Copilot/OS-Copilot) 169 | 170 | - [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) (Feb. 2024) 171 | - Specialized for Windows OS interaction 172 | - [Code](https://github.com/microsoft/UFO) 173 | 174 | - [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://arxiv.org/abs/2402.11941) (Feb. 2024) 175 | - Novel comprehensive environment perception (CEP) approach for exhaustive GUI perception 176 | - Introduces conditional action prediction (CAP) for reliable action response 177 | 178 | - [Intention-inInteraction (IN3): Tell Me More!](https://arxiv.org/abs/2402.09205) (Feb. 2024) 179 | - Novel benchmark for evaluating user intention understanding in agent designs 180 | - Introduces model experts for robust user-agent interaction 181 | 182 | - [Dual-view visual contextualization for web navigation](https://arxiv.org/abs/2402.04476) (Feb. 2024) 183 | - Novel approach for automatic web navigation with language instructions 184 | - Key: HTML elements, visual contextualization 185 | 186 | - [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) (Feb. 2024) 187 | - Specialized for mobile UI and infographics understanding 188 | - Novel approach for visual interface comprehension 189 | 190 | - [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) (Jan. 2024) 191 | - Demonstrates GPT-4V capabilities for web interaction 192 | - [Code](https://github.com/OSU-NLP-Group/SeeAct) 193 | 194 | - [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](https://arxiv.org/abs/2401.16158) (Jan. 2024) 195 | - Visual perception for mobile device interaction 196 | - [Code](https://github.com/X-PLUG/MobileAgent) 197 | 198 | - [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919) (Jan. 2024) 199 | - End-to-end approach for web interaction 200 | - [Code](https://github.com/MinorJerry/WebVoyager) 201 | 202 | - [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) (Dec. 2023) 203 | - Works across PC and Android platforms 204 | - [Code](https://github.com/THUDM/CogVLM) 205 | 206 | - [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) (Dec. 2023) 207 | - Focused on smartphone interaction 208 | - [Code](https://github.com/mnotgod96/AppAgent) 209 | 210 | - [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) (Sep. 2023) 211 | - Novel approach to web navigation 212 | - [Code](https://github.com/Mayer123/LASER) 213 | 214 | - [AndroidEnv: A Reinforcement Learning Platform for Android](https://arxiv.org/abs/2105.13231) (May 2021) 215 | - Reinforcement learning platform for Android interaction 216 | - [Code](https://github.com/google-deepmind/android_env) 217 | 218 |
219 |
220 | 221 |
222 | UI Grounding 223 | 224 | ### UI Grounding 225 | 226 | - [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/pdf/2408.00203) (Aug. 2024) 227 | - Novel vision-based screen parsing method for UI screenshots 228 | - Combines finetuned interactable icon detection and functional description models 229 | - [Code](https://github.com/microsoft/OmniParser) 230 | 231 | - [Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs](https://arxiv.org/abs/2404.05719) (Apr. 2024) 232 | - Mobile UI understanding 233 | - [Code](https://github.com/apple/ml-ferret) 234 | 235 | - [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) (Jan. 2024) 236 | - Advanced visual grounding techniques 237 | - [Code](https://github.com/njucckevin/SeeClick) 238 | 239 | - [Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms](https://arxiv.org/abs/2410.18967) (Oct. 2024) 240 | - Multimodal LLM for universal UI understanding across diverse platforms 241 | - Introduces adaptive gridding for high-resolution perception 242 | - Preprint 243 | 244 | - [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://arxiv.org/abs/2410.05243) (Oct. 2024) 245 | - Universal approach to GUI interaction 246 | - [Code](https://github.com/OSU-NLP-Group/UGround) 247 | 248 | - [OS-ATLAS: Foundation Action Model for Generalist GUI Agents](https://arxiv.org/abs/2410.23218) (Oct. 2024) 249 | - Comprehensive action modeling 250 | - [Code](https://github.com/OS-Copilot/OS-Atlas) 251 | 252 | - [UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding](https://openreview.net/forum?id=5wmAfwDBoi) (Dec. 2024) 253 | - Novel framework for building VLMs with strong UI element grounding capabilities 254 | 255 | - [Grounding Multimodal Large Language Model in GUI World](https://openreview.net/forum?id=M9iky9Ruhx) (Dec. 2024) 256 | - Novel GUI grounding framework with automated data collection engine and lightweight grounding module 257 | 258 |
259 |
260 | 261 |
262 | Dataset 263 | 264 | ### Dataset 265 | 266 | - [Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents](https://arxiv.org/pdf/2502.11357) (Feb. 2025) 267 | - Scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis. 268 | 269 | - [OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](https://arxiv.org/abs/2412.19723) (Dec. 2024) 270 | - Novel interaction-driven approach for automated GUI trajectory synthesis 271 | - Introduces reverse task synthesis and trajectory reward model 272 | - [Code](https://github.com/OS-Copilot/OS-Genesis) 273 | 274 | - [AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials](https://arxiv.org/abs/2412.09605) (Dec. 2024) 275 | - Web tutorial-based trajectory synthesis 276 | 277 | - [ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights](https://arxiv.org/abs/2406.14596) (Jun. 2024) 278 | - Novel approach to continual learning from trajectories 279 | 280 | - [Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale](https://arxiv.org/abs/2409.15637) (Sep. 2024) 281 | - Scalable demonstration generation 282 | 283 | - [UiPad: UI Parsing and Accessibility Dataset](https://huggingface.co/datasets/MacPaw/uipad) (Sep. 2024) 284 | - MacOS desktop UI dataset with accessibility trees and evaluation questions 285 | 286 | - [Multi-Turn Mind2Web: On the Multi-turn Instruction Following](https://arxiv.org/pdf/2402.15057) (Feb. 2024) 287 | - Multi-turn instruction dataset for web agents 288 | - [Code](https://github.com/magicgh/self-map) 289 | 290 | - [CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation](https://aclanthology.org/2024.findings-acl.928/) (Aug. 2024) 291 | - Chinese benchmark for agent evaluation 292 | - [Code](https://github.com/tjunlp-lab/CToolEval) 293 | 294 | - [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks](https://arxiv.org/abs/2407.15711) (Jul. 2024) 295 | - Benchmark for realistic and time-consuming web tasks 296 | - [Code](https://assistantbench.github.io) 297 | 298 | - [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) (Jun. 2023) 299 | - Large-scale web interaction dataset 300 | - [Code](https://github.com/OSU-NLP-Group/Mind2Web) 301 | 302 | - [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) (Jul. 2023) 303 | - Large-scale dataset for Android interaction 304 | - Real-world device control scenarios 305 | 306 | - [WebShop: Towards Scalable Real-World Web Interaction](https://arxiv.org/abs/2207.01206) (Jul. 2022) 307 | - Dataset for grounded language agents in web interaction 308 | - [Code](https://github.com/princeton-nlp/WebShop) 309 | 310 | - [Rico: A Mobile App Dataset for Building Data-Driven Design Applications](https://dl.acm.org/doi/10.1145/3126594.3126651) (Oct. 2017) 311 | - Mobile app UI dataset 312 | - Design-focused data collection 313 | 314 |
315 |
316 | 317 |
318 | Benchmark 319 | 320 | ### Benchmark 321 | 322 | - [A3: Android Agent Arena for Mobile GUI Agents](https://arxiv.org/abs/2501.01149) (Jan. 2025) 323 | - Novel evaluation platform with 201 tasks across 21 widely used third-party apps 324 | - [Website](https://yuxiangchai.github.io/Android-Agent-Arena/) 325 | - [Code](https://github.com/AndroidArenaAgent/AndroidArena) 326 | 327 | - [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972) (Apr. 2024) 328 | - Comprehensive evaluation framework 329 | - [Code](https://github.com/xlang-ai/OSWorld) 330 | 331 | - [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573) (May. 2024) 332 | - Android-focused evaluation 333 | - [Code](https://github.com/google-research/android_world) 334 | 335 | - [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https://arxiv.org/abs/2407.10956) (Jul. 2024) 336 | - Evaluation in data science workflows 337 | - [Code](https://github.com/xlang-ai/Spider2-V) 338 | 339 | - [AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents](https://arxiv.org/abs/2407.18901) (Jul. 2024) 340 | - Comprehensive benchmark with 750 natural tasks across 9 day-to-day apps and 457 APIs 341 | - GPT-4o achieves only ~49% on normal tasks and ~30% on challenge tasks 342 | - [Code](https://github.com/stonybrooknlp/appworld/) 343 | 344 | - [τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045) (Jun. 2024) 345 | - Novel benchmark for evaluating agent-user interaction and policy compliance 346 | - State-of-the-art agents achieve <50% success rate and <25% consistency (pass^8) 347 | - [Code](https://github.com/sierra-research/tau-bench) 348 | 349 | - [MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents](https://arxiv.org/abs/2406.08184) (Jun. 2024) 350 | - Mobile agent evaluation 351 | - [Code](https://github.com/MobileAgentBench/mobile-agent-bench) 352 | 353 | - [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) (Jan. 2024) 354 | - Web-focused evaluation 355 | - [Code](https://github.com/web-arena-x/visualwebarena) 356 | 357 | - [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale](https://arxiv.org/abs/2409.08264) (Sep. 2024) 358 | - Windows OS-focused evaluation framework 359 | - [Code](https://github.com/microsoft/WindowsAgentArena) 360 | - [Website](https://microsoft.github.io/WindowsAgentArena/) 361 | 362 | - [Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction](https://arxiv.org/abs/2305.08144) (May. 2023) 363 | - Mobile-focused evaluation framework 364 | - [Code](https://github.com/X-LANCE/Mobile-Env) 365 | 366 |
367 |
368 | 369 |
370 | Safety 371 | 372 | ### Safety 373 | 374 | - [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391) (Nov. 2024) 375 | - Security analysis of computer agents 376 | - [Code](https://github.com/SALT-NLP/PopupAttack) 377 | 378 | - [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://arxiv.org/abs/2409.11295) (Sep. 2024) 379 | - Privacy and security analysis 380 | 381 | - [GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning](https://arxiv.org/abs/2406.09187) (Jun. 2024) 382 | - Safety mechanisms for agents 383 | 384 |
385 |
386 | 387 | ## Projects 388 | 389 | ### Open Source 390 | 391 |
392 | Frameworks & Models 393 | 394 | ### Frameworks & Models 395 | 396 | - [AutoGen](https://github.com/microsoft/autogen) 397 | - Framework for building AI agent systems. 398 | - It simplifies the creation of event-driven, distributed, scalable, and resilient agentic applications. 399 | 400 | - [Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT) 401 | - Autonomous GPT-4 agent 402 | - Task automation focus 403 | 404 | - [Browser Use](https://github.com/browser-use/browser-use) 405 | - Make websites accessible for AI agents with vision + HTML extraction 406 | - Supports multi-tab management and custom actions with LangChain integration 407 | 408 | - [Claude Computer Use Demo](https://github.com/PallavAg/claude-computer-use-macos) 409 | - MacOS implementation 410 | - Claude integration 411 | 412 | - [Claude Minecraft Use](https://github.com/ObservedObserver/claude-minecraft-use) 413 | - Game automation 414 | - Specialized use case 415 | 416 | - [Computer Use OOTB](https://github.com/showlab/computer_use_ootb) 417 | - Ready-to-use implementation 418 | - Comprehensive toolset 419 | 420 | - [Cua](https://github.com/trycua) 421 | - Computer Use Interface & Agent 422 | 423 | - [Cybergod](https://github.com/james4ever0/agi_computer_control) 424 | - Advanced computer control 425 | 426 | - [Grunty](https://github.com/suitedaces/computer-agent) 427 | - Computer control agent 428 | - Task automation focus 429 | 430 | - [Inferable](https://github.com/inferablehq/inferable) 431 | - Distributed agent builder platform 432 | - Build tools with existing code 433 | 434 | - [LaVague](https://github.com/lavague-ai/LaVague) 435 | - AI web agent framework 436 | - Modular architecture 437 | 438 | - [Mac Computer Use](https://github.com/deedy/mac_computer_use) 439 | - MacOS-specific tools 440 | - Anthropic integration 441 | 442 | - [NatBot](https://github.com/nat/natbot) 443 | - Browser automation 444 | - GPT-4 Vision integration 445 | 446 | - [Notte Browser Using Agent](https://github.com/nottelabs/notte) 447 | - Full-stack web AI agents framework (agents, automations, cloud browser sessions) 448 | - Notte turns websites into structured, navigable maps described in natural language 449 | 450 | - [OpenAdapt](https://github.com/OpenAdaptAI/OpenAdapt) 451 | - AI-First Process Automation 452 | - Multimodal model integration 453 | 454 | - [OpenInterface](https://github.com/AmberSahdev/Open-Interface/) 455 | - Open-source UI interaction framework 456 | - Cross-platform support 457 | 458 | - [OpenInterpreter](https://github.com/OpenInterpreter/open-interpreter) 459 | - General-purpose computer control framework 460 | - Python-based, extensible architecture 461 | 462 | - [Open Source Computer Use by E2B](https://github.com/e2b-dev/secure-computer-use/tree/os-computer-use) 463 | - Open-source implementation of computer control capabilities 464 | - Secure sandboxed environment for AI agents 465 | 466 | - [Self-Operating Computer](https://github.com/OthersideAI/self-operating-computer) (Nov. 2023) 467 | - The first Computer Use framework created 468 | - Computer control framework 469 | - Vision-based automation 470 | 471 | - [Skyvern](https://github.com/skyvern-ai/skyvern) 472 | - AI web agent framework 473 | - Automate browser-based workflows with LLMs using vision and HTML extraction 474 | 475 | - [Surfkit](https://github.com/agentsea/surfkit) 476 | - Device operation toolkit 477 | - Extensible agent framework 478 | 479 | - [WebMarker](https://github.com/reidbarber/webmarker) 480 | - Web page annotation tool 481 | - Vision-language model support 482 | 483 | - [Upsonic](https://github.com/upsonic/upsonic) 484 | - Reliable agent framework that support MCP 485 | - Integrated Browser Use and Computer Use 486 | 487 | 488 |
489 |
490 | 491 |
492 | UI Grounding 493 | 494 | ### UI Grounding 495 | 496 | - [AskUI/PTA-1](https://huggingface.co/AskUI/PTA-1) 497 | - A small vision language model for computer & phone automation, based on Florence-2. 498 | - With only 270M parameters it outperforms much larger models in GUI text and element localization. 499 | 500 | - [Microsoft/OmniParser](https://huggingface.co/microsoft/OmniParser) 501 | - A general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent 502 | 503 |
504 |
505 | 506 |
507 | Environment & Sandbox 508 | 509 | ### Environment & Sandbox 510 | 511 | - [Cua](https://github.com/trycua) 512 | - macOS/Linux Sandbox on Apple Silicon 513 | 514 | - [dockur/windows](https://github.com/dockur/windows) 515 | - Windows inside a Docker container 516 | 517 | - [E2B Desktop Sandbox](https://github.com/e2b-dev/desktop) 518 | - Secure desktop environment 519 | - Agent testing platform 520 | 521 | - [qemus/qemu-docker](https://github.com/qemus/qemu-docker) 522 | - Docker container for running virtual machines using QEMU 523 | 524 |
525 |
526 | 527 |
528 | Automation 529 | 530 | ### Automation 531 | - [nut.js](https://github.com/nut-tree/nut.js) 532 | - Native UI automation 533 | - JavaScript/TypeScript implementation 534 | 535 | - [PyAutoGUI](https://github.com/asweigart/pyautogui) 536 | - Cross-platform GUI automation 537 | - Python-based control library 538 | 539 |
540 |
541 | 542 | ### Commercial 543 | 544 |
545 | Frameworks & Models 546 | 547 | ### Frameworks & Models 548 | 549 | - [Anthropic Claude Computer Use](https://www.anthropic.com/news/3-5-models-and-computer-use) 550 | - Commercial computer control capability 551 | - Integrated with Claude 3.5 models 552 | 553 | - [Multion](https://www.multion.ai) 554 | - AI agents that can fully complete tasks in any web environment. 555 | 556 | - [Runner H](https://www.hcompany.ai/) 557 | - Advanced AI agent for real-world applications. 558 | - Scores 67% on WebVoyager 559 | 560 |
561 |
562 | 563 | ## Contributing 564 | 565 | We welcome and encourage contributions from the community! Here's how you can help: 566 | 567 | - **Add new resources**: Found a relevant paper, project, or tool? Submit a PR to add it 568 | - **Fix errors**: Help us correct any mistakes in existing entries 569 | - **Improve organization**: Suggest better ways to structure the information 570 | - **Update content**: Keep entries up-to-date with latest developments 571 | 572 | To contribute: 573 | 1. Fork the repository 574 | 2. Create a new branch for your changes 575 | 3. Submit a pull request with a clear description of your additions/changes 576 | 4. Post in the [X Community](https://x.com/i/communities/1874549355442802764) to let everyone know about the new resource 577 | 578 | For an example of how to format your contribution, please refer to [this PR](https://github.com/francedot/acu/pull/1). 579 | 580 |
581 | 582 | *Thank you for helping spread knowledge about AI agents for computer use!* 583 | -------------------------------------------------------------------------------- /img/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trycua/acu/2f8dfc08776d336ab884a7d7efedd89b7e24bb09/img/logo.png --------------------------------------------------------------------------------