3 |
4 | 5 |
6 | 7 | [![X Community](https://img.shields.io/badge/Community-black?logo=x&style=flat-square)](https://x.com/i/communities/1874549355442802764) 8 | 9 | ACU - Awesome Agents for Computer Use 10 |

├── README.md
└── img
    └── logo.png


/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | <h1>
  3 |   <div class="image-wrapper" style="display: inline-block;">
  4 |     <img src="img/logo.png" alt="logo" height="100" style="display: block; margin: auto;">
  5 |   </div>
  6 |   
  7 |   [![X Community](https://img.shields.io/badge/Community-black?logo=x&style=flat-square)](https://x.com/i/communities/1874549355442802764)
  8 |   
  9 |   ACU - Awesome Agents for Computer Use
 10 | </h1>
 11 | </div>
 12 | 
 13 | > An AI Agent for Computer Use is an autonomous program that can **reason** about tasks, **plan** sequences of actions, and **act** within the domain of a computer or mobile device in the form of clicks, keystrokes, other computer events, command-line operations and internal/external API calls. These agents combine perception, decision-making, and control capabilities to interact with digital interfaces and accomplish user-specified goals independently.
 14 | 
 15 | A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools.
 16 | 
 17 | ## Table of Contents
 18 | 
 19 | - [ACU - Awesome Agents for Computer Use](#acu---awesome-agents-for-computer-use)
 20 |   - [Table of Contents](#table-of-contents)
 21 |   - [Articles](#articles)
 22 |   - [Papers](#papers)
 23 |     - [Surveys](#surveys)
 24 |     - [Frameworks & Models](#frameworks--models)
 25 |     - [UI Grounding](#ui-grounding)
 26 |     - [Dataset](#dataset)
 27 |     - [Benchmark](#benchmark)
 28 |     - [Safety](#safety)
 29 |   - [Projects](#projects)
 30 |     - [Open Source](#open-source)
 31 |       - [Frameworks & Models](#frameworks--models-1)
 32 |       - [Environment & Sandbox](#environment--sandbox)
 33 |       - [Automation](#automation)
 34 |     - [Commercial](#commercial)
 35 |       - [Frameworks & Models](#frameworks--models-2)
 36 |   - [Contributing](#contributing)
 37 | 
 38 | ## Articles
 39 | - [Anthropic | Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku](https://www.anthropic.com/news/3-5-models-and-computer-use)
 40 | - [Bill Gates | AI is about to completely change how you use computers](https://www.gatesnotes.com/AI-agents)
 41 | - [Ethan Mollick | When you give a Claude a mouse](https://www.oneusefulthing.org/p/when-you-give-a-claude-a-mouse)
 42 | - [OpenAI | Introducing Operator: A research preview of an agent that can use its own browser to perform tasks for you](https://openai.com/index/introducing-operator)
 43 | 
 44 | ## Papers
 45 | 
 46 | <details open>
 47 | <summary><b>Surveys</b></summary>
 48 | 
 49 | ### Surveys
 50 | 
 51 | - [AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants](https://arxiv.org/abs/2501.16150) (Jan. 2025)
 52 |   - Comprehensive review establishing taxonomy of computer control agents (CCAs) from environment, interaction, and agent perspectives, analyzing 86 CCAs and 33 datasets
 53 | 
 54 | - [GUI Agents: A Survey](https://arxiv.org/abs/2412.13501) (Dec. 2024)
 55 |   - General survey of GUI agents
 56 | 
 57 | - [Large Language Model-Brained GUI Agents: A Survey](https://arxiv.org/abs/2411.18279) (Nov. 2024)
 58 |   - Focus on LLM-based approaches
 59 |   - [Website](https://vyokky.github.io/LLM-Brained-GUI-Agents-Survey/)
 60 | 
 61 | - [GUI Agents with Foundation Models: A Comprehensive Survey](https://arxiv.org/abs/2411.04890) (Nov. 2024)
 62 |   - Comprehensive overview of foundation model-based GUI agents
 63 |   
 64 | <br/>
 65 | </details>
 66 | 
 67 | <details open>
 68 | <summary><b>Frameworks & Models</b></summary>
 69 | 
 70 | ### Frameworks & Models
 71 | 
 72 | - [Reinforcement Learning for Long-Horizon Interactive LLM Agents](https://arxiv.org/abs/2502.01600) (Feb. 2025)
 73 |   - Novel RL approach (LOOP) for training IDAs directly in target environments
 74 |   - 32B parameter agent outperforms OpenAI o1 by 9 percentage points on AppWorld
 75 | 
 76 | - [Large Action Models: From Inception to Implementation](https://arxiv.org/abs/2412.10047) (Dec. 2024)
 77 |   - Comprehensive framework for developing LAMs that can perform real-world actions beyond language generation
 78 |   - Details key stages including data collection, model training, environment integration, grounding and evaluation
 79 | 
 80 | - [Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation](https://openreview.net/forum?id=jR6YMxVG9i) (Dec. 2024)
 81 |   - Novel reward-guided navigation approach
 82 | 
 83 | - [SpiritSight Agent: Advanced GUI Agent with One Look](https://openreview.net/forum?id=jY2ow7jRdZ) (Dec. 2024)
 84 |   - Single-shot GUI interaction approach
 85 | 
 86 | - [AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs](https://openreview.net/forum?id=wl4c9jvcyY) (Dec. 2024)
 87 |   - Novel approach for automatic GUI functionality annotation
 88 | 
 89 | - [Simulate Before Act: Model-Based Planning for Web Agents](https://openreview.net/forum?id=JDa5RiTIC7) (Dec. 2024)
 90 |   - Novel model-based planning approach using LLM world models
 91 | 
 92 | - [Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents](https://arxiv.org/abs/2412.13194) (Dec. 2024)
 93 |   - Novel autonomous skill discovery framework for web agents
 94 |   - [Code](https://yanqval.github.io/PAE/)
 95 | 
 96 | - [Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents](https://openreview.net/forum?id=3Gzz7ZQLiz) (Dec. 2024)
 97 |   - Novel framework for contextualizing web pages to enhance LLM agent decision making
 98 | 
 99 | - [Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL](https://openreview.net/forum?id=CjfQssZtAb) (Dec. 2024)
100 |   - Novel value-based offline RL approach for training VLM device-control agents
101 | 
102 | - [Magentic-One](https://www.microsoft.com/en-us/research/uploads/prod/2024/11/MagenticOne.pdf) (Nov. 2024)
103 |   - Multi-agent system with orchestrator-led coordination
104 |   - Strong performance on GAIA, WebArena, and AssistantBench
105 | 
106 | - [Agent Workflow Memory](https://arxiv.org/abs/2409.07429) (Sep. 2024)
107 |   - Novel workflow memory framework for agents
108 |   - [Code](https://github.com/zorazrw/agent-workflow-memory)
109 | 
110 | - [The Impact of Element Ordering on LM Agent Performance](https://arxiv.org/abs/2409.12089) (Sep. 2024)
111 |   - Novel study on element ordering's impact on agent performance
112 |   - [Code](https://github.com/waynchi/gui-agent)
113 | 
114 | - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199) (Aug. 2024)
115 |   - Novel reasoning and learning framework
116 |   - [Website](https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities)
117 | 
118 | - [OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models](https://aclanthology.org/2024.acl-demos.8/) (Aug. 2024)
119 |   - Open platform for web-based agent deployment
120 |   - [Code](https://github.com/boxworld18/OpenWebAgent)
121 | 
122 | - [Agent-e: From autonomous web navigation to foundational design principles in agentic systems](https://arxiv.org/abs/2407.13032) (Jul. 2024)
123 |   - Hierarchical architecture with flexible DOM distillation
124 |   - Novel denoising method for web navigation
125 | 
126 | - [Apple Intelligence Foundation Language Models](https://arxiv.org/pdf/2407.21075) (Jul. 2024)
127 |   - Vision-Language Model with Private Cloud Compute
128 |   - Novel foundation model architecture
129 | 
130 | - [Tree search for language model agents](https://arxiv.org/abs/2407.01476) (Jul. 2024)
131 |   - Multi-step reasoning and planning with best-first tree search
132 |   - Novel approach for LLM-based agents
133 | 
134 | - [DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning](https://arxiv.org/abs/2406.11896) (Jun. 2024)
135 |   - Novel reinforcement learning approach
136 |   - [Code](https://github.com/DigiRL-agent/digirl)
137 | 
138 | - [Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration](https://arxiv.org/abs/2406.01014) (Jun. 2024)
139 |   - Multi-agent collaboration for mobile device operation
140 |   - [Code](https://github.com/X-PLUG/MobileAgent)
141 | 
142 | - [Octopus Series: On-device Language Models for Computer Control](https://arxiv.org/abs/2404.01549) (Apr. 2024)
143 |   - v4: Graph of language models with functional tokens integration (Apr. 2024)
144 |   - v3: Sub-billion parameter multimodal model for edge devices (Apr. 2024)
145 |   - v2: Super agent for Android and iOS (Apr. 2024)
146 |   - v1: Function calling of software APIs (Apr. 2024)
147 |   - [Website](https://www.nexa4ai.com/octopus-v3)
148 |   - [Code](https://github.com/NexaAI/octopus-v4)
149 | 
150 | - [AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent](https://arxiv.org/abs/2404.03648) (Apr. 2024)
151 |   - Novel approach for real-world web navigation and bilingual benchmark
152 |   - [Code](https://github.com/THUDM/WebGLM)
153 | 
154 | - [Cradle: Empowering Foundation Agents towards General Computer Control](https://arxiv.org/abs/2403.03186) (Mar. 2024)
155 |   - Focus on general computer control using Red Dead Redemption II as a case study
156 |   - [Code](https://github.com/BAAI-Agents/Cradle)
157 | 
158 | - [Android in the Zoo: Chain-of-Action-Thought for GUI Agents](https://arxiv.org/abs/2403.02713) (Mar. 2024)
159 |   - Novel Chain-of-Action-Thought framework for Android interaction
160 |   - [Code](https://github.com/IMNearth/CoAT)
161 | 
162 | - [ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model](https://arxiv.org/abs/2402.07945) (Feb. 2024)
163 |   - Vision-language model for computer control
164 |   - [Code](https://github.com/niuzaisheng/ScreenAgent)
165 | 
166 | - [OS-Copilot: Towards Generalist Computer Agents with Self-Improvement](https://arxiv.org/abs/2402.07456) (Feb. 2024)
167 |   - Vision-Language Model for PC interaction
168 |   - [Code](https://github.com/OS-Copilot/OS-Copilot)
169 | 
170 | - [UFO: A UI-Focused Agent for Windows OS Interaction](https://arxiv.org/abs/2402.07939) (Feb. 2024)
171 |   - Specialized for Windows OS interaction
172 |   - [Code](https://github.com/microsoft/UFO)
173 | 
174 | - [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://arxiv.org/abs/2402.11941) (Feb. 2024)
175 |   - Novel comprehensive environment perception (CEP) approach for exhaustive GUI perception
176 |   - Introduces conditional action prediction (CAP) for reliable action response
177 | 
178 | - [Intention-inInteraction (IN3): Tell Me More!](https://arxiv.org/abs/2402.09205) (Feb. 2024)
179 |   - Novel benchmark for evaluating user intention understanding in agent designs
180 |   - Introduces model experts for robust user-agent interaction
181 | 
182 | - [Dual-view visual contextualization for web navigation](https://arxiv.org/abs/2402.04476) (Feb. 2024)
183 |   - Novel approach for automatic web navigation with language instructions
184 |   - Key: HTML elements, visual contextualization
185 | 
186 | - [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) (Feb. 2024)
187 |   - Specialized for mobile UI and infographics understanding
188 |   - Novel approach for visual interface comprehension
189 | 
190 | - [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) (Jan. 2024)
191 |   - Demonstrates GPT-4V capabilities for web interaction
192 |   - [Code](https://github.com/OSU-NLP-Group/SeeAct)
193 | 
194 | - [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception](https://arxiv.org/abs/2401.16158) (Jan. 2024)
195 |   - Visual perception for mobile device interaction
196 |   - [Code](https://github.com/X-PLUG/MobileAgent)
197 | 
198 | - [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://arxiv.org/abs/2401.13919) (Jan. 2024)
199 |   - End-to-end approach for web interaction
200 |   - [Code](https://github.com/MinorJerry/WebVoyager)
201 | 
202 | - [CogAgent: A Visual Language Model for GUI Agents](https://arxiv.org/abs/2312.08914) (Dec. 2023)
203 |   - Works across PC and Android platforms
204 |   - [Code](https://github.com/THUDM/CogVLM)
205 | 
206 | - [AppAgent: Multimodal Agents as Smartphone Users](https://arxiv.org/abs/2312.13771) (Dec. 2023)
207 |   - Focused on smartphone interaction
208 |   - [Code](https://github.com/mnotgod96/AppAgent)
209 | 
210 | - [LASER: LLM Agent with State-Space Exploration for Web Navigation](https://arxiv.org/abs/2309.08172) (Sep. 2023)
211 |   - Novel approach to web navigation
212 |   - [Code](https://github.com/Mayer123/LASER)
213 | 
214 | - [AndroidEnv: A Reinforcement Learning Platform for Android](https://arxiv.org/abs/2105.13231) (May 2021)
215 |   - Reinforcement learning platform for Android interaction
216 |   - [Code](https://github.com/google-deepmind/android_env)
217 | 
218 | <br/>
219 | </details>
220 | 
221 | <details open>
222 | <summary><b>UI Grounding</b></summary>
223 | 
224 | ### UI Grounding
225 | 
226 | - [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/pdf/2408.00203) (Aug. 2024)
227 |   - Novel vision-based screen parsing method for UI screenshots
228 |   - Combines finetuned interactable icon detection and functional description models
229 |   - [Code](https://github.com/microsoft/OmniParser)
230 | 
231 | - [Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs](https://arxiv.org/abs/2404.05719) (Apr. 2024)
232 |   - Mobile UI understanding
233 |   - [Code](https://github.com/apple/ml-ferret)
234 | 
235 | - [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) (Jan. 2024)
236 |   - Advanced visual grounding techniques
237 |   - [Code](https://github.com/njucckevin/SeeClick)
238 | 
239 | - [Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms](https://arxiv.org/abs/2410.18967) (Oct. 2024)
240 |   - Multimodal LLM for universal UI understanding across diverse platforms
241 |   - Introduces adaptive gridding for high-resolution perception
242 |   - Preprint
243 | 
244 | - [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://arxiv.org/abs/2410.05243) (Oct. 2024)
245 |   - Universal approach to GUI interaction
246 |   - [Code](https://github.com/OSU-NLP-Group/UGround)
247 | 
248 | - [OS-ATLAS: Foundation Action Model for Generalist GUI Agents](https://arxiv.org/abs/2410.23218) (Oct. 2024)
249 |   - Comprehensive action modeling
250 |   - [Code](https://github.com/OS-Copilot/OS-Atlas)
251 | 
252 | - [UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding](https://openreview.net/forum?id=5wmAfwDBoi) (Dec. 2024)
253 |   - Novel framework for building VLMs with strong UI element grounding capabilities
254 | 
255 | - [Grounding Multimodal Large Language Model in GUI World](https://openreview.net/forum?id=M9iky9Ruhx) (Dec. 2024)
256 |   - Novel GUI grounding framework with automated data collection engine and lightweight grounding module
257 | 
258 | <br/>
259 | </details>
260 | 
261 | <details open>
262 | <summary><b>Dataset</b></summary>
263 | 
264 | ### Dataset
265 | 
266 | - [Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents](https://arxiv.org/pdf/2502.11357) (Feb. 2025)
267 |   - Scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis.
268 | 
269 | - [OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](https://arxiv.org/abs/2412.19723) (Dec. 2024)
270 |   - Novel interaction-driven approach for automated GUI trajectory synthesis
271 |   - Introduces reverse task synthesis and trajectory reward model
272 |   - [Code](https://github.com/OS-Copilot/OS-Genesis)
273 | 
274 | - [AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials](https://arxiv.org/abs/2412.09605) (Dec. 2024)
275 |   - Web tutorial-based trajectory synthesis
276 | 
277 | - [ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights](https://arxiv.org/abs/2406.14596) (Jun. 2024)
278 |   - Novel approach to continual learning from trajectories
279 | 
280 | - [Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale](https://arxiv.org/abs/2409.15637) (Sep. 2024)
281 |   - Scalable demonstration generation
282 | 
283 | - [UiPad: UI Parsing and Accessibility Dataset](https://huggingface.co/datasets/MacPaw/uipad) (Sep. 2024)
284 |   - MacOS desktop UI dataset with accessibility trees and evaluation questions
285 | 
286 | - [Multi-Turn Mind2Web: On the Multi-turn Instruction Following](https://arxiv.org/pdf/2402.15057) (Feb. 2024)
287 |   - Multi-turn instruction dataset for web agents
288 |   - [Code](https://github.com/magicgh/self-map)
289 | 
290 | - [CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation](https://aclanthology.org/2024.findings-acl.928/) (Aug. 2024)
291 |   - Chinese benchmark for agent evaluation
292 |   - [Code](https://github.com/tjunlp-lab/CToolEval)
293 | 
294 | - [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks](https://arxiv.org/abs/2407.15711) (Jul. 2024)
295 |   - Benchmark for realistic and time-consuming web tasks
296 |   - [Code](https://assistantbench.github.io)
297 | 
298 | - [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) (Jun. 2023)
299 |   - Large-scale web interaction dataset
300 |   - [Code](https://github.com/OSU-NLP-Group/Mind2Web)
301 | 
302 | - [Android in the Wild: A Large-Scale Dataset for Android Device Control](https://arxiv.org/abs/2307.10088) (Jul. 2023)
303 |   - Large-scale dataset for Android interaction
304 |   - Real-world device control scenarios
305 | 
306 | - [WebShop: Towards Scalable Real-World Web Interaction](https://arxiv.org/abs/2207.01206) (Jul. 2022)
307 |   - Dataset for grounded language agents in web interaction
308 |   - [Code](https://github.com/princeton-nlp/WebShop)
309 | 
310 | - [Rico: A Mobile App Dataset for Building Data-Driven Design Applications](https://dl.acm.org/doi/10.1145/3126594.3126651) (Oct. 2017)
311 |   - Mobile app UI dataset
312 |   - Design-focused data collection
313 | 
314 | <br/>
315 | </details>
316 | 
317 | <details open>
318 | <summary><b>Benchmark</b></summary>
319 | 
320 | ### Benchmark
321 | 
322 | - [A3: Android Agent Arena for Mobile GUI Agents](https://arxiv.org/abs/2501.01149) (Jan. 2025)
323 |   - Novel evaluation platform with 201 tasks across 21 widely used third-party apps
324 |   - [Website](https://yuxiangchai.github.io/Android-Agent-Arena/)
325 |   - [Code](https://github.com/AndroidArenaAgent/AndroidArena)
326 | 
327 | - [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972) (Apr. 2024)
328 |   - Comprehensive evaluation framework
329 |   - [Code](https://github.com/xlang-ai/OSWorld)
330 | 
331 | - [AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573) (May. 2024)
332 |   - Android-focused evaluation
333 |   - [Code](https://github.com/google-research/android_world)
334 | 
335 | - [Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?](https://arxiv.org/abs/2407.10956) (Jul. 2024)
336 |   - Evaluation in data science workflows
337 |   - [Code](https://github.com/xlang-ai/Spider2-V)
338 | 
339 | - [AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents](https://arxiv.org/abs/2407.18901) (Jul. 2024)
340 |   - Comprehensive benchmark with 750 natural tasks across 9 day-to-day apps and 457 APIs
341 |   - GPT-4o achieves only ~49% on normal tasks and ~30% on challenge tasks
342 |   - [Code](https://github.com/stonybrooknlp/appworld/)
343 | 
344 | - [τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045) (Jun. 2024)
345 |   - Novel benchmark for evaluating agent-user interaction and policy compliance
346 |   - State-of-the-art agents achieve <50% success rate and <25% consistency (pass^8)
347 |   - [Code](https://github.com/sierra-research/tau-bench)
348 | 
349 | - [MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents](https://arxiv.org/abs/2406.08184) (Jun. 2024)
350 |   - Mobile agent evaluation
351 |   - [Code](https://github.com/MobileAgentBench/mobile-agent-bench)
352 | 
353 | - [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) (Jan. 2024)
354 |   - Web-focused evaluation
355 |   - [Code](https://github.com/web-arena-x/visualwebarena)
356 | 
357 | - [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale](https://arxiv.org/abs/2409.08264) (Sep. 2024)
358 |   - Windows OS-focused evaluation framework
359 |   - [Code](https://github.com/microsoft/WindowsAgentArena)
360 |   - [Website](https://microsoft.github.io/WindowsAgentArena/)
361 | 
362 | - [Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction](https://arxiv.org/abs/2305.08144) (May. 2023)
363 |   - Mobile-focused evaluation framework
364 |   - [Code](https://github.com/X-LANCE/Mobile-Env)
365 | 
366 | <br/>
367 | </details>
368 | 
369 | <details open>
370 | <summary><b>Safety</b></summary>
371 | 
372 | ### Safety
373 | 
374 | - [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391) (Nov. 2024)
375 |   - Security analysis of computer agents
376 |   - [Code](https://github.com/SALT-NLP/PopupAttack)
377 | 
378 | - [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://arxiv.org/abs/2409.11295) (Sep. 2024)
379 |   - Privacy and security analysis
380 | 
381 | - [GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning](https://arxiv.org/abs/2406.09187) (Jun. 2024)
382 |   - Safety mechanisms for agents
383 | 
384 | <br/>
385 | </details>
386 | 
387 | ## Projects
388 | 
389 | ### Open Source
390 | 
391 | <details open>
392 | <summary><b>Frameworks & Models</b></summary>
393 | 
394 | ### Frameworks & Models
395 | 
396 | - [AutoGen](https://github.com/microsoft/autogen)
397 |   - Framework for building AI agent systems.
398 |   - It simplifies the creation of event-driven, distributed, scalable, and resilient agentic applications.
399 | 
400 | - [Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT)  
401 |   - Autonomous GPT-4 agent  
402 |   - Task automation focus
403 | 
404 | - [Browser Use](https://github.com/browser-use/browser-use)  
405 |   - Make websites accessible for AI agents with vision + HTML extraction  
406 |   - Supports multi-tab management and custom actions with LangChain integration  
407 | 
408 | - [Claude Computer Use Demo](https://github.com/PallavAg/claude-computer-use-macos)  
409 |   - MacOS implementation  
410 |   - Claude integration  
411 | 
412 | - [Claude Minecraft Use](https://github.com/ObservedObserver/claude-minecraft-use)  
413 |   - Game automation  
414 |   - Specialized use case  
415 | 
416 | - [Computer Use OOTB](https://github.com/showlab/computer_use_ootb)  
417 |   - Ready-to-use implementation  
418 |   - Comprehensive toolset  
419 | 
420 | - [Cua](https://github.com/trycua)
421 |   - Computer Use Interface & Agent
422 | 
423 | - [Cybergod](https://github.com/james4ever0/agi_computer_control)  
424 |   - Advanced computer control  
425 | 
426 | - [Grunty](https://github.com/suitedaces/computer-agent)  
427 |   - Computer control agent  
428 |   - Task automation focus
429 | 
430 | - [Inferable](https://github.com/inferablehq/inferable)  
431 |   - Distributed agent builder platform  
432 |   - Build tools with existing code  
433 | 
434 | - [LaVague](https://github.com/lavague-ai/LaVague)  
435 |   - AI web agent framework  
436 |   - Modular architecture  
437 | 
438 | - [Mac Computer Use](https://github.com/deedy/mac_computer_use)  
439 |   - MacOS-specific tools  
440 |   - Anthropic integration  
441 | 
442 | - [NatBot](https://github.com/nat/natbot)  
443 |   - Browser automation  
444 |   - GPT-4 Vision integration
445 |  
446 | - [Notte Browser Using Agent](https://github.com/nottelabs/notte)  
447 |   - Full-stack web AI agents framework (agents, automations, cloud browser sessions)
448 |   - Notte turns websites into structured, navigable maps described in natural language
449 | 
450 | - [OpenAdapt](https://github.com/OpenAdaptAI/OpenAdapt)  
451 |   - AI-First Process Automation  
452 |   - Multimodal model integration  
453 | 
454 | - [OpenInterface](https://github.com/AmberSahdev/Open-Interface/)  
455 |   - Open-source UI interaction framework  
456 |   - Cross-platform support  
457 | 
458 | - [OpenInterpreter](https://github.com/OpenInterpreter/open-interpreter)  
459 |   - General-purpose computer control framework  
460 |   - Python-based, extensible architecture  
461 | 
462 | - [Open Source Computer Use by E2B](https://github.com/e2b-dev/secure-computer-use/tree/os-computer-use)  
463 |   - Open-source implementation of computer control capabilities  
464 |   - Secure sandboxed environment for AI agents  
465 | 
466 | - [Self-Operating Computer](https://github.com/OthersideAI/self-operating-computer) (Nov. 2023) 
467 |   - The first Computer Use framework created
468 |   - Computer control framework  
469 |   - Vision-based automation
470 | 
471 | - [Skyvern](https://github.com/skyvern-ai/skyvern)  
472 |   - AI web agent framework
473 |   - Automate browser-based workflows with LLMs using vision and HTML extraction
474 | 
475 | - [Surfkit](https://github.com/agentsea/surfkit)  
476 |   - Device operation toolkit  
477 |   - Extensible agent framework  
478 | 
479 | - [WebMarker](https://github.com/reidbarber/webmarker)  
480 |   - Web page annotation tool  
481 |   - Vision-language model support
482 | 
483 | - [Upsonic](https://github.com/upsonic/upsonic)  
484 |   - Reliable agent framework that support MCP 
485 |   - Integrated Browser Use and Computer Use
486 |  
487 | 
488 | <br/>
489 | </details>
490 | 
491 | <details open>
492 | <summary><b>UI Grounding</b></summary>
493 | 
494 | ### UI Grounding
495 | 
496 | - [AskUI/PTA-1](https://huggingface.co/AskUI/PTA-1)
497 |   - A small vision language model for computer & phone automation, based on Florence-2.
498 |   - With only 270M parameters it outperforms much larger models in GUI text and element localization. 
499 | 
500 | - [Microsoft/OmniParser](https://huggingface.co/microsoft/OmniParser)
501 |   - A general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent
502 | 
503 | <br/>
504 | </details>
505 | 
506 | <details open>
507 | <summary><b>Environment & Sandbox</b></summary>
508 | 
509 | ### Environment & Sandbox
510 | 
511 | - [Cua](https://github.com/trycua)
512 |   - macOS/Linux Sandbox on Apple Silicon
513 | 
514 | - [dockur/windows](https://github.com/dockur/windows)
515 |   - Windows inside a Docker container
516 | 
517 | - [E2B Desktop Sandbox](https://github.com/e2b-dev/desktop)
518 |   - Secure desktop environment
519 |   - Agent testing platform
520 | 
521 | - [qemus/qemu-docker](https://github.com/qemus/qemu-docker)
522 |   - Docker container for running virtual machines using QEMU
523 | 
524 | <br/>
525 | </details>
526 | 
527 | <details open>
528 | <summary><b>Automation</b></summary>
529 | 
530 | ### Automation
531 | - [nut.js](https://github.com/nut-tree/nut.js)
532 |   - Native UI automation
533 |   - JavaScript/TypeScript implementation
534 | 
535 | - [PyAutoGUI](https://github.com/asweigart/pyautogui)
536 |   - Cross-platform GUI automation
537 |   - Python-based control library
538 | 
539 | <br/>
540 | </details>
541 | 
542 | ### Commercial
543 | 
544 | <details open>
545 | <summary><b>Frameworks & Models</b></summary>
546 | 
547 | ### Frameworks & Models
548 | 
549 | - [Anthropic Claude Computer Use](https://www.anthropic.com/news/3-5-models-and-computer-use)
550 |   - Commercial computer control capability
551 |   - Integrated with Claude 3.5 models
552 | 
553 | - [Multion](https://www.multion.ai)
554 |   - AI agents that can fully complete tasks in any web environment.
555 | 
556 | - [Runner H](https://www.hcompany.ai/)
557 |   - Advanced AI agent for real-world applications.
558 |   - Scores 67% on WebVoyager
559 | 
560 | <br/>
561 | </details>
562 | 
563 | ## Contributing
564 | 
565 | We welcome and encourage contributions from the community! Here's how you can help:
566 | 
567 | - **Add new resources**: Found a relevant paper, project, or tool? Submit a PR to add it
568 | - **Fix errors**: Help us correct any mistakes in existing entries
569 | - **Improve organization**: Suggest better ways to structure the information
570 | - **Update content**: Keep entries up-to-date with latest developments
571 | 
572 | To contribute:
573 | 1. Fork the repository
574 | 2. Create a new branch for your changes
575 | 3. Submit a pull request with a clear description of your additions/changes
576 | 4. Post in the [X Community](https://x.com/i/communities/1874549355442802764) to let everyone know about the new resource
577 | 
578 | For an example of how to format your contribution, please refer to [this PR](https://github.com/francedot/acu/pull/1).
579 | 
580 | <br/>
581 | 
582 | *Thank you for helping spread knowledge about AI agents for computer use!*
583 | 


--------------------------------------------------------------------------------
/img/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trycua/acu/2f8dfc08776d336ab884a7d7efedd89b7e24bb09/img/logo.png


--------------------------------------------------------------------------------