├── .github
├── CODE_OF_CONDUCT.md
├── ISSUE_TEMPLATE.md
└── PULL_REQUEST_TEMPLATE.md
├── .gitignore
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
└── computer-use
├── .vscode
└── launch.json
├── cua.py
├── local_computer.py
├── main.py
└── requirements.txt
/.github/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Microsoft Open Source Code of Conduct
2 |
3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
4 |
5 | Resources:
6 |
7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
10 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE.md:
--------------------------------------------------------------------------------
1 |
4 | > Please provide us with the following information:
5 | > ---------------------------------------------------------------
6 |
7 | ### This issue is for a: (mark with an `x`)
8 | ```
9 | - [ ] bug report -> please search issues before submitting
10 | - [ ] feature request
11 | - [ ] documentation issue or request
12 | - [ ] regression (a behavior that used to work and stopped in a new release)
13 | ```
14 |
15 | ### Minimal steps to reproduce
16 | >
17 |
18 | ### Any log messages given by the failure
19 | >
20 |
21 | ### Expected/desired behavior
22 | >
23 |
24 | ### OS and Version?
25 | > Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
26 |
27 | ### Versions
28 | >
29 |
30 | ### Mention any other details that might be useful
31 |
32 | > ---------------------------------------------------------------
33 | > Thanks! We'll be in touch soon.
34 |
--------------------------------------------------------------------------------
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | ## Purpose
2 |
3 | * ...
4 |
5 | ## Does this introduce a breaking change?
6 |
7 | ```
8 | [ ] Yes
9 | [ ] No
10 | ```
11 |
12 | ## Pull Request Type
13 | What kind of change does this Pull Request introduce?
14 |
15 |
16 | ```
17 | [ ] Bugfix
18 | [ ] Feature
19 | [ ] Code style update (formatting, local variables)
20 | [ ] Refactoring (no functional changes, no api changes)
21 | [ ] Documentation content changes
22 | [ ] Other... Please describe:
23 | ```
24 |
25 | ## How to Test
26 | * Get the code
27 |
28 | ```
29 | git clone [repo-address]
30 | cd [repo-name]
31 | git checkout [branch-name]
32 | npm install
33 | ```
34 |
35 | * Test the code
36 |
37 | ```
38 | ```
39 |
40 | ## What to Check
41 | Verify that the following are valid
42 | * ...
43 |
44 | ## Other Information
45 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | ## Ignore Visual Studio temporary files, build results, and
2 | ## files generated by popular Visual Studio add-ons.
3 | ##
4 | ## Get latest from https://github.com/github/gitignore/blob/main/VisualStudio.gitignore
5 |
6 | # User-specific files
7 | *.rsuser
8 | *.suo
9 | *.user
10 | *.userosscache
11 | *.sln.docstates
12 |
13 | # User-specific files (MonoDevelop/Xamarin Studio)
14 | *.userprefs
15 |
16 | # Mono auto generated files
17 | mono_crash.*
18 |
19 | # Build results
20 | [Dd]ebug/
21 | [Dd]ebugPublic/
22 | [Rr]elease/
23 | [Rr]eleases/
24 | x64/
25 | x86/
26 | [Ww][Ii][Nn]32/
27 | [Aa][Rr][Mm]/
28 | [Aa][Rr][Mm]64/
29 | bld/
30 | [Bb]in/
31 | [Oo]bj/
32 | [Ll]og/
33 | [Ll]ogs/
34 |
35 | # Visual Studio 2015/2017 cache/options directory
36 | .vs/
37 | # Uncomment if you have tasks that create the project's static files in wwwroot
38 | #wwwroot/
39 |
40 | # Visual Studio 2017 auto generated files
41 | Generated\ Files/
42 |
43 | # MSTest test Results
44 | [Tt]est[Rr]esult*/
45 | [Bb]uild[Ll]og.*
46 |
47 | # NUnit
48 | *.VisualState.xml
49 | TestResult.xml
50 | nunit-*.xml
51 |
52 | # Build Results of an ATL Project
53 | [Dd]ebugPS/
54 | [Rr]eleasePS/
55 | dlldata.c
56 |
57 | # Benchmark Results
58 | BenchmarkDotNet.Artifacts/
59 |
60 | # .NET Core
61 | project.lock.json
62 | project.fragment.lock.json
63 | artifacts/
64 |
65 | # ASP.NET Scaffolding
66 | ScaffoldingReadMe.txt
67 |
68 | # StyleCop
69 | StyleCopReport.xml
70 |
71 | # Files built by Visual Studio
72 | *_i.c
73 | *_p.c
74 | *_h.h
75 | *.ilk
76 | *.meta
77 | *.obj
78 | *.iobj
79 | *.pch
80 | *.pdb
81 | *.ipdb
82 | *.pgc
83 | *.pgd
84 | *.rsp
85 | # but not Directory.Build.rsp, as it configures directory-level build defaults
86 | !Directory.Build.rsp
87 | *.sbr
88 | *.tlb
89 | *.tli
90 | *.tlh
91 | *.tmp
92 | *.tmp_proj
93 | *_wpftmp.csproj
94 | *.log
95 | *.tlog
96 | *.vspscc
97 | *.vssscc
98 | .builds
99 | *.pidb
100 | *.svclog
101 | *.scc
102 |
103 | # Chutzpah Test files
104 | _Chutzpah*
105 |
106 | # Visual C++ cache files
107 | ipch/
108 | *.aps
109 | *.ncb
110 | *.opendb
111 | *.opensdf
112 | *.sdf
113 | *.cachefile
114 | *.VC.db
115 | *.VC.VC.opendb
116 |
117 | # Visual Studio profiler
118 | *.psess
119 | *.vsp
120 | *.vspx
121 | *.sap
122 |
123 | # Visual Studio Trace Files
124 | *.e2e
125 |
126 | # TFS 2012 Local Workspace
127 | $tf/
128 |
129 | # Guidance Automation Toolkit
130 | *.gpState
131 |
132 | # ReSharper is a .NET coding add-in
133 | _ReSharper*/
134 | *.[Rr]e[Ss]harper
135 | *.DotSettings.user
136 |
137 | # TeamCity is a build add-in
138 | _TeamCity*
139 |
140 | # DotCover is a Code Coverage Tool
141 | *.dotCover
142 |
143 | # AxoCover is a Code Coverage Tool
144 | .axoCover/*
145 | !.axoCover/settings.json
146 |
147 | # Coverlet is a free, cross platform Code Coverage Tool
148 | coverage*.json
149 | coverage*.xml
150 | coverage*.info
151 |
152 | # Visual Studio code coverage results
153 | *.coverage
154 | *.coveragexml
155 |
156 | # NCrunch
157 | _NCrunch_*
158 | .*crunch*.local.xml
159 | nCrunchTemp_*
160 |
161 | # MightyMoose
162 | *.mm.*
163 | AutoTest.Net/
164 |
165 | # Web workbench (sass)
166 | .sass-cache/
167 |
168 | # Installshield output folder
169 | [Ee]xpress/
170 |
171 | # DocProject is a documentation generator add-in
172 | DocProject/buildhelp/
173 | DocProject/Help/*.HxT
174 | DocProject/Help/*.HxC
175 | DocProject/Help/*.hhc
176 | DocProject/Help/*.hhk
177 | DocProject/Help/*.hhp
178 | DocProject/Help/Html2
179 | DocProject/Help/html
180 |
181 | # Click-Once directory
182 | publish/
183 |
184 | # Publish Web Output
185 | *.[Pp]ublish.xml
186 | *.azurePubxml
187 | # Note: Comment the next line if you want to checkin your web deploy settings,
188 | # but database connection strings (with potential passwords) will be unencrypted
189 | *.pubxml
190 | *.publishproj
191 |
192 | # Microsoft Azure Web App publish settings. Comment the next line if you want to
193 | # checkin your Azure Web App publish settings, but sensitive information contained
194 | # in these scripts will be unencrypted
195 | PublishScripts/
196 |
197 | # NuGet Packages
198 | *.nupkg
199 | # NuGet Symbol Packages
200 | *.snupkg
201 | # The packages folder can be ignored because of Package Restore
202 | **/[Pp]ackages/*
203 | # except build/, which is used as an MSBuild target.
204 | !**/[Pp]ackages/build/
205 | # Uncomment if necessary however generally it will be regenerated when needed
206 | #!**/[Pp]ackages/repositories.config
207 | # NuGet v3's project.json files produces more ignorable files
208 | *.nuget.props
209 | *.nuget.targets
210 |
211 | # Microsoft Azure Build Output
212 | csx/
213 | *.build.csdef
214 |
215 | # Microsoft Azure Emulator
216 | ecf/
217 | rcf/
218 |
219 | # Windows Store app package directories and files
220 | AppPackages/
221 | BundleArtifacts/
222 | Package.StoreAssociation.xml
223 | _pkginfo.txt
224 | *.appx
225 | *.appxbundle
226 | *.appxupload
227 |
228 | # Visual Studio cache files
229 | # files ending in .cache can be ignored
230 | *.[Cc]ache
231 | # but keep track of directories ending in .cache
232 | !?*.[Cc]ache/
233 |
234 | # Others
235 | ClientBin/
236 | ~$*
237 | *~
238 | *.dbmdl
239 | *.dbproj.schemaview
240 | *.jfm
241 | *.pfx
242 | *.publishsettings
243 | orleans.codegen.cs
244 |
245 | # Including strong name files can present a security risk
246 | # (https://github.com/github/gitignore/pull/2483#issue-259490424)
247 | #*.snk
248 |
249 | # Since there are multiple workflows, uncomment next line to ignore bower_components
250 | # (https://github.com/github/gitignore/pull/1529#issuecomment-104372622)
251 | #bower_components/
252 |
253 | # RIA/Silverlight projects
254 | Generated_Code/
255 |
256 | # Backup & report files from converting an old project file
257 | # to a newer Visual Studio version. Backup files are not needed,
258 | # because we have git ;-)
259 | _UpgradeReport_Files/
260 | Backup*/
261 | UpgradeLog*.XML
262 | UpgradeLog*.htm
263 | ServiceFabricBackup/
264 | *.rptproj.bak
265 |
266 | # SQL Server files
267 | *.mdf
268 | *.ldf
269 | *.ndf
270 |
271 | # Business Intelligence projects
272 | *.rdl.data
273 | *.bim.layout
274 | *.bim_*.settings
275 | *.rptproj.rsuser
276 | *- [Bb]ackup.rdl
277 | *- [Bb]ackup ([0-9]).rdl
278 | *- [Bb]ackup ([0-9][0-9]).rdl
279 |
280 | # Microsoft Fakes
281 | FakesAssemblies/
282 |
283 | # GhostDoc plugin setting file
284 | *.GhostDoc.xml
285 |
286 | # Node.js Tools for Visual Studio
287 | .ntvs_analysis.dat
288 | node_modules/
289 |
290 | # Visual Studio 6 build log
291 | *.plg
292 |
293 | # Visual Studio 6 workspace options file
294 | *.opt
295 |
296 | # Visual Studio 6 auto-generated workspace file (contains which files were open etc.)
297 | *.vbw
298 |
299 | # Visual Studio 6 auto-generated project file (contains which files were open etc.)
300 | *.vbp
301 |
302 | # Visual Studio 6 workspace and project file (working project files containing files to include in project)
303 | *.dsw
304 | *.dsp
305 |
306 | # Visual Studio 6 technical files
307 | *.ncb
308 | *.aps
309 |
310 | # Visual Studio LightSwitch build output
311 | **/*.HTMLClient/GeneratedArtifacts
312 | **/*.DesktopClient/GeneratedArtifacts
313 | **/*.DesktopClient/ModelManifest.xml
314 | **/*.Server/GeneratedArtifacts
315 | **/*.Server/ModelManifest.xml
316 | _Pvt_Extensions
317 |
318 | # Paket dependency manager
319 | .paket/paket.exe
320 | paket-files/
321 |
322 | # FAKE - F# Make
323 | .fake/
324 |
325 | # CodeRush personal settings
326 | .cr/personal
327 |
328 | # Python Tools for Visual Studio (PTVS)
329 | __pycache__/
330 | *.pyc
331 |
332 | # Cake - Uncomment if you are using it
333 | # tools/**
334 | # !tools/packages.config
335 |
336 | # Tabs Studio
337 | *.tss
338 |
339 | # Telerik's JustMock configuration file
340 | *.jmconfig
341 |
342 | # BizTalk build output
343 | *.btp.cs
344 | *.btm.cs
345 | *.odx.cs
346 | *.xsd.cs
347 |
348 | # OpenCover UI analysis results
349 | OpenCover/
350 |
351 | # Azure Stream Analytics local run output
352 | ASALocalRun/
353 |
354 | # MSBuild Binary and Structured Log
355 | *.binlog
356 |
357 | # NVidia Nsight GPU debugger configuration file
358 | *.nvuser
359 |
360 | # MFractors (Xamarin productivity tool) working folder
361 | .mfractor/
362 |
363 | # Local History for Visual Studio
364 | .localhistory/
365 |
366 | # Visual Studio History (VSHistory) files
367 | .vshistory/
368 |
369 | # BeatPulse healthcheck temp database
370 | healthchecksdb
371 |
372 | # Backup folder for Package Reference Convert tool in Visual Studio 2017
373 | MigrationBackup/
374 |
375 | # Ionide (cross platform F# VS Code tools) working folder
376 | .ionide/
377 |
378 | # Fody - auto-generated XML schema
379 | FodyWeavers.xsd
380 |
381 | # VS Code files for those working on multiple tools
382 | .vscode/*
383 | !.vscode/settings.json
384 | !.vscode/tasks.json
385 | !.vscode/launch.json
386 | !.vscode/extensions.json
387 | *.code-workspace
388 |
389 | # Local History for Visual Studio Code
390 | .history/
391 |
392 | # Windows Installer files from build outputs
393 | *.cab
394 | *.msi
395 | *.msix
396 | *.msm
397 | *.msp
398 |
399 | # JetBrains Rider
400 | *.sln.iml
401 |
--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
1 | ## [project-title] Changelog
2 |
3 |
4 | # x.y.z (yyyy-mm-dd)
5 |
6 | *Features*
7 | * ...
8 |
9 | *Bug Fixes*
10 | * ...
11 |
12 | *Breaking Changes*
13 | * ...
14 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing to [project-title]
2 |
3 | This project welcomes contributions and suggestions. Most contributions require you to agree to a
4 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
5 | the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
6 |
7 | When you submit a pull request, a CLA bot will automatically determine whether you need to provide
8 | a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
9 | provided by the bot. You will only need to do this once across all repos using our CLA.
10 |
11 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
12 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
13 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
14 |
15 | - [Code of Conduct](#coc)
16 | - [Issues and Bugs](#issue)
17 | - [Feature Requests](#feature)
18 | - [Submission Guidelines](#submit)
19 |
20 | ## Code of Conduct
21 | Help us keep this project open and inclusive. Please read and follow our [Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
22 |
23 | ## Found an Issue?
24 | If you find a bug in the source code or a mistake in the documentation, you can help us by
25 | [submitting an issue](#submit-issue) to the GitHub Repository. Even better, you can
26 | [submit a Pull Request](#submit-pr) with a fix.
27 |
28 | ## Want a Feature?
29 | You can *request* a new feature by [submitting an issue](#submit-issue) to the GitHub
30 | Repository. If you would like to *implement* a new feature, please submit an issue with
31 | a proposal for your work first, to be sure that we can use it.
32 |
33 | * **Small Features** can be crafted and directly [submitted as a Pull Request](#submit-pr).
34 |
35 | ## Submission Guidelines
36 |
37 | ### Submitting an Issue
38 | Before you submit an issue, search the archive, maybe your question was already answered.
39 |
40 | If your issue appears to be a bug, and hasn't been reported, open a new issue.
41 | Help us to maximize the effort we can spend fixing issues and adding new
42 | features, by not reporting duplicate issues. Providing the following information will increase the
43 | chances of your issue being dealt with quickly:
44 |
45 | * **Overview of the Issue** - if an error is being thrown a non-minified stack trace helps
46 | * **Version** - what version is affected (e.g. 0.1.2)
47 | * **Motivation for or Use Case** - explain what are you trying to do and why the current behavior is a bug for you
48 | * **Browsers and Operating System** - is this a problem with all browsers?
49 | * **Reproduce the Error** - provide a live example or a unambiguous set of steps
50 | * **Related Issues** - has a similar issue been reported before?
51 | * **Suggest a Fix** - if you can't fix the bug yourself, perhaps you can point to what might be
52 | causing the problem (line of code or commit)
53 |
54 | You can file new issues by providing the above information at the corresponding repository's issues link: https://github.com/[organization-name]/[repository-name]/issues/new].
55 |
56 | ### Submitting a Pull Request (PR)
57 | Before you submit your Pull Request (PR) consider the following guidelines:
58 |
59 | * Search the repository (https://github.com/[organization-name]/[repository-name]/pulls) for an open or closed PR
60 | that relates to your submission. You don't want to duplicate effort.
61 |
62 | * Make your changes in a new git fork:
63 |
64 | * Commit your changes using a descriptive commit message
65 | * Push your fork to GitHub:
66 | * In GitHub, create a pull request
67 | * If we suggest changes then:
68 | * Make the required updates.
69 | * Rebase your fork and force push to your GitHub repository (this will update your Pull Request):
70 |
71 | ```shell
72 | git rebase master -i
73 | git push -f
74 | ```
75 |
76 | That's it! Thank you for your contribution!
77 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) Microsoft Corporation.
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Computer Use Assistant (CUA)
2 | > **Important:** You must apply for access in order to use the Computer Use model. Apply here: https://aka.ms/oai/cuaaccess
3 |
4 | This is a sample repository demonstrating how to use the Computer Use model, an AI model capable of interacting with graphical user interfaces (GUIs) through natural language instructions. The Computer Use model can understand visual interfaces, take actions, and complete tasks by controlling a computer just like a human would.
5 |
6 | This framework provides a bridge between the Computer Use model and computer control, allowing for automated task execution while maintaining safety checks and user consent. It serves as a practical example of how to integrate the Computer Use model into applications that require GUI interaction.
7 |
8 | ## Features
9 |
10 | * Natural language computer control through AI models
11 | * Screenshot capture and analysis
12 | * Mouse and keyboard control
13 | * Safety checks and user consent mechanisms
14 | * Support for both OpenAI and Azure OpenAI endpoints
15 | * Cross-platform compatibility (Windows, macOS, Linux)
16 | * Screen resolution scaling for consistent AI model input
17 |
18 | ## Getting Started
19 |
20 | ### Prerequisites
21 |
22 | * Python 3.7 or higher
23 | * Operating System: Windows, macOS, or Linux
24 | * OpenAI API key or Azure OpenAI credentials
25 |
26 | ### Installation
27 |
28 | 1. Clone the repository:
29 | ```bash
30 | git clone [repository-url]
31 | cd computer-use
32 | ```
33 |
34 | 2. Install the required packages:
35 | ```bash
36 | pip install -r requirements.txt
37 | ```
38 |
39 | 3. Set up your environment variables:
40 |
41 | **For macOS or Linux:**
42 | ```bash
43 | # Azure OpenAI
44 | export AZURE_OPENAI_ENDPOINT="your-azure-endpoint"
45 | export AZURE_OPENAI_API_KEY="your-azure-api-key"
46 |
47 | # OpenAI
48 | export OPENAI_API_KEY="your-openai-api-key"
49 | ```
50 | **For Windows:**
51 | ```powershell
52 | # Azure OpenAI
53 | setx AZURE_OPENAI_ENDPOINT "your-azure-endpoint"
54 | setx AZURE_OPENAI_API_KEY "your-azure-api-key"
55 |
56 | # OpenAI
57 | setx OPENAI_API_KEY "your-openai-api-key"
58 | ```
59 |
60 | ## Usage
61 |
62 | ### Local Computer Control
63 |
64 | The framework is designed to work directly with your local computer. Here's how to use it:
65 |
66 | 1. Run the example application:
67 | ```bash
68 | python main.py --instructions "Open web browser and go to microsoft.com"
69 | ```
70 |
71 | 2. The AI model will:
72 | - Take screenshots of your screen
73 | - Analyze the visual information
74 | - Execute appropriate actions to complete the task
75 | - Request user consent for safety-critical actions
76 |
77 | ### Command Line Arguments
78 |
79 | * `--instructions`: The task to perform (default: "Open web browser and go to microsoft.com")
80 | * `--model`: The AI model to use (default: "computer-use-preview")
81 | * `--endpoint`: The API endpoint to use ("azure" or "openai", default: "azure")
82 | * `--autoplay`: Automatically execute actions without confirmation (default: true)
83 |
84 | ### VM/Remote Control
85 |
86 | For scenarios requiring remote computer control or VM automation, we recommend using Playwright. Playwright provides robust browser automation capabilities and is well-suited for VM-based testing and automation scenarios.
87 |
88 | For more information on VM automation with Playwright, please refer to:
89 | * [Playwright Documentation](https://playwright.dev/docs/intro)
90 | * [Playwright VM Setup Guide](https://playwright.dev/docs/ci-intro)
91 |
92 | ## Demo
93 |
94 | The included demo application (`main.py`) demonstrates how to use the CUA framework:
95 |
96 | 1. Start the demo:
97 | ```bash
98 | python main.py
99 | ```
100 |
101 | 2. Enter your instructions when prompted, or use the `--instructions` parameter to provide them directly.
102 |
103 | 3. Watch as the AI model:
104 | - Captures and analyzes your screen
105 | - Performs mouse and keyboard actions
106 | - Requests consent for safety-critical operations
107 | - Provides reasoning for its actions
108 |
109 | ## Resources
110 |
111 | * [OpenAI API Documentation](https://platform.openai.com/docs/api-reference)
112 | * [Azure OpenAI Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/)
113 | * [PyAutoGUI Documentation](https://pyautogui.readthedocs.io/)
114 |
--------------------------------------------------------------------------------
/computer-use/.vscode/launch.json:
--------------------------------------------------------------------------------
1 | {
2 | "version": "0.2.0",
3 | "configurations": [
4 |
5 | {
6 | "name": "Launch Python Debugger",
7 | "type": "debugpy",
8 | "request": "launch",
9 | "program": "main.py",
10 | "console": "integratedTerminal"
11 | }
12 | ]
13 | }
--------------------------------------------------------------------------------
/computer-use/cua.py:
--------------------------------------------------------------------------------
1 | import asyncio
2 | import base64
3 | import inspect
4 | import io
5 | import json
6 | import re
7 |
8 | import openai
9 | import PIL
10 |
11 |
12 | class Scaler:
13 | """Wrapper for a computer that performs resizing and coordinate translation."""
14 |
15 | def __init__(self, computer, dimensions: tuple[int, int] | None = None):
16 | self.computer = computer
17 | self.size = dimensions
18 | self.screen_width = -1
19 | self.screen_height = -1
20 |
21 | @property
22 | def environment(self):
23 | return self.computer.environment
24 |
25 | @property
26 | def dimensions(self):
27 | if not self.size:
28 | # If no dimensions are given, take a screenshot and scale to fit in 2048px
29 | # https://platform.openai.com/docs/guides/images
30 | width, height = self.computer.dimensions
31 | max_size = 2048
32 | longest = max(width, height)
33 | if longest <= max_size:
34 | self.size = (width, height)
35 | else:
36 | scale = max_size / longest
37 | self.size = (int(width * scale), int(height * scale))
38 | return self.size
39 |
40 | async def screenshot(self) -> str:
41 | # Take a screenshot from the actual computer
42 | screenshot = await self.computer.screenshot()
43 | screenshot = base64.b64decode(screenshot)
44 | buffer = io.BytesIO(screenshot)
45 | image = PIL.Image.open(buffer)
46 | # Scale the screenshot
47 | self.screen_width, self.screen_height = image.size
48 | width, height = self.dimensions
49 | ratio = min(width / self.screen_width, height / self.screen_height)
50 | new_width = int(self.screen_width * ratio)
51 | new_height = int(self.screen_height * ratio)
52 | new_size = (new_width, new_height)
53 | resized_image = image.resize(new_size, PIL.Image.Resampling.LANCZOS)
54 | image = PIL.Image.new("RGB", (width, height), (0, 0, 0))
55 | image.paste(resized_image, (0, 0))
56 | buffer = io.BytesIO()
57 | image.save(buffer, format="PNG")
58 | buffer.seek(0)
59 | data = bytearray(buffer.getvalue())
60 | return base64.b64encode(data).decode("utf-8")
61 |
62 | async def click(self, x: int, y: int, button: str = "left") -> None:
63 | x, y = self._point_to_screen_coords(x, y)
64 | await self.computer.click(x, y, button=button)
65 |
66 | async def double_click(self, x: int, y: int) -> None:
67 | x, y = self._point_to_screen_coords(x, y)
68 | await self.computer.double_click(x, y)
69 |
70 | async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
71 | x, y = self._point_to_screen_coords(x, y)
72 | await self.computer.scroll(x, y, scroll_x, scroll_y)
73 |
74 | async def type(self, text: str) -> None:
75 | await self.computer.type(text)
76 |
77 | async def wait(self, ms: int = 1000) -> None:
78 | await self.computer.wait(ms)
79 |
80 | async def move(self, x: int, y: int) -> None:
81 | x, y = self._point_to_screen_coords(x, y)
82 | await self.computer.move(x, y)
83 |
84 | async def keypress(self, keys: list[str]) -> None:
85 | await self.computer.keypress(keys)
86 |
87 | async def drag(self, path: list[tuple[int, int]]) -> None:
88 | path = [self._point_to_screen_coords(*point) for point in path]
89 | await self.computer.drag(path)
90 |
91 | def _point_to_screen_coords(self, x, y):
92 | width, height = self.dimensions
93 | ratio = min(width / self.screen_width, height / self.screen_height)
94 | x = x / ratio
95 | y = y / ratio
96 | return int(x), int(y)
97 |
98 |
99 | class Agent:
100 | """CUA agent to start and continue task execution"""
101 |
102 | def __init__(self, client, model: str, computer, logger=None):
103 | self.client = client
104 | self.model = model
105 | self.computer = computer
106 | self.logger = logger
107 | self.tools = {}
108 | self.extra_headers = None
109 | self.parallel_tool_calls = False
110 | self.start_task()
111 |
112 | def add_tool(self, tool: dict, func):
113 | name = tool["name"]
114 | self.tools[name] = (tool, func)
115 |
116 | @property
117 | def requires_user_input(self) -> bool:
118 | if self.response is None or len(self.response.output) == 0:
119 | return True
120 | item = self.response.output[-1]
121 | return item.type == "message" and item.role == "assistant"
122 |
123 | @property
124 | def requires_consent(self) -> bool:
125 | return any(item.type == "computer_call" for item in self.response.output)
126 |
127 | @property
128 | def pending_safety_checks(self):
129 | items = [item for item in self.response.output if item.type == "computer_call"]
130 | return [check for item in items for check in item.pending_safety_checks]
131 |
132 | @property
133 | def reasoning_summary(self):
134 | items = [item for item in self.response.output if item.type == "reasoning"]
135 | return "".join([summary.text for item in items for summary in item.summary])
136 |
137 | @property
138 | def messages(self) -> list[str]:
139 | result: list[str] = []
140 | if self.response:
141 | for item in self.response.output:
142 | if item.type == "message":
143 | for content in item.content:
144 | if content.type == "output_text":
145 | result.append(content.text)
146 | return result
147 |
148 | @property
149 | def actions(self):
150 | actions = []
151 | for item in self.response.output:
152 | if item.type == "computer_call":
153 | action_args = vars(item.action) | {}
154 | action = action_args.pop("type")
155 | if action == "drag":
156 | path = [(point.x, point.y) for point in item.action.path]
157 | action_args["path"] = path
158 | actions.append((action, action_args))
159 | return actions
160 |
161 | def start_task(self):
162 | self.response = None
163 |
164 | async def continue_task(self, user_message: str = "", temperature=None):
165 | inputs = []
166 | screenshot = ""
167 | response_input_param = openai.types.responses.response_input_param
168 | previous_response = self.response
169 | previous_response_id = None
170 | if previous_response:
171 | previous_response_id = previous_response.id
172 | for item in previous_response.output:
173 | if item.type == "computer_call":
174 | action, action_args = self.actions[0]
175 | method = getattr(self.computer, action)
176 | if action != "screenshot":
177 | if inspect.iscoroutinefunction(method):
178 | result = await method(**action_args)
179 | else:
180 | result = method(**action_args)
181 | screenshot = await self.computer.screenshot()
182 | output = response_input_param.ComputerCallOutput(
183 | type="computer_call_output",
184 | call_id=item.call_id,
185 | output=response_input_param.ResponseComputerToolCallOutputScreenshotParam(
186 | type="computer_screenshot",
187 | image_url=f"data:image/png;base64,{screenshot}",
188 | ),
189 | acknowledged_safety_checks=self.pending_safety_checks,
190 | )
191 | inputs.append(output)
192 | elif item.type == "function_call":
193 | tool_name = item.name
194 | kwargs = json.loads(item.arguments)
195 | if tool_name not in self.tools:
196 | raise ValueError(f"Unsupported tool '{tool_name}'.")
197 | _, func = self.tools[tool_name]
198 | if inspect.iscoroutinefunction(func):
199 | result = await func(**kwargs)
200 | else:
201 | result = func(**kwargs)
202 | output = response_input_param.FunctionCallOutput(
203 | type="function_call_output",
204 | call_id=item.call_id,
205 | output=json.dumps(result),
206 | )
207 | inputs.append(output)
208 | elif item.type == "reasoning" or item.type == "message":
209 | pass
210 | else:
211 | message = (f"Unsupported response output type '{item.type}'.",)
212 | raise NotImplementedError(message)
213 | if user_message:
214 | message = response_input_param.Message(role="user", content=user_message)
215 | inputs.append(message)
216 | self.response = None
217 | wait = 0
218 | retry = 10
219 | while retry > 0:
220 | retry -= 1
221 | try:
222 | await asyncio.sleep(wait)
223 | kwargs = {
224 | "model": self.model,
225 | "input": inputs,
226 | "previous_response_id": previous_response_id,
227 | "tools": self.get_tools(),
228 | "reasoning": {"generate_summary": "concise"},
229 | "temperature": temperature,
230 | "truncation": "auto",
231 | "extra_headers": self.extra_headers,
232 | "parallel_tool_calls": self.parallel_tool_calls,
233 | }
234 | if isinstance(self.client, openai.AsyncOpenAI):
235 | self.response = await self.client.responses.create(**kwargs)
236 | else:
237 | self.response = self.client.responses.create(**kwargs)
238 | assert self.response.status == "completed"
239 | return
240 | except openai.RateLimitError as e:
241 | match = re.search(r"Please try again in (\d+)s", e.message)
242 | wait = int(match.group(1)) if match else 10
243 | if self.logger:
244 | self.logger.exception(
245 | f"Rate limit exceeded. Waiting for {wait} seconds.",
246 | exc_info=e,
247 | )
248 | if retry == 0:
249 | raise
250 | except openai.InternalServerError as e:
251 | if self.logger:
252 | self.logger.exception(
253 | f"Internal server error: {e.message}",
254 | exc_info=e,
255 | )
256 | if retry == 0:
257 | raise
258 |
259 | def get_tools(self) -> list[openai.types.responses.tool_param.ToolParam]:
260 | tools = [entry[0] for entry in self.tools.values()]
261 | return [self.computer_tool(), *tools]
262 |
263 | def computer_tool(self) -> openai.types.responses.ComputerToolParam:
264 | environment = self.computer.environment
265 | dimensions = self.computer.dimensions
266 | return openai.types.responses.ComputerToolParam(
267 | type="computer_use_preview",
268 | display_width=dimensions[0],
269 | display_height=dimensions[1],
270 | environment=environment,
271 | )
272 |
--------------------------------------------------------------------------------
/computer-use/local_computer.py:
--------------------------------------------------------------------------------
1 | import asyncio
2 | import base64
3 | import io
4 | import platform
5 |
6 | import pyautogui
7 |
8 |
9 | class LocalComputer:
10 | """Use pyautogui to take screenshots and perform actions on the local computer."""
11 |
12 | def __init__(self):
13 | self.size = None
14 |
15 | @property
16 | def environment(self):
17 | system = platform.system()
18 | if system == "Windows":
19 | return "windows"
20 | elif system == "Darwin":
21 | return "mac"
22 | elif system == "Linux":
23 | return "linux"
24 | else:
25 | raise NotImplementedError(f"Unsupported operating system: '{system}'")
26 |
27 | @property
28 | def dimensions(self):
29 | if not self.size:
30 | screenshot = pyautogui.screenshot()
31 | self.size = screenshot.size
32 | return self.size
33 |
34 | async def screenshot(self) -> str:
35 | screenshot = pyautogui.screenshot()
36 | self.size = screenshot.size
37 | buffer = io.BytesIO()
38 | screenshot.save(buffer, format="PNG")
39 | buffer.seek(0)
40 | data = bytearray(buffer.getvalue())
41 | return base64.b64encode(data).decode("utf-8")
42 |
43 | async def click(self, x: int, y: int, button: str = "left") -> None:
44 | width, height = self.size
45 | if 0 <= x < width and 0 <= y < height:
46 | button = "middle" if button == "wheel" else button
47 | pyautogui.moveTo(x, y, duration=0.1)
48 | pyautogui.click(x, y, button=button)
49 |
50 | async def double_click(self, x: int, y: int) -> None:
51 | width, height = self.size
52 | if 0 <= x < width and 0 <= y < height:
53 | pyautogui.moveTo(x, y, duration=0.1)
54 | pyautogui.doubleClick(x, y)
55 |
56 | async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
57 | pyautogui.moveTo(x, y, duration=0.5)
58 | pyautogui.scroll(-scroll_y)
59 | pyautogui.hscroll(scroll_x)
60 |
61 | async def type(self, text: str) -> None:
62 | pyautogui.write(text)
63 |
64 | async def wait(self, ms: int = 1000) -> None:
65 | await asyncio.sleep(ms / 1000)
66 |
67 | async def move(self, x: int, y: int) -> None:
68 | pyautogui.moveTo(x, y, duration=0.1)
69 |
70 | async def keypress(self, keys: list[str]) -> None:
71 | keys = [key.lower() for key in keys]
72 | keymap = {
73 | "arrowdown": "down",
74 | "arrowleft": "left",
75 | "arrowright": "right",
76 | "arrowup": "up",
77 | }
78 | keys = [keymap.get(key, key) for key in keys]
79 | for key in keys:
80 | pyautogui.keyDown(key)
81 | for key in keys:
82 | pyautogui.keyUp(key)
83 |
84 | async def drag(self, path: list[tuple[int, int]]) -> None:
85 | if len(path) <= 1:
86 | pass
87 | elif len(path) == 2:
88 | pyautogui.moveTo(*path[0], duration=0.5)
89 | pyautogui.dragTo(*path[1], duration=1.0, button="left")
90 | else:
91 | pyautogui.moveTo(*path[0], duration=0.5)
92 | pyautogui.mouseDown(button="left")
93 | for point in path[1:]:
94 | pyautogui.dragTo(*point, duration=1.0, mouseDownUp=False)
95 | pyautogui.mouseUp(button="left")
96 |
--------------------------------------------------------------------------------
/computer-use/main.py:
--------------------------------------------------------------------------------
1 | """
2 | This is a basic example of how to use the CUA model along with the Responses API.
3 | The code will run a loop taking screenshots and perform actions suggested by the model.
4 | Make sure to install the required packages before running the script.
5 | """
6 |
7 | import argparse
8 | import asyncio
9 | import logging
10 | import os
11 |
12 | import cua
13 | import local_computer
14 | import openai
15 |
16 |
17 | async def main():
18 |
19 | logging.basicConfig(level=logging.WARNING, format="%(message)s")
20 | logger = logging.getLogger(__name__)
21 | logger.setLevel(logging.DEBUG)
22 |
23 | parser = argparse.ArgumentParser()
24 | parser.add_argument("--instructions", dest="instructions",
25 | default="Open web browser and go to microsoft.com.",
26 | help="Instructions to follow")
27 | parser.add_argument("--model", dest="model",
28 | default="computer-use-preview")
29 | parser.add_argument("--endpoint", default="azure",
30 | help="The endpoint to use, either OpenAI or Azure OpenAI")
31 | parser.add_argument("--autoplay", dest="autoplay", action="store_true",
32 | default=True, help="Autoplay actions without confirmation")
33 | parser.add_argument("--environment", dest="environment", default="linux")
34 | args = parser.parse_args()
35 |
36 | if args.endpoint == "azure":
37 | client = openai.AsyncAzureOpenAI(
38 | azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
39 | api_key=os.environ["AZURE_OPENAI_API_KEY"],
40 | api_version="2025-03-01-preview",
41 | )
42 | else:
43 | client = openai.AsyncOpenAI()
44 |
45 | model = args.model
46 |
47 | # Computer is used to take screenshots and send keystrokes or mouse clicks
48 | computer = local_computer.LocalComputer()
49 |
50 | # Scaler is used to resize the screen to a smaller size
51 | computer = cua.Scaler(computer, (1024, 768))
52 |
53 | # Agent to run the CUA model and keep track of state
54 | agent = cua.Agent(client, model, computer)
55 |
56 | # Get the user request
57 | if args.instructions:
58 | user_input = args.instructions
59 | else:
60 | user_input = input("Please enter the initial task: ")
61 |
62 | logger.info(f"User: {user_input}")
63 | agent.start_task()
64 | while True:
65 | if not user_input and agent.requires_user_input:
66 | logger.info("")
67 | user_input = input("User: ")
68 | await agent.continue_task(user_input)
69 | user_input = None
70 | if agent.requires_consent and not args.autoplay:
71 | input("Press Enter to run computer tool...")
72 | elif agent.pending_safety_checks and not args.autoplay:
73 | logger.info(f"Safety checks: {agent.pending_safety_checks}")
74 | input("Press Enter to acknowledge and continue...")
75 | if agent.reasoning_summary:
76 | logger.info("")
77 | logger.info(f"Action: {agent.reasoning_summary}")
78 | for action, action_args in agent.actions:
79 | logger.info(f" {action} {action_args}")
80 | if agent.messages:
81 | logger.info("")
82 | logger.info(f"Agent: {"".join(agent.messages)}")
83 |
84 | if __name__ == "__main__":
85 | asyncio.run(main())
86 |
--------------------------------------------------------------------------------
/computer-use/requirements.txt:
--------------------------------------------------------------------------------
1 | openai>=1.68.2
2 | pyautogui>=0.9.54
3 | Pillow>=11.1.0
--------------------------------------------------------------------------------