└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome GUI Agent Datasets for Computer-Use and Phone-Use 2 | 3 | A curated list of datasets for training GUI agents—AI systems that automate interactions with graphical user interfaces on computers, phones, and browsers. These datasets support tasks like clicking, typing, and navigating visual elements, making them essential for researchers and developers advancing AI agent training and GUI automation. Sorted by year (most recent first), each entry includes the dataset name, a brief description, data summary, and a URL where available. 4 | 5 | --- 6 | 7 | ## 2025 8 | 9 | - **Aria-UI/Aria-UI_Data** 10 | - *Description*: A comprehensive collection of GUI grounding data covering web, mobile, and desktop interfaces, designed for versatile grounding instruction understanding and context-aware grounding. 11 | - *Data*: Web Data (2.9M instructions, 173k images), Mobile Data (1.1M instructions, 104k images from AMEX), Desktop Data (150k instructions, 7.8k images from Ubuntu). 12 | - *URL*: [https://huggingface.co/datasets/Aria-UI/Aria-UI_Data](https://huggingface.co/datasets/Aria-UI/Aria-UI_Data) 13 | 14 | - **Multimodal-Mind2Web** 15 | - *Description*: A multimodal version of Mind2Web, pairing HTML documents with corresponding website screenshots to support the development of general-purpose web agents. 16 | - *Data*: 7,775 actions from 1,009 training tasks; 1,339 actions from 177 test tasks (same website), 1,019 actions from 142 test tasks (new website), 4,060 actions from 694 test tasks (new domain). Includes HTML and screenshot data. 17 | - *URL*: https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web 18 | 19 | - **GUIMid** 20 | - *Description*: A consolidated mid-training dataset designed to enhance the foundational agentic capabilities of Vision Language Models (VLMs) for Graphical User Interface (GUI) tasks, by leveraging data from adjacent, non-GUI domains. 21 | - *Data*: 300,000 samples, combining MathInstruct (150k), CodeI/O (20k), Olympiad Math (50k), and Multi-modal Math (80k). 22 | - *URL*: [https://github.com/hkust-nlp/GUIMid](https://github.com/hkust-nlp/GUIMid) 23 | 24 | - **STEVE (Windows OS dataset)** 25 | - *Description*: A private Windows OS dataset for UI grounding, collected through a Windows virtual machine, OmniParser, screenshots, and accessibility tree data. It aims to train VLMs specialized in UI grounding. 26 | - *Data*: 10,000 desktop images and 80,000 UI elements, augmented with publicly available AITW data. 27 | - *URL*: [https://github.com/FanbinLu/STEVE](https://github.com/FanbinLu/STEVE) 28 | 29 | - **Aguvis Data Collection (xlangai/aguvis-stage1 & xlangai/aguvis-stage2)** 30 | - *Description*: A large-scale cross-platform dataset of GUI agent trajectories featuring multimodal grounding and reasoning annotations, including inner monologue. Stage 1 focuses on computer/mobile grounding, while Stage 2 is for computer/mobile/desktop trajectory training. 31 | - *Data*: The full dataset viewer is currently unavailable, but previews show image filenames and conversations with human instructions and GPT-generated `pyautogui` actions. Associated with arXiv:2412.04454. 32 | - *URL*: [https://huggingface.co/datasets/xlangai/aguvis-stage1](https://huggingface.co/datasets/xlangai/aguvis-stage1), [https://huggingface.co/datasets/xlangai/aguvis-stage2](https://huggingface.co/datasets/xlangai/aguvis-stage2) 33 | 34 | - **OS-Genesis** 35 | - *Description*: An interaction-driven pipeline that synthesizes high-quality, diverse GUI agent trajectory data without human supervision or predefined tasks, by using reverse task synthesis and a trajectory reward model. 36 | - *Data*: Includes raw collected triples (``), complete trajectory data on HuggingFace, screenshots, and texts with State-of-Mind (SoM) information. Contains both Mobile and Web data with specific Google Drive links for screenshots and JSON data. 37 | - *URL*: [https://github.com/OS-Copilot/OS-Genesis](https://github.com/OS-Copilot/OS-Genesis) 38 | 39 | - **Mobile-R1 (PG23/Mobile-R1)** 40 | - *Description*: A high-quality dataset for training and utilizing VLM-based mobile agents, specifically focusing on Chinese mobile applications. 41 | - *Data*: 1,007 trajectories across 28 different mobile applications, with 3,924 total steps. Includes screenshots and `data.jsonl` with full interaction trajectories, action history, instructions, system prompts, image dimensions, and detailed action parameters. 42 | - *URL*: [https://mobile-r1.github.io/Mobile-R1/](https://mobile-r1.github.io/Mobile-R1/) 43 | 44 | - **ShowUI_desktop** 45 | - *Description*: A vision–language–action dataset designed to improve desktop GUI element grounding across diverse applications, featuring rich bounding-box and keypoint annotations. 46 | - *Data*: ~7,500 desktop screenshots spanning 15 applications; 8,000 element annotations with appearance, spatial-relation, and intent queries; relative bounding boxes and action keypoints for each element. 47 | - *URL*: [https://huggingface.co/datasets/Voxel51/ShowUI_desktop](https://huggingface.co/datasets/Voxel51/ShowUI_desktop) 48 | 49 | - **GUI-Robust** 50 | - *Description*: A comprehensive dataset explicitly crafted to evaluate GUI agent performance under seven real-world anomaly conditions (e.g., occlusion, dynamic content changes). It employs a semi-automated collection paradigm combining RPA-tool recordings and MLLM-assisted annotation. 51 | - *Data*: 10,000+ action sequences with seven anomaly types, task descriptions, stepwise instructions, screenshots, and grounding metadata; generated via a human-in-the-loop pipeline that reduces annotation time by 19×. 52 | - *URL*: [https://github.com/chessbean1/GUI-Robust](https://github.com/chessbean1/GUI-Robust) 53 | 54 | - **VideoGUI** 55 | - *Description*: A multi-modal benchmark sourced from professional instructional videos, focusing on visually intensive, multi-step software tasks. It evaluates agents across hierarchical levels—high-level planning, middle-level planning, and atomic action execution. 56 | - *Data*: 86 complex tasks (avg. 22.7 actions each), 463 subtasks, 2.7K manually annotated actions with precise element locations, covering 11 software applications such as Adobe Photoshop and Stable Diffusion WebUI; includes video-derived screenshots and action transcripts. 57 | - *URL*: [https://github.com/showlab/videogui](https://github.com/showlab/videogui) 58 | 59 | - **Mind2Web 2** 60 | - *Description*: Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework 61 | - *Data*: Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks. 62 | - *URL*: https://huggingface.co/datasets/osunlp/Mind2Web-2 63 | 64 | - **Online-Mind2Web** 65 | - *Description*: An online version of Mind2Web, featuring 300 high-quality tasks from 136 popular websites across domains like clothing, food, housing, and transportation, designed to evaluate web agent performance in real-world online environments. 66 | - *Data*: 300 tasks from 136 websites, with fields including task_id (str), website (str), task_description (str), and reference_length (int). 67 | - *URL*: https://github.com/OSU-NLP-Group/Online-Mind2Web 68 | 69 | - **LearnGUI** 70 | - *Description*: A dataset for studying illustration-based learning in mobile GUI agents, enhancing performance in unseen scenarios. 71 | - *Data*: 2,252 offline and 101 online tasks across 73 apps, with high-quality human demonstrations, screenshots, and action sequences. 72 | - *URL*: [https://huggingface.co/datasets/lgy0404/LearnGUI](https://huggingface.co/datasets/lgy0404/LearnGUI) 73 | 74 | - **AndroidInteraction** 75 | - *Description*: Focuses on user interaction needs and notifications in phone UI automation, enabling agent-initiated interactions. 76 | - *Data*: Over 750 demonstrations across 250+ apps, including action-observation pairs with screenshots and accessibility metadata. 77 | - *URL*: [https://arxiv.org/abs/2503.19537](https://arxiv.org/abs/2503.19537) 78 | 79 | - **WorldGUI** 80 | - *Description*: An interactive benchmark for desktop GUI automation, supporting tasks across multiple applications from any starting point. 81 | - *Data*: 611 tasks across 10 desktop and web apps, with user queries, instructional videos, and project files. 82 | - *URL*: [https://github.com/showlab/WorldGUI](https://github.com/showlab/WorldGUI) 83 | 84 | - **DeskVision** 85 | - *Description*: Large-scale desktop region captioning dataset for advanced GUI agents, improving visual element understanding. 86 | - *Data*: Large-scale desktop GUI dataset with rich annotations for diverse UI systems and elements, using automated captioning. 87 | - *URL*: [https://arxiv.org/abs/2503.11170](https://arxiv.org/abs/2503.11170) 88 | 89 | - **GUI-Lasagne** 90 | - *Description*: Multi-level, large-scale dataset for training agents like SpiritSight, focusing on GUI understanding and grounding. 91 | - *Data*: 5.73M samples, 2.24M screenshots, 57.8M elements, with a 3-tier structure for image-text alignment and navigation. 92 | - *URL*: [https://arxiv.org/abs/2503.03196](https://arxiv.org/abs/2503.03196) 93 | 94 | - **TongUI / GUI-Net** 95 | - *Description*: Builds generalized GUI agents by learning from multimodal web tutorials across multiple OS. 96 | - *Data*: 143K (or 1M) trajectory data points across 5 OS and 200+ apps, with multimodal web instructions, text, and screenshots. 97 | - *URL*: [https://tongui-agent.github.io/](https://tongui-agent.github.io/) 98 | 99 | - **ScreenSpot-Pro** 100 | - *Description*: Benchmark for GUI grounding in high-resolution professional environments, ideal for multimodal LLMs. 101 | - *Data*: 1,581 task data points across 23 industries, with high-resolution full-screen images, natural language instructions, and bounding box annotations. 102 | - *URL*: [https://huggingface.co/datasets/Voxel51/ScreenSpot-Pro](https://huggingface.co/datasets/Voxel51/ScreenSpot-Pro) 103 | 104 | - **WebGames** 105 | - *Description*: A dataset for training agents to play web-based games, focusing on interactive GUI tasks. 106 | - *Data*: Over 50 unique interactive challenges in JSONL format, with 160px × 210px environments and text-based goals. 107 | - *URL*: [https://github.com/convergence-ai/webgames](https://github.com/convergence-ai/webgames) 108 | 109 | - **Explorer** 110 | - *Description*: The largest-scale web trajectory dataset to date, dynamically exploring web pages to create contextually relevant tasks. 111 | - *Data*: 94K successful web trajectories, 49K unique URLs, and 720K screenshots, generated by a multi-agent LLM pipeline. 112 | - *URL*: [https://arxiv.org/abs/2502.11357](https://arxiv.org/abs/2502.11357) 113 | 114 | - **InSTA** 115 | - *Description*: An Internet-scale dataset for training GUI-based web agents, generated entirely through an automated LLM pipeline without human annotations. 116 | - *Data*: Covers 150k diverse websites sourced from Common Crawl and includes rich web navigation tasks, trajectories in Playwright API calls, and evaluations using LLM-based judges. 117 | - *URL*: [https://huggingface.co/datasets/data-for-agents/insta-150k-v3](https://huggingface.co/datasets/data-for-agents/insta-150k-v3) 118 | - 119 | 120 | - **VideoCAD** 121 | - *Description*: A large-scale synthetic dataset for learning UI interactions and 3D reasoning from CAD software. It consists of annotated video recordings of CAD operations. 122 | - *Data*: Over 41K annotated video recordings of CAD operations, offering significantly higher complexity in UI interaction learning for real-world engineering tasks (up to 20x longer time horizon than other datasets). 123 | - *URL*: [https://arxiv.org/html/2505.24838v1](https://arxiv.org/html/2505.24838v1) 124 | 125 | - **AutomotiveUI-Bench-4K** 126 | - *Description*: An open-source dataset for understanding and interacting with automotive infotainment systems, enabling seamless adaptation across different UI designs. It serves as a validation benchmark for automotive UI. 127 | - *Data*: 998 images with 4,208 annotations focusing on interaction with in-vehicle infotainment (IVI) systems. Image sources are primarily photographs of IVI displays (due to screenshot limitations) and some direct screenshots (e.g., Android Auto). Annotation classes include "Test Action" (bounding box + imperative command) and "Expected Result" (bounding box + expected outcome + Pass/Fail status). Covers 15 automotive brands/OEMs (2018-2025 models). IVI UI in German and English, annotations in English. 128 | - *URL*: [https://paperswithcode.com/dataset/automotiveui-bench-4k](https://paperswithcode.com/dataset/automotiveui-bench-4k) 129 | 130 | - **OS-Atlas** 131 | - *Description*: A foundational GUI action model excelling at GUI grounding and Out-Of-Distribution (OOD) agentic tasks, leveraging a large open-source cross-platform GUI grounding corpus. 132 | - *Data*: Over 13 million GUI elements across mobile, desktop, and web platforms, synthesized using a specialized toolkit. Includes extensive evaluation across six benchmarks. 133 | - *URL*: [https://osatlas.github.io/](https://osatlas.github.io/)[](https://huggingface.co/papers/2410.23218) 134 | 135 | 136 | - **AGUVIS** 137 | - *Description*: A unified pure vision-based framework for autonomous GUI agents that operates across web, desktop, and mobile platforms, using vision-based observations and a consistent action space for better generalization. 138 | - *Data*: The AGUVIS dataset is split into two stages: 139 | - **Stage 1 (Grounding)**: 4.2 million samples for computer/mobile grounding training, including multimodal grounding annotations. 140 | - **Stage 2 (Planning and Reasoning)**: 1.3 million GUI agent trajectories with reasoning annotations across web, desktop, and mobile platforms. 141 | - *URL*: [https://github.com/xlang-ai/aguvis](https://github.com/xlang-ai/aguvis)[](https://huggingface.co/datasets/xlangai/aguvis-stage1)[](https://huggingface.co/datasets/xlangai/aguvis-stage2) 142 | 143 | --- 144 | 145 | ## 2024 146 | 147 | - **MultiUI** 148 | - *Description*: A large-scale dataset designed to enhance GUI agents’ text-rich visual understanding. It utilizes structured accessibility trees to generate high-quality multimodal instructions. 149 | - *Data*: 7.3 million multimodal instruction samples collected from 1 million websites, covering key web UI tasks such as element grounding, action prediction, and interaction modeling. 150 | - *URL*: [https://huggingface.co/datasets/neulab/MultiUI](https://huggingface.co/datasets/neulab/MultiUI) 151 | 152 | - **Mind2Web-Live** 153 | - *Description*: This dataset focuses on dynamic evaluation using "key nodes," which represent critical intermediate states in web tasks. 154 | - *Data*: 542 tasks with 4,550 detailed annotation steps, annotated by human experts. 155 | - *URL*: [https://huggingface.co/datasets/iMeanAI/Mind2Web-Live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) 156 | 157 | - **WebVLN** 158 | - *Description*: This dataset expands on web GUI tasks by combining navigation with question-answering. It provides agents with text-based queries that guide them to locate relevant web pages and extract information. 159 | - *Data*: 8,990 navigation paths and 14,825 QA pairs, leveraging both HTML and visual content from websites. 160 | - *URL*: [https://drive.google.com/drive/folders/1Gzm44P5QBxvBYUU4BiYW-WlxbiB5M19K?usp=sharing](https://drive.google.com/drive/folders/1Gzm44P5QBxvBYUU4BiYW-WlxbiB5M19K?usp=sharing) 161 | 162 | - **WebLINX** 163 | - *Description*: This dataset focuses on conversational GUI agents, particularly emphasizing real-world web navigation through multi-turn dialogue. 164 | - *Data*: Over 2,300 expert demonstrations with over 100,000 interactions across 155 real-world websites, including DOM trees and screenshots. 165 | - *URL*: [https://huggingface.co/datasets/McGill-NLP/WebLINX](https://huggingface.co/datasets/McGill-NLP/WebLINX) 166 | 167 | - **AgentTrek** 168 | - *Description*: This dataset synthesizes high-quality trajectory data by leveraging web tutorials. 169 | - *Data*: 4,902 trajectories with task metadata, step-by-step instructions, action sequences, visual observations, and reproducible native traces. 170 | - *URL*: [https://agenttrek.github.io/](https://agenttrek.github.io/) 171 | 172 | - **ScreenAI** 173 | - *Description*: Extends the scope of data collection to include both mobile and desktop interfaces, covering tasks such as screen annotation, question-answering, and navigation. 174 | - *Data*: Hundreds of millions of annotated samples. 175 | - *URL*: [https://github.com/google-research-datasets/screen_qa](https://github.com/google-research-datasets/screen_qa) 176 | 177 | - **VisualAgentBench** 178 | - *Description*: A groundbreaking cross-platform benchmark designed to assess GUI agents in both mobile and web settings, emphasizing interaction-focused tasks. 179 | - *Data*: Not explicitly summarized in the survey, but uses environments like Android Virtual Device and WebArena-Lite. 180 | - *URL*: [https://github.com/THUDM/VisualAgentBench](https://github.com/THUDM/VisualAgentBench) 181 | 182 | - **VGA** 183 | - *Description*: A Vision Question Answering (VQA) dataset designed to minimize "hallucinations" in VLMs for GUI understanding by balancing attention between image and text inputs. It aims to enhance the model's accuracy in extracting information from images and aligning with human intent. 184 | - *Data*: 63.8k high-quality VQA examples, developed using a "Referent Method," focusing on responses tied to visual content. Based on the Rico dataset. 185 | - *URL*: [https://github.com/Linziyang1999/VGA-visual-GUI-assistant](https://github.com/Linziyang1999/VGA-visual-GUI-assistant) 186 | 187 | - **GUI-Robust** 188 | - *Description*: A comprehensive dataset explicitly crafted to evaluate GUI agent performance under seven real-world anomaly conditions (e.g., occlusion, dynamic content changes). It employs a semi-automated collection paradigm combining RPA-tool recordings and MLLM-assisted annotation. 189 | - *Data*: 10,000+ action sequences with seven anomaly types, task descriptions, stepwise instructions, screenshots, and grounding metadata; generated via a human-in-the-loop pipeline that reduces annotation time by 19×. 190 | - *URL*: [https://github.com/chessbean1/GUI-Robust](https://github.com/chessbean1/GUI-Robust) 191 | 192 | - **VideoGUI** 193 | - *Description*: A multi-modal benchmark sourced from professional instructional videos, focusing on visually intensive, multi-step software tasks. It evaluates agents across hierarchical levels—high-level planning, middle-level planning, and atomic action execution. 194 | - *Data*: 86 complex tasks (avg. 22.7 actions each), 463 subtasks, 2.7K manually annotated actions with precise element locations, covering 11 software applications such as Adobe Photoshop and Stable Diffusion WebUI; includes video-derived screenshots and action transcripts. 195 | - *URL*: [https://github.com/showlab/videogui](https://github.com/showlab/videogui) 196 | 197 | - **MobileViews** 198 | - *Description*: Large-scale mobile GUI dataset with over 600,000 screenshot-view hierarchy pairs from 20,000+ Android apps. 199 | - *Data*: 600,000+ screenshot-view hierarchy pairs, including .jpg screenshots, .json/.xml view hierarchies, and action logs in .zip/.parquet formats. 200 | - *URL*: [https://huggingface.co/datasets/mllmTeam/MobileViews](https://huggingface.co/datasets/mllmTeam/MobileViews) 201 | 202 | - **AMEX (Android Multi-annotation EXpo)** 203 | - *Description*: Over 104,000 high-resolution screenshots from 110 popular mobile apps with detailed annotations. 204 | - *Data*: 104,000+ high-resolution screenshots, 711,000 element-level functions, and 3,000 unique instructions with multi-step GUI action sequences. 205 | - *URL*: [https://huggingface.co/datasets/Yuxiang007/AMEX](https://huggingface.co/datasets/Yuxiang007/AMEX) 206 | 207 | - **AndroidControl** 208 | - *Description*: 15,283 demonstrations of daily tasks across 833 Android apps, exploring data scale effects. 209 | - *Data*: 15,283 task demonstrations with 14,548 unique tasks, including high/low-level instructions, screenshots, and accessibility trees in TFRecords. 210 | - *URL*: [https://huggingface.co/datasets/HarrytheOrange/parsed_AndroidControl](https://huggingface.co/datasets/HarrytheOrange/parsed_AndroidControl) 211 | 212 | - **B-MoCA** 213 | - *Description*: Benchmark for evaluating mobile control agents across diverse device configurations. 214 | - *Data*: 131 daily tasks and 60 real-world tasks across varied device configurations, with interactive environment data. 215 | - *URL*: [https://github.com/gimme1dollar/b-moca](https://github.com/gimme1dollar/b-moca) 216 | 217 | - **MobileAgentBench** 218 | - *Description*: User-friendly benchmark with 100 tasks across 10 open-source apps for testing mobile LLM agents. 219 | - *Data*: 100 tasks across 10 open-source Android apps, categorized by difficulty, with real-device execution support. 220 | - *URL*: [https://MobileAgentBench.github.io](https://MobileAgentBench.github.io) 221 | 222 | - **Mobile3M** 223 | - *Description*: A dataset for mobile app understanding with 3 million examples for training GUI agents. 224 | - *Data*: 3M UI pages and real-world transitions from 49 popular Chinese apps, with XML documents and directed graph structures in Parquet format. 225 | - *URL*: [https://huggingface.co/datasets/xwk123/Mobile3M](https://huggingface.co/datasets/xwk123/Mobile3M) 226 | 227 | - **OmniACT** 228 | - *Description*: Dataset for multimodal generalist agents performing tasks on desktop and web interfaces. 229 | - *Data*: Over 9,800 image-instruction pairs, with natural language tasks and PyAutoGUI-executable commands. 230 | - *URL*: [https://arxiv.org/abs/2402.17553](https://arxiv.org/abs/2402.17553) 231 | 232 | - **VisualWebArena** 233 | - *Description*: A dataset for training agents to interact with web interfaces using visual inputs. 234 | - *Data*: Built on WebArena, includes image-based tasks with executable evaluations, focusing on multimodal web interactions. 235 | - *URL*: [https://github.com/web-arena-x/visualwebarena](https://github.com/web-arena-x/visualwebarena) 236 | 237 | - **WebCanvas** 238 | - *Description*: A dataset for training agents to interact with web-based creative tools like drawing apps. 239 | - *Data*: Raw challenge data including HTML, screenshots, DOM trees, Axtrees, videos, and recorded actions for online web environments. 240 | - *URL*: [https://github.com/iMeanAI/WebCanvas](https://github.com/iMeanAI/WebCanvas) 241 | 242 | - **VisualWebBench** 243 | - *Description*: A benchmark for evaluating visual web interaction tasks. 244 | - *Data*: Image and text data from websites, designed for evaluating multimodal LLM web understanding and grounding. 245 | - *URL*: [https://github.com/VisualWebBench](https://github.com/VisualWebBench) 246 | 247 | - **CRAB** 248 | - *Description*: Cross-environment benchmark for multimodal agents, supporting Ubuntu and Android tasks. 249 | - *Data*: 120 tasks in CRAB Benchmark-v0, with multimodal observations (screenshots) and Python-based task definitions/evaluations. 250 | - *URL*: [https://crab.camel-ai.org/](https://crab.camel-ai.org/) 251 | 252 | - **GUI-World** 253 | - *Description*: Comprehensive GUI dataset with over 12K videos and 100K queries for evaluating multimodal LLM-based agents. 254 | - *Data*: 12K+ videos, 100K queries with annotated keyframes, detailed captions, and diverse QA types in JSON/Parquet formats. 255 | - *URL*: [https://gui-world.github.io/](https://gui-world.github.io/) 256 | 257 | - **GUICourse (GUIEnv)** 258 | - *Description*: Features 10M page-caption pairs for training vision-based GUI agents across web and mobile. 259 | - *Data*: 10M page-caption pairs and 0.7M region-text QA pairs in JSON/Parquet, covering OCR, grounding, and navigation. 260 | - *URL*: [https://github.com/yiye3/GUICourse](https://github.com/yiye3/GUICourse) 261 | 262 | - **GUICourse (GUIAct)** 263 | - *Description*: Contains 67K single-step and 15K multi-step instructions for GUI actions. 264 | - *Data*: 67K single-step and 15K multi-step action instructions in JSON/Parquet, spanning web and Android scenarios. 265 | - *URL*: [https://github.com/yiye3/GUICourse](https://github.com/yiye3/GUICourse) 266 | 267 | - **GUICourse (GUIChat)** 268 | - *Description*: Offers 44K single-turn QAs and 6K multi-turn dialogues for GUI interactions. 269 | - *Data*: 44K single-turn QAs and 6K multi-turn dialogues with text-rich images and bounding box annotations in JSON/Parquet. 270 | - *URL*: [https://github.com/yiye3/GUICourse](https://github.com/yiye3/GUICourse) 271 | 272 | - **AutoUI** 273 | - *Description*: Leverages AITW to evaluate Auto-GUI, an LLM-based task automation system for Android. 274 | - *Data*: Utilizes AITW’s 715,000 episodes and 30,000 unique instructions with natural language instructions, screenshots, and actions. 275 | - *URL*: [https://github.com/cooelf/Auto-GUI](https://github.com/cooelf/Auto-GUI) 276 | 277 | - **AndroidWorld** 278 | - *Description*: An environment for building and benchmarking autonomous computer control agents. 279 | - *Data*: 116 diverse tasks across 20 real-world Android apps, with dynamic task initialization for millions of variants. 280 | - *URL*: https://github.com/google-research/android_world 281 | 282 | - **RICOSCA** 283 | - *Description*: A dataset for Android app security analysis. 284 | - *Data*: 295,476 synthetic one-step commands for 177,962 objects across 25,677 Android screens, derived from Rico dataset. 285 | - *URL*: https://huggingface.co/datasets/rootsautomation/RICO-SCA 286 | 287 | - **WebVoyager** 288 | - *Description*: A dataset for training agents to navigate and interact with web environments. 289 | - *Data*: Browser screen perceptions as pixels, with mouse/keyboard actions, tested on 15 real-world websites like Amazon and GitHub. 290 | - *URL*: [https://arxiv.org/abs/2401.13919](https://arxiv.org/abs/2401.13919) 291 | 292 | - **ScreenAgent** 293 | - *Description*: A dataset for training agents to interact with screen-based interfaces. 294 | - *Data*: Details not fully specified, but includes data for vision-language model-driven computer control under Apache-2.0 license. 295 | - *URL*: [https://github.com/niuzaisheng/ScreenAgent](https://github.com/niuzaisheng/ScreenAgent) 296 | 297 | - **OpenDFM/MoGUI** 298 | - *Description*: A dataset for mobile GUI interaction, forming part of the MoGUI and MoCon projects. It aims to provide data for conversational agents on mobile GUIs. 299 | - *Data*: It contains over 2.6 million GUI data from over 250,000 applications, each GUI data includes a screenshot and .xml metadata, and each application also includes a navigation graph, aiming to explicitly crawl the jump relationship between different GUIs during the crawling process 300 | - *URL*: [https://huggingface.co/datasets/OpenDFM/MoGUI](https://huggingface.co/datasets/OpenDFM/MoGUI) 301 | 302 | - **OpenDFM/MoCon** 303 | - *Description*: A dataset for mobile GUI interaction, also part of the MoGUI and MoCon projects. 304 | - *Data*: "MoCon🛡️ data" released Mar 1, 2024. Specific data summary refers to an associated technical report. 305 | - *URL*: [https://huggingface.co/datasets/OpenDFM/MoCon](https://huggingface.co/datasets/OpenDFM/MoCon) 306 | 307 | - **OpenDFM/MobA-MobBench** 308 | - *Description*: A benchmark dataset designed for evaluating mobile phone agents, supporting both English and Chinese languages. 309 | - *Data*: 50 tasks (rows) across 11 columns, including task IDs, types, descriptions (EN/ZH), involved applications, preparation steps, scoring milestones, and human expert steps. The dataset size is small (<1KB). 310 | - *URL*: [https://huggingface.co/datasets/OpenDFM/MobA-MobBench](https://huggingface.co/datasets/OpenDFM/MobA-MobBench) 311 | 312 | - **WebUI (biglab/webui-all)** 313 | - *Description*: A large dataset of rendered web pages associated with automatically extracted metadata, created by crawling the web to enhance visual UI understanding with web semantics. 314 | - *Data*: Contains 400,000 web UIs. The Hugging Face version is a filtered subset of the original dataset. The raw dataset is available on Google Drive. 315 | - *URL*: [https://huggingface.co/datasets/biglab/webui-all](https://huggingface.co/datasets/biglab/webui-all) 316 | 317 | --- 318 | 319 | ## 2023 320 | 321 | - **UEyes** 322 | - *Description*: Eye-tracking dataset for understanding visual saliency across various user interfaces. 323 | - *Data*: Eye-tracking data from 62 participants, 1,980 UI screenshots, with raw gaze logs (.csv), saliency maps (.png), and scan paths (.png). 324 | - *URL*: [https://github.com/YueJiang-nj/UEyes-CHI2023](https://github.com/YueJiang-nj/UEyes-CHI2023) 325 | 326 | 327 | - **Android in the Wild (AITW)** 328 | - *Description*: Large-scale dataset with 715,142 episodes for Android device control across 30,378 unique instructions. 329 | - *Data*: 715,142 episodes, 30,378 unique instructions across 159 apps and 198+ websites, with 5.6M+ RGB screenshots and action sequences in TFRecord format. 330 | - *URL*: [https://github.com/google-research/google-research/tree/master/android_in_the_wild](https://github.com/google-research/google-research/tree/master/android_in_the_wild) 331 | 332 | 333 | - **GUI Odyssey** 334 | - *Description*: A dataset for evaluating GUI agents across diverse tasks and environments. 335 | - *Data*: 7,735 episodes from 6 devices, covering 6 multi-app task types, 201 apps, and 1,400 unique app combinations. 336 | - *URL*: [https://github.com/OpenGVLab/GUI-Odyssey](https://github.com/OpenGVLab/GUI-Odyssey) 337 | 338 | - **Mobile-Env** 339 | - *Description*: A dataset for training agents to interact with mobile apps in simulated environments. 340 | - *Data*: WikiHow task set with screenshots, view hierarchies, and touch/type token actions in ProtoBuf 3 format. 341 | - *URL*: [https://github.com/X-LANCE/Mobile-Env](https://github.com/X-LANCE/Mobile-Env) 342 | 343 | - **Mind2Web** 344 | - *Description*: A dataset for training agents to interact with web pages using natural language. 345 | - *Data*: 2,350 tasks across 137 real-world websites in 31 domains, with HTML inputs and action sequences (Click, Type, Select). 346 | - *URL*: [https://osu-nlp-group.github.io/Mind2Web/](https://osu-nlp-group.github.io/Mind2Web/) 347 | 348 | - **WebArena** 349 | - *Description*: A dataset for training agents to perform tasks on web pages. 350 | - *Data*: 812 long-term web tasks from 241 templates, with natural language intents, HTML/DOM trees, screenshots, and keyboard/mouse actions. 351 | - *URL*: [https://webarena.dev/](https://webarena.dev/) 352 | 353 | - **Synapse** 354 | - *Description*: A dataset for training agents to perform tasks across multiple applications. 355 | - *Data*: 100,000 synthetic demonstrations across 21 domains, with Python programs, natural language plans, CoT reasoning, and HTML snippets. 356 | - *URL*: [https://ltzheng.github.io/Synapse](https://ltzheng.github.io/Synapse) 357 | 358 | - **ASSISTGUI** 359 | - *Description*: A dataset for evaluating GUI agents in assistive technology contexts. 360 | - *Data*: 100 tasks across 9 widely-used productivity software, with necessary project files for task execution. 361 | - *URL*: [https://showlab.github.io/assistgui/](https://showlab.github.io/assistgui/) 362 | 363 | --- 364 | 365 | ## 2022 366 | 367 | - **META-GUI** 368 | - *Description*: Benchmark for GUI-based task-oriented dialogue systems with 1,125 dialogues across six domains. 369 | - *Data*: 1,125 dialogues (4,684 turns) and 18,337 action prediction data points with screenshots, XML view hierarchies, and GUI actions. 370 | - *URL*: [https://x-lance.github.io/META-GUI-Leaderboard/](https://x-lance.github.io/META-GUI-Leaderboard/) 371 | 372 | - **UGIF** 373 | - *Description*: A dataset for understanding user interactions with graphical interfaces. 374 | - *Data*: 523 multilingual natural language instructions with UI screen-action sequences, supporting 8 languages, including XML view hierarchies. 375 | - *URL*: https://arxiv.org/abs/2211.07615 376 | 377 | - **WebShop** 378 | - *Description*: Dataset for training autonomous agents in online shopping with 1.18 million real-world products. 379 | - *Data*: 1.18M real-world products, 12,087 crowdsourced text instructions, and 1,600 human demonstrations in OpenAI Gym format. 380 | - *URL*: [https://webshop-pnlp.github.io](https://webshop-pnlp.github.io) 381 | 382 | --- 383 | 384 | ## 2021 385 | 386 | - **UIBert (AppSim & RefExp)** 387 | - *Description*: A dataset for understanding and generating UI descriptions. 388 | - *Data*: Built on Rico’s 72,000 UI data points, with AppSim (similar UI element pairs) and RefExp (reference expressions for UI elements) in TFRecords. 389 | - *URL*: [https://github.com/google-research-datasets/uibert](https://github.com/google-research-datasets/uibert) 390 | 391 | - **AndroidEnv** 392 | - *Description*: A simulated environment for testing Android GUI agents. 393 | - *Data*: 100 example tasks with RGB pixel observations, (x,y) action spaces, and support for custom task extensions. 394 | - *URL*: [https://github.com/deepmind/android_env](https://github.com/deepmind/android_env) 395 | 396 | - **Screen Annotation** 397 | - *Description*: Dataset for generating concise language descriptions of mobile screens. 398 | - *Data*: 22,417 mobile screenshots with 15,743 training, 2,364 validation, and 4,310 test annotations in CSV format, derived from Rico. 399 | - *URL*: [https://github.com/google-research-datasets/screen2words](https://github.com/google-research-datasets/screen2words) 400 | 401 | - **MoTIF (Mobile app Tasks with Iterative Feedback)** 402 | - *Description*: A dataset for training agents to perform tasks on mobile apps with user feedback. 403 | - *Data*: Over 6,100 free-form natural language commands across 125 Android apps, with action coordinates, screenshots, and feasibility annotations. 404 | - *URL*: [https://vigilworkshop.github.io/static/papers-2021/26.pdf](https://vigilworkshop.github.io/static/papers-2021/26.pdf) 405 | 406 | --- 407 | 408 | ## 2020 409 | 410 | - **PixelHelp** 411 | - *Description*: Features 187 multi-step instructions for common tasks on Google Pixel phones. 412 | - *Data*: 187 multi-step instructions across 4 task categories (general, Gmail, Chrome, Photos) with human-annotated step-by-step actions. 413 | - *URL*: https://arxiv.org/abs/2005.03776 414 | 415 | - **ANDROIDHOWTO** 416 | - *Description*: A dataset for training agents to follow step-by-step instructions on Android devices. 417 | - *Data*: 32,436 data points from 9,893 unique “How-to” instructions, with 190K action and 172K object segments in JSON/TFRecords. 418 | - *URL*: [https://github.com/debymf/generating_android_howto](https://github.com/debymf/generating_android_howto) 419 | 420 | --- 421 | 422 | ## 2018 423 | 424 | - **RICO** 425 | - *Description*: Contains 72,000 unique Android app UIs, a foundational dataset for mobile GUI research. 426 | - *Data*: 72,000 Android app UI screenshots with view hierarchies and annotations, widely used for mobile GUI studies. 427 | - *URL*: [https://www.kaggle.com/datasets/onurgunes1993/rico-dataset](https://www.kaggle.com/datasets/onurgunes1993/rico-dataset) 428 | 429 | --- 430 | 431 | This Awesome List is a comprehensive resource for GUI agent datasets, covering mobile, desktop, and web environments. Contributions are welcome to keep it updated with the latest advancements in computer interaction datasets! 432 | --------------------------------------------------------------------------------