16 |
17 | 1Harbin Institute of Technology, Shenzhen, 2Huawei Noah’s Ark Lab
18 | *Corresponding author
19 |
20 | Annual Meeting of the Association for Computational Linguistics (**ACL**) 2025
21 |
22 | [[Paper]](https://arxiv.org/abs/2505.16827) [[Code]](https://github.com/JiuTian-VL/GUI-explorer) [[Project Page]](https://xieincz.github.io/GUI-explorer.github.io/)
23 |
24 | :fire: Details will be released. Stay tuned :beers: :+1:
25 |
26 |
27 |
28 |
29 | ## If you find this work useful for your research, please kindly cite our paper and star our repo.
30 |
31 | ## Updates
32 |
33 | - [05/2025] [Project Page](https://xieincz.github.io/GUI-explorer.github.io/) released.
34 | - [05/2025] [Arxiv paper](https://arxiv.org/abs/2505.16827) released.
35 | - [05/2025] [Code](https://github.com/JiuTian-VL/GUI-explorer) released.
36 |
37 | ## Introduction
38 |
39 | This is the github repository of *GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent*. In this work, we propose GUI-explorer. It synergizes two key components: (1) Autonomous Exploration of Function-Aware Trajectory; (2) Unsupervised Mining of Transition-Aware Knowledge.
40 |
41 | The overview of the proposed GUI-explorer:
42 |
43 |
44 |
45 |
46 |
47 | ## Installation
48 |
49 | ### Download
50 |
51 | ```bash
52 | git clone https://github.com/JiuTian-VL/GUI-explorer.git
53 | cd GUI-explorer
54 | mkdir knowledge_base
55 | cd knowledge_base
56 | wget https://github.com/JiuTian-VL/GUI-explorer/releases/download/knowledge_base/knowledge_data.pkl
57 | ```
58 |
59 | ### Environment
60 |
61 | ```bash
62 | cd GUI-explorer
63 | conda create -n GUI_explorer python=3.12 -y
64 | conda activate GUI_explorer
65 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
66 | pip install -r requirements.txt
67 | ```
68 |
69 | Duplicate `.env.example` and rename it to `.env`. Then, in the `.env` file, fill in your `OPENAI_API_KEY`.
70 |
71 | ## Usage
72 |
73 | ### Prepare api servers
74 |
75 | ```bash
76 | # Open a new shell window and run
77 | cd GUI-explorer
78 | conda activate GUI_explorer
79 | python -m utils.embedding_pipeline
80 |
81 | # Open a new shell window and run (Need to wait for embedding_pipeline to start up)
82 | cd GUI-explorer
83 | conda activate GUI_explorer
84 | python -m utils.retrieval
85 | ```
86 |
87 | #### Exploration
88 |
89 | ```bash
90 | # After prepare api servers
91 | cd GUI-explorer
92 | conda activate GUI_explorer
93 | python exploration_and_mining.py -device_serial emulator-5554 -max_branching_factor 10 -max_exploration_steps 30 -max_exploration_depth 5 -package_name net.osmand
94 | # After the update of knowledge_base, you need to restart `python -m utils.retrieval` to load the new knowledge_base
95 | ```
96 |
97 | `device_serial` can be obtained by running `adb devices`. (If not, you need to follow the `Setup` section in [this tutorial](https://github.com/ai-agents-2030/SPA-Bench/blob/main/Documentation.md#setup)).
98 |
99 | `package_name` can be obtained from the app's link on the app store. For example, in `https://play.google.com/store/apps/details?id=net.osmand`, `net.osmand` is the `package_name` for this app.
100 |
101 | #### Demo
102 |
103 | ```bash
104 | # After prepare api servers
105 | # Connect an Android device to this computer and make sure you can see it in `adb devices`.
106 | # Open a new shell window and run
107 | cd GUI-explorer
108 | conda activate GUI_explorer
109 | python -m demo.demo_web_backend
110 |
111 | # Open a new shell window and run
112 | cd GUI-explorer
113 | conda activate GUI_explorer
114 | python -m demo.demo_agent_backend
115 |
116 | # Open a new shell window and run
117 | cd GUI-explorer/demo/demo_web_frontend
118 | pnpm install
119 | pnpm run dev
120 | ```
121 |
122 | Open http://localhost:5173 in your browser.
123 |
124 | You should be able to see something like this:
125 |
126 | 
127 |
128 |
129 |
130 | ## Evaluation Results
131 |
132 | Table 1: Main Result of GUI-explorer on SPA-Bench single-app English Level 3 tasks.
133 |
134 |
135 | Table 2: Main Result of GUI-explorer on AndroidWorld tasks.
136 |
137 |
138 | Table 3: Main Result of GUI-explorer on GUI-KRB.
139 |
140 |
141 |
142 |
143 | ## Showcases
144 |
145 | | Instruction | Video |
146 | | :----------------------------------------------------------: | :----------------------------------------------------------: |
147 | | Open Google Chrome and search for today's weather in Shenzhen. Carefully observe the screen and record the current weather conditions. Then, in Markor, create a note named "today.md" and write the temperature read from the webpage into it. | |
148 | | Get the search results for stay tonight near 'wembley stadium' for 1 adult. Add one result to wishlist. Confirm that this item is in the wishlist. | |
149 |
150 |
151 |
152 | ## More Examples
153 |
154 |
155 |
156 |
157 |
158 | ## Citation
159 |
160 | If you find this work useful for your research, please kindly cite our paper:
161 | ```
162 | @inproceedings{xie2025gui,
163 | title={GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent},
164 | author={Bin Xie and Rui Shao and Gongwei Chen and Kaiwen Zhou and Yinchuan Li and Jie Liu and Min Zhang and Liqiang Nie},
165 | booktitle={Annual Meeting of the Association for Computational Linguistics (ACL)},
166 | year={2025}
167 | }
168 | ```
169 |
--------------------------------------------------------------------------------
/utils/prompt_templates.py:
--------------------------------------------------------------------------------
1 | # INPUTS: app_name, package_name, activity_list
2 | TASK_GOAL_GENERATOR = """Given the screenshot of {app_name} and its available activities, generate a comprehensive list of practical user tasks that:
3 |
4 | 1. Start from the current screen shown in the screenshot
5 | 2. Can be completed within 10-30 steps
6 | 3. Utilize the app's full feature set based on the activity list
7 | 4. Are concrete and specific (like searching for a particular item rather than just "search")
8 | 5. Cover different user interaction patterns (viewing, editing, sharing, etc.)
9 | 6. Include both basic and advanced features
10 | 7. Represent realistic user behaviors and goals
11 | 8. Avoid excessive steps on form-filling or scrolling pages
12 |
13 | Important context:
14 | - App name: {app_name}
15 | - Package name: {package_name}
16 | - Available activities (app screens/features):
17 | ```{activity_list}```
18 |
19 | Format requirements:
20 | 1. List only the tasks without explanations or commentary
21 | 2. Each task should be a single, clear directive
22 | 3. Use specific examples (e.g., concrete search terms, actions, settings)
23 | 4. Include the expected outcome where relevant
24 | 5. Tasks should follow this pattern: [Starting action] + [Specific steps] + [End goal]
25 |
26 | Example tasks from other apps (for reference only):
27 | 1. Search for "ocean waves" white noise, then sort results by most played
28 | 2. Open the first recommended video, then post "Great content!" as a comment
29 | 3. Play the trending video, then add it to your "Watch Later" playlist
30 | 4. Navigate to the comments section of a featured video, then like the top comment
31 |
32 | Generate diverse tasks that would help a user explore and utilize all major features visible in the screenshot and implied by the activity list."""
33 |
34 |
35 | # INPUTS: task_description, numeric_tag_of_element, ui_element_attributes, action
36 | KNOWLEDGE_EXTRACTOR = """Objective: Describe the functionality of a specific UI element in a mobile app screenshot.
37 |
38 | Input:
39 | - Two screenshots: Before and after interacting with a UI element
40 | - UI element marked with a numeric tag in the top-left corner
41 | - Element number: {numeric_tag_of_element}
42 | - Broader task context: {task_description}
43 | - Action taken: {action}
44 | - UI Element Attributes:
45 | ```
46 | {ui_element_attributes}
47 | ```
48 |
49 | Requirements for Functionality Description:
50 | 1. Concise: 1-2 sentences
51 | 2. Focus on general function, not specific details
52 | 3. Avoid mentioning the numeric tag
53 | 4. Use generic terms like "UI element" or appropriate pronouns
54 |
55 | Example:
56 | - Incorrect: "Tapping the element #3 displays David's saved recipes in the results panel"
57 | - Correct: "Tapping this element will initiates a search and displays matching results"
58 |
59 | Guidance:
60 | - Describe the core action and immediate result of interacting with the UI element
61 | - Prioritize clarity and generality in the description"""
62 |
63 |
64 | # INPUTS: task_goal, knowledge_a, knowledge_b
65 | RANKER = """Given the user instruction: {task_goal}, determine which of the following two knowledge entries is more useful.
66 | Respond ONLY with a integer value:
67 | 1 means Knowledge A is strictly better.
68 | 2 means Knowledge B is strictly better.
69 |
70 | Knowledge A: {knowledge_a}
71 | Knowledge B: {knowledge_b}
72 |
73 | Please provide your response:
74 | """
75 |
76 |
77 | # INPUTS: task_goal, history, ui_elements, knowledge
78 | REASONING = """## Role Definition
79 | You are an Android operation AI that fulfills user requests through precise screen interactions.
80 | The current screenshot and the same screenshot with bounding boxes and labels added are also given to you.
81 |
82 | ## Action Catalog
83 | Available actions (STRICT JSON FORMAT REQUIRED):
84 | 1. Status Operations:
85 | - Task Complete: {{"action_type": "status", "goal_status": "complete"}}
86 | - Task Infeasible: {{"action_type": "status", "goal_status": "infeasible"}}
87 | 2. Information Actions:
88 | - Answer Question: {{"action_type": "answer", "text": ""}}
89 | 3. Screen Interactions:
90 | - Tap Element: {{"action_type": "click", "index": }}
91 | - Long Press: {{"action_type": "long_press", "index": }}
92 | - Scroll: Scroll the screen or a specific scrollable UI element. Use the `index` of the target element if scrolling a specific element, or omit `index` to scroll the whole screen. {{"action_type": "scroll", "direction": <"up"|"down"|"left"|"right">, "index": }}
93 | 4. Input Operations:
94 | - Text Entry: {{"action_type": "input_text", "text": "", "index": }}
95 | - Keyboard Enter: {{"action_type": "keyboard_enter"}}
96 | 5. Navigation:
97 | - Home Screen: {{"action_type": "navigate_home"}}
98 | - Back Navigation: {{"action_type": "navigate_back"}}
99 | 6. System Actions:
100 | - Launch App: {{"action_type": "open_app", "app_name": ""}}
101 | - Wait Refresh: {{"action_type": "wait"}}
102 |
103 | ## Current Objective
104 | User Goal: {task_goal}
105 |
106 | ## Execution Context
107 | Action History:
108 | {history}
109 |
110 | Visible UI Elements (Only interact with *visible=true elements):
111 | {ui_elements}
112 |
113 | ## Core Strategy
114 | 1. Path Optimization:
115 | - Prefer direct methods (e.g., open_app > app drawer navigation)
116 | - Always use the `input_text` action for entering text into designated text fields.
117 | - Verify element visibility (`visible=true`) before attempting any interaction (click, long_press, input_text). Do not interact with elements marked `visible=false`.
118 | - Use `scroll` when necessary to bring off-screen elements into view. Prioritize scrolling specific containers (`index` provided) over full-screen scrolls if possible.
119 |
120 | 2. Error Handling Protocol:
121 | - Switch approach after ≥ 2 failed attempts
122 | - Prioritize scrolling (`scroll` action) over force-acting on invisible elements
123 | - If an element is not visible, use `scroll` in the likely direction (e.g., 'down' to find elements below the current view).
124 | - Try opposite scroll direction if initial fails (up/down, left/right)
125 | - If the `open_app` action fails to correctly open the app, find the corresponding app in the app drawer and open it.
126 |
127 | 3. Information Tasks:
128 | - MANDATORY: Use answer action for questions
129 | - Verify data freshness (e.g., check calendar date)
130 |
131 | ## Expert Techniques
132 | Here are some tips for you:
133 | {knowledge}
134 |
135 | ## Response Format
136 | STRICTLY follow:
137 | Reasoning: [Step-by-step analysis covering:
138 | - Visibility verification
139 | - History effectiveness evaluation
140 | - Alternative approach comparison
141 | - Consideration of scrolling if needed]
142 | Action: [SINGLE JSON action from catalog]
143 |
144 | Generate response:
145 | """
146 |
147 | # INPUTS: task_goal, before_ui_elements, after_ui_elements, action, reasoning
148 | SUMMARY="""
149 | Goal: {task_goal}
150 |
151 | Before screenshot elements:
152 | {before_ui_elements}
153 |
154 | After screenshot elements:
155 | {after_ui_elements}
156 |
157 | Action: {action}
158 | Reasoning: {reasoning}
159 |
160 | Provide a concise single-line summary (under 50 words) of this step by comparing screenshots and action outcome. Include:
161 | - What was intended
162 | - Whether it succeeded
163 | - Key information for future actions
164 | - Critical analysis if action/reasoning was flawed
165 | - Important data to remember across apps
166 |
167 | For actions like 'answer' or 'wait' with no screen change, assume they worked as intended.
168 |
169 | Summary:
170 | """
--------------------------------------------------------------------------------
/utils/retrieval.py:
--------------------------------------------------------------------------------
1 | """
2 | 使用方式: python -m utils.retrieval
3 | """
4 |
5 | import copy
6 | from typing import Any, List, Dict, Tuple, Union
7 | from PIL import Image
8 | from utils.utils import str_to_md5
9 | import os
10 |
11 | import uuid
12 | import numpy as np
13 |
14 | from fastapi import FastAPI, Request, HTTPException
15 | from fastapi.middleware.cors import CORSMiddleware
16 | from utils.memory import load_memories, KnowledgeStore
17 |
18 | app = FastAPI()
19 | app.add_middleware(
20 | CORSMiddleware,
21 | allow_origins=["*"],
22 | allow_credentials=True,
23 | allow_methods=["*"],
24 | allow_headers=["*"],
25 | )
26 |
27 | __KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH = None
28 | __MEMORY: dict[str, KnowledgeStore] = {}
29 |
30 | import base64
31 | import io
32 | from PIL import Image
33 |
34 |
35 | def pil_to_base64(pil_image: Image.Image) -> str:
36 | """Convert a PIL Image to a base64 encoded string.
37 |
38 | Args:
39 | pil_image (Image.Image): The PIL Image object.
40 |
41 | Returns:
42 | str: The base64 encoded string.
43 | """
44 | buffered = io.BytesIO()
45 | pil_image.save(buffered, format="WEBP", quality=95)
46 | return base64.b64encode(buffered.getvalue()).decode()
47 |
48 |
49 | def ndarray_to_base64(ndarray) -> str:
50 | """Convert a numpy array to a base64 encoded string.
51 |
52 | Args:
53 | ndarray (np.ndarray): The numpy array.
54 |
55 | Returns:
56 | str: The base64 encoded string.
57 | """
58 | return pil_to_base64(Image.fromarray(ndarray))
59 |
60 |
61 | import requests
62 | import time
63 | import os
64 |
65 |
66 | def retrieval_api(
67 | query: Image.Image, top_k: int = 1, threshold: float = 0.9, package_name: str = None
68 | ) -> list[dict]: # NOTE:将这个函数复制到需要调用retrieval_api的地方即可
69 | """检索出query对应的knowledge
70 |
71 | Returns:
72 | List[dict[str,Any]]: 返回的结果列表(注意长度可能小于top_k)
73 | """
74 | ret, rsp, max_retries = None, None, 3
75 | data = {
76 | "package_name": package_name, # "com.example.app",
77 | "query": pil_to_base64(query), # "base64 image",
78 | "top_k": top_k,
79 | "threshold": threshold,
80 | }
81 | url = os.getenv("RAG_SERVER_ENDPOINT", "http://localhost:8769") + "/retrieval"
82 | for i in range(max_retries):
83 | try:
84 | rsp = requests.post(url, json=data, timeout=300)
85 | ret = rsp.json()
86 | return ret["results"]
87 | except Exception as e:
88 | print(f"retrieval_api error: {e} retrying {i+1}/{max_retries}")
89 | if i == max_retries - 1:
90 | raise e
91 | time.sleep(1)
92 |
93 |
94 | def retrieval_batch_api(
95 | queries: list[Image.Image],
96 | top_k: int = 1,
97 | threshold: float = 0.9,
98 | package_name: str = None,
99 | ) -> list[list[dict]]:
100 | """检索出query对应的knowledge
101 |
102 | Returns:
103 | List[List[dict[str,Any]]]: 返回的结果列表(注意长度可能小于top_k)
104 | """
105 | ret, rsp, max_retries = None, None, 3
106 | data = {
107 | "package_name": package_name, # "com.example.app",
108 | "queries": [pil_to_base64(query) for query in queries], # "base64 image",
109 | "top_k": top_k,
110 | "threshold": threshold,
111 | }
112 | url = os.getenv("RAG_SERVER_ENDPOINT", "http://localhost:8769") + "/retrieval_batch"
113 | for i in range(max_retries):
114 | try:
115 | rsp = requests.post(url, json=data, timeout=300)
116 | ret = rsp.json()
117 | return ret["results"]
118 | except Exception as e:
119 | print(f"retrieval_api error: {e} retrying {i+1}/{max_retries}")
120 | if i == max_retries - 1:
121 | raise e
122 | time.sleep(1)
123 |
124 |
125 | def base64_to_pil(base64_str: str) -> Image.Image:
126 | """Convert a base64 encoded string to a PIL Image.
127 |
128 | Args:
129 | base64_str (str): The base64 string representing the image.
130 |
131 | Returns:
132 | Image.Image: A PIL Image object.
133 | """
134 | return Image.open(io.BytesIO(base64.b64decode(base64_str))).convert("RGB")
135 |
136 |
137 | @app.post("/retrieval")
138 | async def retrieval(request: Request):
139 | """
140 | body: {
141 | "package_name": "com.example.app", # 在哪个app对应的知识库中检索,如果不指定包名就在所有的知识库中检索
142 | "query": "base64 image", # base64编码的图片
143 | "top_k": 1, # 返回的结果数量 Optional
144 | "threshold": 0 # 距离或者相似度的阈值
145 | "similarity": "cosine" or "l2" # 相似度计算方式(目前暂时限定为cosine) Optional TODO:等待后续支持l2
146 | }
147 |
148 | response: {
149 | "results": List[dict[str,Any]] # 返回的结果列表(注意长度可能小于top_k)。{"knowledge": str,"similarity": float,}
150 | }
151 | """
152 | try:
153 | # 从请求中解析原始 JSON
154 | data = await request.json()
155 | query = data.get("query", None)
156 | top_k = data.get("top_k", 1)
157 | package_name = data.get("package_name", None)
158 | threshold = data.get("threshold", 0.9)
159 | # similarity = data.get("similarity", "l2") #TODO:等待后续支持l2
160 | result = {"results": []}
161 | if query is not None:
162 | memory = __MEMORY["fusion"]
163 | if package_name is not None:
164 | if package_name in __MEMORY:
165 | memory = __MEMORY[package_name]
166 |
167 | if memory is not None:
168 | result["results"] = memory.search(
169 | base64_to_pil(query), top_k, threshold
170 | )
171 | return result
172 | except Exception as e:
173 | raise HTTPException(status_code=500, detail=str(e))
174 |
175 |
176 | @app.post("/retrieval_batch")
177 | async def retrieval_batch(request: Request):
178 | """
179 | 批量检索知识库中的信息
180 | body: {
181 | "package_name": "com.example.app", # 在哪个app对应的知识库中检索,如果不指定包名就在所有的知识库中检索
182 | "queries": ["base64 image"], # base64编码的图片
183 | "top_k": 1, # 返回的结果数量 Optional
184 | "threshold": 0 # 距离或者相似度的阈值
185 | "similarity": "cosine" or "l2" # 相似度计算方式(目前暂时限定为cosine) Optional TODO:等待后续支持l2
186 | }
187 |
188 | response: {
189 | "results": List[List[dict[str,Any]]] # 返回的结果列表(注意长度可能小于top_k)。{"knowledge": str,"similarity": float,}
190 | }
191 | """
192 | try:
193 | # 从请求中解析原始 JSON
194 | data = await request.json()
195 | queries = data.get("queries", [])
196 | top_k = data.get("top_k", 1)
197 | package_name = data.get("package_name", None)
198 | threshold = data.get("threshold", 0.9)
199 | # similarity = data.get("similarity", "l2") #TODO:等待后续支持l2
200 | result = {"results": []}
201 | if queries:
202 | memory = __MEMORY["fusion"]
203 | if package_name is not None:
204 | if package_name in __MEMORY:
205 | memory = __MEMORY[package_name]
206 |
207 | if memory is not None:
208 | pil_images = [base64_to_pil(query) for query in queries]
209 | res = memory.search_batch(pil_images, top_k, threshold)
210 | result["results"] = res
211 | return result
212 | except Exception as e:
213 | raise HTTPException(status_code=500, detail=str(e))
214 |
215 |
216 | if __name__ == "__main__":
217 | """
218 | Usage: python -m utils.retrieval
219 | """
220 |
221 | __KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH = os.getenv("KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH")
222 |
223 | if __KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH is not None and os.path.exists(
224 | __KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH
225 | ):
226 | print(f"Using knowledge base at {__KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH}")
227 | else:
228 | print(
229 | f"WARNING: No knowledge base found at {__KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH}, please set KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH in environment variable or .env file"
230 | )
231 | exit(1)
232 |
233 | os.environ["no_proxy"] = "localhost, 127.0.0.1/8, ::1"
234 | print("Retrieval Service")
235 | print("Loading Memory...")
236 | __MEMORY = load_memories(__KNOWLEDGE_BASE_ABSOLUTE_ROOT_PATH)
237 |
238 | print("Fast API is starting")
239 | import uvicorn
240 |
241 | uvicorn.run(app, host="0.0.0.0", port=8769, timeout_graceful_shutdown=3)
242 |
243 | exit(0)
244 |
--------------------------------------------------------------------------------
/utils/utils.py:
--------------------------------------------------------------------------------
1 | from androguard.util import set_log
2 |
3 | try:
4 | set_log("ERROR") # 关闭琐碎的DEBUG输出
5 | except:
6 | pass
7 |
8 | import subprocess
9 | import time
10 | import re
11 | from androguard.core.apk import APK
12 | import os
13 |
14 | # from dotenv import load_dotenv
15 | import io
16 | import json
17 | from PIL import Image
18 | import uuid
19 | import base64
20 | import hashlib
21 | import cv2
22 |
23 |
24 | # load_dotenv(verbose=True, override=True)
25 |
26 | import requests
27 | import urllib3
28 |
29 | urllib3.disable_warnings()
30 |
31 | import pickle
32 | import zstd
33 |
34 |
35 | def save_object_to_disk(obj: object, file_path: str, compress_level: int = 3):
36 | """将对象序列化为pickle格式并使用Zstandard压缩保存到本地文件
37 | Args:
38 | obj (object): 要保存的对象
39 | file_path (str): 保存文件的路径
40 | compress_level (int): compression level, ultra-fast levels from -100 (ultra) to -1 (fast) available since zstd-1.3.4, and from 1 (fast) to 22 (slowest), 0 or unset - means default (3). Default 3.
41 | """
42 | pickled_data = pickle.dumps(obj)
43 | compressed_data = zstd.compress(pickled_data, compress_level)
44 | with open(file_path, "wb") as file:
45 | file.write(compressed_data)
46 |
47 |
48 | def load_object_from_disk(file_path: str) -> object:
49 | """从本地文件读取Zstandard压缩的pickle数据并反序列化为对象"""
50 | with open(file_path, "rb") as file:
51 | compressed_data = file.read()
52 | pickled_data = zstd.decompress(compressed_data)
53 | return pickle.loads(pickled_data)
54 |
55 |
56 | from PIL import Image
57 | import numpy as np
58 |
59 |
60 | def resize_pil_image(image: Image.Image, target_max_size: int = 1000) -> Image.Image:
61 | """
62 | Resize a PIL image to fit within a square of target_max_size x target_max_size pixels,
63 | maintaining the aspect ratio.
64 | """
65 | width, height = image.size
66 | if width > height:
67 | new_width = target_max_size
68 | new_height = int((height / width) * target_max_size)
69 | else:
70 | new_height = target_max_size
71 | new_width = int((width / height) * target_max_size)
72 | return image.resize((new_width, new_height), Image.LANCZOS)
73 |
74 |
75 | def resize_ndarray_image(image: np.ndarray, target_max_size: int = 1000) -> np.ndarray:
76 | """
77 | Resize a numpy ndarray image to fit within a square of target_max_size x target_max_size pixels, maintaining the aspect ratio.
78 | """
79 | return np.array(resize_pil_image(Image.fromarray(image), target_max_size))
80 |
81 |
82 | def openai_request(
83 | messages: list,
84 | model: str = "env",
85 | max_retry: int = 5,
86 | timeout: int = 60,
87 | temperature: float = 0.0,
88 | max_tokens: int = 300,
89 | usage: dict[str, int] = {"prompt_tokens": 0, "completion_tokens": 0},
90 | ) -> str:
91 | headers = {
92 | "Content-Type": "application/json",
93 | "Authorization": f'Bearer {os.getenv("OPENAI_API_KEY")}',
94 | }
95 | data = {
96 | "model": os.getenv("OPENAI_API_MODEL", model) if model == "env" else model,
97 | "messages": messages,
98 | "max_tokens": max_tokens,
99 | "temperature": temperature,
100 | }
101 | url = (
102 | f"{os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")}/chat/completions"
103 | )
104 | HTTP_PROXY = os.getenv("HTTP_PROXY", None)
105 | proxies = None
106 | # if HTTP_PROXY:
107 | # proxies = {
108 | # "http": HTTP_PROXY,
109 | # "https": HTTP_PROXY,
110 | # }
111 | r = None
112 | for i in range(max_retry + 1):
113 | try:
114 | r = requests.post(
115 | url,
116 | headers=headers,
117 | json=data,
118 | timeout=timeout,
119 | verify=False, # 禁用证书验证
120 | proxies=proxies,
121 | ) # .json()
122 | d = r.json()
123 | content = d.get("choices", [{}])[0].get("message", {})["content"]
124 | usage["prompt_tokens"] += d.get("usage", {}).get("prompt_tokens", 0)
125 | usage["completion_tokens"] += d.get("usage", {}).get("completion_tokens", 0)
126 | return content
127 | except Exception as e:
128 | print(
129 | f"Request failed: {e} , retrying {i+1} of {max_retry} after {(i + 1) ** 3} seconds"
130 | )
131 | if r is not None:
132 | print(r.text)
133 | time.sleep((i + 1) ** 3)
134 | raise Exception(f"Request failed after retrying {max_retry} times")
135 |
136 |
137 | def str_to_md5(input_str: str) -> str:
138 | return hashlib.md5(input_str.encode()).hexdigest().upper()
139 |
140 |
141 | def pil_to_webp_base64(img: Image.Image) -> str:
142 | buffered = io.BytesIO()
143 | img.convert("RGB").save(buffered, format="WEBP", quality=95)
144 | return base64.b64encode(buffered.getvalue()).decode("utf-8")
145 |
146 |
147 | def ndarray_to_webp_base64(img: np.ndarray) -> str:
148 | """
149 | Convert a numpy ndarray image to a base64 encoded string.
150 | """
151 | return pil_to_webp_base64(Image.fromarray(img))
152 |
153 |
154 | def base64_to_pil(base64_str: str) -> Image.Image:
155 | """
156 | Convert a base64 encoded string to a PIL Image.
157 |
158 | Args:
159 | base64_str (str): The base64 string representing the image.
160 |
161 | Returns:
162 | Image.Image: A PIL Image object.
163 | """
164 | return Image.open(io.BytesIO(base64.b64decode(base64_str))).convert("RGB")
165 |
166 |
167 | def cv2_to_pil(cv2_img):
168 | # 将 cv2 图像转换为 RGB 格式(OpenCV 使用 BGR)
169 | cv2_img_rgb = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
170 | # 将 NumPy 数组转换为 PIL 图像
171 | pil_img = Image.fromarray(cv2_img_rgb)
172 | return pil_img
173 |
174 |
175 | def safe_decode(byte_data, encoding_list=["utf-8", "gbk"]):
176 | for encoding in encoding_list:
177 | try:
178 | return byte_data.decode(encoding)
179 | except UnicodeDecodeError:
180 | continue
181 | raise UnicodeDecodeError(f"Unable to decode with encodings: {encoding_list}")
182 |
183 |
184 | import ast
185 | import re
186 | import json
187 | from typing import Any, Optional
188 |
189 |
190 | def extract_json(s: str) -> Optional[dict[str, Any]]:
191 | """Extracts the first JSON object found in a string.
192 |
193 | Handles multi-line JSON and JSON embedded within other text.
194 |
195 | Args:
196 | s: A string potentially containing a JSON object.
197 | E.g., "{'hello': 'world'}" (Python-like) or '"key": "value", "boolean": true, "nothing": null' (Standard JSON) or CoT: "let's think step-by-step, ..., { ... json ... } ... more text"
198 |
199 | Returns:
200 | The parsed JSON object as a Python dictionary, or None if no valid
201 | JSON object is found or parsing fails.
202 | """
203 | pattern = r"\{.*\}"
204 | match = re.search(pattern, s, re.DOTALL)
205 | if match:
206 | potential_json_string = match.group()
207 | try:
208 | return json.loads(potential_json_string)
209 | except json.JSONDecodeError as json_error:
210 | # print(
211 | # f"JSON parsing failed ({json_error}), attempting Python literal eval."
212 | # )
213 | try:
214 | return ast.literal_eval(potential_json_string)
215 | except (SyntaxError, ValueError) as eval_error:
216 | print(
217 | f"Python literal eval also failed ({eval_error}), cannot extract dictionary."
218 | )
219 | return None
220 | else:
221 | return None
222 |
223 |
224 | def get_apk(package_name: str, local_apk_path: str, device_serial: str = None) -> str:
225 | command = "adb "
226 | if device_serial:
227 | command += f" -s {device_serial} "
228 | command += f" shell pm path {package_name}"
229 | apk_path = execute_cmd(command)
230 | if apk_path == "ERROR":
231 | return "ERROR"
232 | apk_path = apk_path.split("package:")[1].strip()
233 | command = "adb "
234 | if device_serial:
235 | command += f" -s {device_serial} "
236 | command += f" pull {apk_path} {local_apk_path}"
237 | return execute_cmd(command)
238 |
239 |
240 | def execute_cmd(command: str, verbose=True) -> str:
241 | result = subprocess.run(command, shell=True, capture_output=True, text=True)
242 | if result.returncode == 0:
243 | return result.stdout.strip()
244 | if verbose:
245 | print(f"Command execution failed: {command}")
246 | print(result.stderr)
247 | return "ERROR"
248 |
249 |
250 | def get_all_devices() -> list:
251 | command = "adb devices"
252 | device_list = []
253 | result = execute_cmd(command)
254 | if result != "ERROR":
255 | devices = result.split("\n")[1:]
256 | for d in devices:
257 | device_list.append(d.split()[0])
258 |
259 | return device_list
260 |
--------------------------------------------------------------------------------
/utils/knowledge_generation.py:
--------------------------------------------------------------------------------
1 | """
2 | 在执行完task后,调用 update_trajectory_to_knowledge 即可,如果需要马上用上新的知识库,可以再继续看看 memory.py
3 | """
4 |
5 | import xml.etree.ElementTree as ET
6 | import subprocess
7 | import time
8 | import re
9 | import os
10 | import re
11 | import glob
12 | import imagehash
13 | import openai
14 |
15 | # from dotenv import load_dotenv
16 | from PIL import Image
17 | import io
18 | import os
19 | import json
20 | import uuid
21 | import base64
22 | import hashlib
23 |
24 | from utils.utils import pil_to_webp_base64, cv2_to_pil
25 | from typing import List, Any, Dict, Union
26 | import cv2
27 | import copy
28 |
29 | # 如果版本升级,是否清空知识库重新生成
30 | EMPTY_KNOWLEDGE_BASE_WHEN_VERSION_UPGRADE = (
31 | os.getenv("EMPTY_KNOWLEDGE_BASE_WHEN_VERSION_UPGRADE", "False").lower() == "false"
32 | )
33 | # 如果版本降级,是否清空知识库重新生成
34 | EMPTY_KNOWLEDGE_BASE_WHEN_VERSION_DOWNGRADE = (
35 | os.getenv("EMPTY_KNOWLEDGE_BASE_WHEN_VERSION_DOWNGRADE", "True").lower() == "true"
36 | )
37 |
38 |
39 | from utils.memory import KnowledgeStore
40 | from tqdm import tqdm
41 | import math
42 | import operator
43 | from dataclasses import asdict
44 | import numpy as np
45 | from utils.utils import ndarray_to_webp_base64, resize_ndarray_image
46 | from utils.device import UIElement
47 | import imagehash
48 | from PIL import Image
49 | from utils.prompt_templates import KNOWLEDGE_EXTRACTOR
50 | from utils.utils import openai_request
51 | from utils.device import (
52 | _generate_ui_element_description,
53 | add_screenshot_label,
54 | add_ui_element_mark,
55 | )
56 |
57 |
58 | def pil_image_to_phash(pil_image: Image.Image) -> str:
59 | """Convert a PIL Image to a perceptual hash.
60 |
61 | Args:
62 | pil_image (Image.Image): The PIL Image object.
63 |
64 | Returns:
65 | str: The perceptual hash.
66 | """
67 |
68 | return str(imagehash.phash(pil_image, hash_size=16, highfreq_factor=8)).upper()
69 |
70 |
71 | def ndarray_image_to_phash(ndarray_image: np.ndarray) -> str:
72 | """Convert a NumPy ndarray image to a perceptual hash.
73 |
74 | Args:
75 | ndarray_image (np.ndarray): The NumPy ndarray image.
76 |
77 | Returns:
78 | str: The perceptual hash.
79 | """
80 | return pil_image_to_phash(Image.fromarray(ndarray_image))
81 |
82 |
83 | def dot_product(v1: list, v2: list) -> float:
84 | return sum(map(operator.mul, v1, v2))
85 |
86 |
87 | def cosine_similarity(v1: list, v2: list) -> float:
88 | """越接近1越相似"""
89 | prod = dot_product(v1, v2)
90 | len1 = math.sqrt(dot_product(v1, v1))
91 | len2 = math.sqrt(dot_product(v2, v2))
92 | return prod / (len1 * len2)
93 |
94 |
95 | def update_trajectory_to_knowledge(
96 | trajectory_data: list[dict],
97 | locations: list[tuple[str, int]],
98 | fusion_memory: KnowledgeStore,
99 | knowledge_data: dict[str, dict],
100 | usage: dict[str, int] = {"prompt_tokens": 0, "completion_tokens": 0},
101 | ) -> None:
102 | """将轨迹数据转换为知识数据,并且更新到知识库(knowledge_data)中。需要在调用完这个函数之后手动保存更新后的knowledge_data,并且重新载入一次fusion memory(因为concat的消息尚未在memory中进行更新)
103 |
104 | Args:
105 | trajectory_data (list[dict]): 轨迹数据
106 | locations (list[tuple[str, int]]): 检索出来的index到knowledge_data位置的映射,val分别是package_name和index
107 | fusion_memory (KnowledgeStore): 知识库
108 | knowledge_data (dict[str, dict[str, Any]]): 知识数据
109 | usage (dict[str, int], optional): 本次调用使用的token数,会在这个函数中更新
110 |
111 | Returns:
112 | None
113 | """
114 | tmp_memory = None
115 |
116 | def is_transition_valid(
117 | before_screenshot: np.ndarray, after_screenshot: np.ndarray
118 | ) -> bool:
119 | """判断两个截图之间的转换是否有效"""
120 | return ndarray_image_to_phash(before_screenshot) != ndarray_image_to_phash(
121 | after_screenshot
122 | )
123 |
124 | for d in tqdm(trajectory_data, ncols=80, leave=False, desc="Updating knowledge"):
125 | before_screenshot = d["before_screenshot"]
126 | after_screenshot = d["after_screenshot"]
127 | if not is_transition_valid(before_screenshot, after_screenshot):
128 | continue
129 | pil_before_screenshot = Image.fromarray(before_screenshot).convert("RGB")
130 | task_description = d["goal"]
131 | numeric_tag_of_element = (
132 | d["converted_action"].index
133 | if hasattr(d["converted_action"], "index")
134 | else None
135 | )
136 | if numeric_tag_of_element is None:
137 | continue
138 | if d["target_element"] is None:
139 | continue
140 | e = UIElement(**d["target_element"])
141 | x_min, y_min, x_max, y_max = (
142 | e.bbox_pixels.x_min,
143 | e.bbox_pixels.y_min,
144 | e.bbox_pixels.x_max,
145 | e.bbox_pixels.y_max,
146 | )
147 | w, h = pil_before_screenshot.size
148 | x_min = int(max(x_min, 0))
149 | y_min = int(max(y_min, 0))
150 | x_max = int(min(x_max, w))
151 | y_max = int(min(y_max, h))
152 | image_patch = pil_before_screenshot.crop((x_min, y_min, x_max, y_max))
153 | logical_screen_size = (w, h)
154 | physical_frame_boundary = (0, 0, w, h)
155 | orientation = 0
156 | add_ui_element_mark(
157 | before_screenshot,
158 | e,
159 | numeric_tag_of_element,
160 | logical_screen_size,
161 | physical_frame_boundary,
162 | orientation,
163 | )
164 | add_screenshot_label(
165 | before_screenshot,
166 | "Before",
167 | )
168 | add_screenshot_label(
169 | after_screenshot,
170 | "After",
171 | )
172 | if tmp_memory is not None:
173 | res = tmp_memory.search(image_patch, k=1, similarity_threshold=0.99)
174 | if len(res) > 0: # 说明在本次轨迹中已经被处理过
175 | continue
176 | res = fusion_memory.search(image_patch, k=1, similarity_threshold=0.99)
177 | ui_element_attributes = (
178 | _generate_ui_element_description(e, numeric_tag_of_element)
179 | if d["target_element"] is not None
180 | else "None"
181 | )
182 | action = d["converted_action"].json_str()
183 | package_name = d["top_app_package_name"]
184 | p = KNOWLEDGE_EXTRACTOR.format(
185 | task_description=task_description,
186 | numeric_tag_of_element=numeric_tag_of_element,
187 | ui_element_attributes=ui_element_attributes,
188 | action=action,
189 | )
190 | low_resolution = os.getenv("LOW_RESOLUTION", "False").lower() == "true"
191 | if low_resolution:
192 | before_screenshot = resize_ndarray_image(before_screenshot, 1000)
193 | after_screenshot = resize_ndarray_image(after_screenshot, 1000)
194 | messages = [
195 | {
196 | "role": "user",
197 | "content": [
198 | {
199 | "type": "image_url",
200 | "image_url": {
201 | "url": f"data:image/webp;base64,{ndarray_to_webp_base64(before_screenshot)}",
202 | },
203 | },
204 | {
205 | "type": "image_url",
206 | "image_url": {
207 | "url": f"data:image/webp;base64,{ndarray_to_webp_base64(after_screenshot)}",
208 | },
209 | },
210 | {"type": "text", "text": p},
211 | ],
212 | },
213 | ]
214 | rsp_txt = openai_request(
215 | messages=messages,
216 | temperature=0.0,
217 | max_tokens=1000,
218 | timeout=120,
219 | usage=usage,
220 | )
221 | rsp_txt = rsp_txt.strip()
222 | assert rsp_txt != "", "empty response from MLLM"
223 | if not rsp_txt.endswith("."):
224 | rsp_txt += "."
225 | rsp_txt += " "
226 | if tmp_memory is None:
227 | tmp_memory = KnowledgeStore(
228 | knowledge_items=[{"image": np.array(image_patch), "info": rsp_txt}],
229 | embedding_pipeline=fusion_memory.embedding_pipeline,
230 | )
231 | else:
232 | tmp_memory.add_knowledge_items(
233 | [{"image": np.array(image_patch), "info": rsp_txt}]
234 | )
235 |
236 | add_as_new = True
237 | rsp_txt_embedding = fusion_memory.embedding_pipeline(rsp_txt)
238 | for r in res:
239 | add_as_new = False
240 | idx = r["index"]
241 | txt = r["knowledge"]
242 | txt_embedding = fusion_memory.embedding_pipeline(txt)
243 | similarity = cosine_similarity(rsp_txt_embedding, txt_embedding)
244 | if similarity <= 0.1:
245 | pkg, k_idx = locations[idx]
246 | knowledge_data[pkg]["knowledge"][k_idx]["info"] += rsp_txt
247 | if add_as_new:
248 | d = {
249 | "attrib": asdict(e),
250 | "image": np.array(image_patch),
251 | "info": rsp_txt,
252 | }
253 | if "knowledge" not in knowledge_data[package_name]:
254 | knowledge_data[package_name]["knowledge"] = []
255 | knowledge_data[package_name]["knowledge"].append(d)
256 | fusion_memory.add_knowledge_items([d])
257 | locations.append(
258 | (package_name, len(knowledge_data[package_name]["knowledge"]) - 1)
259 | )
260 |
--------------------------------------------------------------------------------
/demo/demo_web_frontend/src/components/ChatPage.vue:
--------------------------------------------------------------------------------
1 |
2 |