├── LICENSE
├── README.md
├── agent
├── grounding.py
├── main.py
├── pcagent.py
├── planning.py
├── prompt.py
├── server
│ ├── molmo_server.sh
│ └── planner_server.sh
└── utils.py
├── assets
├── animation.png
├── overview.png
└── raw_trajectory_example.png
├── environment.yml
├── postprocess
├── completion.py
├── data
│ └── events_example
│ │ ├── free_task_20241126_161517.jsonl
│ │ ├── free_task_20241126_161517.md
│ │ └── screenshot
│ │ ├── 20241126_161519_1.png
│ │ ├── 20241126_161519_1_marked.png
│ │ ├── 20241126_161521_2.png
│ │ ├── 20241126_161521_2_marked.png
│ │ ├── 20241126_161524_3.png
│ │ ├── 20241126_161524_3_marked.png
│ │ ├── 20241126_161526_4.png
│ │ ├── 20241126_161526_4_marked.png
│ │ ├── 20241126_161528_5.png
│ │ ├── 20241126_161536_6.png
│ │ ├── 20241126_161536_6_marked.png
│ │ ├── 20241126_161538_7.png
│ │ ├── 20241126_161538_7_marked.png
│ │ └── 20241126_161540_8.png
├── prompt.py
├── refinement.py
└── utils.py
└── tracker
├── README.md
├── README_zh.md
├── capturer.py
├── fs.py
├── main.py
├── main.spec
├── monitor.py
├── package.ps1
├── recorder.py
├── task.py
├── task_cnt.json
├── tasks.json
├── tracker.py
└── utils.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 Yanheng He
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World
2 |
3 |
4 | 📄 Paper |
5 | 🌐 Website |
6 | 📘 机器之心
7 |
8 |
9 |
10 |
11 |
12 |
13 | ## News
14 |
15 | - [2025/05/21] 🔥 **PC Agent-E** is released, a new SOTA open-source model for Windows computer use. [[Paper](https://arxiv.org/pdf/2505.13909)] [[Code](https://github.com/GAIR-NLP/PC-Agent-E/)] [[Model](https://huggingface.co/henryhe0123/PC-Agent-E)] [[Data](https://huggingface.co/datasets/henryhe0123/PC-Agent-E)]
16 | - [2024/12/24] 🔥 We released our [paper](https://arxiv.org/abs/2412.17589), [code](https://github.com/GAIR-NLP/PC-Agent/) and [project page](https://gair-nlp.github.io/PC-Agent/). Check it out!
17 |
18 | ## Demo
19 |
20 | Check out our demo of PC Agent autonomously controlling a computer to complete complex tasks involving dozens of steps!
21 |
22 | https://github.com/user-attachments/assets/0b7613c6-e3b1-41cf-86d3-0e7a828fe863
23 |
24 | ## Introduction
25 |
26 | **PC Agent** introduces a novel framework to empower autonomous digital agents through **human cognition transfer**.
27 | This transfer is implemented through three key components:
28 | 1. **PC Tracker**, the first lightweight infrastructure for large-scale human-computer interaction data collection;
29 | 2. A **Cognition Completion** postprocess pipeline that transforms raw interaction data into cognitive trajectories;
30 | 3. A multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding.
31 |
32 | 
33 |
34 | ## Quick Start
35 |
36 | ### Setup
37 |
38 | To get started with PC Agent, we recommend setting up your Python environment using conda:
39 |
40 | ```bash
41 | # Clone the repository and navigate to the folder
42 | git clone https://github.com/GAIR-NLP/PC-Agent.git
43 | cd PC-Agent
44 | # Create and activate conda environment
45 | conda env create -f environment.yml
46 | conda activate pcagent
47 | ```
48 |
49 | ### PC Tracker
50 |
51 | PC Tracker is an infrastructure for human-computer interaction data collection. The source code in `tracker/` directory can be modified to fit your specific data collection requirements.
52 |
53 | To deploy:
54 | 1. Build the executable (Windows):
55 | ```powershell
56 | cd tracker
57 | .\package.ps1
58 | ```
59 | 2. Customize `tasks.json` according to your annotation needs
60 | 3. Distribute to annotators
61 | 4. Collect annotation data from annotators - annotated data will be saved in the `events/` folder (hidden) under working directory
62 |
63 | For user instructions, please refer to our [PC Tracker User Manual](./tracker/README.md).
64 |
65 | ### Post Processing
66 |
67 | To convert raw interaction data into cognitive trajectories, follow these steps:
68 | 1. Place your data in the `postprocess/data/` directory. Example data is available in this directory for reference.
69 | 2. Run post-processing pipeline:
70 | ```bash
71 | python postprocess/refinement.py # Data refinement
72 | python postprocess/completion.py # Cognition completion
73 | ```
74 |
75 | Note: You need to prepare your OpenAI API key in advance to perform cognition completion.
76 |
77 | ### Agent
78 |
79 | We provide a reference implementation of our multi-agent system in the `agent/` directory, combining planning and grounding agents. To run:
80 |
81 | ```bash
82 | python agent/main.py
83 | ```
84 |
85 | Reference scripts for model deployment can be found in `agent/server/` directory.
86 |
87 | ## Citation
88 |
89 | If you find this work helpful, please consider citing:
90 |
91 | ```
92 | @article{he2024pcagent,
93 | title={PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World},
94 | author={Yanheng He and Jiahe Jin and Shijie Xia and Jiadi Su and Runze Fan and Haoyang Zou and Xiangkun Hu and Pengfei Liu},
95 | year={2024},
96 | journal={arXiv preprint arXiv:2412.17589},
97 | url={https://arxiv.org/abs/2412.17589}
98 | }
99 | ```
100 |
--------------------------------------------------------------------------------
/agent/grounding.py:
--------------------------------------------------------------------------------
1 | from utils import *
2 | from prompt import GROUNDING_AGENT_PROMPT
3 | import re
4 |
5 | class GroundingAgent:
6 | def __init__(self, grounding_client):
7 | self.grounding_client = grounding_client
8 | self.grounding_model = grounding_client.models.list().data[0].id
9 | print(f"Grounding model: {self.grounding_model}")
10 | screenshot = get_screenshot()
11 | self.window_width = screenshot.width
12 | self.window_height = screenshot.height
13 |
14 | def find_element(self, element_description, screenshot):
15 | """
16 | find the element and return the coordinates (with check),
17 | return x, y, there_are_none
18 | """
19 | cnt = 0
20 | retry_limit = 3
21 |
22 | while cnt < retry_limit:
23 | x, y, there_are_none = self.call_grounding(element_description, screenshot)
24 | if there_are_none:
25 | return None, None, True
26 | elif self.check_grounding(x, y, screenshot, element_description):
27 | return x, y, False
28 | cnt += 1
29 |
30 | return None, None, True
31 |
32 | def call_grounding(self, element_description, screenshot):
33 | """
34 | call the grounding model to locate the element,
35 | return x, y, there_are_none
36 | """
37 | base64_image = encode_image(screenshot)
38 | instruction = f"Point to {element_description}"
39 | messages = get_mllm_messages(instruction, base64_image)
40 |
41 | completion = self.grounding_client.chat.completions.create(
42 | model=self.grounding_model,
43 | messages=messages,
44 | max_tokens=512,
45 | temperature=0.8,
46 | n=3 # Request n completions in parallel
47 | )
48 |
49 | # Try each response until we find valid coordinates
50 | for choice in completion.choices:
51 | x, y = self.parse_coordinates(choice.message.content)
52 | if x is not None and y is not None:
53 | return x, y, False
54 |
55 | # If no valid coordinates found in any response
56 | return None, None, True
57 |
58 | def check_grounding(self, x, y, screenshot, target_description):
59 | """
60 | check if the grounding result is correct,
61 | return True if accept
62 | """
63 | element = get_element_info_from_position(x, y)
64 | element_name = element['name'] if element['name'] != '' else 'unknown'
65 | marked_screenshot = mark_screenshot(screenshot, (x, y), element["coordinates"])
66 | base64_marked_screenshot = encode_image(marked_screenshot)
67 |
68 | instruction = GROUNDING_AGENT_PROMPT + \
69 | f"The element name from the accessibility tree: {element_name}\n" + \
70 | f"Target Element Description: {target_description}\n"
71 |
72 | messages = get_mllm_messages(instruction, base64_marked_screenshot)
73 | completion = self.grounding_client.chat.completions.create(
74 | model=self.grounding_model,
75 | messages=messages,
76 | max_tokens=512,
77 | )
78 | output_text = completion.choices[0].message.content
79 | return "Result: Reject" not in output_text
80 |
81 | def parse_coordinates(self, output_text):
82 | """
83 | parse the coordinates from the model output
84 | """
85 | # match single point coordinates
86 | point_pattern = r' PLANNING_MAX_RETRY:
47 | print(f"Plan Model failed to make valid plan after {PLANNING_MAX_RETRY} retries")
48 | self.exit(1)
49 | else:
50 | print(f"Retry after click not found: {self.retry_click_elements[-1]}")
51 | # call planning agent to get plan
52 | plan, action = self.planning_agent.get_plan(screenshot, self.task_description, self.retry_click_elements)
53 |
54 | if "click element:" in action:
55 | # call grounding agent to find element
56 | element_description = action.split("click element:")[1].strip()
57 | x, y, there_are_none = self.grounding_agent.find_element(element_description, screenshot)
58 |
59 | if there_are_none:
60 | # if element not found, retry
61 | self.retry_click_elements.append(element_description)
62 | self.add_fail_block(plan)
63 | return self.step(screenshot, retry_click+1)
64 | else:
65 | # if element found, execute action
66 | element = get_element_info_from_position(x, y)
67 | marked_screenshot = mark_screenshot(screenshot, (x, y), element["coordinates"])
68 | action = self.get_click_action(action, x, y) # rewrite click action
69 | self.add_success_block(plan, action)
70 | self.execute_click_action(action, x, y)
71 | output = f"{plan}\nAction: {action}"
72 | self.planning_agent.add_to_history(output)
73 | self.after_action(output)
74 | return output, marked_screenshot
75 | else:
76 | # non-click action
77 | self.add_success_block(plan, action)
78 | self.execute_non_click_action(action)
79 | output = f"{plan}\nAction: {action}"
80 | self.planning_agent.add_to_history(output)
81 | self.after_action(output)
82 | return output, screenshot
83 |
84 | def get_click_action(self, action, x, y):
85 | if action.startswith("click"):
86 | return f"click ({x}, {y})"
87 | elif action.startswith("right click"):
88 | return f"right click ({x}, {y})"
89 | elif action.startswith("double click"):
90 | return f"double click ({x}, {y})"
91 |
92 | def after_action(self, output):
93 | print_in_green(f"\nAgent Done:\n{output}")
94 | self.step_cnt += 1
95 |
96 | def execute_click_action(self, action, x, y):
97 | if action.startswith("click"):
98 | pyautogui.click(x, y)
99 | elif action.startswith("right click"):
100 | pyautogui.rightClick(x, y)
101 | elif action.startswith("double click"):
102 | pyautogui.doubleClick(x, y)
103 |
104 | def execute_non_click_action(self, action):
105 | # drag
106 | match = re.match(r"(drag from) \((-?\d+), (-?\d+)\) to \((-?\d+), (-?\d+)\)", action)
107 | if match:
108 | x1 = int(match.group(2)) # start x coordinate
109 | y1 = int(match.group(3)) # start y coordinate
110 | x2 = int(match.group(4)) # target x coordinate
111 | y2 = int(match.group(5)) # target y coordinate
112 | pyautogui.mouseDown(x1, y1)
113 | pyautogui.dragTo(x2, y2, duration=0.5)
114 | return
115 |
116 | # scroll
117 | match = re.match(r"scroll \((-?\d+), (-?\d+)\)", action)
118 | if match:
119 | x = int(match.group(1)) # horizontal scroll distance
120 | y = int(match.group(2)) # vertical scroll distance
121 | if x != 0:
122 | pyautogui.hscroll(x) # horizontal scroll
123 | if y != 0:
124 | pyautogui.scroll(y) # vertical scroll
125 | return
126 |
127 | # press key
128 | match = re.match(r"press key: (.+)", action)
129 | if match:
130 | key_content = match.group(1)
131 | pyautogui.press(key_content)
132 | return
133 |
134 | # hotkey
135 | match = re.match(r"hotkey \((.+), (.+)\)", action)
136 | if match:
137 | key1 = match.group(1).lower()
138 | key2 = match.group(2).lower()
139 | pyautogui.hotkey(key1, key2)
140 | return
141 |
142 | # type text
143 | match = re.match(r"type text: (.+)", action)
144 | if match:
145 | text_content = match.group(1)
146 | pyautogui.write(text_content)
147 | return
148 |
149 | # wait
150 | if action == "wait":
151 | time.sleep(3)
152 |
153 | # finish
154 | if action == "finish":
155 | self.exit(0)
156 |
157 | # fail
158 | if action == "fail":
159 | self.exit(1)
160 |
161 | def record(self, output, screenshot):
162 | # record in markdown
163 | first_event = self.step_cnt == 1
164 | record_in_md(self.directory_path, self.task_description, f"{self.step_cnt}.png", output, first_event=first_event)
165 | # save image
166 | screenshot_path = os.path.join(self.directory_path, f"{self.step_cnt}.png")
167 | save_screenshot(screenshot, screenshot_path)
168 |
169 | def add_success_block(self, plan, action):
170 | if self.output_queue is not None:
171 | self.output_queue.put(f"{plan}\n\nAction: {action}")
172 |
173 | def add_fail_block(self, plan):
174 | if self.output_queue is not None:
175 | self.output_queue.put(f"{plan}")
176 |
177 | def exit(self, exit_code):
178 | if exit_code == 0:
179 | print("Task is done!")
180 |
181 | exit(exit_code)
182 |
--------------------------------------------------------------------------------
/agent/planning.py:
--------------------------------------------------------------------------------
1 | from utils import *
2 | from prompt import PLANNING_AGENT_PROMPT
3 |
4 | class PlanningAgent:
5 | def __init__(self, plan_client):
6 | self.plan_client = plan_client
7 | self.plan_model = plan_client.models.list().data[0].id
8 | print(f"Planning model: {self.plan_model}")
9 | self.history = []
10 | self.HISTORY_CUT_OFF = 10
11 |
12 | def get_plan(self, screenshot, task_description, retry_click_elements=None):
13 | """
14 | get the next plan
15 | Args:
16 | screenshot: the screenshot
17 | task_description: task description
18 | retry_click_elements: the list of elements that failed to click before
19 | Returns:
20 | plan_str: plan description
21 | action_str: specific action
22 | """
23 | instruction = self.get_plan_instruction(task_description)
24 |
25 | if retry_click_elements:
26 | retry_elements_str = "> and <".join(retry_click_elements)
27 | instruction += f"\n\nNote: The element <{retry_elements_str}> you want to click before is not found, please try a new plan."
28 |
29 | base64_image = encode_image(screenshot)
30 | messages = get_mllm_messages(instruction, base64_image)
31 | completion = self.plan_client.chat.completions.create(
32 | model=self.plan_model,
33 | messages=messages,
34 | max_tokens=512,
35 | temperature=0.8
36 | )
37 | output_text = completion.choices[0].message.content
38 | return self.split_output(output_text)
39 |
40 | def add_to_history(self, output):
41 | """
42 | add the output to the history
43 | """
44 | self.history.append(output)
45 |
46 | def get_plan_instruction(self, task_description):
47 | """
48 | generate the planning instruction
49 | """
50 | prompt = PLANNING_AGENT_PROMPT + f"Your task is: {task_description}\n\n"
51 |
52 | if len(self.history) > self.HISTORY_CUT_OFF:
53 | history_str = "\n\n".join(f"[{i+1}] {item}" for i, item in enumerate(self.history[-self.HISTORY_CUT_OFF:]))
54 | else:
55 | history_str = "\n\n".join(f"[{i+1}] {item}" for i, item in enumerate(self.history))
56 |
57 | if history_str == '':
58 | history_str = "None"
59 |
60 | prompt += f"History of the previous actions and thoughts you have done to reach the current screen: {history_str}\n\n"
61 | prompt += "--------------------------------------------\n\n"
62 | prompt += f"Given the screenshot. What's the next step that you will do to help with the task?"
63 | return prompt
64 |
65 | def split_output(self, output):
66 | """
67 | split the output into plan and action
68 | """
69 | plan_str = output.split("Action:")[0].strip()
70 | action_str = output.split("Action:")[1].strip()
71 | return plan_str, action_str
72 |
--------------------------------------------------------------------------------
/agent/prompt.py:
--------------------------------------------------------------------------------
1 | PLANNING_AGENT_PROMPT = """You are a helpful assistant who can help users complete computer tasks, with **full permission** to make any operations on the user's computer.
2 | Based on the provided current state, you need to suggest the next action to complete the task. Do not try to complete the entire task in one step. Break it down into smaller steps, and at each step you will get a new state to interact with.
3 |
4 | IMPORTANT: You must strictly adhere to the following rules:
5 | 1. Choose ONLY ONE action from the list below for each response, DO NOT perform more than one action per step.
6 | 2. Follow the exact syntax format for the selected action, DO NOT create or use any actions other than those listed.
7 | 3. Once the task is completed, output "finish" without any further actions required.
8 | 4. If external reflection is provided, use it to improve your next action.
9 |
10 | Valid actions:
11 |
12 | 1. click element: element_description
13 | click the element with the description element_description on current screen
14 |
15 | 2. right click element: element_description
16 | right click the element with the description element_description on current screen
17 |
18 | 3. double click element: element_description
19 | double click the element with the description element_description on current screen
20 |
21 | 4. drag from (x1, y1) to (x2, y2)
22 | drag the element from position (x1, y1) to (x2, y2).
23 |
24 | 5. scroll (dx, dy)
25 | scroll the screen with the offset (dx, dy). dx is the horizontal offset, and dy is the vertical offset.
26 |
27 | 6. press key: key_content
28 | press the key key_content on the keyboard.
29 |
30 | 7. hotkey (key1, key2)
31 | press the hotkey composed of key1 and key2.
32 |
33 | 8. type text: text_content
34 | type content text_content on the keyboard.
35 |
36 | 9. wait
37 | wait for some time, usually for the system to respond, screen to refresh, advertisement to finish.
38 |
39 | 10. finish
40 | indicating that the task has been completed.
41 |
42 | 11. fail
43 | indicating that the task has failed.
44 |
45 | Response Format: {Your thought process}\n\nAction: {The specific action you choose to take}
46 |
47 | --------------------------------------------
48 |
49 | """
50 |
51 | GROUNDING_AGENT_PROMPT = """You are an assistant evaluating the accuracy of click actions performed by a PC agent. Your role is to verify if the executed click matches the intended target based on:
52 |
53 | 1. A screenshot showing:
54 | - A red dot and circle marking the exact click location
55 | - A red box outlining the general area of the clicked element
56 | Note: While the dot and circle are precise, the box might be less accurate
57 |
58 | 2. The element name from the accessibility tree
59 | Note: This information might be incomplete, with many elements labeled as "unknown". Ignore it in this case.
60 |
61 | 3. The target element description
62 |
63 | Your Task is to verify if the click action matches the target element based on the above information.
64 |
65 | # Important Notes
66 | 1. Generally, be cautious about rejecting valid clicks - avoid false negatives when possible.
67 | 2. However, be strict about distinguishing between clearly different elements.
68 | 3. Position of target element description is not a strict criterion.
69 |
70 | # Evaluation Process
71 | 1. Locate the click point with red markers.
72 | 2. Check the element name for useful info.
73 | 3. Compare the target description with your findings.
74 |
75 | Response Format:
76 | Evaluation Process: {your evaluation process}
77 | Result: {your result}
78 |
79 | Your result should be either:
80 | - "Accept" if the click matches the target element
81 | - "Reject" if the click does not match the target element
82 |
83 | Few Example Responses:
84 | [1]
85 | "Evaluation Process: The click is at the center of the element labeled "close button", which matches the target description.
86 | Result: Accept"
87 |
88 | [2]
89 | "Evaluation Process: The click element name from the accessibility tree is "Copy image address", which is not a match for the target description "Copy image" option.
90 | Result: Reject"
91 |
92 | --------------------------------------------
93 |
94 | """
95 |
--------------------------------------------------------------------------------
/agent/server/molmo_server.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | vllm serve allenai/Molmo-72B-0924 --tensor-parallel-size 4 --trust-remote-code --port 8000
4 |
--------------------------------------------------------------------------------
/agent/server/planner_server.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | vllm serve henryhe0123/pc-agent --tensor-parallel-size 4 --port 8002
4 |
--------------------------------------------------------------------------------
/agent/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import io
4 | import base64
5 | from PIL import ImageDraw, ImageGrab
6 | from pywinauto import Desktop
7 |
8 | desktop = Desktop(backend="uia")
9 |
10 |
11 | def get_screenshot():
12 | screenshot = ImageGrab.grab()
13 | return screenshot
14 |
15 |
16 | def encode_image(image):
17 | # encode image to base64 string
18 | buffered = io.BytesIO()
19 | image.save(buffered, format="PNG")
20 | return base64.b64encode(buffered.getvalue()).decode('utf-8')
21 |
22 |
23 | def save_screenshot(screenshot, path):
24 | screenshot.save(path, format="PNG")
25 |
26 |
27 | def get_mllm_messages(instruction, base64_image):
28 | messages = [
29 | {
30 | "role": "user",
31 | "content": [
32 | {
33 | "type": "image_url",
34 | "image_url": {
35 | "url": f"data:image/jpeg;base64,{base64_image}"
36 | },
37 | },
38 | {
39 | "type": "text",
40 | "text": instruction
41 | },
42 | ],
43 | },
44 | ]
45 | return messages
46 |
47 |
48 | def get_element_info_from_position(x, y):
49 | # get the UI element info at the specified coordinates
50 | try:
51 | element = desktop.from_point(x, y)
52 | # get the rectangle coordinates of the element
53 | rect = element.rectangle()
54 |
55 | return {
56 | "name": element.element_info.name,
57 | "coordinates": {
58 | "left": rect.left,
59 | "top": rect.top,
60 | "right": rect.right,
61 | "bottom": rect.bottom
62 | }
63 | }
64 | except Exception as e:
65 | print(f"Error occurs when get element from position: {e}")
66 | return None
67 |
68 |
69 | def mark_screenshot(original_screenshot, coordinates, rect=None):
70 | screenshot = original_screenshot.copy()
71 | x, y = coordinates
72 | point = {"x": x, "y": y}
73 |
74 | if rect is not None:
75 | # create a drawable object
76 | draw = ImageDraw.Draw(screenshot)
77 | # draw the rectangle
78 | draw.rectangle(
79 | [(rect["left"], rect["top"]), (rect["right"], rect["bottom"])],
80 | outline="red",
81 | width=3 # line width
82 | )
83 |
84 | if point is not None:
85 | draw = ImageDraw.Draw(screenshot)
86 |
87 | # calculate the top-left and bottom-right coordinates of the solid circle
88 | radius = 3
89 | left = point["x"] - radius
90 | top = point["y"] - radius
91 | right = point["x"] + radius
92 | bottom = point["y"] + radius
93 |
94 | # draw the solid circle
95 | draw.ellipse(
96 | [(left, top), (right, bottom)],
97 | fill="red"
98 | )
99 |
100 | # add a larger hollow circle
101 | circle_radius = 18
102 | circle_left = point["x"] - circle_radius
103 | circle_top = point["y"] - circle_radius
104 | circle_right = point["x"] + circle_radius
105 | circle_bottom = point["y"] + circle_radius
106 |
107 | # draw the hollow circle
108 | draw.ellipse(
109 | [(circle_left, circle_top), (circle_right, circle_bottom)],
110 | outline="red",
111 | width=2
112 | )
113 |
114 | return screenshot
115 |
116 |
117 | def record_in_md(directory_path, task_description, screenshot_path, output, external_reflection=None,
118 | first_event=False):
119 | file_name = "inference_record.md"
120 | with open(os.path.join(directory_path, file_name), "a", encoding="utf-8") as file:
121 | if first_event:
122 | file.write(f"# Inference Task\n")
123 | file.write(f"**Description:** {task_description}\n\n")
124 | file.write(f"### {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n")
125 | file.write(f"**Screenshot:**\n")
126 | file.write(f'
\n\n')
127 | file.write(f"**External Reflection:**\n{external_reflection}\n\n") if external_reflection else None
128 | file.write(f"**Output:**\n{output}\n\n")
129 |
130 |
131 | def log(message, filename="agent.log"):
132 | current_time = time.strftime("%Y-%m-%d %H:%M:%S")
133 | # open the file with UTF-8 encoding
134 | with open(filename, 'a', encoding='utf-8') as file:
135 | file.write(f"{current_time}\n{message}\n\n")
136 |
137 |
138 | def print_in_green(message):
139 | print(f"\033[92m{message}\033[0m")
140 |
--------------------------------------------------------------------------------
/assets/animation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/assets/animation.png
--------------------------------------------------------------------------------
/assets/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/assets/overview.png
--------------------------------------------------------------------------------
/assets/raw_trajectory_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/assets/raw_trajectory_example.png
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: pcagent
2 | channels:
3 | - defaults
4 | dependencies:
5 | - python=3.10.14
6 | - aiofiles=22.1.0
7 | - blas=1.0=mkl
8 | - blinker=1.6.2
9 | - bzip2=1.0.8
10 | - ca-certificates=2024.7.2
11 | - click=8.1.7
12 | - colorama=0.4.6
13 | - flask=3.0.3
14 | - intel-openmp=2023.1.0
15 | - itsdangerous=2.2.0
16 | - jinja2=3.1.4
17 | - libffi=3.4.4
18 | - libiconv=1.16
19 | - libxml2=2.10.4
20 | - libxslt=1.1.37
21 | - lxml=5.2.1
22 | - markupsafe=2.1.3
23 | - mkl=2023.1.0
24 | - mkl-service=2.4.0
25 | - mkl_fft=1.3.8
26 | - mkl_random=1.2.4
27 | - numpy=1.26.4
28 | - numpy-base=1.26.4
29 | - openssl=3.0.14
30 | - pip=24.0
31 | - setuptools=69.5.1
32 | - sortedcontainers=2.4.0
33 | - sqlite=3.45.3
34 | - tbb=2021.8.0
35 | - tk=8.6.14
36 | - tzdata=2024a
37 | - vc=14.2
38 | - vs2015_runtime=14.29.30133
39 | - werkzeug=3.0.3
40 | - wheel=0.43.0
41 | - xz=5.4.6
42 | - zlib=1.2.13
43 | - pip:
44 | - aiohappyeyeballs==2.4.0
45 | - aiohttp==3.10.5
46 | - aiosignal==1.3.1
47 | - altgraph==0.17.4
48 | - annotated-types==0.7.0
49 | - anyio==4.4.0
50 | - astor==0.8.1
51 | - async-timeout==4.0.3
52 | - attrs==24.2.0
53 | - beautifulsoup4==4.12.3
54 | - certifi==2024.8.30
55 | - charset-normalizer==3.3.2
56 | - comtypes==1.4.5
57 | - contourpy==1.3.0
58 | - customtkinter==5.2.2
59 | - cycler==0.12.1
60 | - cython==3.0.11
61 | - darkdetect==0.8.0
62 | - dashscope==1.20.9
63 | - decorator==5.1.1
64 | - distro==1.9.0
65 | - exceptiongroup==1.2.2
66 | - fire==0.7.0
67 | - fonttools==4.54.1
68 | - frozenlist==1.4.1
69 | - h11==0.14.0
70 | - httpcore==1.0.5
71 | - httpx==0.27.2
72 | - idna==3.8
73 | - imageio==2.35.1
74 | - imgaug==0.4.0
75 | - jiter==0.5.0
76 | - kiwisolver==1.4.7
77 | - lazy-loader==0.4
78 | - lmdb==1.5.1
79 | - matplotlib==3.9.2
80 | - mouseinfo==0.1.3
81 | - multidict==6.1.0
82 | - networkx==3.3
83 | - openai==1.44.1
84 | - opencv-contrib-python==4.10.0.84
85 | - opencv-python==4.10.0.84
86 | - opt-einsum==3.3.0
87 | - packaging==24.1
88 | - paddleocr==2.8.1
89 | - paddlepaddle==2.6.2
90 | - pefile==2023.2.7
91 | - pillow==10.4.0
92 | - protobuf==3.20.2
93 | - psutil==6.1.0
94 | - pyautogui==0.9.54
95 | - pyclipper==1.3.0.post5
96 | - pydantic==2.9.1
97 | - pydantic-core==2.23.3
98 | - pygetwindow==0.0.9
99 | - pyinstaller==6.9.0
100 | - pyinstaller-hooks-contrib==2024.7
101 | - pymsgbox==1.0.9
102 | - pynput==1.7.7
103 | - pyparsing==3.1.4
104 | - pyperclip==1.9.0
105 | - pyqt5==5.15.11
106 | - pyqt5-qt5==5.15.2
107 | - pyqt5-sip==12.15.0
108 | - pyrect==0.2.0
109 | - pyscreeze==0.1.30
110 | - python-dateutil==2.9.0.post0
111 | - python-docx==1.1.2
112 | - pytweening==1.2.0
113 | - pywin32==306
114 | - pywin32-ctypes==0.2.2
115 | - pywinauto==0.6.8
116 | - pyyaml==6.0.2
117 | - rapidfuzz==3.10.0
118 | - requests==2.32.3
119 | - scikit-image==0.24.0
120 | - scipy==1.14.1
121 | - shapely==2.0.6
122 | - six==1.16.0
123 | - sniffio==1.3.1
124 | - soupsieve==2.6
125 | - termcolor==2.4.0
126 | - tifffile==2024.9.20
127 | - tqdm==4.66.5
128 | - typing-extensions==4.12.2
129 | - urllib3==2.2.2
130 | - websocket-client==1.8.0
131 | - wmi==1.5.1
132 | - yarl==1.11.1
--------------------------------------------------------------------------------
/postprocess/completion.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import sys
4 | import random
5 | import concurrent.futures
6 | from datetime import datetime
7 | from openai import OpenAI
8 | from concurrent.futures import ThreadPoolExecutor
9 | from prompt import *
10 | from utils import *
11 |
12 | client = OpenAI()
13 | model = "gpt-4o"
14 |
15 | CONCURRENT_NUM = 80
16 | RE_GENERATE = False
17 | MAX_CONTEXT_ENTRIES = 10
18 | DETAILED_OUTPUT = False
19 |
20 |
21 | def call_openai(query, base64_image=None):
22 | messages = [
23 | {
24 | "role": "user",
25 | "content": [
26 | {
27 | "type": "image_url",
28 | "image_url": {
29 | "url": f"data:image/jpeg;base64,{base64_image}"
30 | }
31 | } if base64_image else None,
32 | {
33 | "type": "text",
34 | "text": query
35 | },
36 | ],
37 | },
38 | ]
39 |
40 | completion = client.chat.completions.create(
41 | model=model,
42 | messages=messages,
43 | max_tokens=1000
44 | )
45 |
46 | reply = completion.choices[0].message.content
47 | return reply
48 |
49 |
50 | def process_concurrently(data_dir, function):
51 | tasks = []
52 |
53 | for item in os.listdir(data_dir):
54 | item_path = os.path.join(data_dir, item)
55 | directory_name = "events_"
56 |
57 | if os.path.isdir(item_path) and item.startswith(directory_name):
58 | print(f'Processing directory: {item_path}')
59 | for filename in os.listdir(item_path):
60 | if filename.endswith('.jsonl') and 'task' in filename:
61 | file_path = os.path.join(item_path, filename)
62 | md_path = os.path.join(item_path, filename.replace('.jsonl', '.md'))
63 | try:
64 | with open(md_path, 'r', encoding='utf-8') as file:
65 | lines = file.readlines()
66 | task_description = lines[1].replace('**Description:** ', '').strip()
67 | tasks.append((file_path, task_description))
68 | except Exception as e:
69 | print(f"error: failed to extract task description from {md_path}: {e}")
70 |
71 | random.shuffle(tasks)
72 | with ThreadPoolExecutor(max_workers=CONCURRENT_NUM) as executor:
73 | futures = [executor.submit(function, file_path, task_description)
74 | for file_path, task_description in tasks]
75 | concurrent.futures.wait(futures)
76 |
77 |
78 | def get_action_description(action, element_name, marked_screenshot_path=None, marked_screenshot_data=None):
79 | """
80 | Generate action description for click-related actions.
81 | """
82 | if marked_screenshot_path:
83 | base64_image = encode_image(marked_screenshot_path)
84 | elif marked_screenshot_data:
85 | base64_image = marked_screenshot_data
86 | else:
87 | base64_image = None
88 |
89 | click_action, _ = parse_click_action(action)
90 | if click_action:
91 | # Is a click-related action, generate action description
92 | query = CLICK_ACTION_DESCRIPTION_PROMPT \
93 | + f"The name of the clicked target for reference: {element_name}\n\n"
94 |
95 | reply = call_openai(query, base64_image)
96 | description = f"{click_action} <\{reply}>"
97 |
98 | else:
99 | # Not a click-related action, return the original action as description
100 | description = action
101 |
102 | return description
103 |
104 |
105 | def get_action_description_check(action, element_name, action_description, marked_screenshot_path=None, marked_screenshot_data=None):
106 | """
107 | Check the action description for click-related actions.
108 | """
109 | if marked_screenshot_path:
110 | base64_image = encode_image(marked_screenshot_path)
111 | elif marked_screenshot_data:
112 | base64_image = marked_screenshot_data
113 | else:
114 | base64_image = None
115 |
116 | click_action, coordinates = parse_click_action(action)
117 | if click_action:
118 | # Is a click-related action, check the action description
119 | x, y = coordinates
120 | clicked_element_description = action_description.split('<\\')[1].split('>')[0]
121 |
122 | query = CLICK_ACTION_DESCRIPTION_CHECK_PROMPT \
123 | + f"The exact coordinates of the mouse click: ({x}, {y})\n" \
124 | + f"The element name from the accessibility tree: {element_name}\n" \
125 | + f"The pre-generated description of the click location: {clicked_element_description}\n"
126 |
127 | try_time = 0
128 |
129 | while True:
130 | try_time += 1
131 |
132 | reply = call_openai(query, base64_image)
133 |
134 | if "Answer:" in reply:
135 | check_result = reply.split("Answer:")[1].strip().strip('*')
136 | break
137 |
138 | if try_time > 3:
139 | check_result = None
140 | print(f"action description check failed after 3 times in {marked_screenshot_path}")
141 | break
142 |
143 | else:
144 | # Not a click-related action, return None
145 | check_result = None
146 |
147 | if check_result and check_result.strip().startswith("Wrong"):
148 | modifyed_description = check_result.split("Wrong. Correct Description:")[1].strip()
149 | final_answer = f"{click_action} <\{modifyed_description}>"
150 | else:
151 | final_answer = None
152 |
153 | return check_result, final_answer
154 |
155 |
156 | def get_thought(task_description, action, history, following_actions, marked_screenshot_path=None, marked_screenshot_data=None):
157 | """
158 | Generate thought for the action.
159 | """
160 | if marked_screenshot_path:
161 | base64_image = encode_image(marked_screenshot_path)
162 | elif marked_screenshot_data:
163 | base64_image = marked_screenshot_data
164 | else:
165 | base64_image = None
166 |
167 | query = THOUGHT_PROMPT \
168 | + f"The task you are attempting to complete: {task_description}\n\n" \
169 | + f"Your performing history:\n{history}\n\n" \
170 | + f"Your subsequent actions:\n{following_actions}\n\n" \
171 | + f"The specific action you chose to perform: {action}\n\n"
172 |
173 | thought = call_openai(query, base64_image)
174 |
175 | if "Action:" in thought:
176 | print(f"warning: found 'Action:' in thought generation, deleting it")
177 | thought = thought.split("Action:")[0].strip()
178 | if "*Action*:" in thought:
179 | print(f"warning: found '*Action*:' in thought generation, deleting it")
180 | thought = thought.split("*Action*:")[0].strip()
181 |
182 | return thought
183 |
184 |
185 | def add_field_for_file(file_path, field, task_description):
186 | print(f"begin adding {field} for {file_path}")
187 | entries = []
188 |
189 | try:
190 | with open(file_path, 'r', encoding='utf-8') as file:
191 | entries = [json.loads(line) for line in file]
192 | except Exception as e:
193 | print(f"error: failed to read file {file_path}: {e}")
194 | return
195 |
196 | if field == 'thought':
197 | all_actions = [entry['action_description'] for entry in entries]
198 |
199 | try:
200 | for id, entry in enumerate(entries):
201 | if field in entry and entry[field] and content_is_rational(entry[field]):
202 | if not RE_GENERATE:
203 | continue
204 |
205 | if 'marked_screenshot' not in entry:
206 | print(f"error: marked_screenshot field not found: {file_path}")
207 | continue
208 |
209 | marked_screenshot_path = os.path.join(os.path.dirname(file_path), entry['marked_screenshot'])
210 | if not os.path.isfile(marked_screenshot_path):
211 | print(f"error: screenshot file not found: {marked_screenshot_path}")
212 | continue
213 |
214 | try:
215 | if field == 'action_description':
216 | action_description = get_action_description(entry['action'], entry['element'], marked_screenshot_path=marked_screenshot_path)
217 |
218 | entry['action_description'] = action_description
219 |
220 | if DETAILED_OUTPUT:
221 | print(f"generated action_description: {action_description}")
222 | elif field == 'action_description_checked':
223 | action_description_checked, modified_action_description = get_action_description_check(
224 | entry['action'], entry['element'], entry['action_description'],
225 | marked_screenshot_path=marked_screenshot_path)
226 |
227 | entry['action_description_checked'] = action_description_checked
228 |
229 | if modified_action_description:
230 | entry['action_description'] = modified_action_description
231 |
232 | if DETAILED_OUTPUT and action_description_checked:
233 | print(f"generated action_description_checked: {action_description_checked}")
234 | elif field == 'thought':
235 | # build history steps
236 | history_steps = []
237 | start_idx = max(0, id - MAX_CONTEXT_ENTRIES)
238 | for hist_id in range(start_idx, id):
239 | hist_entry = entries[hist_id]
240 | if 'thought' in hist_entry and hist_entry['thought'] and content_is_rational(hist_entry['thought']):
241 | history_steps.append(f"{hist_id+1}:\nThought: {hist_entry['thought']}\nAction: {hist_entry['action_description']}")
242 | # build subsequent steps
243 | subsequent_actions = all_actions[id+1:id+1+MAX_CONTEXT_ENTRIES]
244 | subsequent_actions_str = get_action_string(subsequent_actions)
245 | thought = get_thought(
246 | task_description, entry['action_description'], '\n'.join(history_steps), subsequent_actions_str, marked_screenshot_path=marked_screenshot_path)
247 |
248 | entry['thought'] = thought
249 |
250 | if DETAILED_OUTPUT:
251 | print(f"generated thought: {thought}")
252 | else:
253 | print(f"error: unknown field: {field}")
254 | except Exception as e:
255 | print(f"error: failed to get {field} for {marked_screenshot_path}: {e}")
256 | continue
257 |
258 | with open(file_path, 'w', encoding='utf-8') as file:
259 | for entry in entries:
260 | json.dump(entry, file, ensure_ascii=False)
261 | file.write('\n')
262 |
263 | rewrite_markdown_file_by_jsonl(file_path)
264 | print(f"finished adding {field} for {file_path}")
265 |
266 | except Exception as e:
267 | print(f"error: failed to process file {file_path}: {e}")
268 | if "Expecting" in str(e) or "Invalid control character" in str(e):
269 | print(f"file {file_path} is corrupted, deleting...")
270 | try:
271 | os.remove(file_path)
272 | print(f"deleted corrupted file: {file_path}")
273 | except OSError as delete_error:
274 | print(f"error: failed to delete corrupted file: {delete_error}")
275 |
276 |
277 | def action_semantic_completion(file_path, task_description):
278 | """
279 | Adds the field 'action_description' to the jsonl file as action semantics.
280 |
281 | Parameters:
282 | file_path (str): The path to the jsonl file to be processed.
283 | """
284 | # add action description
285 | add_field_for_file(file_path, 'action_description', task_description)
286 | # add action description check
287 | add_field_for_file(file_path, 'action_description_checked', task_description)
288 | # add thought
289 | add_field_for_file(file_path, 'thought', task_description)
290 |
291 |
292 |
293 | if __name__ == "__main__":
294 | start_time = datetime.now()
295 | print(f"start time: {start_time}")
296 |
297 | current_dir = os.path.dirname(os.path.abspath(__file__))
298 | if len(sys.argv) > 1:
299 | data_dir = os.path.join(current_dir, sys.argv[1])
300 | else:
301 | data_dir = os.path.join(current_dir, 'data') # write the data directory here
302 | if not os.path.exists(data_dir):
303 | print(f"error: {data_dir} directory does not exist")
304 | exit()
305 |
306 | process_concurrently(data_dir, action_semantic_completion)
307 |
308 | end_time = datetime.now()
309 | print(f"end time: {end_time}")
310 | print(f"Total time: {end_time - start_time}")
311 |
--------------------------------------------------------------------------------
/postprocess/data/events_example/free_task_20241126_161517.jsonl:
--------------------------------------------------------------------------------
1 | {"timestamp": "2024-11-26_16:15:19", "action": "click (828, 1075)", "screenshot": "screenshot/20241126_161519_1.png", "element": "Start", "rect": {"left": 794, "top": 1020, "right": 850, "bottom": 1080}, "marked_screenshot": "screenshot/20241126_161519_1_marked.png"}
2 | {"timestamp": "2024-11-26_16:15:21", "action": "click (1111, 334)", "screenshot": "screenshot/20241126_161521_2.png", "element": "PowerPoint", "rect": {"left": 1079, "top": 273, "right": 1199, "bottom": 378}, "marked_screenshot": "screenshot/20241126_161521_2_marked.png"}
3 | {"timestamp": "2024-11-26_16:15:24", "action": "click (407, 275)", "screenshot": "screenshot/20241126_161524_3.png", "element": "Blank Presentation", "rect": {"left": 225, "top": 181, "right": 450, "bottom": 371}, "marked_screenshot": "screenshot/20241126_161524_3_marked.png"}
4 | {"timestamp": "2024-11-26_16:15:26", "action": "click (1048, 546)", "screenshot": "screenshot/20241126_161526_4.png", "element": "Unknown", "rect": {"left": -9, "top": 184, "right": 1929, "bottom": 1002}, "marked_screenshot": "screenshot/20241126_161526_4_marked.png"}
5 | {"timestamp": "2024-11-26_16:15:28", "action": "type text: Hello, world", "screenshot": "screenshot/20241126_161528_5.png", "element": null, "rect": null, "marked_screenshot": "screenshot/20241126_161528_5.png"}
6 | {"timestamp": "2024-11-26_16:15:36", "action": "click (1877, 21)", "screenshot": "screenshot/20241126_161536_6.png", "element": "Close", "rect": {"left": 1860, "top": 0, "right": 1920, "bottom": 60}, "marked_screenshot": "screenshot/20241126_161536_6_marked.png"}
7 | {"timestamp": "2024-11-26_16:15:38", "action": "click (951, 658)", "screenshot": "screenshot/20241126_161538_7.png", "element": "Save", "rect": {"left": 901, "top": 645, "right": 989, "bottom": 675}, "marked_screenshot": "screenshot/20241126_161538_7_marked.png"}
8 | {"timestamp": "2024-11-26_16:15:40", "action": "finish", "screenshot": "screenshot/20241126_161540_8.png", "element": null, "rect": null, "marked_screenshot": "screenshot/20241126_161540_8.png"}
9 |
--------------------------------------------------------------------------------
/postprocess/data/events_example/free_task_20241126_161517.md:
--------------------------------------------------------------------------------
1 | # Free Task
2 | **Description:** create a new blank slide with title "Hello, world" on desktop.
3 |
4 | **Level:** easy
5 |
6 | ### 2024-11-26_16:15:19
7 | **Input:**
8 |
9 | Given the screenshot as below. What's the next step that you will do to help with the task?
10 |
11 |
12 |
13 | **Output:**
14 |
15 | click element Start at (828, 1075)
16 |
17 |
18 | ### 2024-11-26_16:15:21
19 | **Input:**
20 |
21 | Given the screenshot as below. What's the next step that you will do to help with the task?
22 |
23 |
24 |
25 | **Output:**
26 |
27 | click element PowerPoint at (1111, 334)
28 |
29 |
30 | ### 2024-11-26_16:15:24
31 | **Input:**
32 |
33 | Given the screenshot as below. What's the next step that you will do to help with the task?
34 |
35 |
36 |
37 | **Output:**
38 |
39 | click element Blank Presentation at (407, 275)
40 |
41 |
42 | ### 2024-11-26_16:15:26
43 | **Input:**
44 |
45 | Given the screenshot as below. What's the next step that you will do to help with the task?
46 |
47 |
48 |
49 | **Output:**
50 |
51 | click element Unknown at (1048, 546)
52 |
53 |
54 | ### 2024-11-26_16:15:28
55 | **Input:**
56 |
57 | Given the screenshot as below. What's the next step that you will do to help with the task?
58 |
59 |
60 |
61 | **Output:**
62 |
63 | type text: Hello, world
64 |
65 |
66 | ### 2024-11-26_16:15:36
67 | **Input:**
68 |
69 | Given the screenshot as below. What's the next step that you will do to help with the task?
70 |
71 |
72 |
73 | **Output:**
74 |
75 | click element Close at (1877, 21)
76 |
77 |
78 | ### 2024-11-26_16:15:38
79 | **Input:**
80 |
81 | Given the screenshot as below. What's the next step that you will do to help with the task?
82 |
83 |
84 |
85 | **Output:**
86 |
87 | click element Save at (951, 658)
88 |
89 |
90 | ### 2024-11-26_16:15:40
91 | **Input:**
92 |
93 | Given the screenshot as below. What's the next step that you will do to help with the task?
94 |
95 |
96 |
97 | **Output:**
98 |
99 | finish
100 |
101 |
102 |
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161519_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161519_1.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161519_1_marked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161519_1_marked.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161521_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161521_2.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161521_2_marked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161521_2_marked.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161524_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161524_3.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161524_3_marked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161524_3_marked.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161526_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161526_4.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161526_4_marked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161526_4_marked.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161528_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161528_5.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161536_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161536_6.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161536_6_marked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161536_6_marked.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161538_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161538_7.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161538_7_marked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161538_7_marked.png
--------------------------------------------------------------------------------
/postprocess/data/events_example/screenshot/20241126_161540_8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161540_8.png
--------------------------------------------------------------------------------
/postprocess/prompt.py:
--------------------------------------------------------------------------------
1 | CLICK_ACTION_DESCRIPTION_PROMPT = """Help me describe the target in the screenshot. The target may be a GUI element or an empty area on the screen.
2 |
3 | You will be provided with:
4 | 1. A screenshot with a red mark quadruplet:
5 | - Frame: rectangular border around the target (may be inaccurate)
6 | - Circle: circle at the center of the target
7 | - Point: dot marking the exact click position
8 | - Arrow: pointing to the target
9 | 2. The name of the clicked target for reference. It's just for reference. If this name is "Unknown" or appears to be incorrect, just ignore it.
10 |
11 | Description Rules:
12 | 1. Priority Order:
13 | - Highest: Circle, Point and Arrow
14 | - Second: Reference name (if reliable)
15 | - Lowest: Frame
16 |
17 | 2. Description Strategy:
18 | A. For Clear GUI Elements:
19 | - Include position info ("top", "left", "center", etc.) if possible
20 | - Use visual information to describe the element
21 | - Refer to the provided element name if reliable
22 | - Examples:
23 | √ "the button in the top-right corner of the window"
24 | √ "the current tab at the top of the browser"
25 | x "the red circle" (red marks doesn't belong to the original screenshot or element)
26 |
27 | B. For Empty Areas or Uncertain Elements:
28 | - Focus on positional relationships
29 | - Use visual information to locate the position
30 | - Examples:
31 | √ "empty area on the right side of the window"
32 | √ "area near the bottom toolbar"
33 |
34 | 3. Prohibited:
35 | - No speculation about element functions
36 | - No uncertain terms like "seems", "appears", "probably"
37 | - No description of elements outside the frame
38 |
39 | Output Format:
40 | - For GUI elements: "{position description} + {element description}"
41 | - For empty areas: "empty area + {position description}"
42 |
43 | Examples:
44 | √ "the close button in the top-right corner of the window"
45 | √ "the 'Chrome' icon on the desktop"
46 | √ "the left thumbnail panel in current window"
47 | √ "the 'Images' tab below the search bar"
48 | √ "'click to add title'"
49 | √ "the button in the top-right corner of the browser" (when the reference name is not reliable and you are unsure about the element)
50 | x "what appears to be a settings button" (avoid speculation)
51 |
52 | Important:
53 | 1. Carefully observe the screenshot and the red mark quadruplet. Use these visual cues to describe the element or position as accurately as possible. But **DO NOT** explicitly state the red marks in your description. Avoid phrases like "red arrow marking on the slide.." or "the red circle.."
54 | 2. When uncertain, prefer positional description over semantic or functional speculation. Be extraordinarily cautious to avoid hallucinations.
55 | 3. Be precise and output the description directly in an objective tone. Avoid sentences starting with "the target is","The pointed target is", or "it appears to be".
56 | 4. Do not directly use the provided element name, create your own natural description based on visual information.
57 |
58 | Note:
59 | 1. For the name of the clicked target for reference, it is either very precise or completely worthless. Judge its reliability based on visual information.
60 | If unreliable, ignore it and be cautious, preferably using only positional descriptions; if reliable, try to expand on its description as much as possible.
61 |
62 | 2. Special cases: for the text box in PowerPoint, the name of the clicked target is usually "click to add title" or "click to add text".
63 | - "'click to add title'": for the title text box above the content text box or on the cover slide
64 | - "'click to add text'": for the content text box below the title text box
65 | - "'click to add subtitle'": for the subtitle text box below the title text box
66 | - "'the left thumbnail panel in current window'": for the **left slides thumbnail panel in PowerPoint**. But **DO NOT** abuse the use of "thumbnail" in other cases.
67 | """
68 |
69 | CLICK_ACTION_DESCRIPTION_CHECK_PROMPT = """
70 | You are provided with the following information about a mouse click on a computer screen:
71 |
72 | 1. A screenshot showing:
73 | - A red dot and circle marking the exact click location
74 | - A red arrow pointing to the click location
75 | - A red box outlining the general area of the clicked element
76 | Note: While the dot, circle, and arrow are precise, the box might be less accurate
77 |
78 | 2. The exact coordinates of the mouse click
79 |
80 | 3. The element name from the accessibility tree
81 | Note: This information might be incomplete, with many elements labeled as "unknown".
82 |
83 | 4. A pre-generated description of the click location
84 | Types:
85 | - Empty area description (e.g., "empty area near the bottom toolbar")
86 | - Specific element description (e.g., "the start button on the left corner of the taskbar")
87 |
88 | # Your Task
89 | Evaluate the provided description, determine if it is accurate. If not, provide the correct description. You can describe it as an empty area or a specific element. Do not mention the red marks on the screenshot.
90 |
91 | # Critical Evaluation Points
92 | 1. **Priority of Visual Evidence**: The red markers (dot, circle, arrow) on the screenshot show the ACTUAL click location. This is your primary source of truth. But **DO NOT** explicitly state the red marks in your description. Avoid phrases like "red arrow marking on the slide.." or "the red circle.."
93 |
94 | 2. **Element Name Usage**:
95 | - Ignore if marked as "unknown"
96 | - If available, use it to verify the description's accuracy
97 | - If there's a conflict between the element name and the description, carefully evaluate which is correct
98 |
99 | 3. **Empty Area vs. Specific Element Distinction**:
100 | - True empty areas: Locations where clicks produce no effect
101 | - False empty areas: Locations that appear empty but are part of specific functional elements
102 |
103 | # Evaluation Process
104 | 1. First, locate the exact click point using the red markers
105 | 2. Check if the provided element name offers any useful information
106 | 3. Determine if the location is a true empty area or part of a specific functional element
107 | 4. Compare the given description against your findings
108 | 5. Provide your response based on the required format
109 |
110 | # Important
111 | - Carefully determine the wrong description. Most of the time, the provided description is correct.
112 | - The pre-generated description may have hallucinations. Carefully evaluate it.
113 |
114 | Final Answer Format:(Response in English even the element name is Chinese)
115 | Thought Process: {your thought process}
116 | Answer:{your answer}
117 |
118 | Your answer should be either:
119 | - "Good" if the description is accurate
120 | - "Wrong. Correct Description: {your description}" if the description is inaccurate
121 | --------------------------------------------
122 |
123 | """
124 |
125 |
126 | THOUGHT_PROMPT = """You are a helpful PC Agent designed to complete tasks on a computer. Your goal is to recreate your **thought process** behind a specific action.
127 |
128 | You will be provided with:
129 |
130 | 1. The task you are attempting to complete.
131 | 2. A history of the steps you have already performed (up to 50, if any; none if it was the first action).
132 | 3. Subsequent actions (none if this is the last action).
133 | 4. The specific action you chose to take.
134 | 5. A screenshot of the computer screen at the moment you decided to take the action
135 | 6. The red marks on the screenshot:
136 | A. For Click Actions (click, right click, double click):
137 | - Frame: rectangular border around clicked element
138 | - Center: circle at element center
139 | - Click: point at exact click position
140 | - Arrow: pointing to clicked element
141 | B. For Drag Actions:
142 | - Start: red point and circle
143 | - End: red point and circle
144 | - Arrow: from start to end position
145 |
146 | Explanation of actions:
147 | 1. **click element: <{element description}>**: Click the element described by `{element description}`.
148 | 2. **right click element: <{element description}>**: Right-click the element described by `{element description}`.
149 | 3. **double click element: <{element description}>**: Double-click the element described by `{element description}`.
150 | 4. **drag from (x1, y1) to (x2, y2)**: Drag the mouse from the position (x1, y1) to (x2, y2).
151 | 5. **scroll (dx, dy)**: Scroll with offsets (dx for horizontal movement, dy for vertical movement).
152 | 6. **press key: key_content**: Press the `key_content` on the keyboard.
153 | 7. **hotkey (key1, key2)**: Press the combination of `key1` and `key2`.
154 | 8. **type text: text_content**: Type the text `text_content` on the keyboard.
155 | 9. **wait**: Pause briefly, usually for system responses or screen updates.
156 | 10. **finish**: Indicate the task has been completed.
157 | 11. **fail**: Indicate the task has failed.
158 |
159 | Further explanation of drag operation: drag from (x1, y1) to (x2, y2) is a combination of press the mouse at (x1, y1) and drag it to (x2, y2). It might has following purposes:
160 | 1. Move/Translate - Moving an element from position (x1,y1) to (x2,y2)
161 | Common scenarios:
162 | - Dragging a file/folder to a new location
163 | - Moving a UI element (window, widget) to a different position
164 | - Moving elements (shapes, text boxes, images) in a PowerPoint slide
165 | - Adjusting slider controls or resizing elements
166 | - Reordering items in a list or menu
167 |
168 | 2. Range Selection - Selecting content within a rectangular region defined by (x1,y1) and (x2,y2) as diagonal points
169 | Common scenarios:
170 | - Selecting multiple files/icons in a folder
171 | - Selecting text in a document. This is usually performed before copy/cut/delete/adjust text operation. After this action, the selected text will be highlighted.
172 | - Selecting cells in a spreadsheet
173 | - Drawing selection rectangle on a canvas
174 |
175 | Consider the following to give your thought process:
176 | 1. The current state of the screen and your last step (if any). Does current state align with your last plan? Are this action trying to fix something?
177 | 2. Based on the history steps, how far have you progressed in the whole task? And based on your subsequent actions, what is the expected outcome of this action? (**DO NOT** explicitly state the next action in your output.)
178 | 3. Based on all the information (task, observation, history, future), if this action seems not related to the task, is it possibly exploring the environment?
179 | Based on the above, recreate your thought process in a clear, natural first-person narrative.
180 |
181 | Other requirements:
182 | 1. Be confident in your thought process. Avoid speculative or uncertain phrases like "it seems" or "this action might have been for."
183 | 2. You may reference future actions as context, but **DO NOT** explicitly state the next action in your explanation.
184 | 3. If there are red marks on the screenshot, you should use them to understand the action, but **DO NOT** explicitly state the red marks in your explanation. Avoid phrases like "I notice the red circles around..." or "the red arrow indicates...".
185 | 3. Keep your explanations **concise and short**, do not conduct meaningless analysis and emphasis.
186 | 4. Do not repeat the action after your thought process.
187 |
188 | Here are some examples of the thought process:
189 | - "I see the 'View' menu is successfully opened, so I can click the 'Slide Master' button on it to change the font for all slides."
190 | - "To open the target powerpoint file, I should open the folder containing it first. So I need to click the folder icon on the left of the taskbar."
191 | - "I want to click the close button to close the current window in my last step, but I see it is not closed yet in current screen. Maybe my click was not precise last time, so I need to click it again. I should click the close button on the right top corner of the window."
192 | - "After save the file to desktop, I have successfully completed the task."
193 | - "I need to modify the 5th slide, but it is not in the current screen. I should scroll down the page to find it."
194 | - "I have insert a new text box and focus on it, so I can type the content now."
195 | - "I have finished typing content in the text box. Now I can click anywhere outside the text box to deselect it and view the content on the slide."
196 | - "I see the current file name is 'Untitled', so I should change it to a proper name. First I need to click the text box of the file name to focus on it."
197 | - "I need to insert a new slide, so I can first click the left thumbnail panel in the PowerPoint window."
198 | - "I need to insert a new slide, and I have clicked the left thumbnail panel in the PowerPoint window. Now I need to press key enter to insert a new slide."
199 |
200 | Examples of thought processes for exploratory actions:
201 | - "I need to save the file to the desktop, but I don't see a desktop option in the window. Maybe I should scroll down to see if there's a desktop option."
202 | - "I want to select the save button, but I don't see a save option in the window. I guess I might find it by clicking the File button."
203 | - "I need to open the settings menu, but I don't see an obvious settings icon on the current interface. Perhaps I should click on the three dots or three horizontal lines icon in the top right corner, as these often hide more options."
204 | - "I want to change the document's font, but I can't find the font option on the toolbar. I might need to click on the 'Format' or 'Style' menu to see if I can find the font settings there."
205 | - "I need to insert an image, but I don't see an obvious 'Insert' button. I guess I might need to right-click on a blank area of the document to see if there's an option to insert an image in the context menu."
206 | - "I want to check the version information of this application, but I can't find the relevant option on the main interface. Maybe I should click on the 'Help' or 'About' menu, as version information is often found there."
207 | - "I need to exit this full-screen program, but I don't see an exit button. I can try pressing the ESC key or moving the mouse to the top of the screen to see if a hidden menu bar appears."
208 | - "I want to search for specific content on this webpage, but I don't see a search box. I can try using the shortcut Ctrl+F (or Command+F) to see if it brings up the in-page search function."
209 |
210 | Additional PowerPoint Operation Tip:
211 | - These steps are to add a new slide at the end of the presentation:
212 | 1. Click in the left thumbnail panel of the PowerPoint window.
213 | 2. Press the Enter key to insert a new slide.
214 | - These steps are to add text in the text box:
215 | 1. Click 'click to add text'/'click to add title'/'click to add subtitle' to focus on the text box.
216 | 2. Type the content in the text box.
217 | 3. (Optional) Press the Enter key to finish.
218 |
219 | Again, you are recreating your thought process when you made the action, so you should not include any post-event evaluation or similar phrases.
220 |
221 | --------------------------------------------
222 |
223 | """
224 |
225 |
226 |
227 |
228 |
229 |
230 |
231 |
232 |
233 |
234 |
235 |
--------------------------------------------------------------------------------
/postprocess/refinement.py:
--------------------------------------------------------------------------------
1 | # multi-function script for data refinement
2 | # 1. rewrite screenshot path
3 | # 2. clean fail and error record
4 | # 3. check last action finish
5 | # 4. merge press and drag
6 | # 5. remove redundant actions
7 | # 6. remove meaningless actions
8 | # 7. resize screenshot and coordinates to 1080p
9 | # 8. clean tracker interface
10 | # 9. mark screenshot with red rect and point
11 | # 10. rewrite markdown file
12 | # 11. statistics
13 | # support interrupt
14 |
15 | import os
16 | import json
17 | import sys
18 | import numpy as np
19 | from PIL import Image
20 | from utils import *
21 |
22 | OVERWRITE_MARKED = False
23 | REMOVE_FAIL_RECORD = True
24 | DETAIL_OUTPUT = False
25 |
26 |
27 | def screenshot_of_tracker(screenshot_path, sample_size=100):
28 | """
29 | check if the screenshot is a Tracker interface.
30 | """
31 | if get_file_size_kb(screenshot_path) > 83: # magic number
32 | return False
33 |
34 | bg_color = "#f0f0f0"
35 | bg_threshold = 0.8
36 | top_offset = 40 # top area offset
37 | bottom_offset = 80 # bottom area offset
38 |
39 | with Image.open(screenshot_path) as img:
40 | width, height = img.size
41 |
42 | # define the sampling regions
43 | sample_regions = [
44 | (0, top_offset, sample_size, sample_size + top_offset), # top left corner
45 | (width - sample_size, top_offset, width, sample_size + top_offset), # top right corner
46 | (0, height - sample_size - bottom_offset, sample_size, height - bottom_offset), # bottom left corner
47 | (width - sample_size, height - sample_size - bottom_offset, width, height - bottom_offset) # bottom right corner
48 | ]
49 |
50 | # convert the background color to numpy array
51 | bg_color_rgb = np.array([int(bg_color[i:i + 2], 16) for i in (1, 3, 5)])
52 |
53 | # check the four regions
54 | for region in sample_regions:
55 | sample_region = img.crop(region)
56 | sample_array = np.array(sample_region)[:, :, :3]
57 | matches = np.all(sample_array == bg_color_rgb, axis=2)
58 | bg_ratio = np.sum(matches) / matches.size
59 |
60 | if bg_ratio < bg_threshold:
61 | return False
62 |
63 | return True
64 |
65 |
66 | def clean_tracker_interface(file_path):
67 | """
68 | clean the action records of the Tracker interface.
69 |
70 | return the number of actions after cleaning, -1 means the file is deleted
71 | """
72 | if DETAIL_OUTPUT:
73 | print(f"Clean tracker interface: {file_path}")
74 | screenshot_paths = []
75 | entries = []
76 |
77 | with open(file_path, 'r', encoding='utf-8') as file:
78 | for line in file:
79 | entry = json.loads(line)
80 | full_path = os.path.join(os.path.dirname(file_path), entry['screenshot'])
81 | screenshot_paths.append(full_path)
82 | entries.append(entry)
83 |
84 | last_entry_action = entries[-1].get('action')
85 | markdown_path = file_path.replace('.jsonl', '.md')
86 |
87 | # scan and identify the action of the Tracker interface
88 | begin = -1
89 | interval_list = [] # [begin, end)
90 | for index, screenshot_path in enumerate(screenshot_paths):
91 | # find the screenshot of the Tracker interface
92 | if screenshot_of_tracker(screenshot_path):
93 | if begin == -1:
94 | begin = index
95 | else:
96 | # back to the screenshot of non-Tracker interface, end the interval
97 | if begin != -1:
98 | interval_list.append((begin, index))
99 | begin = -1
100 |
101 | interval_list.append((begin, len(screenshot_paths))) # the last interval (begin maybe -1)
102 |
103 | # delete the last interval (finish/fail)
104 | begin, end = interval_list.pop()
105 | if begin != -1:
106 | entries = entries[:begin]
107 | print(f"begin: {begin}, end: {end}")
108 | try:
109 | entries[-1]['action'] = last_entry_action
110 | entries[-1]['element'] = None
111 | entries[-1]['rect'] = None
112 | except Exception as e: # empty data
113 | print(f"Error: {e}")
114 | print("Delete related records (probably empty)...")
115 | # delete the JSONL file
116 | os.remove(file_path)
117 | # delete the Markdown file
118 | os.remove(markdown_path)
119 | # delete the screenshot files
120 | for screenshot_path in screenshot_paths:
121 | remove_screenshot(screenshot_path)
122 | return -1
123 |
124 | for i in range(begin, end):
125 | remove_screenshot(screenshot_paths[i])
126 |
127 | # delete other intervals
128 | to_remove_entry_set = set()
129 | for begin, end in interval_list:
130 | for i in range(begin - 1, end):
131 | remove_screenshot(screenshot_paths[i])
132 | to_remove_entry_set.add(i)
133 |
134 | entries = [entry for i, entry in enumerate(entries) if i not in to_remove_entry_set]
135 |
136 | # save the updated JSONL file
137 | with open(file_path, 'w', encoding='utf-8') as file:
138 | for entry in entries:
139 | json.dump(entry, file, ensure_ascii=False)
140 | file.write('\n')
141 |
142 | return len(entries)
143 |
144 |
145 | def clean_fail_and_error(file_path):
146 | """
147 | clean the records without corresponding Markdown files or the last action is 'fail'.
148 |
149 | return True if the file is deleted, False otherwise.
150 | """
151 | markdown_path = file_path.replace('.jsonl', '.md')
152 | if DETAIL_OUTPUT:
153 | print(f"Clean fail: {file_path}")
154 | try:
155 | with open(file_path, 'r', encoding='utf-8') as infile:
156 | entries = [json.loads(line) for line in infile]
157 | except Exception as e:
158 | print(f"[ERROR] Failed to read file {file_path}: {e}")
159 | return False
160 |
161 | screenshot_paths = [os.path.join(os.path.dirname(file_path), entry['screenshot']) for entry in entries]
162 | last_entry_action = entries[-1]['action'] if entries else ''
163 |
164 | # delete the records without corresponding Markdown files
165 | if not os.path.exists(markdown_path):
166 | print(f"File {file_path} has no corresponding Markdown file")
167 | print("Delete related records...")
168 | # delete the JSONL file
169 | os.remove(file_path)
170 | # delete the screenshot files
171 | for screenshot_path in screenshot_paths:
172 | remove_screenshot(screenshot_path)
173 | return True
174 |
175 | # clean the fail records (optional)
176 | if REMOVE_FAIL_RECORD and last_entry_action == 'fail':
177 | print(f"File {file_path} ends with fail action")
178 | print("Delete related records...")
179 | # delete the JSONL file
180 | os.remove(file_path)
181 | # delete the Markdown file
182 | os.remove(markdown_path)
183 | # delete the screenshot files
184 | for screenshot_path in screenshot_paths:
185 | remove_screenshot(screenshot_path)
186 | return True
187 |
188 | return False
189 |
190 |
191 | def resize(file_path):
192 | if DETAIL_OUTPUT:
193 | print(f"Resize file: {file_path}")
194 |
195 | # get the directory of the file
196 | task_dir = os.path.dirname(file_path)
197 |
198 | # read the screenshot path of the last entry
199 | try:
200 | with open(file_path, 'r', encoding='utf-8') as infile:
201 | lines = infile.readlines()
202 | last_line = lines[-1]
203 | last_entry = json.loads(last_line)
204 | screenshot_path = os.path.join(task_dir, last_entry['screenshot'])
205 | except Exception as e:
206 | print(f"[ERROR] Failed to read the screenshot path of the last entry: {e}")
207 | return
208 |
209 | if not os.path.exists(screenshot_path):
210 | print(f"[ERROR] The screenshot file does not exist: {screenshot_path}")
211 | return
212 |
213 | # get the resolution of the screenshot
214 | try:
215 | with Image.open(screenshot_path) as img:
216 | original_width, original_height = img.size
217 | if DETAIL_OUTPUT:
218 | print(f"Original resolution: {original_width}x{original_height}")
219 | except Exception as e:
220 | print(f"[ERROR] Failed to open the screenshot file {screenshot_path}: {e}")
221 | return
222 |
223 | # original_width, original_height = 2560, 1440
224 |
225 | # target resolution
226 | target_width, target_height = 1920, 1080
227 | if original_width == target_width and original_height == target_height:
228 | if DETAIL_OUTPUT:
229 | print(f"The screenshot resolution is the same as the target resolution, no need to resize")
230 | return
231 |
232 | scale_x = target_width / original_width
233 | scale_y = target_height / original_height
234 | if DETAIL_OUTPUT:
235 | print(f"Resize ratio - X: {scale_x:.4f}, Y: {scale_y:.4f}")
236 |
237 | # process the JSONL file
238 | modified_lines = []
239 | for line in lines:
240 | try:
241 | data = json.loads(line)
242 |
243 | # process the screenshot
244 | screenshot_path = os.path.join(task_dir, data['screenshot'])
245 | assert resize_to_1080p(screenshot_path), "Error occured!"
246 |
247 | # process the action
248 | data['action'] = resize_action(data['action'], scale_x, scale_y)
249 |
250 | # process the rect
251 | if 'rect' in data and isinstance(data['rect'], dict):
252 | rect = data['rect']
253 | rect['left'] = round(rect['left'] * scale_x)
254 | rect['top'] = round(rect['top'] * scale_y)
255 | rect['right'] = round(rect['right'] * scale_x)
256 | rect['bottom'] = round(rect['bottom'] * scale_y)
257 | if DETAIL_OUTPUT:
258 | print(f"Resize rect: {rect}")
259 |
260 | modified_lines.append(json.dumps(data, ensure_ascii=False) + '\n')
261 | except Exception as e:
262 | print(f"[WARNING] Error when processing the line: {line.strip()} - {e}")
263 | modified_lines.append(line)
264 |
265 | # directly write the modified content, overwrite the original file
266 | try:
267 | with open(file_path, 'w', encoding='utf-8') as outfile:
268 | outfile.writelines(modified_lines)
269 | if DETAIL_OUTPUT:
270 | print(f"Saved the modified file: {file_path}")
271 | except Exception as e:
272 | print(f"[ERROR] Failed to write the file {file_path}: {e}")
273 |
274 |
275 | def mark(file_path):
276 | if DETAIL_OUTPUT:
277 | print(f"Mark file: {file_path}")
278 |
279 | # get the directory of the file
280 | task_dir = os.path.dirname(file_path)
281 |
282 | # process the JSONL file
283 | modified_lines = []
284 | with open(file_path, 'r', encoding='utf-8') as infile:
285 | for line in infile:
286 | entry = json.loads(line)
287 |
288 | if not OVERWRITE_MARKED and 'marked_screenshot' in entry:
289 | if DETAIL_OUTPUT:
290 | print(f"Already marked: {entry['marked_screenshot']}")
291 | modified_lines.append(line)
292 | continue
293 |
294 | screenshot = os.path.join(task_dir, entry.get('screenshot'))
295 | action = entry.get('action')
296 | rect = entry.get('rect')
297 |
298 | if rect is not None: # click or drag
299 | click_action_name, coordinates = parse_click_action(action)
300 | if click_action_name != None: # click related action
301 | x, y = coordinates
302 | marked_screenshot = mark_image(is_click_action=True, image_path=screenshot, rect=rect, point1={'x': x, 'y': y})
303 | entry['marked_screenshot'] = marked_screenshot
304 | else: # drag related action
305 | (x1, y1), (x2, y2) = parse_drag_action(action)
306 | marked_screenshot = mark_image(is_click_action=False, image_path=screenshot, rect=rect, point1={'x': x1, 'y': y1}, point2={'x': x2, 'y': y2})
307 | entry['marked_screenshot'] = marked_screenshot
308 | else:
309 | # rect is None, copy the original screenshot path
310 | entry['marked_screenshot'] = screenshot
311 |
312 | # remove the task_dir prefix of marked_screenshot
313 | entry['marked_screenshot'] = entry['marked_screenshot'].replace(
314 | task_dir + '/', '')
315 |
316 | modified_lines.append(json.dumps(entry, ensure_ascii=False) + '\n')
317 |
318 | # write the modified content, overwrite the original file
319 | with open(file_path, 'w', encoding='utf-8') as outfile:
320 | outfile.writelines(modified_lines)
321 |
322 |
323 | def rewrite_screenshot_path(file_path):
324 | if DETAIL_OUTPUT:
325 | print(f"Rewrite screenshot path: {file_path}")
326 |
327 | modified_lines = []
328 | with open(file_path, 'r', encoding='utf-8') as file:
329 | for line in file:
330 | entry = json.loads(line)
331 |
332 | # process the screenshot field, remove the possible prefix 'events\\'
333 | if entry['screenshot'].startswith('events\\'):
334 | entry['screenshot'] = entry['screenshot'][7:] # remove the 'events\\' prefix
335 |
336 | # replace the backslash with the forward slash (Linux format)
337 | if "\\" in entry['screenshot']:
338 | entry['screenshot'] = entry['screenshot'].replace("\\", "/")
339 |
340 | modified_lines.append(json.dumps(entry, ensure_ascii=False) + '\n')
341 |
342 | with open(file_path, 'w', encoding='utf-8') as outfile:
343 | outfile.writelines(modified_lines)
344 |
345 |
346 | duplicate_clicks = 0
347 | adjacent_clicks = 0
348 |
349 |
350 | def remove_redundant_actions(file_path):
351 | if DETAIL_OUTPUT:
352 | print(f"Remove redundant actions: {file_path}")
353 | ctrl_cnt = 0
354 | shift_cnt = 0
355 | wait_cnt = 0
356 | all_entries = []
357 | kept_entries = []
358 | screenshot_paths = []
359 | continuous_wait_at_begin = False
360 |
361 | with open(file_path, 'r', encoding='utf-8') as file:
362 | for line in file:
363 | entry = json.loads(line)
364 | all_entries.append(entry)
365 |
366 | total_cnt = len(all_entries)
367 | skip = False
368 | for id, entry in enumerate(all_entries):
369 | if skip:
370 | skip = False
371 | continue
372 | # check the continuous adjacent clicks
373 | screenshot_path = os.path.join(os.path.dirname(file_path), entry['screenshot'])
374 | if entry != all_entries[-1] and 'click' in entry['action'] and 'click' in all_entries[id+1]['action']:
375 | _, (x1, y1) = parse_click_action(entry['action'])
376 | _, (x2, y2) = parse_click_action(all_entries[id+1]['action'])
377 | global adjacent_clicks
378 | global duplicate_clicks
379 | if entry['action'] == all_entries[id+1]['action']:
380 | duplicate_clicks += 1;
381 | print(f"action{id}: {entry['action']} in {file_path} is a click same as the next action")
382 | elif abs(x1-x2) + abs(y1-y2) < 5:
383 | adjacent_clicks += 1;
384 | print(f"action{id}: {entry['action']} in {file_path} is a click adjacent to the next action")
385 |
386 | # delete the continuous wait at the beginning
387 | if entry['action'] != 'wait':
388 | continuous_wait_at_begin = False
389 | if entry['action'] == 'wait' and (id == 0 or continuous_wait_at_begin):
390 | wait_cnt += 1
391 | screenshot_paths.append(screenshot_path)
392 | continuous_wait_at_begin = True
393 | # delete the redundant ctrl and shift
394 | elif entry['action'] == 'press key ctrl' and (entry == all_entries[-1] or all_entries[id+1]['action'] == 'press key ctrl' or all_entries[id+1]['action'].startswith("hotkey (Ctrl,")):
395 | ctrl_cnt += 1
396 | screenshot_paths.append(screenshot_path)
397 | elif entry['action'] == 'press key shift' and (entry == all_entries[-1] or all_entries[id+1]['action'] == 'press key shift' or all_entries[id+1]['action'].startswith('type')):
398 | shift_cnt += 1
399 | screenshot_paths.append(screenshot_path)
400 | elif entry['action'] == 'press key ctrl' and all_entries[id+1]['action'] == 'press key shift':
401 | # this action and the next action should be deleted
402 | ctrl_cnt += 1
403 | shift_cnt += 1
404 | screenshot_paths.append(screenshot_path)
405 | screenshot_paths.append(os.path.join(os.path.dirname(file_path), all_entries[id+1]['screenshot']))
406 | print(f"remove ctrl + shift in {file_path} action {id}")
407 | skip = True
408 | else:
409 | kept_entries.append(entry)
410 |
411 | with open(file_path, 'w', encoding='utf-8') as file:
412 | for entry in kept_entries:
413 | json.dump(entry, file, ensure_ascii=False)
414 | file.write('\n')
415 |
416 | if len(kept_entries) == len(all_entries):
417 | if DETAIL_OUTPUT:
418 | print(f"File {file_path} has no redundant actions")
419 | return
420 | if DETAIL_OUTPUT:
421 | if wait_cnt != 0:
422 | print(f"File {file_path} has {wait_cnt}/{total_cnt} redundant wait, removed")
423 | if ctrl_cnt != 0:
424 | print(f"File {file_path} has {ctrl_cnt}/{total_cnt} redundant ctrl, removed")
425 | if shift_cnt != 0:
426 | print(f"File {file_path} has {shift_cnt}/{total_cnt} redundant shift, removed")
427 |
428 | # delete the screenshot files
429 | for screenshot_path in screenshot_paths:
430 | os.remove(screenshot_path)
431 |
432 |
433 | def remove_meaningless_actions(file_path):
434 | if DETAIL_OUTPUT:
435 | print(f"Remove meaningless actions: {file_path}")
436 | all_entries = []
437 | kept_entries = []
438 | screenshot_paths = []
439 |
440 | with open(file_path, 'r', encoding='utf-8') as file:
441 | for line in file:
442 | entry = json.loads(line)
443 | all_entries.append(entry)
444 |
445 | for id, entry in enumerate(all_entries):
446 | # check the similarity of two continuous screenshots
447 | if entry != all_entries[-1] and (entry['action'] == 'wait' or 'click' in entry['action']):
448 | screenshot_path1 = os.path.join(os.path.dirname(file_path), entry['screenshot'])
449 | screenshot_path2 = os.path.join(os.path.dirname(file_path), all_entries[id+1]['screenshot'])
450 | if are_screenshots_identical(screenshot_path1, screenshot_path2):
451 | screenshot_paths.append(screenshot_path1)
452 | print(f"action {id}: {entry['action']} in {file_path} is a meaningless action, it has been removed")
453 | else:
454 | kept_entries.append(entry)
455 | else:
456 | kept_entries.append(entry)
457 |
458 | if len(kept_entries) == len(all_entries):
459 | if DETAIL_OUTPUT:
460 | print(f"File {file_path} has no meaningless actions")
461 | return
462 |
463 | # rewrite the JSON file
464 | with open(file_path, 'w', encoding='utf-8') as file:
465 | for entry in kept_entries:
466 | json.dump(entry, file, ensure_ascii=False)
467 | file.write('\n')
468 |
469 | # delete the screenshot files
470 | for screenshot_path in screenshot_paths:
471 | os.remove(screenshot_path)
472 |
473 |
474 | def merge_press_drag(file_path):
475 | if DETAIL_OUTPUT:
476 | print(f"Merge press and drag: {file_path}")
477 |
478 | all_entries = []
479 | kept_entries = []
480 | screenshot_paths = []
481 |
482 | with open(file_path, 'r', encoding='utf-8') as file:
483 | for line in file:
484 | entry = json.loads(line)
485 | all_entries.append(entry)
486 |
487 | id = 0
488 | while id < len(all_entries):
489 | # check the press action
490 | if id != len(all_entries) - 1 and all_entries[id]['action'].startswith("press ("):
491 | # the next action must be drag to
492 | assert all_entries[id+1]['action'].startswith("drag"), f"Error: In file {file_path}, action {id+1} should start with 'drag', but it's {all_entries[id+1]['action']}"
493 | x1, y1 = extract_coordinates(all_entries[id]['action'])
494 | x2, y2 = extract_coordinates(all_entries[id+1]['action'])
495 | if abs(x1-x2) + abs(y1-y2) <= 5:
496 | print(f"delta: {abs(x1-x2) + abs(y1-y2)} in {file_path} action {id} is too small, it's merged into a single click")
497 | all_entries[id]['action'] = f"click ({x2}, {y2})"
498 | else:
499 | print(f"action {id}: {all_entries[id]['action']} in {file_path} has been merged with action {id+1}: {all_entries[id+1]['action']}")
500 | all_entries[id]['action'] = f"drag from ({x1}, {y1}) to ({x2}, {y2})"
501 | screenshot_paths.append(os.path.join(os.path.dirname(file_path), all_entries[id+1]['screenshot']))
502 | kept_entries.append(all_entries[id])
503 | id += 1 # skip the next action
504 | else:
505 | kept_entries.append(all_entries[id])
506 |
507 | id += 1
508 |
509 | if len(kept_entries) == len(all_entries):
510 | if DETAIL_OUTPUT:
511 | print(f"File {file_path} has no press and drag to be merged")
512 | return
513 |
514 | # rewrite the JSON file
515 | with open(file_path, 'w', encoding='utf-8') as file:
516 | for entry in kept_entries:
517 | json.dump(entry, file, ensure_ascii=False)
518 | file.write('\n')
519 |
520 | # delete the screenshot files
521 | for screenshot_path in screenshot_paths:
522 | os.remove(screenshot_path)
523 |
524 |
525 | def check_finish(file_path):
526 | if DETAIL_OUTPUT:
527 | print(f"Check finish: {file_path}")
528 |
529 | # read all lines
530 | try:
531 | with open(file_path, 'r', encoding='utf-8') as infile:
532 | lines = infile.readlines()
533 | last_line = lines[-1]
534 | last_entry = json.loads(last_line)
535 | except Exception as e:
536 | print(f"[ERROR] Failed to read the file content: {e}")
537 | return
538 |
539 | # replace the last action with finish
540 | if last_entry.get('action') == 'finish':
541 | if DETAIL_OUTPUT:
542 | print("The last entry is already 'finish'")
543 | return
544 | else:
545 | if DETAIL_OUTPUT:
546 | print("The last entry is ", last_entry.get('action'))
547 | print("Modify the last entry to 'finish'")
548 | last_entry['action'] = 'finish'
549 |
550 | # update the last line
551 | lines[-1] = json.dumps(last_entry, ensure_ascii=False) + '\n'
552 |
553 | # write back to file
554 | try:
555 | with open(file_path, 'w', encoding='utf-8') as outfile:
556 | outfile.writelines(lines)
557 | if DETAIL_OUTPUT:
558 | print(f"Saved the modified file: {file_path}")
559 | except Exception as e:
560 | print(f"[ERROR] Failed to write the file {file_path}: {e}")
561 |
562 |
563 | def process_task_jsonl_file(file_path):
564 | print(f"Process task jsonl file: {file_path}")
565 | rewrite_screenshot_path(file_path)
566 | if clean_fail_and_error(file_path):
567 | return -1 # the file is deleted
568 | check_finish(file_path)
569 | merge_press_drag(file_path)
570 | remove_redundant_actions(file_path)
571 | remove_meaningless_actions(file_path)
572 | resize(file_path)
573 | cnt = clean_tracker_interface(file_path)
574 | if cnt != -1:
575 | mark(file_path)
576 | rewrite_markdown_file_by_jsonl(file_path)
577 | return cnt
578 |
579 |
580 | def process_events_directories():
581 | # get the directory of the script
582 | current_dir = os.path.dirname(os.path.abspath(__file__))
583 |
584 | # build the path of the data folder
585 | data_dir = os.path.join(current_dir, 'data')
586 |
587 | total_action_cnt = 0
588 | total_record_cnt = 0
589 | max_action_cnt = 0
590 |
591 | # traverse all subdirectories of the data folder
592 | for item in os.listdir(data_dir):
593 | item_path = os.path.join(data_dir, item)
594 |
595 | if len(sys.argv) > 1:
596 | directory_prefix = sys.argv[1]
597 | else:
598 | # input the specified directory
599 | directory_prefix = "events"
600 |
601 | # check if it's a directory and starts with the specified name
602 | if os.path.isdir(item_path) and item.startswith(directory_prefix):
603 | print(f'Processing directory: {item_path}')
604 | for filename in os.listdir(item_path):
605 | # task jsonl file
606 | if filename.endswith('.jsonl') and 'task' in filename:
607 | file_path = os.path.join(item_path, filename)
608 | cnt = process_task_jsonl_file(file_path)
609 | if cnt != -1:
610 | total_action_cnt += cnt
611 | total_record_cnt += 1
612 | max_action_cnt = max(max_action_cnt, cnt)
613 |
614 | average_action_cnt = total_action_cnt / total_record_cnt
615 | print(f"Total records: {total_record_cnt}")
616 | print(f"Average actions per record: {average_action_cnt:.2f}")
617 | print(f"Maximum actions: {max_action_cnt}")
618 |
619 |
620 | if __name__ == "__main__":
621 | process_events_directories()
622 |
--------------------------------------------------------------------------------
/postprocess/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import json
4 | import base64
5 | import cv2
6 | import numpy as np
7 | from PIL import Image, ImageDraw
8 |
9 | POINT_RADIUS = 3
10 | CIRCLE_RADIUS = 18
11 | CIRCLE_WIDTH = 3
12 |
13 |
14 | def rewrite_markdown_file_by_jsonl(jsonl_path):
15 | """
16 | rewrite markdown file by jsonl file
17 | """
18 | with open(jsonl_path, 'r', encoding='utf-8') as file:
19 | lines = file.readlines()
20 |
21 | entries = [json.loads(line) for line in lines]
22 | markdown_path = jsonl_path.replace('.jsonl', '.md')
23 | rewrite_markdown_file(markdown_path, entries)
24 |
25 |
26 | def rewrite_markdown_file(markdown_path, entries):
27 | """
28 | rewrite markdown file by entries, use marked_screenshot if exists
29 | """
30 | prompt = '''Given the screenshot as below. What's the next step that you will do to help with the task?'''
31 | with open(markdown_path, 'r', encoding='utf-8') as file:
32 | lines = file.readlines()
33 |
34 | # keep the first 5 lines
35 | kept_lines = lines[:5]
36 |
37 | # add new lines after the kept lines
38 | for entry in entries:
39 | timestamp = entry['timestamp']
40 | action = get_full_action(entry)
41 | screenshot_path = entry['marked_screenshot'] if 'marked_screenshot' in entry else entry['screenshot']
42 | image_description = entry['image_description'] if 'image_description' in entry else None
43 | thought = entry['thought'] if 'thought' in entry else None
44 | action_description = entry['action_description'] if 'action_description' in entry else None
45 | action_description_checked = entry['action_description_checked'] if 'action_description_checked' in entry else None
46 |
47 | kept_lines.append(f'### {timestamp}\n')
48 | kept_lines.append(f'**Input:** \n\n{prompt}\n\n')
49 | kept_lines.append(
50 | f'
\n\n')
51 | # if image_description:
52 | # kept_lines.append(f'**Image Description:** \n\n{image_description}\n\n')
53 | if action_description:
54 | kept_lines.append(
55 | f'**Action Description:** \n\n{action_description}\n\n')
56 | if thought:
57 | kept_lines.append(f'**Thought:** \n\n{thought}\n\n')
58 | if action_description_checked:
59 | kept_lines.append(
60 | f'**Action Description Checked:** \n\n{action_description_checked}\n\n')
61 | kept_lines.append(f'**Output:** \n\n{action}\n\n')
62 |
63 | # rewrite the file
64 | with open(markdown_path, 'w', encoding='utf-8') as file:
65 | file.writelines(kept_lines)
66 |
67 |
68 | def remove_screenshot(screenshot_path):
69 | """
70 | remove the screenshot file and the possible _marked file
71 | """
72 | if os.path.exists(screenshot_path):
73 | os.remove(screenshot_path)
74 |
75 | # remove the possible _marked file
76 | marked_screenshot_path = screenshot_path.replace('.png', '_marked.png')
77 | if os.path.exists(marked_screenshot_path):
78 | os.remove(marked_screenshot_path)
79 |
80 |
81 | def get_full_action(entry):
82 | """
83 | get the full action string from entry
84 | """
85 | action = entry['action']
86 | element = entry['element']
87 | if element:
88 | target = 'click'
89 | index = action.find(target)
90 | if index != -1:
91 | # find the end position of 'click'
92 | insert_position = index + len(target)
93 | # insert ':' after 'click'
94 | action = action[:insert_position] + \
95 | f' element {element} at' + action[insert_position:]
96 | return action
97 |
98 | def get_action_string(following_actions):
99 | str = ""
100 | for id, action in enumerate(following_actions):
101 | str += f"{id+1}:\n{action}\n"
102 | return str if str != "" else "None\n"
103 |
104 | def content_is_rational(content):
105 | if "sorry, I can\'t assist" in content:
106 | return False
107 | return True
108 |
109 | def encode_image(image_path):
110 | """
111 | encode image to base64
112 | """
113 | with open(image_path, "rb") as image_file:
114 | return base64.b64encode(image_file.read()).decode('utf-8')
115 |
116 |
117 | def get_file_size_kb(file_path):
118 | file_size_bytes = os.path.getsize(file_path)
119 | file_size_kb = file_size_bytes / 1024 # convert to KB
120 | return round(file_size_kb, 1) # keep 1 decimal place
121 |
122 |
123 | def mark_image(is_click_action, image_path, rect, point1, point2=None):
124 | """
125 | mark the image and save as a new file, return the new file path
126 | """
127 | # open the image
128 | with Image.open(image_path) as image:
129 | if is_click_action:
130 | # create a drawable object
131 | draw = ImageDraw.Draw(image)
132 |
133 | # draw a rectangle
134 | draw.rectangle(
135 | [(rect["left"], rect["top"]), (rect["right"], rect["bottom"])],
136 | outline="red",
137 | width=3 # line width
138 | )
139 |
140 | # draw a point
141 | draw_point(point1["x"], point1["y"], draw)
142 |
143 | # draw a circle
144 | draw_circle(point1["x"], point1["y"], draw)
145 |
146 | # draw a short arrow
147 | draw_short_arrow(point1["x"], point1["y"], draw)
148 |
149 | else:
150 | draw = ImageDraw.Draw(image)
151 |
152 | # draw a point
153 | draw_point(point1["x"], point1["y"], draw)
154 | draw_point(point2["x"], point2["y"], draw)
155 |
156 | if (abs(point1["x"] - point2["x"]) + abs(point1["y"] - point2["y"])) > 15:
157 | # draw a circle
158 | draw_circle(point1["x"], point1["y"], draw)
159 | draw_circle(point2["x"], point2["y"], draw)
160 | else:
161 | print(f"the distance between point1 and point2 in image {image_path} is too small, skip drawing circles")
162 |
163 | # draw a long arrow
164 | draw_long_arrow(point1["x"], point1["y"], point2["x"], point2["y"], draw)
165 |
166 | # generate the output path, add "_marked" to the original file name
167 | base, ext = os.path.splitext(image_path)
168 | output_path = f"{base}_marked{ext}"
169 |
170 | # save the marked image
171 | image.save(output_path)
172 | # print(f"marked image saved to: {output_path}")
173 | return output_path
174 |
175 |
176 | def resize_to_1080p(image_path):
177 | """
178 | check and resize the image to fixed 1920x1080 resolution, return whether success
179 | """
180 | try:
181 | with Image.open(image_path) as img:
182 | img.verify() # verify the image integrity
183 | except:
184 | print(f"[ERROR] image corrupted: {image_path}")
185 | return False
186 |
187 | # open the image
188 | with Image.open(image_path) as img:
189 | # check if the image is already 1080p
190 | if img.size == (1920, 1080):
191 | print(f"image is already 1080p, no need to resize: {image_path}")
192 | return True
193 |
194 | # resize the image to fixed 1920x1080 resolution
195 | try:
196 | resized_img = img.resize((1920, 1080), Image.LANCZOS)
197 | except:
198 | print(f"[ERROR] cannot resize image: {image_path}")
199 | return False
200 |
201 | # save the resized image, overwrite the original file
202 | resized_img.save(image_path, optimize=True)
203 | print(f"image resized and saved: {image_path}")
204 | return True
205 |
206 |
207 | def resize_action(action_str, scale_x, scale_y):
208 | """
209 | extract coordinates from the action string, scale them, and replace the coordinate part in the original string.
210 |
211 | :param action_str: action string, e.g. "double click (1415, 741)"
212 | :param scale_x: X axis scale factor
213 | :param scale_y: Y axis scale factor
214 | :return: the scaled action string
215 | """
216 | # use regex to match the coordinate part
217 | pattern = r'\((\d+),\s*(\d+)\)'
218 | match = re.search(pattern, action_str)
219 |
220 | if match:
221 | original_x = float(match.group(1))
222 | original_y = float(match.group(2))
223 | scaled_x = round(original_x * scale_x)
224 | scaled_y = round(original_y * scale_y)
225 | print(
226 | f"scale coordinates: ({original_x}, {original_y}) -> ({scaled_x}, {scaled_y})")
227 |
228 | # construct the new coordinate string
229 | new_coords = f"({scaled_x}, {scaled_y})"
230 |
231 | # replace the original coordinate string
232 | new_action_str = re.sub(pattern, new_coords, action_str)
233 | return new_action_str
234 | else:
235 | return action_str
236 |
237 |
238 | def are_screenshots_identical(screenshot_path1, screenshot_path2):
239 | """
240 | check if two screenshots are identical
241 | """
242 | # read the images
243 | img1 = cv2.imread(screenshot_path1)
244 | img2 = cv2.imread(screenshot_path2)
245 |
246 | # check if the images are successfully read
247 | if img1 is None or img2 is None:
248 | print(f"cannot read image: {screenshot_path1} or {screenshot_path2}")
249 | return False
250 |
251 | # check if the images have the same size
252 | if img1.shape != img2.shape:
253 | return False
254 |
255 | # check if the images are identical
256 | difference = cv2.subtract(img1, img2)
257 | return not np.any(difference)
258 |
259 |
260 | def parse_click_action(action):
261 | pattern = r'((?:double |right )?click)\s*\((\d+),\s*(\d+)\)'
262 | match = re.match(pattern, action)
263 |
264 | if match:
265 | action = match.group(1) # extract the action name
266 | x = int(match.group(2)) # extract x coordinate and convert to integer
267 | y = int(match.group(3)) # extract y coordinate and convert to integer
268 | return action, (x, y)
269 | else:
270 | return None, None
271 |
272 |
273 | def parse_drag_action(action):
274 | assert action.startswith('drag from'), f"error: action '{action}' is not a drag action"
275 | start1 = action.find('from (') + 6
276 | end1 = action.find(') to (')
277 | start2 = action.find('to (') + 4
278 | end2 = len(action) - 1
279 |
280 | # extract two sets of coordinates
281 | coord1 = action[start1:end1]
282 | coord2 = action[start2:end2]
283 |
284 | # split and convert to integers
285 | x1, y1 = map(int, coord1.split(', '))
286 | x2, y2 = map(int, coord2.split(', '))
287 |
288 | return (x1, y1), (x2, y2)
289 |
290 |
291 | def extract_coordinates(text):
292 | pattern = r'(?:drag to|press) \((\d+), (\d+)\)'
293 | match = re.search(pattern, text)
294 | if match:
295 | x, y = map(int, match.groups())
296 | return x, y
297 | return None
298 |
299 |
300 | def draw_point(x, y, draw):
301 | radius = POINT_RADIUS
302 | left = x - radius
303 | top = y - radius
304 | right = x + radius
305 | bottom = y + radius
306 |
307 | draw.ellipse(
308 | [(left, top), (right, bottom)],
309 | fill="red"
310 | )
311 |
312 |
313 | def draw_circle(x, y, draw):
314 | radius = CIRCLE_RADIUS
315 | left = x - radius
316 | top = y - radius
317 | right = x + radius
318 | bottom = y + radius
319 |
320 | draw.ellipse(
321 | [(left, top), (right, bottom)],
322 | outline="red",
323 | width=CIRCLE_WIDTH
324 | )
325 |
326 |
327 | def draw_short_arrow(x, y, draw):
328 | arrow_length = 50 # arrow length
329 | arrow_gap = CIRCLE_RADIUS + 2 # arrow gap
330 | arrow_width = 18 # arrow width
331 | angle = np.radians(30) # arrow angle
332 | cos_angle = np.cos(angle)
333 | sin_angle = np.sin(angle)
334 |
335 | # draw the arrow body
336 | start_x = x - arrow_length * cos_angle
337 | start_y = y - arrow_length * sin_angle
338 | end_x = x - arrow_gap * cos_angle
339 | end_y = y - arrow_gap * sin_angle
340 | draw.line([(start_x, start_y), (end_x, end_y)],
341 | fill="red", width=3)
342 |
343 | # draw the arrow head
344 | arrow_point1 = (
345 | int(end_x - arrow_width),
346 | int(end_y)
347 | )
348 | arrow_point2 = (
349 | int(end_x - arrow_width * sin_angle),
350 | int(end_y - arrow_width * cos_angle)
351 | )
352 |
353 | draw.polygon([
354 | (end_x, end_y),
355 | arrow_point1,
356 | arrow_point2
357 | ], fill="red")
358 |
359 |
360 | def draw_long_arrow(x1, y1, x2, y2, draw):
361 | head_length = 18 # arrow head length
362 | head_angle = np.radians(30) # arrow head angle
363 |
364 | # calculate the midpoint of the line
365 | mid_x = (x1 + x2) / 2
366 | mid_y = (y1 + y2) / 2
367 |
368 | # draw the arrow body
369 | draw.line([(x1, y1), (x2, y2)], fill="red", width=3)
370 |
371 | # arrow head direction vector
372 | vector_x = x2 - x1
373 | vector_y = y2 - y1
374 | length = np.hypot(vector_x, vector_y)
375 | unit_vector_x = vector_x / length
376 | unit_vector_y = vector_y / length
377 |
378 | # calculate the positions of the two points of the arrow head (now based on the midpoint)
379 | left_x = mid_x - head_length * \
380 | (unit_vector_x * np.cos(head_angle) +
381 | unit_vector_y * np.sin(head_angle))
382 | left_y = mid_y - head_length * \
383 | (unit_vector_y * np.cos(head_angle) -
384 | unit_vector_x * np.sin(head_angle))
385 |
386 | right_x = mid_x - head_length * \
387 | (unit_vector_x * np.cos(head_angle) -
388 | unit_vector_y * np.sin(head_angle))
389 | right_y = mid_y - head_length * \
390 | (unit_vector_y * np.cos(head_angle) +
391 | unit_vector_x * np.sin(head_angle))
392 |
393 | # use the midpoint as the vertex of the arrow head
394 | draw.polygon([(mid_x, mid_y), (left_x, left_y),
395 | (right_x, right_y)], fill="red")
396 |
--------------------------------------------------------------------------------
/tracker/README.md:
--------------------------------------------------------------------------------
1 | # PC Tracker User Manual
2 |
3 | \[ English | [中文](README_zh.md) \]
4 |
5 | - Version: 1.0
6 | - Last updated: 2024-12-25
7 |
8 | ## 1. Introduction
9 |
10 | PC Tracker is a lightweight infrastructure for efficiently collecting large-scale human-computer interaction trajectories. The program runs seamlessly in the background, automatically capturing screenshots and keyboard & mouse activities.
11 |
12 | Below is an example of the collected human-computer interaction trajectories:
13 |
14 | 
15 |
16 | ## 2. Installation
17 |
18 | - Ensure your operating system is Windows.
19 | - Extract our software package to a location with sufficient disk space (recommended to have more than 3GB of available space for storing recorded data).
20 |
21 | ## 3. Quick Start
22 |
23 | - [Optional] Set screen resolution to 16:9 (recommended 1920 x 1080).
24 | - Open the extracted folder and launch `main.exe`.
25 |
26 | ## 4. Instructions
27 |
28 | After starting the software, you can choose between **Task Oriented Mode** or **Non-Task Oriented Mode** for recording.
29 |
30 | ### Task Oriented Mode
31 |
32 | This mode is divided into two sub-modes: **Given Task** and **Free Task**.
33 |
34 | #### Given Task
35 |
36 | In this mode, you will be assigned an uncompleted task each time.
37 |
38 | - **Next Task**: Click `Next Task` to get the next task.
39 | - **Previous Task**: Click `Previous Task` to return to the next task.
40 | - **Bad Task Feedback**: If you think the current task is difficult to complete, click `Bad Task` to discard it permanently. Alternatively, you can start the task and modify its description after completion based on your actual execution.
41 | - **Start Recording**: Click `Start`, and the tracker window will automatically minimize while recording begins.
42 | - **End Task**: After completing the task, click `Finish` to save the record. Or if the task execution fails or you don’t want to record it, click `Fail`.
43 | - **Modify Task Description**: After finishing the task, you can modify the task description based on your actual execution.
44 |
45 | #### Free Task
46 |
47 | In this mode, you can freely use the computer and summarize the task description and difficulty yourself.
48 |
49 | - **Start Recording**: Click `Start`, and the tracker window will automatically minimize while recording begins.
50 | - **Save and Summarize This Record**: Fill in the task description, select difficulty (easy/medium/hard), and click `Save` to save the record.
51 | - **Discard This Record**: Click `Discard` to discard the record.
52 |
53 | ### Non-Task Oriented Mode
54 |
55 | In this mode, you can freely use the computer, with similar methods to start and stop recording as described above.
56 |
57 | ## 5. Usage Notes
58 |
59 | - **Does not currently support using extended screens**.
60 | - **Does not currently support using Chinese input methods**.
61 | - **Does not currently support using touchpads**.
62 | - **The tracker window is fixed in fullscreen.** To support the filtering of tracker-related actions (such as clicking the Start button) in post-processing, the tracker window is fixed in fullscreen. You can reopen the tracker window by clicking to view the task description, then minimize it again, but please do not drag it to display in a non-fullscreen state.
63 |
64 | ## 6. Data Privacy
65 |
66 | - After starting recording, your screenshots and keyboard & mouse operations will be automatically recorded. PC Tracker does not record any information from unopened software. If you believe the recording may infringe on your privacy, you can choose to discard the record.
67 | - Collected data will be saved in the `./events` folder (hidden by default). Each trajectory comes with a Markdown file for easy visualization.
68 |
69 | ## 7. FAQ
70 |
71 | **1. Does the software have networking capabilities?**
72 |
73 | PC Tracker is completely local, does not support networking, and will not upload your data.
74 |
75 | **2. What if my computer screen resolution is not 16:9?**
76 |
77 | If your screen resolution is not 16:9, it will affect the subsequent unified processing of data. We recommend adjusting your screen resolution to 16:9.
78 |
79 | **3. How much space will the collected data approximately occupy?**
80 |
81 | The specific data size varies. Generally, even with intensive recording operations for 1 hour, it will not generate more than 1GB of data.
82 |
83 | **4. What should I do if the interface doesn't display properly after launching the tracker?**
84 |
85 | If some interface elements (such as buttons) appear incomplete after launching the software, this may be caused by your computer's display scaling settings. You can try adjusting the display scaling in Settings -> System -> Display, and then restart the software.
86 |
87 | ## 8. Contact
88 |
89 | If you have any questions, please contact us at henryhe_sjtu@sjtu.edu.cn or zizi0123@sjtu.edu.cn.
90 |
--------------------------------------------------------------------------------
/tracker/README_zh.md:
--------------------------------------------------------------------------------
1 | # PC Tracker 用户手册
2 |
3 | \[ [English](./README.md) | 中文 \]
4 |
5 | - Version: 1.0
6 | - Last updated: 2024-12-25
7 |
8 | ## 1. 简介
9 |
10 | PC Tracker 是一个轻量级工具,用于高效收集大规模真实人机交互轨迹。类似于屏幕录制,PC Tracker在后台无缝运行,自动捕获屏幕截图和键鼠操作。收集到的人机交互轨迹示例如下:
11 |
12 | 
13 |
14 | ## 2. 安装
15 |
16 | - 确保您的电脑是 Windows 操作系统。
17 | - 解压缩我们的软件包到一个磁盘空间足够的位置(用于存储记录数据,推荐 3GB 以上可用空间)。
18 |
19 | ## 3. 启动
20 |
21 | - [可选] 把屏幕分辨率比例设置为 16:9 (推荐 1920 x 1080)。
22 | - 打开解压后的文件夹,启动 main.exe。
23 |
24 | ## 4. 使用说明
25 |
26 | 启动软件后,您可选择 **Task Oriented Mode** 或 **Non-Task Oriented Mode** 记录模式。
27 |
28 | ### Task Oriented Mode
29 |
30 | 本模式分为 **Given Task** 与 **Free Task** 两种子模式。
31 |
32 | #### Given Task
33 |
34 | 在本模式下,您每次将被分配一个未被完成的任务。
35 |
36 | - **下一个任务**:点击 `Next Task` 可获取下一个任务。
37 | - **上一个任务**:点击 `Previous Task` 可返回上一个任务。
38 | - **反馈不佳任务**:如果您认为当前任务难以完成,可点击 `Bad Task` 将其永久丢弃。或者,您也可以先开始执行任务,在完成后根据实际执行情况修改任务描述。
39 | - **开启记录**:点击 `Start`,软件窗口将会自动最小化,此后记录将会开始。
40 | - **结束任务**:完成任务后,点击 `Finish` 保存本次操作过程。如果任务执行失败或您不希望记录本次操作,请点击 `Fail`。
41 | - **修改任务描述**:完成任务后,您可以根据实际执行情况修改任务描述。
42 |
43 | #### Free Task
44 |
45 | 在本模式下,您可以自由地使用电脑,并自行总结任务描述与难度。
46 |
47 | - **开启记录**:点击 `Start`,软件页面将会自动最小化,此后记录将会开始。
48 | - **保存并总结记录**:填写任务描述,选择任务难度(easy/medium/hard),然后点击 `Save` 保存记录。
49 | - **丢弃记录**:点击 `Discard` 可丢弃本次记录。
50 |
51 | ### Non-Task Oriented Mode
52 |
53 | 在本模式下,您可以自由地使用电脑,任意选择开启与结束记录的时刻。
54 |
55 | ## 5. 注意事项
56 |
57 | - **暂不支持使用扩展屏**。
58 | - **暂不支持使用中文输入法**。
59 | - **暂不支持使用触控板**。
60 | - **软件窗口固定为全屏显示**。为了在后处理中过滤与PC Tracker相关的操作(如点击Start按钮),软件窗口被固定为全屏显示。您可以通过点击重新打开PC Tracker窗口查看任务描述,然后再次最小化,但请不要将其拖动至非全屏状态显示。
61 |
62 | ## 6. 数据隐私
63 |
64 | - 开启记录后,您的屏幕截图与键盘鼠标操作将会被软件自动记录。PC Tracker不会记录任何未被打开的软件的信息。如果您认为本次记录可能会侵犯您的隐私,可选择丢弃本次记录。
65 | - 收集的数据都会被保存在 `./events` 文件夹(默认隐藏)中。我们为每份记录下来的轨迹都提供了Markdown可视化文件。
66 |
67 | ## 7. 常见问题
68 |
69 | **1. 软件是否有联网功能?**
70 |
71 | PC Tracker 完全是本地的,不支持联网,不会上传您的数据。
72 |
73 | **2. 我的电脑屏幕分辨率不是 16:9,会有什么影响?**
74 |
75 | 如果您的屏幕分辨率不是 16:9,会对数据的后续统一处理产生影响。我们建议您将屏幕分辨率调整为 16:9。
76 |
77 | **3. 收集的数据大概会占用多少空间?**
78 |
79 | 具体的数据大小视情况而异。一般来说,即使高强度记录操作1小时,也不会产生超过 1GB 的数据量。
80 |
81 | **4. 为何软件的界面显示看上去有些问题?**
82 |
83 | tracker 界面的异常显示(如部分按钮显示不全)可能是由于您电脑的缩放率导致的。您可以在设置->系统->屏幕中调整缩放率,然后再重新打开 tracker。
84 |
85 | ## 8. 联系方式
86 |
87 | 如有任何疑问,请通过 henryhe_sjtu@sjtu.edu.cn 或 zizi0123@sjtu.edu.cn 与我们联系。
88 |
--------------------------------------------------------------------------------
/tracker/capturer.py:
--------------------------------------------------------------------------------
1 | import threading
2 | import time
3 |
4 | import pyautogui
5 | import win32con
6 | import win32gui
7 | import win32ui
8 |
9 | screen_size = pyautogui.size()
10 |
11 |
12 | class ScreenCapturer:
13 | def __init__(self):
14 | self.hwindow = win32gui.GetDesktopWindow()
15 |
16 | def capture(self):
17 | # dc: device context
18 | window_dc = win32gui.GetWindowDC(self.hwindow)
19 | img_dc = win32ui.CreateDCFromHandle(window_dc)
20 | mem_dc = img_dc.CreateCompatibleDC()
21 | # Create a bitmap object
22 | screenshot = win32ui.CreateBitmap()
23 | # Create a bitmap compatible with the device context and set its width and height
24 | screenshot.CreateCompatibleBitmap(img_dc, screen_size[0], screen_size[1])
25 | # Select the bitmap into the memory device context
26 | mem_dc.SelectObject(screenshot)
27 | # Perform a bit block transfer
28 | mem_dc.BitBlt((0, 0), screen_size, img_dc, (0, 0), win32con.SRCCOPY)
29 | # screenshot: bitmap byte stream
30 | bits = screenshot.GetBitmapBits(True)
31 | # Release resources
32 | mem_dc.DeleteDC()
33 | win32gui.ReleaseDC(self.hwindow, window_dc)
34 | win32gui.DeleteObject(screenshot.GetHandle())
35 | return bits
36 |
37 |
38 | capturer = ScreenCapturer()
39 |
40 |
41 | class RecentScreen:
42 | def __init__(self, capture_interval=0.1):
43 | self.screenshot = capturer.capture()
44 | self.capture_interval = capture_interval
45 | self.lock = threading.Lock()
46 | self.refresh_thread = threading.Thread(target=self.refreshing)
47 | self.refresh_thread.daemon = True
48 | self.refresh_thread.start()
49 |
50 | def refreshing(self):
51 | while True:
52 | screenshot = capturer.capture()
53 | with self.lock:
54 | self.screenshot = screenshot
55 | time.sleep(self.capture_interval)
56 |
57 | def get(self):
58 | with self.lock:
59 | return self.screenshot
60 |
--------------------------------------------------------------------------------
/tracker/fs.py:
--------------------------------------------------------------------------------
1 | import os
2 | import shutil
3 | import ctypes
4 | import subprocess
5 | import time
6 |
7 |
8 | def ensure_folder(folder_path):
9 | # Ensure the folder exists, create if it doesn't
10 | if not os.path.exists(folder_path):
11 | os.makedirs(folder_path)
12 |
13 |
14 | def hide_folder(folder_path):
15 | # Set folder attribute to hidden
16 | FILE_ATTRIBUTE_HIDDEN = 0x02
17 | ctypes.windll.kernel32.SetFileAttributesW(folder_path, FILE_ATTRIBUTE_HIDDEN)
18 |
19 |
20 | def set_hidden_file(file_path, hide=True):
21 | # Set or remove hidden attribute
22 | if hide:
23 | os.system(f'attrib +h "{file_path}"')
24 | else:
25 | os.system(f'attrib -h "{file_path}"')
26 |
27 |
28 | class FolderReset:
29 | def __init__(self, folder_path="./files", backup_path="./__files__"):
30 | self.folder_path = folder_path # Visible folder
31 | self.backup_path = backup_path # Backup hidden folder
32 | if self.backup_path and not os.path.exists(backup_path):
33 | self.backup_path = None # No backup path
34 | self.reset()
35 |
36 | def reset(self):
37 | if self.backup_path:
38 | # Ensure the folder exists
39 | ensure_folder(self.folder_path)
40 |
41 | # Attempt to clear the original folder
42 | MAX_RETRIES = 2
43 | for attempt in range(MAX_RETRIES):
44 | try:
45 | # On Windows, use system command to delete
46 | subprocess.call(f'rd /s /q "{self.folder_path}"', shell=True)
47 | break # Exit loop if successfully deleted
48 | except PermissionError as e:
49 | print(f"Except error in folder reset: {e}. Retrying ({attempt + 1}/{MAX_RETRIES})...")
50 | time.sleep(3)
51 | else:
52 | print("WARNING: Failed to clear folder after multiple attempts.")
53 | return # Exit method, do not proceed
54 |
55 | # Restore backup content, make folder writable
56 | shutil.copytree(self.backup_path, self.folder_path)
57 |
58 | # else: No backup path, no reset needed
59 |
60 |
61 | def delete_file(file_path):
62 | if os.path.exists(file_path):
63 | try:
64 | os.remove(file_path)
65 | except OSError as e:
66 | print(f"Failed to delete {file_path}: {e}")
67 |
68 |
69 | def delete_folder(folder_path):
70 | try:
71 | if os.name == 'nt':
72 | # On Windows, use system command to delete
73 | subprocess.call(f'rd /s /q "{folder_path}"', shell=True)
74 | else:
75 | # On other systems, use shutil to delete
76 | shutil.rmtree(folder_path)
77 | except OSError as e:
78 | print(f"Failed to delete {folder_path}: {e}")
79 |
--------------------------------------------------------------------------------
/tracker/main.py:
--------------------------------------------------------------------------------
1 | import multiprocessing
2 | from tracker import Tracker
3 | from task import Task
4 | import tkinter as tk
5 | import tkinter.font as tkFont
6 | import tkinter.ttk as ttk
7 | from tkinter import messagebox # Import messagebox
8 |
9 |
10 | class TrackerApp:
11 | def __init__(self, root):
12 | self.root = root
13 | self.root.title("Tracker")
14 | self.tracker = Tracker()
15 |
16 | # Set window size to screen size and maximize window
17 | self.root.geometry(
18 | f"{self.root.winfo_screenwidth()}x{self.root.winfo_screenheight()}")
19 | self.root.state("zoomed") # Maximize window
20 |
21 | self.root.configure(bg="#f0f0f0") # Set background color
22 | self.root.resizable(True, True)
23 | self.root.resizable(False, False) # Disable resizing the window
24 |
25 | # Font
26 | self.title_font = tkFont.Font(
27 | family="Helvetica", size=26, weight="bold")
28 | self.label_font = tkFont.Font(family="Helvetica", size=18)
29 | self.button_font = tkFont.Font(family="Arial", size=12)
30 | self.text_font = tkFont.Font(family="Helvetica", size=12)
31 |
32 | # Label
33 | self.title_label = tk.Label(
34 | root, text="", font=self.label_font,
35 | wraplength=400, bg="#f0f0f0", fg="#555555")
36 | self.title_label.pack(pady=(20, 40))
37 |
38 | # Intercept close button click event
39 | self.root.protocol("WM_DELETE_WINDOW", self.quit_app)
40 |
41 | # Create initial interface
42 | self.initial_interface()
43 |
44 | def quit_app(self):
45 | self.tracker.stop()
46 | self.tracker.update_tasks()
47 | self.root.destroy()
48 |
49 | """
50 | Interface Definitions
51 | """
52 |
53 | def initial_interface(self):
54 | self.clear_interface()
55 | self.title_label.config(text="Welcome to PC Tracker!")
56 |
57 | self.task_button = tk.Button(
58 | self.root, text="Task Oriented Mode", command=self.task_oriented_interface,
59 | width=25, height=2, font=self.button_font)
60 | self.task_button.pack(pady=10)
61 | ToolTip(self.task_button, "Tracking with a specific task")
62 |
63 | self.non_task_button = tk.Button(
64 | self.root, text="Non-Task Oriented Mode", command=self.non_task_oriented_interface,
65 | width=25, height=2, font=self.button_font)
66 | self.non_task_button.pack(pady=10)
67 | ToolTip(self.non_task_button,
68 | "Tracking while using computer freely")
69 |
70 | def task_oriented_interface(self):
71 | self.clear_interface()
72 | self.title_label.config(text="Task Oriented Mode")
73 |
74 | self.given_task_button = tk.Button(
75 | self.root, text="Given Task", command=self.next_given_task_interface,
76 | width=15, height=2, font=self.button_font)
77 | self.given_task_button.pack(pady=10)
78 | ToolTip(self.given_task_button, "Complete given task")
79 |
80 | self.free_task_button = tk.Button(
81 | self.root, text="Free Task", command=self.free_task_interface,
82 | width=15, height=2, font=self.button_font)
83 | self.free_task_button.pack(pady=10)
84 | ToolTip(self.free_task_button,
85 | "Freely use pc and summarize the tasks completed on your own.")
86 |
87 | self.back_button = tk.Button(
88 | self.root, text="Back", command=self.initial_interface,
89 | width=15, height=2, font=self.button_font)
90 | self.back_button.pack(pady=10)
91 |
92 | def non_task_oriented_interface(self):
93 | self.clear_interface()
94 | self.title_label.config(text="Non-Task Oriented Mode")
95 |
96 | self.start_button = tk.Button(
97 | self.root, text="Start", command=self.start_non_task_tracking,
98 | width=15, height=2, font=self.button_font)
99 | self.start_button.pack(pady=10)
100 |
101 | self.back_button = tk.Button(
102 | self.root, text="Back", command=self.initial_interface,
103 | width=15, height=2, font=self.button_font)
104 | self.back_button.pack(pady=10)
105 |
106 | def free_task_interface(self):
107 | self.clear_interface()
108 | self.title_label.config(text="Free Task Mode")
109 |
110 | self.subtitle_label = tk.Label(
111 | self.root, text="Freely use pc and summarize the tasks completed on your own.", font=("Arial", 15),
112 | wraplength=750)
113 | self.subtitle_label.pack(pady=(0, 30))
114 |
115 | self.start_button = tk.Button(
116 | self.root, text="Start", command=self.start_free_task_tracking,
117 | width=15, height=2, font=self.button_font)
118 | self.start_button.pack(pady=10)
119 |
120 | self.back_button = tk.Button(
121 | self.root, text="Back", command=self.task_oriented_interface,
122 | width=15, height=2, font=self.button_font)
123 | self.back_button.pack(pady=10)
124 |
125 | self.corner_label = tk.Label(self.root, text=f"You have finished {self.tracker.finished_free_cnt} free tasks.",
126 | font=("Arial", 14), bg="#f0f0f0")
127 | self.corner_label.pack(side="bottom", anchor="se", padx=30, pady=30)
128 |
129 | def next_given_task_interface(self):
130 | self.given_task_interface(offset=1)
131 |
132 | def current_given_task_interface(self):
133 | self.given_task_interface(offset=0)
134 |
135 | def previous_given_task_interface(self):
136 | self.given_task_interface(offset=-1)
137 |
138 | def given_task_interface(self, offset):
139 | if self.tracker.finish_all():
140 | messagebox.showinfo(
141 | "Task Completed", "All tasks have been finished!")
142 | self.initial_interface()
143 | else:
144 | self.clear_interface()
145 | self.title_label.config(text="Given Task Mode")
146 |
147 | self.tracker.get_given_task(offset)
148 |
149 | self.subtitle_label = tk.Label(
150 | self.root, text=f"Category: {self.tracker.task.category}", font=("Arial", 15), wraplength=750)
151 | self.subtitle_label.pack(pady=(0, 30))
152 |
153 | self.corner_label = tk.Label(self.root,
154 | text=f"You have finished {self.tracker.finished_given_cnt} given tasks.",
155 | font=("Arial", 14), bg="#f0f0f0")
156 | self.corner_label.pack(side="bottom", anchor="se", padx=30, pady=30)
157 |
158 | # Create a Canvas widget for the rounded rectangle
159 | canvas = tk.Canvas(self.root, width=1500,
160 | height=510, bg="#f0f0f0", highlightthickness=0)
161 | canvas.pack(pady=5, padx=30, anchor="center")
162 |
163 | # Draw a rounded rectangle
164 | create_roundrectangle(
165 | canvas, 20, 0, 1480, 500, radius=30, fill="#ffffff", outline="#cccccc")
166 |
167 | # Add task description text to the canvas
168 | canvas.create_text(30, 10, text=self.tracker.task.description, font=self.text_font,
169 | width=1450, anchor="nw")
170 |
171 | # Create a frame to hold the buttons
172 | button_frame = tk.Frame(self.root, bg="#f0f0f0")
173 | button_frame.pack(pady=20)
174 |
175 | # left column
176 | self.previous_button = tk.Button(button_frame, text="Previous Task",
177 | command=self.previous_given_task_interface,
178 | width=15, height=1, font=self.button_font)
179 | self.previous_button.grid(row=0, column=0, padx=20, pady=10)
180 |
181 | self.next_button = tk.Button(button_frame, text="Next Task", command=self.next_given_task_interface,
182 | width=15, height=1, font=self.button_font)
183 | self.next_button.grid(row=1, column=0, padx=20, pady=10)
184 |
185 | # right column
186 | self.start_button = tk.Button(button_frame, text="Start", command=self.start_given_task_tracking,
187 | width=15, height=1, font=self.button_font)
188 | self.start_button.grid(row=0, column=1, padx=20, pady=10)
189 | ToolTip(self.start_button, "Start tracking with this task")
190 |
191 | self.bad_task_button = tk.Button(button_frame, text="Bad Task", command=self.mark_bad_task,
192 | width=15, height=1, font=self.button_font)
193 | self.bad_task_button.grid(row=1, column=1, padx=20, pady=10)
194 |
195 | # back button centered below the other buttons with the same size
196 | self.back_button = tk.Button(button_frame, text="Back", command=self.task_oriented_interface,
197 | width=15, height=1, font=self.button_font)
198 | self.back_button.grid(
199 | row=2, column=0, columnspan=2, padx=20, pady=20)
200 |
201 | def modify_description_interface(self):
202 | self.clear_interface()
203 | self.title_label.config(text="Modify Task Description")
204 |
205 | # Add multi-line input text box and set initial content to task description
206 | self.entry = tk.Text(self.root, font=self.text_font,
207 | width=120, height=20) # Adjust width and height
208 | # Set initial content, "1.0" represents the first character position of the first line
209 | self.entry.insert("1.0", self.tracker.task.description)
210 | self.entry.pack(pady=(10, 10)) # Leave 25 pixels at top, 10 pixels at bottom
211 |
212 | self.save_button = tk.Button(
213 | self.root, text="Save", command=self.save_modified_description,
214 | width=15, height=2, font=self.button_font)
215 | self.save_button.pack(pady=10)
216 |
217 | self.cancel_button = tk.Button(
218 | self.root, text="Cancel", command=self.cancel_modify_description,
219 | width=15, height=2, font=self.button_font)
220 | self.cancel_button.pack(pady=10)
221 |
222 | def clear_interface(self):
223 | for widget in self.root.winfo_children():
224 | if widget != self.title_label:
225 | widget.destroy()
226 |
227 | """
228 | Given Task Mode Functions
229 | """
230 |
231 | def start_given_task_tracking(self):
232 | self.clear_interface()
233 | self.tracker.start()
234 |
235 | self.title_label.config(text="Tracking...")
236 | self.title_label.pack(pady=(30, 10))
237 |
238 | canvas_width = 1500 # Adjusted for padding
239 | canvas_height = 680
240 | text_front = ("Helvetica", 15)
241 |
242 | # Create a Canvas widget for the rounded rectangle
243 | canvas = tk.Canvas(self.root, width=canvas_width,
244 | height=canvas_height, bg="#f0f0f0", highlightthickness=0)
245 | canvas.pack(pady=5, padx=30, anchor="center")
246 |
247 | # Draw a rounded rectangle
248 | create_roundrectangle(
249 | canvas, 20, 0, 1480, 650, radius=30, fill="#ffffff", outline="#cccccc")
250 |
251 | # Add task description text to the canvas
252 | canvas.create_text(30, 10, text=self.tracker.task.description, font=text_front,
253 | width=1450, anchor="nw")
254 |
255 | self.finish_button = tk.Button(
256 | self.root, text="Finish", command=self.finish_given_task,
257 | width=15, height=2, font=self.button_font)
258 | self.finish_button.pack(pady=10)
259 | ToolTip(self.finish_button, "Task finished")
260 |
261 | self.fail_button = tk.Button(
262 | self.root, text="Fail", command=self.fail_given_task,
263 | width=15, height=2, font=self.button_font)
264 | self.fail_button.pack(pady=10)
265 | ToolTip(self.fail_button, "Task failed")
266 |
267 | print("Task oriented tracking started...")
268 |
269 | self.root.iconify() # Minimize window
270 |
271 | def finish_given_task(self):
272 | self.tracker.stop_without_task()
273 | if messagebox.askyesno("Modify description", "Do you want to modify the description?"):
274 | self.modify_description_interface()
275 | else:
276 | self.after_finish_given_task()
277 |
278 | def after_finish_given_task(self):
279 | self.tracker.finish()
280 | if self.tracker.finish_all():
281 | messagebox.showinfo(
282 | "Task Completed", "All tasks have been finished!")
283 | self.initial_interface()
284 | else:
285 | self.next_given_task_interface() # back to initial interface
286 |
287 | def fail_given_task(self):
288 | self.tracker.fail()
289 | print("Task failed.")
290 |
291 | if messagebox.askyesno("Confirm Task Failure", "Do you want to discard the record?"):
292 | self.tracker.discard()
293 |
294 | self.current_given_task_interface() # back to initial interface
295 |
296 | def mark_bad_task(self):
297 | # Show a confirmation dialog
298 | confirm = messagebox.askyesno(
299 | "Confirm Bad Task",
300 | "Mark the current task as bad?\nThe task you marked as a bad task will be permanently discarded."
301 | )
302 |
303 | if confirm:
304 | # Mark the task as bad if the user confirms
305 | self.tracker.task.is_bad = True
306 | self.tracker.bad_task_cnt += 1
307 | self.next_given_task_interface()
308 |
309 | def save_modified_description(self):
310 | entry_text = self.entry.get("1.0", "end-1c")
311 | if not entry_text:
312 | messagebox.showwarning(
313 | "Input Error", "Please enter your task description")
314 | return
315 |
316 | self.tracker.task.description = entry_text
317 | self.after_finish_given_task()
318 |
319 | def cancel_modify_description(self):
320 | if messagebox.askyesno("Confirm Cancel Meaning", "Do you want to discard the record?"):
321 | self.tracker.discard()
322 | self.current_given_task_interface() # back to initial interface
323 | else:
324 | self.after_finish_given_task()
325 |
326 | """
327 | Free Task Mode Functions
328 | """
329 |
330 | def start_free_task_tracking(self):
331 | self.clear_interface()
332 | self.tracker.get_free_task()
333 | self.tracker.start()
334 |
335 | self.title_label.config(text="Tracking...")
336 | self.title_label.pack(pady=(30, 10))
337 |
338 | self.stop_button = tk.Button(
339 | self.root, text="Stop", command=self.stop_free_task_tracking,
340 | width=15, height=2, font=self.button_font)
341 | self.stop_button.pack(pady=10)
342 |
343 | self.root.iconify() # Minimize window
344 |
345 | def stop_free_task_tracking(self):
346 | self.tracker.stop_without_task()
347 | self.clear_interface()
348 | self.title_label.config(text="")
349 |
350 | # Create info label
351 | self.description_label = tk.Label(
352 | self.root, text="Please enter task description:", font=("Helvetica", 15), bg="#f0f0f0")
353 | self.description_label.pack(pady=(5, 5))
354 |
355 | # Add input text box
356 | self.entry = tk.Text(self.root, font=self.text_font,
357 | width=120, height=20) # Adjust width and height
358 | self.entry.pack(pady=(10, 10)) # Leave 25 pixels at top, 10 pixels at bottom
359 |
360 | # Level dropdown box
361 | self.level_var = tk.StringVar(value="easy") # Default select "easy"
362 | self.level_label = tk.Label(
363 | self.root, text="Please select task level:", font=("Helvetica", 15), bg="#f0f0f0")
364 | self.level_label.pack(pady=(10, 5))
365 |
366 | # Create and configure style
367 | style = ttk.Style()
368 | style.configure('TMenubutton', font=('Helvetica', 14))
369 |
370 | self.level_menu = ttk.OptionMenu(
371 | self.root, self.level_var, "easy", "medium", "hard", style='TMenubutton')
372 | self.level_menu.config(width=15) # Increase width
373 | self.level_menu.pack(pady=(5, 20))
374 |
375 | # Set larger font for dropdown menu options
376 | menu = self.level_menu["menu"]
377 | for index in range(menu.index("end") + 1):
378 | menu.entryconfig(index, font=tkFont.Font(
379 | family="Helvetica", size=14))
380 |
381 | # Save button
382 | self.save_button = tk.Button(self.root, text="Save", command=self.save_free_task,
383 | width=15, height=1, font=self.button_font)
384 | self.save_button.pack(pady=(10, 20))
385 |
386 | # Discard button
387 | self.discard_button = tk.Button(
388 | self.root, text="Discard", command=self.discard_free_task,
389 | width=15, height=1, font=self.button_font)
390 | self.discard_button.pack(pady=(10, 20))
391 |
392 | def save_free_task(self): # Save user-defined task
393 | entry_text = self.entry.get("1.0", "end-1c")
394 | selected_level = self.level_var.get()
395 |
396 | if not entry_text:
397 | messagebox.showwarning(
398 | "Input Error", "Please enter your task description")
399 | return
400 |
401 | task = Task(entry_text, 0, selected_level)
402 | self.tracker.save_free_task(task)
403 | self.free_task_interface()
404 |
405 | def discard_free_task(self): # Discard user-defined task record
406 | self.tracker.discard()
407 | self.free_task_interface()
408 |
409 | def discard_non_task(self): # Discard non-task oriented mode record
410 | self.tracker.discard()
411 | self.non_task_oriented_interface()
412 |
413 | """
414 | Non Task-Oriented Mode Functions
415 | """
416 |
417 | def start_non_task_tracking(self):
418 | self.clear_interface()
419 | self.tracker.start()
420 |
421 | self.title_label.config(text="Tracking...")
422 | self.title_label.pack(pady=30)
423 |
424 | self.stop_button = tk.Button(
425 | self.root, text="Stop", command=self.stop_non_task_tracking,
426 | width=15, height=2, font=self.button_font)
427 | self.stop_button.pack(pady=30)
428 |
429 | print("Non-task oriented tracking started...")
430 |
431 | self.root.iconify() # Minimize window
432 |
433 | def stop_non_task_tracking(self):
434 | self.tracker.stop()
435 | self.clear_interface()
436 |
437 | self.save_button = tk.Button(
438 | self.root, text="Save", command=self.non_task_oriented_interface,
439 | width=15, height=2, font=self.button_font)
440 | self.save_button.pack(pady=30)
441 |
442 | self.discard_button = tk.Button(
443 | self.root, text="Discard", command=self.discard_non_task,
444 | width=15, height=2, font=self.button_font)
445 | self.discard_button.pack(pady=30)
446 |
447 | print("Non-task oriented tracking stopped.")
448 |
449 |
450 | """
451 | Tools
452 | """
453 |
454 |
455 | class ToolTip:
456 | def __init__(self, widget, text):
457 | self.widget = widget
458 | self.text = text
459 | self.tooltip = None
460 | self.widget.bind("", self.show_tooltip)
461 | self.widget.bind("", self.hide_tooltip)
462 |
463 | def show_tooltip(self, event):
464 | x, y, _, _ = self.widget.bbox("insert")
465 | x += self.widget.winfo_rootx() + 55
466 | y += self.widget.winfo_rooty() + 55
467 | self.tooltip = tk.Toplevel(self.widget)
468 | self.tooltip.wm_overrideredirect(True)
469 | self.tooltip.wm_geometry(f"+{x}+{y}")
470 | label = tk.Label(self.tooltip, text=self.text,
471 | background="#f7f7f7", relief="solid", borderwidth=1)
472 | label.pack()
473 |
474 | def hide_tooltip(self, event):
475 | if self.tooltip:
476 | self.tooltip.destroy()
477 | self.tooltip = None
478 |
479 |
480 | def create_roundrectangle(canvas, x1, y1, x2, y2, radius=25, **kwargs):
481 | points = [x1 + radius, y1,
482 | x1 + radius, y1,
483 | x2 - radius, y1,
484 | x2 - radius, y1,
485 | x2, y1,
486 | x2, y1 + radius,
487 | x2, y1 + radius,
488 | x2, y2 - radius,
489 | x2, y2 - radius,
490 | x2, y2,
491 | x2 - radius, y2,
492 | x2 - radius, y2,
493 | x1 + radius, y2,
494 | x1 + radius, y2,
495 | x1, y2,
496 | x1, y2 - radius,
497 | x1, y2 - radius,
498 | x1, y1 + radius,
499 | x1, y1 + radius,
500 | x1, y1]
501 |
502 | return canvas.create_polygon(points, **kwargs, smooth=True)
503 |
504 |
505 | def main():
506 | root = tk.Tk()
507 | TrackerApp(root)
508 | root.mainloop()
509 |
510 |
511 | if __name__ == "__main__":
512 | multiprocessing.freeze_support() # important for pyinstaller
513 | main()
514 |
--------------------------------------------------------------------------------
/tracker/main.spec:
--------------------------------------------------------------------------------
1 | # -*- mode: python ; coding: utf-8 -*-
2 |
3 |
4 | a = Analysis(
5 | ['main.py'],
6 | pathex=[],
7 | binaries=[],
8 | datas=[],
9 | hiddenimports=['comtypes', 'comtypes.gen', 'comtypes.client', 'comtypes.stream'],
10 | hookspath=[],
11 | hooksconfig={},
12 | runtime_hooks=[],
13 | excludes=[],
14 | noarchive=False,
15 | optimize=0,
16 | )
17 | pyz = PYZ(a.pure)
18 |
19 | exe = EXE(
20 | pyz,
21 | a.scripts,
22 | a.binaries,
23 | a.datas,
24 | [],
25 | name='tracker',
26 | debug=False,
27 | bootloader_ignore_signals=False,
28 | strip=False,
29 | upx=True,
30 | upx_exclude=[],
31 | runtime_tmpdir=None,
32 | console=False,
33 | disable_windowed_traceback=False,
34 | argv_emulation=False,
35 | target_arch=None,
36 | codesign_identity=None,
37 | entitlements_file=None,
38 | )
39 |
--------------------------------------------------------------------------------
/tracker/monitor.py:
--------------------------------------------------------------------------------
1 | import threading
2 | import time
3 | from enum import Enum
4 | from pynput import keyboard, mouse
5 | from pynput.keyboard import Key
6 | from recorder import Recorder
7 | from utils import *
8 |
9 | WAIT_INTERVAL = 6 # 6s per wait
10 | DOUBLE_CLICK_INTERVAL = 0.5 # 0.5s for double click
11 |
12 | HOT_KEY = [
13 | ["alt", "tab"], # Switch between running program windows
14 | ["alt", "f4"], # Close current window or program
15 | ["cmd", 'd'], # Show desktop
16 | ["cmd", 'e'], # Open file explorer
17 | ["cmd", 'l'], # Lock computer
18 | ["cmd", 'r'], # Open run dialog
19 | ["cmd", 't'], # Cycle through taskbar programs
20 | ["cmd", 'x'], # Open advanced user menu (Start button right-click menu)
21 | ["cmd", "space"], # Switch input method
22 | ["cmd", 'i'], # Open Windows settings
23 | ["cmd", 'a'], # Open action center
24 | ["cmd", 's'], # Open search
25 | ["cmd", 'u'], # Open accessibility settings
26 | ["cmd", 'p'], # Open projection settings
27 | ["cmd", 'v'], # Open clipboard history
28 | ["cmd", "tab"], # Open task view
29 | ["shift", "delete"] # Permanently delete selected items (bypass recycle bin)
30 | ]
31 |
32 |
33 | def switch_caption(char):
34 | if char.isalpha() and get_capslock_state() == 1: # Caps lock is on
35 | if char.islower():
36 | return char.upper()
37 | else:
38 | return char.lower()
39 | else:
40 | return char
41 |
42 |
43 | class ActionType(Enum):
44 | CLICK = "click"
45 | RIGHT_CLICK = "right click"
46 | DOUBLE_CLICK = "double click"
47 | MOUSE_DOWN = "press"
48 | DRAG = "drag to"
49 | SCROLL = "scroll"
50 | KEY_DOWN = "press key"
51 | HOTKEY = "hotkey"
52 | TYPE = "type text"
53 | WAIT = "wait"
54 | FINISH = "finish"
55 | FAIL = "fail"
56 |
57 |
58 | class Action:
59 | def __init__(self, action_type: ActionType, **kwargs):
60 | self.action_type = action_type
61 | self.kwargs = kwargs
62 |
63 | def __str__(self):
64 | str = f"{self.action_type.value}"
65 | if self.action_type == ActionType.CLICK or self.action_type == ActionType.RIGHT_CLICK or self.action_type == ActionType.MOUSE_DOWN or self.action_type == ActionType.DOUBLE_CLICK:
66 | # str += f" element: {self.kwargs['name']} at ({self.kwargs['x']}, {self.kwargs['y']})"
67 | str += f" ({self.kwargs['x']}, {self.kwargs['y']})"
68 | if self.action_type == ActionType.DRAG:
69 | str += f" ({self.kwargs['x']}, {self.kwargs['y']})"
70 | if self.action_type == ActionType.SCROLL:
71 | str += f" ({self.kwargs['dx']}, {self.kwargs['dy']})"
72 | if self.action_type == ActionType.KEY_DOWN:
73 | str += f" {self.kwargs['key']}"
74 | if self.action_type == ActionType.HOTKEY:
75 | str += f" ({self.kwargs['key1']}, {self.kwargs['key2']})"
76 | if self.action_type == ActionType.TYPE:
77 | str += f": {self.kwargs['text']}"
78 | return str
79 |
80 | def get_element(self):
81 | ele = self.kwargs.get('name')
82 | return ele if ele != "" else "Unknown"
83 |
84 |
85 | class Monitor:
86 | def __init__(self, task):
87 | self.recorder = Recorder(task)
88 | self.type_buffer = TypeBuffer(self.recorder) # How many keyboard operations have been executed consecutively
89 | self.timer = Timer(self.recorder, self.type_buffer)
90 | self.scroll_buffer = ScrollBuffer(self.recorder)
91 | self.keyboard_monitor = KeyboardMonitor(
92 | self.recorder, self.type_buffer, self.timer, self.scroll_buffer)
93 | self.mouse_monitor = MouseMonitor(
94 | self.recorder, self.type_buffer, self.timer, self.scroll_buffer)
95 |
96 | def start(self):
97 | self.keyboard_monitor.start()
98 | self.mouse_monitor.start()
99 | self.type_buffer.reset()
100 | self.timer.reset()
101 |
102 | def stop_without_md(self):
103 | self.keyboard_monitor.stop()
104 | self.mouse_monitor.stop()
105 | self.timer.stop()
106 | self.recorder.wait()
107 |
108 | def generate_md(self, task=None):
109 | self.recorder.generate_md(task)
110 |
111 | def stop(self):
112 | self.stop_without_md()
113 | self.generate_md()
114 |
115 | def finish(self):
116 | self.recorder.record_action(Action(ActionType.FINISH))
117 | self.stop()
118 |
119 | def finish_without_md(self):
120 | self.recorder.record_action(Action(ActionType.FINISH))
121 | self.stop_without_md()
122 |
123 | def fail(self):
124 | self.recorder.record_action(Action(ActionType.FAIL))
125 | self.stop()
126 |
127 | def discard_record(self):
128 | self.recorder.discard()
129 |
130 |
131 | class Timer:
132 | def __init__(self, recorder: Recorder, type_buffer):
133 | self.timer_inner = None
134 | self.recorder = recorder
135 | self.type_buffer = type_buffer
136 | self.reset()
137 |
138 | def reset(self):
139 | if self.timer_inner:
140 | self.timer_inner.cancel()
141 | self.timer_inner = threading.Timer(
142 | WAIT_INTERVAL, self.save_wait) # Start timing, execute save_wait after interval seconds
143 | self.timer_inner.start()
144 |
145 | def stop(self):
146 | if self.timer_inner:
147 | self.timer_inner.cancel()
148 |
149 | def save_wait(self):
150 | if not self.type_buffer.last_action_is_typing:
151 | self.recorder.record_action(Action(ActionType.WAIT))
152 | self.reset()
153 |
154 |
155 | class HotKeyBuffer:
156 | def __init__(self):
157 | self.buffer = []
158 |
159 | def add(self, key):
160 | self.buffer.append(key)
161 |
162 | def pop(self):
163 | if len(self.buffer) > 0:
164 | self.buffer.pop()
165 |
166 | def reset(self):
167 | self.buffer.clear()
168 |
169 |
170 | class TypeBuffer:
171 | def __init__(self, recorder: Recorder):
172 | self.recorder = recorder
173 | self.type_action_cnt = 0
174 | self.text = ""
175 | self.is_typing = False
176 | self.last_action_is_typing = False # Whether the last action could be typing
177 | self.last_action_is_shift = False
178 | self.pre_saved_type_event = None # for TYPE action
179 | self.events_buffer = [] # Buffer keyboard events before confirming typing
180 |
181 | def pre_save_type_event(self):
182 | self.pre_saved_type_event = self.recorder.get_event()
183 |
184 | def reset(self):
185 | # save buffer
186 | if self.is_typing and not self.is_empty():
187 | # At this time, there should be a pre_saved_type_event
188 | assert self.pre_saved_type_event is not None
189 | type_action = Action(ActionType.TYPE, text=self.text)
190 | self.pre_saved_type_event['action'] = type_action
191 | self.recorder.record_event(self.pre_saved_type_event)
192 | elif not self.is_typing:
193 | # self.recorder.save_all()
194 | # Record all previous operations that were cached
195 | for event in self.events_buffer:
196 | self.recorder.record_event(event)
197 |
198 | # reset type buffer
199 | self.text = ""
200 | self.is_typing = False
201 | self.last_action_is_typing = False
202 | self.last_action_is_shift = False
203 | self.pre_saved_type_event = None
204 | self.events_buffer.clear()
205 |
206 | def append(self, char):
207 | self.text += char
208 | if not self.is_typing:
209 | press_action = Action(ActionType.KEY_DOWN, key=char)
210 | # self.recorder.record(press_action)
211 | press_event = self.recorder.get_event(press_action)
212 | self.events_buffer.append(press_event)
213 |
214 | def add_type_related_action(self):
215 | # The typing operation is about to be added
216 | if len(self.text) >= 2 and not self.is_typing:
217 | self.is_typing = True # Enter typing state
218 | self.events_buffer.clear() # The previous recorded keyboard operations will be merged into TYPE, no need to record separately
219 |
220 | def backspace(self):
221 | if len(self.text) > 0:
222 | self.text = self.text[:-1]
223 | if not self.is_typing:
224 | backspace_action = Action(ActionType.KEY_DOWN, key="backspace")
225 | # self.recorder.record(backspace_action)
226 | backspace_event = self.recorder.get_event(backspace_action)
227 | self.events_buffer.append(backspace_event)
228 | else:
229 | self.reset()
230 | backspace_action = Action(ActionType.KEY_DOWN, key="backspace")
231 | self.recorder.record_action(backspace_action)
232 |
233 | def set_last_action_is_typing(self):
234 | self.last_action_is_typing = True
235 |
236 | def reset_last_action_is_typing(self):
237 | self.last_action_is_typing = False
238 |
239 | def set_last_action_is_shift(self):
240 | self.last_action_is_shift = True
241 |
242 | def reset_last_action_is_shift(self):
243 | self.last_action_is_shift = False
244 |
245 | def set_typing(self):
246 | self.is_typing = True
247 |
248 | def is_empty(self) -> bool:
249 | return len(self.text) == 0
250 |
251 |
252 | class ScrollBuffer:
253 | def __init__(self, recorder: Recorder):
254 | self.recorder = recorder
255 | self.dx = 0
256 | self.dy = 0
257 | self.pre_saved_scroll_event = None
258 | # self.empty = self.pre_saved_scroll_event is None
259 |
260 | def is_empty(self):
261 | return self.pre_saved_scroll_event is None
262 |
263 | def reset(self):
264 | if not self.is_empty() and (self.dx != 0 or self.dy != 0):
265 | scroll_action = Action(ActionType.SCROLL, dx=self.dx, dy=self.dy)
266 | self.pre_saved_scroll_event['action'] = scroll_action
267 | self.recorder.record_event(self.pre_saved_scroll_event)
268 | self.dx = 0
269 | self.dy = 0
270 | self.pre_saved_scroll_event = None
271 |
272 | def new(self, dx, dy):
273 | self.dx = dx
274 | self.dy = dy
275 | self.pre_saved_scroll_event = self.recorder.get_event()
276 |
277 | def add_delta(self, dx, dy):
278 | self.dx += dx
279 | self.dy += dy
280 |
281 |
282 | class KeyboardMonitor:
283 | def __init__(self, recorder: Recorder, type_buffer: TypeBuffer, timer: Timer, scroll_buffer: ScrollBuffer):
284 | self.recorder = recorder
285 | self.listener = keyboard.Listener(
286 | on_press=self.on_press, on_release=self.on_release)
287 | self.type_buffer = type_buffer
288 | self.timer = timer
289 | self.scroll_buffer = scroll_buffer
290 | self.hotkey_buffer = HotKeyBuffer()
291 |
292 | def start(self):
293 | self.listener.start()
294 |
295 | def stop(self):
296 | self.listener.stop()
297 |
298 | def on_press(self, key: Key):
299 | try:
300 | # Keyboard operation triggers timer and scroll buffer reset
301 | self.timer.reset()
302 | self.scroll_buffer.reset()
303 |
304 | # Record whether this operation is related to typing
305 | if is_related_to_type(key):
306 | self.type_buffer.set_last_action_is_typing()
307 | self.type_buffer.add_type_related_action()
308 | else:
309 | self.type_buffer.reset_last_action_is_typing()
310 |
311 | # Record whether the last key pressed was the shift key
312 | if key == Key.shift:
313 | self.type_buffer.set_last_action_is_shift()
314 | else:
315 | self.type_buffer.reset_last_action_is_shift()
316 |
317 | record_hotkey = False
318 | # Determine hotkey operation
319 | self.hotkey_buffer.add(get_key_str(key))
320 | if self.hotkey_buffer.buffer in HOT_KEY:
321 | record_hotkey = True # Should be recorded as a hotkey operation
322 |
323 | # Handle record operation
324 | if not is_related_to_type(key): # Keys that cannot appear in typing scenarios
325 | self.type_buffer.reset() # Before entering typing state, all previous operations must have been recorded, just save the text in the buffer
326 | if self.type_buffer.last_action_is_shift:
327 | shift_action = Action(ActionType.KEY_DOWN, key="shift")
328 | self.recorder.record_action(shift_action)
329 | hotkey_2 = get_ctrl_hotkey(key)
330 | if hotkey_2 is not None:
331 | last_action = self.recorder.get_last_action()
332 | if last_action is not None and last_action.action_type == ActionType.KEY_DOWN and (
333 | last_action.kwargs['key'] == 'ctrl_l' or last_action.kwargs['key'] == 'ctrl_r'):
334 | ctrl_hotkey_action = Action(
335 | ActionType.HOTKEY, key1='Ctrl', key2=hotkey_2)
336 | self.recorder.change_last_action(ctrl_hotkey_action)
337 | else:
338 | ctrl_hotkey_action = Action(
339 | ActionType.HOTKEY, key1='Ctrl', key2=hotkey_2)
340 | self.recorder.record_action(ctrl_hotkey_action)
341 | elif not record_hotkey:
342 | key_name = get_key_str(key)
343 | key_press_action = Action(
344 | ActionType.KEY_DOWN, key=key_name)
345 | self.recorder.record_action(key_press_action)
346 | elif not record_hotkey: # Keys that may appear in typing scenarios
347 | if self.type_buffer.is_empty(): # Only characters can be the first element of the buffer
348 | if hasattr(key, 'char'):
349 | switched_char = switch_caption(key.char)
350 | self.type_buffer.append(switched_char)
351 | self.type_buffer.pre_save_type_event() # Save observation when entering typing state
352 | else:
353 | # At this time, the buffer is empty, directly record special keys
354 | key_name = get_key_str(key)
355 | key_press_action = Action(
356 | ActionType.KEY_DOWN, key=key_name)
357 | self.recorder.record_action(key_press_action)
358 | else:
359 | # Just throw into the buffer
360 | if key == Key.backspace:
361 | self.type_buffer.backspace()
362 | elif key == Key.space:
363 | self.type_buffer.append(' ')
364 | elif hasattr(key, 'char'):
365 | switched_char = switch_caption(key.char)
366 | self.type_buffer.append(switched_char)
367 |
368 | if record_hotkey:
369 | last_action = self.recorder.get_last_action()
370 | if last_action.action_type == ActionType.KEY_DOWN and last_action.kwargs['key'] == \
371 | self.hotkey_buffer.buffer[0]:
372 | ctrl_hotkey_action = Action(
373 | ActionType.HOTKEY, key1=self.hotkey_buffer.buffer[0], key2=self.hotkey_buffer.buffer[1])
374 | self.recorder.change_last_action(ctrl_hotkey_action)
375 | else:
376 | ctrl_hotkey_action = Action(
377 | ActionType.HOTKEY, key1=self.hotkey_buffer.buffer[0], key2=self.hotkey_buffer.buffer[1])
378 | self.recorder.record_action(ctrl_hotkey_action)
379 | except AttributeError:
380 | print_debug("error!")
381 |
382 | def on_release(self, key: Key):
383 | self.hotkey_buffer.pop()
384 |
385 |
386 | class LastClick:
387 | def __init__(self):
388 | self.x = 0
389 | self.y = 0
390 | self.time = 0
391 | self.button = mouse.Button.left
392 | self.element_name = ""
393 |
394 | def update(self, x, y, button, element_name):
395 | self.x = x
396 | self.y = y
397 | self.time = time.time()
398 | self.button = button
399 | self.element_name = element_name
400 |
401 |
402 | class MouseMonitor:
403 | def __init__(self, recorder: Recorder, type_buffer: TypeBuffer, timer: Timer, scroll_buffer: ScrollBuffer):
404 | self.recorder = recorder
405 | self.listener = mouse.Listener(
406 | on_click=self.on_click, on_scroll=self.on_scroll, on_move=self.on_move)
407 | self.type_buffer = type_buffer
408 | self.timer = timer
409 | self.scroll_buffer = scroll_buffer
410 | self.last_click = LastClick()
411 | self.pre_saved_drag_event = None
412 |
413 | def start(self):
414 | self.listener.start()
415 |
416 | def stop(self):
417 | self.listener.stop()
418 |
419 | def on_click(self, x, y, button, pressed):
420 | self.timer.reset()
421 | self.type_buffer.reset_last_action_is_typing()
422 | self.type_buffer.reset_last_action_is_shift()
423 | self.scroll_buffer.reset()
424 | if pressed:
425 | # Mouse click triggers information update
426 | element = get_element_info_at_position(x, y) # Get UI element info at mouse click position
427 | self.type_buffer.reset() # reset type buffer
428 | # Save observation when mouse is pressed, for possible drag operation
429 | self.pre_saved_drag_event = self.recorder.get_event()
430 |
431 | delta_time = time.time() - self.last_click.time
432 | if delta_time < DOUBLE_CLICK_INTERVAL and x == self.last_click.x and y == self.last_click.y:
433 | # Double click
434 | last_action = self.recorder.get_last_action()
435 | if last_action is not None and last_action.action_type == ActionType.CLICK:
436 | double_click_action = Action(
437 | ActionType.DOUBLE_CLICK, x=x, y=y, name=last_action.kwargs['name'])
438 | self.recorder.change_last_action(double_click_action)
439 | else:
440 | # Click
441 | if button == mouse.Button.left:
442 | click_action = Action(
443 | ActionType.CLICK, x=x, y=y, name=element['name'])
444 | self.recorder.record_action(
445 | click_action, element['coordinates'])
446 | elif button == mouse.Button.right:
447 | click_action = Action(
448 | ActionType.RIGHT_CLICK, x=x, y=y, name=element['name'])
449 | self.recorder.record_action(
450 | click_action, element['coordinates'])
451 | else:
452 | print_debug(f"Unknown button {button}")
453 |
454 | self.last_click.update(x, y, button, element['name'])
455 |
456 | else: # released
457 | if x != self.last_click.x or y != self.last_click.y: # Mouse dragged
458 | last_action = self.recorder.get_last_action()
459 | if last_action.action_type == ActionType.CLICK: # Previous operation was a click operation
460 | press_action = Action(ActionType.MOUSE_DOWN, x=self.last_click.x,
461 | y=self.last_click.y, name=self.last_click.element_name)
462 | self.recorder.change_last_action(
463 | press_action) # Modify the previous click operation to press operation
464 | # Record drag operation
465 | drag_action = Action(ActionType.DRAG, x=x, y=y)
466 | self.pre_saved_drag_event['action'] = drag_action
467 | self.recorder.record_event(self.pre_saved_drag_event)
468 | else: # Normal click
469 | pass
470 |
471 | def on_move(self, x, y):
472 | # print(f"Mouse moved to {(x, y)}")
473 | pass
474 |
475 | def on_scroll(self, x, y, dx, dy):
476 | self.timer.stop() # Close timer during scrolling to avoid recording wait operations during scrolling
477 | self.type_buffer.reset_last_action_is_typing()
478 | self.type_buffer.reset_last_action_is_shift()
479 | self.type_buffer.reset()
480 | if self.scroll_buffer.is_empty():
481 | self.scroll_buffer.new(dx, dy)
482 | else:
483 | self.scroll_buffer.add_delta(dx, dy)
484 |
485 |
486 | def is_related_to_type(key):
487 | if isinstance(key, Key):
488 | return key in [Key.shift, Key.space, Key.caps_lock, Key.backspace]
489 | elif isinstance(key, keyboard.KeyCode):
490 | return key.char is not None and ord(key.char) > 31
491 | return False
492 |
493 |
494 | def get_ctrl_hotkey(key):
495 | if isinstance(key, keyboard.KeyCode) and key.char is not None and ord(key.char) <= 31:
496 | return chr(ord('@') + ord(key.char))
497 | return None
498 |
499 |
500 | def get_key_str(key):
501 | if isinstance(key, Key):
502 | key_str = str(key)
503 | if "ctrl" in key_str:
504 | return "ctrl"
505 | if "shift" in key_str:
506 | return "shift"
507 | if "alt" in key_str:
508 | return "alt"
509 | if "cmd" in key_str:
510 | return "cmd"
511 | return key_str[4:]
512 | elif isinstance(key, keyboard.KeyCode):
513 | return key.char
514 |
--------------------------------------------------------------------------------
/tracker/package.ps1:
--------------------------------------------------------------------------------
1 | # 0. clean ./dist/ folder (if exists)
2 | $distPath = "./dist"
3 | if (Test-Path $distPath)
4 | {
5 | Remove-Item -Path $distPath -Recurse -Force
6 | Write-Output "dist folder cleared."
7 | }
8 | else
9 | {
10 | Write-Output "dist folder does not exist, skipping clearing."
11 | }
12 |
13 | # 1. run pyinstaller
14 | pyinstaller main.spec
15 |
16 | # 2. check if ./dist/tracker.exe is generated
17 | if (Test-Path "./dist/tracker.exe")
18 | {
19 | Write-Output "tracker.exe successfully created."
20 | }
21 | else
22 | {
23 | Write-Output "Error: tracker.exe not created."
24 | exit 1
25 | }
26 |
27 | # 3. copy ./tasks.json and ./README.md to ./dist/
28 | Copy-Item -Path "./tasks.json" -Destination "./dist/" -Force
29 | Copy-Item -Path "./README.md" -Destination "./dist/" -Force
30 |
31 | # 4. copy ./task_cnt.json to ./dist/
32 | Copy-Item -Path "./task_cnt.json" -Destination "./dist/" -Force
33 |
34 | Write-Output "./dist/ successfully prepared, ready for zip."
35 |
--------------------------------------------------------------------------------
/tracker/recorder.py:
--------------------------------------------------------------------------------
1 | import json
2 | import multiprocessing
3 |
4 | from PIL import Image, ImageDraw
5 |
6 | from capturer import *
7 | from fs import *
8 | from utils import *
9 |
10 | MARK_IMAGE = False
11 |
12 |
13 | class Recorder:
14 | def __init__(self, task=None, buffer_len=1, directory="events"):
15 | self.pool = multiprocessing.Pool()
16 | self.task = task
17 | self.buffer_len = buffer_len
18 | self.directory = directory
19 | self.screenshot_dir = os.path.join(directory, "screenshot")
20 | self.buffer = [] # event buffer
21 | self.saved_cnt = 0
22 | self.timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
23 | # Ensure directories exist
24 | ensure_folder(self.directory)
25 | ensure_folder(self.screenshot_dir)
26 | # Hide directory
27 | hide_folder(self.directory)
28 | # Generate filename prefix
29 | if self.task is not None:
30 | index = self.task.id
31 | prefix = f"task{index}" if index != 0 else "free_task"
32 | else:
33 | prefix = "events"
34 | # Generate filename
35 | self.event_filename = os.path.join(
36 | self.directory, f"{prefix}_{self.timestamp_str}.jsonl")
37 | self.md_filename = os.path.join(
38 | self.directory, f"{prefix}_{self.timestamp_str}.md")
39 |
40 | self.recent_screen = RecentScreen()
41 |
42 | self.screenshot_f_list = []
43 |
44 | def get_event(self, action=None):
45 | timestamp = get_current_time()
46 | screenshot = self.recent_screen.get()
47 | event = {
48 | 'timestamp': timestamp,
49 | 'action': action,
50 | 'screenshot': screenshot,
51 | }
52 | return event
53 |
54 | def record_event(self, event, rect=None):
55 | self.buffer.append((event, rect))
56 | if len(self.buffer) > self.buffer_len:
57 | ev, rec = self.buffer.pop(0)
58 | self.save(ev, rec)
59 |
60 | def record_action(self, action, rect=None):
61 | event = self.get_event(action)
62 | self.record_event(event, rect)
63 |
64 | def get_last_action(self):
65 | if self.buffer and len(self.buffer) > 0:
66 | event, _ = self.buffer[-1]
67 | return event['action']
68 | else:
69 | return None
70 |
71 | def change_last_action(self, action):
72 | if self.buffer:
73 | event, rect = self.buffer.pop()
74 | event['action'] = action
75 | self.buffer.append((event, rect))
76 | else:
77 | print("WARNING: No record to change in the buffer!")
78 |
79 | def save(self, event, rect):
80 | self.saved_cnt += 1
81 | timestamp = event['timestamp'].replace(':', '').replace('-', '')
82 | action = event['action']
83 | screenshot_filename = os.path.join(
84 | self.screenshot_dir, f"{timestamp}_{self.saved_cnt}.png")
85 |
86 | point = {"x": action.kwargs.get('x'), "y": action.kwargs.get('y')}
87 | if None in point.values():
88 | point = None
89 |
90 | # Async save screenshot
91 | self.pool.apply_async(
92 | save_screenshot, (screenshot_filename, event['screenshot'], rect, point))
93 |
94 | event['screenshot'] = screenshot_filename
95 | event['action'] = str(action)
96 | event['element'] = action.get_element()
97 | event['rect'] = rect
98 | with open(self.event_filename, 'a', encoding='utf-8') as f:
99 | json.dump(event, f, ensure_ascii=False)
100 | f.write('\n')
101 |
102 | self.screenshot_f_list.append(screenshot_filename)
103 |
104 | def wait(self):
105 | # Save all buffered events
106 | for event, rect in self.buffer:
107 | self.save(event, rect)
108 | # Close process pool
109 | self.pool.close()
110 | self.pool.join()
111 |
112 | def generate_md(self, task=None):
113 | if task is not None:
114 | self.task = task # Reset task
115 |
116 | prompt = '''Given the screenshot as below. What's the next step that you will do to help with the task?'''
117 |
118 | with open(self.event_filename, 'r', encoding='utf-8') as file:
119 | lines = file.readlines()
120 |
121 | markdown_content = []
122 | if self.task is not None:
123 | index = self.task.id
124 | description = self.task.description
125 | level = self.task.level
126 |
127 | if index == 0:
128 | markdown_content.append(f'# Free Task\n')
129 | else:
130 | markdown_content.append(f'# Task {index}\n')
131 |
132 | markdown_content.append(f'**Description:** {description}\n\n')
133 | markdown_content.append(f'**Level:** {level}\n\n')
134 | else:
135 | markdown_content.append(f'# Non task-oriented events\n')
136 |
137 | for line in lines:
138 | event = json.loads(line.strip())
139 | timestamp = event.get('timestamp', '')
140 | action = event.get('action', '')
141 | screenshot_path = event.get('screenshot', '')
142 | screenshot_path = '\\'.join(screenshot_path.split(
143 | '\\')[1:]) # remove the first directory
144 |
145 | markdown_content.append(f'### {timestamp}\n')
146 | markdown_content.append(f'**Input:** \n\n{prompt}\n\n')
147 | markdown_content.append(
148 | f'
\n\n')
149 | markdown_content.append(f'**Output:** \n\n{action}\n\n')
150 |
151 | # Write content to Markdown file
152 | with open(self.md_filename, 'w', encoding='utf-8') as md_file:
153 | md_file.writelines(markdown_content)
154 |
155 | def discard(self):
156 | # Delete all record files
157 | delete_file(self.event_filename)
158 | # markdown may not be recorded, but not a problem
159 | delete_file(self.md_filename)
160 | for f in self.screenshot_f_list:
161 | delete_file(f)
162 |
163 |
164 | def save_screenshot(save_filename, screenshot, rect=None, point=None):
165 | # Create image from buffer
166 | image = Image.frombuffer(
167 | 'RGB',
168 | screen_size,
169 | screenshot, 'raw', 'BGRX', 0, 1
170 | )
171 |
172 | if MARK_IMAGE:
173 | mark_image(image, rect, point)
174 |
175 | # Save image
176 | image.save(save_filename)
177 |
178 |
179 | def mark_image(image, rect, point):
180 | if rect is not None:
181 | # Create a drawable object
182 | draw = ImageDraw.Draw(image)
183 | # Draw rectangle
184 | draw.rectangle(
185 | [(rect["left"], rect["top"]), (rect["right"], rect["bottom"])],
186 | outline="red",
187 | width=3 # line width
188 | )
189 |
190 | if point is not None:
191 | draw = ImageDraw.Draw(image)
192 |
193 | # Calculate circle's top-left and bottom-right coordinates
194 | radius = 6
195 | left = point["x"] - radius
196 | top = point["y"] - radius
197 | right = point["x"] + radius
198 | bottom = point["y"] + radius
199 |
200 | # Draw circle
201 | draw.ellipse(
202 | [(left, top), (right, bottom)],
203 | fill="red"
204 | )
205 |
--------------------------------------------------------------------------------
/tracker/task.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import time
4 |
5 |
6 | def find_tasks_json():
7 | start_dir = os.path.abspath(".")
8 | for root, dirs, files in os.walk(start_dir):
9 | if 'tasks.json' in files:
10 | return os.path.join(root, 'tasks.json')
11 | return None
12 |
13 |
14 | def find_task_cnt_json():
15 | start_dir = os.path.abspath(".")
16 | for root, dirs, files in os.walk(start_dir):
17 | if 'task_cnt.json' in files:
18 | return os.path.join(root, 'task_cnt.json')
19 | return None
20 |
21 |
22 | tasks_path = find_tasks_json()
23 | task_cnt_path = find_task_cnt_json()
24 | task_cnt = 0
25 |
26 |
27 | class Task:
28 | def __init__(self, description, id, level, file_input=None, category="other", finished=False, is_bad=False):
29 | self.description = description
30 | self.level = level
31 | self.id = id
32 | self.category = category
33 | self.file_input = file_input
34 | self.finished = finished
35 | self.is_bad = is_bad
36 |
37 |
38 | def from_json(task, task_cnt) -> Task:
39 | return Task(task['task'], task_cnt, task['level'], task['file_input'], task['category'], task['finished'])
40 |
41 |
42 | def free_task():
43 | return Task("free task", 0, "easy")
44 |
45 |
46 | def load_task_cnt():
47 | with open(task_cnt_path, 'r') as file:
48 | data = json.load(file)
49 | return data['given_task'], data['free_task']
50 |
51 |
52 | def load_given_tasks():
53 | tasks = []
54 | global task_cnt
55 | if tasks_path is None:
56 | return [free_task()] # for robustness
57 | with open(tasks_path, 'r') as file:
58 | data = json.load(file)
59 | for task in data:
60 | task_cnt += 1
61 | tasks.append(from_json(task, task_cnt))
62 | return tasks
63 |
64 |
65 | def update_given_tasks(given_tasks):
66 | if tasks_path is None:
67 | return
68 | try:
69 | # set_hidden_file(tasks_path, False)
70 | with open(tasks_path, 'w') as file:
71 | json.dump(
72 | [{'task': task.description,
73 | 'level': task.level,
74 | 'file_input': task.file_input,
75 | 'category': task.category,
76 | 'finished': task.finished}
77 | for task in given_tasks if not task.is_bad],
78 | file,
79 | indent=2 # Set indentation to 2 spaces
80 | )
81 | # set_hidden_file(tasks_path, True)
82 | except Exception as e:
83 | print(e)
84 | # sleep for 10 seconds
85 | time.sleep(10)
86 |
87 |
88 | def update_task_cnt(finished_given_cnt, finished_free_cnt):
89 | print(f"update task cnt: {finished_given_cnt}, {finished_free_cnt}")
90 | # set_hidden_file(task_cnt_path, False)
91 | with open(task_cnt_path, 'w') as file:
92 | json.dump({'given_task': finished_given_cnt, 'free_task': finished_free_cnt}, file, indent=2)
93 | # set_hidden_file(task_cnt_path, True)
94 |
--------------------------------------------------------------------------------
/tracker/task_cnt.json:
--------------------------------------------------------------------------------
1 | {
2 | "given_task": 0,
3 | "free_task": 0
4 | }
--------------------------------------------------------------------------------
/tracker/tasks.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "task": "Open PowerPoint and create a new presentation about \"Star Wars Saga Overview\".\nOn the first slide, add the title \"Star Wars: A Galaxy Far, Far Away\" and the subtitle \"The Epic Saga | 1977-2019\".\nFor the second slide, title it \"The Force\". Search for an image of \"Star Wars Lightsaber Duel\" on Chrome, copy the image, and paste it into the slide.\nOn the third slide, use the title \"Key Characters\" and include the following text: \"Follow the journey of Luke Skywalker, Darth Vader, Princess Leia and Rey as they shape the destiny of the galaxy.\"\nFor the fourth slide, title it \"Legacy\" and add: \"Star Wars revolutionized cinema with groundbreaking effects and created a cultural phenomenon that spans movies, games and books.\"\nOn the final slide, write \"May The Force Be With You\".\nSave the presentation to desktop, then close PowerPoint and Chrome.",
4 | "level": "hard",
5 | "file_input": false,
6 | "category": "PowerPointTasks",
7 | "finished": false
8 | },
9 | {
10 | "task": "Open Chrome and navigate to the product page for the MacBook Air M3 on Amazon.",
11 | "level": "medium",
12 | "file_input": false,
13 | "category": "ChromeTasks",
14 | "finished": false
15 | },
16 | {
17 | "task": "Create a new file named 'diary.txt' on the desktop.",
18 | "level": "easy",
19 | "file_input": false,
20 | "category": "FileOperationTasks",
21 | "finished": false
22 | }
23 | ]
24 |
--------------------------------------------------------------------------------
/tracker/tracker.py:
--------------------------------------------------------------------------------
1 | import random
2 | from monitor import Monitor
3 | from task import *
4 |
5 |
6 | class Tracker:
7 | def __init__(self):
8 | self.monitor = None
9 | self.running = False
10 | self.given_tasks = load_given_tasks()
11 | self.finished_given_cnt, self.finished_free_cnt = load_task_cnt()
12 | self.bad_task_cnt = 0
13 | self.task_num = len(self.given_tasks)
14 | print(f"task num = {self.task_num}")
15 | self.task_id = random.randint(0, self.task_num - 1)
16 | self.task = None
17 |
18 | def get_given_task(self, offset):
19 | while True:
20 | self.task_id = (self.task_id + self.task_num + offset) % self.task_num
21 | task = self.given_tasks[self.task_id]
22 | if not task.finished and not task.is_bad:
23 | break
24 | self.task = self.given_tasks[self.task_id]
25 |
26 | def finish_all(self):
27 | return self.finished_given_cnt + self.bad_task_cnt == self.task_num
28 |
29 | def update_tasks(self):
30 | update_given_tasks(self.given_tasks)
31 | update_task_cnt(self.finished_given_cnt, self.finished_free_cnt)
32 |
33 | def get_free_task(self):
34 | self.task = free_task()
35 |
36 | def start(self):
37 | if not self.running:
38 | self.monitor = Monitor(self.task)
39 | self.monitor.start()
40 | self.running = True
41 |
42 | def stop(self):
43 | if self.running:
44 | self.monitor.stop()
45 | self.running = False
46 |
47 | def finish(self):
48 | if self.running:
49 | self.monitor.finish()
50 | self.running = False
51 | self.given_tasks[self.task_id].finished = True
52 | self.finished_given_cnt += 1
53 | else:
54 | self.monitor.generate_md()
55 | self.given_tasks[self.task_id].finished = True
56 | self.finished_given_cnt += 1
57 |
58 | def fail(self):
59 | if self.running:
60 | self.monitor.fail()
61 | self.running = False
62 |
63 | def stop_without_task(self):
64 | # stop without markdown (task unknown)
65 | if self.running:
66 | self.monitor.finish_without_md()
67 | self.running = False
68 |
69 | def save_free_task(self, task):
70 | # 在stop without task后调用,更新task后保存记录
71 | self.monitor.generate_md(task)
72 | self.finished_free_cnt += 1
73 |
74 | def discard(self):
75 | # 在stop/stop free task后调用,丢弃记录
76 | self.monitor.discard_record()
77 |
--------------------------------------------------------------------------------
/tracker/utils.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from pywinauto import Desktop
3 |
4 |
5 | def get_current_time():
6 | return datetime.now().strftime('%Y-%m-%d_%H:%M:%S')
7 |
8 |
9 | desktop = Desktop(backend="uia")
10 |
11 |
12 | def get_element_info_at_position(x, y):
13 | """
14 | Get UI element info at specified coordinates
15 | """
16 | try:
17 | element = desktop.from_point(x, y)
18 | # Get element's rectangle coordinates
19 | rect = element.rectangle()
20 |
21 | return {
22 | "name": element.element_info.name,
23 | "coordinates": {
24 | "left": rect.left,
25 | "top": rect.top,
26 | "right": rect.right,
27 | "bottom": rect.bottom
28 | }
29 | }
30 | except Exception as e:
31 | print(f"Error occurs when get element from position: {e}")
32 | return None
33 |
34 |
35 | def print_debug(string):
36 | import sys
37 | sys.stderr.write(string + "\n")
38 |
39 |
40 | # Return 1 if the caps lock key is on; return 0 if it is off
41 | def get_capslock_state():
42 | import ctypes
43 | hllDll = ctypes.WinDLL("User32.dll")
44 | VK_CAPITAL = 0x14
45 | return hllDll.GetKeyState(VK_CAPITAL)
46 |
--------------------------------------------------------------------------------