├── LICENSE ├── README.md ├── agent ├── grounding.py ├── main.py ├── pcagent.py ├── planning.py ├── prompt.py ├── server │ ├── molmo_server.sh │ └── planner_server.sh └── utils.py ├── assets ├── animation.png ├── overview.png └── raw_trajectory_example.png ├── environment.yml ├── postprocess ├── completion.py ├── data │ └── events_example │ │ ├── free_task_20241126_161517.jsonl │ │ ├── free_task_20241126_161517.md │ │ └── screenshot │ │ ├── 20241126_161519_1.png │ │ ├── 20241126_161519_1_marked.png │ │ ├── 20241126_161521_2.png │ │ ├── 20241126_161521_2_marked.png │ │ ├── 20241126_161524_3.png │ │ ├── 20241126_161524_3_marked.png │ │ ├── 20241126_161526_4.png │ │ ├── 20241126_161526_4_marked.png │ │ ├── 20241126_161528_5.png │ │ ├── 20241126_161536_6.png │ │ ├── 20241126_161536_6_marked.png │ │ ├── 20241126_161538_7.png │ │ ├── 20241126_161538_7_marked.png │ │ └── 20241126_161540_8.png ├── prompt.py ├── refinement.py └── utils.py └── tracker ├── README.md ├── README_zh.md ├── capturer.py ├── fs.py ├── main.py ├── main.spec ├── monitor.py ├── package.ps1 ├── recorder.py ├── task.py ├── task_cnt.json ├── tasks.json ├── tracker.py └── utils.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Yanheng He 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World 2 | 3 |

4 | 📄 Paper | 5 | 🌐 Website | 6 | 📘 机器之心 7 |

8 | 9 |

10 | animation 11 |

12 | 13 | ## News 14 | 15 | - [2025/05/21] 🔥 **PC Agent-E** is released, a new SOTA open-source model for Windows computer use. [[Paper](https://arxiv.org/pdf/2505.13909)] [[Code](https://github.com/GAIR-NLP/PC-Agent-E/)] [[Model](https://huggingface.co/henryhe0123/PC-Agent-E)] [[Data](https://huggingface.co/datasets/henryhe0123/PC-Agent-E)] 16 | - [2024/12/24] 🔥 We released our [paper](https://arxiv.org/abs/2412.17589), [code](https://github.com/GAIR-NLP/PC-Agent/) and [project page](https://gair-nlp.github.io/PC-Agent/). Check it out! 17 | 18 | ## Demo 19 | 20 | Check out our demo of PC Agent autonomously controlling a computer to complete complex tasks involving dozens of steps! 21 | 22 | https://github.com/user-attachments/assets/0b7613c6-e3b1-41cf-86d3-0e7a828fe863 23 | 24 | ## Introduction 25 | 26 | **PC Agent** introduces a novel framework to empower autonomous digital agents through **human cognition transfer**. 27 | This transfer is implemented through three key components: 28 | 1. **PC Tracker**, the first lightweight infrastructure for large-scale human-computer interaction data collection; 29 | 2. A **Cognition Completion** postprocess pipeline that transforms raw interaction data into cognitive trajectories; 30 | 3. A multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. 31 | 32 | ![overview](./assets/overview.png) 33 | 34 | ## Quick Start 35 | 36 | ### Setup 37 | 38 | To get started with PC Agent, we recommend setting up your Python environment using conda: 39 | 40 | ```bash 41 | # Clone the repository and navigate to the folder 42 | git clone https://github.com/GAIR-NLP/PC-Agent.git 43 | cd PC-Agent 44 | # Create and activate conda environment 45 | conda env create -f environment.yml 46 | conda activate pcagent 47 | ``` 48 | 49 | ### PC Tracker 50 | 51 | PC Tracker is an infrastructure for human-computer interaction data collection. The source code in `tracker/` directory can be modified to fit your specific data collection requirements. 52 | 53 | To deploy: 54 | 1. Build the executable (Windows): 55 | ```powershell 56 | cd tracker 57 | .\package.ps1 58 | ``` 59 | 2. Customize `tasks.json` according to your annotation needs 60 | 3. Distribute to annotators 61 | 4. Collect annotation data from annotators - annotated data will be saved in the `events/` folder (hidden) under working directory 62 | 63 | For user instructions, please refer to our [PC Tracker User Manual](./tracker/README.md). 64 | 65 | ### Post Processing 66 | 67 | To convert raw interaction data into cognitive trajectories, follow these steps: 68 | 1. Place your data in the `postprocess/data/` directory. Example data is available in this directory for reference. 69 | 2. Run post-processing pipeline: 70 | ```bash 71 | python postprocess/refinement.py # Data refinement 72 | python postprocess/completion.py # Cognition completion 73 | ``` 74 | 75 | Note: You need to prepare your OpenAI API key in advance to perform cognition completion. 76 | 77 | ### Agent 78 | 79 | We provide a reference implementation of our multi-agent system in the `agent/` directory, combining planning and grounding agents. To run: 80 | 81 | ```bash 82 | python agent/main.py 83 | ``` 84 | 85 | Reference scripts for model deployment can be found in `agent/server/` directory. 86 | 87 | ## Citation 88 | 89 | If you find this work helpful, please consider citing: 90 | 91 | ``` 92 | @article{he2024pcagent, 93 | title={PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World}, 94 | author={Yanheng He and Jiahe Jin and Shijie Xia and Jiadi Su and Runze Fan and Haoyang Zou and Xiangkun Hu and Pengfei Liu}, 95 | year={2024}, 96 | journal={arXiv preprint arXiv:2412.17589}, 97 | url={https://arxiv.org/abs/2412.17589} 98 | } 99 | ``` 100 | -------------------------------------------------------------------------------- /agent/grounding.py: -------------------------------------------------------------------------------- 1 | from utils import * 2 | from prompt import GROUNDING_AGENT_PROMPT 3 | import re 4 | 5 | class GroundingAgent: 6 | def __init__(self, grounding_client): 7 | self.grounding_client = grounding_client 8 | self.grounding_model = grounding_client.models.list().data[0].id 9 | print(f"Grounding model: {self.grounding_model}") 10 | screenshot = get_screenshot() 11 | self.window_width = screenshot.width 12 | self.window_height = screenshot.height 13 | 14 | def find_element(self, element_description, screenshot): 15 | """ 16 | find the element and return the coordinates (with check), 17 | return x, y, there_are_none 18 | """ 19 | cnt = 0 20 | retry_limit = 3 21 | 22 | while cnt < retry_limit: 23 | x, y, there_are_none = self.call_grounding(element_description, screenshot) 24 | if there_are_none: 25 | return None, None, True 26 | elif self.check_grounding(x, y, screenshot, element_description): 27 | return x, y, False 28 | cnt += 1 29 | 30 | return None, None, True 31 | 32 | def call_grounding(self, element_description, screenshot): 33 | """ 34 | call the grounding model to locate the element, 35 | return x, y, there_are_none 36 | """ 37 | base64_image = encode_image(screenshot) 38 | instruction = f"Point to {element_description}" 39 | messages = get_mllm_messages(instruction, base64_image) 40 | 41 | completion = self.grounding_client.chat.completions.create( 42 | model=self.grounding_model, 43 | messages=messages, 44 | max_tokens=512, 45 | temperature=0.8, 46 | n=3 # Request n completions in parallel 47 | ) 48 | 49 | # Try each response until we find valid coordinates 50 | for choice in completion.choices: 51 | x, y = self.parse_coordinates(choice.message.content) 52 | if x is not None and y is not None: 53 | return x, y, False 54 | 55 | # If no valid coordinates found in any response 56 | return None, None, True 57 | 58 | def check_grounding(self, x, y, screenshot, target_description): 59 | """ 60 | check if the grounding result is correct, 61 | return True if accept 62 | """ 63 | element = get_element_info_from_position(x, y) 64 | element_name = element['name'] if element['name'] != '' else 'unknown' 65 | marked_screenshot = mark_screenshot(screenshot, (x, y), element["coordinates"]) 66 | base64_marked_screenshot = encode_image(marked_screenshot) 67 | 68 | instruction = GROUNDING_AGENT_PROMPT + \ 69 | f"The element name from the accessibility tree: {element_name}\n" + \ 70 | f"Target Element Description: {target_description}\n" 71 | 72 | messages = get_mllm_messages(instruction, base64_marked_screenshot) 73 | completion = self.grounding_client.chat.completions.create( 74 | model=self.grounding_model, 75 | messages=messages, 76 | max_tokens=512, 77 | ) 78 | output_text = completion.choices[0].message.content 79 | return "Result: Reject" not in output_text 80 | 81 | def parse_coordinates(self, output_text): 82 | """ 83 | parse the coordinates from the model output 84 | """ 85 | # match single point coordinates 86 | point_pattern = r' PLANNING_MAX_RETRY: 47 | print(f"Plan Model failed to make valid plan after {PLANNING_MAX_RETRY} retries") 48 | self.exit(1) 49 | else: 50 | print(f"Retry after click not found: {self.retry_click_elements[-1]}") 51 | # call planning agent to get plan 52 | plan, action = self.planning_agent.get_plan(screenshot, self.task_description, self.retry_click_elements) 53 | 54 | if "click element:" in action: 55 | # call grounding agent to find element 56 | element_description = action.split("click element:")[1].strip() 57 | x, y, there_are_none = self.grounding_agent.find_element(element_description, screenshot) 58 | 59 | if there_are_none: 60 | # if element not found, retry 61 | self.retry_click_elements.append(element_description) 62 | self.add_fail_block(plan) 63 | return self.step(screenshot, retry_click+1) 64 | else: 65 | # if element found, execute action 66 | element = get_element_info_from_position(x, y) 67 | marked_screenshot = mark_screenshot(screenshot, (x, y), element["coordinates"]) 68 | action = self.get_click_action(action, x, y) # rewrite click action 69 | self.add_success_block(plan, action) 70 | self.execute_click_action(action, x, y) 71 | output = f"{plan}\nAction: {action}" 72 | self.planning_agent.add_to_history(output) 73 | self.after_action(output) 74 | return output, marked_screenshot 75 | else: 76 | # non-click action 77 | self.add_success_block(plan, action) 78 | self.execute_non_click_action(action) 79 | output = f"{plan}\nAction: {action}" 80 | self.planning_agent.add_to_history(output) 81 | self.after_action(output) 82 | return output, screenshot 83 | 84 | def get_click_action(self, action, x, y): 85 | if action.startswith("click"): 86 | return f"click ({x}, {y})" 87 | elif action.startswith("right click"): 88 | return f"right click ({x}, {y})" 89 | elif action.startswith("double click"): 90 | return f"double click ({x}, {y})" 91 | 92 | def after_action(self, output): 93 | print_in_green(f"\nAgent Done:\n{output}") 94 | self.step_cnt += 1 95 | 96 | def execute_click_action(self, action, x, y): 97 | if action.startswith("click"): 98 | pyautogui.click(x, y) 99 | elif action.startswith("right click"): 100 | pyautogui.rightClick(x, y) 101 | elif action.startswith("double click"): 102 | pyautogui.doubleClick(x, y) 103 | 104 | def execute_non_click_action(self, action): 105 | # drag 106 | match = re.match(r"(drag from) $(-?\d+), (-?\d+)$ to $(-?\d+), (-?\d+)$", action) 107 | if match: 108 | x1 = int(match.group(2)) # start x coordinate 109 | y1 = int(match.group(3)) # start y coordinate 110 | x2 = int(match.group(4)) # target x coordinate 111 | y2 = int(match.group(5)) # target y coordinate 112 | pyautogui.mouseDown(x1, y1) 113 | pyautogui.dragTo(x2, y2, duration=0.5) 114 | return 115 | 116 | # scroll 117 | match = re.match(r"scroll $(-?\d+), (-?\d+)$", action) 118 | if match: 119 | x = int(match.group(1)) # horizontal scroll distance 120 | y = int(match.group(2)) # vertical scroll distance 121 | if x != 0: 122 | pyautogui.hscroll(x) # horizontal scroll 123 | if y != 0: 124 | pyautogui.scroll(y) # vertical scroll 125 | return 126 | 127 | # press key 128 | match = re.match(r"press key: (.+)", action) 129 | if match: 130 | key_content = match.group(1) 131 | pyautogui.press(key_content) 132 | return 133 | 134 | # hotkey 135 | match = re.match(r"hotkey $(.+), (.+)$", action) 136 | if match: 137 | key1 = match.group(1).lower() 138 | key2 = match.group(2).lower() 139 | pyautogui.hotkey(key1, key2) 140 | return 141 | 142 | # type text 143 | match = re.match(r"type text: (.+)", action) 144 | if match: 145 | text_content = match.group(1) 146 | pyautogui.write(text_content) 147 | return 148 | 149 | # wait 150 | if action == "wait": 151 | time.sleep(3) 152 | 153 | # finish 154 | if action == "finish": 155 | self.exit(0) 156 | 157 | # fail 158 | if action == "fail": 159 | self.exit(1) 160 | 161 | def record(self, output, screenshot): 162 | # record in markdown 163 | first_event = self.step_cnt == 1 164 | record_in_md(self.directory_path, self.task_description, f"{self.step_cnt}.png", output, first_event=first_event) 165 | # save image 166 | screenshot_path = os.path.join(self.directory_path, f"{self.step_cnt}.png") 167 | save_screenshot(screenshot, screenshot_path) 168 | 169 | def add_success_block(self, plan, action): 170 | if self.output_queue is not None: 171 | self.output_queue.put(f"{plan}\n\nAction: {action}") 172 | 173 | def add_fail_block(self, plan): 174 | if self.output_queue is not None: 175 | self.output_queue.put(f"{plan}") 176 | 177 | def exit(self, exit_code): 178 | if exit_code == 0: 179 | print("Task is done!") 180 | 181 | exit(exit_code) 182 | -------------------------------------------------------------------------------- /agent/planning.py: -------------------------------------------------------------------------------- 1 | from utils import * 2 | from prompt import PLANNING_AGENT_PROMPT 3 | 4 | class PlanningAgent: 5 | def __init__(self, plan_client): 6 | self.plan_client = plan_client 7 | self.plan_model = plan_client.models.list().data[0].id 8 | print(f"Planning model: {self.plan_model}") 9 | self.history = [] 10 | self.HISTORY_CUT_OFF = 10 11 | 12 | def get_plan(self, screenshot, task_description, retry_click_elements=None): 13 | """ 14 | get the next plan 15 | Args: 16 | screenshot: the screenshot 17 | task_description: task description 18 | retry_click_elements: the list of elements that failed to click before 19 | Returns: 20 | plan_str: plan description 21 | action_str: specific action 22 | """ 23 | instruction = self.get_plan_instruction(task_description) 24 | 25 | if retry_click_elements: 26 | retry_elements_str = "> and <".join(retry_click_elements) 27 | instruction += f"\n\nNote: The element <{retry_elements_str}> you want to click before is not found, please try a new plan." 28 | 29 | base64_image = encode_image(screenshot) 30 | messages = get_mllm_messages(instruction, base64_image) 31 | completion = self.plan_client.chat.completions.create( 32 | model=self.plan_model, 33 | messages=messages, 34 | max_tokens=512, 35 | temperature=0.8 36 | ) 37 | output_text = completion.choices[0].message.content 38 | return self.split_output(output_text) 39 | 40 | def add_to_history(self, output): 41 | """ 42 | add the output to the history 43 | """ 44 | self.history.append(output) 45 | 46 | def get_plan_instruction(self, task_description): 47 | """ 48 | generate the planning instruction 49 | """ 50 | prompt = PLANNING_AGENT_PROMPT + f"Your task is: {task_description}\n\n" 51 | 52 | if len(self.history) > self.HISTORY_CUT_OFF: 53 | history_str = "\n\n".join(f"[{i+1}] {item}" for i, item in enumerate(self.history[-self.HISTORY_CUT_OFF:])) 54 | else: 55 | history_str = "\n\n".join(f"[{i+1}] {item}" for i, item in enumerate(self.history)) 56 | 57 | if history_str == '': 58 | history_str = "None" 59 | 60 | prompt += f"History of the previous actions and thoughts you have done to reach the current screen: {history_str}\n\n" 61 | prompt += "--------------------------------------------\n\n" 62 | prompt += f"Given the screenshot. What's the next step that you will do to help with the task?" 63 | return prompt 64 | 65 | def split_output(self, output): 66 | """ 67 | split the output into plan and action 68 | """ 69 | plan_str = output.split("Action:")[0].strip() 70 | action_str = output.split("Action:")[1].strip() 71 | return plan_str, action_str 72 | -------------------------------------------------------------------------------- /agent/prompt.py: -------------------------------------------------------------------------------- 1 | PLANNING_AGENT_PROMPT = """You are a helpful assistant who can help users complete computer tasks, with **full permission** to make any operations on the user's computer. 2 | Based on the provided current state, you need to suggest the next action to complete the task. Do not try to complete the entire task in one step. Break it down into smaller steps, and at each step you will get a new state to interact with. 3 | 4 | IMPORTANT: You must strictly adhere to the following rules: 5 | 1. Choose ONLY ONE action from the list below for each response, DO NOT perform more than one action per step. 6 | 2. Follow the exact syntax format for the selected action, DO NOT create or use any actions other than those listed. 7 | 3. Once the task is completed, output "finish" without any further actions required. 8 | 4. If external reflection is provided, use it to improve your next action. 9 | 10 | Valid actions: 11 | 12 | 1. click element: element_description 13 | click the element with the description element_description on current screen 14 | 15 | 2. right click element: element_description 16 | right click the element with the description element_description on current screen 17 | 18 | 3. double click element: element_description 19 | double click the element with the description element_description on current screen 20 | 21 | 4. drag from (x1, y1) to (x2, y2) 22 | drag the element from position (x1, y1) to (x2, y2). 23 | 24 | 5. scroll (dx, dy) 25 | scroll the screen with the offset (dx, dy). dx is the horizontal offset, and dy is the vertical offset. 26 | 27 | 6. press key: key_content 28 | press the key key_content on the keyboard. 29 | 30 | 7. hotkey (key1, key2) 31 | press the hotkey composed of key1 and key2. 32 | 33 | 8. type text: text_content 34 | type content text_content on the keyboard. 35 | 36 | 9. wait 37 | wait for some time, usually for the system to respond, screen to refresh, advertisement to finish. 38 | 39 | 10. finish 40 | indicating that the task has been completed. 41 | 42 | 11. fail 43 | indicating that the task has failed. 44 | 45 | Response Format: {Your thought process}\n\nAction: {The specific action you choose to take} 46 | 47 | -------------------------------------------- 48 | 49 | """ 50 | 51 | GROUNDING_AGENT_PROMPT = """You are an assistant evaluating the accuracy of click actions performed by a PC agent. Your role is to verify if the executed click matches the intended target based on: 52 | 53 | 1. A screenshot showing: 54 | - A red dot and circle marking the exact click location 55 | - A red box outlining the general area of the clicked element 56 | Note: While the dot and circle are precise, the box might be less accurate 57 | 58 | 2. The element name from the accessibility tree 59 | Note: This information might be incomplete, with many elements labeled as "unknown". Ignore it in this case. 60 | 61 | 3. The target element description 62 | 63 | Your Task is to verify if the click action matches the target element based on the above information. 64 | 65 | # Important Notes 66 | 1. Generally, be cautious about rejecting valid clicks - avoid false negatives when possible. 67 | 2. However, be strict about distinguishing between clearly different elements. 68 | 3. Position of target element description is not a strict criterion. 69 | 70 | # Evaluation Process 71 | 1. Locate the click point with red markers. 72 | 2. Check the element name for useful info. 73 | 3. Compare the target description with your findings. 74 | 75 | Response Format: 76 | Evaluation Process: {your evaluation process} 77 | Result: {your result} 78 | 79 | Your result should be either: 80 | - "Accept" if the click matches the target element 81 | - "Reject" if the click does not match the target element 82 | 83 | Few Example Responses: 84 | [1] 85 | "Evaluation Process: The click is at the center of the element labeled "close button", which matches the target description. 86 | Result: Accept" 87 | 88 | [2] 89 | "Evaluation Process: The click element name from the accessibility tree is "Copy image address", which is not a match for the target description "Copy image" option. 90 | Result: Reject" 91 | 92 | -------------------------------------------- 93 | 94 | """ 95 | -------------------------------------------------------------------------------- /agent/server/molmo_server.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | vllm serve allenai/Molmo-72B-0924 --tensor-parallel-size 4 --trust-remote-code --port 8000 4 | -------------------------------------------------------------------------------- /agent/server/planner_server.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | vllm serve henryhe0123/pc-agent --tensor-parallel-size 4 --port 8002 4 | -------------------------------------------------------------------------------- /agent/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import io 4 | import base64 5 | from PIL import ImageDraw, ImageGrab 6 | from pywinauto import Desktop 7 | 8 | desktop = Desktop(backend="uia") 9 | 10 | 11 | def get_screenshot(): 12 | screenshot = ImageGrab.grab() 13 | return screenshot 14 | 15 | 16 | def encode_image(image): 17 | # encode image to base64 string 18 | buffered = io.BytesIO() 19 | image.save(buffered, format="PNG") 20 | return base64.b64encode(buffered.getvalue()).decode('utf-8') 21 | 22 | 23 | def save_screenshot(screenshot, path): 24 | screenshot.save(path, format="PNG") 25 | 26 | 27 | def get_mllm_messages(instruction, base64_image): 28 | messages = [ 29 | { 30 | "role": "user", 31 | "content": [ 32 | { 33 | "type": "image_url", 34 | "image_url": { 35 | "url": f"data:image/jpeg;base64,{base64_image}" 36 | }, 37 | }, 38 | { 39 | "type": "text", 40 | "text": instruction 41 | }, 42 | ], 43 | }, 44 | ] 45 | return messages 46 | 47 | 48 | def get_element_info_from_position(x, y): 49 | # get the UI element info at the specified coordinates 50 | try: 51 | element = desktop.from_point(x, y) 52 | # get the rectangle coordinates of the element 53 | rect = element.rectangle() 54 | 55 | return { 56 | "name": element.element_info.name, 57 | "coordinates": { 58 | "left": rect.left, 59 | "top": rect.top, 60 | "right": rect.right, 61 | "bottom": rect.bottom 62 | } 63 | } 64 | except Exception as e: 65 | print(f"Error occurs when get element from position: {e}") 66 | return None 67 | 68 | 69 | def mark_screenshot(original_screenshot, coordinates, rect=None): 70 | screenshot = original_screenshot.copy() 71 | x, y = coordinates 72 | point = {"x": x, "y": y} 73 | 74 | if rect is not None: 75 | # create a drawable object 76 | draw = ImageDraw.Draw(screenshot) 77 | # draw the rectangle 78 | draw.rectangle( 79 | [(rect["left"], rect["top"]), (rect["right"], rect["bottom"])], 80 | outline="red", 81 | width=3 # line width 82 | ) 83 | 84 | if point is not None: 85 | draw = ImageDraw.Draw(screenshot) 86 | 87 | # calculate the top-left and bottom-right coordinates of the solid circle 88 | radius = 3 89 | left = point["x"] - radius 90 | top = point["y"] - radius 91 | right = point["x"] + radius 92 | bottom = point["y"] + radius 93 | 94 | # draw the solid circle 95 | draw.ellipse( 96 | [(left, top), (right, bottom)], 97 | fill="red" 98 | ) 99 | 100 | # add a larger hollow circle 101 | circle_radius = 18 102 | circle_left = point["x"] - circle_radius 103 | circle_top = point["y"] - circle_radius 104 | circle_right = point["x"] + circle_radius 105 | circle_bottom = point["y"] + circle_radius 106 | 107 | # draw the hollow circle 108 | draw.ellipse( 109 | [(circle_left, circle_top), (circle_right, circle_bottom)], 110 | outline="red", 111 | width=2 112 | ) 113 | 114 | return screenshot 115 | 116 | 117 | def record_in_md(directory_path, task_description, screenshot_path, output, external_reflection=None, 118 | first_event=False): 119 | file_name = "inference_record.md" 120 | with open(os.path.join(directory_path, file_name), "a", encoding="utf-8") as file: 121 | if first_event: 122 | file.write(f"# Inference Task\n") 123 | file.write(f"**Description:** {task_description}\n\n") 124 | file.write(f"### {time.strftime('%Y-%m-%d %H:%M:%S')}\n\n") 125 | file.write(f"**Screenshot:**\n") 126 | file.write(f'

\n\n') 127 | file.write(f"**External Reflection:**\n{external_reflection}\n\n") if external_reflection else None 128 | file.write(f"**Output:**\n{output}\n\n") 129 | 130 | 131 | def log(message, filename="agent.log"): 132 | current_time = time.strftime("%Y-%m-%d %H:%M:%S") 133 | # open the file with UTF-8 encoding 134 | with open(filename, 'a', encoding='utf-8') as file: 135 | file.write(f"{current_time}\n{message}\n\n") 136 | 137 | 138 | def print_in_green(message): 139 | print(f"\033[92m{message}\033[0m") 140 | -------------------------------------------------------------------------------- /assets/animation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/assets/animation.png -------------------------------------------------------------------------------- /assets/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/assets/overview.png -------------------------------------------------------------------------------- /assets/raw_trajectory_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/assets/raw_trajectory_example.png -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: pcagent 2 | channels: 3 | - defaults 4 | dependencies: 5 | - python=3.10.14 6 | - aiofiles=22.1.0 7 | - blas=1.0=mkl 8 | - blinker=1.6.2 9 | - bzip2=1.0.8 10 | - ca-certificates=2024.7.2 11 | - click=8.1.7 12 | - colorama=0.4.6 13 | - flask=3.0.3 14 | - intel-openmp=2023.1.0 15 | - itsdangerous=2.2.0 16 | - jinja2=3.1.4 17 | - libffi=3.4.4 18 | - libiconv=1.16 19 | - libxml2=2.10.4 20 | - libxslt=1.1.37 21 | - lxml=5.2.1 22 | - markupsafe=2.1.3 23 | - mkl=2023.1.0 24 | - mkl-service=2.4.0 25 | - mkl_fft=1.3.8 26 | - mkl_random=1.2.4 27 | - numpy=1.26.4 28 | - numpy-base=1.26.4 29 | - openssl=3.0.14 30 | - pip=24.0 31 | - setuptools=69.5.1 32 | - sortedcontainers=2.4.0 33 | - sqlite=3.45.3 34 | - tbb=2021.8.0 35 | - tk=8.6.14 36 | - tzdata=2024a 37 | - vc=14.2 38 | - vs2015_runtime=14.29.30133 39 | - werkzeug=3.0.3 40 | - wheel=0.43.0 41 | - xz=5.4.6 42 | - zlib=1.2.13 43 | - pip: 44 | - aiohappyeyeballs==2.4.0 45 | - aiohttp==3.10.5 46 | - aiosignal==1.3.1 47 | - altgraph==0.17.4 48 | - annotated-types==0.7.0 49 | - anyio==4.4.0 50 | - astor==0.8.1 51 | - async-timeout==4.0.3 52 | - attrs==24.2.0 53 | - beautifulsoup4==4.12.3 54 | - certifi==2024.8.30 55 | - charset-normalizer==3.3.2 56 | - comtypes==1.4.5 57 | - contourpy==1.3.0 58 | - customtkinter==5.2.2 59 | - cycler==0.12.1 60 | - cython==3.0.11 61 | - darkdetect==0.8.0 62 | - dashscope==1.20.9 63 | - decorator==5.1.1 64 | - distro==1.9.0 65 | - exceptiongroup==1.2.2 66 | - fire==0.7.0 67 | - fonttools==4.54.1 68 | - frozenlist==1.4.1 69 | - h11==0.14.0 70 | - httpcore==1.0.5 71 | - httpx==0.27.2 72 | - idna==3.8 73 | - imageio==2.35.1 74 | - imgaug==0.4.0 75 | - jiter==0.5.0 76 | - kiwisolver==1.4.7 77 | - lazy-loader==0.4 78 | - lmdb==1.5.1 79 | - matplotlib==3.9.2 80 | - mouseinfo==0.1.3 81 | - multidict==6.1.0 82 | - networkx==3.3 83 | - openai==1.44.1 84 | - opencv-contrib-python==4.10.0.84 85 | - opencv-python==4.10.0.84 86 | - opt-einsum==3.3.0 87 | - packaging==24.1 88 | - paddleocr==2.8.1 89 | - paddlepaddle==2.6.2 90 | - pefile==2023.2.7 91 | - pillow==10.4.0 92 | - protobuf==3.20.2 93 | - psutil==6.1.0 94 | - pyautogui==0.9.54 95 | - pyclipper==1.3.0.post5 96 | - pydantic==2.9.1 97 | - pydantic-core==2.23.3 98 | - pygetwindow==0.0.9 99 | - pyinstaller==6.9.0 100 | - pyinstaller-hooks-contrib==2024.7 101 | - pymsgbox==1.0.9 102 | - pynput==1.7.7 103 | - pyparsing==3.1.4 104 | - pyperclip==1.9.0 105 | - pyqt5==5.15.11 106 | - pyqt5-qt5==5.15.2 107 | - pyqt5-sip==12.15.0 108 | - pyrect==0.2.0 109 | - pyscreeze==0.1.30 110 | - python-dateutil==2.9.0.post0 111 | - python-docx==1.1.2 112 | - pytweening==1.2.0 113 | - pywin32==306 114 | - pywin32-ctypes==0.2.2 115 | - pywinauto==0.6.8 116 | - pyyaml==6.0.2 117 | - rapidfuzz==3.10.0 118 | - requests==2.32.3 119 | - scikit-image==0.24.0 120 | - scipy==1.14.1 121 | - shapely==2.0.6 122 | - six==1.16.0 123 | - sniffio==1.3.1 124 | - soupsieve==2.6 125 | - termcolor==2.4.0 126 | - tifffile==2024.9.20 127 | - tqdm==4.66.5 128 | - typing-extensions==4.12.2 129 | - urllib3==2.2.2 130 | - websocket-client==1.8.0 131 | - wmi==1.5.1 132 | - yarl==1.11.1 -------------------------------------------------------------------------------- /postprocess/completion.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import sys 4 | import random 5 | import concurrent.futures 6 | from datetime import datetime 7 | from openai import OpenAI 8 | from concurrent.futures import ThreadPoolExecutor 9 | from prompt import * 10 | from utils import * 11 | 12 | client = OpenAI() 13 | model = "gpt-4o" 14 | 15 | CONCURRENT_NUM = 80 16 | RE_GENERATE = False 17 | MAX_CONTEXT_ENTRIES = 10 18 | DETAILED_OUTPUT = False 19 | 20 | 21 | def call_openai(query, base64_image=None): 22 | messages = [ 23 | { 24 | "role": "user", 25 | "content": [ 26 | { 27 | "type": "image_url", 28 | "image_url": { 29 | "url": f"data:image/jpeg;base64,{base64_image}" 30 | } 31 | } if base64_image else None, 32 | { 33 | "type": "text", 34 | "text": query 35 | }, 36 | ], 37 | }, 38 | ] 39 | 40 | completion = client.chat.completions.create( 41 | model=model, 42 | messages=messages, 43 | max_tokens=1000 44 | ) 45 | 46 | reply = completion.choices[0].message.content 47 | return reply 48 | 49 | 50 | def process_concurrently(data_dir, function): 51 | tasks = [] 52 | 53 | for item in os.listdir(data_dir): 54 | item_path = os.path.join(data_dir, item) 55 | directory_name = "events_" 56 | 57 | if os.path.isdir(item_path) and item.startswith(directory_name): 58 | print(f'Processing directory: {item_path}') 59 | for filename in os.listdir(item_path): 60 | if filename.endswith('.jsonl') and 'task' in filename: 61 | file_path = os.path.join(item_path, filename) 62 | md_path = os.path.join(item_path, filename.replace('.jsonl', '.md')) 63 | try: 64 | with open(md_path, 'r', encoding='utf-8') as file: 65 | lines = file.readlines() 66 | task_description = lines[1].replace('**Description:** ', '').strip() 67 | tasks.append((file_path, task_description)) 68 | except Exception as e: 69 | print(f"error: failed to extract task description from {md_path}: {e}") 70 | 71 | random.shuffle(tasks) 72 | with ThreadPoolExecutor(max_workers=CONCURRENT_NUM) as executor: 73 | futures = [executor.submit(function, file_path, task_description) 74 | for file_path, task_description in tasks] 75 | concurrent.futures.wait(futures) 76 | 77 | 78 | def get_action_description(action, element_name, marked_screenshot_path=None, marked_screenshot_data=None): 79 | """ 80 | Generate action description for click-related actions. 81 | """ 82 | if marked_screenshot_path: 83 | base64_image = encode_image(marked_screenshot_path) 84 | elif marked_screenshot_data: 85 | base64_image = marked_screenshot_data 86 | else: 87 | base64_image = None 88 | 89 | click_action, _ = parse_click_action(action) 90 | if click_action: 91 | # Is a click-related action, generate action description 92 | query = CLICK_ACTION_DESCRIPTION_PROMPT \ 93 | + f"The name of the clicked target for reference: {element_name}\n\n" 94 | 95 | reply = call_openai(query, base64_image) 96 | description = f"{click_action} <\{reply}>" 97 | 98 | else: 99 | # Not a click-related action, return the original action as description 100 | description = action 101 | 102 | return description 103 | 104 | 105 | def get_action_description_check(action, element_name, action_description, marked_screenshot_path=None, marked_screenshot_data=None): 106 | """ 107 | Check the action description for click-related actions. 108 | """ 109 | if marked_screenshot_path: 110 | base64_image = encode_image(marked_screenshot_path) 111 | elif marked_screenshot_data: 112 | base64_image = marked_screenshot_data 113 | else: 114 | base64_image = None 115 | 116 | click_action, coordinates = parse_click_action(action) 117 | if click_action: 118 | # Is a click-related action, check the action description 119 | x, y = coordinates 120 | clicked_element_description = action_description.split('<\\')[1].split('>')[0] 121 | 122 | query = CLICK_ACTION_DESCRIPTION_CHECK_PROMPT \ 123 | + f"The exact coordinates of the mouse click: ({x}, {y})\n" \ 124 | + f"The element name from the accessibility tree: {element_name}\n" \ 125 | + f"The pre-generated description of the click location: {clicked_element_description}\n" 126 | 127 | try_time = 0 128 | 129 | while True: 130 | try_time += 1 131 | 132 | reply = call_openai(query, base64_image) 133 | 134 | if "Answer:" in reply: 135 | check_result = reply.split("Answer:")[1].strip().strip('*') 136 | break 137 | 138 | if try_time > 3: 139 | check_result = None 140 | print(f"action description check failed after 3 times in {marked_screenshot_path}") 141 | break 142 | 143 | else: 144 | # Not a click-related action, return None 145 | check_result = None 146 | 147 | if check_result and check_result.strip().startswith("Wrong"): 148 | modifyed_description = check_result.split("Wrong. Correct Description:")[1].strip() 149 | final_answer = f"{click_action} <\{modifyed_description}>" 150 | else: 151 | final_answer = None 152 | 153 | return check_result, final_answer 154 | 155 | 156 | def get_thought(task_description, action, history, following_actions, marked_screenshot_path=None, marked_screenshot_data=None): 157 | """ 158 | Generate thought for the action. 159 | """ 160 | if marked_screenshot_path: 161 | base64_image = encode_image(marked_screenshot_path) 162 | elif marked_screenshot_data: 163 | base64_image = marked_screenshot_data 164 | else: 165 | base64_image = None 166 | 167 | query = THOUGHT_PROMPT \ 168 | + f"The task you are attempting to complete: {task_description}\n\n" \ 169 | + f"Your performing history:\n{history}\n\n" \ 170 | + f"Your subsequent actions:\n{following_actions}\n\n" \ 171 | + f"The specific action you chose to perform: {action}\n\n" 172 | 173 | thought = call_openai(query, base64_image) 174 | 175 | if "Action:" in thought: 176 | print(f"warning: found 'Action:' in thought generation, deleting it") 177 | thought = thought.split("Action:")[0].strip() 178 | if "*Action*:" in thought: 179 | print(f"warning: found '*Action*:' in thought generation, deleting it") 180 | thought = thought.split("*Action*:")[0].strip() 181 | 182 | return thought 183 | 184 | 185 | def add_field_for_file(file_path, field, task_description): 186 | print(f"begin adding {field} for {file_path}") 187 | entries = [] 188 | 189 | try: 190 | with open(file_path, 'r', encoding='utf-8') as file: 191 | entries = [json.loads(line) for line in file] 192 | except Exception as e: 193 | print(f"error: failed to read file {file_path}: {e}") 194 | return 195 | 196 | if field == 'thought': 197 | all_actions = [entry['action_description'] for entry in entries] 198 | 199 | try: 200 | for id, entry in enumerate(entries): 201 | if field in entry and entry[field] and content_is_rational(entry[field]): 202 | if not RE_GENERATE: 203 | continue 204 | 205 | if 'marked_screenshot' not in entry: 206 | print(f"error: marked_screenshot field not found: {file_path}") 207 | continue 208 | 209 | marked_screenshot_path = os.path.join(os.path.dirname(file_path), entry['marked_screenshot']) 210 | if not os.path.isfile(marked_screenshot_path): 211 | print(f"error: screenshot file not found: {marked_screenshot_path}") 212 | continue 213 | 214 | try: 215 | if field == 'action_description': 216 | action_description = get_action_description(entry['action'], entry['element'], marked_screenshot_path=marked_screenshot_path) 217 | 218 | entry['action_description'] = action_description 219 | 220 | if DETAILED_OUTPUT: 221 | print(f"generated action_description: {action_description}") 222 | elif field == 'action_description_checked': 223 | action_description_checked, modified_action_description = get_action_description_check( 224 | entry['action'], entry['element'], entry['action_description'], 225 | marked_screenshot_path=marked_screenshot_path) 226 | 227 | entry['action_description_checked'] = action_description_checked 228 | 229 | if modified_action_description: 230 | entry['action_description'] = modified_action_description 231 | 232 | if DETAILED_OUTPUT and action_description_checked: 233 | print(f"generated action_description_checked: {action_description_checked}") 234 | elif field == 'thought': 235 | # build history steps 236 | history_steps = [] 237 | start_idx = max(0, id - MAX_CONTEXT_ENTRIES) 238 | for hist_id in range(start_idx, id): 239 | hist_entry = entries[hist_id] 240 | if 'thought' in hist_entry and hist_entry['thought'] and content_is_rational(hist_entry['thought']): 241 | history_steps.append(f"{hist_id+1}:\nThought: {hist_entry['thought']}\nAction: {hist_entry['action_description']}") 242 | # build subsequent steps 243 | subsequent_actions = all_actions[id+1:id+1+MAX_CONTEXT_ENTRIES] 244 | subsequent_actions_str = get_action_string(subsequent_actions) 245 | thought = get_thought( 246 | task_description, entry['action_description'], '\n'.join(history_steps), subsequent_actions_str, marked_screenshot_path=marked_screenshot_path) 247 | 248 | entry['thought'] = thought 249 | 250 | if DETAILED_OUTPUT: 251 | print(f"generated thought: {thought}") 252 | else: 253 | print(f"error: unknown field: {field}") 254 | except Exception as e: 255 | print(f"error: failed to get {field} for {marked_screenshot_path}: {e}") 256 | continue 257 | 258 | with open(file_path, 'w', encoding='utf-8') as file: 259 | for entry in entries: 260 | json.dump(entry, file, ensure_ascii=False) 261 | file.write('\n') 262 | 263 | rewrite_markdown_file_by_jsonl(file_path) 264 | print(f"finished adding {field} for {file_path}") 265 | 266 | except Exception as e: 267 | print(f"error: failed to process file {file_path}: {e}") 268 | if "Expecting" in str(e) or "Invalid control character" in str(e): 269 | print(f"file {file_path} is corrupted, deleting...") 270 | try: 271 | os.remove(file_path) 272 | print(f"deleted corrupted file: {file_path}") 273 | except OSError as delete_error: 274 | print(f"error: failed to delete corrupted file: {delete_error}") 275 | 276 | 277 | def action_semantic_completion(file_path, task_description): 278 | """ 279 | Adds the field 'action_description' to the jsonl file as action semantics. 280 | 281 | Parameters: 282 | file_path (str): The path to the jsonl file to be processed. 283 | """ 284 | # add action description 285 | add_field_for_file(file_path, 'action_description', task_description) 286 | # add action description check 287 | add_field_for_file(file_path, 'action_description_checked', task_description) 288 | # add thought 289 | add_field_for_file(file_path, 'thought', task_description) 290 | 291 | 292 | 293 | if __name__ == "__main__": 294 | start_time = datetime.now() 295 | print(f"start time: {start_time}") 296 | 297 | current_dir = os.path.dirname(os.path.abspath(__file__)) 298 | if len(sys.argv) > 1: 299 | data_dir = os.path.join(current_dir, sys.argv[1]) 300 | else: 301 | data_dir = os.path.join(current_dir, 'data') # write the data directory here 302 | if not os.path.exists(data_dir): 303 | print(f"error: {data_dir} directory does not exist") 304 | exit() 305 | 306 | process_concurrently(data_dir, action_semantic_completion) 307 | 308 | end_time = datetime.now() 309 | print(f"end time: {end_time}") 310 | print(f"Total time: {end_time - start_time}") 311 | -------------------------------------------------------------------------------- /postprocess/data/events_example/free_task_20241126_161517.jsonl: -------------------------------------------------------------------------------- 1 | {"timestamp": "2024-11-26_16:15:19", "action": "click (828, 1075)", "screenshot": "screenshot/20241126_161519_1.png", "element": "Start", "rect": {"left": 794, "top": 1020, "right": 850, "bottom": 1080}, "marked_screenshot": "screenshot/20241126_161519_1_marked.png"} 2 | {"timestamp": "2024-11-26_16:15:21", "action": "click (1111, 334)", "screenshot": "screenshot/20241126_161521_2.png", "element": "PowerPoint", "rect": {"left": 1079, "top": 273, "right": 1199, "bottom": 378}, "marked_screenshot": "screenshot/20241126_161521_2_marked.png"} 3 | {"timestamp": "2024-11-26_16:15:24", "action": "click (407, 275)", "screenshot": "screenshot/20241126_161524_3.png", "element": "Blank Presentation", "rect": {"left": 225, "top": 181, "right": 450, "bottom": 371}, "marked_screenshot": "screenshot/20241126_161524_3_marked.png"} 4 | {"timestamp": "2024-11-26_16:15:26", "action": "click (1048, 546)", "screenshot": "screenshot/20241126_161526_4.png", "element": "Unknown", "rect": {"left": -9, "top": 184, "right": 1929, "bottom": 1002}, "marked_screenshot": "screenshot/20241126_161526_4_marked.png"} 5 | {"timestamp": "2024-11-26_16:15:28", "action": "type text: Hello, world", "screenshot": "screenshot/20241126_161528_5.png", "element": null, "rect": null, "marked_screenshot": "screenshot/20241126_161528_5.png"} 6 | {"timestamp": "2024-11-26_16:15:36", "action": "click (1877, 21)", "screenshot": "screenshot/20241126_161536_6.png", "element": "Close", "rect": {"left": 1860, "top": 0, "right": 1920, "bottom": 60}, "marked_screenshot": "screenshot/20241126_161536_6_marked.png"} 7 | {"timestamp": "2024-11-26_16:15:38", "action": "click (951, 658)", "screenshot": "screenshot/20241126_161538_7.png", "element": "Save", "rect": {"left": 901, "top": 645, "right": 989, "bottom": 675}, "marked_screenshot": "screenshot/20241126_161538_7_marked.png"} 8 | {"timestamp": "2024-11-26_16:15:40", "action": "finish", "screenshot": "screenshot/20241126_161540_8.png", "element": null, "rect": null, "marked_screenshot": "screenshot/20241126_161540_8.png"} 9 | -------------------------------------------------------------------------------- /postprocess/data/events_example/free_task_20241126_161517.md: -------------------------------------------------------------------------------- 1 | # Free Task 2 | **Description:** create a new blank slide with title "Hello, world" on desktop. 3 | 4 | **Level:** easy 5 | 6 | ### 2024-11-26_16:15:19 7 | **Input:** 8 | 9 | Given the screenshot as below. What's the next step that you will do to help with the task? 10 | 11 |

12 | 13 | **Output:** 14 | 15 | click element Start at (828, 1075) 16 | 17 | 18 | ### 2024-11-26_16:15:21 19 | **Input:** 20 | 21 | Given the screenshot as below. What's the next step that you will do to help with the task? 22 | 23 |

24 | 25 | **Output:** 26 | 27 | click element PowerPoint at (1111, 334) 28 | 29 | 30 | ### 2024-11-26_16:15:24 31 | **Input:** 32 | 33 | Given the screenshot as below. What's the next step that you will do to help with the task? 34 | 35 |

36 | 37 | **Output:** 38 | 39 | click element Blank Presentation at (407, 275) 40 | 41 | 42 | ### 2024-11-26_16:15:26 43 | **Input:** 44 | 45 | Given the screenshot as below. What's the next step that you will do to help with the task? 46 | 47 |

48 | 49 | **Output:** 50 | 51 | click element Unknown at (1048, 546) 52 | 53 | 54 | ### 2024-11-26_16:15:28 55 | **Input:** 56 | 57 | Given the screenshot as below. What's the next step that you will do to help with the task? 58 | 59 |

60 | 61 | **Output:** 62 | 63 | type text: Hello, world 64 | 65 | 66 | ### 2024-11-26_16:15:36 67 | **Input:** 68 | 69 | Given the screenshot as below. What's the next step that you will do to help with the task? 70 | 71 |

72 | 73 | **Output:** 74 | 75 | click element Close at (1877, 21) 76 | 77 | 78 | ### 2024-11-26_16:15:38 79 | **Input:** 80 | 81 | Given the screenshot as below. What's the next step that you will do to help with the task? 82 | 83 |

84 | 85 | **Output:** 86 | 87 | click element Save at (951, 658) 88 | 89 | 90 | ### 2024-11-26_16:15:40 91 | **Input:** 92 | 93 | Given the screenshot as below. What's the next step that you will do to help with the task? 94 | 95 |

96 | 97 | **Output:** 98 | 99 | finish 100 | 101 | 102 | -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161519_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161519_1.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161519_1_marked.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161519_1_marked.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161521_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161521_2.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161521_2_marked.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161521_2_marked.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161524_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161524_3.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161524_3_marked.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161524_3_marked.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161526_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161526_4.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161526_4_marked.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161526_4_marked.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161528_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161528_5.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161536_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161536_6.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161536_6_marked.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161536_6_marked.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161538_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161538_7.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161538_7_marked.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161538_7_marked.png -------------------------------------------------------------------------------- /postprocess/data/events_example/screenshot/20241126_161540_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/PC-Agent/e60ee6e225d367b121da2d40c4da4bc71bdfcf99/postprocess/data/events_example/screenshot/20241126_161540_8.png -------------------------------------------------------------------------------- /postprocess/prompt.py: -------------------------------------------------------------------------------- 1 | CLICK_ACTION_DESCRIPTION_PROMPT = """Help me describe the target in the screenshot. The target may be a GUI element or an empty area on the screen. 2 | 3 | You will be provided with: 4 | 1. A screenshot with a red mark quadruplet: 5 | - Frame: rectangular border around the target (may be inaccurate) 6 | - Circle: circle at the center of the target 7 | - Point: dot marking the exact click position 8 | - Arrow: pointing to the target 9 | 2. The name of the clicked target for reference. It's just for reference. If this name is "Unknown" or appears to be incorrect, just ignore it. 10 | 11 | Description Rules: 12 | 1. Priority Order: 13 | - Highest: Circle, Point and Arrow 14 | - Second: Reference name (if reliable) 15 | - Lowest: Frame 16 | 17 | 2. Description Strategy: 18 | A. For Clear GUI Elements: 19 | - Include position info ("top", "left", "center", etc.) if possible 20 | - Use visual information to describe the element 21 | - Refer to the provided element name if reliable 22 | - Examples: 23 | √ "the button in the top-right corner of the window" 24 | √ "the current tab at the top of the browser" 25 | x "the red circle" (red marks doesn't belong to the original screenshot or element) 26 | 27 | B. For Empty Areas or Uncertain Elements: 28 | - Focus on positional relationships 29 | - Use visual information to locate the position 30 | - Examples: 31 | √ "empty area on the right side of the window" 32 | √ "area near the bottom toolbar" 33 | 34 | 3. Prohibited: 35 | - No speculation about element functions 36 | - No uncertain terms like "seems", "appears", "probably" 37 | - No description of elements outside the frame 38 | 39 | Output Format: 40 | - For GUI elements: "{position description} + {element description}" 41 | - For empty areas: "empty area + {position description}" 42 | 43 | Examples: 44 | √ "the close button in the top-right corner of the window" 45 | √ "the 'Chrome' icon on the desktop" 46 | √ "the left thumbnail panel in current window" 47 | √ "the 'Images' tab below the search bar" 48 | √ "'click to add title'" 49 | √ "the button in the top-right corner of the browser" (when the reference name is not reliable and you are unsure about the element) 50 | x "what appears to be a settings button" (avoid speculation) 51 | 52 | Important: 53 | 1. Carefully observe the screenshot and the red mark quadruplet. Use these visual cues to describe the element or position as accurately as possible. But **DO NOT** explicitly state the red marks in your description. Avoid phrases like "red arrow marking on the slide.." or "the red circle.." 54 | 2. When uncertain, prefer positional description over semantic or functional speculation. Be extraordinarily cautious to avoid hallucinations. 55 | 3. Be precise and output the description directly in an objective tone. Avoid sentences starting with "the target is","The pointed target is", or "it appears to be". 56 | 4. Do not directly use the provided element name, create your own natural description based on visual information. 57 | 58 | Note: 59 | 1. For the name of the clicked target for reference, it is either very precise or completely worthless. Judge its reliability based on visual information. 60 | If unreliable, ignore it and be cautious, preferably using only positional descriptions; if reliable, try to expand on its description as much as possible. 61 | 62 | 2. Special cases: for the text box in PowerPoint, the name of the clicked target is usually "click to add title" or "click to add text". 63 | - "'click to add title'": for the title text box above the content text box or on the cover slide 64 | - "'click to add text'": for the content text box below the title text box 65 | - "'click to add subtitle'": for the subtitle text box below the title text box 66 | - "'the left thumbnail panel in current window'": for the **left slides thumbnail panel in PowerPoint**. But **DO NOT** abuse the use of "thumbnail" in other cases. 67 | """ 68 | 69 | CLICK_ACTION_DESCRIPTION_CHECK_PROMPT = """ 70 | You are provided with the following information about a mouse click on a computer screen: 71 | 72 | 1. A screenshot showing: 73 | - A red dot and circle marking the exact click location 74 | - A red arrow pointing to the click location 75 | - A red box outlining the general area of the clicked element 76 | Note: While the dot, circle, and arrow are precise, the box might be less accurate 77 | 78 | 2. The exact coordinates of the mouse click 79 | 80 | 3. The element name from the accessibility tree 81 | Note: This information might be incomplete, with many elements labeled as "unknown". 82 | 83 | 4. A pre-generated description of the click location 84 | Types: 85 | - Empty area description (e.g., "empty area near the bottom toolbar") 86 | - Specific element description (e.g., "the start button on the left corner of the taskbar") 87 | 88 | # Your Task 89 | Evaluate the provided description, determine if it is accurate. If not, provide the correct description. You can describe it as an empty area or a specific element. Do not mention the red marks on the screenshot. 90 | 91 | # Critical Evaluation Points 92 | 1. **Priority of Visual Evidence**: The red markers (dot, circle, arrow) on the screenshot show the ACTUAL click location. This is your primary source of truth. But **DO NOT** explicitly state the red marks in your description. Avoid phrases like "red arrow marking on the slide.." or "the red circle.." 93 | 94 | 2. **Element Name Usage**: 95 | - Ignore if marked as "unknown" 96 | - If available, use it to verify the description's accuracy 97 | - If there's a conflict between the element name and the description, carefully evaluate which is correct 98 | 99 | 3. **Empty Area vs. Specific Element Distinction**: 100 | - True empty areas: Locations where clicks produce no effect 101 | - False empty areas: Locations that appear empty but are part of specific functional elements 102 | 103 | # Evaluation Process 104 | 1. First, locate the exact click point using the red markers 105 | 2. Check if the provided element name offers any useful information 106 | 3. Determine if the location is a true empty area or part of a specific functional element 107 | 4. Compare the given description against your findings 108 | 5. Provide your response based on the required format 109 | 110 | # Important 111 | - Carefully determine the wrong description. Most of the time, the provided description is correct. 112 | - The pre-generated description may have hallucinations. Carefully evaluate it. 113 | 114 | Final Answer Format:(Response in English even the element name is Chinese) 115 | Thought Process: {your thought process} 116 | Answer:{your answer} 117 | 118 | Your answer should be either: 119 | - "Good" if the description is accurate 120 | - "Wrong. Correct Description: {your description}" if the description is inaccurate 121 | -------------------------------------------- 122 | 123 | """ 124 | 125 | 126 | THOUGHT_PROMPT = """You are a helpful PC Agent designed to complete tasks on a computer. Your goal is to recreate your **thought process** behind a specific action. 127 | 128 | You will be provided with: 129 | 130 | 1. The task you are attempting to complete. 131 | 2. A history of the steps you have already performed (up to 50, if any; none if it was the first action). 132 | 3. Subsequent actions (none if this is the last action). 133 | 4. The specific action you chose to take. 134 | 5. A screenshot of the computer screen at the moment you decided to take the action 135 | 6. The red marks on the screenshot: 136 | A. For Click Actions (click, right click, double click): 137 | - Frame: rectangular border around clicked element 138 | - Center: circle at element center 139 | - Click: point at exact click position 140 | - Arrow: pointing to clicked element 141 | B. For Drag Actions: 142 | - Start: red point and circle 143 | - End: red point and circle 144 | - Arrow: from start to end position 145 | 146 | Explanation of actions: 147 | 1. **click element: <{element description}>**: Click the element described by `{element description}`. 148 | 2. **right click element: <{element description}>**: Right-click the element described by `{element description}`. 149 | 3. **double click element: <{element description}>**: Double-click the element described by `{element description}`. 150 | 4. **drag from (x1, y1) to (x2, y2)**: Drag the mouse from the position (x1, y1) to (x2, y2). 151 | 5. **scroll (dx, dy)**: Scroll with offsets (dx for horizontal movement, dy for vertical movement). 152 | 6. **press key: key_content**: Press the `key_content` on the keyboard. 153 | 7. **hotkey (key1, key2)**: Press the combination of `key1` and `key2`. 154 | 8. **type text: text_content**: Type the text `text_content` on the keyboard. 155 | 9. **wait**: Pause briefly, usually for system responses or screen updates. 156 | 10. **finish**: Indicate the task has been completed. 157 | 11. **fail**: Indicate the task has failed. 158 | 159 | Further explanation of drag operation: drag from (x1, y1) to (x2, y2) is a combination of press the mouse at (x1, y1) and drag it to (x2, y2). It might has following purposes: 160 | 1. Move/Translate - Moving an element from position (x1,y1) to (x2,y2) 161 | Common scenarios: 162 | - Dragging a file/folder to a new location 163 | - Moving a UI element (window, widget) to a different position 164 | - Moving elements (shapes, text boxes, images) in a PowerPoint slide 165 | - Adjusting slider controls or resizing elements 166 | - Reordering items in a list or menu 167 | 168 | 2. Range Selection - Selecting content within a rectangular region defined by (x1,y1) and (x2,y2) as diagonal points 169 | Common scenarios: 170 | - Selecting multiple files/icons in a folder 171 | - Selecting text in a document. This is usually performed before copy/cut/delete/adjust text operation. After this action, the selected text will be highlighted. 172 | - Selecting cells in a spreadsheet 173 | - Drawing selection rectangle on a canvas 174 | 175 | Consider the following to give your thought process: 176 | 1. The current state of the screen and your last step (if any). Does current state align with your last plan? Are this action trying to fix something? 177 | 2. Based on the history steps, how far have you progressed in the whole task? And based on your subsequent actions, what is the expected outcome of this action? (**DO NOT** explicitly state the next action in your output.) 178 | 3. Based on all the information (task, observation, history, future), if this action seems not related to the task, is it possibly exploring the environment? 179 | Based on the above, recreate your thought process in a clear, natural first-person narrative. 180 | 181 | Other requirements: 182 | 1. Be confident in your thought process. Avoid speculative or uncertain phrases like "it seems" or "this action might have been for." 183 | 2. You may reference future actions as context, but **DO NOT** explicitly state the next action in your explanation. 184 | 3. If there are red marks on the screenshot, you should use them to understand the action, but **DO NOT** explicitly state the red marks in your explanation. Avoid phrases like "I notice the red circles around..." or "the red arrow indicates...". 185 | 3. Keep your explanations **concise and short**, do not conduct meaningless analysis and emphasis. 186 | 4. Do not repeat the action after your thought process. 187 | 188 | Here are some examples of the thought process: 189 | - "I see the 'View' menu is successfully opened, so I can click the 'Slide Master' button on it to change the font for all slides." 190 | - "To open the target powerpoint file, I should open the folder containing it first. So I need to click the folder icon on the left of the taskbar." 191 | - "I want to click the close button to close the current window in my last step, but I see it is not closed yet in current screen. Maybe my click was not precise last time, so I need to click it again. I should click the close button on the right top corner of the window." 192 | - "After save the file to desktop, I have successfully completed the task." 193 | - "I need to modify the 5th slide, but it is not in the current screen. I should scroll down the page to find it." 194 | - "I have insert a new text box and focus on it, so I can type the content now." 195 | - "I have finished typing content in the text box. Now I can click anywhere outside the text box to deselect it and view the content on the slide." 196 | - "I see the current file name is 'Untitled', so I should change it to a proper name. First I need to click the text box of the file name to focus on it." 197 | - "I need to insert a new slide, so I can first click the left thumbnail panel in the PowerPoint window." 198 | - "I need to insert a new slide, and I have clicked the left thumbnail panel in the PowerPoint window. Now I need to press key enter to insert a new slide." 199 | 200 | Examples of thought processes for exploratory actions: 201 | - "I need to save the file to the desktop, but I don't see a desktop option in the window. Maybe I should scroll down to see if there's a desktop option." 202 | - "I want to select the save button, but I don't see a save option in the window. I guess I might find it by clicking the File button." 203 | - "I need to open the settings menu, but I don't see an obvious settings icon on the current interface. Perhaps I should click on the three dots or three horizontal lines icon in the top right corner, as these often hide more options." 204 | - "I want to change the document's font, but I can't find the font option on the toolbar. I might need to click on the 'Format' or 'Style' menu to see if I can find the font settings there." 205 | - "I need to insert an image, but I don't see an obvious 'Insert' button. I guess I might need to right-click on a blank area of the document to see if there's an option to insert an image in the context menu." 206 | - "I want to check the version information of this application, but I can't find the relevant option on the main interface. Maybe I should click on the 'Help' or 'About' menu, as version information is often found there." 207 | - "I need to exit this full-screen program, but I don't see an exit button. I can try pressing the ESC key or moving the mouse to the top of the screen to see if a hidden menu bar appears." 208 | - "I want to search for specific content on this webpage, but I don't see a search box. I can try using the shortcut Ctrl+F (or Command+F) to see if it brings up the in-page search function." 209 | 210 | Additional PowerPoint Operation Tip: 211 | - These steps are to add a new slide at the end of the presentation: 212 | 1. Click in the left thumbnail panel of the PowerPoint window. 213 | 2. Press the Enter key to insert a new slide. 214 | - These steps are to add text in the text box: 215 | 1. Click 'click to add text'/'click to add title'/'click to add subtitle' to focus on the text box. 216 | 2. Type the content in the text box. 217 | 3. (Optional) Press the Enter key to finish. 218 | 219 | Again, you are recreating your thought process when you made the action, so you should not include any post-event evaluation or similar phrases. 220 | 221 | -------------------------------------------- 222 | 223 | """ 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | -------------------------------------------------------------------------------- /postprocess/refinement.py: -------------------------------------------------------------------------------- 1 | # multi-function script for data refinement 2 | # 1. rewrite screenshot path 3 | # 2. clean fail and error record 4 | # 3. check last action finish 5 | # 4. merge press and drag 6 | # 5. remove redundant actions 7 | # 6. remove meaningless actions 8 | # 7. resize screenshot and coordinates to 1080p 9 | # 8. clean tracker interface 10 | # 9. mark screenshot with red rect and point 11 | # 10. rewrite markdown file 12 | # 11. statistics 13 | # support interrupt 14 | 15 | import os 16 | import json 17 | import sys 18 | import numpy as np 19 | from PIL import Image 20 | from utils import * 21 | 22 | OVERWRITE_MARKED = False 23 | REMOVE_FAIL_RECORD = True 24 | DETAIL_OUTPUT = False 25 | 26 | 27 | def screenshot_of_tracker(screenshot_path, sample_size=100): 28 | """ 29 | check if the screenshot is a Tracker interface. 30 | """ 31 | if get_file_size_kb(screenshot_path) > 83: # magic number 32 | return False 33 | 34 | bg_color = "#f0f0f0" 35 | bg_threshold = 0.8 36 | top_offset = 40 # top area offset 37 | bottom_offset = 80 # bottom area offset 38 | 39 | with Image.open(screenshot_path) as img: 40 | width, height = img.size 41 | 42 | # define the sampling regions 43 | sample_regions = [ 44 | (0, top_offset, sample_size, sample_size + top_offset), # top left corner 45 | (width - sample_size, top_offset, width, sample_size + top_offset), # top right corner 46 | (0, height - sample_size - bottom_offset, sample_size, height - bottom_offset), # bottom left corner 47 | (width - sample_size, height - sample_size - bottom_offset, width, height - bottom_offset) # bottom right corner 48 | ] 49 | 50 | # convert the background color to numpy array 51 | bg_color_rgb = np.array([int(bg_color[i:i + 2], 16) for i in (1, 3, 5)]) 52 | 53 | # check the four regions 54 | for region in sample_regions: 55 | sample_region = img.crop(region) 56 | sample_array = np.array(sample_region)[:, :, :3] 57 | matches = np.all(sample_array == bg_color_rgb, axis=2) 58 | bg_ratio = np.sum(matches) / matches.size 59 | 60 | if bg_ratio < bg_threshold: 61 | return False 62 | 63 | return True 64 | 65 | 66 | def clean_tracker_interface(file_path): 67 | """ 68 | clean the action records of the Tracker interface. 69 | 70 | return the number of actions after cleaning, -1 means the file is deleted 71 | """ 72 | if DETAIL_OUTPUT: 73 | print(f"Clean tracker interface: {file_path}") 74 | screenshot_paths = [] 75 | entries = [] 76 | 77 | with open(file_path, 'r', encoding='utf-8') as file: 78 | for line in file: 79 | entry = json.loads(line) 80 | full_path = os.path.join(os.path.dirname(file_path), entry['screenshot']) 81 | screenshot_paths.append(full_path) 82 | entries.append(entry) 83 | 84 | last_entry_action = entries[-1].get('action') 85 | markdown_path = file_path.replace('.jsonl', '.md') 86 | 87 | # scan and identify the action of the Tracker interface 88 | begin = -1 89 | interval_list = [] # [begin, end) 90 | for index, screenshot_path in enumerate(screenshot_paths): 91 | # find the screenshot of the Tracker interface 92 | if screenshot_of_tracker(screenshot_path): 93 | if begin == -1: 94 | begin = index 95 | else: 96 | # back to the screenshot of non-Tracker interface, end the interval 97 | if begin != -1: 98 | interval_list.append((begin, index)) 99 | begin = -1 100 | 101 | interval_list.append((begin, len(screenshot_paths))) # the last interval (begin maybe -1) 102 | 103 | # delete the last interval (finish/fail) 104 | begin, end = interval_list.pop() 105 | if begin != -1: 106 | entries = entries[:begin] 107 | print(f"begin: {begin}, end: {end}") 108 | try: 109 | entries[-1]['action'] = last_entry_action 110 | entries[-1]['element'] = None 111 | entries[-1]['rect'] = None 112 | except Exception as e: # empty data 113 | print(f"Error: {e}") 114 | print("Delete related records (probably empty)...") 115 | # delete the JSONL file 116 | os.remove(file_path) 117 | # delete the Markdown file 118 | os.remove(markdown_path) 119 | # delete the screenshot files 120 | for screenshot_path in screenshot_paths: 121 | remove_screenshot(screenshot_path) 122 | return -1 123 | 124 | for i in range(begin, end): 125 | remove_screenshot(screenshot_paths[i]) 126 | 127 | # delete other intervals 128 | to_remove_entry_set = set() 129 | for begin, end in interval_list: 130 | for i in range(begin - 1, end): 131 | remove_screenshot(screenshot_paths[i]) 132 | to_remove_entry_set.add(i) 133 | 134 | entries = [entry for i, entry in enumerate(entries) if i not in to_remove_entry_set] 135 | 136 | # save the updated JSONL file 137 | with open(file_path, 'w', encoding='utf-8') as file: 138 | for entry in entries: 139 | json.dump(entry, file, ensure_ascii=False) 140 | file.write('\n') 141 | 142 | return len(entries) 143 | 144 | 145 | def clean_fail_and_error(file_path): 146 | """ 147 | clean the records without corresponding Markdown files or the last action is 'fail'. 148 | 149 | return True if the file is deleted, False otherwise. 150 | """ 151 | markdown_path = file_path.replace('.jsonl', '.md') 152 | if DETAIL_OUTPUT: 153 | print(f"Clean fail: {file_path}") 154 | try: 155 | with open(file_path, 'r', encoding='utf-8') as infile: 156 | entries = [json.loads(line) for line in infile] 157 | except Exception as e: 158 | print(f"[ERROR] Failed to read file {file_path}: {e}") 159 | return False 160 | 161 | screenshot_paths = [os.path.join(os.path.dirname(file_path), entry['screenshot']) for entry in entries] 162 | last_entry_action = entries[-1]['action'] if entries else '' 163 | 164 | # delete the records without corresponding Markdown files 165 | if not os.path.exists(markdown_path): 166 | print(f"File {file_path} has no corresponding Markdown file") 167 | print("Delete related records...") 168 | # delete the JSONL file 169 | os.remove(file_path) 170 | # delete the screenshot files 171 | for screenshot_path in screenshot_paths: 172 | remove_screenshot(screenshot_path) 173 | return True 174 | 175 | # clean the fail records (optional) 176 | if REMOVE_FAIL_RECORD and last_entry_action == 'fail': 177 | print(f"File {file_path} ends with fail action") 178 | print("Delete related records...") 179 | # delete the JSONL file 180 | os.remove(file_path) 181 | # delete the Markdown file 182 | os.remove(markdown_path) 183 | # delete the screenshot files 184 | for screenshot_path in screenshot_paths: 185 | remove_screenshot(screenshot_path) 186 | return True 187 | 188 | return False 189 | 190 | 191 | def resize(file_path): 192 | if DETAIL_OUTPUT: 193 | print(f"Resize file: {file_path}") 194 | 195 | # get the directory of the file 196 | task_dir = os.path.dirname(file_path) 197 | 198 | # read the screenshot path of the last entry 199 | try: 200 | with open(file_path, 'r', encoding='utf-8') as infile: 201 | lines = infile.readlines() 202 | last_line = lines[-1] 203 | last_entry = json.loads(last_line) 204 | screenshot_path = os.path.join(task_dir, last_entry['screenshot']) 205 | except Exception as e: 206 | print(f"[ERROR] Failed to read the screenshot path of the last entry: {e}") 207 | return 208 | 209 | if not os.path.exists(screenshot_path): 210 | print(f"[ERROR] The screenshot file does not exist: {screenshot_path}") 211 | return 212 | 213 | # get the resolution of the screenshot 214 | try: 215 | with Image.open(screenshot_path) as img: 216 | original_width, original_height = img.size 217 | if DETAIL_OUTPUT: 218 | print(f"Original resolution: {original_width}x{original_height}") 219 | except Exception as e: 220 | print(f"[ERROR] Failed to open the screenshot file {screenshot_path}: {e}") 221 | return 222 | 223 | # original_width, original_height = 2560, 1440 224 | 225 | # target resolution 226 | target_width, target_height = 1920, 1080 227 | if original_width == target_width and original_height == target_height: 228 | if DETAIL_OUTPUT: 229 | print(f"The screenshot resolution is the same as the target resolution, no need to resize") 230 | return 231 | 232 | scale_x = target_width / original_width 233 | scale_y = target_height / original_height 234 | if DETAIL_OUTPUT: 235 | print(f"Resize ratio - X: {scale_x:.4f}, Y: {scale_y:.4f}") 236 | 237 | # process the JSONL file 238 | modified_lines = [] 239 | for line in lines: 240 | try: 241 | data = json.loads(line) 242 | 243 | # process the screenshot 244 | screenshot_path = os.path.join(task_dir, data['screenshot']) 245 | assert resize_to_1080p(screenshot_path), "Error occured!" 246 | 247 | # process the action 248 | data['action'] = resize_action(data['action'], scale_x, scale_y) 249 | 250 | # process the rect 251 | if 'rect' in data and isinstance(data['rect'], dict): 252 | rect = data['rect'] 253 | rect['left'] = round(rect['left'] * scale_x) 254 | rect['top'] = round(rect['top'] * scale_y) 255 | rect['right'] = round(rect['right'] * scale_x) 256 | rect['bottom'] = round(rect['bottom'] * scale_y) 257 | if DETAIL_OUTPUT: 258 | print(f"Resize rect: {rect}") 259 | 260 | modified_lines.append(json.dumps(data, ensure_ascii=False) + '\n') 261 | except Exception as e: 262 | print(f"[WARNING] Error when processing the line: {line.strip()} - {e}") 263 | modified_lines.append(line) 264 | 265 | # directly write the modified content, overwrite the original file 266 | try: 267 | with open(file_path, 'w', encoding='utf-8') as outfile: 268 | outfile.writelines(modified_lines) 269 | if DETAIL_OUTPUT: 270 | print(f"Saved the modified file: {file_path}") 271 | except Exception as e: 272 | print(f"[ERROR] Failed to write the file {file_path}: {e}") 273 | 274 | 275 | def mark(file_path): 276 | if DETAIL_OUTPUT: 277 | print(f"Mark file: {file_path}") 278 | 279 | # get the directory of the file 280 | task_dir = os.path.dirname(file_path) 281 | 282 | # process the JSONL file 283 | modified_lines = [] 284 | with open(file_path, 'r', encoding='utf-8') as infile: 285 | for line in infile: 286 | entry = json.loads(line) 287 | 288 | if not OVERWRITE_MARKED and 'marked_screenshot' in entry: 289 | if DETAIL_OUTPUT: 290 | print(f"Already marked: {entry['marked_screenshot']}") 291 | modified_lines.append(line) 292 | continue 293 | 294 | screenshot = os.path.join(task_dir, entry.get('screenshot')) 295 | action = entry.get('action') 296 | rect = entry.get('rect') 297 | 298 | if rect is not None: # click or drag 299 | click_action_name, coordinates = parse_click_action(action) 300 | if click_action_name != None: # click related action 301 | x, y = coordinates 302 | marked_screenshot = mark_image(is_click_action=True, image_path=screenshot, rect=rect, point1={'x': x, 'y': y}) 303 | entry['marked_screenshot'] = marked_screenshot 304 | else: # drag related action 305 | (x1, y1), (x2, y2) = parse_drag_action(action) 306 | marked_screenshot = mark_image(is_click_action=False, image_path=screenshot, rect=rect, point1={'x': x1, 'y': y1}, point2={'x': x2, 'y': y2}) 307 | entry['marked_screenshot'] = marked_screenshot 308 | else: 309 | # rect is None, copy the original screenshot path 310 | entry['marked_screenshot'] = screenshot 311 | 312 | # remove the task_dir prefix of marked_screenshot 313 | entry['marked_screenshot'] = entry['marked_screenshot'].replace( 314 | task_dir + '/', '') 315 | 316 | modified_lines.append(json.dumps(entry, ensure_ascii=False) + '\n') 317 | 318 | # write the modified content, overwrite the original file 319 | with open(file_path, 'w', encoding='utf-8') as outfile: 320 | outfile.writelines(modified_lines) 321 | 322 | 323 | def rewrite_screenshot_path(file_path): 324 | if DETAIL_OUTPUT: 325 | print(f"Rewrite screenshot path: {file_path}") 326 | 327 | modified_lines = [] 328 | with open(file_path, 'r', encoding='utf-8') as file: 329 | for line in file: 330 | entry = json.loads(line) 331 | 332 | # process the screenshot field, remove the possible prefix 'events\\' 333 | if entry['screenshot'].startswith('events\\'): 334 | entry['screenshot'] = entry['screenshot'][7:] # remove the 'events\\' prefix 335 | 336 | # replace the backslash with the forward slash (Linux format) 337 | if "\\" in entry['screenshot']: 338 | entry['screenshot'] = entry['screenshot'].replace("\\", "/") 339 | 340 | modified_lines.append(json.dumps(entry, ensure_ascii=False) + '\n') 341 | 342 | with open(file_path, 'w', encoding='utf-8') as outfile: 343 | outfile.writelines(modified_lines) 344 | 345 | 346 | duplicate_clicks = 0 347 | adjacent_clicks = 0 348 | 349 | 350 | def remove_redundant_actions(file_path): 351 | if DETAIL_OUTPUT: 352 | print(f"Remove redundant actions: {file_path}") 353 | ctrl_cnt = 0 354 | shift_cnt = 0 355 | wait_cnt = 0 356 | all_entries = [] 357 | kept_entries = [] 358 | screenshot_paths = [] 359 | continuous_wait_at_begin = False 360 | 361 | with open(file_path, 'r', encoding='utf-8') as file: 362 | for line in file: 363 | entry = json.loads(line) 364 | all_entries.append(entry) 365 | 366 | total_cnt = len(all_entries) 367 | skip = False 368 | for id, entry in enumerate(all_entries): 369 | if skip: 370 | skip = False 371 | continue 372 | # check the continuous adjacent clicks 373 | screenshot_path = os.path.join(os.path.dirname(file_path), entry['screenshot']) 374 | if entry != all_entries[-1] and 'click' in entry['action'] and 'click' in all_entries[id+1]['action']: 375 | _, (x1, y1) = parse_click_action(entry['action']) 376 | _, (x2, y2) = parse_click_action(all_entries[id+1]['action']) 377 | global adjacent_clicks 378 | global duplicate_clicks 379 | if entry['action'] == all_entries[id+1]['action']: 380 | duplicate_clicks += 1; 381 | print(f"action{id}: {entry['action']} in {file_path} is a click same as the next action") 382 | elif abs(x1-x2) + abs(y1-y2) < 5: 383 | adjacent_clicks += 1; 384 | print(f"action{id}: {entry['action']} in {file_path} is a click adjacent to the next action") 385 | 386 | # delete the continuous wait at the beginning 387 | if entry['action'] != 'wait': 388 | continuous_wait_at_begin = False 389 | if entry['action'] == 'wait' and (id == 0 or continuous_wait_at_begin): 390 | wait_cnt += 1 391 | screenshot_paths.append(screenshot_path) 392 | continuous_wait_at_begin = True 393 | # delete the redundant ctrl and shift 394 | elif entry['action'] == 'press key ctrl' and (entry == all_entries[-1] or all_entries[id+1]['action'] == 'press key ctrl' or all_entries[id+1]['action'].startswith("hotkey (Ctrl,")): 395 | ctrl_cnt += 1 396 | screenshot_paths.append(screenshot_path) 397 | elif entry['action'] == 'press key shift' and (entry == all_entries[-1] or all_entries[id+1]['action'] == 'press key shift' or all_entries[id+1]['action'].startswith('type')): 398 | shift_cnt += 1 399 | screenshot_paths.append(screenshot_path) 400 | elif entry['action'] == 'press key ctrl' and all_entries[id+1]['action'] == 'press key shift': 401 | # this action and the next action should be deleted 402 | ctrl_cnt += 1 403 | shift_cnt += 1 404 | screenshot_paths.append(screenshot_path) 405 | screenshot_paths.append(os.path.join(os.path.dirname(file_path), all_entries[id+1]['screenshot'])) 406 | print(f"remove ctrl + shift in {file_path} action {id}") 407 | skip = True 408 | else: 409 | kept_entries.append(entry) 410 | 411 | with open(file_path, 'w', encoding='utf-8') as file: 412 | for entry in kept_entries: 413 | json.dump(entry, file, ensure_ascii=False) 414 | file.write('\n') 415 | 416 | if len(kept_entries) == len(all_entries): 417 | if DETAIL_OUTPUT: 418 | print(f"File {file_path} has no redundant actions") 419 | return 420 | if DETAIL_OUTPUT: 421 | if wait_cnt != 0: 422 | print(f"File {file_path} has {wait_cnt}/{total_cnt} redundant wait, removed") 423 | if ctrl_cnt != 0: 424 | print(f"File {file_path} has {ctrl_cnt}/{total_cnt} redundant ctrl, removed") 425 | if shift_cnt != 0: 426 | print(f"File {file_path} has {shift_cnt}/{total_cnt} redundant shift, removed") 427 | 428 | # delete the screenshot files 429 | for screenshot_path in screenshot_paths: 430 | os.remove(screenshot_path) 431 | 432 | 433 | def remove_meaningless_actions(file_path): 434 | if DETAIL_OUTPUT: 435 | print(f"Remove meaningless actions: {file_path}") 436 | all_entries = [] 437 | kept_entries = [] 438 | screenshot_paths = [] 439 | 440 | with open(file_path, 'r', encoding='utf-8') as file: 441 | for line in file: 442 | entry = json.loads(line) 443 | all_entries.append(entry) 444 | 445 | for id, entry in enumerate(all_entries): 446 | # check the similarity of two continuous screenshots 447 | if entry != all_entries[-1] and (entry['action'] == 'wait' or 'click' in entry['action']): 448 | screenshot_path1 = os.path.join(os.path.dirname(file_path), entry['screenshot']) 449 | screenshot_path2 = os.path.join(os.path.dirname(file_path), all_entries[id+1]['screenshot']) 450 | if are_screenshots_identical(screenshot_path1, screenshot_path2): 451 | screenshot_paths.append(screenshot_path1) 452 | print(f"action {id}: {entry['action']} in {file_path} is a meaningless action, it has been removed") 453 | else: 454 | kept_entries.append(entry) 455 | else: 456 | kept_entries.append(entry) 457 | 458 | if len(kept_entries) == len(all_entries): 459 | if DETAIL_OUTPUT: 460 | print(f"File {file_path} has no meaningless actions") 461 | return 462 | 463 | # rewrite the JSON file 464 | with open(file_path, 'w', encoding='utf-8') as file: 465 | for entry in kept_entries: 466 | json.dump(entry, file, ensure_ascii=False) 467 | file.write('\n') 468 | 469 | # delete the screenshot files 470 | for screenshot_path in screenshot_paths: 471 | os.remove(screenshot_path) 472 | 473 | 474 | def merge_press_drag(file_path): 475 | if DETAIL_OUTPUT: 476 | print(f"Merge press and drag: {file_path}") 477 | 478 | all_entries = [] 479 | kept_entries = [] 480 | screenshot_paths = [] 481 | 482 | with open(file_path, 'r', encoding='utf-8') as file: 483 | for line in file: 484 | entry = json.loads(line) 485 | all_entries.append(entry) 486 | 487 | id = 0 488 | while id < len(all_entries): 489 | # check the press action 490 | if id != len(all_entries) - 1 and all_entries[id]['action'].startswith("press ("): 491 | # the next action must be drag to 492 | assert all_entries[id+1]['action'].startswith("drag"), f"Error: In file {file_path}, action {id+1} should start with 'drag', but it's {all_entries[id+1]['action']}" 493 | x1, y1 = extract_coordinates(all_entries[id]['action']) 494 | x2, y2 = extract_coordinates(all_entries[id+1]['action']) 495 | if abs(x1-x2) + abs(y1-y2) <= 5: 496 | print(f"delta: {abs(x1-x2) + abs(y1-y2)} in {file_path} action {id} is too small, it's merged into a single click") 497 | all_entries[id]['action'] = f"click ({x2}, {y2})" 498 | else: 499 | print(f"action {id}: {all_entries[id]['action']} in {file_path} has been merged with action {id+1}: {all_entries[id+1]['action']}") 500 | all_entries[id]['action'] = f"drag from ({x1}, {y1}) to ({x2}, {y2})" 501 | screenshot_paths.append(os.path.join(os.path.dirname(file_path), all_entries[id+1]['screenshot'])) 502 | kept_entries.append(all_entries[id]) 503 | id += 1 # skip the next action 504 | else: 505 | kept_entries.append(all_entries[id]) 506 | 507 | id += 1 508 | 509 | if len(kept_entries) == len(all_entries): 510 | if DETAIL_OUTPUT: 511 | print(f"File {file_path} has no press and drag to be merged") 512 | return 513 | 514 | # rewrite the JSON file 515 | with open(file_path, 'w', encoding='utf-8') as file: 516 | for entry in kept_entries: 517 | json.dump(entry, file, ensure_ascii=False) 518 | file.write('\n') 519 | 520 | # delete the screenshot files 521 | for screenshot_path in screenshot_paths: 522 | os.remove(screenshot_path) 523 | 524 | 525 | def check_finish(file_path): 526 | if DETAIL_OUTPUT: 527 | print(f"Check finish: {file_path}") 528 | 529 | # read all lines 530 | try: 531 | with open(file_path, 'r', encoding='utf-8') as infile: 532 | lines = infile.readlines() 533 | last_line = lines[-1] 534 | last_entry = json.loads(last_line) 535 | except Exception as e: 536 | print(f"[ERROR] Failed to read the file content: {e}") 537 | return 538 | 539 | # replace the last action with finish 540 | if last_entry.get('action') == 'finish': 541 | if DETAIL_OUTPUT: 542 | print("The last entry is already 'finish'") 543 | return 544 | else: 545 | if DETAIL_OUTPUT: 546 | print("The last entry is ", last_entry.get('action')) 547 | print("Modify the last entry to 'finish'") 548 | last_entry['action'] = 'finish' 549 | 550 | # update the last line 551 | lines[-1] = json.dumps(last_entry, ensure_ascii=False) + '\n' 552 | 553 | # write back to file 554 | try: 555 | with open(file_path, 'w', encoding='utf-8') as outfile: 556 | outfile.writelines(lines) 557 | if DETAIL_OUTPUT: 558 | print(f"Saved the modified file: {file_path}") 559 | except Exception as e: 560 | print(f"[ERROR] Failed to write the file {file_path}: {e}") 561 | 562 | 563 | def process_task_jsonl_file(file_path): 564 | print(f"Process task jsonl file: {file_path}") 565 | rewrite_screenshot_path(file_path) 566 | if clean_fail_and_error(file_path): 567 | return -1 # the file is deleted 568 | check_finish(file_path) 569 | merge_press_drag(file_path) 570 | remove_redundant_actions(file_path) 571 | remove_meaningless_actions(file_path) 572 | resize(file_path) 573 | cnt = clean_tracker_interface(file_path) 574 | if cnt != -1: 575 | mark(file_path) 576 | rewrite_markdown_file_by_jsonl(file_path) 577 | return cnt 578 | 579 | 580 | def process_events_directories(): 581 | # get the directory of the script 582 | current_dir = os.path.dirname(os.path.abspath(__file__)) 583 | 584 | # build the path of the data folder 585 | data_dir = os.path.join(current_dir, 'data') 586 | 587 | total_action_cnt = 0 588 | total_record_cnt = 0 589 | max_action_cnt = 0 590 | 591 | # traverse all subdirectories of the data folder 592 | for item in os.listdir(data_dir): 593 | item_path = os.path.join(data_dir, item) 594 | 595 | if len(sys.argv) > 1: 596 | directory_prefix = sys.argv[1] 597 | else: 598 | # input the specified directory 599 | directory_prefix = "events" 600 | 601 | # check if it's a directory and starts with the specified name 602 | if os.path.isdir(item_path) and item.startswith(directory_prefix): 603 | print(f'Processing directory: {item_path}') 604 | for filename in os.listdir(item_path): 605 | # task jsonl file 606 | if filename.endswith('.jsonl') and 'task' in filename: 607 | file_path = os.path.join(item_path, filename) 608 | cnt = process_task_jsonl_file(file_path) 609 | if cnt != -1: 610 | total_action_cnt += cnt 611 | total_record_cnt += 1 612 | max_action_cnt = max(max_action_cnt, cnt) 613 | 614 | average_action_cnt = total_action_cnt / total_record_cnt 615 | print(f"Total records: {total_record_cnt}") 616 | print(f"Average actions per record: {average_action_cnt:.2f}") 617 | print(f"Maximum actions: {max_action_cnt}") 618 | 619 | 620 | if __name__ == "__main__": 621 | process_events_directories() 622 | -------------------------------------------------------------------------------- /postprocess/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import json 4 | import base64 5 | import cv2 6 | import numpy as np 7 | from PIL import Image, ImageDraw 8 | 9 | POINT_RADIUS = 3 10 | CIRCLE_RADIUS = 18 11 | CIRCLE_WIDTH = 3 12 | 13 | 14 | def rewrite_markdown_file_by_jsonl(jsonl_path): 15 | """ 16 | rewrite markdown file by jsonl file 17 | """ 18 | with open(jsonl_path, 'r', encoding='utf-8') as file: 19 | lines = file.readlines() 20 | 21 | entries = [json.loads(line) for line in lines] 22 | markdown_path = jsonl_path.replace('.jsonl', '.md') 23 | rewrite_markdown_file(markdown_path, entries) 24 | 25 | 26 | def rewrite_markdown_file(markdown_path, entries): 27 | """ 28 | rewrite markdown file by entries, use marked_screenshot if exists 29 | """ 30 | prompt = '''Given the screenshot as below. What's the next step that you will do to help with the task?''' 31 | with open(markdown_path, 'r', encoding='utf-8') as file: 32 | lines = file.readlines() 33 | 34 | # keep the first 5 lines 35 | kept_lines = lines[:5] 36 | 37 | # add new lines after the kept lines 38 | for entry in entries: 39 | timestamp = entry['timestamp'] 40 | action = get_full_action(entry) 41 | screenshot_path = entry['marked_screenshot'] if 'marked_screenshot' in entry else entry['screenshot'] 42 | image_description = entry['image_description'] if 'image_description' in entry else None 43 | thought = entry['thought'] if 'thought' in entry else None 44 | action_description = entry['action_description'] if 'action_description' in entry else None 45 | action_description_checked = entry['action_description_checked'] if 'action_description_checked' in entry else None 46 | 47 | kept_lines.append(f'### {timestamp}\n') 48 | kept_lines.append(f'**Input:** \n\n{prompt}\n\n') 49 | kept_lines.append( 50 | f'

\n\n') 51 | # if image_description: 52 | # kept_lines.append(f'**Image Description:** \n\n{image_description}\n\n') 53 | if action_description: 54 | kept_lines.append( 55 | f'**Action Description:** \n\n{action_description}\n\n') 56 | if thought: 57 | kept_lines.append(f'**Thought:** \n\n{thought}\n\n') 58 | if action_description_checked: 59 | kept_lines.append( 60 | f'**Action Description Checked:** \n\n{action_description_checked}\n\n') 61 | kept_lines.append(f'**Output:** \n\n{action}\n\n') 62 | 63 | # rewrite the file 64 | with open(markdown_path, 'w', encoding='utf-8') as file: 65 | file.writelines(kept_lines) 66 | 67 | 68 | def remove_screenshot(screenshot_path): 69 | """ 70 | remove the screenshot file and the possible _marked file 71 | """ 72 | if os.path.exists(screenshot_path): 73 | os.remove(screenshot_path) 74 | 75 | # remove the possible _marked file 76 | marked_screenshot_path = screenshot_path.replace('.png', '_marked.png') 77 | if os.path.exists(marked_screenshot_path): 78 | os.remove(marked_screenshot_path) 79 | 80 | 81 | def get_full_action(entry): 82 | """ 83 | get the full action string from entry 84 | """ 85 | action = entry['action'] 86 | element = entry['element'] 87 | if element: 88 | target = 'click' 89 | index = action.find(target) 90 | if index != -1: 91 | # find the end position of 'click' 92 | insert_position = index + len(target) 93 | # insert ':' after 'click' 94 | action = action[:insert_position] + \ 95 | f' element {element} at' + action[insert_position:] 96 | return action 97 | 98 | def get_action_string(following_actions): 99 | str = "" 100 | for id, action in enumerate(following_actions): 101 | str += f"{id+1}:\n{action}\n" 102 | return str if str != "" else "None\n" 103 | 104 | def content_is_rational(content): 105 | if "sorry, I can\'t assist" in content: 106 | return False 107 | return True 108 | 109 | def encode_image(image_path): 110 | """ 111 | encode image to base64 112 | """ 113 | with open(image_path, "rb") as image_file: 114 | return base64.b64encode(image_file.read()).decode('utf-8') 115 | 116 | 117 | def get_file_size_kb(file_path): 118 | file_size_bytes = os.path.getsize(file_path) 119 | file_size_kb = file_size_bytes / 1024 # convert to KB 120 | return round(file_size_kb, 1) # keep 1 decimal place 121 | 122 | 123 | def mark_image(is_click_action, image_path, rect, point1, point2=None): 124 | """ 125 | mark the image and save as a new file, return the new file path 126 | """ 127 | # open the image 128 | with Image.open(image_path) as image: 129 | if is_click_action: 130 | # create a drawable object 131 | draw = ImageDraw.Draw(image) 132 | 133 | # draw a rectangle 134 | draw.rectangle( 135 | [(rect["left"], rect["top"]), (rect["right"], rect["bottom"])], 136 | outline="red", 137 | width=3 # line width 138 | ) 139 | 140 | # draw a point 141 | draw_point(point1["x"], point1["y"], draw) 142 | 143 | # draw a circle 144 | draw_circle(point1["x"], point1["y"], draw) 145 | 146 | # draw a short arrow 147 | draw_short_arrow(point1["x"], point1["y"], draw) 148 | 149 | else: 150 | draw = ImageDraw.Draw(image) 151 | 152 | # draw a point 153 | draw_point(point1["x"], point1["y"], draw) 154 | draw_point(point2["x"], point2["y"], draw) 155 | 156 | if (abs(point1["x"] - point2["x"]) + abs(point1["y"] - point2["y"])) > 15: 157 | # draw a circle 158 | draw_circle(point1["x"], point1["y"], draw) 159 | draw_circle(point2["x"], point2["y"], draw) 160 | else: 161 | print(f"the distance between point1 and point2 in image {image_path} is too small, skip drawing circles") 162 | 163 | # draw a long arrow 164 | draw_long_arrow(point1["x"], point1["y"], point2["x"], point2["y"], draw) 165 | 166 | # generate the output path, add "_marked" to the original file name 167 | base, ext = os.path.splitext(image_path) 168 | output_path = f"{base}_marked{ext}" 169 | 170 | # save the marked image 171 | image.save(output_path) 172 | # print(f"marked image saved to: {output_path}") 173 | return output_path 174 | 175 | 176 | def resize_to_1080p(image_path): 177 | """ 178 | check and resize the image to fixed 1920x1080 resolution, return whether success 179 | """ 180 | try: 181 | with Image.open(image_path) as img: 182 | img.verify() # verify the image integrity 183 | except: 184 | print(f"[ERROR] image corrupted: {image_path}") 185 | return False 186 | 187 | # open the image 188 | with Image.open(image_path) as img: 189 | # check if the image is already 1080p 190 | if img.size == (1920, 1080): 191 | print(f"image is already 1080p, no need to resize: {image_path}") 192 | return True 193 | 194 | # resize the image to fixed 1920x1080 resolution 195 | try: 196 | resized_img = img.resize((1920, 1080), Image.LANCZOS) 197 | except: 198 | print(f"[ERROR] cannot resize image: {image_path}") 199 | return False 200 | 201 | # save the resized image, overwrite the original file 202 | resized_img.save(image_path, optimize=True) 203 | print(f"image resized and saved: {image_path}") 204 | return True 205 | 206 | 207 | def resize_action(action_str, scale_x, scale_y): 208 | """ 209 | extract coordinates from the action string, scale them, and replace the coordinate part in the original string. 210 | 211 | :param action_str: action string, e.g. "double click (1415, 741)" 212 | :param scale_x: X axis scale factor 213 | :param scale_y: Y axis scale factor 214 | :return: the scaled action string 215 | """ 216 | # use regex to match the coordinate part 217 | pattern = r'$(\d+),\s*(\d+)$' 218 | match = re.search(pattern, action_str) 219 | 220 | if match: 221 | original_x = float(match.group(1)) 222 | original_y = float(match.group(2)) 223 | scaled_x = round(original_x * scale_x) 224 | scaled_y = round(original_y * scale_y) 225 | print( 226 | f"scale coordinates: ({original_x}, {original_y}) -> ({scaled_x}, {scaled_y})") 227 | 228 | # construct the new coordinate string 229 | new_coords = f"({scaled_x}, {scaled_y})" 230 | 231 | # replace the original coordinate string 232 | new_action_str = re.sub(pattern, new_coords, action_str) 233 | return new_action_str 234 | else: 235 | return action_str 236 | 237 | 238 | def are_screenshots_identical(screenshot_path1, screenshot_path2): 239 | """ 240 | check if two screenshots are identical 241 | """ 242 | # read the images 243 | img1 = cv2.imread(screenshot_path1) 244 | img2 = cv2.imread(screenshot_path2) 245 | 246 | # check if the images are successfully read 247 | if img1 is None or img2 is None: 248 | print(f"cannot read image: {screenshot_path1} or {screenshot_path2}") 249 | return False 250 | 251 | # check if the images have the same size 252 | if img1.shape != img2.shape: 253 | return False 254 | 255 | # check if the images are identical 256 | difference = cv2.subtract(img1, img2) 257 | return not np.any(difference) 258 | 259 | 260 | def parse_click_action(action): 261 | pattern = r'((?:double |right )?click)\s*$(\d+),\s*(\d+)$' 262 | match = re.match(pattern, action) 263 | 264 | if match: 265 | action = match.group(1) # extract the action name 266 | x = int(match.group(2)) # extract x coordinate and convert to integer 267 | y = int(match.group(3)) # extract y coordinate and convert to integer 268 | return action, (x, y) 269 | else: 270 | return None, None 271 | 272 | 273 | def parse_drag_action(action): 274 | assert action.startswith('drag from'), f"error: action '{action}' is not a drag action" 275 | start1 = action.find('from (') + 6 276 | end1 = action.find(') to (') 277 | start2 = action.find('to (') + 4 278 | end2 = len(action) - 1 279 | 280 | # extract two sets of coordinates 281 | coord1 = action[start1:end1] 282 | coord2 = action[start2:end2] 283 | 284 | # split and convert to integers 285 | x1, y1 = map(int, coord1.split(', ')) 286 | x2, y2 = map(int, coord2.split(', ')) 287 | 288 | return (x1, y1), (x2, y2) 289 | 290 | 291 | def extract_coordinates(text): 292 | pattern = r'(?:drag to|press) $(\d+), (\d+)$' 293 | match = re.search(pattern, text) 294 | if match: 295 | x, y = map(int, match.groups()) 296 | return x, y 297 | return None 298 | 299 | 300 | def draw_point(x, y, draw): 301 | radius = POINT_RADIUS 302 | left = x - radius 303 | top = y - radius 304 | right = x + radius 305 | bottom = y + radius 306 | 307 | draw.ellipse( 308 | [(left, top), (right, bottom)], 309 | fill="red" 310 | ) 311 | 312 | 313 | def draw_circle(x, y, draw): 314 | radius = CIRCLE_RADIUS 315 | left = x - radius 316 | top = y - radius 317 | right = x + radius 318 | bottom = y + radius 319 | 320 | draw.ellipse( 321 | [(left, top), (right, bottom)], 322 | outline="red", 323 | width=CIRCLE_WIDTH 324 | ) 325 | 326 | 327 | def draw_short_arrow(x, y, draw): 328 | arrow_length = 50 # arrow length 329 | arrow_gap = CIRCLE_RADIUS + 2 # arrow gap 330 | arrow_width = 18 # arrow width 331 | angle = np.radians(30) # arrow angle 332 | cos_angle = np.cos(angle) 333 | sin_angle = np.sin(angle) 334 | 335 | # draw the arrow body 336 | start_x = x - arrow_length * cos_angle 337 | start_y = y - arrow_length * sin_angle 338 | end_x = x - arrow_gap * cos_angle 339 | end_y = y - arrow_gap * sin_angle 340 | draw.line([(start_x, start_y), (end_x, end_y)], 341 | fill="red", width=3) 342 | 343 | # draw the arrow head 344 | arrow_point1 = ( 345 | int(end_x - arrow_width), 346 | int(end_y) 347 | ) 348 | arrow_point2 = ( 349 | int(end_x - arrow_width * sin_angle), 350 | int(end_y - arrow_width * cos_angle) 351 | ) 352 | 353 | draw.polygon([ 354 | (end_x, end_y), 355 | arrow_point1, 356 | arrow_point2 357 | ], fill="red") 358 | 359 | 360 | def draw_long_arrow(x1, y1, x2, y2, draw): 361 | head_length = 18 # arrow head length 362 | head_angle = np.radians(30) # arrow head angle 363 | 364 | # calculate the midpoint of the line 365 | mid_x = (x1 + x2) / 2 366 | mid_y = (y1 + y2) / 2 367 | 368 | # draw the arrow body 369 | draw.line([(x1, y1), (x2, y2)], fill="red", width=3) 370 | 371 | # arrow head direction vector 372 | vector_x = x2 - x1 373 | vector_y = y2 - y1 374 | length = np.hypot(vector_x, vector_y) 375 | unit_vector_x = vector_x / length 376 | unit_vector_y = vector_y / length 377 | 378 | # calculate the positions of the two points of the arrow head (now based on the midpoint) 379 | left_x = mid_x - head_length * \ 380 | (unit_vector_x * np.cos(head_angle) + 381 | unit_vector_y * np.sin(head_angle)) 382 | left_y = mid_y - head_length * \ 383 | (unit_vector_y * np.cos(head_angle) - 384 | unit_vector_x * np.sin(head_angle)) 385 | 386 | right_x = mid_x - head_length * \ 387 | (unit_vector_x * np.cos(head_angle) - 388 | unit_vector_y * np.sin(head_angle)) 389 | right_y = mid_y - head_length * \ 390 | (unit_vector_y * np.cos(head_angle) + 391 | unit_vector_x * np.sin(head_angle)) 392 | 393 | # use the midpoint as the vertex of the arrow head 394 | draw.polygon([(mid_x, mid_y), (left_x, left_y), 395 | (right_x, right_y)], fill="red") 396 | -------------------------------------------------------------------------------- /tracker/README.md: -------------------------------------------------------------------------------- 1 | # PC Tracker User Manual 2 | 3 | \[ English | [中文](README_zh.md) \] 4 | 5 | - Version: 1.0 6 | - Last updated: 2024-12-25 7 | 8 | ## 1. Introduction 9 | 10 | PC Tracker is a lightweight infrastructure for efficiently collecting large-scale human-computer interaction trajectories. The program runs seamlessly in the background, automatically capturing screenshots and keyboard & mouse activities. 11 | 12 | Below is an example of the collected human-computer interaction trajectories: 13 | 14 | ![raw_trajectory_example](../assets/raw_trajectory_example.png) 15 | 16 | ## 2. Installation 17 | 18 | - Ensure your operating system is Windows. 19 | - Extract our software package to a location with sufficient disk space (recommended to have more than 3GB of available space for storing recorded data). 20 | 21 | ## 3. Quick Start 22 | 23 | - [Optional] Set screen resolution to 16:9 (recommended 1920 x 1080). 24 | - Open the extracted folder and launch `main.exe`. 25 | 26 | ## 4. Instructions 27 | 28 | After starting the software, you can choose between **Task Oriented Mode** or **Non-Task Oriented Mode** for recording. 29 | 30 | ### Task Oriented Mode 31 | 32 | This mode is divided into two sub-modes: **Given Task** and **Free Task**. 33 | 34 | #### Given Task 35 | 36 | In this mode, you will be assigned an uncompleted task each time. 37 | 38 | - **Next Task**: Click `Next Task` to get the next task. 39 | - **Previous Task**: Click `Previous Task` to return to the next task. 40 | - **Bad Task Feedback**: If you think the current task is difficult to complete, click `Bad Task` to discard it permanently. Alternatively, you can start the task and modify its description after completion based on your actual execution. 41 | - **Start Recording**: Click `Start`, and the tracker window will automatically minimize while recording begins. 42 | - **End Task**: After completing the task, click `Finish` to save the record. Or if the task execution fails or you don’t want to record it, click `Fail`. 43 | - **Modify Task Description**: After finishing the task, you can modify the task description based on your actual execution. 44 | 45 | #### Free Task 46 | 47 | In this mode, you can freely use the computer and summarize the task description and difficulty yourself. 48 | 49 | - **Start Recording**: Click `Start`, and the tracker window will automatically minimize while recording begins. 50 | - **Save and Summarize This Record**: Fill in the task description, select difficulty (easy/medium/hard), and click `Save` to save the record. 51 | - **Discard This Record**: Click `Discard` to discard the record. 52 | 53 | ### Non-Task Oriented Mode 54 | 55 | In this mode, you can freely use the computer, with similar methods to start and stop recording as described above. 56 | 57 | ## 5. Usage Notes 58 | 59 | - **Does not currently support using extended screens**. 60 | - **Does not currently support using Chinese input methods**. 61 | - **Does not currently support using touchpads**. 62 | - **The tracker window is fixed in fullscreen.** To support the filtering of tracker-related actions (such as clicking the Start button) in post-processing, the tracker window is fixed in fullscreen. You can reopen the tracker window by clicking to view the task description, then minimize it again, but please do not drag it to display in a non-fullscreen state. 63 | 64 | ## 6. Data Privacy 65 | 66 | - After starting recording, your screenshots and keyboard & mouse operations will be automatically recorded. PC Tracker does not record any information from unopened software. If you believe the recording may infringe on your privacy, you can choose to discard the record. 67 | - Collected data will be saved in the `./events` folder (hidden by default). Each trajectory comes with a Markdown file for easy visualization. 68 | 69 | ## 7. FAQ 70 | 71 | **1. Does the software have networking capabilities?** 72 | 73 | PC Tracker is completely local, does not support networking, and will not upload your data. 74 | 75 | **2. What if my computer screen resolution is not 16:9?** 76 | 77 | If your screen resolution is not 16:9, it will affect the subsequent unified processing of data. We recommend adjusting your screen resolution to 16:9. 78 | 79 | **3. How much space will the collected data approximately occupy?** 80 | 81 | The specific data size varies. Generally, even with intensive recording operations for 1 hour, it will not generate more than 1GB of data. 82 | 83 | **4. What should I do if the interface doesn't display properly after launching the tracker?** 84 | 85 | If some interface elements (such as buttons) appear incomplete after launching the software, this may be caused by your computer's display scaling settings. You can try adjusting the display scaling in Settings -> System -> Display, and then restart the software. 86 | 87 | ## 8. Contact 88 | 89 | If you have any questions, please contact us at henryhe_sjtu@sjtu.edu.cn or zizi0123@sjtu.edu.cn. 90 | -------------------------------------------------------------------------------- /tracker/README_zh.md: -------------------------------------------------------------------------------- 1 | # PC Tracker 用户手册 2 | 3 | \[ [English](./README.md) | 中文 \] 4 | 5 | - Version: 1.0 6 | - Last updated: 2024-12-25 7 | 8 | ## 1. 简介 9 | 10 | PC Tracker 是一个轻量级工具，用于高效收集大规模真实人机交互轨迹。类似于屏幕录制，PC Tracker在后台无缝运行，自动捕获屏幕截图和键鼠操作。收集到的人机交互轨迹示例如下： 11 | 12 | ![raw_trajectory_example](../assets/raw_trajectory_example.png) 13 | 14 | ## 2. 安装 15 | 16 | - 确保您的电脑是 Windows 操作系统。 17 | - 解压缩我们的软件包到一个磁盘空间足够的位置（用于存储记录数据，推荐 3GB 以上可用空间）。 18 | 19 | ## 3. 启动 20 | 21 | - [可选] 把屏幕分辨率比例设置为 16:9 （推荐 1920 x 1080）。 22 | - 打开解压后的文件夹，启动 main.exe。 23 | 24 | ## 4. 使用说明 25 | 26 | 启动软件后，您可选择 **Task Oriented Mode** 或 **Non-Task Oriented Mode** 记录模式。 27 | 28 | ### Task Oriented Mode 29 | 30 | 本模式分为 **Given Task** 与 **Free Task** 两种子模式。 31 | 32 | #### Given Task 33 | 34 | 在本模式下，您每次将被分配一个未被完成的任务。 35 | 36 | - **下一个任务**：点击 `Next Task` 可获取下一个任务。 37 | - **上一个任务**：点击 `Previous Task` 可返回上一个任务。 38 | - **反馈不佳任务**：如果您认为当前任务难以完成，可点击 `Bad Task` 将其永久丢弃。或者，您也可以先开始执行任务，在完成后根据实际执行情况修改任务描述。 39 | - **开启记录**：点击 `Start`，软件窗口将会自动最小化，此后记录将会开始。 40 | - **结束任务**：完成任务后，点击 `Finish` 保存本次操作过程。如果任务执行失败或您不希望记录本次操作，请点击 `Fail`。 41 | - **修改任务描述**：完成任务后，您可以根据实际执行情况修改任务描述。 42 | 43 | #### Free Task 44 | 45 | 在本模式下，您可以自由地使用电脑，并自行总结任务描述与难度。 46 | 47 | - **开启记录**：点击 `Start`，软件页面将会自动最小化，此后记录将会开始。 48 | - **保存并总结记录**：填写任务描述，选择任务难度（easy/medium/hard），然后点击 `Save` 保存记录。 49 | - **丢弃记录**：点击 `Discard` 可丢弃本次记录。 50 | 51 | ### Non-Task Oriented Mode 52 | 53 | 在本模式下，您可以自由地使用电脑，任意选择开启与结束记录的时刻。 54 | 55 | ## 5. 注意事项 56 | 57 | - **暂不支持使用扩展屏**。 58 | - **暂不支持使用中文输入法**。 59 | - **暂不支持使用触控板**。 60 | - **软件窗口固定为全屏显示**。为了在后处理中过滤与PC Tracker相关的操作（如点击Start按钮），软件窗口被固定为全屏显示。您可以通过点击重新打开PC Tracker窗口查看任务描述，然后再次最小化，但请不要将其拖动至非全屏状态显示。 61 | 62 | ## 6. 数据隐私 63 | 64 | - 开启记录后，您的屏幕截图与键盘鼠标操作将会被软件自动记录。PC Tracker不会记录任何未被打开的软件的信息。如果您认为本次记录可能会侵犯您的隐私，可选择丢弃本次记录。 65 | - 收集的数据都会被保存在 `./events` 文件夹（默认隐藏）中。我们为每份记录下来的轨迹都提供了Markdown可视化文件。 66 | 67 | ## 7. 常见问题 68 | 69 | **1. 软件是否有联网功能？** 70 | 71 | PC Tracker 完全是本地的，不支持联网，不会上传您的数据。 72 | 73 | **2. 我的电脑屏幕分辨率不是 16:9，会有什么影响？** 74 | 75 | 如果您的屏幕分辨率不是 16:9，会对数据的后续统一处理产生影响。我们建议您将屏幕分辨率调整为 16:9。 76 | 77 | **3. 收集的数据大概会占用多少空间？** 78 | 79 | 具体的数据大小视情况而异。一般来说，即使高强度记录操作1小时，也不会产生超过 1GB 的数据量。 80 | 81 | **4. 为何软件的界面显示看上去有些问题？** 82 | 83 | tracker 界面的异常显示（如部分按钮显示不全）可能是由于您电脑的缩放率导致的。您可以在设置->系统->屏幕中调整缩放率，然后再重新打开 tracker。 84 | 85 | ## 8. 联系方式 86 | 87 | 如有任何疑问，请通过 henryhe_sjtu@sjtu.edu.cn 或 zizi0123@sjtu.edu.cn 与我们联系。 88 | -------------------------------------------------------------------------------- /tracker/capturer.py: -------------------------------------------------------------------------------- 1 | import threading 2 | import time 3 | 4 | import pyautogui 5 | import win32con 6 | import win32gui 7 | import win32ui 8 | 9 | screen_size = pyautogui.size() 10 | 11 | 12 | class ScreenCapturer: 13 | def __init__(self): 14 | self.hwindow = win32gui.GetDesktopWindow() 15 | 16 | def capture(self): 17 | # dc: device context 18 | window_dc = win32gui.GetWindowDC(self.hwindow) 19 | img_dc = win32ui.CreateDCFromHandle(window_dc) 20 | mem_dc = img_dc.CreateCompatibleDC() 21 | # Create a bitmap object 22 | screenshot = win32ui.CreateBitmap() 23 | # Create a bitmap compatible with the device context and set its width and height 24 | screenshot.CreateCompatibleBitmap(img_dc, screen_size[0], screen_size[1]) 25 | # Select the bitmap into the memory device context 26 | mem_dc.SelectObject(screenshot) 27 | # Perform a bit block transfer 28 | mem_dc.BitBlt((0, 0), screen_size, img_dc, (0, 0), win32con.SRCCOPY) 29 | # screenshot: bitmap byte stream 30 | bits = screenshot.GetBitmapBits(True) 31 | # Release resources 32 | mem_dc.DeleteDC() 33 | win32gui.ReleaseDC(self.hwindow, window_dc) 34 | win32gui.DeleteObject(screenshot.GetHandle()) 35 | return bits 36 | 37 | 38 | capturer = ScreenCapturer() 39 | 40 | 41 | class RecentScreen: 42 | def __init__(self, capture_interval=0.1): 43 | self.screenshot = capturer.capture() 44 | self.capture_interval = capture_interval 45 | self.lock = threading.Lock() 46 | self.refresh_thread = threading.Thread(target=self.refreshing) 47 | self.refresh_thread.daemon = True 48 | self.refresh_thread.start() 49 | 50 | def refreshing(self): 51 | while True: 52 | screenshot = capturer.capture() 53 | with self.lock: 54 | self.screenshot = screenshot 55 | time.sleep(self.capture_interval) 56 | 57 | def get(self): 58 | with self.lock: 59 | return self.screenshot 60 | -------------------------------------------------------------------------------- /tracker/fs.py: -------------------------------------------------------------------------------- 1 | import os 2 | import shutil 3 | import ctypes 4 | import subprocess 5 | import time 6 | 7 | 8 | def ensure_folder(folder_path): 9 | # Ensure the folder exists, create if it doesn't 10 | if not os.path.exists(folder_path): 11 | os.makedirs(folder_path) 12 | 13 | 14 | def hide_folder(folder_path): 15 | # Set folder attribute to hidden 16 | FILE_ATTRIBUTE_HIDDEN = 0x02 17 | ctypes.windll.kernel32.SetFileAttributesW(folder_path, FILE_ATTRIBUTE_HIDDEN) 18 | 19 | 20 | def set_hidden_file(file_path, hide=True): 21 | # Set or remove hidden attribute 22 | if hide: 23 | os.system(f'attrib +h "{file_path}"') 24 | else: 25 | os.system(f'attrib -h "{file_path}"') 26 | 27 | 28 | class FolderReset: 29 | def __init__(self, folder_path="./files", backup_path="./__files__"): 30 | self.folder_path = folder_path # Visible folder 31 | self.backup_path = backup_path # Backup hidden folder 32 | if self.backup_path and not os.path.exists(backup_path): 33 | self.backup_path = None # No backup path 34 | self.reset() 35 | 36 | def reset(self): 37 | if self.backup_path: 38 | # Ensure the folder exists 39 | ensure_folder(self.folder_path) 40 | 41 | # Attempt to clear the original folder 42 | MAX_RETRIES = 2 43 | for attempt in range(MAX_RETRIES): 44 | try: 45 | # On Windows, use system command to delete 46 | subprocess.call(f'rd /s /q "{self.folder_path}"', shell=True) 47 | break # Exit loop if successfully deleted 48 | except PermissionError as e: 49 | print(f"Except error in folder reset: {e}. Retrying ({attempt + 1}/{MAX_RETRIES})...") 50 | time.sleep(3) 51 | else: 52 | print("WARNING: Failed to clear folder after multiple attempts.") 53 | return # Exit method, do not proceed 54 | 55 | # Restore backup content, make folder writable 56 | shutil.copytree(self.backup_path, self.folder_path) 57 | 58 | # else: No backup path, no reset needed 59 | 60 | 61 | def delete_file(file_path): 62 | if os.path.exists(file_path): 63 | try: 64 | os.remove(file_path) 65 | except OSError as e: 66 | print(f"Failed to delete {file_path}: {e}") 67 | 68 | 69 | def delete_folder(folder_path): 70 | try: 71 | if os.name == 'nt': 72 | # On Windows, use system command to delete 73 | subprocess.call(f'rd /s /q "{folder_path}"', shell=True) 74 | else: 75 | # On other systems, use shutil to delete 76 | shutil.rmtree(folder_path) 77 | except OSError as e: 78 | print(f"Failed to delete {folder_path}: {e}") 79 | -------------------------------------------------------------------------------- /tracker/main.py: -------------------------------------------------------------------------------- 1 | import multiprocessing 2 | from tracker import Tracker 3 | from task import Task 4 | import tkinter as tk 5 | import tkinter.font as tkFont 6 | import tkinter.ttk as ttk 7 | from tkinter import messagebox # Import messagebox 8 | 9 | 10 | class TrackerApp: 11 | def __init__(self, root): 12 | self.root = root 13 | self.root.title("Tracker") 14 | self.tracker = Tracker() 15 | 16 | # Set window size to screen size and maximize window 17 | self.root.geometry( 18 | f"{self.root.winfo_screenwidth()}x{self.root.winfo_screenheight()}") 19 | self.root.state("zoomed") # Maximize window 20 | 21 | self.root.configure(bg="#f0f0f0") # Set background color 22 | self.root.resizable(True, True) 23 | self.root.resizable(False, False) # Disable resizing the window 24 | 25 | # Font 26 | self.title_font = tkFont.Font( 27 | family="Helvetica", size=26, weight="bold") 28 | self.label_font = tkFont.Font(family="Helvetica", size=18) 29 | self.button_font = tkFont.Font(family="Arial", size=12) 30 | self.text_font = tkFont.Font(family="Helvetica", size=12) 31 | 32 | # Label 33 | self.title_label = tk.Label( 34 | root, text="", font=self.label_font, 35 | wraplength=400, bg="#f0f0f0", fg="#555555") 36 | self.title_label.pack(pady=(20, 40)) 37 | 38 | # Intercept close button click event 39 | self.root.protocol("WM_DELETE_WINDOW", self.quit_app) 40 | 41 | # Create initial interface 42 | self.initial_interface() 43 | 44 | def quit_app(self): 45 | self.tracker.stop() 46 | self.tracker.update_tasks() 47 | self.root.destroy() 48 | 49 | """ 50 | Interface Definitions 51 | """ 52 | 53 | def initial_interface(self): 54 | self.clear_interface() 55 | self.title_label.config(text="Welcome to PC Tracker!") 56 | 57 | self.task_button = tk.Button( 58 | self.root, text="Task Oriented Mode", command=self.task_oriented_interface, 59 | width=25, height=2, font=self.button_font) 60 | self.task_button.pack(pady=10) 61 | ToolTip(self.task_button, "Tracking with a specific task") 62 | 63 | self.non_task_button = tk.Button( 64 | self.root, text="Non-Task Oriented Mode", command=self.non_task_oriented_interface, 65 | width=25, height=2, font=self.button_font) 66 | self.non_task_button.pack(pady=10) 67 | ToolTip(self.non_task_button, 68 | "Tracking while using computer freely") 69 | 70 | def task_oriented_interface(self): 71 | self.clear_interface() 72 | self.title_label.config(text="Task Oriented Mode") 73 | 74 | self.given_task_button = tk.Button( 75 | self.root, text="Given Task", command=self.next_given_task_interface, 76 | width=15, height=2, font=self.button_font) 77 | self.given_task_button.pack(pady=10) 78 | ToolTip(self.given_task_button, "Complete given task") 79 | 80 | self.free_task_button = tk.Button( 81 | self.root, text="Free Task", command=self.free_task_interface, 82 | width=15, height=2, font=self.button_font) 83 | self.free_task_button.pack(pady=10) 84 | ToolTip(self.free_task_button, 85 | "Freely use pc and summarize the tasks completed on your own.") 86 | 87 | self.back_button = tk.Button( 88 | self.root, text="Back", command=self.initial_interface, 89 | width=15, height=2, font=self.button_font) 90 | self.back_button.pack(pady=10) 91 | 92 | def non_task_oriented_interface(self): 93 | self.clear_interface() 94 | self.title_label.config(text="Non-Task Oriented Mode") 95 | 96 | self.start_button = tk.Button( 97 | self.root, text="Start", command=self.start_non_task_tracking, 98 | width=15, height=2, font=self.button_font) 99 | self.start_button.pack(pady=10) 100 | 101 | self.back_button = tk.Button( 102 | self.root, text="Back", command=self.initial_interface, 103 | width=15, height=2, font=self.button_font) 104 | self.back_button.pack(pady=10) 105 | 106 | def free_task_interface(self): 107 | self.clear_interface() 108 | self.title_label.config(text="Free Task Mode") 109 | 110 | self.subtitle_label = tk.Label( 111 | self.root, text="Freely use pc and summarize the tasks completed on your own.", font=("Arial", 15), 112 | wraplength=750) 113 | self.subtitle_label.pack(pady=(0, 30)) 114 | 115 | self.start_button = tk.Button( 116 | self.root, text="Start", command=self.start_free_task_tracking, 117 | width=15, height=2, font=self.button_font) 118 | self.start_button.pack(pady=10) 119 | 120 | self.back_button = tk.Button( 121 | self.root, text="Back", command=self.task_oriented_interface, 122 | width=15, height=2, font=self.button_font) 123 | self.back_button.pack(pady=10) 124 | 125 | self.corner_label = tk.Label(self.root, text=f"You have finished {self.tracker.finished_free_cnt} free tasks.", 126 | font=("Arial", 14), bg="#f0f0f0") 127 | self.corner_label.pack(side="bottom", anchor="se", padx=30, pady=30) 128 | 129 | def next_given_task_interface(self): 130 | self.given_task_interface(offset=1) 131 | 132 | def current_given_task_interface(self): 133 | self.given_task_interface(offset=0) 134 | 135 | def previous_given_task_interface(self): 136 | self.given_task_interface(offset=-1) 137 | 138 | def given_task_interface(self, offset): 139 | if self.tracker.finish_all(): 140 | messagebox.showinfo( 141 | "Task Completed", "All tasks have been finished!") 142 | self.initial_interface() 143 | else: 144 | self.clear_interface() 145 | self.title_label.config(text="Given Task Mode") 146 | 147 | self.tracker.get_given_task(offset) 148 | 149 | self.subtitle_label = tk.Label( 150 | self.root, text=f"Category: {self.tracker.task.category}", font=("Arial", 15), wraplength=750) 151 | self.subtitle_label.pack(pady=(0, 30)) 152 | 153 | self.corner_label = tk.Label(self.root, 154 | text=f"You have finished {self.tracker.finished_given_cnt} given tasks.", 155 | font=("Arial", 14), bg="#f0f0f0") 156 | self.corner_label.pack(side="bottom", anchor="se", padx=30, pady=30) 157 | 158 | # Create a Canvas widget for the rounded rectangle 159 | canvas = tk.Canvas(self.root, width=1500, 160 | height=510, bg="#f0f0f0", highlightthickness=0) 161 | canvas.pack(pady=5, padx=30, anchor="center") 162 | 163 | # Draw a rounded rectangle 164 | create_roundrectangle( 165 | canvas, 20, 0, 1480, 500, radius=30, fill="#ffffff", outline="#cccccc") 166 | 167 | # Add task description text to the canvas 168 | canvas.create_text(30, 10, text=self.tracker.task.description, font=self.text_font, 169 | width=1450, anchor="nw") 170 | 171 | # Create a frame to hold the buttons 172 | button_frame = tk.Frame(self.root, bg="#f0f0f0") 173 | button_frame.pack(pady=20) 174 | 175 | # left column 176 | self.previous_button = tk.Button(button_frame, text="Previous Task", 177 | command=self.previous_given_task_interface, 178 | width=15, height=1, font=self.button_font) 179 | self.previous_button.grid(row=0, column=0, padx=20, pady=10) 180 | 181 | self.next_button = tk.Button(button_frame, text="Next Task", command=self.next_given_task_interface, 182 | width=15, height=1, font=self.button_font) 183 | self.next_button.grid(row=1, column=0, padx=20, pady=10) 184 | 185 | # right column 186 | self.start_button = tk.Button(button_frame, text="Start", command=self.start_given_task_tracking, 187 | width=15, height=1, font=self.button_font) 188 | self.start_button.grid(row=0, column=1, padx=20, pady=10) 189 | ToolTip(self.start_button, "Start tracking with this task") 190 | 191 | self.bad_task_button = tk.Button(button_frame, text="Bad Task", command=self.mark_bad_task, 192 | width=15, height=1, font=self.button_font) 193 | self.bad_task_button.grid(row=1, column=1, padx=20, pady=10) 194 | 195 | # back button centered below the other buttons with the same size 196 | self.back_button = tk.Button(button_frame, text="Back", command=self.task_oriented_interface, 197 | width=15, height=1, font=self.button_font) 198 | self.back_button.grid( 199 | row=2, column=0, columnspan=2, padx=20, pady=20) 200 | 201 | def modify_description_interface(self): 202 | self.clear_interface() 203 | self.title_label.config(text="Modify Task Description") 204 | 205 | # Add multi-line input text box and set initial content to task description 206 | self.entry = tk.Text(self.root, font=self.text_font, 207 | width=120, height=20) # Adjust width and height 208 | # Set initial content, "1.0" represents the first character position of the first line 209 | self.entry.insert("1.0", self.tracker.task.description) 210 | self.entry.pack(pady=(10, 10)) # Leave 25 pixels at top, 10 pixels at bottom 211 | 212 | self.save_button = tk.Button( 213 | self.root, text="Save", command=self.save_modified_description, 214 | width=15, height=2, font=self.button_font) 215 | self.save_button.pack(pady=10) 216 | 217 | self.cancel_button = tk.Button( 218 | self.root, text="Cancel", command=self.cancel_modify_description, 219 | width=15, height=2, font=self.button_font) 220 | self.cancel_button.pack(pady=10) 221 | 222 | def clear_interface(self): 223 | for widget in self.root.winfo_children(): 224 | if widget != self.title_label: 225 | widget.destroy() 226 | 227 | """ 228 | Given Task Mode Functions 229 | """ 230 | 231 | def start_given_task_tracking(self): 232 | self.clear_interface() 233 | self.tracker.start() 234 | 235 | self.title_label.config(text="Tracking...") 236 | self.title_label.pack(pady=(30, 10)) 237 | 238 | canvas_width = 1500 # Adjusted for padding 239 | canvas_height = 680 240 | text_front = ("Helvetica", 15) 241 | 242 | # Create a Canvas widget for the rounded rectangle 243 | canvas = tk.Canvas(self.root, width=canvas_width, 244 | height=canvas_height, bg="#f0f0f0", highlightthickness=0) 245 | canvas.pack(pady=5, padx=30, anchor="center") 246 | 247 | # Draw a rounded rectangle 248 | create_roundrectangle( 249 | canvas, 20, 0, 1480, 650, radius=30, fill="#ffffff", outline="#cccccc") 250 | 251 | # Add task description text to the canvas 252 | canvas.create_text(30, 10, text=self.tracker.task.description, font=text_front, 253 | width=1450, anchor="nw") 254 | 255 | self.finish_button = tk.Button( 256 | self.root, text="Finish", command=self.finish_given_task, 257 | width=15, height=2, font=self.button_font) 258 | self.finish_button.pack(pady=10) 259 | ToolTip(self.finish_button, "Task finished") 260 | 261 | self.fail_button = tk.Button( 262 | self.root, text="Fail", command=self.fail_given_task, 263 | width=15, height=2, font=self.button_font) 264 | self.fail_button.pack(pady=10) 265 | ToolTip(self.fail_button, "Task failed") 266 | 267 | print("Task oriented tracking started...") 268 | 269 | self.root.iconify() # Minimize window 270 | 271 | def finish_given_task(self): 272 | self.tracker.stop_without_task() 273 | if messagebox.askyesno("Modify description", "Do you want to modify the description?"): 274 | self.modify_description_interface() 275 | else: 276 | self.after_finish_given_task() 277 | 278 | def after_finish_given_task(self): 279 | self.tracker.finish() 280 | if self.tracker.finish_all(): 281 | messagebox.showinfo( 282 | "Task Completed", "All tasks have been finished!") 283 | self.initial_interface() 284 | else: 285 | self.next_given_task_interface() # back to initial interface 286 | 287 | def fail_given_task(self): 288 | self.tracker.fail() 289 | print("Task failed.") 290 | 291 | if messagebox.askyesno("Confirm Task Failure", "Do you want to discard the record?"): 292 | self.tracker.discard() 293 | 294 | self.current_given_task_interface() # back to initial interface 295 | 296 | def mark_bad_task(self): 297 | # Show a confirmation dialog 298 | confirm = messagebox.askyesno( 299 | "Confirm Bad Task", 300 | "Mark the current task as bad?\nThe task you marked as a bad task will be permanently discarded." 301 | ) 302 | 303 | if confirm: 304 | # Mark the task as bad if the user confirms 305 | self.tracker.task.is_bad = True 306 | self.tracker.bad_task_cnt += 1 307 | self.next_given_task_interface() 308 | 309 | def save_modified_description(self): 310 | entry_text = self.entry.get("1.0", "end-1c") 311 | if not entry_text: 312 | messagebox.showwarning( 313 | "Input Error", "Please enter your task description") 314 | return 315 | 316 | self.tracker.task.description = entry_text 317 | self.after_finish_given_task() 318 | 319 | def cancel_modify_description(self): 320 | if messagebox.askyesno("Confirm Cancel Meaning", "Do you want to discard the record?"): 321 | self.tracker.discard() 322 | self.current_given_task_interface() # back to initial interface 323 | else: 324 | self.after_finish_given_task() 325 | 326 | """ 327 | Free Task Mode Functions 328 | """ 329 | 330 | def start_free_task_tracking(self): 331 | self.clear_interface() 332 | self.tracker.get_free_task() 333 | self.tracker.start() 334 | 335 | self.title_label.config(text="Tracking...") 336 | self.title_label.pack(pady=(30, 10)) 337 | 338 | self.stop_button = tk.Button( 339 | self.root, text="Stop", command=self.stop_free_task_tracking, 340 | width=15, height=2, font=self.button_font) 341 | self.stop_button.pack(pady=10) 342 | 343 | self.root.iconify() # Minimize window 344 | 345 | def stop_free_task_tracking(self): 346 | self.tracker.stop_without_task() 347 | self.clear_interface() 348 | self.title_label.config(text="") 349 | 350 | # Create info label 351 | self.description_label = tk.Label( 352 | self.root, text="Please enter task description:", font=("Helvetica", 15), bg="#f0f0f0") 353 | self.description_label.pack(pady=(5, 5)) 354 | 355 | # Add input text box 356 | self.entry = tk.Text(self.root, font=self.text_font, 357 | width=120, height=20) # Adjust width and height 358 | self.entry.pack(pady=(10, 10)) # Leave 25 pixels at top, 10 pixels at bottom 359 | 360 | # Level dropdown box 361 | self.level_var = tk.StringVar(value="easy") # Default select "easy" 362 | self.level_label = tk.Label( 363 | self.root, text="Please select task level:", font=("Helvetica", 15), bg="#f0f0f0") 364 | self.level_label.pack(pady=(10, 5)) 365 | 366 | # Create and configure style 367 | style = ttk.Style() 368 | style.configure('TMenubutton', font=('Helvetica', 14)) 369 | 370 | self.level_menu = ttk.OptionMenu( 371 | self.root, self.level_var, "easy", "medium", "hard", style='TMenubutton') 372 | self.level_menu.config(width=15) # Increase width 373 | self.level_menu.pack(pady=(5, 20)) 374 | 375 | # Set larger font for dropdown menu options 376 | menu = self.level_menu["menu"] 377 | for index in range(menu.index("end") + 1): 378 | menu.entryconfig(index, font=tkFont.Font( 379 | family="Helvetica", size=14)) 380 | 381 | # Save button 382 | self.save_button = tk.Button(self.root, text="Save", command=self.save_free_task, 383 | width=15, height=1, font=self.button_font) 384 | self.save_button.pack(pady=(10, 20)) 385 | 386 | # Discard button 387 | self.discard_button = tk.Button( 388 | self.root, text="Discard", command=self.discard_free_task, 389 | width=15, height=1, font=self.button_font) 390 | self.discard_button.pack(pady=(10, 20)) 391 | 392 | def save_free_task(self): # Save user-defined task 393 | entry_text = self.entry.get("1.0", "end-1c") 394 | selected_level = self.level_var.get() 395 | 396 | if not entry_text: 397 | messagebox.showwarning( 398 | "Input Error", "Please enter your task description") 399 | return 400 | 401 | task = Task(entry_text, 0, selected_level) 402 | self.tracker.save_free_task(task) 403 | self.free_task_interface() 404 | 405 | def discard_free_task(self): # Discard user-defined task record 406 | self.tracker.discard() 407 | self.free_task_interface() 408 | 409 | def discard_non_task(self): # Discard non-task oriented mode record 410 | self.tracker.discard() 411 | self.non_task_oriented_interface() 412 | 413 | """ 414 | Non Task-Oriented Mode Functions 415 | """ 416 | 417 | def start_non_task_tracking(self): 418 | self.clear_interface() 419 | self.tracker.start() 420 | 421 | self.title_label.config(text="Tracking...") 422 | self.title_label.pack(pady=30) 423 | 424 | self.stop_button = tk.Button( 425 | self.root, text="Stop", command=self.stop_non_task_tracking, 426 | width=15, height=2, font=self.button_font) 427 | self.stop_button.pack(pady=30) 428 | 429 | print("Non-task oriented tracking started...") 430 | 431 | self.root.iconify() # Minimize window 432 | 433 | def stop_non_task_tracking(self): 434 | self.tracker.stop() 435 | self.clear_interface() 436 | 437 | self.save_button = tk.Button( 438 | self.root, text="Save", command=self.non_task_oriented_interface, 439 | width=15, height=2, font=self.button_font) 440 | self.save_button.pack(pady=30) 441 | 442 | self.discard_button = tk.Button( 443 | self.root, text="Discard", command=self.discard_non_task, 444 | width=15, height=2, font=self.button_font) 445 | self.discard_button.pack(pady=30) 446 | 447 | print("Non-task oriented tracking stopped.") 448 | 449 | 450 | """ 451 | Tools 452 | """ 453 | 454 | 455 | class ToolTip: 456 | def __init__(self, widget, text): 457 | self.widget = widget 458 | self.text = text 459 | self.tooltip = None 460 | self.widget.bind("", self.show_tooltip) 461 | self.widget.bind("", self.hide_tooltip) 462 | 463 | def show_tooltip(self, event): 464 | x, y, _, _ = self.widget.bbox("insert") 465 | x += self.widget.winfo_rootx() + 55 466 | y += self.widget.winfo_rooty() + 55 467 | self.tooltip = tk.Toplevel(self.widget) 468 | self.tooltip.wm_overrideredirect(True) 469 | self.tooltip.wm_geometry(f"+{x}+{y}") 470 | label = tk.Label(self.tooltip, text=self.text, 471 | background="#f7f7f7", relief="solid", borderwidth=1) 472 | label.pack() 473 | 474 | def hide_tooltip(self, event): 475 | if self.tooltip: 476 | self.tooltip.destroy() 477 | self.tooltip = None 478 | 479 | 480 | def create_roundrectangle(canvas, x1, y1, x2, y2, radius=25, **kwargs): 481 | points = [x1 + radius, y1, 482 | x1 + radius, y1, 483 | x2 - radius, y1, 484 | x2 - radius, y1, 485 | x2, y1, 486 | x2, y1 + radius, 487 | x2, y1 + radius, 488 | x2, y2 - radius, 489 | x2, y2 - radius, 490 | x2, y2, 491 | x2 - radius, y2, 492 | x2 - radius, y2, 493 | x1 + radius, y2, 494 | x1 + radius, y2, 495 | x1, y2, 496 | x1, y2 - radius, 497 | x1, y2 - radius, 498 | x1, y1 + radius, 499 | x1, y1 + radius, 500 | x1, y1] 501 | 502 | return canvas.create_polygon(points, **kwargs, smooth=True) 503 | 504 | 505 | def main(): 506 | root = tk.Tk() 507 | TrackerApp(root) 508 | root.mainloop() 509 | 510 | 511 | if __name__ == "__main__": 512 | multiprocessing.freeze_support() # important for pyinstaller 513 | main() 514 | -------------------------------------------------------------------------------- /tracker/main.spec: -------------------------------------------------------------------------------- 1 | # -*- mode: python ; coding: utf-8 -*- 2 | 3 | 4 | a = Analysis( 5 | ['main.py'], 6 | pathex=[], 7 | binaries=[], 8 | datas=[], 9 | hiddenimports=['comtypes', 'comtypes.gen', 'comtypes.client', 'comtypes.stream'], 10 | hookspath=[], 11 | hooksconfig={}, 12 | runtime_hooks=[], 13 | excludes=[], 14 | noarchive=False, 15 | optimize=0, 16 | ) 17 | pyz = PYZ(a.pure) 18 | 19 | exe = EXE( 20 | pyz, 21 | a.scripts, 22 | a.binaries, 23 | a.datas, 24 | [], 25 | name='tracker', 26 | debug=False, 27 | bootloader_ignore_signals=False, 28 | strip=False, 29 | upx=True, 30 | upx_exclude=[], 31 | runtime_tmpdir=None, 32 | console=False, 33 | disable_windowed_traceback=False, 34 | argv_emulation=False, 35 | target_arch=None, 36 | codesign_identity=None, 37 | entitlements_file=None, 38 | ) 39 | -------------------------------------------------------------------------------- /tracker/monitor.py: -------------------------------------------------------------------------------- 1 | import threading 2 | import time 3 | from enum import Enum 4 | from pynput import keyboard, mouse 5 | from pynput.keyboard import Key 6 | from recorder import Recorder 7 | from utils import * 8 | 9 | WAIT_INTERVAL = 6 # 6s per wait 10 | DOUBLE_CLICK_INTERVAL = 0.5 # 0.5s for double click 11 | 12 | HOT_KEY = [ 13 | ["alt", "tab"], # Switch between running program windows 14 | ["alt", "f4"], # Close current window or program 15 | ["cmd", 'd'], # Show desktop 16 | ["cmd", 'e'], # Open file explorer 17 | ["cmd", 'l'], # Lock computer 18 | ["cmd", 'r'], # Open run dialog 19 | ["cmd", 't'], # Cycle through taskbar programs 20 | ["cmd", 'x'], # Open advanced user menu (Start button right-click menu) 21 | ["cmd", "space"], # Switch input method 22 | ["cmd", 'i'], # Open Windows settings 23 | ["cmd", 'a'], # Open action center 24 | ["cmd", 's'], # Open search 25 | ["cmd", 'u'], # Open accessibility settings 26 | ["cmd", 'p'], # Open projection settings 27 | ["cmd", 'v'], # Open clipboard history 28 | ["cmd", "tab"], # Open task view 29 | ["shift", "delete"] # Permanently delete selected items (bypass recycle bin) 30 | ] 31 | 32 | 33 | def switch_caption(char): 34 | if char.isalpha() and get_capslock_state() == 1: # Caps lock is on 35 | if char.islower(): 36 | return char.upper() 37 | else: 38 | return char.lower() 39 | else: 40 | return char 41 | 42 | 43 | class ActionType(Enum): 44 | CLICK = "click" 45 | RIGHT_CLICK = "right click" 46 | DOUBLE_CLICK = "double click" 47 | MOUSE_DOWN = "press" 48 | DRAG = "drag to" 49 | SCROLL = "scroll" 50 | KEY_DOWN = "press key" 51 | HOTKEY = "hotkey" 52 | TYPE = "type text" 53 | WAIT = "wait" 54 | FINISH = "finish" 55 | FAIL = "fail" 56 | 57 | 58 | class Action: 59 | def __init__(self, action_type: ActionType, **kwargs): 60 | self.action_type = action_type 61 | self.kwargs = kwargs 62 | 63 | def __str__(self): 64 | str = f"{self.action_type.value}" 65 | if self.action_type == ActionType.CLICK or self.action_type == ActionType.RIGHT_CLICK or self.action_type == ActionType.MOUSE_DOWN or self.action_type == ActionType.DOUBLE_CLICK: 66 | # str += f" element: {self.kwargs['name']} at ({self.kwargs['x']}, {self.kwargs['y']})" 67 | str += f" ({self.kwargs['x']}, {self.kwargs['y']})" 68 | if self.action_type == ActionType.DRAG: 69 | str += f" ({self.kwargs['x']}, {self.kwargs['y']})" 70 | if self.action_type == ActionType.SCROLL: 71 | str += f" ({self.kwargs['dx']}, {self.kwargs['dy']})" 72 | if self.action_type == ActionType.KEY_DOWN: 73 | str += f" {self.kwargs['key']}" 74 | if self.action_type == ActionType.HOTKEY: 75 | str += f" ({self.kwargs['key1']}, {self.kwargs['key2']})" 76 | if self.action_type == ActionType.TYPE: 77 | str += f": {self.kwargs['text']}" 78 | return str 79 | 80 | def get_element(self): 81 | ele = self.kwargs.get('name') 82 | return ele if ele != "" else "Unknown" 83 | 84 | 85 | class Monitor: 86 | def __init__(self, task): 87 | self.recorder = Recorder(task) 88 | self.type_buffer = TypeBuffer(self.recorder) # How many keyboard operations have been executed consecutively 89 | self.timer = Timer(self.recorder, self.type_buffer) 90 | self.scroll_buffer = ScrollBuffer(self.recorder) 91 | self.keyboard_monitor = KeyboardMonitor( 92 | self.recorder, self.type_buffer, self.timer, self.scroll_buffer) 93 | self.mouse_monitor = MouseMonitor( 94 | self.recorder, self.type_buffer, self.timer, self.scroll_buffer) 95 | 96 | def start(self): 97 | self.keyboard_monitor.start() 98 | self.mouse_monitor.start() 99 | self.type_buffer.reset() 100 | self.timer.reset() 101 | 102 | def stop_without_md(self): 103 | self.keyboard_monitor.stop() 104 | self.mouse_monitor.stop() 105 | self.timer.stop() 106 | self.recorder.wait() 107 | 108 | def generate_md(self, task=None): 109 | self.recorder.generate_md(task) 110 | 111 | def stop(self): 112 | self.stop_without_md() 113 | self.generate_md() 114 | 115 | def finish(self): 116 | self.recorder.record_action(Action(ActionType.FINISH)) 117 | self.stop() 118 | 119 | def finish_without_md(self): 120 | self.recorder.record_action(Action(ActionType.FINISH)) 121 | self.stop_without_md() 122 | 123 | def fail(self): 124 | self.recorder.record_action(Action(ActionType.FAIL)) 125 | self.stop() 126 | 127 | def discard_record(self): 128 | self.recorder.discard() 129 | 130 | 131 | class Timer: 132 | def __init__(self, recorder: Recorder, type_buffer): 133 | self.timer_inner = None 134 | self.recorder = recorder 135 | self.type_buffer = type_buffer 136 | self.reset() 137 | 138 | def reset(self): 139 | if self.timer_inner: 140 | self.timer_inner.cancel() 141 | self.timer_inner = threading.Timer( 142 | WAIT_INTERVAL, self.save_wait) # Start timing, execute save_wait after interval seconds 143 | self.timer_inner.start() 144 | 145 | def stop(self): 146 | if self.timer_inner: 147 | self.timer_inner.cancel() 148 | 149 | def save_wait(self): 150 | if not self.type_buffer.last_action_is_typing: 151 | self.recorder.record_action(Action(ActionType.WAIT)) 152 | self.reset() 153 | 154 | 155 | class HotKeyBuffer: 156 | def __init__(self): 157 | self.buffer = [] 158 | 159 | def add(self, key): 160 | self.buffer.append(key) 161 | 162 | def pop(self): 163 | if len(self.buffer) > 0: 164 | self.buffer.pop() 165 | 166 | def reset(self): 167 | self.buffer.clear() 168 | 169 | 170 | class TypeBuffer: 171 | def __init__(self, recorder: Recorder): 172 | self.recorder = recorder 173 | self.type_action_cnt = 0 174 | self.text = "" 175 | self.is_typing = False 176 | self.last_action_is_typing = False # Whether the last action could be typing 177 | self.last_action_is_shift = False 178 | self.pre_saved_type_event = None # for TYPE action 179 | self.events_buffer = [] # Buffer keyboard events before confirming typing 180 | 181 | def pre_save_type_event(self): 182 | self.pre_saved_type_event = self.recorder.get_event() 183 | 184 | def reset(self): 185 | # save buffer 186 | if self.is_typing and not self.is_empty(): 187 | # At this time, there should be a pre_saved_type_event 188 | assert self.pre_saved_type_event is not None 189 | type_action = Action(ActionType.TYPE, text=self.text) 190 | self.pre_saved_type_event['action'] = type_action 191 | self.recorder.record_event(self.pre_saved_type_event) 192 | elif not self.is_typing: 193 | # self.recorder.save_all() 194 | # Record all previous operations that were cached 195 | for event in self.events_buffer: 196 | self.recorder.record_event(event) 197 | 198 | # reset type buffer 199 | self.text = "" 200 | self.is_typing = False 201 | self.last_action_is_typing = False 202 | self.last_action_is_shift = False 203 | self.pre_saved_type_event = None 204 | self.events_buffer.clear() 205 | 206 | def append(self, char): 207 | self.text += char 208 | if not self.is_typing: 209 | press_action = Action(ActionType.KEY_DOWN, key=char) 210 | # self.recorder.record(press_action) 211 | press_event = self.recorder.get_event(press_action) 212 | self.events_buffer.append(press_event) 213 | 214 | def add_type_related_action(self): 215 | # The typing operation is about to be added 216 | if len(self.text) >= 2 and not self.is_typing: 217 | self.is_typing = True # Enter typing state 218 | self.events_buffer.clear() # The previous recorded keyboard operations will be merged into TYPE, no need to record separately 219 | 220 | def backspace(self): 221 | if len(self.text) > 0: 222 | self.text = self.text[:-1] 223 | if not self.is_typing: 224 | backspace_action = Action(ActionType.KEY_DOWN, key="backspace") 225 | # self.recorder.record(backspace_action) 226 | backspace_event = self.recorder.get_event(backspace_action) 227 | self.events_buffer.append(backspace_event) 228 | else: 229 | self.reset() 230 | backspace_action = Action(ActionType.KEY_DOWN, key="backspace") 231 | self.recorder.record_action(backspace_action) 232 | 233 | def set_last_action_is_typing(self): 234 | self.last_action_is_typing = True 235 | 236 | def reset_last_action_is_typing(self): 237 | self.last_action_is_typing = False 238 | 239 | def set_last_action_is_shift(self): 240 | self.last_action_is_shift = True 241 | 242 | def reset_last_action_is_shift(self): 243 | self.last_action_is_shift = False 244 | 245 | def set_typing(self): 246 | self.is_typing = True 247 | 248 | def is_empty(self) -> bool: 249 | return len(self.text) == 0 250 | 251 | 252 | class ScrollBuffer: 253 | def __init__(self, recorder: Recorder): 254 | self.recorder = recorder 255 | self.dx = 0 256 | self.dy = 0 257 | self.pre_saved_scroll_event = None 258 | # self.empty = self.pre_saved_scroll_event is None 259 | 260 | def is_empty(self): 261 | return self.pre_saved_scroll_event is None 262 | 263 | def reset(self): 264 | if not self.is_empty() and (self.dx != 0 or self.dy != 0): 265 | scroll_action = Action(ActionType.SCROLL, dx=self.dx, dy=self.dy) 266 | self.pre_saved_scroll_event['action'] = scroll_action 267 | self.recorder.record_event(self.pre_saved_scroll_event) 268 | self.dx = 0 269 | self.dy = 0 270 | self.pre_saved_scroll_event = None 271 | 272 | def new(self, dx, dy): 273 | self.dx = dx 274 | self.dy = dy 275 | self.pre_saved_scroll_event = self.recorder.get_event() 276 | 277 | def add_delta(self, dx, dy): 278 | self.dx += dx 279 | self.dy += dy 280 | 281 | 282 | class KeyboardMonitor: 283 | def __init__(self, recorder: Recorder, type_buffer: TypeBuffer, timer: Timer, scroll_buffer: ScrollBuffer): 284 | self.recorder = recorder 285 | self.listener = keyboard.Listener( 286 | on_press=self.on_press, on_release=self.on_release) 287 | self.type_buffer = type_buffer 288 | self.timer = timer 289 | self.scroll_buffer = scroll_buffer 290 | self.hotkey_buffer = HotKeyBuffer() 291 | 292 | def start(self): 293 | self.listener.start() 294 | 295 | def stop(self): 296 | self.listener.stop() 297 | 298 | def on_press(self, key: Key): 299 | try: 300 | # Keyboard operation triggers timer and scroll buffer reset 301 | self.timer.reset() 302 | self.scroll_buffer.reset() 303 | 304 | # Record whether this operation is related to typing 305 | if is_related_to_type(key): 306 | self.type_buffer.set_last_action_is_typing() 307 | self.type_buffer.add_type_related_action() 308 | else: 309 | self.type_buffer.reset_last_action_is_typing() 310 | 311 | # Record whether the last key pressed was the shift key 312 | if key == Key.shift: 313 | self.type_buffer.set_last_action_is_shift() 314 | else: 315 | self.type_buffer.reset_last_action_is_shift() 316 | 317 | record_hotkey = False 318 | # Determine hotkey operation 319 | self.hotkey_buffer.add(get_key_str(key)) 320 | if self.hotkey_buffer.buffer in HOT_KEY: 321 | record_hotkey = True # Should be recorded as a hotkey operation 322 | 323 | # Handle record operation 324 | if not is_related_to_type(key): # Keys that cannot appear in typing scenarios 325 | self.type_buffer.reset() # Before entering typing state, all previous operations must have been recorded, just save the text in the buffer 326 | if self.type_buffer.last_action_is_shift: 327 | shift_action = Action(ActionType.KEY_DOWN, key="shift") 328 | self.recorder.record_action(shift_action) 329 | hotkey_2 = get_ctrl_hotkey(key) 330 | if hotkey_2 is not None: 331 | last_action = self.recorder.get_last_action() 332 | if last_action is not None and last_action.action_type == ActionType.KEY_DOWN and ( 333 | last_action.kwargs['key'] == 'ctrl_l' or last_action.kwargs['key'] == 'ctrl_r'): 334 | ctrl_hotkey_action = Action( 335 | ActionType.HOTKEY, key1='Ctrl', key2=hotkey_2) 336 | self.recorder.change_last_action(ctrl_hotkey_action) 337 | else: 338 | ctrl_hotkey_action = Action( 339 | ActionType.HOTKEY, key1='Ctrl', key2=hotkey_2) 340 | self.recorder.record_action(ctrl_hotkey_action) 341 | elif not record_hotkey: 342 | key_name = get_key_str(key) 343 | key_press_action = Action( 344 | ActionType.KEY_DOWN, key=key_name) 345 | self.recorder.record_action(key_press_action) 346 | elif not record_hotkey: # Keys that may appear in typing scenarios 347 | if self.type_buffer.is_empty(): # Only characters can be the first element of the buffer 348 | if hasattr(key, 'char'): 349 | switched_char = switch_caption(key.char) 350 | self.type_buffer.append(switched_char) 351 | self.type_buffer.pre_save_type_event() # Save observation when entering typing state 352 | else: 353 | # At this time, the buffer is empty, directly record special keys 354 | key_name = get_key_str(key) 355 | key_press_action = Action( 356 | ActionType.KEY_DOWN, key=key_name) 357 | self.recorder.record_action(key_press_action) 358 | else: 359 | # Just throw into the buffer 360 | if key == Key.backspace: 361 | self.type_buffer.backspace() 362 | elif key == Key.space: 363 | self.type_buffer.append(' ') 364 | elif hasattr(key, 'char'): 365 | switched_char = switch_caption(key.char) 366 | self.type_buffer.append(switched_char) 367 | 368 | if record_hotkey: 369 | last_action = self.recorder.get_last_action() 370 | if last_action.action_type == ActionType.KEY_DOWN and last_action.kwargs['key'] == \ 371 | self.hotkey_buffer.buffer[0]: 372 | ctrl_hotkey_action = Action( 373 | ActionType.HOTKEY, key1=self.hotkey_buffer.buffer[0], key2=self.hotkey_buffer.buffer[1]) 374 | self.recorder.change_last_action(ctrl_hotkey_action) 375 | else: 376 | ctrl_hotkey_action = Action( 377 | ActionType.HOTKEY, key1=self.hotkey_buffer.buffer[0], key2=self.hotkey_buffer.buffer[1]) 378 | self.recorder.record_action(ctrl_hotkey_action) 379 | except AttributeError: 380 | print_debug("error!") 381 | 382 | def on_release(self, key: Key): 383 | self.hotkey_buffer.pop() 384 | 385 | 386 | class LastClick: 387 | def __init__(self): 388 | self.x = 0 389 | self.y = 0 390 | self.time = 0 391 | self.button = mouse.Button.left 392 | self.element_name = "" 393 | 394 | def update(self, x, y, button, element_name): 395 | self.x = x 396 | self.y = y 397 | self.time = time.time() 398 | self.button = button 399 | self.element_name = element_name 400 | 401 | 402 | class MouseMonitor: 403 | def __init__(self, recorder: Recorder, type_buffer: TypeBuffer, timer: Timer, scroll_buffer: ScrollBuffer): 404 | self.recorder = recorder 405 | self.listener = mouse.Listener( 406 | on_click=self.on_click, on_scroll=self.on_scroll, on_move=self.on_move) 407 | self.type_buffer = type_buffer 408 | self.timer = timer 409 | self.scroll_buffer = scroll_buffer 410 | self.last_click = LastClick() 411 | self.pre_saved_drag_event = None 412 | 413 | def start(self): 414 | self.listener.start() 415 | 416 | def stop(self): 417 | self.listener.stop() 418 | 419 | def on_click(self, x, y, button, pressed): 420 | self.timer.reset() 421 | self.type_buffer.reset_last_action_is_typing() 422 | self.type_buffer.reset_last_action_is_shift() 423 | self.scroll_buffer.reset() 424 | if pressed: 425 | # Mouse click triggers information update 426 | element = get_element_info_at_position(x, y) # Get UI element info at mouse click position 427 | self.type_buffer.reset() # reset type buffer 428 | # Save observation when mouse is pressed, for possible drag operation 429 | self.pre_saved_drag_event = self.recorder.get_event() 430 | 431 | delta_time = time.time() - self.last_click.time 432 | if delta_time < DOUBLE_CLICK_INTERVAL and x == self.last_click.x and y == self.last_click.y: 433 | # Double click 434 | last_action = self.recorder.get_last_action() 435 | if last_action is not None and last_action.action_type == ActionType.CLICK: 436 | double_click_action = Action( 437 | ActionType.DOUBLE_CLICK, x=x, y=y, name=last_action.kwargs['name']) 438 | self.recorder.change_last_action(double_click_action) 439 | else: 440 | # Click 441 | if button == mouse.Button.left: 442 | click_action = Action( 443 | ActionType.CLICK, x=x, y=y, name=element['name']) 444 | self.recorder.record_action( 445 | click_action, element['coordinates']) 446 | elif button == mouse.Button.right: 447 | click_action = Action( 448 | ActionType.RIGHT_CLICK, x=x, y=y, name=element['name']) 449 | self.recorder.record_action( 450 | click_action, element['coordinates']) 451 | else: 452 | print_debug(f"Unknown button {button}") 453 | 454 | self.last_click.update(x, y, button, element['name']) 455 | 456 | else: # released 457 | if x != self.last_click.x or y != self.last_click.y: # Mouse dragged 458 | last_action = self.recorder.get_last_action() 459 | if last_action.action_type == ActionType.CLICK: # Previous operation was a click operation 460 | press_action = Action(ActionType.MOUSE_DOWN, x=self.last_click.x, 461 | y=self.last_click.y, name=self.last_click.element_name) 462 | self.recorder.change_last_action( 463 | press_action) # Modify the previous click operation to press operation 464 | # Record drag operation 465 | drag_action = Action(ActionType.DRAG, x=x, y=y) 466 | self.pre_saved_drag_event['action'] = drag_action 467 | self.recorder.record_event(self.pre_saved_drag_event) 468 | else: # Normal click 469 | pass 470 | 471 | def on_move(self, x, y): 472 | # print(f"Mouse moved to {(x, y)}") 473 | pass 474 | 475 | def on_scroll(self, x, y, dx, dy): 476 | self.timer.stop() # Close timer during scrolling to avoid recording wait operations during scrolling 477 | self.type_buffer.reset_last_action_is_typing() 478 | self.type_buffer.reset_last_action_is_shift() 479 | self.type_buffer.reset() 480 | if self.scroll_buffer.is_empty(): 481 | self.scroll_buffer.new(dx, dy) 482 | else: 483 | self.scroll_buffer.add_delta(dx, dy) 484 | 485 | 486 | def is_related_to_type(key): 487 | if isinstance(key, Key): 488 | return key in [Key.shift, Key.space, Key.caps_lock, Key.backspace] 489 | elif isinstance(key, keyboard.KeyCode): 490 | return key.char is not None and ord(key.char) > 31 491 | return False 492 | 493 | 494 | def get_ctrl_hotkey(key): 495 | if isinstance(key, keyboard.KeyCode) and key.char is not None and ord(key.char) <= 31: 496 | return chr(ord('@') + ord(key.char)) 497 | return None 498 | 499 | 500 | def get_key_str(key): 501 | if isinstance(key, Key): 502 | key_str = str(key) 503 | if "ctrl" in key_str: 504 | return "ctrl" 505 | if "shift" in key_str: 506 | return "shift" 507 | if "alt" in key_str: 508 | return "alt" 509 | if "cmd" in key_str: 510 | return "cmd" 511 | return key_str[4:] 512 | elif isinstance(key, keyboard.KeyCode): 513 | return key.char 514 | -------------------------------------------------------------------------------- /tracker/package.ps1: -------------------------------------------------------------------------------- 1 | # 0. clean ./dist/ folder (if exists) 2 | $distPath = "./dist" 3 | if (Test-Path $distPath) 4 | { 5 | Remove-Item -Path $distPath -Recurse -Force 6 | Write-Output "dist folder cleared." 7 | } 8 | else 9 | { 10 | Write-Output "dist folder does not exist, skipping clearing." 11 | } 12 | 13 | # 1. run pyinstaller 14 | pyinstaller main.spec 15 | 16 | # 2. check if ./dist/tracker.exe is generated 17 | if (Test-Path "./dist/tracker.exe") 18 | { 19 | Write-Output "tracker.exe successfully created." 20 | } 21 | else 22 | { 23 | Write-Output "Error: tracker.exe not created." 24 | exit 1 25 | } 26 | 27 | # 3. copy ./tasks.json and ./README.md to ./dist/ 28 | Copy-Item -Path "./tasks.json" -Destination "./dist/" -Force 29 | Copy-Item -Path "./README.md" -Destination "./dist/" -Force 30 | 31 | # 4. copy ./task_cnt.json to ./dist/ 32 | Copy-Item -Path "./task_cnt.json" -Destination "./dist/" -Force 33 | 34 | Write-Output "./dist/ successfully prepared, ready for zip." 35 | -------------------------------------------------------------------------------- /tracker/recorder.py: -------------------------------------------------------------------------------- 1 | import json 2 | import multiprocessing 3 | 4 | from PIL import Image, ImageDraw 5 | 6 | from capturer import * 7 | from fs import * 8 | from utils import * 9 | 10 | MARK_IMAGE = False 11 | 12 | 13 | class Recorder: 14 | def __init__(self, task=None, buffer_len=1, directory="events"): 15 | self.pool = multiprocessing.Pool() 16 | self.task = task 17 | self.buffer_len = buffer_len 18 | self.directory = directory 19 | self.screenshot_dir = os.path.join(directory, "screenshot") 20 | self.buffer = [] # event buffer 21 | self.saved_cnt = 0 22 | self.timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S") 23 | # Ensure directories exist 24 | ensure_folder(self.directory) 25 | ensure_folder(self.screenshot_dir) 26 | # Hide directory 27 | hide_folder(self.directory) 28 | # Generate filename prefix 29 | if self.task is not None: 30 | index = self.task.id 31 | prefix = f"task{index}" if index != 0 else "free_task" 32 | else: 33 | prefix = "events" 34 | # Generate filename 35 | self.event_filename = os.path.join( 36 | self.directory, f"{prefix}_{self.timestamp_str}.jsonl") 37 | self.md_filename = os.path.join( 38 | self.directory, f"{prefix}_{self.timestamp_str}.md") 39 | 40 | self.recent_screen = RecentScreen() 41 | 42 | self.screenshot_f_list = [] 43 | 44 | def get_event(self, action=None): 45 | timestamp = get_current_time() 46 | screenshot = self.recent_screen.get() 47 | event = { 48 | 'timestamp': timestamp, 49 | 'action': action, 50 | 'screenshot': screenshot, 51 | } 52 | return event 53 | 54 | def record_event(self, event, rect=None): 55 | self.buffer.append((event, rect)) 56 | if len(self.buffer) > self.buffer_len: 57 | ev, rec = self.buffer.pop(0) 58 | self.save(ev, rec) 59 | 60 | def record_action(self, action, rect=None): 61 | event = self.get_event(action) 62 | self.record_event(event, rect) 63 | 64 | def get_last_action(self): 65 | if self.buffer and len(self.buffer) > 0: 66 | event, _ = self.buffer[-1] 67 | return event['action'] 68 | else: 69 | return None 70 | 71 | def change_last_action(self, action): 72 | if self.buffer: 73 | event, rect = self.buffer.pop() 74 | event['action'] = action 75 | self.buffer.append((event, rect)) 76 | else: 77 | print("WARNING: No record to change in the buffer!") 78 | 79 | def save(self, event, rect): 80 | self.saved_cnt += 1 81 | timestamp = event['timestamp'].replace(':', '').replace('-', '') 82 | action = event['action'] 83 | screenshot_filename = os.path.join( 84 | self.screenshot_dir, f"{timestamp}_{self.saved_cnt}.png") 85 | 86 | point = {"x": action.kwargs.get('x'), "y": action.kwargs.get('y')} 87 | if None in point.values(): 88 | point = None 89 | 90 | # Async save screenshot 91 | self.pool.apply_async( 92 | save_screenshot, (screenshot_filename, event['screenshot'], rect, point)) 93 | 94 | event['screenshot'] = screenshot_filename 95 | event['action'] = str(action) 96 | event['element'] = action.get_element() 97 | event['rect'] = rect 98 | with open(self.event_filename, 'a', encoding='utf-8') as f: 99 | json.dump(event, f, ensure_ascii=False) 100 | f.write('\n') 101 | 102 | self.screenshot_f_list.append(screenshot_filename) 103 | 104 | def wait(self): 105 | # Save all buffered events 106 | for event, rect in self.buffer: 107 | self.save(event, rect) 108 | # Close process pool 109 | self.pool.close() 110 | self.pool.join() 111 | 112 | def generate_md(self, task=None): 113 | if task is not None: 114 | self.task = task # Reset task 115 | 116 | prompt = '''Given the screenshot as below. What's the next step that you will do to help with the task?''' 117 | 118 | with open(self.event_filename, 'r', encoding='utf-8') as file: 119 | lines = file.readlines() 120 | 121 | markdown_content = [] 122 | if self.task is not None: 123 | index = self.task.id 124 | description = self.task.description 125 | level = self.task.level 126 | 127 | if index == 0: 128 | markdown_content.append(f'# Free Task\n') 129 | else: 130 | markdown_content.append(f'# Task {index}\n') 131 | 132 | markdown_content.append(f'**Description:** {description}\n\n') 133 | markdown_content.append(f'**Level:** {level}\n\n') 134 | else: 135 | markdown_content.append(f'# Non task-oriented events\n') 136 | 137 | for line in lines: 138 | event = json.loads(line.strip()) 139 | timestamp = event.get('timestamp', '') 140 | action = event.get('action', '') 141 | screenshot_path = event.get('screenshot', '') 142 | screenshot_path = '\\'.join(screenshot_path.split( 143 | '\\')[1:]) # remove the first directory 144 | 145 | markdown_content.append(f'### {timestamp}\n') 146 | markdown_content.append(f'**Input:** \n\n{prompt}\n\n') 147 | markdown_content.append( 148 | f'

\n\n') 149 | markdown_content.append(f'**Output:** \n\n{action}\n\n') 150 | 151 | # Write content to Markdown file 152 | with open(self.md_filename, 'w', encoding='utf-8') as md_file: 153 | md_file.writelines(markdown_content) 154 | 155 | def discard(self): 156 | # Delete all record files 157 | delete_file(self.event_filename) 158 | # markdown may not be recorded, but not a problem 159 | delete_file(self.md_filename) 160 | for f in self.screenshot_f_list: 161 | delete_file(f) 162 | 163 | 164 | def save_screenshot(save_filename, screenshot, rect=None, point=None): 165 | # Create image from buffer 166 | image = Image.frombuffer( 167 | 'RGB', 168 | screen_size, 169 | screenshot, 'raw', 'BGRX', 0, 1 170 | ) 171 | 172 | if MARK_IMAGE: 173 | mark_image(image, rect, point) 174 | 175 | # Save image 176 | image.save(save_filename) 177 | 178 | 179 | def mark_image(image, rect, point): 180 | if rect is not None: 181 | # Create a drawable object 182 | draw = ImageDraw.Draw(image) 183 | # Draw rectangle 184 | draw.rectangle( 185 | [(rect["left"], rect["top"]), (rect["right"], rect["bottom"])], 186 | outline="red", 187 | width=3 # line width 188 | ) 189 | 190 | if point is not None: 191 | draw = ImageDraw.Draw(image) 192 | 193 | # Calculate circle's top-left and bottom-right coordinates 194 | radius = 6 195 | left = point["x"] - radius 196 | top = point["y"] - radius 197 | right = point["x"] + radius 198 | bottom = point["y"] + radius 199 | 200 | # Draw circle 201 | draw.ellipse( 202 | [(left, top), (right, bottom)], 203 | fill="red" 204 | ) 205 | -------------------------------------------------------------------------------- /tracker/task.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import time 4 | 5 | 6 | def find_tasks_json(): 7 | start_dir = os.path.abspath(".") 8 | for root, dirs, files in os.walk(start_dir): 9 | if 'tasks.json' in files: 10 | return os.path.join(root, 'tasks.json') 11 | return None 12 | 13 | 14 | def find_task_cnt_json(): 15 | start_dir = os.path.abspath(".") 16 | for root, dirs, files in os.walk(start_dir): 17 | if 'task_cnt.json' in files: 18 | return os.path.join(root, 'task_cnt.json') 19 | return None 20 | 21 | 22 | tasks_path = find_tasks_json() 23 | task_cnt_path = find_task_cnt_json() 24 | task_cnt = 0 25 | 26 | 27 | class Task: 28 | def __init__(self, description, id, level, file_input=None, category="other", finished=False, is_bad=False): 29 | self.description = description 30 | self.level = level 31 | self.id = id 32 | self.category = category 33 | self.file_input = file_input 34 | self.finished = finished 35 | self.is_bad = is_bad 36 | 37 | 38 | def from_json(task, task_cnt) -> Task: 39 | return Task(task['task'], task_cnt, task['level'], task['file_input'], task['category'], task['finished']) 40 | 41 | 42 | def free_task(): 43 | return Task("free task", 0, "easy") 44 | 45 | 46 | def load_task_cnt(): 47 | with open(task_cnt_path, 'r') as file: 48 | data = json.load(file) 49 | return data['given_task'], data['free_task'] 50 | 51 | 52 | def load_given_tasks(): 53 | tasks = [] 54 | global task_cnt 55 | if tasks_path is None: 56 | return [free_task()] # for robustness 57 | with open(tasks_path, 'r') as file: 58 | data = json.load(file) 59 | for task in data: 60 | task_cnt += 1 61 | tasks.append(from_json(task, task_cnt)) 62 | return tasks 63 | 64 | 65 | def update_given_tasks(given_tasks): 66 | if tasks_path is None: 67 | return 68 | try: 69 | # set_hidden_file(tasks_path, False) 70 | with open(tasks_path, 'w') as file: 71 | json.dump( 72 | [{'task': task.description, 73 | 'level': task.level, 74 | 'file_input': task.file_input, 75 | 'category': task.category, 76 | 'finished': task.finished} 77 | for task in given_tasks if not task.is_bad], 78 | file, 79 | indent=2 # Set indentation to 2 spaces 80 | ) 81 | # set_hidden_file(tasks_path, True) 82 | except Exception as e: 83 | print(e) 84 | # sleep for 10 seconds 85 | time.sleep(10) 86 | 87 | 88 | def update_task_cnt(finished_given_cnt, finished_free_cnt): 89 | print(f"update task cnt: {finished_given_cnt}, {finished_free_cnt}") 90 | # set_hidden_file(task_cnt_path, False) 91 | with open(task_cnt_path, 'w') as file: 92 | json.dump({'given_task': finished_given_cnt, 'free_task': finished_free_cnt}, file, indent=2) 93 | # set_hidden_file(task_cnt_path, True) 94 | -------------------------------------------------------------------------------- /tracker/task_cnt.json: -------------------------------------------------------------------------------- 1 | { 2 | "given_task": 0, 3 | "free_task": 0 4 | } -------------------------------------------------------------------------------- /tracker/tasks.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "task": "Open PowerPoint and create a new presentation about \"Star Wars Saga Overview\".\nOn the first slide, add the title \"Star Wars: A Galaxy Far, Far Away\" and the subtitle \"The Epic Saga | 1977-2019\".\nFor the second slide, title it \"The Force\". Search for an image of \"Star Wars Lightsaber Duel\" on Chrome, copy the image, and paste it into the slide.\nOn the third slide, use the title \"Key Characters\" and include the following text: \"Follow the journey of Luke Skywalker, Darth Vader, Princess Leia and Rey as they shape the destiny of the galaxy.\"\nFor the fourth slide, title it \"Legacy\" and add: \"Star Wars revolutionized cinema with groundbreaking effects and created a cultural phenomenon that spans movies, games and books.\"\nOn the final slide, write \"May The Force Be With You\".\nSave the presentation to desktop, then close PowerPoint and Chrome.", 4 | "level": "hard", 5 | "file_input": false, 6 | "category": "PowerPointTasks", 7 | "finished": false 8 | }, 9 | { 10 | "task": "Open Chrome and navigate to the product page for the MacBook Air M3 on Amazon.", 11 | "level": "medium", 12 | "file_input": false, 13 | "category": "ChromeTasks", 14 | "finished": false 15 | }, 16 | { 17 | "task": "Create a new file named 'diary.txt' on the desktop.", 18 | "level": "easy", 19 | "file_input": false, 20 | "category": "FileOperationTasks", 21 | "finished": false 22 | } 23 | ] 24 | -------------------------------------------------------------------------------- /tracker/tracker.py: -------------------------------------------------------------------------------- 1 | import random 2 | from monitor import Monitor 3 | from task import * 4 | 5 | 6 | class Tracker: 7 | def __init__(self): 8 | self.monitor = None 9 | self.running = False 10 | self.given_tasks = load_given_tasks() 11 | self.finished_given_cnt, self.finished_free_cnt = load_task_cnt() 12 | self.bad_task_cnt = 0 13 | self.task_num = len(self.given_tasks) 14 | print(f"task num = {self.task_num}") 15 | self.task_id = random.randint(0, self.task_num - 1) 16 | self.task = None 17 | 18 | def get_given_task(self, offset): 19 | while True: 20 | self.task_id = (self.task_id + self.task_num + offset) % self.task_num 21 | task = self.given_tasks[self.task_id] 22 | if not task.finished and not task.is_bad: 23 | break 24 | self.task = self.given_tasks[self.task_id] 25 | 26 | def finish_all(self): 27 | return self.finished_given_cnt + self.bad_task_cnt == self.task_num 28 | 29 | def update_tasks(self): 30 | update_given_tasks(self.given_tasks) 31 | update_task_cnt(self.finished_given_cnt, self.finished_free_cnt) 32 | 33 | def get_free_task(self): 34 | self.task = free_task() 35 | 36 | def start(self): 37 | if not self.running: 38 | self.monitor = Monitor(self.task) 39 | self.monitor.start() 40 | self.running = True 41 | 42 | def stop(self): 43 | if self.running: 44 | self.monitor.stop() 45 | self.running = False 46 | 47 | def finish(self): 48 | if self.running: 49 | self.monitor.finish() 50 | self.running = False 51 | self.given_tasks[self.task_id].finished = True 52 | self.finished_given_cnt += 1 53 | else: 54 | self.monitor.generate_md() 55 | self.given_tasks[self.task_id].finished = True 56 | self.finished_given_cnt += 1 57 | 58 | def fail(self): 59 | if self.running: 60 | self.monitor.fail() 61 | self.running = False 62 | 63 | def stop_without_task(self): 64 | # stop without markdown (task unknown) 65 | if self.running: 66 | self.monitor.finish_without_md() 67 | self.running = False 68 | 69 | def save_free_task(self, task): 70 | # 在stop without task后调用，更新task后保存记录 71 | self.monitor.generate_md(task) 72 | self.finished_free_cnt += 1 73 | 74 | def discard(self): 75 | # 在stop/stop free task后调用，丢弃记录 76 | self.monitor.discard_record() 77 | -------------------------------------------------------------------------------- /tracker/utils.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from pywinauto import Desktop 3 | 4 | 5 | def get_current_time(): 6 | return datetime.now().strftime('%Y-%m-%d_%H:%M:%S') 7 | 8 | 9 | desktop = Desktop(backend="uia") 10 | 11 | 12 | def get_element_info_at_position(x, y): 13 | """ 14 | Get UI element info at specified coordinates 15 | """ 16 | try: 17 | element = desktop.from_point(x, y) 18 | # Get element's rectangle coordinates 19 | rect = element.rectangle() 20 | 21 | return { 22 | "name": element.element_info.name, 23 | "coordinates": { 24 | "left": rect.left, 25 | "top": rect.top, 26 | "right": rect.right, 27 | "bottom": rect.bottom 28 | } 29 | } 30 | except Exception as e: 31 | print(f"Error occurs when get element from position: {e}") 32 | return None 33 | 34 | 35 | def print_debug(string): 36 | import sys 37 | sys.stderr.write(string + "\n") 38 | 39 | 40 | # Return 1 if the caps lock key is on; return 0 if it is off 41 | def get_capslock_state(): 42 | import ctypes 43 | hllDll = ctypes.WinDLL("User32.dll") 44 | VK_CAPITAL = 0x14 45 | return hllDll.GetKeyState(VK_CAPITAL) 46 | --------------------------------------------------------------------------------