├── .gitignore ├── README.md ├── SeeAct ├── data_utils │ ├── __pycache__ │ │ ├── dom_utils.cpython-312.pyc │ │ ├── format_prompt_utils.cpython-312.pyc │ │ ├── image_utils.cpython-312.pyc │ │ └── prompts.cpython-312.pyc │ ├── dom_utils.py │ ├── evaluation_utils.py │ ├── format_prompt_utils.py │ ├── image_utils.py │ └── prompts.py ├── demo_utils │ ├── __pycache__ │ │ ├── conversation.cpython-312.pyc │ │ ├── inference_engine.cpython-312.pyc │ │ ├── inference_engine.cpython-39.pyc │ │ └── llava_engine.cpython-312.pyc │ ├── browser_helper.py │ ├── conversation.py │ ├── format_prompt.py │ ├── inference_engine.py │ ├── llava_engine.py │ ├── ranking_model.py │ └── website_dict.py └── offline_experiments │ ├── .ipynb_checkpoints │ ├── offline_experiment-checkpoint.ipynb │ ├── read_results_offline_experiment-checkpoint.ipynb │ ├── read_results_offline_experiment_general-checkpoint.ipynb │ ├── read_results_offline_experiment_general_aug-checkpoint.ipynb │ ├── read_selected_aug_dataset_stock_and_aug_fix-checkpoint.ipynb │ └── save_selected_stock_tasks_as_standalone-checkpoint.ipynb │ └── gpt_4_standalone │ ├── __pycache__ │ ├── adv_gpt.cpython-312.pyc │ ├── inference_engine.cpython-312.pyc │ ├── inference_engine.cpython-39.pyc │ ├── utils.cpython-312.pyc │ └── utils.cpython-39.pyc │ ├── adv_gpt.py │ ├── adv_hf.py │ ├── inference_engine.py │ ├── offline_experiment_grounding.py │ ├── prompt_no_task.txt │ ├── prompt_no_task_gemini.txt │ ├── prompt_no_task_short.txt │ └── utils.py ├── data ├── task_demo_-1 │ └── f5da4b14-026d-4a10-ab89-f5720418f2b4_9016ffb6-7468-4495-ad07-756ac9f2af03 │ │ ├── images │ │ └── 0.jpg │ │ ├── prediction-4api.jsonl │ │ └── queries.jsonl └── task_demo_-1_aug │ ├── attack_dataset.json │ ├── subset_test_data_aug │ ├── augmented_dataset.json │ ├── imgs │ │ └── f5da4b14-026d-4a10-ab89-f5720418f2b4_9016ffb6-7468-4495-ad07-756ac9f2af03.jpg │ ├── predictions │ │ ├── augmented_dataset_correct.json │ │ ├── prediction-4api-augment-data-correct.jsonl │ │ └── prediction-4api-augment-data.jsonl │ ├── test.json │ └── train.json │ └── together │ └── data │ ├── dpo_test.jsonl │ ├── dpo_train.jsonl │ └── sft_train.jsonl ├── data_generation.ipynb ├── dpo ├── __pycache__ │ ├── booking_6agent_dataset.cpython-312.pyc │ └── build_agent_dataset.cpython-312.pyc ├── build_agent_dataset.py ├── dpo_training.py ├── merge_peft_adapter.py └── template_config │ ├── __pycache__ │ ├── chat_template.cpython-312.pyc │ └── chat_template.cpython-39.pyc │ ├── amberchat.jinja │ ├── chat_template.py │ ├── chatml.jinja │ ├── falcon-instruct.jinja │ ├── gemma-it.jinja │ ├── llama-2-chat.jinja │ ├── mistral-instruct.jinja │ ├── openchat.jinja │ ├── solar-instruct.jinja │ ├── vicuna.jinja │ └── zephyr.jinja ├── dpo_model_merge.sh ├── dpo_training.sh ├── evaluation.sh ├── outputs ├── adv_gpt_demo.json └── adv_hf_task_demo_dpo.json ├── pipe_inference.png ├── requirements.txt └── training_data_generation.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store 2 | *./vscode 3 | .idea 4 | .vscode 5 | data/Multimodal-Mind2Web/* 6 | data/seeact_source_data/test_website_outputs_top50.json 7 | data/seeact_source_data/test_domain_outputs_top50.json 8 | data/seeact_source_data/test_task_outputs_top50.json 9 | data/task_demo_-1_aug/together/new_models/* -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AdvAgent: Controllable Blackbox Red-teaming on Web Agents 2 | 3 | Code for our paper AdvAgent: Controllable Blackbox Red-teaming on Web Agents 4 | 5 | ## Setup 6 | 7 | Create virtual environment, for example with conda: 8 | ``` 9 | conda create -n AdvAgent python=3.12.2 10 | conda activate AdvAgent 11 | ``` 12 | 13 | Install dependencies: 14 | ``` 15 | pip install -r requirements.txt 16 | ``` 17 | 18 | Clone this repository: 19 | ``` 20 | git clone https://github.com/AI-secure/AdvAgent.git 21 | ``` 22 | 23 | Set up OpenAI API key and other keys to the environment: 24 | (Our pipeline supports attacking various large language models such as GPT, Gemini, and Claude. Here, we take attacking GPT as an example.) 25 | 26 | ``` 27 | export OPENAI_API_KEY= 28 | export HUGGING_FACE_HUB_TOKEN= 29 | ``` 30 | 31 | ## Data 32 | We conduct experiments on the [Mind2Web](https://osu-nlp-group.github.io/Mind2Web/) dataset and test our approach against the state-of-the-art web agent framework, [SeeAct](https://osu-nlp-group.github.io/SeeAct/). 33 | 34 | Download the source data [Multimodal-Mind2Web](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web/tree/main) from Hugging Face and store it in the path `data/Multimodal-Mind2Web/data/`. 35 | 36 | Download the [Seeact Source Data](https://buckeyemailosu-my.sharepoint.com/:f:/g/personal/zheng_2372_buckeyemail_osu_edu/Ei95kzWnWlVAn4DR5I3zDDEBUZtC-9vIf0VBuFMOzZNn2w?e=OcH9Om) and store it in the path `data/seeact_source_data/`. 37 | 38 | ## Run Demo 39 | ### Data Generation 40 | #### Construct the training set and test set 41 | Run the notebook `data_generation.ipynb` to filter data from the source dataset and construct the training set and test set. 42 | #### Build datasets for SFT and DPO 43 | Run `training_data_generation.sh` to test the quality of the data in the training set and construct datasets for SFT and DPO. 44 | 45 | After completing the Data Generation section, your file structure should look like this: 46 | ``` 47 | ├──task_demo_-1_aug 48 | ├──attack_dataset.json 49 | ├──subset_test_data_aug 50 | │ ├── train.json 51 | │ ├── test.json 52 | │ ├── augmented_dataset.json 53 | │ ├── predictions 54 | │ │ ├── prediction-4api-augment-data.jsonl 55 | │ │ ├── augmented_dataset_correct.json 56 | │ │ └── prediction-4api-augment-data-correct.jsonl 57 | │ └── imgs 58 | │ └── f5da4b14-026d-4a10-ab89-f5720418f2b4_9016ffb6-7468-4495-ad07-756ac9f2af03.jpg 59 | └── together 60 | └── data 61 | └── sft_train_data.jsonl 62 | ``` 63 | ### Model Training 64 | #### SFT 65 | We fine-tune the model by calling Together AI's API. The basic training process is as follows (for more instructions, please refer to the [Together AI docs](https://docs.together.ai/docs/fine-tuning-overview)): 66 | Set up Together AI API key: 67 | ``` 68 | export TOGETHER_API_KEY= 69 | ``` 70 | Upload training dataset: 71 | ``` 72 | together files upload "xxx.jsonl" 73 | ``` 74 | Train the SFT model: 75 | ``` 76 | together fine-tuning create \ 77 | --training-file "file-xxx" \ 78 | --model "mistralai/Mistral-7B-Instruct-v0.2" \ 79 | --lora \ 80 | --batch-size 16 81 | ``` 82 | Download the SFT model: 83 | ``` 84 | together fine-tuning download "ft-xxx" 85 | ``` 86 | You can store the SFT model in the path `data/task_demo_-1_aug/together/new_models/`. 87 | #### DPO 88 | Run `dpo_training.sh` to train the DPO model. 89 | Select the best training model based on the training curve, and run `dpo_model_merge.sh` to merge the model. 90 | #### Evaluation 91 | Run `evaluation.sh` to evaluate the SFT and DPO models. 92 | ## Citation 93 | If you find this code useful, please cite our paper: 94 | 95 | ``` 96 | ``` 97 | 98 | -------------------------------------------------------------------------------- /SeeAct/data_utils/__pycache__/dom_utils.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/data_utils/__pycache__/dom_utils.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/data_utils/__pycache__/format_prompt_utils.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/data_utils/__pycache__/format_prompt_utils.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/data_utils/__pycache__/image_utils.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/data_utils/__pycache__/image_utils.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/data_utils/__pycache__/prompts.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/data_utils/__pycache__/prompts.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/data_utils/dom_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Copyright (c) 2024 OSU Natural Language Processing Group 3 | # 4 | # Licensed under the OpenRAIL-S License; 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # https://www.licenses.ai/ai-pubs-open-rails-vz1 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import copy 17 | import re 18 | from lxml import etree 19 | import lxml 20 | from bs4 import BeautifulSoup 21 | 22 | salient_attributes = { 23 | "alt", 24 | "aria_description", 25 | "aria_label", 26 | "aria_role", 27 | "input_checked", 28 | "input_value", 29 | "label", 30 | "name", 31 | "option_selected", 32 | "placeholder", 33 | "role", 34 | "text_value", 35 | "title", 36 | "type", 37 | "value", 38 | } 39 | 40 | 41 | def remove_extra_eol(text): 42 | # Replace EOL symbols 43 | text = text.replace('\n', ' ') 44 | return re.sub(r'\s{2,}', ' ', text) 45 | 46 | 47 | def clean_text(text): 48 | if text is None: 49 | return "" 50 | text = re.sub(r"\s+", " ", text) 51 | return text.strip() 52 | 53 | 54 | def get_descendants(node, max_depth, current_depth=0): 55 | if current_depth > max_depth: 56 | return [] 57 | descendants = [] 58 | for child in node: 59 | descendants.append(child) 60 | descendants.extend(get_descendants(child, max_depth, current_depth + 1)) 61 | return descendants 62 | 63 | 64 | def clean_tree(dom_tree, all_candidate_ids): 65 | new_tree = copy.deepcopy(dom_tree) 66 | for node in new_tree.xpath("//*")[::-1]: 67 | # check if node have salient attributes 68 | for attr in node.attrib: 69 | if attr == "class" and node.attrib[attr] and node.tag == "svg": 70 | icon_texts = re.findall(r"\S*icon\S*", node.attrib[attr], re.IGNORECASE) 71 | icon_texts = [clean_text(text) for text in icon_texts] 72 | icon_texts = [text for text in icon_texts if text] 73 | if icon_texts: 74 | node.attrib[attr] = " ".join(icon_texts) 75 | else: 76 | node.attrib.pop(attr) 77 | elif attr in salient_attributes: 78 | if not ( 79 | ( 80 | attr == "role" 81 | and node.attrib.get(attr, "") 82 | in {"presentation", "none", "link"} 83 | ) 84 | or (attr == "type" and node.attrib.get(attr, "") == "hidden") 85 | ): 86 | value = clean_text(node.attrib[attr]) 87 | if value != "": 88 | node.attrib[attr] = value 89 | else: 90 | node.attrib.pop(attr) 91 | else: 92 | node.attrib.pop(attr) 93 | elif attr != "backend_node_id": 94 | node.attrib.pop(attr) 95 | if node.tag == "text": 96 | value = clean_text(node.text) 97 | if len(value) > 0: 98 | node.text = value 99 | else: 100 | node.getparent().remove(node) 101 | elif ( 102 | node.attrib.get("backend_node_id", "") not in all_candidate_ids 103 | and len(node.attrib) == 1 104 | and not any([x.tag == "text" for x in node.getchildren()]) 105 | and node.getparent() is not None 106 | and len(node.getchildren()) <= 1 107 | ): 108 | # insert all children into parent 109 | for child in node.getchildren(): 110 | node.addprevious(child) 111 | node.getparent().remove(node) 112 | return new_tree 113 | 114 | 115 | def prune_tree( 116 | dom_tree, 117 | candidate_set, 118 | max_depth=5, 119 | max_children=50, 120 | max_sibling=3, 121 | ): 122 | nodes_to_keep = set() 123 | for candidate_id in candidate_set: 124 | candidate_node = dom_tree.xpath(f'//*[@backend_node_id="{candidate_id}"]')[0] 125 | nodes_to_keep.add(candidate_node.attrib["backend_node_id"]) 126 | # get all ancestors 127 | nodes_to_keep.update( 128 | [ 129 | x.attrib.get("backend_node_id", "") 130 | for x in candidate_node.xpath("ancestor::*") 131 | ] 132 | ) 133 | # get descendants with max depth 134 | nodes_to_keep.update( 135 | [ 136 | x.attrib.get("backend_node_id", "") 137 | for x in get_descendants(candidate_node, max_depth) 138 | ][:max_children] 139 | ) 140 | # get siblings within range 141 | parent = candidate_node.getparent() 142 | if parent is not None: 143 | siblings = [x for x in parent.getchildren() if x.tag != "text"] 144 | idx_in_sibling = siblings.index(candidate_node) 145 | nodes_to_keep.update( 146 | [ 147 | x.attrib.get("backend_node_id", "") 148 | for x in siblings[ 149 | max(0, idx_in_sibling - max_sibling): idx_in_sibling 150 | + max_sibling 151 | + 1 152 | ] 153 | ] 154 | ) 155 | # clone the tree 156 | new_tree = copy.deepcopy(dom_tree) 157 | # remove nodes not in nodes_to_keep 158 | for node in new_tree.xpath("//*")[::-1]: 159 | if node.tag != "text": 160 | is_keep = node.attrib.get("backend_node_id", "") in nodes_to_keep 161 | is_candidate = node.attrib.get("backend_node_id", "") in candidate_set 162 | else: 163 | is_keep = ( 164 | node.getparent().attrib.get("backend_node_id", "") in nodes_to_keep 165 | ) 166 | is_candidate = ( 167 | node.getparent().attrib.get("backend_node_id", "") in candidate_set 168 | ) 169 | if not is_keep and node.getparent() is not None: 170 | node.getparent().remove(node) 171 | else: 172 | if not is_candidate or node.tag == "text": 173 | node.attrib.pop("backend_node_id", None) 174 | if ( 175 | len(node.attrib) == 0 176 | and not any([x.tag == "text" for x in node.getchildren()]) 177 | and node.getparent() is not None 178 | and node.tag != "text" 179 | and len(node.getchildren()) <= 1 180 | ): 181 | # insert all children into parent 182 | for child in node.getchildren(): 183 | node.addprevious(child) 184 | node.getparent().remove(node) 185 | return new_tree 186 | 187 | 188 | def data_prune_tree( 189 | dom_tree, 190 | candidate_set, 191 | max_depth=5, 192 | max_children=50, 193 | max_sibling=3, 194 | ): 195 | nodes_to_keep = set() 196 | for candidate_id in candidate_set: 197 | candidate_node = dom_tree.xpath(f'//*[@backend_node_id="{candidate_id}"]')[0] 198 | nodes_to_keep.add(candidate_node.attrib["backend_node_id"]) 199 | # get all ancestors 200 | nodes_to_keep.update( 201 | [ 202 | x.attrib.get("backend_node_id", "") 203 | for x in candidate_node.xpath("ancestor::*") 204 | ] 205 | ) 206 | # get descendants with max depth 207 | nodes_to_keep.update( 208 | [ 209 | x.attrib.get("backend_node_id", "") 210 | for x in get_descendants(candidate_node, max_depth) 211 | ][:max_children] 212 | ) 213 | # get siblings within range 214 | parent = candidate_node.getparent() 215 | if parent is not None: 216 | siblings = [x for x in parent.getchildren() if x.tag != "text"] 217 | idx_in_sibling = siblings.index(candidate_node) 218 | nodes_to_keep.update( 219 | [ 220 | x.attrib.get("backend_node_id", "") 221 | for x in siblings[ 222 | max(0, idx_in_sibling - max_sibling): idx_in_sibling 223 | + max_sibling 224 | + 1 225 | ] 226 | ] 227 | ) 228 | # clone the tree 229 | new_tree = copy.deepcopy(dom_tree) 230 | # remove nodes not in nodes_to_keep 231 | for node in new_tree.xpath("//*")[::-1]: 232 | if node.tag != "text": 233 | is_keep = node.attrib.get("backend_node_id", "") in nodes_to_keep 234 | is_candidate = node.attrib.get("backend_node_id", "") in candidate_set 235 | else: 236 | is_keep = ( 237 | node.getparent().attrib.get("backend_node_id", "") in nodes_to_keep 238 | ) 239 | is_candidate = ( 240 | node.getparent().attrib.get("backend_node_id", "") in candidate_set 241 | ) 242 | if not is_keep and node.getparent() is not None: 243 | node.getparent().remove(node) 244 | else: 245 | if not is_candidate or node.tag == "text": 246 | node.attrib.pop("backend_node_id", None) 247 | if ( 248 | len(node.attrib) == 0 249 | and not any([x.tag == "text" for x in node.getchildren()]) 250 | and node.getparent() is not None 251 | and node.tag != "text" 252 | and len(node.getchildren()) <= 1 253 | ): 254 | # insert all children into parent 255 | for child in node.getchildren(): 256 | node.addprevious(child) 257 | node.getparent().remove(node) 258 | return new_tree, nodes_to_keep 259 | 260 | 261 | def get_attribute_repr(node, max_value_length=5, max_length=20): 262 | # get attribute values in order 263 | attr_values_set = set() 264 | attr_values = "" 265 | for attr in [ 266 | "role", 267 | "aria_role", 268 | "type", 269 | "alt", 270 | "aria_description", 271 | "aria_label", 272 | "label", 273 | "title", 274 | "name", 275 | "text_value", 276 | "value", 277 | "placeholder", 278 | "input_checked", 279 | "input_value", 280 | "option_selected", 281 | "class", 282 | ]: 283 | if attr in node.attrib and node.attrib[attr] is not None: 284 | value = node.attrib[attr].lower() 285 | # less menaingful values 286 | if value in [ 287 | "hidden", 288 | "none", 289 | "presentation", 290 | "null", 291 | "undefined", 292 | ] or value.startswith("http"): 293 | continue 294 | value = value.split() 295 | value = " ".join([v for v in value if len(v) < 15][:max_value_length]) 296 | if value and value not in attr_values_set: 297 | attr_values_set.add(value) 298 | attr_values += value + " " 299 | uid = node.attrib.get("backend_node_id", "") 300 | # clear all attributes 301 | node.attrib.clear() 302 | if uid: 303 | node.attrib["id"] = uid 304 | # add meta attribute 305 | if attr_values: 306 | node.attrib["meta"] = " ".join(attr_values.split()[:max_length]) 307 | 308 | 309 | def get_tree_repr( 310 | tree, max_value_length=5, max_length=20, id_mapping={}, keep_html_brackets=False 311 | ): 312 | if isinstance(tree, str): 313 | tree = etree.fromstring(tree) 314 | else: 315 | tree = copy.deepcopy(tree) 316 | for node in tree.xpath("//*"): 317 | if node.tag != "text": 318 | if "backend_node_id" in node.attrib: 319 | if node.attrib["backend_node_id"] not in id_mapping: 320 | id_mapping[node.attrib["backend_node_id"]] = len(id_mapping) 321 | node.attrib["backend_node_id"] = str( 322 | id_mapping[node.attrib["backend_node_id"]] 323 | ) 324 | get_attribute_repr(node, max_value_length, max_length) 325 | else: 326 | node.text = " ".join(node.text.split()[:max_length]) 327 | tree_repr = etree.tostring(tree, encoding="unicode") 328 | 329 | tree_repr = tree_repr.replace('"', " ") 330 | tree_repr = ( 331 | tree_repr.replace("meta= ", "").replace("id= ", "id=").replace(" >", ">") 332 | ) 333 | tree_repr = re.sub(r"(.*?)", r"\1", tree_repr) 334 | if not keep_html_brackets: 335 | tree_repr = tree_repr.replace("/>", "$/$>") 336 | tree_repr = re.sub(r"(.+?)>", r")", tree_repr) 337 | tree_repr = re.sub(r"<(.+?)>", r"(\1", tree_repr) 338 | tree_repr = tree_repr.replace("$/$", ")") 339 | 340 | html_escape_table = [ 341 | (""", '"'), 342 | ("&", "&"), 343 | ("<", "<"), 344 | (">", ">"), 345 | (" ", " "), 346 | ("–", "-"), 347 | ("’", "'"), 348 | ("‘", "'"), 349 | ("“", '"'), 350 | ("”", '"'), 351 | ("'", "'"), 352 | ("(", "("), 353 | (")", ")"), 354 | ] 355 | for k, v in html_escape_table: 356 | tree_repr = tree_repr.replace(k, v) 357 | tree_repr = re.sub(r"\s+", " ", tree_repr).strip() 358 | 359 | return tree_repr, id_mapping 360 | 361 | 362 | def extract_elements_from_html(whole_html): 363 | pattern = r'(.*?)' 364 | all_element_texts = whole_html.strip().split("\n") 365 | valids = [] 366 | invalids = [] 367 | 368 | for text in all_element_texts: 369 | match = re.search(pattern, text) 370 | 371 | # Extracting the values if a match is found 372 | if match: 373 | backend_node_id = match.group(1) 374 | inner_text = match.group(2) 375 | valids.append([backend_node_id, inner_text, text]) 376 | else: 377 | backend_node_id, inner_text = None, None 378 | invalids.append(text) 379 | 380 | # Using Beautify Soup 381 | soup = BeautifulSoup(whole_html, 'html.parser') 382 | # Find all elements and get their tag names 383 | tag_names = [tag.name for tag in soup.find_all()] 384 | 385 | # Remove duplicates by converting the list to a set 386 | unique_tag_names = set(tag_names) 387 | 388 | element_dict = {} 389 | for tag in unique_tag_names: 390 | tag_elements = [] 391 | elements = soup.find_all(tag) 392 | for element in elements: 393 | if 'backend_node_id' not in element.attrs: 394 | continue 395 | temp = [element.attrs['backend_node_id'], clean_element_text(element.text)] 396 | if 'alt' in element.attrs: 397 | temp.append(element.attrs['alt']) 398 | tag_elements.append(temp) 399 | if clean_element_text(element.text) == "": 400 | t = element.attrs 401 | element_dict[tag] = tag_elements 402 | 403 | return element_dict 404 | 405 | 406 | def locate_element_attributes( 407 | sample, keep_html_brackets=False 408 | ): 409 | # Parse html into a dom tree 410 | dom_tree = lxml.etree.fromstring(sample["cleaned_html"]) 411 | tree_repr, backend_node_id2id = get_tree_repr( 412 | dom_tree, id_mapping={}, keep_html_brackets=keep_html_brackets 413 | ) 414 | id2backend_node_id = {} 415 | for item in backend_node_id2id: 416 | id2backend_node_id[backend_node_id2id[item]] = item 417 | 418 | if isinstance(dom_tree, str): 419 | tree = etree.fromstring(dom_tree) 420 | else: 421 | tree = copy.deepcopy(dom_tree) 422 | # Collect Attributes 423 | all_node_attributes = [] 424 | node_to_traverse = tree.xpath("//*") 425 | for node in node_to_traverse: 426 | if "backend_node_id" not in node.attrib: 427 | continue 428 | all_node_attributes.append([ 429 | node.attrib, 430 | node.tag, 431 | node.text, 432 | ]) 433 | return all_node_attributes 434 | 435 | 436 | def clean_element_text(element_text): 437 | if not isinstance(element_text, str): 438 | return "" 439 | # Remove Symbols 440 | symbol_list = [".", ":", "/", "'", '"', ","] 441 | for symbol in symbol_list: 442 | element_text = element_text.replace(symbol, "") 443 | element_text = element_text.strip() 444 | # Convert text to lower case for better matching 445 | element_text = element_text.lower() 446 | return element_text 447 | -------------------------------------------------------------------------------- /SeeAct/data_utils/format_prompt_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Copyright (c) 2024 OSU Natural Language Processing Group 3 | # 4 | # Licensed under the OpenRAIL-S License; 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # https://www.licenses.ai/ai-pubs-open-rails-vz1 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | import string 17 | import lxml 18 | from .dom_utils import get_tree_repr, data_prune_tree 19 | def data_format_input_multichoice( 20 | sample, candidate_ids, gt=-1, previous_k=5, keep_html_brackets=False 21 | ): 22 | # Parse html into a dom tree 23 | dom_tree = lxml.etree.fromstring(sample["cleaned_html"]) 24 | dom_tree, node_to_keep = data_prune_tree(dom_tree, candidate_ids) 25 | tree_repr, id_mapping = get_tree_repr( 26 | dom_tree, id_mapping={}, keep_html_brackets=keep_html_brackets 27 | ) 28 | candidate_nodes = dom_tree.xpath("//*[@backend_node_id]") 29 | choices = [] 30 | for idx, node in enumerate(candidate_nodes): 31 | temp = get_tree_repr( 32 | node, 33 | id_mapping=id_mapping, 34 | keep_html_brackets=keep_html_brackets, 35 | ) 36 | choices.append( 37 | [ 38 | node.attrib["backend_node_id"], 39 | " ".join( 40 | get_tree_repr( 41 | node, 42 | id_mapping=id_mapping, 43 | keep_html_brackets=keep_html_brackets, 44 | )[0].split()[:10] 45 | ), 46 | ] 47 | ) 48 | gt = id_mapping.get(gt, -1) 49 | seq_input = ( 50 | "Based on the HTML webpage above, try to complete the following task:\n" 51 | f"Task: {sample['confirmed_task']}\n" 52 | f"Previous actions:\n" 53 | ) 54 | if len(sample["previous_actions"]) > 0: 55 | for action in sample["previous_actions"][-previous_k:]: 56 | seq_input += f"{action}\n" 57 | else: 58 | seq_input += "None\n" 59 | seq_input += ( 60 | "What should be the next action? Please select from the following choices " 61 | "(If the correct action is not in the page above, please select A. 'None of the above'):\n\n" 62 | "A. None of the above\n" 63 | ) 64 | for idx, choice in enumerate(choices): 65 | # convert to ascii A, B, C, D, ... 66 | seq_input += f"{chr(66 + idx)}. {choice[1]}\n" 67 | if gt == -1: 68 | seq_target = "A." 69 | else: 70 | gt += 1 71 | current_action_op = sample["operation"]["op"] 72 | current_action_value = sample["operation"]["value"] 73 | seq_target = f"{chr(65 + gt)}.\n" f"Action: {current_action_op}\n" 74 | if current_action_op != "CLICK": 75 | seq_target += f"Value: {current_action_value}" 76 | return tree_repr, seq_input, seq_target, choices, node_to_keep 77 | 78 | 79 | def generate_query_prompt(system_prompt="", task="", previous_actions=None, question_description=""): 80 | """ 81 | Generate the first phase prompt to ask model to generate general descriptions about {environment, high-level plans, next step action} 82 | Each experiment will have a similar prompt in this phase 83 | This prompt is used to generate models' thoughts without disrupt of formatting/referring prompts 84 | """ 85 | query_text = "" 86 | 87 | # System Prompt 88 | query_text += system_prompt 89 | 90 | # Task Description 91 | query_text += task 92 | query_text += "\n\n" 93 | 94 | # Previous Actions 95 | previous_action_text = "Previous Actions:\n" 96 | if previous_actions is None: 97 | previous_actions = [] 98 | for action_text in previous_actions: 99 | previous_action_text += action_text 100 | previous_action_text += "\n" 101 | query_text += previous_action_text 102 | query_text += "\n" 103 | 104 | # Question Description 105 | query_text += question_description 106 | return query_text 107 | 108 | 109 | def generate_new_query_prompt(system_prompt="", task="", previous_actions=None, question_description=""): 110 | """ 111 | Generate the first phase prompt to ask model to generate general descriptions about {environment, high-level plans, next step action} 112 | Each experiment will have a similar prompt in this phase 113 | This prompt is used to generate models' thoughts without disrupt of formatting/referring prompts 114 | """ 115 | sys_role=""+system_prompt 116 | query_text = "" 117 | 118 | # System Prompt 119 | query_text += "You are asked to complete the following task: " 120 | 121 | # Task Description 122 | query_text += task 123 | query_text += "\n\n" 124 | 125 | # Previous Actions 126 | previous_action_text = "Previous Actions:\n" 127 | if previous_actions is None: 128 | previous_actions = [] 129 | for action_text in previous_actions: 130 | previous_action_text += action_text 131 | previous_action_text += "\n" 132 | query_text += previous_action_text 133 | query_text += "\n" 134 | 135 | # Question Description 136 | query_text += question_description 137 | return [sys_role,query_text] 138 | 139 | def generate_referring_prompt(referring_description="", element_format="", action_format="", value_format="", 140 | choices=None): 141 | referring_prompt = "" 142 | 143 | # Add description about how to format output 144 | if referring_description != "": 145 | referring_prompt += referring_description 146 | referring_prompt += "\n\n" 147 | 148 | # Add element prediction format and choices 149 | if element_format != "": 150 | referring_prompt += element_format 151 | referring_prompt += "\n\n" 152 | 153 | # Prepare Option texts 154 | # For exp {1, 2, 4}, generate option 155 | # For element_atttribute, set options field at None 156 | if choices: 157 | choice_text = format_options(choices) 158 | referring_prompt += choice_text 159 | 160 | # Format Action Prediction 161 | if action_format != "": 162 | referring_prompt += action_format 163 | referring_prompt += "\n\n" 164 | 165 | # Format Value Prediction 166 | if value_format != "": 167 | referring_prompt += value_format 168 | referring_prompt += "" 169 | 170 | return referring_prompt 171 | 172 | 173 | def generate_new_referring_prompt(referring_description="", element_format="", action_format="", value_format="", 174 | choices=None,split="4"): 175 | referring_prompt = "" 176 | 177 | # Add description about how to format output 178 | if referring_description != "": 179 | referring_prompt += referring_description 180 | referring_prompt += "\n\n" 181 | 182 | # Add element prediction format and choices 183 | 184 | 185 | # Prepare Option texts 186 | # For exp {1, 2, 4}, generate option 187 | # For element_atttribute, set options field at None 188 | if choices: 189 | choice_text = format_options(choices) 190 | referring_prompt += choice_text 191 | 192 | if element_format != "": 193 | referring_prompt += element_format 194 | referring_prompt += "\n\n" 195 | 196 | # Format Action Prediction 197 | if action_format != "": 198 | referring_prompt += action_format 199 | referring_prompt += "\n\n" 200 | 201 | # Format Value Prediction 202 | if value_format != "": 203 | referring_prompt += value_format 204 | referring_prompt += "" 205 | 206 | return referring_prompt 207 | 208 | def format_options(choices): 209 | option_text = "" 210 | abcd = '' 211 | non_abcd = '' 212 | 213 | multi_choice = '' 214 | for multichoice_idx, choice in enumerate(choices): 215 | multi_choice += f"{generate_option_name(multichoice_idx)}. {choice[1]}\n" 216 | abcd += f"{generate_option_name(multichoice_idx)}, " 217 | 218 | non_abcd = generate_option_name(multichoice_idx + 1) 219 | 220 | multi_choice += f"{non_abcd}. None of the other options match the correct element" 221 | # option_text += abcd 222 | option_text += f"If none of these elements match your target element, please select {non_abcd}. None of the other options match the correct element.\n" 223 | 224 | option_text += (multi_choice + '\n\n') 225 | return option_text 226 | 227 | 228 | def generate_option_name(index): 229 | if index < 26: 230 | return string.ascii_uppercase[index] 231 | else: 232 | first_letter_index = (index - 26) // 26 233 | second_letter_index = (index - 26) % 26 234 | first_letter = string.ascii_uppercase[first_letter_index] 235 | second_letter = string.ascii_uppercase[second_letter_index] 236 | return f"{first_letter}{second_letter}" 237 | 238 | def get_index_from_option_name(name): 239 | if len(name) == 1: 240 | return string.ascii_uppercase.index(name) 241 | elif len(name) == 2: 242 | first_letter_index = string.ascii_uppercase.index(name[0]) 243 | second_letter_index = string.ascii_uppercase.index(name[1]) 244 | return 26 + first_letter_index * 26 + second_letter_index 245 | else: 246 | raise Exception("The string should be either 1 or 2 characters long") 247 | 248 | 249 | -------------------------------------------------------------------------------- /SeeAct/data_utils/image_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Copyright (c) 2024 OSU Natural Language Processing Group 3 | # 4 | # Licensed under the OpenRAIL-S License; 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # https://www.licenses.ai/ai-pubs-open-rails-vz1 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | 17 | import supervision as sv 18 | import torch 19 | import json 20 | import copy 21 | 22 | def convert_elements2detections(candidate_elements): 23 | """ 24 | Extract element coordinates 25 | Parse candidate elements coordinates and convert into sv Detection objects 26 | """ 27 | boxes = [] 28 | for box_id, element in enumerate(candidate_elements): 29 | bounding_box_rect = json.loads(element['attributes'])['bounding_box_rect'].strip().split(',') 30 | x1 = float(bounding_box_rect[0]) 31 | y1 = float(bounding_box_rect[1]) 32 | w = float(bounding_box_rect[2]) 33 | h = float(bounding_box_rect[3]) 34 | boxes.append([x1, y1, x1 + w, y1 + h]) 35 | # Format bounding box into transformers output format to convert into supervision detection 36 | transformer_results = { 37 | "boxes": torch.tensor(boxes), 38 | "scores": torch.tensor([0.5 for item in boxes]), 39 | "labels": torch.tensor([1 for item in boxes]) 40 | } 41 | detections = sv.Detections.from_transformers(transformer_results) 42 | return detections 43 | 44 | 45 | def extract_topk_elements(all_elements, k): 46 | topk_elements = [] 47 | for element in all_elements: 48 | rank = element['rank'] 49 | score = element['score'] 50 | if rank < k: 51 | topk_elements.append(copy.deepcopy(element)) 52 | return topk_elements 53 | 54 | 55 | def extract_elements_by_ids(all_elements, ids): 56 | """ 57 | Extract elements specified by the list of element_id 58 | To prevent order change, we will keep the return element the same order as the ids input 59 | """ 60 | output = [] 61 | for element in all_elements: 62 | element_id = element['backend_node_id'] 63 | if element_id in ids: 64 | output.append(element) 65 | 66 | # Order output element to be identical with ids input 67 | element_dict = {} 68 | for element in all_elements: 69 | element_id = element['backend_node_id'] 70 | element_dict[element_id] = element 71 | ordered_output = [] 72 | for element_id in ids: 73 | ordered_output.append(element_dict[element_id]) 74 | 75 | return ordered_output 76 | 77 | 78 | def batch_elements_by_locality(elements, num_choices): 79 | # Sort elements by y1 location (ascending order) 80 | sorted_elements = sorted(elements, key=lambda x: float( 81 | json.loads(x['attributes'])['bounding_box_rect'].strip().split(',')[1])) 82 | 83 | batches = [] 84 | while len(sorted_elements) > 1: 85 | batch = sorted_elements[: num_choices] 86 | sorted_elements = sorted_elements[num_choices:] 87 | batches.append(batch) 88 | 89 | return batches 90 | 91 | def batch_elements_by_locality_16_16_17(elements): 92 | # Sort elements by y1 location (ascending order) 93 | sorted_elements = sorted(elements, key=lambda x: float( 94 | json.loads(x['attributes'])['bounding_box_rect'].strip().split(',')[1])) 95 | 96 | batches = [] 97 | # First batch: 16 98 | batch = sorted_elements[: 16] 99 | sorted_elements = sorted_elements[16:] 100 | batches.append(batch) 101 | 102 | # Second batch: 17 103 | batch = sorted_elements[: 17] 104 | sorted_elements = sorted_elements[17:] 105 | batches.append(batch) 106 | 107 | # Third batch: 17 108 | batch = sorted_elements[: 17] 109 | sorted_elements = sorted_elements[17:] 110 | batches.append(batch) 111 | 112 | return batches 113 | 114 | 115 | def split_elements_by_locality_final_round(elements): 116 | # Sort elements by y1 location (ascending order) 117 | sorted_elements = sorted(elements, key=lambda x: float( 118 | json.loads(x['attributes'])['bounding_box_rect'].strip().split(',')[1])) 119 | 120 | y1_axis = [float(json.loads(item['attributes'])['bounding_box_rect'].strip().split(',')[1]) for item in sorted_elements] 121 | batches = [] 122 | window_elements = [] 123 | for idx in range(len(y1_axis)): 124 | if not window_elements: 125 | window_elements.append(y1_axis[idx]) 126 | continue 127 | else: 128 | current_y = y1_axis[idx] 129 | if current_y - window_elements[-1]<2000: 130 | window_elements.append(current_y) 131 | else: 132 | batches.append(window_elements) 133 | window_elements = [current_y] 134 | batches.append(window_elements) 135 | 136 | cropping_locations = [] 137 | idx = 0 138 | for item in batches: 139 | cropping_locations.append([idx, idx+len(item)]) 140 | idx += len(item) 141 | return cropping_locations 142 | 143 | -------------------------------------------------------------------------------- /SeeAct/demo_utils/__pycache__/conversation.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/demo_utils/__pycache__/conversation.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/demo_utils/__pycache__/inference_engine.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/demo_utils/__pycache__/inference_engine.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/demo_utils/__pycache__/inference_engine.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/demo_utils/__pycache__/inference_engine.cpython-39.pyc -------------------------------------------------------------------------------- /SeeAct/demo_utils/__pycache__/llava_engine.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AI-secure/AdvAgent/e01316e33d11ab145daa7cc7ec12d6ddb0bf38a0/SeeAct/demo_utils/__pycache__/llava_engine.cpython-312.pyc -------------------------------------------------------------------------------- /SeeAct/demo_utils/conversation.py: -------------------------------------------------------------------------------- 1 | import dataclasses 2 | from enum import auto, Enum 3 | from typing import List, Tuple 4 | import base64 5 | from io import BytesIO 6 | from PIL import Image 7 | 8 | 9 | class SeparatorStyle(Enum): 10 | """Different separator style.""" 11 | SINGLE = auto() 12 | TWO = auto() 13 | MPT = auto() 14 | PLAIN = auto() 15 | LLAMA_2 = auto() 16 | 17 | 18 | @dataclasses.dataclass 19 | class Conversation: 20 | """A class that keeps all conversation history.""" 21 | system: str 22 | roles: List[str] 23 | messages: List[List[str]] 24 | offset: int 25 | sep_style: SeparatorStyle = SeparatorStyle.SINGLE 26 | sep: str = "###" 27 | sep2: str = None 28 | version: str = "Unknown" 29 | 30 | skip_next: bool = False 31 | 32 | def get_prompt(self): 33 | messages = self.messages 34 | if len(messages) > 0 and type(messages[0][1]) is tuple: 35 | messages = self.messages.copy() 36 | init_role, init_msg = messages[0].copy() 37 | init_msg = init_msg[0].replace("", "").strip() 38 | if 'mmtag' in self.version: 39 | messages[0] = (init_role, init_msg) 40 | messages.insert(0, (self.roles[0], "")) 41 | messages.insert(1, (self.roles[1], "Received.")) 42 | else: 43 | messages[0] = (init_role, "\n" + init_msg) 44 | 45 | if self.sep_style == SeparatorStyle.SINGLE: 46 | ret = self.system + self.sep 47 | for role, message in messages: 48 | if message: 49 | if type(message) is tuple: 50 | message, _, _ = message 51 | ret += role + ": " + message + self.sep 52 | else: 53 | ret += role + ":" 54 | elif self.sep_style == SeparatorStyle.TWO: 55 | seps = [self.sep, self.sep2] 56 | ret = self.system + seps[0] 57 | for i, (role, message) in enumerate(messages): 58 | if message: 59 | if type(message) is tuple: 60 | message, _, _ = message 61 | ret += role + ": " + message + seps[i % 2] 62 | else: 63 | ret += role + ":" 64 | elif self.sep_style == SeparatorStyle.MPT: 65 | ret = self.system + self.sep 66 | for role, message in messages: 67 | if message: 68 | if type(message) is tuple: 69 | message, _, _ = message 70 | ret += role + message + self.sep 71 | else: 72 | ret += role 73 | elif self.sep_style == SeparatorStyle.LLAMA_2: 74 | wrap_sys = lambda msg: f"<>\n{msg}\n<>\n\n" if len(msg) > 0 else msg 75 | wrap_inst = lambda msg: f"[INST] {msg} [/INST]" 76 | ret = "" 77 | 78 | for i, (role, message) in enumerate(messages): 79 | if i == 0: 80 | assert message, "first message should not be none" 81 | assert role == self.roles[0], "first message should come from user" 82 | if message: 83 | if type(message) is tuple: 84 | message, _, _ = message 85 | if i == 0: message = wrap_sys(self.system) + message 86 | if i % 2 == 0: 87 | message = wrap_inst(message) 88 | ret += self.sep + message 89 | else: 90 | ret += " " + message + " " + self.sep2 91 | else: 92 | ret += "" 93 | ret = ret.lstrip(self.sep) 94 | elif self.sep_style == SeparatorStyle.PLAIN: 95 | seps = [self.sep, self.sep2] 96 | ret = self.system 97 | for i, (role, message) in enumerate(messages): 98 | if message: 99 | if type(message) is tuple: 100 | message, _, _ = message 101 | ret += message + seps[i % 2] 102 | else: 103 | ret += "" 104 | else: 105 | raise ValueError(f"Invalid style: {self.sep_style}") 106 | 107 | return ret 108 | 109 | def append_message(self, role, message): 110 | self.messages.append([role, message]) 111 | 112 | def process_image(self, image, image_process_mode, return_pil=False, image_format='PNG', max_len=1344, min_len=672): 113 | if image_process_mode == "Pad": 114 | def expand2square(pil_img, background_color=(122, 116, 104)): 115 | width, height = pil_img.size 116 | if width == height: 117 | return pil_img 118 | elif width > height: 119 | result = Image.new(pil_img.mode, (width, width), background_color) 120 | result.paste(pil_img, (0, (width - height) // 2)) 121 | return result 122 | else: 123 | result = Image.new(pil_img.mode, (height, height), background_color) 124 | result.paste(pil_img, ((height - width) // 2, 0)) 125 | return result 126 | image = expand2square(image) 127 | elif image_process_mode in ["Default", "Crop"]: 128 | pass 129 | elif image_process_mode == "Resize": 130 | image = image.resize((336, 336)) 131 | else: 132 | raise ValueError(f"Invalid image_process_mode: {image_process_mode}") 133 | if max(image.size) > max_len: 134 | max_hw, min_hw = max(image.size), min(image.size) 135 | aspect_ratio = max_hw / min_hw 136 | shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw)) 137 | longest_edge = int(shortest_edge * aspect_ratio) 138 | W, H = image.size 139 | if H > W: 140 | H, W = longest_edge, shortest_edge 141 | else: 142 | H, W = shortest_edge, longest_edge 143 | image = image.resize((W, H)) 144 | if return_pil: 145 | return image 146 | else: 147 | buffered = BytesIO() 148 | image.save(buffered, format=image_format) 149 | img_b64_str = base64.b64encode(buffered.getvalue()).decode() 150 | return img_b64_str 151 | 152 | def get_images(self, return_pil=False): 153 | images = [] 154 | for i, (role, msg) in enumerate(self.messages[self.offset:]): 155 | if i % 2 == 0: 156 | if type(msg) is tuple: 157 | msg, image, image_process_mode = msg 158 | image = self.process_image(image, image_process_mode, return_pil=return_pil) 159 | images.append(image) 160 | return images 161 | 162 | def to_gradio_chatbot(self): 163 | ret = [] 164 | for i, (role, msg) in enumerate(self.messages[self.offset:]): 165 | if i % 2 == 0: 166 | if type(msg) is tuple: 167 | msg, image, image_process_mode = msg 168 | img_b64_str = self.process_image( 169 | image, "Default", return_pil=False, 170 | image_format='JPEG') 171 | img_str = f'' 172 | msg = img_str + msg.replace('', '').strip() 173 | ret.append([msg, None]) 174 | else: 175 | ret.append([msg, None]) 176 | else: 177 | ret[-1][-1] = msg 178 | return ret 179 | 180 | def copy(self): 181 | return Conversation( 182 | system=self.system, 183 | roles=self.roles, 184 | messages=[[x, y] for x, y in self.messages], 185 | offset=self.offset, 186 | sep_style=self.sep_style, 187 | sep=self.sep, 188 | sep2=self.sep2, 189 | version=self.version) 190 | 191 | def dict(self): 192 | if len(self.get_images()) > 0: 193 | return { 194 | "system": self.system, 195 | "roles": self.roles, 196 | "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages], 197 | "offset": self.offset, 198 | "sep": self.sep, 199 | "sep2": self.sep2, 200 | } 201 | return { 202 | "system": self.system, 203 | "roles": self.roles, 204 | "messages": self.messages, 205 | "offset": self.offset, 206 | "sep": self.sep, 207 | "sep2": self.sep2, 208 | } 209 | 210 | 211 | conv_vicuna_v0 = Conversation( 212 | system="A chat between a curious human and an artificial intelligence assistant. " 213 | "The assistant gives helpful, detailed, and polite answers to the human's questions.", 214 | roles=("Human", "Assistant"), 215 | messages=( 216 | ("Human", "What are the key differences between renewable and non-renewable energy sources?"), 217 | ("Assistant", 218 | "Renewable energy sources are those that can be replenished naturally in a relatively " 219 | "short amount of time, such as solar, wind, hydro, geothermal, and biomass. " 220 | "Non-renewable energy sources, on the other hand, are finite and will eventually be " 221 | "depleted, such as coal, oil, and natural gas. Here are some key differences between " 222 | "renewable and non-renewable energy sources:\n" 223 | "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable " 224 | "energy sources are finite and will eventually run out.\n" 225 | "2. Environmental impact: Renewable energy sources have a much lower environmental impact " 226 | "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, " 227 | "and other negative effects.\n" 228 | "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically " 229 | "have lower operational costs than non-renewable sources.\n" 230 | "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote " 231 | "locations than non-renewable sources.\n" 232 | "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different " 233 | "situations and needs, while non-renewable sources are more rigid and inflexible.\n" 234 | "6. Sustainability: Renewable energy sources are more sustainable over the long term, while " 235 | "non-renewable sources are not, and their depletion can lead to economic and social instability.\n") 236 | ), 237 | offset=2, 238 | sep_style=SeparatorStyle.SINGLE, 239 | sep="###", 240 | ) 241 | 242 | conv_vicuna_v1 = Conversation( 243 | system="A chat between a curious user and an artificial intelligence assistant. " 244 | "The assistant gives helpful, detailed, and polite answers to the user's questions.", 245 | roles=("USER", "ASSISTANT"), 246 | version="v1", 247 | messages=(), 248 | offset=0, 249 | sep_style=SeparatorStyle.TWO, 250 | sep=" ", 251 | sep2="", 252 | ) 253 | 254 | conv_llama_2 = Conversation( 255 | system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. 256 | 257 | If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""", 258 | roles=("USER", "ASSISTANT"), 259 | version="llama_v2", 260 | messages=(), 261 | offset=0, 262 | sep_style=SeparatorStyle.LLAMA_2, 263 | sep="", 264 | sep2="", 265 | ) 266 | 267 | conv_llava_llama_2 = Conversation( 268 | system="You are a helpful language and vision assistant. " 269 | "You are able to understand the visual content that the user provides, " 270 | "and assist the user with a variety of tasks using natural language.", 271 | roles=("USER", "ASSISTANT"), 272 | version="llama_v2", 273 | messages=(), 274 | offset=0, 275 | sep_style=SeparatorStyle.LLAMA_2, 276 | sep="", 277 | sep2="", 278 | ) 279 | 280 | conv_mpt = Conversation( 281 | system="""<|im_start|>system 282 | A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""", 283 | roles=("<|im_start|>user\n", "<|im_start|>assistant\n"), 284 | version="mpt", 285 | messages=(), 286 | offset=0, 287 | sep_style=SeparatorStyle.MPT, 288 | sep="<|im_end|>", 289 | ) 290 | 291 | conv_llava_plain = Conversation( 292 | system="", 293 | roles=("", ""), 294 | messages=( 295 | ), 296 | offset=0, 297 | sep_style=SeparatorStyle.PLAIN, 298 | sep="\n", 299 | ) 300 | 301 | conv_llava_v0 = Conversation( 302 | system="A chat between a curious human and an artificial intelligence assistant. " 303 | "The assistant gives helpful, detailed, and polite answers to the human's questions.", 304 | roles=("Human", "Assistant"), 305 | messages=( 306 | ), 307 | offset=0, 308 | sep_style=SeparatorStyle.SINGLE, 309 | sep="###", 310 | ) 311 | 312 | conv_llava_v0_mmtag = Conversation( 313 | system="A chat between a curious user and an artificial intelligence assistant. " 314 | "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language." 315 | "The visual content will be provided with the following format: visual content.", 316 | roles=("Human", "Assistant"), 317 | messages=( 318 | ), 319 | offset=0, 320 | sep_style=SeparatorStyle.SINGLE, 321 | sep="###", 322 | version="v0_mmtag", 323 | ) 324 | 325 | conv_llava_v1 = Conversation( 326 | system="A chat between a curious human and an artificial intelligence assistant. " 327 | "The assistant gives helpful, detailed, and polite answers to the human's questions.", 328 | roles=("USER", "ASSISTANT"), 329 | version="v1", 330 | messages=(), 331 | offset=0, 332 | sep_style=SeparatorStyle.TWO, 333 | sep=" ", 334 | sep2="", 335 | ) 336 | 337 | conv_llava_v1_mmtag = Conversation( 338 | system="A chat between a curious user and an artificial intelligence assistant. " 339 | "The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language." 340 | "The visual content will be provided with the following format: visual content.", 341 | roles=("USER", "ASSISTANT"), 342 | messages=(), 343 | offset=0, 344 | sep_style=SeparatorStyle.TWO, 345 | sep=" ", 346 | sep2="", 347 | version="v1_mmtag", 348 | ) 349 | 350 | conv_mistral_instruct = Conversation( 351 | system="", 352 | roles=("USER", "ASSISTANT"), 353 | version="llama_v2", 354 | messages=(), 355 | offset=0, 356 | sep_style=SeparatorStyle.LLAMA_2, 357 | sep="", 358 | sep2="", 359 | ) 360 | 361 | conv_chatml_direct = Conversation( 362 | system="""<|im_start|>system 363 | Answer the questions.""", 364 | roles=("<|im_start|>user\n", "<|im_start|>assistant\n"), 365 | version="mpt", 366 | messages=(), 367 | offset=0, 368 | sep_style=SeparatorStyle.MPT, 369 | sep="<|im_end|>", 370 | ) 371 | 372 | default_conversation = conv_vicuna_v1 373 | conv_templates = { 374 | "default": conv_vicuna_v0, 375 | "v0": conv_vicuna_v0, 376 | "v1": conv_vicuna_v1, 377 | "vicuna_v1": conv_vicuna_v1, 378 | "llama_2": conv_llama_2, 379 | "mistral_instruct": conv_mistral_instruct, 380 | "chatml_direct": conv_chatml_direct, 381 | "mistral_direct": conv_chatml_direct, 382 | 383 | "plain": conv_llava_plain, 384 | "v0_plain": conv_llava_plain, 385 | "llava_v0": conv_llava_v0, 386 | "v0_mmtag": conv_llava_v0_mmtag, 387 | "llava_v1": conv_llava_v1, 388 | "v1_mmtag": conv_llava_v1_mmtag, 389 | "llava_llama_2": conv_llava_llama_2, 390 | 391 | "mpt": conv_mpt, 392 | } 393 | 394 | 395 | if __name__ == "__main__": 396 | print(default_conversation.get_prompt()) 397 | -------------------------------------------------------------------------------- /SeeAct/demo_utils/format_prompt.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Copyright (c) 2024 OSU Natural Language Processing Group 3 | # 4 | # Licensed under the OpenRAIL-S License; 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # https://www.licenses.ai/ai-pubs-open-rails-vz1 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | import re 16 | 17 | prompt_dict = { 18 | "default_prompt" : """/* 19 | You are interacting with a web page. You will be given a list of elements that you can interact with. The actions you can take for each element are listed below. 20 | : click 21 | : click 22 | : click 23 | : value = "X" 24 | : value = "X" 25 | : value = "X" 26 | : click 27 | : click 28 | : click 29 | You can take one action each time and will be given a new list of elements. You can also choose to stop interacting with the page at any time. 30 | */ 31 | 32 | /* 33 | Now you are at http://yelp.com. Here is the list of elements that you can interact with: 34 | Help 35 | Toggle Menu 36 | Yelp for Business 37 | Write a Review 38 | Yelp 39 | Find 40 | Near 41 | Search 42 | Messages 43 | Notifications 44 | Restaurants 45 | Home Services 46 | Auto Services 47 | More 48 | Select slide 49 | Auto repair 50 | Select slide 0 51 | Select slide 1 52 | RepairSmith 53 | Dismiss card 0 54 | Dismiss card 1 55 | Brassica 56 | Fox In the Snow Cafe 57 | 1 star rating 58 | 2 star rating 59 | 3 star rating 60 | */ 61 | 62 | //OBJECTIVE: find chinese restaurants in san francisco 63 | 64 | //The list of actions you have taken: 65 | 66 | //What is your next action? 67 | document.querySelector('textbox:contains("Find")').value = "chinese"; 68 | 69 | """ 70 | } 71 | 72 | llm_prompt = [ 73 | { 74 | "role": "system", 75 | "content": "You are a helpful assistant that is great at website design, navigation, and executing tasks for the user." 76 | }, 77 | { 78 | "role": "user", 79 | "content": "'''\n Book a reservation Dine in Pickup Delivery Events Wineries Everything Celebrating and supporting leading women shaking up the industry. Explore now \n'''\n\nBased on the HTML webpage above, try to complete the following task:\nTask: Check for pickup restaurant available in Boston, NY on March 18, 5pm with just one guest\nPrevious actions:\nNone\nWhat should be the next action? Please select from the following choices (If the correct action is not in the page above, please select A. 'None of the above'):\n\nA. None of the above\nB. Book a\nC. Dine in Celebrating and supporting leading women shaking up" 80 | }, 81 | { 82 | "role": "assistant", 83 | "content": "Answer: C.\nAction: SELECT\nValue: Pickup" 84 | }, 85 | { 86 | "role": "user", 87 | "content": "'''\n Hotel Dates* Travelers 1 Adult 1 Adult Travel with a pet Clear all \n'''\n\nBased on the HTML webpage above, try to complete the following task:\nTask: Compare the fare types to book a 1-adult ticket from Springfiels, IL to Austin, TX for April 29th 2023\nPrevious actions:\n[combobox] Enter your departing city, airport name, or airpor... -> TYPE: SPRINGFIELD\n[button] Springfield, IL, US (SPI) -> CLICK\n[combobox] Enter your destination city, airport name, or airp... -> TYPE: AUSTIN\n[button] Austin, TX, US (AUS) -> CLICK\nWhat should be the next action? Please select from the following choices (If the correct action is not in the page above, please select A. 'None of the above'):\n\nA. None of the above\nB. Hotel\nC. Dates* Car Sales Buy A Car Plan Your Purchase Its Tax Refund Time. Treat Yourself to an Upgrade. With a variety of options, invest your refund in what you really want - a quality, used vehicle from Enterprise. View Inventory Enterprise Fleet Management 19 Mar 2023
Celebrating and supporting leading women shaking up the industry.
Celebrating and supporting leading women shaking up" 80 | }, 81 | { 82 | "role": "assistant", 83 | "content": "Answer: C.\nAction: SELECT\nValue: Pickup" 84 | }, 85 | { 86 | "role": "user", 87 | "content": "'''\n
1 Adult
With a variety of options, invest your refund in what you really want - a quality, used vehicle from Enterprise.