├── LICENSE ├── README.md ├── data_preprocessors ├── ai2d.py ├── chartqa.py ├── cord.py ├── deepform.py ├── docbank.py ├── docile.py ├── doclaynet.py ├── docvqa.py ├── docvqa_iq.py ├── dude.py ├── funsd.py ├── google_vision_ocr.py ├── hwsquad.py ├── iconqa.py ├── infographicvqa.py ├── klc.py ├── llavar.py ├── ocrvqa.py ├── pwc.py ├── rvlcdip.py ├── rvlcdip_io.py ├── scicap.py ├── scienceqa.py ├── screen2words.py ├── slidevqa.py ├── sroie.py ├── tabfact.py ├── tatdqa.py ├── textbookqa.py ├── utils.py ├── visualmrc.py ├── websrc.py ├── wildreceipt.py └── wtq.py ├── download.sh ├── download_scripts ├── README.md ├── ai2d.sh ├── doclaynet.sh ├── due.sh ├── funsd.sh ├── iconqa.sh ├── llavar.sh ├── screen2words.sh ├── textbookqa.sh ├── websrc.sh └── wildreceipt.sh ├── example.png ├── instructdoc_instructions.xlsx ├── merge_datasets.py └── process_data.sh /LICENSE: -------------------------------------------------------------------------------- 1 | SOFTWARE LICENSE AGREEMENT FOR EVALUATION 2 | 3 | This SOFTWARE EVALUATION LICENSE AGREEMENT (this "Agreement") is a legal contract between a person who uses or otherwise accesses or installs the Software (“User(s)”), and Nippon Telegraph and Telephone corporation ("NTT"). 4 | READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR OTHERWISE ACCESSING OR USING NTT'S PROPRIETARY SOFTWARE ACCOMPANIED BY THIS AGREEMENT (the "SOFTWARE"). THE SOFTWARE IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY INSTALLING OR OTHERWISE ACCESSING OR USING THE SOFTWARE, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD TERMINATE THE INSTALLATION PROCESS, IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE SOFTWARE AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE SOFTWARE. 5 | 6 | 7 | BACKGROUND 8 | A. NTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Software and related documentation listed in Exhibit A to this Agreement. 9 | B. User wishes to obtain a royalty free license to use the Software to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement. 10 | C. As a condition to NTT's provision of the Software to User, NTT has required User to execute this Agreement. 11 | In consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows: 12 | 1. Grant of Evaluation License. NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use the Software internally for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may make a reasonable number of backup copies of the Software solely for User's internal use pursuant to the license granted in this Section 1. 13 | 2. Shipment and Installation. NTT will ship or deliver the Software by any method that NTT deems appropriate. User shall be solely responsible for proper installation of the Software. 14 | 3. Term. This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s installing, accessing, and using the Software, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, and (ii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the Software. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Software. Upon any termination or expiration of this Agreement for any reason, User agrees to uninstall the Software and either return to NTT the Software and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT. 15 | 4. Proprietary Rights 16 | (a) The Software is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement. Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Software shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Software. 17 | (b) USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE SOFTWARE TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE SOFTWARE IN ANY MANNER; (iii) DISCLOSE THE SOFTWARE TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE SOFTWARE FOR THE PURPOSES OF THIS AGREEMENT; (iv) MODIFY, DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE SOFTWARE; OR (v) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (iv) ABOVE. 18 | (c) User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Software to ensure that all of User's obligations under this Section 4 shall be satisfied. 19 | 5.  Indemnity. User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Software. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE SOFTWARE. 20 | 6. Disclaimer. THE SOFTWARE IS LICENSED TO USER "AS IS," WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES. USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE SOFTWARE, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT. 21 | 7. Limitation of Liability. IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE SOFTWARE, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE SOFTWARE, OR AS A RESULT OF ANY DEFECT IN THE SOFTWARE. THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE. USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3. 22 | 8. No Assignment or Sublicense. Neither this Agreement nor any right or license under this Agreement, nor the Software, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent. 23 | 9. General 24 | (a) If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect. 25 | (b) This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter. 26 | (c) Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User. 27 | (d) If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding. 28 | (e) This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association. The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof. 29 | (f)   NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control. 30 |   31 | EXHIBIT A 32 | The software and related data include the following files, 33 | - data_preprocessors 34 | - download_scripts 35 | - download.sh 36 | - process_data.sh 37 | - merge_datasets.py 38 | - instructdoc_instructions.xlsx 39 | - README 40 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions 2 | This repository includes the InstructDoc dataset introduced by the following paper: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. "InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions". In Proc. of AAAI. 2024. 3 | 4 | > We introduce InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets. 5 | 6 | ![Figure 1 from paper](example.png) 7 | 8 | 9 | # Get Started 10 | ## 1. Download datasets 11 | ``` 12 | sh download.sh 13 | ``` 14 | This script helps you to download most of the datasets automatically. For some datasets, due to the license issue and downloading restrictions, you need to manually download them by following the instructions in [download_scripts/README.md](download_scripts) 15 | 16 | ## 2. Preprocess datasets 17 | ``` 18 | sh process_data.sh API_KEY 19 | ``` 20 | This script helps you to process all the datasets. To extract OCR information from document images, we used Google Vision API and set the variables "API_KEY" to the API key obtained from [Google Cloud Platform](https://cloud.google.com/). To get one visit the [link](https://cloud.google.com/vision/docs/quickstart).

21 | If you encounter the FileNotFoundError while processing the datasets, please set the variable --input_data_dir in [data_processors](data_processors) to your dataset directory name correctly. 22 | 23 | ## 3. Merge preprocessed datasets 24 | ``` 25 | python merge_datasets --max_samples 5000 --input_data_dir processed_data --save_dir ./ 26 | ``` 27 | We randomly sampled a maximum of 5000 instances for each held-in dataset. After processing datasets, you can obtain JSON files with the following format. 28 | If the dataset provides multiple images per instance (e.g., SlideVQA), we add "_list" into the fields, including "image", "ocr", and "bboxes". 29 | 30 |
31 |    {
32 |       "dataset_name": dataset name,
33 |       "id": identification of the instance,
34 |       "image" or "image_list": image path,
35 |       "ocr" or "ocr_list": ocr text,
36 |       "bboxes" or "bboxes_list": [x1, y1, x2, y2, w, h],
37 |       "conversations": [
38 |         {'user': 'human', 'value': randomly sampled instruction}
39 |         {'user': 'gpt', 'value': answer}
40 |       ]
41 |     }
42 | 
43 | 44 | # Citation 45 | 46 | You can cite it as follows: 47 | ```bibtex 48 | @inproceedings{InstructDoc2024, 49 | author = {Ryota Tanaka and 50 | Taichi Iki and 51 | Kyosuke Nishida and 52 | Kuniko Saito and 53 | Jun Suzuki}, 54 | title = {InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions}, 55 | booktitle = {AAAI}, 56 | year = {2024} 57 | } 58 | ``` 59 | 60 | If you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryota.tanaka[at]ntt.com) or open an issue! 61 | -------------------------------------------------------------------------------- /data_preprocessors/ai2d.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import glob 5 | from PIL import Image, ImageDraw, ImageFont 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, load_instructions 9 | from transformers import BertTokenizer 10 | import argparse 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.question_dir = os.path.join(args.input_data_dir, f'questions') 18 | self.ann_dir = os.path.join(args.input_data_dir, f'annotations') 19 | self.img_dir = os.path.join(args.input_data_dir, f'images') 20 | self.font = ImageFont.truetype(args.font_file, size=40) 21 | self.dataset_name = 'ai2d' 22 | self.split = ['train', 'test'] 23 | 24 | def sort_coordinate(self, bboxes): 25 | return sorted(bboxes, key=lambda k: [k[1][1], k[1][0]]) 26 | 27 | def create_data(self): 28 | train = [] 29 | test = [] 30 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 31 | with open(os.path.join(self.data_dir, 'ai2d_test_ids.csv')) as f: 32 | test_ids = f.read().splitlines() 33 | for i, file in enumerate(tqdm(sorted(os.listdir(self.question_dir)))): 34 | file_path = os.path.join(self.question_dir, file) 35 | with open(file_path, 'r', encoding='utf-8') as f: 36 | data = json.load(f) 37 | annotation_path = os.path.join(self.ann_dir, file) 38 | with open(annotation_path, 'r') as f: 39 | ann = json.load(f) 40 | 41 | index = file.replace('.png.json', '') 42 | split = 'test' if str(index) in test_ids else 'train' 43 | 44 | image_path = os.path.join(self.img_dir, file) 45 | image_path = image_path.replace('.json', '') 46 | img = Image.open(image_path) 47 | draw = ImageDraw.Draw(img) 48 | 49 | for index, text in ann['text'].items(): 50 | replacement_text = text['replacementText'] 51 | bbox = text['rectangle'] 52 | bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]] 53 | text = text['value'] 54 | x1, y1, x2, y2 = bbox 55 | draw.rectangle((x1, y1, x2, y2), outline="lime", width=4) 56 | draw.text((x1, y1-30), replacement_text, font=self.font, fill="blue", align="center") 57 | 58 | image_path = os.path.join(self.out_data_dir, 'draw_images', f'{file.replace(".json", "")}') 59 | os.makedirs(os.path.dirname(image_path), exist_ok=True) 60 | img.save(image_path) 61 | 62 | for question, item in data['questions'].items(): 63 | options = item['answerTexts'] 64 | answer_index = item['correctAnswer'] 65 | value = options[answer_index] 66 | 67 | instruction = random.choice(instructions) 68 | instruction = instruction.replace('', question).replace('', str(options)) 69 | file_name = os.path.abspath(image_path) 70 | metadata = { 71 | "image": file_name, 72 | "conversations": [ 73 | {'from': 'human', 'value': instruction}, 74 | {'from': 'gpt', 'value': f"{value}"}, 75 | ], 76 | } 77 | if split == 'train': 78 | train.append(metadata) 79 | elif split == 'test': 80 | test.append(metadata) 81 | 82 | for split, results in [('train', train), ('test', test)]: 83 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 84 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 85 | 86 | print(f'{split}: {len(results)}') 87 | with open(out_filepath, "w") as f: 88 | json.dump(results, f) 89 | 90 | if __name__ == '__main__': 91 | parser = argparse.ArgumentParser() 92 | parser.add_argument('--input_data_dir', default='raw_datasets/ai2d', type=str) 93 | parser.add_argument('--out_data_dir', default='processed_data/ai2d', type=str) 94 | parser.add_argument('--font_file', default='GoNotoCurrent.ttf', type=str) 95 | args = parser.parse_args() 96 | 97 | dataset = InstructData(args) 98 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/chartqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | 6 | from PIL import Image 7 | from tqdm import tqdm 8 | from pathlib import Path 9 | from utils import load_instructions 10 | from google_vision_ocr import Google_OCR 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 18 | self.dataset_name = 'chartqa' 19 | self.google_ocr = Google_OCR(args.api_key) 20 | self.split = ['train', 'val', 'test'] 21 | os.makedirs(self.ocr_dir, exist_ok=True) 22 | 23 | def create_data(self): 24 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 25 | for split in self.split: 26 | target_format = [] 27 | for qa_type in ['human', 'augmented']: 28 | file_name = os.path.join(self.data_dir, f'{split}/{split}_{qa_type}.json') 29 | with open(file_name, 'r') as f: 30 | data = json.load(f) 31 | for d in tqdm(data): 32 | image_name = d['imgname'] 33 | image_path = os.path.join(self.data_dir, f'{split}/png/{image_name}') 34 | ocr_path = os.path.join(self.ocr_dir, f'{image_name.replace(".png", ".json")}') 35 | try: 36 | img = Image.open(image_path) 37 | img_w, img_h = img.size 38 | if not os.path.exists(ocr_path): 39 | items = self.google_ocr.recognize_image(img) 40 | if items == "error": 41 | print('OCR error: ', image_path) 42 | continue 43 | with open(ocr_path, 'w') as f: 44 | json.dump(items, f) 45 | else: 46 | with open(ocr_path, 'r') as f: 47 | items = json.load(f) 48 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 49 | except: 50 | words, bboxes = [], [] 51 | 52 | question = d['query'] 53 | value = d['label'] 54 | instruction = random.choice(instructions) 55 | instruction = instruction.replace('', question) 56 | ocr = ' '.join(words) 57 | 58 | file_name = os.path.abspath(image_path) 59 | target_format.append({ 60 | "image": file_name, 61 | "ocr": ocr, 62 | "bboxes": bboxes, 63 | "conversations": [ 64 | {'from': 'human', 'value': instruction}, 65 | {'from': 'gpt', 'value': value}, 66 | ], 67 | }) 68 | 69 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 70 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 71 | 72 | print(f'{split}: {len(target_format)}') 73 | with open(out_filepath, "w") as f: 74 | json.dump(target_format, f) 75 | 76 | if __name__ == '__main__': 77 | parser = argparse.ArgumentParser() 78 | parser.add_argument('--input_data_dir', default='raw_datasets/chartqa', type=str) 79 | parser.add_argument('--out_data_dir', default='processed_data/chartqa', type=str) 80 | parser.add_argument('--api_key', type=str, help='google vision api key') 81 | args = parser.parse_args() 82 | 83 | dataset = InstructData(args) 84 | dataset.create_data() 85 | -------------------------------------------------------------------------------- /data_preprocessors/cord.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | 5 | from PIL import Image 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, sort_coordinate, load_instructions 9 | import argparse 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'cord' 17 | self.split = ['train', 'dev', 'test'] 18 | 19 | def create_data(self): 20 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 21 | for split in self.split: 22 | target_format = [] 23 | ann_dir = os.path.join(self.data_dir, f'{split}/json') 24 | img_dir = os.path.join(self.data_dir, f'{split}/image') 25 | for file in tqdm(sorted(os.listdir(ann_dir))): 26 | file_path = os.path.join(ann_dir, file) 27 | with open(file_path, 'r', encoding='utf-8') as f: 28 | data = json.load(f) 29 | 30 | image_path = os.path.join(img_dir, file) 31 | image_path = image_path.replace('.json', '.png') 32 | image = Image.open(image_path) 33 | w, h = image.size 34 | 35 | items = [] 36 | labels = {} 37 | for item in data["valid_line"]: 38 | words, label = item["words"], item["category"] 39 | words = [w for w in words if w["text"].strip() != ""] 40 | if len(words) == 0: 41 | continue 42 | text = " ".join([word["text"] for word in words]) 43 | bbox = [words[0]["quad"]["x1"], words[0]["quad"]["y1"], words[-1]["quad"]["x3"], words[-1]["quad"]["y3"]] 44 | bbox = normalize_bbox(bbox, w, h) 45 | items.append((text, label, bbox)) 46 | 47 | items = sort_coordinate(items) 48 | ocr = [] 49 | bboxes = [] 50 | for item in items: 51 | words, label, bbox = item 52 | labels[words] = label 53 | ocr.append(words) 54 | bbox = [bbox] * len(words.split()) 55 | bboxes += bbox 56 | ocr = ' '.join(ocr) 57 | 58 | for key in labels: 59 | instruction = random.choice(instructions) 60 | instruction = instruction.replace('', key) 61 | value = labels[key] 62 | 63 | file_name = os.path.abspath(image_path) 64 | target_format.append({ 65 | "image": file_name, 66 | "ocr": ocr, 67 | "bboxes": bboxes, 68 | "conversations": [ 69 | {'from': 'human', 'value': instruction}, 70 | {'from': 'gpt', 'value': value}, 71 | ], 72 | }) 73 | 74 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 75 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 76 | 77 | with open(out_filepath, "w") as f: 78 | json.dump(target_format, f) 79 | 80 | if __name__ == '__main__': 81 | parser = argparse.ArgumentParser() 82 | parser.add_argument('--input_data_dir', default='raw_datasets/cord/CORD', type=str) 83 | parser.add_argument('--out_data_dir', default='processed_data/cord', type=str) 84 | args = parser.parse_args() 85 | 86 | dataset = InstructData(args) 87 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/deepform.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'deepform' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | pass 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | break 43 | return ocrs 44 | 45 | def create_data(self): 46 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 47 | for split in self.split: 48 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 49 | with open(file_name, 'r') as f: 50 | data = f.readlines() 51 | 52 | ocrs = self.create_ocr_data(split) 53 | target_format = [] 54 | for d in tqdm(data): 55 | d = json.loads(d) 56 | image_name = d['name'].replace('.pdf', '') 57 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 58 | file_name = os.path.abspath(file_name) 59 | for ann in d['annotations']: 60 | instruction = random.choice(instructions) 61 | if 'children' in ann['values'][0]: 62 | for v in ann['values']: 63 | for child in v['children']: 64 | value = child['key'] 65 | key = child['values'][0]['value'] 66 | instruction = instruction.replace('', key) 67 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 68 | 69 | target_format.append({ 70 | "image": file_name, 71 | "ocr": ocr, 72 | "bboxes": bboxes, 73 | "conversations": [ 74 | {'from': 'human', 'value': instruction}, 75 | {'from': 'gpt', 'value': value}, 76 | ], 77 | }) 78 | 79 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 80 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 81 | 82 | print(f'{split}: {len(target_format)}') 83 | with open(out_filepath, "w") as f: 84 | json.dump(target_format, f) 85 | 86 | if __name__ == '__main__': 87 | parser = argparse.ArgumentParser() 88 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/DeepForm', type=str) 89 | parser.add_argument('--out_data_dir', default='processed_data/deepform', type=str) 90 | args = parser.parse_args() 91 | 92 | dataset = InstructData(args) 93 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/docbank.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh 8 | from transformers import BertTokenizer 9 | from collections import defaultdict 10 | import argparse 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.dataset_name = 'docbank' 18 | self.split = ['train', 'valid', 'test'] 19 | 20 | def sort_coordinate(self, bboxes): 21 | return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]]) 22 | 23 | def create_ocr_data(self, data): 24 | ocr_info = {} 25 | for image_info in tqdm(data['images']): 26 | file_name = image_info['file_name'] 27 | image_id = image_info['id'] 28 | width, height = image_info['width'], image_info['height'] 29 | 30 | image_path = os.path.join(self.data_dir, f'DocBank_500K_ori_img/{file_name}') 31 | txt_path = os.path.join(self.data_dir, f'DocBank_500K_txt/{file_name.replace("_ori.jpg", ".txt")}') 32 | with open(txt_path, 'r') as f: 33 | txt_data = f.read().splitlines() 34 | 35 | words = [] 36 | bboxes = [] 37 | for d in txt_data: 38 | d = d.split('\t') 39 | word = d[0] 40 | word_position = convert_wh([int(d[1]), int(d[2]), int(d[3]), int(d[4])]) 41 | if word_position[0] >= word_position[2] or word_position[1] >= word_position[3]: 42 | continue 43 | words.append(word) 44 | bboxes.append(word_position) 45 | 46 | text_sequence = ' '.join(words) 47 | ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height} 48 | return ocr_info 49 | 50 | def create_data(self): 51 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 52 | for split in self.split: 53 | with open(os.path.join(self.data_dir, f'500K_{split}.json'), "r") as f: 54 | data = json.load(f) 55 | 56 | ocr_info = self.create_ocr_data(data) 57 | categories = data['categories'] 58 | 59 | target_format = [] 60 | annotations = defaultdict(list) 61 | for ann_info in data['annotations']: 62 | image_id = ann_info['image_id'] 63 | annotations[image_id].append(ann_info) 64 | 65 | for image_id in tqdm(annotations): 66 | image_info = ocr_info[image_id] 67 | image_path = image_info['image_path'] 68 | text_sequence = image_info['text_sequence'] 69 | bboxes = image_info['bboxes'] 70 | width, height = image_info['width'], image_info['height'] 71 | 72 | items = [] 73 | for ann in annotations[image_id]: 74 | category_id = ann['category_id'] 75 | category_name = categories[category_id-1]['name'] 76 | bbox = ann['bbox'] 77 | bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]] 78 | bbox = normalize_bbox(bbox, width, height) 79 | items.append((category_name, bbox)) 80 | items = self.sort_coordinate(items) 81 | 82 | dla = [] 83 | for item in items: 84 | category_name, bbox = item 85 | dla.append(f'{category_name} {bbox}') 86 | value = ' '.join(dla) 87 | 88 | instruction = random.choice(instructions) 89 | file_name = os.path.abspath(image_path) 90 | 91 | target_format.append({ 92 | "image": file_name, 93 | "ocr": text_sequence, 94 | "bboxes": bboxes, 95 | "conversations": [ 96 | {'from': 'human','value': instruction}, 97 | {'from': 'gpt', 'value': value}, 98 | ], 99 | }) 100 | 101 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 102 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 103 | 104 | print(f'{split}: {len(target_format)}') 105 | with open(out_filepath, "w") as f: 106 | json.dump(target_format, f) 107 | 108 | if __name__ == '__main__': 109 | parser = argparse.ArgumentParser() 110 | parser.add_argument('--input_data_dir', default='raw_datasets/docbank', type=str) 111 | parser.add_argument('--out_data_dir', default='processed_data/docbank', type=str) 112 | args = parser.parse_args() 113 | 114 | dataset = InstructData(args) 115 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/docile.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import sort_coordinate, load_instructions, normalize_bbox 8 | import argparse 9 | from collections import defaultdict 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'docile' 17 | self.ann_dir = os.path.join(args.input_data_dir, f'annotations') 18 | self.img_dir = os.path.join(args.input_data_dir, f'images') 19 | self.ocr_dir = os.path.join(args.input_data_dir, f'ocr') 20 | self.split = ['train', 'val'] 21 | 22 | def extract_ocr_info(self, ocr_data): 23 | tokens = [] 24 | bboxes = [] 25 | for page in ocr_data['pages']: 26 | for block in page['blocks']: 27 | for line in block['lines']: 28 | for word in line['words']: 29 | left_top, right_bottom = word['geometry'] 30 | bbox = normalize_bbox([left_top[0], left_top[1], right_bottom[0], right_bottom[1]]) 31 | bboxes.append(bbox) 32 | tokens.append(word['value']) 33 | return tokens, bboxes 34 | 35 | def create_data(self): 36 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 37 | for split in self.split: 38 | file_name = os.path.join(self.data_dir, f'{split}.json') 39 | with open(file_name, 'r') as f: 40 | ann_filenames = json.load(f) 41 | 42 | target_format = [] 43 | for id, file in enumerate(tqdm(ann_filenames)): 44 | image_path = os.path.join(self.img_dir, file + '0001-1.jpg') 45 | with open(os.path.join(self.ocr_dir, f'{file}.json'), 'r', encoding='utf-8') as f: 46 | ocr_data = json.load(f) 47 | with open(os.path.join(self.ann_dir, f'{file}.json'), 'r', encoding='utf-8') as f: 48 | d = json.load(f) 49 | 50 | items = [] 51 | for item in d["field_extractions"]: 52 | if item["page"] == 0: 53 | text, label = item["text"], item["fieldtype"] 54 | bbox = item["bbox"] 55 | items.append((text, label, bbox)) 56 | if len(items) == 0: 57 | continue 58 | items = sort_coordinate(items) 59 | 60 | labels = {} 61 | for item in items: 62 | tokens, label, bbox = item 63 | labels[tokens] = label 64 | 65 | tokens, bboxes = self.extract_ocr_info(ocr_data) 66 | ocr = ' '.join(tokens) 67 | 68 | for key in labels: 69 | instruction = random.choice(instructions) 70 | instruction = instruction.replace('', key) 71 | value = labels[key] 72 | 73 | file_name = os.path.abspath(image_path) 74 | target_format.append({ 75 | "image": file_name, 76 | "ocr": ocr, 77 | "bboxes": bboxes, 78 | "conversations": [ 79 | {'from': 'human', 'value': instruction}, 80 | {'from': 'gpt', 'value': value}, 81 | ], 82 | }) 83 | 84 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 85 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 86 | 87 | print(f'{split}: {len(target_format)}') 88 | with open(out_filepath, "w") as f: 89 | json.dump(target_format, f) 90 | 91 | if __name__ == '__main__': 92 | parser = argparse.ArgumentParser() 93 | parser.add_argument('--input_data_dir', default='raw_datasets/docile/data/docile', type=str) 94 | parser.add_argument('--out_data_dir', default='processed_data/docile', type=str) 95 | args = parser.parse_args() 96 | 97 | dataset = InstructData(args) 98 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/doclaynet.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh 8 | from collections import defaultdict 9 | import argparse 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'doclaynet' 17 | self.split = ['train', 'val'] 18 | 19 | def sort_coordinate(self, bboxes): 20 | return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]]) 21 | 22 | def create_ocr_data(self, data): 23 | ocr_info = {} 24 | for image_info in data['images']: 25 | file_name = image_info['file_name'] 26 | image_id = image_info['id'] 27 | image_path = os.path.join(self.data_dir, f'PNG/{file_name}') 28 | json_path = os.path.join(self.data_dir, f'JSON/{file_name.replace(".png", ".json")}') 29 | width, height = image_info['width'], image_info['height'] 30 | with open(json_path, 'r') as f: 31 | json_data = json.load(f) 32 | items = [] 33 | 34 | for cell in json_data['cells']: 35 | text = cell['text'] 36 | bbox = cell['bbox'] 37 | bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]] 38 | bbox = convert_wh(normalize_bbox(bbox, width, height)) 39 | items.append((text, bbox)) 40 | 41 | items = self.sort_coordinate(items) 42 | words = [] 43 | bboxes = [] 44 | for text, bbox in items: 45 | words.append(text) 46 | bboxes += bbox 47 | text_sequence = ' '.join(words) 48 | ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height} 49 | break 50 | return ocr_info 51 | 52 | def create_data(self): 53 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 54 | for split in self.split: 55 | with open(os.path.join(self.data_dir, f'COCO/{split}.json'), "r") as f: 56 | data = json.load(f) 57 | ocr_info = self.create_ocr_data(data) 58 | categories = data['categories'] 59 | 60 | target_format = [] 61 | annotations = defaultdict(list) 62 | for ann_info in data['annotations']: 63 | image_id = ann_info['image_id'] 64 | annotations[image_id].append(ann_info) 65 | 66 | for image_id in tqdm(annotations): 67 | image_info = ocr_info[image_id] 68 | image_path = image_info['image_path'] 69 | text_sequence = image_info['text_sequence'] 70 | bboxes = image_info['bboxes'] 71 | width, height = image_info['width'], image_info['height'] 72 | 73 | items = [] 74 | for ann in annotations[image_id]: 75 | category_id = ann['category_id'] 76 | category_name = categories[category_id-1]['name'] 77 | bbox = ann['bbox'] 78 | bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3], bbox[2], bbox[3]] 79 | bbox = normalize_bbox(bbox, width, height) 80 | items.append((category_name, bbox)) 81 | items = self.sort_coordinate(items) 82 | 83 | dla = [] 84 | for item in items: 85 | category_name, bbox = item 86 | dla.append(f'{category_name} {bbox}') 87 | value = ' '.join(dla) 88 | 89 | instruction = random.choice(instructions) 90 | file_name = os.path.abspath(image_path) 91 | 92 | target_format.append({ 93 | "image": file_name, 94 | "ocr": text_sequence, 95 | "bboxes": bboxes, 96 | "conversations": [ 97 | {'from': 'human','value': instruction}, 98 | {'from': 'gpt', 'value': value}, 99 | ], 100 | }) 101 | 102 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 103 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 104 | 105 | print(f'{split}: {len(target_format)}') 106 | with open(out_filepath, "w") as f: 107 | json.dump(target_format, f) 108 | 109 | if __name__ == '__main__': 110 | parser = argparse.ArgumentParser() 111 | parser.add_argument('--input_data_dir', default='raw_datasets/doclaynet', type=str) 112 | parser.add_argument('--out_data_dir', default='processed_data/doclaynet', type=str) 113 | args = parser.parse_args() 114 | 115 | dataset = InstructData(args) 116 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/docvqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'docvqa' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | pass 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | return ocrs 43 | 44 | def create_data(self): 45 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 46 | for split in self.split: 47 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 48 | with open(file_name, 'r') as f: 49 | data = f.readlines() 50 | 51 | ocrs = self.create_ocr_data(split) 52 | target_format = [] 53 | for d in tqdm(data): 54 | d = json.loads(d) 55 | image_name = d['name'].replace('.pdf', '') 56 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 57 | file_name = os.path.abspath(file_name) 58 | for ann in d['annotations']: 59 | instruction = random.choice(instructions) 60 | question = ann['key'] 61 | instruction = instruction.replace('', question) 62 | bboxes = [] 63 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 64 | value = ann['values'][0]['value'] 65 | values = ann['values'][0]['value_variants'] 66 | 67 | target_format.append({ 68 | "image": file_name, 69 | "ocr": ocr, 70 | "bboxes": bboxes, 71 | "conversations": [ 72 | {'from': 'human', 'instruction': instruction}, 73 | {'from': 'gpt', 'value': value, 'values': values}, 74 | ], 75 | }) 76 | 77 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 78 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 79 | 80 | print(f'{split}: {len(target_format)}') 81 | with open(out_filepath, "w") as f: 82 | json.dump(target_format, f) 83 | 84 | if __name__ == '__main__': 85 | parser = argparse.ArgumentParser() 86 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/docvqa', type=str) 87 | parser.add_argument('--out_data_dir', default='processed_data/docvqa', type=str) 88 | args = parser.parse_args() 89 | 90 | dataset = InstructData(args) 91 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/docvqa_iq.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'docvqa_iq' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | pass 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | return ocrs 43 | 44 | def create_data(self): 45 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 46 | for split in self.split: 47 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 48 | with open(file_name, 'r') as f: 49 | data = f.readlines() 50 | 51 | ocrs = self.create_ocr_data(split) 52 | target_format = [] 53 | questions = [] 54 | for d in data: 55 | d = json.loads(d) 56 | for ann in d['annotations']: 57 | question = ann['key'] 58 | questions.append(question) 59 | 60 | for d in tqdm(data): 61 | d = json.loads(d) 62 | image_name = d['name'].replace('.pdf', '') 63 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 64 | file_name = os.path.abspath(file_name) 65 | for ann in d['annotations']: 66 | instruction = random.choice(instructions) 67 | if random.random() > 0.5: 68 | question = random.choice(questions) 69 | value = 'no' 70 | else: 71 | question = ann['key'] 72 | value = 'yes' 73 | 74 | instruction = instruction.replace('', question) 75 | bboxes = [] 76 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 77 | 78 | target_format.append({ 79 | "image": file_name, 80 | "ocr": ocr, 81 | "bboxes": bboxes, 82 | "conversations": [ 83 | {'from': 'human', 'instruction': instruction}, 84 | {'from': 'gpt', 'value': value}, 85 | ], 86 | }) 87 | 88 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 89 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 90 | 91 | print(f'{split}: {len(target_format)}') 92 | with open(out_filepath, "w") as f: 93 | json.dump(target_format, f) 94 | 95 | if __name__ == '__main__': 96 | parser = argparse.ArgumentParser() 97 | parser.add_argument('--input_data_dir', default='raw_datasets/docvqa', type=str) 98 | parser.add_argument('--out_data_dir', default='processed_data/docvqa_iq', type=str) 99 | args = parser.parse_args() 100 | 101 | dataset = InstructData(args) 102 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/dude.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | import glob 6 | 7 | from PIL import Image 8 | from tqdm import tqdm 9 | from pathlib import Path 10 | from utils import normalize_bbox, load_instructions 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.image_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/images') 18 | self.ocr_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/OCR') 19 | self.dataset_name = 'dude' 20 | self.split = ['train', 'val'] 21 | 22 | def create_data(self): 23 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 24 | file_name = os.path.join(self.data_dir, '2023-03-23_DUDE_gt_test_PUBLIC.json') 25 | with open(file_name, 'r') as f: 26 | data = json.load() 27 | train, validation = [],[] 28 | for d in tqdm(data['data']): 29 | docid = d['docId'] 30 | question = d['question'] 31 | split = d['data_split'] 32 | if split in self.split: 33 | image_paths = [] 34 | pages = len(glob.glob(os.path.join(self.image_dir, split, f'{docid}_*.jpg'))) 35 | for i in range(pages): 36 | image_path = os.path.join(self.image_dir, split, f'{docid}_{i}.jpg') 37 | image_path = os.path.abspath(image_path) 38 | image_paths.append(image_path) 39 | 40 | ocr_path =os.path.join(self.ocr_dir, f'Azure/{docid}_due.json') 41 | try: 42 | with open(ocr_path, 'r') as f: 43 | ocr_info = json.load(f) 44 | except: 45 | continue 46 | 47 | structure_value = ocr_info['structures']['pages']['structure_value'] 48 | image_sizes = ocr_info['structures']['pages']['positions'] 49 | text_sequences = [] 50 | bboxes = [] 51 | for page_split, image_size in zip(structure_value, image_sizes): 52 | start = page_split[0] 53 | end = page_split[1] 54 | page_tokens = ' '.join(ocr_info['tokens'][start:end]) 55 | page_bboxes = [] 56 | for bbox in ocr_info['positions'][start:end]: 57 | bbox = normalize_bbox(bbox, (image_size[2], image_size[3])) 58 | page_bboxes.append(bbox) 59 | text_sequences.append(page_tokens) 60 | bboxes.append(page_bboxes) 61 | 62 | if len(text_sequences) != len(image_paths): 63 | continue 64 | 65 | instruction = random.choice(instructions) 66 | instruction = instruction.replace('', question) 67 | if 'answers' in d: 68 | value = d['answers'][0] 69 | if d['answer_type'] == 'not-answerable': 70 | d['answers'] = 'none' 71 | else: 72 | value = '' 73 | 74 | file_name = os.path.abspath(image_path) 75 | sample = { 76 | "image_list": image_paths, 77 | "ocr_list": text_sequences, 78 | "bboxes_list": bboxes, 79 | "conversations": [ 80 | {'from': 'human', 'value': instruction}, 81 | {'from': 'gpt', 'value': value}, 82 | ], 83 | } 84 | 85 | if split == 'train': 86 | train.append(sample) 87 | elif split == 'val': 88 | validation.append(sample) 89 | 90 | for split, target_format in [('train', train), ('validation', validation)]: 91 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 92 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 93 | 94 | print(f'{split}: {len(target_format)}') 95 | with open(out_filepath, "w") as f: 96 | json.dump(target_format, f) 97 | 98 | if __name__ == '__main__': 99 | parser = argparse.ArgumentParser() 100 | parser.add_argument('--input_data_dir', default='raw_datasets/dude', type=str) 101 | parser.add_argument('--out_data_dir', default='processed_data/dude', type=str) 102 | args = parser.parse_args() 103 | 104 | dataset = InstructData(args) 105 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/funsd.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import cv2 5 | from PIL import Image, ImageSequence 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, sort_coordinate, load_instructions 9 | import argparse 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'funsd' 17 | self.split = ['training', 'testing'] 18 | self.label_mapping = {'header': 'title', 19 | 'question': 'key', 20 | 'answer': 'value', 21 | 'other': 'other'} 22 | 23 | def create_data(self): 24 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 25 | for split in self.split: 26 | target_format = [] 27 | ann_dir = os.path.join(self.data_dir, f'{split}_data/annotations') 28 | img_dir = os.path.join(self.data_dir, f'{split}_data/images') 29 | for i, file in enumerate(tqdm(sorted(os.listdir(ann_dir)))): 30 | file_path = os.path.join(ann_dir, file) 31 | with open(file_path, 'r', encoding='utf-8') as f: 32 | data = json.load(f) 33 | 34 | image_path = os.path.join(img_dir, file) 35 | image_path = image_path.replace('.json', '.png') 36 | image = cv2.imread(image_path) 37 | h, w, _ = image.shape 38 | 39 | items = [] 40 | for item in data["form"]: 41 | text = item['text'] 42 | words, label = item["words"], item["label"] 43 | label = self.label_mapping[label] 44 | words = [w for w in words if w["text"].strip() != ""] 45 | if len(words) == 0: 46 | continue 47 | start_bbox, end_bbox = words[0]['box'], words[-1]['box'] 48 | bbox = [start_bbox[0], start_bbox[1], end_bbox[2], start_bbox[3]] 49 | bbox = normalize_bbox(bbox, w, h) 50 | items.append((text, label, bbox)) 51 | items = sort_coordinate(items) 52 | 53 | text_sequence = [] 54 | bboxes = [] 55 | labels = {} 56 | for item in items: 57 | text, label, bbox = item 58 | labels[text] = label 59 | text_sequence.append(text) 60 | bbox = [bbox] * len(text) 61 | bboxes += bbox 62 | 63 | ocr = ' '.join(text_sequence) 64 | for key in labels: 65 | instruction = random.choice(instructions) 66 | instruction = instruction.replace('', key) 67 | value = labels[key] 68 | 69 | file_name = os.path.abspath(image_path) 70 | target_format.append({ 71 | "image": file_name, 72 | "ocr": ocr, 73 | "bboxes": bboxes, 74 | "conversations": [ 75 | {'from': 'human', 'value': instruction}, 76 | {'from': 'gpt', 'value': value}, 77 | ], 78 | }) 79 | 80 | split = split.replace('ing', '') 81 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 82 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 83 | 84 | with open(out_filepath, "w") as f: 85 | json.dump(target_format, f) 86 | 87 | if __name__ == '__main__': 88 | parser = argparse.ArgumentParser() 89 | parser.add_argument('--input_data_dir', default='raw_datasets/funsd/dataset', type=str) 90 | parser.add_argument('--out_data_dir', default='processed_data/funsd', type=str) 91 | args = parser.parse_args() 92 | 93 | dataset = InstructData(args) 94 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/google_vision_ocr.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import json 3 | from requests import Request, Session 4 | from io import BytesIO 5 | from utils import normalize_bbox 6 | 7 | class Google_OCR: 8 | def __init__(self, api_key): 9 | self.api_key = api_key 10 | 11 | def pil_image_to_base64(self, pil_image): 12 | buffered = BytesIO() 13 | pil_image.save(buffered, format="PNG") 14 | str_encode_file = base64.b64encode(buffered.getvalue()).decode("utf-8") 15 | return str_encode_file 16 | 17 | def recognize_image(self, pil_image): 18 | str_encode_file = self.pil_image_to_base64(pil_image) 19 | str_url = "https://vision.googleapis.com/v1/images:annotate?key=" 20 | str_headers = {'Content-Type': 'application/json'} 21 | str_json_data = { 22 | 'requests': [ 23 | { 24 | 'image': { 25 | 'content': str_encode_file 26 | }, 27 | 'features': [ 28 | { 29 | 'type': "TEXT_DETECTION", 30 | } 31 | ] 32 | } 33 | ] 34 | } 35 | 36 | obj_session = Session() 37 | obj_request = Request("POST", 38 | str_url + self.api_key, 39 | data=json.dumps(str_json_data), 40 | headers=str_headers 41 | ) 42 | obj_prepped = obj_session.prepare_request(obj_request) 43 | obj_response = obj_session.send(obj_prepped, 44 | verify=True, 45 | timeout=60 46 | ) 47 | 48 | if obj_response.status_code == 200: 49 | return obj_response.json() 50 | 51 | else: 52 | return "error" 53 | 54 | def extract_info(self, items, img_w, img_h): 55 | words = [] 56 | bboxes = [] 57 | for page_ocrs in items['responses'][0]['fullTextAnnotation']['pages']: 58 | for block_ocrs in page_ocrs['blocks']: 59 | for para_ocrs in block_ocrs['paragraphs']: 60 | for word_ocrs in para_ocrs['words']: 61 | char_bboxes = [] 62 | word = '' 63 | for sym_ocrs in word_ocrs['symbols']: 64 | try: 65 | bbox = sym_ocrs['boundingBox'] 66 | xmin = max(0, bbox['vertices'][0]['x']) 67 | ymin = max(0, bbox['vertices'][0]['y']) 68 | xmax = max(0, bbox['vertices'][2]['x']) 69 | ymax = max(0, bbox['vertices'][2]['y']) 70 | bbox = [xmin, ymin, xmax, ymax] 71 | except: 72 | continue 73 | word += sym_ocrs['text'] 74 | char_bboxes.append(bbox) 75 | if len(char_bboxes) > 0: 76 | x1 = [w_p[0] for w_p in char_bboxes] 77 | y1 = [w_p[1] for w_p in char_bboxes] 78 | x2 = [w_p[2] for w_p in char_bboxes] 79 | y2 = [w_p[3] for w_p in char_bboxes] 80 | word_bbox = [min(x1), min(y1), max(x2), max(y2)] 81 | if word_bbox[0] >= word_bbox[2] or word_bbox[1] >= word_bbox[3]: 82 | continue 83 | word_bbox = normalize_bbox(word_bbox, img_w, img_h) 84 | words.append(word) 85 | bboxes.append(word_bbox) 86 | return words, bboxes 87 | -------------------------------------------------------------------------------- /data_preprocessors/hwsquad.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | import csv 6 | 7 | from PIL import Image, ImageSequence 8 | from tqdm import tqdm 9 | from pathlib import Path 10 | from utils import normalize_bbox, load_instructions 11 | from collections import defaultdict 12 | 13 | class InstructData: 14 | def __init__(self, args): 15 | self.instruction_path = Path('instructdoc_instructions.xlsx') 16 | self.data_dir = args.input_data_dir 17 | self.out_data_dir = args.out_data_dir 18 | self.dataset_name = 'hwsquad' 19 | self.split = ['train', 'val', 'test'] 20 | 21 | def create_data(self): 22 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 23 | for split in self.split: 24 | filename = os.path.join(self.data_dir, f"HW-SQuAD_{split}_1.0.json") 25 | with open(filename, "r") as f: 26 | annotations = json.load(f) 27 | 28 | target_format = [] 29 | for ann in tqdm(annotations["data"]): 30 | qas = ann["qas"] 31 | image_path = ann["document_image"]["document_image"] 32 | h, w = ann["document_image"]["image_height"], ann["document_image"]["image_width"] 33 | 34 | words = [] 35 | bboxes = [] 36 | for item in ann["document_image"]["gold_standard_transcription"]: 37 | word = item["text"] 38 | words.append(word) 39 | bbox = [item["xmin"], item["ymin"], item["xmax"], item["ymax"]] 40 | bbox = normalize_bbox(bbox, w, h) 41 | bboxes.append(bbox) 42 | 43 | for qa in qas: 44 | question = qa["question"] 45 | start_index, end_index = qa["answers"][0]["answer_start_word_no"], qa["answers"][0]["answer_end_word_no"]+1 46 | answer = words[start_index:end_index] 47 | answer = " ".join(answer) 48 | 49 | instruction = random.choice(instructions) 50 | instruction = instruction.replace('', question) 51 | ocr = ' '.join(words) 52 | 53 | file_name = os.path.abspath(image_path) 54 | target_format.append({ 55 | "image": file_name, 56 | "ocr": ocr, 57 | "bboxes": bboxes, 58 | "conversations": [ 59 | {'from': 'human', 'value': instruction}, 60 | {'from': 'gpt', 'value': answer}, 61 | ], 62 | }) 63 | 64 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 65 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 66 | 67 | print(f'{split}: {len(target_format)}') 68 | with open(out_filepath, "w") as f: 69 | json.dump(target_format, f) 70 | 71 | if __name__ == '__main__': 72 | parser = argparse.ArgumentParser() 73 | parser.add_argument('--input_data_dir', default='raw_datasets/HW-SQuAD/HW-SQuAD_annotations', type=str) 74 | parser.add_argument('--out_data_dir', default='processed_data/hwsquad', type=str) 75 | args = parser.parse_args() 76 | 77 | dataset = InstructData(args) 78 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/iconqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import glob 5 | from PIL import Image 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, load_instructions 9 | import argparse 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'iconqa' 17 | self.split = ['train', 'val'] 18 | 19 | def create_data(self): 20 | for split in self.split: 21 | for answer_style in ['fill_in_blank', 'choose_txt']: 22 | target_format = [] 23 | dataset_name = f'{self.dataset_name}_{answer_style}' 24 | instructions = load_instructions(self.instruction_path)[dataset_name] 25 | 26 | data_dir = os.path.join(self.data_dir, f'{split}/{answer_style}/*') 27 | for file_path in glob.glob(data_dir): 28 | data_path = os.path.join(file_path, 'data.json') 29 | image_path = os.path.join(file_path, 'image.png') 30 | with open(data_path, 'r') as f: 31 | data = json.load(f) 32 | question = data['question'] 33 | instruction = random.choice(instructions) 34 | instruction = instruction.replace('', question) 35 | if answer_style == 'fill_in_blank': 36 | value = data['answer'] 37 | else: 38 | options = data['choices'] 39 | answer_index = data['answer'] 40 | value = str(options[answer_index]) 41 | instruction = instruction.replace('', options) 42 | 43 | file_name = os.path.abspath(image_path) 44 | target_format.append({ 45 | "image": file_name, 46 | "conversations": [ 47 | {'from': 'human', 'value': instruction}, 48 | {'from': 'gpt', 'value': f"{value}"}, 49 | ], 50 | }) 51 | 52 | out_data_dir = f'{self.out_data_dir}_{answer_style}' 53 | out_filepath = os.path.join(out_data_dir, f'{split}.json') 54 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 55 | 56 | print(f'{split}: {len(target_format)}') 57 | with open(out_filepath, "w") as f: 58 | json.dump(target_format, f) 59 | 60 | if __name__ == '__main__': 61 | parser = argparse.ArgumentParser() 62 | parser.add_argument('--input_data_dir', default='raw_datasets/iconqa/iconqa_data', type=str) 63 | parser.add_argument('--out_data_dir', default='processed_data/iconqa', type=str) 64 | args = parser.parse_args() 65 | 66 | dataset = InstructData(args) 67 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/infographicvqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'infographicvqa' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | pass 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | return ocrs 43 | 44 | def create_data(self): 45 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 46 | for split in self.split: 47 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 48 | with open(file_name, 'r') as f: 49 | data = f.readlines() 50 | 51 | ocrs = self.create_ocr_data(split) 52 | target_format = [] 53 | for d in tqdm(data): 54 | d = json.loads(d) 55 | image_name = d['name'].replace('.pdf', '') 56 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 57 | file_name = os.path.abspath(file_name) 58 | for ann in d['annotations']: 59 | instruction = random.choice(instructions) 60 | question = ann['key'] 61 | instruction = instruction.replace('', question) 62 | bboxes = [] 63 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 64 | value = ann['values'][0]['value'] 65 | values = ann['values'][0]['value_variants'] 66 | 67 | target_format.append({ 68 | "image": file_name, 69 | "ocr": ocr, 70 | "bboxes": bboxes, 71 | "conversations": [ 72 | {'from': 'human', 'instruction': instruction}, 73 | {'from': 'gpt', 'value': value, 'values': values}, 74 | ], 75 | }) 76 | 77 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 78 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 79 | 80 | print(f'{split}: {len(target_format)}') 81 | with open(out_filepath, "w") as f: 82 | json.dump(target_format, f) 83 | 84 | if __name__ == '__main__': 85 | parser = argparse.ArgumentParser() 86 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/infographics_vqa', type=str) 87 | parser.add_argument('--out_data_dir', default='processed_data/infographicvqa', type=str) 88 | args = parser.parse_args() 89 | 90 | dataset = InstructData(args) 91 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/klc.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'klc' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | pass 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | break 43 | return ocrs 44 | 45 | def create_data(self): 46 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 47 | for split in self.split: 48 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 49 | with open(file_name, 'r') as f: 50 | data = f.readlines() 51 | 52 | ocrs = self.create_ocr_data(split) 53 | target_format = [] 54 | for d in tqdm(data): 55 | d = json.loads(d) 56 | image_name = d['name'].replace('.pdf', '') 57 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 58 | file_name = os.path.abspath(file_name) 59 | for ann in d['annotations']: 60 | instruction = random.choice(instructions) 61 | if 'children' in ann['values'][0]: 62 | for v in ann['values']: 63 | for child in v['children']: 64 | value = child['key'] 65 | key = child['values'][0]['value'] 66 | instruction = instruction.replace('', key) 67 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 68 | 69 | target_format.append({ 70 | "image": file_name, 71 | "ocr": ocr, 72 | "bboxes": bboxes, 73 | "conversations": [ 74 | {'from': 'human', 'value': instruction}, 75 | {'from': 'gpt', 'value': value}, 76 | ], 77 | }) 78 | 79 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 80 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 81 | 82 | print(f'{split}: {len(target_format)}') 83 | with open(out_filepath, "w") as f: 84 | json.dump(target_format, f) 85 | 86 | if __name__ == '__main__': 87 | parser = argparse.ArgumentParser() 88 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/kleister-charity', type=str) 89 | parser.add_argument('--out_data_dir', default='processed_data/klc', type=str) 90 | args = parser.parse_args() 91 | 92 | dataset = InstructData(args) 93 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/llavar.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | 6 | from PIL import Image, ImageSequence 7 | from tqdm import tqdm 8 | from pathlib import Path 9 | from utils import normalize_bbox, load_instructions 10 | from collections import defaultdict 11 | from google_vision_ocr import Google_OCR 12 | 13 | class InstructData: 14 | def __init__(self, args): 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 18 | self.image_dir = os.path.join(args.input_data_dir, 'images') 19 | self.google_ocr = Google_OCR(args.api_key) 20 | os.makedirs(self.ocr_dir, exist_ok=True) 21 | 22 | def create_data(self): 23 | file_name = os.path.join(self.data_dir, 'llava_instruct_150k_llavar_20k.json') 24 | with open(file_name, 'r') as f: 25 | data = json.load(f) 26 | target_format = [] 27 | for d in data: 28 | image_name = d["image"] 29 | image_path = os.path.join(self.image_dir, image_name) 30 | if not os.path.exists(image_path): 31 | continue 32 | 33 | ocr_path = os.path.join(self.ocr_dir, f"{image_name.replace('.jpg', '.json')}") 34 | try: 35 | img = Image.open(image_path) 36 | img_w, img_h = img.size 37 | if not os.path.exists(ocr_path): 38 | items = self.google_ocr.recognize_image(img) 39 | if items == 'error': 40 | print('OCR error: ', image_path) 41 | continue 42 | with open(ocr_path, 'w') as f: 43 | json.dump(items, f) 44 | else: 45 | with open(ocr_path, 'r') as f: 46 | items = json.load(f) 47 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 48 | except: 49 | words, bboxes = [], [] 50 | 51 | ocr = ' '.join(words) 52 | file_name = os.path.abspath(image_path) 53 | d["image"] = file_name 54 | d["ocr"] = ocr 55 | d["bboxes"] = bboxes 56 | target_format.append(d) 57 | 58 | out_filepath = os.path.join(self.out_data_dir, 'train.json') 59 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 60 | 61 | print(f'train: {len(target_format)}') 62 | with open(out_filepath, "w") as f: 63 | json.dump(target_format, f) 64 | 65 | if __name__ == '__main__': 66 | parser = argparse.ArgumentParser() 67 | parser.add_argument('--input_data_dir', default='raw_datasets/llavar', type=str) 68 | parser.add_argument('--out_data_dir', default='processed_data/llavar', type=str) 69 | parser.add_argument('--api_key', default='API_KEY', type=str) 70 | args = parser.parse_args() 71 | 72 | dataset = InstructData(args) 73 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/ocrvqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | import csv 6 | 7 | from PIL import Image, ImageSequence 8 | from tqdm import tqdm 9 | from pathlib import Path 10 | from utils import normalize_bbox, load_instructions 11 | from collections import defaultdict 12 | from google_vision_ocr import Google_OCR 13 | 14 | class InstructData: 15 | def __init__(self, args): 16 | self.instruction_path = Path('instructdoc_instructions.xlsx') 17 | self.data_dir = args.input_data_dir 18 | self.out_data_dir = args.out_data_dir 19 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 20 | self.image_dir = os.path.join(args.input_data_dir, 'images') 21 | self.dataset_name = 'ocrvqa' 22 | self.google_ocr = Google_OCR(args.api_key) 23 | self.split = ['train', 'dev', 'test'] 24 | self.split_dict = {1: 'train', 2: 'dev', 3: 'test'} 25 | os.makedirs(self.ocr_dir, exist_ok=True) 26 | 27 | def create_data(self): 28 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 29 | for split in self.split: 30 | target_format = [] 31 | file_name = os.path.join(self.data_dir, 'dataset.json') 32 | with open(file_name, 'r') as f: 33 | data = json.load(f) 34 | for image_id in tqdm(data): 35 | d = data[image_id] 36 | split_id = d['split'] 37 | if split != self.split_dict[split_id]: 38 | continue 39 | image_path = os.path.join(self.image_dir, f'{image_id}.jpg') 40 | if not os.path.exists(image_path): 41 | continue 42 | 43 | ocr_path = os.path.join(self.ocr_dir, f"{image_id}.json") 44 | try: 45 | img = Image.open(image_path) 46 | img_w, img_h = img.size 47 | if not os.path.exists(ocr_path): 48 | items = self.google_ocr.recognize_image(img) 49 | if items == "error": 50 | print('error: ', image_path) 51 | continue 52 | with open(ocr_path, 'w') as f: 53 | json.dump(items, f) 54 | else: 55 | with open(ocr_path, 'r') as f: 56 | items = json.load(f) 57 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 58 | except: 59 | words, bboxes = [], [] 60 | 61 | ocr = ' '.join(words) 62 | file_name = os.path.abspath(image_path) 63 | for question, answer in zip(d['questions'], d['answers']): 64 | instruction = random.choice(instructions) 65 | instruction = instruction.replace('', question) 66 | target_format.append({ 67 | "image": file_name, 68 | "ocr": ocr, 69 | "bboxes": bboxes, 70 | "conversations": [ 71 | {'from': 'human', 'value': instruction}, 72 | {'from': 'gpt', 'value': answer}, 73 | ], 74 | }) 75 | 76 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 77 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 78 | 79 | print(f'{split}: {len(target_format)}') 80 | with open(out_filepath, "w") as f: 81 | json.dump(target_format, f) 82 | 83 | if __name__ == '__main__': 84 | parser = argparse.ArgumentParser() 85 | parser.add_argument('--input_data_dir', default='raw_datasets/OCR-VQA-200K', type=str) 86 | parser.add_argument('--out_data_dir', default='processed_data/ocrvqa', type=str) 87 | parser.add_argument('--api_key', default='API_KEY', type=str) 88 | args = parser.parse_args() 89 | 90 | dataset = InstructData(args) 91 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/pwc.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'pwc' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | pass 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | break 43 | return ocrs 44 | 45 | def create_data(self): 46 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 47 | for split in self.split: 48 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 49 | with open(file_name, 'r') as f: 50 | data = f.readlines() 51 | 52 | ocrs = self.create_ocr_data(split) 53 | target_format = [] 54 | for d in tqdm(data): 55 | d = json.loads(d) 56 | image_name = d['name'].replace('.pdf', '') 57 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 58 | file_name = os.path.abspath(file_name) 59 | for ann in d['annotations']: 60 | instruction = random.choice(instructions) 61 | if 'children' in ann['values'][0]: 62 | for v in ann['values']: 63 | for child in v['children']: 64 | value = child['key'] 65 | key = child['values'][0]['value'] 66 | instruction = instruction.replace('', key) 67 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 68 | 69 | target_format.append({ 70 | "image": file_name, 71 | "ocr": ocr, 72 | "bboxes": bboxes, 73 | "conversations": [ 74 | {'from': 'human', 'value': instruction}, 75 | {'from': 'gpt', 'value': value}, 76 | ], 77 | }) 78 | 79 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 80 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 81 | 82 | print(f'{split}: {len(target_format)}') 83 | with open(out_filepath, "w") as f: 84 | json.dump(target_format, f) 85 | 86 | if __name__ == '__main__': 87 | parser = argparse.ArgumentParser() 88 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/AxCell', type=str) 89 | parser.add_argument('--out_data_dir', default='processed_data/pwc', type=str) 90 | args = parser.parse_args() 91 | 92 | dataset = InstructData(args) 93 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/rvlcdip.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | 6 | from PIL import Image 7 | from tqdm import tqdm 8 | from pathlib import Path 9 | from utils import load_instructions 10 | from google_vision_ocr import Google_OCR 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 18 | self.image_dir = os.path.join(args.input_data_dir, 'images') 19 | self.dataset_name = 'rvlcdip' 20 | self.google_ocr = Google_OCR(args.api_key) 21 | self.split = ['train', 'val', 'test'] 22 | self.class_dict = {'4': "advertisement", '10': "budget", '2': "email", 23 | '8': "file_folder", '1': "form", '3': "handwritten", 24 | '11': "invoice", '0': "letter", '15': "memo", 25 | '9': "news_article", '12': "presentation", '13': "questionnaire", 26 | '14': "resume", '6': "scientific_publication", '5':"scientific_report", 27 | '7': "specification"} 28 | 29 | def create_data(self): 30 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 31 | for split in self.split: 32 | target_format = [] 33 | with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f: 34 | labels = f.read().splitlines() 35 | for label in labels: 36 | filename, label = label.split(' ') 37 | value = self.class_dict[label] 38 | image_path = os.path.join(self.image_dir, filename) 39 | ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}') 40 | try: 41 | img = Image.open(image_path) 42 | img_w, img_h = img.size 43 | if not os.path.exists(ocr_path): 44 | items = self.google_ocr.recognize_image(img) 45 | if items == "error": 46 | print('OCR error: ', image_path) 47 | continue 48 | os.makedirs(os.dirname(ocr_path), exist_ok=True) 49 | with open(ocr_path, 'w') as f: 50 | json.dump(items, f) 51 | else: 52 | with open(ocr_path, 'r') as f: 53 | items = json.load(f) 54 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 55 | except: 56 | words, bboxes = [], [] 57 | 58 | instruction = random.choice(instructions) 59 | ocr = ' '.join(words) 60 | 61 | file_name = os.path.abspath(image_path) 62 | target_format.append({ 63 | "image": file_name, 64 | "ocr": ocr, 65 | "bboxes": bboxes, 66 | "conversations": [ 67 | {'from': 'human', 'value': instruction}, 68 | {'from': 'gpt', 'value': value}, 69 | ], 70 | }) 71 | 72 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 73 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 74 | 75 | print(f'{split}: {len(target_format)}') 76 | with open(out_filepath, "w") as f: 77 | json.dump(target_format, f) 78 | 79 | if __name__ == '__main__': 80 | parser = argparse.ArgumentParser() 81 | parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str) 82 | parser.add_argument('--out_data_dir', default='processed_data/rvlcdip', type=str) 83 | parser.add_argument('--api_key', default='API_KEY', type=str) 84 | args = parser.parse_args() 85 | 86 | dataset = InstructData(args) 87 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/rvlcdip_io.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | 6 | from PIL import Image 7 | from tqdm import tqdm 8 | from pathlib import Path 9 | from utils import load_instructions 10 | from google_vision_ocr import Google_OCR 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 18 | self.image_dir = os.path.join(args.input_data_dir, 'images') 19 | self.dataset_name = 'rvlcdip_io' 20 | self.google_ocr = Google_OCR(args.api_key) 21 | self.split = ['train', 'val', 'test'] 22 | 23 | def create_data(self): 24 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 25 | for split in self.split: 26 | target_format = [] 27 | ocrs = [] 28 | with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f: 29 | labels = f.read().splitlines() 30 | for label in labels: 31 | filename, label = label.split(' ') 32 | value = self.class_dict[label] 33 | image_path = os.path.join(self.image_dir, filename) 34 | ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}') 35 | try: 36 | img = Image.open(image_path) 37 | img_w, img_h = img.size 38 | if not os.path.exists(ocr_path): 39 | items = self.google_ocr.recognize_image(img) 40 | if items == "error": 41 | print('OCR error: ', image_path) 42 | continue 43 | os.makedirs(os.dirname(ocr_path), exist_ok=True) 44 | with open(ocr_path, 'w') as f: 45 | json.dump(items, f) 46 | else: 47 | with open(ocr_path, 'r') as f: 48 | items = json.load(f) 49 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 50 | except: 51 | words, bboxes = [], [] 52 | 53 | ocr = ' '.join(words) 54 | ocrs.append((ocr, bboxes)) 55 | 56 | for label in labels: 57 | instruction = random.choice(instructions) 58 | if random.random() > 0.5: 59 | ocr, bboxes = random.choice(ocrs) 60 | value = 'no' 61 | else: 62 | value = 'yes' 63 | 64 | file_name = os.path.abspath(image_path) 65 | target_format.append({ 66 | "image": file_name, 67 | "ocr": ocr, 68 | "bboxes": bboxes, 69 | "conversations": [ 70 | {'from': 'human', 'value': instruction}, 71 | {'from': 'gpt', 'value': value}, 72 | ], 73 | }) 74 | 75 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 76 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 77 | 78 | print(f'{split}: {len(target_format)}') 79 | with open(out_filepath, "w") as f: 80 | json.dump(target_format, f) 81 | 82 | if __name__ == '__main__': 83 | parser = argparse.ArgumentParser() 84 | parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str) 85 | parser.add_argument('--out_data_dir', default='processed_data/rvlcdip_io', type=str) 86 | parser.add_argument('--api_key', default='API_KEY', type=str) 87 | args = parser.parse_args() 88 | 89 | dataset = InstructData(args) 90 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/scicap.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | 6 | from PIL import Image 7 | from tqdm import tqdm 8 | from pathlib import Path 9 | from utils import load_instructions 10 | from google_vision_ocr import Google_OCR 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.image_dir = os.path.join(args.input_data_dir, 'SciCap-No-Subfig-Img') 18 | self.caption_dir = os.path.join(args.input_data_dir, 'SciCap-Caption-All') 19 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 20 | self.dataset_name = 'scicap' 21 | self.google_ocr = Google_OCR(args.api_key) 22 | self.split = ['train', 'val', 'test'] 23 | os.makedirs(self.ocr_dir, exist_ok=True) 24 | 25 | def create_data(self): 26 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 27 | ocr_info = {} 28 | for split in self.split: 29 | with open(os.path.join(self.data_dir, f'List-of-Files-for-Each-Experiments/Single-Sentence-Caption/No-Subfig/{split}/file_idx.json'), "r") as f: 30 | split_info = json.load(f) 31 | target_format = [] 32 | for file_name in tqdm(split_info): 33 | image_path = os.path.join(self.image_dir, split, file_name) 34 | caption_path = os.path.join(self.caption_dir, split, f'{file_name.replace(".png", ".json")}') 35 | ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".png", ".json")}') 36 | 37 | with open(caption_path, "r") as f: 38 | annotation = json.load(f) 39 | try: 40 | img = Image.open(image_path) 41 | img_w, img_h = img.size 42 | if not os.path.exists(ocr_path): 43 | items = self.google_ocr.recognize_image(img) 44 | if items == "error": 45 | print('OCR error: ', image_path) 46 | continue 47 | with open(ocr_path, 'w') as f: 48 | json.dump(items, f) 49 | else: 50 | with open(ocr_path, 'r') as f: 51 | items = json.load(f) 52 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 53 | except: 54 | words, bboxes = [], [] 55 | 56 | value = annotation['1-lowercase-and-token-and-remove-figure-index']['caption'] 57 | instruction = random.choice(instructions) 58 | ocr = ' '.join(words) 59 | 60 | file_name = os.path.abspath(image_path) 61 | target_format.append({ 62 | "image": file_name, 63 | "ocr": ocr, 64 | "bboxes": bboxes, 65 | "conversations": [ 66 | {'from': 'human', 'value': instruction}, 67 | {'from': 'gpt', 'value': value}, 68 | ], 69 | }) 70 | 71 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 72 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 73 | 74 | print(f'{split}: {len(target_format)}') 75 | with open(out_filepath, "w") as f: 76 | json.dump(target_format, f) 77 | 78 | if __name__ == '__main__': 79 | parser = argparse.ArgumentParser() 80 | parser.add_argument('--input_data_dir', default='raw_datasets/scicap', type=str) 81 | parser.add_argument('--out_data_dir', default='processed_data/scicap', type=str) 82 | parser.add_argument('--api_key', default='API_KEY', type=str) 83 | args = parser.parse_args() 84 | 85 | dataset = InstructData(args) 86 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/scienceqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import glob 5 | from PIL import Image 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, sort_coordinate, load_instructions 9 | import argparse 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'scienceqa' 17 | 18 | def create_data(self): 19 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 20 | train, val, test = [],[],[] 21 | target_format = [] 22 | ann_filename = os.path.join(self.data_dir, 'data/scienceqa/problems.json') 23 | with open(ann_filename, 'r') as f: 24 | anns = json.load(f) 25 | for questionId, ann in tqdm(anns.items()): 26 | question = ann['question'] 27 | choices = ann['choices'] 28 | value = choices[ann['answer']] 29 | split = ann['split'] 30 | image_name = ann['image'] 31 | if str(image_name) == 'null': 32 | continue 33 | 34 | image_path = os.path.join(self.data_dir, split, questionId, image_name) 35 | instruction = random.choice(instructions) 36 | instruction = instruction.replace('', question).replace('', str(choices)) 37 | 38 | file_name = os.path.abspath(image_path) 39 | sample = { 40 | "image": file_name, 41 | "conversations": [ 42 | {'from': 'human', 'value': instruction}, 43 | {'from': 'gpt', 'value': f"{value}"}, 44 | ], 45 | } 46 | if split == 'train': 47 | train.append(sample) 48 | elif split == 'val': 49 | val.append(sample) 50 | elif split == 'train': 51 | test.append(sample) 52 | 53 | for split, target_format in [('train', train), ('val', val), ('test', test)]: 54 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 55 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 56 | print(f'{split}: {len(target_format)}') 57 | with open(out_filepath, "w") as f: 58 | json.dump(target_format, f) 59 | 60 | if __name__ == '__main__': 61 | parser = argparse.ArgumentParser() 62 | parser.add_argument('--input_data_dir', default='raw_datasets/scienceqa', type=str) 63 | parser.add_argument('--out_data_dir', default='processed_data/scienceqa', type=str) 64 | args = parser.parse_args() 65 | 66 | dataset = InstructData(args) 67 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/screen2words.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | 6 | from PIL import Image 7 | from tqdm import tqdm 8 | from pathlib import Path 9 | from utils import load_instructions 10 | from google_vision_ocr import Google_OCR 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 18 | self.image_dir = os.path.join(args.input_data_dir, 'combined') 19 | self.dataset_name = 'screen2words' 20 | self.google_ocr = Google_OCR(args.api_key) 21 | self.split = ['train', 'dev'] 22 | os.makedirs(self.ocr_dir, exist_ok=True) 23 | 24 | def load_captions(self): 25 | with open(os.path.join(self.data_dir, 'screen_summaries.csv'), "r") as f: 26 | lines = f.read().splitlines() 27 | captions = {} 28 | for i, line in enumerate(lines): 29 | if i != 0: 30 | items = line.split(',') 31 | if len(items) > 2: 32 | screenId = items[0] 33 | summary = line[len(screenId)+1:] 34 | else: 35 | screenId, summary = items 36 | captions[screenId] = summary 37 | return captions 38 | 39 | def create_data(self): 40 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 41 | captions = self.load_captions() 42 | for split in self.split: 43 | target_format = [] 44 | with open(os.path.join(self.data_dir, f'split/{split}_screens.txt'), "r") as f: 45 | split_info = f.read().splitlines() 46 | for split_id in tqdm(split_info): 47 | image_path = os.path.join(self.image_dir, f'{split_id}.jpg') 48 | ocr_path = os.path.join(self.ocr_dir, f'{split_id}.json') 49 | try: 50 | img = Image.open(image_path) 51 | img_w, img_h = img.size 52 | if not os.path.exists(ocr_path): 53 | items = self.google_ocr.recognize_image(img) 54 | if items == "error": 55 | print('OCR error: ', image_path) 56 | continue 57 | with open(ocr_path, 'w') as f: 58 | json.dump(items, f) 59 | else: 60 | with open(ocr_path, 'r') as f: 61 | items = json.load(f) 62 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 63 | except: 64 | words, bboxes = [], [] 65 | 66 | value = captions[split_id] 67 | instruction = random.choice(instructions) 68 | ocr = ' '.join(words) 69 | 70 | file_name = os.path.abspath(image_path) 71 | target_format.append({ 72 | "image": file_name, 73 | "ocr": ocr, 74 | "bboxes": bboxes, 75 | "conversations": [ 76 | {'from': 'human', 'value': instruction}, 77 | {'from': 'gpt', 'value': value}, 78 | ], 79 | }) 80 | 81 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 82 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 83 | 84 | print(f'{split}: {len(target_format)}') 85 | with open(out_filepath, "w") as f: 86 | json.dump(target_format, f) 87 | 88 | if __name__ == '__main__': 89 | parser = argparse.ArgumentParser() 90 | parser.add_argument('--input_data_dir', default='raw_datasets/screen2words', type=str) 91 | parser.add_argument('--out_data_dir', default='processed_data/screen2words', type=str) 92 | parser.add_argument('--api_key', default='API_KEY', type=str) 93 | args = parser.parse_args() 94 | 95 | dataset = InstructData(args) 96 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/slidevqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | import glob 6 | 7 | from PIL import Image 8 | from tqdm import tqdm 9 | from pathlib import Path 10 | from utils import load_instructions 11 | from google_vision_ocr import Google_OCR 12 | 13 | class InstructData: 14 | def __init__(self, args): 15 | self.instruction_path = Path('instructdoc_instructions.xlsx') 16 | self.data_dir = args.input_data_dir 17 | self.out_data_dir = args.out_data_dir 18 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 19 | self.image_dir = os.path.join(args.input_data_dir, 'images') 20 | self.dataset_name = 'slidevqa' 21 | self.google_ocr = Google_OCR(args.api_key) 22 | self.split = ['train', 'val', 'test'] 23 | os.makedirs(self.ocr_dir, exist_ok=True) 24 | 25 | def create_data(self): 26 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 27 | for split in self.split: 28 | target_format = [] 29 | file_name = os.path.join(self.data_dir, 'annotations/qa', f'{split}.jsonl') 30 | with open(file_name, 'r') as f: 31 | data = f.read().splitlines() 32 | for d in tqdm(data): 33 | question = d['question'] 34 | deck_name = d['deck_name'] 35 | value = d['answer'] 36 | image_paths = [] 37 | text_sequences = [] 38 | bboxes = [] 39 | for image_path in glob.glob(os.path.join(self.image_dir, deck_name, f'slide_*_1024.jpg')): 40 | image_path = os.path.abspath(image_path) 41 | image_name = os.path.basename(image_path) 42 | image_paths.append(image_path) 43 | ocr_path = os.path.join(self.ocr_dir, f'{deck_name}_{image_name.replace(".jpg", ".json")}') 44 | try: 45 | img = Image.open(image_path) 46 | img_w, img_h = img.size 47 | if not os.path.exists(ocr_path): 48 | items = self.google_ocr.recognize_image(img) 49 | if items == 'error': 50 | print('OCR error: ', image_path) 51 | continue 52 | with open(ocr_path, 'w') as f: 53 | json.dump(items, f) 54 | else: 55 | with open(ocr_path, 'r') as f: 56 | items = json.load(f) 57 | words, page_bboxes = self.google_ocr.extract_info(items, img_w, img_h) 58 | except: 59 | words, page_bboxes = [], [] 60 | text_sequences.append(' '.join(words)) 61 | bboxes.append(page_bboxes) 62 | 63 | instruction = random.choice(instructions) 64 | instruction = instruction.replace('', question) 65 | 66 | file_name = os.path.abspath(image_path) 67 | target_format.append({ 68 | "image_list": image_paths, 69 | "ocr_list": text_sequences, 70 | "bboxes_list": bboxes, 71 | "conversations": [ 72 | {'from': 'human', 'value': instruction}, 73 | {'from': 'gpt', 'value': value}, 74 | ], 75 | }) 76 | 77 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 78 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 79 | 80 | print(f'{split}: {len(target_format)}') 81 | with open(out_filepath, "w") as f: 82 | json.dump(target_format, f) 83 | 84 | if __name__ == '__main__': 85 | parser = argparse.ArgumentParser() 86 | parser.add_argument('--input_data_dir', default='raw_datasets/slidevqa', type=str) 87 | parser.add_argument('--out_data_dir', default='processed_data/slidevqa', type=str) 88 | parser.add_argument('--api_key', type=str, help='google vision api key') 89 | args = parser.parse_args() 90 | 91 | dataset = InstructData(args) 92 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/sroie.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import cv2 5 | from PIL import Image, ImageSequence 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, load_instructions 9 | import argparse 10 | 11 | class InstructData: 12 | def __init__(self, args): 13 | self.instruction_path = Path('instructdoc_instructions.xlsx') 14 | self.data_dir = args.input_data_dir 15 | self.out_data_dir = args.out_data_dir 16 | self.dataset_name = 'sroie' 17 | self.split = ['train', 'test'] 18 | 19 | def sort_coordinate(self, bboxes): 20 | return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]]) 21 | 22 | def create_data(self): 23 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 24 | for split in self.split: 25 | target_format = [] 26 | ann_dir = os.path.join(self.data_dir, f'{split}/entities') 27 | img_dir = os.path.join(self.data_dir, f'{split}/img') 28 | for file in tqdm(sorted(os.listdir(ann_dir))): 29 | file_path = os.path.join(ann_dir, file) 30 | with open(file_path, 'r', encoding='utf-8') as f: 31 | labels = json.load(f) 32 | image_path = os.path.join(img_dir, file) 33 | image_path = image_path.replace('.txt', '.jpg') 34 | image = cv2.imread(image_path) 35 | h, w, _ = image.shape 36 | 37 | file_path = os.path.join(ann_dir.replace('entities', 'box'), file) 38 | text_sequence = [] 39 | bboxes = [] 40 | with open(file_path, 'r', encoding='utf-8') as f: 41 | items = [] 42 | for item in f.read().splitlines(): 43 | bbox = item.split(',')[:8] 44 | text = item[len(','.join(bbox))+1:] 45 | bbox = [int(bbox[0]), int(bbox[1]), int(bbox[4]), int(bbox[5])] 46 | bbox = normalize_bbox(bbox, w, h) 47 | items.append((text, bbox)) 48 | items = self.sort_coordinate(items) 49 | for item in items: 50 | words, bbox = item 51 | text_sequence.append(words) 52 | bbox = [bbox] * len(words.split()) 53 | bboxes += bbox 54 | 55 | ocr = ' '.join(text_sequence) 56 | for label in labels: 57 | instruction = random.choice(instructions) 58 | instruction = instruction.replace('', labels[label]) 59 | 60 | file_name = os.path.abspath(image_path) 61 | target_format.append({ 62 | "image": file_name, 63 | "ocr": ocr, 64 | "bboxes": bboxes, 65 | "conversations": [ 66 | {'from': 'human', 'value': instruction}, 67 | {'from': 'gpt', 'value': label}, 68 | ], 69 | }) 70 | 71 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 72 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 73 | 74 | with open(out_filepath, "w") as f: 75 | json.dump(target_format, f) 76 | 77 | if __name__ == '__main__': 78 | parser = argparse.ArgumentParser() 79 | parser.add_argument('--input_data_dir', default='raw_datasets/SROIE2019', type=str) 80 | parser.add_argument('--out_data_dir', default='processed_data/sroie', type=str) 81 | args = parser.parse_args() 82 | 83 | dataset = InstructData(args) 84 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/tabfact.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'tabfact' 16 | self.split = ['train', 'dev'] 17 | self.options = ['no', 'yes'] 18 | 19 | def create_ocr_data(self, split): 20 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 21 | with open(file_name, 'r') as f: 22 | data = f.readlines() 23 | ocrs = {} 24 | for d in data: 25 | d = json.loads(d) 26 | image_name = d['name'].replace('.pdf', '') 27 | try: 28 | content = d['contents'][1] # microsoft cv 29 | except: 30 | content = d['contents'][0] # tesseract 31 | 32 | bboxes = [] 33 | tokens = [] 34 | try: 35 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 36 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 37 | bbox = normalize_bbox(bbox, w, h) 38 | bboxes.append(bbox) 39 | tokens.append(token) 40 | except: 41 | pass 42 | ocrs[image_name] = (' '.join(tokens), bboxes) 43 | return ocrs 44 | 45 | def create_data(self): 46 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 47 | for split in self.split: 48 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 49 | with open(file_name, 'r') as f: 50 | data = f.readlines() 51 | 52 | ocrs = self.create_ocr_data(split) 53 | target_format = [] 54 | for d in tqdm(data): 55 | d = json.loads(d) 56 | image_name = d['name'].replace('.pdf', '') 57 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 58 | file_name = os.path.abspath(file_name) 59 | for ann in d['annotations']: 60 | instruction = random.choice(instructions) 61 | question = ann['key'] 62 | instruction = instruction.replace('', question) 63 | bboxes = [] 64 | ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 65 | value = ann['values'][0]['value'] 66 | value = self.options[int(value)] 67 | 68 | target_format.append({ 69 | "image": file_name, 70 | "ocr": ocr, 71 | "bboxes": bboxes, 72 | "conversations": [ 73 | {'from': 'human', 'instruction': instruction}, 74 | {'from': 'gpt', 'value': value}, 75 | ], 76 | }) 77 | 78 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 79 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 80 | 81 | print(f'{split}: {len(target_format)}') 82 | with open(out_filepath, "w") as f: 83 | json.dump(target_format, f) 84 | 85 | if __name__ == '__main__': 86 | parser = argparse.ArgumentParser() 87 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/TabFact', type=str) 88 | parser.add_argument('--out_data_dir', default='processed_data/tabfact', type=str) 89 | args = parser.parse_args() 90 | 91 | dataset = InstructData(args) 92 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/tatdqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | import csv 6 | 7 | from PIL import Image, ImageSequence 8 | from tqdm import tqdm 9 | from pathlib import Path 10 | from utils import normalize_bbox, load_instructions 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.dataset_name = 'tatdqa' 18 | self.split = ['train', 'dev', 'test'] 19 | 20 | def create_data(self): 21 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 22 | for split in self.split: 23 | target_format = [] 24 | file_name = os.path.join(self.data_dir, f'tatdqa_dataset_{split}.json') 25 | with open(file_name, 'r') as f: 26 | data = json.load(f) 27 | for d in tqdm(data): 28 | uid = d['doc']['uid'] 29 | page_num = d['doc']['page'] 30 | image_path = f'{split}/{uid}_{page_num}.png' 31 | ocr_file_name = os.path.join(self.data_dir, f'{split}/{uid}.json') 32 | with open(ocr_file_name, 'r') as f: 33 | ocrs = json.load(f) 34 | 35 | text = [] 36 | bboxes = [] 37 | _, _, w, h = ocrs['pages'][page_num-1]['bbox'] 38 | for page in ocrs['pages']: 39 | for block in page['blocks']: 40 | text.append(block['text']) 41 | for bbox in block['words']['bbox_list']: 42 | bbox = normalize_bbox(bbox, w, h) 43 | bboxes.append(bbox) 44 | 45 | for qa in d['questions']: 46 | question =qa['question'] 47 | if 'answer' in qa: 48 | answer = qa['answer'] 49 | if type(qa['answer']) == list: 50 | if len(qa['answer']) > 1: 51 | answer = ', '.join(answer) 52 | else: 53 | answer = answer[0] 54 | else: 55 | answer = "" 56 | 57 | instruction = random.choice(instructions) 58 | instruction = instruction.replace('', question) 59 | ocr = ' '.join(text) 60 | 61 | file_name = os.path.abspath(image_path) 62 | target_format.append({ 63 | "image": file_name, 64 | "ocr": ocr, 65 | "bboxes": bboxes, 66 | "conversations": [ 67 | {'from': 'human', 'value': instruction}, 68 | {'from': 'gpt', 'value': answer}, 69 | ], 70 | }) 71 | 72 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 73 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 74 | 75 | print(f'{split}: {len(target_format)}') 76 | with open(out_filepath, "w") as f: 77 | json.dump(target_format, f) 78 | 79 | if __name__ == '__main__': 80 | parser = argparse.ArgumentParser() 81 | parser.add_argument('--input_data_dir', default='raw_datasets/TAT-DQA', type=str) 82 | parser.add_argument('--out_data_dir', default='processed_data/tatdqa', type=str) 83 | args = parser.parse_args() 84 | 85 | dataset = InstructData(args) 86 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/textbookqa.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import glob 5 | from PIL import Image 6 | from tqdm import tqdm 7 | from pathlib import Path 8 | from utils import normalize_bbox, sort_coordinate, load_instructions 9 | from transformers import BertTokenizer 10 | import argparse 11 | 12 | class InstructData: 13 | def __init__(self, args): 14 | self.instruction_path = Path('instructdoc_instructions.xlsx') 15 | self.data_dir = args.input_data_dir 16 | self.out_data_dir = args.out_data_dir 17 | self.dataset_name = 'textbookqa' 18 | self.split = ['train', 'val', 'test'] 19 | 20 | def create_data(self): 21 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 22 | for split in self.split: 23 | target_format = [] 24 | ann_filename = f'{split}/tqa_v1_{split}.json' if split != 'test' else f'{split}/tqa_v2_{split}.json' 25 | ann_filename = os.path.join(self.data_dir, ann_filename) 26 | with open(ann_filename, 'r') as f: 27 | anns = json.load(f) 28 | for ann in tqdm(anns): 29 | questions = ann['questions'] 30 | diagram_questions = questions['diagramQuestions'] 31 | if len(diagram_questions) == 0: 32 | continue 33 | 34 | diagram_annotations = ann['diagramAnnotations'] 35 | 36 | for global_id, data in diagram_questions.items(): 37 | options = [] 38 | for option_id, choice in data['answerChoices'].items(): 39 | choice = choice['processedText'] 40 | options.append(choice) 41 | question = data['beingAsked']['processedText'] 42 | value = data['correctAnswer']['rawText'] 43 | image_path = data['imagePath'] 44 | image_name = data['imageName'] 45 | image_path = os.path.join(self.data_dir, f'{split}/{image_path}') 46 | if image_name in diagram_annotations: 47 | annotation = diagram_annotations[image_name] 48 | bboxes = [] 49 | ocr = [] 50 | for item in annotation: 51 | text, bbox = item["text"], item["rectangle"] 52 | try: 53 | bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]] 54 | except: 55 | continue 56 | if len(text) > 0: 57 | bboxes.append(bbox) 58 | ocr.append(text) 59 | ocr = " ".join(ocr) 60 | else: 61 | ocr = "" 62 | instruction = random.choice(instructions) 63 | instruction = instruction.replace('', question).replace('', str(options)) 64 | 65 | file_name = os.path.abspath(image_path) 66 | target_format.append({ 67 | "image": file_name, 68 | "ocr": ocr, 69 | "bboxes": bboxes, 70 | "conversations": [ 71 | {'from': 'human', 'value': instruction}, 72 | {'from': 'gpt', 'value': f"{value}"}, 73 | ], 74 | }) 75 | 76 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 77 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 78 | 79 | print(f'{split}: {len(target_format)}') 80 | with open(out_filepath, "w") as f: 81 | json.dump(target_format, f) 82 | 83 | if __name__ == '__main__': 84 | parser = argparse.ArgumentParser() 85 | parser.add_argument('--input_data_dir', default='raw_datasets/textbookqa', type=str) 86 | parser.add_argument('--out_data_dir', default='processed_data/textbookqa', type=str) 87 | args = parser.parse_args() 88 | 89 | dataset = InstructData(args) 90 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def normalize_bbox(bbox, w=-1, h=-1): 4 | if w > 0 and h > 0: 5 | normalized_bbox = [ 6 | int(1000 * bbox[0] / w), 7 | int(1000 * bbox[1] / h), 8 | int(1000 * bbox[2] / w), 9 | int(1000 * bbox[3] / h), 10 | ] 11 | else: 12 | normalized_bbox = [ 13 | int(1000 * bbox[0]), 14 | int(1000 * bbox[1]), 15 | int(1000 * bbox[2]), 16 | int(1000 * bbox[3]), 17 | ] 18 | 19 | if len(bbox) == 4: 20 | return convert_wh(normalized_bbox) 21 | elif len(bbox) == 6: 22 | return normalized_bbox 23 | 24 | def convert_wh(bbox): 25 | return [bbox[0], bbox[1], bbox[2], bbox[3], abs(bbox[2]-bbox[0]), abs(bbox[3]-bbox[1])] 26 | 27 | def sort_coordinate(bboxes): 28 | return sorted(bboxes , key=lambda k: [k[2][1], k[2][0]]) 29 | 30 | def load_instructions(instruction_path): 31 | instructions = {} 32 | data = pd.read_excel(instruction_path) 33 | for d in data.values: 34 | dataset_name = d[0] 35 | insts = [] 36 | for prompt in d[3:]: 37 | if pd.isna(prompt): 38 | break 39 | insts.append(prompt) 40 | instructions[dataset_name] = insts 41 | return instructions 42 | -------------------------------------------------------------------------------- /data_preprocessors/visualmrc.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'visualmrc' 16 | self.split = ['train', 'dev', 'test'] 17 | 18 | def create_data(self): 19 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 20 | for split in self.split: 21 | file_name = os.path.join(self.data_dir, f'data/{split}.jsonl') 22 | with open(file_name, 'r') as f: 23 | data = f.readlines() 24 | target_format = [] 25 | for d in tqdm(data): 26 | d = json.loads(d) 27 | file_name = os.path.join(self.data_dir, d['image_filename']) 28 | file_name = os.path.abspath(file_name) 29 | image = Image.open(file_name) 30 | w, h = image.size 31 | 32 | words = [] 33 | bboxes = [] 34 | for bbox in d['bounding_boxes']: 35 | if 'ocr_info' in bbox: 36 | for ocr in bbox['ocr_info']: 37 | word = ocr['word'] 38 | bbox = ocr['bbox'] 39 | bbox = [bbox['x'], bbox['y'], bbox['x']+bbox['width'], bbox['y']+bbox['height']] 40 | bbox = normalize_bbox(bbox, w, h) 41 | bboxes.append(bbox) 42 | words.append(word) 43 | 44 | ocr = " ".join(words) 45 | for qa in d['qa_data']: 46 | question = qa['question']['text'] 47 | value = qa['answer']['text'] 48 | instruction = random.choice(instructions) 49 | instruction = instruction.replace('', question) 50 | 51 | target_format.append({ 52 | "image": file_name, 53 | "ocr": ocr, 54 | "bboxes": bboxes, 55 | "conversations": [ 56 | {'from': 'human', 'instruction': instruction}, 57 | {'from': 'gpt', 'value': value}, 58 | ], 59 | }) 60 | 61 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 62 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 63 | 64 | print(f'{split}: {len(target_format)}') 65 | with open(out_filepath, "w") as f: 66 | json.dump(target_format, f) 67 | 68 | if __name__ == '__main__': 69 | parser = argparse.ArgumentParser() 70 | parser.add_argument('--input_data_dir', default='raw_datasets/VisualMRC_official', type=str) 71 | parser.add_argument('--out_data_dir', default='processed_data/visualmrc', type=str) 72 | args = parser.parse_args() 73 | 74 | dataset = InstructData(args) 75 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/websrc.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | import argparse 5 | import csv 6 | 7 | from PIL import Image 8 | from tqdm import tqdm 9 | from pathlib import Path 10 | from utils import normalize_bbox, load_instructions 11 | from collections import defaultdict 12 | from google_vision_ocr import Google_OCR 13 | 14 | class InstructData: 15 | def __init__(self, args): 16 | self.instruction_path = Path('instructdoc_instructions.xlsx') 17 | self.data_dir = args.input_data_dir 18 | self.out_data_dir = args.out_data_dir 19 | self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs') 20 | self.dataset_name = 'websrc' 21 | self.google_ocr = Google_OCR(args.api_key) 22 | self.split = ['train', 'dev'] 23 | os.makedirs(self.ocr_dir, exist_ok=True) 24 | 25 | def load_split_info(self): 26 | file_name = os.path.join(self.data_dir, 'dataset_split.csv') 27 | with open(file_name) as f: 28 | reader = csv.reader(f) 29 | split_info = defaultdict(list) 30 | for i, row in enumerate(reader): 31 | if i == 0: 32 | continue 33 | number = '0' + row[1] if int(row[1]) < 10 else row[1] 34 | split = row[3] 35 | data_path = os.path.join(self.data_dir, f'{row[0]}/{number}/dataset.csv') 36 | split_info[split].append(data_path) 37 | return split_info 38 | 39 | def create_data(self): 40 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 41 | split_info = self.load_split_info() 42 | for split in self.split: 43 | target_format = [] 44 | for data_path in tqdm(split_info[split]): 45 | with open(data_path) as f: 46 | data_dir = os.path.dirname(data_path) 47 | reader = csv.reader(f) 48 | for i, row in enumerate(reader): 49 | if i == 0: 50 | for index, element in enumerate(row): 51 | if 'question' == element: 52 | question_index = index 53 | elif 'id' == element: 54 | id_index = index 55 | elif 'answer' == element: 56 | answer_index = index 57 | continue 58 | questionId = row[id_index] 59 | image_path = os.path.join(data_dir, f'processed_data/{questionId[2:9]}.png') 60 | img = Image.open(image_path) 61 | img_w, img_h = img.size 62 | 63 | ocr_path = os.path.join(self.ocr_dir, f'{questionId[2:9]}.json') 64 | try: 65 | if not os.path.exists(ocr_path): 66 | items = self.google_ocr.recognize_image(img) 67 | if items == "error": 68 | print('OCR error: ', image_path) 69 | continue 70 | with open(ocr_path, 'w') as f: 71 | json.dump(items, f) 72 | else: 73 | with open(ocr_path, 'r') as f: 74 | items = json.load(f) 75 | words, bboxes = self.google_ocr.extract_info(items, img_w, img_h) 76 | except: 77 | words, bboxes = [], [] 78 | 79 | question = row[question_index] 80 | instruction = random.choice(instructions) 81 | instruction = instruction.replace('', question) 82 | ocr = ' '.join(words) 83 | value = row[answer_index] 84 | 85 | file_name = os.path.abspath(image_path) 86 | target_format.append({ 87 | "image": file_name, 88 | "ocr": ocr, 89 | "bboxes": bboxes, 90 | "conversations": [ 91 | {'from': 'human', 'value': instruction}, 92 | {'from': 'gpt', 'value': value}, 93 | ], 94 | }) 95 | 96 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 97 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 98 | 99 | print(f'{split}: {len(target_format)}') 100 | with open(out_filepath, "w") as f: 101 | json.dump(target_format, f) 102 | 103 | if __name__ == '__main__': 104 | parser = argparse.ArgumentParser() 105 | parser.add_argument('--input_data_dir', default='raw_datasets/websrc', type=str) 106 | parser.add_argument('--out_data_dir', default='processed_data/websrc', type=str) 107 | parser.add_argument('--ocr_dir', default='raw_datasets/websrc/ocrs', type=str) 108 | parser.add_argument('--api_key', default='API_KEY', type=str) 109 | args = parser.parse_args() 110 | 111 | dataset = InstructData(args) 112 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/wildreceipt.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, sort_coordinate, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'wildreceipt' 16 | self.split = ['train', 'test'] 17 | self.classes = {} 18 | for items in open(os.path.join(args.input_data_dir, 'class_list.txt')): 19 | index, label = items.split() 20 | self.classes[index] = label 21 | 22 | def create_data(self): 23 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 24 | for split in self.split: 25 | target_format = [] 26 | with open(os.path.join(self.data_dir, f'{split}.txt')) as f: 27 | samples = f.readlines() 28 | for sample in tqdm(samples): 29 | data = json.loads(sample) 30 | file_name = data['file_name'] 31 | image_path = os.path.join(self.data_dir, file_name) 32 | image = Image.open(image_path) 33 | w, h = image.size 34 | 35 | items = [] 36 | labels = {} 37 | for item in data["annotations"]: 38 | text, label_index = item["text"], item["label"] 39 | label = self.classes[str(label_index)] 40 | if label_index == 0: 41 | continue 42 | bbox = item["box"] 43 | bbox = [bbox[0], bbox[1], bbox[4], bbox[5]] 44 | bbox = normalize_bbox(bbox, w, h) 45 | items.append((text, label, bbox)) 46 | 47 | items = sort_coordinate(items) 48 | 49 | ocr = [] 50 | bboxes = [] 51 | for item in items: 52 | words, label, bbox = item 53 | labels[words] = label 54 | ocr.append(words) 55 | bbox = [bbox] * len(words.split()) 56 | bboxes += bbox 57 | ocr = ' '.join(ocr) 58 | 59 | for key in labels: 60 | instruction = random.choice(instructions) 61 | instruction = instruction.replace('', key) 62 | value = labels[key] 63 | 64 | file_name = os.path.abspath(image_path) 65 | target_format.append({ 66 | "image": file_name, 67 | "ocr": ocr, 68 | "bboxes": bboxes, 69 | "conversations": [ 70 | {'from': 'human', 'value': instruction}, 71 | {'from': 'gpt', 'value': value}, 72 | ], 73 | }) 74 | 75 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 76 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 77 | 78 | with open(out_filepath, "w") as f: 79 | json.dump(target_format, f) 80 | 81 | if __name__ == '__main__': 82 | parser = argparse.ArgumentParser() 83 | parser.add_argument('--input_data_dir', default='raw_datasets/wildreceipt/wildreceipt', type=str) 84 | parser.add_argument('--out_data_dir', default='processed_data/wildreceipt', type=str) 85 | args = parser.parse_args() 86 | 87 | dataset = InstructData(args) 88 | dataset.create_data() -------------------------------------------------------------------------------- /data_preprocessors/wtq.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import random 4 | from PIL import Image, ImageSequence 5 | from tqdm import tqdm 6 | from pathlib import Path 7 | from utils import normalize_bbox, load_instructions 8 | import argparse 9 | 10 | class InstructData: 11 | def __init__(self, args): 12 | self.instruction_path = Path('instructdoc_instructions.xlsx') 13 | self.data_dir = args.input_data_dir 14 | self.out_data_dir = args.out_data_dir 15 | self.dataset_name = 'wtq' 16 | self.split = ['train', 'dev'] 17 | 18 | def create_ocr_data(self, split): 19 | file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl') 20 | with open(file_name, 'r') as f: 21 | data = f.readlines() 22 | ocrs = {} 23 | for d in data: 24 | d = json.loads(d) 25 | image_name = d['name'].replace('.pdf', '') 26 | try: 27 | content = d['contents'][1] # microsoft cv 28 | except: 29 | content = d['contents'][0] # tesseract 30 | 31 | bboxes = [] 32 | tokens = [] 33 | try: 34 | _ , _, w, h = content['common_format']['structures']['pages']['positions'][0] 35 | for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']): 36 | bbox = normalize_bbox(bbox, w, h) 37 | bboxes.append(bbox) 38 | tokens.append(token) 39 | except: 40 | continue 41 | ocrs[image_name] = (' '.join(tokens), bboxes) 42 | return ocrs 43 | 44 | def create_data(self): 45 | instructions = load_instructions(self.instruction_path)[self.dataset_name] 46 | for split in self.split: 47 | file_name = os.path.join(self.data_dir, split, 'document.jsonl') 48 | with open(file_name, 'r') as f: 49 | data = f.readlines() 50 | 51 | ocrs = self.create_ocr_data(split) 52 | target_format = [] 53 | for d in tqdm(data): 54 | d = json.loads(d) 55 | image_name = d['name'].replace('.pdf', '') 56 | file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg') 57 | file_name = os.path.abspath(file_name) 58 | for ann in d['annotations']: 59 | instruction = random.choice(instructions) 60 | question = ann['key'] 61 | instruction = instruction.replace('', question) 62 | ocr, bboxes = ocrs[image_name] 63 | value = ann['values'][0]['value'] 64 | 65 | target_format.append({ 66 | "image": file_name, 67 | "ocr": ocr, 68 | "bboxes": bboxes, 69 | "conversations": [ 70 | {'from': 'human', 'value': instruction}, 71 | {'from': 'gpt', 'value': value}, 72 | ], 73 | }) 74 | 75 | out_filepath = os.path.join(self.out_data_dir, f'{split}.json') 76 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 77 | 78 | print(f'{split}: {len(target_format)}') 79 | with open(out_filepath, "w") as f: 80 | json.dump(target_format, f) 81 | 82 | if __name__ == '__main__': 83 | parser = argparse.ArgumentParser() 84 | parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/WikiTableQuestions', type=str) 85 | parser.add_argument('--out_data_dir', default='processed_data/wtq', type=str) 86 | args = parser.parse_args() 87 | 88 | dataset = InstructData(args) 89 | dataset.create_data() -------------------------------------------------------------------------------- /download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | export DATASET_DIR=raw_datasets 3 | 4 | mkdir raw_datasets 5 | 6 | sh ./download_scripts/due.sh 7 | sh ./download_scripts/websrc.sh 8 | sh ./download_scripts/funsd.sh 9 | sh ./download_scripts/iconqa.sh 10 | sh ./download_scripts/textbookqa.sh 11 | sh ./download_scripts/screen2words.shsh 12 | sh ./download_scripts/doclaynet.sh 13 | sh ./download_scripts/ai2d.sh 14 | sh ./download_scripts/wildreceipt.sh 15 | 16 | # font file for rendering text in AI2D dataset 17 | wget https://huggingface.co/Team-PIXEL/pixel-base-finetuned-masakhaner-swa/resolve/main/GoNotoCurrent.ttf 18 | -------------------------------------------------------------------------------- /download_scripts/README.md: -------------------------------------------------------------------------------- 1 | Beolow are the list for downloading datasets used in InstructDoc. 2 | ### Automatically download datasets 3 | - DocVQA ([due.sh](download_scripts/due.sh)) 4 | - InfographicVQA ([due.sh](download_scripts/due.sh)) 5 | - PWC ([due.sh](download_scripts/due.sh)) 6 | - KLC ([due.sh](download_scripts/due.sh)) 7 | - DeepForm ([due.sh](download_scripts/due.sh)) 8 | - TabFact ([due.sh](download_scripts/due.sh)) 9 | - WebSRC ([websrc.sh](download_scripts/websrc.sh)) 10 | - FUNSD ([funsd.sh](download_scripts/funsd.sh)) 11 | - IconQA ([iconqa.sh](download_scripts/iconqa.sh)) 12 | - TextbookQA ([textbookqa.sh](download_scripts/textbookqa.sh)) 13 | - Screen2Words ([screen2words.sh](download_scripts/screen2words.sh)) 14 | - DocLaynet ([doclaynet.sh](download_scripts/doclaynet.sh)) 15 | - LLaVAR ([llavar.sh](download_scripts/llavar.sh)) 16 | 17 | ### Manually download 18 | After downloading below datasets, please place them under the directory "raw_datasets". 19 | - SROIE ([kaggle](https://www.kaggle.com/datasets/urbikn/sroie-datasetv2)) 20 | - CORD ([google drive](https://drive.google.com/drive/folders/14OEWr86qotVBMAsWk7lymMytxn5u-kM6)) 21 | - OCRVQA ([google drive](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)) 22 | - TAT-DQA ([google drive](https://drive.google.com/drive/folders/1SGpZyRWqycMd_dZim1ygvWhl5KdJYDR2)) 23 | - ScienceQA ([google drive](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev)) 24 | - ChartQA ([google drive](https://drive.google.com/file/d/17-aqtiq_KJ16PIGOp30W0y6OJNax6SVT/view)) 25 | - RVL-CDIP ([goole docs](https://docs.google.com/uc?id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc&export=download)) 26 | - HW-SQuAD ([onedrive](https://www.docvqa.org/datasets/benthamqa-and-hw-squad)) 27 | - SciCap ([dropbox](https://www.dropbox.com/s/t1sjqesl0pynaxo/scicap_data.zip?dl=0)) 28 | - DUDE ([project page](https://rrc.cvc.uab.es/?ch=23&com=introduction)) 29 | - DocBank ([project page](https://doc-analysis.github.io/docbank-page/index.html)) 30 | - DocILE ([projct page](https://docile.rossum.ai/)) 31 | - VisualMRC ([project page](https://github.com/nttmdlab-nlp/VisualMRC), request authors via e-mail ryota.tanaka@ntt.com) 32 | - SlideVQA ([project page](https://github.com/nttmdlab-nlp/SlideVQA), request authors via e-mail ryota.tanaka@ntt.com) 33 | 34 | 35 | -------------------------------------------------------------------------------- /download_scripts/ai2d.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading AI2D dataset..." 4 | mkdir ai2d 5 | cd ai2d 6 | wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip 7 | wget https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv 8 | unzip ai2d-all.zip && rm ai2d-all.zip 9 | -------------------------------------------------------------------------------- /download_scripts/doclaynet.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading DocLaynet dataset..." 4 | mkdir doclaynet 5 | cd doclaynet 6 | wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip 7 | wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip 8 | unzip DocLayNet_core.zip && rm DocLayNet_core.zip 9 | unzip DocLayNet_extra.zip && rm DocLayNet_extra.zip 10 | -------------------------------------------------------------------------------- /download_scripts/due.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading DocVQA dataset..." 4 | mkdir docvqa 5 | cd docvqa 6 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DocVQA.tar.gz 7 | tar xvf DocVQA.tar.gz && rm DocVQA.tar.gz 8 | cd .. 9 | 10 | echo "Donwloading InfoVQA dataset..." 11 | mkdir infovqa 12 | cd infovqa 13 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/InfographicsVQA.tar.gz 14 | tar xvf InfographicsVQA.tar.gz && rm InfographicsVQA.tar.gz 15 | cd .. 16 | 17 | echo "Donwloading TabFact dataset..." 18 | mkdir tabfact 19 | cd tabfact 20 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/TabFact.tar.gz 21 | tar xvf TabFact.tar.gz && rm TabFact.tar.gz 22 | cd .. 23 | 24 | echo "Donwloading WTQ dataset..." 25 | mkdir wtq 26 | cd wtq 27 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/WikiTableQuestions.tar.gz 28 | tar xvf WikiTableQuestions.tar.gz && rm WikiTableQuestions.tar.gz 29 | cd .. 30 | 31 | echo "Donwloading KLC dataset..." 32 | mkdir klc 33 | cd klc 34 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/KleisterCharity.tar.gz 35 | tar xvf KleisterCharity.tar.gz && rm KleisterCharity.tar.gz 36 | cd .. 37 | 38 | echo "Donwloading DeepForm dataset..." 39 | mkdir deepform 40 | cd deepform 41 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DeepForm.tar.gz 42 | tar xvf DeepForm.tar.gz && rm DeepForm.tar.gz 43 | cd .. 44 | 45 | echo "Donwloading PWC dataset..." 46 | mkdir pwc 47 | cd pwc 48 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/PWC.tar.gz 49 | tar xvf PWC.tar.gz && rm PWC.tar.gz 50 | -------------------------------------------------------------------------------- /download_scripts/funsd.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading FUNSD dataset..." 4 | mkdir funsd 5 | cd funsd 6 | wget https://guillaumejaume.github.io/FUNSD/dataset.zip 7 | unzip dataset.zip && rm dataset.zip 8 | -------------------------------------------------------------------------------- /download_scripts/iconqa.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading IconQA dataset..." 4 | mkdir iconqa 5 | cd iconqa 6 | wget https://iconqa2021.s3.us-west-1.amazonaws.com/iconqa_data.zip 7 | unzip iconqa_data.zip && rm iconqa_data.zip 8 | -------------------------------------------------------------------------------- /download_scripts/llavar.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading LLaVAR dataset..." 4 | mkdir llavar 5 | cd llavar 6 | wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/llava_instruct_150k_llavar_20k.json 7 | mkdir images 8 | cd images 9 | wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/finetune.zip 10 | unzip finetune.zip && rm finetune.zip 11 | -------------------------------------------------------------------------------- /download_scripts/screen2words.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading Screen2Words dataset..." 4 | git clone https://github.com/google-research-datasets/screen2words.git 5 | cd screen2words 6 | wget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz 7 | tar xvf unique_uis.tar.gz && rm unique_uis.tar.gz 8 | -------------------------------------------------------------------------------- /download_scripts/textbookqa.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading TextbookQA dataset..." 4 | mkdir textbookqa 5 | cd textbookqa 6 | wget https://ai2-public-datasets.s3.amazonaws.com/tqa/tqa_train_val_test.zip 7 | unzip tqa_train_val_test.zip && rm tqa_train_val_test.zip 8 | -------------------------------------------------------------------------------- /download_scripts/websrc.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading WebSRC dataset..." 4 | mkdir websrc 5 | cd websrc 6 | wget https://websrc-data.s3.amazonaws.com/release.zip 7 | unzip release.zip && rm release.zip 8 | -------------------------------------------------------------------------------- /download_scripts/wildreceipt.sh: -------------------------------------------------------------------------------- 1 | cd $DATASET_DIR 2 | 3 | echo "Donwloading WildReceipt dataset..." 4 | mkdir wildreceipt 5 | cd wildreceipt 6 | wget https://download.openmmlab.com/mmocr/data/wildreceipt.tar 7 | tar xvf wildreceipt.tar && rm wildreceipt.tar 8 | -------------------------------------------------------------------------------- /example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nttmdlab-nlp/InstructDoc/fadcdabc1d071ae3f64c090acac7cf55eb932abd/example.png -------------------------------------------------------------------------------- /instructdoc_instructions.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nttmdlab-nlp/InstructDoc/fadcdabc1d071ae3f64c090acac7cf55eb932abd/instructdoc_instructions.xlsx -------------------------------------------------------------------------------- /merge_datasets.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import random 4 | import argparse 5 | 6 | train_val_datasets = ['klc', 'pwc', 'deepform', 'sroie', 'docile', 'wildreceipt', 'websrc', 'hwsquad', 7 | 'visualmrc', 'iconqa_fill_in_blank', 'iconqa_choose_txt', 'scienceqa', 8 | 'ai2d', 'docvqa', 'rvlcdip', 'textbookqa', 'wtq', 'tatdqa','scicap', 'llavar', 9 | 'screen2words', 'doclaynet', 'docbank', 'docvqa_iq', 'rvlcdip_io', 'ocrvqa'] 10 | 11 | def merge_datasets(input_data_dir='./processed_data', save_dir='./', max_samples=5000): 12 | questionId = 0 13 | for split in [('train'), ('dev', 'val')]: 14 | merge = [] 15 | for dataset_name in train_val_datasets: 16 | for s in split: 17 | dataset_path = os.path.join(input_data_dir, dataset_name, f'{s}.json') 18 | if os.path.exists(dataset_path): 19 | with open(dataset_path, 'r') as f: 20 | data = json.load(f) 21 | if len(data) == 0: 22 | continue 23 | random.shuffle(data)[:max_samples] 24 | for d in data: 25 | d["dataset_name"] = dataset_name 26 | d["id"] = questionId 27 | merge.append(d) 28 | random.shuffle(merge) 29 | 30 | out_filepath = os.path.join(save_dir, f'{split[0]}.json') 31 | os.makedirs(os.path.dirname(out_filepath), exist_ok=True) 32 | print(f'{split}: {len(merge)}') 33 | with open(out_filepath, "w") as f: 34 | json.dump(merge, f) 35 | 36 | if __name__ == '__main__': 37 | parser = argparse.ArgumentParser() 38 | parser.add_argument('--input_data_dir', default='processed_data', type=str) 39 | parser.add_argument('--save_dir', default='./', type=str) 40 | parser.add_argument('--max_samples', default=5000, type=int) 41 | args = parser.parse_args() 42 | 43 | merge_datasets(args.input_data_dir, args.save_dir, args.max_samples) -------------------------------------------------------------------------------- /process_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | API_KEY=$1 3 | 4 | # ===== KIE ===== 5 | python data_preprocessors/docile.py 6 | python data_preprocessors/klc.py 7 | python data_preprocessors/deepform.py 8 | python data_preprocessors/funsd.py 9 | python data_preprocessors/pwc.py 10 | python data_preprocessors/wildreceipt.py 11 | python data_preprocessors/cord.py 12 | python data_preprocessors/sroi.py 13 | 14 | # ===== Single-page QA ===== 15 | python data_preprocessors/visualmrc.py 16 | python data_preprocessors/websrc.py --api_key $API_KEY 17 | python data_preprocessors/ocrvqa.py --api_key $API_KEY 18 | python data_preprocessors/docvqa.py 19 | python data_preprocessors/hwsquad.py 20 | 21 | # ===== Single-page QA w/ Discrete Reasoning ===== 22 | python data_preprocessors/tatdqa.py 23 | python data_preprocessors/wtq.py 24 | 25 | # ===== Single-page QA w/ Visual Reasoning ===== 26 | python data_preprocessors/iconqa.py 27 | python data_preprocessors/ai2d.py 28 | python data_preprocessors/scienceqa.py 29 | python data_preprocessors/textbook.py 30 | 31 | # ===== Single-page QA w/ Discrete and Visual Reasoning ===== 32 | python data_preprocessors/infographicvqa.py 33 | python data_preprocessors/chartqa.py --api_key $API_KEY 34 | 35 | # ===== Multi-page QA w/ Multi-hop, Discrete, and Visual Reasoning ===== 36 | python data_preprocessors/slidevqa.py --api_key $API_KEY 37 | python data_preprocessors/dude.py 38 | 39 | # ===== Document NLI ===== 40 | python data_preprocessors/tabfact.py 41 | 42 | # ===== Dialogue ===== 43 | python data_preprocessors/llavar.py --api_key $API_KEY 44 | 45 | # ===== Captioning ===== 46 | python data_preprocessors/scicap.py --api_key $API_KEY 47 | python data_preprocessors/screen2words.py --api_key $API_KEY 48 | 49 | # ===== Classification ===== 50 | python data_preprocessors/rvlcdip.py --api_key $API_KEY 51 | 52 | # ===== ITM ===== 53 | python data_preprocessors/rvlcdip_io.py --api_key $API_KEY 54 | python data_preprocessors/docvqa_iq.py 55 | 56 | # ===== DLA ===== 57 | python data_preprocessors/docbank.py 58 | python data_preprocessors/doclaynet.py 59 | --------------------------------------------------------------------------------