├── LICENSE
├── README.md
├── data_preprocessors
    ├── ai2d.py
    ├── chartqa.py
    ├── cord.py
    ├── deepform.py
    ├── docbank.py
    ├── docile.py
    ├── doclaynet.py
    ├── docvqa.py
    ├── docvqa_iq.py
    ├── dude.py
    ├── funsd.py
    ├── google_vision_ocr.py
    ├── hwsquad.py
    ├── iconqa.py
    ├── infographicvqa.py
    ├── klc.py
    ├── llavar.py
    ├── ocrvqa.py
    ├── pwc.py
    ├── rvlcdip.py
    ├── rvlcdip_io.py
    ├── scicap.py
    ├── scienceqa.py
    ├── screen2words.py
    ├── slidevqa.py
    ├── sroie.py
    ├── tabfact.py
    ├── tatdqa.py
    ├── textbookqa.py
    ├── utils.py
    ├── visualmrc.py
    ├── websrc.py
    ├── wildreceipt.py
    └── wtq.py
├── download.sh
├── download_scripts
    ├── README.md
    ├── ai2d.sh
    ├── doclaynet.sh
    ├── due.sh
    ├── funsd.sh
    ├── iconqa.sh
    ├── llavar.sh
    ├── screen2words.sh
    ├── textbookqa.sh
    ├── websrc.sh
    └── wildreceipt.sh
├── example.png
├── instructdoc_instructions.xlsx
├── merge_datasets.py
└── process_data.sh


/LICENSE:
--------------------------------------------------------------------------------
 1 | SOFTWARE LICENSE AGREEMENT FOR EVALUATION
 2 | 
 3 | This SOFTWARE EVALUATION LICENSE AGREEMENT (this "Agreement") is a legal contract between a person who uses or otherwise accesses or installs the Software (“User(s)”), and Nippon Telegraph and Telephone corporation ("NTT").
 4 | READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR OTHERWISE ACCESSING OR USING NTT'S PROPRIETARY SOFTWARE ACCOMPANIED BY THIS AGREEMENT (the "SOFTWARE"). THE SOFTWARE IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY INSTALLING OR OTHERWISE ACCESSING OR USING THE SOFTWARE, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD  TERMINATE THE INSTALLATION PROCESS, IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE SOFTWARE AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE SOFTWARE.
 5 | 
 6 |  
 7 | BACKGROUND
 8 | A.	NTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Software and related documentation listed in Exhibit A to this Agreement.
 9 | B.	User wishes to obtain a royalty free license to use the Software to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement.
10 | C.	As a condition to NTT's provision of the Software to User, NTT has required User to execute this Agreement.
11 | In consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows:
12 | 1. Grant of Evaluation License.     NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use the Software internally for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may make a reasonable number of backup copies of the Software solely for User's internal use pursuant to the license granted in this Section 1.
13 | 2.　Shipment and Installation.  NTT will ship or deliver the Software by any method that NTT deems appropriate. User shall be solely responsible for proper installation of the Software.
14 | 3.　Term.  This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s installing, accessing, and using the Software, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, and (ii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the Software. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Software. Upon any termination or expiration of this Agreement for any reason, User agrees to uninstall the Software and either return to NTT the Software and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT.
15 | 4.	   Proprietary Rights
16 | (a)	   The Software is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement.  Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Software shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Software. 
17 | (b)	   USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE SOFTWARE TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE SOFTWARE IN ANY MANNER; (iii) DISCLOSE THE SOFTWARE TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE SOFTWARE FOR THE PURPOSES OF THIS AGREEMENT; (iv) MODIFY, DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE SOFTWARE; OR (v) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (iv) ABOVE.
18 | (c)	   User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Software to ensure that all of User's obligations under this Section 4 shall be satisfied.  
19 | 5.　	   Indemnity.  User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Software. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE SOFTWARE.
20 | 6.	   Disclaimer.  THE SOFTWARE IS LICENSED TO USER "AS IS," WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES.  USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE SOFTWARE, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT. 
21 | 7.	   Limitation of Liability.  IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE SOFTWARE, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE SOFTWARE, OR AS A RESULT OF ANY DEFECT IN THE SOFTWARE.  THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE.  USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3.
22 | 8.	   No Assignment or Sublicense.  Neither this Agreement nor any right or license under this Agreement, nor the Software, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent.
23 | 9.	   General
24 | (a)	   If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect.
25 | (b)	   This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter.  
26 | (c)	   Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User.  
27 | (d)	   If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding.
28 | (e)	   This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association.  The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof.
29 | (f)　　	   NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control.
30 |  
31 | EXHIBIT A
32 | The software and related data include the following files,
33 | - data_preprocessors
34 | - download_scripts
35 | - download.sh
36 | - process_data.sh
37 | - merge_datasets.py
38 | - instructdoc_instructions.xlsx
39 | - README
40 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
 2 | This repository includes the InstructDoc dataset introduced by the following paper: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. "InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions". In Proc. of AAAI. 2024.
 3 | 
 4 | > We introduce InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets.
 5 | 
 6 | ![Figure 1 from paper](example.png)
 7 | 
 8 | 
 9 | # Get Started
10 | ## 1. Download datasets
11 | ```
12 | sh download.sh
13 | ```
14 | This script helps you to download most of the datasets automatically. For some datasets, due to the license issue and downloading restrictions, you need to manually download them by following the instructions in [download_scripts/README.md](download_scripts)
15 | 
16 | ## 2. Preprocess datasets
17 | ```
18 | sh process_data.sh API_KEY
19 | ```
20 | This script helps you to process all the datasets. To extract OCR information from document images, we used Google Vision API and set the variables "API_KEY" to the API key obtained from [Google Cloud Platform](https://cloud.google.com/). To get one visit the [link](https://cloud.google.com/vision/docs/quickstart). <br><br>
21 | If you encounter the FileNotFoundError while processing the datasets, please set the variable --input_data_dir in [data_processors](data_processors) to your dataset directory name correctly.
22 | 
23 | ## 3. Merge preprocessed datasets
24 | ```
25 | python merge_datasets --max_samples 5000 --input_data_dir processed_data --save_dir ./
26 | ```
27 | We randomly sampled a maximum of 5000 instances for each held-in dataset. After processing datasets, you can obtain JSON files with the following format.
28 | If the dataset provides multiple images per instance (e.g., SlideVQA), we add "_list" into the fields, including "image", "ocr", and "bboxes". 
29 | 
30 | <pre>
31 |    {
32 |       "dataset_name": dataset name,
33 |       "id": identification of the instance,
34 |       "image" or "image_list": image path,
35 |       "ocr" or "ocr_list": ocr text,
36 |       "bboxes" or "bboxes_list": [x1, y1, x2, y2, w, h],
37 |       "conversations": [
38 |         {'user': 'human', 'value': randomly sampled instruction}
39 |         {'user': 'gpt', 'value': answer}
40 |       ]
41 |     }
42 | </pre>
43 | 
44 | # Citation
45 | 
46 | You can cite it as follows:
47 | ```bibtex
48 | @inproceedings{InstructDoc2024,
49 |   author    = {Ryota Tanaka and
50 |                Taichi Iki and
51 |                Kyosuke Nishida and
52 |                Kuniko Saito and
53 |                Jun Suzuki},
54 |   title     = {InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions},
55 |   booktitle = {AAAI},
56 |   year      = {2024}
57 | }
58 | ```
59 | 
60 | If you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryota.tanaka[at]ntt.com) or open an issue!
61 | 


--------------------------------------------------------------------------------
/data_preprocessors/ai2d.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import glob
 5 | from PIL import Image, ImageDraw, ImageFont
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, load_instructions
 9 | from transformers import BertTokenizer
10 | import argparse
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.question_dir = os.path.join(args.input_data_dir, f'questions')
18 |         self.ann_dir = os.path.join(args.input_data_dir, f'annotations')
19 |         self.img_dir = os.path.join(args.input_data_dir, f'images')
20 |         self.font = ImageFont.truetype(args.font_file, size=40)
21 |         self.dataset_name = 'ai2d'
22 |         self.split = ['train', 'test']
23 | 
24 |     def sort_coordinate(self, bboxes):
25 |         return sorted(bboxes, key=lambda k: [k[1][1], k[1][0]])    
26 | 
27 |     def create_data(self):
28 |         train = []
29 |         test = []
30 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
31 |         with open(os.path.join(self.data_dir, 'ai2d_test_ids.csv')) as f:
32 |             test_ids = f.read().splitlines()
33 |         for i, file in enumerate(tqdm(sorted(os.listdir(self.question_dir)))):
34 |             file_path = os.path.join(self.question_dir, file)
35 |             with open(file_path, 'r', encoding='utf-8') as f:
36 |                 data = json.load(f)
37 |             annotation_path = os.path.join(self.ann_dir, file)
38 |             with open(annotation_path, 'r') as f:
39 |                 ann = json.load(f) 
40 | 
41 |             index = file.replace('.png.json', '')
42 |             split = 'test' if str(index) in test_ids else 'train'
43 | 
44 |             image_path = os.path.join(self.img_dir, file)
45 |             image_path = image_path.replace('.json', '')
46 |             img = Image.open(image_path)
47 |             draw = ImageDraw.Draw(img)
48 | 
49 |             for index, text in ann['text'].items():
50 |                 replacement_text = text['replacementText']
51 |                 bbox = text['rectangle']
52 |                 bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]]
53 |                 text = text['value']
54 |                 x1, y1, x2, y2 = bbox
55 |                 draw.rectangle((x1, y1, x2, y2), outline="lime", width=4)
56 |                 draw.text((x1, y1-30), replacement_text, font=self.font, fill="blue", align="center")
57 | 
58 |             image_path = os.path.join(self.out_data_dir, 'draw_images', f'{file.replace(".json", "")}')
59 |             os.makedirs(os.path.dirname(image_path), exist_ok=True)
60 |             img.save(image_path)
61 |             
62 |             for question, item in data['questions'].items():
63 |                 options = item['answerTexts']
64 |                 answer_index = item['correctAnswer']
65 |                 value = options[answer_index]
66 | 
67 |                 instruction = random.choice(instructions)
68 |                 instruction = instruction.replace('<key>', question).replace('<options>', str(options))
69 |                 file_name = os.path.abspath(image_path)
70 |                 metadata = {
71 |                     "image": file_name,
72 |                     "conversations": [
73 |                         {'from': 'human', 'value': instruction},
74 |                         {'from': 'gpt', 'value': f"{value}"},
75 |                     ],
76 |                 }
77 |                 if split == 'train':
78 |                     train.append(metadata)
79 |                 elif split == 'test':
80 |                     test.append(metadata)
81 | 
82 |         for split, results in [('train', train), ('test', test)]:
83 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
84 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
85 | 
86 |             print(f'{split}: {len(results)}')
87 |             with open(out_filepath, "w") as f:
88 |                 json.dump(results, f)
89 | 
90 | if __name__ == '__main__':
91 |     parser = argparse.ArgumentParser()
92 |     parser.add_argument('--input_data_dir', default='raw_datasets/ai2d', type=str)
93 |     parser.add_argument('--out_data_dir', default='processed_data/ai2d', type=str)
94 |     parser.add_argument('--font_file', default='GoNotoCurrent.ttf', type=str)
95 |     args = parser.parse_args()
96 |     
97 |     dataset = InstructData(args)
98 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/chartqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | 
 6 | from PIL import Image
 7 | from tqdm import tqdm 
 8 | from pathlib import Path
 9 | from utils import load_instructions
10 | from google_vision_ocr import Google_OCR
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
18 |         self.dataset_name = 'chartqa'
19 |         self.google_ocr = Google_OCR(args.api_key)
20 |         self.split = ['train', 'val', 'test']
21 |         os.makedirs(self.ocr_dir, exist_ok=True)
22 |             
23 |     def create_data(self):
24 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
25 |         for split in self.split:
26 |             target_format = []
27 |             for qa_type in ['human', 'augmented']:
28 |                 file_name = os.path.join(self.data_dir, f'{split}/{split}_{qa_type}.json')
29 |                 with open(file_name, 'r') as f:
30 |                     data = json.load(f)
31 |                 for d in tqdm(data):
32 |                     image_name = d['imgname']
33 |                     image_path = os.path.join(self.data_dir, f'{split}/png/{image_name}')
34 |                     ocr_path = os.path.join(self.ocr_dir, f'{image_name.replace(".png", ".json")}')        
35 |                     try:
36 |                         img = Image.open(image_path)
37 |                         img_w, img_h = img.size
38 |                         if not os.path.exists(ocr_path):
39 |                             items = self.google_ocr.recognize_image(img)
40 |                             if items == "error":
41 |                                 print('OCR error: ', image_path)
42 |                                 continue
43 |                             with open(ocr_path, 'w') as f:
44 |                                 json.dump(items, f)
45 |                         else:
46 |                             with open(ocr_path, 'r') as f:
47 |                                 items = json.load(f)
48 |                         words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
49 |                     except:
50 |                         words, bboxes = [], []
51 |                     
52 |                     question = d['query']
53 |                     value = d['label']
54 |                     instruction = random.choice(instructions)
55 |                     instruction = instruction.replace('<key>', question)
56 |                     ocr = ' '.join(words)
57 | 
58 |                     file_name = os.path.abspath(image_path)
59 |                     target_format.append({
60 |                         "image": file_name,
61 |                         "ocr": ocr, 
62 |                         "bboxes": bboxes, 
63 |                         "conversations": [
64 |                             {'from': 'human', 'value': instruction},
65 |                             {'from': 'gpt', 'value': value},
66 |                         ],
67 |                     })
68 | 
69 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
70 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
71 | 
72 |             print(f'{split}: {len(target_format)}')
73 |             with open(out_filepath, "w") as f:
74 |                 json.dump(target_format, f)
75 | 
76 | if __name__ == '__main__':
77 |     parser = argparse.ArgumentParser()
78 |     parser.add_argument('--input_data_dir', default='raw_datasets/chartqa', type=str)
79 |     parser.add_argument('--out_data_dir', default='processed_data/chartqa', type=str)
80 |     parser.add_argument('--api_key', type=str, help='google vision api key')
81 |     args = parser.parse_args()
82 |     
83 |     dataset = InstructData(args)
84 |     dataset.create_data()
85 | 


--------------------------------------------------------------------------------
/data_preprocessors/cord.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | 
 5 | from PIL import Image
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, sort_coordinate, load_instructions
 9 | import argparse
10 | 
11 | class InstructData:
12 |     def __init__(self, args):
13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
14 |         self.data_dir = args.input_data_dir
15 |         self.out_data_dir = args.out_data_dir
16 |         self.dataset_name = 'cord'
17 |         self.split = ['train', 'dev', 'test']
18 | 
19 |     def create_data(self):
20 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
21 |         for split in self.split:
22 |             target_format = []
23 |             ann_dir = os.path.join(self.data_dir, f'{split}/json')
24 |             img_dir = os.path.join(self.data_dir, f'{split}/image')
25 |             for file in tqdm(sorted(os.listdir(ann_dir))):
26 |                 file_path = os.path.join(ann_dir, file)
27 |                 with open(file_path, 'r', encoding='utf-8') as f:
28 |                     data = json.load(f)
29 | 
30 |                 image_path = os.path.join(img_dir, file)
31 |                 image_path = image_path.replace('.json', '.png')
32 |                 image = Image.open(image_path)
33 |                 w, h = image.size
34 | 
35 |                 items = []
36 |                 labels = {}
37 |                 for item in data["valid_line"]:
38 |                     words, label = item["words"], item["category"]
39 |                     words = [w for w in words if w["text"].strip() != ""]
40 |                     if len(words) == 0:
41 |                         continue
42 |                     text = " ".join([word["text"] for word in words])
43 |                     bbox = [words[0]["quad"]["x1"], words[0]["quad"]["y1"], words[-1]["quad"]["x3"], words[-1]["quad"]["y3"]]
44 |                     bbox = normalize_bbox(bbox, w, h)
45 |                     items.append((text, label, bbox))
46 | 
47 |                 items = sort_coordinate(items)
48 |                 ocr = []
49 |                 bboxes = []
50 |                 for item in items:
51 |                     words, label, bbox = item
52 |                     labels[words] = label
53 |                     ocr.append(words)
54 |                     bbox = [bbox] * len(words.split())
55 |                     bboxes += bbox
56 |                 ocr = ' '.join(ocr)
57 | 
58 |                 for key in labels:
59 |                     instruction = random.choice(instructions)
60 |                     instruction = instruction.replace('<key>', key)
61 |                     value = labels[key]
62 | 
63 |                     file_name = os.path.abspath(image_path)
64 |                     target_format.append({
65 |                         "image": file_name,
66 |                         "ocr": ocr,
67 |                         "bboxes": bboxes,
68 |                         "conversations": [
69 |                             {'from': 'human', 'value': instruction},
70 |                             {'from': 'gpt', 'value': value},
71 |                         ],
72 |                     })
73 | 
74 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
75 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
76 | 
77 |             with open(out_filepath, "w") as f:
78 |                 json.dump(target_format, f)
79 | 
80 | if __name__ == '__main__':
81 |     parser = argparse.ArgumentParser()
82 |     parser.add_argument('--input_data_dir', default='raw_datasets/cord/CORD', type=str)
83 |     parser.add_argument('--out_data_dir', default='processed_data/cord', type=str)
84 |     args = parser.parse_args()
85 |     
86 |     dataset = InstructData(args)
87 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/deepform.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'deepform'
16 |         self.split = ['train', 'dev']
17 | 
18 |     def create_ocr_data(self, split):
19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
20 |         with open(file_name, 'r') as f:
21 |             data = f.readlines()
22 |         ocrs = {}
23 |         for d in data:
24 |             d = json.loads(d)
25 |             image_name = d['name'].replace('.pdf', '')
26 |             try:
27 |                 content = d['contents'][1] # microsoft cv
28 |             except:
29 |                 content = d['contents'][0] # tesseract
30 | 
31 |             bboxes = []
32 |             tokens = []
33 |             try:
34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
36 |                     bbox = normalize_bbox(bbox, w, h)
37 |                     bboxes.append(bbox)
38 |                     tokens.append(token)
39 |             except:
40 |                 pass
41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
42 |             break
43 |         return ocrs
44 | 
45 |     def create_data(self):
46 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
47 |         for split in self.split:
48 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
49 |             with open(file_name, 'r') as f:
50 |                 data = f.readlines()
51 | 
52 |             ocrs = self.create_ocr_data(split)
53 |             target_format = []
54 |             for d in tqdm(data):
55 |                 d = json.loads(d)
56 |                 image_name = d['name'].replace('.pdf', '')
57 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
58 |                 file_name = os.path.abspath(file_name)
59 |                 for ann in d['annotations']:
60 |                     instruction = random.choice(instructions)
61 |                     if 'children' in ann['values'][0]:
62 |                         for v in ann['values']:
63 |                             for child in v['children']:
64 |                                 value = child['key']
65 |                                 key = child['values'][0]['value']
66 |                                 instruction = instruction.replace('<key>', key)
67 |                                 ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
68 | 
69 |                                 target_format.append({
70 |                                     "image": file_name,
71 |                                     "ocr": ocr,
72 |                                     "bboxes": bboxes,
73 |                                     "conversations": [
74 |                                         {'from': 'human', 'value': instruction},
75 |                                         {'from': 'gpt', 'value': value},
76 |                                     ],
77 |                                 })
78 | 
79 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
80 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
81 | 
82 |             print(f'{split}: {len(target_format)}')
83 |             with open(out_filepath, "w") as f:
84 |                 json.dump(target_format, f)
85 | 
86 | if __name__ == '__main__':
87 |     parser = argparse.ArgumentParser()
88 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/DeepForm', type=str)
89 |     parser.add_argument('--out_data_dir', default='processed_data/deepform', type=str)
90 |     args = parser.parse_args()
91 |     
92 |     dataset = InstructData(args)
93 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/docbank.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import random
  4 | from PIL import Image
  5 | from tqdm import tqdm 
  6 | from pathlib import Path
  7 | from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh
  8 | from transformers import BertTokenizer
  9 | from collections import defaultdict
 10 | import argparse
 11 | 
 12 | class InstructData:
 13 |     def __init__(self, args):
 14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
 15 |         self.data_dir = args.input_data_dir
 16 |         self.out_data_dir = args.out_data_dir
 17 |         self.dataset_name = 'docbank'
 18 |         self.split = ['train', 'valid', 'test']
 19 | 
 20 |     def sort_coordinate(self, bboxes):
 21 |         return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    
 22 | 
 23 |     def create_ocr_data(self, data):
 24 |         ocr_info = {}
 25 |         for image_info in tqdm(data['images']):
 26 |             file_name = image_info['file_name']
 27 |             image_id = image_info['id']
 28 |             width, height = image_info['width'], image_info['height']
 29 | 
 30 |             image_path = os.path.join(self.data_dir,  f'DocBank_500K_ori_img/{file_name}')
 31 |             txt_path = os.path.join(self.data_dir,  f'DocBank_500K_txt/{file_name.replace("_ori.jpg", ".txt")}')
 32 |             with open(txt_path, 'r') as f:
 33 |                 txt_data = f.read().splitlines()
 34 | 
 35 |             words = []
 36 |             bboxes = []
 37 |             for d in txt_data:      
 38 |                 d = d.split('\t')
 39 |                 word = d[0]
 40 |                 word_position = convert_wh([int(d[1]), int(d[2]), int(d[3]), int(d[4])])
 41 |                 if word_position[0] >= word_position[2] or word_position[1] >= word_position[3]:
 42 |                     continue
 43 |                 words.append(word)
 44 |                 bboxes.append(word_position)
 45 |             
 46 |             text_sequence = ' '.join(words)
 47 |             ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height}
 48 |         return ocr_info
 49 |     
 50 |     def create_data(self):
 51 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
 52 |         for split in self.split:
 53 |             with open(os.path.join(self.data_dir, f'500K_{split}.json'), "r") as f:
 54 |                 data = json.load(f)
 55 | 
 56 |             ocr_info = self.create_ocr_data(data)
 57 |             categories = data['categories']
 58 | 
 59 |             target_format = []
 60 |             annotations = defaultdict(list)
 61 |             for ann_info in data['annotations']:
 62 |                 image_id = ann_info['image_id']
 63 |                 annotations[image_id].append(ann_info)
 64 | 
 65 |             for image_id in tqdm(annotations):
 66 |                 image_info = ocr_info[image_id]
 67 |                 image_path = image_info['image_path']
 68 |                 text_sequence = image_info['text_sequence']
 69 |                 bboxes = image_info['bboxes']
 70 |                 width, height = image_info['width'], image_info['height']
 71 | 
 72 |                 items = []
 73 |                 for ann in annotations[image_id]:
 74 |                     category_id = ann['category_id']
 75 |                     category_name = categories[category_id-1]['name']
 76 |                     bbox = ann['bbox']
 77 |                     bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]]
 78 |                     bbox = normalize_bbox(bbox, width, height)
 79 |                     items.append((category_name, bbox))
 80 |                 items = self.sort_coordinate(items)
 81 | 
 82 |                 dla = []
 83 |                 for item in items:
 84 |                     category_name, bbox = item
 85 |                     dla.append(f'{category_name} {bbox}')
 86 |                 value = ' '.join(dla)
 87 | 
 88 |                 instruction = random.choice(instructions)        
 89 |                 file_name = os.path.abspath(image_path)
 90 | 
 91 |                 target_format.append({
 92 |                     "image": file_name,
 93 |                     "ocr": text_sequence,
 94 |                     "bboxes": bboxes,
 95 |                     "conversations": [
 96 |                         {'from': 'human','value': instruction},
 97 |                         {'from': 'gpt', 'value': value},
 98 |                     ],
 99 |                 })
100 | 
101 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
102 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
103 | 
104 |             print(f'{split}: {len(target_format)}')
105 |             with open(out_filepath, "w") as f:
106 |                 json.dump(target_format, f)
107 | 
108 | if __name__ == '__main__':
109 |     parser = argparse.ArgumentParser()
110 |     parser.add_argument('--input_data_dir', default='raw_datasets/docbank', type=str)
111 |     parser.add_argument('--out_data_dir', default='processed_data/docbank', type=str)
112 |     args = parser.parse_args()
113 |     
114 |     dataset = InstructData(args)
115 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/docile.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import sort_coordinate, load_instructions, normalize_bbox
 8 | import argparse
 9 | from collections import defaultdict
10 | 
11 | class InstructData:
12 |     def __init__(self, args):
13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
14 |         self.data_dir = args.input_data_dir
15 |         self.out_data_dir = args.out_data_dir
16 |         self.dataset_name = 'docile'
17 |         self.ann_dir = os.path.join(args.input_data_dir, f'annotations')
18 |         self.img_dir = os.path.join(args.input_data_dir, f'images')
19 |         self.ocr_dir = os.path.join(args.input_data_dir, f'ocr')
20 |         self.split = ['train', 'val']
21 | 
22 |     def extract_ocr_info(self, ocr_data):
23 |         tokens = []
24 |         bboxes = []
25 |         for page in ocr_data['pages']:
26 |             for block in page['blocks']:
27 |                 for line in block['lines']:
28 |                     for word in line['words']:
29 |                         left_top, right_bottom = word['geometry']
30 |                         bbox = normalize_bbox([left_top[0], left_top[1], right_bottom[0], right_bottom[1]])
31 |                         bboxes.append(bbox)
32 |                         tokens.append(word['value'])
33 |         return tokens, bboxes
34 | 
35 |     def create_data(self):
36 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
37 |         for split in self.split:
38 |             file_name = os.path.join(self.data_dir, f'{split}.json')
39 |             with open(file_name, 'r') as f:
40 |                 ann_filenames = json.load(f)
41 | 
42 |             target_format = []
43 |             for id, file in enumerate(tqdm(ann_filenames)):
44 |                 image_path = os.path.join(self.img_dir, file + '0001-1.jpg')
45 |                 with open(os.path.join(self.ocr_dir, f'{file}.json'), 'r', encoding='utf-8') as f:
46 |                     ocr_data = json.load(f)
47 |                 with open(os.path.join(self.ann_dir, f'{file}.json'), 'r', encoding='utf-8') as f:
48 |                     d = json.load(f)
49 | 
50 |                 items = []
51 |                 for item in d["field_extractions"]:
52 |                     if item["page"] == 0:                
53 |                         text, label = item["text"], item["fieldtype"]
54 |                         bbox = item["bbox"]
55 |                         items.append((text, label, bbox))
56 |                 if len(items) == 0:
57 |                     continue
58 |                 items = sort_coordinate(items)
59 | 
60 |                 labels = {}
61 |                 for item in items:
62 |                     tokens, label, bbox = item
63 |                     labels[tokens] = label
64 | 
65 |                 tokens, bboxes = self.extract_ocr_info(ocr_data)
66 |                 ocr = ' '.join(tokens)
67 | 
68 |                 for key in labels:
69 |                     instruction = random.choice(instructions)
70 |                     instruction = instruction.replace('<key>', key)
71 |                     value = labels[key]
72 | 
73 |                     file_name = os.path.abspath(image_path)
74 |                     target_format.append({
75 |                         "image": file_name,
76 |                         "ocr": ocr,
77 |                         "bboxes": bboxes,
78 |                         "conversations": [
79 |                             {'from': 'human', 'value': instruction},
80 |                             {'from': 'gpt', 'value': value},
81 |                         ],
82 |                     })
83 | 
84 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
85 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
86 | 
87 |             print(f'{split}: {len(target_format)}')
88 |             with open(out_filepath, "w") as f:
89 |                 json.dump(target_format, f)
90 | 
91 | if __name__ == '__main__':
92 |     parser = argparse.ArgumentParser()
93 |     parser.add_argument('--input_data_dir', default='raw_datasets/docile/data/docile', type=str)
94 |     parser.add_argument('--out_data_dir', default='processed_data/docile', type=str)
95 |     args = parser.parse_args()
96 |     
97 |     dataset = InstructData(args)
98 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/doclaynet.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import random
  4 | from PIL import Image
  5 | from tqdm import tqdm 
  6 | from pathlib import Path
  7 | from utils import normalize_bbox, sort_coordinate, load_instructions, convert_wh
  8 | from collections import defaultdict
  9 | import argparse
 10 | 
 11 | class InstructData:
 12 |     def __init__(self, args):
 13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
 14 |         self.data_dir = args.input_data_dir
 15 |         self.out_data_dir = args.out_data_dir
 16 |         self.dataset_name = 'doclaynet'
 17 |         self.split = ['train', 'val']
 18 | 
 19 |     def sort_coordinate(self, bboxes):
 20 |         return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    
 21 | 
 22 |     def create_ocr_data(self, data):
 23 |         ocr_info = {}
 24 |         for image_info in data['images']:
 25 |             file_name = image_info['file_name']
 26 |             image_id = image_info['id']
 27 |             image_path = os.path.join(self.data_dir,  f'PNG/{file_name}')
 28 |             json_path = os.path.join(self.data_dir,  f'JSON/{file_name.replace(".png", ".json")}')
 29 |             width, height = image_info['width'], image_info['height']
 30 |             with open(json_path, 'r') as f:
 31 |                 json_data = json.load(f)
 32 |             items = []
 33 | 
 34 |             for cell in json_data['cells']:
 35 |                 text = cell['text']
 36 |                 bbox = cell['bbox']
 37 |                 bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3]]
 38 |                 bbox = convert_wh(normalize_bbox(bbox, width, height))
 39 |                 items.append((text, bbox))
 40 | 
 41 |             items = self.sort_coordinate(items)
 42 |             words = []
 43 |             bboxes = []
 44 |             for text, bbox in items:
 45 |                 words.append(text)
 46 |                 bboxes += bbox
 47 |             text_sequence = ' '.join(words)
 48 |             ocr_info[image_id] = {'image_path': image_path, 'text_sequence': text_sequence, 'bboxes': bboxes, 'width': width, 'height': height}
 49 |             break
 50 |         return ocr_info
 51 |     
 52 |     def create_data(self):
 53 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
 54 |         for split in self.split:
 55 |             with open(os.path.join(self.data_dir, f'COCO/{split}.json'), "r") as f:
 56 |                 data = json.load(f)
 57 |             ocr_info = self.create_ocr_data(data)
 58 |             categories = data['categories']
 59 | 
 60 |             target_format = []
 61 |             annotations = defaultdict(list)
 62 |             for ann_info in data['annotations']:
 63 |                 image_id = ann_info['image_id']
 64 |                 annotations[image_id].append(ann_info)
 65 | 
 66 |             for image_id in tqdm(annotations):
 67 |                 image_info = ocr_info[image_id]
 68 |                 image_path = image_info['image_path']
 69 |                 text_sequence = image_info['text_sequence']
 70 |                 bboxes = image_info['bboxes']
 71 |                 width, height = image_info['width'], image_info['height']
 72 | 
 73 |                 items = []
 74 |                 for ann in annotations[image_id]:
 75 |                     category_id = ann['category_id']
 76 |                     category_name = categories[category_id-1]['name']
 77 |                     bbox = ann['bbox']
 78 |                     bbox = [bbox[0], bbox[1], bbox[0]+bbox[2], bbox[1]+bbox[3], bbox[2], bbox[3]]
 79 |                     bbox = normalize_bbox(bbox, width, height)
 80 |                     items.append((category_name, bbox))
 81 |                 items = self.sort_coordinate(items)
 82 | 
 83 |                 dla = []
 84 |                 for item in items:
 85 |                     category_name, bbox = item
 86 |                     dla.append(f'{category_name} {bbox}')
 87 |                 value = ' '.join(dla)
 88 | 
 89 |                 instruction = random.choice(instructions)        
 90 |                 file_name = os.path.abspath(image_path)
 91 | 
 92 |                 target_format.append({
 93 |                     "image": file_name,
 94 |                     "ocr": text_sequence,
 95 |                     "bboxes": bboxes,
 96 |                     "conversations": [
 97 |                         {'from': 'human','value': instruction},
 98 |                         {'from': 'gpt', 'value': value},
 99 |                     ],
100 |                 })
101 | 
102 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
103 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
104 | 
105 |             print(f'{split}: {len(target_format)}')
106 |             with open(out_filepath, "w") as f:
107 |                 json.dump(target_format, f)
108 | 
109 | if __name__ == '__main__':
110 |     parser = argparse.ArgumentParser()
111 |     parser.add_argument('--input_data_dir', default='raw_datasets/doclaynet', type=str)
112 |     parser.add_argument('--out_data_dir', default='processed_data/doclaynet', type=str)
113 |     args = parser.parse_args()
114 |     
115 |     dataset = InstructData(args)
116 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/docvqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'docvqa'
16 |         self.split = ['train', 'dev']
17 | 
18 |     def create_ocr_data(self, split):
19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
20 |         with open(file_name, 'r') as f:
21 |             data = f.readlines()
22 |         ocrs = {}
23 |         for d in data:
24 |             d = json.loads(d)
25 |             image_name = d['name'].replace('.pdf', '')
26 |             try:
27 |                 content = d['contents'][1] # microsoft cv
28 |             except:
29 |                 content = d['contents'][0] # tesseract
30 | 
31 |             bboxes = []
32 |             tokens = []
33 |             try:
34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
36 |                     bbox = normalize_bbox(bbox, w, h)
37 |                     bboxes.append(bbox)
38 |                     tokens.append(token)
39 |             except:
40 |                 pass
41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
42 |         return ocrs
43 | 
44 |     def create_data(self):
45 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
46 |         for split in self.split:
47 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
48 |             with open(file_name, 'r') as f:
49 |                 data = f.readlines()
50 | 
51 |             ocrs = self.create_ocr_data(split)
52 |             target_format = []
53 |             for d in tqdm(data):
54 |                 d = json.loads(d)
55 |                 image_name = d['name'].replace('.pdf', '')
56 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
57 |                 file_name = os.path.abspath(file_name)
58 |                 for ann in d['annotations']:
59 |                     instruction = random.choice(instructions)
60 |                     question = ann['key']
61 |                     instruction = instruction.replace('<key>', question)
62 |                     bboxes = []
63 |                     ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
64 |                     value = ann['values'][0]['value']
65 |                     values = ann['values'][0]['value_variants']
66 | 
67 |                     target_format.append({
68 |                         "image": file_name,
69 |                         "ocr": ocr, 
70 |                         "bboxes": bboxes,
71 |                         "conversations": [
72 |                             {'from': 'human', 'instruction': instruction},
73 |                             {'from': 'gpt', 'value': value, 'values': values},
74 |                         ],
75 |                     })
76 | 
77 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
78 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
79 | 
80 |             print(f'{split}: {len(target_format)}')
81 |             with open(out_filepath, "w") as f:
82 |                 json.dump(target_format, f)
83 | 
84 | if __name__ == '__main__':
85 |     parser = argparse.ArgumentParser()
86 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/docvqa', type=str)
87 |     parser.add_argument('--out_data_dir', default='processed_data/docvqa', type=str)
88 |     args = parser.parse_args()
89 |     
90 |     dataset = InstructData(args)
91 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/docvqa_iq.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import random
  4 | from PIL import Image, ImageSequence
  5 | from tqdm import tqdm 
  6 | from pathlib import Path
  7 | from utils import normalize_bbox, load_instructions
  8 | import argparse
  9 | 
 10 | class InstructData:
 11 |     def __init__(self, args):
 12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
 13 |         self.data_dir = args.input_data_dir
 14 |         self.out_data_dir = args.out_data_dir
 15 |         self.dataset_name = 'docvqa_iq'
 16 |         self.split = ['train', 'dev']
 17 | 
 18 |     def create_ocr_data(self, split):
 19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
 20 |         with open(file_name, 'r') as f:
 21 |             data = f.readlines()
 22 |         ocrs = {}
 23 |         for d in data:
 24 |             d = json.loads(d)
 25 |             image_name = d['name'].replace('.pdf', '')
 26 |             try:
 27 |                 content = d['contents'][1] # microsoft cv
 28 |             except:
 29 |                 content = d['contents'][0] # tesseract
 30 | 
 31 |             bboxes = []
 32 |             tokens = []
 33 |             try:
 34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
 35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
 36 |                     bbox = normalize_bbox(bbox, w, h)
 37 |                     bboxes.append(bbox)
 38 |                     tokens.append(token)
 39 |             except:
 40 |                 pass
 41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
 42 |         return ocrs
 43 | 
 44 |     def create_data(self):
 45 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
 46 |         for split in self.split:
 47 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
 48 |             with open(file_name, 'r') as f:
 49 |                 data = f.readlines()
 50 | 
 51 |             ocrs = self.create_ocr_data(split)
 52 |             target_format = []
 53 |             questions = []
 54 |             for d in data:
 55 |                 d = json.loads(d)
 56 |                 for ann in d['annotations']:
 57 |                     question = ann['key']
 58 |                     questions.append(question)
 59 | 
 60 |             for d in tqdm(data):
 61 |                 d = json.loads(d)
 62 |                 image_name = d['name'].replace('.pdf', '')
 63 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
 64 |                 file_name = os.path.abspath(file_name)
 65 |                 for ann in d['annotations']:
 66 |                     instruction = random.choice(instructions)
 67 |                     if random.random() > 0.5:
 68 |                         question = random.choice(questions)
 69 |                         value = 'no'
 70 |                     else:
 71 |                         question = ann['key']
 72 |                         value = 'yes'
 73 | 
 74 |                     instruction = instruction.replace('<key>', question)
 75 |                     bboxes = []
 76 |                     ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
 77 | 
 78 |                     target_format.append({
 79 |                         "image": file_name,
 80 |                         "ocr": ocr, 
 81 |                         "bboxes": bboxes,
 82 |                         "conversations": [
 83 |                             {'from': 'human', 'instruction': instruction},
 84 |                             {'from': 'gpt', 'value': value},
 85 |                         ],
 86 |                     })
 87 | 
 88 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
 89 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
 90 | 
 91 |             print(f'{split}: {len(target_format)}')
 92 |             with open(out_filepath, "w") as f:
 93 |                 json.dump(target_format, f)
 94 | 
 95 | if __name__ == '__main__':
 96 |     parser = argparse.ArgumentParser()
 97 |     parser.add_argument('--input_data_dir', default='raw_datasets/docvqa', type=str)
 98 |     parser.add_argument('--out_data_dir', default='processed_data/docvqa_iq', type=str)
 99 |     args = parser.parse_args()
100 |     
101 |     dataset = InstructData(args)
102 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/dude.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import random
  4 | import argparse
  5 | import glob
  6 | 
  7 | from PIL import Image
  8 | from tqdm import tqdm 
  9 | from pathlib import Path
 10 | from utils import normalize_bbox, load_instructions
 11 | 
 12 | class InstructData:
 13 |     def __init__(self, args):
 14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
 15 |         self.data_dir = args.input_data_dir
 16 |         self.out_data_dir = args.out_data_dir
 17 |         self.image_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/images')
 18 |         self.ocr_dir = os.path.join(args.input_data_dir, 'DUDE_train-val-test_binaries/OCR')
 19 |         self.dataset_name = 'dude'
 20 |         self.split = ['train', 'val']
 21 |             
 22 |     def create_data(self):
 23 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
 24 |         file_name =  os.path.join(self.data_dir, '2023-03-23_DUDE_gt_test_PUBLIC.json')
 25 |         with open(file_name, 'r') as f:
 26 |             data = json.load()
 27 |         train, validation = [],[]
 28 |         for d in tqdm(data['data']):
 29 |             docid = d['docId']
 30 |             question = d['question']
 31 |             split = d['data_split']          
 32 |             if split in self.split:
 33 |                 image_paths = []
 34 |                 pages = len(glob.glob(os.path.join(self.image_dir, split, f'{docid}_*.jpg')))
 35 |                 for i in range(pages):
 36 |                     image_path = os.path.join(self.image_dir, split, f'{docid}_{i}.jpg')
 37 |                     image_path = os.path.abspath(image_path)                
 38 |                     image_paths.append(image_path)
 39 | 
 40 |                 ocr_path =os.path.join(self.ocr_dir, f'Azure/{docid}_due.json')
 41 |                 try:
 42 |                     with open(ocr_path, 'r') as f:
 43 |                         ocr_info = json.load(f)
 44 |                 except:
 45 |                     continue
 46 | 
 47 |                 structure_value = ocr_info['structures']['pages']['structure_value']
 48 |                 image_sizes = ocr_info['structures']['pages']['positions']
 49 |                 text_sequences = []
 50 |                 bboxes = []
 51 |                 for page_split, image_size in zip(structure_value, image_sizes):
 52 |                     start = page_split[0]
 53 |                     end = page_split[1]
 54 |                     page_tokens = ' '.join(ocr_info['tokens'][start:end])
 55 |                     page_bboxes = []
 56 |                     for bbox in ocr_info['positions'][start:end]:
 57 |                         bbox = normalize_bbox(bbox, (image_size[2], image_size[3]))
 58 |                         page_bboxes.append(bbox)
 59 |                     text_sequences.append(page_tokens)
 60 |                     bboxes.append(page_bboxes)
 61 |                                 
 62 |                 if len(text_sequences) != len(image_paths):
 63 |                     continue
 64 | 
 65 |                 instruction = random.choice(instructions)
 66 |                 instruction = instruction.replace('<key>', question)
 67 |                 if 'answers' in d:
 68 |                     value = d['answers'][0]
 69 |                     if d['answer_type'] == 'not-answerable':
 70 |                         d['answers'] = 'none'
 71 |                 else:
 72 |                     value  = ''
 73 | 
 74 |                 file_name = os.path.abspath(image_path)
 75 |                 sample = {
 76 |                     "image_list": image_paths,
 77 |                     "ocr_list": text_sequences, 
 78 |                     "bboxes_list": bboxes, 
 79 |                     "conversations": [
 80 |                         {'from': 'human', 'value': instruction},
 81 |                         {'from': 'gpt', 'value': value},
 82 |                     ],
 83 |                 }
 84 | 
 85 |                 if split == 'train':
 86 |                     train.append(sample)
 87 |                 elif split == 'val':
 88 |                     validation.append(sample)
 89 | 
 90 |         for split, target_format in [('train', train), ('validation', validation)]:
 91 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
 92 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
 93 | 
 94 |             print(f'{split}: {len(target_format)}')
 95 |             with open(out_filepath, "w") as f:
 96 |                 json.dump(target_format, f)
 97 | 
 98 | if __name__ == '__main__':
 99 |     parser = argparse.ArgumentParser()
100 |     parser.add_argument('--input_data_dir', default='raw_datasets/dude', type=str)
101 |     parser.add_argument('--out_data_dir', default='processed_data/dude', type=str)
102 |     args = parser.parse_args()
103 |     
104 |     dataset = InstructData(args)
105 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/funsd.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import cv2
 5 | from PIL import Image, ImageSequence
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, sort_coordinate, load_instructions
 9 | import argparse
10 | 
11 | class InstructData:
12 |     def __init__(self, args):
13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
14 |         self.data_dir = args.input_data_dir
15 |         self.out_data_dir = args.out_data_dir
16 |         self.dataset_name = 'funsd'
17 |         self.split = ['training', 'testing']
18 |         self.label_mapping = {'header': 'title',
19 |                               'question': 'key',
20 |                               'answer': 'value',
21 |                               'other': 'other'}
22 | 
23 |     def create_data(self):
24 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
25 |         for split in self.split:
26 |             target_format = []
27 |             ann_dir = os.path.join(self.data_dir, f'{split}_data/annotations')
28 |             img_dir = os.path.join(self.data_dir, f'{split}_data/images')
29 |             for i, file in enumerate(tqdm(sorted(os.listdir(ann_dir)))):
30 |                 file_path = os.path.join(ann_dir, file)
31 |                 with open(file_path, 'r', encoding='utf-8') as f:
32 |                     data = json.load(f)
33 | 
34 |                 image_path = os.path.join(img_dir, file)
35 |                 image_path = image_path.replace('.json', '.png') 
36 |                 image = cv2.imread(image_path)
37 |                 h, w, _ = image.shape
38 | 
39 |                 items = []
40 |                 for item in data["form"]:
41 |                     text = item['text']
42 |                     words, label = item["words"], item["label"]
43 |                     label = self.label_mapping[label]
44 |                     words = [w for w in words if w["text"].strip() != ""]
45 |                     if len(words) == 0:
46 |                         continue
47 |                     start_bbox, end_bbox = words[0]['box'], words[-1]['box']
48 |                     bbox = [start_bbox[0], start_bbox[1], end_bbox[2], start_bbox[3]]
49 |                     bbox = normalize_bbox(bbox, w, h)
50 |                     items.append((text, label, bbox))
51 |                 items = sort_coordinate(items)
52 | 
53 |                 text_sequence = []
54 |                 bboxes = []
55 |                 labels = {}
56 |                 for item in items:
57 |                     text, label, bbox = item
58 |                     labels[text] = label
59 |                     text_sequence.append(text)
60 |                     bbox = [bbox] * len(text)
61 |                     bboxes += bbox
62 | 
63 |                 ocr = ' '.join(text_sequence)
64 |                 for key in labels:
65 |                     instruction = random.choice(instructions)
66 |                     instruction = instruction.replace('<key>', key)
67 |                     value = labels[key]
68 | 
69 |                     file_name = os.path.abspath(image_path)
70 |                     target_format.append({
71 |                         "image": file_name,
72 |                         "ocr": ocr,
73 |                         "bboxes": bboxes,
74 |                         "conversations": [
75 |                             {'from': 'human', 'value': instruction},
76 |                             {'from': 'gpt', 'value': value},
77 |                         ],
78 |                     })
79 | 
80 |             split = split.replace('ing', '')
81 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
82 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
83 | 
84 |             with open(out_filepath, "w") as f:
85 |                 json.dump(target_format, f)
86 | 
87 | if __name__ == '__main__':
88 |     parser = argparse.ArgumentParser()
89 |     parser.add_argument('--input_data_dir', default='raw_datasets/funsd/dataset', type=str)
90 |     parser.add_argument('--out_data_dir', default='processed_data/funsd', type=str)
91 |     args = parser.parse_args()
92 |     
93 |     dataset = InstructData(args)
94 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/google_vision_ocr.py:
--------------------------------------------------------------------------------
 1 | import base64
 2 | import json
 3 | from requests import Request, Session
 4 | from io import BytesIO
 5 | from utils import normalize_bbox
 6 | 
 7 | class Google_OCR:
 8 |     def __init__(self, api_key):
 9 |         self.api_key = api_key
10 | 
11 |     def pil_image_to_base64(self, pil_image):
12 |         buffered = BytesIO()
13 |         pil_image.save(buffered, format="PNG")
14 |         str_encode_file = base64.b64encode(buffered.getvalue()).decode("utf-8")
15 |         return str_encode_file
16 | 
17 |     def recognize_image(self, pil_image):
18 |         str_encode_file = self.pil_image_to_base64(pil_image)
19 |         str_url = "https://vision.googleapis.com/v1/images:annotate?key="
20 |         str_headers = {'Content-Type': 'application/json'}
21 |         str_json_data = {
22 |             'requests': [
23 |                 {
24 |                     'image': {
25 |                         'content': str_encode_file
26 |                     },
27 |                     'features': [
28 |                         {
29 |                             'type': "TEXT_DETECTION",
30 |                         }
31 |                     ]
32 |                 }
33 |             ]
34 |         }
35 | 
36 |         obj_session = Session()
37 |         obj_request = Request("POST",
38 |                               str_url + self.api_key,
39 |                               data=json.dumps(str_json_data),
40 |                               headers=str_headers
41 |                               )
42 |         obj_prepped = obj_session.prepare_request(obj_request)
43 |         obj_response = obj_session.send(obj_prepped,
44 |                                         verify=True,
45 |                                         timeout=60
46 |                                         )
47 | 
48 |         if obj_response.status_code == 200:
49 |             return obj_response.json()
50 | 
51 |         else:
52 |             return "error"
53 | 
54 |     def extract_info(self, items, img_w, img_h):
55 |         words = []
56 |         bboxes = []
57 |         for page_ocrs in items['responses'][0]['fullTextAnnotation']['pages']:
58 |             for block_ocrs in page_ocrs['blocks']:
59 |                 for para_ocrs in block_ocrs['paragraphs']:
60 |                     for word_ocrs in para_ocrs['words']:
61 |                         char_bboxes = []
62 |                         word = ''
63 |                         for sym_ocrs in word_ocrs['symbols']:
64 |                             try:
65 |                                 bbox = sym_ocrs['boundingBox']
66 |                                 xmin = max(0, bbox['vertices'][0]['x'])
67 |                                 ymin = max(0, bbox['vertices'][0]['y'])
68 |                                 xmax = max(0, bbox['vertices'][2]['x'])
69 |                                 ymax = max(0, bbox['vertices'][2]['y'])
70 |                                 bbox = [xmin, ymin, xmax, ymax]
71 |                             except:
72 |                                 continue
73 |                             word += sym_ocrs['text']
74 |                             char_bboxes.append(bbox)
75 |                         if len(char_bboxes) > 0:
76 |                             x1 = [w_p[0] for w_p in char_bboxes]
77 |                             y1 = [w_p[1] for w_p in char_bboxes]
78 |                             x2 = [w_p[2] for w_p in char_bboxes]
79 |                             y2 = [w_p[3] for w_p in char_bboxes]
80 |                             word_bbox = [min(x1), min(y1), max(x2), max(y2)]
81 |                             if word_bbox[0] >= word_bbox[2] or word_bbox[1] >= word_bbox[3]:
82 |                                 continue
83 |                             word_bbox = normalize_bbox(word_bbox, img_w, img_h)
84 |                             words.append(word)
85 |                             bboxes.append(word_bbox)
86 |         return words, bboxes
87 | 


--------------------------------------------------------------------------------
/data_preprocessors/hwsquad.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | import csv
 6 | 
 7 | from PIL import Image, ImageSequence
 8 | from tqdm import tqdm 
 9 | from pathlib import Path
10 | from utils import normalize_bbox, load_instructions
11 | from collections import defaultdict
12 | 
13 | class InstructData:
14 |     def __init__(self, args):
15 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
16 |         self.data_dir = args.input_data_dir
17 |         self.out_data_dir = args.out_data_dir
18 |         self.dataset_name = 'hwsquad'
19 |         self.split = ['train', 'val', 'test']
20 |     
21 |     def create_data(self):
22 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
23 |         for split in self.split:
24 |             filename = os.path.join(self.data_dir, f"HW-SQuAD_{split}_1.0.json")
25 |             with open(filename, "r") as f:
26 |                 annotations = json.load(f)
27 | 
28 |             target_format = []
29 |             for ann in tqdm(annotations["data"]):
30 |                 qas = ann["qas"]
31 |                 image_path = ann["document_image"]["document_image"]
32 |                 h, w = ann["document_image"]["image_height"], ann["document_image"]["image_width"]
33 | 
34 |                 words = []
35 |                 bboxes = []
36 |                 for item in ann["document_image"]["gold_standard_transcription"]:
37 |                     word = item["text"]
38 |                     words.append(word)
39 |                     bbox = [item["xmin"], item["ymin"], item["xmax"], item["ymax"]]
40 |                     bbox = normalize_bbox(bbox, w, h)
41 |                     bboxes.append(bbox)
42 |                 
43 |                 for qa in qas:
44 |                     question = qa["question"]
45 |                     start_index, end_index = qa["answers"][0]["answer_start_word_no"], qa["answers"][0]["answer_end_word_no"]+1    
46 |                     answer = words[start_index:end_index]
47 |                     answer = " ".join(answer)
48 | 
49 |                     instruction = random.choice(instructions)        
50 |                     instruction = instruction.replace('<key>', question)
51 |                     ocr = ' '.join(words)
52 | 
53 |                     file_name = os.path.abspath(image_path)
54 |                     target_format.append({
55 |                         "image": file_name,
56 |                         "ocr": ocr,
57 |                         "bboxes": bboxes,
58 |                         "conversations": [
59 |                             {'from': 'human', 'value': instruction},
60 |                             {'from': 'gpt', 'value': answer},
61 |                         ],
62 |                     })
63 | 
64 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
65 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
66 | 
67 |             print(f'{split}: {len(target_format)}')
68 |             with open(out_filepath, "w") as f:
69 |                 json.dump(target_format, f)
70 | 
71 | if __name__ == '__main__':
72 |     parser = argparse.ArgumentParser()
73 |     parser.add_argument('--input_data_dir', default='raw_datasets/HW-SQuAD/HW-SQuAD_annotations', type=str)
74 |     parser.add_argument('--out_data_dir', default='processed_data/hwsquad', type=str)
75 |     args = parser.parse_args()
76 |     
77 |     dataset = InstructData(args)
78 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/iconqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import glob
 5 | from PIL import Image
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, load_instructions
 9 | import argparse
10 | 
11 | class InstructData:
12 |     def __init__(self, args):
13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
14 |         self.data_dir = args.input_data_dir
15 |         self.out_data_dir = args.out_data_dir
16 |         self.dataset_name = 'iconqa'
17 |         self.split = ['train', 'val']
18 | 
19 |     def create_data(self):
20 |         for split in self.split:
21 |             for answer_style in ['fill_in_blank', 'choose_txt']:
22 |                 target_format = []
23 |                 dataset_name = f'{self.dataset_name}_{answer_style}'
24 |                 instructions = load_instructions(self.instruction_path)[dataset_name]
25 | 
26 |                 data_dir = os.path.join(self.data_dir, f'{split}/{answer_style}/*')
27 |                 for file_path in glob.glob(data_dir):
28 |                     data_path = os.path.join(file_path, 'data.json')
29 |                     image_path = os.path.join(file_path, 'image.png')
30 |                     with open(data_path, 'r') as f:
31 |                         data = json.load(f)
32 |                     question = data['question']
33 |                     instruction = random.choice(instructions)
34 |                     instruction = instruction.replace('<key>', question)
35 |                     if answer_style == 'fill_in_blank':
36 |                         value = data['answer']
37 |                     else:
38 |                         options = data['choices']
39 |                         answer_index = data['answer']
40 |                         value = str(options[answer_index])
41 |                         instruction = instruction.replace('<options>', options)
42 | 
43 |                     file_name = os.path.abspath(image_path)
44 |                     target_format.append({
45 |                         "image": file_name,
46 |                         "conversations": [
47 |                             {'from': 'human', 'value': instruction},
48 |                             {'from': 'gpt', 'value': f"{value}"},
49 |                         ],
50 |                     })
51 |             
52 |                 out_data_dir = f'{self.out_data_dir}_{answer_style}'
53 |                 out_filepath = os.path.join(out_data_dir, f'{split}.json')        
54 |                 os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
55 | 
56 |                 print(f'{split}: {len(target_format)}')
57 |                 with open(out_filepath, "w") as f:
58 |                     json.dump(target_format, f)
59 | 
60 | if __name__ == '__main__':
61 |     parser = argparse.ArgumentParser()
62 |     parser.add_argument('--input_data_dir', default='raw_datasets/iconqa/iconqa_data', type=str)
63 |     parser.add_argument('--out_data_dir', default='processed_data/iconqa', type=str)
64 |     args = parser.parse_args()
65 |     
66 |     dataset = InstructData(args)
67 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/infographicvqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'infographicvqa'
16 |         self.split = ['train', 'dev']
17 | 
18 |     def create_ocr_data(self, split):
19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
20 |         with open(file_name, 'r') as f:
21 |             data = f.readlines()
22 |         ocrs = {}
23 |         for d in data:
24 |             d = json.loads(d)
25 |             image_name = d['name'].replace('.pdf', '')
26 |             try:
27 |                 content = d['contents'][1] # microsoft cv
28 |             except:
29 |                 content = d['contents'][0] # tesseract
30 | 
31 |             bboxes = []
32 |             tokens = []
33 |             try:
34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
36 |                     bbox = normalize_bbox(bbox, w, h)
37 |                     bboxes.append(bbox)
38 |                     tokens.append(token)
39 |             except:
40 |                 pass
41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
42 |         return ocrs
43 | 
44 |     def create_data(self):
45 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
46 |         for split in self.split:
47 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
48 |             with open(file_name, 'r') as f:
49 |                 data = f.readlines()
50 | 
51 |             ocrs = self.create_ocr_data(split)
52 |             target_format = []
53 |             for d in tqdm(data):
54 |                 d = json.loads(d)
55 |                 image_name = d['name'].replace('.pdf', '')
56 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
57 |                 file_name = os.path.abspath(file_name)
58 |                 for ann in d['annotations']:
59 |                     instruction = random.choice(instructions)
60 |                     question = ann['key']
61 |                     instruction = instruction.replace('<key>', question)
62 |                     bboxes = []
63 |                     ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
64 |                     value = ann['values'][0]['value']
65 |                     values = ann['values'][0]['value_variants']
66 | 
67 |                     target_format.append({
68 |                         "image": file_name,
69 |                         "ocr": ocr, 
70 |                         "bboxes": bboxes,
71 |                         "conversations": [
72 |                             {'from': 'human', 'instruction': instruction},
73 |                             {'from': 'gpt', 'value': value, 'values': values},
74 |                         ],
75 |                     })
76 | 
77 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
78 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
79 | 
80 |             print(f'{split}: {len(target_format)}')
81 |             with open(out_filepath, "w") as f:
82 |                 json.dump(target_format, f)
83 | 
84 | if __name__ == '__main__':
85 |     parser = argparse.ArgumentParser()
86 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/infographics_vqa', type=str)
87 |     parser.add_argument('--out_data_dir', default='processed_data/infographicvqa', type=str)
88 |     args = parser.parse_args()
89 |     
90 |     dataset = InstructData(args)
91 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/klc.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | 
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'klc'
16 |         self.split = ['train', 'dev']
17 | 
18 |     def create_ocr_data(self, split):
19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
20 |         with open(file_name, 'r') as f:
21 |             data = f.readlines()
22 |         ocrs = {}
23 |         for d in data:
24 |             d = json.loads(d)
25 |             image_name = d['name'].replace('.pdf', '')
26 |             try:
27 |                 content = d['contents'][1] # microsoft cv
28 |             except:
29 |                 content = d['contents'][0] # tesseract
30 | 
31 |             bboxes = []
32 |             tokens = []
33 |             try:
34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
36 |                     bbox = normalize_bbox(bbox, w, h)
37 |                     bboxes.append(bbox)
38 |                     tokens.append(token)
39 |             except:
40 |                 pass
41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
42 |             break
43 |         return ocrs
44 | 
45 |     def create_data(self):
46 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
47 |         for split in self.split:
48 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
49 |             with open(file_name, 'r') as f:
50 |                 data = f.readlines()
51 | 
52 |             ocrs = self.create_ocr_data(split)
53 |             target_format = []
54 |             for d in tqdm(data):
55 |                 d = json.loads(d)
56 |                 image_name = d['name'].replace('.pdf', '')
57 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
58 |                 file_name = os.path.abspath(file_name)
59 |                 for ann in d['annotations']:
60 |                     instruction = random.choice(instructions)
61 |                     if 'children' in ann['values'][0]:
62 |                         for v in ann['values']:
63 |                             for child in v['children']:
64 |                                 value = child['key']
65 |                                 key = child['values'][0]['value']
66 |                                 instruction = instruction.replace('<key>', key)
67 |                                 ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 
68 | 
69 |                                 target_format.append({
70 |                                     "image": file_name,
71 |                                     "ocr": ocr,
72 |                                     "bboxes": bboxes,
73 |                                     "conversations": [
74 |                                         {'from': 'human', 'value': instruction},
75 |                                         {'from': 'gpt', 'value': value},
76 |                                     ],
77 |                                 })
78 | 
79 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
80 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
81 | 
82 |             print(f'{split}: {len(target_format)}')
83 |             with open(out_filepath, "w") as f:
84 |                 json.dump(target_format, f)
85 | 
86 | if __name__ == '__main__':
87 |     parser = argparse.ArgumentParser()
88 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/kleister-charity', type=str)
89 |     parser.add_argument('--out_data_dir', default='processed_data/klc', type=str)
90 |     args = parser.parse_args()
91 |     
92 |     dataset = InstructData(args)
93 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/llavar.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | 
 6 | from PIL import Image, ImageSequence
 7 | from tqdm import tqdm 
 8 | from pathlib import Path
 9 | from utils import normalize_bbox, load_instructions
10 | from collections import defaultdict
11 | from google_vision_ocr import Google_OCR
12 | 
13 | class InstructData:
14 |     def __init__(self, args):
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
18 |         self.image_dir = os.path.join(args.input_data_dir, 'images')
19 |         self.google_ocr = Google_OCR(args.api_key)
20 |         os.makedirs(self.ocr_dir, exist_ok=True)
21 | 
22 |     def create_data(self):
23 |         file_name = os.path.join(self.data_dir, 'llava_instruct_150k_llavar_20k.json')
24 |         with open(file_name, 'r') as f:
25 |             data = json.load(f)
26 |         target_format = []
27 |         for d in data:
28 |             image_name = d["image"]
29 |             image_path = os.path.join(self.image_dir, image_name)
30 |             if not os.path.exists(image_path):
31 |                 continue
32 | 
33 |             ocr_path = os.path.join(self.ocr_dir, f"{image_name.replace('.jpg', '.json')}")
34 |             try:
35 |                 img = Image.open(image_path)
36 |                 img_w, img_h = img.size
37 |                 if not os.path.exists(ocr_path):
38 |                     items = self.google_ocr.recognize_image(img)
39 |                     if items == 'error':
40 |                         print('OCR error: ', image_path)
41 |                         continue
42 |                     with open(ocr_path, 'w') as f:
43 |                         json.dump(items, f)
44 |                 else:
45 |                     with open(ocr_path, 'r') as f:
46 |                         items = json.load(f)
47 |                 words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
48 |             except:
49 |                 words, bboxes = [], []
50 | 
51 |             ocr = ' '.join(words)
52 |             file_name = os.path.abspath(image_path)
53 |             d["image"] = file_name
54 |             d["ocr"] = ocr
55 |             d["bboxes"] = bboxes
56 |             target_format.append(d)
57 | 
58 |         out_filepath = os.path.join(self.out_data_dir, 'train.json')        
59 |         os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
60 | 
61 |         print(f'train: {len(target_format)}')
62 |         with open(out_filepath, "w") as f:
63 |             json.dump(target_format, f)
64 | 
65 | if __name__ == '__main__':
66 |     parser = argparse.ArgumentParser()
67 |     parser.add_argument('--input_data_dir', default='raw_datasets/llavar', type=str)
68 |     parser.add_argument('--out_data_dir', default='processed_data/llavar', type=str)
69 |     parser.add_argument('--api_key', default='API_KEY', type=str)
70 |     args = parser.parse_args()
71 |     
72 |     dataset = InstructData(args)
73 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/ocrvqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | import csv
 6 | 
 7 | from PIL import Image, ImageSequence
 8 | from tqdm import tqdm 
 9 | from pathlib import Path
10 | from utils import normalize_bbox, load_instructions
11 | from collections import defaultdict
12 | from google_vision_ocr import Google_OCR
13 | 
14 | class InstructData:
15 |     def __init__(self, args):
16 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
17 |         self.data_dir = args.input_data_dir
18 |         self.out_data_dir = args.out_data_dir
19 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
20 |         self.image_dir = os.path.join(args.input_data_dir, 'images')
21 |         self.dataset_name = 'ocrvqa'
22 |         self.google_ocr = Google_OCR(args.api_key)
23 |         self.split = ['train', 'dev', 'test']
24 |         self.split_dict = {1: 'train', 2: 'dev', 3: 'test'}
25 |         os.makedirs(self.ocr_dir, exist_ok=True)
26 |         
27 |     def create_data(self):
28 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
29 |         for split in self.split:
30 |             target_format = []
31 |             file_name = os.path.join(self.data_dir, 'dataset.json')
32 |             with open(file_name, 'r') as f:
33 |                 data = json.load(f)
34 |             for image_id in tqdm(data):
35 |                 d = data[image_id]
36 |                 split_id = d['split']
37 |                 if split != self.split_dict[split_id]:
38 |                     continue
39 |                 image_path = os.path.join(self.image_dir, f'{image_id}.jpg')
40 |                 if not os.path.exists(image_path):
41 |                     continue
42 | 
43 |                 ocr_path = os.path.join(self.ocr_dir, f"{image_id}.json")
44 |                 try:
45 |                     img = Image.open(image_path)
46 |                     img_w, img_h = img.size
47 |                     if not os.path.exists(ocr_path):
48 |                         items = self.google_ocr.recognize_image(img)
49 |                         if items == "error":
50 |                             print('error: ', image_path)
51 |                             continue
52 |                         with open(ocr_path, 'w') as f:
53 |                             json.dump(items, f)
54 |                     else:
55 |                         with open(ocr_path, 'r') as f:
56 |                             items = json.load(f)
57 |                     words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
58 |                 except:
59 |                     words, bboxes = [], []
60 | 
61 |                 ocr = ' '.join(words)
62 |                 file_name = os.path.abspath(image_path)
63 |                 for question, answer in zip(d['questions'], d['answers']):
64 |                     instruction = random.choice(instructions)        
65 |                     instruction = instruction.replace('<key>', question)
66 |                     target_format.append({
67 |                         "image": file_name,
68 |                         "ocr": ocr,
69 |                         "bboxes": bboxes,
70 |                         "conversations": [
71 |                             {'from': 'human', 'value': instruction},
72 |                             {'from': 'gpt', 'value': answer},
73 |                         ],
74 |                     }) 
75 | 
76 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
77 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
78 | 
79 |             print(f'{split}: {len(target_format)}')
80 |             with open(out_filepath, "w") as f:
81 |                 json.dump(target_format, f)
82 | 
83 | if __name__ == '__main__':
84 |     parser = argparse.ArgumentParser()
85 |     parser.add_argument('--input_data_dir', default='raw_datasets/OCR-VQA-200K', type=str)
86 |     parser.add_argument('--out_data_dir', default='processed_data/ocrvqa', type=str)
87 |     parser.add_argument('--api_key', default='API_KEY', type=str)
88 |     args = parser.parse_args()
89 |     
90 |     dataset = InstructData(args)
91 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/pwc.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'pwc'
16 |         self.split = ['train', 'dev']
17 | 
18 |     def create_ocr_data(self, split):
19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
20 |         with open(file_name, 'r') as f:
21 |             data = f.readlines()
22 |         ocrs = {}
23 |         for d in data:
24 |             d = json.loads(d)
25 |             image_name = d['name'].replace('.pdf', '')
26 |             try:
27 |                 content = d['contents'][1] # microsoft cv
28 |             except:
29 |                 content = d['contents'][0] # tesseract
30 | 
31 |             bboxes = []
32 |             tokens = []
33 |             try:
34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
36 |                     bbox = normalize_bbox(bbox, w, h)
37 |                     bboxes.append(bbox)
38 |                     tokens.append(token)
39 |             except:
40 |                 pass
41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
42 |             break
43 |         return ocrs
44 | 
45 |     def create_data(self):
46 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
47 |         for split in self.split:
48 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
49 |             with open(file_name, 'r') as f:
50 |                 data = f.readlines()
51 | 
52 |             ocrs = self.create_ocr_data(split)
53 |             target_format = []
54 |             for d in tqdm(data):
55 |                 d = json.loads(d)
56 |                 image_name = d['name'].replace('.pdf', '')
57 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
58 |                 file_name = os.path.abspath(file_name)
59 |                 for ann in d['annotations']:
60 |                     instruction = random.choice(instructions)
61 |                     if 'children' in ann['values'][0]:
62 |                         for v in ann['values']:
63 |                             for child in v['children']:
64 |                                 value = child['key']
65 |                                 key = child['values'][0]['value']
66 |                                 instruction = instruction.replace('<key>', key)
67 |                                 ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1] 
68 | 
69 |                                 target_format.append({
70 |                                     "image": file_name,
71 |                                     "ocr": ocr,
72 |                                     "bboxes": bboxes,
73 |                                     "conversations": [
74 |                                         {'from': 'human', 'value': instruction},
75 |                                         {'from': 'gpt', 'value': value},
76 |                                     ],
77 |                                 })
78 | 
79 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
80 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
81 | 
82 |             print(f'{split}: {len(target_format)}')
83 |             with open(out_filepath, "w") as f:
84 |                 json.dump(target_format, f)
85 | 
86 | if __name__ == '__main__':
87 |     parser = argparse.ArgumentParser()
88 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/AxCell', type=str)
89 |     parser.add_argument('--out_data_dir', default='processed_data/pwc', type=str)
90 |     args = parser.parse_args()
91 |     
92 |     dataset = InstructData(args)
93 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/rvlcdip.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | 
 6 | from PIL import Image
 7 | from tqdm import tqdm 
 8 | from pathlib import Path
 9 | from utils import load_instructions
10 | from google_vision_ocr import Google_OCR
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
18 |         self.image_dir = os.path.join(args.input_data_dir, 'images')
19 |         self.dataset_name = 'rvlcdip'
20 |         self.google_ocr = Google_OCR(args.api_key)
21 |         self.split = ['train', 'val', 'test']
22 |         self.class_dict = {'4': "advertisement", '10': "budget", '2': "email", 
23 |                           '8': "file_folder", '1': "form", '3': "handwritten", 
24 |                           '11': "invoice", '0': "letter", '15': "memo", 
25 |                           '9': "news_article", '12': "presentation", '13': "questionnaire", 
26 |                           '14': "resume", '6': "scientific_publication", '5':"scientific_report", 
27 |                           '7': "specification"}
28 | 
29 |     def create_data(self):
30 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
31 |         for split in self.split:
32 |             target_format = []
33 |             with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f:
34 |                 labels = f.read().splitlines()
35 |             for label in labels:
36 |                 filename, label = label.split(' ')
37 |                 value = self.class_dict[label]
38 |                 image_path = os.path.join(self.image_dir, filename)
39 |                 ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}')
40 |                 try:
41 |                     img = Image.open(image_path)
42 |                     img_w, img_h = img.size
43 |                     if not os.path.exists(ocr_path):
44 |                         items = self.google_ocr.recognize_image(img)
45 |                         if items == "error":
46 |                             print('OCR error: ', image_path)
47 |                             continue
48 |                         os.makedirs(os.dirname(ocr_path), exist_ok=True)
49 |                         with open(ocr_path, 'w') as f:
50 |                             json.dump(items, f)
51 |                     else:
52 |                         with open(ocr_path, 'r') as f:
53 |                             items = json.load(f)
54 |                     words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
55 |                 except:
56 |                     words, bboxes = [], []
57 | 
58 |                 instruction = random.choice(instructions)
59 |                 ocr = ' '.join(words)
60 | 
61 |                 file_name = os.path.abspath(image_path)
62 |                 target_format.append({
63 |                     "image": file_name,
64 |                     "ocr": ocr, 
65 |                     "bboxes": bboxes,
66 |                     "conversations": [
67 |                         {'from': 'human', 'value': instruction},
68 |                         {'from': 'gpt', 'value': value},
69 |                     ],
70 |                 })
71 | 
72 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
73 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
74 | 
75 |             print(f'{split}: {len(target_format)}')
76 |             with open(out_filepath, "w") as f:
77 |                 json.dump(target_format, f)
78 | 
79 | if __name__ == '__main__':
80 |     parser = argparse.ArgumentParser()
81 |     parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str)
82 |     parser.add_argument('--out_data_dir', default='processed_data/rvlcdip', type=str)
83 |     parser.add_argument('--api_key', default='API_KEY', type=str)
84 |     args = parser.parse_args()
85 |     
86 |     dataset = InstructData(args)
87 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/rvlcdip_io.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | 
 6 | from PIL import Image
 7 | from tqdm import tqdm 
 8 | from pathlib import Path
 9 | from utils import load_instructions
10 | from google_vision_ocr import Google_OCR
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
18 |         self.image_dir = os.path.join(args.input_data_dir, 'images')
19 |         self.dataset_name = 'rvlcdip_io'
20 |         self.google_ocr = Google_OCR(args.api_key)
21 |         self.split = ['train', 'val', 'test']
22 | 
23 |     def create_data(self):
24 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
25 |         for split in self.split:
26 |             target_format = []
27 |             ocrs = []
28 |             with open(os.path.join(self.data_dir, f'labels/{split}.txt'), 'r') as f:
29 |                 labels = f.read().splitlines()
30 |             for label in labels:
31 |                 filename, label = label.split(' ')
32 |                 value = self.class_dict[label]
33 |                 image_path = os.path.join(self.image_dir, filename)
34 |                 ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".tif", ".json")}')
35 |                 try:
36 |                     img = Image.open(image_path)
37 |                     img_w, img_h = img.size
38 |                     if not os.path.exists(ocr_path):
39 |                         items = self.google_ocr.recognize_image(img)
40 |                         if items == "error":
41 |                             print('OCR error: ', image_path)
42 |                             continue
43 |                         os.makedirs(os.dirname(ocr_path), exist_ok=True)
44 |                         with open(ocr_path, 'w') as f:
45 |                             json.dump(items, f)
46 |                     else:
47 |                         with open(ocr_path, 'r') as f:
48 |                             items = json.load(f)
49 |                     words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
50 |                 except:
51 |                     words, bboxes = [], []
52 | 
53 |                 ocr = ' '.join(words)
54 |                 ocrs.append((ocr, bboxes))
55 | 
56 |             for label in labels:
57 |                 instruction = random.choice(instructions)
58 |                 if random.random() > 0.5:
59 |                     ocr, bboxes = random.choice(ocrs)
60 |                     value = 'no'
61 |                 else:
62 |                     value = 'yes'
63 |                 
64 |                 file_name = os.path.abspath(image_path)
65 |                 target_format.append({
66 |                     "image": file_name,
67 |                     "ocr": ocr, 
68 |                     "bboxes": bboxes, 
69 |                     "conversations": [
70 |                         {'from': 'human', 'value': instruction},
71 |                         {'from': 'gpt', 'value': value},
72 |                     ],
73 |                 })
74 | 
75 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
76 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
77 | 
78 |             print(f'{split}: {len(target_format)}')
79 |             with open(out_filepath, "w") as f:
80 |                 json.dump(target_format, f)
81 | 
82 | if __name__ == '__main__':
83 |     parser = argparse.ArgumentParser()
84 |     parser.add_argument('--input_data_dir', default='raw_datasets/rvlcdip', type=str)
85 |     parser.add_argument('--out_data_dir', default='processed_data/rvlcdip_io', type=str)
86 |     parser.add_argument('--api_key', default='API_KEY', type=str)
87 |     args = parser.parse_args()
88 |     
89 |     dataset = InstructData(args)
90 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/scicap.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | 
 6 | from PIL import Image
 7 | from tqdm import tqdm 
 8 | from pathlib import Path
 9 | from utils import load_instructions
10 | from google_vision_ocr import Google_OCR
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.image_dir = os.path.join(args.input_data_dir, 'SciCap-No-Subfig-Img')
18 |         self.caption_dir = os.path.join(args.input_data_dir, 'SciCap-Caption-All')
19 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
20 |         self.dataset_name = 'scicap'
21 |         self.google_ocr = Google_OCR(args.api_key)
22 |         self.split = ['train', 'val', 'test']
23 |         os.makedirs(self.ocr_dir, exist_ok=True)
24 |         
25 |     def create_data(self):
26 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
27 |         ocr_info = {}
28 |         for split in self.split:
29 |             with open(os.path.join(self.data_dir, f'List-of-Files-for-Each-Experiments/Single-Sentence-Caption/No-Subfig/{split}/file_idx.json'), "r") as f:
30 |                 split_info = json.load(f)
31 |             target_format = []
32 |             for file_name in tqdm(split_info):
33 |                 image_path = os.path.join(self.image_dir, split, file_name)
34 |                 caption_path = os.path.join(self.caption_dir,  split, f'{file_name.replace(".png", ".json")}')
35 |                 ocr_path = os.path.join(self.ocr_dir, f'{file_name.replace(".png", ".json")}')
36 | 
37 |                 with open(caption_path, "r") as f:
38 |                     annotation = json.load(f)
39 |                 try:
40 |                     img = Image.open(image_path)
41 |                     img_w, img_h = img.size
42 |                     if not os.path.exists(ocr_path):
43 |                         items = self.google_ocr.recognize_image(img)
44 |                         if items == "error":
45 |                             print('OCR error: ', image_path)
46 |                             continue
47 |                         with open(ocr_path, 'w') as f:
48 |                             json.dump(items, f)
49 |                     else:
50 |                         with open(ocr_path, 'r') as f:
51 |                             items = json.load(f)
52 |                     words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
53 |                 except:
54 |                     words, bboxes = [], []
55 | 
56 |                 value = annotation['1-lowercase-and-token-and-remove-figure-index']['caption']
57 |                 instruction = random.choice(instructions)
58 |                 ocr = ' '.join(words)
59 | 
60 |                 file_name = os.path.abspath(image_path)
61 |                 target_format.append({
62 |                     "image": file_name,
63 |                     "ocr": ocr, 
64 |                     "bboxes": bboxes, 
65 |                     "conversations": [
66 |                         {'from': 'human', 'value': instruction},
67 |                         {'from': 'gpt', 'value': value},
68 |                     ],
69 |                 })
70 | 
71 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
72 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
73 | 
74 |             print(f'{split}: {len(target_format)}')
75 |             with open(out_filepath, "w") as f:
76 |                 json.dump(target_format, f)
77 | 
78 | if __name__ == '__main__':
79 |     parser = argparse.ArgumentParser()
80 |     parser.add_argument('--input_data_dir', default='raw_datasets/scicap', type=str)
81 |     parser.add_argument('--out_data_dir', default='processed_data/scicap', type=str)
82 |     parser.add_argument('--api_key', default='API_KEY', type=str)
83 |     args = parser.parse_args()
84 |     
85 |     dataset = InstructData(args)
86 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/scienceqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import glob
 5 | from PIL import Image
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, sort_coordinate, load_instructions
 9 | import argparse
10 | 
11 | class InstructData:
12 |     def __init__(self, args):
13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
14 |         self.data_dir = args.input_data_dir
15 |         self.out_data_dir = args.out_data_dir
16 |         self.dataset_name = 'scienceqa'
17 | 
18 |     def create_data(self):
19 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
20 |         train, val, test = [],[],[]
21 |         target_format = []
22 |         ann_filename = os.path.join(self.data_dir, 'data/scienceqa/problems.json')
23 |         with open(ann_filename, 'r') as f:
24 |             anns = json.load(f)
25 |         for questionId, ann in tqdm(anns.items()):
26 |             question = ann['question']
27 |             choices = ann['choices']
28 |             value = choices[ann['answer']]
29 |             split = ann['split']
30 |             image_name = ann['image']
31 |             if str(image_name) == 'null':
32 |                 continue
33 | 
34 |             image_path = os.path.join(self.data_dir, split, questionId, image_name)
35 |             instruction = random.choice(instructions)
36 |             instruction = instruction.replace('<key>', question).replace('<options>', str(choices))
37 | 
38 |             file_name = os.path.abspath(image_path)
39 |             sample = {
40 |                 "image": file_name,
41 |                 "conversations": [
42 |                     {'from': 'human', 'value': instruction},
43 |                     {'from': 'gpt', 'value': f"{value}"},
44 |                 ],
45 |             }
46 |             if split == 'train':
47 |                 train.append(sample)
48 |             elif split == 'val':
49 |                 val.append(sample)
50 |             elif split == 'train':
51 |                 test.append(sample)
52 |         
53 |         for split, target_format in [('train', train), ('val', val), ('test', test)]:
54 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
55 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
56 |             print(f'{split}: {len(target_format)}')
57 |             with open(out_filepath, "w") as f:
58 |                 json.dump(target_format, f)
59 | 
60 | if __name__ == '__main__':
61 |     parser = argparse.ArgumentParser()
62 |     parser.add_argument('--input_data_dir', default='raw_datasets/scienceqa', type=str)
63 |     parser.add_argument('--out_data_dir', default='processed_data/scienceqa', type=str)
64 |     args = parser.parse_args()
65 |     
66 |     dataset = InstructData(args)
67 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/screen2words.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | 
 6 | from PIL import Image
 7 | from tqdm import tqdm 
 8 | from pathlib import Path
 9 | from utils import load_instructions
10 | from google_vision_ocr import Google_OCR
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
18 |         self.image_dir = os.path.join(args.input_data_dir, 'combined')
19 |         self.dataset_name = 'screen2words'
20 |         self.google_ocr = Google_OCR(args.api_key)
21 |         self.split = ['train', 'dev']
22 |         os.makedirs(self.ocr_dir, exist_ok=True)
23 |     
24 |     def load_captions(self):
25 |         with open(os.path.join(self.data_dir, 'screen_summaries.csv'), "r") as f:
26 |             lines = f.read().splitlines()
27 |         captions = {}
28 |         for i, line in enumerate(lines):
29 |             if i != 0:
30 |                 items = line.split(',')
31 |                 if len(items) > 2:
32 |                     screenId = items[0]
33 |                     summary = line[len(screenId)+1:]
34 |                 else:
35 |                     screenId, summary = items
36 |                 captions[screenId] = summary
37 |         return captions
38 |         
39 |     def create_data(self):
40 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
41 |         captions = self.load_captions()
42 |         for split in self.split:
43 |             target_format = []
44 |             with open(os.path.join(self.data_dir, f'split/{split}_screens.txt'), "r") as f:
45 |                 split_info = f.read().splitlines()
46 |             for split_id in tqdm(split_info):
47 |                 image_path = os.path.join(self.image_dir,  f'{split_id}.jpg')
48 |                 ocr_path = os.path.join(self.ocr_dir, f'{split_id}.json')
49 |                 try:
50 |                     img = Image.open(image_path)
51 |                     img_w, img_h = img.size
52 |                     if not os.path.exists(ocr_path):
53 |                         items = self.google_ocr.recognize_image(img)
54 |                         if items == "error":
55 |                             print('OCR error: ', image_path)
56 |                             continue
57 |                         with open(ocr_path, 'w') as f:
58 |                             json.dump(items, f)
59 |                     else:
60 |                         with open(ocr_path, 'r') as f:
61 |                             items = json.load(f)
62 |                     words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
63 |                 except:
64 |                     words, bboxes = [], []
65 | 
66 |                 value = captions[split_id]
67 |                 instruction = random.choice(instructions)
68 |                 ocr = ' '.join(words)
69 | 
70 |                 file_name = os.path.abspath(image_path)
71 |                 target_format.append({
72 |                     "image": file_name,
73 |                     "ocr": ocr, 
74 |                     "bboxes": bboxes, 
75 |                     "conversations": [
76 |                         {'from': 'human', 'value': instruction},
77 |                         {'from': 'gpt', 'value': value},
78 |                     ],
79 |                 })
80 | 
81 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
82 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
83 | 
84 |             print(f'{split}: {len(target_format)}')
85 |             with open(out_filepath, "w") as f:
86 |                 json.dump(target_format, f)
87 | 
88 | if __name__ == '__main__':
89 |     parser = argparse.ArgumentParser()
90 |     parser.add_argument('--input_data_dir', default='raw_datasets/screen2words', type=str)
91 |     parser.add_argument('--out_data_dir', default='processed_data/screen2words', type=str)
92 |     parser.add_argument('--api_key', default='API_KEY', type=str)
93 |     args = parser.parse_args()
94 |     
95 |     dataset = InstructData(args)
96 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/slidevqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | import glob
 6 | 
 7 | from PIL import Image
 8 | from tqdm import tqdm 
 9 | from pathlib import Path
10 | from utils import load_instructions
11 | from google_vision_ocr import Google_OCR
12 | 
13 | class InstructData:
14 |     def __init__(self, args):
15 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
16 |         self.data_dir = args.input_data_dir
17 |         self.out_data_dir = args.out_data_dir
18 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
19 |         self.image_dir = os.path.join(args.input_data_dir, 'images')
20 |         self.dataset_name = 'slidevqa'
21 |         self.google_ocr = Google_OCR(args.api_key)
22 |         self.split = ['train', 'val', 'test']
23 |         os.makedirs(self.ocr_dir, exist_ok=True)
24 |             
25 |     def create_data(self):
26 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
27 |         for split in self.split:
28 |             target_format = []
29 |             file_name =  os.path.join(self.data_dir, 'annotations/qa', f'{split}.jsonl')
30 |             with open(file_name, 'r') as f:
31 |                 data = f.read().splitlines()
32 |             for d in tqdm(data):
33 |                 question = d['question']
34 |                 deck_name = d['deck_name']
35 |                 value = d['answer']
36 |                 image_paths = []
37 |                 text_sequences = []
38 |                 bboxes = []
39 |                 for image_path in glob.glob(os.path.join(self.image_dir, deck_name, f'slide_*_1024.jpg')):
40 |                     image_path = os.path.abspath(image_path)
41 |                     image_name = os.path.basename(image_path)                
42 |                     image_paths.append(image_path)
43 |                     ocr_path = os.path.join(self.ocr_dir, f'{deck_name}_{image_name.replace(".jpg", ".json")}')
44 |                     try:
45 |                         img = Image.open(image_path)
46 |                         img_w, img_h = img.size
47 |                         if not os.path.exists(ocr_path):
48 |                             items = self.google_ocr.recognize_image(img)
49 |                             if items == 'error':
50 |                                 print('OCR error: ', image_path)
51 |                                 continue
52 |                             with open(ocr_path, 'w') as f:
53 |                                 json.dump(items, f)
54 |                         else:
55 |                             with open(ocr_path, 'r') as f:
56 |                                 items = json.load(f)
57 |                         words, page_bboxes = self.google_ocr.extract_info(items, img_w, img_h)
58 |                     except:
59 |                         words, page_bboxes = [], []                        
60 |                     text_sequences.append(' '.join(words))
61 |                     bboxes.append(page_bboxes)
62 |                 
63 |                 instruction = random.choice(instructions)
64 |                 instruction = instruction.replace('<key>', question)
65 | 
66 |                 file_name = os.path.abspath(image_path)
67 |                 target_format.append({
68 |                     "image_list": image_paths,
69 |                     "ocr_list": text_sequences, 
70 |                     "bboxes_list": bboxes, 
71 |                     "conversations": [
72 |                         {'from': 'human', 'value': instruction},
73 |                         {'from': 'gpt', 'value': value},
74 |                     ],
75 |                 })
76 | 
77 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
78 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
79 | 
80 |             print(f'{split}: {len(target_format)}')
81 |             with open(out_filepath, "w") as f:
82 |                 json.dump(target_format, f)
83 | 
84 | if __name__ == '__main__':
85 |     parser = argparse.ArgumentParser()
86 |     parser.add_argument('--input_data_dir', default='raw_datasets/slidevqa', type=str)
87 |     parser.add_argument('--out_data_dir', default='processed_data/slidevqa', type=str)
88 |     parser.add_argument('--api_key', type=str, help='google vision api key')
89 |     args = parser.parse_args()
90 |     
91 |     dataset = InstructData(args)
92 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/sroie.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import cv2
 5 | from PIL import Image, ImageSequence
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, load_instructions
 9 | import argparse
10 | 
11 | class InstructData:
12 |     def __init__(self, args):
13 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
14 |         self.data_dir = args.input_data_dir
15 |         self.out_data_dir = args.out_data_dir
16 |         self.dataset_name = 'sroie'
17 |         self.split = ['train', 'test']
18 | 
19 |     def sort_coordinate(self, bboxes):
20 |         return sorted(bboxes , key=lambda k: [k[1][1], k[1][0]])    
21 | 
22 |     def create_data(self):
23 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
24 |         for split in self.split:
25 |             target_format = []
26 |             ann_dir = os.path.join(self.data_dir, f'{split}/entities')
27 |             img_dir = os.path.join(self.data_dir, f'{split}/img')
28 |             for file in tqdm(sorted(os.listdir(ann_dir))):
29 |                 file_path = os.path.join(ann_dir, file)
30 |                 with open(file_path, 'r', encoding='utf-8') as f:
31 |                     labels = json.load(f)
32 |                 image_path = os.path.join(img_dir, file)
33 |                 image_path = image_path.replace('.txt', '.jpg')
34 |                 image = cv2.imread(image_path)
35 |                 h, w, _ = image.shape
36 |                     
37 |                 file_path = os.path.join(ann_dir.replace('entities', 'box'), file)
38 |                 text_sequence = []
39 |                 bboxes = []
40 |                 with open(file_path, 'r', encoding='utf-8') as f:
41 |                     items = []
42 |                     for item in f.read().splitlines():
43 |                         bbox = item.split(',')[:8]
44 |                         text = item[len(','.join(bbox))+1:]
45 |                         bbox = [int(bbox[0]), int(bbox[1]), int(bbox[4]), int(bbox[5])]
46 |                         bbox = normalize_bbox(bbox, w, h)
47 |                         items.append((text, bbox))
48 |                 items = self.sort_coordinate(items)
49 |                 for item in items:
50 |                     words, bbox = item
51 |                     text_sequence.append(words)
52 |                     bbox = [bbox] * len(words.split())
53 |                     bboxes += bbox
54 |                 
55 |                 ocr = ' '.join(text_sequence)
56 |                 for label in labels:
57 |                     instruction = random.choice(instructions)
58 |                     instruction = instruction.replace('<key>', labels[label])
59 | 
60 |                     file_name = os.path.abspath(image_path)
61 |                     target_format.append({
62 |                         "image": file_name,
63 |                         "ocr": ocr,
64 |                         "bboxes": bboxes,
65 |                         "conversations": [
66 |                             {'from': 'human', 'value': instruction},
67 |                             {'from': 'gpt', 'value': label},
68 |                         ],
69 |                     })
70 | 
71 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
72 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
73 | 
74 |             with open(out_filepath, "w") as f:
75 |                 json.dump(target_format, f)
76 | 
77 | if __name__ == '__main__':
78 |     parser = argparse.ArgumentParser()
79 |     parser.add_argument('--input_data_dir', default='raw_datasets/SROIE2019', type=str)
80 |     parser.add_argument('--out_data_dir', default='processed_data/sroie', type=str)
81 |     args = parser.parse_args()
82 |     
83 |     dataset = InstructData(args)
84 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/tabfact.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'tabfact'
16 |         self.split = ['train', 'dev']
17 |         self.options = ['no', 'yes']
18 | 
19 |     def create_ocr_data(self, split):
20 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
21 |         with open(file_name, 'r') as f:
22 |             data = f.readlines()
23 |         ocrs = {}
24 |         for d in data:
25 |             d = json.loads(d)
26 |             image_name = d['name'].replace('.pdf', '')
27 |             try:
28 |                 content = d['contents'][1] # microsoft cv
29 |             except:
30 |                 content = d['contents'][0] # tesseract
31 | 
32 |             bboxes = []
33 |             tokens = []
34 |             try:
35 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
36 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
37 |                     bbox = normalize_bbox(bbox, w, h)
38 |                     bboxes.append(bbox)
39 |                     tokens.append(token)
40 |             except:
41 |                 pass
42 |             ocrs[image_name] = (' '.join(tokens), bboxes)
43 |         return ocrs
44 | 
45 |     def create_data(self):
46 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
47 |         for split in self.split:
48 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
49 |             with open(file_name, 'r') as f:
50 |                 data = f.readlines()
51 | 
52 |             ocrs = self.create_ocr_data(split)
53 |             target_format = []
54 |             for d in tqdm(data):
55 |                 d = json.loads(d)
56 |                 image_name = d['name'].replace('.pdf', '')
57 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
58 |                 file_name = os.path.abspath(file_name)
59 |                 for ann in d['annotations']:
60 |                     instruction = random.choice(instructions)
61 |                     question = ann['key']
62 |                     instruction = instruction.replace('<key>', question)
63 |                     bboxes = []
64 |                     ocr, bboxes = ocrs[image_name][0], ocrs[image_name][1]
65 |                     value = ann['values'][0]['value']
66 |                     value = self.options[int(value)]
67 | 
68 |                     target_format.append({
69 |                         "image": file_name,
70 |                         "ocr": ocr, 
71 |                         "bboxes": bboxes,
72 |                         "conversations": [
73 |                             {'from': 'human', 'instruction': instruction},
74 |                             {'from': 'gpt', 'value': value},
75 |                         ],
76 |                     })
77 | 
78 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
79 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
80 | 
81 |             print(f'{split}: {len(target_format)}')
82 |             with open(out_filepath, "w") as f:
83 |                 json.dump(target_format, f)
84 | 
85 | if __name__ == '__main__':
86 |     parser = argparse.ArgumentParser()
87 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/TabFact', type=str)
88 |     parser.add_argument('--out_data_dir', default='processed_data/tabfact', type=str)
89 |     args = parser.parse_args()
90 |     
91 |     dataset = InstructData(args)
92 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/tatdqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import argparse
 5 | import csv
 6 | 
 7 | from PIL import Image, ImageSequence
 8 | from tqdm import tqdm 
 9 | from pathlib import Path
10 | from utils import normalize_bbox, load_instructions
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.dataset_name = 'tatdqa'
18 |         self.split = ['train', 'dev', 'test']
19 |     
20 |     def create_data(self):
21 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
22 |         for split in self.split:
23 |             target_format = []
24 |             file_name = os.path.join(self.data_dir, f'tatdqa_dataset_{split}.json')
25 |             with open(file_name, 'r') as f:
26 |                 data = json.load(f)
27 |             for d in tqdm(data):
28 |                 uid = d['doc']['uid']
29 |                 page_num = d['doc']['page']
30 |                 image_path = f'{split}/{uid}_{page_num}.png'
31 |                 ocr_file_name = os.path.join(self.data_dir, f'{split}/{uid}.json')
32 |                 with open(ocr_file_name, 'r') as f:
33 |                     ocrs = json.load(f)
34 | 
35 |                 text = []
36 |                 bboxes = []
37 |                 _, _, w, h = ocrs['pages'][page_num-1]['bbox']
38 |                 for page in ocrs['pages']:
39 |                     for block in page['blocks']:
40 |                         text.append(block['text'])
41 |                         for bbox in block['words']['bbox_list']:
42 |                             bbox = normalize_bbox(bbox, w, h)
43 |                             bboxes.append(bbox)
44 | 
45 |                 for qa in d['questions']:
46 |                     question =qa['question']
47 |                     if 'answer' in qa:
48 |                         answer = qa['answer']
49 |                         if type(qa['answer']) == list:
50 |                             if len(qa['answer']) > 1:
51 |                                 answer = ', '.join(answer)
52 |                             else:
53 |                                 answer = answer[0]
54 |                     else:
55 |                         answer = ""
56 | 
57 |                     instruction = random.choice(instructions)        
58 |                     instruction = instruction.replace('<key>', question)
59 |                     ocr = ' '.join(text)
60 | 
61 |                     file_name = os.path.abspath(image_path)
62 |                     target_format.append({
63 |                         "image": file_name,
64 |                         "ocr": ocr,
65 |                         "bboxes": bboxes,
66 |                         "conversations": [
67 |                             {'from': 'human', 'value': instruction},
68 |                             {'from': 'gpt', 'value': answer},
69 |                         ],
70 |                     })
71 | 
72 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
73 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
74 | 
75 |             print(f'{split}: {len(target_format)}')
76 |             with open(out_filepath, "w") as f:
77 |                 json.dump(target_format, f)
78 | 
79 | if __name__ == '__main__':
80 |     parser = argparse.ArgumentParser()
81 |     parser.add_argument('--input_data_dir', default='raw_datasets/TAT-DQA', type=str)
82 |     parser.add_argument('--out_data_dir', default='processed_data/tatdqa', type=str)
83 |     args = parser.parse_args()
84 |     
85 |     dataset = InstructData(args)
86 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/textbookqa.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | import glob
 5 | from PIL import Image
 6 | from tqdm import tqdm 
 7 | from pathlib import Path
 8 | from utils import normalize_bbox, sort_coordinate, load_instructions
 9 | from transformers import BertTokenizer
10 | import argparse
11 | 
12 | class InstructData:
13 |     def __init__(self, args):
14 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
15 |         self.data_dir = args.input_data_dir
16 |         self.out_data_dir = args.out_data_dir
17 |         self.dataset_name = 'textbookqa'
18 |         self.split = ['train', 'val', 'test']
19 | 
20 |     def create_data(self):
21 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
22 |         for split in self.split:
23 |             target_format = []
24 |             ann_filename = f'{split}/tqa_v1_{split}.json' if split != 'test' else f'{split}/tqa_v2_{split}.json'
25 |             ann_filename = os.path.join(self.data_dir, ann_filename)
26 |             with open(ann_filename, 'r') as f:
27 |                 anns = json.load(f)
28 |             for ann in tqdm(anns):
29 |                 questions = ann['questions']
30 |                 diagram_questions = questions['diagramQuestions']
31 |                 if len(diagram_questions) == 0:
32 |                     continue
33 | 
34 |                 diagram_annotations = ann['diagramAnnotations']
35 | 
36 |                 for global_id, data in diagram_questions.items():
37 |                     options = []
38 |                     for option_id, choice in data['answerChoices'].items():
39 |                         choice = choice['processedText']
40 |                         options.append(choice)
41 |                     question = data['beingAsked']['processedText']
42 |                     value = data['correctAnswer']['rawText']
43 |                     image_path = data['imagePath']
44 |                     image_name = data['imageName']
45 |                     image_path = os.path.join(self.data_dir, f'{split}/{image_path}')
46 |                     if image_name in diagram_annotations:
47 |                         annotation = diagram_annotations[image_name]
48 |                         bboxes = []
49 |                         ocr = []
50 |                         for item in annotation:
51 |                             text, bbox = item["text"], item["rectangle"]
52 |                             try:
53 |                                 bbox = [bbox[0][0], bbox[0][1], bbox[1][0], bbox[1][1]]
54 |                             except:
55 |                                 continue
56 |                             if len(text) > 0:
57 |                                 bboxes.append(bbox)
58 |                                 ocr.append(text)
59 |                         ocr = " ".join(ocr)
60 |                     else:
61 |                         ocr = ""
62 |                     instruction = random.choice(instructions)
63 |                     instruction = instruction.replace('<key>', question).replace('<options>', str(options))
64 | 
65 |                     file_name = os.path.abspath(image_path)
66 |                     target_format.append({
67 |                         "image": file_name,
68 |                         "ocr": ocr,
69 |                         "bboxes": bboxes,
70 |                         "conversations": [
71 |                             {'from': 'human', 'value': instruction},
72 |                             {'from': 'gpt', 'value': f"{value}"},
73 |                         ],
74 |                     })
75 |             
76 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
77 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
78 | 
79 |             print(f'{split}: {len(target_format)}')
80 |             with open(out_filepath, "w") as f:
81 |                 json.dump(target_format, f)
82 | 
83 | if __name__ == '__main__':
84 |     parser = argparse.ArgumentParser()
85 |     parser.add_argument('--input_data_dir', default='raw_datasets/textbookqa', type=str)
86 |     parser.add_argument('--out_data_dir', default='processed_data/textbookqa', type=str)
87 |     args = parser.parse_args()
88 |     
89 |     dataset = InstructData(args)
90 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/utils.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | 
 3 | def normalize_bbox(bbox, w=-1, h=-1):
 4 |     if w > 0 and h > 0:
 5 |         normalized_bbox = [
 6 |             int(1000 * bbox[0] / w),
 7 |             int(1000 * bbox[1] / h),
 8 |             int(1000 * bbox[2] / w),
 9 |             int(1000 * bbox[3] / h),
10 |         ]
11 |     else:
12 |         normalized_bbox = [
13 |             int(1000 * bbox[0]),
14 |             int(1000 * bbox[1]),
15 |             int(1000 * bbox[2]),
16 |             int(1000 * bbox[3]),
17 |         ]
18 |         
19 |     if len(bbox) == 4:
20 |         return convert_wh(normalized_bbox)
21 |     elif len(bbox) == 6:
22 |         return normalized_bbox
23 | 
24 | def convert_wh(bbox):
25 |     return [bbox[0], bbox[1], bbox[2], bbox[3], abs(bbox[2]-bbox[0]), abs(bbox[3]-bbox[1])]
26 | 
27 | def sort_coordinate(bboxes):
28 |     return sorted(bboxes , key=lambda k: [k[2][1], k[2][0]])    
29 | 
30 | def load_instructions(instruction_path):
31 |     instructions = {}
32 |     data = pd.read_excel(instruction_path)
33 |     for d in data.values:
34 |         dataset_name = d[0]
35 |         insts = []
36 |         for prompt in d[3:]:
37 |             if pd.isna(prompt):
38 |                 break
39 |             insts.append(prompt)
40 |         instructions[dataset_name] = insts
41 |     return instructions
42 | 


--------------------------------------------------------------------------------
/data_preprocessors/visualmrc.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'visualmrc'
16 |         self.split = ['train', 'dev', 'test']
17 | 
18 |     def create_data(self):
19 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
20 |         for split in self.split:
21 |             file_name = os.path.join(self.data_dir, f'data/{split}.jsonl')
22 |             with open(file_name, 'r') as f:
23 |                 data = f.readlines()
24 |             target_format = []
25 |             for d in tqdm(data):
26 |                 d = json.loads(d)
27 |                 file_name = os.path.join(self.data_dir, d['image_filename'])
28 |                 file_name = os.path.abspath(file_name)
29 |                 image = Image.open(file_name)
30 |                 w, h = image.size
31 | 
32 |                 words = []
33 |                 bboxes = []
34 |                 for bbox in d['bounding_boxes']:
35 |                     if 'ocr_info' in bbox:
36 |                         for ocr in bbox['ocr_info']:
37 |                             word = ocr['word']
38 |                             bbox = ocr['bbox']
39 |                             bbox = [bbox['x'], bbox['y'], bbox['x']+bbox['width'], bbox['y']+bbox['height']]
40 |                             bbox = normalize_bbox(bbox, w, h)
41 |                             bboxes.append(bbox)
42 |                             words.append(word)
43 | 
44 |                 ocr = " ".join(words)
45 |                 for qa in d['qa_data']:
46 |                     question = qa['question']['text']
47 |                     value = qa['answer']['text']
48 |                     instruction = random.choice(instructions)
49 |                     instruction = instruction.replace('<key>', question)
50 | 
51 |                     target_format.append({
52 |                         "image": file_name,
53 |                         "ocr": ocr, 
54 |                         "bboxes": bboxes,
55 |                         "conversations": [
56 |                             {'from': 'human', 'instruction': instruction},
57 |                             {'from': 'gpt', 'value': value},
58 |                         ],
59 |                     })
60 | 
61 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
62 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
63 | 
64 |             print(f'{split}: {len(target_format)}')
65 |             with open(out_filepath, "w") as f:
66 |                 json.dump(target_format, f)
67 | 
68 | if __name__ == '__main__':
69 |     parser = argparse.ArgumentParser()
70 |     parser.add_argument('--input_data_dir', default='raw_datasets/VisualMRC_official', type=str)
71 |     parser.add_argument('--out_data_dir', default='processed_data/visualmrc', type=str)
72 |     args = parser.parse_args()
73 |     
74 |     dataset = InstructData(args)
75 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/websrc.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import random
  4 | import argparse
  5 | import csv
  6 | 
  7 | from PIL import Image
  8 | from tqdm import tqdm 
  9 | from pathlib import Path
 10 | from utils import normalize_bbox, load_instructions
 11 | from collections import defaultdict
 12 | from google_vision_ocr import Google_OCR
 13 | 
 14 | class InstructData:
 15 |     def __init__(self, args):
 16 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
 17 |         self.data_dir = args.input_data_dir
 18 |         self.out_data_dir = args.out_data_dir
 19 |         self.ocr_dir = os.path.join(args.input_data_dir, 'ocrs')
 20 |         self.dataset_name = 'websrc'
 21 |         self.google_ocr = Google_OCR(args.api_key)
 22 |         self.split = ['train', 'dev']
 23 |         os.makedirs(self.ocr_dir, exist_ok=True)
 24 |     
 25 |     def load_split_info(self):
 26 |         file_name = os.path.join(self.data_dir, 'dataset_split.csv')
 27 |         with open(file_name) as f:
 28 |             reader = csv.reader(f)
 29 |             split_info = defaultdict(list)
 30 |             for i, row in enumerate(reader):
 31 |                 if i == 0:
 32 |                     continue
 33 |                 number = '0' + row[1] if int(row[1]) < 10 else  row[1]
 34 |                 split = row[3]
 35 |                 data_path = os.path.join(self.data_dir, f'{row[0]}/{number}/dataset.csv')
 36 |                 split_info[split].append(data_path)
 37 |         return split_info
 38 |         
 39 |     def create_data(self):
 40 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
 41 |         split_info = self.load_split_info()
 42 |         for split in self.split:
 43 |             target_format = []
 44 |             for data_path in tqdm(split_info[split]):
 45 |                 with open(data_path) as f:
 46 |                     data_dir = os.path.dirname(data_path)
 47 |                     reader = csv.reader(f)
 48 |                     for i, row in enumerate(reader):
 49 |                         if i == 0:
 50 |                             for index, element in enumerate(row):
 51 |                                 if 'question' == element:
 52 |                                     question_index = index
 53 |                                 elif 'id' == element:
 54 |                                     id_index = index
 55 |                                 elif 'answer' == element:
 56 |                                     answer_index = index
 57 |                             continue   
 58 |                         questionId = row[id_index]
 59 |                         image_path = os.path.join(data_dir, f'processed_data/{questionId[2:9]}.png')
 60 |                         img = Image.open(image_path)
 61 |                         img_w, img_h = img.size
 62 | 
 63 |                         ocr_path = os.path.join(self.ocr_dir, f'{questionId[2:9]}.json')
 64 |                         try:
 65 |                             if not os.path.exists(ocr_path):
 66 |                                 items = self.google_ocr.recognize_image(img)
 67 |                                 if items == "error":
 68 |                                     print('OCR error: ', image_path)
 69 |                                     continue
 70 |                                 with open(ocr_path, 'w') as f:
 71 |                                     json.dump(items, f)
 72 |                             else:
 73 |                                 with open(ocr_path, 'r') as f:
 74 |                                     items = json.load(f)
 75 |                             words, bboxes = self.google_ocr.extract_info(items, img_w, img_h)
 76 |                         except:
 77 |                             words, bboxes = [], []
 78 | 
 79 |                         question = row[question_index]
 80 |                         instruction = random.choice(instructions)        
 81 |                         instruction = instruction.replace('<key>', question)
 82 |                         ocr = ' '.join(words)
 83 |                         value = row[answer_index]
 84 | 
 85 |                         file_name = os.path.abspath(image_path)
 86 |                         target_format.append({
 87 |                             "image": file_name,
 88 |                             "ocr": ocr,
 89 |                             "bboxes": bboxes,
 90 |                             "conversations": [
 91 |                                 {'from': 'human', 'value': instruction},
 92 |                                 {'from': 'gpt', 'value': value},
 93 |                             ],
 94 |                         })
 95 | 
 96 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
 97 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
 98 | 
 99 |             print(f'{split}: {len(target_format)}')
100 |             with open(out_filepath, "w") as f:
101 |                 json.dump(target_format, f)
102 | 
103 | if __name__ == '__main__':
104 |     parser = argparse.ArgumentParser()
105 |     parser.add_argument('--input_data_dir', default='raw_datasets/websrc', type=str)
106 |     parser.add_argument('--out_data_dir', default='processed_data/websrc', type=str)
107 |     parser.add_argument('--ocr_dir', default='raw_datasets/websrc/ocrs', type=str)
108 |     parser.add_argument('--api_key', default='API_KEY', type=str)
109 |     args = parser.parse_args()
110 |     
111 |     dataset = InstructData(args)
112 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/wildreceipt.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, sort_coordinate, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'wildreceipt'
16 |         self.split = ['train', 'test']
17 |         self.classes = {}
18 |         for items in open(os.path.join(args.input_data_dir, 'class_list.txt')):
19 |             index, label = items.split()
20 |             self.classes[index] = label
21 | 
22 |     def create_data(self):
23 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
24 |         for split in self.split:
25 |             target_format = []
26 |             with open(os.path.join(self.data_dir, f'{split}.txt')) as f:
27 |                 samples = f.readlines()
28 |             for sample in tqdm(samples):
29 |                 data = json.loads(sample)
30 |                 file_name = data['file_name']
31 |                 image_path = os.path.join(self.data_dir, file_name)
32 |                 image = Image.open(image_path)
33 |                 w, h = image.size
34 | 
35 |                 items = []
36 |                 labels = {}
37 |                 for item in data["annotations"]:
38 |                     text, label_index = item["text"], item["label"]
39 |                     label = self.classes[str(label_index)]
40 |                     if label_index == 0:
41 |                         continue
42 |                     bbox = item["box"]
43 |                     bbox = [bbox[0], bbox[1], bbox[4], bbox[5]]
44 |                     bbox = normalize_bbox(bbox, w, h)
45 |                     items.append((text, label, bbox))
46 | 
47 |                 items = sort_coordinate(items)
48 | 
49 |                 ocr = []
50 |                 bboxes = []
51 |                 for item in items:
52 |                     words, label, bbox = item
53 |                     labels[words] = label
54 |                     ocr.append(words)
55 |                     bbox = [bbox] * len(words.split())
56 |                     bboxes += bbox
57 |                 ocr = ' '.join(ocr)
58 | 
59 |                 for key in labels:
60 |                     instruction = random.choice(instructions)
61 |                     instruction = instruction.replace('<key>', key)
62 |                     value = labels[key]
63 | 
64 |                     file_name = os.path.abspath(image_path)
65 |                     target_format.append({
66 |                         "image": file_name,
67 |                         "ocr": ocr,
68 |                         "bboxes": bboxes,
69 |                         "conversations": [
70 |                             {'from': 'human', 'value': instruction},
71 |                             {'from': 'gpt', 'value': value},
72 |                         ],
73 |                     })
74 | 
75 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
76 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
77 | 
78 |             with open(out_filepath, "w") as f:
79 |                 json.dump(target_format, f)
80 | 
81 | if __name__ == '__main__':
82 |     parser = argparse.ArgumentParser()
83 |     parser.add_argument('--input_data_dir', default='raw_datasets/wildreceipt/wildreceipt', type=str)
84 |     parser.add_argument('--out_data_dir', default='processed_data/wildreceipt', type=str)
85 |     args = parser.parse_args()
86 |     
87 |     dataset = InstructData(args)
88 |     dataset.create_data()


--------------------------------------------------------------------------------
/data_preprocessors/wtq.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import random
 4 | from PIL import Image, ImageSequence
 5 | from tqdm import tqdm 
 6 | from pathlib import Path
 7 | from utils import normalize_bbox, load_instructions
 8 | import argparse
 9 | 
10 | class InstructData:
11 |     def __init__(self, args):
12 |         self.instruction_path = Path('instructdoc_instructions.xlsx')
13 |         self.data_dir = args.input_data_dir
14 |         self.out_data_dir = args.out_data_dir
15 |         self.dataset_name = 'wtq'
16 |         self.split = ['train', 'dev']
17 | 
18 |     def create_ocr_data(self, split):
19 |         file_name = os.path.join(self.data_dir, split, 'documents_content.jsonl')
20 |         with open(file_name, 'r') as f:
21 |             data = f.readlines()
22 |         ocrs = {}
23 |         for d in data:
24 |             d = json.loads(d)
25 |             image_name = d['name'].replace('.pdf', '')
26 |             try:
27 |                 content = d['contents'][1] # microsoft cv
28 |             except:
29 |                 content = d['contents'][0] # tesseract
30 | 
31 |             bboxes = []
32 |             tokens = []
33 |             try:
34 |                 _ , _, w, h = content['common_format']['structures']['pages']['positions'][0]
35 |                 for token, bbox in zip(content['common_format']['tokens'], content['common_format']['positions']):
36 |                     bbox = normalize_bbox(bbox, w, h)
37 |                     bboxes.append(bbox)
38 |                     tokens.append(token)
39 |             except:
40 |                 continue
41 |             ocrs[image_name] = (' '.join(tokens), bboxes)
42 |         return ocrs
43 | 
44 |     def create_data(self):
45 |         instructions = load_instructions(self.instruction_path)[self.dataset_name]
46 |         for split in self.split:
47 |             file_name = os.path.join(self.data_dir, split, 'document.jsonl')
48 |             with open(file_name, 'r') as f:
49 |                 data = f.readlines()
50 | 
51 |             ocrs = self.create_ocr_data(split)
52 |             target_format = []
53 |             for d in tqdm(data):
54 |                 d = json.loads(d)
55 |                 image_name = d['name'].replace('.pdf', '')
56 |                 file_name = os.path.join(self.data_dir, 'png', image_name, '0.jpg')
57 |                 file_name = os.path.abspath(file_name)
58 |                 for ann in d['annotations']:
59 |                     instruction = random.choice(instructions)
60 |                     question = ann['key']
61 |                     instruction = instruction.replace('<key>', question)
62 |                     ocr, bboxes = ocrs[image_name]
63 |                     value = ann['values'][0]['value']
64 | 
65 |                     target_format.append({
66 |                         "image": file_name,
67 |                         "ocr": ocr,
68 |                         "bboxes": bboxes,
69 |                         "conversations": [
70 |                             {'from': 'human',  'value': instruction},
71 |                             {'from': 'gpt', 'value': value},
72 |                         ],
73 |                     })
74 | 
75 |             out_filepath = os.path.join(self.out_data_dir, f'{split}.json')        
76 |             os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
77 | 
78 |             print(f'{split}: {len(target_format)}')
79 |             with open(out_filepath, "w") as f:
80 |                 json.dump(target_format, f)
81 | 
82 | if __name__ == '__main__':
83 |     parser = argparse.ArgumentParser()
84 |     parser.add_argument('--input_data_dir', default='raw_datasets/aws_neurips_time/WikiTableQuestions', type=str)
85 |     parser.add_argument('--out_data_dir', default='processed_data/wtq', type=str)
86 |     args = parser.parse_args()
87 |     
88 |     dataset = InstructData(args)
89 |     dataset.create_data()


--------------------------------------------------------------------------------
/download.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | export DATASET_DIR=raw_datasets
 3 | 
 4 | mkdir raw_datasets  
 5 | 
 6 | sh ./download_scripts/due.sh
 7 | sh ./download_scripts/websrc.sh
 8 | sh ./download_scripts/funsd.sh
 9 | sh ./download_scripts/iconqa.sh
10 | sh ./download_scripts/textbookqa.sh
11 | sh ./download_scripts/screen2words.shsh 
12 | sh ./download_scripts/doclaynet.sh
13 | sh ./download_scripts/ai2d.sh
14 | sh ./download_scripts/wildreceipt.sh
15 | 
16 | # font file for rendering text in AI2D dataset
17 | wget https://huggingface.co/Team-PIXEL/pixel-base-finetuned-masakhaner-swa/resolve/main/GoNotoCurrent.ttf
18 | 


--------------------------------------------------------------------------------
/download_scripts/README.md:
--------------------------------------------------------------------------------
 1 | Beolow are the list for downloading datasets used in InstructDoc.
 2 | ### Automatically download datasets
 3 | - DocVQA ([due.sh](download_scripts/due.sh))
 4 | - InfographicVQA ([due.sh](download_scripts/due.sh))
 5 | - PWC ([due.sh](download_scripts/due.sh))
 6 | - KLC ([due.sh](download_scripts/due.sh))
 7 | - DeepForm ([due.sh](download_scripts/due.sh))
 8 | - TabFact ([due.sh](download_scripts/due.sh))
 9 | - WebSRC ([websrc.sh](download_scripts/websrc.sh))
10 | - FUNSD ([funsd.sh](download_scripts/funsd.sh))
11 | - IconQA ([iconqa.sh](download_scripts/iconqa.sh))
12 | - TextbookQA ([textbookqa.sh](download_scripts/textbookqa.sh))
13 | - Screen2Words ([screen2words.sh](download_scripts/screen2words.sh))
14 | - DocLaynet ([doclaynet.sh](download_scripts/doclaynet.sh))
15 | - LLaVAR ([llavar.sh](download_scripts/llavar.sh))
16 | 
17 | ### Manually download 
18 | After downloading below datasets, please place them under the directory "raw_datasets".
19 | - SROIE ([kaggle](https://www.kaggle.com/datasets/urbikn/sroie-datasetv2))
20 | - CORD ([google drive](https://drive.google.com/drive/folders/14OEWr86qotVBMAsWk7lymMytxn5u-kM6))
21 | - OCRVQA ([google drive](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing))
22 | - TAT-DQA ([google drive](https://drive.google.com/drive/folders/1SGpZyRWqycMd_dZim1ygvWhl5KdJYDR2))
23 | - ScienceQA ([google drive](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev))
24 | - ChartQA ([google drive](https://drive.google.com/file/d/17-aqtiq_KJ16PIGOp30W0y6OJNax6SVT/view))
25 | - RVL-CDIP ([goole docs](https://docs.google.com/uc?id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc&export=download))
26 | - HW-SQuAD ([onedrive](https://www.docvqa.org/datasets/benthamqa-and-hw-squad))
27 | - SciCap ([dropbox](https://www.dropbox.com/s/t1sjqesl0pynaxo/scicap_data.zip?dl=0))
28 | - DUDE ([project page](https://rrc.cvc.uab.es/?ch=23&com=introduction))
29 | - DocBank ([project page](https://doc-analysis.github.io/docbank-page/index.html))
30 | - DocILE ([projct page](https://docile.rossum.ai/))
31 | - VisualMRC ([project page](https://github.com/nttmdlab-nlp/VisualMRC), request authors via e-mail ryota.tanaka@ntt.com)
32 | - SlideVQA ([project page](https://github.com/nttmdlab-nlp/SlideVQA), request authors via e-mail ryota.tanaka@ntt.com)
33 | 
34 | 
35 | 


--------------------------------------------------------------------------------
/download_scripts/ai2d.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading AI2D dataset..."
4 | mkdir ai2d
5 | cd ai2d
6 | wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip
7 | wget https://s3-us-east-2.amazonaws.com/prior-datasets/ai2d_test_ids.csv
8 | unzip ai2d-all.zip && rm ai2d-all.zip
9 | 


--------------------------------------------------------------------------------
/download_scripts/doclaynet.sh:
--------------------------------------------------------------------------------
 1 | cd $DATASET_DIR
 2 | 
 3 | echo "Donwloading DocLaynet dataset..."
 4 | mkdir doclaynet
 5 | cd doclaynet
 6 | wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip
 7 | wget https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip
 8 | unzip DocLayNet_core.zip && rm DocLayNet_core.zip
 9 | unzip DocLayNet_extra.zip && rm DocLayNet_extra.zip
10 | 


--------------------------------------------------------------------------------
/download_scripts/due.sh:
--------------------------------------------------------------------------------
 1 | cd $DATASET_DIR
 2 | 
 3 | echo "Donwloading DocVQA dataset..."
 4 | mkdir docvqa
 5 | cd docvqa
 6 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DocVQA.tar.gz
 7 | tar xvf DocVQA.tar.gz && rm DocVQA.tar.gz
 8 | cd ..
 9 | 
10 | echo "Donwloading InfoVQA dataset..."
11 | mkdir infovqa
12 | cd infovqa
13 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/InfographicsVQA.tar.gz
14 | tar xvf InfographicsVQA.tar.gz && rm InfographicsVQA.tar.gz
15 | cd ..
16 | 
17 | echo "Donwloading TabFact dataset..."
18 | mkdir tabfact
19 | cd tabfact
20 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/TabFact.tar.gz
21 | tar xvf TabFact.tar.gz && rm TabFact.tar.gz
22 | cd ..
23 | 
24 | echo "Donwloading WTQ dataset..."
25 | mkdir wtq
26 | cd wtq
27 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/WikiTableQuestions.tar.gz
28 | tar xvf WikiTableQuestions.tar.gz && rm WikiTableQuestions.tar.gz
29 | cd ..
30 | 
31 | echo "Donwloading KLC dataset..."
32 | mkdir klc
33 | cd klc
34 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/KleisterCharity.tar.gz
35 | tar xvf KleisterCharity.tar.gz && rm KleisterCharity.tar.gz
36 | cd ..
37 | 
38 | echo "Donwloading DeepForm dataset..."
39 | mkdir deepform
40 | cd deepform
41 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/DeepForm.tar.gz
42 | tar xvf DeepForm.tar.gz && rm DeepForm.tar.gz
43 | cd ..
44 | 
45 | echo "Donwloading PWC dataset..."
46 | mkdir pwc
47 | cd pwc
48 | wget https://applica-public.s3.eu-west-1.amazonaws.com/due/datasets/PWC.tar.gz
49 | tar xvf PWC.tar.gz && rm PWC.tar.gz
50 | 


--------------------------------------------------------------------------------
/download_scripts/funsd.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading FUNSD dataset..."
4 | mkdir funsd
5 | cd funsd
6 | wget https://guillaumejaume.github.io/FUNSD/dataset.zip
7 | unzip dataset.zip && rm dataset.zip
8 | 


--------------------------------------------------------------------------------
/download_scripts/iconqa.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading IconQA dataset..."
4 | mkdir iconqa
5 | cd iconqa
6 | wget https://iconqa2021.s3.us-west-1.amazonaws.com/iconqa_data.zip
7 | unzip iconqa_data.zip && rm iconqa_data.zip
8 | 


--------------------------------------------------------------------------------
/download_scripts/llavar.sh:
--------------------------------------------------------------------------------
 1 | cd $DATASET_DIR
 2 | 
 3 | echo "Donwloading LLaVAR dataset..."
 4 | mkdir llavar
 5 | cd llavar
 6 | wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/llava_instruct_150k_llavar_20k.json
 7 | mkdir images
 8 | cd images
 9 | wget https://huggingface.co/datasets/SALT-NLP/LLaVAR/resolve/main/finetune.zip
10 | unzip finetune.zip && rm finetune.zip
11 | 


--------------------------------------------------------------------------------
/download_scripts/screen2words.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading Screen2Words dataset..."
4 | git clone https://github.com/google-research-datasets/screen2words.git
5 | cd screen2words
6 | wget https://storage.googleapis.com/crowdstf-rico-uiuc-4540/rico_dataset_v0.1/unique_uis.tar.gz
7 | tar xvf unique_uis.tar.gz && rm unique_uis.tar.gz
8 | 


--------------------------------------------------------------------------------
/download_scripts/textbookqa.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading TextbookQA dataset..."
4 | mkdir textbookqa
5 | cd textbookqa
6 | wget https://ai2-public-datasets.s3.amazonaws.com/tqa/tqa_train_val_test.zip
7 | unzip tqa_train_val_test.zip && rm tqa_train_val_test.zip
8 | 


--------------------------------------------------------------------------------
/download_scripts/websrc.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading WebSRC dataset..."
4 | mkdir websrc
5 | cd websrc
6 | wget https://websrc-data.s3.amazonaws.com/release.zip
7 | unzip release.zip && rm release.zip
8 | 


--------------------------------------------------------------------------------
/download_scripts/wildreceipt.sh:
--------------------------------------------------------------------------------
1 | cd $DATASET_DIR
2 | 
3 | echo "Donwloading WildReceipt dataset..."
4 | mkdir wildreceipt
5 | cd wildreceipt
6 | wget https://download.openmmlab.com/mmocr/data/wildreceipt.tar
7 | tar xvf wildreceipt.tar && rm wildreceipt.tar
8 | 


--------------------------------------------------------------------------------
/example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nttmdlab-nlp/InstructDoc/fadcdabc1d071ae3f64c090acac7cf55eb932abd/example.png


--------------------------------------------------------------------------------
/instructdoc_instructions.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nttmdlab-nlp/InstructDoc/fadcdabc1d071ae3f64c090acac7cf55eb932abd/instructdoc_instructions.xlsx


--------------------------------------------------------------------------------
/merge_datasets.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | import random
 4 | import argparse
 5 | 
 6 | train_val_datasets = ['klc', 'pwc', 'deepform', 'sroie', 'docile', 'wildreceipt', 'websrc', 'hwsquad',
 7 |                       'visualmrc', 'iconqa_fill_in_blank', 'iconqa_choose_txt', 'scienceqa',
 8 |                       'ai2d', 'docvqa', 'rvlcdip', 'textbookqa', 'wtq', 'tatdqa','scicap', 'llavar',
 9 |                       'screen2words', 'doclaynet', 'docbank', 'docvqa_iq', 'rvlcdip_io', 'ocrvqa']
10 | 
11 | def merge_datasets(input_data_dir='./processed_data', save_dir='./', max_samples=5000):
12 |     questionId = 0
13 |     for split in [('train'), ('dev', 'val')]:
14 |         merge = []
15 |         for dataset_name in train_val_datasets:
16 |             for s in split:
17 |                 dataset_path = os.path.join(input_data_dir, dataset_name, f'{s}.json')
18 |                 if os.path.exists(dataset_path):
19 |                     with open(dataset_path, 'r') as f:
20 |                         data = json.load(f)
21 |             if len(data) == 0:
22 |                 continue
23 |             random.shuffle(data)[:max_samples]
24 |             for d in data:
25 |                 d["dataset_name"] = dataset_name
26 |                 d["id"] = questionId
27 |                 merge.append(d)
28 |         random.shuffle(merge)
29 | 
30 |         out_filepath = os.path.join(save_dir, f'{split[0]}.json')
31 |         os.makedirs(os.path.dirname(out_filepath), exist_ok=True)
32 |         print(f'{split}: {len(merge)}')
33 |         with open(out_filepath, "w") as f:
34 |             json.dump(merge, f)
35 | 
36 | if __name__ == '__main__':
37 |     parser = argparse.ArgumentParser()
38 |     parser.add_argument('--input_data_dir', default='processed_data', type=str)
39 |     parser.add_argument('--save_dir', default='./', type=str)
40 |     parser.add_argument('--max_samples', default=5000, type=int)
41 |     args = parser.parse_args()
42 | 
43 |     merge_datasets(args.input_data_dir, args.save_dir, args.max_samples)


--------------------------------------------------------------------------------
/process_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | API_KEY=$1
 3 | 
 4 | # ===== KIE =====
 5 | python data_preprocessors/docile.py
 6 | python data_preprocessors/klc.py
 7 | python data_preprocessors/deepform.py
 8 | python data_preprocessors/funsd.py
 9 | python data_preprocessors/pwc.py
10 | python data_preprocessors/wildreceipt.py
11 | python data_preprocessors/cord.py
12 | python data_preprocessors/sroi.py
13 | 
14 | # ===== Single-page QA =====
15 | python data_preprocessors/visualmrc.py
16 | python data_preprocessors/websrc.py --api_key $API_KEY
17 | python data_preprocessors/ocrvqa.py --api_key $API_KEY
18 | python data_preprocessors/docvqa.py
19 | python data_preprocessors/hwsquad.py
20 | 
21 | # ===== Single-page QA w/ Discrete Reasoning =====
22 | python data_preprocessors/tatdqa.py
23 | python data_preprocessors/wtq.py
24 | 
25 | # ===== Single-page QA w/ Visual Reasoning =====
26 | python data_preprocessors/iconqa.py
27 | python data_preprocessors/ai2d.py
28 | python data_preprocessors/scienceqa.py
29 | python data_preprocessors/textbook.py
30 | 
31 | # ===== Single-page QA w/ Discrete and Visual Reasoning =====
32 | python data_preprocessors/infographicvqa.py
33 | python data_preprocessors/chartqa.py --api_key $API_KEY
34 | 
35 | # ===== Multi-page QA w/ Multi-hop, Discrete, and Visual Reasoning =====
36 | python data_preprocessors/slidevqa.py --api_key $API_KEY
37 | python data_preprocessors/dude.py
38 | 
39 | # ===== Document NLI =====
40 | python data_preprocessors/tabfact.py
41 | 
42 | # ===== Dialogue =====
43 | python data_preprocessors/llavar.py --api_key $API_KEY
44 | 
45 | # ===== Captioning =====
46 | python data_preprocessors/scicap.py --api_key $API_KEY
47 | python data_preprocessors/screen2words.py --api_key $API_KEY
48 | 
49 | # ===== Classification =====
50 | python data_preprocessors/rvlcdip.py --api_key $API_KEY
51 | 
52 | # ===== ITM =====
53 | python data_preprocessors/rvlcdip_io.py --api_key $API_KEY
54 | python data_preprocessors/docvqa_iq.py
55 | 
56 | # ===== DLA =====
57 | python data_preprocessors/docbank.py
58 | python data_preprocessors/doclaynet.py
59 | 


--------------------------------------------------------------------------------