├── LICENSE ├── README.md ├── annotations ├── bbox │ ├── dev.jsonl │ ├── test.jsonl │ └── train.jsonl └── qa │ ├── dev.jsonl │ ├── test.jsonl │ └── train.jsonl ├── download_slides_slideshare.py ├── evaluate.py ├── example.png ├── extract_ocr_tessearct.py ├── extract_ocr_visionAPI.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | SOFTWARE LICENSE AGREEMENT FOR EVALUATION 2 | 3 | This SOFTWARE EVALUATION LICENSE AGREEMENT (this "Agreement") is a legal contract between a person who uses or otherwise accesses or installs the Software (“User(s)”), and Nippon Telegraph and Telephone corporation ("NTT"). 4 | READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR OTHERWISE ACCESSING OR USING NTT'S PROPRIETARY SOFTWARE ACCOMPANIED BY THIS AGREEMENT (the "SOFTWARE"). THE SOFTWARE IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY INSTALLING OR OTHERWISE ACCESSING OR USING THE SOFTWARE, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD TERMINATE THE INSTALLATION PROCESS, IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE SOFTWARE AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE SOFTWARE. 5 | 6 | 7 | BACKGROUND 8 | A. NTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Software and related documentation listed in Exhibit A to this Agreement. 9 | B. User wishes to obtain a royalty free license to use the Software to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement. 10 | C. As a condition to NTT's provision of the Software to User, NTT has required User to execute this Agreement. 11 | In consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows: 12 | 1. Grant of Evaluation License. NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use the Software internally for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may make a reasonable number of backup copies of the Software solely for User's internal use pursuant to the license granted in this Section 1. 13 | 2. Shipment and Installation. NTT will ship or deliver the Software by any method that NTT deems appropriate. User shall be solely responsible for proper installation of the Software. 14 | 3. Term. This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s installing, accessing, and using the Software, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, and (ii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the Software. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Software. Upon any termination or expiration of this Agreement for any reason, User agrees to uninstall the Software and either return to NTT the Software and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT. 15 | 4. Proprietary Rights 16 | (a) The Software is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement. Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Software shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Software. 17 | (b) USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE SOFTWARE TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE SOFTWARE IN ANY MANNER; (iii) DISCLOSE THE SOFTWARE TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE SOFTWARE FOR THE PURPOSES OF THIS AGREEMENT; (iv) MODIFY, DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE SOFTWARE; OR (v) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (iv) ABOVE. 18 | (c) User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Software to ensure that all of User's obligations under this Section 4 shall be satisfied. 19 | 5.  Indemnity. User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Software. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE SOFTWARE. 20 | 6. Disclaimer. THE SOFTWARE IS LICENSED TO USER "AS IS," WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES. USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE SOFTWARE, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT. 21 | 7. Limitation of Liability. IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE SOFTWARE, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE SOFTWARE, OR AS A RESULT OF ANY DEFECT IN THE SOFTWARE. THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE. USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3. 22 | 8. No Assignment or Sublicense. Neither this Agreement nor any right or license under this Agreement, nor the Software, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent. 23 | 9. General 24 | (a) If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect. 25 | (b) This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter. 26 | (c) Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User. 27 | (d) If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding. 28 | (e) This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association. The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof. 29 | (f)   NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control. 30 |   31 | EXHIBIT A 32 | The software and related data include the following files, 33 | - annotations/qa/train.jsonl 34 | - annotations/qa/dev.jsonl 35 | - annotations/qa/test.jsonl 36 | - annotations/bbox/train.jsonl 37 | - annotations/bbox/dev.jsonl 38 | - annotations/bbox/test.jsonl 39 | - download_slides_slideshare.py 40 | - extract_ocr_tesseract.py 41 | - extract_ocr_visionAPI.py 42 | - evaluate.py 43 | - requirements.txt 44 | - README.md 45 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SlideVQA 2 | This repository includes the SlideVQA dataset introduced by the following paper: Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. "A Dataset for Document Visual Question Answering on Multiple Images". In Proc. of AAAI. 2023. 3 | 4 | > We introduce a new document VQA dataset, SlideVQA, for tasks wherein given a slide deck composed of multiple slide images and a corresponding question, a system selects a set of evidence images and answers the question. 5 | 6 | ![Figure 1 from paper](example.png) 7 | 8 | # 📢 News 9 | - [2025.03.26] Our SlideVQA dataset is available on 🤗[HuggingFace](https://huggingface.co/datasets/NTT-hil-insight/SlideVQA). 10 | 11 | 12 | # Software installation 13 | ``` 14 | pip install -r requirements.txt 15 | ``` 16 | For users who want to extract OCR with Tesseract, please install [Google Tesseract OCR](https://github.com/tesseract-ocr/tesseract). 17 | 18 | # Get Started 19 | If you want to manually collect slide images along with their corresponding OCR text, follow these steps. Alternatively, you can access the SlideVQA dataset (excluding OCR and bbox data) on 🤗[HuggingFace](https://huggingface.co/datasets/NTT-hil-insight/SlideVQA). 20 | 21 | ## 1. Download slide images 22 | Download 2,619 slide decks from [SlideShare](https://www.slideshare.net/). Each deck is composed of 20 slide images. 23 | ``` 24 | python download_slides_slideshare.py --target_dir TARGET_DIR --split SPLIT --sleep_time 5 25 | ``` 26 | Some images could not be downloaded due to a 404 error. If you want to obtain all images, please download them through 🤗[HuggingFace](https://huggingface.co/datasets/NTT-hil-insight/SlideVQA). 27 | 28 | ## 2. OCR (Google Cloud Vision API) 29 | Google Cloud Vision API is a paid OCR software, and we used the OCR results obtained from this OCR software in our main experiments. 30 | 31 | Before running OCR scripts, you should obtain an API key through the [Google Cloud Platform](https://cloud.google.com/). To get one, visit the [link](https://cloud.google.com/vision/docs/quickstart). 32 | ``` 33 | python extract_ocr_visionAPI.py --image_dir IMAGE_DIR --save_dir SAVE_DIR --split SPLIT 34 | ``` 35 | 36 | ## 2. OCR (Tesseract) 37 | Tesseract is a free OCR software. Our script uses [pytessract](https://github.com/madmaze/pytesseract), which is the wrapper for Google's Tesseract-OCR Engine. 38 | ``` 39 | python extract_ocr_tesseract.py --image_dir IMAGE_DIR --save_dir SAVE_DIR --split SPLIT 40 | ``` 41 | 42 | # Dataset Format 43 | SlideVQA provides annotated 14,484 QA pairs and 890,945 bounding boxes. 44 | 45 | ## QA format 46 |
 47 |    {
 48 |       "deck_name": slide deck name,
 49 |       "deck_url": slide deck URL in slideshare,
 50 |       "image_urls": URL list of slide deck in slideshare,
 51 |       "qa_id": identification of the QA sample,
 52 |       "question": question,
 53 |       "answer": answer,
 54 |       "arithmetic_expression": arithmetic expression to derive the answer,
 55 |       "evidence_pages": evidence pages (1 - 20) to answer the question,
 56 |       "reasoning_type": reasoning type,
 57 |       "answer_type": answer type,
 58 |     }
 59 | 
60 | 61 | ## Bounding boxes format 62 |
 63 |     {
 64 |       "deck_name": slide deck name,
 65 |       "deck_url": slide deck URL in slideshare,
 66 |       "image_urls": URL list of slide deck in slideshare,
 67 |       "category": category name of slide deck defined in slideshare,
 68 |       "bboxes": [
 69 |                   [
 70 |                   "page_number",
 71 |                      [{
 72 |                        "bbox_id": identification of the bounding box,
 73 |                        "class": class name of the bounding box,
 74 |                        "bbox": [x1, y1, w, h]
 75 |                      }]
 76 |                   ]
 77 |                 ]
 78 |     }
 79 | 
80 | 81 | 82 | # Evaluate 83 | 84 | With the prediction and ground-truth results, you can get evaluation results on all evaluation tasks. 85 | 86 | ``` 87 | python evaluate.py --qa_preds_file QA_PREDICTIONS_FILE_NAME --es_preds_file ES_PREDICTIONS_FILE_NAME --gts_file TARGETS_FILE_NAME 88 | ``` 89 | 90 | 91 | # Citation 92 | You can cite it as follows: 93 | ```bibtex 94 | @inproceedings{SlideVQA2023, 95 | author = {Ryota Tanaka and 96 | Kyosuke Nishida and 97 | Kosuke Nishida and 98 | Taku Hasegawa and 99 | Itsumi Saito and 100 | Kuniko Saito}, 101 | title = {SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images}, 102 | booktitle = {AAAI}, 103 | year = {2023} 104 | } 105 | ``` 106 | 107 | If you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryouta.tanaka.rg[at]hco.ntt.co.jp) or open an issue! 108 | -------------------------------------------------------------------------------- /download_slides_slideshare.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import argparse 4 | import time 5 | from urllib import request 6 | from tqdm import tqdm 7 | 8 | def download_deck(args, sample, deck_name): 9 | ### Download slide images from slideshare ### 10 | if not os.path.exists(f'images/{args.split}/{deck_name}'): 11 | os.makedirs(f'images/{args.split}/{deck_name}') 12 | 13 | for url in tqdm(sample['image_urls'], desc=f'Download 20 slides from {deck_name}'): 14 | save_name = f'images/{args.split}/{deck_name}/{os.path.basename(url)}' 15 | if not os.path.exists(save_name): 16 | request.urlretrieve(url, save_name) 17 | time.sleep(args.sleep_time) 18 | 19 | if __name__ == '__main__': 20 | parser = argparse.ArgumentParser() 21 | parser.add_argument('--target_dir', type=str, default='annotations/bbox') 22 | parser.add_argument('--split', type=str, default='dev') 23 | parser.add_argument('--sleep_time', type=float, default=5) 24 | args = parser.parse_args() 25 | 26 | with open(f'{args.target_dir}/{args.split}.jsonl', 'r') as f: 27 | samples = f.readlines() 28 | 29 | for sample in tqdm(samples): 30 | sample = json.loads(sample) 31 | deck_name = sample['deck_name'] 32 | download_deck(args, sample, deck_name) 33 | -------------------------------------------------------------------------------- /evaluate.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | import numpy as np 4 | import os 5 | import string 6 | from collections import Counter 7 | import re 8 | 9 | WORD_NUMBER_MAP = {"zero": 0, "one": 1, "two": 2, "three": 3, "four": 4, 10 | "five": 5, "six": 6, "seven": 7, "eight": 8, 11 | "nine": 9, "ten": 10, "eleven": 11, "twelve": 12, 12 | "thirteen": 13, "fourteen": 14, "fifteen": 15, 13 | "sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19} 14 | 15 | def normalize_answer(s, question): 16 | def remove_articles(text): 17 | regex = re.compile(r'\b(a|an|the)\b', re.UNICODE) 18 | return re.sub(regex, ' ', text) 19 | def white_space_fix(text): 20 | return ' '.join(text.split()) 21 | def remove_punc(text): 22 | exclude = set(string.punctuation) 23 | return ''.join(ch for ch in text if ch not in exclude) 24 | def lower(text): 25 | return text.lower() 26 | def yesno(text): 27 | if 'yes' == text[:3] or 'no' == text[:2]: 28 | text = text.split()[0] 29 | return text 30 | def replace_text(text): 31 | return text.replace('this is ', '').replace('it is ', '').replace('&', ',').replace('and', ',').replace('percent', '').replace('organisation', 'organization').replace('because of', '').replace('because', '').replace('due to', '').replace('hours', 'hrs').replace('minites', 'min') 32 | def word2number(text): 33 | words = text.split() 34 | return ' '.join([str(WORD_NUMBER_MAP[word]) if word in WORD_NUMBER_MAP else word for word in words]) 35 | def remove_unit(text, question): 36 | if 'how many' in question: 37 | idx = question.find('how many') 38 | unit = question[idx+len('how many'):].split()[0] 39 | text = text.replace(unit, '') 40 | if 'which' in question: 41 | idx = question.find('which') 42 | unit = question[idx+len('which'):].split()[0] 43 | text = text.replace(unit, '') 44 | return text 45 | return word2number(white_space_fix(yesno(remove_articles(remove_punc(remove_unit(replace_text(lower(s)), question)))))) 46 | 47 | def evaluate_f1_em_qa(gts, preds): 48 | f1 = exact_match = 0 49 | precisions = {} 50 | recalls = {} 51 | ems = {} 52 | for qa_id in gts: 53 | question = gts[qa_id]['question'] 54 | prediction = preds[qa_id]['answer'] 55 | ground_truth = gts[qa_id]['answer'] 56 | prediction_tokens = normalize_answer(prediction, question).split() 57 | ground_truth_tokens = normalize_answer(ground_truth, question).split() 58 | common = Counter(prediction_tokens) & Counter(ground_truth_tokens) 59 | num_same = sum(common.values()) 60 | if num_same == 0: 61 | precisions[qa_id] = recalls[qa_id] = ems[qa_id] = 0 62 | continue 63 | precision = 1.0 * num_same / len(prediction_tokens) 64 | recall = 1.0 * num_same / len(ground_truth_tokens) 65 | f1 += (2 * precision * recall) / (precision + recall) 66 | exact_match += (prediction_tokens == ground_truth_tokens) 67 | precisions[qa_id] = precision 68 | recalls[qa_id] = recall 69 | ems[qa_id] = (prediction_tokens == ground_truth_tokens) 70 | exact_match = exact_match / len(gts) 71 | f1 = f1 / len(gts) 72 | return {'F1': f1, 'EM': exact_match, 'precisions': precisions, 'recalls': recalls, 'EMs': ems} 73 | 74 | def evaluate_f1_em_es(gts, preds): 75 | f1 = exact_match = 0 76 | precisions = {} 77 | recalls = {} 78 | ems = {} 79 | for qa_id in gts: 80 | prediction = preds[qa_id]['evidence_pages'] 81 | ground_truth = gts[qa_id]['evidence_pages'] 82 | common = Counter(prediction) & Counter(ground_truth) 83 | num_same = sum(common.values()) 84 | if num_same == 0: 85 | precisions[qa_id] = recalls[qa_id] = ems[qa_id] = 0 86 | continue 87 | precision = 1.0 * num_same / len(prediction) 88 | recall = 1.0 * num_same / len(ground_truth) 89 | f1 += (2 * precision * recall) / (precision + recall) 90 | exact_match += (prediction == ground_truth) 91 | precisions[qa_id] = precision 92 | recalls[qa_id] = recall 93 | ems[qa_id] = (prediction == ground_truth) 94 | exact_match = exact_match / len(gts) 95 | f1 = f1 / len(gts) 96 | return {'F1': f1, 'EM': exact_match, 'precisions': precisions, 'recalls': recalls, 'EMs': ems} 97 | 98 | def evaluate_f1_em_main(metrics_qa, metrics_es): 99 | qa_preicsions = metrics_qa['precisions'] 100 | qa_recalls = metrics_qa['recalls'] 101 | qa_ems = metrics_qa['EMs'] 102 | es_preicsions = metrics_es['precisions'] 103 | es_recalls = metrics_es['recalls'] 104 | es_ems = metrics_es['EMs'] 105 | f1 = exact_match = 0 106 | for qa_id in qa_preicsions: 107 | qa_preicsion = qa_preicsions[qa_id] 108 | qa_recall = qa_recalls[qa_id] 109 | qa_em = qa_ems[qa_id] 110 | es_preicsion = es_preicsions[qa_id] 111 | es_recall = es_recalls[qa_id] 112 | es_em = es_ems[qa_id] 113 | joint_precision = qa_preicsion * es_preicsion 114 | joint_recall = qa_recall * es_recall 115 | if (joint_precision + joint_recall) == 0: 116 | continue 117 | f1 += (2 * joint_precision * joint_recall) / (joint_precision + joint_recall) 118 | exact_match += int(qa_em == es_em == 1) 119 | f1 = f1 / len(qa_preicsions) 120 | exact_match = exact_match / len(qa_preicsions) 121 | return {'F1': f1, 'EM': exact_match} 122 | 123 | def print_metrics(res_metrics, task): 124 | keys = ['F1', 'EM'] 125 | print(f'**********\nFinal model performance ({task}):\n**********') 126 | for k in keys: 127 | print(k, ': %.1f' % (res_metrics[k] * 100)) 128 | 129 | if __name__ == '__main__': 130 | import argparse 131 | parser = argparse.ArgumentParser() 132 | parser.add_argument('--qa_preds_file', type=str, default='qa_preds.jsonl') 133 | parser.add_argument('--es_preds_file', type=str, default='es_pred.jsonl') 134 | parser.add_argument('--gts_file', type=str, default='test.jsonl') 135 | args = parser.parse_args() 136 | 137 | with open(args.qa_preds_file) as f: 138 | qa_preds = f.read().splitlines() 139 | 140 | with open(args.es_preds_file) as f: 141 | es_preds = f.read().splitlines() 142 | 143 | with open(args.gts_file) as f: 144 | gts = f.read().splitlines() 145 | 146 | gts = {} 147 | preds = {} 148 | for qa_pred, es_pred, gt in zip(qa_preds, es_preds, gts): 149 | qa_pred = json.loads(qa_pred) 150 | es_pred = json.loads(es_pred) 151 | gt = json.loads(gt) 152 | qa_id = gt['qa_id'] 153 | preds[qa_id] = {'question': pred['question'], 'answer': qa_pred['answer'], 'evidence_pages': es_pred['evidence_pages']} 154 | gts[qa_id] = {'question': gt['question'], 'answer': gt['answer'], 'evidence_pages': gt['evidence_pages']} 155 | 156 | qa_metrics = evaluate_f1_em_qa(gts, preds) 157 | es_metrics = evaluate_f1_em_es(gts, preds) 158 | main_metrics = evaluate_f1_main(qa_metrics, es_metrics) 159 | 160 | print_metrics(qa_metrics, 'QA') 161 | print_metrics(es_metrics, 'ES') 162 | print_metrics(main_metrics, 'Main') 163 | -------------------------------------------------------------------------------- /example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nttmdlab-nlp/SlideVQA/0d4168a935d60b0b218097205a02fb1133c8bead/example.png -------------------------------------------------------------------------------- /extract_ocr_tessearct.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import pytesseract 4 | import cv2 5 | import argparse 6 | import glob 7 | from pytesseract import Output 8 | from tqdm import tqdm 9 | 10 | def extract_ocr(filename, output_filename): 11 | if os.path.exists(output_filename): 12 | return 13 | 14 | if '.jpg' not in filename: 15 | return 16 | 17 | try: 18 | img = cv2.imread(filename) 19 | except: 20 | print('Image Loading Error', filename) 21 | return 22 | 23 | d = pytesseract.image_to_data(img, output_type=Output.DICT) 24 | with open(output_filename, 'w') as f: 25 | json.dump(d, f) 26 | 27 | if __name__ == '__main__': 28 | parser = argparse.ArgumentParser() 29 | parser.add_argument('--image_dir', type=str, default='images') 30 | parser.add_argument('--save_dir', type=str, default='ocrs_tesseract') 31 | parser.add_argument('--split', type=str, default='dev') 32 | args = parser.parse_args() 33 | 34 | for filename in tqdm(glob.glob(f'{args.image_dir}/{args.split}/*/*')): 35 | save_dir = os.path.dirname(filename).replace(args.image_dir + '/', args.save_dir + '/') 36 | if not os.path.exists(save_dir): 37 | os.makedirs(save_dir) 38 | 39 | output_filename = filename.replace(args.image_dir + '/', args.save_dir + '/').replace('.jpg', '.json') 40 | extract_ocr(filename, output_filename) 41 | 42 | -------------------------------------------------------------------------------- /extract_ocr_visionAPI.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import cv2 4 | import argparse 5 | import glob 6 | from google.cloud import vision 7 | from tqdm import tqdm 8 | 9 | def extract_ocr(filename, output_filename): 10 | if os.path.exists(output_filename): 11 | return 12 | 13 | if '.jpg' not in filename: 14 | return 15 | 16 | try: 17 | client = vision.ImageAnnotatorClient() 18 | with io.open(file_name, 'rb') as image_file: 19 | content = filename.read() 20 | img = vision.Image(content=content) 21 | response = client.document_text_detection(image=img) 22 | 23 | with gzip.open(output_filename, 'wt') as fp: 24 | str_data = vision.AnnotateImageResponse.to_json(response) 25 | data = json.loads(str_data) 26 | del data['textAnnotations'], data["faceAnnotations"], data["landmarkAnnotations"], data["logoAnnotations"], data["labelAnnotations"], data["localizedObjectAnnotations"] 27 | print(json.dumps(data, ensure_ascii=False), file=fp) 28 | except: 29 | return 30 | 31 | if __name__ == '__main__': 32 | parser = argparse.ArgumentParser() 33 | parser.add_argument('--image_dir', type=str, default='images') 34 | parser.add_argument('--save_dir', type=str, default='ocrs_visionAPI') 35 | parser.add_argument('--split', type=str, default='dev') 36 | args = parser.parse_args() 37 | 38 | for filename in tqdm(glob.glob(f'{args.image_dir}/{args.split}/*/*')): 39 | save_dir = os.path.dirname(filename).replace(args.image_dir + '/', args.save_dir + '/') 40 | if not os.path.exists(save_dir): 41 | os.makedirs(save_dir) 42 | 43 | output_filename = filename.replace(args.image_dir + '/', args.save_dir + '/').replace('.jpg', '.json') 44 | extract_ocr(filename, output_filename) 45 | 46 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pytesseract==0.3.8 2 | google-cloud-vision==2.7.3 3 | opencv-python 4 | --------------------------------------------------------------------------------