├── README.md ├── config └── api_config.json ├── data └── LongDocURL.jsonl ├── eval ├── __pycache__ │ └── utils_api.cpython-311.pyc ├── api_models │ ├── __pycache__ │ │ ├── model.cpython-311.pyc │ │ └── pure_ocr_utils.cpython-311.pyc │ ├── eval_api_models.py │ ├── model.py │ └── pure_ocr_utils.py ├── extract_concise_answer_from_vllm.py ├── prompt_for_answer_extraction.md └── utils_api.py ├── evaluation_results ├── api_models │ ├── results_gemini15_pro.jsonl │ └── results_gpt4o.jsonl └── scores_sample_fine_grained.json ├── examples ├── example_docmind.json └── example_structured_text.txt ├── human_baselines ├── baseline_1.jsonl └── baseline_2.jsonl ├── scripts ├── calculate_metrics.sh ├── calculate_metrics_fine_grained.sh └── eval_api_models.sh └── utils ├── __pycache__ └── utils_score_v3.cpython-311.pyc ├── calculate_metrics.py ├── calculate_metrics_fine_grained.py ├── extract_ccpdf.py ├── run_extract_ccpdf.sh └── utils_score_v3.py /README.md: -------------------------------------------------------------------------------- 1 | # Welcome to LongDocURL! 2 | Repository for the paper "LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating". 3 | 4 | **Paper**: [paper](https://arxiv.org/pdf/2412.18424) 5 | 6 | **Blog Website**: [longdocurl. github. io](https://longdocurl.github.io/) 7 | 8 | **Dataset**: [dataset](https://huggingface.co/datasets/dengchao/LongDocURL/) 9 | 10 | ## About LongDocURL 11 | The LongDocURL benchmark is specifically designed for assessing the ability of models in long document understanding. 12 | We collect 2,325 high-quality question-answering pairs, covering 396 PDF-formatted documents and more than 33,000 pages, significantly outperforming existing benchmarks. 13 | Our open dataset can be found at [LongDocURL](https://huggingface.co/datasets/dengchao/LongDocURL/). You can refer to [Blog Website](https://longdocurl.github.io/) for more infomation. 14 | 15 | ## Evaluation 16 | 17 | **1. (Optional) Download & Extract PDFs** 18 | 19 | Download PDFs and qa file (.jsonl) from [LongDocURL](https://huggingface.co/datasets/dengchao/LongDocURL/). Run the following commands to extract PDFs into pngs and json files (by PyMuPDF). 20 | 21 | ```bash 22 | bash utils/run_extract_ccpdf.sh 23 | ``` 24 | 25 | Images will be organized in following ways: 26 | ```markdown 27 | ├── 4000 28 | │ └── 4000001.png 29 | └── 4001 30 | ├── 4001001.png 31 | └── 4001002.png 32 | ``` 33 | 34 | **2. Other Configurations** 35 | - `api_key`: update `config/api_config.json`, used to extract short answer from detailed response. 36 | - `qa_jsonl`: update `data/LongDocURL.jsonl`, downloaded from [LongDocURL](https://huggingface.co/datasets/dengchao/LongDocURL/). 37 | - `api_models`: default `gpt4o-2024-05-13` for extracting short answer. if use our codes to evaluate proprietary models, please check and modify `eval/api_models/model.py`. 38 | 39 | **3. Evaluating API Models** 40 | ```bash 41 | bash scripts/eval_api_models.sh 42 | ``` 43 | 44 | Options to note: 45 | - `process_mode`: default `serial`. Set `parallel` if parallel execution is needed. Default number of parallel processes is 8. 46 | - `image_prefix`: default `None`. Add image prefix when needed in order to get proper image paths. 47 | - `model_name`: the model abbreviation is mapped to the actual model class defined in `eval/api_models/model.py`, 48 | 49 | **4. Claculate Metrics** 50 | 51 | To calculate the final generalized accuracy: 52 | ```bash 53 | bash scripts/calculate_metrics.sh 54 | ``` 55 | To calculate generalized accuracy in a more fine-grained way like `evaluation_results/scores_sample_fine_grained.json`: 56 | ```bash 57 | bash scripts/calculate_metrics_fine_grained.sh 58 | ``` 59 | 60 | ## 🏆 Leaderboard 🏆 61 | 62 | | Model | Size | Understanding | Reasoning | Locating | Total | 63 | |---------------------------|--------|----------------|-------------|------------|-------| 64 | | GPT-4o-24-05-13 🥇 | - | 68.6 | 59.9 | 59.6 | 64.5 | 65 | | Gemini-1.5-Pro 🥈 | - | 55.7 | 43.4 | 46.4 | 50.9 | 66 | | Qwen-VL-Max 🥉 | - | 58.8 | 43.9 | 36.0 | 49.5 | 67 | | Qwen2-VL | 7B | 36.9 | 24.8 | 22.6 | 30.6 | 68 | | LLaVA-OneVision-Chat | 7B | 30.5 | 19.0 | 18.7 | 25.0 | 69 | | LLaVA-Next-Interleave-DPO | 7B | 21.6 | 13.9 | 7.6 | 16.2 | 70 | | Llama-3.2 | 11B | 12.9 | 9.4 | 2.7 | 9.2 | 71 | -------------------------------------------------------------------------------- /config/api_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "gpt4o": { 3 | "access_key": "", 4 | "base_url": "" 5 | } 6 | } -------------------------------------------------------------------------------- /data/LongDocURL.jsonl: -------------------------------------------------------------------------------- 1 | {"question_id": "free_gemini15_pro_4055915_42_71_8", "doc_no": "4055915", "total_pages": 84, "start_end_idx": [42, 71], "question_type": "calculate", "question": "How many judging criterias are there for judging the homemade chocolate cake competition?", "answer": 5, "detailed_evidences": "The judging criteria for the Homemade Chocolate Cake Contest are listed on page 54, (0.08, 0.26, 0.92, 0.52). The criteria include flavor (30 points), texture (25 points), inside characteristics (20 points), outside characteristics (15 points), and frosting (10 points).", "evidence_pages": [54], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "images": ["/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_41.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_42.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_43.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_44.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_45.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_46.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_47.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_48.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_49.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_50.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_51.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_52.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_53.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_54.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_55.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_56.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_57.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_58.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_59.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_60.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_61.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_62.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_63.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_64.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_65.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_66.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_67.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_68.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_69.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_70.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4055915.pdf"} 2 | {"question_id": "summary2title_4127644_8_37_8", "doc_no": "4127644", "total_pages": 98, "start_end_idx": [8, 37], "question_type": "summary2title", "question": "Which section best matches the follwing description: \nThe text describes a merger between Vistra and Dynegy, where Dynegy merged with and into Vistra, with Vistra continuing as the surviving corporation. The merger was intended to qualify as a tax-free reorganization under the IRC, so that none of Vistra, Dynegy or any of the Dynegy stockholders would recognize any gain or loss in the transaction, except that Dynegy stockholders could recognize a gain or loss with respect to cash received in lieu of fractional shares of Vistra's common stock.\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Dywegy.Merger Transaction", "detailed_evidences": "", "evidence_pages": [15, 16], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "images": ["/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_7.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_8.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_9.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_10.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_11.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_12.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_13.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_14.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_15.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_16.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_17.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_18.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_19.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_20.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_21.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_22.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_23.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_24.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_25.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_26.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_27.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_28.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_29.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_30.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_31.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_32.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_33.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_34.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_35.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_36.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4127644.pdf"} 3 | {"question_id": "extract_fig2tab_4151603_24_53_2", "doc_no": "4151603", "total_pages": 84, "start_end_idx": [24, 53], "question_type": "extract_fig2tab", "question": "List names of the figures at the page which contains a table whose name is \"Table 4.16. Stability database of digestate in Belgium\".", "answer": ["Table 4.15. Irish digestate samples tested for RBP and OUR for this study", "Figure 4.2. Relationship between RBP and OUR in Irish digestate samples(n=5)."], "detailed_evidences": "", "evidence_pages": [42], "evidence_sources": ["Table", "Figure"], "answer_format": "List", "task_tag": "Locating", "images": ["/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_23.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_24.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_25.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_26.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_27.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_28.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_29.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_30.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_31.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_32.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_33.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_34.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_35.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_36.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_37.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_38.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_39.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_40.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_41.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_42.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_43.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_44.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_45.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_46.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_47.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_48.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_49.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_50.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_51.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_52.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4151603.pdf"} 4 | -------------------------------------------------------------------------------- /eval/__pycache__/utils_api.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/eval/__pycache__/utils_api.cpython-311.pyc -------------------------------------------------------------------------------- /eval/api_models/__pycache__/model.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/eval/api_models/__pycache__/model.cpython-311.pyc -------------------------------------------------------------------------------- /eval/api_models/__pycache__/pure_ocr_utils.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/eval/api_models/__pycache__/pure_ocr_utils.cpython-311.pyc -------------------------------------------------------------------------------- /eval/api_models/eval_api_models.py: -------------------------------------------------------------------------------- 1 | # TODO: 2 | import sys 3 | import pathlib 4 | sys.path.append(str(pathlib.Path(__file__).absolute().parent.parent.parent)) 5 | 6 | import argparse 7 | import os 8 | from io import BytesIO 9 | 10 | import oss2 11 | import base64 12 | import json 13 | from tqdm import tqdm 14 | import requests 15 | import re 16 | import time 17 | from multiprocessing import Pool 18 | import datetime 19 | from openai import OpenAI 20 | 21 | from eval.utils_api import * 22 | from utils.utils_score_v3 import * 23 | from model import Gemini15ProInferencer, GPT4oInferencer, QwenVLMaxInferencer, O1PreviewInferencer, QwenMaxInferencer 24 | from pure_ocr_utils import * 25 | 26 | system_prompt = "You are an expert in visual document question-answering, please answer our questions based on the given images.\n" 27 | 28 | # TODO 29 | project_prefix = "/mnt/workspace/Projects/CodeLib/LongDocURL/" 30 | 31 | config_file = os.path.join(project_prefix, "config/api_config.json") 32 | extractor_prompt_path = os.path.join(project_prefix, "eval/prompt_for_answer_extraction.md") 33 | 34 | with open(config_file, "r", encoding="utf-8") as rf: 35 | config = json.load(rf) 36 | client = OpenAI(api_key=config["gpt4o"]["access_key"], base_url=config["gpt4o"]["base_url"]) 37 | 38 | model_name2inferencer = {"gpt4o": "GPT4oInferencer", "gemini15_pro": "Gemini15ProInferencer", "qwen_vl_max": "QwenVLMaxInferencer", \ 39 | "o1_preview": "O1PreviewInferencer", "qwen_max": "QwenMaxInferencer"} 40 | 41 | prompt_sign = True 42 | 43 | def preprocess(input_datapath, output_datapath, image_prefix=None): 44 | dataset = read_jsonl_file(input_datapath) 45 | print("dataset cnt: ", len(dataset)) 46 | 47 | if os.path.exists(output_datapath): 48 | output_dataset = read_jsonl_file(output_datapath) 49 | dataset = delete_generate_dataset(dataset, output_dataset) 50 | 51 | if image_prefix is not None: 52 | for _ in dataset: 53 | for i, image_path in enumerate(_["images"]): 54 | _["images"][i] = os.path.join(image_prefix, "/".join(image_path.split("/")[-2:])) 55 | 56 | print("dataset cnt need to do: ", len(dataset)) 57 | 58 | return dataset 59 | 60 | def read_jsonl_file(file_path): 61 | data = [] 62 | with open(file_path, "r", encoding="utf-8") as jsonl_file: 63 | for i, line in enumerate(jsonl_file): 64 | data_dict = json.loads(line.strip()) 65 | if 'question_id' not in data_dict: 66 | data_dict['question_id'] = i 67 | data.append(data_dict) 68 | return data 69 | 70 | def call_llm(prompt, urls, temperature=0.1, seed=42, max_tokens=4096): 71 | msgs = get_msg_format(prompt, urls) 72 | response = None 73 | max_try = 2 74 | while response is None and max_try > 0: 75 | try: 76 | # TODO 77 | completion = client.chat.completions.create(model="gpt-4o-0513", messages=msgs, temperature=0.) 78 | # completion = client.chat.completions.create(model="gpt-4o", messages=msgs, temperature=0.) 79 | response = completion.choices[0].message.content 80 | except Exception as e: 81 | print(f"error with {e}, response = {response}") 82 | max_try -= 1 83 | response = None 84 | 85 | return response 86 | 87 | def delete_generate_dataset(dataset, output_dataset): 88 | finished_question_id_set = set([sample['question_id'] for sample in output_dataset]) 89 | unfinished_dataset = [sample for sample in dataset if sample['question_id'] not in finished_question_id_set] 90 | return unfinished_dataset 91 | 92 | def eval_per_record(args): 93 | print("--------------------------------------") 94 | case, output_datapath, model_name = args 95 | 96 | inferencer = eval(model_name2inferencer[model_name])() 97 | 98 | question = case["question"] 99 | prompt = system_prompt + "Following is our question: \n" + f"{question}" + "\n" 100 | 101 | result = inferencer.infer(prompt, case["images"]) 102 | 103 | if result is None: 104 | return 105 | 106 | # extract concise answer 107 | with open(extractor_prompt_path) as f: 108 | extractor_prompt = f.read() 109 | prompt = system_prompt + extractor_prompt + "\nQuestion: " + question + "\nAnalysis: " + result 110 | extractor_result = call_llm(prompt, None) 111 | try: 112 | import re 113 | concise_answer = re.findall(r"(.*?)", extractor_result, re.DOTALL)[0] 114 | answer_format = re.findall(r"(.*?)", extractor_result, re.DOTALL)[0] 115 | except: 116 | concise_answer = "Fail to extract" 117 | answer_format = "None" 118 | 119 | # calculate scores 120 | try: 121 | # pred_ans = eval(concise_answer) 122 | pred_ans = eval(concise_answer) if not isinstance(eval(concise_answer), set) else list(eval(concise_answer)) 123 | except: 124 | pred_ans = concise_answer 125 | if pred_ans == "Fail to extract": 126 | score_v3 = 0.0 127 | else: 128 | score_v3 = eval_score(case["answer"], pred_ans, case["answer_format"]) 129 | 130 | case["detailed_response"] = result 131 | case["pred"] = pred_ans 132 | case["score_v3"] = score_v3 133 | 134 | print("\n\n") 135 | print("Question: {}".format(case["question"])) 136 | print("Response: {}".format(case["pred"])) 137 | 138 | print("Gt: {}\tPred: {}\tScore_v3: {}".format(case["answer"], case["pred"], case["score_v3"])) 139 | 140 | if result is not None: # Check if result is not None 141 | try: # not json serialable 142 | with open(output_datapath, "a") as output_review_file: 143 | output_review_file.write(json.dumps(case, ensure_ascii=False) + "\n") 144 | except Exception as e: 145 | print("error: ", e) 146 | else: 147 | print("error") 148 | 149 | 150 | def evaluate(dataset, output_datapath, model_name="gpt4o", process_mode="serial", extra_infos=None): 151 | 152 | if os.path.exists(output_datapath): 153 | output_dataset = read_jsonl_file(output_datapath) 154 | dataset = delete_generate_dataset(dataset, output_dataset) 155 | 156 | print("dataset cnt: ", len(dataset)) 157 | if not len(dataset): 158 | return 159 | 160 | args_list = [] 161 | for case in dataset: 162 | args_list.append((case, output_datapath, model_name)) 163 | 164 | start_time = datetime.datetime.now() 165 | print("job start time:", start_time) 166 | 167 | if process_mode == "serial": 168 | for args in args_list: 169 | eval_per_record(args) 170 | elif process_mode == "parallel": 171 | with Pool(processes=8) as pool: # You can adjust the number of processes as needed 172 | list(tqdm(pool.imap(eval_per_record, args_list), total=len(args_list))) 173 | else: 174 | print("process mode error!") 175 | 176 | 177 | if __name__ == "__main__": 178 | parser = argparse.ArgumentParser() 179 | parser.add_argument('--qa_file', type=str, default="data/LongDocURL.jsonl") 180 | parser.add_argument('--results_file', type=str, default="evaluation_results/api_models/results_gpt4o.jsonl") 181 | parser.add_argument('--process_mode', type=str, default="serial") # serial/parallel 182 | # parser.add_argument('--input_format', type=str, default="e2e") # e2e/ocr 183 | parser.add_argument('--image_prefix', type=str, default="/mnt/achao/Downloads/pdf_pngs/4000-4999") 184 | parser.add_argument('--model_name', type=str, default="gpt4o") # gemini15_pro/claude35_sonnet/qwen_vl_max/gpt4o 185 | 186 | args = parser.parse_args() 187 | 188 | input_datapath = args.qa_file 189 | output_datapath = args.results_file 190 | 191 | # load data 192 | # dataset = preprocess(input_datapath, output_datapath) 193 | # if image paths are not modified in .jsonl file, add image prefix when executed 194 | dataset = preprocess(input_datapath, output_datapath, image_prefix=args.image_prefix) 195 | 196 | try_cnt = 2 197 | while try_cnt: 198 | try_cnt -= 1 199 | try: 200 | evaluate(dataset, output_datapath, model_name=args.model_name, process_mode=args.process_mode) 201 | except Exception as e: 202 | print(f"An error occurred: {e}") 203 | print("Restarting script...") 204 | time.sleep(1) 205 | 206 | acc, f1, = calculate_acc_and_f1(output_datapath) 207 | print("--------------------------------------") 208 | print("Avg acc: {}".format(acc)) 209 | print("Avg f1: {}".format(f1)) -------------------------------------------------------------------------------- /eval/api_models/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import base64 3 | from io import BytesIO 4 | from transformers import AutoModelForCausalLM, AutoTokenizer, Blip2Processor, Blip2ForConditionalGeneration, BitsAndBytesConfig 5 | from PIL import Image 6 | from abc import ABC, abstractmethod 7 | from openai import OpenAI 8 | import requests 9 | import os 10 | from typing import Union 11 | import oss2 12 | import json 13 | 14 | # TODO 15 | project_prefix = "/mnt/workspace/Projects/CodeLib/LongDocURL/" 16 | config_file = os.path.join(project_prefix, "config/api_config.json") 17 | 18 | 19 | class APIInferencer(ABC): 20 | def __init__(self): 21 | pass 22 | # uncomment if oss paths are used 23 | # self.bucket = self.get_alimama_oss_bucket() 24 | 25 | def get_alimama_oss_bucket(self): 26 | # TODO 27 | endpoint = '' 28 | access_key_id = '' 29 | access_key_secret = '' 30 | bucket_name = '' 31 | bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name) 32 | return bucket 33 | 34 | @abstractmethod 35 | def infer(self, prompt: str, image_path: str) -> str: 36 | pass 37 | 38 | def load_client(self): 39 | with open(config_file, "r", encoding="utf-8") as rf: 40 | config = json.load(rf) 41 | return OpenAI(api_key=config["gpt4o"]["access_key"], base_url=config["gpt4o"]["base_url"]) 42 | 43 | def cleanup(self): 44 | if hasattr(self, 'client'): 45 | del self.client 46 | 47 | def encode_image_to_base64(self, image_path: str) -> str: 48 | if 'https' in image_path: 49 | response = requests.get(image_path) 50 | img = BytesIO(response.content) 51 | return base64.b64encode(img.read()).decode('utf-8') 52 | 53 | if image_path.startswith('oss://'): 54 | return base64.b64encode(self.bucket.get_object(image_path[6:].split("/", 1)[1]).read()).decode("utf-8") 55 | 56 | with open(image_path, "rb") as image_file: 57 | return base64.b64encode(image_file.read()).decode('utf-8') 58 | 59 | def get_correct_response(self, model_name: str, prompt: str, image_path: Union[list, str]) -> str: 60 | response = self.model_chat(model_name, prompt, image_path) 61 | return response 62 | 63 | def model_chat(self, model_name: str, prompt: str, image_path: str) -> str: 64 | client = self.load_client() 65 | messages = [ 66 | { 67 | "role": "user", 68 | "content": self.build_message_content(prompt, image_path) 69 | } 70 | ] 71 | max_try = 2 72 | response = None 73 | while response is None and max_try > 0: 74 | try: 75 | completion = client.chat.completions.create(model=model_name, messages=messages, temperature=0.) 76 | response = completion.choices[0].message.content 77 | except Exception as e: 78 | print("exception: ", e) 79 | max_try -= 1 80 | return response 81 | 82 | def build_message_content(self, prompt: str, image_path: str): 83 | content = [{"type": "text", "text": prompt}] 84 | if image_path is None: 85 | return content 86 | if isinstance(image_path, str): 87 | image_paths = [image_path] 88 | elif isinstance(image_path, Union[list, tuple]): 89 | image_paths = image_path 90 | base64_images = [self.encode_image_to_base64(image_path) for image_path in image_paths] 91 | for i, base64_image in enumerate(base64_images): 92 | content += [ 93 | {"type": "text", "text": f"Below is the {i+1}-th image (total {len(base64_images)} images).\n"}, 94 | { 95 | "type": "image_url", 96 | "image_url": { 97 | "url": f"data:image/png;base64,{base64_image}" 98 | }, 99 | }, 100 | ] 101 | return content 102 | 103 | class QwenMaxInferencer(APIInferencer): 104 | def infer(self, prompt: str, image_path: str) -> str: 105 | response = self.get_correct_response('qwen_max', prompt, image_path) 106 | return response 107 | 108 | 109 | class O1PreviewInferencer(APIInferencer): 110 | def infer(self, prompt: str, image_path: str) -> str: 111 | response = self.get_correct_response('o1-preview-0912', prompt, image_path) 112 | return response 113 | 114 | class GPT4oInferencer(APIInferencer): 115 | def infer(self, prompt: str, image_path: str) -> str: 116 | response = self.get_correct_response('gpt-4o-0513', prompt, image_path) 117 | # response = self.get_correct_response('gpt-4o', prompt, image_path) 118 | return response 119 | 120 | class Gemini15ProInferencer(APIInferencer): 121 | def infer(self, prompt: str, image_path: str) -> str: 122 | response = self.get_correct_response('gemini-1.5-pro', prompt, image_path) 123 | return response 124 | 125 | 126 | class QwenVLMaxInferencer(APIInferencer): 127 | def infer(self, prompt: str, image_path: str) -> str: 128 | response = self.get_correct_response('qwen-vl-max', prompt, image_path) 129 | return response 130 | -------------------------------------------------------------------------------- /eval/api_models/pure_ocr_utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | def record2text_with_layout(record): 5 | text = "" 6 | img_width, img_height = record['docInfo']['pages'][0]['imageWidth'], record['docInfo']['pages'][0]['imageHeight'] 7 | for item in record['layouts']: 8 | _type, sub_type = item['type'], item['subType'] 9 | item_text = item['text'] 10 | x1y1 = item['pos'][0]; x2y2 = item['pos'][2] 11 | text += f"(type: {_type}, sub_type: {sub_type}, box: {tuple(float(f'{_:.2f}') for _ in (x1y1['x']/img_width, x1y1['y']/img_height, x2y2['x']/img_width, x2y2['y']/img_height))})" + f" {item_text}\n" 12 | 13 | return text 14 | 15 | def record2text(record): 16 | text = "" 17 | for item in record['layouts']: 18 | item_text = item['text'] 19 | text += f"{item_text}\n" 20 | 21 | return text 22 | 23 | 24 | def get_pure_ocr_prompt_docmind(doc_no: str, **kwargs): 25 | zip_no = doc_no[:4] 26 | json_path = "/mnt/achao/Downloads/pdf_jsons/{}/{}_docmind_results.json" 27 | record = json.load(open(json_path.format(zip_no, doc_no), "r", encoding="utf-8"))['contents'] 28 | ocr_text_template = "page_no: {}\n{}\n\n" 29 | 30 | start_page = kwargs.pop("start_page", 0); end_page = kwargs.pop("end_page", start_page+1) 31 | if "extra_infos" in kwargs and "with_layout" in kwargs["extra_infos"] and kwargs["extra_infos"]["with_layout"]: 32 | ocr_texts = [record2text_with_layout(record[f"page_{idx}"]) for idx in range(start_page, end_page+1) if f"page_{idx}" in record] 33 | else: 34 | ocr_texts = [record2text(record[f"page_{idx}"]) for idx in range(start_page, end_page+1) if f"page_{idx}" in record] 35 | pages_used = end_page - start_page + 1 36 | print("number of pages used: ", end_page - start_page + 1) 37 | 38 | ocr_prompt = "\n\n" 39 | for page_no, ocr_text in zip(range(start_page, end_page+1), ocr_texts): 40 | ocr_prompt += ocr_text_template.format(page_no+1, ocr_text) # why page_no+1? 41 | 42 | return ocr_prompt 43 | 44 | def get_pure_ocr_prompt_pymupdf(doc_no: str, **kwargs): 45 | zip_no = doc_no[:4] 46 | json_path = "/mnt/achao/Downloads/pdf_jsons/{}/{}_line_level.json" 47 | ocr_texts_doc = json.load(open(json_path.format(zip_no, doc_no), "r", encoding="utf-8"))["pages_str"] # list 48 | ocr_text_template = "page_no: {}\n{}\n\n" 49 | 50 | start_page = kwargs.pop("start_page", 0); end_page = kwargs.pop("end_page", start_page+1) 51 | 52 | pages_used = end_page - start_page + 1 53 | print("number of pages used: ", end_page - start_page + 1) 54 | 55 | ocr_prompt = "\n\n" 56 | for page_no, ocr_texts_page in zip(range(start_page, end_page+1), ocr_texts_doc): 57 | ocr_prompt += ocr_text_template.format(page_no+1, ocr_texts_page) # why page_no+1? 58 | 59 | return ocr_prompt -------------------------------------------------------------------------------- /eval/extract_concise_answer_from_vllm.py: -------------------------------------------------------------------------------- 1 | # # TODO: 2 | # import sys 3 | # sys.path.append("/root_dir/LongDocURL/") 4 | 5 | import argparse 6 | import os 7 | from io import BytesIO 8 | 9 | import oss2 10 | import base64 11 | import json 12 | from tqdm import tqdm 13 | import requests 14 | import re 15 | import time 16 | from multiprocessing import Pool 17 | import datetime 18 | from openai import OpenAI 19 | 20 | from eval.utils_api import * 21 | from utils.utils_score_v3 import * 22 | 23 | system_prompt = "You are an expert in visual document question-answering, please answer our questions based on the given images.\n" 24 | 25 | # TODO 26 | project_prefix = "/root_dir/LongDocURL/" 27 | config_file = os.path.join(project_prefix, "config/api_config.json") 28 | extractor_prompt_path = os.path.join(project_prefix, "eval/prompt_for_answer_extraction.md") 29 | 30 | with open(config_file, "r", encoding="utf-8") as rf: 31 | config = json.load(rf) 32 | client = OpenAI(api_key=config["gpt4o"]["access_key"], base_url=config["gpt4o"]["base_url"]) 33 | 34 | 35 | def call_llm(prompt, urls, temperature=0.1, seed=42, max_tokens=4096): 36 | msgs = get_msg_format(prompt, urls) 37 | response = None 38 | max_try = 6 39 | while response is None and max_try > 0: 40 | try: 41 | completion = client.chat.completions.create(model="gpt-4o-0513", messages=msgs, temperature=0.) 42 | response = completion.choices[0].message.content 43 | except Exception as e: 44 | print(f"error with {e}, response = {response}") 45 | max_try -= 1 46 | response = None 47 | 48 | return response 49 | 50 | 51 | def delete_generated_dataset(records, output_datapath): 52 | finished_sample_ids = set([json.loads(_.strip())["question_id"] for _ in open(output_datapath, "r", encoding="utf-8").readlines()]) if os.path.exists(output_datapath) else set() 53 | return [_ for _ in records if _["question_id"] not in finished_sample_ids] 54 | 55 | 56 | def extract_per_record(args): 57 | case, result, output_datapath = args 58 | question = case["question"] 59 | print(case["question_id"]) 60 | 61 | # extract concise answer 62 | with open(extractor_prompt_path) as f: 63 | extractor_prompt = f.read() 64 | prompt = system_prompt + extractor_prompt + "\nQuestion: " + question + "\nAnalysis: " + result 65 | extractor_result = call_llm(prompt, None) 66 | try: 67 | import re 68 | concise_answer = re.findall(r"(.*?)", extractor_result, re.DOTALL)[0] 69 | answer_format = re.findall(r"(.*?)", extractor_result, re.DOTALL)[0] 70 | except: 71 | concise_answer = "Fail to extract" 72 | answer_format = "None" 73 | 74 | # calculate scores 75 | try: 76 | # pred_ans = eval(concise_answer) 77 | pred_ans = eval(concise_answer) if not isinstance(eval(concise_answer), set) else list(eval(concise_answer)) 78 | except: 79 | pred_ans = concise_answer 80 | if pred_ans == "Fail to extract": 81 | score_v3 = 0.0 82 | else: 83 | score_v3 = eval_score(case["answer"], pred_ans, case["answer_format"]) 84 | 85 | case["pred"] = pred_ans 86 | case["score_v3"] = score_v3 87 | 88 | print("\n\n") 89 | print("Question: {}".format(case["question"])) 90 | print("Response: {}".format(case["pred"])) 91 | print("Gt: {}\tPred: {}\tScore_v3: {}".format(case["answer"], case["pred"], case["score_v3"])) 92 | 93 | try: 94 | with open(output_datapath, "a") as output_review_file: 95 | output_review_file.write(json.dumps(case, ensure_ascii=False) + "\n") 96 | except Exception as e: 97 | print("error: ", e) 98 | print("error: ", case["question_id"]) 99 | 100 | 101 | 102 | def extract_answers(records, output_datapath): 103 | 104 | records = delete_generated_dataset(records, output_datapath) 105 | args_list = [] 106 | for record in records: 107 | args_list.append((record, record["detailed_response"], output_datapath)) 108 | 109 | with Pool(processes=1) as pool: 110 | list(tqdm(pool.imap(extract_per_record, args_list), total=len(args_list))) 111 | 112 | 113 | # def run_test(): 114 | # # case = {} 115 | # question = case["question"] 116 | # result = case["detailed_response"] 117 | # output_datapath = "" 118 | # args = (case, question, result, output_datapath) 119 | # extract_per_record(args) 120 | 121 | 122 | if __name__ == "__main__": 123 | parser = argparse.ArgumentParser() 124 | parser.add_argument("--qa_file", type=str, default="./evaluation_results/api_models/results_detailed.jsonl") 125 | parser.add_argument("--results_file", type=str, default="./evaluation_results/api_models/results_extracted.jsonl") 126 | args = parser.parse_args() 127 | 128 | with open(args.qa_file, "r", encoding="utf-8") as rf: 129 | records = [json.loads(_.strip()) for i, _ in enumerate(rf.readlines())] 130 | 131 | extract_answers(records, args.results_file) 132 | 133 | # run_test() 134 | 135 | 136 | -------------------------------------------------------------------------------- /eval/prompt_for_answer_extraction.md: -------------------------------------------------------------------------------- 1 | Given the question and analysis, you are tasked to extract answers with required formats from the free-form analysis. 2 | - Your extracted answers should be one of the following formats: (1) Integer, (2) Float, (3) String and (4) List. If you find the analysis the question can not be answered from the given documents, type "Not answerable". Exception: If the analysis only tells you that it can not read/understand the images or documents, type "Fail to answer". 3 | - Answers may not meet visibility requirements within questions, such as, one question requires to find answers 'between 7-th to 20-th images' but invisible information in 'image 23' is included in analysis. So, these invisible components should be deprecated and removed from final answers. 4 | - Please use and tokens at the start and end of the extracted answer. For example, if the extracted answer is number 3, the format is 3. 5 | - Please use and tokens at the start and end of the answer format. For example, if the answer format is List, the format is List. 6 | - Please make your response as concise as possible. Also note that your response should be formatted as below: 7 | ``` 8 | Extracted answer: [answer] 9 | Answer format: [answer format] 10 | ``` 11 | 12 | Please read the following example, then extract the answer from the model response and type it at the end of the prompt. 13 | 14 | --- 15 | Question: List the primary questions asked about the services in this report. 16 | Analysis: The primary questions asked about the services in the report for The Limes Residential Home are: \n\n1. Is the service safe? \n\n2. Is the service effective? \n\n3. Is the service caring? \n\n4. Is the service responsive? \n\n5. Is the service well-led? 17 | Extracted answer: ['Is the servife safe?', 'Is the service effective', 'Is the serve caring?', 'Is the service responsive?', 'Is the service well-led?'] 18 | Answer format: List 19 | 20 | --- 21 | Question: How many regulations of the HSCA 2008 are breached in all according to this report? 22 | Analysis: According to the report, the provider breached 10 Health and Social Care Act 2008 (Regulated Activities) Regulations in total. Here are the specifics: 23 | 24 | 1. Regulation 13: Safeguarding service users from abuse and improper treatment 25 | 26 | 2. Regulation 12: Safe care and treatment 27 | 28 | 3. Regulation 18: Staffing 29 | 30 | 4. Regulation 11: Need for consent 31 | 32 | 5. Regulation 10: Dignity and respect 33 | 34 | 6. Regulation 9: Person-centred care 35 | 36 | 7. Regulation 17: Good governance 37 | 38 | 8. Regulation 18 (CQC Registration Regulations 2009): Notification of other incidents 39 | 40 | 9. Regulation 18: Failure to maintain an accurate and up-to-date care plan 41 | 42 | 10. Regulation 11: Failure to implement the Mental Capacity Act 2005 code of practice effectively 43 | 44 | These breaches involve issues concerning staffing, safeguarding, medicines management, dignity and respect, consent, care planning, governance, and failure to notify the CQC of incidents. 45 | Extracted answer: 10 46 | Answer format: Integer 47 | 48 | --- 49 | Question: According to the survey that is the percentage of Chinese who are paying more or about the same attention to politics after Trump’s election? 50 | Analysis: The survey provided does not specify the percentage of Chinese individuals specifically who are paying more or about the same attention to politics after Trump’s election. The report focuses primarily on American demographics and does not include specific details about the Chinese population in relation to this question. If you need information about a different demographic or a summary of the findings from the American demographic, I can certainly help with that! 51 | Extracted answer: Not answerable 52 | Answer format: None 53 | 54 | --- 55 | Question: How many quotations from male respondent over 50 years old are included in this report? 56 | Analysis: The image you’ve provided appears to be a screenshot of a document with multiple charts. However, the text is too small and blurry to read accurately. If you can provide a clearer image or more context, I might be able to help you with your question. 57 | Extracted answer: Fail to answer 58 | Answer format: None 59 | 60 | --- -------------------------------------------------------------------------------- /eval/utils_api.py: -------------------------------------------------------------------------------- 1 | import os 2 | from io import BytesIO 3 | import oss2 4 | import base64 5 | import json 6 | import requests 7 | import re 8 | from typing import Union 9 | 10 | # TODO 11 | def get_alimama_oss_bucket(): 12 | endpoint = '' 13 | access_key_id = '' 14 | access_key_secret = '' 15 | bucket_name = '' 16 | bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name) 17 | return bucket 18 | 19 | # # uncomment if oss paths are used 20 | # bucket = get_alimama_oss_bucket() 21 | 22 | 23 | def encode_image_to_base64(image_path): 24 | if 'https' in image_path: 25 | response = requests.get(image_path) 26 | img = BytesIO(response.content) 27 | return base64.b64encode(img.read()).decode('utf-8') 28 | 29 | if image_path.startswith('oss://'): 30 | return base64.b64encode(bucket.get_object(image_path[6:].split("/", 1)[1]).read()).decode("utf-8") 31 | 32 | with open(image_path, "rb") as image_file: 33 | return base64.b64encode(image_file.read()).decode('utf-8') 34 | 35 | 36 | def get_msg_format(prompt, img_urls): 37 | content = [{"type": "text", "text": prompt}] 38 | if img_urls is not None: 39 | if isinstance(img_urls, str): 40 | base64_images = [encode_image_to_base64(img_urls)] 41 | elif isinstance(img_urls, Union[list, tuple]): 42 | base64_images = [encode_image_to_base64(img_url) for img_url in img_urls] 43 | 44 | for i, base64_image in enumerate(base64_images): 45 | content += [ 46 | {"type": "text", "text": f"Below is the {i+1}-th image (total {len(base64_images)} images).\n"}, 47 | { 48 | "type": "image_url", 49 | "image_url": { 50 | "url": f"data:image/png;base64,{base64_image}" 51 | }, 52 | }, 53 | ] 54 | messages = [ 55 | { 56 | "role": "user", 57 | "content": content 58 | } 59 | ] 60 | return messages 61 | 62 | 63 | -------------------------------------------------------------------------------- /evaluation_results/api_models/results_gemini15_pro.jsonl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/evaluation_results/api_models/results_gemini15_pro.jsonl -------------------------------------------------------------------------------- /evaluation_results/api_models/results_gpt4o.jsonl: -------------------------------------------------------------------------------- 1 | {"question_id": "free_gemini15_pro_4055915_42_71_8", "doc_no": "4055915", "total_pages": 84, "start_end_idx": [42, 71], "question_type": "calculate", "question": "How many judging criterias are there for judging the homemade chocolate cake competition?", "answer": 5, "detailed_evidences": "The judging criteria for the Homemade Chocolate Cake Contest are listed on page 54, (0.08, 0.26, 0.92, 0.52). The criteria include flavor (30 points), texture (25 points), inside characteristics (20 points), outside characteristics (15 points), and frosting (10 points).", "evidence_pages": [54], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "images": ["/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_41.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_42.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_43.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_44.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_45.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_46.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_47.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_48.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_49.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_50.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_51.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_52.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_53.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_54.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_55.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_56.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_57.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_58.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_59.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_60.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_61.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_62.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_63.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_64.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_65.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_66.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_67.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_68.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_69.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_70.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4055915.pdf", "detailed_response": "There are five judging criteria for the homemade chocolate cake competition. They are:\n\n1. Flavor (aroma, taste, good balance of flavorings) - 30 points\n2. Texture (moist and tender crumb; not soggy or dry) - 25 points\n3. Inside Characteristics (even grain, evenly baked, free from air pockets) - 20 points\n4. Outside Characteristics (consistent shape/size/surface; overall appeal) - 15 points\n5. Frosting (taste, texture, even color) - 10 points", "pred": 5, "score_v3": 1.0} 2 | -------------------------------------------------------------------------------- /evaluation_results/scores_sample_fine_grained.json: -------------------------------------------------------------------------------- 1 | { 2 | "scores":{ 3 | "Main_Task":{ 4 | "Understanding": 0.0, 5 | "Reasoning": 0.0, 6 | "Locating": 0.0 7 | }, 8 | "Element_Type":{ 9 | "Text": 0.0, 10 | "Layout": 0.0, 11 | "Figure": 0.0, 12 | "Table": 0.0 13 | }, 14 | "Evidence_Pages":{ 15 | "Single_Page": 0.0, 16 | "Multi_Page": 0.0 17 | }, 18 | "Num_of_Element_Types":{ 19 | "Cross_Element": 0.0 20 | }, 21 | "Fine_Grained":{ 22 | "Understanding":{ 23 | "Single_Page":{ 24 | "Text": 0.0, 25 | "Layout": 0.0, 26 | "Figure": 0.0, 27 | "Table": 0.0 28 | }, 29 | "Multi_Page":{ 30 | "Text": 0.0, 31 | "Layout": 0.0, 32 | "Figure": 0.0, 33 | "Table": 0.0 34 | } 35 | }, 36 | "Reasoning":{ 37 | "Single_Page":{ 38 | "Text": 0.0, 39 | "Layout": 0.0, 40 | "Figure": 0.0, 41 | "Table": 0.0 42 | }, 43 | "Multi_Page":{ 44 | "Text": 0.0, 45 | "Layout": 0.0, 46 | "Figure": 0.0, 47 | "Table": 0.0 48 | } 49 | }, 50 | "Locating":{ 51 | "Cross_Element":{ 52 | "Cross_Title": 0.0, 53 | "Cross_Table": 0.0, 54 | "Para_Title": 0.0, 55 | "Figure_Table": 0.0 56 | } 57 | } 58 | } 59 | }, 60 | "sample_cnt":{ 61 | "Main_Task":{ 62 | "Understanding": 1243, 63 | "Reasoning": 387, 64 | "Locating": 695 65 | }, 66 | "Element_Type":{ 67 | "Text": 994, 68 | "Layout": 779, 69 | "Figure": 556, 70 | "Table": 871 71 | }, 72 | "Evidence_Pages":{ 73 | "Single_Page": 1093, 74 | "Multi_Page": 1230 75 | }, 76 | "Num_of_Element_Types":{ 77 | "Cross_Element": 862 78 | }, 79 | "Fine_Grained":{ 80 | "Understanding":{ 81 | "Single_Page":{ 82 | "Text": 259, 83 | "Layout": 91, 84 | "Figure": 94, 85 | "Table": 263 86 | }, 87 | "Multi_Page":{ 88 | "Text": 443, 89 | "Layout": 172, 90 | "Figure": 174, 91 | "Table": 115 92 | } 93 | }, 94 | "Reasoning":{ 95 | "Single_Page":{ 96 | "Text": 40, 97 | "Layout": 12, 98 | "Figure": 28, 99 | "Table": 98 100 | }, 101 | "Multi_Page":{ 102 | "Text": 115, 103 | "Layout": 40, 104 | "Figure": 85, 105 | "Table": 69 106 | } 107 | }, 108 | "Locating":{ 109 | "Cross_Element":{ 110 | "Cross_Title": 201, 111 | "Cross_Table": 126, 112 | "Para_Title": 137, 113 | "Figure_Table": 231 114 | } 115 | } 116 | } 117 | } 118 | 119 | } -------------------------------------------------------------------------------- /examples/example_docmind.json: -------------------------------------------------------------------------------- 1 | { 2 | "record_id": "4020103_1", 3 | "data": { 4 | "logics": { 5 | "docTree": [ 6 | { 7 | "backlink": { 8 | "上级": [ 9 | "ROOT" 10 | ] 11 | }, 12 | "level": 0, 13 | "link": { 14 | "下级": [], 15 | "包含": [] 16 | }, 17 | "uniqueId": "a73142f82a0bf25f1f9a033332428d81" 18 | }, 19 | { 20 | "backlink": { 21 | "上级": [ 22 | "ROOT" 23 | ] 24 | }, 25 | "level": 0, 26 | "link": { 27 | "下级": [], 28 | "包含": [] 29 | }, 30 | "uniqueId": "bc31341899f848a372c4e5312d15ffcf" 31 | }, 32 | { 33 | "backlink": { 34 | "上级": [ 35 | "ROOT" 36 | ] 37 | }, 38 | "level": 0, 39 | "link": { 40 | "下级": [], 41 | "包含": [] 42 | }, 43 | "uniqueId": "309eed9853f2ff2d0d548760e8661858" 44 | }, 45 | { 46 | "backlink": { 47 | "上级": [ 48 | "bc31341899f848a372c4e5312d15ffcf" 49 | ] 50 | }, 51 | "level": 1, 52 | "link": { 53 | "下级": [], 54 | "包含": [] 55 | }, 56 | "uniqueId": "647a777b4a1fa759eff5d8f8a76720a3" 57 | }, 58 | { 59 | "backlink": { 60 | "上级": [ 61 | "bc31341899f848a372c4e5312d15ffcf" 62 | ] 63 | }, 64 | "level": 1, 65 | "link": { 66 | "下级": [], 67 | "包含": [] 68 | }, 69 | "uniqueId": "dcce8b595a53d275c769634a69cfa2b0" 70 | }, 71 | { 72 | "backlink": { 73 | "上级": [ 74 | "bc31341899f848a372c4e5312d15ffcf" 75 | ] 76 | }, 77 | "level": 1, 78 | "link": { 79 | "下级": [], 80 | "包含": [] 81 | }, 82 | "uniqueId": "16f675dd06680c65c3d88ace0635389e" 83 | }, 84 | { 85 | "backlink": { 86 | "上级": [ 87 | "bc31341899f848a372c4e5312d15ffcf" 88 | ] 89 | }, 90 | "level": 1, 91 | "link": { 92 | "下级": [], 93 | "包含": [] 94 | }, 95 | "uniqueId": "4aafb975055e208a637931dc387e5040" 96 | } 97 | ], 98 | "paragraphKVs": [] 99 | }, 100 | "docInfo": { 101 | "pages": [ 102 | { 103 | "imageWidth": 432, 104 | "fileName": null, 105 | "sheetName": "", 106 | "fileIndex": 0, 107 | "pageIdCurDoc": 0, 108 | "source": null, 109 | "pdfParseResult": null, 110 | "imageHeight": 648, 111 | "isBlankPage": null, 112 | "sourceFileType": "cover", 113 | "imageUrl": "http://docmind-api-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/publicDocStructure/docmind-20240827-e89d28a1fb2d49639f0b4a1a613138ab/0.png?Expires=1724793864&OSSAccessKeyId=LTAI5tQL9bqLHC5HYKV68DA9&Signature=dxBVmKA3KZQR421uzb8dW1sG8Pw%3D", 114 | "angle": 0.0, 115 | "sourceImagePath": null, 116 | "excelParseResult": null, 117 | "fileUrl": null, 118 | "wordParseResult": null, 119 | "imageType": "JPEG", 120 | "imageStorageKey": "docmind-20240827-e89d28a1fb2d49639f0b4a1a613138ab/0.png", 121 | "pageIdAllDocs": 0, 122 | "imageParseResult": "http://docmind-api-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/ocr?Expires=1724793864&OSSAccessKeyId=LTAI5tQL9bqLHC5HYKV68DA9&Signature=1PrwDsB9eWxmvmz90%2FwqRIDfsxc%3D" 123 | } 124 | ], 125 | "docType": "png", 126 | "pageCountEstimate": 1 127 | }, 128 | "styles": [ 129 | { 130 | "fontName": "Verdana", 131 | "charScale": 1.0, 132 | "color": "000000", 133 | "underline": false, 134 | "styleId": 0, 135 | "deleteLine": false, 136 | "fontSize": 12, 137 | "bold": true, 138 | "italic": false 139 | }, 140 | { 141 | "fontName": "微软雅黑", 142 | "charScale": 1.0, 143 | "color": "000000", 144 | "underline": false, 145 | "styleId": 1, 146 | "deleteLine": false, 147 | "fontSize": 58, 148 | "bold": true, 149 | "italic": false 150 | }, 151 | { 152 | "fontName": "微软雅黑", 153 | "charScale": 1.0, 154 | "color": "000000", 155 | "underline": false, 156 | "styleId": 2, 157 | "deleteLine": false, 158 | "fontSize": 25, 159 | "bold": true, 160 | "italic": false 161 | }, 162 | { 163 | "fontName": "微软雅黑", 164 | "charScale": 1.0, 165 | "color": "000000", 166 | "underline": false, 167 | "styleId": 3, 168 | "deleteLine": false, 169 | "fontSize": 55, 170 | "bold": true, 171 | "italic": false 172 | }, 173 | { 174 | "fontName": "微软雅黑", 175 | "charScale": 1.0, 176 | "color": "000000", 177 | "underline": false, 178 | "styleId": 4, 179 | "deleteLine": false, 180 | "fontSize": 49, 181 | "bold": true, 182 | "italic": false 183 | }, 184 | { 185 | "fontName": "微软雅黑", 186 | "charScale": 1.0, 187 | "color": "000000", 188 | "underline": false, 189 | "styleId": 5, 190 | "deleteLine": false, 191 | "fontSize": 24, 192 | "bold": true, 193 | "italic": false 194 | }, 195 | { 196 | "fontName": "微软雅黑", 197 | "charScale": 1.0, 198 | "color": "000000", 199 | "underline": false, 200 | "styleId": 6, 201 | "deleteLine": false, 202 | "fontSize": 21, 203 | "bold": true, 204 | "italic": false 205 | }, 206 | { 207 | "fontName": "Arial_Narrow", 208 | "charScale": 1.0, 209 | "color": "000000", 210 | "underline": false, 211 | "styleId": 7, 212 | "deleteLine": false, 213 | "fontSize": 15, 214 | "bold": true, 215 | "italic": false 216 | }, 217 | { 218 | "fontName": "Verdana", 219 | "charScale": 1.0, 220 | "color": "000000", 221 | "underline": false, 222 | "styleId": 8, 223 | "deleteLine": false, 224 | "fontSize": 11, 225 | "bold": true, 226 | "italic": false 227 | }, 228 | { 229 | "fontName": "Verdana", 230 | "charScale": 1.0, 231 | "color": "000000", 232 | "underline": false, 233 | "styleId": 9, 234 | "deleteLine": false, 235 | "fontSize": 13, 236 | "bold": true, 237 | "italic": false 238 | }, 239 | { 240 | "fontName": "微软雅黑", 241 | "charScale": 1.0, 242 | "color": "000000", 243 | "underline": false, 244 | "styleId": 10, 245 | "deleteLine": false, 246 | "fontSize": 13, 247 | "bold": true, 248 | "italic": false 249 | }, 250 | { 251 | "fontName": "Arial_Narrow", 252 | "charScale": 1.0, 253 | "color": "000000", 254 | "underline": false, 255 | "styleId": 11, 256 | "deleteLine": false, 257 | "fontSize": 17, 258 | "bold": true, 259 | "italic": false 260 | }, 261 | { 262 | "fontName": "微软雅黑", 263 | "charScale": 1.0, 264 | "color": "000000", 265 | "underline": false, 266 | "styleId": 12, 267 | "deleteLine": false, 268 | "fontSize": 17, 269 | "bold": true, 270 | "italic": false 271 | }, 272 | { 273 | "fontName": "Verdana", 274 | "charScale": 1.0, 275 | "color": "000000", 276 | "underline": false, 277 | "styleId": 13, 278 | "deleteLine": false, 279 | "fontSize": 15, 280 | "bold": true, 281 | "italic": false 282 | }, 283 | { 284 | "fontName": "Times_New_Roman", 285 | "charScale": 1.0, 286 | "color": "000000", 287 | "underline": false, 288 | "styleId": 14, 289 | "deleteLine": false, 290 | "fontSize": 13, 291 | "bold": true, 292 | "italic": false 293 | }, 294 | { 295 | "fontName": "Times_New_Roman", 296 | "charScale": 1.0, 297 | "color": "000000", 298 | "underline": false, 299 | "styleId": 15, 300 | "deleteLine": false, 301 | "fontSize": 10, 302 | "bold": true, 303 | "italic": false 304 | }, 305 | { 306 | "fontName": "Arial_Narrow", 307 | "charScale": 1.0, 308 | "color": "000000", 309 | "underline": false, 310 | "styleId": 16, 311 | "deleteLine": false, 312 | "fontSize": 13, 313 | "bold": true, 314 | "italic": false 315 | } 316 | ], 317 | "layouts": [ 318 | { 319 | "firstLinesChars": 0, 320 | "pos": [ 321 | { 322 | "x": 9, 323 | "y": 3 324 | }, 325 | { 326 | "x": 331, 327 | "y": 3 328 | }, 329 | { 330 | "x": 331, 331 | "y": 12 332 | }, 333 | { 334 | "x": 9, 335 | "y": 12 336 | } 337 | ], 338 | "blocks": [ 339 | { 340 | "pos": [ 341 | { 342 | "x": 7, 343 | "y": 1 344 | }, 345 | { 346 | "x": 331, 347 | "y": 1 348 | }, 349 | { 350 | "x": 331, 351 | "y": 13 352 | }, 353 | { 354 | "x": 7, 355 | "y": 13 356 | } 357 | ], 358 | "styleId": 0, 359 | "text": "Innovations in the Food System: Exploring the Future of Fcod: Proceedings of a Workshop" 360 | } 361 | ], 362 | "index": 0, 363 | "subType": "page_header", 364 | "lineHeight": 0, 365 | "text": "Innovations in the Food System: Exploring the Future of Fcod: Proceedings of a Workshop", 366 | "alignment": "center", 367 | "type": "head", 368 | "pageNum": [ 369 | 0 370 | ], 371 | "uniqueId": "16b4f10a0114aae6d5a47d8b78259659" 372 | }, 373 | { 374 | "firstLinesChars": 0, 375 | "pos": [ 376 | { 377 | "x": 119, 378 | "y": 64 379 | }, 380 | { 381 | "x": 314, 382 | "y": 64 383 | }, 384 | { 385 | "x": 314, 386 | "y": 190 387 | }, 388 | { 389 | "x": 119, 390 | "y": 190 391 | } 392 | ], 393 | "blocks": [ 394 | { 395 | "pos": [ 396 | { 397 | "x": 119, 398 | "y": 63 399 | }, 400 | { 401 | "x": 311, 402 | "y": 63 403 | }, 404 | { 405 | "x": 311, 406 | "y": 109 407 | }, 408 | { 409 | "x": 119, 410 | "y": 109 411 | } 412 | ], 413 | "styleId": 1, 414 | "text": "INNOVATIONS" 415 | }, 416 | { 417 | "pos": [ 418 | { 419 | "x": 199, 420 | "y": 109 421 | }, 422 | { 423 | "x": 231, 424 | "y": 109 425 | }, 426 | { 427 | "x": 231, 428 | "y": 124 429 | }, 430 | { 431 | "x": 199, 432 | "y": 124 433 | } 434 | ], 435 | "styleId": 2, 436 | "text": " in the" 437 | }, 438 | { 439 | "pos": [ 440 | { 441 | "x": 114, 442 | "y": 128 443 | }, 444 | { 445 | "x": 195, 446 | "y": 128 447 | }, 448 | { 449 | "x": 195, 450 | "y": 170 451 | }, 452 | { 453 | "x": 114, 454 | "y": 170 455 | } 456 | ], 457 | "styleId": 3, 458 | "text": "FOOD" 459 | }, 460 | { 461 | "pos": [ 462 | { 463 | "x": 195, 464 | "y": 127 465 | }, 466 | { 467 | "x": 317, 468 | "y": 127 469 | }, 470 | { 471 | "x": 317, 472 | "y": 170 473 | }, 474 | { 475 | "x": 195, 476 | "y": 170 477 | } 478 | ], 479 | "styleId": 4, 480 | "text": "SYSTEM" 481 | }, 482 | { 483 | "pos": [ 484 | { 485 | "x": 117, 486 | "y": 175 487 | }, 488 | { 489 | "x": 314, 490 | "y": 175 491 | }, 492 | { 493 | "x": 314, 494 | "y": 195 495 | }, 496 | { 497 | "x": 117, 498 | "y": 195 499 | } 500 | ], 501 | "styleId": 5, 502 | "text": " EXPLORING THE FUTURE OF FOOD" 503 | } 504 | ], 505 | "index": 1, 506 | "subType": "none", 507 | "lineHeight": 3, 508 | "text": "INNOVATIONS in the FOODSYSTEM EXPLORING THE FUTURE OF FOOD", 509 | "alignment": "center", 510 | "type": "title", 511 | "pageNum": [ 512 | 0 513 | ], 514 | "uniqueId": "a73142f82a0bf25f1f9a033332428d81" 515 | }, 516 | { 517 | "firstLinesChars": 0, 518 | "pos": [ 519 | { 520 | "x": 112, 521 | "y": 220 522 | }, 523 | { 524 | "x": 320, 525 | "y": 220 526 | }, 527 | { 528 | "x": 320, 529 | "y": 234 530 | }, 531 | { 532 | "x": 112, 533 | "y": 234 534 | } 535 | ], 536 | "blocks": [ 537 | { 538 | "pos": [ 539 | { 540 | "x": 110, 541 | "y": 218 542 | }, 543 | { 544 | "x": 323, 545 | "y": 218 546 | }, 547 | { 548 | "x": 323, 549 | "y": 236 550 | }, 551 | { 552 | "x": 110, 553 | "y": 236 554 | } 555 | ], 556 | "styleId": 6, 557 | "text": "PROCEEDINGS OF A WORKSHOP" 558 | } 559 | ], 560 | "index": 2, 561 | "subType": "none", 562 | "lineHeight": 0, 563 | "text": "PROCEEDINGS OF A WORKSHOP", 564 | "alignment": "center", 565 | "type": "text", 566 | "pageNum": [ 567 | 0 568 | ], 569 | "uniqueId": "bc31341899f848a372c4e5312d15ffcf" 570 | }, 571 | { 572 | "firstLinesChars": 0, 573 | "pos": [ 574 | { 575 | "x": 133, 576 | "y": 288 577 | }, 578 | { 579 | "x": 298, 580 | "y": 288 581 | }, 582 | { 583 | "x": 298, 584 | "y": 322 585 | }, 586 | { 587 | "x": 133, 588 | "y": 322 589 | } 590 | ], 591 | "blocks": [ 592 | { 593 | "pos": [ 594 | { 595 | "x": 134, 596 | "y": 287 597 | }, 598 | { 599 | "x": 298, 600 | "y": 287 601 | }, 602 | { 603 | "x": 298, 604 | "y": 299 605 | }, 606 | { 607 | "x": 134, 608 | "y": 299 609 | } 610 | ], 611 | "styleId": 7, 612 | "text": "Melissa Maitin-Shepard, Rapporteur" 613 | }, 614 | { 615 | "pos": [ 616 | { 617 | "x": 187, 618 | "y": 312 619 | }, 620 | { 621 | "x": 243, 622 | "y": 312 623 | }, 624 | { 625 | "x": 243, 626 | "y": 322 627 | }, 628 | { 629 | "x": 187, 630 | "y": 322 631 | } 632 | ], 633 | "styleId": 8, 634 | "text": " Food Forum" 635 | } 636 | ], 637 | "index": 3, 638 | "subType": "none", 639 | "lineHeight": 15, 640 | "text": "Melissa Maitin-Shepard, Rapporteur Food Forum", 641 | "alignment": "center", 642 | "type": "text", 643 | "pageNum": [ 644 | 0 645 | ], 646 | "uniqueId": "309eed9853f2ff2d0d548760e8661858" 647 | }, 648 | { 649 | "firstLinesChars": 0, 650 | "pos": [ 651 | { 652 | "x": 157, 653 | "y": 336 654 | }, 655 | { 656 | "x": 274, 657 | "y": 336 658 | }, 659 | { 660 | "x": 274, 661 | "y": 346 662 | }, 663 | { 664 | "x": 157, 665 | "y": 346 666 | } 667 | ], 668 | "blocks": [ 669 | { 670 | "pos": [ 671 | { 672 | "x": 157, 673 | "y": 336 674 | }, 675 | { 676 | "x": 274, 677 | "y": 336 678 | }, 679 | { 680 | "x": 274, 681 | "y": 346 682 | }, 683 | { 684 | "x": 157, 685 | "y": 346 686 | } 687 | ], 688 | "styleId": 9, 689 | "text": "Food and Nutrition Board" 690 | } 691 | ], 692 | "index": 4, 693 | "subType": "para", 694 | "lineHeight": 0, 695 | "text": "Food and Nutrition Board", 696 | "alignment": "center", 697 | "type": "text", 698 | "pageNum": [ 699 | 0 700 | ], 701 | "uniqueId": "647a777b4a1fa759eff5d8f8a76720a3" 702 | }, 703 | { 704 | "firstLinesChars": 0, 705 | "pos": [ 706 | { 707 | "x": 150, 708 | "y": 360 709 | }, 710 | { 711 | "x": 283, 712 | "y": 360 713 | }, 714 | { 715 | "x": 283, 716 | "y": 371 717 | }, 718 | { 719 | "x": 150, 720 | "y": 371 721 | } 722 | ], 723 | "blocks": [ 724 | { 725 | "pos": [ 726 | { 727 | "x": 148, 728 | "y": 360 729 | }, 730 | { 731 | "x": 282, 732 | "y": 360 733 | }, 734 | { 735 | "x": 282, 736 | "y": 370 737 | }, 738 | { 739 | "x": 148, 740 | "y": 370 741 | } 742 | ], 743 | "styleId": 10, 744 | "text": "Health and Medicine Division" 745 | } 746 | ], 747 | "index": 5, 748 | "subType": "para", 749 | "lineHeight": 0, 750 | "text": "Health and Medicine Division", 751 | "alignment": "center", 752 | "type": "text", 753 | "pageNum": [ 754 | 0 755 | ], 756 | "uniqueId": "dcce8b595a53d275c769634a69cfa2b0" 757 | }, 758 | { 759 | "firstLinesChars": 47, 760 | "pos": [ 761 | { 762 | "x": 106, 763 | "y": 395 764 | }, 765 | { 766 | "x": 325, 767 | "y": 395 768 | }, 769 | { 770 | "x": 325, 771 | "y": 423 772 | }, 773 | { 774 | "x": 106, 775 | "y": 423 776 | } 777 | ], 778 | "blocks": [ 779 | { 780 | "pos": [ 781 | { 782 | "x": 151, 783 | "y": 393 784 | }, 785 | { 786 | "x": 282, 787 | "y": 393 788 | }, 789 | { 790 | "x": 282, 791 | "y": 407 792 | }, 793 | { 794 | "x": 151, 795 | "y": 407 796 | } 797 | ], 798 | "styleId": 11, 799 | "text": "The National Academies of" 800 | }, 801 | { 802 | "pos": [ 803 | { 804 | "x": 104, 805 | "y": 410 806 | }, 807 | { 808 | "x": 327, 809 | "y": 410 810 | }, 811 | { 812 | "x": 327, 813 | "y": 424 814 | }, 815 | { 816 | "x": 104, 817 | "y": 424 818 | } 819 | ], 820 | "styleId": 12, 821 | "text": " SCIENCES· ENGINEERING ·MEDICINE" 822 | } 823 | ], 824 | "index": 6, 825 | "subType": "para", 826 | "lineHeight": 4, 827 | "text": "The National Academies of SCIENCES· ENGINEERING ·MEDICINE", 828 | "alignment": "center", 829 | "type": "text", 830 | "pageNum": [ 831 | 0 832 | ], 833 | "uniqueId": "16f675dd06680c65c3d88ace0635389e" 834 | }, 835 | { 836 | "firstLinesChars": 0, 837 | "pos": [ 838 | { 839 | "x": 143, 840 | "y": 539 841 | }, 842 | { 843 | "x": 289, 844 | "y": 539 845 | }, 846 | { 847 | "x": 289, 848 | "y": 578 849 | }, 850 | { 851 | "x": 143, 852 | "y": 578 853 | } 854 | ], 855 | "blocks": [ 856 | { 857 | "pos": [ 858 | { 859 | "x": 136, 860 | "y": 538 861 | }, 862 | { 863 | "x": 295, 864 | "y": 538 865 | }, 866 | { 867 | "x": 295, 868 | "y": 549 869 | }, 870 | { 871 | "x": 136, 872 | "y": 549 873 | } 874 | ], 875 | "styleId": 13, 876 | "text": "THE NATIONAL ACADEMIES PRESS" 877 | }, 878 | { 879 | "pos": [ 880 | { 881 | "x": 182, 882 | "y": 554 883 | }, 884 | { 885 | "x": 250, 886 | "y": 554 887 | }, 888 | { 889 | "x": 250, 890 | "y": 564 891 | }, 892 | { 893 | "x": 182, 894 | "y": 564 895 | } 896 | ], 897 | "styleId": 14, 898 | "text": " Washington,DC" 899 | }, 900 | { 901 | "pos": [ 902 | { 903 | "x": 185, 904 | "y": 569 905 | }, 906 | { 907 | "x": 245, 908 | "y": 569 909 | }, 910 | { 911 | "x": 245, 912 | "y": 577 913 | }, 914 | { 915 | "x": 185, 916 | "y": 577 917 | } 918 | ], 919 | "styleId": 15, 920 | "text": " www.nap.edu" 921 | } 922 | ], 923 | "index": 7, 924 | "subType": "para", 925 | "lineHeight": 5, 926 | "text": "THE NATIONAL ACADEMIES PRESS Washington,DC www.nap.edu", 927 | "alignment": "center", 928 | "type": "text", 929 | "pageNum": [ 930 | 0 931 | ], 932 | "uniqueId": "4aafb975055e208a637931dc387e5040" 933 | }, 934 | { 935 | "firstLinesChars": 0, 936 | "pos": [ 937 | { 938 | "x": 81, 939 | "y": 630 940 | }, 941 | { 942 | "x": 350, 943 | "y": 630 944 | }, 945 | { 946 | "x": 350, 947 | "y": 640 948 | }, 949 | { 950 | "x": 81, 951 | "y": 640 952 | } 953 | ], 954 | "blocks": [ 955 | { 956 | "pos": [ 957 | { 958 | "x": 79, 959 | "y": 629 960 | }, 961 | { 962 | "x": 351, 963 | "y": 629 964 | }, 965 | { 966 | "x": 351, 967 | "y": 640 968 | }, 969 | { 970 | "x": 79, 971 | "y": 640 972 | } 973 | ], 974 | "styleId": 16, 975 | "text": "Copyright National Academy of Sciences. All rights reserved." 976 | } 977 | ], 978 | "index": 8, 979 | "subType": "none", 980 | "lineHeight": 0, 981 | "text": "Copyright National Academy of Sciences. All rights reserved.", 982 | "alignment": "center", 983 | "type": "text", 984 | "pageNum": [ 985 | 0 986 | ], 987 | "uniqueId": "3c8b07209763ad8f3feb01ff1d77555e" 988 | } 989 | ], 990 | "version": "1.0.0" 991 | } 992 | } 993 | -------------------------------------------------------------------------------- /examples/example_structured_text.txt: -------------------------------------------------------------------------------- 1 | (type: head, sub_type: page_header, box: (0.02, 0.0, 0.77, 0.02)) Innovations in the Food System: Exploring the Future of Fcod: Proceedings of a Workshop 2 | (type: title, sub_type: none, box: (0.28, 0.1, 0.73, 0.29)) INNOVATIONS in the FOODSYSTEM EXPLORING THE FUTURE OF FOOD 3 | (type: text, sub_type: none, box: (0.26, 0.34, 0.74, 0.36)) PROCEEDINGS OF A WORKSHOP 4 | (type: text, sub_type: none, box: (0.31, 0.44, 0.69, 0.5)) Melissa Maitin-Shepard, Rapporteur Food Forum 5 | (type: text, sub_type: para, box: (0.36, 0.52, 0.63, 0.53)) Food and Nutrition Board 6 | (type: text, sub_type: para, box: (0.35, 0.56, 0.66, 0.57)) Health and Medicine Division 7 | (type: text, sub_type: para, box: (0.25, 0.61, 0.75, 0.65)) The National Academies of SCIENCES· ENGINEERING ·MEDICINE 8 | (type: text, sub_type: para, box: (0.33, 0.83, 0.67, 0.89)) THE NATIONAL ACADEMIES PRESS Washington,DC www.nap.edu 9 | (type: text, sub_type: none, box: (0.19, 0.97, 0.81, 0.99)) Copyright National Academy of Sciences. All rights reserved. 10 | -------------------------------------------------------------------------------- /human_baselines/baseline_1.jsonl: -------------------------------------------------------------------------------- 1 | {"question_id": "free_gpt4o_4033497_24_51_15", "doc_no": "4033497", "total_pages": 51, "start_end_idx": [24, 51], "question_type": "extract", "question": "What should be installed in a high-traffic area to control passage and allow plant material to grow?", "answer": "A fence", "detailed_evidences": "The key recommendations for erosion and shore access as stated in the document are:\n1. Establish an Erosion and Control Plan.\n2. Re-grade, replant, and add obstacles to control access.\n3. Place fencing in high traffic areas to control access and allow plant material to grow.\n4. Develop a policy with the Town's legal counsel to assess liability surrounding shore access.\nThese recommendations are detailed on page 49 under the “Erosion/Safety/Shore Access” section ([Pure-text, Title], (0.03, 0.19, 0.25, 0.24), [49]).", "evidence_pages": [49], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Layout_Understanding"], "pred": "fencing", "score": 0.0} 2 | {"question_id": "free_gpt4o_4153452_61_90_9", "doc_no": "4153452", "total_pages": 128, "start_end_idx": [61, 90], "question_type": "extract", "question": "What is the name of the transfer agent and registrar for the Company’s Common Stock?", "answer": "BankBoston, N.A.", "detailed_evidences": "The name of the transfer agent and registrar responsible for the Company’s Common Stock is stated under related section of the document.\n\nOn page 66, under \"TRANSFER AGENT\" section, it mentions:\n\"The transfer agent and registrar for the Common Stock of the Company is BankBoston,N.A.\" ((0.06, 0.16, 0.20)).\n\nThus, the transfer agent and registrar for the Company’s Common Stock is BankBoston, N.A.\n\nEvidence is sourced from [Pure-text, Title], page [66].\n", "evidence_pages": [66], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Layout_Understanding"], "pred": "BankBoston, N.A.", "score": 1.0} 3 | {"question_id": "free_gpt4o_4027862_74_102_11", "doc_no": "4027862", "total_pages": 102, "start_end_idx": [74, 102], "question_type": "extract", "question": "What was the percentage of total import for cut flowers in 2020?", "answer": "95.3%", "detailed_evidences": "The percentage of total import for cut flowers in 2020 was 95.3% ((0.15, 0.28, 0.86, 0.58), page 85).", "evidence_pages": [85], "evidence_sources": ["Layout"], "answer_format": "Float", "task_tag": "Understanding", "subTask": ["SP_Layout_Understanding"], "pred": "95.3%", "score": 1.0} 4 | {"question_id": "topic2title_4082886_66_66_7", "doc_no": "4082886", "total_pages": 72, "start_end_idx": [66, 66], "question_type": "topic2title", "question": "Which sections provide insights into Energoprojekt's strategic vision for the future?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["THEFUTUREOFENERSOPROJEST", "STRATEGICVISION INTO THE FUTURE OF ENERGOPROJEKT"], "detailed_evidences": "", "evidence_pages": [66], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": "THE FUTURE OF ENERGOPROJEKT: STRATEGIC VISION INTO THE FUTURE OF ENERGOPROJEKT", "score": 0.3076923076923077} 5 | {"question_id": "topic2title_4026569_45_48_4", "doc_no": "4026569", "total_pages": 83, "start_end_idx": [45, 48], "question_type": "topic2title", "question": "From which sections can we learn about the methodologies and recommendations for site preparation and grading?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["SITE PREPARATION AND GRADING", "Compaction Standard and Methodology"], "detailed_evidences": "", "evidence_pages": [45, 48], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["SITE PREPARATION AND GRADING"], "score": 0.5} 6 | {"question_id": "topic2title_4081036_8_29_3", "doc_no": "4081036", "total_pages": 68, "start_end_idx": [8, 29], "question_type": "topic2title", "question": "Which sections would you refer to for understanding the company's accounting principles and recent changes in accounting standards?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["Significant Accounting Policies", "Recently Issued Accounting Pronouncements"], "detailed_evidences": "", "evidence_pages": [8, 29], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Recently Issued Accounting Pronouncements", "Pending Accounting Pronouncements"], "score": 0.5} 7 | {"question_id": "topic2title_4031221_13_13_2", "doc_no": "4031221", "total_pages": 52, "start_end_idx": [13, 13], "question_type": "topic2title", "question": "Where can we find details about the financial obligations and debts of a campaign?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["SCHEDULE D: LIABILITIES", "SCHEDULE D: LIABILITIES"], "detailed_evidences": "", "evidence_pages": [13], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Form CPF M 102: Campaign Finance Report"], "score": 0.0} 8 | {"question_id": "topic2title_4026569_1_19_1", "doc_no": "4026569", "total_pages": 83, "start_end_idx": [1, 19], "question_type": "topic2title", "question": "Which sections provide detailed information on the geological and soil characteristics of the Newport Banning Ranch development site?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["APPENDIX B GEOLOGY AND SOILS", "SITE GEOLOGY"], "detailed_evidences": "", "evidence_pages": [1, 19], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["SITE GEOLOGY"], "score": 0.5} 9 | {"question_id": "topic2title_4153185_3_4_1", "doc_no": "4153185", "total_pages": 53, "start_end_idx": [3, 4], "question_type": "topic2title", "question": "Which sections provide insights into the objectives and structure of Child Labour Monitoring and Remediation Systems (CLMRS) in the cocoa sector?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["What is a Child Labour Monitoring and Remediation System?"], "detailed_evidences": "", "evidence_pages": [3, 4], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Objectives, scope and structure of this review"], "score": 0.0} 10 | {"question_id": "topic2title_4081036_22_22_9", "doc_no": "4081036", "total_pages": 68, "start_end_idx": [22, 22], "question_type": "topic2title", "question": "Which sections discuss the company's forward-looking statements and overall business overview?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["Fonswnd-Looking Statements", "Business Overview"], "detailed_evidences": "", "evidence_pages": [22], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Forward-Looking Statements", "Business Overview"], "score": 0.9230769230769231} 11 | {"question_id": "topic2title_4153185_15_20_4", "doc_no": "4153185", "total_pages": 53, "start_end_idx": [15, 20], "question_type": "topic2title", "question": "From which sections can we learn about the data collection and analysis methods used in the review?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["Data sources and methodology overview"], "detailed_evidences": "", "evidence_pages": [15, 20], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Data sources and methodology overview"], "score": 1.0} 12 | {"question_id": "free_gpt4o_4094684_68_97_21", "doc_no": "4094684", "total_pages": 113, "start_end_idx": [68, 97], "question_type": "extract", "question": "Changes made to the construction drawings for Tomillo involved adding specific details like elastic in the back of the pants, a flat lining, and ensuring multiple pockets were incorporated, yes or no?", "answer": "yes", "detailed_evidences": "Changes made to the construction drawings for Tomillo involved adding specific details like elastic in the back of the pants, a flat lining, and ensuring multiple pockets were incorporated. These modifications ensured functionality on stage, as described in figures on pages 78 and 79. ", "evidence_pages": [78, 79], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "yes", "score": 1.0} 13 | {"question_id": "free_gpt4o_4131533_52_68_14", "doc_no": "4131533", "total_pages": 73, "start_end_idx": [52, 68], "question_type": "extract", "question": "A NaN (Not a Number) value is used to Indicate when the system cannot calculate a strain or temperature value for a particular gage, yes or no?", "answer": "yes", "detailed_evidences": "In the ODiSI protocol messages, the JSON value 'null' is used to indicate a NaN (Not a Number) value. This condition occurs specifically when the ODiSI system cannot calculate a strain or temperature value for a particular gage. The JSON null value serves to denote these uncalculated or invalid data points in the Measurement Messages. This is mentioned across multiple pages, with detailed explanations on pages 63 ((0.15, 0.68, 0.88, 0.75), (0.14, 0.85, 0.88, 0.9)) and 65 ((0.14, 0.17, 0.88, 0.24), (0.15, 0.32, 0.88, 0.37)).", "evidence_pages": [63, 65], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "yes", "score": 1.0} 14 | {"question_id": "free_gpt4o_4048779_16_45_5", "doc_no": "4048779", "total_pages": 52, "start_end_idx": [16, 45], "question_type": "extract", "question": "What is the prerequisite for 'English, Grade 12, University Preparation'?", "answer": "\"English, Grade 11, University Preparation (ENG3U)\"", "detailed_evidences": "The prerequisite for 'English, Grade 12, University Preparation (ENG4U)' is 'English, Grade 11, University Preparation (ENG3U)' ((0.12, 0.83, 0.5, 0.85), page 27). The prerequisite for 'Mathematics, Grade 12, University Preparation' courses are:\n1. 'Advanced Functions (MHF4U)': 'Grade 12, University Preparation'\n2. 'Calculus and Vectors (MCV4U)': 'Grade 12 Advanced Functions, University Preparation' (to be taken prior to or concurrently) ((0.12, 0.13, 0.86, 0.17), page 29). \nEvidence collected from the Title and Text parts of the given pages.", "evidence_pages": [27, 29], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "English, Grade 11, University Preparation", "score": 1.0} 15 | {"question_id": "free_gpt4o_4174181_15_44_18", "doc_no": "4174181", "total_pages": 122, "start_end_idx": [15, 44], "question_type": "extract", "question": "List all technical reviewers who contributed to the book.", "answer": ["Chris McCoy", "Benjamin Taylor"], "detailed_evidences": "The technical reviewers who contributed to the book are:\nChris McCoy\nBenjamin Taylor\nThis information is found on pages 18 and 19 ((0.15, 0.66, 0.88)).", "evidence_pages": [18, 19], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": ["Chris McCoy", "Benjamin Taylor"], "score": 1.0} 16 | {"question_id": "free_gemini15_pro_4025531_16_45_1", "doc_no": "4025531", "total_pages": 69, "start_end_idx": [16, 45], "question_type": "extract", "question": "What is the last update date of the explosion action plan BG3115?", "answer": "12 February 2016", "detailed_evidences": "According to the text on page 17, (0.15, 0.24, 0.45, 0.26), the explosion action plan BG3115 was last updated on 12 February 2016. The same page, (0.15, 0.27, 0.43, 0.28), indicates that the update was done by the person answering the question (referred to as \"you\"). ", "evidence_pages": [16, 17], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "12 February 2016", "score": 1.0} 17 | {"question_id": "free_gpt4o_4120884_42_71_17-2", "doc_no": "4120884", "total_pages": 143, "start_end_idx": [42, 71], "question_type": "extract", "question": "What is the orientation of the scattering vector (Q) for hoop strain measurements?", "answer": "The scattering vector (Q) should be tangential to the cylinder bore and perpendicular to the specified planes.", "detailed_evidences": "For neutron diffraction measurements in hoop strain orientation (Picture, Pure-text):\n1. Align the engine block so that the scattering vector (Q) is tangential to the cylinder bore, perpendicular to the specified planes.\n2. Position the engine block on the spectrometer table, as shown in Figure 28(a) and (b) on page 68.\n3. Allow the monochromated incident beam to penetrate the bulk material where the sampling volume is formed at the intersection of the incident and diffracted beams.\n4. Translate the block in X, Y, and Z directions to acquire strain measurements from top to bottom of the cylinder bridge.\nThis method is detailed on pages 68 to 70.", "evidence_pages": [68, 69, 70], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "radial", "score": 0.0} 18 | {"question_id": "free_gpt4o_4041821_2_31_3", "doc_no": "4041821", "total_pages": 79, "start_end_idx": [2, 31], "question_type": "extract", "question": "How many residential dwellings are estimated to be part of the development?", "answer": 165, "detailed_evidences": "The estimated number of residential dwellings is 165. This information can be found in the section titled \"Population and Residential Densities\" on page 8 ((0.13, 0.36, 0.88, 0.65), page 8), evidence source is Figure. Additionally, it is also discussed under section 4.2.1 DWELLING TARGET on page 10 ((0.52, 0.84, 0.61, 0.85), page 10), evidence source is Pure-text. Therefore, the estimated residential dwellings are 165, and this information is provided in sections \"Population and Residential Densities\" and \"4.2.1 DWELLING TARGET\" on pages [8, 10].", "evidence_pages": [8, 10], "evidence_sources": ["Text", "Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Table_Understanding"], "pred": 140, "score": 0.0} 19 | {"question_id": "free_gpt4o_4065289_60_89_9", "doc_no": "4065289", "total_pages": 91, "start_end_idx": [60, 89], "question_type": "extract", "question": "Was there any significant excess for any cancer type associated with aviation electronics technicians in the Groves et al., (2002) study? Choose yes or no?", "answer": "no", "detailed_evidences": "There was no significant excess for any cancer type, including lymphoid malignancies, brain cancer, or testicular cancer, associated with aviation electronics technicians in the Groves et al (2002) study. This is indicated on pages 61 ((0.23, 0.13, 0.18)) and 62 ((0.23, 0.14, 0.18)). Evidence source is Pure-texts spread across these two pages.", "evidence_pages": [61, 62], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "no", "score": 1.0} 20 | {"question_id": "free_gpt4o_4076912_11_40_4", "doc_no": "4076912", "total_pages": 103, "start_end_idx": [11, 40], "question_type": "extract", "question": "Determine the percentage of animals used in research for the purposes of regulatory use in 2018.", "answer": 17.9, "detailed_evidences": "In 2018, 17.9% of animals were used for regulatory purposes. This information can be found on pages 28 and 31. The percentage is mentioned in Figure 14 on page 28 with coordinates ((0.15, 0.39, 0.87, 0.81)), and further details are provided in the text on page 31 with coordinates ((0.09, 0.56, 0.91, 0.57)).", "evidence_pages": [28, 31], "evidence_sources": ["Text", "Figure"], "answer_format": "Float", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "17.9%", "score": 1.0} 21 | {"question_id": "free_gpt4o_4142297_76_95_13", "doc_no": "4142297", "total_pages": 95, "start_end_idx": [76, 95], "question_type": "extract", "question": "What exposure time(in ms) was determined optimal for ArcLight measurements?", "answer": "100", "detailed_evidences": "The optimal exposure time determined for ArcLight measurements was 100 ms, based on the need for clarity and avoidance of background interferences. Reducing the time to 20 ms achieved higher recording speed but required the cells to be very bright to avoid dimming impacts ((0.12, 0.62, 0.88, 0.87), pages [76, 77]).", "evidence_pages": [76, 77], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": 25, "score": 0.0} 22 | {"question_id": "free_gpt4o_4118675_34_63_10", "doc_no": "4118675", "total_pages": 73, "start_end_idx": [34, 63], "question_type": "extract", "question": "What is the land use type for soils PE011 and PE020?", "answer": ["cultivated pasture without irrigation", "high-level arable farming"], "detailed_evidences": "The land use type for soil PE011 is detailed on page 36, and it is cultivated pasture without irrigation ((0.06, 0.54, 0.95)). For soil PE020, the land use type is high-level arable farming, as mentioned on page 54 ((0.05, 0.06, 0.6)).", "evidence_pages": [36, 54], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": ["cultivated pasture; no irrigation; Rotation: not relevent; Inprovenents: none", "high level arable farming; Crops: annual crops; continuously irrigated; Rotation: continuous crop rotation; Improvements: terracing"], "score": 0.0} 23 | {"question_id": "free_gpt4o_4118675_34_63_1", "doc_no": "4118675", "total_pages": 73, "start_end_idx": [34, 63], "question_type": "extract", "question": "What are the Print dates for the soils in PE010 and PE011?", "answer": "22/03/95", "detailed_evidences": "The Print date for the reference soil PE010 can be found on page 35, with coordinates ((0.71, 0.06, 0.94, 0.08)), and belongs to the \"Text\" type. The Print date for the reference soil PE011 is on page 36, located at coordinates ((0.71, 0.08, 0.95, 0.09)), it also belongs to the \"Text\" type. The Print dates for PE010 and PE011 are both 22/03/95.", "evidence_pages": [35, 36], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "22/03/95", "score": 1.0} 24 | {"question_id": "free_gemini15_pro_4185929_30_59_1", "doc_no": "4185929", "total_pages": 75, "start_end_idx": [30, 59], "question_type": "extract", "question": "What percentage of carbon has been lost from the world's soils due to poor farming practices, according to the provided text?", "answer": 80, "detailed_evidences": "The text mentions that up to 80 percent of carbon from the world's soils has been lost due to poor farming and ranching practices, especially modern industrial agriculture. This information is found across multiple pages. Specifically, page 30 contains the statement \"industrial agriculture- have led to the loss of up to 80 percent of carbon from the world's soils,\" (0.11, 0.41, 0.87, 0.67). This is corroborated by similar statements on pages 34 (0.11, 0.41, 0.88, 0.61), 36 (0.11, 0.42, 0.61, 0.46)(0.11, 0.47, 0.87, 0.6), 48 (0.11, 0.41, 0.86, 0.67), and 56 (0.11, 0.41, 0.87, 0.67). ", "evidence_pages": [30, 34, 36, 48, 56], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": 80, "score": 1.0} 25 | {"question_id": "free_gpt4o_4069340_65_91_25", "doc_no": "4069340", "total_pages": 96, "start_end_idx": [65, 91], "question_type": "extract", "question": "The camera is considered not omnipotent because it cannot fully cover every possible shooting point, yes or no?", "answer": "no", "detailed_evidences": "The consequences of omniscience and omnipotence in cameras are that concepts no longer signify the world but rather the universe signifies the program within cameras. This bi-univocal relationship means that cameras are considered omnipotent in their photographic universe due to their comprehensive coverage of every possible photographic point. This shift leads to an absurd omniscience where cameras know everything and can do everything within their programmed universe, making human concepts redundant. This discussion is found on page 68 ((0.19, 0.16, 0.85, 0.79)) and further elaborated on page 69 ((0.13, 0.14, 0.78, 0.53)).", "evidence_pages": [68, 69], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "no", "score": 1.0} 26 | {"question_id": "free_gpt4o_4048779_16_45_8", "doc_no": "4048779", "total_pages": 52, "start_end_idx": [16, 45], "question_type": "extract", "question": "How many courses related to health and physical education are there in 9th grade?", "answer": 1, "detailed_evidences": "The 'Health and Physical Education' section lists the following courses by grade:\n1. Grade 9: \n - Health and Physical Education (Healthy Active Living) (PPL1O)\n2. Grade 10:\n - Health and Physical Education (Healthy Active Living) (PPL2O)\n3. Grade 12:\n - Health and Physical Education (Healthy Active Living) (PPL4O)\n - Introductory Kinesiology (PSK4U)\n\nCounting these, there are 1 course listed for Grade 9, 1 for Grade 10, and 2 for Grade 12, making a total of 4 courses. Evidence present on pages 38 and 39.", "evidence_pages": [38, 39], "evidence_sources": ["Text", "Layout"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": 1, "score": 1.0} 27 | {"question_id": "free_gpt4o_4176503_99_112_8-2", "doc_no": "4176503", "total_pages": 112, "start_end_idx": [99, 112], "question_type": "extract", "question": "Jim Schultz approved of Lockport's security system, yes or no?", "answer": "no", "detailed_evidences": "Jim Shultz raised several arguments against Lockport's security system:\n1. The system posed significant privacy risks for students and teachers.\n2. The community was not properly consulted.\n3. Potential conflicts of interest between the district and the security consultant.\n4. Effectiveness of the system was questionable.\nHe also actively wrote columns in the local newspaper, organized petitions, and called for public hearings to challenge the lack of transparency and efficacy of the implemented security system. (pages 107, 108) Evidence: Text from [Pure-text, Corner-note] coordinates (0.11, 0.49, 0.89, 0.59), page 107 and (0.11, 0.08, 0.88, 0.4), page 108.", "evidence_pages": [107, 108], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "no", "score": 1.0} 28 | {"question_id": "free_gpt4o_4049875_8_27_13", "doc_no": "4049875", "total_pages": 81, "start_end_idx": [8, 27], "question_type": "extract", "question": "What is the governing body that regulates the Responsible Entity of the Fund?", "answer": "Australian Securities and Investments Commission (ASIC)", "detailed_evidences": "The governing body that regulates the Responsible Entity of the Fund is the Australian Securities and Investments Commission (ASIC). This information is located on page 10, coordinates ((0.08, 0.09, 0.92, 0.7)), specifying \"The Fund is a managed investment scheme registered with ASIC under the Corporations Act.\" Another supporting reference is found on page 19 in the text coordinates ((0.08, 0.22, 0.33, 0.32)) stating the involvement of operators and networks regulated by ASIC.", "evidence_pages": [10, 19], "evidence_sources": ["Text", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Table_Understanding"], "pred": "The Trust Company (RE Services) Limited", "score": 0.0} 29 | {"question_id": "free_gpt4o_4120884_42_71_11-2", "doc_no": "4120884", "total_pages": 143, "start_end_idx": [42, 71], "question_type": "extract", "question": "Enumerate the experimental analysis steps on the engine block castings as noted in the project outline.", "answer": ["Analysis of cylinder profiles in service-tested blocks", "Microstructure analysis", "Thermal analysis", "Mechanical properties testing", "Residual stress/strain evolution"], "detailed_evidences": "The experimental analysis steps on the engine block castings noted in the project outline (Pure-text) include:\n1. Analysis of cylinder profiles in service-tested blocks.\n2. Microstructure analysis.\n3. Thermal analysis.\n4. Mechanical properties testing.\n5. Residual stress/strain evolution, as illustrated in the flow chart on page 56 (Picture).", "evidence_pages": [55, 56], "evidence_sources": ["Text", "Figure"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": ["Analysis of the cylinder profiles in service tested block", "Microstructure", "Thermal analysis", "Mechanical properties", "Residual stress/strain evolution"], "score": 0.8455140850627318} 30 | {"question_id": "free_gpt4o_4120884_42_71_6", "doc_no": "4120884", "total_pages": 143, "start_end_idx": [42, 71], "question_type": "extract", "question": "The specific volume change associated with aging precipitates in Al-Cu alloys is mentioned, yes or no?", "answer": "yes", "detailed_evidences": "The specific volume change associated with aging precipitates in Al-Cu alloys is discussed on pages 51 and 52. According to the text and corresponding figure, the unit linear dimensional change varies significantly with different precipitates. For instance, precipitation of the \\( \\theta' \\) phase results in the largest specific volume change, being approximately three times larger than the equilibrium \\( \\theta \\) phase ([Pure-text] (0.11, 0.49, 0.88, 0.8), [Figure] (0.31, 0.1, 0.67, 0.37), [51, 52]). Hence, the specific volume change with aging precipitates can achieve the most considerable expansion with the \\( \\theta' \\) phase.", "evidence_pages": [51, 52], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "yes", "score": 1.0} 31 | {"question_id": "free_gpt4o_4115594_31_60_5", "doc_no": "4115594", "total_pages": 63, "start_end_idx": [31, 60], "question_type": "extract", "question": "In the migrant survey, how many people took German courses?", "answer": 8, "detailed_evidences": "From the migrant survey, 8 participants attended German language courses, provided by public education/training centers, NGOs, and e-learning platforms (page 35, (0.12, 0.38, 0.68), evidence source: Picture).", "evidence_pages": [35, 36], "evidence_sources": ["Figure"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Figure_Understanding"], "pred": 8, "score": 1.0} 32 | {"question_id": "free_gpt4o_4013909_54_76_2", "doc_no": "4013909", "total_pages": 76, "start_end_idx": [54, 76], "question_type": "extract", "question": "Where were the experiments conducted with CyberSpaceship One?", "answer": "GNC laboratory", "detailed_evidences": "The experiments conducted using CyberShip I were performed in the GNC laboratory at the Department of Engineering Cybernetics, NTNU. This information can be subdivided across multiple pages. On page 65, it states, \"The first experiments were performed in the GNC laboratory at the Department of Engineering Cybernetics, NTNU using CyberShip I which is an offshore supply vessel scale 1:70\"((0.12, 0.33, 0.4, 0.51), page 65), this leads to pictures and further explanations of the experiments across page 65 and page 66.", "evidence_pages": [65, 66], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "the GNC laboratory", "score": 0.7777777777777778} 33 | {"question_id": "free_gpt4o_4013909_54_76_4", "doc_no": "4013909", "total_pages": 76, "start_end_idx": [54, 76], "question_type": "extract", "question": "Does the combination of compass and angular velocity sensor improve heading retention? Choose yes or no.", "answer": "yes", "detailed_evidences": "The results of comparing 'y' values and estimates using the compass and angular rate sensor show improvements in course-keeping capabilities. On page 74, it mentions, \"It is seen that the WF motion components are quite well removed from the estimate of resulting in good course-keeping capabilities\"((0.13, 0.26, 0.45, 0.36), page 74). The figures on page 70 show the results graphically, comparing 'y=v+y_w' and estimated 'y_r'((0.26, 0.18, 0.76, 0.9), page 70). Thus, both textual and graphical comparisons across these pages indicate better course-keeping when combining the observations from multiple sensors.", "evidence_pages": [70, 74], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "yes", "score": 1.0} 34 | {"question_id": "free_gpt4o_4142297_65_94_12", "doc_no": "4142297", "total_pages": 95, "start_end_idx": [65, 94], "question_type": "extract", "question": "What was the main reason for no correlation between electrical and optical signals in the experiment discussed in Figure 3.11?", "answer": "High optical noise and bleaching effects", "detailed_evidences": "The primary reason for the lack of correlation between electrical and optical signals in the experiment (Figure 3.11) is the high optical noise and bleaching effects. The noise arises due to fluctuations, especially when photon count is low, dropping SNR. Despite the membrane depolarization and action potentials, significant reductions in fluorescence were not recorded. The bleaching rate was noted at 9% per 70 s with 100% intensity and 100 ms exposure, contributing significantly to the variance in results (pages 71, 72, 73).", "evidence_pages": [71, 72, 73], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "the trade-off between the recording settings for Andor Zyla sCMOS camera and the sensitivity of the measurements we would have liked to achieve", "score": 0.0} 35 | {"question_id": "free_gpt4o_4014524_22_51_8", "doc_no": "4014524", "total_pages": 55, "start_end_idx": [22, 51], "question_type": "extract", "question": "What percentage of Ward 3 residents had no physical activity in the last 30 days?", "answer": 38, "detailed_evidences": "Thirty-eight percent (38%) of Ward 7 residents had no physical activity in the last 30 days as mentioned in the document related to the Parks, Recreation, and Open Space Element. However, the document provided does not specify the percentage for Ward 3. You may need to review the other images in the set of 30 provided to find specific information about Ward 3.", "evidence_pages": [43], "evidence_sources": ["Text", "Layout"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Layout_Understanding"], "pred": "6%", "score": 0.0} 36 | {"question_id": "free_gemini15_pro_4139004_25_54_8", "doc_no": "4139004", "total_pages": 57, "start_end_idx": [25, 54], "question_type": "extract", "question": "What is the radius of the circle (Ic) used for error calculations in Example 1 of the Neumann case (Figure 24)?\n(A). 5A/3\n(B). 10A/3\n(C). 5λ/3\n(D). 10λ/3\nChoose the letter name in front of the right option from A, B, C, D.", "answer": "C", "detailed_evidences": "The text on page 41, with coordinates (0.21, 0.63, 0.79, 0.75), describes Example 1 in the Neumann case (Figure 24, (0.25, 0.28, 0.78, 0.55), page 41), stating that the errors are calculated on the circle Ic with radius 5λ/3.", "evidence_pages": [41], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Figure_Understanding"], "pred": "C", "score": 1.0} 37 | {"question_id": "free_gpt4o_4183202_75_104_5", "doc_no": "4183202", "total_pages": 120, "start_end_idx": [75, 104], "question_type": "extract", "question": "Orkhon II is steward of Xa Deshret, yes or no?", "answer": "yes", "detailed_evidences": "The steward of Xa Deshret is Orkhon II. This title is mentioned in the context of description about leadership: 'Orkhon Il is steward of Xa Deshret and has ruled it well for nearly three decades' ((0.14, 0.51, 0.52, 0.69), page 80). The importance of their role lies in 'protecting the remains of the Sphinx Empire from the savagery of the Vermilion Tribes and the ever-growing menace of Gorhgijesk and Rhaz al-Khali' ((0.14, 0.51, 0.52, 0.69), page 80). This role maintains governance and defends against external threats.", "evidence_pages": [80], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "no", "score": 0.0} 38 | {"question_id": "free_gpt4o_4091930_40_58_11", "doc_no": "4091930", "total_pages": 58, "start_end_idx": [40, 58], "question_type": "extract", "question": "What is the URL for InfraMap.net?", "answer": "\"800.863.3748\"", "detailed_evidences": "The URL for InfraMap.net provided in the document is \"800.863.3748\" (page 47, Text, (0.22, 0.9, 0.79, 0.93)). They provide professional subsurface utility engineering and utility infrastructure mapping services (page 47, Text, (0.13, 0.55, 0.87, 0.62)).", "evidence_pages": [47], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "Inframap.net", "score": 0.0} 39 | {"question_id": "free_gpt4o_4110886_2_26_8", "doc_no": "4110886", "total_pages": 76, "start_end_idx": [2, 26], "question_type": "extract", "question": "Who are the main supporters and sponsors for the research study according to the acknowledgements section?", "answer": ["Punjabi Community Health Services (PCHS)", "The Government of Ontario", "Mr. Baldev Mutta", "Ms. Manvir Bhangu", "Dr. Puneet Dhillon"], "detailed_evidences": "The main supporters and sponsors for the research study are mentioned in the acknowledgements section on page 5. The document states: \"This work was sponsored by Punjabi Community Health Services (PCHS) in cooperation with The Government of Ontario\" ((0.12, 0.16, 0.88, 0.24), page 5). Additionally, it mentions specific individuals at PCHS who supported the study: Mr. Baldev Mutta, Chief Executive Officer (CEO), Ms. Manvir Bhangu, Manager of Operations, Quality and Systems, and Dr. Puneet Dhillon, Research, Communication and Research Analyst ((0.15, 0.33, 0.63, 0.35), (0.15, 0.36, 0.8, 0.38), (0.15, 0.4, 0.82, 0.41), page 5).", "evidence_pages": [5], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": ["Punjabi Community Health Services (PCHS)", "The Goverment of Ontario", "Mr. Baldev Mutta", "Ms. Manvir Bhangu", "Dr. Puneet Dhillon"], "score": 0.992} 40 | {"question_id": "free_gpt4o_4034235_26_55_8", "doc_no": "4034235", "total_pages": 90, "start_end_idx": [26, 55], "question_type": "extract", "question": "What is the function of the overall correction value of the upper thread tension?", "answer": "\"Change the tension values for all programs at once\"", "detailed_evidences": "The function of the overall correction value is to change the tension values for all programs at once. When set, the upper thread tension is adjusted based on the overall correction value during sewing, regardless of the individual program's settings. As described on page 52, the correction value is displayed in the TENSION display and applied the next time sewing is carried out ((0.54, 0.51, 0.92, 0.76)).", "evidence_pages": [52], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "to change the tension values for all programs at once", "score": 0.9433962264150944} 41 | {"question_id": "free_gemini15_pro_4193248_57_86_6", "doc_no": "4193248", "total_pages": 142, "start_end_idx": [57, 86], "question_type": "extract", "question": "What three successful examples of an integrated service approach for engaging culturally and linguistically diverse communities are noted in section 3.2.5?", "answer": ["The Family Relationship Centre Broadmeadows", "The Dandenong Project at the Federal Magistrates Court", "The Neighbourhood Justice Centre in Collingwood, Victoria"], "detailed_evidences": "Section 3.2.5 on page 67 mentions three successful examples of integrated service approaches: 1) The Family Relationship Centre Broadmeadows; 2) The Dandenong Project at the Federal Magistrates Court; and 3) The Neighbourhood Justice Centre in Collingwood, Victoria. These are listed in the paragraph starting at (0.14, 0.51, 0.85, 0.81) on page 67. The evidence source is Pure-text. ", "evidence_pages": [67], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": ["the Family Relationship Centre Broadmeadows", "The Federal Magistrates Court component of the 'Dandenong Project'", "the Neighbourhood Justice Centre in Collingwood, Victoria"], "score": 0.6666666666666666} 42 | {"question_id": "free_gpt4o_4169843_69_85_7", "doc_no": "4169843", "total_pages": 85, "start_end_idx": [69, 85], "question_type": "extract", "question": "Give the cumulative actual total and expected total values for the occupancy of structures in floodplains.\n(A). 0/20\n(B). 20/0\n(C). 0/200\n(D). 0/2\nChoose the letter name in front of the right option from A, B, C, D.", "answer": "B", "detailed_evidences": "The cumulative actual and expected total values for occupancy of structures in floodplains is 20 and 0. This is detailed in page 69 related to the actual and expected occupancy figures (Page 69: \"0/0\", (0.31, 0.11, 0.51, 0.14), page 69). This evidence can be traced in the document's section for \"Accomplishments Performance Measures\". Hence, occupancy in floodplains shows 20 actual against 0 expected structures.", "evidence_pages": [69], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "B", "score": 1.0} 43 | {"question_id": "free_gpt4o_4113560_74_103_8", "doc_no": "4113560", "total_pages": 146, "start_end_idx": [74, 103], "question_type": "extract", "question": "Summarize the steps to be taken if the bike seat gives way after load.", "answer": "\"Bring the seat into the lowest position and then into the highest position to allow for compensation of the oil-gas mixture in the spring chambers, ensuring the seat locks again. Repeat once.\"", "detailed_evidences": "The document suggests steps for handling situations where the bike seat gives way after load on page 103. It advises bringing the seat into the lowest position and then into the highest position to allow for compensation of the oil-gas mixture in the spring chambers, ensuring the seat locks again. Verbatim advice includes: \"Bring the seat into the lowest position and then into the highest position and repeat once.\" ((0.19, 0.13, 0.86, 0.19), evidence source is Text). No additional steps are mentioned, affirming this action as sufficient to resolve the issue.", "evidence_pages": [103], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "Bring the seat into the lowest position and then into the highest position and repeat this once", "score": 0.0} 44 | {"question_id": "free_gemini15_pro_4000400_14_43_1", "doc_no": "4000400", "total_pages": 54, "start_end_idx": [14, 43], "question_type": "extract", "question": "What were the compensatory damages sought in the Butler v. Celotex Corp. case?", "answer": "$150000", "detailed_evidences": "The _Butler v. Celotex Corp._ case sought $150,000 in compensatory damages. This information is found in a table on page 30. (0.17, 0.28, 0.79, 0.41) ", "evidence_pages": [30], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "$150000", "score": 1.0} 45 | {"question_id": "free_gemini15_pro_4088207_13_42_1", "doc_no": "4088207", "total_pages": 76, "start_end_idx": [13, 42], "question_type": "extract", "question": "What is the total number of shareholders as of 31.12.2020?", "answer": 485, "detailed_evidences": "The total number of shareholders as of 31.12.2020 is 485, according to the table on page 13 (0.15, 0.12, 0.9, 0.26). The number of private individual shareholders is 429, as shown in the table on page 13 (0.14, 0.29, 0.9, 0.38). ", "evidence_pages": [13], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 485, "score": 1.0} 46 | {"question_id": "free_gpt4o_4108584_37_66_17", "doc_no": "4108584", "total_pages": 118, "start_end_idx": [37, 66], "question_type": "extract", "question": "Identify the most common growth intervals for Atlantic salmon in Norway.", "answer": ["0.1-0.2 kg", "0.2-1 kg", "1-2 kg", "2-3 kg", "3-4 kg", "4-5 kg"], "detailed_evidences": "Growth intervals for Atlantic salmon in Norway are categorized as follows: 0.1-0.2 kg, 0.2-1 kg, 1-2 kg, 2-3 kg, 3-4 kg, and 4-5 kg. The corresponding feed consumption is 0.08 kg, 0.75 kg, 1.00 kg, 1.05 kg, 1.10 kg, and 1.20 kg respectively [Table] ((0.12, 0.34, 0.86, 0.4), page 66).", "evidence_pages": [66], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "4-5 kg", "score": 0.16666666666666666} 47 | {"question_id": "free_gpt4o_4093319_101_112_4", "doc_no": "4093319", "total_pages": 112, "start_end_idx": [101, 112], "question_type": "extract", "question": "What is the value-added output index for 'Pulp and Paper' in 1962?", "answer": 317, "detailed_evidences": "To find the value-added output index for 'Pulp and Paper' in 1962, refer to Table I on page 103. The index for 'Pulp and Paper' in 1962 is 317. Evidence can be found in Table I on page 103, coordinates (0.07, 0.25, 0.9, 0.84).", "evidence_pages": [103], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 317, "score": 1.0} 48 | {"question_id": "free_gemini15_pro_4063909_28_57_8", "doc_no": "4063909", "total_pages": 68, "start_end_idx": [28, 57], "question_type": "extract", "question": "What was the total equity of WSE as at 31 December 2012(in PLN thousand)?", "answer": "384881", "detailed_evidences": "WSE's total equity was PLN 384,881 thousand as at 31 December 2012 and PLN 382,670 thousand as at 31 December 2011. This information is located in a table on page 48 (0.12, 0.48, 0.87, 0.56).", "evidence_pages": [48], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 384881, "score": 1.0} 49 | {"question_id": "free_gemini15_pro_4063909_28_57_9", "doc_no": "4063909", "total_pages": 68, "start_end_idx": [28, 57], "question_type": "extract", "question": "What is the higher value(in PLN thousand) between 'Deferred tax assets to be recovered within 12 months' and 'Deferred tax assets to be recovered after more than 12 months' as at 31 December 2011?", "answer": "2816", "detailed_evidences": "As at 31 December 2011, deferred tax assets to be recovered within 12 months were (PLN 2,519) thousand, and those to be recovered after more than 12 months were (PLN 2,816) thousand. This data is from a table on page 39 (0.13, 0.33, 0.88, 0.42).", "evidence_pages": [39], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 2816, "score": 1.0} 50 | {"question_id": "free_gemini15_pro_4188879_36_65_2", "doc_no": "4188879", "total_pages": 84, "start_end_idx": [36, 65], "question_type": "extract", "question": "Which districts in Mizoram had a high level of WRC production in both 2009-10 and 2014-15?", "answer": ["Champhai", "Kolasib"], "detailed_evidences": "According to Table 3 ((0.07, 0.36, 0.48, 0.69), page 36), Champhai and Kolasib districts had high WRC production in 2009-10. In 2014-15, Table 3 again shows that Kolasib and Champhai districts had high production. Therefore, Champhai and Kolasib are the districts with consistently high production levels across both periods.", "evidence_pages": [36], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": ["Kolasib", "Champhai"], "score": 1.0} 51 | {"question_id": "free_gpt4o_4131533_6_19_4", "doc_no": "4131533", "total_pages": 73, "start_end_idx": [6, 19], "question_type": "extract", "question": "What are the maximum sensor lengths supported by the remote modules?", "answer": "50 m", "detailed_evidences": "The \"Remote Module (Standard Length)\" supports sensor lengths from 1 m to 20 m ((0.13, 0.15, 0.86, 0.91), page 6). The \"Remote Module (Extended Length)\" supports sensor lengths from 1 m to 50 m ((0.13, 0.15, 0.86, 0.91), page 6).", "evidence_pages": [6], "evidence_sources": ["Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "50 m", "score": 1.0} 52 | {"question_id": "free_gemini15_pro_4188879_36_65_5", "doc_no": "4188879", "total_pages": 84, "start_end_idx": [36, 65], "question_type": "extract", "question": "What is the total annual requirement of rice for Mizoram in 2014-15?", "answer": 2199263, "detailed_evidences": "Total Annual Requirement of Rice for Mizoram in 2014-15:\nAccording to Table 8 from image 7, the total annual requirement of rice for Mizoram in 2014-15 was 21,99,263 quintals.", "evidence_pages": [42], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 2199263, "score": 1.0} 53 | {"question_id": "free_gpt4o_4086173_6_35_11", "doc_no": "4086173", "total_pages": 117, "start_end_idx": [6, 35], "question_type": "extract", "question": "What methodologies have been used in the final chapter project on job satisfaction and occupational demographics?", "answer": "linear regression analysis", "detailed_evidences": "The methodology used in the final chapter project on job satisfaction and occupational demographics is linear regression analysis. Specifically, this methodology is used to explore the demographics of the workforce and job satisfaction without determining causality. This summary can be found in Table 2 of Chapter 1 (page 13: 0.12, 0.24, 0.56, 0.26).\n", "evidence_pages": [13], "evidence_sources": ["Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "Linear regression analysis", "score": 1.0} 54 | {"question_id": "free_gpt4o_4091919_79_89_3", "doc_no": "4091919", "total_pages": 89, "start_end_idx": [79, 89], "question_type": "extract", "question": "Which drone allows for dual control mode?", "answer": "Gleagle X3 Quadcopter", "detailed_evidences": "The 'Gleagle X3 Quadcopter' allows for dual control mode, and this feature is mentioned on page 80 in the paragraph starting at ((0.04, 0.58, 0.63, 0.69)).", "evidence_pages": [80], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "Gleagle X3 Quadcopter", "score": 1.0} 55 | {"question_id": "extract_fig2tab_4066338_71_100_8", "doc_no": "4066338", "total_pages": 130, "start_end_idx": [71, 100], "question_type": "extract_fig2tab", "question": "What's name of the figure at the page which contains a table whose name is \"Table 5.1: Gender performance using LBP texture and full face. Percentage values represent the class-specific accuracies and the average class-specific accuracy for each experiment. VeriLook pre-dicted no gender on 2.75% and 0.89% of male and female images respectively in the FRGC dataset. In the MORPH dataset it predicted no gender on 0.5% and 11.26% of male and female images.\"?", "answer": "Figure 5.6: Easy and hard subjects in texture gender classification on MORPH images. Subjects on the left are a subset of those correctly classified in all texture experiments. Top to bottom: Black. White. and Hispanic. Subjects on the right those who were misclassified in over 95% of experiments. The subject on the bottom was labeled incorrectly as Female.", "detailed_evidences": "", "evidence_pages": [90], "evidence_sources": ["Table", "Figure"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Figure 5.6: Easy and hard subjects in texture gender classification on MORPH images. Subjects on the left are a subset of those correctly classified in all texture experiments. Top to bottom: Black, White, and Hispanic. Subjects on the right those who were misclassified in over 95% of experiments. The subject on the bottom was labeled incorrectly as Female.", "score": 0.9944289693593314} 56 | {"question_id": "extract_fig2tab_4107960_34_63_5", "doc_no": "4107960", "total_pages": 86, "start_end_idx": [34, 63], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"Table 6-2. Ohio Lumex SRM Comparison\".", "answer": ["Table 6-3. ALSI SRM Comparison"], "detailed_evidences": "", "evidence_pages": [54], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 6-3. ALSI SRM Comparison"], "score": 1.0} 57 | {"question_id": "extract_fig2tab_4057441_1_30_23", "doc_no": "4057441", "total_pages": 81, "start_end_idx": [1, 30], "question_type": "extract_fig2tab", "question": "What's name of the table at the page which contains a figure whose name is \"Figure 1 Connection Diagram for SOP16 package\".", "answer": ["16-LEAD SOP"], "detailed_evidences": "", "evidence_pages": [5], "evidence_sources": ["Figure"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Table 1 Pin Description for SOP16 package", "score": 0.0} 58 | {"question_id": "extract_fig2tab_4057441_40_69_5", "doc_no": "4057441", "total_pages": 81, "start_end_idx": [40, 69], "question_type": "extract_fig2tab", "question": "What's name of the table at the page which contains a figure whose name is \"Figure 97 Power-on Timing\"?", "answer": "Table 17 Power-Up Timing and Write Inhibit Threshold", "detailed_evidences": "", "evidence_pages": [61], "evidence_sources": ["Table", "Figure"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Table 17 Power-Up Timing and Write Inhibit Threshold", "score": 1.0} 59 | {"question_id": "extract_fig2tab_4057441_1_30_7", "doc_no": "4057441", "total_pages": 81, "start_end_idx": [1, 30], "question_type": "extract_fig2tab", "question": "List names of the tables at the page which contains a figure whose name is \"Figure 3 Connection Diagram for TFBGA24 5x5 ball array package\"?", "answer": ["Table 2 Pin Description for WSON8 package"], "detailed_evidences": "", "evidence_pages": [6], "evidence_sources": ["Table", "Figure"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 2 Pin Description for WSON8 package"], "score": 1.0} 60 | {"question_id": "extract_fig2tab_4140425_12_41_1", "doc_no": "4140425", "total_pages": 52, "start_end_idx": [12, 41], "question_type": "extract_fig2tab", "question": "What's name of the table at the page which contains a figure whose name is \"Fig.1 Microstructure and EDS spectrum of the sample A\"?", "answer": "Table 2. Characteristics of the samples", "detailed_evidences": "", "evidence_pages": [16], "evidence_sources": ["Table", "Figure"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Table 2. Characteristics of the samples", "score": 1.0} 61 | {"question_id": "extract_fig2tab_4043512_42_71_4", "doc_no": "4043512", "total_pages": 101, "start_end_idx": [42, 71], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"Table 32. National Total Commute Trip CO2Emissions,40-ft Compressed Natural Gas\".", "answer": ["Table 31. Comparison of On-Road Transportation CO2Emissions and Potential Emission Reductions Between California and Maine"], "detailed_evidences": "", "evidence_pages": [61], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 31.Comparison of On-Road Transportation CO2 Emissions and Potential Emission Reductions Between California and Maine"], "score": 0.9836065573770492} 62 | {"question_id": "extract_fig2tab_4100212_33_62_11", "doc_no": "4100212", "total_pages": 111, "start_end_idx": [33, 62], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"TABLE 114: Large, Variable-sized Delimiters\".", "answer": ["TABLE112: nath Delimiters", "TABLE 113: Variable-sized Delimiters"], "detailed_evidences": "", "evidence_pages": [38], "evidence_sources": ["Figure"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["TABLE 112: nath Delimiters", "TABLE 113: Variable-sized Delimiters"], "score": 0.9807692307692308} 63 | {"question_id": "free_gpt4o_4098399_14_43_12", "doc_no": "4098399", "total_pages": 54, "start_end_idx": [14, 43], "question_type": "count", "question": "How many workshops does South Jordan City offer annually for water education?", "answer": 2, "detailed_evidences": "South Jordan City offers two water education workshops annually, one in the spring and another in the fall. This information is detailed on page 17 ((0.2, 0.17, 0.87, 0.49), evidence source: Pure-text) and page 18 ((0.2, 0.09, 0.45, 0.33), evidence source: Pure-text). Therefore, the number of workshops offered annually is 2.", "evidence_pages": [17, 18], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": 2, "score": 1.0} 64 | {"question_id": "extract_fig2tab_4151603_24_53_4", "doc_no": "4151603", "total_pages": 84, "start_end_idx": [24, 53], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"Table 4.14. Summary of recommended stability method and limit value for compost applications\".", "answer": ["Table 4.13. Proposed relationship between self-heating class and best use ot compost"], "detailed_evidences": "", "evidence_pages": [41], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 4.13. Proposed relationship between self-heating class and best use of compost"], "score": 0.9880952380952381} 65 | {"question_id": "free_gpt4o_4053330_32_61_1-2", "doc_no": "4053330", "total_pages": 65, "start_end_idx": [32, 61], "question_type": "count", "question": "How many characters are allowed for the \"Rendering Provider ID#\" in item 24J?", "answer": 11, "detailed_evidences": "The \"Rendering Provider ID#\" field allows for the entry of 11 characters in the shaded area and the entry of a 10-digit NPI number in the unshaded area. These specifics are detailed in the instructions on page 44 ((0.11, 0.17, 0.87, 0.14)) and on page 35 ((0.14, 0.12, 0.3, 0.13)). Therefore, this field allows for a maximum of 21 characters.", "evidence_pages": [35, 44], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": 11, "score": 1.0} 66 | {"question_id": "free_gpt4o_4035900_50_65_11-2", "doc_no": "4035900", "total_pages": 65, "start_end_idx": [50, 65], "question_type": "count", "question": "How many steps are there in the instructions about how to serve court papers by registered process server?", "answer": "4", "detailed_evidences": "The instances where depositing payment for services is mentioned include:\n1. Page 52: \"The amount you need to pay for this service\" for Process Server ((0.26, 0.69, 0.88, 0.73))\n2. Page 54: \"$200.00 deposit fee\" for Sheriff's service ((0.35, 0.56, 0.84, 0.59))\n(Evidence: Pages 52, 54)\nTherefore, the instances of depositing payment for services are [\"Process Server payment amount\", \"$200 deposit fee for Sheriff\"].", "evidence_pages": [52, 54], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": 4, "score": 1.0} 67 | {"question_id": "free_gpt4o_4063393_80_109_4", "doc_no": "4063393", "total_pages": 118, "start_end_idx": [80, 109], "question_type": "compare", "question": "Which has the better effectiveness between AREDS 1 and AREDS 2 supplements?\n(A). AREDS 1\n(B). AREDS 2\nChoose the letter name in front of the right option from A, B.", "answer": "AREDS 2", "detailed_evidences": "The effectiveness of patients using AREDS 1 and AREDS 2 supplements is detailed on pages 97 and 98. AREDS 1 (on page 97, (0.09, 0.43, 0.19, 0.46)) contains Beta-carotene and no lutein or zeaxanthin and is no longer recommended because it investigated early AMD with no significant benefit. AREDS 2 (on page 97, (0.13, 0.58, 0.69, 0.6)) is recommended for intermediate and advanced AMD and includes carotenoids such as lutein and zeaxanthin with improved benefits. Thus, AREDS 2 is more effective than AREDS 1 for preventing AMD progression.", "evidence_pages": [97, 98], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": "B", "score": 0.0} 68 | {"question_id": "free_gpt4o_4005038_38_65_14", "doc_no": "4005038", "total_pages": 91, "start_end_idx": [38, 65], "question_type": "count", "question": "How many provinces are covered by the study area according to 'Appendix C'?", "answer": 4, "detailed_evidences": "'Appendix C' on page 57 mentions that benthic taxa counts cover all four provinces in the project area ((0.12, 0.13, 0.88, 0.27), page 57). Despite the geographical limits, especially in some regions like Gaspe Bay peninsula, these provinces still provide substantial data through the CABIN dataset. The evidence source spans multiple pages and figures ((0.22, 0.18, 0.79, 0.5), (0.36, 0.53, 0.64, 0.55), page 63).", "evidence_pages": [57, 63], "evidence_sources": ["Text", "Figure"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning", "MP_Figure_Reasoning"], "pred": 4, "score": 1.0} 69 | {"question_id": "free_gpt4o_4157143_6_35_10", "doc_no": "4157143", "total_pages": 87, "start_end_idx": [6, 35], "question_type": "count", "question": "What's the number of figures listed under 'GUI (Graphical User Interface) in MATLAB'?", "answer": 1, "detailed_evidences": "'Fiber Types' figures are spread across pages 8 and 9, and include:\nFigure 7 - Fiber Types: refractive index profile (page 8)\nThis totals to 1 figures.\nOn the other hand, 'GUI (Graphical User Interface) in MATLAB' figures are found distributed in:\nFigure 44 - GUI Design\nThus, each section has a comparable number of figures, adding up to 1 figures each.", "evidence_pages": [8, 9], "evidence_sources": ["Figure"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Figure_Reasoning"], "pred": 1, "score": 1.0} 70 | {"question_id": "free_gpt4o_4177240_35_56_8", "doc_no": "4177240", "total_pages": 56, "start_end_idx": [35, 56], "question_type": "compare", "question": "Which model shows less aggressive underprediction for the u component in the -45 degree case in narrow streets:\n(A) Miskam A \n(B) Fluent D\nChoose the letter name in front of the right option from A, B.", "answer": "A", "detailed_evidences": "Comparing the ability of two different models in predicting the u component for the -45 degree case in narrow streets from different panels involves panels from page 36 (panel 1) and page 37 (panel 2). For instance, on page 36, Miskam A has a u prediction formula y=192 - 0.6969y + error term ((0.08, 0.21, 0.95, 0.93)). On page 37, Fluent D shows a different prediction, y=1.322k - 0.7124y + error term ((0.09, 0.21, 0.95, 0.93)}). When comparing these models, Miskam A underpredicts less aggressively (lower negative coefficient) compared to Fluent D, which indicates stronger variability but a higher error term in its predictions. Thus, Miskam A from panel 1 seems more stable with less drastic underprediction compared to Fluent D from panel 2.", "evidence_pages": [36, 37], "evidence_sources": ["Figure"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Figure_Reasoning"], "pred": "A", "score": 1.0} 71 | {"question_id": "free_gpt4o_4060934_85_114_5", "doc_no": "4060934", "total_pages": 130, "start_end_idx": [85, 114], "question_type": "compare", "question": "Which policy received the lowest percentage of 'No' responses between GE2a, GE2b, GE3, GE4, and BE1?\n(A). GE2a\n(B). GE2b\n(C). GE3\n(D). GE4\nChoose the letter name in front of the right option from A, B, C, D.", "answer": "A", "detailed_evidences": "To identify the policy with the lowest percentage of 'No' responses: Policy GE2a on page 86 had 0.69% 'No' responses ((0.05, 0.39, 0.95, 0.45), page 86), source is Table; Policy GE2b on page 87 had 2.07% 'No' responses ((0.06, 0.42, 0.08, 0.43), page 87), source is Text; Policy GE3 on page 88 had 2.74% 'No' responses ((0.06, 0.44, 0.08, 0.45), page 88), source is Text; Policy GE4 on page 90 had 0.70% 'No' responses ((0.06, 0.44, 0.08, 0.45), page 90), source is Text; Policy BE1 on page 91 had 4.29% 'No' responses ((0.06, 0.44, 0.08, 0.45), page 91), source is Text. The policy with the lowest percentage of 'No' responses is GE2a with 0.69%.", "evidence_pages": [86, 87, 88, 90, 91], "evidence_sources": ["Table"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Table_Reasoning"], "pred": "A", "score": 1.0} 72 | {"question_id": "summary2title_4185072_40_69_1", "doc_no": "4185072", "total_pages": 72, "start_end_idx": [40, 69], "question_type": "summary2title", "question": "Which section best matches the follwing description: \nThis text discusses Micro Air Vehicles (MAVs) and Nano Air Vehicles (NAVs), focusing on biomimetic MAVs that mimic the wing motion of insects like dragonflies and bees. It describes a simplified wing frame design inspired by dragonfly wings, fabricated using methods like 3D printing. The text also mentions a multi-degree of freedom mechanism for a bee-like MAV, designed to replicate the figure-eight motion of bee wings. A comparison of different wing materials (ABS, PLA, acrylic) is also provided.\n\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Bio-Inspiration in Flapping-Wing UAVs.", "detailed_evidences": "", "evidence_pages": [42, 43], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "Bio-Inspiration in Flapping-Wing UAVs", "score": 0.9736842105263158} 73 | {"question_id": "free_gemini15_pro_4181009_95_112_1", "doc_no": "4181009", "total_pages": 112, "start_end_idx": [95, 112], "question_type": "calculate", "question": "What is the total number of directories across all Canadian FTP sites listed, according to the scan dates?", "answer": 1424, "detailed_evidences": "To calculate the total number of directories, we need to sum the \"Total Dirs\" from each table across multiple pages. Page 95 shows 113 directories for NEtLine 2000 ((0.22, 0.06, 0.77, 0.17), page 95); page 97 shows 114 directories for the University of Montreal ((0.22, 0.06, 0.77, 0.17), page 97); page 99 shows 415 directories for the University of Calgary ((0.22, 0.06, 0.77, 0.17), page 99); page 101 shows 70 directories for York University ((0.22, 0.06, 0.77, 0.17), page 101); page 103 shows 93 directories for Carleton University ((0.22, 0.06, 0.77, 0.17), page 103); page 105 shows 326 directories for Simon Fraser University ((0.22, 0.06, 0.77, 0.17), page 105); page 107 shows 247 directories for Mimic.ca Archive ((0.22, 0.07, 0.77, 0.17), page 107); and page 111 shows 46 directories for another Carleton University entry ((0.22, 0.07, 0.77, 0.17), page 111). Summing these values: 113 + 114 + 415 + 70 + 93 + 326 + 247 + 46 = 1424. Therefore, the total number of directories is 1424. Evidence can be found on pages 95, 97, 99, 101, 103, 105, 107, and 111.", "evidence_pages": [95, 97, 99, 101, 103, 105, 107, 111], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Table_Reasoning"], "pred": "Not answerable", "score": 0.0} 74 | {"question_id": "summary2title_4129570_80_91_1", "doc_no": "4129570", "total_pages": 91, "start_end_idx": [80, 91], "question_type": "summary2title", "question": "Which section best matches the follwing description: \nThe text primarily discusses the prohibition and regulation of using genetic information for determining eligibility and underwriting of life, disability-income, and long-term-care insurance. It highlights that genetic tests must be certified, similar to current medical practices, and points out existing state laws that restrict such uses. The document addresses concerns from the insurance industry and notes the broad international consensus on restricting genetic information usage, emphasizing the importance of protecting individual privacy.\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "ARTICLE 4: LIFE, DISABILITY-INCOME, AND LONG-TERM-CARE INSURANCE", "detailed_evidences": "", "evidence_pages": [84, 85, 86], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "SECTION 401. GENETIC TESTING SUPPLIED FOR LIFE, DISABILITY-INCOME, AND LONG-TERM-CARE INSURANCE.", "score": 0.6170212765957447} 75 | {"question_id": "summary2title_4166749_17_46_1", "doc_no": "4166749", "total_pages": 52, "start_end_idx": [17, 46], "question_type": "summary2title", "question": "Which section best matches the follwing description: \nThe provided text discusses guidelines for tapering opioid dosages to avoid withdrawal symptoms. It highlights the recommended rates and methods for reducing opioids, emphasizing individualized tapering plans and patient safety. The text advises on slow tapers, possibly pausing, and restarting based on patient readiness. It also warns against rapid detoxification due to high risks, especially during pregnancy. Collaboration with mental health providers and specialists is recommended, along with monitoring for opioid use disorder and considering naloxone for overdose prevention.\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Considerations for Tapering Opioids", "detailed_evidences": "", "evidence_pages": [28], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "Recommendations", "score": 0.0} 76 | {"question_id": "summary2title_4127644_8_37_3", "doc_no": "4127644", "total_pages": 98, "start_end_idx": [8, 37], "question_type": "summary2title", "question": "Which section best matches the follwing description: \nVistra, an electricity provider, considers its services critical during the COVID-19 pandemic and national emergency. While prioritizing employee and community well-being, Vistra maintains operations. Their Q3 2020 financial statements, incorporating COVID-19's impact, show no material adverse effects. The CARES Act, signed March 27, 2020, has tax implications detailed in Note 7.\n\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "COVID-19 Pandemic", "detailed_evidences": "", "evidence_pages": [12], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "VISTRA CORP. NOTES TO CONDENSED CONSOLIDATED FINANCIAL STATEMENTS", "score": 0.0} 77 | {"question_id": "free_gpt4o_4068263_5_24_14", "doc_no": "4068263", "total_pages": 60, "start_end_idx": [5, 24], "question_type": "calculate", "question": "Calculate the difference in U-Value between Standard and Extreme Casement Units?", "answer": 0.04, "detailed_evidences": "The document provides U-Values for various casement units on page 5. The U-Value for the Standard Casement Unit is 0.27, and for the Extreme Casement Unit, it is 0.23 ((0.08, 0.54, 0.56, 0.7), page 5). The difference in U-Value between these two units is calculated as 0.27 - 0.23, which equals 0.04.", "evidence_pages": [5], "evidence_sources": ["Figure"], "answer_format": "Float", "task_tag": "Reasoning", "subTask": ["SP_Figure_Reasoning"], "pred": 0.04, "score": 1.0} 78 | {"question_id": "free_gemini15_pro_4016274_26_55_11", "doc_no": "4016274", "total_pages": 71, "start_end_idx": [26, 55], "question_type": "extract", "question": "In the Client Agreement, what clause addresses the topic of \"Notices\"? ", "answer": "Clause 17", "detailed_evidences": "Clause 17, \"NOTICES\", covers the topic of communications. This can be verified by the index on page 30 and the clause title on pages 41 and 42. ", "evidence_pages": [30, 41, 42], "evidence_sources": ["Layout", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding", "MP_Table_Understanding"], "pred": "Notices", "score": 0.0} 79 | {"question_id": "free_gpt4o_4076912_11_40_6", "doc_no": "4076912", "total_pages": 103, "start_end_idx": [11, 40], "question_type": "extract", "question": "List the top four domains of basic research with the highest number of animal uses in 2018.", "answer": ["Ethology / animal behavior / animal biology", "Nervous system", "Immune system", "Oncology"], "detailed_evidences": "The top four domains of basic research with the highest number of animal uses in 2018 were:\n1. Ethology / animal behavior / animal biology: 1,061,647 uses\n2. Nervous system: 901,510 uses\n3. Immune system: 773,265 uses\n4. Oncology: 556,952 uses\nThis information can be found on pages 34 and 35. Detailed data is present in table 15 on page 35 with coordinates ((0.12, 0.5, 0.88, 0.81)) and Figure 17 on page 34 with coordinates ((0.04, 0.1, 0.95, 0.69)).", "evidence_pages": [34, 35], "evidence_sources": ["Table", "Figure"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Table_Understanding", "MP_Figure_Understanding"], "pred": ["Ethology/Animal Behaviour/Animal Biology", "Nervous System", "Immune System", "Oncology"], "score": 0.9709302325581395} 80 | {"question_id": "free_gpt4o_4083860_40_69_3", "doc_no": "4083860", "total_pages": 69, "start_end_idx": [40, 69], "question_type": "extract", "question": "List all percentages for the confidence that the organization would address concerns about unsafe clinical practice for 'Your org' and 'Average' categories.", "answer": ["49.7%", "59.1%", "54.8%", "58.5%"], "detailed_evidences": "To find the relevant data regarding the confidence that the organization would address concerns about unsafe clinical practice, look at pages 43 and 44.\n\nOn page 43:\n- Your org: 49.7%, Average: 59.1% (0.35, 0.8, 0.66, 0.94)\n\nOn page 44:\n- Your org: 54.8%, Average: 58.5% (0.04, 0.8, 0.34, 0.95)\n\nSo the percentages for 'Your org' and 'Average' concerning confidence in addressing concerns are found on pages 43 (49.7% and 59.1%) and 44 (54.8% and 58.5%).", "evidence_pages": [43, 44], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Table_Understanding"], "pred": ["49.7%", "59.1%"], "score": 0.0} 81 | {"question_id": "free_gemini15_pro_4139178_20_49_5", "doc_no": "4139178", "total_pages": 98, "start_end_idx": [20, 49], "question_type": "extract", "question": "The specific industries excluded by the SRI Balanced option do not include fossil fuels, yes or no?", "answer": "no", "detailed_evidences": "The SRI-Balanced Option, as described on pages 29 and 30, excludes investments in fossil fuels, companies breaching human rights or labor rights standards, companies with poor ESG policies, controversial weapons, Uncertified palm oil, Predatory lending, for-profit detention, gambling, tobacco production, and live animal export. See (0.09, 0.66, 0.26, 0.81) and (0.51, 0.21, 0.91, 0.50) for details.", "evidence_pages": [29, 30], "evidence_sources": ["Text", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Table_Understanding"], "pred": "no", "score": 1.0} 82 | {"question_id": "free_gpt4o_4032186_71_100_9", "doc_no": "4032186", "total_pages": 121, "start_end_idx": [71, 100], "question_type": "extract", "question": "Between 2013 and 2019, the purchasing managers' indices for China's manufacturing and services sectors showed volatility and an overall upward trend,yes or no?", "answer": "no", "detailed_evidences": "On page 77, the chart titled 'China: Purchasing Managers' Indices' shows the performance and variance for the years 2013 to 2015. The indices tracked are Caixin Manufacturing, Caixin Services, and Official. The indices show a downward trend, with fluctuations, reflecting the economic slowdown in the period. The chart coordinates are (0.53, 0.19, 0.95, 0.89).", "evidence_pages": [77], "evidence_sources": ["Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Figure_Understanding"], "pred": "no", "score": 1.0} 83 | {"question_id": "free_gpt4o_4115594_31_60_19", "doc_no": "4115594", "total_pages": 63, "start_end_idx": [31, 60], "question_type": "extract", "question": "Which public institutions did participants in the self-assessment tool receive training from?", "answer": "Public Education/Training Centers", "detailed_evidences": "8 of the participants in the self-assessment tool received training from Public Education/Training Centers (page 50, (0.15, 0.70, 0.72), evidence source: Picture).", "evidence_pages": [50], "evidence_sources": ["Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Figure_Understanding"], "pred": "Culinary Sector", "score": 0.0} 84 | {"question_id": "free_gemini15_pro_4074494_35_56_10", "doc_no": "4074494", "total_pages": 56, "start_end_idx": [35, 56], "question_type": "extract", "question": "What year was the map \"A topographical map of the north part of New York Island, exhibiting the plan of Fort Washington\" published?", "answer": 1777, "detailed_evidences": "The map was published in 1777 and created by William Faden. This information can be found on page 46, starting at (0.06, 0.65, 0.48, 0.79) and illustrated in Figure 1 on the same page. ", "evidence_pages": [46], "evidence_sources": ["Text", "Figure"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Figure_Understanding"], "pred": 1777, "score": 1.0} 85 | {"question_id": "summary2title_4084402_91_102_1", "doc_no": "4084402", "total_pages": 102, "start_end_idx": [91, 102], "question_type": "summary2title", "question": "Which section best matches the follwing description: \nThe given text outlines essential safety and etiquette guidelines for bikers, including slowing down for pedestrians, signaling when overtaking, respecting wildlife, nature, and property, choosing routes wisely, wearing protective gear, not traveling alone, and adhering to unwritten rules such as giving precedence, signaling overtaking, and remaining within sight of group members.\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Behaviour code", "detailed_evidences": "", "evidence_pages": [99, 100], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "Behaviour code", "score": 1.0} 86 | {"question_id": "free_gemini15_pro_4061601_47_71_8", "doc_no": "4061601", "total_pages": 71, "start_end_idx": [47, 71], "question_type": "calculate", "question": "What was the total fair value of options that vested in 2016, 2015, and 2014, in millions of Canadian dollars?", "answer": 312, "detailed_evidences": "Page 48 shows the fair value of options that vested as $126 million in 2016, $91 million in 2015, and $95 million in 2014. The total is $126 + $91 + $95 = $312 million. ", "evidence_pages": [48], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Table_Reasoning"], "pred": 312, "score": 1.0} 87 | {"question_id": "free_gemini15_pro_4156965_49_64_3", "doc_no": "4156965", "total_pages": 64, "start_end_idx": [49, 64], "question_type": "calculate", "question": "What is the aggregate compensation of Key Management Personnel for the years ended 30/06/15 and 30/06/14?", "answer": "$1660110", "detailed_evidences": "The table on page 49 details the aggregate compensation of Key Management Personnel. For the year ended 30/06/15, the total was $719,484.(0.15, 0.64, 0.88, 0.79) For the year ended 30/06/14, it was $940,626.(0.15, 0.64, 0.88, 0.79) The total for both years is $719,484 + $940,626 = $1,660,110. ", "evidence_pages": [49], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Table_Reasoning"], "pred": "$1660110", "score": 1.0} 88 | {"question_id": "free_gemini15_pro_4063909_28_57_6", "doc_no": "4063909", "total_pages": 68, "start_end_idx": [28, 57], "question_type": "calculate", "question": "How much did the total carrying amount of WSE's investments in its subsidiaries as at 31 December 2012 increase compared to 2011(in PLN thousands)?", "answer": "213582", "detailed_evidences": "The total carrying amount of WSE's investments in its subsidiaries as at 31 December 2012 was PLN 250,497 thousand and PLN 36,915 thousand as at 31 December 2011. This data is from the tables on page 36 (0.09, 0.43, 0.93, 0.63) and page 36 (0.11, 0.67, 0.92, 0.85) respectively.", "evidence_pages": [36], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Table_Reasoning"], "pred": 213582, "score": 1.0} 89 | {"question_id": "free_gpt4o_4034595_7", "doc_no": "4034595", "total_pages": 52, "start_end_idx": [10, 37], "question_type": "extract", "question": "What are the environmental concerns mentioned in the Arctic Community Concerns section?", "answer": ["More Spill Planning/Response", "Subsistence Hunter Safety", "Food Quality & Availability"], "detailed_evidences": "The section \"Arctic Community Concerns\" on pages 15 and 17 mentions several environmental concerns:\nOn page 15, concerns include \"More Spill Planning/Response\" ((0.04, 0.71, 0.75)), \"Subsistence Hunter Safety\" ((0.04, 0.76, 0.80)), and \"Food Quality & Availability\" ((0.04, 0.81, 0.86)).", "evidence_pages": [15, 17], "evidence_sources": ["Text", "Layout"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": ["Increased Arctic activity", "Spills may lead to lost food resources", "Gov’t response capabilities appear inadequate", "Not enough CG infrastructure present", "Traditional knowledge not fully understood"], "score": 0.0} 90 | {"question_id": "free_gpt4o_4096779_58_87_10", "doc_no": "4096779", "total_pages": 91, "start_end_idx": [58, 87], "question_type": "extract", "question": "Which section of copyright law was pivotal in Atari Games Corp. v. Nintendo ?", "answer": "Section 117(1)", "detailed_evidences": "Section 117(1) of the copyright law played a pivotal role in Atari Games Corp. v. Nintendo, relating to the legitimacy of intermediate copying activities. The section allows copying as an essential step in software utilization but was contested regarding reverse engineering claims in this case. The significance is outlined on page 65 ((0.07, 0.07, 0.29, 0.74)) and page 66 ((0.06, 0.07, 0.67, 0.85)), with evidence sources being Title and Pure-text, respectively.", "evidence_pages": [65, 66], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "Section 117", "score": 1.0} 91 | {"question_id": "free_gpt4o_4027862_66_95_3", "doc_no": "4027862", "total_pages": 102, "start_end_idx": [66, 95], "question_type": "extract", "question": "What is the total local paddy production area(in hectares) developed in 2020?", "answer": 852.97, "detailed_evidences": "The total local paddy production area developed in 2020 is calculated using data from pages 71 and 74. On page 71, it's mentioned that the total developed area was 852.97 hectares. Page 74 also highlights developed areas with specific characteristics like irrigated and non-irrigated, but the overall developed area mentioned is consistent at 852.97 hectares.\n\nEvidence references:\n- \"PADDY RICE SUPPORT SCHEME FOR 2019/2020 SEASON\" ((0.15, 0.13, 0.22), page 74).\n- \"PADDY INDUSTRY 2020 Local Production 3.864 mt Area Developed 852.97 ha\" ((0.12, 0.07, 0.22), page 71).\n\nHence, the total local paddy production area developed in 2020 is 852.97 hectares.", "evidence_pages": [71, 74], "evidence_sources": ["Layout"], "answer_format": "Float", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding"], "pred": 852.97, "score": 1.0} 92 | {"question_id": "free_gpt4o_4169843_69_85_9", "doc_no": "4169843", "total_pages": 85, "start_end_idx": [69, 85], "question_type": "extract", "question": "Which organization is responsible for the Big Creek OV - Flood Damage Reduction Project?", "answer": "Shelby County Government", "detailed_evidences": "All responsible organizations across the different projects mentioned are as follows:\n1. Wolf River OV- Increased Greenway Use: Shelby County Government.\n2. Wolf River OV- Attracting New Business: Shelby County Government.\n3. Wolf River OV- Increased Wetlands: Shelby County Government.\n4. Wolf River OV- Flood Cost Reduction: Shelby County Government.\n5. Big Creek OV- Increased Greenway Use: City of Millington/Millington Parks Department.\n6. Big Creek OV - Increased Housing Value: Division of Planning and Development.\n7. Big Creek OV -Tree Mitigation: Shelby County Public Works.\n8. Big Creek OV- Decreased Flood Damage: Shelby County Government.\n\nAll these organizations are specified on pages within respective project descriptions, such as page 70 ((0.06, 0.26, 0.7, 0.29)), page 74 ((0.06, 0.26, 0.7, 0.29)), and page 78 ((0.07, 0.26, 0.74, 0.29)).", "evidence_pages": [70, 72, 74, 76, 78, 80, 82, 84], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "Shelby County Government2", "score": 0.96} 93 | {"question_id": "free_gpt4o_4071554_7_36_4", "doc_no": "4071554", "total_pages": 51, "start_end_idx": [7, 36], "question_type": "extract", "question": "List all the sessions chaired by members from Western Ontario?", "answer": "[\"Geographies in the Global South I-West Africa Chair\", \"Environmental Discourse\"]", "detailed_evidences": "Western Ontario members chair the following sessions listed in the document:\n\n\"Geographies in the Global South I-West Africa Chair\" chaired by Godwin Arku (Western Ontario), listed on page 8. Evidence coordinates are (0.13, 0.65, 0.61, 0.7).\n\"Environmental Discourse\" chaired by Dan Shrubsole (Western Ontario), listed on page 9. Evidence coordinates are (0.17, 0.75, 0.55, 0.78).\n\nThese are the sessions chaired by members from Western Ontario.", "evidence_pages": [8, 9, 31, 32], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding"], "pred": ["W14 Geographies in the Global South I – West Africa", "W16 Environmental Discourse"], "score": 0.7984749455337691} 94 | {"question_id": "free_gpt4o_4027862_66_95_6-2", "doc_no": "4027862", "total_pages": 102, "start_end_idx": [66, 95], "question_type": "extract", "question": "What's the most valuable fruit in retail in 2020?", "answer": "\"Banana\"", "detailed_evidences": "The top three types of fruits with the highest retail value in 2020 are \"Banana\" with a retail value of 16,530,000 B$ (page 66, evidence: Table titled \"FRUITS INDUSTRY 2011-2020\" (0.24, 0.13, 0.74, 0.17)), \"Papaya\" with a retail value of 14,530,000 B$ (page 67, evidence: table titled \"LOCAL FRUITS PRODUCTION 2020\" (0.17, 0.16, 0.83, 0.36)), and \"Watermelon\" with a retail value of 12,530,000 B$ (page 69, evidence: Table titled \"15 MAJOR TYPE OF IMPORT TROPICAL FRUITS 2020 (BY QUANTITY)\" (0.16, 0.25, 0.85, 0.83)).", "evidence_pages": [66, 67, 69], "evidence_sources": ["Layout", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding", "MP_Table_Understanding"], "pred": "Banana", "score": 1.0} 95 | {"question_id": "summary2tab_4063624_72_75_6", "doc_no": "4063624", "total_pages": 76, "start_end_idx": [72, 75], "question_type": "summary2tab", "question": "Which tables emphasize the role of human resource management in the recruitment and evaluation process?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["Molimovasobjasnitekojavrstaobukebibilanajkorisnijazaposlenimazaduzenimzakadrovskeposlove/upravljanjeljudskimresursima.", "Molimovasdaobrazlozite,vasimrijecima,dalibi bilopotrebnorevidirati iunaprijeditibilokojiaspektpostupkaocjenjivanjasluzbenika."], "detailed_evidences": "", "evidence_pages": [72, 75], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": "Not answerable", "score": 0.0} 96 | {"question_id": "summary2tab_4096873_61_66_4", "doc_no": "4096873", "total_pages": 87, "start_end_idx": [61, 66], "question_type": "summary2tab", "question": "Which tables provide detailed instructions on serving legal documents to the other party?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["DEFAULT TIMETABLE"], "detailed_evidences": "", "evidence_pages": [66, 61, 62], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": "Not Answerable", "score": 0.0} 97 | {"question_id": "summary2tab_4028159_23_30_3", "doc_no": "4028159", "total_pages": 62, "start_end_idx": [23, 30], "question_type": "summary2tab", "question": "Which tables offer a comprehensive checklist for evaluating potential environmental impacts of the project?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["InitialStudy Checklist", "Initial Study Checklist"], "detailed_evidences": "", "evidence_pages": [30, 23], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": ["Table 2 Initial Study Checklist"], "score": 0.7258064516129032} 98 | {"question_id": "summary2tab_4013723_3_4_2", "doc_no": "4013723", "total_pages": 149, "start_end_idx": [3, 4], "question_type": "summary2tab", "question": "From which tables can we learn about the health benefits of plant carbohydrates and their role in colonic health?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["Table 1 Dietary benefit from properties of plant carbohydrates in the diet", "Table 2 Direct and indirect effects of bulk provided by fermentation-resistant polysaccharide associations in the large intestine"], "detailed_evidences": "", "evidence_pages": [3, 4], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": ["Table 1 Dietary benefit from properties of plant carbohydrates in the diet"], "score": 0.5} 99 | {"question_id": "summary2tab_4165048_33_39_2", "doc_no": "4165048", "total_pages": 74, "start_end_idx": [33, 39], "question_type": "summary2tab", "question": "From which tables can we learn about the use of non-online sources by practitioners, non-university researchers, and university researchers?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["Table 4.2. Use of non-online sources among Practitioners (valid answers = 42/46), Non-University Researchers (valid answers=24/25), and University Researchers (valid answers=26/28). \"Ranked as the most important\" refers to the proportion of respondents who ranked the source as 1st,2nd or 3rd. The sources are ranked by decreasing importance for Practitioners. Grey cells outline the 3 most important for each category.", "Table 4.7 (a) Comparison of the preferred online tools used by practitioners to share information and those used by researchers to access it; (b) Comparison of the preferred non-online tools used by practitioners to share information and those used by researchers to access it. Personal emails were not included in the answers offered in the question on information access, hence the impossibility of comparison."], "detailed_evidences": "", "evidence_pages": [33, 39], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": ["Table 4.2. Use of non-online sources among Practitioners (valid answers 42/46). Non-University Researchers (valid answers 24/25), and University Researchers (valid answers 26/28)."], "score": 0.0} 100 | {"question_id": "free_gpt4o_4031704_61_71_4", "doc_no": "4031704", "total_pages": 71, "start_end_idx": [61, 71], "question_type": "count", "question": "How many team members did \"Spike It Up Sports Track Club A\" confirm for the 4x100 Meter Relay 17-18 year old Men?", "answer": 4, "detailed_evidences": "For the 4x100 Meter Relay 17-18 year old Men, \"Spike It Up Sports Track Club A\" confirmed 4 team members. The team members are Brian Smart Jr, Jake Mc Entyre, Ronaldo Cherry, and Jarion Hayes((0.09, 0.75, 0.96, 0.79), page 66), evidence sources are 'multicolumn'.", "evidence_pages": [66], "evidence_sources": ["Layout"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Layout_Reasoning"], "pred": 4, "score": 1.0} 101 | {"question_id": "free_gpt4o_4056678_18", "doc_no": "4056678", "total_pages": 70, "start_end_idx": [56, 63], "question_type": "compare", "question": "Is the definition of \"trip\" consistent across different sections of the document? Choose yes or no.", "answer": "Yes", "detailed_evidences": "The definition of \"trip\" appears consistently across different sections of the document. On page 56, under the title \"Important meanings in this section,\" \"trip\" is defined as \"travel during the period of insurance\" ((0.53, 0.57, 0.59), page 56). Similarly, on page 58, under the title \"Important meanings in this section,\" \"trip\" is again defined as \"travel during the period of insurance\" ((0.53, 0.48, 0.50), page 58). This consistency is maintained on page 62, where \"trip\" is defined identically as \"travel during the period of insurance\" ((0.53, 0.43, 0.45), page 62). Thus, the definition of \"trip\" remains unchanged across the document.", "evidence_pages": [56, 58, 62], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning", "MP_Layout_Reasoning"], "pred": "yes", "score": 1.0} 102 | {"question_id": "free_gpt4o_4155062_5_34_7", "doc_no": "4155062", "total_pages": 64, "start_end_idx": [5, 34], "question_type": "count", "question": "How many types of infoments does the Vivaro have?", "answer": 2, "detailed_evidences": "The Vivaro offers various infotainment features across its different models and variants. These include:\n - Infotainment system with well-connected entertainment [12] [Text] (0.05, 0.25, 0.44, 0.4).\n - Stay in touch with high levels of infotainment and communication [12] [Pure-text] (0.11, 0.71, 0.45, 0.73).\n\nTherefore, there are altogether 2 different types of infotainment features mentioned.", "evidence_pages": [12], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Text_Reasoning"], "pred": 1, "score": 0.0} 103 | -------------------------------------------------------------------------------- /scripts/calculate_metrics.sh: -------------------------------------------------------------------------------- 1 | python utils/calculate_metrics.py \ 2 | --results_file evaluation_results/api_models/results_gemini15_pro.jsonl -------------------------------------------------------------------------------- /scripts/calculate_metrics_fine_grained.sh: -------------------------------------------------------------------------------- 1 | python utils/calculate_metrics_fine_grained.py \ 2 | --results_file evaluation_results/api_models/results_gemini15_pro.jsonl -------------------------------------------------------------------------------- /scripts/eval_api_models.sh: -------------------------------------------------------------------------------- 1 | python eval/api_models/eval_api_models.py \ 2 | --qa_file data/LongDocURL.jsonl \ 3 | --results_file evaluation_results/api_models/results_gpt4o.jsonl \ 4 | --process_mode serial \ 5 | --image_prefix /mnt/achao/Downloads/pdf_pngs/4000-4999 \ 6 | --model_name gpt4o -------------------------------------------------------------------------------- /utils/__pycache__/utils_score_v3.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/utils/__pycache__/utils_score_v3.cpython-311.pyc -------------------------------------------------------------------------------- /utils/calculate_metrics.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | 4 | import sys 5 | import pathlib 6 | sys.path.append(str(pathlib.Path(__file__).absolute().parent.parent)) 7 | 8 | from utils.utils_score_v3 import eval_score 9 | 10 | # modify: allow multiple preds and return score dict 11 | def calculate_accuracy(answers: list, annotations: list, answer_formats: list, multiple_preds: list = None): 12 | total_scores = 0.0 13 | score_list = [] 14 | for pred_ans, annotation, answer_format, multiple_pred in zip(answers, annotations, answer_formats, multiple_preds): 15 | if pred_ans == "Fail to extract": 16 | score_v3 = 0.0 17 | elif not multiple_pred: 18 | score_v3 = eval_score(annotation, pred_ans, answer_format) 19 | else: 20 | score_v3 = max([eval_score(annotation, item, answer_format) for item in pred_ans]) 21 | 22 | score_list.append(score_v3) 23 | total_scores += score_v3 24 | 25 | generalized_score = total_scores / len(answers) 26 | 27 | return generalized_score, score_list 28 | 29 | 30 | 31 | if __name__ == "__main__": 32 | parser = argparse.ArgumentParser() 33 | parser.add_argument('--results_file', type=str, default="") 34 | 35 | args = parser.parse_args() 36 | 37 | with open(args.results_file, "r", encoding="utf-8") as rf: 38 | samples = [json.loads(_.strip()) for _ in rf.readlines()] 39 | 40 | for sample in samples: 41 | assert "pred" in sample 42 | 43 | answers = [_["pred"] for _ in samples] 44 | annotations = [_["answer"] for _ in samples] 45 | answer_formats = [_["answer_format"] for _ in samples] 46 | 47 | # modify 48 | multiple_preds = [True if "multiple_pred" in _ else False for _ in samples] # for modified_answer case 49 | # multiple_preds = [False for _ in samples] # for non-modified_answer case 50 | 51 | generalized_score, score_list = calculate_accuracy(answers, annotations, answer_formats, multiple_preds) # calculate on size of successful samples 52 | rectified_generalized_score = generalized_score * len(answers) / 2325 # calculate on size of 2325 53 | 54 | # Update the score field in samples with scores from score_list 55 | for sample, score in zip(samples, score_list): 56 | sample["score"] = score 57 | 58 | # Write the updated results to a new file in the same directory 59 | output_file = args.results_file.replace(".jsonl", "_scored.jsonl") 60 | with open(output_file, "w", encoding="utf-8") as wf: 61 | for sample in samples: 62 | wf.write(json.dumps(sample) + "\n") 63 | print(f"Scores updated and saved to: {output_file}") 64 | 65 | 66 | print("--------------------------------------") 67 | print("Avg. acc: {}".format(generalized_score)) 68 | print("Rectified Avg. acc: {}".format(rectified_generalized_score)) 69 | -------------------------------------------------------------------------------- /utils/calculate_metrics_fine_grained.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | 4 | import sys 5 | import pathlib 6 | sys.path.append(str(pathlib.Path(__file__).absolute().parent.parent)) 7 | 8 | from utils.utils_score_v3 import eval_score 9 | 10 | def calculate_accuracy_fine_grained(samples, score_dict): 11 | for sample in samples: 12 | pred_ans, annotation, answer_format, multiple_pred = sample["pred"], sample["answer"], sample["answer_format"], True if "multiple_pred" in sample else False 13 | if pred_ans == "Fail to extract": 14 | score_v3 = 0.0 15 | elif not multiple_pred: 16 | score_v3 = eval_score(annotation, pred_ans, answer_format) 17 | else: 18 | score_v3 = max([eval_score(annotation, item, answer_format) for item in pred_ans]) 19 | sample["score_v3"] = score_v3 20 | 21 | # Main_Task 22 | for sample in samples: 23 | score_dict["Main_Task"][sample["task_tag"]] += sample["score_v3"] 24 | 25 | # Element_Type 26 | for sample in samples: 27 | for evidence_source in sample["evidence_sources"]: 28 | if evidence_source in ["Text", "Layout", "Figure", "Table"]: 29 | score_dict["Element_Type"][evidence_source] += sample["score_v3"] 30 | 31 | # Evidence_Pages 32 | for sample in samples: 33 | if len(sample["evidence_pages"]) > 1: 34 | score_dict["Evidence_Pages"]["Multi_Page"] += sample["score_v3"] 35 | elif len(sample["evidence_pages"]) == 1: 36 | score_dict["Evidence_Pages"]["Single_Page"] += sample["score_v3"] 37 | 38 | # Num_of_Element_Types 39 | for sample in samples: 40 | if len(sample["evidence_sources"]) > 1: 41 | score_dict["Num_of_Element_Types"]["Cross_Element"] += sample["score_v3"] 42 | 43 | # Fine_Grained 44 | for sample in samples: 45 | sub_score_dict = score_dict["Fine_Grained"][sample["task_tag"]] 46 | if sample["task_tag"] in ["Understanding", "Reasoning"]: 47 | if len(sample["evidence_pages"]) > 1: 48 | sub_sub_score_dict = sub_score_dict["Multi_Page"] 49 | elif len(sample["evidence_pages"]) == 1: 50 | sub_sub_score_dict = sub_score_dict["Single_Page"] 51 | 52 | for evidence_source in sample["evidence_sources"]: 53 | if evidence_source in ["Text", "Layout", "Figure", "Table"]: 54 | sub_sub_score_dict[evidence_source] += sample["score_v3"] 55 | 56 | if len(sample["evidence_pages"]) > 1: 57 | sub_score_dict["Multi_Page"] = sub_sub_score_dict 58 | elif len(sample["evidence_pages"]) == 1: 59 | sub_score_dict["Single_Page"] = sub_sub_score_dict 60 | 61 | elif sample["task_tag"] in ["Locating"]: 62 | sub_sub_score_dict = sub_score_dict["Cross_Element"] 63 | if sample["question_type"] == "topic2title": 64 | sub_sub_score_dict["Cross_Title"] += sample["score_v3"] 65 | elif sample["question_type"] == "summary2title": 66 | sub_sub_score_dict["Para_Title"] += sample["score_v3"] 67 | elif sample["question_type"] == "summary2tab": 68 | sub_sub_score_dict["Cross_Table"] += sample["score_v3"] 69 | elif sample["question_type"] == "extract_fig2tab": 70 | sub_sub_score_dict["Figure_Table"] += sample["score_v3"] 71 | 72 | sub_score_dict["Cross_Element"] = sub_sub_score_dict 73 | 74 | score_dict["Fine_Grained"][sample["task_tag"]] = sub_score_dict 75 | 76 | 77 | return score_dict 78 | 79 | 80 | 81 | if __name__ == "__main__": 82 | parser = argparse.ArgumentParser() 83 | parser.add_argument('--results_file', type=str, default="evaluation_results/api_models/results_gemini15_pro.jsonl") 84 | parser.add_argument('--score_sample_file', type=str, default="evaluation_results/scores_sample_fine_grained.json") 85 | 86 | args = parser.parse_args() 87 | 88 | with open(args.results_file, "r", encoding="utf-8") as rf: 89 | samples = [json.loads(_.strip()) for _ in rf.readlines()] 90 | 91 | with open(args.score_sample_file, "r", encoding="utf-8") as rf: 92 | _ = json.load(rf) 93 | score_dict, sample_cnt_dict = _["scores"], _["sample_cnt"] 94 | 95 | for sample in samples: 96 | assert "pred" in sample 97 | 98 | score_dict = calculate_accuracy_fine_grained(samples, score_dict) 99 | 100 | def generalize_score_dict(score_dict, sample_cnt_dict): 101 | for key, value in score_dict.items(): 102 | if isinstance(value, dict): 103 | generalize_score_dict(value, sample_cnt_dict[key]) 104 | score_dict[key] = value 105 | else: 106 | score_dict[key] /= sample_cnt_dict[key] 107 | 108 | generalize_score_dict(score_dict, sample_cnt_dict) 109 | 110 | print("--------------------------------------------------------------") 111 | print(score_dict) 112 | -------------------------------------------------------------------------------- /utils/extract_ccpdf.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import fitz 4 | from PIL import Image 5 | import sys 6 | from tqdm import tqdm 7 | import argparse 8 | 9 | 10 | def get_pdf_filename(pdf_paths_txt, **kwargs): 11 | with open(pdf_paths_txt, "r", encoding="utf-8") as file: 12 | files = [line.strip() for line in file.readlines()] 13 | files = [(filename, filename.split("/")[-2], filename.split("/")[-1][:-4]) for filename in files] 14 | print("pdf file cnt: ", len(files)) 15 | if "start_id" in kwargs and "end_id" in kwargs: 16 | start_id = int(kwargs.pop("start_id")) 17 | end_id = int(kwargs.pop("end_id")) 18 | return files[start_id:end_id] 19 | else: 20 | return files 21 | 22 | 23 | def save_png(page, pp_no, doc_no, dir, zip_no): 24 | ddir = f"{dir}/{zip_no}" 25 | if not os.path.exists(ddir): 26 | os.makedirs(ddir) 27 | png_path = ddir + f"/{doc_no}_{pp_no}.png" 28 | pix = page.get_pixmap() 29 | pix.save(png_path) 30 | 31 | return png_path 32 | 33 | 34 | def resize(img_size, rectangle): 35 | ori_w, ori_h = img_size 36 | w_rate = ori_w # / 224 37 | h_rate = ori_h # / 224 38 | return [round(rectangle[0]/w_rate, 3), round(rectangle[1]/h_rate, 3), round(rectangle[2]/w_rate, 3), round(rectangle[3]/h_rate, 3)] 39 | 40 | 41 | def extracts(pdf_path, img_save_dir, json_dir): 42 | doc_no = pdf_path.split("/")[-1][:7] # 4000001.pdf 43 | zip_no = doc_no[:4] 44 | filename = pdf_path 45 | try: 46 | pages = [] 47 | doc = fitz.open(filename) 48 | for pp in doc: 49 | pages.append(pp) 50 | except: 51 | print("error: ", filename.split('/')[-1]) 52 | return 53 | 54 | lines = [] 55 | for idx, page in enumerate(pages): 56 | try: 57 | png_path = save_png(page, idx, doc_no, img_save_dir, zip_no) 58 | my_image = Image.open(png_path) 59 | except: 60 | print("img error: ", filename.split('/')[-1]) 61 | continue 62 | 63 | word_lists = page.get_text("words") 64 | if not word_lists: 65 | continue 66 | 67 | for dd in word_lists: 68 | coordi = resize(my_image.size, list(dd[:4])) 69 | lines.append({"coordi": coordi, 70 | "word": dd[4].encode('utf-8', 'ignore').decode('utf-8'), 71 | "line_no": dd[5], 72 | "block_no": dd[6], 73 | "word_no": dd[7], 74 | "page_no": idx}) 75 | 76 | my_image.close() 77 | 78 | out_data = { 79 | "zip_no": zip_no, 80 | "doc_no": doc_no, 81 | "pdf_path": filename, 82 | "img_size": my_image.size, 83 | "contents": lines 84 | } 85 | 86 | if not os.path.exists(os.path.join(json_dir, zip_no)): 87 | os.makedirs(os.path.join(json_dir, zip_no)) 88 | with open(os.path.join(json_dir, zip_no, f"{doc_no}.json"), 'w') as file: 89 | json.dump(out_data, file, ensure_ascii=False) 90 | 91 | 92 | if __name__ == '__main__': 93 | parser = argparse.ArgumentParser() 94 | parser.add_argument("--pdf_dir", type=str, default="/mnt/achao/Downloads/ccpdf_zip/") 95 | parser.add_argument("--img_save_dir", type=str, default="/mnt/achao/Downloads/pdf_pngs/") 96 | parser.add_argument("--json_dir", type=str, default="/mnt/achao/Downloads/pdf_jsons/") 97 | args = parser.parse_args() 98 | 99 | pdf_dir = args.pdf_dir 100 | img_save_dir = args.img_save_dir 101 | json_dir = args.json_dir 102 | pdf_paths = [os.path.join(pdf_dir, pdf_name) for pdf_name in os.listdir(pdf_dir)] 103 | 104 | for pdf_path in tqdm(pdf_paths, desc="Processing ", total=len(pdf_paths), unit="item"): 105 | extracts(pdf_path, img_save_dir, json_dir) 106 | -------------------------------------------------------------------------------- /utils/run_extract_ccpdf.sh: -------------------------------------------------------------------------------- 1 | python utils/extract_ccpdf.py \ 2 | --pdf_dir /mnt/achao/Downloads/ccpdf_zip/ \ 3 | --img_save_dir /mnt/achao/Downloads/pdf_pngs/ \ 4 | --json_dir /mnt/achao/Downloads/pdf_jsons/ -------------------------------------------------------------------------------- /utils/utils_score_v3.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | from math import isclose 4 | from collections import defaultdict 5 | 6 | 7 | def levenshtein_distance(s1, s2): 8 | if len(s1) > len(s2): 9 | s1, s2 = s2, s1 10 | 11 | distances = range(len(s1) + 1) 12 | for i2, c2 in enumerate(s2): 13 | distances_ = [i2 + 1] 14 | for i1, c1 in enumerate(s1): 15 | if c1 == c2: 16 | distances_.append(distances[i1]) 17 | else: 18 | distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1]))) 19 | distances = distances_ 20 | return distances[-1] 21 | 22 | 23 | def anls_compute(groundtruth, prediction, threshold=0.5): 24 | dist = levenshtein_distance(groundtruth, prediction) 25 | length = max(len(groundtruth.upper()), len(prediction.upper())) 26 | value = 0.0 if length == 0 else float(dist) / float(length) 27 | anls = 1.0 - value 28 | if anls<=threshold: 29 | anls = 0.0 30 | return anls 31 | 32 | 33 | def is_float_equal(reference, prediction, include_percentage: bool = False, is_close: float = False) -> bool: 34 | def get_precision(gt_ans: float) -> int: 35 | precision = 3 36 | if '.' in str(gt_ans): 37 | precision = len(str(gt_ans).split('.')[-1]) 38 | return precision 39 | 40 | reference = float(str(reference).strip().rstrip("%").strip()) 41 | try: 42 | prediction = float(str(prediction).strip().rstrip("%").strip()) 43 | except: 44 | return False 45 | 46 | if include_percentage: 47 | gt_result = [reference / 100, reference, reference * 100] 48 | else: 49 | gt_result = [reference] 50 | for item in gt_result: 51 | try: 52 | if is_close: 53 | if isclose(item, prediction, rel_tol=0.01): 54 | return True 55 | precision = max(min(get_precision(prediction), get_precision(item)), 2) 56 | if round(prediction, precision) == round(item, precision): 57 | return True 58 | except Exception: 59 | continue 60 | return False 61 | 62 | 63 | def get_clean_string(s): 64 | s = str(s).lower().strip() 65 | s = s.replace(",", "") 66 | if s.endswith("kg"): 67 | s = s.rstrip("kg").strip() 68 | if s.endswith("mm"): 69 | s = s.rstrip("mm").strip() 70 | if s.endswith("m"): 71 | s = s.rstrip("m").strip() 72 | if s.endswith("meters"): 73 | s = s.rstrip("meters").strip() 74 | if s.endswith("acres"): 75 | s = s.rstrip("acres").strip() 76 | if s.endswith("minutes"): 77 | s = s.rstrip("minutes").strip() 78 | if s.endswith("mile"): 79 | s = s.rstrip("mile").strip() 80 | if s.endswith("miles"): 81 | s = s.rstrip("miles").strip() 82 | if s.endswith("million"): 83 | s = s.rstrip("million").strip() 84 | if s.endswith("thousand"): 85 | s = s.rstrip("thousand").strip() 86 | if s.endswith("billion"): 87 | s = s.rstrip("billion").strip() 88 | # remove parenthesis 89 | s = re.sub(r'\s*\([^)]*\)', '', s).strip() 90 | # remove quotes 91 | s = re.sub(r"^['\"]|['\"]$", "", s).strip() 92 | s = s.strip().lstrip("$").strip() 93 | s = s.strip().lstrip("£").strip() 94 | s = s.strip().rstrip("%").strip() 95 | return s 96 | 97 | 98 | def is_exact_match(s): 99 | flag = False 100 | # Website 101 | if "https://" in s: 102 | flag = True 103 | # code file 104 | if s.endswith(".py") or s.endswith("ipynb"): 105 | flag = True 106 | if s.startswith("page"): 107 | flag = True 108 | # telephone number 109 | if re.fullmatch(r'\b\d+(-\d+|\s\d+)?\b', s): 110 | flag = True 111 | # time 112 | if "a.m." in s or "p.m." in s: 113 | flag = True 114 | # YYYY-MM-DD 115 | if re.fullmatch(r'\b\d{4}[-\s]\d{2}[-\s]\d{2}\b', s): 116 | flag = True 117 | # YYYY-MM 118 | if re.fullmatch(r'\b\d{4}[-\s]\d{2}\b', s): 119 | flag = True 120 | # Email address 121 | if re.fullmatch(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', s): 122 | flag = True 123 | return flag 124 | 125 | 126 | def isfloat(num): 127 | try: 128 | float(num) 129 | return True 130 | except Exception as e: 131 | return False 132 | 133 | 134 | def eval_score(gt, pred, answer_type): 135 | if answer_type=="Integer": 136 | try: 137 | gt = get_clean_string(str(gt)) 138 | if len(re.findall(r"\d+,\s*\d+", gt, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395" 139 | gt = "".join([_.strip() for _ in gt.split(",")]) 140 | gt = int(gt) 141 | except: 142 | gt = gt 143 | try: 144 | pred = get_clean_string(str(pred)) 145 | if len(re.findall(r"\d+,\s*\d+", pred, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395" 146 | pred = "".join([_.strip() for _ in pred.split(",")]) 147 | pred = int(pred) 148 | except: 149 | pred = "" 150 | score = (gt==pred) 151 | elif answer_type=="Float": 152 | gt = get_clean_string(str(gt)) 153 | pred = get_clean_string(str(pred)) 154 | 155 | if len(re.findall(r"\d+,\s*\d+", gt, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395" 156 | gt = "".join([_.strip() for _ in gt.split(",")]) 157 | try: 158 | gt = float(gt) 159 | except: 160 | gt = gt 161 | 162 | if len(re.findall(r"\d+,\s*\d+", pred, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395" 163 | pred = "".join([_.strip() for _ in pred.split(",")]) 164 | try: 165 | pred = float(pred) 166 | except: 167 | pred = str(pred) 168 | 169 | try: 170 | score = is_float_equal(gt, pred, include_percentage=True, is_close=True) 171 | except: 172 | score = 0 173 | 174 | elif answer_type in ["String", "None"]: 175 | gt = get_clean_string(gt) 176 | pred = get_clean_string(pred) 177 | if is_exact_match(gt): 178 | score = (gt==pred) 179 | else: 180 | score = anls_compute(gt, pred) 181 | else: 182 | if isinstance(gt, str) and gt.startswith("["): 183 | try: 184 | gt = eval(gt) 185 | except: 186 | gt = gt 187 | if not isinstance(gt, list): 188 | gt = [gt] 189 | if isinstance(pred, str) and pred.startswith("["): 190 | try: 191 | pred = eval(pred) 192 | except: 193 | pred = pred 194 | if not isinstance(pred, list): 195 | pred = [pred] 196 | if isinstance(gt[0], dict): 197 | gt = ["-".join([str(value) for key,value in _.items()]) for _ in gt] 198 | if isinstance(pred[0], dict): 199 | pred = ["-".join([str(value) for key,value in _.items()]) for _ in pred] 200 | 201 | print(len(gt), len(pred)) 202 | print(gt, pred) 203 | def cal_score_v3(gt, pred): 204 | gt = [get_clean_string(a) for a in gt] 205 | pred = [get_clean_string(a) for a in pred] 206 | if isfloat(gt[0]) or is_exact_match(gt[0]): 207 | score_v3 = ("-".join(gt)=="-".join(pred)) 208 | else: 209 | greedy_scores = [max([anls_compute(str(gt_v), str(pred_v)) for pred_v in pred]) for gt_v in gt] 210 | score_v3 = sum(greedy_scores) / len(gt) * min(1, len(gt) / len(pred)) ** 0.5 211 | return score_v3 212 | 213 | score_v3 = cal_score_v3(gt, pred) 214 | 215 | score_v3 = score if answer_type in ["Integer", "Float", "String", "None"] else score_v3 216 | 217 | return float(score_v3) 218 | 219 | 220 | def calculate_acc_and_f1(results_file): 221 | samples = [json.loads(_.strip()) for _ in open(results_file, "r", encoding="utf-8").readlines()] 222 | evaluated_samples = [sample for sample in samples if "score_v3" in sample] 223 | if not evaluated_samples: 224 | return 0.0, 0.0 225 | 226 | acc = sum([sample["score_v3"] for sample in evaluated_samples])/len(evaluated_samples) 227 | try: 228 | recall = sum([sample["score_v3"] for sample in evaluated_samples if sample["answer"]!="Not answerable"])/len([sample for sample in evaluated_samples if sample["answer"]!="Not answerable"]) 229 | precision = sum([sample["score_v3"] for sample in evaluated_samples if sample["answer"]!="Not answerable"])/len([sample for sample in evaluated_samples if sample["pred"]!="Not answerable"]) 230 | f1 = 2*recall*precision/(recall+precision) if (recall+precision)>0.0 else 0.0 231 | except: 232 | f1 = 0.0 233 | 234 | return acc, f1 235 | 236 | 237 | def show_results(samples, show_path=None): 238 | for sample in samples: 239 | sample["evidence_pages"] = eval(sample["evidence_pages"]) 240 | sample["evidence_sources"] = eval(sample["evidence_sources"]) 241 | 242 | with open(show_path, 'w') as f: 243 | acc, f1 = eval_acc_and_f1(samples) 244 | f.write("Overall Acc: {} | Question Number: {}\n".format(acc, len(samples))) 245 | f.write("Overall F1-score: {} | Question Number: {}\n".format(f1, len(samples))) 246 | f.write("-----------------------\n") 247 | 248 | ##################### 249 | acc_single_page, _ = eval_acc_and_f1([sample for sample in samples if len(sample["evidence_pages"])==1]) 250 | acc_multi_page, _ = eval_acc_and_f1([sample for sample in samples if len(sample["evidence_pages"])!=1 and sample["answer"]!="Not answerable"]) 251 | acc_neg, _ = eval_acc_and_f1([sample for sample in samples if sample["answer"]=="Not answerable"]) 252 | 253 | f.write("Single-page | Accuracy: {} | Question Number: {}\n".format( 254 | acc_single_page, len([sample for sample in samples if len(sample["evidence_pages"])==1]) 255 | )) 256 | f.write("Cross-page | Accuracy: {} | Question Number: {}\n".format( 257 | acc_multi_page, len([sample for sample in samples if len(sample["evidence_pages"])!=1 and sample["answer"]!="Not answerable"]) 258 | )) 259 | f.write("Unanswerable | Accuracy: {} | Question Number: {}\n".format( 260 | acc_neg, len([sample for sample in samples if sample["answer"]=="Not answerable"]) 261 | )) 262 | f.write("-----------------------\n") 263 | 264 | ##################### 265 | source_sample_dict, document_type_dict = defaultdict(list), defaultdict(list) 266 | for sample in samples: 267 | for answer_source in sample["evidence_sources"]: 268 | source_sample_dict[answer_source].append(sample) 269 | document_type_dict[sample["doc_type"]].append(sample) 270 | for type, sub_samples in source_sample_dict.items(): 271 | f.write( 272 | "Evidence Sources: {} | Accuracy: {} | Question Number: {}\n".format(type, eval_acc_and_f1(sub_samples)[0], len(sub_samples)) 273 | ) 274 | 275 | f.write("-----------------------\n") 276 | for type, sub_samples in document_type_dict.items(): 277 | f.write( 278 | "Document Type: {} | Accuracy: {} | Question Number: {}\n".format(type, eval_acc_and_f1(sub_samples)[0], len(sub_samples)) 279 | ) 280 | --------------------------------------------------------------------------------