├── README.md
├── config
    └── api_config.json
├── data
    └── LongDocURL.jsonl
├── eval
    ├── __pycache__
    │   └── utils_api.cpython-311.pyc
    ├── api_models
    │   ├── __pycache__
    │   │   ├── model.cpython-311.pyc
    │   │   └── pure_ocr_utils.cpython-311.pyc
    │   ├── eval_api_models.py
    │   ├── model.py
    │   └── pure_ocr_utils.py
    ├── extract_concise_answer_from_vllm.py
    ├── prompt_for_answer_extraction.md
    └── utils_api.py
├── evaluation_results
    ├── api_models
    │   ├── results_gemini15_pro.jsonl
    │   └── results_gpt4o.jsonl
    └── scores_sample_fine_grained.json
├── examples
    ├── example_docmind.json
    └── example_structured_text.txt
├── human_baselines
    ├── baseline_1.jsonl
    └── baseline_2.jsonl
├── scripts
    ├── calculate_metrics.sh
    ├── calculate_metrics_fine_grained.sh
    └── eval_api_models.sh
└── utils
    ├── __pycache__
        └── utils_score_v3.cpython-311.pyc
    ├── calculate_metrics.py
    ├── calculate_metrics_fine_grained.py
    ├── extract_ccpdf.py
    ├── run_extract_ccpdf.sh
    └── utils_score_v3.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Welcome to LongDocURL!
 2 | Repository for the paper "LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating".
 3 | 
 4 | **Paper**: [paper](https://arxiv.org/pdf/2412.18424)
 5 | 
 6 | **Blog Website**: [longdocurl. github. io](https://longdocurl.github.io/)
 7 | 
 8 | **Dataset**: [dataset](https://huggingface.co/datasets/dengchao/LongDocURL/)
 9 | 
10 | ## About LongDocURL
11 | The LongDocURL benchmark is specifically designed for assessing the ability of models in long document understanding.
12 | We collect 2,325 high-quality question-answering pairs, covering 396 PDF-formatted documents and more than 33,000 pages, significantly outperforming existing benchmarks.
13 | Our open dataset can be found at [LongDocURL](https://huggingface.co/datasets/dengchao/LongDocURL/). You can refer to [Blog Website](https://longdocurl.github.io/) for more infomation.
14 | 
15 | ## Evaluation
16 | 
17 | **1. (Optional) Download & Extract PDFs**
18 | 
19 | Download PDFs and qa file (.jsonl) from [LongDocURL](https://huggingface.co/datasets/dengchao/LongDocURL/). Run the following commands to extract PDFs into pngs and json files (by PyMuPDF).
20 | 
21 | ```bash
22 | bash utils/run_extract_ccpdf.sh
23 | ```
24 | 
25 | Images will be organized in following ways:
26 | ```markdown
27 | ├── 4000
28 | │   └── 4000001.png
29 | └── 4001
30 |     ├── 4001001.png
31 |     └── 4001002.png
32 | ```
33 | 
34 | **2. Other Configurations**
35 | - `api_key`: update `config/api_config.json`, used to extract short answer from detailed response.
36 | - `qa_jsonl`: update `data/LongDocURL.jsonl`, downloaded from [LongDocURL](https://huggingface.co/datasets/dengchao/LongDocURL/).
37 | - `api_models`: default `gpt4o-2024-05-13` for extracting short answer. if use our codes to evaluate proprietary models, please check and modify `eval/api_models/model.py`.
38 | 
39 | **3. Evaluating API Models**
40 | ```bash
41 | bash scripts/eval_api_models.sh
42 | ```
43 | 
44 | Options to note:
45 | - `process_mode`: default `serial`. Set `parallel` if parallel execution is needed. Default number of parallel processes is 8.
46 | - `image_prefix`: default `None`. Add image prefix when needed in order to get proper image paths.
47 | - `model_name`: the model abbreviation is mapped to the actual model class defined in `eval/api_models/model.py`,
48 | 
49 | **4. Claculate Metrics**
50 | 
51 | To calculate the final generalized accuracy:
52 | ```bash
53 | bash scripts/calculate_metrics.sh
54 | ```
55 | To calculate generalized accuracy in a more fine-grained way like `evaluation_results/scores_sample_fine_grained.json`:
56 | ```bash
57 | bash scripts/calculate_metrics_fine_grained.sh
58 | ```
59 | 
60 | ##  🏆 Leaderboard 🏆
61 | 
62 | | Model                     | Size   | Understanding  | Reasoning   | Locating   | Total |
63 | |---------------------------|--------|----------------|-------------|------------|-------|
64 | |	GPT-4o-24-05-13 🥇       | -      | 68.6           | 59.9        | 59.6       | 64.5  |
65 | | Gemini-1.5-Pro 🥈        | -      | 55.7           | 43.4        | 46.4       | 50.9  |
66 | | Qwen-VL-Max 🥉           | -      | 58.8           | 43.9        | 36.0       | 49.5  |
67 | | Qwen2-VL                  | 7B     | 36.9           | 24.8        | 22.6       | 30.6  |
68 | | LLaVA-OneVision-Chat      | 7B     | 30.5           | 19.0        | 18.7       | 25.0  |
69 | | LLaVA-Next-Interleave-DPO | 7B     | 21.6           | 13.9        | 7.6        | 16.2  |
70 | | Llama-3.2                 | 11B    | 12.9           | 9.4         | 2.7        | 9.2   |
71 | 


--------------------------------------------------------------------------------
/config/api_config.json:
--------------------------------------------------------------------------------
1 | {
2 |     "gpt4o": {
3 |         "access_key": "<api_key>",
4 |         "base_url": "<base_url>"
5 |     }
6 | }


--------------------------------------------------------------------------------
/data/LongDocURL.jsonl:
--------------------------------------------------------------------------------
1 | {"question_id": "free_gemini15_pro_4055915_42_71_8", "doc_no": "4055915", "total_pages": 84, "start_end_idx": [42, 71], "question_type": "calculate", "question": "How many judging criterias are there for judging the homemade chocolate cake competition?", "answer": 5, "detailed_evidences": "The judging criteria for the Homemade Chocolate Cake Contest are listed on page 54, <box>(0.08, 0.26, 0.92, 0.52)</box>.  The criteria include flavor (30 points), texture (25 points), inside characteristics (20 points), outside characteristics (15 points), and frosting (10 points).", "evidence_pages": [54], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "images": ["/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_41.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_42.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_43.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_44.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_45.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_46.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_47.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_48.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_49.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_50.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_51.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_52.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_53.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_54.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_55.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_56.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_57.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_58.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_59.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_60.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_61.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_62.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_63.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_64.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_65.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_66.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_67.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_68.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_69.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4055/4055915_70.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4055915.pdf"}
2 | {"question_id": "summary2title_4127644_8_37_8", "doc_no": "4127644", "total_pages": 98, "start_end_idx": [8, 37], "question_type": "summary2title", "question": "Which section best matches the follwing description: \n<description>The text describes a merger between Vistra and Dynegy, where Dynegy merged with and into Vistra, with Vistra continuing as the surviving corporation. The merger was intended to qualify as a tax-free reorganization under the IRC, so that none of Vistra, Dynegy or any of the Dynegy stockholders would recognize any gain or loss in the transaction, except that Dynegy stockholders could recognize a gain or loss with respect to cash received in lieu of fractional shares of Vistra's common stock.</description>\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Dywegy.Merger Transaction", "detailed_evidences": "", "evidence_pages": [15, 16], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "images": ["/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_7.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_8.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_9.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_10.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_11.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_12.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_13.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_14.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_15.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_16.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_17.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_18.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_19.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_20.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_21.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_22.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_23.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_24.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_25.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_26.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_27.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_28.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_29.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_30.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_31.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_32.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_33.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_34.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_35.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4127/4127644_36.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4127644.pdf"}
3 | {"question_id": "extract_fig2tab_4151603_24_53_2", "doc_no": "4151603", "total_pages": 84, "start_end_idx": [24, 53], "question_type": "extract_fig2tab", "question": "List names of the figures at the page which contains a table whose name is \"Table 4.16. Stability database of digestate in Belgium\".", "answer": ["Table 4.15. Irish digestate samples tested for RBP and OUR for this study", "Figure 4.2. Relationship between RBP and OUR in Irish digestate samples(n=5)."], "detailed_evidences": "", "evidence_pages": [42], "evidence_sources": ["Table", "Figure"], "answer_format": "List", "task_tag": "Locating", "images": ["/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_23.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_24.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_25.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_26.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_27.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_28.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_29.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_30.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_31.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_32.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_33.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_34.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_35.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_36.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_37.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_38.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_39.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_40.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_41.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_42.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_43.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_44.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_45.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_46.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_47.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_48.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_49.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_50.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_51.png", "/data/oss_bucket_0/achao.dc/public_datasets/pdf_pngs/4000-4999/4151/4151603_52.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4151603.pdf"}
4 | 


--------------------------------------------------------------------------------
/eval/__pycache__/utils_api.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/eval/__pycache__/utils_api.cpython-311.pyc


--------------------------------------------------------------------------------
/eval/api_models/__pycache__/model.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/eval/api_models/__pycache__/model.cpython-311.pyc


--------------------------------------------------------------------------------
/eval/api_models/__pycache__/pure_ocr_utils.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/eval/api_models/__pycache__/pure_ocr_utils.cpython-311.pyc


--------------------------------------------------------------------------------
/eval/api_models/eval_api_models.py:
--------------------------------------------------------------------------------
  1 | # TODO:
  2 | import sys
  3 | import pathlib
  4 | sys.path.append(str(pathlib.Path(__file__).absolute().parent.parent.parent))
  5 | 
  6 | import argparse
  7 | import os
  8 | from io import BytesIO
  9 | 
 10 | import oss2
 11 | import base64
 12 | import json
 13 | from tqdm import tqdm
 14 | import requests
 15 | import re
 16 | import time
 17 | from multiprocessing import Pool
 18 | import datetime
 19 | from openai import OpenAI
 20 | 
 21 | from eval.utils_api import *
 22 | from utils.utils_score_v3 import *
 23 | from model import Gemini15ProInferencer, GPT4oInferencer, QwenVLMaxInferencer, O1PreviewInferencer, QwenMaxInferencer
 24 | from pure_ocr_utils import *
 25 | 
 26 | system_prompt = "You are an expert in visual document question-answering, please answer our questions based on the given images.\n"
 27 | 
 28 | # TODO
 29 | project_prefix = "/mnt/workspace/Projects/CodeLib/LongDocURL/"
 30 | 
 31 | config_file = os.path.join(project_prefix, "config/api_config.json")
 32 | extractor_prompt_path = os.path.join(project_prefix, "eval/prompt_for_answer_extraction.md")
 33 | 
 34 | with open(config_file, "r", encoding="utf-8") as rf:
 35 |     config = json.load(rf)
 36 | client = OpenAI(api_key=config["gpt4o"]["access_key"], base_url=config["gpt4o"]["base_url"])
 37 | 
 38 | model_name2inferencer = {"gpt4o": "GPT4oInferencer", "gemini15_pro": "Gemini15ProInferencer", "qwen_vl_max": "QwenVLMaxInferencer", \
 39 |     "o1_preview": "O1PreviewInferencer", "qwen_max": "QwenMaxInferencer"}
 40 | 
 41 | prompt_sign = True
 42 | 
 43 | def preprocess(input_datapath, output_datapath, image_prefix=None):
 44 |     dataset = read_jsonl_file(input_datapath)
 45 |     print("dataset cnt: ", len(dataset))
 46 | 
 47 |     if os.path.exists(output_datapath):
 48 |         output_dataset = read_jsonl_file(output_datapath)
 49 |         dataset = delete_generate_dataset(dataset, output_dataset)
 50 | 
 51 |     if image_prefix is not None:
 52 |         for _ in dataset:
 53 |             for i, image_path in enumerate(_["images"]):
 54 |                 _["images"][i] = os.path.join(image_prefix, "/".join(image_path.split("/")[-2:]))
 55 | 
 56 |     print("dataset cnt need to do: ", len(dataset))
 57 | 
 58 |     return dataset
 59 | 
 60 | def read_jsonl_file(file_path):
 61 |     data = []
 62 |     with open(file_path, "r", encoding="utf-8") as jsonl_file:
 63 |         for i, line in enumerate(jsonl_file):
 64 |             data_dict = json.loads(line.strip())
 65 |             if 'question_id' not in data_dict:
 66 |                 data_dict['question_id'] = i
 67 |             data.append(data_dict)
 68 |     return data
 69 | 
 70 | def call_llm(prompt, urls, temperature=0.1, seed=42, max_tokens=4096):
 71 |     msgs = get_msg_format(prompt, urls)
 72 |     response = None
 73 |     max_try = 2
 74 |     while response is None and max_try > 0:
 75 |         try:
 76 |             # TODO
 77 |             completion = client.chat.completions.create(model="gpt-4o-0513", messages=msgs, temperature=0.)
 78 |             # completion = client.chat.completions.create(model="gpt-4o", messages=msgs, temperature=0.)
 79 |             response = completion.choices[0].message.content
 80 |         except Exception as e:
 81 |             print(f"error with {e}, response = {response}")
 82 |             max_try -= 1
 83 |             response = None
 84 | 
 85 |     return response
 86 | 
 87 | def delete_generate_dataset(dataset, output_dataset):
 88 |     finished_question_id_set = set([sample['question_id'] for sample in output_dataset])
 89 |     unfinished_dataset = [sample for sample in dataset if sample['question_id'] not in finished_question_id_set]
 90 |     return unfinished_dataset
 91 | 
 92 | def eval_per_record(args):
 93 |     print("--------------------------------------")
 94 |     case, output_datapath, model_name = args
 95 | 
 96 |     inferencer = eval(model_name2inferencer[model_name])()
 97 | 
 98 |     question = case["question"]
 99 |     prompt = system_prompt + "Following is our question: \n" + f"<question>{question}</question>" + "\n"
100 | 
101 |     result = inferencer.infer(prompt, case["images"])
102 | 
103 |     if result is None:
104 |         return
105 | 
106 |     # extract concise answer
107 |     with open(extractor_prompt_path) as f:
108 |         extractor_prompt = f.read()
109 |     prompt = system_prompt + extractor_prompt + "\nQuestion: " + question + "\nAnalysis: " + result
110 |     extractor_result = call_llm(prompt, None)
111 |     try:
112 |         import re
113 |         concise_answer = re.findall(r"<concise_answer>(.*?)</concise_answer>", extractor_result, re.DOTALL)[0]
114 |         answer_format = re.findall(r"<answer_format>(.*?)</answer_format>", extractor_result, re.DOTALL)[0]
115 |     except:
116 |         concise_answer = "Fail to extract"
117 |         answer_format = "None"
118 | 
119 |     # calculate scores
120 |     try:
121 |         # pred_ans = eval(concise_answer)
122 |         pred_ans = eval(concise_answer) if not isinstance(eval(concise_answer), set) else list(eval(concise_answer))
123 |     except:
124 |         pred_ans = concise_answer
125 |     if pred_ans == "Fail to extract":
126 |         score_v3 = 0.0
127 |     else:
128 |         score_v3 = eval_score(case["answer"], pred_ans, case["answer_format"])
129 |         
130 |     case["detailed_response"] = result
131 |     case["pred"] = pred_ans
132 |     case["score_v3"] = score_v3
133 | 
134 |     print("\n\n")
135 |     print("Question: {}".format(case["question"]))
136 |     print("Response: {}".format(case["pred"]))
137 |     
138 |     print("Gt: {}\tPred: {}\tScore_v3: {}".format(case["answer"], case["pred"], case["score_v3"]))
139 | 
140 |     if result is not None:  # Check if result is not None
141 |         try: # not json serialable
142 |             with open(output_datapath, "a") as output_review_file:
143 |                 output_review_file.write(json.dumps(case, ensure_ascii=False) + "\n")
144 |         except Exception as e:
145 |             print("error: ", e)
146 |     else:
147 |         print("error")
148 | 
149 | 
150 | def evaluate(dataset, output_datapath, model_name="gpt4o", process_mode="serial", extra_infos=None):
151 | 
152 |     if os.path.exists(output_datapath):
153 |         output_dataset = read_jsonl_file(output_datapath)
154 |         dataset = delete_generate_dataset(dataset, output_dataset)
155 | 
156 |     print("dataset cnt: ", len(dataset))
157 |     if not len(dataset):
158 |         return
159 | 
160 |     args_list = []
161 |     for case in dataset:
162 |         args_list.append((case, output_datapath, model_name))
163 | 
164 |     start_time = datetime.datetime.now()
165 |     print("job start time:", start_time)
166 | 
167 |     if process_mode == "serial":
168 |         for args in args_list:
169 |             eval_per_record(args)
170 |     elif process_mode == "parallel":
171 |         with Pool(processes=8) as pool:  # You can adjust the number of processes as needed
172 |             list(tqdm(pool.imap(eval_per_record, args_list), total=len(args_list)))
173 |     else:
174 |         print("process mode error!")
175 | 
176 | 
177 | if __name__ == "__main__":
178 |     parser = argparse.ArgumentParser()
179 |     parser.add_argument('--qa_file', type=str, default="data/LongDocURL.jsonl")
180 |     parser.add_argument('--results_file', type=str, default="evaluation_results/api_models/results_gpt4o.jsonl") 
181 |     parser.add_argument('--process_mode', type=str, default="serial") # serial/parallel
182 |     # parser.add_argument('--input_format', type=str, default="e2e") # e2e/ocr
183 |     parser.add_argument('--image_prefix', type=str, default="/mnt/achao/Downloads/pdf_pngs/4000-4999")
184 |     parser.add_argument('--model_name', type=str, default="gpt4o") # gemini15_pro/claude35_sonnet/qwen_vl_max/gpt4o
185 | 
186 |     args = parser.parse_args()
187 | 
188 |     input_datapath = args.qa_file
189 |     output_datapath = args.results_file
190 | 
191 |     # load data
192 |     # dataset = preprocess(input_datapath, output_datapath)
193 |     # if image paths are not modified in .jsonl file, add image prefix when executed
194 |     dataset = preprocess(input_datapath, output_datapath, image_prefix=args.image_prefix)
195 | 
196 |     try_cnt = 2
197 |     while try_cnt:
198 |         try_cnt -= 1
199 |         try:
200 |             evaluate(dataset, output_datapath, model_name=args.model_name, process_mode=args.process_mode)
201 |         except Exception as e:
202 |             print(f"An error occurred: {e}")
203 |             print("Restarting script...")
204 |             time.sleep(1)
205 | 
206 |     acc, f1, = calculate_acc_and_f1(output_datapath)
207 |     print("--------------------------------------")
208 |     print("Avg acc: {}".format(acc))
209 |     print("Avg f1: {}".format(f1))


--------------------------------------------------------------------------------
/eval/api_models/model.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import base64
  3 | from io import BytesIO
  4 | from transformers import AutoModelForCausalLM, AutoTokenizer, Blip2Processor, Blip2ForConditionalGeneration, BitsAndBytesConfig
  5 | from PIL import Image
  6 | from abc import ABC, abstractmethod
  7 | from openai import OpenAI
  8 | import requests
  9 | import os
 10 | from typing import Union
 11 | import oss2
 12 | import json
 13 | 
 14 | # TODO
 15 | project_prefix = "/mnt/workspace/Projects/CodeLib/LongDocURL/"
 16 | config_file = os.path.join(project_prefix, "config/api_config.json")
 17 | 
 18 | 
 19 | class APIInferencer(ABC):
 20 |     def __init__(self):
 21 |         pass
 22 |         # uncomment if oss paths are used
 23 |         # self.bucket = self.get_alimama_oss_bucket()
 24 | 
 25 |     def get_alimama_oss_bucket(self):
 26 |         # TODO
 27 |         endpoint = ''
 28 |         access_key_id = ''
 29 |         access_key_secret = ''
 30 |         bucket_name = ''
 31 |         bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
 32 |         return bucket
 33 | 
 34 |     @abstractmethod
 35 |     def infer(self, prompt: str, image_path: str) -> str:
 36 |         pass
 37 | 
 38 |     def load_client(self):
 39 |         with open(config_file, "r", encoding="utf-8") as rf:
 40 |             config = json.load(rf)
 41 |         return OpenAI(api_key=config["gpt4o"]["access_key"], base_url=config["gpt4o"]["base_url"])
 42 | 
 43 |     def cleanup(self):
 44 |         if hasattr(self, 'client'):
 45 |             del self.client
 46 | 
 47 |     def encode_image_to_base64(self, image_path: str) -> str:
 48 |         if 'https' in image_path:
 49 |             response = requests.get(image_path)
 50 |             img = BytesIO(response.content)
 51 |             return base64.b64encode(img.read()).decode('utf-8')
 52 | 
 53 |         if image_path.startswith('oss://'):
 54 |             return base64.b64encode(self.bucket.get_object(image_path[6:].split("/", 1)[1]).read()).decode("utf-8")
 55 | 
 56 |         with open(image_path, "rb") as image_file:
 57 |             return base64.b64encode(image_file.read()).decode('utf-8')
 58 | 
 59 |     def get_correct_response(self, model_name: str, prompt: str, image_path: Union[list, str]) -> str:
 60 |         response = self.model_chat(model_name, prompt, image_path)
 61 |         return response
 62 | 
 63 |     def model_chat(self, model_name: str, prompt: str, image_path: str) -> str:
 64 |         client = self.load_client()
 65 |         messages = [
 66 |             {
 67 |                 "role": "user",
 68 |                 "content": self.build_message_content(prompt, image_path)
 69 |             }
 70 |         ]
 71 |         max_try = 2
 72 |         response = None
 73 |         while response is None and max_try > 0:
 74 |             try:
 75 |                 completion = client.chat.completions.create(model=model_name, messages=messages, temperature=0.)
 76 |                 response = completion.choices[0].message.content
 77 |             except Exception as e:
 78 |                 print("exception: ", e)
 79 |                 max_try -= 1
 80 |         return response
 81 | 
 82 |     def build_message_content(self, prompt: str, image_path: str):
 83 |         content = [{"type": "text", "text": prompt}]
 84 |         if image_path is None:
 85 |             return content
 86 |         if isinstance(image_path, str):
 87 |             image_paths = [image_path]
 88 |         elif isinstance(image_path, Union[list, tuple]):
 89 |             image_paths = image_path
 90 |         base64_images = [self.encode_image_to_base64(image_path) for image_path in image_paths]
 91 |         for i, base64_image in enumerate(base64_images):
 92 |             content += [
 93 |                 {"type": "text", "text": f"Below is the {i+1}-th image (total {len(base64_images)} images).\n"},
 94 |                 {
 95 |                     "type": "image_url",
 96 |                     "image_url": {
 97 |                         "url": f"data:image/png;base64,{base64_image}"
 98 |                     },
 99 |                 },
100 |             ]
101 |         return content
102 | 
103 | class QwenMaxInferencer(APIInferencer):
104 |     def infer(self, prompt: str, image_path: str) -> str:
105 |         response = self.get_correct_response('qwen_max', prompt, image_path)
106 |         return response
107 | 
108 | 
109 | class O1PreviewInferencer(APIInferencer):
110 |     def infer(self, prompt: str, image_path: str) -> str:
111 |         response = self.get_correct_response('o1-preview-0912', prompt, image_path)
112 |         return response
113 | 
114 | class GPT4oInferencer(APIInferencer):
115 |     def infer(self, prompt: str, image_path: str) -> str:
116 |         response = self.get_correct_response('gpt-4o-0513', prompt, image_path)
117 |         # response = self.get_correct_response('gpt-4o', prompt, image_path)
118 |         return response
119 | 
120 | class Gemini15ProInferencer(APIInferencer):
121 |     def infer(self, prompt: str, image_path: str) -> str:
122 |         response = self.get_correct_response('gemini-1.5-pro', prompt, image_path)
123 |         return response
124 | 
125 | 
126 | class QwenVLMaxInferencer(APIInferencer):
127 |     def infer(self, prompt: str, image_path: str) -> str:
128 |         response = self.get_correct_response('qwen-vl-max', prompt, image_path)
129 |         return response
130 | 


--------------------------------------------------------------------------------
/eval/api_models/pure_ocr_utils.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | 
 4 | def record2text_with_layout(record):
 5 |     text = ""
 6 |     img_width, img_height = record['docInfo']['pages'][0]['imageWidth'], record['docInfo']['pages'][0]['imageHeight']
 7 |     for item in record['layouts']:
 8 |         _type, sub_type = item['type'], item['subType']
 9 |         item_text = item['text']
10 |         x1y1 = item['pos'][0]; x2y2 = item['pos'][2]
11 |         text += f"(type: {_type}, sub_type: {sub_type}, box: {tuple(float(f'{_:.2f}') for _ in (x1y1['x']/img_width, x1y1['y']/img_height, x2y2['x']/img_width, x2y2['y']/img_height))})" + f" {item_text}\n"
12 | 
13 |     return text
14 | 
15 | def record2text(record):
16 |     text = ""
17 |     for item in record['layouts']:
18 |         item_text = item['text']
19 |         text += f"{item_text}\n"
20 | 
21 |     return text
22 | 
23 | 
24 | def get_pure_ocr_prompt_docmind(doc_no: str, **kwargs):
25 |     zip_no = doc_no[:4]
26 |     json_path = "/mnt/achao/Downloads/pdf_jsons/{}/{}_docmind_results.json"
27 |     record = json.load(open(json_path.format(zip_no, doc_no), "r", encoding="utf-8"))['contents']
28 |     ocr_text_template = "page_no: {}\n{}\n\n"
29 |     
30 |     start_page = kwargs.pop("start_page", 0); end_page = kwargs.pop("end_page", start_page+1)
31 |     if "extra_infos" in kwargs and "with_layout" in kwargs["extra_infos"] and kwargs["extra_infos"]["with_layout"]:
32 |         ocr_texts = [record2text_with_layout(record[f"page_{idx}"]) for idx in range(start_page, end_page+1) if f"page_{idx}" in record]
33 |     else:
34 |         ocr_texts = [record2text(record[f"page_{idx}"]) for idx in range(start_page, end_page+1) if f"page_{idx}" in record]
35 |     pages_used = end_page - start_page + 1
36 |     print("number of pages used: ", end_page - start_page + 1)
37 | 
38 |     ocr_prompt = "\n\n"
39 |     for page_no, ocr_text in zip(range(start_page, end_page+1), ocr_texts):
40 |         ocr_prompt += ocr_text_template.format(page_no+1, ocr_text) # why page_no+1？
41 |         
42 |     return ocr_prompt
43 | 
44 | def get_pure_ocr_prompt_pymupdf(doc_no: str, **kwargs):
45 |     zip_no = doc_no[:4]
46 |     json_path = "/mnt/achao/Downloads/pdf_jsons/{}/{}_line_level.json"
47 |     ocr_texts_doc = json.load(open(json_path.format(zip_no, doc_no), "r", encoding="utf-8"))["pages_str"] # list
48 |     ocr_text_template = "page_no: {}\n{}\n\n"
49 |     
50 |     start_page = kwargs.pop("start_page", 0); end_page = kwargs.pop("end_page", start_page+1)
51 | 
52 |     pages_used = end_page - start_page + 1
53 |     print("number of pages used: ", end_page - start_page + 1)
54 | 
55 |     ocr_prompt = "\n\n"
56 |     for page_no, ocr_texts_page in zip(range(start_page, end_page+1), ocr_texts_doc):
57 |         ocr_prompt += ocr_text_template.format(page_no+1, ocr_texts_page) # why page_no+1?
58 |         
59 |     return ocr_prompt


--------------------------------------------------------------------------------
/eval/extract_concise_answer_from_vllm.py:
--------------------------------------------------------------------------------
  1 | # # TODO:
  2 | # import sys
  3 | # sys.path.append("/root_dir/LongDocURL/")
  4 | 
  5 | import argparse
  6 | import os
  7 | from io import BytesIO
  8 | 
  9 | import oss2
 10 | import base64
 11 | import json
 12 | from tqdm import tqdm
 13 | import requests
 14 | import re
 15 | import time
 16 | from multiprocessing import Pool
 17 | import datetime
 18 | from openai import OpenAI
 19 | 
 20 | from eval.utils_api import *
 21 | from utils.utils_score_v3 import *
 22 | 
 23 | system_prompt = "You are an expert in visual document question-answering, please answer our questions based on the given images.\n"
 24 | 
 25 | # TODO
 26 | project_prefix = "/root_dir/LongDocURL/"
 27 | config_file = os.path.join(project_prefix, "config/api_config.json")
 28 | extractor_prompt_path = os.path.join(project_prefix, "eval/prompt_for_answer_extraction.md")
 29 | 
 30 | with open(config_file, "r", encoding="utf-8") as rf:
 31 |     config = json.load(rf)
 32 | client = OpenAI(api_key=config["gpt4o"]["access_key"], base_url=config["gpt4o"]["base_url"])
 33 | 
 34 | 
 35 | def call_llm(prompt, urls, temperature=0.1, seed=42, max_tokens=4096):
 36 |     msgs = get_msg_format(prompt, urls)
 37 |     response = None
 38 |     max_try = 6
 39 |     while response is None and max_try > 0:
 40 |         try:
 41 |             completion = client.chat.completions.create(model="gpt-4o-0513", messages=msgs, temperature=0.)
 42 |             response = completion.choices[0].message.content
 43 |         except Exception as e:
 44 |             print(f"error with {e}, response = {response}")
 45 |             max_try -= 1
 46 |             response = None
 47 | 
 48 |     return response
 49 | 
 50 | 
 51 | def delete_generated_dataset(records, output_datapath):
 52 |     finished_sample_ids = set([json.loads(_.strip())["question_id"] for _ in open(output_datapath, "r", encoding="utf-8").readlines()]) if os.path.exists(output_datapath) else set()
 53 |     return [_ for _ in records if _["question_id"] not in finished_sample_ids]
 54 | 
 55 | 
 56 | def extract_per_record(args):
 57 |     case, result, output_datapath = args
 58 |     question = case["question"]
 59 |     print(case["question_id"])
 60 | 
 61 |     # extract concise answer
 62 |     with open(extractor_prompt_path) as f:
 63 |         extractor_prompt = f.read()
 64 |     prompt = system_prompt + extractor_prompt + "\nQuestion: " + question + "\nAnalysis: " + result
 65 |     extractor_result = call_llm(prompt, None)
 66 |     try:
 67 |         import re
 68 |         concise_answer = re.findall(r"<concise_answer>(.*?)</concise_answer>", extractor_result, re.DOTALL)[0]
 69 |         answer_format = re.findall(r"<answer_format>(.*?)</answer_format>", extractor_result, re.DOTALL)[0]
 70 |     except:
 71 |         concise_answer = "Fail to extract"
 72 |         answer_format = "None"
 73 | 
 74 |     # calculate scores
 75 |     try:
 76 |         # pred_ans = eval(concise_answer)
 77 |         pred_ans = eval(concise_answer) if not isinstance(eval(concise_answer), set) else list(eval(concise_answer))
 78 |     except:
 79 |         pred_ans = concise_answer
 80 |     if pred_ans == "Fail to extract":
 81 |         score_v3 = 0.0
 82 |     else:
 83 |         score_v3 = eval_score(case["answer"], pred_ans, case["answer_format"])
 84 | 
 85 |     case["pred"] = pred_ans
 86 |     case["score_v3"] = score_v3
 87 | 
 88 |     print("\n\n")
 89 |     print("Question: {}".format(case["question"]))
 90 |     print("Response: {}".format(case["pred"]))
 91 |     print("Gt: {}\tPred: {}\tScore_v3: {}".format(case["answer"], case["pred"], case["score_v3"]))
 92 | 
 93 |     try:
 94 |         with open(output_datapath, "a") as output_review_file:
 95 |             output_review_file.write(json.dumps(case, ensure_ascii=False) + "\n")
 96 |     except Exception as e:
 97 |         print("error: ", e)
 98 |         print("error: ", case["question_id"])
 99 | 
100 | 
101 | 
102 | def extract_answers(records, output_datapath):
103 | 
104 |     records = delete_generated_dataset(records, output_datapath)
105 |     args_list = []
106 |     for record in records:
107 |         args_list.append((record, record["detailed_response"], output_datapath))
108 | 
109 |     with Pool(processes=1) as pool:
110 |         list(tqdm(pool.imap(extract_per_record, args_list), total=len(args_list)))
111 | 
112 | 
113 | # def run_test():
114 | #     # case = {}
115 | #     question = case["question"]
116 | #     result = case["detailed_response"]
117 | #     output_datapath = ""
118 | #     args = (case, question, result, output_datapath)
119 | #     extract_per_record(args)
120 | 
121 | 
122 | if __name__ == "__main__":
123 |     parser = argparse.ArgumentParser()
124 |     parser.add_argument("--qa_file", type=str, default="./evaluation_results/api_models/results_detailed.jsonl")
125 |     parser.add_argument("--results_file", type=str, default="./evaluation_results/api_models/results_extracted.jsonl")
126 |     args = parser.parse_args()
127 | 
128 |     with open(args.qa_file, "r", encoding="utf-8") as rf:
129 |         records = [json.loads(_.strip()) for i, _ in enumerate(rf.readlines())]
130 | 
131 |     extract_answers(records, args.results_file)
132 | 
133 |     # run_test()
134 | 
135 | 
136 | 


--------------------------------------------------------------------------------
/eval/prompt_for_answer_extraction.md:
--------------------------------------------------------------------------------
 1 | Given the question and analysis, you are tasked to extract answers with required formats from the free-form analysis. 
 2 | - Your extracted answers should be one of the following formats: (1) Integer, (2) Float, (3) String and (4) List. If you find the analysis the question can not be answered from the given documents, type "Not answerable". Exception: If the analysis only tells you that it can not read/understand the images or documents, type "Fail to answer". 
 3 | - Answers may not meet visibility requirements within questions, such as, one question requires to find answers 'between 7-th to 20-th images' but invisible information in 'image 23' is included in analysis. So, these invisible components should be deprecated and removed from final answers.
 4 | - Please use <concise_answer> and </concise_answer> tokens at the start and end of the extracted answer. For example, if the extracted answer is number 3, the format is <concise_answer>3</concise_answer>.
 5 | - Please use <answer_format> and </answer_format> tokens at the start and end of the answer format. For example, if the answer format is List, the format is <answer_format>List</answer_format>.
 6 | - Please make your response as concise as possible. Also note that your response should be formatted as below: 
 7 | ```
 8 | Extracted answer: <concise_answer>[answer]</concise_answer>
 9 | Answer format: <answer_format>[answer format]</answer_format>
10 | ```
11 | 
12 | Please read the following example, then extract the answer from the model response and type it at the end of the prompt. 
13 | 
14 | ---
15 | Question: List the primary questions asked about the services in this report.
16 | Analysis: The primary questions asked about the services in the report for The Limes Residential Home are: \n\n1. Is the service safe? \n\n2. Is the service effective? \n\n3. Is the service caring? \n\n4. Is the service responsive? \n\n5. Is the service well-led? 
17 | Extracted answer: <concise_answer>['Is the servife safe?', 'Is the service effective', 'Is the serve caring?', 'Is the service responsive?', 'Is the service well-led?']</concise_answer>
18 | Answer format: <answer_format>List</answer_format>
19 | 
20 | ---
21 | Question: How many regulations of the HSCA 2008 are breached in all according to this report?
22 | Analysis: According to the report, the provider breached 10 Health and Social Care Act 2008 (Regulated Activities) Regulations in total. Here are the specifics:
23 | 
24 | 1. Regulation 13: Safeguarding service users from abuse and improper treatment
25 | 
26 | 2. Regulation 12: Safe care and treatment
27 | 
28 | 3. Regulation 18: Staffing
29 | 
30 | 4. Regulation 11: Need for consent
31 | 
32 | 5. Regulation 10: Dignity and respect
33 | 
34 | 6. Regulation 9: Person-centred care
35 | 
36 | 7. Regulation 17: Good governance
37 | 
38 | 8. Regulation 18 (CQC Registration Regulations 2009): Notification of other incidents
39 | 
40 | 9. Regulation 18: Failure to maintain an accurate and up-to-date care plan
41 | 
42 | 10. Regulation 11: Failure to implement the Mental Capacity Act 2005 code of practice effectively
43 | 
44 | These breaches involve issues concerning staffing, safeguarding, medicines management, dignity and respect, consent, care planning, governance, and failure to notify the CQC of incidents.
45 | Extracted answer: <concise_answer>10</concise_answer>
46 | Answer format: <answer_format>Integer</answer_format>
47 | 
48 | ---
49 | Question: According to the survey that is the percentage of Chinese who are paying more or about the same attention to politics after Trump’s election?
50 | Analysis: The survey provided does not specify the percentage of Chinese individuals specifically who are paying more or about the same attention to politics after Trump’s election. The report focuses primarily on American demographics and does not include specific details about the Chinese population in relation to this question. If you need information about a different demographic or a summary of the findings from the American demographic, I can certainly help with that!
51 | Extracted answer: <concise_answer>Not answerable</concise_answer>
52 | Answer format: <answer_format>None</answer_format>
53 | 
54 | ---
55 | Question: How many quotations from male respondent over 50 years old are included in this report?
56 | Analysis: The image you’ve provided appears to be a screenshot of a document with multiple charts. However, the text is too small and blurry to read accurately. If you can provide a clearer image or more context, I might be able to help you with your question.
57 | Extracted answer: <concise_answer>Fail to answer</concise_answer>
58 | Answer format: <answer_format>None</answer_format>
59 | 
60 | ---


--------------------------------------------------------------------------------
/eval/utils_api.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from io import BytesIO
 3 | import oss2
 4 | import base64
 5 | import json
 6 | import requests
 7 | import re
 8 | from typing import Union
 9 | 
10 | # TODO
11 | def get_alimama_oss_bucket():
12 |     endpoint = ''
13 |     access_key_id = ''
14 |     access_key_secret = ''
15 |     bucket_name = ''
16 |     bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
17 |     return bucket
18 | 
19 | # # uncomment if oss paths are used
20 | # bucket = get_alimama_oss_bucket()
21 | 
22 | 
23 | def encode_image_to_base64(image_path):
24 |     if 'https' in image_path:
25 |         response = requests.get(image_path)
26 |         img = BytesIO(response.content)
27 |         return base64.b64encode(img.read()).decode('utf-8')
28 | 
29 |     if image_path.startswith('oss://'):
30 |         return base64.b64encode(bucket.get_object(image_path[6:].split("/", 1)[1]).read()).decode("utf-8")
31 | 
32 |     with open(image_path, "rb") as image_file:
33 |         return base64.b64encode(image_file.read()).decode('utf-8')
34 | 
35 |         
36 | def get_msg_format(prompt, img_urls):
37 |     content = [{"type": "text", "text": prompt}]
38 |     if img_urls is not None:
39 |         if isinstance(img_urls, str):
40 |             base64_images = [encode_image_to_base64(img_urls)]
41 |         elif isinstance(img_urls, Union[list, tuple]):
42 |             base64_images = [encode_image_to_base64(img_url) for img_url in img_urls]
43 |         
44 |         for i, base64_image in enumerate(base64_images):
45 |             content += [
46 |                 {"type": "text", "text": f"Below is the {i+1}-th image (total {len(base64_images)} images).\n"},
47 |                 {
48 |                     "type": "image_url",
49 |                     "image_url": {
50 |                         "url": f"data:image/png;base64,{base64_image}"
51 |                     },
52 |                 },
53 |             ]
54 |     messages = [
55 |         {
56 |             "role": "user",
57 |             "content": content
58 |         }
59 |     ]
60 |     return messages
61 | 
62 | 
63 | 


--------------------------------------------------------------------------------
/evaluation_results/api_models/results_gemini15_pro.jsonl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/evaluation_results/api_models/results_gemini15_pro.jsonl


--------------------------------------------------------------------------------
/evaluation_results/api_models/results_gpt4o.jsonl:
--------------------------------------------------------------------------------
1 | {"question_id": "free_gemini15_pro_4055915_42_71_8", "doc_no": "4055915", "total_pages": 84, "start_end_idx": [42, 71], "question_type": "calculate", "question": "How many judging criterias are there for judging the homemade chocolate cake competition?", "answer": 5, "detailed_evidences": "The judging criteria for the Homemade Chocolate Cake Contest are listed on page 54, <box>(0.08, 0.26, 0.92, 0.52)</box>.  The criteria include flavor (30 points), texture (25 points), inside characteristics (20 points), outside characteristics (15 points), and frosting (10 points).", "evidence_pages": [54], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "images": ["/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_41.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_42.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_43.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_44.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_45.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_46.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_47.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_48.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_49.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_50.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_51.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_52.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_53.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_54.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_55.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_56.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_57.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_58.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_59.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_60.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_61.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_62.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_63.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_64.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_65.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_66.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_67.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_68.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_69.png", "/mnt/achao/Downloads/pdf_pngs/4000-4999/4055/4055915_70.png"], "pdf_path": "/data/oss_bucket_0/achao.dc/public_datasets/ccpdf_zip/4000-4999/4055915.pdf", "detailed_response": "There are five judging criteria for the homemade chocolate cake competition. They are:\n\n1. Flavor (aroma, taste, good balance of flavorings) - 30 points\n2. Texture (moist and tender crumb; not soggy or dry) - 25 points\n3. Inside Characteristics (even grain, evenly baked, free from air pockets) - 20 points\n4. Outside Characteristics (consistent shape/size/surface; overall appeal) - 15 points\n5. Frosting (taste, texture, even color) - 10 points", "pred": 5, "score_v3": 1.0}
2 | 


--------------------------------------------------------------------------------
/evaluation_results/scores_sample_fine_grained.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "scores":{
  3 |         "Main_Task":{
  4 |             "Understanding": 0.0,
  5 |             "Reasoning": 0.0,
  6 |             "Locating": 0.0
  7 |         },
  8 |         "Element_Type":{
  9 |             "Text": 0.0,
 10 |             "Layout": 0.0,
 11 |             "Figure": 0.0,
 12 |             "Table": 0.0
 13 |         },
 14 |         "Evidence_Pages":{
 15 |             "Single_Page": 0.0,
 16 |             "Multi_Page": 0.0
 17 |         },
 18 |         "Num_of_Element_Types":{
 19 |             "Cross_Element": 0.0
 20 |         },
 21 |         "Fine_Grained":{
 22 |             "Understanding":{
 23 |                 "Single_Page":{
 24 |                     "Text": 0.0,
 25 |                     "Layout": 0.0,
 26 |                     "Figure": 0.0,
 27 |                     "Table": 0.0
 28 |                 },
 29 |                 "Multi_Page":{
 30 |                     "Text": 0.0,
 31 |                     "Layout": 0.0,
 32 |                     "Figure": 0.0,
 33 |                     "Table": 0.0
 34 |                 }
 35 |             },
 36 |             "Reasoning":{
 37 |                 "Single_Page":{
 38 |                     "Text": 0.0,
 39 |                     "Layout": 0.0,
 40 |                     "Figure": 0.0,
 41 |                     "Table": 0.0
 42 |                 },
 43 |                 "Multi_Page":{
 44 |                     "Text": 0.0,
 45 |                     "Layout": 0.0,
 46 |                     "Figure": 0.0,
 47 |                     "Table": 0.0
 48 |                 }
 49 |             },
 50 |             "Locating":{
 51 |                 "Cross_Element":{
 52 |                     "Cross_Title": 0.0,
 53 |                     "Cross_Table": 0.0,
 54 |                     "Para_Title": 0.0,
 55 |                     "Figure_Table": 0.0
 56 |                 }
 57 |             }
 58 |         }
 59 |     },
 60 |     "sample_cnt":{
 61 |         "Main_Task":{
 62 |             "Understanding": 1243,
 63 |             "Reasoning": 387,
 64 |             "Locating": 695
 65 |         },
 66 |         "Element_Type":{
 67 |             "Text": 994,
 68 |             "Layout": 779,
 69 |             "Figure": 556,
 70 |             "Table": 871
 71 |         },
 72 |         "Evidence_Pages":{
 73 |             "Single_Page": 1093,
 74 |             "Multi_Page": 1230
 75 |         },
 76 |         "Num_of_Element_Types":{
 77 |             "Cross_Element": 862
 78 |         },
 79 |         "Fine_Grained":{
 80 |             "Understanding":{
 81 |                 "Single_Page":{
 82 |                     "Text": 259,
 83 |                     "Layout": 91,
 84 |                     "Figure": 94,
 85 |                     "Table": 263
 86 |                 },
 87 |                 "Multi_Page":{
 88 |                     "Text": 443,
 89 |                     "Layout": 172,
 90 |                     "Figure": 174,
 91 |                     "Table": 115
 92 |                 }
 93 |             },
 94 |             "Reasoning":{
 95 |                 "Single_Page":{
 96 |                     "Text": 40,
 97 |                     "Layout": 12,
 98 |                     "Figure": 28,
 99 |                     "Table": 98
100 |                 },
101 |                 "Multi_Page":{
102 |                     "Text": 115,
103 |                     "Layout": 40,
104 |                     "Figure": 85,
105 |                     "Table": 69
106 |                 }
107 |             },
108 |             "Locating":{
109 |                 "Cross_Element":{
110 |                     "Cross_Title": 201,
111 |                     "Cross_Table": 126,
112 |                     "Para_Title": 137,
113 |                     "Figure_Table": 231
114 |                 }
115 |             }
116 |         }
117 |     }
118 |     
119 | }


--------------------------------------------------------------------------------
/examples/example_docmind.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "record_id": "4020103_1",
  3 |     "data": {
  4 |         "logics": {
  5 |             "docTree": [
  6 |                 {
  7 |                     "backlink": {
  8 |                         "上级": [
  9 |                             "ROOT"
 10 |                         ]
 11 |                     },
 12 |                     "level": 0,
 13 |                     "link": {
 14 |                         "下级": [],
 15 |                         "包含": []
 16 |                     },
 17 |                     "uniqueId": "a73142f82a0bf25f1f9a033332428d81"
 18 |                 },
 19 |                 {
 20 |                     "backlink": {
 21 |                         "上级": [
 22 |                             "ROOT"
 23 |                         ]
 24 |                     },
 25 |                     "level": 0,
 26 |                     "link": {
 27 |                         "下级": [],
 28 |                         "包含": []
 29 |                     },
 30 |                     "uniqueId": "bc31341899f848a372c4e5312d15ffcf"
 31 |                 },
 32 |                 {
 33 |                     "backlink": {
 34 |                         "上级": [
 35 |                             "ROOT"
 36 |                         ]
 37 |                     },
 38 |                     "level": 0,
 39 |                     "link": {
 40 |                         "下级": [],
 41 |                         "包含": []
 42 |                     },
 43 |                     "uniqueId": "309eed9853f2ff2d0d548760e8661858"
 44 |                 },
 45 |                 {
 46 |                     "backlink": {
 47 |                         "上级": [
 48 |                             "bc31341899f848a372c4e5312d15ffcf"
 49 |                         ]
 50 |                     },
 51 |                     "level": 1,
 52 |                     "link": {
 53 |                         "下级": [],
 54 |                         "包含": []
 55 |                     },
 56 |                     "uniqueId": "647a777b4a1fa759eff5d8f8a76720a3"
 57 |                 },
 58 |                 {
 59 |                     "backlink": {
 60 |                         "上级": [
 61 |                             "bc31341899f848a372c4e5312d15ffcf"
 62 |                         ]
 63 |                     },
 64 |                     "level": 1,
 65 |                     "link": {
 66 |                         "下级": [],
 67 |                         "包含": []
 68 |                     },
 69 |                     "uniqueId": "dcce8b595a53d275c769634a69cfa2b0"
 70 |                 },
 71 |                 {
 72 |                     "backlink": {
 73 |                         "上级": [
 74 |                             "bc31341899f848a372c4e5312d15ffcf"
 75 |                         ]
 76 |                     },
 77 |                     "level": 1,
 78 |                     "link": {
 79 |                         "下级": [],
 80 |                         "包含": []
 81 |                     },
 82 |                     "uniqueId": "16f675dd06680c65c3d88ace0635389e"
 83 |                 },
 84 |                 {
 85 |                     "backlink": {
 86 |                         "上级": [
 87 |                             "bc31341899f848a372c4e5312d15ffcf"
 88 |                         ]
 89 |                     },
 90 |                     "level": 1,
 91 |                     "link": {
 92 |                         "下级": [],
 93 |                         "包含": []
 94 |                     },
 95 |                     "uniqueId": "4aafb975055e208a637931dc387e5040"
 96 |                 }
 97 |             ],
 98 |             "paragraphKVs": []
 99 |         },
100 |         "docInfo": {
101 |             "pages": [
102 |                 {
103 |                     "imageWidth": 432,
104 |                     "fileName": null,
105 |                     "sheetName": "",
106 |                     "fileIndex": 0,
107 |                     "pageIdCurDoc": 0,
108 |                     "source": null,
109 |                     "pdfParseResult": null,
110 |                     "imageHeight": 648,
111 |                     "isBlankPage": null,
112 |                     "sourceFileType": "cover",
113 |                     "imageUrl": "http://docmind-api-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/publicDocStructure/docmind-20240827-e89d28a1fb2d49639f0b4a1a613138ab/0.png?Expires=1724793864&OSSAccessKeyId=LTAI5tQL9bqLHC5HYKV68DA9&Signature=dxBVmKA3KZQR421uzb8dW1sG8Pw%3D",
114 |                     "angle": 0.0,
115 |                     "sourceImagePath": null,
116 |                     "excelParseResult": null,
117 |                     "fileUrl": null,
118 |                     "wordParseResult": null,
119 |                     "imageType": "JPEG",
120 |                     "imageStorageKey": "docmind-20240827-e89d28a1fb2d49639f0b4a1a613138ab/0.png",
121 |                     "pageIdAllDocs": 0,
122 |                     "imageParseResult": "http://docmind-api-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/ocr?Expires=1724793864&OSSAccessKeyId=LTAI5tQL9bqLHC5HYKV68DA9&Signature=1PrwDsB9eWxmvmz90%2FwqRIDfsxc%3D"
123 |                 }
124 |             ],
125 |             "docType": "png",
126 |             "pageCountEstimate": 1
127 |         },
128 |         "styles": [
129 |             {
130 |                 "fontName": "Verdana",
131 |                 "charScale": 1.0,
132 |                 "color": "000000",
133 |                 "underline": false,
134 |                 "styleId": 0,
135 |                 "deleteLine": false,
136 |                 "fontSize": 12,
137 |                 "bold": true,
138 |                 "italic": false
139 |             },
140 |             {
141 |                 "fontName": "微软雅黑",
142 |                 "charScale": 1.0,
143 |                 "color": "000000",
144 |                 "underline": false,
145 |                 "styleId": 1,
146 |                 "deleteLine": false,
147 |                 "fontSize": 58,
148 |                 "bold": true,
149 |                 "italic": false
150 |             },
151 |             {
152 |                 "fontName": "微软雅黑",
153 |                 "charScale": 1.0,
154 |                 "color": "000000",
155 |                 "underline": false,
156 |                 "styleId": 2,
157 |                 "deleteLine": false,
158 |                 "fontSize": 25,
159 |                 "bold": true,
160 |                 "italic": false
161 |             },
162 |             {
163 |                 "fontName": "微软雅黑",
164 |                 "charScale": 1.0,
165 |                 "color": "000000",
166 |                 "underline": false,
167 |                 "styleId": 3,
168 |                 "deleteLine": false,
169 |                 "fontSize": 55,
170 |                 "bold": true,
171 |                 "italic": false
172 |             },
173 |             {
174 |                 "fontName": "微软雅黑",
175 |                 "charScale": 1.0,
176 |                 "color": "000000",
177 |                 "underline": false,
178 |                 "styleId": 4,
179 |                 "deleteLine": false,
180 |                 "fontSize": 49,
181 |                 "bold": true,
182 |                 "italic": false
183 |             },
184 |             {
185 |                 "fontName": "微软雅黑",
186 |                 "charScale": 1.0,
187 |                 "color": "000000",
188 |                 "underline": false,
189 |                 "styleId": 5,
190 |                 "deleteLine": false,
191 |                 "fontSize": 24,
192 |                 "bold": true,
193 |                 "italic": false
194 |             },
195 |             {
196 |                 "fontName": "微软雅黑",
197 |                 "charScale": 1.0,
198 |                 "color": "000000",
199 |                 "underline": false,
200 |                 "styleId": 6,
201 |                 "deleteLine": false,
202 |                 "fontSize": 21,
203 |                 "bold": true,
204 |                 "italic": false
205 |             },
206 |             {
207 |                 "fontName": "Arial_Narrow",
208 |                 "charScale": 1.0,
209 |                 "color": "000000",
210 |                 "underline": false,
211 |                 "styleId": 7,
212 |                 "deleteLine": false,
213 |                 "fontSize": 15,
214 |                 "bold": true,
215 |                 "italic": false
216 |             },
217 |             {
218 |                 "fontName": "Verdana",
219 |                 "charScale": 1.0,
220 |                 "color": "000000",
221 |                 "underline": false,
222 |                 "styleId": 8,
223 |                 "deleteLine": false,
224 |                 "fontSize": 11,
225 |                 "bold": true,
226 |                 "italic": false
227 |             },
228 |             {
229 |                 "fontName": "Verdana",
230 |                 "charScale": 1.0,
231 |                 "color": "000000",
232 |                 "underline": false,
233 |                 "styleId": 9,
234 |                 "deleteLine": false,
235 |                 "fontSize": 13,
236 |                 "bold": true,
237 |                 "italic": false
238 |             },
239 |             {
240 |                 "fontName": "微软雅黑",
241 |                 "charScale": 1.0,
242 |                 "color": "000000",
243 |                 "underline": false,
244 |                 "styleId": 10,
245 |                 "deleteLine": false,
246 |                 "fontSize": 13,
247 |                 "bold": true,
248 |                 "italic": false
249 |             },
250 |             {
251 |                 "fontName": "Arial_Narrow",
252 |                 "charScale": 1.0,
253 |                 "color": "000000",
254 |                 "underline": false,
255 |                 "styleId": 11,
256 |                 "deleteLine": false,
257 |                 "fontSize": 17,
258 |                 "bold": true,
259 |                 "italic": false
260 |             },
261 |             {
262 |                 "fontName": "微软雅黑",
263 |                 "charScale": 1.0,
264 |                 "color": "000000",
265 |                 "underline": false,
266 |                 "styleId": 12,
267 |                 "deleteLine": false,
268 |                 "fontSize": 17,
269 |                 "bold": true,
270 |                 "italic": false
271 |             },
272 |             {
273 |                 "fontName": "Verdana",
274 |                 "charScale": 1.0,
275 |                 "color": "000000",
276 |                 "underline": false,
277 |                 "styleId": 13,
278 |                 "deleteLine": false,
279 |                 "fontSize": 15,
280 |                 "bold": true,
281 |                 "italic": false
282 |             },
283 |             {
284 |                 "fontName": "Times_New_Roman",
285 |                 "charScale": 1.0,
286 |                 "color": "000000",
287 |                 "underline": false,
288 |                 "styleId": 14,
289 |                 "deleteLine": false,
290 |                 "fontSize": 13,
291 |                 "bold": true,
292 |                 "italic": false
293 |             },
294 |             {
295 |                 "fontName": "Times_New_Roman",
296 |                 "charScale": 1.0,
297 |                 "color": "000000",
298 |                 "underline": false,
299 |                 "styleId": 15,
300 |                 "deleteLine": false,
301 |                 "fontSize": 10,
302 |                 "bold": true,
303 |                 "italic": false
304 |             },
305 |             {
306 |                 "fontName": "Arial_Narrow",
307 |                 "charScale": 1.0,
308 |                 "color": "000000",
309 |                 "underline": false,
310 |                 "styleId": 16,
311 |                 "deleteLine": false,
312 |                 "fontSize": 13,
313 |                 "bold": true,
314 |                 "italic": false
315 |             }
316 |         ],
317 |         "layouts": [
318 |             {
319 |                 "firstLinesChars": 0,
320 |                 "pos": [
321 |                     {
322 |                         "x": 9,
323 |                         "y": 3
324 |                     },
325 |                     {
326 |                         "x": 331,
327 |                         "y": 3
328 |                     },
329 |                     {
330 |                         "x": 331,
331 |                         "y": 12
332 |                     },
333 |                     {
334 |                         "x": 9,
335 |                         "y": 12
336 |                     }
337 |                 ],
338 |                 "blocks": [
339 |                     {
340 |                         "pos": [
341 |                             {
342 |                                 "x": 7,
343 |                                 "y": 1
344 |                             },
345 |                             {
346 |                                 "x": 331,
347 |                                 "y": 1
348 |                             },
349 |                             {
350 |                                 "x": 331,
351 |                                 "y": 13
352 |                             },
353 |                             {
354 |                                 "x": 7,
355 |                                 "y": 13
356 |                             }
357 |                         ],
358 |                         "styleId": 0,
359 |                         "text": "Innovations in the Food System： Exploring the Future of Fcod： Proceedings of a Workshop"
360 |                     }
361 |                 ],
362 |                 "index": 0,
363 |                 "subType": "page_header",
364 |                 "lineHeight": 0,
365 |                 "text": "Innovations in the Food System： Exploring the Future of Fcod： Proceedings of a Workshop",
366 |                 "alignment": "center",
367 |                 "type": "head",
368 |                 "pageNum": [
369 |                     0
370 |                 ],
371 |                 "uniqueId": "16b4f10a0114aae6d5a47d8b78259659"
372 |             },
373 |             {
374 |                 "firstLinesChars": 0,
375 |                 "pos": [
376 |                     {
377 |                         "x": 119,
378 |                         "y": 64
379 |                     },
380 |                     {
381 |                         "x": 314,
382 |                         "y": 64
383 |                     },
384 |                     {
385 |                         "x": 314,
386 |                         "y": 190
387 |                     },
388 |                     {
389 |                         "x": 119,
390 |                         "y": 190
391 |                     }
392 |                 ],
393 |                 "blocks": [
394 |                     {
395 |                         "pos": [
396 |                             {
397 |                                 "x": 119,
398 |                                 "y": 63
399 |                             },
400 |                             {
401 |                                 "x": 311,
402 |                                 "y": 63
403 |                             },
404 |                             {
405 |                                 "x": 311,
406 |                                 "y": 109
407 |                             },
408 |                             {
409 |                                 "x": 119,
410 |                                 "y": 109
411 |                             }
412 |                         ],
413 |                         "styleId": 1,
414 |                         "text": "INNOVATIONS"
415 |                     },
416 |                     {
417 |                         "pos": [
418 |                             {
419 |                                 "x": 199,
420 |                                 "y": 109
421 |                             },
422 |                             {
423 |                                 "x": 231,
424 |                                 "y": 109
425 |                             },
426 |                             {
427 |                                 "x": 231,
428 |                                 "y": 124
429 |                             },
430 |                             {
431 |                                 "x": 199,
432 |                                 "y": 124
433 |                             }
434 |                         ],
435 |                         "styleId": 2,
436 |                         "text": " in the"
437 |                     },
438 |                     {
439 |                         "pos": [
440 |                             {
441 |                                 "x": 114,
442 |                                 "y": 128
443 |                             },
444 |                             {
445 |                                 "x": 195,
446 |                                 "y": 128
447 |                             },
448 |                             {
449 |                                 "x": 195,
450 |                                 "y": 170
451 |                             },
452 |                             {
453 |                                 "x": 114,
454 |                                 "y": 170
455 |                             }
456 |                         ],
457 |                         "styleId": 3,
458 |                         "text": "FOOD"
459 |                     },
460 |                     {
461 |                         "pos": [
462 |                             {
463 |                                 "x": 195,
464 |                                 "y": 127
465 |                             },
466 |                             {
467 |                                 "x": 317,
468 |                                 "y": 127
469 |                             },
470 |                             {
471 |                                 "x": 317,
472 |                                 "y": 170
473 |                             },
474 |                             {
475 |                                 "x": 195,
476 |                                 "y": 170
477 |                             }
478 |                         ],
479 |                         "styleId": 4,
480 |                         "text": "SYSTEM"
481 |                     },
482 |                     {
483 |                         "pos": [
484 |                             {
485 |                                 "x": 117,
486 |                                 "y": 175
487 |                             },
488 |                             {
489 |                                 "x": 314,
490 |                                 "y": 175
491 |                             },
492 |                             {
493 |                                 "x": 314,
494 |                                 "y": 195
495 |                             },
496 |                             {
497 |                                 "x": 117,
498 |                                 "y": 195
499 |                             }
500 |                         ],
501 |                         "styleId": 5,
502 |                         "text": " EXPLORING THE FUTURE OF FOOD"
503 |                     }
504 |                 ],
505 |                 "index": 1,
506 |                 "subType": "none",
507 |                 "lineHeight": 3,
508 |                 "text": "INNOVATIONS in the FOODSYSTEM EXPLORING THE FUTURE OF FOOD",
509 |                 "alignment": "center",
510 |                 "type": "title",
511 |                 "pageNum": [
512 |                     0
513 |                 ],
514 |                 "uniqueId": "a73142f82a0bf25f1f9a033332428d81"
515 |             },
516 |             {
517 |                 "firstLinesChars": 0,
518 |                 "pos": [
519 |                     {
520 |                         "x": 112,
521 |                         "y": 220
522 |                     },
523 |                     {
524 |                         "x": 320,
525 |                         "y": 220
526 |                     },
527 |                     {
528 |                         "x": 320,
529 |                         "y": 234
530 |                     },
531 |                     {
532 |                         "x": 112,
533 |                         "y": 234
534 |                     }
535 |                 ],
536 |                 "blocks": [
537 |                     {
538 |                         "pos": [
539 |                             {
540 |                                 "x": 110,
541 |                                 "y": 218
542 |                             },
543 |                             {
544 |                                 "x": 323,
545 |                                 "y": 218
546 |                             },
547 |                             {
548 |                                 "x": 323,
549 |                                 "y": 236
550 |                             },
551 |                             {
552 |                                 "x": 110,
553 |                                 "y": 236
554 |                             }
555 |                         ],
556 |                         "styleId": 6,
557 |                         "text": "PROCEEDINGS OF A WORKSHOP"
558 |                     }
559 |                 ],
560 |                 "index": 2,
561 |                 "subType": "none",
562 |                 "lineHeight": 0,
563 |                 "text": "PROCEEDINGS OF A WORKSHOP",
564 |                 "alignment": "center",
565 |                 "type": "text",
566 |                 "pageNum": [
567 |                     0
568 |                 ],
569 |                 "uniqueId": "bc31341899f848a372c4e5312d15ffcf"
570 |             },
571 |             {
572 |                 "firstLinesChars": 0,
573 |                 "pos": [
574 |                     {
575 |                         "x": 133,
576 |                         "y": 288
577 |                     },
578 |                     {
579 |                         "x": 298,
580 |                         "y": 288
581 |                     },
582 |                     {
583 |                         "x": 298,
584 |                         "y": 322
585 |                     },
586 |                     {
587 |                         "x": 133,
588 |                         "y": 322
589 |                     }
590 |                 ],
591 |                 "blocks": [
592 |                     {
593 |                         "pos": [
594 |                             {
595 |                                 "x": 134,
596 |                                 "y": 287
597 |                             },
598 |                             {
599 |                                 "x": 298,
600 |                                 "y": 287
601 |                             },
602 |                             {
603 |                                 "x": 298,
604 |                                 "y": 299
605 |                             },
606 |                             {
607 |                                 "x": 134,
608 |                                 "y": 299
609 |                             }
610 |                         ],
611 |                         "styleId": 7,
612 |                         "text": "Melissa Maitin-Shepard， Rapporteur"
613 |                     },
614 |                     {
615 |                         "pos": [
616 |                             {
617 |                                 "x": 187,
618 |                                 "y": 312
619 |                             },
620 |                             {
621 |                                 "x": 243,
622 |                                 "y": 312
623 |                             },
624 |                             {
625 |                                 "x": 243,
626 |                                 "y": 322
627 |                             },
628 |                             {
629 |                                 "x": 187,
630 |                                 "y": 322
631 |                             }
632 |                         ],
633 |                         "styleId": 8,
634 |                         "text": " Food Forum"
635 |                     }
636 |                 ],
637 |                 "index": 3,
638 |                 "subType": "none",
639 |                 "lineHeight": 15,
640 |                 "text": "Melissa Maitin-Shepard， Rapporteur Food Forum",
641 |                 "alignment": "center",
642 |                 "type": "text",
643 |                 "pageNum": [
644 |                     0
645 |                 ],
646 |                 "uniqueId": "309eed9853f2ff2d0d548760e8661858"
647 |             },
648 |             {
649 |                 "firstLinesChars": 0,
650 |                 "pos": [
651 |                     {
652 |                         "x": 157,
653 |                         "y": 336
654 |                     },
655 |                     {
656 |                         "x": 274,
657 |                         "y": 336
658 |                     },
659 |                     {
660 |                         "x": 274,
661 |                         "y": 346
662 |                     },
663 |                     {
664 |                         "x": 157,
665 |                         "y": 346
666 |                     }
667 |                 ],
668 |                 "blocks": [
669 |                     {
670 |                         "pos": [
671 |                             {
672 |                                 "x": 157,
673 |                                 "y": 336
674 |                             },
675 |                             {
676 |                                 "x": 274,
677 |                                 "y": 336
678 |                             },
679 |                             {
680 |                                 "x": 274,
681 |                                 "y": 346
682 |                             },
683 |                             {
684 |                                 "x": 157,
685 |                                 "y": 346
686 |                             }
687 |                         ],
688 |                         "styleId": 9,
689 |                         "text": "Food and Nutrition Board"
690 |                     }
691 |                 ],
692 |                 "index": 4,
693 |                 "subType": "para",
694 |                 "lineHeight": 0,
695 |                 "text": "Food and Nutrition Board",
696 |                 "alignment": "center",
697 |                 "type": "text",
698 |                 "pageNum": [
699 |                     0
700 |                 ],
701 |                 "uniqueId": "647a777b4a1fa759eff5d8f8a76720a3"
702 |             },
703 |             {
704 |                 "firstLinesChars": 0,
705 |                 "pos": [
706 |                     {
707 |                         "x": 150,
708 |                         "y": 360
709 |                     },
710 |                     {
711 |                         "x": 283,
712 |                         "y": 360
713 |                     },
714 |                     {
715 |                         "x": 283,
716 |                         "y": 371
717 |                     },
718 |                     {
719 |                         "x": 150,
720 |                         "y": 371
721 |                     }
722 |                 ],
723 |                 "blocks": [
724 |                     {
725 |                         "pos": [
726 |                             {
727 |                                 "x": 148,
728 |                                 "y": 360
729 |                             },
730 |                             {
731 |                                 "x": 282,
732 |                                 "y": 360
733 |                             },
734 |                             {
735 |                                 "x": 282,
736 |                                 "y": 370
737 |                             },
738 |                             {
739 |                                 "x": 148,
740 |                                 "y": 370
741 |                             }
742 |                         ],
743 |                         "styleId": 10,
744 |                         "text": "Health and Medicine Division"
745 |                     }
746 |                 ],
747 |                 "index": 5,
748 |                 "subType": "para",
749 |                 "lineHeight": 0,
750 |                 "text": "Health and Medicine Division",
751 |                 "alignment": "center",
752 |                 "type": "text",
753 |                 "pageNum": [
754 |                     0
755 |                 ],
756 |                 "uniqueId": "dcce8b595a53d275c769634a69cfa2b0"
757 |             },
758 |             {
759 |                 "firstLinesChars": 47,
760 |                 "pos": [
761 |                     {
762 |                         "x": 106,
763 |                         "y": 395
764 |                     },
765 |                     {
766 |                         "x": 325,
767 |                         "y": 395
768 |                     },
769 |                     {
770 |                         "x": 325,
771 |                         "y": 423
772 |                     },
773 |                     {
774 |                         "x": 106,
775 |                         "y": 423
776 |                     }
777 |                 ],
778 |                 "blocks": [
779 |                     {
780 |                         "pos": [
781 |                             {
782 |                                 "x": 151,
783 |                                 "y": 393
784 |                             },
785 |                             {
786 |                                 "x": 282,
787 |                                 "y": 393
788 |                             },
789 |                             {
790 |                                 "x": 282,
791 |                                 "y": 407
792 |                             },
793 |                             {
794 |                                 "x": 151,
795 |                                 "y": 407
796 |                             }
797 |                         ],
798 |                         "styleId": 11,
799 |                         "text": "The National Academies of"
800 |                     },
801 |                     {
802 |                         "pos": [
803 |                             {
804 |                                 "x": 104,
805 |                                 "y": 410
806 |                             },
807 |                             {
808 |                                 "x": 327,
809 |                                 "y": 410
810 |                             },
811 |                             {
812 |                                 "x": 327,
813 |                                 "y": 424
814 |                             },
815 |                             {
816 |                                 "x": 104,
817 |                                 "y": 424
818 |                             }
819 |                         ],
820 |                         "styleId": 12,
821 |                         "text": " SCIENCES· ENGINEERING ·MEDICINE"
822 |                     }
823 |                 ],
824 |                 "index": 6,
825 |                 "subType": "para",
826 |                 "lineHeight": 4,
827 |                 "text": "The National Academies of SCIENCES· ENGINEERING ·MEDICINE",
828 |                 "alignment": "center",
829 |                 "type": "text",
830 |                 "pageNum": [
831 |                     0
832 |                 ],
833 |                 "uniqueId": "16f675dd06680c65c3d88ace0635389e"
834 |             },
835 |             {
836 |                 "firstLinesChars": 0,
837 |                 "pos": [
838 |                     {
839 |                         "x": 143,
840 |                         "y": 539
841 |                     },
842 |                     {
843 |                         "x": 289,
844 |                         "y": 539
845 |                     },
846 |                     {
847 |                         "x": 289,
848 |                         "y": 578
849 |                     },
850 |                     {
851 |                         "x": 143,
852 |                         "y": 578
853 |                     }
854 |                 ],
855 |                 "blocks": [
856 |                     {
857 |                         "pos": [
858 |                             {
859 |                                 "x": 136,
860 |                                 "y": 538
861 |                             },
862 |                             {
863 |                                 "x": 295,
864 |                                 "y": 538
865 |                             },
866 |                             {
867 |                                 "x": 295,
868 |                                 "y": 549
869 |                             },
870 |                             {
871 |                                 "x": 136,
872 |                                 "y": 549
873 |                             }
874 |                         ],
875 |                         "styleId": 13,
876 |                         "text": "THE NATIONAL ACADEMIES PRESS"
877 |                     },
878 |                     {
879 |                         "pos": [
880 |                             {
881 |                                 "x": 182,
882 |                                 "y": 554
883 |                             },
884 |                             {
885 |                                 "x": 250,
886 |                                 "y": 554
887 |                             },
888 |                             {
889 |                                 "x": 250,
890 |                                 "y": 564
891 |                             },
892 |                             {
893 |                                 "x": 182,
894 |                                 "y": 564
895 |                             }
896 |                         ],
897 |                         "styleId": 14,
898 |                         "text": " Washington，DC"
899 |                     },
900 |                     {
901 |                         "pos": [
902 |                             {
903 |                                 "x": 185,
904 |                                 "y": 569
905 |                             },
906 |                             {
907 |                                 "x": 245,
908 |                                 "y": 569
909 |                             },
910 |                             {
911 |                                 "x": 245,
912 |                                 "y": 577
913 |                             },
914 |                             {
915 |                                 "x": 185,
916 |                                 "y": 577
917 |                             }
918 |                         ],
919 |                         "styleId": 15,
920 |                         "text": " www.nap.edu"
921 |                     }
922 |                 ],
923 |                 "index": 7,
924 |                 "subType": "para",
925 |                 "lineHeight": 5,
926 |                 "text": "THE NATIONAL ACADEMIES PRESS Washington，DC www.nap.edu",
927 |                 "alignment": "center",
928 |                 "type": "text",
929 |                 "pageNum": [
930 |                     0
931 |                 ],
932 |                 "uniqueId": "4aafb975055e208a637931dc387e5040"
933 |             },
934 |             {
935 |                 "firstLinesChars": 0,
936 |                 "pos": [
937 |                     {
938 |                         "x": 81,
939 |                         "y": 630
940 |                     },
941 |                     {
942 |                         "x": 350,
943 |                         "y": 630
944 |                     },
945 |                     {
946 |                         "x": 350,
947 |                         "y": 640
948 |                     },
949 |                     {
950 |                         "x": 81,
951 |                         "y": 640
952 |                     }
953 |                 ],
954 |                 "blocks": [
955 |                     {
956 |                         "pos": [
957 |                             {
958 |                                 "x": 79,
959 |                                 "y": 629
960 |                             },
961 |                             {
962 |                                 "x": 351,
963 |                                 "y": 629
964 |                             },
965 |                             {
966 |                                 "x": 351,
967 |                                 "y": 640
968 |                             },
969 |                             {
970 |                                 "x": 79,
971 |                                 "y": 640
972 |                             }
973 |                         ],
974 |                         "styleId": 16,
975 |                         "text": "Copyright National Academy of Sciences. All rights reserved."
976 |                     }
977 |                 ],
978 |                 "index": 8,
979 |                 "subType": "none",
980 |                 "lineHeight": 0,
981 |                 "text": "Copyright National Academy of Sciences. All rights reserved.",
982 |                 "alignment": "center",
983 |                 "type": "text",
984 |                 "pageNum": [
985 |                     0
986 |                 ],
987 |                 "uniqueId": "3c8b07209763ad8f3feb01ff1d77555e"
988 |             }
989 |         ],
990 |         "version": "1.0.0"
991 |     }
992 | }
993 | 


--------------------------------------------------------------------------------
/examples/example_structured_text.txt:
--------------------------------------------------------------------------------
 1 | (type: head, sub_type: page_header, box: (0.02, 0.0, 0.77, 0.02)) Innovations in the Food System： Exploring the Future of Fcod： Proceedings of a Workshop
 2 | (type: title, sub_type: none, box: (0.28, 0.1, 0.73, 0.29)) INNOVATIONS in the FOODSYSTEM EXPLORING THE FUTURE OF FOOD
 3 | (type: text, sub_type: none, box: (0.26, 0.34, 0.74, 0.36)) PROCEEDINGS OF A WORKSHOP
 4 | (type: text, sub_type: none, box: (0.31, 0.44, 0.69, 0.5)) Melissa Maitin-Shepard， Rapporteur Food Forum
 5 | (type: text, sub_type: para, box: (0.36, 0.52, 0.63, 0.53)) Food and Nutrition Board
 6 | (type: text, sub_type: para, box: (0.35, 0.56, 0.66, 0.57)) Health and Medicine Division
 7 | (type: text, sub_type: para, box: (0.25, 0.61, 0.75, 0.65)) The National Academies of SCIENCES· ENGINEERING ·MEDICINE
 8 | (type: text, sub_type: para, box: (0.33, 0.83, 0.67, 0.89)) THE NATIONAL ACADEMIES PRESS Washington，DC www.nap.edu
 9 | (type: text, sub_type: none, box: (0.19, 0.97, 0.81, 0.99)) Copyright National Academy of Sciences. All rights reserved.
10 | 


--------------------------------------------------------------------------------
/human_baselines/baseline_1.jsonl:
--------------------------------------------------------------------------------
  1 | {"question_id": "free_gpt4o_4033497_24_51_15", "doc_no": "4033497", "total_pages": 51, "start_end_idx": [24, 51], "question_type": "extract", "question": "What should be installed in a high-traffic area to control passage and allow plant material to grow?", "answer": "A fence", "detailed_evidences": "The key recommendations for erosion and shore access as stated in the document are:\n1. Establish an Erosion and Control Plan.\n2. Re-grade, replant, and add obstacles to control access.\n3. Place fencing in high traffic areas to control access and allow plant material to grow.\n4. Develop a policy with the Town's legal counsel to assess liability surrounding shore access.\nThese recommendations are detailed on page 49 under the “Erosion/Safety/Shore Access” section (<evidence_sources>[Pure-text, Title]</evidence_sources>, <box>(0.03, 0.19, 0.25, 0.24)</box>, <evidence_pages>[49]).", "evidence_pages": [49], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Layout_Understanding"], "pred": "fencing", "score": 0.0}
  2 | {"question_id": "free_gpt4o_4153452_61_90_9", "doc_no": "4153452", "total_pages": 128, "start_end_idx": [61, 90], "question_type": "extract", "question": "What is the name of the transfer agent and registrar for the Company’s Common Stock?", "answer": "BankBoston, N.A.", "detailed_evidences": "The name of the transfer agent and registrar responsible for the Company’s Common Stock is stated under related section of the document.\n\nOn page 66, under \"TRANSFER AGENT\" section, it mentions:\n\"The transfer agent and registrar for the Common Stock of the Company is BankBoston,N.A.\" (<box>(0.06, 0.16, 0.20)</box>).\n\nThus, the transfer agent and registrar for the Company’s Common Stock is BankBoston, N.A.\n\nEvidence is sourced from [Pure-text, Title], page [66].\n", "evidence_pages": [66], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Layout_Understanding"], "pred": "BankBoston, N.A.", "score": 1.0}
  3 | {"question_id": "free_gpt4o_4027862_74_102_11", "doc_no": "4027862", "total_pages": 102, "start_end_idx": [74, 102], "question_type": "extract", "question": "What was the percentage of total import for cut flowers in 2020?", "answer": "95.3%", "detailed_evidences": "The percentage of total import for cut flowers in 2020 was 95.3% ((0.15, 0.28, 0.86, 0.58), page 85).", "evidence_pages": [85], "evidence_sources": ["Layout"], "answer_format": "Float", "task_tag": "Understanding", "subTask": ["SP_Layout_Understanding"], "pred": "95.3%", "score": 1.0}
  4 | {"question_id": "topic2title_4082886_66_66_7", "doc_no": "4082886", "total_pages": 72, "start_end_idx": [66, 66], "question_type": "topic2title", "question": "Which sections provide insights into Energoprojekt's strategic vision for the future?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["THEFUTUREOFENERSOPROJEST", "STRATEGICVISION INTO THE FUTURE OF ENERGOPROJEKT"], "detailed_evidences": "", "evidence_pages": [66], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": "THE FUTURE OF ENERGOPROJEKT: STRATEGIC VISION INTO THE FUTURE OF ENERGOPROJEKT", "score": 0.3076923076923077}
  5 | {"question_id": "topic2title_4026569_45_48_4", "doc_no": "4026569", "total_pages": 83, "start_end_idx": [45, 48], "question_type": "topic2title", "question": "From which sections can we learn about the methodologies and recommendations for site preparation and grading?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["SITE PREPARATION AND GRADING", "Compaction Standard and Methodology"], "detailed_evidences": "", "evidence_pages": [45, 48], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["SITE PREPARATION AND GRADING"], "score": 0.5}
  6 | {"question_id": "topic2title_4081036_8_29_3", "doc_no": "4081036", "total_pages": 68, "start_end_idx": [8, 29], "question_type": "topic2title", "question": "Which sections would you refer to for understanding the company's accounting principles and recent changes in accounting standards?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["Significant Accounting Policies", "Recently Issued Accounting Pronouncements"], "detailed_evidences": "", "evidence_pages": [8, 29], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Recently Issued Accounting Pronouncements", "Pending Accounting Pronouncements"], "score": 0.5}
  7 | {"question_id": "topic2title_4031221_13_13_2", "doc_no": "4031221", "total_pages": 52, "start_end_idx": [13, 13], "question_type": "topic2title", "question": "Where can we find details about the financial obligations and debts of a campaign?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["SCHEDULE D： LIABILITIES", "SCHEDULE D： LIABILITIES"], "detailed_evidences": "", "evidence_pages": [13], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Form CPF M 102: Campaign Finance Report"], "score": 0.0}
  8 | {"question_id": "topic2title_4026569_1_19_1", "doc_no": "4026569", "total_pages": 83, "start_end_idx": [1, 19], "question_type": "topic2title", "question": "Which sections provide detailed information on the geological and soil characteristics of the Newport Banning Ranch development site?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["APPENDIX B GEOLOGY AND SOILS", "SITE GEOLOGY"], "detailed_evidences": "", "evidence_pages": [1, 19], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["SITE GEOLOGY"], "score": 0.5}
  9 | {"question_id": "topic2title_4153185_3_4_1", "doc_no": "4153185", "total_pages": 53, "start_end_idx": [3, 4], "question_type": "topic2title", "question": "Which sections provide insights into the objectives and structure of Child Labour Monitoring and Remediation Systems (CLMRS) in the cocoa sector?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["What is a Child Labour Monitoring and Remediation System?"], "detailed_evidences": "", "evidence_pages": [3, 4], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Objectives, scope and structure of this review"], "score": 0.0}
 10 | {"question_id": "topic2title_4081036_22_22_9", "doc_no": "4081036", "total_pages": 68, "start_end_idx": [22, 22], "question_type": "topic2title", "question": "Which sections discuss the company's forward-looking statements and overall business overview?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["Fonswnd-Looking Statements", "Business Overview"], "detailed_evidences": "", "evidence_pages": [22], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Forward-Looking Statements", "Business Overview"], "score": 0.9230769230769231}
 11 | {"question_id": "topic2title_4153185_15_20_4", "doc_no": "4153185", "total_pages": 53, "start_end_idx": [15, 20], "question_type": "topic2title", "question": "From which sections can we learn about the data collection and analysis methods used in the review?\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": ["Data sources and methodology overview"], "detailed_evidences": "", "evidence_pages": [15, 20], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Title_Locating"], "pred": ["Data sources and methodology overview"], "score": 1.0}
 12 | {"question_id": "free_gpt4o_4094684_68_97_21", "doc_no": "4094684", "total_pages": 113, "start_end_idx": [68, 97], "question_type": "extract", "question": "Changes made to the construction drawings for Tomillo involved adding specific details like elastic in the back of the pants, a flat lining, and ensuring multiple pockets were incorporated, yes or no?", "answer": "yes", "detailed_evidences": "Changes made to the construction drawings for Tomillo involved adding specific details like elastic in the back of the pants, a flat lining, and ensuring multiple pockets were incorporated. These modifications ensured functionality on stage, as described in figures on pages 78 and 79. ", "evidence_pages": [78, 79], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "yes", "score": 1.0}
 13 | {"question_id": "free_gpt4o_4131533_52_68_14", "doc_no": "4131533", "total_pages": 73, "start_end_idx": [52, 68], "question_type": "extract", "question": "A NaN (Not a Number) value is used to Indicate when the system cannot calculate a strain or temperature value for a particular gage, yes or no?", "answer": "yes", "detailed_evidences": "In the ODiSI protocol messages, the JSON value 'null' is used to indicate a NaN (Not a Number) value. This condition occurs specifically when the ODiSI system cannot calculate a strain or temperature value for a particular gage. The JSON null value serves to denote these uncalculated or invalid data points in the Measurement Messages. This is mentioned across multiple pages, with detailed explanations on pages 63 (<box>(0.15, 0.68, 0.88, 0.75)</box>, <box>(0.14, 0.85, 0.88, 0.9)</box>) and 65 (<box>(0.14, 0.17, 0.88, 0.24)</box>, <box>(0.15, 0.32, 0.88, 0.37)</box>).", "evidence_pages": [63, 65], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "yes", "score": 1.0}
 14 | {"question_id": "free_gpt4o_4048779_16_45_5", "doc_no": "4048779", "total_pages": 52, "start_end_idx": [16, 45], "question_type": "extract", "question": "What is the prerequisite for 'English, Grade 12, University Preparation'?", "answer": "\"English, Grade 11, University Preparation (ENG3U)\"", "detailed_evidences": "The prerequisite for 'English, Grade 12, University Preparation (ENG4U)' is 'English, Grade 11, University Preparation (ENG3U)' (<box>(0.12, 0.83, 0.5, 0.85)</box>, page 27). The prerequisite for 'Mathematics, Grade 12, University Preparation' courses are:\n1.  'Advanced Functions (MHF4U)': 'Grade 12, University Preparation'\n2.  'Calculus and Vectors (MCV4U)': 'Grade 12 Advanced Functions, University Preparation' (to be taken prior to or concurrently) (<box>(0.12, 0.13, 0.86, 0.17)</box>, page 29). \nEvidence collected from the Title and Text parts of the given pages.", "evidence_pages": [27, 29], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "English, Grade 11, University Preparation", "score": 1.0}
 15 | {"question_id": "free_gpt4o_4174181_15_44_18", "doc_no": "4174181", "total_pages": 122, "start_end_idx": [15, 44], "question_type": "extract", "question": "List all technical reviewers who contributed to the book.", "answer": ["Chris McCoy", "Benjamin Taylor"], "detailed_evidences": "The technical reviewers who contributed to the book are:\nChris McCoy\nBenjamin Taylor\nThis information is found on pages 18 and 19 ((0.15, 0.66, 0.88)).", "evidence_pages": [18, 19], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": ["Chris McCoy", "Benjamin Taylor"], "score": 1.0}
 16 | {"question_id": "free_gemini15_pro_4025531_16_45_1", "doc_no": "4025531", "total_pages": 69, "start_end_idx": [16, 45], "question_type": "extract", "question": "What is the last update date of the explosion action plan BG3115?", "answer": "12 February 2016", "detailed_evidences": "According to the text on page 17, <box>(0.15, 0.24, 0.45, 0.26)</box>, the explosion action plan BG3115 was last updated on 12 February 2016.  The same page, <box>(0.15, 0.27, 0.43, 0.28)</box>, indicates that the update was done by the person answering the question (referred to as \"you\"). ", "evidence_pages": [16, 17], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "12 February 2016", "score": 1.0}
 17 | {"question_id": "free_gpt4o_4120884_42_71_17-2", "doc_no": "4120884", "total_pages": 143, "start_end_idx": [42, 71], "question_type": "extract", "question": "What is the orientation of the scattering vector (Q) for hoop strain measurements?", "answer": "The scattering vector (Q) should be tangential to the cylinder bore and perpendicular to the specified planes.", "detailed_evidences": "For neutron diffraction measurements in hoop strain orientation (Picture, Pure-text):\n1. Align the engine block so that the scattering vector (Q) is tangential to the cylinder bore, perpendicular to the specified planes.\n2. Position the engine block on the spectrometer table, as shown in Figure 28(a) and (b) on page 68.\n3. Allow the monochromated incident beam to penetrate the bulk material where the sampling volume is formed at the intersection of the incident and diffracted beams.\n4. Translate the block in X, Y, and Z directions to acquire strain measurements from top to bottom of the cylinder bridge.\nThis method is detailed on pages 68 to 70.", "evidence_pages": [68, 69, 70], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "radial", "score": 0.0}
 18 | {"question_id": "free_gpt4o_4041821_2_31_3", "doc_no": "4041821", "total_pages": 79, "start_end_idx": [2, 31], "question_type": "extract", "question": "How many residential dwellings are estimated to be part of the development?", "answer": 165, "detailed_evidences": "The estimated number of residential dwellings is 165. This information can be found in the section titled \"Population and Residential Densities\" on page 8 (<box>(0.13, 0.36, 0.88, 0.65)</box>, page 8), evidence source is Figure. Additionally, it is also discussed under section 4.2.1 DWELLING TARGET on page 10 (<box>(0.52, 0.84, 0.61, 0.85)</box>, page 10), evidence source is Pure-text. Therefore, the estimated residential dwellings are 165, and this information is provided in sections \"Population and Residential Densities\" and \"4.2.1 DWELLING TARGET\" on pages [8, 10].", "evidence_pages": [8, 10], "evidence_sources": ["Text", "Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Table_Understanding"], "pred": 140, "score": 0.0}
 19 | {"question_id": "free_gpt4o_4065289_60_89_9", "doc_no": "4065289", "total_pages": 91, "start_end_idx": [60, 89], "question_type": "extract", "question": "Was there any significant excess for any cancer type associated with aviation electronics technicians in the Groves et al., (2002) study? Choose yes or no?", "answer": "no", "detailed_evidences": "There was no significant excess for any cancer type, including lymphoid malignancies, brain cancer, or testicular cancer, associated with aviation electronics technicians in the Groves et al (2002) study. This is indicated on pages 61 (<box>(0.23, 0.13, 0.18)</box>) and 62 (<box>(0.23, 0.14, 0.18)</box>). Evidence source is Pure-texts spread across these two pages.", "evidence_pages": [61, 62], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "no", "score": 1.0}
 20 | {"question_id": "free_gpt4o_4076912_11_40_4", "doc_no": "4076912", "total_pages": 103, "start_end_idx": [11, 40], "question_type": "extract", "question": "Determine the percentage of animals used in research for the purposes of regulatory use in 2018.", "answer": 17.9, "detailed_evidences": "In 2018, 17.9% of animals were used for regulatory purposes. This information can be found on pages 28 and 31. The percentage is mentioned in Figure 14 on page 28 with coordinates (<box>(0.15, 0.39, 0.87, 0.81)</box>), and further details are provided in the text on page 31 with coordinates (<box>(0.09, 0.56, 0.91, 0.57)</box>).", "evidence_pages": [28, 31], "evidence_sources": ["Text", "Figure"], "answer_format": "Float", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "17.9%", "score": 1.0}
 21 | {"question_id": "free_gpt4o_4142297_76_95_13", "doc_no": "4142297", "total_pages": 95, "start_end_idx": [76, 95], "question_type": "extract", "question": "What exposure time(in ms) was determined optimal for ArcLight measurements?", "answer": "100", "detailed_evidences": "The optimal exposure time determined for ArcLight measurements was 100 ms, based on the need for clarity and avoidance of background interferences. Reducing the time to 20 ms achieved higher recording speed but required the cells to be very bright to avoid dimming impacts (<box>(0.12, 0.62, 0.88, 0.87)</box>, pages <evidence_pages>[76, 77]</evidence_pages>).", "evidence_pages": [76, 77], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": 25, "score": 0.0}
 22 | {"question_id": "free_gpt4o_4118675_34_63_10", "doc_no": "4118675", "total_pages": 73, "start_end_idx": [34, 63], "question_type": "extract", "question": "What is the land use type for soils PE011 and PE020?", "answer": ["cultivated pasture without irrigation", "high-level arable farming"], "detailed_evidences": "The land use type for soil PE011 is detailed on page 36, and it is cultivated pasture without irrigation (<box>(0.06, 0.54, 0.95)</box>). For soil PE020, the land use type is high-level arable farming, as mentioned on page 54 (<box>(0.05, 0.06, 0.6)</box>).", "evidence_pages": [36, 54], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": ["cultivated pasture; no irrigation; Rotation: not relevent; Inprovenents: none", "high level arable farming; Crops: annual crops; continuously irrigated; Rotation: continuous crop rotation; Improvements: terracing"], "score": 0.0}
 23 | {"question_id": "free_gpt4o_4118675_34_63_1", "doc_no": "4118675", "total_pages": 73, "start_end_idx": [34, 63], "question_type": "extract", "question": "What are the Print dates for the soils in PE010 and PE011?", "answer": "22/03/95", "detailed_evidences": "The Print date for the reference soil PE010 can be found on page 35, with coordinates (<box>(0.71, 0.06, 0.94, 0.08)</box>), and belongs to the \"Text\" type. The Print date for the reference soil PE011 is on page 36, located at coordinates (<box>(0.71, 0.08, 0.95, 0.09)</box>), it also belongs to the \"Text\" type. The Print dates for PE010 and PE011 are both 22/03/95.", "evidence_pages": [35, 36], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "22/03/95", "score": 1.0}
 24 | {"question_id": "free_gemini15_pro_4185929_30_59_1", "doc_no": "4185929", "total_pages": 75, "start_end_idx": [30, 59], "question_type": "extract", "question": "What percentage of carbon has been lost from the world's soils due to poor farming practices, according to the provided text?", "answer": 80, "detailed_evidences": "The text mentions that up to 80 percent of carbon from the world's soils has been lost due to poor farming and ranching practices, especially modern industrial agriculture. This information is found across multiple pages.  Specifically, page 30 contains the statement \"industrial agriculture- have led to the loss of up to 80 percent of carbon from the world's soils,\" <box>(0.11, 0.41, 0.87, 0.67)</box>. This is corroborated by similar statements on pages 34 <box>(0.11, 0.41, 0.88, 0.61)</box>, 36 <box>(0.11, 0.42, 0.61, 0.46)</box><box>(0.11, 0.47, 0.87, 0.6)</box>, 48 <box>(0.11, 0.41, 0.86, 0.67)</box>, and 56 <box>(0.11, 0.41, 0.87, 0.67)</box>.  ", "evidence_pages": [30, 34, 36, 48, 56], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": 80, "score": 1.0}
 25 | {"question_id": "free_gpt4o_4069340_65_91_25", "doc_no": "4069340", "total_pages": 96, "start_end_idx": [65, 91], "question_type": "extract", "question": "The camera is considered not omnipotent because it cannot fully cover every possible shooting point, yes or no?", "answer": "no", "detailed_evidences": "The consequences of omniscience and omnipotence in cameras are that concepts no longer signify the world but rather the universe signifies the program within cameras. This bi-univocal relationship means that cameras are considered omnipotent in their photographic universe due to their comprehensive coverage of every possible photographic point. This shift leads to an absurd omniscience where cameras know everything and can do everything within their programmed universe, making human concepts redundant. This discussion is found on page 68 (<box>(0.19, 0.16, 0.85, 0.79)</box>) and further elaborated on page 69 (<box>(0.13, 0.14, 0.78, 0.53)</box>).", "evidence_pages": [68, 69], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding"], "pred": "no", "score": 1.0}
 26 | {"question_id": "free_gpt4o_4048779_16_45_8", "doc_no": "4048779", "total_pages": 52, "start_end_idx": [16, 45], "question_type": "extract", "question": "How many courses related to health and physical education are there in 9th grade?", "answer": 1, "detailed_evidences": "The 'Health and Physical Education' section lists the following courses by grade:\n1. Grade 9: \n   - Health and Physical Education (Healthy Active Living) (PPL1O)\n2. Grade 10:\n   - Health and Physical Education (Healthy Active Living) (PPL2O)\n3. Grade 12:\n   - Health and Physical Education (Healthy Active Living) (PPL4O)\n   - Introductory Kinesiology (PSK4U)\n\nCounting these, there are 1 course listed for Grade 9, 1 for Grade 10, and 2 for Grade 12, making a total of 4 courses. Evidence present on pages 38 and 39.", "evidence_pages": [38, 39], "evidence_sources": ["Text", "Layout"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": 1, "score": 1.0}
 27 | {"question_id": "free_gpt4o_4176503_99_112_8-2", "doc_no": "4176503", "total_pages": 112, "start_end_idx": [99, 112], "question_type": "extract", "question": "Jim Schultz approved of Lockport's security system, yes or no?", "answer": "no", "detailed_evidences": "Jim Shultz raised several arguments against Lockport's security system:\n1. The system posed significant privacy risks for students and teachers.\n2. The community was not properly consulted.\n3. Potential conflicts of interest between the district and the security consultant.\n4. Effectiveness of the system was questionable.\nHe also actively wrote columns in the local newspaper, organized petitions, and called for public hearings to challenge the lack of transparency and efficacy of the implemented security system. (pages 107, 108) Evidence: Text from <evidence_sources>[Pure-text, Corner-note]</evidence_sources> coordinates <box>(0.11, 0.49, 0.89, 0.59)</box>, page 107 and <box>(0.11, 0.08, 0.88, 0.4)</box>, page 108.", "evidence_pages": [107, 108], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "no", "score": 1.0}
 28 | {"question_id": "free_gpt4o_4049875_8_27_13", "doc_no": "4049875", "total_pages": 81, "start_end_idx": [8, 27], "question_type": "extract", "question": "What is the governing body that regulates the Responsible Entity of the Fund?", "answer": "Australian Securities and Investments Commission (ASIC)", "detailed_evidences": "The governing body that regulates the Responsible Entity of the Fund is the Australian Securities and Investments Commission (ASIC). This information is located on page 10, coordinates (<box>(0.08, 0.09, 0.92, 0.7)</box>), specifying \"The Fund is a managed investment scheme registered with ASIC under the Corporations Act.\" Another supporting reference is found on page 19 in the text coordinates (<box>(0.08, 0.22, 0.33, 0.32)</box>) stating the involvement of operators and networks regulated by ASIC.", "evidence_pages": [10, 19], "evidence_sources": ["Text", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Table_Understanding"], "pred": "The Trust Company (RE Services) Limited", "score": 0.0}
 29 | {"question_id": "free_gpt4o_4120884_42_71_11-2", "doc_no": "4120884", "total_pages": 143, "start_end_idx": [42, 71], "question_type": "extract", "question": "Enumerate the experimental analysis steps on the engine block castings as noted in the project outline.", "answer": ["Analysis of cylinder profiles in service-tested blocks", "Microstructure analysis", "Thermal analysis", "Mechanical properties testing", "Residual stress/strain evolution"], "detailed_evidences": "The experimental analysis steps on the engine block castings noted in the project outline (Pure-text) include:\n1. Analysis of cylinder profiles in service-tested blocks.\n2. Microstructure analysis.\n3. Thermal analysis.\n4. Mechanical properties testing.\n5. Residual stress/strain evolution, as illustrated in the flow chart on page 56 (Picture).", "evidence_pages": [55, 56], "evidence_sources": ["Text", "Figure"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": ["Analysis of the cylinder profiles in service tested block", "Microstructure", "Thermal analysis", "Mechanical properties", "Residual stress/strain evolution"], "score": 0.8455140850627318}
 30 | {"question_id": "free_gpt4o_4120884_42_71_6", "doc_no": "4120884", "total_pages": 143, "start_end_idx": [42, 71], "question_type": "extract", "question": "The specific volume change associated with aging precipitates in Al-Cu alloys is mentioned, yes or no?", "answer": "yes", "detailed_evidences": "The specific volume change associated with aging precipitates in Al-Cu alloys is discussed on pages 51 and 52. According to the text and corresponding figure, the unit linear dimensional change varies significantly with different precipitates. For instance, precipitation of the \\( \\theta' \\) phase results in the largest specific volume change, being approximately three times larger than the equilibrium \\( \\theta \\) phase (<evidence_sources>[Pure-text]</evidence_sources> <box>(0.11, 0.49, 0.88, 0.8)</box>, <evidence_sources>[Figure]</evidence_sources> <box>(0.31, 0.1, 0.67, 0.37)</box>, <evidence_pages>[51, 52]</evidence_pages>). Hence, the specific volume change with aging precipitates can achieve the most considerable expansion with the \\( \\theta' \\) phase.", "evidence_pages": [51, 52], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "yes", "score": 1.0}
 31 | {"question_id": "free_gpt4o_4115594_31_60_5", "doc_no": "4115594", "total_pages": 63, "start_end_idx": [31, 60], "question_type": "extract", "question": "In the migrant survey, how many people took German courses?", "answer": 8, "detailed_evidences": "From the migrant survey, 8 participants attended German language courses, provided by public education/training centers, NGOs, and e-learning platforms (page 35, <box>(0.12, 0.38, 0.68)</box>, evidence source: Picture).", "evidence_pages": [35, 36], "evidence_sources": ["Figure"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["MP_Figure_Understanding"], "pred": 8, "score": 1.0}
 32 | {"question_id": "free_gpt4o_4013909_54_76_2", "doc_no": "4013909", "total_pages": 76, "start_end_idx": [54, 76], "question_type": "extract", "question": "Where were the experiments conducted with CyberSpaceship One？", "answer": "GNC laboratory", "detailed_evidences": "The experiments conducted using CyberShip I were performed in the GNC laboratory at the Department of Engineering Cybernetics, NTNU. This information can be subdivided across multiple pages. On page 65, it states, \"The first experiments were performed in the GNC laboratory at the Department of Engineering Cybernetics, NTNU using CyberShip I which is an offshore supply vessel scale 1:70\"(<box>(0.12, 0.33, 0.4, 0.51)</box>, page 65), this leads to pictures and further explanations of the experiments across page 65 and page 66.", "evidence_pages": [65, 66], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "the GNC laboratory", "score": 0.7777777777777778}
 33 | {"question_id": "free_gpt4o_4013909_54_76_4", "doc_no": "4013909", "total_pages": 76, "start_end_idx": [54, 76], "question_type": "extract", "question": "Does the combination of compass and angular velocity sensor improve heading retention? Choose yes or no.", "answer": "yes", "detailed_evidences": "The results of comparing 'y' values and estimates using the compass and angular rate sensor show improvements in course-keeping capabilities. On page 74, it mentions, \"It is seen that the WF motion components are quite well removed from the estimate of resulting in good course-keeping capabilities\"(<box>(0.13, 0.26, 0.45, 0.36)</box>, page 74). The figures on page 70 show the results graphically, comparing 'y=v+y_w' and estimated 'y_r'(<box>(0.26, 0.18, 0.76, 0.9)</box>, page 70). Thus, both textual and graphical comparisons across these pages indicate better course-keeping when combining the observations from multiple sensors.", "evidence_pages": [70, 74], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "yes", "score": 1.0}
 34 | {"question_id": "free_gpt4o_4142297_65_94_12", "doc_no": "4142297", "total_pages": 95, "start_end_idx": [65, 94], "question_type": "extract", "question": "What was the main reason for no correlation between electrical and optical signals in the experiment discussed in Figure 3.11?", "answer": "High optical noise and bleaching effects", "detailed_evidences": "The primary reason for the lack of correlation between electrical and optical signals in the experiment (Figure 3.11) is the high optical noise and bleaching effects. The noise arises due to fluctuations, especially when photon count is low, dropping SNR. Despite the membrane depolarization and action potentials, significant reductions in fluorescence were not recorded. The bleaching rate was noted at 9% per 70 s with 100% intensity and 100 ms exposure, contributing significantly to the variance in results (pages 71, 72, 73).", "evidence_pages": [71, 72, 73], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Figure_Understanding"], "pred": "the trade-off between the recording settings for Andor Zyla sCMOS camera and the sensitivity of the measurements we would have liked to achieve", "score": 0.0}
 35 | {"question_id": "free_gpt4o_4014524_22_51_8", "doc_no": "4014524", "total_pages": 55, "start_end_idx": [22, 51], "question_type": "extract", "question": "What percentage of Ward 3 residents had no physical activity in the last 30 days?", "answer": 38, "detailed_evidences": "Thirty-eight percent (38%) of Ward 7 residents had no physical activity in the last 30 days as mentioned in the document related to the Parks, Recreation, and Open Space Element. However, the document provided does not specify the percentage for Ward 3. You may need to review the other images in the set of 30 provided to find specific information about Ward 3.", "evidence_pages": [43], "evidence_sources": ["Text", "Layout"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Layout_Understanding"], "pred": "6%", "score": 0.0}
 36 | {"question_id": "free_gemini15_pro_4139004_25_54_8", "doc_no": "4139004", "total_pages": 57, "start_end_idx": [25, 54], "question_type": "extract", "question": "What is the radius of the circle (Ic) used for error calculations in Example 1 of the Neumann case (Figure 24)?\n(A). 5A/3\n(B). 10A/3\n(C). 5λ/3\n(D). 10λ/3\nChoose the letter name in front of the right option from A, B, C, D.", "answer": "C", "detailed_evidences": "The text on page 41, with coordinates (0.21, 0.63, 0.79, 0.75), describes Example 1 in the Neumann case (Figure 24, (0.25, 0.28, 0.78, 0.55), page 41), stating that the errors are calculated on the circle Ic with radius 5λ/3.", "evidence_pages": [41], "evidence_sources": ["Text", "Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Figure_Understanding"], "pred": "C", "score": 1.0}
 37 | {"question_id": "free_gpt4o_4183202_75_104_5", "doc_no": "4183202", "total_pages": 120, "start_end_idx": [75, 104], "question_type": "extract", "question": "Orkhon II is steward of Xa Deshret, yes or no?", "answer": "yes", "detailed_evidences": "The steward of Xa Deshret is Orkhon II. This title is mentioned in the context of description about leadership: 'Orkhon Il is steward of Xa Deshret and has ruled it well for nearly three decades' (<box>(0.14, 0.51, 0.52, 0.69)</box>, page 80). The importance of their role lies in 'protecting the remains of the Sphinx Empire from the savagery of the Vermilion Tribes and the ever-growing menace of Gorhgijesk and Rhaz al-Khali' (<box>(0.14, 0.51, 0.52, 0.69)</box>, page 80). This role maintains governance and defends against external threats.", "evidence_pages": [80], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "no", "score": 0.0}
 38 | {"question_id": "free_gpt4o_4091930_40_58_11", "doc_no": "4091930", "total_pages": 58, "start_end_idx": [40, 58], "question_type": "extract", "question": "What is the URL for InfraMap.net?", "answer": "\"800.863.3748\"", "detailed_evidences": "The URL for InfraMap.net provided in the document is \"800.863.3748\" (page 47, Text, (0.22, 0.9, 0.79, 0.93)). They provide professional subsurface utility engineering and utility infrastructure mapping services (page 47, Text, (0.13, 0.55, 0.87, 0.62)).", "evidence_pages": [47], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "Inframap.net", "score": 0.0}
 39 | {"question_id": "free_gpt4o_4110886_2_26_8", "doc_no": "4110886", "total_pages": 76, "start_end_idx": [2, 26], "question_type": "extract", "question": "Who are the main supporters and sponsors for the research study according to the acknowledgements section?", "answer": ["Punjabi Community Health Services (PCHS)", "The Government of Ontario", "Mr. Baldev Mutta", "Ms. Manvir Bhangu", "Dr. Puneet Dhillon"], "detailed_evidences": "The main supporters and sponsors for the research study are mentioned in the acknowledgements section on page 5. The document states: \"This work was sponsored by Punjabi Community Health Services (PCHS) in cooperation with The Government of Ontario\" (<box>(0.12, 0.16, 0.88, 0.24)</box>, page 5). Additionally, it mentions specific individuals at PCHS who supported the study: Mr. Baldev Mutta, Chief Executive Officer (CEO), Ms. Manvir Bhangu, Manager of Operations, Quality and Systems, and Dr. Puneet Dhillon, Research, Communication and Research Analyst (<box>(0.15, 0.33, 0.63, 0.35)</box>, <box>(0.15, 0.36, 0.8, 0.38)</box>, <box>(0.15, 0.4, 0.82, 0.41)</box>, page 5).", "evidence_pages": [5], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": ["Punjabi Community Health Services (PCHS)", "The Goverment of Ontario", "Mr. Baldev Mutta", "Ms. Manvir Bhangu", "Dr. Puneet Dhillon"], "score": 0.992}
 40 | {"question_id": "free_gpt4o_4034235_26_55_8", "doc_no": "4034235", "total_pages": 90, "start_end_idx": [26, 55], "question_type": "extract", "question": "What is the function of the overall correction value of the upper thread tension?", "answer": "\"Change the tension values for all programs at once\"", "detailed_evidences": "The function of the overall correction value is to change the tension values for all programs at once. When set, the upper thread tension is adjusted based on the overall correction value during sewing, regardless of the individual program's settings. As described on page 52, the correction value is displayed in the TENSION display and applied the next time sewing is carried out (<box>(0.54, 0.51, 0.92, 0.76)</box>).", "evidence_pages": [52], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "to change the tension values for all programs at once", "score": 0.9433962264150944}
 41 | {"question_id": "free_gemini15_pro_4193248_57_86_6", "doc_no": "4193248", "total_pages": 142, "start_end_idx": [57, 86], "question_type": "extract", "question": "What three successful examples of an integrated service approach for engaging culturally and linguistically diverse communities are noted in section 3.2.5?", "answer": ["The Family Relationship Centre Broadmeadows", "The Dandenong Project at the Federal Magistrates Court", "The Neighbourhood Justice Centre in Collingwood, Victoria"], "detailed_evidences": "Section 3.2.5 on page 67 mentions three successful examples of integrated service approaches: 1) The Family Relationship Centre Broadmeadows; 2) The Dandenong Project at the Federal Magistrates Court; and 3) The Neighbourhood Justice Centre in Collingwood, Victoria.  These are listed in the paragraph starting at <box>(0.14, 0.51, 0.85, 0.81)</box> on page 67. The evidence source is Pure-text. ", "evidence_pages": [67], "evidence_sources": ["Text"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": ["the Family Relationship Centre Broadmeadows", "The Federal Magistrates Court component of the 'Dandenong Project'", "the Neighbourhood Justice Centre in Collingwood, Victoria"], "score": 0.6666666666666666}
 42 | {"question_id": "free_gpt4o_4169843_69_85_7", "doc_no": "4169843", "total_pages": 85, "start_end_idx": [69, 85], "question_type": "extract", "question": "Give the cumulative actual total and expected total values for the occupancy of structures in floodplains.\n(A). 0/20\n(B). 20/0\n(C). 0/200\n(D). 0/2\nChoose the letter name in front of the right option from A, B, C, D.", "answer": "B", "detailed_evidences": "The cumulative actual and expected total values for occupancy of structures in floodplains is 20 and 0.   This is detailed in page 69 related to the actual and expected occupancy figures (Page 69: \"0/0\", <box>(0.31, 0.11, 0.51, 0.14)</box>, page 69).   This evidence can be traced in the document's section for \"Accomplishments Performance Measures\".   Hence, occupancy in floodplains shows 20 actual against 0 expected structures.", "evidence_pages": [69], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "B", "score": 1.0}
 43 | {"question_id": "free_gpt4o_4113560_74_103_8", "doc_no": "4113560", "total_pages": 146, "start_end_idx": [74, 103], "question_type": "extract", "question": "Summarize the steps to be taken if the bike seat gives way after load.", "answer": "\"Bring the seat into the lowest position and then into the highest position to allow for compensation of the oil-gas mixture in the spring chambers, ensuring the seat locks again. Repeat once.\"", "detailed_evidences": "The document suggests steps for handling situations where the bike seat gives way after load on page 103. It advises bringing the seat into the lowest position and then into the highest position to allow for compensation of the oil-gas mixture in the spring chambers, ensuring the seat locks again. Verbatim advice includes: \"Bring the seat into the lowest position and then into the highest position and repeat once.\" (<box>(0.19, 0.13, 0.86, 0.19)</box>, evidence source is Text). No additional steps are mentioned, affirming this action as sufficient to resolve the issue.", "evidence_pages": [103], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "Bring the seat into the lowest position and then into the highest position and repeat this once", "score": 0.0}
 44 | {"question_id": "free_gemini15_pro_4000400_14_43_1", "doc_no": "4000400", "total_pages": 54, "start_end_idx": [14, 43], "question_type": "extract", "question": "What were the compensatory damages sought in the Butler v. Celotex Corp. case?", "answer": "$150000", "detailed_evidences": "The _Butler v. Celotex Corp._ case sought $150,000 in compensatory damages. This information is found in a table on page 30. <box>(0.17, 0.28, 0.79, 0.41)</box>  ", "evidence_pages": [30], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "$150000", "score": 1.0}
 45 | {"question_id": "free_gemini15_pro_4088207_13_42_1", "doc_no": "4088207", "total_pages": 76, "start_end_idx": [13, 42], "question_type": "extract", "question": "What is the total number of shareholders as of 31.12.2020?", "answer": 485, "detailed_evidences": "The total number of shareholders as of 31.12.2020 is 485, according to the table on page 13 <box>(0.15, 0.12, 0.9, 0.26)</box>.  The number of private individual shareholders is 429, as shown in the table on page 13 <box>(0.14, 0.29, 0.9, 0.38)</box>. ", "evidence_pages": [13], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 485, "score": 1.0}
 46 | {"question_id": "free_gpt4o_4108584_37_66_17", "doc_no": "4108584", "total_pages": 118, "start_end_idx": [37, 66], "question_type": "extract", "question": "Identify the most common growth intervals for Atlantic salmon in Norway.", "answer": ["0.1-0.2 kg", "0.2-1 kg", "1-2 kg", "2-3 kg", "3-4 kg", "4-5 kg"], "detailed_evidences": "Growth intervals for Atlantic salmon in Norway are categorized as follows: 0.1-0.2 kg, 0.2-1 kg, 1-2 kg, 2-3 kg, 3-4 kg, and 4-5 kg. The corresponding feed consumption is 0.08 kg, 0.75 kg, 1.00 kg, 1.05 kg, 1.10 kg, and 1.20 kg respectively [Table] (<box>(0.12, 0.34, 0.86, 0.4)</box>, page 66).", "evidence_pages": [66], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "4-5 kg", "score": 0.16666666666666666}
 47 | {"question_id": "free_gpt4o_4093319_101_112_4", "doc_no": "4093319", "total_pages": 112, "start_end_idx": [101, 112], "question_type": "extract", "question": "What is the value-added output index for 'Pulp and Paper' in 1962?", "answer": 317, "detailed_evidences": "To find the value-added output index for 'Pulp and Paper' in 1962, refer to Table I on page 103. The index for 'Pulp and Paper' in 1962 is 317. Evidence can be found in Table I on page 103, coordinates (0.07, 0.25, 0.9, 0.84).", "evidence_pages": [103], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 317, "score": 1.0}
 48 | {"question_id": "free_gemini15_pro_4063909_28_57_8", "doc_no": "4063909", "total_pages": 68, "start_end_idx": [28, 57], "question_type": "extract", "question": "What was the total equity of WSE as at 31 December 2012(in PLN thousand)?", "answer": "384881", "detailed_evidences": "WSE's total equity was PLN 384,881 thousand as at 31 December 2012 and PLN 382,670 thousand as at 31 December 2011.  This information is located in a table on page 48 <box>(0.12, 0.48, 0.87, 0.56)</box>.", "evidence_pages": [48], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 384881, "score": 1.0}
 49 | {"question_id": "free_gemini15_pro_4063909_28_57_9", "doc_no": "4063909", "total_pages": 68, "start_end_idx": [28, 57], "question_type": "extract", "question": "What is the higher value(in PLN thousand) between 'Deferred tax assets to be recovered within 12 months' and 'Deferred tax assets to be recovered after more than 12 months' as at 31 December 2011?", "answer": "2816", "detailed_evidences": "As at 31 December 2011, deferred tax assets to be recovered within 12 months were (PLN 2,519) thousand, and those to be recovered after more than 12 months were (PLN 2,816) thousand.  This data is from a table on page 39 <box>(0.13, 0.33, 0.88, 0.42)</box>.", "evidence_pages": [39], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 2816, "score": 1.0}
 50 | {"question_id": "free_gemini15_pro_4188879_36_65_2", "doc_no": "4188879", "total_pages": 84, "start_end_idx": [36, 65], "question_type": "extract", "question": "Which districts in Mizoram had a high level of WRC production in both 2009-10 and 2014-15?", "answer": ["Champhai", "Kolasib"], "detailed_evidences": "According to Table 3 (<box>(0.07, 0.36, 0.48, 0.69)</box>, page 36), Champhai and Kolasib districts had high WRC production in 2009-10.  In 2014-15, Table 3 again shows that Kolasib and Champhai districts had high production. Therefore, Champhai and Kolasib are the districts with consistently high production levels across both periods.", "evidence_pages": [36], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": ["Kolasib", "Champhai"], "score": 1.0}
 51 | {"question_id": "free_gpt4o_4131533_6_19_4", "doc_no": "4131533", "total_pages": 73, "start_end_idx": [6, 19], "question_type": "extract", "question": "What are the maximum sensor lengths supported by the remote modules?", "answer": "50 m", "detailed_evidences": "The \"Remote Module (Standard Length)\" supports sensor lengths from 1 m to 20 m (<box>(0.13, 0.15, 0.86, 0.91)</box>, page 6). The \"Remote Module (Extended Length)\" supports sensor lengths from 1 m to 50 m (<box>(0.13, 0.15, 0.86, 0.91)</box>, page 6).", "evidence_pages": [6], "evidence_sources": ["Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "50 m", "score": 1.0}
 52 | {"question_id": "free_gemini15_pro_4188879_36_65_5", "doc_no": "4188879", "total_pages": 84, "start_end_idx": [36, 65], "question_type": "extract", "question": "What is the total annual requirement of rice for Mizoram in 2014-15？", "answer": 2199263, "detailed_evidences": "Total Annual Requirement of Rice for Mizoram in 2014-15:\nAccording to Table 8 from image 7, the total annual requirement of rice for Mizoram in 2014-15 was 21,99,263 quintals.", "evidence_pages": [42], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": 2199263, "score": 1.0}
 53 | {"question_id": "free_gpt4o_4086173_6_35_11", "doc_no": "4086173", "total_pages": 117, "start_end_idx": [6, 35], "question_type": "extract", "question": "What methodologies have been used in the final chapter project on job satisfaction and occupational demographics?", "answer": "linear regression analysis", "detailed_evidences": "The methodology used in the final chapter project on job satisfaction and occupational demographics is linear regression analysis. Specifically, this methodology is used to explore the demographics of the workforce and job satisfaction without determining causality. This summary can be found in Table 2 of Chapter 1 (page 13: <box>0.12, 0.24, 0.56, 0.26</box>).\n", "evidence_pages": [13], "evidence_sources": ["Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Table_Understanding"], "pred": "Linear regression analysis", "score": 1.0}
 54 | {"question_id": "free_gpt4o_4091919_79_89_3", "doc_no": "4091919", "total_pages": 89, "start_end_idx": [79, 89], "question_type": "extract", "question": "Which drone allows for dual control mode?", "answer": "Gleagle X3 Quadcopter", "detailed_evidences": "The 'Gleagle X3 Quadcopter' allows for dual control mode, and this feature is mentioned on page 80 in the paragraph starting at (<box>(0.04, 0.58, 0.63, 0.69)</box>).", "evidence_pages": [80], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding"], "pred": "Gleagle X3 Quadcopter", "score": 1.0}
 55 | {"question_id": "extract_fig2tab_4066338_71_100_8", "doc_no": "4066338", "total_pages": 130, "start_end_idx": [71, 100], "question_type": "extract_fig2tab", "question": "What's name of the figure at the page which contains a table whose name is \"Table 5.1: Gender performance using LBP texture and full face. Percentage values represent the class-specific accuracies and the average class-specific accuracy for each experiment. VeriLook pre-dicted no gender on 2.75% and 0.89% of male and female images respectively in the FRGC dataset. In the MORPH dataset it predicted no gender on 0.5% and 11.26% of male and female images.\"?", "answer": "Figure 5.6: Easy and hard subjects in texture gender classification on MORPH images. Subjects on the left are a subset of those correctly classified in all texture experiments. Top to bottom: Black. White. and Hispanic. Subjects on the right those who were misclassified in over 95% of experiments. The subject on the bottom was labeled incorrectly as Female.", "detailed_evidences": "", "evidence_pages": [90], "evidence_sources": ["Table", "Figure"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Figure 5.6: Easy and hard subjects in texture gender classification on MORPH images. Subjects on the left are a subset of those correctly classified in all texture experiments. Top to bottom: Black, White, and Hispanic. Subjects on the right those who were misclassified in over 95% of experiments. The subject on the bottom was labeled incorrectly as Female.", "score": 0.9944289693593314}
 56 | {"question_id": "extract_fig2tab_4107960_34_63_5", "doc_no": "4107960", "total_pages": 86, "start_end_idx": [34, 63], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"Table 6-2. Ohio Lumex SRM Comparison\".", "answer": ["Table 6-3. ALSI SRM Comparison"], "detailed_evidences": "", "evidence_pages": [54], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 6-3. ALSI SRM Comparison"], "score": 1.0}
 57 | {"question_id": "extract_fig2tab_4057441_1_30_23", "doc_no": "4057441", "total_pages": 81, "start_end_idx": [1, 30], "question_type": "extract_fig2tab", "question": "What's name of the table at the page which contains a figure whose name is \"Figure 1 Connection Diagram for SOP16 package\".", "answer": ["16-LEAD SOP"], "detailed_evidences": "", "evidence_pages": [5], "evidence_sources": ["Figure"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Table 1 Pin Description for SOP16 package", "score": 0.0}
 58 | {"question_id": "extract_fig2tab_4057441_40_69_5", "doc_no": "4057441", "total_pages": 81, "start_end_idx": [40, 69], "question_type": "extract_fig2tab", "question": "What's name of the table at the page which contains a figure whose name is \"Figure 97 Power-on Timing\"?", "answer": "Table 17 Power-Up Timing and Write Inhibit Threshold", "detailed_evidences": "", "evidence_pages": [61], "evidence_sources": ["Table", "Figure"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Table 17 Power-Up Timing and Write Inhibit Threshold", "score": 1.0}
 59 | {"question_id": "extract_fig2tab_4057441_1_30_7", "doc_no": "4057441", "total_pages": 81, "start_end_idx": [1, 30], "question_type": "extract_fig2tab", "question": "List names of the tables at the page which contains a figure whose name is \"Figure 3 Connection Diagram for TFBGA24 5x5 ball array package\"?", "answer": ["Table 2 Pin Description for WSON8 package"], "detailed_evidences": "", "evidence_pages": [6], "evidence_sources": ["Table", "Figure"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 2 Pin Description for WSON8 package"], "score": 1.0}
 60 | {"question_id": "extract_fig2tab_4140425_12_41_1", "doc_no": "4140425", "total_pages": 52, "start_end_idx": [12, 41], "question_type": "extract_fig2tab", "question": "What's name of the table at the page which contains a figure whose name is \"Fig.1 Microstructure and EDS spectrum of the sample A\"?", "answer": "Table 2. Characteristics of the samples", "detailed_evidences": "", "evidence_pages": [16], "evidence_sources": ["Table", "Figure"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": "Table 2. Characteristics of the samples", "score": 1.0}
 61 | {"question_id": "extract_fig2tab_4043512_42_71_4", "doc_no": "4043512", "total_pages": 101, "start_end_idx": [42, 71], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"Table 32. National Total Commute Trip CO2Emissions,40-ft Compressed Natural Gas\".", "answer": ["Table 31. Comparison of On-Road Transportation CO2Emissions and Potential Emission Reductions Between California and Maine"], "detailed_evidences": "", "evidence_pages": [61], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 31.Comparison of On-Road Transportation CO2 Emissions and Potential Emission Reductions Between California and Maine"], "score": 0.9836065573770492}
 62 | {"question_id": "extract_fig2tab_4100212_33_62_11", "doc_no": "4100212", "total_pages": 111, "start_end_idx": [33, 62], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"TABLE 114: Large, Variable-sized Delimiters\".", "answer": ["TABLE112: nath Delimiters", "TABLE 113: Variable-sized Delimiters"], "detailed_evidences": "", "evidence_pages": [38], "evidence_sources": ["Figure"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["TABLE 112: nath Delimiters", "TABLE 113: Variable-sized Delimiters"], "score": 0.9807692307692308}
 63 | {"question_id": "free_gpt4o_4098399_14_43_12", "doc_no": "4098399", "total_pages": 54, "start_end_idx": [14, 43], "question_type": "count", "question": "How many workshops does South Jordan City offer annually for water education?", "answer": 2, "detailed_evidences": "South Jordan City offers two water education workshops annually, one in the spring and another in the fall. This information is detailed on page 17 (<box>(0.2, 0.17, 0.87, 0.49)</box>, evidence source: Pure-text) and page 18 (<box>(0.2, 0.09, 0.45, 0.33)</box>, evidence source: Pure-text). Therefore, the number of workshops offered annually is 2.", "evidence_pages": [17, 18], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": 2, "score": 1.0}
 64 | {"question_id": "extract_fig2tab_4151603_24_53_4", "doc_no": "4151603", "total_pages": 84, "start_end_idx": [24, 53], "question_type": "extract_fig2tab", "question": "List names of the other tables at the page which contains a table whose name is \"Table 4.14. Summary of recommended stability method and limit value for compost applications\".", "answer": ["Table 4.13. Proposed relationship between self-heating class and best use ot compost"], "detailed_evidences": "", "evidence_pages": [41], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Figure_Table_Locating"], "pred": ["Table 4.13. Proposed relationship between self-heating class and best use of compost"], "score": 0.9880952380952381}
 65 | {"question_id": "free_gpt4o_4053330_32_61_1-2", "doc_no": "4053330", "total_pages": 65, "start_end_idx": [32, 61], "question_type": "count", "question": "How many characters are allowed for the \"Rendering Provider ID#\" in item 24J?", "answer": 11, "detailed_evidences": "The \"Rendering Provider ID#\" field allows for the entry of 11 characters in the shaded area and the entry of a 10-digit NPI number in the unshaded area. These specifics are detailed in the instructions on page 44 (<box>(0.11, 0.17, 0.87, 0.14)</box>) and on page 35 (<box>(0.14, 0.12, 0.3, 0.13)</box>). Therefore, this field allows for a maximum of 21 characters.", "evidence_pages": [35, 44], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": 11, "score": 1.0}
 66 | {"question_id": "free_gpt4o_4035900_50_65_11-2", "doc_no": "4035900", "total_pages": 65, "start_end_idx": [50, 65], "question_type": "count", "question": "How many steps are there in the instructions about how to serve court papers by registered process server?", "answer": "4", "detailed_evidences": "The instances where depositing payment for services is mentioned include:\n1. Page 52: \"The amount you need to pay for this service\" for Process Server (<box>(0.26, 0.69, 0.88, 0.73)</box>)\n2. Page 54: \"$200.00 deposit fee\" for Sheriff's service (<box>(0.35, 0.56, 0.84, 0.59)</box>)\n(Evidence: Pages 52, 54)\nTherefore, the instances of depositing payment for services are [\"Process Server payment amount\", \"$200 deposit fee for Sheriff\"].", "evidence_pages": [52, 54], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": 4, "score": 1.0}
 67 | {"question_id": "free_gpt4o_4063393_80_109_4", "doc_no": "4063393", "total_pages": 118, "start_end_idx": [80, 109], "question_type": "compare", "question": "Which has the better effectiveness between AREDS 1 and AREDS 2 supplements?\n(A). AREDS 1\n(B). AREDS 2\nChoose the letter name in front of the right option from A, B.", "answer": "AREDS 2", "detailed_evidences": "The effectiveness of patients using AREDS 1 and AREDS 2 supplements is detailed on pages 97 and 98. AREDS 1 (on page 97, <box>(0.09, 0.43, 0.19, 0.46)</box>) contains Beta-carotene and no lutein or zeaxanthin and is no longer recommended because it investigated early AMD with no significant benefit. AREDS 2 (on page 97, <box>(0.13, 0.58, 0.69, 0.6)</box>) is recommended for intermediate and advanced AMD and includes carotenoids such as lutein and zeaxanthin with improved benefits. Thus, AREDS 2 is more effective than AREDS 1 for preventing AMD progression.", "evidence_pages": [97, 98], "evidence_sources": ["Text"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning"], "pred": "B", "score": 0.0}
 68 | {"question_id": "free_gpt4o_4005038_38_65_14", "doc_no": "4005038", "total_pages": 91, "start_end_idx": [38, 65], "question_type": "count", "question": "How many provinces are covered by the study area according to 'Appendix C'?", "answer": 4, "detailed_evidences": "'Appendix C' on page 57 mentions that benthic taxa counts cover all four provinces in the project area (<box>(0.12, 0.13, 0.88, 0.27)</box>, page 57). Despite the geographical limits, especially in some regions like Gaspe Bay peninsula, these provinces still provide substantial data through the CABIN dataset. The evidence source spans multiple pages and figures (<box>(0.22, 0.18, 0.79, 0.5)</box>, <box>(0.36, 0.53, 0.64, 0.55)</box>, page 63).", "evidence_pages": [57, 63], "evidence_sources": ["Text", "Figure"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning", "MP_Figure_Reasoning"], "pred": 4, "score": 1.0}
 69 | {"question_id": "free_gpt4o_4157143_6_35_10", "doc_no": "4157143", "total_pages": 87, "start_end_idx": [6, 35], "question_type": "count", "question": "What's the number of figures listed under 'GUI (Graphical User Interface) in MATLAB'?", "answer": 1, "detailed_evidences": "'Fiber Types' figures are spread across pages 8 and 9, and include:\nFigure 7 - Fiber Types: refractive index profile (page 8)\nThis totals to 1 figures.\nOn the other hand, 'GUI (Graphical User Interface) in MATLAB' figures are found distributed in:\nFigure 44 - GUI Design\nThus, each section has a comparable number of figures, adding up to 1 figures each.", "evidence_pages": [8, 9], "evidence_sources": ["Figure"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Figure_Reasoning"], "pred": 1, "score": 1.0}
 70 | {"question_id": "free_gpt4o_4177240_35_56_8", "doc_no": "4177240", "total_pages": 56, "start_end_idx": [35, 56], "question_type": "compare", "question": "Which model shows less aggressive underprediction for the u component in the -45 degree case in narrow streets:\n(A) Miskam A  \n(B) Fluent D\nChoose the letter name in front of the right option from A, B.", "answer": "A", "detailed_evidences": "Comparing the ability of two different models in predicting the u component for the -45 degree case in narrow streets from different panels involves panels from page 36 (panel 1) and page 37 (panel 2). For instance, on page 36, Miskam A has a u prediction formula y=192 - 0.6969y + error term (<box>(0.08, 0.21, 0.95, 0.93)</box>). On page 37, Fluent D shows a different prediction, y=1.322k - 0.7124y + error term (<box>(0.09, 0.21, 0.95, 0.93)}</box>). When comparing these models, Miskam A underpredicts less aggressively (lower negative coefficient) compared to Fluent D, which indicates stronger variability but a higher error term in its predictions. Thus, Miskam A from panel 1 seems more stable with less drastic underprediction compared to Fluent D from panel 2.", "evidence_pages": [36, 37], "evidence_sources": ["Figure"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Figure_Reasoning"], "pred": "A", "score": 1.0}
 71 | {"question_id": "free_gpt4o_4060934_85_114_5", "doc_no": "4060934", "total_pages": 130, "start_end_idx": [85, 114], "question_type": "compare", "question": "Which policy received the lowest percentage of 'No' responses between GE2a, GE2b, GE3, GE4, and BE1?\n(A). GE2a\n(B). GE2b\n(C). GE3\n(D). GE4\nChoose the letter name in front of the right option from A, B, C, D.", "answer": "A", "detailed_evidences": "To identify the policy with the lowest percentage of 'No' responses: Policy GE2a on page 86 had 0.69% 'No' responses (<box>(0.05, 0.39, 0.95, 0.45)</box>, page 86), source is Table; Policy GE2b on page 87 had 2.07% 'No' responses (<box>(0.06, 0.42, 0.08, 0.43)</box>, page 87), source is Text; Policy GE3 on page 88 had 2.74% 'No' responses (<box>(0.06, 0.44, 0.08, 0.45)</box>, page 88), source is Text; Policy GE4 on page 90 had 0.70% 'No' responses (<box>(0.06, 0.44, 0.08, 0.45)</box>, page 90), source is Text; Policy BE1 on page 91 had 4.29% 'No' responses (<box>(0.06, 0.44, 0.08, 0.45)</box>, page 91), source is Text. The policy with the lowest percentage of 'No' responses is GE2a with 0.69%.", "evidence_pages": [86, 87, 88, 90, 91], "evidence_sources": ["Table"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Table_Reasoning"], "pred": "A", "score": 1.0}
 72 | {"question_id": "summary2title_4185072_40_69_1", "doc_no": "4185072", "total_pages": 72, "start_end_idx": [40, 69], "question_type": "summary2title", "question": "Which section best matches the follwing description: \n<description>This text discusses Micro Air Vehicles (MAVs) and Nano Air Vehicles (NAVs), focusing on biomimetic MAVs that mimic the wing motion of insects like dragonflies and bees.  It describes a simplified wing frame design inspired by dragonfly wings, fabricated using methods like 3D printing.  The text also mentions a multi-degree of freedom mechanism for a bee-like MAV, designed to replicate the figure-eight motion of bee wings.  A comparison of different wing materials (ABS, PLA, acrylic) is also provided.\n</description>\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Bio-Inspiration in Flapping-Wing UAVs.", "detailed_evidences": "", "evidence_pages": [42, 43], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "Bio-Inspiration in Flapping-Wing UAVs", "score": 0.9736842105263158}
 73 | {"question_id": "free_gemini15_pro_4181009_95_112_1", "doc_no": "4181009", "total_pages": 112, "start_end_idx": [95, 112], "question_type": "calculate", "question": "What is the total number of directories across all Canadian FTP sites listed, according to the scan dates?", "answer": 1424, "detailed_evidences": "To calculate the total number of directories, we need to sum the \"Total Dirs\" from each table across multiple pages.  Page 95 shows 113 directories for NEtLine 2000 (<box>(0.22, 0.06, 0.77, 0.17)</box>, page 95); page 97 shows 114 directories for the University of Montreal (<box>(0.22, 0.06, 0.77, 0.17)</box>, page 97); page 99 shows 415 directories for the University of Calgary (<box>(0.22, 0.06, 0.77, 0.17)</box>, page 99); page 101 shows 70 directories for York University (<box>(0.22, 0.06, 0.77, 0.17)</box>, page 101); page 103 shows 93 directories for Carleton University (<box>(0.22, 0.06, 0.77, 0.17)</box>, page 103); page 105 shows 326 directories for Simon Fraser University (<box>(0.22, 0.06, 0.77, 0.17)</box>, page 105); page 107 shows 247 directories for Mimic.ca Archive (<box>(0.22, 0.07, 0.77, 0.17)</box>, page 107); and page 111 shows 46 directories for another Carleton University entry (<box>(0.22, 0.07, 0.77, 0.17)</box>, page 111).  Summing these values: 113 + 114 + 415 + 70 + 93 + 326 + 247 + 46 = 1424. Therefore, the total number of directories is 1424.  Evidence can be found on pages 95, 97, 99, 101, 103, 105, 107, and 111.", "evidence_pages": [95, 97, 99, 101, 103, 105, 107, 111], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["MP_Table_Reasoning"], "pred": "Not answerable", "score": 0.0}
 74 | {"question_id": "summary2title_4129570_80_91_1", "doc_no": "4129570", "total_pages": 91, "start_end_idx": [80, 91], "question_type": "summary2title", "question": "Which section best matches the follwing description: \n<description>The text primarily discusses the prohibition and regulation of using genetic information for determining eligibility and underwriting of life, disability-income, and long-term-care insurance. It highlights that genetic tests must be certified, similar to current medical practices, and points out existing state laws that restrict such uses. The document addresses concerns from the insurance industry and notes the broad international consensus on restricting genetic information usage, emphasizing the importance of protecting individual privacy.</description>\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "ARTICLE 4: LIFE, DISABILITY-INCOME, AND LONG-TERM-CARE INSURANCE", "detailed_evidences": "", "evidence_pages": [84, 85, 86], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "SECTION 401. GENETIC TESTING SUPPLIED FOR LIFE, DISABILITY-INCOME, AND LONG-TERM-CARE INSURANCE.", "score": 0.6170212765957447}
 75 | {"question_id": "summary2title_4166749_17_46_1", "doc_no": "4166749", "total_pages": 52, "start_end_idx": [17, 46], "question_type": "summary2title", "question": "Which section best matches the follwing description: \n<description>The provided text discusses guidelines for tapering opioid dosages to avoid withdrawal symptoms. It highlights the recommended rates and methods for reducing opioids, emphasizing individualized tapering plans and patient safety. The text advises on slow tapers, possibly pausing, and restarting based on patient readiness. It also warns against rapid detoxification due to high risks, especially during pregnancy. Collaboration with mental health providers and specialists is recommended, along with monitoring for opioid use disorder and considering naloxone for overdose prevention.</description>\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Considerations for Tapering Opioids", "detailed_evidences": "", "evidence_pages": [28], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "Recommendations", "score": 0.0}
 76 | {"question_id": "summary2title_4127644_8_37_3", "doc_no": "4127644", "total_pages": 98, "start_end_idx": [8, 37], "question_type": "summary2title", "question": "Which section best matches the follwing description: \n<description>Vistra, an electricity provider, considers its services critical during the COVID-19 pandemic and national emergency.  While prioritizing employee and community well-being, Vistra maintains operations.  Their Q3 2020 financial statements, incorporating COVID-19's impact, show no material adverse effects. The CARES Act, signed March 27, 2020, has tax implications detailed in Note 7.\n</description>\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "COVID-19 Pandemic", "detailed_evidences": "", "evidence_pages": [12], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "VISTRA CORP. NOTES TO CONDENSED CONSOLIDATED FINANCIAL STATEMENTS", "score": 0.0}
 77 | {"question_id": "free_gpt4o_4068263_5_24_14", "doc_no": "4068263", "total_pages": 60, "start_end_idx": [5, 24], "question_type": "calculate", "question": "Calculate the difference in U-Value between Standard and Extreme Casement Units?", "answer": 0.04, "detailed_evidences": "The document provides U-Values for various casement units on page 5. The U-Value for the Standard Casement Unit is 0.27, and for the Extreme Casement Unit, it is 0.23 (<box>(0.08, 0.54, 0.56, 0.7)</box>, page 5). The difference in U-Value between these two units is calculated as 0.27 - 0.23, which equals 0.04.", "evidence_pages": [5], "evidence_sources": ["Figure"], "answer_format": "Float", "task_tag": "Reasoning", "subTask": ["SP_Figure_Reasoning"], "pred": 0.04, "score": 1.0}
 78 | {"question_id": "free_gemini15_pro_4016274_26_55_11", "doc_no": "4016274", "total_pages": 71, "start_end_idx": [26, 55], "question_type": "extract", "question": "In the Client Agreement, what clause addresses the topic of \"Notices\"? ", "answer": "Clause 17", "detailed_evidences": "Clause 17, \"NOTICES\", covers the topic of communications.  This can be verified by the index on page 30 and the clause title on pages 41 and 42. ", "evidence_pages": [30, 41, 42], "evidence_sources": ["Layout", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding", "MP_Table_Understanding"], "pred": "Notices", "score": 0.0}
 79 | {"question_id": "free_gpt4o_4076912_11_40_6", "doc_no": "4076912", "total_pages": 103, "start_end_idx": [11, 40], "question_type": "extract", "question": "List the top four domains of basic research with the highest number of animal uses in 2018.", "answer": ["Ethology / animal behavior / animal biology", "Nervous system", "Immune system", "Oncology"], "detailed_evidences": "The top four domains of basic research with the highest number of animal uses in 2018 were:\n1. Ethology / animal behavior / animal biology: 1,061,647 uses\n2. Nervous system: 901,510 uses\n3. Immune system: 773,265 uses\n4. Oncology: 556,952 uses\nThis information can be found on pages 34 and 35. Detailed data is present in table 15 on page 35 with coordinates (<box>(0.12, 0.5, 0.88, 0.81)</box>) and Figure 17 on page 34 with coordinates (<box>(0.04, 0.1, 0.95, 0.69)</box>).", "evidence_pages": [34, 35], "evidence_sources": ["Table", "Figure"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Table_Understanding", "MP_Figure_Understanding"], "pred": ["Ethology/Animal Behaviour/Animal Biology", "Nervous System", "Immune System", "Oncology"], "score": 0.9709302325581395}
 80 | {"question_id": "free_gpt4o_4083860_40_69_3", "doc_no": "4083860", "total_pages": 69, "start_end_idx": [40, 69], "question_type": "extract", "question": "List all percentages for the confidence that the organization would address concerns about unsafe clinical practice for 'Your org' and 'Average' categories.", "answer": ["49.7%", "59.1%", "54.8%", "58.5%"], "detailed_evidences": "To find the relevant data regarding the confidence that the organization would address concerns about unsafe clinical practice, look at pages 43 and 44.\n\nOn page 43:\n- Your org: 49.7%, Average: 59.1% <box>(0.35, 0.8, 0.66, 0.94)</box>\n\nOn page 44:\n- Your org: 54.8%, Average: 58.5% <box>(0.04, 0.8, 0.34, 0.95)</box>\n\nSo the percentages for 'Your org' and 'Average' concerning confidence in addressing concerns are found on pages 43 (49.7% and 59.1%) and 44 (54.8% and 58.5%).", "evidence_pages": [43, 44], "evidence_sources": ["Table"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Table_Understanding"], "pred": ["49.7%", "59.1%"], "score": 0.0}
 81 | {"question_id": "free_gemini15_pro_4139178_20_49_5", "doc_no": "4139178", "total_pages": 98, "start_end_idx": [20, 49], "question_type": "extract", "question": "The specific industries excluded by the SRI Balanced option do not include fossil fuels, yes or no?", "answer": "no", "detailed_evidences": "The SRI-Balanced Option, as described on pages 29 and 30, excludes investments in fossil fuels, companies breaching human rights or labor rights standards, companies with poor ESG policies, controversial weapons, Uncertified palm oil, Predatory lending, for-profit detention, gambling, tobacco production, and live animal export.  See (0.09, 0.66, 0.26, 0.81) and (0.51, 0.21, 0.91, 0.50) for details.", "evidence_pages": [29, 30], "evidence_sources": ["Text", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Table_Understanding"], "pred": "no", "score": 1.0}
 82 | {"question_id": "free_gpt4o_4032186_71_100_9", "doc_no": "4032186", "total_pages": 121, "start_end_idx": [71, 100], "question_type": "extract", "question": "Between 2013 and 2019, the purchasing managers' indices for China's manufacturing and services sectors showed volatility and an overall upward trend，yes or no？", "answer": "no", "detailed_evidences": "On page 77, the chart titled 'China: Purchasing Managers' Indices' shows the performance and variance for the years 2013 to 2015. The indices tracked are Caixin Manufacturing, Caixin Services, and Official. The indices show a downward trend, with fluctuations, reflecting the economic slowdown in the period. The chart coordinates are <box>(0.53, 0.19, 0.95, 0.89)</box>.", "evidence_pages": [77], "evidence_sources": ["Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Figure_Understanding"], "pred": "no", "score": 1.0}
 83 | {"question_id": "free_gpt4o_4115594_31_60_19", "doc_no": "4115594", "total_pages": 63, "start_end_idx": [31, 60], "question_type": "extract", "question": "Which public institutions did participants in the self-assessment tool receive training from?", "answer": "Public Education/Training Centers", "detailed_evidences": "8 of the participants in the self-assessment tool received training from Public Education/Training Centers (page 50, <box>(0.15, 0.70, 0.72)</box>, evidence source: Picture).", "evidence_pages": [50], "evidence_sources": ["Figure"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["SP_Figure_Understanding"], "pred": "Culinary Sector", "score": 0.0}
 84 | {"question_id": "free_gemini15_pro_4074494_35_56_10", "doc_no": "4074494", "total_pages": 56, "start_end_idx": [35, 56], "question_type": "extract", "question": "What year was the map \"A topographical map of the north part of New York Island, exhibiting the plan of Fort Washington\" published?", "answer": 1777, "detailed_evidences": "The map was published in 1777 and created by William Faden.  This information can be found on page 46, starting at <box>(0.06, 0.65, 0.48, 0.79)</box> and illustrated in Figure 1 on the same page.  ", "evidence_pages": [46], "evidence_sources": ["Text", "Figure"], "answer_format": "Integer", "task_tag": "Understanding", "subTask": ["SP_Text_Understanding", "SP_Figure_Understanding"], "pred": 1777, "score": 1.0}
 85 | {"question_id": "summary2title_4084402_91_102_1", "doc_no": "4084402", "total_pages": 102, "start_end_idx": [91, 102], "question_type": "summary2title", "question": "Which section best matches the follwing description: \n<description>The given text outlines essential safety and etiquette guidelines for bikers, including slowing down for pedestrians, signaling when overtaking, respecting wildlife, nature, and property, choosing routes wisely, wearing protective gear, not traveling alone, and adhering to unwritten rules such as giving precedence, signaling overtaking, and remaining within sight of group members.</description>\nSelect titles from the doc that best answer the question, do not alter or analyze the titles themselves.", "answer": "Behaviour code", "detailed_evidences": "", "evidence_pages": [99, 100], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Locating", "subTask": ["Para_Title_Locating"], "pred": "Behaviour code", "score": 1.0}
 86 | {"question_id": "free_gemini15_pro_4061601_47_71_8", "doc_no": "4061601", "total_pages": 71, "start_end_idx": [47, 71], "question_type": "calculate", "question": "What was the total fair value of options that vested in 2016, 2015, and 2014, in millions of Canadian dollars?", "answer": 312, "detailed_evidences": "Page 48 shows the fair value of options that vested as $126 million in 2016, $91 million in 2015, and $95 million in 2014.  The total is $126 + $91 + $95 = $312 million. ", "evidence_pages": [48], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Table_Reasoning"], "pred": 312, "score": 1.0}
 87 | {"question_id": "free_gemini15_pro_4156965_49_64_3", "doc_no": "4156965", "total_pages": 64, "start_end_idx": [49, 64], "question_type": "calculate", "question": "What is the aggregate compensation of Key Management Personnel for the years ended 30/06/15 and 30/06/14?", "answer": "$1660110", "detailed_evidences": "The table on page 49 details the aggregate compensation of Key Management Personnel. For the year ended 30/06/15, the total was $719,484.<box>(0.15, 0.64, 0.88, 0.79)</box> For the year ended 30/06/14, it was $940,626.<box>(0.15, 0.64, 0.88, 0.79)</box> The total for both years is $719,484 + $940,626 = $1,660,110. ", "evidence_pages": [49], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Table_Reasoning"], "pred": "$1660110", "score": 1.0}
 88 | {"question_id": "free_gemini15_pro_4063909_28_57_6", "doc_no": "4063909", "total_pages": 68, "start_end_idx": [28, 57], "question_type": "calculate", "question": "How much did the total carrying amount of WSE's investments in its subsidiaries as at 31 December 2012 increase compared to 2011(in PLN thousands)?", "answer": "213582", "detailed_evidences": "The total carrying amount of WSE's investments in its subsidiaries as at 31 December 2012 was PLN 250,497 thousand and PLN 36,915 thousand as at 31 December 2011.  This data is from the tables on page 36 <box>(0.09, 0.43, 0.93, 0.63)</box> and page 36 <box>(0.11, 0.67, 0.92, 0.85)</box> respectively.", "evidence_pages": [36], "evidence_sources": ["Table"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Table_Reasoning"], "pred": 213582, "score": 1.0}
 89 | {"question_id": "free_gpt4o_4034595_7", "doc_no": "4034595", "total_pages": 52, "start_end_idx": [10, 37], "question_type": "extract", "question": "What are the environmental concerns mentioned in the Arctic Community Concerns section?", "answer": ["More Spill Planning/Response", "Subsistence Hunter Safety", "Food Quality & Availability"], "detailed_evidences": "The section \"Arctic Community Concerns\" on pages 15 and 17 mentions several environmental concerns:\nOn page 15, concerns include \"More Spill Planning/Response\" ((0.04, 0.71, 0.75)), \"Subsistence Hunter Safety\" ((0.04, 0.76, 0.80)), and \"Food Quality & Availability\" ((0.04, 0.81, 0.86)).", "evidence_pages": [15, 17], "evidence_sources": ["Text", "Layout"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": ["Increased Arctic activity", "Spills may lead to lost food resources", "Gov’t response capabilities appear inadequate", "Not enough CG infrastructure present", "Traditional knowledge not fully understood"], "score": 0.0}
 90 | {"question_id": "free_gpt4o_4096779_58_87_10", "doc_no": "4096779", "total_pages": 91, "start_end_idx": [58, 87], "question_type": "extract", "question": "Which section of copyright law was pivotal in Atari Games Corp. v. Nintendo ?", "answer": "Section 117(1)", "detailed_evidences": "Section 117(1) of the copyright law played a pivotal role in Atari Games Corp. v. Nintendo, relating to the legitimacy of intermediate copying activities. The section allows copying as an essential step in software utilization but was contested regarding reverse engineering claims in this case. The significance is outlined on page 65 (<box>(0.07, 0.07, 0.29, 0.74)</box>) and page 66 (<box>(0.06, 0.07, 0.67, 0.85)</box>), with evidence sources being Title and Pure-text, respectively.", "evidence_pages": [65, 66], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "Section 117", "score": 1.0}
 91 | {"question_id": "free_gpt4o_4027862_66_95_3", "doc_no": "4027862", "total_pages": 102, "start_end_idx": [66, 95], "question_type": "extract", "question": "What is the total local paddy production area(in hectares) developed in 2020?", "answer": 852.97, "detailed_evidences": "The total local paddy production area developed in 2020 is calculated using data from pages 71 and 74. On page 71, it's mentioned that the total developed area was 852.97 hectares. Page 74 also highlights developed areas with specific characteristics like irrigated and non-irrigated, but the overall developed area mentioned is consistent at 852.97 hectares.\n\nEvidence references:\n- \"PADDY RICE SUPPORT SCHEME FOR 2019/2020 SEASON\" (<box>(0.15, 0.13, 0.22)</box>, page 74).\n- \"PADDY INDUSTRY 2020 Local Production 3.864 mt Area Developed 852.97 ha\" (<box>(0.12, 0.07, 0.22)</box>, page 71).\n\nHence, the total local paddy production area developed in 2020 is 852.97 hectares.", "evidence_pages": [71, 74], "evidence_sources": ["Layout"], "answer_format": "Float", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding"], "pred": 852.97, "score": 1.0}
 92 | {"question_id": "free_gpt4o_4169843_69_85_9", "doc_no": "4169843", "total_pages": 85, "start_end_idx": [69, 85], "question_type": "extract", "question": "Which organization is responsible for the Big Creek OV - Flood Damage Reduction Project?", "answer": "Shelby County Government", "detailed_evidences": "All responsible organizations across the different projects mentioned are as follows:\n1. Wolf River OV- Increased Greenway Use: Shelby County Government.\n2. Wolf River OV- Attracting New Business: Shelby County Government.\n3. Wolf River OV- Increased Wetlands: Shelby County Government.\n4. Wolf River OV- Flood Cost Reduction: Shelby County Government.\n5. Big Creek OV- Increased Greenway Use: City of Millington/Millington Parks Department.\n6. Big Creek OV - Increased Housing Value: Division of Planning and Development.\n7. Big Creek OV -Tree Mitigation: Shelby County Public Works.\n8. Big Creek OV- Decreased Flood Damage: Shelby County Government.\n\nAll these organizations are specified on pages within respective project descriptions, such as page 70 (<box>(0.06, 0.26, 0.7, 0.29)</box>), page 74 (<box>(0.06, 0.26, 0.7, 0.29)</box>), and page 78 (<box>(0.07, 0.26, 0.74, 0.29)</box>).", "evidence_pages": [70, 72, 74, 76, 78, 80, 82, 84], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Text_Understanding", "MP_Layout_Understanding"], "pred": "Shelby County Government2", "score": 0.96}
 93 | {"question_id": "free_gpt4o_4071554_7_36_4", "doc_no": "4071554", "total_pages": 51, "start_end_idx": [7, 36], "question_type": "extract", "question": "List all the sessions chaired by members from Western Ontario?", "answer": "[\"Geographies in the Global South I-West Africa Chair\", \"Environmental Discourse\"]", "detailed_evidences": "Western Ontario members chair the following sessions listed in the document:\n\n\"Geographies in the Global South I-West Africa Chair\" chaired by Godwin Arku (Western Ontario), listed on page 8. Evidence coordinates are (0.13, 0.65, 0.61, 0.7).\n\"Environmental Discourse\" chaired by Dan Shrubsole (Western Ontario), listed on page 9. Evidence coordinates are (0.17, 0.75, 0.55, 0.78).\n\nThese are the sessions chaired by members from Western Ontario.", "evidence_pages": [8, 9, 31, 32], "evidence_sources": ["Layout"], "answer_format": "List", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding"], "pred": ["W14 Geographies in the Global South I – West Africa", "W16 Environmental Discourse"], "score": 0.7984749455337691}
 94 | {"question_id": "free_gpt4o_4027862_66_95_6-2", "doc_no": "4027862", "total_pages": 102, "start_end_idx": [66, 95], "question_type": "extract", "question": "What's the most valuable fruit in retail in 2020?", "answer": "\"Banana\"", "detailed_evidences": "The top three types of fruits with the highest retail value in 2020 are \"Banana\" with a retail value of 16,530,000 B$ (page 66, evidence: Table titled \"FRUITS INDUSTRY 2011-2020\" <box>(0.24, 0.13, 0.74, 0.17)</box>), \"Papaya\" with a retail value of 14,530,000 B$ (page 67, evidence: table titled \"LOCAL FRUITS PRODUCTION 2020\" <box>(0.17, 0.16, 0.83, 0.36)</box>), and \"Watermelon\" with a retail value of 12,530,000 B$ (page 69, evidence: Table titled \"15 MAJOR TYPE OF IMPORT TROPICAL FRUITS 2020 (BY QUANTITY)\" <box>(0.16, 0.25, 0.85, 0.83)</box>).", "evidence_pages": [66, 67, 69], "evidence_sources": ["Layout", "Table"], "answer_format": "String", "task_tag": "Understanding", "subTask": ["MP_Layout_Understanding", "MP_Table_Understanding"], "pred": "Banana", "score": 1.0}
 95 | {"question_id": "summary2tab_4063624_72_75_6", "doc_no": "4063624", "total_pages": 76, "start_end_idx": [72, 75], "question_type": "summary2tab", "question": "Which tables emphasize the role of human resource management in the recruitment and evaluation process?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["Molimovasobjasnitekojavrstaobukebibilanajkorisnijazaposlenimazaduzenimzakadrovskeposlove/upravljanjeljudskimresursima.", "Molimovasdaobrazlozite,vasimrijecima,dalibi bilopotrebnorevidirati iunaprijeditibilokojiaspektpostupkaocjenjivanjasluzbenika."], "detailed_evidences": "", "evidence_pages": [72, 75], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": "Not answerable", "score": 0.0}
 96 | {"question_id": "summary2tab_4096873_61_66_4", "doc_no": "4096873", "total_pages": 87, "start_end_idx": [61, 66], "question_type": "summary2tab", "question": "Which tables provide detailed instructions on serving legal documents to the other party?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["DEFAULT TIMETABLE"], "detailed_evidences": "", "evidence_pages": [66, 61, 62], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": "Not Answerable", "score": 0.0}
 97 | {"question_id": "summary2tab_4028159_23_30_3", "doc_no": "4028159", "total_pages": 62, "start_end_idx": [23, 30], "question_type": "summary2tab", "question": "Which tables offer a comprehensive checklist for evaluating potential environmental impacts of the project?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["InitialStudy Checklist", "Initial Study Checklist"], "detailed_evidences": "", "evidence_pages": [30, 23], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": ["Table 2 Initial Study Checklist"], "score": 0.7258064516129032}
 98 | {"question_id": "summary2tab_4013723_3_4_2", "doc_no": "4013723", "total_pages": 149, "start_end_idx": [3, 4], "question_type": "summary2tab", "question": "From which tables can we learn about the health benefits of plant carbohydrates and their role in colonic health?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["Table 1 Dietary benefit from properties of plant carbohydrates in the diet", "Table 2 Direct and indirect effects of bulk provided by fermentation-resistant polysaccharide associations in the large intestine"], "detailed_evidences": "", "evidence_pages": [3, 4], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": ["Table 1 Dietary benefit from properties of plant carbohydrates in the diet"], "score": 0.5}
 99 | {"question_id": "summary2tab_4165048_33_39_2", "doc_no": "4165048", "total_pages": 74, "start_end_idx": [33, 39], "question_type": "summary2tab", "question": "From which tables can we learn about the use of non-online sources by practitioners, non-university researchers, and university researchers?\nSelect table names from the doc that best answer the question, do not alter or analyze the table names themselves.", "answer": ["Table 4.2. Use of non-online sources among Practitioners (valid answers = 42/46)， Non-University Researchers (valid answers=24/25)， and University Researchers (valid answers=26/28). \"Ranked as the most important\" refers to the proportion of respondents who ranked the source as 1st，2nd or 3rd. The sources are ranked by decreasing importance for Practitioners. Grey cells outline the 3 most important for each category.", "Table 4.7 (a) Comparison of the preferred online tools used by practitioners to share information and those used by researchers to access it; (b) Comparison of the preferred non-online tools used by practitioners to share information and those used by researchers to access it. Personal emails were not included in the answers offered in the question on information access, hence the impossibility of comparison."], "detailed_evidences": "", "evidence_pages": [33, 39], "evidence_sources": ["Layout", "Table"], "answer_format": "List", "task_tag": "Locating", "subTask": ["Cross_Table_Locating"], "pred": ["Table 4.2. Use of non-online sources among Practitioners (valid answers 42/46). Non-University Researchers (valid answers 24/25), and University Researchers (valid answers 26/28)."], "score": 0.0}
100 | {"question_id": "free_gpt4o_4031704_61_71_4", "doc_no": "4031704", "total_pages": 71, "start_end_idx": [61, 71], "question_type": "count", "question": "How many team members did \"Spike It Up Sports Track Club A\" confirm for the 4x100 Meter Relay 17-18 year old Men?", "answer": 4, "detailed_evidences": "For the 4x100 Meter Relay 17-18 year old Men, \"Spike It Up Sports Track Club A\" confirmed 4 team members. The team members are Brian Smart Jr, Jake Mc Entyre, Ronaldo Cherry, and Jarion Hayes(<box>(0.09, 0.75, 0.96, 0.79)</box>, page 66), evidence sources are 'multicolumn'.", "evidence_pages": [66], "evidence_sources": ["Layout"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Layout_Reasoning"], "pred": 4, "score": 1.0}
101 | {"question_id": "free_gpt4o_4056678_18", "doc_no": "4056678", "total_pages": 70, "start_end_idx": [56, 63], "question_type": "compare", "question": "Is the definition of \"trip\" consistent across different sections of the document? Choose yes or no.", "answer": "Yes", "detailed_evidences": "The definition of \"trip\" appears consistently across different sections of the document. On page 56, under the title \"Important meanings in this section,\" \"trip\" is defined as \"travel during the period of insurance\" (<box>(0.53, 0.57, 0.59)</box>, page 56). Similarly, on page 58, under the title \"Important meanings in this section,\" \"trip\" is again defined as \"travel during the period of insurance\" (<box>(0.53, 0.48, 0.50)</box>, page 58). This consistency is maintained on page 62, where \"trip\" is defined identically as \"travel during the period of insurance\" (<box>(0.53, 0.43, 0.45)</box>, page 62). Thus, the definition of \"trip\" remains unchanged across the document.", "evidence_pages": [56, 58, 62], "evidence_sources": ["Text", "Layout"], "answer_format": "String", "task_tag": "Reasoning", "subTask": ["MP_Text_Reasoning", "MP_Layout_Reasoning"], "pred": "yes", "score": 1.0}
102 | {"question_id": "free_gpt4o_4155062_5_34_7", "doc_no": "4155062", "total_pages": 64, "start_end_idx": [5, 34], "question_type": "count", "question": "How many types of infoments does the Vivaro have?", "answer": 2, "detailed_evidences": "The Vivaro offers various infotainment features across its different models and variants. These include:\n   - Infotainment system with well-connected entertainment <evidence_pages>[12]</evidence_pages> <evidence_sources>[Text]</evidence_sources> <box>(0.05, 0.25, 0.44, 0.4)</box>.\n   - Stay in touch with high levels of infotainment and communication <evidence_pages>[12]</evidence_pages> <evidence_sources>[Pure-text]</evidence_sources> <box>(0.11, 0.71, 0.45, 0.73)</box>.\n\nTherefore, there are altogether 2 different types of infotainment features mentioned.", "evidence_pages": [12], "evidence_sources": ["Text"], "answer_format": "Integer", "task_tag": "Reasoning", "subTask": ["SP_Text_Reasoning"], "pred": 1, "score": 0.0}
103 | 


--------------------------------------------------------------------------------
/scripts/calculate_metrics.sh:
--------------------------------------------------------------------------------
1 | python utils/calculate_metrics.py \
2 |     --results_file evaluation_results/api_models/results_gemini15_pro.jsonl


--------------------------------------------------------------------------------
/scripts/calculate_metrics_fine_grained.sh:
--------------------------------------------------------------------------------
1 | python utils/calculate_metrics_fine_grained.py \
2 |     --results_file evaluation_results/api_models/results_gemini15_pro.jsonl


--------------------------------------------------------------------------------
/scripts/eval_api_models.sh:
--------------------------------------------------------------------------------
1 | python eval/api_models/eval_api_models.py \
2 |     --qa_file data/LongDocURL.jsonl \
3 |     --results_file evaluation_results/api_models/results_gpt4o.jsonl \
4 |     --process_mode serial \
5 |     --image_prefix /mnt/achao/Downloads/pdf_pngs/4000-4999 \
6 |     --model_name gpt4o


--------------------------------------------------------------------------------
/utils/__pycache__/utils_score_v3.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dengc2023/LongDocURL/84cd1ac4d314f24719e5e5e8e093d336a8c39faf/utils/__pycache__/utils_score_v3.cpython-311.pyc


--------------------------------------------------------------------------------
/utils/calculate_metrics.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import json
 3 | 
 4 | import sys
 5 | import pathlib
 6 | sys.path.append(str(pathlib.Path(__file__).absolute().parent.parent))
 7 | 
 8 | from utils.utils_score_v3 import eval_score
 9 | 
10 | # modify: allow multiple preds and return score dict
11 | def calculate_accuracy(answers: list, annotations: list, answer_formats: list, multiple_preds: list = None):
12 |     total_scores = 0.0
13 |     score_list = []
14 |     for pred_ans, annotation, answer_format, multiple_pred in zip(answers, annotations, answer_formats, multiple_preds):
15 |         if pred_ans == "Fail to extract":
16 |             score_v3 = 0.0
17 |         elif not multiple_pred:
18 |             score_v3 = eval_score(annotation, pred_ans, answer_format)
19 |         else:
20 |             score_v3 = max([eval_score(annotation, item, answer_format) for item in pred_ans])
21 |         
22 |         score_list.append(score_v3)
23 |         total_scores += score_v3
24 |     
25 |     generalized_score = total_scores / len(answers)
26 | 
27 |     return generalized_score, score_list
28 | 
29 | 
30 | 
31 | if __name__ == "__main__":
32 |     parser = argparse.ArgumentParser()
33 |     parser.add_argument('--results_file', type=str, default="")
34 |     
35 |     args = parser.parse_args()
36 | 
37 |     with open(args.results_file, "r", encoding="utf-8") as rf:
38 |         samples = [json.loads(_.strip()) for _ in rf.readlines()]
39 |     
40 |     for sample in samples:
41 |         assert "pred" in sample
42 | 
43 |     answers = [_["pred"] for _ in samples]
44 |     annotations = [_["answer"] for _ in samples]
45 |     answer_formats = [_["answer_format"] for _ in samples]
46 |     
47 |     # modify
48 |     multiple_preds = [True if "multiple_pred" in _ else False for _ in samples] # for modified_answer case
49 |     # multiple_preds = [False for _ in samples] # for non-modified_answer case
50 |     
51 |     generalized_score, score_list = calculate_accuracy(answers, annotations, answer_formats, multiple_preds) # calculate on size of successful samples
52 |     rectified_generalized_score = generalized_score * len(answers) / 2325 # calculate on size of 2325
53 |     
54 |     # Update the score field in samples with scores from score_list
55 |     for sample, score in zip(samples, score_list):
56 |         sample["score"] = score
57 |     
58 |     # Write the updated results to a new file in the same directory
59 |     output_file = args.results_file.replace(".jsonl", "_scored.jsonl")
60 |     with open(output_file, "w", encoding="utf-8") as wf:
61 |         for sample in samples:
62 |             wf.write(json.dumps(sample) + "\n")
63 |     print(f"Scores updated and saved to: {output_file}")
64 | 
65 | 
66 |     print("--------------------------------------")
67 |     print("Avg. acc: {}".format(generalized_score))
68 |     print("Rectified Avg. acc: {}".format(rectified_generalized_score))
69 | 


--------------------------------------------------------------------------------
/utils/calculate_metrics_fine_grained.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import json
  3 | 
  4 | import sys
  5 | import pathlib
  6 | sys.path.append(str(pathlib.Path(__file__).absolute().parent.parent))
  7 | 
  8 | from utils.utils_score_v3 import eval_score
  9 | 
 10 | def calculate_accuracy_fine_grained(samples, score_dict):
 11 |     for sample in samples:
 12 |         pred_ans, annotation, answer_format, multiple_pred = sample["pred"], sample["answer"], sample["answer_format"], True if "multiple_pred" in sample else False
 13 |         if pred_ans == "Fail to extract":
 14 |             score_v3 = 0.0
 15 |         elif not multiple_pred:
 16 |             score_v3 = eval_score(annotation, pred_ans, answer_format)
 17 |         else:
 18 |             score_v3 = max([eval_score(annotation, item, answer_format) for item in pred_ans])
 19 |         sample["score_v3"] = score_v3
 20 |         
 21 |     # Main_Task
 22 |     for sample in samples:
 23 |         score_dict["Main_Task"][sample["task_tag"]] += sample["score_v3"]
 24 |     
 25 |     # Element_Type
 26 |     for sample in samples:
 27 |         for evidence_source in sample["evidence_sources"]:
 28 |             if evidence_source in ["Text", "Layout", "Figure", "Table"]:
 29 |                 score_dict["Element_Type"][evidence_source] += sample["score_v3"]
 30 | 
 31 |     # Evidence_Pages
 32 |     for sample in samples:
 33 |         if len(sample["evidence_pages"]) > 1:
 34 |             score_dict["Evidence_Pages"]["Multi_Page"] += sample["score_v3"]
 35 |         elif len(sample["evidence_pages"]) == 1:
 36 |             score_dict["Evidence_Pages"]["Single_Page"] += sample["score_v3"]
 37 | 
 38 |     # Num_of_Element_Types
 39 |     for sample in samples:
 40 |         if len(sample["evidence_sources"]) > 1:
 41 |             score_dict["Num_of_Element_Types"]["Cross_Element"] += sample["score_v3"]
 42 | 
 43 |     # Fine_Grained
 44 |     for sample in samples:
 45 |         sub_score_dict = score_dict["Fine_Grained"][sample["task_tag"]]
 46 |         if sample["task_tag"] in ["Understanding", "Reasoning"]:
 47 |             if len(sample["evidence_pages"]) > 1:
 48 |                 sub_sub_score_dict = sub_score_dict["Multi_Page"]
 49 |             elif len(sample["evidence_pages"]) == 1:
 50 |                 sub_sub_score_dict = sub_score_dict["Single_Page"]
 51 | 
 52 |             for evidence_source in sample["evidence_sources"]:
 53 |                 if evidence_source in ["Text", "Layout", "Figure", "Table"]:
 54 |                     sub_sub_score_dict[evidence_source] += sample["score_v3"]
 55 | 
 56 |             if len(sample["evidence_pages"]) > 1:
 57 |                 sub_score_dict["Multi_Page"] = sub_sub_score_dict
 58 |             elif len(sample["evidence_pages"]) == 1:
 59 |                 sub_score_dict["Single_Page"] = sub_sub_score_dict
 60 | 
 61 |         elif sample["task_tag"] in ["Locating"]:
 62 |             sub_sub_score_dict = sub_score_dict["Cross_Element"]
 63 |             if sample["question_type"] == "topic2title":
 64 |                 sub_sub_score_dict["Cross_Title"] += sample["score_v3"]
 65 |             elif sample["question_type"] == "summary2title":
 66 |                 sub_sub_score_dict["Para_Title"] += sample["score_v3"]
 67 |             elif sample["question_type"] == "summary2tab":
 68 |                 sub_sub_score_dict["Cross_Table"] += sample["score_v3"]
 69 |             elif sample["question_type"] == "extract_fig2tab":
 70 |                 sub_sub_score_dict["Figure_Table"] += sample["score_v3"]
 71 |             
 72 |             sub_score_dict["Cross_Element"] = sub_sub_score_dict
 73 |         
 74 |         score_dict["Fine_Grained"][sample["task_tag"]] = sub_score_dict
 75 | 
 76 | 
 77 |     return score_dict
 78 | 
 79 | 
 80 | 
 81 | if __name__ == "__main__":
 82 |     parser = argparse.ArgumentParser()
 83 |     parser.add_argument('--results_file', type=str, default="evaluation_results/api_models/results_gemini15_pro.jsonl")
 84 |     parser.add_argument('--score_sample_file', type=str, default="evaluation_results/scores_sample_fine_grained.json")
 85 |     
 86 |     args = parser.parse_args()
 87 | 
 88 |     with open(args.results_file, "r", encoding="utf-8") as rf:
 89 |         samples = [json.loads(_.strip()) for _ in rf.readlines()]
 90 | 
 91 |     with open(args.score_sample_file, "r", encoding="utf-8") as rf:
 92 |         _ = json.load(rf)
 93 |         score_dict, sample_cnt_dict = _["scores"], _["sample_cnt"]
 94 |     
 95 |     for sample in samples:
 96 |         assert "pred" in sample
 97 | 
 98 |     score_dict = calculate_accuracy_fine_grained(samples, score_dict)
 99 | 
100 |     def generalize_score_dict(score_dict, sample_cnt_dict):
101 |         for key, value in score_dict.items():
102 |             if isinstance(value, dict):
103 |                 generalize_score_dict(value, sample_cnt_dict[key])
104 |                 score_dict[key] = value
105 |             else:
106 |                 score_dict[key] /= sample_cnt_dict[key]
107 | 
108 |     generalize_score_dict(score_dict, sample_cnt_dict)
109 | 
110 |     print("--------------------------------------------------------------")
111 |     print(score_dict)
112 | 


--------------------------------------------------------------------------------
/utils/extract_ccpdf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import fitz
  4 | from PIL import Image
  5 | import sys 
  6 | from tqdm import tqdm
  7 | import argparse
  8 | 
  9 | 
 10 | def get_pdf_filename(pdf_paths_txt, **kwargs):
 11 |     with open(pdf_paths_txt, "r", encoding="utf-8") as file:
 12 |         files = [line.strip() for line in file.readlines()]
 13 |     files = [(filename, filename.split("/")[-2], filename.split("/")[-1][:-4]) for filename in files]
 14 |     print("pdf file cnt: ", len(files))
 15 |     if "start_id" in kwargs and "end_id" in kwargs:
 16 |         start_id = int(kwargs.pop("start_id"))
 17 |         end_id = int(kwargs.pop("end_id"))
 18 |         return files[start_id:end_id]
 19 |     else:
 20 |         return files
 21 | 
 22 | 
 23 | def save_png(page, pp_no, doc_no, dir, zip_no):
 24 |     ddir = f"{dir}/{zip_no}"
 25 |     if not os.path.exists(ddir):
 26 |         os.makedirs(ddir)
 27 |     png_path = ddir + f"/{doc_no}_{pp_no}.png"
 28 |     pix = page.get_pixmap()
 29 |     pix.save(png_path)
 30 | 
 31 |     return png_path
 32 | 
 33 | 
 34 | def resize(img_size, rectangle):
 35 |     ori_w, ori_h = img_size
 36 |     w_rate = ori_w # / 224
 37 |     h_rate = ori_h # / 224
 38 |     return [round(rectangle[0]/w_rate, 3), round(rectangle[1]/h_rate, 3), round(rectangle[2]/w_rate, 3), round(rectangle[3]/h_rate, 3)]
 39 | 
 40 | 
 41 | def extracts(pdf_path, img_save_dir, json_dir):
 42 |     doc_no = pdf_path.split("/")[-1][:7] # 4000001.pdf
 43 |     zip_no = doc_no[:4]
 44 |     filename = pdf_path
 45 |     try:
 46 |         pages = []
 47 |         doc = fitz.open(filename) 
 48 |         for pp in doc:
 49 |             pages.append(pp)
 50 |     except:
 51 |         print("error: ", filename.split('/')[-1])
 52 |         return  
 53 | 
 54 |     lines = []
 55 |     for idx, page in enumerate(pages):
 56 |         try:
 57 |             png_path = save_png(page, idx, doc_no, img_save_dir, zip_no)
 58 |             my_image = Image.open(png_path)
 59 |         except:
 60 |             print("img error: ", filename.split('/')[-1])
 61 |             continue 
 62 |         
 63 |         word_lists = page.get_text("words")
 64 |         if not word_lists:
 65 |             continue
 66 | 
 67 |         for dd in word_lists:
 68 |             coordi = resize(my_image.size, list(dd[:4]))
 69 |             lines.append({"coordi": coordi, 
 70 |                           "word": dd[4].encode('utf-8', 'ignore').decode('utf-8'), 
 71 |                           "line_no": dd[5],
 72 |                           "block_no": dd[6], 
 73 |                           "word_no": dd[7],
 74 |                           "page_no": idx})
 75 |         
 76 |         my_image.close()
 77 | 
 78 |     out_data = {
 79 |             "zip_no": zip_no, 
 80 |             "doc_no": doc_no, 
 81 |             "pdf_path": filename,
 82 |             "img_size": my_image.size,
 83 |             "contents": lines
 84 |         }
 85 |         
 86 |     if not os.path.exists(os.path.join(json_dir, zip_no)):
 87 |         os.makedirs(os.path.join(json_dir, zip_no))
 88 |     with open(os.path.join(json_dir, zip_no, f"{doc_no}.json"), 'w') as file: 
 89 |         json.dump(out_data, file, ensure_ascii=False)
 90 | 
 91 | 
 92 | if __name__ == '__main__':
 93 |     parser = argparse.ArgumentParser()
 94 |     parser.add_argument("--pdf_dir", type=str, default="/mnt/achao/Downloads/ccpdf_zip/")
 95 |     parser.add_argument("--img_save_dir", type=str, default="/mnt/achao/Downloads/pdf_pngs/")
 96 |     parser.add_argument("--json_dir", type=str, default="/mnt/achao/Downloads/pdf_jsons/")
 97 |     args = parser.parse_args()
 98 | 
 99 |     pdf_dir = args.pdf_dir
100 |     img_save_dir = args.img_save_dir
101 |     json_dir = args.json_dir
102 |     pdf_paths = [os.path.join(pdf_dir, pdf_name) for pdf_name in os.listdir(pdf_dir)]
103 |     
104 |     for pdf_path in tqdm(pdf_paths, desc="Processing ", total=len(pdf_paths), unit="item"):
105 |         extracts(pdf_path, img_save_dir, json_dir)
106 | 


--------------------------------------------------------------------------------
/utils/run_extract_ccpdf.sh:
--------------------------------------------------------------------------------
1 | python utils/extract_ccpdf.py \
2 |     --pdf_dir /mnt/achao/Downloads/ccpdf_zip/ \
3 |     --img_save_dir /mnt/achao/Downloads/pdf_pngs/ \
4 |     --json_dir /mnt/achao/Downloads/pdf_jsons/


--------------------------------------------------------------------------------
/utils/utils_score_v3.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | from math import isclose
  4 | from collections import defaultdict
  5 | 
  6 | 
  7 | def levenshtein_distance(s1, s2):
  8 |     if len(s1) > len(s2):
  9 |         s1, s2 = s2, s1
 10 | 
 11 |     distances = range(len(s1) + 1)
 12 |     for i2, c2 in enumerate(s2):
 13 |         distances_ = [i2 + 1]
 14 |         for i1, c1 in enumerate(s1):
 15 |             if c1 == c2:
 16 |                 distances_.append(distances[i1])
 17 |             else:
 18 |                 distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
 19 |         distances = distances_
 20 |     return distances[-1]
 21 | 
 22 | 
 23 | def anls_compute(groundtruth, prediction, threshold=0.5):
 24 |     dist = levenshtein_distance(groundtruth, prediction)
 25 |     length = max(len(groundtruth.upper()), len(prediction.upper()))
 26 |     value = 0.0 if length == 0 else float(dist) / float(length)
 27 |     anls = 1.0 - value
 28 |     if anls<=threshold:
 29 |         anls = 0.0
 30 |     return anls
 31 | 
 32 | 
 33 | def is_float_equal(reference, prediction, include_percentage: bool = False, is_close: float = False) -> bool:
 34 |     def get_precision(gt_ans: float) -> int:
 35 |         precision = 3
 36 |         if '.' in str(gt_ans):
 37 |             precision = len(str(gt_ans).split('.')[-1])
 38 |         return precision
 39 | 
 40 |     reference = float(str(reference).strip().rstrip("%").strip())
 41 |     try:
 42 |         prediction = float(str(prediction).strip().rstrip("%").strip())
 43 |     except:
 44 |         return False
 45 | 
 46 |     if include_percentage:
 47 |         gt_result = [reference / 100, reference, reference * 100]
 48 |     else:
 49 |         gt_result = [reference]
 50 |     for item in gt_result:
 51 |         try:
 52 |             if is_close:
 53 |                 if isclose(item, prediction, rel_tol=0.01):
 54 |                     return True
 55 |             precision = max(min(get_precision(prediction), get_precision(item)), 2)
 56 |             if round(prediction, precision) == round(item, precision):
 57 |                 return True
 58 |         except Exception:
 59 |             continue
 60 |     return False
 61 | 
 62 | 
 63 | def get_clean_string(s):
 64 |     s = str(s).lower().strip()
 65 |     s = s.replace(",", "")
 66 |     if s.endswith("kg"):
 67 |         s = s.rstrip("kg").strip()
 68 |     if s.endswith("mm"):
 69 |         s = s.rstrip("mm").strip()
 70 |     if s.endswith("m"):
 71 |         s = s.rstrip("m").strip()
 72 |     if s.endswith("meters"):
 73 |         s = s.rstrip("meters").strip()
 74 |     if s.endswith("acres"):
 75 |         s = s.rstrip("acres").strip()
 76 |     if s.endswith("minutes"):
 77 |         s = s.rstrip("minutes").strip()
 78 |     if s.endswith("mile"):
 79 |         s = s.rstrip("mile").strip()
 80 |     if s.endswith("miles"):
 81 |         s = s.rstrip("miles").strip()
 82 |     if s.endswith("million"):
 83 |         s = s.rstrip("million").strip()
 84 |     if s.endswith("thousand"):
 85 |         s = s.rstrip("thousand").strip()
 86 |     if s.endswith("billion"):
 87 |         s = s.rstrip("billion").strip()
 88 |     # remove parenthesis
 89 |     s = re.sub(r'\s*\([^)]*\)', '', s).strip()
 90 |     # remove quotes
 91 |     s = re.sub(r"^['\"]|['\"]$", "", s).strip()
 92 |     s = s.strip().lstrip("$").strip()
 93 |     s = s.strip().lstrip("£").strip()
 94 |     s = s.strip().rstrip("%").strip()
 95 |     return s
 96 | 
 97 | 
 98 | def is_exact_match(s):
 99 |     flag = False
100 |     # Website
101 |     if "https://" in s:
102 |         flag = True
103 |     # code file
104 |     if s.endswith(".py") or s.endswith("ipynb"):
105 |         flag = True
106 |     if s.startswith("page"):
107 |         flag = True
108 |     # telephone number
109 |     if re.fullmatch(r'\b\d+(-\d+|\s\d+)?\b', s):
110 |         flag = True
111 |     # time
112 |     if "a.m." in s or "p.m." in s:
113 |         flag = True
114 |     # YYYY-MM-DD
115 |     if re.fullmatch(r'\b\d{4}[-\s]\d{2}[-\s]\d{2}\b', s):
116 |         flag = True
117 |     # YYYY-MM
118 |     if re.fullmatch(r'\b\d{4}[-\s]\d{2}\b', s):
119 |         flag = True
120 |     # Email address
121 |     if re.fullmatch(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', s):
122 |         flag = True
123 |     return flag
124 | 
125 | 
126 | def isfloat(num):
127 |     try:
128 |         float(num)
129 |         return True
130 |     except Exception as e:
131 |         return False
132 | 
133 | 
134 | def eval_score(gt, pred, answer_type):
135 |     if answer_type=="Integer":
136 |         try:
137 |             gt = get_clean_string(str(gt))
138 |             if len(re.findall(r"\d+,\s*\d+", gt, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395"
139 |                 gt = "".join([_.strip() for _ in gt.split(",")])
140 |             gt = int(gt)
141 |         except:
142 |             gt = gt
143 |         try:
144 |             pred = get_clean_string(str(pred))
145 |             if len(re.findall(r"\d+,\s*\d+", pred, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395"
146 |                 pred = "".join([_.strip() for _ in pred.split(",")])
147 |             pred = int(pred)
148 |         except:
149 |             pred = ""
150 |         score = (gt==pred)
151 |     elif answer_type=="Float":
152 |         gt = get_clean_string(str(gt))
153 |         pred = get_clean_string(str(pred))
154 |         
155 |         if len(re.findall(r"\d+,\s*\d+", gt, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395"
156 |             gt = "".join([_.strip() for _ in gt.split(",")])
157 |         try:
158 |             gt = float(gt)
159 |         except:
160 |             gt = gt
161 |         
162 |         if len(re.findall(r"\d+,\s*\d+", pred, re.DOTALL)) > 0: # deal with Integer value formatted as "96,395"
163 |             pred = "".join([_.strip() for _ in pred.split(",")])
164 |         try:
165 |             pred = float(pred)
166 |         except:
167 |             pred = str(pred)
168 | 
169 |         try:
170 |             score = is_float_equal(gt, pred, include_percentage=True, is_close=True)
171 |         except:
172 |             score = 0
173 | 
174 |     elif answer_type in ["String", "None"]:
175 |         gt = get_clean_string(gt)
176 |         pred = get_clean_string(pred)
177 |         if is_exact_match(gt):
178 |             score = (gt==pred)
179 |         else:
180 |             score = anls_compute(gt, pred)
181 |     else:
182 |         if isinstance(gt, str) and gt.startswith("["):
183 |             try:
184 |                 gt = eval(gt)
185 |             except:
186 |                 gt = gt
187 |         if not isinstance(gt, list):
188 |             gt = [gt]
189 |         if isinstance(pred, str) and pred.startswith("["):
190 |             try:
191 |                 pred = eval(pred)
192 |             except:
193 |                 pred = pred
194 |         if not isinstance(pred, list):
195 |             pred = [pred]
196 |         if isinstance(gt[0], dict):
197 |             gt = ["-".join([str(value) for key,value in _.items()]) for _ in gt]
198 |         if isinstance(pred[0], dict):
199 |             pred = ["-".join([str(value) for key,value in _.items()]) for _ in pred]
200 | 
201 |         print(len(gt), len(pred))
202 |         print(gt, pred)
203 |         def cal_score_v3(gt, pred):
204 |             gt = [get_clean_string(a) for a in gt]
205 |             pred = [get_clean_string(a) for a in pred]
206 |             if isfloat(gt[0]) or is_exact_match(gt[0]):
207 |                 score_v3 = ("-".join(gt)=="-".join(pred))
208 |             else:
209 |                 greedy_scores = [max([anls_compute(str(gt_v), str(pred_v)) for pred_v in pred]) for gt_v in gt]
210 |                 score_v3 = sum(greedy_scores) / len(gt) * min(1, len(gt) / len(pred)) ** 0.5
211 |             return score_v3
212 | 
213 |         score_v3 = cal_score_v3(gt, pred)
214 | 
215 |     score_v3 = score if answer_type in ["Integer", "Float", "String", "None"] else score_v3
216 | 
217 |     return float(score_v3)
218 | 
219 | 
220 | def calculate_acc_and_f1(results_file):
221 |     samples = [json.loads(_.strip()) for _ in open(results_file, "r", encoding="utf-8").readlines()]
222 |     evaluated_samples = [sample for sample in samples if "score_v3" in sample]
223 |     if not evaluated_samples:
224 |         return 0.0, 0.0
225 |     
226 |     acc = sum([sample["score_v3"] for sample in evaluated_samples])/len(evaluated_samples)
227 |     try:
228 |         recall = sum([sample["score_v3"] for sample in evaluated_samples if sample["answer"]!="Not answerable"])/len([sample for sample in evaluated_samples if sample["answer"]!="Not answerable"])
229 |         precision = sum([sample["score_v3"] for sample in evaluated_samples if sample["answer"]!="Not answerable"])/len([sample for sample in evaluated_samples if sample["pred"]!="Not answerable"])
230 |         f1 = 2*recall*precision/(recall+precision) if (recall+precision)>0.0 else 0.0
231 |     except:
232 |         f1 = 0.0
233 |     
234 |     return acc, f1
235 | 
236 | 
237 | def show_results(samples, show_path=None):
238 |     for sample in samples:
239 |         sample["evidence_pages"] = eval(sample["evidence_pages"])
240 |         sample["evidence_sources"] = eval(sample["evidence_sources"])
241 |     
242 |     with open(show_path, 'w') as f:
243 |         acc, f1 = eval_acc_and_f1(samples)
244 |         f.write("Overall Acc: {} | Question Number: {}\n".format(acc, len(samples)))
245 |         f.write("Overall F1-score: {} | Question Number: {}\n".format(f1, len(samples)))
246 |         f.write("-----------------------\n")
247 | 
248 |         #####################
249 |         acc_single_page, _ = eval_acc_and_f1([sample for sample in samples if len(sample["evidence_pages"])==1])
250 |         acc_multi_page, _ = eval_acc_and_f1([sample for sample in samples if len(sample["evidence_pages"])!=1 and sample["answer"]!="Not answerable"])
251 |         acc_neg, _ = eval_acc_and_f1([sample for sample in samples if sample["answer"]=="Not answerable"])
252 | 
253 |         f.write("Single-page | Accuracy: {} | Question Number: {}\n".format(
254 |             acc_single_page, len([sample for sample in samples if len(sample["evidence_pages"])==1])
255 |         ))
256 |         f.write("Cross-page | Accuracy: {} | Question Number: {}\n".format(
257 |             acc_multi_page, len([sample for sample in samples if len(sample["evidence_pages"])!=1 and sample["answer"]!="Not answerable"])
258 |         ))
259 |         f.write("Unanswerable | Accuracy: {} | Question Number: {}\n".format(
260 |             acc_neg, len([sample for sample in samples if sample["answer"]=="Not answerable"])
261 |         ))
262 |         f.write("-----------------------\n")
263 | 
264 |         #####################
265 |         source_sample_dict, document_type_dict = defaultdict(list), defaultdict(list)
266 |         for sample in samples:
267 |             for answer_source in sample["evidence_sources"]:
268 |                 source_sample_dict[answer_source].append(sample)
269 |             document_type_dict[sample["doc_type"]].append(sample)
270 |         for type, sub_samples in source_sample_dict.items():
271 |             f.write(
272 |                 "Evidence Sources: {} | Accuracy: {} | Question Number: {}\n".format(type, eval_acc_and_f1(sub_samples)[0], len(sub_samples))
273 |             )
274 | 
275 |         f.write("-----------------------\n")
276 |         for type, sub_samples in document_type_dict.items():
277 |             f.write(
278 |                 "Document Type: {} | Accuracy: {} | Question Number: {}\n".format(type, eval_acc_and_f1(sub_samples)[0], len(sub_samples))
279 |             )
280 | 


--------------------------------------------------------------------------------