├── .gitignore ├── CHANGELOG.md ├── LICENSE ├── README.md ├── docs ├── 2023-annual-report.pdf ├── PRML.pdf ├── Regulation Best Interest_Interpretive release.pdf ├── Regulation Best Interest_proposed rule.pdf ├── earthmover.pdf ├── four-lectures.pdf └── q1-fy25-earnings.pdf ├── pageindex ├── __init__.py ├── config.yaml ├── page_index.py └── utils.py ├── requirements.txt ├── results ├── 2023-annual-report_structure.json ├── PRML_structure.json ├── Regulation Best Interest_Interpretive release_structure.json ├── Regulation Best Interest_proposed rule_structure.json ├── earthmover_structure.json ├── four-lectures_structure.json └── q1-fy25-earnings_structure.json └── run_pageindex.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | __pycache__ 3 | files 4 | index 5 | temp/* 6 | chroma-collections.parquet 7 | chroma-embeddings.parquet 8 | .DS_Store 9 | .env* 10 | notebook 11 | SDK/* 12 | log/* 13 | logs/ 14 | parts/* 15 | json_results/* 16 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Change Log 2 | All notable changes to this project will be documented in this file. 3 | 4 | ## Beta - 2025-04-23 5 | 6 | ### Fixed 7 | - [x] Fixed a bug introduced on April 18 where `start_index` was incorrectly passed. 8 | 9 | ## Beta - 2025-04-03 10 | 11 | ### Added 12 | - [x] Add node_id, node summary 13 | - [x] Add document discription 14 | 15 | ### Changed 16 | - [x] Change "child_nodes" -> "nodes" to simplify the structure 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Vectify AI 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### ⚠️ Bug Fix Notice 2 | 3 | A bug introduced on **April 18** has now been fixed. 4 | 5 | If you pulled the repo between **April 18–23**, please update to the latest version: 6 | 7 | ```bash 8 | git pull origin main 9 | ``` 10 | 11 | Thanks for your understanding 🙏 12 | 13 | 14 | # 📄 PageIndex 15 | 16 | Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short. 17 | 18 | 🧠 **Reasoning-based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by AlphaGo, we use *tree search* to perform structured document retrieval. 19 | 20 | **[PageIndex](https://vectify.ai/pageindex)** is a *document indexing system* that builds *search tree structures* from long documents, making them ready for reasoning-based RAG. 21 | 22 | You can self-host it with this open-source repo, or try our ☁️ [Cloud service](https://pageindex.vectify.ai/) — no setup required, with advanced features like OCR for complex PDFs. 23 | 24 | Built by [Vectify AI](https://vectify.ai/pageindex). 25 | 26 | --- 27 | 28 | # **⭐ What is PageIndex** 29 | 30 | PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a *"table of contents"* but optimized for use with Large Language Models (LLMs). 31 | It’s ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits. 32 | 33 | ### ✅ Key Features 34 | 35 | - **Hierarchical Tree Structure** 36 | Enables LLMs to traverse documents logically — like an intelligent, LLM-optimized table of contents. 37 | 38 | - **Precise Page Referencing** 39 | Every node contains its summary and start/end page physical index, allowing pinpoint retrieval. 40 | 41 | - **Chunk-Free Segmentation** 42 | No arbitrary chunking. Nodes follow the natural structure of the document. 43 | 44 | - **Scales to Massive Documents** 45 | Designed to handle hundreds or even thousands of pages with ease. 46 | 47 | ### 📦 PageIndex Format 48 | 49 | Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/docs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/results). 50 | 51 | ``` 52 | ... 53 | { 54 | "title": "Financial Stability", 55 | "node_id": "0006", 56 | "start_index": 21, 57 | "end_index": 22, 58 | "summary": "The Federal Reserve ...", 59 | "nodes": [ 60 | { 61 | "title": "Monitoring Financial Vulnerabilities", 62 | "node_id": "0007", 63 | "start_index": 22, 64 | "end_index": 28, 65 | "summary": "The Federal Reserve's monitoring ..." 66 | }, 67 | { 68 | "title": "Domestic and International Cooperation and Coordination", 69 | "node_id": "0008", 70 | "start_index": 28, 71 | "end_index": 31, 72 | "summary": "In 2023, the Federal Reserve collaborated ..." 73 | } 74 | ] 75 | } 76 | ... 77 | ``` 78 | 79 | --- 80 | 81 | # 🚀 Package Usage 82 | 83 | Follow these steps to generate a PageIndex tree from a PDF document. 84 | 85 | ### 1. Install dependencies 86 | 87 | ```bash 88 | pip3 install -r requirements.txt 89 | ``` 90 | 91 | ### 2. Set your OpenAI API key 92 | 93 | Create a `.env` file in the root directory and add your API key: 94 | 95 | ```bash 96 | CHATGPT_API_KEY=your_openai_key_here 97 | ``` 98 | 99 | ### 3. Run PageIndex on your PDF 100 | 101 | ```bash 102 | python3 run_pageindex.py --pdf_path /path/to/your/document.pdf 103 | ``` 104 | You can customize the processing with additional optional arguments: 105 | 106 | ``` 107 | --model OpenAI model to use (default: gpt-4o-2024-11-20) 108 | --toc-check-pages Pages to check for table of contents (default: 20) 109 | --max-pages-per-node Max pages per node (default: 10) 110 | --max-tokens-per-node Max tokens per node (default: 20000) 111 | --if-add-node-id Add node ID (yes/no, default: yes) 112 | --if-add-node-summary Add node summary (yes/no, default: no) 113 | --if-add-doc-description Add doc description (yes/no, default: yes) 114 | ``` 115 | 116 | --- 117 | 118 | # ☁️ Cloud API (Beta) 119 | 120 | Don’t want to host it yourself? Try our [hosted API](https://pageindex.vectify.ai/) for PageIndex. The hosted version uses our custom OCR model to recognize PDFs more accurately, providing a better tree structure for complex documents. 121 | 122 | You can also explore results visually with our [web Dashboard](https://pageindex.ai/files) — no coding needed. 123 | 124 | Leave your email in [this form](https://ii2abc2jejf.typeform.com/to/meB40zV0) to receive 1,000 pages for free. 125 | 126 | --- 127 | 128 | # 📈 Case Study: Mafin 2.5 129 | 130 | [Mafin 2.5](https://vectify.ai/blog/Mafin2.5) is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Built on top of **PageIndex**, it achieved an impressive **98.7% accuracy** on the [FinanceBench](https://github.com/VectifyAI/Mafin2.5-FinanceBench) benchmark — significantly outperforming traditional vector-based RAG systems. 131 | 132 | PageIndex’s hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures. 133 | 134 | 👉 See full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) for detailed comparisons and performance metrics. 135 | 136 | --- 137 | 138 | # 🧠 Reasoning-Based RAG with PageIndex 139 | 140 | Use PageIndex to build **reasoning-based retrieval systems** without relying on semantic similarity. Great for domain-specific tasks where nuance matters. 141 | 142 | ### 🔖 Preprocessing Workflow Example 143 | 1. Process documents using PageIndex to generate tree structures. 144 | 2. Store the tree structures and their corresponding document IDs in a database table. 145 | 3. Store the contents of each node in a separate table, indexed by node ID and tree ID. 146 | 147 | ### 🔖 Reasoning-Based RAG Framework Example 148 | 1. Query Preprocessing: 149 | - Analyze the query to identify the required knowledge 150 | 2. Document Selection: 151 | - Search for relevant documents and their IDs 152 | - Fetch the corresponding tree structures from the database 153 | 3. Node Selection: 154 | - Search through tree structures to identify relevant nodes 155 | 4. LLM Generation: 156 | - Fetch the corresponding contents of the selected nodes from the database 157 | - Format and extract the relevant information 158 | - Send the assembled context along with the original query to the LLM 159 | - Generate contextually informed responses 160 | 161 | 162 | ### 🔖 Example Prompt for Node Selection 163 | 164 | ```python 165 | prompt = f""" 166 | You are given a question and a tree structure of a document. 167 | You need to find all nodes that are likely to contain the answer. 168 | 169 | Question: {question} 170 | 171 | Document tree structure: {structure} 172 | 173 | Reply in the following JSON format: 174 | {{ 175 | "thinking": , 176 | "node_list": [node_id1, node_id2, ...] 177 | }} 178 | """ 179 | ``` 180 | For more examples, see the [API dashboard](https://pageindex.vectify.ai/). 181 | 182 | --- 183 | 184 | # 🛤 Roadmap 185 | 186 | - [x] Detailed examples of document selection, node selection, and RAG pipelines (due 2025/04/14) 187 | - [ ] Integration of reasoning-based retrieval and semantic-based retrieval (due 2025/04/21) 188 | - [ ] Efficient tree search methods introduction 189 | - [ ] Technical report on the design of PageIndex 190 | 191 | --- 192 | 193 | # 🚧 Notice 194 | This project is in its early beta development, and all progress will remain open and transparent. We welcome you to raise issues, reach out with questions, or contribute directly to the project. 195 | 196 | Due to the diverse structures of PDF documents, you may encounter instability during usage. For a more accurate and stable version with a leading OCR integration, please try our [hosted API for PageIndex](https://pageindex.vectify.ai/). Leave your email in [this form](https://ii2abc2jejf.typeform.com/to/meB40zV0) to receive 1,000 pages for free. 197 | 198 | Together, let's push forward the revolution of reasoning-based RAG systems. 199 | 200 | --- 201 | 202 | # 📬 Contact Us 203 | 204 | Need customized support for your documents or reasoning-based RAG system? 205 | 206 | :loudspeaker: [Join our Discord](https://discord.com/invite/nnyyEdT2RG) 207 | 208 | :envelope: [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0) 209 | -------------------------------------------------------------------------------- /docs/2023-annual-report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/2023-annual-report.pdf -------------------------------------------------------------------------------- /docs/PRML.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/PRML.pdf -------------------------------------------------------------------------------- /docs/Regulation Best Interest_Interpretive release.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/Regulation Best Interest_Interpretive release.pdf -------------------------------------------------------------------------------- /docs/Regulation Best Interest_proposed rule.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/Regulation Best Interest_proposed rule.pdf -------------------------------------------------------------------------------- /docs/earthmover.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/earthmover.pdf -------------------------------------------------------------------------------- /docs/four-lectures.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/four-lectures.pdf -------------------------------------------------------------------------------- /docs/q1-fy25-earnings.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VectifyAI/PageIndex/8d3cd94fd07c4fa25fc777c0f7a6ec8d4d9b8fa9/docs/q1-fy25-earnings.pdf -------------------------------------------------------------------------------- /pageindex/__init__.py: -------------------------------------------------------------------------------- 1 | from .page_index import * -------------------------------------------------------------------------------- /pageindex/config.yaml: -------------------------------------------------------------------------------- 1 | model: "gpt-4o-2024-11-20" 2 | toc_check_page_num: 20 3 | max_page_num_each_node: 10 4 | max_token_num_each_node: 20000 5 | if_add_node_id: "yes" 6 | if_add_node_summary: "no" 7 | if_add_doc_description: "yes" 8 | if_add_node_text: "no" -------------------------------------------------------------------------------- /pageindex/page_index.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import copy 4 | import math 5 | import random 6 | import re 7 | from .utils import * 8 | import os 9 | from concurrent.futures import ThreadPoolExecutor, as_completed 10 | 11 | 12 | ################### check title in page ######################################################### 13 | async def check_title_appearance(item, page_list, start_index=1, model=None): 14 | title=item['title'] 15 | if 'physical_index' not in item or item['physical_index'] is None: 16 | return {'list_index': item.get('list_index'), 'answer': 'no', 'title':title, 'page_number': None} 17 | 18 | 19 | page_number = item['physical_index'] 20 | page_text = page_list[page_number-start_index][0] 21 | 22 | 23 | prompt = f""" 24 | Your job is to check if the given section appears or starts in the given page_text. 25 | 26 | Note: do fuzzy matching, ignore any space inconsistency in the page_text. 27 | 28 | The given section title is {title}. 29 | The given page_text is {page_text}. 30 | 31 | Reply format: 32 | {{ 33 | 34 | "thinking": 35 | "answer": "yes or no" (yes if the section appears or starts in the page_text, no otherwise) 36 | }} 37 | Directly return the final JSON structure. Do not output anything else.""" 38 | 39 | response = await ChatGPT_API_async(model=model, prompt=prompt) 40 | response = extract_json(response) 41 | if 'answer' in response: 42 | answer = response['answer'] 43 | else: 44 | answer = 'no' 45 | return {'list_index': item['list_index'], 'answer': answer, 'title': title, 'page_number': page_number} 46 | 47 | 48 | async def check_title_appearance_in_start(title, page_text, model=None, logger=None): 49 | prompt = f""" 50 | You will be given the current section title and the current page_text. 51 | Your job is to check if the current section starts in the beginning of the given page_text. 52 | If there are other contents before the current section title, then the current section does not start in the beginning of the given page_text. 53 | If the current section title is the first content in the given page_text, then the current section starts in the beginning of the given page_text. 54 | 55 | Note: do fuzzy matching, ignore any space inconsistency in the page_text. 56 | 57 | The given section title is {title}. 58 | The given page_text is {page_text}. 59 | 60 | reply format: 61 | {{ 62 | "thinking": 63 | "start_begin": "yes or no" (yes if the section starts in the beginning of the page_text, no otherwise) 64 | }} 65 | Directly return the final JSON structure. Do not output anything else.""" 66 | 67 | response = await ChatGPT_API_async(model=model, prompt=prompt) 68 | response = extract_json(response) 69 | if logger: 70 | logger.info(f"Response: {response}") 71 | return response.get("start_begin", "no") 72 | 73 | 74 | async def check_title_appearance_in_start_concurrent(structure, page_list, model=None, logger=None): 75 | if logger: 76 | logger.info("Checking title appearance in start concurrently") 77 | 78 | # skip items without physical_index 79 | for item in structure: 80 | if item.get('physical_index') is None: 81 | item['appear_start'] = 'no' 82 | 83 | # only for items with valid physical_index 84 | tasks = [] 85 | valid_items = [] 86 | for item in structure: 87 | if item.get('physical_index') is not None: 88 | page_text = page_list[item['physical_index'] - 1][0] 89 | tasks.append(check_title_appearance_in_start(item['title'], page_text, model=model, logger=logger)) 90 | valid_items.append(item) 91 | 92 | results = await asyncio.gather(*tasks, return_exceptions=True) 93 | for item, result in zip(valid_items, results): 94 | if isinstance(result, Exception): 95 | if logger: 96 | logger.error(f"Error checking start for {item['title']}: {result}") 97 | item['appear_start'] = 'no' 98 | else: 99 | item['appear_start'] = result 100 | 101 | return structure 102 | 103 | 104 | def toc_detector_single_page(content, model=None): 105 | prompt = f""" 106 | Your job is to detect if there is a table of content provided in the given text. 107 | 108 | Given text: {content} 109 | 110 | return the following JSON format: 111 | {{ 112 | "thinking": 113 | "toc_detected": "", 114 | }} 115 | 116 | Directly return the final JSON structure. Do not output anything else. 117 | Please note: abstract,summary, notation list, figure list, table list, etc. are not table of contents.""" 118 | 119 | response = ChatGPT_API(model=model, prompt=prompt) 120 | # print('response', response) 121 | json_content = extract_json(response) 122 | return json_content['toc_detected'] 123 | 124 | 125 | def check_if_toc_extraction_is_complete(content, toc, model=None): 126 | prompt = f""" 127 | You are given a partial document and a table of contents. 128 | Your job is to check if the table of contents is complete, which it contains all the main sections in the partial document. 129 | 130 | Reply format: 131 | {{ 132 | "thinking": 133 | "completed": "yes" or "no" 134 | }} 135 | Directly return the final JSON structure. Do not output anything else.""" 136 | 137 | prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc 138 | response = ChatGPT_API(model=model, prompt=prompt) 139 | json_content = extract_json(response) 140 | return json_content['completed'] 141 | 142 | 143 | def check_if_toc_transformation_is_complete(content, toc, model=None): 144 | prompt = f""" 145 | You are given a raw table of contents and a table of contents. 146 | Your job is to check if the table of contents is complete. 147 | 148 | Reply format: 149 | {{ 150 | "thinking": 151 | "completed": "yes" or "no" 152 | }} 153 | Directly return the final JSON structure. Do not output anything else.""" 154 | 155 | prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc 156 | response = ChatGPT_API(model=model, prompt=prompt) 157 | json_content = extract_json(response) 158 | return json_content['completed'] 159 | 160 | def extract_toc_content(content, model=None): 161 | prompt = f""" 162 | Your job is to extract the full table of contents from the given text, replace ... with : 163 | 164 | Given text: {content} 165 | 166 | Directly return the full table of contents content. Do not output anything else.""" 167 | 168 | response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt) 169 | 170 | if_complete = check_if_toc_transformation_is_complete(content, response, model) 171 | if if_complete == "yes" and finish_reason == "finished": 172 | return response 173 | 174 | chat_history = [ 175 | {"role": "user", "content": prompt}, 176 | {"role": "assistant", "content": response}, 177 | ] 178 | prompt = f"""please continue the generation of table of contents , directly output the remaining part of the structure""" 179 | new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history) 180 | response = response + new_response 181 | if_complete = check_if_toc_transformation_is_complete(content, response, model) 182 | 183 | while not (if_complete == "yes" and finish_reason == "finished"): 184 | chat_history = [ 185 | {"role": "user", "content": prompt}, 186 | {"role": "assistant", "content": response}, 187 | ] 188 | prompt = f"""please continue the generation of table of contents , directly output the remaining part of the structure""" 189 | new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history) 190 | response = response + new_response 191 | if_complete = check_if_toc_transformation_is_complete(content, response, model) 192 | 193 | # Optional: Add a maximum retry limit to prevent infinite loops 194 | if len(chat_history) > 5: # Arbitrary limit of 10 attempts 195 | raise Exception('Failed to complete table of contents after maximum retries') 196 | 197 | return response 198 | 199 | def detect_page_index(toc_content, model=None): 200 | print('start detect_page_index') 201 | prompt = f""" 202 | You will be given a table of contents. 203 | 204 | Your job is to detect if there are page numbers/indices given within the table of contents. 205 | 206 | Given text: {toc_content} 207 | 208 | Reply format: 209 | {{ 210 | "thinking": 211 | "page_index_given_in_toc": "" 212 | }} 213 | Directly return the final JSON structure. Do not output anything else.""" 214 | 215 | response = ChatGPT_API(model=model, prompt=prompt) 216 | json_content = extract_json(response) 217 | return json_content['page_index_given_in_toc'] 218 | 219 | def toc_extractor(page_list, toc_page_list, model): 220 | def transform_dots_to_colon(text): 221 | text = re.sub(r'\.{5,}', ': ', text) 222 | # Handle dots separated by spaces 223 | text = re.sub(r'(?:\. ){5,}\.?', ': ', text) 224 | return text 225 | 226 | toc_content = "" 227 | for page_index in toc_page_list: 228 | toc_content += page_list[page_index][0] 229 | toc_content = transform_dots_to_colon(toc_content) 230 | has_page_index = detect_page_index(toc_content, model=model) 231 | 232 | return { 233 | "toc_content": toc_content, 234 | "page_index_given_in_toc": has_page_index 235 | } 236 | 237 | 238 | 239 | 240 | def toc_index_extractor(toc, content, model=None): 241 | print('start toc_index_extractor') 242 | tob_extractor_prompt = """ 243 | You are given a table of contents in a json format and several pages of a document, your job is to add the physical_index to the table of contents in the json format. 244 | 245 | The provided pages contains tags like and to indicate the physical location of the page X. 246 | 247 | The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc. 248 | 249 | The response should be in the following JSON format: 250 | [ 251 | { 252 | "structure": (string), 253 | "title": , 254 | "physical_index": "<physical_index_X>" (keep the format) 255 | }, 256 | ... 257 | ] 258 | 259 | Only add the physical_index to the sections that are in the provided pages. 260 | If the section is not in the provided pages, do not add the physical_index to it. 261 | Directly return the final JSON structure. Do not output anything else.""" 262 | 263 | prompt = tob_extractor_prompt + '\nTable of contents:\n' + str(toc) + '\nDocument pages:\n' + content 264 | response = ChatGPT_API(model=model, prompt=prompt) 265 | json_content = extract_json(response) 266 | return json_content 267 | 268 | 269 | 270 | def toc_transformer(toc_content, model=None): 271 | print('start toc_transformer') 272 | init_prompt = """ 273 | You are given a table of contents, You job is to transform the whole table of content into a JSON format included table_of_contents. 274 | 275 | structure is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc. 276 | 277 | The response should be in the following JSON format: 278 | { 279 | table_of_contents: [ 280 | { 281 | "structure": <structure index, "x.x.x" or None> (string), 282 | "title": <title of the section>, 283 | "page": <page number or None>, 284 | }, 285 | ... 286 | ], 287 | } 288 | You should transform the full table of contents in one go. 289 | Directly return the final JSON structure, do not output anything else. """ 290 | 291 | prompt = init_prompt + '\n Given table of contents\n:' + toc_content 292 | last_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt) 293 | if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model) 294 | if if_complete == "yes" and finish_reason == "finished": 295 | last_complete = extract_json(last_complete) 296 | cleaned_response=convert_page_to_int(last_complete['table_of_contents']) 297 | return cleaned_response 298 | 299 | last_complete = get_json_content(last_complete) 300 | while not (if_complete == "yes" and finish_reason == "finished"): 301 | position = last_complete.rfind('}') 302 | if position != -1: 303 | last_complete = last_complete[:position+2] 304 | prompt = f""" 305 | Your task is to continue the table of contents json structure, directly output the remaining part of the json structure. 306 | The response should be in the following JSON format: 307 | 308 | The raw table of contents json structure is: 309 | {toc_content} 310 | 311 | The incomplete transformed table of contents json structure is: 312 | {last_complete} 313 | 314 | Please continue the json structure, directly output the remaining part of the json structure.""" 315 | 316 | new_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt) 317 | 318 | if new_complete.startswith('```json'): 319 | new_complete = get_json_content(new_complete) 320 | last_complete = last_complete+new_complete 321 | 322 | if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model) 323 | 324 | 325 | last_complete = json.loads(last_complete) 326 | 327 | cleaned_response=convert_page_to_int(last_complete['table_of_contents']) 328 | return cleaned_response 329 | 330 | 331 | 332 | 333 | def find_toc_pages(start_page_index, page_list, opt, logger=None): 334 | print('start find_toc_pages') 335 | last_page_is_yes = False 336 | toc_page_list = [] 337 | i = start_page_index 338 | 339 | while i < len(page_list): 340 | # Only check beyond max_pages if we're still finding TOC pages 341 | if i >= opt.toc_check_page_num and not last_page_is_yes: 342 | break 343 | detected_result = toc_detector_single_page(page_list[i][0],model=opt.model) 344 | if detected_result == 'yes': 345 | if logger: 346 | logger.info(f'Page {i} has toc') 347 | toc_page_list.append(i) 348 | last_page_is_yes = True 349 | elif detected_result == 'no' and last_page_is_yes: 350 | if logger: 351 | logger.info(f'Found the last page with toc: {i-1}') 352 | break 353 | i += 1 354 | 355 | if not toc_page_list and logger: 356 | logger.info('No toc found') 357 | 358 | return toc_page_list 359 | 360 | def remove_page_number(data): 361 | if isinstance(data, dict): 362 | data.pop('page_number', None) 363 | for key in list(data.keys()): 364 | if 'nodes' in key: 365 | remove_page_number(data[key]) 366 | elif isinstance(data, list): 367 | for item in data: 368 | remove_page_number(item) 369 | return data 370 | 371 | def extract_matching_page_pairs(toc_page, toc_physical_index, start_page_index): 372 | pairs = [] 373 | for phy_item in toc_physical_index: 374 | for page_item in toc_page: 375 | if phy_item.get('title') == page_item.get('title'): 376 | physical_index = phy_item.get('physical_index') 377 | if physical_index is not None and int(physical_index) >= start_page_index: 378 | pairs.append({ 379 | 'title': phy_item.get('title'), 380 | 'page': page_item.get('page'), 381 | 'physical_index': physical_index 382 | }) 383 | return pairs 384 | 385 | 386 | def calculate_page_offset(pairs): 387 | differences = [] 388 | for pair in pairs: 389 | try: 390 | physical_index = pair['physical_index'] 391 | page_number = pair['page'] 392 | difference = physical_index - page_number 393 | differences.append(difference) 394 | except (KeyError, TypeError): 395 | continue 396 | 397 | if not differences: 398 | return None 399 | 400 | difference_counts = {} 401 | for diff in differences: 402 | difference_counts[diff] = difference_counts.get(diff, 0) + 1 403 | 404 | most_common = max(difference_counts.items(), key=lambda x: x[1])[0] 405 | 406 | return most_common 407 | 408 | def add_page_offset_to_toc_json(data, offset): 409 | for i in range(len(data)): 410 | if data[i].get('page') is not None and isinstance(data[i]['page'], int): 411 | data[i]['physical_index'] = data[i]['page'] + offset 412 | del data[i]['page'] 413 | 414 | return data 415 | 416 | 417 | 418 | def page_list_to_group_text(page_contents, token_lengths, max_tokens=20000, overlap_page=1): 419 | num_tokens = sum(token_lengths) 420 | 421 | if num_tokens <= max_tokens: 422 | # merge all pages into one text 423 | page_text = "".join(page_contents) 424 | return [page_text] 425 | 426 | subsets = [] 427 | current_subset = [] 428 | current_token_count = 0 429 | 430 | expected_parts_num = math.ceil(num_tokens / max_tokens) 431 | average_tokens_per_part = math.ceil(((num_tokens / expected_parts_num) + max_tokens) / 2) 432 | 433 | for i, (page_content, page_tokens) in enumerate(zip(page_contents, token_lengths)): 434 | if current_token_count + page_tokens > average_tokens_per_part: 435 | 436 | subsets.append(''.join(current_subset)) 437 | # Start new subset from overlap if specified 438 | overlap_start = max(i - overlap_page, 0) 439 | current_subset = page_contents[overlap_start:i] 440 | current_token_count = sum(token_lengths[overlap_start:i]) 441 | 442 | # Add current page to the subset 443 | current_subset.append(page_content) 444 | current_token_count += page_tokens 445 | 446 | # Add the last subset if it contains any pages 447 | if current_subset: 448 | subsets.append(''.join(current_subset)) 449 | 450 | print('divide page_list to groups', len(subsets)) 451 | return subsets 452 | 453 | def add_page_number_to_toc(part, structure, model=None): 454 | fill_prompt_seq = """ 455 | You are given an JSON structure of a document and a partial part of the document. Your task is to check if the title that is described in the structure is started in the partial given document. 456 | 457 | The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X. 458 | 459 | If the full target section starts in the partial given document, insert the given JSON structure with the "start": "yes", and "start_index": "<physical_index_X>". 460 | 461 | If the full target section does not start in the partial given document, insert "start": "no", "start_index": None. 462 | 463 | The response should be in the following format. 464 | [ 465 | { 466 | "structure": <structure index, "x.x.x" or None> (string), 467 | "title": <title of the section>, 468 | "start": "<yes or no>", 469 | "physical_index": "<physical_index_X> (keep the format)" or None 470 | }, 471 | ... 472 | ] 473 | The given structure contains the result of the previous part, you need to fill the result of the current part, do not change the previous result. 474 | Directly return the final JSON structure. Do not output anything else.""" 475 | 476 | prompt = fill_prompt_seq + f"\n\nCurrent Partial Document:\n{part}\n\nGiven Structure\n{json.dumps(structure, indent=2)}\n" 477 | current_json_raw = ChatGPT_API(model=model, prompt=prompt) 478 | json_result = extract_json(current_json_raw) 479 | 480 | for item in json_result: 481 | if 'start' in item: 482 | del item['start'] 483 | return json_result 484 | 485 | 486 | def remove_first_physical_index_section(text): 487 | """ 488 | Removes the first section between <physical_index_X> and <physical_index_X> tags, 489 | and returns the remaining text. 490 | """ 491 | pattern = r'<physical_index_\d+>.*?<physical_index_\d+>' 492 | match = re.search(pattern, text, re.DOTALL) 493 | if match: 494 | # Remove the first matched section 495 | return text.replace(match.group(0), '', 1) 496 | return text 497 | 498 | ### add verify completeness 499 | def generate_toc_continue(toc_content, part, model="gpt-4o-2024-11-20"): 500 | print('start generate_toc_continue') 501 | prompt = """ 502 | You are an expert in extracting hierarchical tree structure. 503 | You are given a tree structure of the previous part and the text of the current part. 504 | Your task is to continue the tree structure from the previous part to include the current part. 505 | 506 | The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc. 507 | 508 | For the title, you need to extract the original title from the text, only fix the space inconsistency. 509 | 510 | The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. \ 511 | 512 | For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format. 513 | 514 | The response should be in the following format. 515 | [ 516 | { 517 | "structure": <structure index, "x.x.x"> (string), 518 | "title": <title of the section, keep the original title>, 519 | "physical_index": "<physical_index_X> (keep the format)" 520 | }, 521 | ... 522 | ] 523 | 524 | Directly return the additional part of the final JSON structure. Do not output anything else.""" 525 | 526 | prompt = prompt + '\nGiven text\n:' + part + '\nPrevious tree structure\n:' + json.dumps(toc_content, indent=2) 527 | response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt) 528 | if finish_reason == 'finished': 529 | return extract_json(response) 530 | else: 531 | raise Exception(f'finish reason: {finish_reason}') 532 | 533 | ### add verify completeness 534 | def generate_toc_init(part, model=None): 535 | print('start generate_toc_init') 536 | prompt = """ 537 | You are an expert in extracting hierarchical tree structure, your task is to generate the tree structure of the document. 538 | 539 | The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc. 540 | 541 | For the title, you need to extract the original title from the text, only fix the space inconsistency. 542 | 543 | The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. 544 | 545 | For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format. 546 | 547 | The response should be in the following format. 548 | [ 549 | {{ 550 | "structure": <structure index, "x.x.x"> (string), 551 | "title": <title of the section, keep the original title>, 552 | "physical_index": "<physical_index_X> (keep the format)" 553 | }}, 554 | 555 | ], 556 | 557 | 558 | Directly return the final JSON structure. Do not output anything else.""" 559 | 560 | prompt = prompt + '\nGiven text\n:' + part 561 | response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt) 562 | 563 | if finish_reason == 'finished': 564 | return extract_json(response) 565 | else: 566 | raise Exception(f'finish reason: {finish_reason}') 567 | 568 | def process_no_toc(page_list, start_index=1, model=None, logger=None): 569 | page_contents=[] 570 | token_lengths=[] 571 | for page_index in range(start_index, start_index+len(page_list)): 572 | page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n" 573 | page_contents.append(page_text) 574 | token_lengths.append(count_tokens(page_text, model)) 575 | group_texts = page_list_to_group_text(page_contents, token_lengths) 576 | logger.info(f'len(group_texts): {len(group_texts)}') 577 | 578 | toc_with_page_number= generate_toc_init(group_texts[0], model) 579 | for group_text in group_texts[1:]: 580 | toc_with_page_number_additional = generate_toc_continue(toc_with_page_number, group_text, model) 581 | toc_with_page_number.extend(toc_with_page_number_additional) 582 | logger.info(f'generate_toc: {toc_with_page_number}') 583 | 584 | toc_with_page_number = convert_physical_index_to_int(toc_with_page_number) 585 | logger.info(f'convert_physical_index_to_int: {toc_with_page_number}') 586 | 587 | return toc_with_page_number 588 | 589 | def process_toc_no_page_numbers(toc_content, toc_page_list, page_list, start_index=1, model=None, logger=None): 590 | page_contents=[] 591 | token_lengths=[] 592 | toc_content = toc_transformer(toc_content, model) 593 | logger.info(f'toc_transformer: {toc_content}') 594 | for page_index in range(start_index, start_index+len(page_list)): 595 | page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n" 596 | page_contents.append(page_text) 597 | token_lengths.append(count_tokens(page_text, model)) 598 | 599 | group_texts = page_list_to_group_text(page_contents, token_lengths) 600 | logger.info(f'len(group_texts): {len(group_texts)}') 601 | 602 | toc_with_page_number=copy.deepcopy(toc_content) 603 | for group_text in group_texts: 604 | toc_with_page_number = add_page_number_to_toc(group_text, toc_with_page_number, model) 605 | logger.info(f'add_page_number_to_toc: {toc_with_page_number}') 606 | 607 | toc_with_page_number = convert_physical_index_to_int(toc_with_page_number) 608 | logger.info(f'convert_physical_index_to_int: {toc_with_page_number}') 609 | 610 | return toc_with_page_number 611 | 612 | 613 | 614 | def process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=None, model=None, logger=None): 615 | toc_with_page_number = toc_transformer(toc_content, model) 616 | logger.info(f'toc_with_page_number: {toc_with_page_number}') 617 | 618 | toc_no_page_number = remove_page_number(copy.deepcopy(toc_with_page_number)) 619 | 620 | start_page_index = toc_page_list[-1] + 1 621 | main_content = "" 622 | for page_index in range(start_page_index, min(start_page_index + toc_check_page_num, len(page_list))): 623 | main_content += f"<physical_index_{page_index+1}>\n{page_list[page_index][0]}\n<physical_index_{page_index+1}>\n\n" 624 | 625 | toc_with_physical_index = toc_index_extractor(toc_no_page_number, main_content, model) 626 | logger.info(f'toc_with_physical_index: {toc_with_physical_index}') 627 | 628 | toc_with_physical_index = convert_physical_index_to_int(toc_with_physical_index) 629 | logger.info(f'toc_with_physical_index: {toc_with_physical_index}') 630 | 631 | matching_pairs = extract_matching_page_pairs(toc_with_page_number, toc_with_physical_index, start_page_index) 632 | logger.info(f'matching_pairs: {matching_pairs}') 633 | 634 | offset = calculate_page_offset(matching_pairs) 635 | logger.info(f'offset: {offset}') 636 | 637 | toc_with_page_number = add_page_offset_to_toc_json(toc_with_page_number, offset) 638 | logger.info(f'toc_with_page_number: {toc_with_page_number}') 639 | 640 | toc_with_page_number = process_none_page_numbers(toc_with_page_number, page_list, model=model) 641 | logger.info(f'toc_with_page_number: {toc_with_page_number}') 642 | 643 | return toc_with_page_number 644 | 645 | 646 | 647 | ##check if needed to process none page numbers 648 | def process_none_page_numbers(toc_items, page_list, start_index=1, model=None): 649 | for i, item in enumerate(toc_items): 650 | if "physical_index" not in item: 651 | # logger.info(f"fix item: {item}") 652 | # Find previous physical_index 653 | prev_physical_index = 0 # Default if no previous item exists 654 | for j in range(i - 1, -1, -1): 655 | if toc_items[j].get('physical_index') is not None: 656 | prev_physical_index = toc_items[j]['physical_index'] 657 | break 658 | 659 | # Find next physical_index 660 | next_physical_index = -1 # Default if no next item exists 661 | for j in range(i + 1, len(toc_items)): 662 | if toc_items[j].get('physical_index') is not None: 663 | next_physical_index = toc_items[j]['physical_index'] 664 | break 665 | 666 | page_contents = [] 667 | for page_index in range(prev_physical_index, next_physical_index+1): 668 | page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n" 669 | page_contents.append(page_text) 670 | 671 | item_copy = copy.deepcopy(item) 672 | del item_copy['page'] 673 | result = add_page_number_to_toc(page_contents, item_copy, model) 674 | if isinstance(result[0]['physical_index'], str) and result[0]['physical_index'].startswith('<physical_index'): 675 | item['physical_index'] = int(result[0]['physical_index'].split('_')[-1].rstrip('>').strip()) 676 | del item['page'] 677 | 678 | return toc_items 679 | 680 | 681 | 682 | 683 | def check_toc(page_list, opt=None): 684 | toc_page_list = find_toc_pages(start_page_index=0, page_list=page_list, opt=opt) 685 | if len(toc_page_list) == 0: 686 | print('no toc found') 687 | return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'} 688 | else: 689 | print('toc found') 690 | toc_json = toc_extractor(page_list, toc_page_list, opt.model) 691 | 692 | if toc_json['page_index_given_in_toc'] == 'yes': 693 | print('index found') 694 | return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'yes'} 695 | else: 696 | current_start_index = toc_page_list[-1] + 1 697 | 698 | while (toc_json['page_index_given_in_toc'] == 'no' and 699 | current_start_index < len(page_list) and 700 | current_start_index < opt.toc_check_page_num): 701 | 702 | additional_toc_pages = find_toc_pages( 703 | start_page_index=current_start_index, 704 | page_list=page_list, 705 | opt=opt 706 | ) 707 | 708 | if len(additional_toc_pages) == 0: 709 | break 710 | 711 | additional_toc_json = toc_extractor(page_list, additional_toc_pages, opt.model) 712 | if additional_toc_json['page_index_given_in_toc'] == 'yes': 713 | print('index found') 714 | return {'toc_content': additional_toc_json['toc_content'], 'toc_page_list': additional_toc_pages, 'page_index_given_in_toc': 'yes'} 715 | 716 | else: 717 | current_start_index = additional_toc_pages[-1] + 1 718 | print('index not found') 719 | return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'no'} 720 | 721 | 722 | 723 | 724 | 725 | 726 | ################### fix incorrect toc ######################################################### 727 | def single_toc_item_index_fixer(section_title, content, model="gpt-4o-2024-11-20"): 728 | tob_extractor_prompt = """ 729 | You are given a section title and several pages of a document, your job is to find the physical index of the start page of the section in the partial document. 730 | 731 | The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X. 732 | 733 | Reply in a JSON format: 734 | { 735 | "thinking": <explain which page, started and closed by <physical_index_X>, contains the start of this section>, 736 | "physical_index": "<physical_index_X>" (keep the format) 737 | } 738 | Directly return the final JSON structure. Do not output anything else.""" 739 | 740 | prompt = tob_extractor_prompt + '\nSection Title:\n' + str(section_title) + '\nDocument pages:\n' + content 741 | response = ChatGPT_API(model=model, prompt=prompt) 742 | json_content = extract_json(response) 743 | return convert_physical_index_to_int(json_content['physical_index']) 744 | 745 | 746 | 747 | async def fix_incorrect_toc(toc_with_page_number, page_list, incorrect_results, start_index=1, model=None, logger=None): 748 | print(f'start fix_incorrect_toc with {len(incorrect_results)} incorrect results') 749 | incorrect_indices = {result['list_index'] for result in incorrect_results} 750 | 751 | end_index = len(page_list) + start_index - 1 752 | 753 | incorrect_results_and_range_logs = [] 754 | # Helper function to process and check a single incorrect item 755 | async def process_and_check_item(incorrect_item): 756 | list_index = incorrect_item['list_index'] 757 | # Find the previous correct item 758 | prev_correct = None 759 | for i in range(list_index-1, -1, -1): 760 | if i not in incorrect_indices: 761 | prev_correct = toc_with_page_number[i]['physical_index'] 762 | break 763 | # If no previous correct item found, use start_index 764 | if prev_correct is None: 765 | prev_correct = start_index - 1 766 | 767 | # Find the next correct item 768 | next_correct = None 769 | for i in range(list_index+1, len(toc_with_page_number)): 770 | if i not in incorrect_indices: 771 | next_correct = toc_with_page_number[i]['physical_index'] 772 | break 773 | # If no next correct item found, use end_index 774 | if next_correct is None: 775 | next_correct = end_index 776 | 777 | incorrect_results_and_range_logs.append({ 778 | 'list_index': list_index, 779 | 'title': incorrect_item['title'], 780 | 'prev_correct': prev_correct, 781 | 'next_correct': next_correct 782 | }) 783 | 784 | page_contents=[] 785 | for page_index in range(prev_correct, next_correct+1): 786 | page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n" 787 | page_contents.append(page_text) 788 | content_range = ''.join(page_contents) 789 | 790 | physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model) 791 | 792 | # Check if the result is correct 793 | check_item = incorrect_item.copy() 794 | check_item['physical_index'] = physical_index_int 795 | check_result = await check_title_appearance(check_item, page_list, start_index, model) 796 | 797 | return { 798 | 'list_index': list_index, 799 | 'title': incorrect_item['title'], 800 | 'physical_index': physical_index_int, 801 | 'is_valid': check_result['answer'] == 'yes' 802 | } 803 | 804 | # Process incorrect items concurrently 805 | tasks = [ 806 | process_and_check_item(item) 807 | for item in incorrect_results 808 | ] 809 | results = await asyncio.gather(*tasks, return_exceptions=True) 810 | for item, result in zip(incorrect_results, results): 811 | if isinstance(result, Exception): 812 | print(f"Processing item {item} generated an exception: {result}") 813 | continue 814 | results = [result for result in results if not isinstance(result, Exception)] 815 | 816 | # Update the toc_with_page_number with the fixed indices and check for any invalid results 817 | invalid_results = [] 818 | for result in results: 819 | if result['is_valid']: 820 | toc_with_page_number[result['list_index']]['physical_index'] = result['physical_index'] 821 | else: 822 | invalid_results.append({ 823 | 'list_index': result['list_index'], 824 | 'title': result['title'], 825 | 'physical_index': result['physical_index'], 826 | }) 827 | 828 | logger.info(f'incorrect_results_and_range_logs: {incorrect_results_and_range_logs}') 829 | logger.info(f'invalid_results: {invalid_results}') 830 | 831 | return toc_with_page_number, invalid_results 832 | 833 | 834 | 835 | async def fix_incorrect_toc_with_retries(toc_with_page_number, page_list, incorrect_results, start_index=1, max_attempts=3, model=None, logger=None): 836 | print('start fix_incorrect_toc') 837 | fix_attempt = 0 838 | current_toc = toc_with_page_number 839 | current_incorrect = incorrect_results 840 | 841 | while current_incorrect: 842 | print(f"Fixing {len(current_incorrect)} incorrect results") 843 | 844 | current_toc, current_incorrect = await fix_incorrect_toc(current_toc, page_list, current_incorrect, start_index, model, logger) 845 | 846 | fix_attempt += 1 847 | if fix_attempt >= max_attempts: 848 | logger.info("Maximum fix attempts reached") 849 | break 850 | 851 | return current_toc, current_incorrect 852 | 853 | 854 | 855 | 856 | ################### verify toc ######################################################### 857 | async def verify_toc(page_list, list_result, start_index=1, N=None, model=None): 858 | print('start verify_toc') 859 | # Find the last non-None physical_index 860 | last_physical_index = None 861 | for item in reversed(list_result): 862 | if item.get('physical_index') is not None: 863 | last_physical_index = item['physical_index'] 864 | break 865 | 866 | # Early return if we don't have valid physical indices 867 | if last_physical_index is None or last_physical_index < len(page_list)/2: 868 | return 0, [] 869 | 870 | # Determine which items to check 871 | if N is None: 872 | print('check all items') 873 | sample_indices = range(0, len(list_result)) 874 | else: 875 | N = min(N, len(list_result)) 876 | print(f'check {N} items') 877 | sample_indices = random.sample(range(0, len(list_result)), N) 878 | 879 | # Prepare items with their list indices 880 | indexed_sample_list = [] 881 | for idx in sample_indices: 882 | item = list_result[idx] 883 | item_with_index = item.copy() 884 | item_with_index['list_index'] = idx # Add the original index in list_result 885 | indexed_sample_list.append(item_with_index) 886 | 887 | # Run checks concurrently 888 | tasks = [ 889 | check_title_appearance(item, page_list, start_index, model) 890 | for item in indexed_sample_list 891 | ] 892 | results = await asyncio.gather(*tasks) 893 | 894 | # Process results 895 | correct_count = 0 896 | incorrect_results = [] 897 | for result in results: 898 | if result['answer'] == 'yes': 899 | correct_count += 1 900 | else: 901 | incorrect_results.append(result) 902 | 903 | # Calculate accuracy 904 | checked_count = len(results) 905 | accuracy = correct_count / checked_count if checked_count > 0 else 0 906 | print(f"accuracy: {accuracy*100:.2f}%") 907 | return accuracy, incorrect_results 908 | 909 | 910 | 911 | 912 | 913 | ################### main process ######################################################### 914 | async def meta_processor(page_list, mode=None, toc_content=None, toc_page_list=None, start_index=1, opt=None, logger=None): 915 | print(mode) 916 | print(f'start_index: {start_index}') 917 | 918 | if mode == 'process_toc_with_page_numbers': 919 | toc_with_page_number = process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=opt.toc_check_page_num, model=opt.model, logger=logger) 920 | elif mode == 'process_toc_no_page_numbers': 921 | toc_with_page_number = process_toc_no_page_numbers(toc_content, toc_page_list, page_list, model=opt.model, logger=logger) 922 | else: 923 | toc_with_page_number = process_no_toc(page_list, start_index=start_index, model=opt.model, logger=logger) 924 | 925 | toc_with_page_number = [item for item in toc_with_page_number if item.get('physical_index') is not None] 926 | accuracy, incorrect_results = await verify_toc(page_list, toc_with_page_number, start_index=start_index, model=opt.model) 927 | 928 | logger.info({ 929 | 'mode': 'process_toc_with_page_numbers', 930 | 'accuracy': accuracy, 931 | 'incorrect_results': incorrect_results 932 | }) 933 | if accuracy == 1.0 and len(incorrect_results) == 0: 934 | return toc_with_page_number 935 | if accuracy > 0.6 and len(incorrect_results) > 0: 936 | toc_with_page_number, incorrect_results = await fix_incorrect_toc_with_retries(toc_with_page_number, page_list, incorrect_results,start_index=start_index, max_attempts=3, model=opt.model, logger=logger) 937 | return toc_with_page_number 938 | else: 939 | if mode == 'process_toc_with_page_numbers': 940 | return await meta_processor(page_list, mode='process_toc_no_page_numbers', toc_content=toc_content, toc_page_list=toc_page_list, start_index=start_index, opt=opt, logger=logger) 941 | elif mode == 'process_toc_no_page_numbers': 942 | return await meta_processor(page_list, mode='process_no_toc', start_index=start_index, opt=opt, logger=logger) 943 | else: 944 | raise Exception('Processing failed') 945 | 946 | 947 | async def process_large_node_recursively(node, page_list, opt=None, logger=None): 948 | node_page_list = page_list[node['start_index']-1:node['end_index']] 949 | token_num = sum([page[1] for page in node_page_list]) 950 | 951 | if node['end_index'] - node['start_index'] > opt.max_page_num_each_node and token_num >= opt.max_token_num_each_node: 952 | print('large node:', node['title'], 'start_index:', node['start_index'], 'end_index:', node['end_index'], 'token_num:', token_num) 953 | 954 | node_toc_tree = await meta_processor(node_page_list, mode='process_no_toc', start_index=node['start_index'], opt=opt, logger=logger) 955 | node_toc_tree = await check_title_appearance_in_start_concurrent(node_toc_tree, page_list, model=opt.model, logger=logger) 956 | 957 | if node['title'].strip() == node_toc_tree[0]['title'].strip(): 958 | node['nodes'] = post_processing(node_toc_tree[1:], node['end_index']) 959 | node['end_index'] = node_toc_tree[1]['start_index'] 960 | else: 961 | node['nodes'] = post_processing(node_toc_tree, node['end_index']) 962 | node['end_index'] = node_toc_tree[0]['start_index'] 963 | 964 | if 'nodes' in node and node['nodes']: 965 | tasks = [ 966 | process_large_node_recursively(child_node, page_list, opt, logger=logger) 967 | for child_node in node['nodes'] 968 | ] 969 | await asyncio.gather(*tasks) 970 | 971 | return node 972 | 973 | async def tree_parser(page_list, opt, doc=None, logger=None): 974 | check_toc_result = check_toc(page_list, opt) 975 | logger.info(check_toc_result) 976 | 977 | if check_toc_result.get("toc_content") and check_toc_result["toc_content"].strip() and check_toc_result["page_index_given_in_toc"] == "yes": 978 | toc_with_page_number = await meta_processor( 979 | page_list, 980 | mode='process_toc_with_page_numbers', 981 | start_index=1, 982 | toc_content=check_toc_result['toc_content'], 983 | toc_page_list=check_toc_result['toc_page_list'], 984 | opt=opt, 985 | logger=logger) 986 | else: 987 | toc_with_page_number = await meta_processor( 988 | page_list, 989 | mode='process_no_toc', 990 | start_index=1, 991 | opt=opt, 992 | logger=logger) 993 | 994 | toc_with_page_number = add_preface_if_needed(toc_with_page_number) 995 | toc_with_page_number = await check_title_appearance_in_start_concurrent(toc_with_page_number, page_list, model=opt.model, logger=logger) 996 | toc_tree = post_processing(toc_with_page_number, len(page_list)) 997 | tasks = [ 998 | process_large_node_recursively(node, page_list, opt, logger=logger) 999 | for node in toc_tree 1000 | ] 1001 | await asyncio.gather(*tasks) 1002 | 1003 | return toc_tree 1004 | 1005 | 1006 | def page_index_main(doc, opt=None): 1007 | logger = JsonLogger(doc) 1008 | 1009 | is_valid_pdf = ( 1010 | (isinstance(doc, str) and os.path.isfile(doc) and doc.lower().endswith(".pdf")) or 1011 | isinstance(doc, BytesIO) 1012 | ) 1013 | if not is_valid_pdf: 1014 | raise ValueError("Unsupported input type. Expected a PDF file path or BytesIO object.") 1015 | 1016 | print('Parsing PDF...') 1017 | page_list = get_page_tokens(doc) 1018 | 1019 | logger.info({'total_page_number': len(page_list)}) 1020 | logger.info({'total_token': sum([page[1] for page in page_list])}) 1021 | 1022 | structure = asyncio.run(tree_parser(page_list, opt, doc=doc, logger=logger)) 1023 | if opt.if_add_node_id == 'yes': 1024 | write_node_id(structure) 1025 | if opt.if_add_node_summary == 'yes': 1026 | add_node_text(structure, page_list) 1027 | asyncio.run(generate_summaries_for_structure(structure, model=opt.model)) 1028 | remove_structure_text(structure) 1029 | if opt.if_add_node_text == 'yes': 1030 | add_node_text_with_labels(structure, page_list) 1031 | if opt.if_add_doc_description == 'yes': 1032 | doc_description = generate_doc_description(structure, model=opt.model) 1033 | return { 1034 | 'doc_name': get_pdf_name(doc), 1035 | 'doc_description': doc_description, 1036 | 'structure': structure, 1037 | } 1038 | return { 1039 | 'doc_name': get_pdf_name(doc), 1040 | 'structure': structure, 1041 | } 1042 | 1043 | 1044 | def page_index(doc, model=None, toc_check_page_num=None, max_page_num_each_node=None, max_token_num_each_node=None, 1045 | if_add_node_id=None, if_add_node_summary=None, if_add_doc_description=None, if_add_node_text=None): 1046 | 1047 | user_opt = { 1048 | arg: value for arg, value in locals().items() 1049 | if arg != "doc" and value is not None 1050 | } 1051 | opt = ConfigLoader().load(user_opt) 1052 | return page_index_main(doc, opt) 1053 | 1054 | 1055 | 1056 | -------------------------------------------------------------------------------- /pageindex/utils.py: -------------------------------------------------------------------------------- 1 | import tiktoken 2 | import openai 3 | import logging 4 | import os 5 | from datetime import datetime 6 | import time 7 | import json 8 | import PyPDF2 9 | import copy 10 | import asyncio 11 | import pymupdf 12 | from io import BytesIO 13 | from dotenv import load_dotenv 14 | load_dotenv() 15 | import logging 16 | import yaml 17 | from pathlib import Path 18 | from types import SimpleNamespace as config 19 | 20 | CHATGPT_API_KEY = os.getenv("CHATGPT_API_KEY") 21 | 22 | 23 | def count_tokens(text, model): 24 | enc = tiktoken.encoding_for_model(model) 25 | tokens = enc.encode(text) 26 | return len(tokens) 27 | 28 | def ChatGPT_API_with_finish_reason(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None): 29 | max_retries = 10 30 | client = openai.OpenAI(api_key=api_key) 31 | for i in range(max_retries): 32 | try: 33 | if chat_history: 34 | messages = chat_history 35 | messages.append({"role": "user", "content": prompt}) 36 | else: 37 | messages = [{"role": "user", "content": prompt}] 38 | 39 | response = client.chat.completions.create( 40 | model=model, 41 | messages=messages, 42 | temperature=0, 43 | ) 44 | if response.choices[0].finish_reason == "length": 45 | return response.choices[0].message.content, "max_output_reached" 46 | else: 47 | return response.choices[0].message.content, "finished" 48 | 49 | except Exception as e: 50 | print('************* Retrying *************') 51 | logging.error(f"Error: {e}") 52 | if i < max_retries - 1: 53 | time.sleep(1) # Wait for 1秒 before retrying 54 | else: 55 | logging.error('Max retries reached for prompt: ' + prompt) 56 | return "Error" 57 | 58 | 59 | 60 | def ChatGPT_API(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None): 61 | max_retries = 10 62 | client = openai.OpenAI(api_key=api_key) 63 | for i in range(max_retries): 64 | try: 65 | if chat_history: 66 | messages = chat_history 67 | messages.append({"role": "user", "content": prompt}) 68 | else: 69 | messages = [{"role": "user", "content": prompt}] 70 | 71 | response = client.chat.completions.create( 72 | model=model, 73 | messages=messages, 74 | temperature=0, 75 | ) 76 | 77 | return response.choices[0].message.content 78 | except Exception as e: 79 | print('************* Retrying *************') 80 | logging.error(f"Error: {e}") 81 | if i < max_retries - 1: 82 | time.sleep(1) # Wait for 1秒 before retrying 83 | else: 84 | logging.error('Max retries reached for prompt: ' + prompt) 85 | return "Error" 86 | 87 | 88 | async def ChatGPT_API_async(model, prompt, api_key=CHATGPT_API_KEY): 89 | max_retries = 10 90 | client = openai.AsyncOpenAI(api_key=api_key) 91 | for i in range(max_retries): 92 | try: 93 | messages = [{"role": "user", "content": prompt}] 94 | response = await client.chat.completions.create( 95 | model=model, 96 | messages=messages, 97 | temperature=0, 98 | ) 99 | return response.choices[0].message.content 100 | except Exception as e: 101 | print('************* Retrying *************') 102 | logging.error(f"Error: {e}") 103 | if i < max_retries - 1: 104 | await asyncio.sleep(1) # Wait for 1秒 before retrying 105 | else: 106 | logging.error('Max retries reached for prompt: ' + prompt) 107 | return "Error" 108 | 109 | def get_json_content(response): 110 | start_idx = response.find("```json") 111 | if start_idx != -1: 112 | start_idx += 7 113 | response = response[start_idx:] 114 | 115 | end_idx = response.rfind("```") 116 | if end_idx != -1: 117 | response = response[:end_idx] 118 | 119 | json_content = response.strip() 120 | return json_content 121 | 122 | 123 | def extract_json(content): 124 | try: 125 | # First, try to extract JSON enclosed within ```json and ``` 126 | start_idx = content.find("```json") 127 | if start_idx != -1: 128 | start_idx += 7 # Adjust index to start after the delimiter 129 | end_idx = content.rfind("```") 130 | json_content = content[start_idx:end_idx].strip() 131 | else: 132 | # If no delimiters, assume entire content could be JSON 133 | json_content = content.strip() 134 | 135 | # Clean up common issues that might cause parsing errors 136 | json_content = json_content.replace('None', 'null') # Replace Python None with JSON null 137 | json_content = json_content.replace('\n', ' ').replace('\r', ' ') # Remove newlines 138 | json_content = ' '.join(json_content.split()) # Normalize whitespace 139 | 140 | # Attempt to parse and return the JSON object 141 | return json.loads(json_content) 142 | except json.JSONDecodeError as e: 143 | logging.error(f"Failed to extract JSON: {e}") 144 | # Try to clean up the content further if initial parsing fails 145 | try: 146 | # Remove any trailing commas before closing brackets/braces 147 | json_content = json_content.replace(',]', ']').replace(',}', '}') 148 | return json.loads(json_content) 149 | except: 150 | logging.error("Failed to parse JSON even after cleanup") 151 | return {} 152 | except Exception as e: 153 | logging.error(f"Unexpected error while extracting JSON: {e}") 154 | return {} 155 | 156 | def write_node_id(data, node_id=0): 157 | if isinstance(data, dict): 158 | data['node_id'] = str(node_id).zfill(4) 159 | node_id += 1 160 | for key in list(data.keys()): 161 | if 'nodes' in key: 162 | node_id = write_node_id(data[key], node_id) 163 | elif isinstance(data, list): 164 | for index in range(len(data)): 165 | node_id = write_node_id(data[index], node_id) 166 | return node_id 167 | 168 | def get_nodes(structure): 169 | if isinstance(structure, dict): 170 | structure_node = copy.deepcopy(structure) 171 | structure_node.pop('nodes', None) 172 | nodes = [structure_node] 173 | for key in list(structure.keys()): 174 | if 'nodes' in key: 175 | nodes.extend(get_nodes(structure[key])) 176 | return nodes 177 | elif isinstance(structure, list): 178 | nodes = [] 179 | for item in structure: 180 | nodes.extend(get_nodes(item)) 181 | return nodes 182 | 183 | def structure_to_list(structure): 184 | if isinstance(structure, dict): 185 | nodes = [] 186 | nodes.append(structure) 187 | if 'nodes' in structure: 188 | nodes.extend(structure_to_list(structure['nodes'])) 189 | return nodes 190 | elif isinstance(structure, list): 191 | nodes = [] 192 | for item in structure: 193 | nodes.extend(structure_to_list(item)) 194 | return nodes 195 | 196 | 197 | def get_leaf_nodes(structure): 198 | if isinstance(structure, dict): 199 | if not structure['nodes']: 200 | structure_node = copy.deepcopy(structure) 201 | structure_node.pop('nodes', None) 202 | return [structure_node] 203 | else: 204 | leaf_nodes = [] 205 | for key in list(structure.keys()): 206 | if 'nodes' in key: 207 | leaf_nodes.extend(get_leaf_nodes(structure[key])) 208 | return leaf_nodes 209 | elif isinstance(structure, list): 210 | leaf_nodes = [] 211 | for item in structure: 212 | leaf_nodes.extend(get_leaf_nodes(item)) 213 | return leaf_nodes 214 | 215 | def is_leaf_node(data, node_id): 216 | # Helper function to find the node by its node_id 217 | def find_node(data, node_id): 218 | if isinstance(data, dict): 219 | if data.get('node_id') == node_id: 220 | return data 221 | for key in data.keys(): 222 | if 'nodes' in key: 223 | result = find_node(data[key], node_id) 224 | if result: 225 | return result 226 | elif isinstance(data, list): 227 | for item in data: 228 | result = find_node(item, node_id) 229 | if result: 230 | return result 231 | return None 232 | 233 | # Find the node with the given node_id 234 | node = find_node(data, node_id) 235 | 236 | # Check if the node is a leaf node 237 | if node and not node.get('nodes'): 238 | return True 239 | return False 240 | 241 | def get_last_node(structure): 242 | return structure[-1] 243 | 244 | 245 | def extract_text_from_pdf(pdf_path): 246 | pdf_reader = PyPDF2.PdfReader(pdf_path) 247 | ###return text not list 248 | text="" 249 | for page_num in range(len(pdf_reader.pages)): 250 | page = pdf_reader.pages[page_num] 251 | text+=page.extract_text() 252 | return text 253 | 254 | def get_pdf_title(pdf_path): 255 | pdf_reader = PyPDF2.PdfReader(pdf_path) 256 | meta = pdf_reader.metadata 257 | title = meta.title if meta and meta.title else 'Untitled' 258 | return title 259 | 260 | def get_text_of_pages(pdf_path, start_page, end_page, tag=True): 261 | pdf_reader = PyPDF2.PdfReader(pdf_path) 262 | text = "" 263 | for page_num in range(start_page-1, end_page): 264 | page = pdf_reader.pages[page_num] 265 | page_text = page.extract_text() 266 | if tag: 267 | text += f"<start_index_{page_num+1}>\n{page_text}\n<end_index_{page_num+1}>\n" 268 | else: 269 | text += page_text 270 | return text 271 | 272 | def get_first_start_page_from_text(text): 273 | start_page = -1 274 | start_page_match = re.search(r'<start_index_(\d+)>', text) 275 | if start_page_match: 276 | start_page = int(start_page_match.group(1)) 277 | return start_page 278 | 279 | def get_last_start_page_from_text(text): 280 | start_page = -1 281 | # Find all matches of start_index tags 282 | start_page_matches = re.finditer(r'<start_index_(\d+)>', text) 283 | # Convert iterator to list and get the last match if any exist 284 | matches_list = list(start_page_matches) 285 | if matches_list: 286 | start_page = int(matches_list[-1].group(1)) 287 | return start_page 288 | 289 | 290 | def sanitize_filename(filename, replacement='-'): 291 | # In Linux, only '/' and '\0' (null) are invalid in filenames. 292 | # Null can't be represented in strings, so we only handle '/'. 293 | return filename.replace('/', replacement) 294 | 295 | def get_pdf_name(pdf_path): 296 | # Extract PDF name 297 | if isinstance(pdf_path, str): 298 | pdf_name = os.path.basename(pdf_path) 299 | elif isinstance(pdf_path, BytesIO): 300 | pdf_reader = PyPDF2.PdfReader(pdf_path) 301 | meta = pdf_reader.metadata 302 | pdf_name = meta.title if meta and meta.title else 'Untitled' 303 | pdf_name = sanitize_filename(pdf_name) 304 | return pdf_name 305 | 306 | 307 | class JsonLogger: 308 | def __init__(self, file_path): 309 | # Extract PDF name for logger name 310 | pdf_name = get_pdf_name(file_path) 311 | 312 | current_time = datetime.now().strftime("%Y%m%d_%H%M%S") 313 | self.filename = f"{pdf_name}_{current_time}.json" 314 | os.makedirs("./logs", exist_ok=True) 315 | # Initialize empty list to store all messages 316 | self.log_data = [] 317 | 318 | def log(self, level, message, **kwargs): 319 | if isinstance(message, dict): 320 | self.log_data.append(message) 321 | else: 322 | self.log_data.append({'message': message}) 323 | # Add new message to the log data 324 | 325 | # Write entire log data to file 326 | with open(self._filepath(), "w") as f: 327 | json.dump(self.log_data, f, indent=2) 328 | 329 | def info(self, message, **kwargs): 330 | self.log("INFO", message, **kwargs) 331 | 332 | def error(self, message, **kwargs): 333 | self.log("ERROR", message, **kwargs) 334 | 335 | def debug(self, message, **kwargs): 336 | self.log("DEBUG", message, **kwargs) 337 | 338 | def exception(self, message, **kwargs): 339 | kwargs["exception"] = True 340 | self.log("ERROR", message, **kwargs) 341 | 342 | def _filepath(self): 343 | return os.path.join("logs", self.filename) 344 | 345 | 346 | 347 | 348 | def list_to_tree(data): 349 | def get_parent_structure(structure): 350 | """Helper function to get the parent structure code""" 351 | if not structure: 352 | return None 353 | parts = str(structure).split('.') 354 | return '.'.join(parts[:-1]) if len(parts) > 1 else None 355 | 356 | # First pass: Create nodes and track parent-child relationships 357 | nodes = {} 358 | root_nodes = [] 359 | 360 | for item in data: 361 | structure = item.get('structure') 362 | node = { 363 | 'title': item.get('title'), 364 | 'start_index': item.get('start_index'), 365 | 'end_index': item.get('end_index'), 366 | 'nodes': [] 367 | } 368 | 369 | nodes[structure] = node 370 | 371 | # Find parent 372 | parent_structure = get_parent_structure(structure) 373 | 374 | if parent_structure: 375 | # Add as child to parent if parent exists 376 | if parent_structure in nodes: 377 | nodes[parent_structure]['nodes'].append(node) 378 | else: 379 | root_nodes.append(node) 380 | else: 381 | # No parent, this is a root node 382 | root_nodes.append(node) 383 | 384 | # Helper function to clean empty children arrays 385 | def clean_node(node): 386 | if not node['nodes']: 387 | del node['nodes'] 388 | else: 389 | for child in node['nodes']: 390 | clean_node(child) 391 | return node 392 | 393 | # Clean and return the tree 394 | return [clean_node(node) for node in root_nodes] 395 | 396 | def add_preface_if_needed(data): 397 | if not isinstance(data, list) or not data: 398 | return data 399 | 400 | if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1: 401 | preface_node = { 402 | "structure": "0", 403 | "title": "Preface", 404 | "physical_index": 1, 405 | } 406 | data.insert(0, preface_node) 407 | return data 408 | 409 | 410 | 411 | def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyPDF2"): 412 | enc = tiktoken.encoding_for_model(model) 413 | if pdf_parser == "PyPDF2": 414 | pdf_reader = PyPDF2.PdfReader(pdf_path) 415 | page_list = [] 416 | for page_num in range(len(pdf_reader.pages)): 417 | page = pdf_reader.pages[page_num] 418 | page_text = page.extract_text() 419 | token_length = len(enc.encode(page_text)) 420 | page_list.append((page_text, token_length)) 421 | return page_list 422 | elif pdf_parser == "PyMuPDF": 423 | if isinstance(pdf_path, BytesIO): 424 | pdf_stream = pdf_path 425 | doc = pymupdf.open(stream=pdf_stream, filetype="pdf") 426 | elif isinstance(pdf_path, str) and os.path.isfile(pdf_path) and pdf_path.lower().endswith(".pdf"): 427 | doc = pymupdf.open(pdf_path) 428 | page_list = [] 429 | for page in doc: 430 | page_text = page.get_text() 431 | token_length = len(enc.encode(page_text)) 432 | page_list.append((page_text, token_length)) 433 | return page_list 434 | else: 435 | raise ValueError(f"Unsupported PDF parser: {pdf_parser}") 436 | 437 | 438 | 439 | def get_text_of_pdf_pages(pdf_pages, start_page, end_page): 440 | text = "" 441 | for page_num in range(start_page-1, end_page): 442 | text += pdf_pages[page_num][0] 443 | return text 444 | 445 | def get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page): 446 | text = "" 447 | for page_num in range(start_page-1, end_page): 448 | text += f"<physical_index_{page_num+1}>\n{pdf_pages[page_num][0]}\n<physical_index_{page_num+1}>\n" 449 | return text 450 | 451 | def get_number_of_pages(pdf_path): 452 | pdf_reader = PyPDF2.PdfReader(pdf_path) 453 | num = len(pdf_reader.pages) 454 | return num 455 | 456 | 457 | 458 | def post_processing(structure, end_physical_index): 459 | # First convert page_number to start_index in flat list 460 | for i, item in enumerate(structure): 461 | item['start_index'] = item.get('physical_index') 462 | if i < len(structure) - 1: 463 | if structure[i + 1].get('appear_start') == 'yes': 464 | item['end_index'] = structure[i + 1]['physical_index']-1 465 | else: 466 | item['end_index'] = structure[i + 1]['physical_index'] 467 | else: 468 | item['end_index'] = end_physical_index 469 | tree = list_to_tree(structure) 470 | if len(tree)!=0: 471 | return tree 472 | else: 473 | ### remove appear_start 474 | for node in structure: 475 | node.pop('appear_start', None) 476 | node.pop('physical_index', None) 477 | return structure 478 | 479 | def clean_structure_post(data): 480 | if isinstance(data, dict): 481 | data.pop('page_number', None) 482 | data.pop('start_index', None) 483 | data.pop('end_index', None) 484 | if 'nodes' in data: 485 | clean_structure_post(data['nodes']) 486 | elif isinstance(data, list): 487 | for section in data: 488 | clean_structure_post(section) 489 | return data 490 | 491 | 492 | def remove_structure_text(data): 493 | if isinstance(data, dict): 494 | data.pop('text', None) 495 | if 'nodes' in data: 496 | remove_structure_text(data['nodes']) 497 | elif isinstance(data, list): 498 | for item in data: 499 | remove_structure_text(item) 500 | return data 501 | 502 | 503 | def check_token_limit(structure, limit=110000): 504 | list = structure_to_list(structure) 505 | for node in list: 506 | num_tokens = count_tokens(node['text'], model='gpt-4o') 507 | if num_tokens > limit: 508 | print(f"Node ID: {node['node_id']} has {num_tokens} tokens") 509 | print("Start Index:", node['start_index']) 510 | print("End Index:", node['end_index']) 511 | print("Title:", node['title']) 512 | print("\n") 513 | 514 | 515 | def convert_physical_index_to_int(data): 516 | if isinstance(data, list): 517 | for i in range(len(data)): 518 | # Check if item is a dictionary and has 'physical_index' key 519 | if isinstance(data[i], dict) and 'physical_index' in data[i]: 520 | if isinstance(data[i]['physical_index'], str): 521 | if data[i]['physical_index'].startswith('<physical_index_'): 522 | data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip()) 523 | elif data[i]['physical_index'].startswith('physical_index_'): 524 | data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip()) 525 | elif isinstance(data, str): 526 | if data.startswith('<physical_index_'): 527 | data = int(data.split('_')[-1].rstrip('>').strip()) 528 | elif data.startswith('physical_index_'): 529 | data = int(data.split('_')[-1].strip()) 530 | # Check data is int 531 | if isinstance(data, int): 532 | return data 533 | else: 534 | return None 535 | return data 536 | 537 | 538 | def convert_page_to_int(data): 539 | for item in data: 540 | if 'page' in item and isinstance(item['page'], str): 541 | try: 542 | item['page'] = int(item['page']) 543 | except ValueError: 544 | # Keep original value if conversion fails 545 | pass 546 | return data 547 | 548 | 549 | def add_node_text(node, pdf_pages): 550 | if isinstance(node, dict): 551 | start_page = node.get('start_index') 552 | end_page = node.get('end_index') 553 | node['text'] = get_text_of_pdf_pages(pdf_pages, start_page, end_page) 554 | if 'nodes' in node: 555 | add_node_text(node['nodes'], pdf_pages) 556 | elif isinstance(node, list): 557 | for index in range(len(node)): 558 | add_node_text(node[index], pdf_pages) 559 | return 560 | 561 | 562 | def add_node_text_with_labels(node, pdf_pages): 563 | if isinstance(node, dict): 564 | start_page = node.get('start_index') 565 | end_page = node.get('end_index') 566 | node['text'] = get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page) 567 | if 'nodes' in node: 568 | add_node_text_with_labels(node['nodes'], pdf_pages) 569 | elif isinstance(node, list): 570 | for index in range(len(node)): 571 | add_node_text_with_labels(node[index], pdf_pages) 572 | return 573 | 574 | 575 | async def generate_node_summary(node, model=None): 576 | prompt = f"""You are given a part of a document, your task is to generate a description of the partial document about what are main points covered in the partial document. 577 | 578 | Partial Document Text: {node['text']} 579 | 580 | Directly return the description, do not include any other text. 581 | """ 582 | response = await ChatGPT_API_async(model, prompt) 583 | return response 584 | 585 | 586 | async def generate_summaries_for_structure(structure, model=None): 587 | nodes = structure_to_list(structure) 588 | tasks = [generate_node_summary(node, model=model) for node in nodes] 589 | summaries = await asyncio.gather(*tasks) 590 | 591 | for node, summary in zip(nodes, summaries): 592 | node['summary'] = summary 593 | return structure 594 | 595 | 596 | def generate_doc_description(structure, model=None): 597 | prompt = f"""Your are an expert in generating descriptions for a document. 598 | You are given a structure of a document. Your task is to generate a one-sentence description for the document, which makes it easy to distinguish the document from other documents. 599 | 600 | Document Structure: {structure} 601 | 602 | Directly return the description, do not include any other text. 603 | """ 604 | response = ChatGPT_API(model, prompt) 605 | return response 606 | 607 | 608 | class ConfigLoader: 609 | def __init__(self, default_path: str = None): 610 | if default_path is None: 611 | default_path = Path(__file__).parent / "config.yaml" 612 | self._default_dict = self._load_yaml(default_path) 613 | 614 | @staticmethod 615 | def _load_yaml(path): 616 | with open(path, "r", encoding="utf-8") as f: 617 | return yaml.safe_load(f) or {} 618 | 619 | def _validate_keys(self, user_dict): 620 | unknown_keys = set(user_dict) - set(self._default_dict) 621 | if unknown_keys: 622 | raise ValueError(f"Unknown config keys: {unknown_keys}") 623 | 624 | def load(self, user_opt=None) -> config: 625 | """ 626 | Load the configuration, merging user options with default values. 627 | """ 628 | if user_opt is None: 629 | user_dict = {} 630 | elif isinstance(user_opt, config): 631 | user_dict = vars(user_opt) 632 | elif isinstance(user_opt, dict): 633 | user_dict = user_opt 634 | else: 635 | raise TypeError("user_opt must be dict, config(SimpleNamespace) or None") 636 | 637 | self._validate_keys(user_dict) 638 | merged = {**self._default_dict, **user_dict} 639 | return config(**merged) -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | openai==1.70.0 2 | pymupdf==1.25.5 3 | PyPDF2==3.0.1 4 | python-dotenv==1.1.0 5 | tiktoken==0.7.0 6 | pyyaml==6.0.2 7 | -------------------------------------------------------------------------------- /results/2023-annual-report_structure.json: -------------------------------------------------------------------------------- 1 | { 2 | "doc_name": "2023-annual-report.pdf", 3 | "structure": [ 4 | { 5 | "title": "Preface", 6 | "start_index": 1, 7 | "end_index": 4, 8 | "node_id": "0000" 9 | }, 10 | { 11 | "title": "About the Federal Reserve", 12 | "start_index": 5, 13 | "end_index": 6, 14 | "node_id": "0001" 15 | }, 16 | { 17 | "title": "Overview", 18 | "start_index": 7, 19 | "end_index": 8, 20 | "node_id": "0002" 21 | }, 22 | { 23 | "title": "Monetary Policy and Economic Developments", 24 | "start_index": 9, 25 | "end_index": 9, 26 | "nodes": [ 27 | { 28 | "title": "March 2024 Summary", 29 | "start_index": 9, 30 | "end_index": 14, 31 | "node_id": "0004" 32 | }, 33 | { 34 | "title": "June 2023 Summary", 35 | "start_index": 15, 36 | "end_index": 20, 37 | "node_id": "0005" 38 | } 39 | ], 40 | "node_id": "0003" 41 | }, 42 | { 43 | "title": "Financial Stability", 44 | "start_index": 21, 45 | "end_index": 21, 46 | "nodes": [ 47 | { 48 | "title": "Monitoring Financial Vulnerabilities", 49 | "start_index": 22, 50 | "end_index": 28, 51 | "node_id": "0007" 52 | }, 53 | { 54 | "title": "Domestic and International Cooperation and Coordination", 55 | "start_index": 28, 56 | "end_index": 31, 57 | "node_id": "0008" 58 | } 59 | ], 60 | "node_id": "0006" 61 | }, 62 | { 63 | "title": "Supervision and Regulation", 64 | "start_index": 31, 65 | "end_index": 31, 66 | "nodes": [ 67 | { 68 | "title": "Supervised and Regulated Institutions", 69 | "start_index": 32, 70 | "end_index": 35, 71 | "node_id": "0010" 72 | }, 73 | { 74 | "title": "Supervisory Developments", 75 | "start_index": 35, 76 | "end_index": 54, 77 | "node_id": "0011" 78 | }, 79 | { 80 | "title": "Regulatory Developments", 81 | "start_index": 55, 82 | "end_index": 59, 83 | "node_id": "0012" 84 | } 85 | ], 86 | "node_id": "0009" 87 | }, 88 | { 89 | "title": "Payment System and Reserve Bank Oversight", 90 | "start_index": 59, 91 | "end_index": 59, 92 | "nodes": [ 93 | { 94 | "title": "Payment Services to Depository and Other Institutions", 95 | "start_index": 60, 96 | "end_index": 65, 97 | "node_id": "0014" 98 | }, 99 | { 100 | "title": "Currency and Coin", 101 | "start_index": 66, 102 | "end_index": 68, 103 | "node_id": "0015" 104 | }, 105 | { 106 | "title": "Fiscal Agency and Government Depository Services", 107 | "start_index": 69, 108 | "end_index": 72, 109 | "node_id": "0016" 110 | }, 111 | { 112 | "title": "Evolutions and Improvements to the System", 113 | "start_index": 72, 114 | "end_index": 75, 115 | "node_id": "0017" 116 | }, 117 | { 118 | "title": "Oversight of Federal Reserve Banks", 119 | "start_index": 75, 120 | "end_index": 81, 121 | "node_id": "0018" 122 | }, 123 | { 124 | "title": "Pro Forma Financial Statements for Federal Reserve Priced Services", 125 | "start_index": 82, 126 | "end_index": 88, 127 | "node_id": "0019" 128 | } 129 | ], 130 | "node_id": "0013" 131 | }, 132 | { 133 | "title": "Consumer and Community Affairs", 134 | "start_index": 89, 135 | "end_index": 89, 136 | "nodes": [ 137 | { 138 | "title": "Consumer Compliance Supervision", 139 | "start_index": 89, 140 | "end_index": 101, 141 | "node_id": "0021" 142 | }, 143 | { 144 | "title": "Consumer Laws and Regulations", 145 | "start_index": 101, 146 | "end_index": 102, 147 | "node_id": "0022" 148 | }, 149 | { 150 | "title": "Consumer Research and Analysis of Emerging Issues and Policy", 151 | "start_index": 102, 152 | "end_index": 105, 153 | "node_id": "0023" 154 | }, 155 | { 156 | "title": "Community Development", 157 | "start_index": 105, 158 | "end_index": 106, 159 | "node_id": "0024" 160 | } 161 | ], 162 | "node_id": "0020" 163 | }, 164 | { 165 | "title": "Appendixes", 166 | "start_index": 107, 167 | "end_index": 109, 168 | "node_id": "0025" 169 | }, 170 | { 171 | "title": "Federal Reserve System Organization", 172 | "start_index": 109, 173 | "end_index": 109, 174 | "nodes": [ 175 | { 176 | "title": "Board of Governors", 177 | "start_index": 109, 178 | "end_index": 116, 179 | "node_id": "0027" 180 | }, 181 | { 182 | "title": "Federal Open Market Committee", 183 | "start_index": 117, 184 | "end_index": 118, 185 | "node_id": "0028" 186 | }, 187 | { 188 | "title": "Board of Governors Advisory Councils", 189 | "start_index": 119, 190 | "end_index": 122, 191 | "node_id": "0029" 192 | }, 193 | { 194 | "title": "Federal Reserve Banks and Branches", 195 | "start_index": 123, 196 | "end_index": 146, 197 | "node_id": "0030" 198 | } 199 | ], 200 | "node_id": "0026" 201 | }, 202 | { 203 | "title": "Minutes of Federal Open Market Committee Meetings", 204 | "start_index": 147, 205 | "end_index": 147, 206 | "nodes": [ 207 | { 208 | "title": "Meeting Minutes", 209 | "start_index": 147, 210 | "end_index": 149, 211 | "node_id": "0032" 212 | } 213 | ], 214 | "node_id": "0031" 215 | }, 216 | { 217 | "title": "Federal Reserve System Audits", 218 | "start_index": 149, 219 | "end_index": 149, 220 | "nodes": [ 221 | { 222 | "title": "Office of Inspector General Activities", 223 | "start_index": 149, 224 | "end_index": 151, 225 | "node_id": "0034" 226 | }, 227 | { 228 | "title": "Government Accountability Office Reviews", 229 | "start_index": 151, 230 | "end_index": 152, 231 | "node_id": "0035" 232 | } 233 | ], 234 | "node_id": "0033" 235 | }, 236 | { 237 | "title": "Federal Reserve System Budgets", 238 | "start_index": 153, 239 | "end_index": 153, 240 | "nodes": [ 241 | { 242 | "title": "System Budgets Overview", 243 | "start_index": 153, 244 | "end_index": 157, 245 | "node_id": "0037" 246 | }, 247 | { 248 | "title": "Board of Governors Budgets", 249 | "start_index": 157, 250 | "end_index": 163, 251 | "node_id": "0038" 252 | }, 253 | { 254 | "title": "Federal Reserve Banks Budgets", 255 | "start_index": 163, 256 | "end_index": 169, 257 | "node_id": "0039" 258 | }, 259 | { 260 | "title": "Currency Budget", 261 | "start_index": 169, 262 | "end_index": 174, 263 | "node_id": "0040" 264 | } 265 | ], 266 | "node_id": "0036" 267 | }, 268 | { 269 | "title": "Record of Policy Actions of the Board of Governors", 270 | "start_index": 175, 271 | "end_index": 175, 272 | "nodes": [ 273 | { 274 | "title": "Rules and Regulations", 275 | "start_index": 175, 276 | "end_index": 176, 277 | "node_id": "0042" 278 | }, 279 | { 280 | "title": "Policy Statements and Other Actions", 281 | "start_index": 177, 282 | "end_index": 181, 283 | "node_id": "0043" 284 | }, 285 | { 286 | "title": "Discount Rates for Depository Institutions in 2023", 287 | "start_index": 181, 288 | "end_index": 183, 289 | "node_id": "0044" 290 | }, 291 | { 292 | "title": "The Board of Governors and the Government Performance and Results Act", 293 | "start_index": 184, 294 | "end_index": 184, 295 | "node_id": "0045" 296 | } 297 | ], 298 | "node_id": "0041" 299 | }, 300 | { 301 | "title": "Litigation", 302 | "start_index": 185, 303 | "end_index": 185, 304 | "nodes": [ 305 | { 306 | "title": "Pending", 307 | "start_index": 185, 308 | "end_index": 186, 309 | "node_id": "0047" 310 | }, 311 | { 312 | "title": "Resolved", 313 | "start_index": 186, 314 | "end_index": 187, 315 | "node_id": "0048" 316 | } 317 | ], 318 | "node_id": "0046" 319 | }, 320 | { 321 | "title": "Statistical Tables", 322 | "start_index": 187, 323 | "end_index": 187, 324 | "nodes": [ 325 | { 326 | "title": "Federal Reserve open market transactions, 2023", 327 | "start_index": 187, 328 | "end_index": 187, 329 | "nodes": [ 330 | { 331 | "title": "Type of security and transaction", 332 | "start_index": 187, 333 | "end_index": 188, 334 | "node_id": "0051" 335 | }, 336 | { 337 | "title": "Federal agency obligations", 338 | "start_index": 188, 339 | "end_index": 188, 340 | "node_id": "0052" 341 | }, 342 | { 343 | "title": "Mortgage-backed securities", 344 | "start_index": 188, 345 | "end_index": 188, 346 | "node_id": "0053" 347 | }, 348 | { 349 | "title": "Temporary transactions", 350 | "start_index": 188, 351 | "end_index": 188, 352 | "node_id": "0054" 353 | } 354 | ], 355 | "node_id": "0050" 356 | }, 357 | { 358 | "title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323", 359 | "start_index": 189, 360 | "end_index": 189, 361 | "nodes": [ 362 | { 363 | "title": "By remaining maturity", 364 | "start_index": 189, 365 | "end_index": 189, 366 | "node_id": "0056" 367 | }, 368 | { 369 | "title": "By type", 370 | "start_index": 189, 371 | "end_index": 190, 372 | "node_id": "0057" 373 | }, 374 | { 375 | "title": "By issuer", 376 | "start_index": 190, 377 | "end_index": 190, 378 | "node_id": "0058" 379 | } 380 | ], 381 | "node_id": "0055" 382 | }, 383 | { 384 | "title": "Reserve requirements of depository institutions, December 31, 2023", 385 | "start_index": 191, 386 | "end_index": 191, 387 | "node_id": "0059" 388 | }, 389 | { 390 | "title": "Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023", 391 | "start_index": 192, 392 | "end_index": 192, 393 | "node_id": "0060" 394 | }, 395 | { 396 | "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023", 397 | "start_index": 193, 398 | "end_index": 196, 399 | "node_id": "0061" 400 | }, 401 | { 402 | "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983", 403 | "start_index": 197, 404 | "end_index": 200, 405 | "node_id": "0062" 406 | }, 407 | { 408 | "title": "Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022", 409 | "start_index": 201, 410 | "end_index": 201, 411 | "node_id": "0063" 412 | }, 413 | { 414 | "title": "Initial margin requirements under Regulations T, U, and X", 415 | "start_index": 202, 416 | "end_index": 203, 417 | "node_id": "0064" 418 | }, 419 | { 420 | "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022", 421 | "start_index": 203, 422 | "end_index": 209, 423 | "node_id": "0065" 424 | }, 425 | { 426 | "title": "Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022", 427 | "start_index": 209, 428 | "end_index": 210, 429 | "node_id": "0066" 430 | }, 431 | { 432 | "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023", 433 | "start_index": 210, 434 | "end_index": 212, 435 | "nodes": [ 436 | { 437 | "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued", 438 | "start_index": 212, 439 | "end_index": 214, 440 | "node_id": "0068" 441 | } 442 | ], 443 | "node_id": "0067" 444 | }, 445 | { 446 | "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023", 447 | "start_index": 214, 448 | "end_index": 215, 449 | "nodes": [ 450 | { 451 | "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued", 452 | "start_index": 215, 453 | "end_index": 216, 454 | "node_id": "0070" 455 | }, 456 | { 457 | "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued", 458 | "start_index": 216, 459 | "end_index": 217, 460 | "node_id": "0071" 461 | }, 462 | { 463 | "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued", 464 | "start_index": 217, 465 | "end_index": 217, 466 | "node_id": "0072" 467 | } 468 | ], 469 | "node_id": "0069" 470 | }, 471 | { 472 | "title": "Operations in principal departments of the Federal Reserve Banks, 2020\u201323", 473 | "start_index": 218, 474 | "end_index": 218, 475 | "node_id": "0073" 476 | }, 477 | { 478 | "title": "Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023", 479 | "start_index": 219, 480 | "end_index": 220, 481 | "node_id": "0074" 482 | }, 483 | { 484 | "title": "Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023", 485 | "start_index": 220, 486 | "end_index": 222, 487 | "node_id": "0075" 488 | } 489 | ], 490 | "node_id": "0049" 491 | } 492 | ] 493 | } -------------------------------------------------------------------------------- /results/Regulation Best Interest_Interpretive release_structure.json: -------------------------------------------------------------------------------- 1 | { 2 | "doc_name": "Regulation Best Interest_Interpretive release.pdf", 3 | "doc_description": "A detailed analysis of the SEC's interpretation of the \"solely incidental\" prong of the broker-dealer exclusion under the Investment Advisers Act of 1940, including its historical context, application guidance, economic implications, and regulatory considerations.", 4 | "structure": [ 5 | { 6 | "title": "Preface", 7 | "start_index": 1, 8 | "end_index": 2, 9 | "node_id": "0000", 10 | "summary": "The partial document outlines an interpretation by the Securities and Exchange Commission (SEC) regarding the \"solely incidental\" prong of the broker-dealer exclusion under the Investment Advisers Act of 1940. It clarifies that brokers or dealers providing advisory services that are incidental to their primary business and for which they receive no special compensation are excluded from the definition of \"investment adviser\" under the Act. The document includes a historical and legislative context, the scope of the \"solely incidental\" prong, guidance on its application, and economic considerations related to the interpretation. It also provides contact information for further inquiries and specifies the effective date of the interpretation as July 12, 2019." 11 | }, 12 | { 13 | "title": "Introduction", 14 | "start_index": 2, 15 | "end_index": 6, 16 | "node_id": "0001", 17 | "summary": "The partial document discusses the regulation of investment advisers under the Advisers Act, specifically focusing on the \"broker-dealer exclusion,\" which exempts brokers and dealers from being classified as investment advisers under certain conditions. Key points include:\n\n1. **Introduction to the Advisers Act**: Overview of the regulation of investment advisers and the broker-dealer exclusion, which applies when advisory services are \"solely incidental\" to brokerage business and no special compensation is received.\n\n2. **Historical Context and Legislative History**: Examination of the historical practices of broker-dealers providing investment advice, distinguishing between auxiliary advice as part of brokerage services and separate advisory services.\n\n3. **Interpretation of the Solely Incidental Prong**: Clarification of the \"solely incidental\" condition of the broker-dealer exclusion, including its application to activities like investment discretion and account monitoring.\n\n4. **Economic Considerations**: Discussion of the potential economic effects of the interpretation and application of the broker-dealer exclusion.\n\n5. **Regulatory Developments**: Reference to the Commission's 2018 proposals, including Regulation Best Interest (Reg. BI), the Proposed Fiduciary Interpretation, and the Relationship Summary Proposal, aimed at enhancing standards of conduct and investor understanding.\n\n6. **Public Comments and Feedback**: Summary of public comments on the scope and interpretation of the broker-dealer exclusion, highlighting disagreements and requests for clarification on the \"solely incidental\" prong.\n\n7. **Adoption of Interpretation**: The Commission's adoption of an interpretation to confirm and clarify its position on the \"solely incidental\" prong, complementing related rules and forms to improve investor understanding of broker-dealer and adviser relationships." 18 | }, 19 | { 20 | "title": "Interpretation and Application", 21 | "start_index": 6, 22 | "end_index": 8, 23 | "nodes": [ 24 | { 25 | "title": "Historical Context and Legislative History", 26 | "start_index": 8, 27 | "end_index": 10, 28 | "node_id": "0003", 29 | "summary": "The partial document discusses the historical context and legislative development of the Investment Advisers Act of 1940. It highlights the findings of a congressional study conducted by the SEC between 1935 and 1939, which identified issues with distinguishing legitimate investment counselors from unregulated \"tipster\" organizations and problems in the organization and operation of investment counsel institutions. The document explains how these findings led to the passage of the Advisers Act, which broadly defined \"investment adviser\" and established regulatory oversight for those providing investment advice for compensation. It also addresses the exclusion of certain professionals, such as broker-dealers, from the definition of \"investment adviser\" if their advice is incidental to their primary business and not specially compensated. Additionally, the document explores the scope of the \"solely incidental\" prong of the broker-dealer exclusion, referencing interpretations and rules by the SEC, including a 2005 rule regarding fee-based brokerage accounts." 30 | }, 31 | { 32 | "title": "Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion", 33 | "start_index": 10, 34 | "end_index": 14, 35 | "node_id": "0004", 36 | "summary": "The partial document discusses the \"broker-dealer exclusion\" under the Investment Advisers Act, specifically focusing on the \"solely incidental\" prong. It examines the scope of this exclusion, emphasizing that investment advice provided by broker-dealers is considered \"solely incidental\" if it is connected to and reasonably related to their primary business of effecting securities transactions. The document references historical interpretations, court rulings (e.g., Financial Planning Association v. SEC and Thomas v. Metropolitan Life Insurance Company), and legislative history to clarify this standard. It highlights that the frequency or importance of advice does not determine whether it meets the \"solely incidental\" standard, but rather its relationship to the broker-dealer's primary business. The document also provides guidance on applying this interpretation to specific practices, such as exercising investment discretion and account monitoring, noting that certain discretionary activities may fall outside the scope of the exclusion." 37 | }, 38 | { 39 | "title": "Guidance on Applying the Interpretation of the Solely Incidental Prong", 40 | "start_index": 14, 41 | "end_index": 22, 42 | "node_id": "0005", 43 | "summary": "The partial document provides guidance on the application of the \"solely incidental\" prong of the broker-dealer exclusion under the Advisers Act. It focuses on two key areas: (1) the exercise of investment discretion by broker-dealers over customer accounts and (2) account monitoring. The document discusses the Commission's interpretation that unlimited investment discretion is not \"solely incidental\" to a broker-dealer's business, as it indicates a primarily advisory relationship. However, temporary or limited discretion in specific scenarios (e.g., cash management, tax-loss sales, or margin requirements) may be consistent with the \"solely incidental\" prong. It also addresses account monitoring, stating that agreed-upon periodic monitoring for buy, sell, or hold recommendations may align with the broker-dealer exclusion, while continuous monitoring or advisory-like services would not. The document includes examples, refinements to prior interpretations, and considerations for broker-dealers to adopt policies ensuring compliance. It concludes with economic considerations, highlighting the potential impact on broker-dealers, customers, and the financial advice market." 44 | } 45 | ], 46 | "node_id": "0002", 47 | "summary": "The partial document discusses the historical context and legislative history of the Advisers Act of 1940, focusing on the roles of broker-dealers in providing investment advice. It highlights two distinct ways broker-dealers offered advice: as part of traditional brokerage services with fixed commissions and as separate advisory services for a fee. The document examines the concept of \"brokerage house advice,\" detailing the types of information and services provided, such as market analyses, tax information, and investment recommendations. It also references a congressional study conducted between 1935 and 1939, which identified issues with distinguishing legitimate investment counselors from \"tipster\" organizations and problems in the organization and operation of investment counsel institutions. These findings led to the enactment of the Advisers Act, which broadly defined \"investment adviser\" to regulate those providing investment advice for compensation. The document also references various reports, hearings, and literature that informed the development of the Act." 48 | }, 49 | { 50 | "title": "Economic Considerations", 51 | "start_index": 22, 52 | "end_index": 22, 53 | "nodes": [ 54 | { 55 | "title": "Background", 56 | "start_index": 22, 57 | "end_index": 23, 58 | "node_id": "0007", 59 | "summary": "The partial document discusses the U.S. Securities and Exchange Commission's (SEC) interpretation of the \"solely incidental\" prong of the broker-dealer exclusion, clarifying its understanding without creating new legal obligations. It examines the potential economic effects of this interpretation on broker-dealers, their associated persons, customers, and the broader financial advice market. The document provides background data on broker-dealers, including their assets, customer accounts, and dual registration as investment advisers. It highlights compliance costs for broker-dealers to align with the interpretation and notes the limited circumstances under which broker-dealers exercise temporary or limited investment discretion. The document also references the lack of data received during the Reg. BI Proposal to analyze the economic impact further." 60 | }, 61 | { 62 | "title": "Potential Economic Effects", 63 | "start_index": 23, 64 | "end_index": 28, 65 | "node_id": "0008", 66 | "summary": "The partial document discusses the economic effects and regulatory implications of the SEC's interpretation of the \"solely incidental\" prong of the broker-dealer exclusion from the definition of an investment adviser. Key points include:\n\n1. **Compliance Costs**: Broker-dealers currently incur costs to align their practices with the \"solely incidental\" prong, and the interpretation may lead to additional costs for evaluating and adjusting practices.\n\n2. **Impact on Broker-Dealer Practices**: Broker-dealers providing advisory services beyond the scope of the interpretation may need to adjust their practices, potentially resulting in reduced services, loss of customers, or a shift to advisory accounts.\n\n3. **Market Effects**: The interpretation could lead to decreased competition, increased fees, and a diminished number of broker-dealers offering commission-based services. It may also shift demand from broker-dealers to investment advisers.\n\n4. **Regulatory Adjustments**: Broker-dealers may choose to register as investment advisers, incurring new compliance costs, or migrate customers to advisory accounts of affiliates.\n\n5. **Potential Benefits**: Some broker-dealers may expand limited discretionary services or monitoring activities, benefiting investors with more efficient access to these services.\n\n6. **Regulatory Arbitrage Risks**: The interpretation raises concerns about regulatory arbitrage, though these risks may be mitigated by enhanced standards of conduct for broker-dealers.\n\n7. **Amendments to Regulations**: The document includes amendments to the Code of Federal Regulations, adding an interpretive release regarding the \"solely incidental\" prong, dated June 5, 2019." 67 | } 68 | ], 69 | "node_id": "0006", 70 | "summary": "The partial document discusses the SEC's interpretation of the \"solely incidental\" prong of the broker-dealer exclusion, clarifying that it does not impose new legal obligations but may have economic implications if broker-dealer practices deviate from this interpretation. It provides background on the potential effects on broker-dealers, their associated persons, customers, and the broader financial advice market. The document includes data on the number of registered broker-dealers, their customer accounts, total assets, and the prevalence of dual registrants (firms registered as both broker-dealers and investment advisers) as of December 2018." 71 | } 72 | ] 73 | } -------------------------------------------------------------------------------- /results/earthmover_structure.json: -------------------------------------------------------------------------------- 1 | { 2 | "doc_name": "earthmover.pdf", 3 | "structure": [ 4 | { 5 | "title": "Earth Mover\u2019s Distance based Similarity Search at Scale", 6 | "start_index": 1, 7 | "end_index": 1, 8 | "node_id": "0000" 9 | }, 10 | { 11 | "title": "ABSTRACT", 12 | "start_index": 1, 13 | "end_index": 1, 14 | "node_id": "0001" 15 | }, 16 | { 17 | "title": "INTRODUCTION", 18 | "start_index": 1, 19 | "end_index": 2, 20 | "node_id": "0002" 21 | }, 22 | { 23 | "title": "PRELIMINARIES", 24 | "start_index": 2, 25 | "end_index": 2, 26 | "nodes": [ 27 | { 28 | "title": "Computing the EMD", 29 | "start_index": 3, 30 | "end_index": 3, 31 | "node_id": "0004" 32 | }, 33 | { 34 | "title": "Filter-and-Refinement Framework", 35 | "start_index": 3, 36 | "end_index": 4, 37 | "node_id": "0005" 38 | } 39 | ], 40 | "node_id": "0003" 41 | }, 42 | { 43 | "title": "SCALING UP SSP", 44 | "start_index": 4, 45 | "end_index": 5, 46 | "node_id": "0006" 47 | }, 48 | { 49 | "title": "BOOSTING THE REFINEMENT PHASE", 50 | "start_index": 5, 51 | "end_index": 5, 52 | "nodes": [ 53 | { 54 | "title": "Analysis of EMD Calculation", 55 | "start_index": 5, 56 | "end_index": 6, 57 | "node_id": "0008" 58 | }, 59 | { 60 | "title": "Progressive Bounding", 61 | "start_index": 6, 62 | "end_index": 6, 63 | "node_id": "0009" 64 | }, 65 | { 66 | "title": "Sensitivity to Refinement Order", 67 | "start_index": 6, 68 | "end_index": 7, 69 | "node_id": "0010" 70 | }, 71 | { 72 | "title": "Dynamic Refinement Ordering", 73 | "start_index": 7, 74 | "end_index": 8, 75 | "node_id": "0011" 76 | }, 77 | { 78 | "title": "Running Upper Bound", 79 | "start_index": 8, 80 | "end_index": 8, 81 | "node_id": "0012" 82 | } 83 | ], 84 | "node_id": "0007" 85 | }, 86 | { 87 | "title": "EXPERIMENTAL EVALUATION", 88 | "start_index": 8, 89 | "end_index": 9, 90 | "nodes": [ 91 | { 92 | "title": "Performance Improvement", 93 | "start_index": 9, 94 | "end_index": 10, 95 | "node_id": "0014" 96 | }, 97 | { 98 | "title": "Scalability Experiments", 99 | "start_index": 10, 100 | "end_index": 11, 101 | "node_id": "0015" 102 | }, 103 | { 104 | "title": "Parameter Tuning in DRO", 105 | "start_index": 11, 106 | "end_index": 12, 107 | "node_id": "0016" 108 | } 109 | ], 110 | "node_id": "0013" 111 | }, 112 | { 113 | "title": "RELATED WORK", 114 | "start_index": 12, 115 | "end_index": 12, 116 | "node_id": "0017" 117 | }, 118 | { 119 | "title": "CONCLUSION", 120 | "start_index": 12, 121 | "end_index": 12, 122 | "node_id": "0018" 123 | }, 124 | { 125 | "title": "ACKNOWLEDGMENT", 126 | "start_index": 12, 127 | "end_index": 12, 128 | "node_id": "0019" 129 | }, 130 | { 131 | "title": "REFERENCES", 132 | "start_index": 12, 133 | "end_index": 12, 134 | "node_id": "0020" 135 | } 136 | ] 137 | } -------------------------------------------------------------------------------- /results/four-lectures_structure.json: -------------------------------------------------------------------------------- 1 | { 2 | "doc_name": "four-lectures.pdf", 3 | "structure": [ 4 | { 5 | "title": "Preface", 6 | "start_index": 1, 7 | "end_index": 1, 8 | "node_id": "0000" 9 | }, 10 | { 11 | "title": "ML at a Glance", 12 | "start_index": 2, 13 | "end_index": 2, 14 | "nodes": [ 15 | { 16 | "title": "An ML session", 17 | "start_index": 2, 18 | "end_index": 3, 19 | "node_id": "0002" 20 | }, 21 | { 22 | "title": "Types and Values", 23 | "start_index": 3, 24 | "end_index": 4, 25 | "node_id": "0003" 26 | }, 27 | { 28 | "title": "Recursive Functions", 29 | "start_index": 4, 30 | "end_index": 4, 31 | "node_id": "0004" 32 | }, 33 | { 34 | "title": "Raising Exceptions", 35 | "start_index": 4, 36 | "end_index": 5, 37 | "node_id": "0005" 38 | }, 39 | { 40 | "title": "Structures", 41 | "start_index": 5, 42 | "end_index": 6, 43 | "node_id": "0006" 44 | }, 45 | { 46 | "title": "Signatures", 47 | "start_index": 6, 48 | "end_index": 7, 49 | "node_id": "0007" 50 | }, 51 | { 52 | "title": "Coercive Signature Matching", 53 | "start_index": 7, 54 | "end_index": 8, 55 | "node_id": "0008" 56 | }, 57 | { 58 | "title": "Functor Declaration", 59 | "start_index": 8, 60 | "end_index": 9, 61 | "node_id": "0009" 62 | }, 63 | { 64 | "title": "Functor Application", 65 | "start_index": 9, 66 | "end_index": 9, 67 | "node_id": "0010" 68 | }, 69 | { 70 | "title": "Summary", 71 | "start_index": 9, 72 | "end_index": 9, 73 | "node_id": "0011" 74 | } 75 | ], 76 | "node_id": "0001" 77 | }, 78 | { 79 | "title": "Programming with ML Modules", 80 | "start_index": 10, 81 | "end_index": 10, 82 | "nodes": [ 83 | { 84 | "title": "Introduction", 85 | "start_index": 10, 86 | "end_index": 11, 87 | "node_id": "0013" 88 | }, 89 | { 90 | "title": "Signatures", 91 | "start_index": 11, 92 | "end_index": 12, 93 | "node_id": "0014" 94 | }, 95 | { 96 | "title": "Structures", 97 | "start_index": 12, 98 | "end_index": 13, 99 | "node_id": "0015" 100 | }, 101 | { 102 | "title": "Functors", 103 | "start_index": 13, 104 | "end_index": 14, 105 | "node_id": "0016" 106 | }, 107 | { 108 | "title": "Substructures", 109 | "start_index": 14, 110 | "end_index": 15, 111 | "node_id": "0017" 112 | }, 113 | { 114 | "title": "Sharing", 115 | "start_index": 15, 116 | "end_index": 16, 117 | "node_id": "0018" 118 | }, 119 | { 120 | "title": "Building the System", 121 | "start_index": 16, 122 | "end_index": 17, 123 | "node_id": "0019" 124 | }, 125 | { 126 | "title": "Separate Compilation", 127 | "start_index": 17, 128 | "end_index": 18, 129 | "node_id": "0020" 130 | }, 131 | { 132 | "title": "Good Style", 133 | "start_index": 18, 134 | "end_index": 18, 135 | "node_id": "0021" 136 | }, 137 | { 138 | "title": "Bad Style", 139 | "start_index": 18, 140 | "end_index": 19, 141 | "node_id": "0022" 142 | } 143 | ], 144 | "node_id": "0012" 145 | }, 146 | { 147 | "title": "The Static Semantics of Modules", 148 | "start_index": 20, 149 | "end_index": 20, 150 | "nodes": [ 151 | { 152 | "title": "Elaboration", 153 | "start_index": 20, 154 | "end_index": 21, 155 | "node_id": "0024" 156 | }, 157 | { 158 | "title": "Names", 159 | "start_index": 21, 160 | "end_index": 21, 161 | "node_id": "0025" 162 | }, 163 | { 164 | "title": "Decorating Structures", 165 | "start_index": 21, 166 | "end_index": 21, 167 | "node_id": "0026" 168 | }, 169 | { 170 | "title": "Decorating Signatures", 171 | "start_index": 22, 172 | "end_index": 23, 173 | "node_id": "0027" 174 | }, 175 | { 176 | "title": "Signature Instantiation", 177 | "start_index": 23, 178 | "end_index": 24, 179 | "node_id": "0028" 180 | }, 181 | { 182 | "title": "Signature Matching", 183 | "start_index": 24, 184 | "end_index": 25, 185 | "node_id": "0029" 186 | }, 187 | { 188 | "title": "Signature Constraints", 189 | "start_index": 25, 190 | "end_index": 25, 191 | "node_id": "0030" 192 | }, 193 | { 194 | "title": "Decorating Functors", 195 | "start_index": 26, 196 | "end_index": 26, 197 | "node_id": "0031" 198 | }, 199 | { 200 | "title": "External Sharing", 201 | "start_index": 26, 202 | "end_index": 27, 203 | "node_id": "0032" 204 | }, 205 | { 206 | "title": "Functors with Arguments", 207 | "start_index": 27, 208 | "end_index": 28, 209 | "node_id": "0033" 210 | }, 211 | { 212 | "title": "Sharing Between Argument and Result", 213 | "start_index": 28, 214 | "end_index": 28, 215 | "node_id": "0034" 216 | }, 217 | { 218 | "title": "Explicit Result Signatures", 219 | "start_index": 28, 220 | "end_index": 29, 221 | "node_id": "0035" 222 | } 223 | ], 224 | "node_id": "0023" 225 | }, 226 | { 227 | "title": "Implementing an Interpreter in ML", 228 | "start_index": 30, 229 | "end_index": 32, 230 | "nodes": [ 231 | { 232 | "title": "Version 1: The Bare Typechecker", 233 | "start_index": 32, 234 | "end_index": 33, 235 | "node_id": "0037" 236 | }, 237 | { 238 | "title": "Version 2: Adding Lists and Polymorphism", 239 | "start_index": 33, 240 | "end_index": 37, 241 | "node_id": "0038" 242 | }, 243 | { 244 | "title": "Version 3: A Different Implementation of Types", 245 | "start_index": 37, 246 | "end_index": 39, 247 | "node_id": "0039" 248 | }, 249 | { 250 | "title": "Version 4: Introducing Variables and Let", 251 | "start_index": 39, 252 | "end_index": 43, 253 | "node_id": "0040" 254 | }, 255 | { 256 | "title": "Acknowledgement", 257 | "start_index": 43, 258 | "end_index": 43, 259 | "node_id": "0041" 260 | } 261 | ], 262 | "node_id": "0036" 263 | }, 264 | { 265 | "title": "Appendix A: The Bare Interpreter", 266 | "start_index": 44, 267 | "end_index": 44, 268 | "nodes": [ 269 | { 270 | "title": "Syntax", 271 | "start_index": 44, 272 | "end_index": 44, 273 | "node_id": "0043" 274 | }, 275 | { 276 | "title": "Parsing", 277 | "start_index": 44, 278 | "end_index": 45, 279 | "node_id": "0044" 280 | }, 281 | { 282 | "title": "Environments", 283 | "start_index": 45, 284 | "end_index": 45, 285 | "node_id": "0045" 286 | }, 287 | { 288 | "title": "Evaluation", 289 | "start_index": 45, 290 | "end_index": 46, 291 | "node_id": "0046" 292 | }, 293 | { 294 | "title": "Type Checking", 295 | "start_index": 46, 296 | "end_index": 46, 297 | "node_id": "0047" 298 | }, 299 | { 300 | "title": "The Interpreter", 301 | "start_index": 46, 302 | "end_index": 47, 303 | "node_id": "0048" 304 | }, 305 | { 306 | "title": "The Evaluator", 307 | "start_index": 47, 308 | "end_index": 48, 309 | "node_id": "0049" 310 | }, 311 | { 312 | "title": "The Typechecker", 313 | "start_index": 48, 314 | "end_index": 49, 315 | "node_id": "0050" 316 | }, 317 | { 318 | "title": "The Basics", 319 | "start_index": 50, 320 | "end_index": 52, 321 | "node_id": "0051" 322 | } 323 | ], 324 | "node_id": "0042" 325 | }, 326 | { 327 | "title": "Appendix B: Files", 328 | "start_index": 53, 329 | "end_index": 53, 330 | "node_id": "0052" 331 | } 332 | ] 333 | } -------------------------------------------------------------------------------- /results/q1-fy25-earnings_structure.json: -------------------------------------------------------------------------------- 1 | { 2 | "doc_name": "q1-fy25-earnings.pdf", 3 | "doc_description": "A comprehensive financial report detailing The Walt Disney Company's first-quarter fiscal 2025 performance, including revenue growth, segment highlights, guidance for fiscal 2025, and key financial metrics such as adjusted EPS, operating income, and cash flow.", 4 | "structure": [ 5 | { 6 | "title": "THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025", 7 | "start_index": 1, 8 | "end_index": 1, 9 | "nodes": [ 10 | { 11 | "title": "Financial Results for the Quarter", 12 | "start_index": 1, 13 | "end_index": 1, 14 | "nodes": [ 15 | { 16 | "title": "Key Points", 17 | "start_index": 1, 18 | "end_index": 1, 19 | "node_id": "0002", 20 | "summary": "The partial document outlines The Walt Disney Company's financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\n\n1. **Financial Results**: \n - Revenue increased by 5% to $24.7 billion.\n - Income before taxes rose by 27% to $3.7 billion.\n - Diluted EPS grew by 35% to $1.40.\n - Total segment operating income increased by 31% to $5.1 billion, with adjusted EPS up 44% to $1.76.\n\n2. **Entertainment Segment**:\n - Operating income increased by $0.8 billion to $1.7 billion.\n - Direct-to-Consumer operating income rose by $431 million to $293 million, with advertising revenue (excluding Disney+ Hotstar in India) up 16%.\n - Disney+ and Hulu subscriptions increased by 0.9 million, while Disney+ subscribers decreased by 0.7 million.\n - Content sales/licensing income grew by $536 million, driven by the success of *Moana 2*.\n\n3. **Sports Segment**:\n - Operating income increased by $350 million to $247 million.\n - Domestic ESPN advertising revenue grew by 15%.\n\n4. **Experiences Segment**:\n - Operating income remained at $3.1 billion, with a 6 percentage-point adverse impact due to Hurricanes Milton and Helene and pre-opening expenses for the Disney Treasure.\n - Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income increased by 28%." 21 | } 22 | ], 23 | "node_id": "0001", 24 | "summary": "The partial document is a report from The Walt Disney Company detailing its financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\n\n1. **Financial Performance**:\n - Revenue increased by 5% to $24.7 billion.\n - Income before taxes rose by 27% to $3.7 billion.\n - Diluted EPS grew by 35% to $1.40.\n - Total segment operating income increased by 31% to $5.1 billion, with adjusted EPS up 44% to $1.76.\n\n2. **Segment Highlights**:\n - **Entertainment**: Operating income increased by $0.8 billion to $1.7 billion. Direct-to-Consumer income rose by $431 million, though advertising revenue declined 2% (up 16% excluding Disney+ Hotstar in India). Disney+ and Hulu subscriptions increased slightly, while Disney+ subscribers decreased by 0.7 million. Content sales/licensing income grew, driven by the success of *Moana 2*.\n - **Sports**: Operating income increased by $350 million to $247 million, with ESPN domestic advertising revenue up 15%.\n - **Experiences**: Operating income remained at $3.1 billion, with adverse impacts from hurricanes and pre-opening expenses for the Disney Treasure. Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income rose by 28%.\n\n3. **Additional Notes**:\n - Non-GAAP financial measures are used for certain metrics.\n - Disney+ Hotstar in India saw a significant decline in advertising revenue compared to the previous year." 25 | }, 26 | { 27 | "title": "Guidance and Outlook", 28 | "start_index": 2, 29 | "end_index": 2, 30 | "nodes": [ 31 | { 32 | "title": "Star India deconsolidated in Q1", 33 | "start_index": 2, 34 | "end_index": 2, 35 | "node_id": "0004", 36 | "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For fiscal 2025, the company projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\u2019s digital strategy, and continued investments in the Experiences segment, expressing confidence in Disney's growth strategy." 37 | }, 38 | { 39 | "title": "Q2 Fiscal 2025", 40 | "start_index": 2, 41 | "end_index": 2, 42 | "node_id": "0005", 43 | "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes Disney's strong start to the fiscal year, citing achievements in box office performance, improved streaming profitability, ESPN's digital strategy, and the enduring appeal of the Experiences segment." 44 | }, 45 | { 46 | "title": "Fiscal Year 2025", 47 | "start_index": 2, 48 | "end_index": 2, 49 | "node_id": "0006", 50 | "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes Disney's creative and financial strength, strong box office performance, improved streaming profitability, advancements in ESPN's digital strategy, and continued global investments in the Experiences segment." 51 | } 52 | ], 53 | "node_id": "0003", 54 | "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\u2019s digital strategy, and continued investment in global experiences." 55 | }, 56 | { 57 | "title": "Message From Our CEO", 58 | "start_index": 2, 59 | "end_index": 2, 60 | "node_id": "0007", 61 | "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\u2019s digital strategy, and continued investment in global experiences." 62 | } 63 | ], 64 | "node_id": "0000", 65 | "summary": "The partial document is a report from The Walt Disney Company detailing its financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\n\n1. **Financial Results**: \n - Revenue increased by 5% to $24.7 billion. \n - Income before taxes rose by 27% to $3.7 billion. \n - Diluted EPS grew by 35% to $1.40. \n - Total segment operating income increased by 31% to $5.1 billion, and adjusted EPS rose by 44% to $1.76. \n\n2. **Entertainment Segment**: \n - Operating income increased by $0.8 billion to $1.7 billion. \n - Direct-to-Consumer operating income rose by $431 million to $293 million, with advertising revenue up 16% (excluding Disney+ Hotstar in India). \n - Disney+ and Hulu subscriptions increased by 0.9 million, while Disney+ subscribers decreased by 0.7 million. \n - Content sales/licensing income grew by $536 million, driven by the success of *Moana 2*. \n\n3. **Sports Segment**: \n - Operating income increased by $350 million to $247 million. \n - Domestic ESPN advertising revenue grew by 15%. \n\n4. **Experiences Segment**: \n - Operating income remained at $3.1 billion, with a 6 percentage-point adverse impact due to Hurricanes Milton and Helene and pre-opening expenses for the Disney Treasure. \n - Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income increased by 28%. \n\nThe report also includes non-GAAP financial measures and notes the impact of Disney+ Hotstar's advertising revenue in India." 66 | }, 67 | { 68 | "title": "SUMMARIZED FINANCIAL RESULTS", 69 | "start_index": 3, 70 | "end_index": 3, 71 | "nodes": [ 72 | { 73 | "title": "SUMMARIZED SEGMENT FINANCIAL RESULTS", 74 | "start_index": 3, 75 | "end_index": 3, 76 | "node_id": "0009", 77 | "summary": "The partial document provides a summarized overview of financial results for the first quarter of fiscal years 2025 and 2024. Key points include:\n\n1. **Overall Financial Performance**:\n - Revenues increased by 5% from $23,549 million in 2024 to $24,690 million in 2025.\n - Income before income taxes rose by 27%.\n - Total segment operating income grew by 31%.\n - Diluted EPS increased by 35%, and diluted EPS excluding certain items rose by 44%.\n - Cash provided by operations increased by 47%, while free cash flow decreased by 17%.\n\n2. **Segment Financial Results**:\n - Revenue growth was observed in the Entertainment segment (9%) and Experiences segment (3%), while Sports revenue remained flat.\n - Segment operating income for Entertainment increased significantly by 95%, while Sports shifted from a loss to a positive income. Experiences segment operating income remained stable.\n\n3. **Non-GAAP Measures**:\n - The document highlights the use of non-GAAP financial measures such as total segment operating income, diluted EPS excluding certain items, and free cash flow, with references to further details and reconciliations provided elsewhere in the report." 78 | } 79 | ], 80 | "node_id": "0008", 81 | "summary": "The partial document provides a summarized overview of financial results for the first quarter of fiscal years 2025 and 2024. Key points include:\n\n1. **Overall Financial Performance**:\n - Revenues increased by 5% from $23,549 million in 2024 to $24,690 million in 2025.\n - Income before income taxes rose by 27%.\n - Total segment operating income grew by 31%.\n - Diluted EPS increased by 35%, and diluted EPS excluding certain items rose by 44%.\n - Cash provided by operations increased by 47%, while free cash flow decreased by 17%.\n\n2. **Segment Financial Results**:\n - Revenue growth was observed in the Entertainment segment (9%) and Experiences segment (3%), while Sports revenue remained flat.\n - Segment operating income for Entertainment increased significantly by 95%, while Sports shifted from a loss to a positive income. Experiences segment operating income remained stable.\n\n3. **Non-GAAP Measures**:\n - The document highlights the use of non-GAAP financial measures such as total segment operating income, diluted EPS excluding certain items, and free cash flow, with references to further details and reconciliations provided in later sections." 82 | }, 83 | { 84 | "title": "DISCUSSION OF FIRST QUARTER SEGMENT RESULTS", 85 | "start_index": 4, 86 | "end_index": 4, 87 | "nodes": [ 88 | { 89 | "title": "Star India", 90 | "start_index": 4, 91 | "end_index": 4, 92 | "node_id": "0011", 93 | "summary": "The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels, Disney+ Hotstar, and certain RIL-controlled media businesses, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\u2019s results under \"Equity in the income of investees.\" Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues and a 95% increase in operating income compared to the prior-year quarter. The growth in operating income is attributed to improved results in Content Sales/Licensing and Direct-to-Consumer, partially offset by a decline in Linear Networks." 94 | }, 95 | { 96 | "title": "Entertainment", 97 | "start_index": 4, 98 | "end_index": 4, 99 | "nodes": [ 100 | { 101 | "title": "Linear Networks", 102 | "start_index": 5, 103 | "end_index": 5, 104 | "node_id": "0013", 105 | "summary": "The partial document provides financial performance details for Linear Networks and Direct-to-Consumer segments for the quarters ending December 28, 2024, and December 30, 2023. Key points include:\n\n1. **Linear Networks**:\n - Revenue decreased by 7%, with domestic revenue remaining flat and international revenue declining by 31%.\n - Operating income decreased by 11%, with domestic income stable and international income dropping by 39%.\n - Domestic operating income was impacted by higher programming costs (due to the 2023 guild strikes), lower affiliate revenue (fewer subscribers), lower technology costs, and higher advertising revenue (driven by political advertising but offset by lower viewership).\n - International operating income decline was attributed to the Star India Transaction.\n - Equity income from investees decreased due to lower income from A+E Television Networks, reduced advertising and affiliate revenue, and the absence of a prior-year gain from an investment sale.\n\n2. **Direct-to-Consumer**:\n - Revenue increased by 9%, driven by higher subscription revenue due to increased pricing and more subscribers, partially offset by unfavorable foreign exchange impacts.\n - Operating income improved significantly, moving from a loss in the prior year to a profit, reflecting subscription revenue growth." 106 | }, 107 | { 108 | "title": "Direct-to-Consumer", 109 | "start_index": 5, 110 | "end_index": 7, 111 | "node_id": "0014", 112 | "summary": "The partial document provides a financial performance overview of various segments for the quarter ended December 28, 2024, compared to the prior-year quarter. Key points include:\n\n1. **Linear Networks**:\n - Revenue decreased by 7%, with domestic revenue flat and international revenue down 31%.\n - Operating income decreased by 11%, with domestic income flat and international income down 39%, primarily due to the Star India transaction.\n - Equity income from investees declined by 29%, driven by lower income from A+E Television Networks and the absence of a prior-year gain on an investment sale.\n\n2. **Direct-to-Consumer (DTC)**:\n - Revenue increased by 9%, and operating income improved significantly from a loss of $138 million to a profit of $293 million.\n - Growth was driven by higher subscription revenue due to pricing increases and more subscribers, partially offset by higher costs and lower advertising revenue.\n - Key metrics showed slight changes in Disney+ and Hulu subscriber numbers, with increases in average monthly revenue per paid subscriber due to pricing adjustments.\n\n3. **Content Sales/Licensing and Other**:\n - Revenue increased by 34%, and operating income improved significantly, driven by strong theatrical performance, particularly from \"Moana 2,\" and contributions from \"Mufasa: The Lion King.\"\n\n4. **Sports**:\n - ESPN revenue grew by 8%, with domestic and international segments showing increases, while Star India revenue dropped by 90%.\n - Operating income for ESPN improved by 15%, while Star India shifted from a loss to a small profit.\n\nThe document highlights revenue trends, operating income changes, and key drivers for each segment, including programming costs, subscriber growth, pricing adjustments, and content performance." 113 | }, 114 | { 115 | "title": "Content Sales/Licensing and Other", 116 | "start_index": 7, 117 | "end_index": 7, 118 | "node_id": "0015", 119 | "summary": "The partial document discusses the financial performance of Disney's streaming services, content sales, and sports segment. Key points include:\n\n1. **Disney+ Revenue**: Domestic and international Disney+ average monthly revenue per paid subscriber increased due to pricing hikes, partially offset by promotional offerings. International revenue also benefited from higher advertising revenue.\n\n2. **Hulu Revenue**: Hulu SVOD Only revenue remained stable, with pricing increases offsetting lower advertising revenue. Hulu Live TV + SVOD revenue increased due to pricing hikes.\n\n3. **Content Sales/Licensing**: Revenue and operating income improved significantly, driven by strong theatrical distribution results, particularly from \"Moana 2,\" and contributions from \"Mufasa: The Lion King.\"\n\n4. **Sports Revenue**: ESPN domestic and international revenues grew, while Star India revenue declined sharply. Operating income for ESPN improved, with domestic income slightly down and international losses reduced. Star India showed a notable recovery in operating income." 120 | } 121 | ], 122 | "node_id": "0012", 123 | "summary": "The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels and the Disney+ Hotstar service in India, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\u2019s results under \u201cEquity in the income of investees.\u201d Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues compared to the prior year, driven by growth in Direct-to-Consumer and Content Sales/Licensing and Other, despite a decline in Linear Networks. Operating income increased by 95%, primarily due to improved results in Content Sales/Licensing and Other and Direct-to-Consumer, partially offset by a decrease in Linear Networks." 124 | }, 125 | { 126 | "title": "Sports", 127 | "start_index": 7, 128 | "end_index": 7, 129 | "nodes": [ 130 | { 131 | "title": "Domestic ESPN", 132 | "start_index": 8, 133 | "end_index": 8, 134 | "node_id": "0017", 135 | "summary": "The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for the current quarter compared to the prior-year quarter. Key points include:\n\n1. **Domestic ESPN**: \n - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights and changes in the College Football Playoff (CFP) format.\n - Increase in advertising revenue due to higher rates.\n - Revenue from sub-licensing CFP programming rights.\n - Affiliate revenue remained comparable, with rate increases offset by fewer subscribers.\n\n2. **International ESPN**: \n - Decrease in operating loss driven by higher fees from the Entertainment segment for Disney+ sports content.\n - Increased programming and production costs due to higher soccer rights costs.\n - Lower affiliate revenue due to fewer subscribers.\n\n3. **Star India**: \n - Improved operating results due to the absence of significant cricket events in the current quarter compared to the prior-year quarter, which included the ICC Cricket World Cup.\n\n4. **Key Metrics for ESPN+**:\n - Paid subscribers decreased from 25.6 million to 24.9 million.\n - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue." 136 | }, 137 | { 138 | "title": "International ESPN", 139 | "start_index": 8, 140 | "end_index": 8, 141 | "node_id": "0018", 142 | "summary": "The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for the current quarter compared to the prior-year quarter. Key points include:\n\n1. **Domestic ESPN**: \n - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights and changes in the College Football Playoff (CFP) format.\n - Increase in advertising revenue due to higher rates.\n - Revenue from sub-licensing CFP programming rights.\n - Affiliate revenue remained comparable, with rate increases offset by fewer subscribers.\n\n2. **International ESPN**: \n - Decrease in operating loss driven by higher fees from the Entertainment segment for Disney+ sports content.\n - Increased programming and production costs due to higher soccer rights costs.\n - Lower affiliate revenue due to fewer subscribers.\n\n3. **Star India**: \n - Improved operating results due to the absence of significant cricket events in the current quarter compared to the ICC Cricket World Cup in the prior-year quarter.\n\n4. **Key Metrics for ESPN+**:\n - Paid subscribers decreased from 25.6 million to 24.9 million.\n - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue." 143 | }, 144 | { 145 | "title": "Star India", 146 | "start_index": 8, 147 | "end_index": 8, 148 | "node_id": "0019", 149 | "summary": "The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for a specific quarter. Key points include:\n\n1. **Domestic ESPN**: \n - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights, including additional College Football Playoff (CFP) games under a revised format.\n - Increase in advertising revenue due to higher rates.\n - Revenue from sub-licensing CFP programming rights.\n - Affiliate revenue remained comparable to the prior year due to effective rate increases offset by fewer subscribers.\n\n2. **International ESPN**: \n - Decrease in operating loss driven by higher fees from the Entertainment segment for sports content on Disney+.\n - Increased programming and production costs due to higher soccer rights costs.\n - Lower affiliate revenue due to fewer subscribers.\n\n3. **Star India**: \n - Improvement in operating results due to the absence of significant cricket events in the current quarter compared to the prior year, which included the ICC Cricket World Cup.\n\n4. **Key Metrics for ESPN+**:\n - Paid subscribers decreased from 25.6 million to 24.9 million.\n - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue." 150 | } 151 | ], 152 | "node_id": "0016", 153 | "summary": "The partial document discusses the financial performance of Disney's streaming services, content sales, and sports segment. Key points include:\n\n1. **Disney+ Revenue**: Domestic and international Disney+ average monthly revenue per paid subscriber increased due to pricing hikes, partially offset by promotional offerings. International revenue also benefited from higher advertising revenue.\n\n2. **Hulu Revenue**: Hulu SVOD Only revenue remained stable, with pricing increases offsetting lower advertising revenue. Hulu Live TV + SVOD revenue increased due to pricing hikes.\n\n3. **Content Sales/Licensing**: Revenue and operating income improved significantly, driven by strong theatrical performance, particularly from \"Moana 2,\" and contributions from \"Mufasa: The Lion King.\"\n\n4. **Sports Revenue**: ESPN domestic and international revenues grew, while Star India revenue declined sharply. Operating income for ESPN improved, with domestic income slightly down and international income showing significant recovery. Star India showed a notable turnaround in operating income." 154 | }, 155 | { 156 | "title": "Experiences", 157 | "start_index": 9, 158 | "end_index": 9, 159 | "node_id": "0020", 160 | "summary": "The partial document provides financial performance details for the Parks & Experiences segment, including revenues and operating income for domestic and international operations, as well as consumer products. It highlights a 3% increase in total revenue and stable operating income compared to the prior year. Domestic parks and experiences were negatively impacted by hurricanes, leading to lower volumes and higher costs, despite increased guest spending. International parks and experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings. The document also notes increased corporate expenses due to a legal settlement and a $143 million loss related to the Star India Transaction." 161 | } 162 | ], 163 | "node_id": "0010", 164 | "summary": "The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels, Disney+ Hotstar, and certain RIL-controlled media businesses, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\u2019s results under \"Equity in the income of investees.\" Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues and a 95% increase in operating income compared to the prior-year quarter. The growth in operating income is attributed to improved results in Content Sales/Licensing and Direct-to-Consumer, partially offset by a decline in Linear Networks." 165 | }, 166 | { 167 | "title": "OTHER FINANCIAL INFORMATION", 168 | "start_index": 9, 169 | "end_index": 9, 170 | "nodes": [ 171 | { 172 | "title": "Corporate and Unallocated Shared Expenses", 173 | "start_index": 9, 174 | "end_index": 9, 175 | "node_id": "0022", 176 | "summary": "The partial document provides a financial overview of revenues and operating income for Parks & Experiences, including Domestic, International, and Consumer Products segments, comparing the quarters ending December 28, 2024, and December 30, 2023. It highlights a 3% increase in overall revenue and stable operating income. Domestic Parks and Experiences were negatively impacted by Hurricanes Milton and Helene, leading to closures, cancellations, higher costs, and lower attendance, despite increased guest spending. International Parks and Experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, offset by higher costs. The document also notes a $152 million increase in corporate and unallocated shared expenses due to a legal settlement and a $143 million loss related to the Star India Transaction." 177 | }, 178 | { 179 | "title": "Restructuring and Impairment Charges", 180 | "start_index": 9, 181 | "end_index": 9, 182 | "node_id": "0023", 183 | "summary": "The partial document provides financial performance details for the Parks & Experiences segment, including revenues and operating income for domestic and international operations, as well as consumer products. It highlights a 3% increase in overall revenue and stable operating income compared to the prior year. Domestic parks and experiences were negatively impacted by hurricanes, leading to lower volumes and higher costs, despite increased guest spending. International parks and experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, though costs also rose. Additionally, corporate and unallocated shared expenses increased due to a legal settlement, and a $143 million loss was recorded related to the Star India Transaction." 184 | }, 185 | { 186 | "title": "Interest Expense, net", 187 | "start_index": 10, 188 | "end_index": 10, 189 | "node_id": "0024", 190 | "summary": "The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ending December 28, 2024, and December 30, 2023. Key points include:\n\n1. **Interest Expense, Net**: A decrease in interest expense due to lower average rates and debt balances, partially offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses compared to prior-year gains.\n\n2. **Equity in the Income of Investees**: A $89 million decrease in income from investees, primarily due to lower income from A+E and losses from the India joint venture.\n\n3. **Income Taxes**: An increase in the effective income tax rate from 25.1% to 27.8%, driven by a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects of employee share-based awards." 191 | }, 192 | { 193 | "title": "Equity in the Income of Investees", 194 | "start_index": 10, 195 | "end_index": 10, 196 | "node_id": "0025", 197 | "summary": "The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ended December 28, 2024, and December 30, 2023. It highlights a decrease in net interest expense due to lower average rates and debt balances, offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses. Equity income from investees decreased significantly, driven by lower income from A+E and losses from the India joint venture. The effective income tax rate increased due to a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects." 198 | }, 199 | { 200 | "title": "Income Taxes", 201 | "start_index": 10, 202 | "end_index": 10, 203 | "node_id": "0026", 204 | "summary": "The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ended December 28, 2024, and December 30, 2023. It highlights a decrease in net interest expense due to lower average rates and debt balances, offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses. Equity income from investees dropped significantly, driven by lower income from A+E and losses from the India joint venture. The effective income tax rate increased due to a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects." 205 | }, 206 | { 207 | "title": "Noncontrolling Interests", 208 | "start_index": 11, 209 | "end_index": 11, 210 | "node_id": "0027", 211 | "summary": "The partial document covers two main points:\n\n1. **Noncontrolling Interests**: It discusses the net income attributable to noncontrolling interests, which decreased by 63% compared to the prior-year quarter. The decrease is attributed to the prior-year accretion of NBC Universal\u2019s interest in Hulu. The calculation of net income attributable to noncontrolling interests is based on income after royalties, management fees, financing costs, and income taxes.\n\n2. **Cash from Operations**: It details cash provided by operations and free cash flow, showing an increase in cash provided by operations by $1.0 billion to $3.2 billion in the current quarter. The increase is driven by lower tax payments, higher operating income at Entertainment, and higher film and television production spending, along with the timing of payments for sports rights. Free cash flow decreased by $147 million compared to the prior-year quarter." 212 | }, 213 | { 214 | "title": "Cash from Operations", 215 | "start_index": 11, 216 | "end_index": 11, 217 | "node_id": "0028", 218 | "summary": "The partial document covers two main points:\n\n1. **Noncontrolling Interests**: It discusses the net income attributable to noncontrolling interests, which decreased by 63% in the quarter ended December 28, 2024, compared to the prior-year quarter. The decrease is attributed to the prior-year accretion of NBC Universal\u2019s interest in Hulu. The calculation of net income attributable to noncontrolling interests includes royalties, management fees, financing costs, and income taxes.\n\n2. **Cash from Operations**: It details cash provided by operations and free cash flow for the quarter ended December 28, 2024, compared to the prior-year quarter. Cash provided by operations increased by $1.0 billion, driven by lower tax payments, higher operating income at Entertainment, and higher film and television production spending, along with the timing of payments for sports rights. Free cash flow decreased by $147 million due to increased investments in parks, resorts, and other property." 219 | }, 220 | { 221 | "title": "Capital Expenditures", 222 | "start_index": 12, 223 | "end_index": 12, 224 | "node_id": "0029", 225 | "summary": "The partial document provides details on capital expenditures and depreciation expenses for parks, resorts, and other properties. It highlights an increase in capital expenditures from $1.3 billion to $2.5 billion, primarily due to higher spending on cruise ship fleet expansion in the Experiences segment. The document also breaks down investments and depreciation expenses by category (Entertainment, Sports, Domestic and International Experiences, and Corporate) for the quarters ending December 28, 2024, and December 30, 2023. Depreciation expenses increased from $823 million to $909 million, with detailed figures provided for each segment." 226 | }, 227 | { 228 | "title": "Depreciation Expense", 229 | "start_index": 12, 230 | "end_index": 12, 231 | "node_id": "0030", 232 | "summary": "The partial document provides details on capital expenditures and depreciation expenses for parks, resorts, and other properties. It highlights an increase in capital expenditures from $1.3 billion to $2.5 billion, primarily due to higher spending on cruise ship fleet expansion in the Experiences segment. The breakdown of investments and depreciation expenses is provided for Entertainment, Sports, Domestic and International Experiences, and Corporate segments for the quarters ending December 28, 2024, and December 30, 2023. Depreciation expenses also increased from $823 million to $909 million, with detailed segment-wise allocations." 233 | } 234 | ], 235 | "node_id": "0021", 236 | "summary": "The partial document provides a financial overview of revenues and operating income for Parks & Experiences, including Domestic, International, and Consumer Products segments, comparing the quarters ending December 28, 2024, and December 30, 2023. It highlights a 3% increase in total revenue and stable operating income. Domestic Parks and Experiences were negatively impacted by Hurricanes Milton and Helene, leading to closures, cancellations, higher costs, and lower attendance, despite increased guest spending. International Parks and Experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, offset by increased costs. The document also notes a rise in corporate and unallocated shared expenses due to a legal settlement and a $143 million loss related to the Star India Transaction." 237 | }, 238 | { 239 | "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME", 240 | "start_index": 13, 241 | "end_index": 13, 242 | "node_id": "0031", 243 | "summary": "The partial document provides a condensed consolidated statement of income for The Walt Disney Company for the quarters ended December 28, 2024, and December 30, 2023. It includes details on revenues, costs and expenses, restructuring and impairment charges, net interest expense, equity in the income of investees, income before income taxes, income taxes, and net income. It also breaks down net income attributable to noncontrolling interests and The Walt Disney Company. Additionally, it provides earnings per share (diluted and basic) and the weighted average number of shares outstanding (diluted and basic) for both periods." 244 | }, 245 | { 246 | "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS", 247 | "start_index": 14, 248 | "end_index": 14, 249 | "node_id": "0032", 250 | "summary": "The partial document is a condensed consolidated balance sheet for The Walt Disney Company, comparing financial data as of December 28, 2024, and September 28, 2024. It details the company's assets, liabilities, and equity. Key points include:\n\n1. **Assets**: Breakdown of current assets (cash, receivables, inventories, content advances, and other assets), produced and licensed content costs, investments, property (attractions, buildings, equipment, projects in progress, and land), intangible assets, goodwill, and other assets. Total assets increased slightly from $196.2 billion to $197 billion.\n\n2. **Liabilities**: Includes current liabilities (accounts payable, borrowings, deferred revenue), long-term borrowings, deferred income taxes, and other long-term liabilities. Total liabilities remained relatively stable.\n\n3. **Equity**: Details Disney shareholders' equity, including common stock, retained earnings, accumulated other comprehensive loss, and treasury stock. Noncontrolling interests are also included. Total equity increased from $105.5 billion to $106.7 billion.\n\n4. **Overall Financial Position**: The balance sheet reflects a stable financial position with slight changes in assets, liabilities, and equity over the period." 251 | }, 252 | { 253 | "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS", 254 | "start_index": 15, 255 | "end_index": 15, 256 | "node_id": "0033", 257 | "summary": "The partial document provides a condensed consolidated statement of cash flows for The Walt Disney Company for the quarters ended December 28, 2024, and December 30, 2023. It details cash flow activities categorized into operating, investing, and financing activities. Key points include:\n\n1. **Operating Activities**: Net income increased from $2,151 million in 2023 to $2,644 million in 2024. Other significant changes include variations in depreciation, deferred taxes, equity income, content costs, and changes in operating assets and liabilities, resulting in cash provided by operations of $3,205 million in 2024 compared to $2,185 million in 2023.\n\n2. **Investing Activities**: Investments in parks, resorts, and other properties increased significantly in 2024 ($2,466 million) compared to 2023 ($1,299 million), leading to higher cash used in investing activities.\n\n3. **Financing Activities**: The company saw a net cash outflow in financing activities, including commercial paper borrowings, stock repurchases, and debt reduction. In 2024, cash used in financing activities was $997 million, a significant improvement from $8,006 million in 2023.\n\n4. **Exchange Rate Impact**: Exchange rates negatively impacted cash in 2024 by $153 million, compared to a positive impact of $79 million in 2023.\n\n5. **Overall Cash Position**: The company\u2019s cash, cash equivalents, and restricted cash decreased from $14,235 million at the beginning of the 2023 period to $5,582 million at the end of the 2024 period." 258 | }, 259 | { 260 | "title": "DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS", 261 | "start_index": 16, 262 | "end_index": 16, 263 | "node_id": "0034", 264 | "summary": "The partial document provides an overview of Disney's Direct-to-Consumer (DTC) product offerings, key definitions, and metrics. It details the availability of Disney+, ESPN+, and Hulu as standalone services or bundled offerings in the U.S., including Hulu Live TV + SVOD, which incorporates Disney+ and ESPN+. It explains the global reach of Disney+ in over 150 countries and the various purchase channels, including websites, third-party platforms, and wholesale arrangements. The document defines \"paid subscribers\" as those generating subscription revenue, excluding extra member add-ons, and outlines how subscribers are counted for multi-product offerings. It also describes the calculation of average monthly revenue per paid subscriber for Hulu, ESPN+, and Disney+, including revenue components like subscription fees, advertising, and add-ons, while noting differences in revenue allocation and the impact of wholesale arrangements on average revenue." 265 | }, 266 | { 267 | "title": "NON-GAAP FINANCIAL MEASURES", 268 | "start_index": 17, 269 | "end_index": 17, 270 | "nodes": [ 271 | { 272 | "title": "Diluted EPS excluding certain items", 273 | "start_index": 17, 274 | "end_index": 18, 275 | "node_id": "0036", 276 | "summary": "The partial document discusses the use of non-GAAP financial measures, specifically diluted EPS excluding certain items (adjusted EPS), total segment operating income, and free cash flow. It explains that these measures are not defined by GAAP but are important for evaluating the company's performance. The document highlights that these measures should be reviewed alongside comparable GAAP measures and may not be directly comparable to similar measures from other companies. It provides details on the adjustments made to diluted EPS, including the exclusion of certain items affecting comparability and amortization of TFCF and Hulu intangible assets, to better reflect operational performance. The document also includes a reconciliation table comparing reported diluted EPS to adjusted EPS for specific quarters, showing the impact of excluded items such as restructuring charges and intangible asset amortization. Additionally, it notes the challenges in providing forward-looking GAAP measures due to unpredictable factors." 277 | }, 278 | { 279 | "title": "Total segment operating income", 280 | "start_index": 19, 281 | "end_index": 20, 282 | "node_id": "0037", 283 | "summary": "The partial document focuses on the evaluation of the company's performance through two key financial metrics: total segment operating income and free cash flow. It explains that total segment operating income is used to assess the performance of operating segments separately from non-operational factors, providing insights into operational results. A reconciliation table is provided, showing the calculation of total segment operating income for two quarters, highlighting changes in various components such as corporate expenses, restructuring charges, and interest expenses. Additionally, the document discusses free cash flow as a measure of cash available for purposes beyond capital expenditures, such as debt servicing, acquisitions, and shareholder returns. A summary of consolidated cash flows and a reconciliation of cash provided by operations to free cash flow are presented, comparing figures for two quarters and highlighting changes in cash flow components." 284 | }, 285 | { 286 | "title": "Free cash flow", 287 | "start_index": 20, 288 | "end_index": 20, 289 | "node_id": "0038", 290 | "summary": "The partial document provides a reconciliation of the company's consolidated cash provided by operations to free cash flow for the quarters ended December 28, 2024, and December 30, 2023. It highlights a $1,020 million increase in cash provided by operations, a $1,167 million increase in investments in parks, resorts, and other property, and a $147 million decrease in free cash flow." 291 | } 292 | ], 293 | "node_id": "0035", 294 | "summary": "The partial document discusses the use of non-GAAP financial measures by the company, including diluted EPS excluding certain items (adjusted EPS), total segment operating income, and free cash flow. It explains that these measures are not defined by GAAP but are important for evaluating the company's performance. The document emphasizes that these measures should be reviewed alongside comparable GAAP measures and may not be directly comparable to similar measures from other companies. It highlights the company's inability to provide forward-looking GAAP measures or reconciliations due to uncertainties in predicting significant items. Additionally, the document details the rationale for excluding certain items and amortization of TFCF and Hulu intangible assets from diluted EPS to enhance comparability and provide a clearer evaluation of operational performance, particularly given the significant impact of the 2019 TFCF and Hulu acquisition." 295 | }, 296 | { 297 | "title": "FORWARD-LOOKING STATEMENTS", 298 | "start_index": 21, 299 | "end_index": 21, 300 | "node_id": "0039", 301 | "summary": "The partial document outlines the inclusion of forward-looking statements in an earnings release, emphasizing that these statements are based on management's views and assumptions about future events and business performance. It highlights that actual results may differ materially due to various factors, including company actions (e.g., restructuring, strategic initiatives, cost rationalization), external developments (e.g., economic conditions, competition, consumer behavior, regulatory changes, technological advancements, labor market activities, and natural disasters), and their potential impacts on operations, profitability, content performance, advertising markets, and taxation. The document also references additional risk factors and analyses detailed in the company's filings with the SEC, such as annual and quarterly reports." 302 | }, 303 | { 304 | "title": "PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION", 305 | "start_index": 22, 306 | "end_index": 22, 307 | "node_id": "0040", 308 | "summary": "The partial document provides information about The Walt Disney Company's prepared management remarks and a conference call scheduled for February 5, 2025, at 8:30 AM EST/5:30 AM PST, accessible via a live webcast on their investor website. It also mentions that a replay of the webcast will be available on the site. Additionally, contact details for Corporate Communications (David Jefferson) and Investor Relations (Carlos Gomez) are provided." 309 | } 310 | ] 311 | } -------------------------------------------------------------------------------- /run_pageindex.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from pageindex import * 3 | 4 | if __name__ == "__main__": 5 | # Set up argument parser 6 | parser = argparse.ArgumentParser(description='Process PDF document and generate structure') 7 | parser.add_argument('--pdf_path', type=str, help='Path to the PDF file') 8 | parser.add_argument('--model', type=str, default='gpt-4o-2024-11-20', help='Model to use') 9 | parser.add_argument('--toc-check-pages', type=int, default=20, 10 | help='Number of pages to check for table of contents') 11 | parser.add_argument('--max-pages-per-node', type=int, default=10, 12 | help='Maximum number of pages per node') 13 | parser.add_argument('--max-tokens-per-node', type=int, default=20000, 14 | help='Maximum number of tokens per node') 15 | parser.add_argument('--if-add-node-id', type=str, default='yes', 16 | help='Whether to add node id to the node') 17 | parser.add_argument('--if-add-node-summary', type=str, default='no', 18 | help='Whether to add summary to the node') 19 | parser.add_argument('--if-add-doc-description', type=str, default='yes', 20 | help='Whether to add doc description to the doc') 21 | args = parser.parse_args() 22 | 23 | # Configure options 24 | opt = config( 25 | model=args.model, 26 | toc_check_page_num=args.toc_check_pages, 27 | max_page_num_each_node=args.max_pages_per_node, 28 | max_token_num_each_node=args.max_tokens_per_node, 29 | if_add_node_id=args.if_add_node_id, 30 | if_add_node_summary=args.if_add_node_summary, 31 | if_add_doc_description=args.if_add_doc_description 32 | ) 33 | 34 | # Process the PDF 35 | toc_with_page_number = page_index_main(args.pdf_path, opt) 36 | print('Parsing done, saving to file...') 37 | 38 | # Save results 39 | pdf_name = os.path.splitext(os.path.basename(args.pdf_path))[0] 40 | os.makedirs('./results', exist_ok=True) 41 | 42 | with open(f'./results/{pdf_name}_structure.json', 'w', encoding='utf-8') as f: 43 | json.dump(toc_with_page_number, f, indent=2) --------------------------------------------------------------------------------