├── README.md
├── config.ini
├── requirements.txt
└── src
    ├── app.py
    ├── llm_extractor.py
    ├── llm_summarizer.py
    ├── main.py
    ├── parser.py
    ├── pdf_parser.py
    └── utils.py


/README.md:
--------------------------------------------------------------------------------
  1 | # PDF解析
  2 | ![Python](https://img.shields.io/badge/Python-3.9-blue) ![PyPDF2](https://img.shields.io/badge/PyPDF2-3.0.1-blue) ![PyMuPDF](https://img.shields.io/badge/PyMuPDF-1.23.3-blue)  ![Langchain](https://img.shields.io/badge/Langchain-0.0.285-blue)  ![Rwkv](https://img.shields.io/badge/RWKV-0.8.12-blue) ![ChatGLM2](https://img.shields.io/badge/ChatGLM-2-blue) ![Pandas](https://img.shields.io/badge/Pandas-2.1.0-blue) ![Ninja](https://img.shields.io/badge/Ninja-1.11.1-blue) ![Streamlit](https://img.shields.io/badge/Streamlit-1.26.0-blue)
  3 | 
  4 | ## 介绍
  5 | 
  6 | 实现对PDF解析，将给定的PDF结构化成以下几个部分。
  7 | - 文字
  8 |   - 总标题，章节标题和章节对应的文字内容
  9 | - 图片
 10 |   - 图片和图片标题
 11 | - 表格
 12 |   - 表格和表格标题
 13 | - 参考
 14 |   - 参考
 15 | 
 16 | 在这个项目中还有两个部分用到了大模型
 17 | - 使用了[RWKV-Raven-7B](https://huggingface.co/BlinkDL/rwkv-4-raven)对PDF做摘要。
 18 | - 是用了[ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B)对参考文献做信息抽取。
 19 | 将参考文献结构化成字典的格式，字典包含了”作者“，”标题“，”年份“。
 20 | 
 21 | 在这个项目中还有实现了一个对PDF问答的例子。
 22 | 
 23 | 以下是这个项目的几个主要文件：
 24 | - ```src/pdf_parser.py```：包含了所有 PDF 解析相关代码。
 25 | - ```src/llm_summarizer.py```：包含了大模型摘要相关代码。
 26 | - ```src/llm_extractor.py```：包含了大模型对参考文献做信息抽取相关代码。
 27 | - ```src/main.py```：包含了一些示例代码，展示了如何使用 ```src/pdf_parser.py``` 中的功能。
 28 | - ```src/utils.py```: 包含了一些工具函数。
 29 | - ```src/app.py```: 包含了一个用```streamlit```和```langchain```做PDF问答的例子。
 30 | - ```config.ini```：包含了大模型文件路径和相关的tokenizer文件路径。
 31 | 
 32 | 
 33 | ## 使用
 34 | 
 35 | 关于PDF解析的具体例子请参考 ```src/main.py```。
 36 | 关于PDF问答请参考```src/app.py```。
 37 | 
 38 | 
 39 | **初始化**
 40 | 
 41 | 首先初始化一个类并讲需要解析的PDF文件路径传入到该类。
 42 | 
 43 | ```
 44 | from parser import PDFParser
 45 | 
 46 | pdf_path = '/home/data/gpt-4.pdf'
 47 | parser = PDFParser(pdf_path)
 48 | ```
 49 | 
 50 | **获取文字：标题，章节名称和对应的文字内容**
 51 | 
 52 | ```
 53 | import json
 54 | 
 55 | parser.extract_text()
 56 | # 指定保存的文件路径
 57 | json_file_path = '/home/text/sections.json'
 58 | with open(json_file_path, 'w') as json_file:
 59 |     json.dump(parser.text.section, json_file)
 60 | ```
 61 | 
 62 | **获取图片：图片和对应的标题**
 63 | 
 64 | ```
 65 | parser.extract_images()
 66 | 
 67 | images = parser.images
 68 | for image in images:
 69 |     # 将图像保存为文件
 70 |     image_filename = f"/home/image/image_{image.page_num}_{image.title[:10]}.png"
 71 |     with open(image_filename, "wb") as image_file:
 72 |         logging.info(image.title)
 73 |         logging.info(image.page_num)
 74 |         image_file.write(image.image_data)
 75 | ```
 76 | 
 77 | **获取表格：表格和对应的标题**
 78 | 
 79 | ```
 80 | parser.extract_tables()
 81 | for i, table in enumerate(parser.tables):
 82 |     csv_filename = "/home/table/table_i_{table.page_num}_{table.title[:10]}.csv"
 83 |     table.table_data.to_csv(csv_filename)
 84 | ```
 85 | 
 86 | **获取参考**
 87 | 
 88 | ```
 89 | parser.extract_references()
 90 | with open('/home/reference/references.txt', 'w') as fp:
 91 |     for ref in parser.references:
 92 |         fp.write("%s\n" % ref.ref)
 93 | ```
 94 | 
 95 | **获取参考的信息：作者，标题，年份**
 96 | 
 97 | ```
 98 | from parser import PDFParser
 99 | from llm_extractor import LLMExtractor
100 | from tqdm import tqdm
101 | 
102 | parser.extract_references()
103 | 
104 | llm_extractor = LLMExtractor()
105 | for i, ref in enumerate(tqdm(parser.references)):
106 |     json_ref = llm_extractor.extract_reference(ref)
107 |     if json_ref and len(json_ref) > 0:
108 |         with open('/home/reference/{i}.json', 'w') as outfile:
109 |             json.dump(json_ref, outfile)
110 | ```
111 | 
112 | **获取摘要**
113 | 
114 | ```
115 | from llm_summarizer import LLMSummarizer
116 | 
117 | llm_summarizer = LLMSummarizer()
118 | summary = llm_summarizer.summarize(pdf_path)
119 | ```
120 | 
121 | **运行PDF问答**
122 | ```
123 | streamlit run app.py --server.fileWatcherType none
124 | ```
125 | 
126 | 
127 | 这个项目用到的是大模型是[RWKV-Raven-7B](https://huggingface.co/BlinkDL/rwkv-4-raven)，
128 | [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B)
129 | 需要在config.ini文件中设置相关的模型文件路径和tokenizer文件路径。
130 | 以下是config.ini的文件内容
131 | 
132 | ```
133 | [LLM]
134 | rwkv_model_path=/data/model/rwkv_model/RWKV-4-Raven-7B-v12-Eng49%%-Chn49%%-Jpn1%%-Other1%%-20230530-ctx8192.pth
135 | rwkv_tokenizer_path=/data/model/rwkv_model/20B_tokenizer.json
136 | chatglm2_6b_path=/data/model/chatglm2-6b
137 | ```
138 | 
139 | ## 总结
140 | 
141 | 由于时间关系目前的PDF解析还存在需要优化的地方。
142 | - 表格解析：开发中发现表格解析很有挑战。
143 | 目前使用的库是[PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)，还是有不少表格提错的地方，计划尝试其他多模态的框架，
144 | 例如 [LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm) [table-transformer](https://github.com/microsoft/table-transformer) [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/table/README.md)。
145 | - 图表解析：可以尝试一些基于大模型的库可来解析图表，例如 [DePlot](https://huggingface.co/docs/transformers/main/model_doc/deplot)。
146 | - 时间关系这里用了两个不同的大模型：```RWKV-Raven-7B```和```ChatGLM2-6B``` 分别做摘要和信息抽取，可以考虑只用一个大模型。
147 | - PDF结构化后，可以用大模型对结构化好的部位做问答，为了让问答更有效率我们需要将结构化的内容做切分然后存入向量库。


--------------------------------------------------------------------------------
/config.ini:
--------------------------------------------------------------------------------
1 | [LLM]
2 | rwkv_model_path=/data/model/rwkv_model/RWKV-4-Raven-7B-v12-Eng49%%-Chn49%%-Jpn1%%-Other1%%-20230530-ctx8192.pth
3 | rwkv_tokenizer_path=/data/model/rwkv_model/20B_tokenizer.json
4 | chatglm2_6b_path=/data/model/chatglm2-6b
5 | 
6 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | PyPDF2==3.0.1
 2 | PyMuPDF==1.23.3
 3 | pandas==2.1.0
 4 | langchain[llm]
 5 | pypdf==3.15.5
 6 | ninja==1.11.1
 7 | rwkv==0.8.12
 8 | transformers==4.30.2
 9 | cpm_kernels==1.0.11
10 | torch==1.13.1
11 | streamlit==1.26.0
12 | streamlit-extras==0.3.2
13 | sentence-transformers==2.2.2
14 | 


--------------------------------------------------------------------------------
/src/app.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from typing import Any, List, Mapping, Optional
  3 | import streamlit as st
  4 | from streamlit_extras.add_vertical_space import add_vertical_space
  5 | import pickle
  6 | from PyPDF2 import PdfReader
  7 | from langchain.text_splitter import RecursiveCharacterTextSplitter
  8 | from langchain.embeddings import HuggingFaceEmbeddings
  9 | from langchain.vectorstores import FAISS
 10 | from langchain.chains.question_answering import load_qa_chain
 11 | from langchain.callbacks import get_openai_callback
 12 | from langchain.llms.base import LLM
 13 | from langchain.llms import RWKV
 14 | from transformers import AutoTokenizer, AutoModel, AutoConfig
 15 | from utils import get_config_variable
 16 | 
 17 | 
 18 | os.environ["RWKV_CUDA_ON"] = '1'
 19 | os.environ["RWKV_JIT_ON"] = '1'
 20 | 
 21 | # 启动streamlit
 22 | # streamlit run app.py --server.fileWatcherType none
 23 | 
 24 | 
 25 | class GLM(LLM):
 26 |     """自定义大模型"""
 27 |     max_token: int = 2048
 28 |     temperature: float = 0.1
 29 |     top_p: float = 0.9
 30 |     tokenizer: object = None
 31 |     model: object = None
 32 |     history_len: int = 1024
 33 | 
 34 |     def __init__(self):
 35 |         super().__init__()
 36 | 
 37 |     @property
 38 |     def _llm_type(self) -> str:
 39 |         return "GLM"
 40 | 
 41 |     def load_model(self, llm_device="gpu", model_name_or_path=None):
 42 |         model_config = AutoConfig.from_pretrained(
 43 |             model_name_or_path, trust_remote_code=True)
 44 |         self.tokenizer = AutoTokenizer.from_pretrained(
 45 |             model_name_or_path, trust_remote_code=True)
 46 |         self.model = AutoModel.from_pretrained(model_name_or_path,
 47 |                                                config=model_config,
 48 |                                                trust_remote_code=True).half().cuda()
 49 | 
 50 |     def _call(self, prompt: str, history: List[str] = [], stop: Optional[List[str]] = None):
 51 |         response, _ = self.model.chat(
 52 |             self.tokenizer, prompt,
 53 |             history=history[-self.history_len:] if self.history_len > 0 else [],
 54 |             max_length=self.max_token, temperature=self.temperature,
 55 |             top_p=self.top_p)
 56 |         return response
 57 | 
 58 | 
 59 | # Sidebar contents
 60 | with st.sidebar:
 61 |     st.title('🤗🤗💬 LLM Chat App')
 62 |     st.markdown('''
 63 |     ## About
 64 |     This app is an LLM-powered chatbot built using:
 65 |     - [Streamlit](https://streamlit.io/)
 66 |     - [LangChain](https://python.langchain.com/)
 67 |     - [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B)
 68 |     ''')
 69 |     add_vertical_space(5)
 70 |     st.write('Made by Kai Chen')
 71 | 
 72 | 
 73 | # @st.experimental_singleton
 74 | @st.cache_resource
 75 | def load_llm_chatglm():
 76 |     """加载大模型"""
 77 |     ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
 78 |     config_file = f'{ROOT_DIR[:-3]}config.ini'  # 配置文件的路径
 79 |     model_path = get_config_variable(config_file, 'LLM', 'chatglm2_6b_path')
 80 |     llm = GLM()
 81 |     llm.load_model(model_name_or_path=model_path)
 82 |     return llm
 83 | 
 84 | 
 85 | @st.cache_resource
 86 | def load_llm_rwkv():
 87 |     """加载大模型"""
 88 |     strategy = "cuda fp16i8 *20 -> cuda fp16"
 89 |     ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
 90 |     config_file = f'{ROOT_DIR[:-3]}config.ini'  # 配置文件的路径
 91 |     model_path = get_config_variable(config_file, 'LLM', 'rwkv_model_path')
 92 |     tokens_path = get_config_variable(
 93 |         config_file, 'LLM', 'rwkv_tokenizer_path')
 94 |     model = RWKV(model=model_path,
 95 |                  strategy=strategy,
 96 |                  tokens_path=tokens_path)
 97 |     return model
 98 | 
 99 | 
100 | def main():
101 |     st.header("Chat with PDF 💬")
102 |     # upload a PDF file
103 |     pdf = st.file_uploader("Upload your PDF", type='pdf')
104 |     # st.write(pdf)
105 |     if pdf is not None:
106 |         pdf_reader = PdfReader(pdf)
107 | 
108 |         text = ""
109 |         for page in pdf_reader.pages:
110 |             text += page.extract_text()
111 | 
112 |         text_splitter = RecursiveCharacterTextSplitter(
113 |             chunk_size=1000,
114 |             chunk_overlap=200,
115 |             length_function=len
116 |         )
117 |         chunks = text_splitter.split_text(text=text)
118 | 
119 |         # # embeddings
120 |         store_name = pdf.name[:-4]
121 |         st.write(f'{store_name}')
122 |         # st.write(chunks)
123 | 
124 |         if os.path.exists(f"{store_name}.pkl"):
125 |             with open(f"{store_name}.pkl", "rb") as f:
126 |                 VectorStore = pickle.load(f)
127 |             # st.write('Embeddings Loaded from the Disk')s
128 |         else:
129 |             # embeddings = OpenAIEmbeddings()
130 |             embeddings = HuggingFaceEmbeddings(
131 |                 model_name='/data/model/all-MiniLM-L6-v2')
132 |             # st.write(embeddings)
133 |             VectorStore = FAISS.from_texts(chunks, embedding=embeddings)
134 |             with open(f"{store_name}.pkl", "wb") as f:
135 |                 pickle.dump(VectorStore, f)
136 | 
137 |         # embeddings = OpenAIEmbeddings()
138 |         # VectorStore = FAISS.from_texts(chunks, embedding=embeddings)
139 | 
140 |         # Accept user questions/query
141 |         query = st.text_input("Ask questions about your PDF file:")
142 |         st.write(query)
143 | 
144 |         if query:
145 |             docs = VectorStore.similarity_search(query=query, k=3)
146 |             st.write(docs)
147 | 
148 |             # llm = OpenAI()
149 |             # llm = load_llm_rwkv()
150 |             llm = load_llm_chatglm()
151 |             chain = load_qa_chain(llm=llm, chain_type="stuff")
152 |             with get_openai_callback() as cb:
153 |                 response = chain.run(input_documents=docs, question=query)
154 |                 print(cb)
155 |             st.write(response)
156 | 
157 | 
158 | if __name__ == '__main__':
159 |     main()
160 | 


--------------------------------------------------------------------------------
/src/llm_extractor.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # 作者：陈凯
 3 | # 电子邮件：chenkai0210@hotmail.com
 4 | # 日期：2023-09
 5 | # 描述：这个脚本实现了基于大模型对输入的文字做信息抽取。
 6 | 
 7 | import os
 8 | import json
 9 | from transformers import AutoTokenizer, AutoModel
10 | from utils import get_config_variable
11 | 
12 | 
13 | class LLMExtractor:
14 |     """
15 |     用大模型对文字做信息抽取
16 |     这里用到的大模型是chatglm2 6b
17 |     https://github.com/THUDM/ChatGLM2-6B/tree/main
18 |     """
19 | 
20 |     def __init__(self):
21 |         ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
22 |         config_file = f'{ROOT_DIR[:-3]}config.ini'  # 配置文件的路径
23 |         self.model_path = get_config_variable(
24 |             config_file, 'LLM', 'chatglm2_6b_path')
25 |         self.tokenizer = AutoTokenizer.from_pretrained(
26 |             self.model_path, trust_remote_code=True)
27 |         self.model = AutoModel.from_pretrained(
28 |             self.model_path, trust_remote_code=True, device='cuda').half().cuda()
29 | 
30 |     @staticmethod
31 |     def get_prompt(content):
32 |         data = {
33 |             "author": "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell",
34 |             "title": "Language models are few-shot learners",
35 |             "year": "2020"
36 |         }
37 | 
38 |         # 使用ensure_ascii=False以支持非ASCII字符
39 |         json_string = json.dumps(data, ensure_ascii=False)
40 | 
41 |         prompt = f"""
42 |         从输入的文字中，提取"信息"(keyword,content)，包括:"author"、"title"、"year"的实体，输出json格式内容。
43 |         请只输出json，不需要输出其他内容。
44 |         
45 |         例如：
46 |         输入
47 |         [1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models arefew-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
48 | 
49 |         输出
50 |         {json_string}
51 | 
52 |         输入
53 |         {content}
54 | 
55 |         输出
56 |         以下是提取的实体及其JSON格式内容：
57 |         """
58 | 
59 |         return prompt
60 | 
61 |     def extract_reference(self, ref: str) -> dict:
62 |         prompt = LLMExtractor.get_prompt(ref)
63 |         if self.model:
64 |             response, history = self.model.chat(
65 |                 self.tokenizer, prompt, history=[])
66 |             try:
67 |                 json_object = json.loads(response)
68 |                 return json_object
69 |             except:
70 |                 # TODO: 错误分析
71 |                 pass
72 |         return None
73 | 


--------------------------------------------------------------------------------
/src/llm_summarizer.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # 作者：陈凯
 3 | # 电子邮件：chenkai0210@hotmail.com
 4 | # 日期：2023-09
 5 | # 描述：这个脚本实现了基于大模型对PDF做摘要。
 6 | 
 7 | from utils import get_config_variable
 8 | from langchain.prompts import PromptTemplate
 9 | from langchain.document_loaders import PyPDFLoader
10 | from langchain.chains import LLMChain
11 | from langchain.llms import RWKV
12 | import os
13 | os.environ["RWKV_CUDA_ON"] = '1'
14 | os.environ["RWKV_JIT_ON"] = '1'
15 | 
16 | 
17 | class LLMSummarizer:
18 |     """
19 |     用大模型对PDF进行总结
20 |     这里用到的大模型是rwkv raven 4
21 |     https://huggingface.co/BlinkDL/rwkv-4-raven
22 |     """
23 | 
24 |     def __init__(self):
25 |         # @param {"type":"string"}
26 |         self.strategy = "cuda fp16i8 *20 -> cuda fp16"
27 | 
28 |         ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
29 |         config_file = f'{ROOT_DIR[:-3]}config.ini'  # 配置文件的路径
30 |         self.model_path = get_config_variable(config_file, 'LLM', 'rwkv_model_path')
31 |         self.tokens_path = get_config_variable(
32 |             config_file, 'LLM', 'rwkv_tokenizer_path')
33 |         self.model = RWKV(model=self.model_path,
34 |                           strategy=self.strategy,
35 |                           tokens_path=self.tokens_path)
36 | 
37 |         self.task = """
38 |         Below is an instruction that describes a task. Write a response that appropriately completes the request.
39 |         # Instruction:
40 |         Write a concise summary of the following:
41 |         {text}
42 |         # Response:
43 |         CONCISE SUMMARY:
44 |         """
45 |         self.prompt = PromptTemplate(
46 |             input_variables=["text"],
47 |             template=self.task,
48 |         )
49 |         self.chain = LLMChain(llm=self.model, prompt=self.prompt)
50 | 
51 |     def summarize(self, pdf_path):
52 |         loader = PyPDFLoader(pdf_path)
53 |         # 获取第一页的前500的单子做摘要
54 |         data = loader.load()[0]
55 |         instruction = data.page_content[:500]  # @param {type:"string"}
56 |         summary = self.chain.run(instruction)
57 |         return summary
58 | 


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # 作者：陈凯
 3 | # 电子邮件：chenkai0210@hotmail.com
 4 | # 日期：2023-09
 5 | # 描述：这个脚本演示如何使用PDFParser
 6 | 
 7 | import os
 8 | import logging
 9 | import json
10 | from pdf_parser import PDFParser
11 | from llm_summarizer import LLMSummarizer
12 | from llm_extractor import LLMExtractor
13 | from tqdm import tqdm
14 | 
15 | if __name__ == '__main__':
16 |     ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
17 |     logging.basicConfig(filename=f'{ROOT_DIR[:-3]}basic.log',
18 |                         encoding='utf-8',
19 |                         level=logging.INFO,
20 |                         filemode='w',
21 |                         format='%(process)d-%(levelname)s-%(message)s')
22 |     logging.getLogger().addHandler(logging.StreamHandler())
23 | 
24 |     logging.info('** pdf parsing **')
25 |     # 1 创建 PDFParser 对象
26 |     # https://cdn.openai.com/papers/gpt-4.pdf
27 |     pdf_path = f'{ROOT_DIR[:-3]}/data/gpt-4.pdf'
28 |     parser = PDFParser(pdf_path)
29 | 
30 |     # 2 文字: 标题，章节目录，章节对应的文字内容
31 |     logging.info('== extract text ==')
32 |     parser.extract_text()
33 |     logging.info('-- title --')
34 |     logging.info(parser.text.title)
35 |     logging.info('-- section --')
36 |     for title, section in parser.text.section.items():
37 |         logging.info(title)
38 |     # 指定保存的文件路径
39 |     json_file_path = f"{ROOT_DIR[:-3]}/temp/json/sections.json"
40 |     # 使用 json.dump() 将字典保存为 JSON 文件
41 |     with open(json_file_path, 'w') as json_file:
42 |         json.dump(parser.text.section, json_file)
43 | 
44 |     # 3 图片
45 |     logging.info('== extract image ==')
46 |     parser.extract_images()
47 |     images = parser.images
48 |     for image in images:
49 |         # 将图像保存为文件
50 |         image_filename = f"{ROOT_DIR[:-3]}/temp/image/image_{image.page_num}_{image.title[:10]}.png"
51 |         with open(image_filename, "wb") as image_file:
52 |             logging.info(image.title)
53 |             logging.info(image.page_num)
54 |             image_file.write(image.image_data)
55 | 
56 |     # 4 表格：表格和对应的标题
57 |     logging.info('== extract table ==')
58 |     parser.extract_tables()
59 |     for i, table in enumerate(parser.tables):
60 |         logging.info(table.title)
61 |         csv_filename = f"{ROOT_DIR[:-3]}/temp/table/table_i_{table.page_num}_{table.title[:10]}.csv"
62 |         table.table_data.to_csv(csv_filename)
63 | 
64 |     # 5 参考文献
65 |     logging.info('== extract references ==')
66 |     parser.extract_references()
67 |     # for ref in parser.references:
68 |     #     print(ref.ref)
69 |     logging.info(len(parser.references))
70 |     with open(f'{ROOT_DIR[:-3]}/temp/reference/references.txt', 'w') as fp:
71 |         for ref in parser.references:
72 |             # write each item on a new line
73 |             fp.write("%s\n" % ref.ref)
74 | 
75 |     # 6 总结
76 |     logging.info('== summarizing (LLM) ==')
77 |     llm_summarizer = LLMSummarizer()
78 |     parser.text.summary = llm_summarizer.summarize(pdf_path)
79 |     logging.info(parser.text.summary)
80 | 
81 |     # 7 用大模型将参考文献结构化成作者，标题，日期
82 |     logging.info('== extract info (author, title, year) from references ==')
83 |     llm_extractor = LLMExtractor()
84 |     extracted_ref = 0
85 |     for i, ref in enumerate(tqdm(parser.references)):
86 |         json_ref = llm_extractor.extract_reference(ref)
87 |         if json_ref and len(json_ref) > 0:
88 |             with open(f'{ROOT_DIR[:-3]}/temp/reference/{i}.json', 'w') as outfile:
89 |                 json.dump(json_ref, outfile)
90 |                 extracted_ref += 1
91 |                 if extracted_ref > 5:
92 |                     break
93 | 


--------------------------------------------------------------------------------
/src/parser.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # 作者：陈凯
  3 | # 电子邮件：chenkai0210@hotmail.com
  4 | # 日期：2023-09
  5 | # 描述：这个脚本的目的是解析PDF。将PDF结构化成，文字，图片，表格和参考。
  6 | 
  7 | import pandas as pd  # 用于结构化表格
  8 | import fitz   # PyMuPDF PDF相关操作
  9 | import PyPDF2  # PDF相关操作
 10 | from typing import List
 11 | 
 12 | 
 13 | class Text:
 14 |     """文本类，用于封装提取的文本内容 """
 15 | 
 16 |     def __init__(self,
 17 |                  title: str = None,
 18 |                  section: dict = {},
 19 |                  summary: str = None):
 20 |         """
 21 |         参数:
 22 |         - title: str，文本标题
 23 |         - section: dict, key:章节名称 value:章节文字内容
 24 |         - summary: str,  摘要
 25 |         """
 26 |         self.title = title
 27 |         self.section = section
 28 |         self.summary = summary
 29 | 
 30 | 
 31 | class PDFImage:
 32 |     """图片类，用于封装提取的图片信息 """
 33 | 
 34 |     def __init__(self,
 35 |                  title: str,
 36 |                  image_data: object,
 37 |                  page_num: int):
 38 |         """
 39 |         参数:
 40 |         - title: str，图片标题
 41 |         - image_data: 图片数据的表示形式，可以是字节流、文件路径等
 42 |         - page_num: int: 图片所在页数
 43 |         """
 44 |         self.title = title
 45 |         self.image_data = image_data
 46 |         self.page_num = page_num
 47 | 
 48 | 
 49 | class Table:
 50 |     """表格类，用于封装提取的表格信息 """
 51 | 
 52 |     def __init__(self,
 53 |                  title: str,
 54 |                  table_data: pd.DataFrame,
 55 |                  page_num: int):
 56 |         """
 57 |         参数:
 58 |         - title: str，表格标题
 59 |         - table_data: 表格数据，可以是 Pandas DataFrame 等形式
 60 |         - page_num: int: 表格所在页数
 61 |         """
 62 |         self.title = title
 63 |         self.table_data = table_data
 64 |         self.page_num = page_num
 65 | 
 66 | 
 67 | class Reference:
 68 |     """参考文献类，用于封装提取的参考文献信息 """
 69 | 
 70 |     def __init__(self, ref: str):
 71 |         """
 72 |         参数:
 73 |         - ref: str，参考文献
 74 |         """
 75 |         self.ref = ref
 76 | 
 77 | 
 78 | class PDFOutliner:
 79 |     """
 80 |     该类用于获取给定PDF的所有章节的标题
 81 |     该类对下面的代码做了一些修改，核心算法来自下面仓库
 82 |     https://github.com/beaverden/pdftoc/tree/main
 83 |     """
 84 | 
 85 |     def __init__(self):
 86 |         self.titles = []  # 每一个章节的标题
 87 | 
 88 |     def get_tree_pages(self, root, info, depth=0, titles=[]):
 89 |         """
 90 |             Recursively iterate the outline tree
 91 |             Find the pages pointed by the outline item
 92 |             and get the assigned physical order id
 93 |             Decrement with padding if necessary
 94 |         """
 95 |         if isinstance(root, dict):
 96 |             page = root['/Page'].get_object()
 97 |             t = root['/Title']
 98 |             title = t
 99 |             if isinstance(t, PyPDF2.generic.ByteStringObject):
100 |                 title = t.original_bytes.decode('utf8')
101 |             title = title.strip()
102 |             title = title.replace('\n', '')
103 |             title = title.replace('\r', '')
104 |             page_num = info['all_pages'].get(id(page), 0)
105 |             if page_num == 0:
106 |                 # TODO: logging
107 |                 print('Not found page number for /Page!', page)
108 |             elif page_num < info['padding']:
109 |                 page_num = 0
110 |             else:
111 |                 page_num -= info['padding']
112 |             str_val = '%-5d' % page_num
113 |             str_val += '\t' * depth
114 |             str_val += title + '\t' + '%3d' % page_num
115 |             self.titles.append(title)
116 |             return
117 |         for elem in root:
118 |             self.get_tree_pages(elem, info, depth+1)
119 | 
120 |     def recursive_numbering(self, obj, info):
121 |         """
122 |             Recursively iterate through all the pages in order and 
123 |             assign them a physical order number
124 |         """
125 |         if obj['/Type'] == '/Page':
126 |             obj_id = id(obj)
127 |             if obj_id not in info['all_pages']:
128 |                 info['all_pages'][obj_id] = info['current_page_id']
129 |             info['current_page_id'] += 1
130 |             return
131 |         elif obj['/Type'] == '/Pages':
132 |             for page in obj['/Kids']:
133 |                 self.recursive_numbering(page.get_object(), info)
134 | 
135 |     def create_text_outline(self, pdf_path, page_number_padding):
136 |         # print('Running the script for [%s] with padding [%d]' % (pdf_path, page_number_padding))
137 |         # creating an object
138 |         titles = []
139 |         with open(pdf_path, 'rb') as file:
140 |             fileReader = PyPDF2.PdfReader(file)
141 | 
142 |             info = {
143 |                 'all_pages': {},
144 |                 'current_page_id': 1,
145 |                 'padding': page_number_padding
146 |             }
147 | 
148 |             pages = fileReader.trailer['/Root']['/Pages'].get_object()
149 |             self.recursive_numbering(pages, info)
150 |             # for page_num, page in enumerate(pages['/Kids']):
151 |             #    page_obj = page.getObject()
152 |             #    all_pages[id(page_obj)] = page_num + 1
153 |             self.get_tree_pages(fileReader.outline, info, 0, titles)
154 |         return
155 | 
156 | 
157 | class PDFParser:
158 |     """PDF 解析器类，用于提取 PDF 中的文本、图片、表格和参考文献信息 """
159 | 
160 |     def __init__(self, pdf_path: str):
161 |         """
162 |         参数:
163 |         - pdf_path: str，PDF 文件的路径
164 |         """
165 |         self.pdf_path = pdf_path
166 |         self.doc = fitz.open(self.pdf_path)  # PyMuPDF fitz.Document
167 |         self.text = Text()   # text: Text, 文字内容
168 |         self.images = []     # list, 所有图片（PDFImage）
169 |         self.tables = []     # list, 所有表格（Table）
170 |         self.references = []  # list, 所有参考（Reference）
171 | 
172 |     def extract_title(self):
173 |         """
174 |         获取pdf标题
175 |         """
176 |         doc = self.doc
177 |         first_page = doc.load_page(0)  # 获取第一页
178 |         # 提取第一页的文本内容
179 |         text = first_page.get_text()
180 |         # 按行拆分文本内容
181 |         lines = text.split('\n')
182 |         # 获取第一行文本
183 |         first_line = lines[0].strip()
184 |         self.text.title = first_line
185 |         return
186 | 
187 |     def extract_sections_content(self,
188 |                                  doc: fitz.Document,
189 |                                  section_titles: List[str]):
190 |         """
191 |         根据章节名称列表提取PDF中各章节的文字内容。
192 |         参数：
193 |         - pdf_file: 包含章节的PDF文件路径。
194 |         - section_titles: 包含所有章节名称的列表。
195 | 
196 |         返回值：
197 |         - 一个字典，键是章节名称，值是该章节的文字内容。
198 |         """
199 |         sections_content = {}  # 存储章节名称和内容的字典
200 |         # 获取所有章节名称
201 |         filtered_section_titles = [PDFParser.remove_leading_digits(
202 |             title).strip() for title in section_titles]
203 |         # 对于每一个章节名称，遍历所有文字行，如果文字行内包含了该章节的名称则加下去将文字行加入到该章节文字内容中
204 |         # 如果文字行包含了下一个章节的名称则停止将文字行加入到该章节文字内容中
205 |         for i, section_title in enumerate(filtered_section_titles):
206 |             section_found = False
207 |             section_content = ""
208 |             scan_page = True
209 |             for page_num in range(len(doc)):
210 |                 page = doc[page_num]
211 |                 page_text = page.get_text()
212 |                 for line in page_text.split('\n'):
213 |                     # 如果找到了下一章的标题则跳出
214 |                     if i+1 < len(filtered_section_titles) and filtered_section_titles[i+1].lower() in line.lower():
215 |                         scan_page = False
216 |                         break
217 |                     if section_title.lower() in line.lower():
218 |                         section_found = True
219 |                     elif section_found:
220 |                         # 如果找到了目标标题，开始获取章节内容
221 |                         section_content += line + "\n"
222 |                 if not scan_page:
223 |                     break
224 | 
225 |             if section_found:
226 |                 sections_content[section_titles[i]] = section_content
227 | 
228 |         return sections_content
229 | 
230 |     @staticmethod
231 |     def remove_leading_digits(text: str):
232 |         """
233 |         删除输入文字开头的数字。
234 |         """
235 |         while text and text[0].isdigit():
236 |             text = text[1:]  # 删除第一个字符
237 |         return text
238 | 
239 |     def extract_text(self):
240 |         """
241 |         提取PDF中的文本内容
242 |         """
243 |         # 1 获取标题
244 |         self.extract_title()
245 |         # 2 获取章节名称
246 |         outliner = PDFOutliner()
247 |         outliner.create_text_outline(self.pdf_path, 0)
248 |         # 3 获取对应章节下的文字内容
249 |         self.text.section = self.extract_sections_content(
250 |             self.doc, outliner.titles)
251 |         return
252 | 
253 |     def extract_images(self, fig_caption_start: str = 'Figure'):
254 |         """
255 |         提取 PDF 中的图片信息: 图片和图片的标题
256 |         fig_caption_start: str，图片标题开始词
257 |         """
258 |         doc = self.doc
259 | 
260 |         for page_num in range(len(doc)):
261 |             page = doc[page_num]
262 |             # 提取页面文本块
263 |             blocks = page.get_text('blocks')
264 |             # 通过计算文本块与图片的距离来匹配图片和对应的标题，
265 |             # 文本块有特定的开始词开始且距离（欧氏距离）离图片最近的文本块的文字为当前图片的标题
266 |             for img in page.get_images(full=True):
267 |                 xref = img[0]
268 |                 base_image = doc.extract_image(xref)
269 |                 x0, y0, x1, y2 = page.get_image_rects(xref)[0]
270 |                 related_text = "untitled"
271 |                 min_dist = float('inf')
272 |                 for block in blocks:
273 |                     block_x0, block_y0, block_x1, block_y1, block_text = block[:5]
274 |                     if block_text.strip().startswith(fig_caption_start):
275 |                         # 计算欧式距离
276 |                         dist = (x0 - block_x0)**2 + (y0 - block_y0)**2
277 |                         if dist < min_dist:
278 |                             min_dist = dist
279 |                             related_text = block_text.strip()
280 | 
281 |                 image_data = base_image["image"]
282 |                 image = PDFImage(related_text, image_data, page_num)
283 |                 self.images.append(image)
284 | 
285 |     def extract_tables(self, tab_caption_start: str = 'Table'):
286 |         """
287 |         提取 PDF 中的表格信息
288 |         tab_caption_start: str, 表格标题开始词
289 |         """
290 |         doc = self.doc
291 |         for num in range(len(doc)):
292 |             page = doc[num]
293 |             # 提取页面文本块
294 |             blocks = page.get_text('blocks')
295 |             # 提取表格
296 |             tables = page.find_tables()
297 |             # 通过计算文本块与表格的距离来匹配图片和对应的标题，
298 |             # 文本块有特定的开始词且距离（欧氏距离）离表格最近的文本块的文字为当前图片的标题
299 |             for table in tables:
300 |                 x0, y0, x1, y2 = table.bbox
301 |                 df = table.to_pandas()
302 |                 related_text = "untitled"
303 |                 min_dist = float('inf')
304 |                 for block in blocks:
305 |                     block_x0, block_y0, block_x1, block_y1, block_text = block[:5]
306 |                     if block_text.strip().startswith(tab_caption_start):
307 |                         # 计算欧式距离
308 |                         dist = (x0 - block_x0)**2 + (y0 - block_y0)**2
309 |                         if dist < min_dist:
310 |                             min_dist = dist
311 |                             related_text = block_text.strip()
312 |                 self.tables.append(Table(title=related_text,
313 |                                          table_data=df,
314 |                                          page_num=num))
315 | 
316 |     def extract_references(self):
317 |         """
318 |         提取 PDF 中的参考文献信息
319 |         """
320 |         doc = self.doc
321 |         page_num = len(doc)
322 |         ref_list = []
323 |         for num, page in enumerate(doc):
324 |             content = page.get_text('blocks')
325 |             for pc in content:
326 |                 txt_blocks = list(pc[4:-2])
327 |                 txt = ''.join(txt_blocks)
328 |                 if 'References' in txt or 'REFERENCES' in txt or 'referenCes' in txt:
329 |                     ref_num = [i for i in range(num, page_num)]
330 |                     for rpn in ref_num:
331 |                         ref_page = doc[rpn]
332 |                         ref_content = ref_page.get_text('blocks')
333 |                         for refc in ref_content:
334 |                             txt_blocks = list(refc[4:-2])
335 |                             ref_list.extend(txt_blocks)
336 |         index = 0
337 |         for i, ref in enumerate(ref_list):
338 |             if 'References' in ref or 'REFERENCES' in ref or 'referenCes' in ref:
339 |                 index = i
340 |                 break
341 |         if index + 1 < len(ref_list):
342 |             index += 1
343 |         self.references = [Reference(ref.replace('\n', ''))
344 |                            for ref in ref_list[index:] if len(ref) > 10]
345 | 


--------------------------------------------------------------------------------
/src/pdf_parser.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # 作者：陈凯
  3 | # 电子邮件：chenkai0210@hotmail.com
  4 | # 日期：2023-09
  5 | # 描述：这个脚本的目的是解析PDF。将PDF结构化成，文字，图片，表格和参考。
  6 | 
  7 | import pandas as pd  # 用于结构化表格
  8 | import fitz   # PyMuPDF PDF相关操作
  9 | import PyPDF2  # PDF相关操作
 10 | from typing import List
 11 | 
 12 | 
 13 | class Text:
 14 |     """文本类，用于封装提取的文本内容 """
 15 | 
 16 |     def __init__(self,
 17 |                  title: str = None,
 18 |                  section: dict = {},
 19 |                  summary: str = None):
 20 |         """
 21 |         参数:
 22 |         - title: str，文本标题
 23 |         - section: dict, key:章节名称 value:章节文字内容
 24 |         - summary: str,  摘要
 25 |         """
 26 |         self.title = title
 27 |         self.section = section
 28 |         self.summary = summary
 29 | 
 30 | 
 31 | class PDFImage:
 32 |     """图片类，用于封装提取的图片信息 """
 33 | 
 34 |     def __init__(self,
 35 |                  title: str,
 36 |                  image_data: object,
 37 |                  page_num: int):
 38 |         """
 39 |         参数:
 40 |         - title: str，图片标题
 41 |         - image_data: 图片数据的表示形式，可以是字节流、文件路径等
 42 |         - page_num: int: 图片所在页数
 43 |         """
 44 |         self.title = title
 45 |         self.image_data = image_data
 46 |         self.page_num = page_num
 47 | 
 48 | 
 49 | class Table:
 50 |     """表格类，用于封装提取的表格信息 """
 51 | 
 52 |     def __init__(self,
 53 |                  title: str,
 54 |                  table_data: pd.DataFrame,
 55 |                  page_num: int):
 56 |         """
 57 |         参数:
 58 |         - title: str，表格标题
 59 |         - table_data: 表格数据，可以是 Pandas DataFrame 等形式
 60 |         - page_num: int: 表格所在页数
 61 |         """
 62 |         self.title = title
 63 |         self.table_data = table_data
 64 |         self.page_num = page_num
 65 | 
 66 | 
 67 | class Reference:
 68 |     """参考文献类，用于封装提取的参考文献信息 """
 69 | 
 70 |     def __init__(self, ref: str):
 71 |         """
 72 |         参数:
 73 |         - ref: str，参考文献
 74 |         """
 75 |         self.ref = ref
 76 | 
 77 | 
 78 | class PDFOutliner:
 79 |     """
 80 |     该类用于获取给定PDF的所有章节的标题
 81 |     该类对下面的代码做了一些修改，核心算法来自下面仓库
 82 |     https://github.com/beaverden/pdftoc/tree/main
 83 |     """
 84 | 
 85 |     def __init__(self):
 86 |         self.titles = []  # 每一个章节的标题
 87 | 
 88 |     def get_tree_pages(self, root, info, depth=0, titles=[]):
 89 |         """
 90 |             Recursively iterate the outline tree
 91 |             Find the pages pointed by the outline item
 92 |             and get the assigned physical order id
 93 |             Decrement with padding if necessary
 94 |         """
 95 |         if isinstance(root, dict):
 96 |             page = root['/Page'].get_object()
 97 |             t = root['/Title']
 98 |             title = t
 99 |             if isinstance(t, PyPDF2.generic.ByteStringObject):
100 |                 title = t.original_bytes.decode('utf8')
101 |             title = title.strip()
102 |             title = title.replace('\n', '')
103 |             title = title.replace('\r', '')
104 |             page_num = info['all_pages'].get(id(page), 0)
105 |             if page_num == 0:
106 |                 # TODO: logging
107 |                 print('Not found page number for /Page!', page)
108 |             elif page_num < info['padding']:
109 |                 page_num = 0
110 |             else:
111 |                 page_num -= info['padding']
112 |             str_val = '%-5d' % page_num
113 |             str_val += '\t' * depth
114 |             str_val += title + '\t' + '%3d' % page_num
115 |             self.titles.append(title)
116 |             return
117 |         for elem in root:
118 |             self.get_tree_pages(elem, info, depth+1)
119 | 
120 |     def recursive_numbering(self, obj, info):
121 |         """
122 |             Recursively iterate through all the pages in order and 
123 |             assign them a physical order number
124 |         """
125 |         if obj['/Type'] == '/Page':
126 |             obj_id = id(obj)
127 |             if obj_id not in info['all_pages']:
128 |                 info['all_pages'][obj_id] = info['current_page_id']
129 |             info['current_page_id'] += 1
130 |             return
131 |         elif obj['/Type'] == '/Pages':
132 |             for page in obj['/Kids']:
133 |                 self.recursive_numbering(page.get_object(), info)
134 | 
135 |     def create_text_outline(self, pdf_path, page_number_padding):
136 |         # print('Running the script for [%s] with padding [%d]' % (pdf_path, page_number_padding))
137 |         # creating an object
138 |         titles = []
139 |         with open(pdf_path, 'rb') as file:
140 |             fileReader = PyPDF2.PdfReader(file)
141 | 
142 |             info = {
143 |                 'all_pages': {},
144 |                 'current_page_id': 1,
145 |                 'padding': page_number_padding
146 |             }
147 | 
148 |             pages = fileReader.trailer['/Root']['/Pages'].get_object()
149 |             self.recursive_numbering(pages, info)
150 |             # for page_num, page in enumerate(pages['/Kids']):
151 |             #    page_obj = page.getObject()
152 |             #    all_pages[id(page_obj)] = page_num + 1
153 |             self.get_tree_pages(fileReader.outline, info, 0, titles)
154 |         return
155 | 
156 | 
157 | class PDFParser:
158 |     """PDF 解析器类，用于提取 PDF 中的文本、图片、表格和参考文献信息 """
159 | 
160 |     def __init__(self, pdf_path: str):
161 |         """
162 |         参数:
163 |         - pdf_path: str，PDF 文件的路径
164 |         """
165 |         self.pdf_path = pdf_path
166 |         self.doc = fitz.open(self.pdf_path)  # PyMuPDF fitz.Document
167 |         self.text = Text()   # text: Text, 文字内容
168 |         self.images = []     # list, 所有图片（PDFImage）
169 |         self.tables = []     # list, 所有表格（Table）
170 |         self.references = []  # list, 所有参考（Reference）
171 | 
172 |     def extract_title(self):
173 |         """
174 |         获取pdf标题
175 |         """
176 |         doc = self.doc
177 |         first_page = doc.load_page(0)  # 获取第一页
178 |         # 提取第一页的文本内容
179 |         text = first_page.get_text()
180 |         # 按行拆分文本内容
181 |         lines = text.split('\n')
182 |         # 获取第一行文本
183 |         first_line = lines[0].strip()
184 |         self.text.title = first_line
185 |         return
186 | 
187 |     def extract_sections_content(self,
188 |                                  doc: fitz.Document,
189 |                                  section_titles: List[str]):
190 |         """
191 |         根据章节名称列表提取PDF中各章节的文字内容。
192 |         参数：
193 |         - pdf_file: 包含章节的PDF文件路径。
194 |         - section_titles: 包含所有章节名称的列表。
195 | 
196 |         返回值：
197 |         - 一个字典，键是章节名称，值是该章节的文字内容。
198 |         """
199 |         sections_content = {}  # 存储章节名称和内容的字典
200 |         # 获取所有章节名称
201 |         filtered_section_titles = [PDFParser.remove_leading_digits(
202 |             title).strip() for title in section_titles]
203 |         # 对于每一个章节名称，遍历所有文字行，如果文字行内包含了该章节的名称则加下去将文字行加入到该章节文字内容中
204 |         # 如果文字行包含了下一个章节的名称则停止将文字行加入到该章节文字内容中
205 |         for i, section_title in enumerate(filtered_section_titles):
206 |             section_found = False
207 |             section_content = ""
208 |             scan_page = True
209 |             for page_num in range(len(doc)):
210 |                 page = doc[page_num]
211 |                 page_text = page.get_text()
212 |                 for line in page_text.split('\n'):
213 |                     # 如果找到了下一章的标题则跳出
214 |                     if i+1 < len(filtered_section_titles) and filtered_section_titles[i+1].lower() in line.lower():
215 |                         scan_page = False
216 |                         break
217 |                     if section_title.lower() in line.lower():
218 |                         section_found = True
219 |                     elif section_found:
220 |                         # 如果找到了目标标题，开始获取章节内容
221 |                         section_content += line + "\n"
222 |                 if not scan_page:
223 |                     break
224 | 
225 |             if section_found:
226 |                 sections_content[section_titles[i]] = section_content
227 | 
228 |         return sections_content
229 | 
230 |     @staticmethod
231 |     def remove_leading_digits(text: str):
232 |         """
233 |         删除输入文字开头的数字。
234 |         """
235 |         while text and text[0].isdigit():
236 |             text = text[1:]  # 删除第一个字符
237 |         return text
238 | 
239 |     def extract_text(self):
240 |         """
241 |         提取PDF中的文本内容
242 |         """
243 |         # 1 获取标题
244 |         self.extract_title()
245 |         # 2 获取章节名称
246 |         outliner = PDFOutliner()
247 |         outliner.create_text_outline(self.pdf_path, 0)
248 |         # 3 获取对应章节下的文字内容
249 |         self.text.section = self.extract_sections_content(
250 |             self.doc, outliner.titles)
251 |         return
252 | 
253 |     def extract_images(self, fig_caption_start: str = 'Figure'):
254 |         """
255 |         提取 PDF 中的图片信息: 图片和图片的标题
256 |         fig_caption_start: str，图片标题开始词
257 |         """
258 |         doc = self.doc
259 | 
260 |         for page_num in range(len(doc)):
261 |             page = doc[page_num]
262 |             # 提取页面文本块
263 |             blocks = page.get_text('blocks')
264 |             # 通过计算文本块与图片的距离来匹配图片和对应的标题，
265 |             # 文本块有特定的开始词开始且距离（欧氏距离）离图片最近的文本块的文字为当前图片的标题
266 |             for img in page.get_images(full=True):
267 |                 xref = img[0]
268 |                 base_image = doc.extract_image(xref)
269 |                 x0, y0, x1, y2 = page.get_image_rects(xref)[0]
270 |                 related_text = "untitled"
271 |                 min_dist = float('inf')
272 |                 for block in blocks:
273 |                     block_x0, block_y0, block_x1, block_y1, block_text = block[:5]
274 |                     if block_text.strip().startswith(fig_caption_start):
275 |                         # 计算欧式距离
276 |                         dist = (x0 - block_x0)**2 + (y0 - block_y0)**2
277 |                         if dist < min_dist:
278 |                             min_dist = dist
279 |                             related_text = block_text.strip()
280 | 
281 |                 image_data = base_image["image"]
282 |                 image = PDFImage(related_text, image_data, page_num)
283 |                 self.images.append(image)
284 | 
285 |     def extract_tables(self, tab_caption_start: str = 'Table'):
286 |         """
287 |         提取 PDF 中的表格信息
288 |         tab_caption_start: str, 表格标题开始词
289 |         """
290 |         doc = self.doc
291 |         for num in range(len(doc)):
292 |             page = doc[num]
293 |             # 提取页面文本块
294 |             blocks = page.get_text('blocks')
295 |             # 提取表格
296 |             tables = page.find_tables()
297 |             # 通过计算文本块与表格的距离来匹配图片和对应的标题，
298 |             # 文本块有特定的开始词且距离（欧氏距离）离表格最近的文本块的文字为当前图片的标题
299 |             for table in tables:
300 |                 x0, y0, x1, y2 = table.bbox
301 |                 df = table.to_pandas()
302 |                 related_text = "untitled"
303 |                 min_dist = float('inf')
304 |                 for block in blocks:
305 |                     block_x0, block_y0, block_x1, block_y1, block_text = block[:5]
306 |                     if block_text.strip().startswith(tab_caption_start):
307 |                         # 计算欧式距离
308 |                         dist = (x0 - block_x0)**2 + (y0 - block_y0)**2
309 |                         if dist < min_dist:
310 |                             min_dist = dist
311 |                             related_text = block_text.strip()
312 |                 self.tables.append(Table(title=related_text,
313 |                                          table_data=df,
314 |                                          page_num=num))
315 | 
316 |     def extract_references(self):
317 |         """
318 |         提取 PDF 中的参考文献信息
319 |         """
320 |         doc = self.doc
321 |         page_num = len(doc)
322 |         ref_list = []
323 |         for num, page in enumerate(doc):
324 |             content = page.get_text('blocks')
325 |             for pc in content:
326 |                 txt_blocks = list(pc[4:-2])
327 |                 txt = ''.join(txt_blocks)
328 |                 if 'References' in txt or 'REFERENCES' in txt or 'referenCes' in txt:
329 |                     ref_num = [i for i in range(num, page_num)]
330 |                     for rpn in ref_num:
331 |                         ref_page = doc[rpn]
332 |                         ref_content = ref_page.get_text('blocks')
333 |                         for refc in ref_content:
334 |                             txt_blocks = list(refc[4:-2])
335 |                             ref_list.extend(txt_blocks)
336 |         index = 0
337 |         for i, ref in enumerate(ref_list):
338 |             if 'References' in ref or 'REFERENCES' in ref or 'referenCes' in ref:
339 |                 index = i
340 |                 break
341 |         if index + 1 < len(ref_list):
342 |             index += 1
343 |         self.references = [Reference(ref.replace('\n', ''))
344 |                            for ref in ref_list[index:] if len(ref) > 10]
345 | 


--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # 作者：陈凯
 3 | # 电子邮件：chenkai0210@hotmail.com
 4 | # 日期：2023-09
 5 | # 描述：一些工具方法
 6 | 
 7 | import configparser
 8 | 
 9 | 
10 | def get_config_variable(config_file: str, section: str, variable_name: str):
11 |     """
12 |     从配置文件中获取变量值
13 | 
14 |     Args:
15 |         config_file (str): 配置文件的路径
16 |         section (str): 配置文件中的节名
17 |         variable_name (str): 要获取的变量名
18 | 
19 |     Returns:
20 |         str: 变量的值，如果找不到则返回 None
21 |     """
22 |     config = configparser.ConfigParser()
23 |     config.read(config_file)
24 | 
25 |     if config.has_section(section):
26 |         if config.has_option(section, variable_name):
27 |             return config.get(section, variable_name)
28 | 
29 |     return None
30 | 


--------------------------------------------------------------------------------