├── README.md ├── config.ini ├── requirements.txt └── src ├── app.py ├── llm_extractor.py ├── llm_summarizer.py ├── main.py ├── parser.py ├── pdf_parser.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # PDF解析 2 | ![Python](https://img.shields.io/badge/Python-3.9-blue) ![PyPDF2](https://img.shields.io/badge/PyPDF2-3.0.1-blue) ![PyMuPDF](https://img.shields.io/badge/PyMuPDF-1.23.3-blue) ![Langchain](https://img.shields.io/badge/Langchain-0.0.285-blue) ![Rwkv](https://img.shields.io/badge/RWKV-0.8.12-blue) ![ChatGLM2](https://img.shields.io/badge/ChatGLM-2-blue) ![Pandas](https://img.shields.io/badge/Pandas-2.1.0-blue) ![Ninja](https://img.shields.io/badge/Ninja-1.11.1-blue) ![Streamlit](https://img.shields.io/badge/Streamlit-1.26.0-blue) 3 | 4 | ## 介绍 5 | 6 | 实现对PDF解析,将给定的PDF结构化成以下几个部分。 7 | - 文字 8 | - 总标题,章节标题和章节对应的文字内容 9 | - 图片 10 | - 图片和图片标题 11 | - 表格 12 | - 表格和表格标题 13 | - 参考 14 | - 参考 15 | 16 | 在这个项目中还有两个部分用到了大模型 17 | - 使用了[RWKV-Raven-7B](https://huggingface.co/BlinkDL/rwkv-4-raven)对PDF做摘要。 18 | - 是用了[ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B)对参考文献做信息抽取。 19 | 将参考文献结构化成字典的格式,字典包含了”作者“,”标题“,”年份“。 20 | 21 | 在这个项目中还有实现了一个对PDF问答的例子。 22 | 23 | 以下是这个项目的几个主要文件: 24 | - ```src/pdf_parser.py```:包含了所有 PDF 解析相关代码。 25 | - ```src/llm_summarizer.py```:包含了大模型摘要相关代码。 26 | - ```src/llm_extractor.py```:包含了大模型对参考文献做信息抽取相关代码。 27 | - ```src/main.py```:包含了一些示例代码,展示了如何使用 ```src/pdf_parser.py``` 中的功能。 28 | - ```src/utils.py```: 包含了一些工具函数。 29 | - ```src/app.py```: 包含了一个用```streamlit```和```langchain```做PDF问答的例子。 30 | - ```config.ini```:包含了大模型文件路径和相关的tokenizer文件路径。 31 | 32 | 33 | ## 使用 34 | 35 | 关于PDF解析的具体例子请参考 ```src/main.py```。 36 | 关于PDF问答请参考```src/app.py```。 37 | 38 | 39 | **初始化** 40 | 41 | 首先初始化一个类并讲需要解析的PDF文件路径传入到该类。 42 | 43 | ``` 44 | from parser import PDFParser 45 | 46 | pdf_path = '/home/data/gpt-4.pdf' 47 | parser = PDFParser(pdf_path) 48 | ``` 49 | 50 | **获取文字:标题,章节名称和对应的文字内容** 51 | 52 | ``` 53 | import json 54 | 55 | parser.extract_text() 56 | # 指定保存的文件路径 57 | json_file_path = '/home/text/sections.json' 58 | with open(json_file_path, 'w') as json_file: 59 | json.dump(parser.text.section, json_file) 60 | ``` 61 | 62 | **获取图片:图片和对应的标题** 63 | 64 | ``` 65 | parser.extract_images() 66 | 67 | images = parser.images 68 | for image in images: 69 | # 将图像保存为文件 70 | image_filename = f"/home/image/image_{image.page_num}_{image.title[:10]}.png" 71 | with open(image_filename, "wb") as image_file: 72 | logging.info(image.title) 73 | logging.info(image.page_num) 74 | image_file.write(image.image_data) 75 | ``` 76 | 77 | **获取表格:表格和对应的标题** 78 | 79 | ``` 80 | parser.extract_tables() 81 | for i, table in enumerate(parser.tables): 82 | csv_filename = "/home/table/table_i_{table.page_num}_{table.title[:10]}.csv" 83 | table.table_data.to_csv(csv_filename) 84 | ``` 85 | 86 | **获取参考** 87 | 88 | ``` 89 | parser.extract_references() 90 | with open('/home/reference/references.txt', 'w') as fp: 91 | for ref in parser.references: 92 | fp.write("%s\n" % ref.ref) 93 | ``` 94 | 95 | **获取参考的信息:作者,标题,年份** 96 | 97 | ``` 98 | from parser import PDFParser 99 | from llm_extractor import LLMExtractor 100 | from tqdm import tqdm 101 | 102 | parser.extract_references() 103 | 104 | llm_extractor = LLMExtractor() 105 | for i, ref in enumerate(tqdm(parser.references)): 106 | json_ref = llm_extractor.extract_reference(ref) 107 | if json_ref and len(json_ref) > 0: 108 | with open('/home/reference/{i}.json', 'w') as outfile: 109 | json.dump(json_ref, outfile) 110 | ``` 111 | 112 | **获取摘要** 113 | 114 | ``` 115 | from llm_summarizer import LLMSummarizer 116 | 117 | llm_summarizer = LLMSummarizer() 118 | summary = llm_summarizer.summarize(pdf_path) 119 | ``` 120 | 121 | **运行PDF问答** 122 | ``` 123 | streamlit run app.py --server.fileWatcherType none 124 | ``` 125 | 126 | 127 | 这个项目用到的是大模型是[RWKV-Raven-7B](https://huggingface.co/BlinkDL/rwkv-4-raven), 128 | [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B) 129 | 需要在config.ini文件中设置相关的模型文件路径和tokenizer文件路径。 130 | 以下是config.ini的文件内容 131 | 132 | ``` 133 | [LLM] 134 | rwkv_model_path=/data/model/rwkv_model/RWKV-4-Raven-7B-v12-Eng49%%-Chn49%%-Jpn1%%-Other1%%-20230530-ctx8192.pth 135 | rwkv_tokenizer_path=/data/model/rwkv_model/20B_tokenizer.json 136 | chatglm2_6b_path=/data/model/chatglm2-6b 137 | ``` 138 | 139 | ## 总结 140 | 141 | 由于时间关系目前的PDF解析还存在需要优化的地方。 142 | - 表格解析:开发中发现表格解析很有挑战。 143 | 目前使用的库是[PyMuPDF](https://pymupdf.readthedocs.io/en/latest/),还是有不少表格提错的地方,计划尝试其他多模态的框架, 144 | 例如 [LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm) [table-transformer](https://github.com/microsoft/table-transformer) [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/table/README.md)。 145 | - 图表解析:可以尝试一些基于大模型的库可来解析图表,例如 [DePlot](https://huggingface.co/docs/transformers/main/model_doc/deplot)。 146 | - 时间关系这里用了两个不同的大模型:```RWKV-Raven-7B```和```ChatGLM2-6B``` 分别做摘要和信息抽取,可以考虑只用一个大模型。 147 | - PDF结构化后,可以用大模型对结构化好的部位做问答,为了让问答更有效率我们需要将结构化的内容做切分然后存入向量库。 -------------------------------------------------------------------------------- /config.ini: -------------------------------------------------------------------------------- 1 | [LLM] 2 | rwkv_model_path=/data/model/rwkv_model/RWKV-4-Raven-7B-v12-Eng49%%-Chn49%%-Jpn1%%-Other1%%-20230530-ctx8192.pth 3 | rwkv_tokenizer_path=/data/model/rwkv_model/20B_tokenizer.json 4 | chatglm2_6b_path=/data/model/chatglm2-6b 5 | 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | PyPDF2==3.0.1 2 | PyMuPDF==1.23.3 3 | pandas==2.1.0 4 | langchain[llm] 5 | pypdf==3.15.5 6 | ninja==1.11.1 7 | rwkv==0.8.12 8 | transformers==4.30.2 9 | cpm_kernels==1.0.11 10 | torch==1.13.1 11 | streamlit==1.26.0 12 | streamlit-extras==0.3.2 13 | sentence-transformers==2.2.2 14 | -------------------------------------------------------------------------------- /src/app.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import Any, List, Mapping, Optional 3 | import streamlit as st 4 | from streamlit_extras.add_vertical_space import add_vertical_space 5 | import pickle 6 | from PyPDF2 import PdfReader 7 | from langchain.text_splitter import RecursiveCharacterTextSplitter 8 | from langchain.embeddings import HuggingFaceEmbeddings 9 | from langchain.vectorstores import FAISS 10 | from langchain.chains.question_answering import load_qa_chain 11 | from langchain.callbacks import get_openai_callback 12 | from langchain.llms.base import LLM 13 | from langchain.llms import RWKV 14 | from transformers import AutoTokenizer, AutoModel, AutoConfig 15 | from utils import get_config_variable 16 | 17 | 18 | os.environ["RWKV_CUDA_ON"] = '1' 19 | os.environ["RWKV_JIT_ON"] = '1' 20 | 21 | # 启动streamlit 22 | # streamlit run app.py --server.fileWatcherType none 23 | 24 | 25 | class GLM(LLM): 26 | """自定义大模型""" 27 | max_token: int = 2048 28 | temperature: float = 0.1 29 | top_p: float = 0.9 30 | tokenizer: object = None 31 | model: object = None 32 | history_len: int = 1024 33 | 34 | def __init__(self): 35 | super().__init__() 36 | 37 | @property 38 | def _llm_type(self) -> str: 39 | return "GLM" 40 | 41 | def load_model(self, llm_device="gpu", model_name_or_path=None): 42 | model_config = AutoConfig.from_pretrained( 43 | model_name_or_path, trust_remote_code=True) 44 | self.tokenizer = AutoTokenizer.from_pretrained( 45 | model_name_or_path, trust_remote_code=True) 46 | self.model = AutoModel.from_pretrained(model_name_or_path, 47 | config=model_config, 48 | trust_remote_code=True).half().cuda() 49 | 50 | def _call(self, prompt: str, history: List[str] = [], stop: Optional[List[str]] = None): 51 | response, _ = self.model.chat( 52 | self.tokenizer, prompt, 53 | history=history[-self.history_len:] if self.history_len > 0 else [], 54 | max_length=self.max_token, temperature=self.temperature, 55 | top_p=self.top_p) 56 | return response 57 | 58 | 59 | # Sidebar contents 60 | with st.sidebar: 61 | st.title('🤗🤗💬 LLM Chat App') 62 | st.markdown(''' 63 | ## About 64 | This app is an LLM-powered chatbot built using: 65 | - [Streamlit](https://streamlit.io/) 66 | - [LangChain](https://python.langchain.com/) 67 | - [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B) 68 | ''') 69 | add_vertical_space(5) 70 | st.write('Made by Kai Chen') 71 | 72 | 73 | # @st.experimental_singleton 74 | @st.cache_resource 75 | def load_llm_chatglm(): 76 | """加载大模型""" 77 | ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) 78 | config_file = f'{ROOT_DIR[:-3]}config.ini' # 配置文件的路径 79 | model_path = get_config_variable(config_file, 'LLM', 'chatglm2_6b_path') 80 | llm = GLM() 81 | llm.load_model(model_name_or_path=model_path) 82 | return llm 83 | 84 | 85 | @st.cache_resource 86 | def load_llm_rwkv(): 87 | """加载大模型""" 88 | strategy = "cuda fp16i8 *20 -> cuda fp16" 89 | ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) 90 | config_file = f'{ROOT_DIR[:-3]}config.ini' # 配置文件的路径 91 | model_path = get_config_variable(config_file, 'LLM', 'rwkv_model_path') 92 | tokens_path = get_config_variable( 93 | config_file, 'LLM', 'rwkv_tokenizer_path') 94 | model = RWKV(model=model_path, 95 | strategy=strategy, 96 | tokens_path=tokens_path) 97 | return model 98 | 99 | 100 | def main(): 101 | st.header("Chat with PDF 💬") 102 | # upload a PDF file 103 | pdf = st.file_uploader("Upload your PDF", type='pdf') 104 | # st.write(pdf) 105 | if pdf is not None: 106 | pdf_reader = PdfReader(pdf) 107 | 108 | text = "" 109 | for page in pdf_reader.pages: 110 | text += page.extract_text() 111 | 112 | text_splitter = RecursiveCharacterTextSplitter( 113 | chunk_size=1000, 114 | chunk_overlap=200, 115 | length_function=len 116 | ) 117 | chunks = text_splitter.split_text(text=text) 118 | 119 | # # embeddings 120 | store_name = pdf.name[:-4] 121 | st.write(f'{store_name}') 122 | # st.write(chunks) 123 | 124 | if os.path.exists(f"{store_name}.pkl"): 125 | with open(f"{store_name}.pkl", "rb") as f: 126 | VectorStore = pickle.load(f) 127 | # st.write('Embeddings Loaded from the Disk')s 128 | else: 129 | # embeddings = OpenAIEmbeddings() 130 | embeddings = HuggingFaceEmbeddings( 131 | model_name='/data/model/all-MiniLM-L6-v2') 132 | # st.write(embeddings) 133 | VectorStore = FAISS.from_texts(chunks, embedding=embeddings) 134 | with open(f"{store_name}.pkl", "wb") as f: 135 | pickle.dump(VectorStore, f) 136 | 137 | # embeddings = OpenAIEmbeddings() 138 | # VectorStore = FAISS.from_texts(chunks, embedding=embeddings) 139 | 140 | # Accept user questions/query 141 | query = st.text_input("Ask questions about your PDF file:") 142 | st.write(query) 143 | 144 | if query: 145 | docs = VectorStore.similarity_search(query=query, k=3) 146 | st.write(docs) 147 | 148 | # llm = OpenAI() 149 | # llm = load_llm_rwkv() 150 | llm = load_llm_chatglm() 151 | chain = load_qa_chain(llm=llm, chain_type="stuff") 152 | with get_openai_callback() as cb: 153 | response = chain.run(input_documents=docs, question=query) 154 | print(cb) 155 | st.write(response) 156 | 157 | 158 | if __name__ == '__main__': 159 | main() 160 | -------------------------------------------------------------------------------- /src/llm_extractor.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 作者:陈凯 3 | # 电子邮件:chenkai0210@hotmail.com 4 | # 日期:2023-09 5 | # 描述:这个脚本实现了基于大模型对输入的文字做信息抽取。 6 | 7 | import os 8 | import json 9 | from transformers import AutoTokenizer, AutoModel 10 | from utils import get_config_variable 11 | 12 | 13 | class LLMExtractor: 14 | """ 15 | 用大模型对文字做信息抽取 16 | 这里用到的大模型是chatglm2 6b 17 | https://github.com/THUDM/ChatGLM2-6B/tree/main 18 | """ 19 | 20 | def __init__(self): 21 | ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) 22 | config_file = f'{ROOT_DIR[:-3]}config.ini' # 配置文件的路径 23 | self.model_path = get_config_variable( 24 | config_file, 'LLM', 'chatglm2_6b_path') 25 | self.tokenizer = AutoTokenizer.from_pretrained( 26 | self.model_path, trust_remote_code=True) 27 | self.model = AutoModel.from_pretrained( 28 | self.model_path, trust_remote_code=True, device='cuda').half().cuda() 29 | 30 | @staticmethod 31 | def get_prompt(content): 32 | data = { 33 | "author": "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell", 34 | "title": "Language models are few-shot learners", 35 | "year": "2020" 36 | } 37 | 38 | # 使用ensure_ascii=False以支持非ASCII字符 39 | json_string = json.dumps(data, ensure_ascii=False) 40 | 41 | prompt = f""" 42 | 从输入的文字中,提取"信息"(keyword,content),包括:"author"、"title"、"year"的实体,输出json格式内容。 43 | 请只输出json,不需要输出其他内容。 44 | 45 | 例如: 46 | 输入 47 | [1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models arefew-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020. 48 | 49 | 输出 50 | {json_string} 51 | 52 | 输入 53 | {content} 54 | 55 | 输出 56 | 以下是提取的实体及其JSON格式内容: 57 | """ 58 | 59 | return prompt 60 | 61 | def extract_reference(self, ref: str) -> dict: 62 | prompt = LLMExtractor.get_prompt(ref) 63 | if self.model: 64 | response, history = self.model.chat( 65 | self.tokenizer, prompt, history=[]) 66 | try: 67 | json_object = json.loads(response) 68 | return json_object 69 | except: 70 | # TODO: 错误分析 71 | pass 72 | return None 73 | -------------------------------------------------------------------------------- /src/llm_summarizer.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 作者:陈凯 3 | # 电子邮件:chenkai0210@hotmail.com 4 | # 日期:2023-09 5 | # 描述:这个脚本实现了基于大模型对PDF做摘要。 6 | 7 | from utils import get_config_variable 8 | from langchain.prompts import PromptTemplate 9 | from langchain.document_loaders import PyPDFLoader 10 | from langchain.chains import LLMChain 11 | from langchain.llms import RWKV 12 | import os 13 | os.environ["RWKV_CUDA_ON"] = '1' 14 | os.environ["RWKV_JIT_ON"] = '1' 15 | 16 | 17 | class LLMSummarizer: 18 | """ 19 | 用大模型对PDF进行总结 20 | 这里用到的大模型是rwkv raven 4 21 | https://huggingface.co/BlinkDL/rwkv-4-raven 22 | """ 23 | 24 | def __init__(self): 25 | # @param {"type":"string"} 26 | self.strategy = "cuda fp16i8 *20 -> cuda fp16" 27 | 28 | ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) 29 | config_file = f'{ROOT_DIR[:-3]}config.ini' # 配置文件的路径 30 | self.model_path = get_config_variable(config_file, 'LLM', 'rwkv_model_path') 31 | self.tokens_path = get_config_variable( 32 | config_file, 'LLM', 'rwkv_tokenizer_path') 33 | self.model = RWKV(model=self.model_path, 34 | strategy=self.strategy, 35 | tokens_path=self.tokens_path) 36 | 37 | self.task = """ 38 | Below is an instruction that describes a task. Write a response that appropriately completes the request. 39 | # Instruction: 40 | Write a concise summary of the following: 41 | {text} 42 | # Response: 43 | CONCISE SUMMARY: 44 | """ 45 | self.prompt = PromptTemplate( 46 | input_variables=["text"], 47 | template=self.task, 48 | ) 49 | self.chain = LLMChain(llm=self.model, prompt=self.prompt) 50 | 51 | def summarize(self, pdf_path): 52 | loader = PyPDFLoader(pdf_path) 53 | # 获取第一页的前500的单子做摘要 54 | data = loader.load()[0] 55 | instruction = data.page_content[:500] # @param {type:"string"} 56 | summary = self.chain.run(instruction) 57 | return summary 58 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 作者:陈凯 3 | # 电子邮件:chenkai0210@hotmail.com 4 | # 日期:2023-09 5 | # 描述:这个脚本演示如何使用PDFParser 6 | 7 | import os 8 | import logging 9 | import json 10 | from pdf_parser import PDFParser 11 | from llm_summarizer import LLMSummarizer 12 | from llm_extractor import LLMExtractor 13 | from tqdm import tqdm 14 | 15 | if __name__ == '__main__': 16 | ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) 17 | logging.basicConfig(filename=f'{ROOT_DIR[:-3]}basic.log', 18 | encoding='utf-8', 19 | level=logging.INFO, 20 | filemode='w', 21 | format='%(process)d-%(levelname)s-%(message)s') 22 | logging.getLogger().addHandler(logging.StreamHandler()) 23 | 24 | logging.info('** pdf parsing **') 25 | # 1 创建 PDFParser 对象 26 | # https://cdn.openai.com/papers/gpt-4.pdf 27 | pdf_path = f'{ROOT_DIR[:-3]}/data/gpt-4.pdf' 28 | parser = PDFParser(pdf_path) 29 | 30 | # 2 文字: 标题,章节目录,章节对应的文字内容 31 | logging.info('== extract text ==') 32 | parser.extract_text() 33 | logging.info('-- title --') 34 | logging.info(parser.text.title) 35 | logging.info('-- section --') 36 | for title, section in parser.text.section.items(): 37 | logging.info(title) 38 | # 指定保存的文件路径 39 | json_file_path = f"{ROOT_DIR[:-3]}/temp/json/sections.json" 40 | # 使用 json.dump() 将字典保存为 JSON 文件 41 | with open(json_file_path, 'w') as json_file: 42 | json.dump(parser.text.section, json_file) 43 | 44 | # 3 图片 45 | logging.info('== extract image ==') 46 | parser.extract_images() 47 | images = parser.images 48 | for image in images: 49 | # 将图像保存为文件 50 | image_filename = f"{ROOT_DIR[:-3]}/temp/image/image_{image.page_num}_{image.title[:10]}.png" 51 | with open(image_filename, "wb") as image_file: 52 | logging.info(image.title) 53 | logging.info(image.page_num) 54 | image_file.write(image.image_data) 55 | 56 | # 4 表格:表格和对应的标题 57 | logging.info('== extract table ==') 58 | parser.extract_tables() 59 | for i, table in enumerate(parser.tables): 60 | logging.info(table.title) 61 | csv_filename = f"{ROOT_DIR[:-3]}/temp/table/table_i_{table.page_num}_{table.title[:10]}.csv" 62 | table.table_data.to_csv(csv_filename) 63 | 64 | # 5 参考文献 65 | logging.info('== extract references ==') 66 | parser.extract_references() 67 | # for ref in parser.references: 68 | # print(ref.ref) 69 | logging.info(len(parser.references)) 70 | with open(f'{ROOT_DIR[:-3]}/temp/reference/references.txt', 'w') as fp: 71 | for ref in parser.references: 72 | # write each item on a new line 73 | fp.write("%s\n" % ref.ref) 74 | 75 | # 6 总结 76 | logging.info('== summarizing (LLM) ==') 77 | llm_summarizer = LLMSummarizer() 78 | parser.text.summary = llm_summarizer.summarize(pdf_path) 79 | logging.info(parser.text.summary) 80 | 81 | # 7 用大模型将参考文献结构化成作者,标题,日期 82 | logging.info('== extract info (author, title, year) from references ==') 83 | llm_extractor = LLMExtractor() 84 | extracted_ref = 0 85 | for i, ref in enumerate(tqdm(parser.references)): 86 | json_ref = llm_extractor.extract_reference(ref) 87 | if json_ref and len(json_ref) > 0: 88 | with open(f'{ROOT_DIR[:-3]}/temp/reference/{i}.json', 'w') as outfile: 89 | json.dump(json_ref, outfile) 90 | extracted_ref += 1 91 | if extracted_ref > 5: 92 | break 93 | -------------------------------------------------------------------------------- /src/parser.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 作者:陈凯 3 | # 电子邮件:chenkai0210@hotmail.com 4 | # 日期:2023-09 5 | # 描述:这个脚本的目的是解析PDF。将PDF结构化成,文字,图片,表格和参考。 6 | 7 | import pandas as pd # 用于结构化表格 8 | import fitz # PyMuPDF PDF相关操作 9 | import PyPDF2 # PDF相关操作 10 | from typing import List 11 | 12 | 13 | class Text: 14 | """文本类,用于封装提取的文本内容 """ 15 | 16 | def __init__(self, 17 | title: str = None, 18 | section: dict = {}, 19 | summary: str = None): 20 | """ 21 | 参数: 22 | - title: str,文本标题 23 | - section: dict, key:章节名称 value:章节文字内容 24 | - summary: str, 摘要 25 | """ 26 | self.title = title 27 | self.section = section 28 | self.summary = summary 29 | 30 | 31 | class PDFImage: 32 | """图片类,用于封装提取的图片信息 """ 33 | 34 | def __init__(self, 35 | title: str, 36 | image_data: object, 37 | page_num: int): 38 | """ 39 | 参数: 40 | - title: str,图片标题 41 | - image_data: 图片数据的表示形式,可以是字节流、文件路径等 42 | - page_num: int: 图片所在页数 43 | """ 44 | self.title = title 45 | self.image_data = image_data 46 | self.page_num = page_num 47 | 48 | 49 | class Table: 50 | """表格类,用于封装提取的表格信息 """ 51 | 52 | def __init__(self, 53 | title: str, 54 | table_data: pd.DataFrame, 55 | page_num: int): 56 | """ 57 | 参数: 58 | - title: str,表格标题 59 | - table_data: 表格数据,可以是 Pandas DataFrame 等形式 60 | - page_num: int: 表格所在页数 61 | """ 62 | self.title = title 63 | self.table_data = table_data 64 | self.page_num = page_num 65 | 66 | 67 | class Reference: 68 | """参考文献类,用于封装提取的参考文献信息 """ 69 | 70 | def __init__(self, ref: str): 71 | """ 72 | 参数: 73 | - ref: str,参考文献 74 | """ 75 | self.ref = ref 76 | 77 | 78 | class PDFOutliner: 79 | """ 80 | 该类用于获取给定PDF的所有章节的标题 81 | 该类对下面的代码做了一些修改,核心算法来自下面仓库 82 | https://github.com/beaverden/pdftoc/tree/main 83 | """ 84 | 85 | def __init__(self): 86 | self.titles = [] # 每一个章节的标题 87 | 88 | def get_tree_pages(self, root, info, depth=0, titles=[]): 89 | """ 90 | Recursively iterate the outline tree 91 | Find the pages pointed by the outline item 92 | and get the assigned physical order id 93 | Decrement with padding if necessary 94 | """ 95 | if isinstance(root, dict): 96 | page = root['/Page'].get_object() 97 | t = root['/Title'] 98 | title = t 99 | if isinstance(t, PyPDF2.generic.ByteStringObject): 100 | title = t.original_bytes.decode('utf8') 101 | title = title.strip() 102 | title = title.replace('\n', '') 103 | title = title.replace('\r', '') 104 | page_num = info['all_pages'].get(id(page), 0) 105 | if page_num == 0: 106 | # TODO: logging 107 | print('Not found page number for /Page!', page) 108 | elif page_num < info['padding']: 109 | page_num = 0 110 | else: 111 | page_num -= info['padding'] 112 | str_val = '%-5d' % page_num 113 | str_val += '\t' * depth 114 | str_val += title + '\t' + '%3d' % page_num 115 | self.titles.append(title) 116 | return 117 | for elem in root: 118 | self.get_tree_pages(elem, info, depth+1) 119 | 120 | def recursive_numbering(self, obj, info): 121 | """ 122 | Recursively iterate through all the pages in order and 123 | assign them a physical order number 124 | """ 125 | if obj['/Type'] == '/Page': 126 | obj_id = id(obj) 127 | if obj_id not in info['all_pages']: 128 | info['all_pages'][obj_id] = info['current_page_id'] 129 | info['current_page_id'] += 1 130 | return 131 | elif obj['/Type'] == '/Pages': 132 | for page in obj['/Kids']: 133 | self.recursive_numbering(page.get_object(), info) 134 | 135 | def create_text_outline(self, pdf_path, page_number_padding): 136 | # print('Running the script for [%s] with padding [%d]' % (pdf_path, page_number_padding)) 137 | # creating an object 138 | titles = [] 139 | with open(pdf_path, 'rb') as file: 140 | fileReader = PyPDF2.PdfReader(file) 141 | 142 | info = { 143 | 'all_pages': {}, 144 | 'current_page_id': 1, 145 | 'padding': page_number_padding 146 | } 147 | 148 | pages = fileReader.trailer['/Root']['/Pages'].get_object() 149 | self.recursive_numbering(pages, info) 150 | # for page_num, page in enumerate(pages['/Kids']): 151 | # page_obj = page.getObject() 152 | # all_pages[id(page_obj)] = page_num + 1 153 | self.get_tree_pages(fileReader.outline, info, 0, titles) 154 | return 155 | 156 | 157 | class PDFParser: 158 | """PDF 解析器类,用于提取 PDF 中的文本、图片、表格和参考文献信息 """ 159 | 160 | def __init__(self, pdf_path: str): 161 | """ 162 | 参数: 163 | - pdf_path: str,PDF 文件的路径 164 | """ 165 | self.pdf_path = pdf_path 166 | self.doc = fitz.open(self.pdf_path) # PyMuPDF fitz.Document 167 | self.text = Text() # text: Text, 文字内容 168 | self.images = [] # list, 所有图片(PDFImage) 169 | self.tables = [] # list, 所有表格(Table) 170 | self.references = [] # list, 所有参考(Reference) 171 | 172 | def extract_title(self): 173 | """ 174 | 获取pdf标题 175 | """ 176 | doc = self.doc 177 | first_page = doc.load_page(0) # 获取第一页 178 | # 提取第一页的文本内容 179 | text = first_page.get_text() 180 | # 按行拆分文本内容 181 | lines = text.split('\n') 182 | # 获取第一行文本 183 | first_line = lines[0].strip() 184 | self.text.title = first_line 185 | return 186 | 187 | def extract_sections_content(self, 188 | doc: fitz.Document, 189 | section_titles: List[str]): 190 | """ 191 | 根据章节名称列表提取PDF中各章节的文字内容。 192 | 参数: 193 | - pdf_file: 包含章节的PDF文件路径。 194 | - section_titles: 包含所有章节名称的列表。 195 | 196 | 返回值: 197 | - 一个字典,键是章节名称,值是该章节的文字内容。 198 | """ 199 | sections_content = {} # 存储章节名称和内容的字典 200 | # 获取所有章节名称 201 | filtered_section_titles = [PDFParser.remove_leading_digits( 202 | title).strip() for title in section_titles] 203 | # 对于每一个章节名称,遍历所有文字行,如果文字行内包含了该章节的名称则加下去将文字行加入到该章节文字内容中 204 | # 如果文字行包含了下一个章节的名称则停止将文字行加入到该章节文字内容中 205 | for i, section_title in enumerate(filtered_section_titles): 206 | section_found = False 207 | section_content = "" 208 | scan_page = True 209 | for page_num in range(len(doc)): 210 | page = doc[page_num] 211 | page_text = page.get_text() 212 | for line in page_text.split('\n'): 213 | # 如果找到了下一章的标题则跳出 214 | if i+1 < len(filtered_section_titles) and filtered_section_titles[i+1].lower() in line.lower(): 215 | scan_page = False 216 | break 217 | if section_title.lower() in line.lower(): 218 | section_found = True 219 | elif section_found: 220 | # 如果找到了目标标题,开始获取章节内容 221 | section_content += line + "\n" 222 | if not scan_page: 223 | break 224 | 225 | if section_found: 226 | sections_content[section_titles[i]] = section_content 227 | 228 | return sections_content 229 | 230 | @staticmethod 231 | def remove_leading_digits(text: str): 232 | """ 233 | 删除输入文字开头的数字。 234 | """ 235 | while text and text[0].isdigit(): 236 | text = text[1:] # 删除第一个字符 237 | return text 238 | 239 | def extract_text(self): 240 | """ 241 | 提取PDF中的文本内容 242 | """ 243 | # 1 获取标题 244 | self.extract_title() 245 | # 2 获取章节名称 246 | outliner = PDFOutliner() 247 | outliner.create_text_outline(self.pdf_path, 0) 248 | # 3 获取对应章节下的文字内容 249 | self.text.section = self.extract_sections_content( 250 | self.doc, outliner.titles) 251 | return 252 | 253 | def extract_images(self, fig_caption_start: str = 'Figure'): 254 | """ 255 | 提取 PDF 中的图片信息: 图片和图片的标题 256 | fig_caption_start: str,图片标题开始词 257 | """ 258 | doc = self.doc 259 | 260 | for page_num in range(len(doc)): 261 | page = doc[page_num] 262 | # 提取页面文本块 263 | blocks = page.get_text('blocks') 264 | # 通过计算文本块与图片的距离来匹配图片和对应的标题, 265 | # 文本块有特定的开始词开始且距离(欧氏距离)离图片最近的文本块的文字为当前图片的标题 266 | for img in page.get_images(full=True): 267 | xref = img[0] 268 | base_image = doc.extract_image(xref) 269 | x0, y0, x1, y2 = page.get_image_rects(xref)[0] 270 | related_text = "untitled" 271 | min_dist = float('inf') 272 | for block in blocks: 273 | block_x0, block_y0, block_x1, block_y1, block_text = block[:5] 274 | if block_text.strip().startswith(fig_caption_start): 275 | # 计算欧式距离 276 | dist = (x0 - block_x0)**2 + (y0 - block_y0)**2 277 | if dist < min_dist: 278 | min_dist = dist 279 | related_text = block_text.strip() 280 | 281 | image_data = base_image["image"] 282 | image = PDFImage(related_text, image_data, page_num) 283 | self.images.append(image) 284 | 285 | def extract_tables(self, tab_caption_start: str = 'Table'): 286 | """ 287 | 提取 PDF 中的表格信息 288 | tab_caption_start: str, 表格标题开始词 289 | """ 290 | doc = self.doc 291 | for num in range(len(doc)): 292 | page = doc[num] 293 | # 提取页面文本块 294 | blocks = page.get_text('blocks') 295 | # 提取表格 296 | tables = page.find_tables() 297 | # 通过计算文本块与表格的距离来匹配图片和对应的标题, 298 | # 文本块有特定的开始词且距离(欧氏距离)离表格最近的文本块的文字为当前图片的标题 299 | for table in tables: 300 | x0, y0, x1, y2 = table.bbox 301 | df = table.to_pandas() 302 | related_text = "untitled" 303 | min_dist = float('inf') 304 | for block in blocks: 305 | block_x0, block_y0, block_x1, block_y1, block_text = block[:5] 306 | if block_text.strip().startswith(tab_caption_start): 307 | # 计算欧式距离 308 | dist = (x0 - block_x0)**2 + (y0 - block_y0)**2 309 | if dist < min_dist: 310 | min_dist = dist 311 | related_text = block_text.strip() 312 | self.tables.append(Table(title=related_text, 313 | table_data=df, 314 | page_num=num)) 315 | 316 | def extract_references(self): 317 | """ 318 | 提取 PDF 中的参考文献信息 319 | """ 320 | doc = self.doc 321 | page_num = len(doc) 322 | ref_list = [] 323 | for num, page in enumerate(doc): 324 | content = page.get_text('blocks') 325 | for pc in content: 326 | txt_blocks = list(pc[4:-2]) 327 | txt = ''.join(txt_blocks) 328 | if 'References' in txt or 'REFERENCES' in txt or 'referenCes' in txt: 329 | ref_num = [i for i in range(num, page_num)] 330 | for rpn in ref_num: 331 | ref_page = doc[rpn] 332 | ref_content = ref_page.get_text('blocks') 333 | for refc in ref_content: 334 | txt_blocks = list(refc[4:-2]) 335 | ref_list.extend(txt_blocks) 336 | index = 0 337 | for i, ref in enumerate(ref_list): 338 | if 'References' in ref or 'REFERENCES' in ref or 'referenCes' in ref: 339 | index = i 340 | break 341 | if index + 1 < len(ref_list): 342 | index += 1 343 | self.references = [Reference(ref.replace('\n', '')) 344 | for ref in ref_list[index:] if len(ref) > 10] 345 | -------------------------------------------------------------------------------- /src/pdf_parser.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 作者:陈凯 3 | # 电子邮件:chenkai0210@hotmail.com 4 | # 日期:2023-09 5 | # 描述:这个脚本的目的是解析PDF。将PDF结构化成,文字,图片,表格和参考。 6 | 7 | import pandas as pd # 用于结构化表格 8 | import fitz # PyMuPDF PDF相关操作 9 | import PyPDF2 # PDF相关操作 10 | from typing import List 11 | 12 | 13 | class Text: 14 | """文本类,用于封装提取的文本内容 """ 15 | 16 | def __init__(self, 17 | title: str = None, 18 | section: dict = {}, 19 | summary: str = None): 20 | """ 21 | 参数: 22 | - title: str,文本标题 23 | - section: dict, key:章节名称 value:章节文字内容 24 | - summary: str, 摘要 25 | """ 26 | self.title = title 27 | self.section = section 28 | self.summary = summary 29 | 30 | 31 | class PDFImage: 32 | """图片类,用于封装提取的图片信息 """ 33 | 34 | def __init__(self, 35 | title: str, 36 | image_data: object, 37 | page_num: int): 38 | """ 39 | 参数: 40 | - title: str,图片标题 41 | - image_data: 图片数据的表示形式,可以是字节流、文件路径等 42 | - page_num: int: 图片所在页数 43 | """ 44 | self.title = title 45 | self.image_data = image_data 46 | self.page_num = page_num 47 | 48 | 49 | class Table: 50 | """表格类,用于封装提取的表格信息 """ 51 | 52 | def __init__(self, 53 | title: str, 54 | table_data: pd.DataFrame, 55 | page_num: int): 56 | """ 57 | 参数: 58 | - title: str,表格标题 59 | - table_data: 表格数据,可以是 Pandas DataFrame 等形式 60 | - page_num: int: 表格所在页数 61 | """ 62 | self.title = title 63 | self.table_data = table_data 64 | self.page_num = page_num 65 | 66 | 67 | class Reference: 68 | """参考文献类,用于封装提取的参考文献信息 """ 69 | 70 | def __init__(self, ref: str): 71 | """ 72 | 参数: 73 | - ref: str,参考文献 74 | """ 75 | self.ref = ref 76 | 77 | 78 | class PDFOutliner: 79 | """ 80 | 该类用于获取给定PDF的所有章节的标题 81 | 该类对下面的代码做了一些修改,核心算法来自下面仓库 82 | https://github.com/beaverden/pdftoc/tree/main 83 | """ 84 | 85 | def __init__(self): 86 | self.titles = [] # 每一个章节的标题 87 | 88 | def get_tree_pages(self, root, info, depth=0, titles=[]): 89 | """ 90 | Recursively iterate the outline tree 91 | Find the pages pointed by the outline item 92 | and get the assigned physical order id 93 | Decrement with padding if necessary 94 | """ 95 | if isinstance(root, dict): 96 | page = root['/Page'].get_object() 97 | t = root['/Title'] 98 | title = t 99 | if isinstance(t, PyPDF2.generic.ByteStringObject): 100 | title = t.original_bytes.decode('utf8') 101 | title = title.strip() 102 | title = title.replace('\n', '') 103 | title = title.replace('\r', '') 104 | page_num = info['all_pages'].get(id(page), 0) 105 | if page_num == 0: 106 | # TODO: logging 107 | print('Not found page number for /Page!', page) 108 | elif page_num < info['padding']: 109 | page_num = 0 110 | else: 111 | page_num -= info['padding'] 112 | str_val = '%-5d' % page_num 113 | str_val += '\t' * depth 114 | str_val += title + '\t' + '%3d' % page_num 115 | self.titles.append(title) 116 | return 117 | for elem in root: 118 | self.get_tree_pages(elem, info, depth+1) 119 | 120 | def recursive_numbering(self, obj, info): 121 | """ 122 | Recursively iterate through all the pages in order and 123 | assign them a physical order number 124 | """ 125 | if obj['/Type'] == '/Page': 126 | obj_id = id(obj) 127 | if obj_id not in info['all_pages']: 128 | info['all_pages'][obj_id] = info['current_page_id'] 129 | info['current_page_id'] += 1 130 | return 131 | elif obj['/Type'] == '/Pages': 132 | for page in obj['/Kids']: 133 | self.recursive_numbering(page.get_object(), info) 134 | 135 | def create_text_outline(self, pdf_path, page_number_padding): 136 | # print('Running the script for [%s] with padding [%d]' % (pdf_path, page_number_padding)) 137 | # creating an object 138 | titles = [] 139 | with open(pdf_path, 'rb') as file: 140 | fileReader = PyPDF2.PdfReader(file) 141 | 142 | info = { 143 | 'all_pages': {}, 144 | 'current_page_id': 1, 145 | 'padding': page_number_padding 146 | } 147 | 148 | pages = fileReader.trailer['/Root']['/Pages'].get_object() 149 | self.recursive_numbering(pages, info) 150 | # for page_num, page in enumerate(pages['/Kids']): 151 | # page_obj = page.getObject() 152 | # all_pages[id(page_obj)] = page_num + 1 153 | self.get_tree_pages(fileReader.outline, info, 0, titles) 154 | return 155 | 156 | 157 | class PDFParser: 158 | """PDF 解析器类,用于提取 PDF 中的文本、图片、表格和参考文献信息 """ 159 | 160 | def __init__(self, pdf_path: str): 161 | """ 162 | 参数: 163 | - pdf_path: str,PDF 文件的路径 164 | """ 165 | self.pdf_path = pdf_path 166 | self.doc = fitz.open(self.pdf_path) # PyMuPDF fitz.Document 167 | self.text = Text() # text: Text, 文字内容 168 | self.images = [] # list, 所有图片(PDFImage) 169 | self.tables = [] # list, 所有表格(Table) 170 | self.references = [] # list, 所有参考(Reference) 171 | 172 | def extract_title(self): 173 | """ 174 | 获取pdf标题 175 | """ 176 | doc = self.doc 177 | first_page = doc.load_page(0) # 获取第一页 178 | # 提取第一页的文本内容 179 | text = first_page.get_text() 180 | # 按行拆分文本内容 181 | lines = text.split('\n') 182 | # 获取第一行文本 183 | first_line = lines[0].strip() 184 | self.text.title = first_line 185 | return 186 | 187 | def extract_sections_content(self, 188 | doc: fitz.Document, 189 | section_titles: List[str]): 190 | """ 191 | 根据章节名称列表提取PDF中各章节的文字内容。 192 | 参数: 193 | - pdf_file: 包含章节的PDF文件路径。 194 | - section_titles: 包含所有章节名称的列表。 195 | 196 | 返回值: 197 | - 一个字典,键是章节名称,值是该章节的文字内容。 198 | """ 199 | sections_content = {} # 存储章节名称和内容的字典 200 | # 获取所有章节名称 201 | filtered_section_titles = [PDFParser.remove_leading_digits( 202 | title).strip() for title in section_titles] 203 | # 对于每一个章节名称,遍历所有文字行,如果文字行内包含了该章节的名称则加下去将文字行加入到该章节文字内容中 204 | # 如果文字行包含了下一个章节的名称则停止将文字行加入到该章节文字内容中 205 | for i, section_title in enumerate(filtered_section_titles): 206 | section_found = False 207 | section_content = "" 208 | scan_page = True 209 | for page_num in range(len(doc)): 210 | page = doc[page_num] 211 | page_text = page.get_text() 212 | for line in page_text.split('\n'): 213 | # 如果找到了下一章的标题则跳出 214 | if i+1 < len(filtered_section_titles) and filtered_section_titles[i+1].lower() in line.lower(): 215 | scan_page = False 216 | break 217 | if section_title.lower() in line.lower(): 218 | section_found = True 219 | elif section_found: 220 | # 如果找到了目标标题,开始获取章节内容 221 | section_content += line + "\n" 222 | if not scan_page: 223 | break 224 | 225 | if section_found: 226 | sections_content[section_titles[i]] = section_content 227 | 228 | return sections_content 229 | 230 | @staticmethod 231 | def remove_leading_digits(text: str): 232 | """ 233 | 删除输入文字开头的数字。 234 | """ 235 | while text and text[0].isdigit(): 236 | text = text[1:] # 删除第一个字符 237 | return text 238 | 239 | def extract_text(self): 240 | """ 241 | 提取PDF中的文本内容 242 | """ 243 | # 1 获取标题 244 | self.extract_title() 245 | # 2 获取章节名称 246 | outliner = PDFOutliner() 247 | outliner.create_text_outline(self.pdf_path, 0) 248 | # 3 获取对应章节下的文字内容 249 | self.text.section = self.extract_sections_content( 250 | self.doc, outliner.titles) 251 | return 252 | 253 | def extract_images(self, fig_caption_start: str = 'Figure'): 254 | """ 255 | 提取 PDF 中的图片信息: 图片和图片的标题 256 | fig_caption_start: str,图片标题开始词 257 | """ 258 | doc = self.doc 259 | 260 | for page_num in range(len(doc)): 261 | page = doc[page_num] 262 | # 提取页面文本块 263 | blocks = page.get_text('blocks') 264 | # 通过计算文本块与图片的距离来匹配图片和对应的标题, 265 | # 文本块有特定的开始词开始且距离(欧氏距离)离图片最近的文本块的文字为当前图片的标题 266 | for img in page.get_images(full=True): 267 | xref = img[0] 268 | base_image = doc.extract_image(xref) 269 | x0, y0, x1, y2 = page.get_image_rects(xref)[0] 270 | related_text = "untitled" 271 | min_dist = float('inf') 272 | for block in blocks: 273 | block_x0, block_y0, block_x1, block_y1, block_text = block[:5] 274 | if block_text.strip().startswith(fig_caption_start): 275 | # 计算欧式距离 276 | dist = (x0 - block_x0)**2 + (y0 - block_y0)**2 277 | if dist < min_dist: 278 | min_dist = dist 279 | related_text = block_text.strip() 280 | 281 | image_data = base_image["image"] 282 | image = PDFImage(related_text, image_data, page_num) 283 | self.images.append(image) 284 | 285 | def extract_tables(self, tab_caption_start: str = 'Table'): 286 | """ 287 | 提取 PDF 中的表格信息 288 | tab_caption_start: str, 表格标题开始词 289 | """ 290 | doc = self.doc 291 | for num in range(len(doc)): 292 | page = doc[num] 293 | # 提取页面文本块 294 | blocks = page.get_text('blocks') 295 | # 提取表格 296 | tables = page.find_tables() 297 | # 通过计算文本块与表格的距离来匹配图片和对应的标题, 298 | # 文本块有特定的开始词且距离(欧氏距离)离表格最近的文本块的文字为当前图片的标题 299 | for table in tables: 300 | x0, y0, x1, y2 = table.bbox 301 | df = table.to_pandas() 302 | related_text = "untitled" 303 | min_dist = float('inf') 304 | for block in blocks: 305 | block_x0, block_y0, block_x1, block_y1, block_text = block[:5] 306 | if block_text.strip().startswith(tab_caption_start): 307 | # 计算欧式距离 308 | dist = (x0 - block_x0)**2 + (y0 - block_y0)**2 309 | if dist < min_dist: 310 | min_dist = dist 311 | related_text = block_text.strip() 312 | self.tables.append(Table(title=related_text, 313 | table_data=df, 314 | page_num=num)) 315 | 316 | def extract_references(self): 317 | """ 318 | 提取 PDF 中的参考文献信息 319 | """ 320 | doc = self.doc 321 | page_num = len(doc) 322 | ref_list = [] 323 | for num, page in enumerate(doc): 324 | content = page.get_text('blocks') 325 | for pc in content: 326 | txt_blocks = list(pc[4:-2]) 327 | txt = ''.join(txt_blocks) 328 | if 'References' in txt or 'REFERENCES' in txt or 'referenCes' in txt: 329 | ref_num = [i for i in range(num, page_num)] 330 | for rpn in ref_num: 331 | ref_page = doc[rpn] 332 | ref_content = ref_page.get_text('blocks') 333 | for refc in ref_content: 334 | txt_blocks = list(refc[4:-2]) 335 | ref_list.extend(txt_blocks) 336 | index = 0 337 | for i, ref in enumerate(ref_list): 338 | if 'References' in ref or 'REFERENCES' in ref or 'referenCes' in ref: 339 | index = i 340 | break 341 | if index + 1 < len(ref_list): 342 | index += 1 343 | self.references = [Reference(ref.replace('\n', '')) 344 | for ref in ref_list[index:] if len(ref) > 10] 345 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 作者:陈凯 3 | # 电子邮件:chenkai0210@hotmail.com 4 | # 日期:2023-09 5 | # 描述:一些工具方法 6 | 7 | import configparser 8 | 9 | 10 | def get_config_variable(config_file: str, section: str, variable_name: str): 11 | """ 12 | 从配置文件中获取变量值 13 | 14 | Args: 15 | config_file (str): 配置文件的路径 16 | section (str): 配置文件中的节名 17 | variable_name (str): 要获取的变量名 18 | 19 | Returns: 20 | str: 变量的值,如果找不到则返回 None 21 | """ 22 | config = configparser.ConfigParser() 23 | config.read(config_file) 24 | 25 | if config.has_section(section): 26 | if config.has_option(section, variable_name): 27 | return config.get(section, variable_name) 28 | 29 | return None 30 | --------------------------------------------------------------------------------