├── .gitignore ├── input └── TestPDFfile.pdf ├── output └── TestPDFfile_explained.pdf ├── requirements.txt ├── env.txt ├── LICENSE ├── main.py ├── utils.py ├── PDFReader.html ├── creator.py ├── chat_request.py └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .env 3 | .idea 4 | __pycache__ -------------------------------------------------------------------------------- /input/TestPDFfile.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Agent-QG/PDFInterpreter/HEAD/input/TestPDFfile.pdf -------------------------------------------------------------------------------- /output/TestPDFfile_explained.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Agent-QG/PDFInterpreter/HEAD/output/TestPDFfile_explained.pdf -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.8.4 2 | aiosignal==1.3.1 3 | async-timeout==4.0.2 4 | attrs==23.1.0 5 | certifi==2022.12.7 6 | charset-normalizer==3.1.0 7 | frozenlist==1.3.3 8 | idna==3.4 9 | markdown2==2.4.8 10 | multidict==6.0.4 11 | openai==0.27.4 12 | pdfkit==1.0.0 13 | PyPDF2==3.0.1 14 | python-dotenv==1.0.0 15 | requests==2.28.2 16 | tqdm==4.65.0 17 | typing_extensions==4.5.0 18 | urllib3==1.26.15 19 | yarl==1.9.1 20 | -------------------------------------------------------------------------------- /env.txt: -------------------------------------------------------------------------------- 1 | #Please enter your openai api key 2 | OPENAI_API_KEY='YOUR-KEY' 3 | 4 | #Please enter your target language 5 | LANGUAGE='Chinese' 6 | 7 | #Please enter model, 'gpt-3.5-turbo' or 'gpt-4' (make sure you have gpt-4 model access) 8 | MODEL='gpt-3.5-turbo' 9 | 10 | #You can leave this blank. If you encounter wkhtmltopdf error please enter it 11 | WKHTMLTOPDFPATH='' 12 | 13 | #You can change your prompt here if you want. No need to write 'Answer in language and markdown 14 | PROMPT="Please explain and analyse the following content.Your answer should include two parts-'Explain' which is the explanation of this part and 'Analyse' which is the analysis of this part." 15 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Agent-QG 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from chat_request import Chat_gpt 4 | from dotenv import load_dotenv 5 | 6 | from creator import Creator 7 | from utils import markdown_to_pdf 8 | 9 | load_dotenv() 10 | api_key = os.getenv("OPENAI_API_KEY") 11 | language = os.getenv("LANGUAGE") 12 | model = os.getenv("MODEL") 13 | wkhtmltopdf_path = os.getenv(r"WKHTMLTOPDFPATH") 14 | prompt=os.getenv(r"PROMPT") 15 | 16 | def main(): 17 | input_folder = "input" 18 | output_folder = "output" 19 | 20 | for root, _, filenames in os.walk(input_folder): 21 | for filename in filenames: 22 | if filename.endswith(".pdf"): 23 | pdf_path = os.path.join(root, filename) 24 | relative_path = os.path.relpath(root, input_folder) 25 | output_subfolder = os.path.join(output_folder, relative_path) 26 | 27 | os.makedirs(output_subfolder, exist_ok=True) 28 | 29 | output_pdf = os.path.join( 30 | output_subfolder, os.path.splitext(filename)[0] + "_explained.pdf" 31 | ) 32 | chat_gpt = Chat_gpt(api_key=api_key, language=language, model=model, prompt=prompt) 33 | creator = Creator(pdf_path, chat_gpt) 34 | parsed_list = creator.process() 35 | markdown_interpretations = "\n\n---\n\n".join(parsed_list) 36 | markdown_to_pdf(markdown_interpretations, output_pdf, wkhtmltopdf_path) 37 | 38 | if __name__ == "__main__": 39 | main() 40 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import PyPDF2 3 | import markdown2 4 | import pdfkit 5 | 6 | def read_pdf(file_path): 7 | with open(file_path, "rb") as file: 8 | reader = PyPDF2.PdfReader(file) 9 | pages = [] 10 | for page_num in range(len(reader.pages)): 11 | page = reader.pages[page_num] 12 | content = page.extract_text() 13 | content = re.sub(r'\s*\d+\s*$', '', content) 14 | 15 | pages.append(content) 16 | 17 | return pages 18 | 19 | 20 | 21 | def split_text(text, max_tokens): 22 | sentences = re.split(r'(? 2 | 3 | 4 | 5 | 6 | Double PDF Reader 7 | 27 | 28 | 29 |
30 | 31 | 32 |
33 |
34 | 35 | 36 |
37 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /creator.py: -------------------------------------------------------------------------------- 1 | import threading 2 | from utils import read_pdf, split_text, fix_markdown_issues 3 | from tqdm import tqdm 4 | import os 5 | 6 | class Creator: 7 | 8 | def __init__(self, file_path, chat_gpt): 9 | self.file_path = file_path 10 | self.file_name = os.path.basename(file_path) # Add this line 11 | self.chat_gpt = chat_gpt 12 | 13 | 14 | def process(self): 15 | pdf_contents = read_pdf(self.file_path) 16 | 17 | # for organize report order 18 | sequential_mapping = {} 19 | result_list = [] 20 | 21 | target_num = 0 22 | 23 | for page_num, page_content in enumerate(pdf_contents): 24 | # every request target number 25 | 26 | sequential_mapping[page_num] = [] 27 | 28 | # split to paragraphs 29 | paragraphs = page_content.split("\n\n") 30 | 31 | for paragraph in paragraphs: 32 | # judge whether this paragraph efficient 33 | if len(paragraph.strip()) > 0: 34 | 35 | # split paragraph to sub_paragraphs to avoid too long 36 | sub_paragraphs = split_text(paragraph, 3000) 37 | 38 | for sub_paragraph in sub_paragraphs: 39 | t = threading.Thread(target=self.chat_gpt.go_thread, args=(sub_paragraph, target_num,)) 40 | t.start() 41 | sequential_mapping[page_num].append(target_num) 42 | result_list.append(None) 43 | target_num += 1 44 | 45 | done_request = 0 46 | with tqdm(total=target_num, position=0, desc=self.file_name) as pbar: 47 | while done_request != target_num: 48 | while not self.chat_gpt.content_queue.empty(): 49 | pos, content = self.chat_gpt.content_queue.get() 50 | content = fix_markdown_issues(content) 51 | result_list[pos] = content 52 | done_request += 1 53 | pbar.update(1) 54 | 55 | parsed_pdf = [] 56 | 57 | for page_num in range(len(pdf_contents)): 58 | parsed_page = [] 59 | for index in sequential_mapping[page_num]: 60 | parsed_page.append(result_list[index]) 61 | parsed_page.append(f"[p{page_num + 1}]") 62 | parsed_pdf.append("\n\n".join(parsed_page)) 63 | 64 | return parsed_pdf 65 | -------------------------------------------------------------------------------- /chat_request.py: -------------------------------------------------------------------------------- 1 | import openai 2 | import time 3 | from queue import Queue 4 | import threading 5 | 6 | 7 | class Chat_gpt: 8 | 9 | def __init__(self, api_key, language, model, prompt): 10 | openai.api_key = api_key 11 | self.language = language 12 | self.model = model 13 | self.prompt = prompt 14 | self.content_queue = Queue() 15 | 16 | # about threading, if you don't use threading, ignore it. 17 | self.sem = threading.Semaphore(12) 18 | 19 | # rate limit 20 | self.rate_limit = False 21 | 22 | # multiple 23 | def go_thread(self, text, tag_num): 24 | with self.sem: 25 | self.process(text, tag_num) 26 | 27 | # single 28 | def process(self, text, tag_num=0, max_tokens=3000, max_retries=5, retry_delay=5): 29 | language = self.language 30 | model = self.model 31 | retries = 0 32 | 33 | # first find rate limit threading 34 | first_find_threading_error = False 35 | 36 | while retries < max_retries: 37 | try: 38 | response = openai.ChatCompletion.create( 39 | model=model, 40 | messages=[{"role": "system", "content": "You are an AI assistant."}, 41 | {"role": "user", 42 | "content": f"{self.prompt} Present the answer in {language} Markdown format. Thank you.:\n{text}"}], 43 | max_tokens=max_tokens, 44 | n=1, 45 | temperature=0.5, 46 | ) 47 | self.content_queue.put((tag_num, response['choices'][0]['message']['content'].strip())) 48 | 49 | self.rate_limit = False 50 | first_find_threading_error = False 51 | 52 | break 53 | except openai.error.RateLimitError as e: 54 | 55 | if not self.rate_limit: 56 | first_find_threading_error = True 57 | self.rate_limit = True 58 | else: 59 | while self.rate_limit and not first_find_threading_error: 60 | time.sleep(retry_delay) 61 | 62 | if retries < max_retries - 1: 63 | retries += 1 64 | time.sleep(retry_delay) 65 | else: 66 | raise e 67 | except openai.error.InvalidRequestError as e: 68 | if "maximum context length" in str(e): 69 | max_tokens -= 500 70 | if max_tokens < 1: 71 | print("This page is too long") 72 | return "TOO LONG!" 73 | else: 74 | raise e 75 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PDFInterpreter 2 | This is a Python script for parsing PDF files and interpreting and analyzing their text using OpenAI GPT. The script first reads the PDF file, breaking its content down into individual pages. Then, it processes the text of each page, breaking it down into paragraphs. Next, it sends the paragraphs to OpenAI GPT to obtain interpretations and analyses of the text. Finally, it converts the GPT responses to Markdown format and turns them back into PDF files. You can place PDF files you want to parse in the input folder, and the processed PDF files will be saved in the output folder. This version of the script employs multithreading to speed up the processing of a single PDF file. 3 | 4 | 这是一个用于解析PDF文件并使用OpenAI GPT进行文本解释和分析的Python脚本。脚本首先读取PDF文件,将其内容分解为单独的页面。然后,对每个页面的文本进行处理,将其分解为段落。接下来,将段落发送到OpenAI GPT以获得对文本的解释和分析。最后,将GPT的响应转换为Markdown格式,并将其转换回PDF文件。您可以在 input 文件夹中放置需要解析的PDF文件,处理后的PDF文件将保存在 output 文件夹中。此版本的脚本采用多线程来加快单个PDF文件的处理速度。 5 | 6 | ## Project Structure 7 | 8 | - `input/` - Folder to place your input PDF files. 9 | - `output/` - Folder where the processed PDF files will be saved. 10 | - `main.py` - Main script to run the project. 11 | - `chat_request.py` - Contains the Chat_gpt class, responsible for making API requests to the GPT model and managing concurrent threads for faster PDF processing. 12 | - `creator.py` - Contains the Creator class, responsible for processing a single PDF file by extracting text, splitting it into chunks, and utilizing the Chat_gpt class to obtain explanations and analysis. 13 | - `utils.py` - A utility file containing various helper functions such as reading PDF files, splitting text, fixing Markdown issues, and converting Markdown to PDF. 14 | - `.env` - File to store environment variables such as API key, language, GPT model, and wkhtmltopdf path. 15 | - `PDFReader.html` - A PDF reader to read two PDFs at the same time. 16 | 17 | ## Set Up 18 | 1. Download the repository 19 | 2. Install the requirements 20 | ```shell 21 | pip install -r requirements.txt 22 | ``` 23 | 3. Install wkhtmltopdf 24 | 25 | - Go to https://wkhtmltopdf.org/downloads.html to download `wkhtmltopdf` 26 | 27 | 5. Change your openai api key in env.txt (You can get your key on https://platform.openai.com/account/api-keys) 28 | 6. Change your target language and model in env.txt (Chinese and gpt-3.5-turbo are set as default) 29 | 7. Rename the file 'env.txt' to '.env' 30 | ## Usage 31 | 1. Copy your PDF files to `input/` folder 32 | 2. Start 33 | ```shell 34 | python main.py 35 | ``` 36 | ## Read your PDFs 37 | By running PDFReader.html, you can read two PDFs at the same time, which is convenient for comparing PDFs. 38 | 39 | ## General errors 40 | - Please make sure you have installed `wkhtmltopdf` 41 | - If you encounter a wkhtmltopdf error, please add the path to your wkhtmltopdf path in the .env file 42 | - For MacOS, the default path is `/usr/local/bin/wkhtmltopdf`. You can get your wkhtmltopdf path by 43 | ```shell 44 | which wkhtmltopdf 45 | ``` 46 | - For Windows, the default path is `C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe`. You can get your wkhtmltopdf path by 47 | ```shell 48 | dir /s /b C:\ | findstr /i wkhtmltopdf.exe 49 | ``` 50 | (I assume you installed it in C:\) 51 | - If you still encounter an error with wkhtmltopdf, consider granting administrator privileges to both your wkhtmltopdf and the project. 52 | - If you encounter any other issues, please feel free to open an issue on GitHub or contact me directly at my.qgong@gmail.com. 53 | 54 | ## 未来发展 55 | - 适配GPT-4模型和图片读取功能:我们正在努力将GPT-4模型整合到项目中以提高解析和生成结果的质量。 56 | - 开发网页版:我们计划开发一个网页版并且添加更丰富的功能,让用户能够更方便地在线使用本工具。 57 | 58 | ## Future Development 59 | 60 | - Adapting to GPT-4 Model and Image Reading Capability: We are working on integrating the GPT-4 model into the project to improve the quality of parsing and generated results. 61 | - Developing a Web Version: We plan to develop a web-based version of this tool and add more functions to make it more accessible and convenient for users to use online. 62 | --------------------------------------------------------------------------------