├── .gitignore
├── input
    └── TestPDFfile.pdf
├── output
    └── TestPDFfile_explained.pdf
├── requirements.txt
├── env.txt
├── LICENSE
├── main.py
├── utils.py
├── PDFReader.html
├── creator.py
├── chat_request.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .env
3 | .idea
4 | __pycache__


--------------------------------------------------------------------------------
/input/TestPDFfile.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Agent-QG/PDFInterpreter/HEAD/input/TestPDFfile.pdf


--------------------------------------------------------------------------------
/output/TestPDFfile_explained.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Agent-QG/PDFInterpreter/HEAD/output/TestPDFfile_explained.pdf


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | aiohttp==3.8.4
 2 | aiosignal==1.3.1
 3 | async-timeout==4.0.2
 4 | attrs==23.1.0
 5 | certifi==2022.12.7
 6 | charset-normalizer==3.1.0
 7 | frozenlist==1.3.3
 8 | idna==3.4
 9 | markdown2==2.4.8
10 | multidict==6.0.4
11 | openai==0.27.4
12 | pdfkit==1.0.0
13 | PyPDF2==3.0.1
14 | python-dotenv==1.0.0
15 | requests==2.28.2
16 | tqdm==4.65.0
17 | typing_extensions==4.5.0
18 | urllib3==1.26.15
19 | yarl==1.9.1
20 | 


--------------------------------------------------------------------------------
/env.txt:
--------------------------------------------------------------------------------
 1 | #Please enter your openai api key
 2 | OPENAI_API_KEY='YOUR-KEY'
 3 | 
 4 | #Please enter your target language
 5 | LANGUAGE='Chinese'
 6 | 
 7 | #Please enter model, 'gpt-3.5-turbo' or 'gpt-4' (make sure you have gpt-4 model access)
 8 | MODEL='gpt-3.5-turbo'
 9 | 
10 | #You can leave this blank. If you encounter wkhtmltopdf error please enter it
11 | WKHTMLTOPDFPATH=''
12 | 
13 | #You can change your prompt here if you want. No need to write 'Answer in language and markdown
14 | PROMPT="Please explain and analyse the following content.Your answer should include two parts-'Explain' which is the explanation of this part and 'Analyse' which is the analysis of this part."
15 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Agent-QG
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from chat_request import Chat_gpt
 4 | from dotenv import load_dotenv
 5 | 
 6 | from creator import Creator
 7 | from utils import markdown_to_pdf
 8 | 
 9 | load_dotenv()
10 | api_key = os.getenv("OPENAI_API_KEY")
11 | language = os.getenv("LANGUAGE")
12 | model = os.getenv("MODEL")
13 | wkhtmltopdf_path = os.getenv(r"WKHTMLTOPDFPATH")
14 | prompt=os.getenv(r"PROMPT")
15 | 
16 | def main():
17 |     input_folder = "input"
18 |     output_folder = "output"
19 | 
20 |     for root, _, filenames in os.walk(input_folder):
21 |         for filename in filenames:
22 |             if filename.endswith(".pdf"):
23 |                 pdf_path = os.path.join(root, filename)
24 |                 relative_path = os.path.relpath(root, input_folder)
25 |                 output_subfolder = os.path.join(output_folder, relative_path)
26 | 
27 |                 os.makedirs(output_subfolder, exist_ok=True)
28 | 
29 |                 output_pdf = os.path.join(
30 |                     output_subfolder, os.path.splitext(filename)[0] + "_explained.pdf"
31 |                 )
32 |                 chat_gpt = Chat_gpt(api_key=api_key, language=language, model=model, prompt=prompt)
33 |                 creator = Creator(pdf_path, chat_gpt)
34 |                 parsed_list = creator.process()
35 |                 markdown_interpretations = "\n\n---\n\n".join(parsed_list)
36 |                 markdown_to_pdf(markdown_interpretations, output_pdf, wkhtmltopdf_path)
37 | 
38 | if __name__ == "__main__":
39 |     main()
40 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import PyPDF2
 3 | import markdown2
 4 | import pdfkit
 5 | 
 6 | def read_pdf(file_path):
 7 |     with open(file_path, "rb") as file:
 8 |         reader = PyPDF2.PdfReader(file)
 9 |         pages = []
10 |         for page_num in range(len(reader.pages)):
11 |             page = reader.pages[page_num]
12 |             content = page.extract_text()
13 |             content = re.sub(r'\s*\d+\s*$', '', content)
14 | 
15 |             pages.append(content)
16 | 
17 |     return pages
18 | 
19 | 
20 | 
21 | def split_text(text, max_tokens):
22 |     sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
23 |     chunks = []
24 |     current_chunk = []
25 | 
26 |     for sentence in sentences:
27 |         if len(current_chunk) + len(sentence.split()) <= max_tokens:
28 |             current_chunk.append(sentence)
29 |         else:
30 |             chunks.append(' '.join(current_chunk))
31 |             current_chunk = [sentence]
32 | 
33 |     if current_chunk:
34 |         chunks.append(' '.join(current_chunk))
35 | 
36 |     return chunks
37 | 
38 | 
39 | def fix_markdown_issues(text):
40 |     text = text.replace("--", "—")
41 |     text = re.sub(r'(?<!\\)(~~)', r'\\\1', text)
42 | 
43 |     return text
44 | 
45 | def markdown_to_pdf(markdown_text, output_path,wkhtmltopdf_path):
46 |     html = markdown2.markdown(markdown_text)
47 |     options = {
48 |         'encoding': "UTF-8"
49 |     }
50 |     if wkhtmltopdf_path:
51 |         config = pdfkit.configuration(wkhtmltopdf=wkhtmltopdf_path)
52 |         pdfkit.from_string(html, output_path, options=options, configuration=config)
53 |     else:
54 |         pdfkit.from_string(html, output_path, options=options)
55 | 


--------------------------------------------------------------------------------
/PDFReader.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="zh">
 3 | <head>
 4 |     <meta charset="UTF-8">
 5 |     <meta name="viewport" content="width=device-width, initial-scale=1.0">
 6 |     <title>Double PDF Reader</title>
 7 |     <style>
 8 |         body {
 9 |             display: flex;
10 |             flex-direction: column;
11 |             justify-content: center;
12 |             align-items: center;
13 |             height: 100vh;
14 |             margin: 0;
15 |         }
16 |         .pdf-container {
17 |             display: flex;
18 |             justify-content: space-around;
19 |             width: 100%;
20 |         }
21 |         iframe {
22 |             width: 45%;
23 |             height: 80vh;
24 |             border: 1px solid #ccc;
25 |         }
26 |     </style>
27 | </head>
28 | <body>
29 |     <div class="pdf-container">
30 |         <iframe id="pdf1" src="" type="application/pdf"></iframe>
31 |         <iframe id="pdf2" src="" type="application/pdf"></iframe>
32 |     </div>
33 |     <div>
34 |         <input type="file" id="input1" accept=".pdf">
35 |         <input type="file" id="input2" accept=".pdf">
36 |     </div>
37 |     <script>
38 |         document.getElementById('input1').addEventListener('change', function() {
39 |             const file = this.files[0];
40 |             if (file) {
41 |                 const url = URL.createObjectURL(file);
42 |                 document.getElementById('pdf1').src = url;
43 |             }
44 |         });
45 | 
46 |         document.getElementById('input2').addEventListener('change', function() {
47 |             const file = this.files[0];
48 |             if (file) {
49 |                 const url = URL.createObjectURL(file);
50 |                 document.getElementById('pdf2').src = url;
51 |             }
52 |         });
53 |     </script>
54 | </body>
55 | </html>
56 | 


--------------------------------------------------------------------------------
/creator.py:
--------------------------------------------------------------------------------
 1 | import threading
 2 | from utils import read_pdf, split_text, fix_markdown_issues
 3 | from tqdm import tqdm
 4 | import os
 5 | 
 6 | class Creator:
 7 | 
 8 |     def __init__(self, file_path, chat_gpt):
 9 |         self.file_path = file_path
10 |         self.file_name = os.path.basename(file_path)  # Add this line
11 |         self.chat_gpt = chat_gpt
12 | 
13 | 
14 |     def process(self):
15 |         pdf_contents = read_pdf(self.file_path)
16 | 
17 |         # for organize report order
18 |         sequential_mapping = {}
19 |         result_list = []
20 | 
21 |         target_num = 0
22 | 
23 |         for page_num, page_content in enumerate(pdf_contents):
24 |             # every request target number
25 | 
26 |             sequential_mapping[page_num] = []
27 | 
28 |             # split to paragraphs
29 |             paragraphs = page_content.split("\n\n")
30 | 
31 |             for paragraph in paragraphs:
32 |                 # judge whether this paragraph efficient
33 |                 if len(paragraph.strip()) > 0:
34 | 
35 |                     # split paragraph to sub_paragraphs to avoid too long
36 |                     sub_paragraphs = split_text(paragraph, 3000)
37 | 
38 |                     for sub_paragraph in sub_paragraphs:
39 |                         t = threading.Thread(target=self.chat_gpt.go_thread, args=(sub_paragraph, target_num,))
40 |                         t.start()
41 |                         sequential_mapping[page_num].append(target_num)
42 |                         result_list.append(None)
43 |                         target_num += 1
44 | 
45 |         done_request = 0
46 |         with tqdm(total=target_num, position=0, desc=self.file_name) as pbar:
47 |             while done_request != target_num:
48 |                 while not self.chat_gpt.content_queue.empty():
49 |                     pos, content = self.chat_gpt.content_queue.get()
50 |                     content = fix_markdown_issues(content)
51 |                     result_list[pos] = content
52 |                     done_request += 1
53 |                     pbar.update(1)
54 | 
55 |         parsed_pdf = []
56 | 
57 |         for page_num in range(len(pdf_contents)):
58 |             parsed_page = []
59 |             for index in sequential_mapping[page_num]:
60 |                 parsed_page.append(result_list[index])
61 |             parsed_page.append(f"[p{page_num + 1}]")
62 |             parsed_pdf.append("\n\n".join(parsed_page))
63 | 
64 |         return parsed_pdf
65 | 


--------------------------------------------------------------------------------
/chat_request.py:
--------------------------------------------------------------------------------
 1 | import openai
 2 | import time
 3 | from queue import Queue
 4 | import threading
 5 | 
 6 | 
 7 | class Chat_gpt:
 8 | 
 9 |     def __init__(self, api_key, language, model, prompt):
10 |         openai.api_key = api_key
11 |         self.language = language
12 |         self.model = model
13 |         self.prompt = prompt
14 |         self.content_queue = Queue()
15 | 
16 |         # about threading, if you don't use threading, ignore it.
17 |         self.sem = threading.Semaphore(12)
18 | 
19 |         # rate limit
20 |         self.rate_limit = False
21 | 
22 |     # multiple
23 |     def go_thread(self, text, tag_num):
24 |         with self.sem:
25 |             self.process(text, tag_num)
26 | 
27 |     # single
28 |     def process(self, text, tag_num=0, max_tokens=3000, max_retries=5, retry_delay=5):
29 |         language = self.language
30 |         model = self.model
31 |         retries = 0
32 | 
33 |         # first find rate limit threading
34 |         first_find_threading_error = False
35 | 
36 |         while retries < max_retries:
37 |             try:
38 |                 response = openai.ChatCompletion.create(
39 |                     model=model,
40 |                     messages=[{"role": "system", "content": "You are an AI assistant."},
41 |                               {"role": "user",
42 |                                "content": f"{self.prompt} Present the answer in {language} Markdown format. Thank you.：\n{text}"}],
43 |                     max_tokens=max_tokens,
44 |                     n=1,
45 |                     temperature=0.5,
46 |                 )
47 |                 self.content_queue.put((tag_num, response['choices'][0]['message']['content'].strip()))
48 | 
49 |                 self.rate_limit = False
50 |                 first_find_threading_error = False
51 | 
52 |                 break
53 |             except openai.error.RateLimitError as e:
54 | 
55 |                 if not self.rate_limit:
56 |                     first_find_threading_error = True
57 |                     self.rate_limit = True
58 |                 else:
59 |                     while self.rate_limit and not first_find_threading_error:
60 |                         time.sleep(retry_delay)
61 | 
62 |                 if retries < max_retries - 1:
63 |                     retries += 1
64 |                     time.sleep(retry_delay)
65 |                 else:
66 |                     raise e
67 |             except openai.error.InvalidRequestError as e:
68 |                 if "maximum context length" in str(e):
69 |                     max_tokens -= 500
70 |                     if max_tokens < 1:
71 |                         print("This page is too long")
72 |                         return "TOO LONG!"
73 |                 else:
74 |                     raise e
75 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # PDFInterpreter
 2 | This is a Python script for parsing PDF files and interpreting and analyzing their text using OpenAI GPT. The script first reads the PDF file, breaking its content down into individual pages. Then, it processes the text of each page, breaking it down into paragraphs. Next, it sends the paragraphs to OpenAI GPT to obtain interpretations and analyses of the text. Finally, it converts the GPT responses to Markdown format and turns them back into PDF files. You can place PDF files you want to parse in the input folder, and the processed PDF files will be saved in the output folder. This version of the script employs multithreading to speed up the processing of a single PDF file.
 3 | 
 4 | 这是一个用于解析PDF文件并使用OpenAI GPT进行文本解释和分析的Python脚本。脚本首先读取PDF文件，将其内容分解为单独的页面。然后，对每个页面的文本进行处理，将其分解为段落。接下来，将段落发送到OpenAI GPT以获得对文本的解释和分析。最后，将GPT的响应转换为Markdown格式，并将其转换回PDF文件。您可以在 input 文件夹中放置需要解析的PDF文件，处理后的PDF文件将保存在 output 文件夹中。此版本的脚本采用多线程来加快单个PDF文件的处理速度。
 5 | 
 6 | ## Project Structure
 7 | 
 8 | - `input/` - Folder to place your input PDF files.
 9 | - `output/` - Folder where the processed PDF files will be saved.
10 | - `main.py` - Main script to run the project.
11 | - `chat_request.py` - Contains the Chat_gpt class, responsible for making API requests to the GPT model and managing concurrent threads for faster PDF processing.
12 | - `creator.py` - Contains the Creator class, responsible for processing a single PDF file by extracting text, splitting it into chunks, and utilizing the Chat_gpt class to obtain explanations and analysis.
13 | - `utils.py` - A utility file containing various helper functions such as reading PDF files, splitting text, fixing Markdown issues, and converting Markdown to PDF.
14 | - `.env` - File to store environment variables such as API key, language, GPT model, and wkhtmltopdf path.
15 | - `PDFReader.html` - A PDF reader to read two PDFs at the same time.
16 | 
17 | ## Set Up
18 | 1. Download the repository
19 | 2. Install the requirements
20 | ```shell
21 | pip install -r requirements.txt
22 | ```
23 | 3.  Install wkhtmltopdf
24 | 
25 | - Go to https://wkhtmltopdf.org/downloads.html to download `wkhtmltopdf`
26 | 
27 | 5.  Change your openai api key in env.txt (You can get your key on https://platform.openai.com/account/api-keys)
28 | 6.  Change your target language and model in env.txt (Chinese and gpt-3.5-turbo are set as default)
29 | 7.  Rename the file 'env.txt' to '.env'
30 | ## Usage
31 | 1. Copy your PDF files to `input/` folder
32 | 2. Start
33 | ```shell
34 | python main.py
35 | ```
36 | ## Read your PDFs
37 | By running PDFReader.html, you can read two PDFs at the same time, which is convenient for comparing PDFs.
38 | 
39 | ## General errors
40 | - Please make sure you have installed `wkhtmltopdf`
41 | - If you encounter a wkhtmltopdf error, please add the path to your wkhtmltopdf path in the .env file
42 | - For MacOS, the default path is `/usr/local/bin/wkhtmltopdf`. You can get your wkhtmltopdf path by 
43 | ```shell
44 | which wkhtmltopdf
45 | ```
46 | - For Windows, the default path is `C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe`. You can get your wkhtmltopdf path by 
47 | ```shell
48 | dir /s /b C:\ | findstr /i wkhtmltopdf.exe
49 | ```
50 | (I assume you installed it in C:\)
51 | - If you still encounter an error with wkhtmltopdf, consider granting administrator privileges to both your wkhtmltopdf and the project.
52 | - If you encounter any other issues, please feel free to open an issue on GitHub or contact me directly at my.qgong@gmail.com.
53 | 
54 | ## 未来发展
55 | - 适配GPT-4模型和图片读取功能：我们正在努力将GPT-4模型整合到项目中以提高解析和生成结果的质量。
56 | - 开发网页版：我们计划开发一个网页版并且添加更丰富的功能，让用户能够更方便地在线使用本工具。
57 | 
58 | ## Future Development
59 | 
60 | - Adapting to GPT-4 Model and Image Reading Capability: We are working on integrating the GPT-4 model into the project to improve the quality of parsing and generated results.
61 | - Developing a Web Version: We plan to develop a web-based version of this tool and add more functions to make it more accessible and convenient for users to use online.
62 | 


--------------------------------------------------------------------------------