├── LICENSE
├── README.md
├── generate_embedding.py
├── gif.gif
├── requirements.txt
├── run.py
├── static
    ├── css
    │   └── styles.css
    ├── js
    │   └── script.js
    └── send-icon.png
└── templates
    └── index.html


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # ChatPDFLike
  2 | 
  3 | An end-to-end document question-answering application using large language model APIs.
  4 | 
  5 | **Note**: This project is not affiliated with or endorsed by [ChatPDF](https://www.chatpdf.com/). This is an independent project attempting to replicate similar functionality.
  6 | 
  7 | ## Overview
  8 | 
  9 | ChatPDF-Like is a web application that allows users to upload PDF documents and interact with them using natural language queries. The application leverages large language models (LLMs) like OpenAI's GPT-3.5 Turbo to understand the content of the PDF and provide concise and accurate answers to user questions.
 10 | 
 11 | ## Features
 12 | 
 13 | - **PDF Document Upload**: Upload local PDF files or provide a URL to a PDF document.
 14 | - **Natural Language Interaction**: Ask questions about the content of the PDF in natural language.
 15 | - **Relevant Answers**: Receive concise answers based on the content of the document.
 16 | - **Source References**: View sources (sections of the PDF) that were used to generate the answer.
 17 | - **Multiple LLM Providers**: Support for both OpenAI and Ollama models.
 18 | - **Web Interface**: Simple and intuitive web interface built with Flask and JavaScript.
 19 | 
 20 | ## How It Works
 21 | 
 22 | The application follows these main steps:
 23 | 
 24 | 1. **Text Extraction and Processing**:
 25 |    - The PDF is parsed using `PyPDF2`.
 26 |    - Text is extracted from each page, and large pieces of text are split into manageable chunks.
 27 | 
 28 | 2. **Embedding Generation**:
 29 |    - For each text chunk, an embedding vector is generated using the selected embedding model (e.g., OpenAI's `text-embedding-ada-002`).
 30 |    - These embeddings represent the semantic meaning of the text chunks and are stored for similarity calculations.
 31 | 
 32 | 3. **User Query Handling**:
 33 |    - When a user asks a question, an embedding vector for the query is generated using the same embedding model.
 34 | 
 35 | 4. **Similarity Search**:
 36 |    - The application computes the cosine similarity between the query embedding and the text chunk embeddings.
 37 |    - The most relevant text chunks are selected based on the highest similarity scores.
 38 | 
 39 | 5. **Prompt Construction**:
 40 |    - A prompt is created for the language model, incorporating the user's question and the most relevant text chunks.
 41 | 
 42 | 6. **Answer Generation**:
 43 |    - The prompt is sent to the language model (e.g., OpenAI's GPT-3.5 Turbo).
 44 |    - The model generates an answer to the user's question based on the provided context.
 45 | 
 46 | 7. **Response Display**:
 47 |    - The answer is displayed to the user in the web interface.
 48 |    - References to the source text chunks are also provided for transparency.
 49 | 
 50 | ## Getting Started
 51 | 
 52 | ### Prerequisites
 53 | 
 54 | - **Python**: Version 3.6 or higher is required.
 55 | - **API Keys**:
 56 |   - **OpenAI API Key**: Required to use OpenAI's models for embeddings and answer generation.
 57 |   - **Ollama API Key**: Optional. Required if you want to use Ollama models.
 58 | 
 59 | ### Installation
 60 | 
 61 | 1. **Clone the Repository**
 62 | 
 63 |    ```bash
 64 |    git clone https://github.com/Ulov888/chatpdflike.git
 65 |    cd chatpdflike
 66 |    ```
 67 | 
 68 | 2. **Install Dependencies**
 69 | 
 70 |    Using `pip`, install the required packages:
 71 | 
 72 |    ```bash
 73 |    pip install -r requirements.txt
 74 |    ```
 75 | 
 76 | ### API Keys
 77 | 
 78 | To use OpenAI's API:
 79 | 
 80 | 1. Sign up for an API key at [OpenAI](https://platform.openai.com/account/api-keys).
 81 | 2. Set the `OPENAI_API_KEY` environment variable:
 82 | 
 83 |    ```bash
 84 |    export OPENAI_API_KEY="your_openai_api_key"
 85 |    ```
 86 | 
 87 | To use Ollama's API (if desired):
 88 | 
 89 | 1. Obtain an API key from Ollama.
 90 | 2. Set the `OLLAMA_API_KEY` environment variable:
 91 | 
 92 |    ```bash
 93 |    export OLLAMA_API_KEY="your_ollama_api_key"
 94 |    ```
 95 | 
 96 | ## Usage
 97 | 
 98 | 1. **Start the Application**
 99 | 
100 |    Run the Flask application:
101 | 
102 |    ```bash
103 |    python run.py
104 |    ```
105 | 
106 |    By default, the server runs on `http://0.0.0.0:8080`.
107 | 
108 | 2. **Access the Web Interface**
109 | 
110 |    Open a web browser and navigate to `http://localhost:8080`.
111 | 
112 | 3. **Upload a PDF Document**
113 | 
114 |    You can either:
115 | 
116 |    - Click on "Upload PDF" to select and upload a PDF file from your computer.
117 |    - Enter a URL to a PDF document and click "Submit".
118 | 
119 | 4. **Interact with the PDF**
120 | 
121 |    - Once the PDF is processed, you can ask questions about its content using the chat interface on the right side of the screen.
122 |    - Type your question in the input box and press "Send".
123 | 
124 | 5. **View Answers**
125 | 
126 |    - The application's response will appear below your question.
127 |    - Source references (e.g., page numbers and excerpts) are provided for context.
128 | 
129 | ![Demo GIF](gif.gif)
130 | 
131 | ## Customization
132 | 
133 | ### Prompt Strategies
134 | 
135 | The behavior of the language model can be customized by modifying the prompt strategies in `generate_embedding.py`, specifically in the `create_prompt` method of the `Chatbot` class.
136 | 
137 | Strategies include:
138 | 
139 | - **Paper**: For summarizing scientific papers.
140 | - **Handbook**: For summarizing financial handbooks (answers in Chinese).
141 | - **Contract**: For understanding contracts (answers in Chinese).
142 | - **Default**: General-purpose strategy (answers in Chinese).
143 | 
144 | To select a strategy, you can modify the `strategy` parameter when calling `create_prompt`.
145 | 
146 | ### Language and Output
147 | 
148 | The application is currently configured to provide answers in Chinese for some strategies. You can modify the prompts to change the language or adjust the behavior of the model.
149 | 
150 | ## Limitations
151 | 
152 | - **OpenAI API Costs**: Using OpenAI's API will incur costs based on usage. Make sure to monitor your API usage to avoid unexpected charges.
153 | - **PDF Parsing**: The application uses `PyPDF2`, which may not handle all PDFs perfectly. Complex PDFs with unusual formatting may not parse correctly.
154 | - **Embedding Limits**: The maximum token limit for embeddings may restrict the size of text chunks or the maximum length of the prompt.
155 | - **Model Responses**: The quality and accuracy of the answers depend on the performance of the language model and the relevance of the retrieved text chunks.
156 | 
157 | ## Contributing
158 | 
159 | Contributions are welcome! If you have any suggestions or improvements, feel free to submit an issue or pull request.
160 | 
161 | ## License
162 | 
163 | This project is licensed under the [Apache License](LICENSE).
164 | 


--------------------------------------------------------------------------------
/generate_embedding.py:
--------------------------------------------------------------------------------
  1 | import logging as logger
  2 | import ollama 
  3 | import openai
  4 | import os
  5 | import os
  6 | import pandas as pd
  7 | from flask_cors import CORS
  8 | from openai.embeddings_utils import get_embedding, cosine_similarity
  9 | 
 10 | openai.api_key = os.getenv('OPENAI_API_KEY')
 11 | 
 12 | class Chatbot():
 13 |     def parse_paper(self, pdf):
 14 |         logger.info("Parsing paper")
 15 |         number_of_pages = len(pdf.pages)
 16 |         logger.info(f"Total number of pages: {number_of_pages}")
 17 |         paper_text = []
 18 |         for i in range(number_of_pages):
 19 |             page = pdf.pages[i]
 20 |             page_text = []
 21 | 
 22 |             def visitor_body(text, cm, tm, fontDict, fontSize):
 23 |                 x = tm[4]
 24 |                 y = tm[5]
 25 |                 # ignore header/footer
 26 |                 if (y > 50 and y < 720) and (len(text.strip()) > 1):
 27 |                     page_text.append({
 28 |                         'fontsize': fontSize,
 29 |                         'text': text.strip().replace('\x03', ''),
 30 |                         'x': x,
 31 |                         'y': y
 32 |                     })
 33 | 
 34 |             _ = page.extract_text(visitor_text=visitor_body)
 35 | 
 36 |             blob_font_size = None
 37 |             blob_text = ''
 38 |             processed_text = []
 39 | 
 40 |             for t in page_text:
 41 |                 if t['fontsize'] == blob_font_size:
 42 |                     blob_text += f" {t['text']}"
 43 |                     if len(blob_text) >= 2000:
 44 |                         processed_text.append({
 45 |                             'fontsize': blob_font_size,
 46 |                             'text': blob_text,
 47 |                             'page': i
 48 |                         })
 49 |                         blob_font_size = None
 50 |                         blob_text = ''
 51 |                 else:
 52 |                     if blob_font_size is not None and len(blob_text) >= 1:
 53 |                         processed_text.append({
 54 |                             'fontsize': blob_font_size,
 55 |                             'text': blob_text,
 56 |                             'page': i
 57 |                         })
 58 |                     blob_font_size = t['fontsize']
 59 |                     blob_text = t['text']
 60 |                 paper_text += processed_text
 61 |         logger.info("Done parsing paper")
 62 |         return paper_text
 63 | 
 64 |     def paper_df(self, pdf):
 65 |         logger.info('Creating dataframe')
 66 |         filtered_pdf= []
 67 |         for row in pdf:
 68 |             if len(row['text']) < 30:
 69 |                 continue
 70 |             if len(row['text']) > 8000:
 71 |                 row['text'] = row['text'][:8000]
 72 |             filtered_pdf.append(row)
 73 |         df = pd.DataFrame(filtered_pdf)
 74 |         # remove elements with identical df[text] and df[page] values
 75 |         df = df.drop_duplicates(subset=['text', 'page'], keep='first')
 76 |         df['length'] = df['text'].apply(lambda x: len(x))
 77 |         logger.info('Done creating df')
 78 |         return df
 79 | 
 80 |     def calculate_embeddings(self, df):
 81 |         logger.info('Calculating embeddings')
 82 |         embedding_model = "text-embedding-ada-002"
 83 |         embeddings = df.text.apply([lambda x: get_embedding(x, engine=embedding_model)])
 84 |         df["embeddings"] = embeddings
 85 |         logger.info('Done calculating embeddings')
 86 |         return df
 87 | 
 88 |     def search_embeddings(self, df, query, n=2, pprint=True):
 89 |         query_embedding = get_embedding(
 90 |             query,
 91 |             engine="text-embedding-ada-002"
 92 |         )
 93 |         df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))
 94 | 
 95 |         results = df.sort_values("similarity", ascending=False, ignore_index=True)
 96 |         results = results.head(n)
 97 |         global sources
 98 |         sources = []
 99 |         for i in range(n):
100 |             # append the page number and the text as a dict to the sources list
101 |             sources.append({'Page '+str(results.iloc[i]['page']): results.iloc[i]['text'][:150]+'...'})
102 |         print(sources)
103 |         return results.head(n)
104 | 
105 |     def create_prompt(self, df, user_input, strategy=None):
106 |         result = self.search_embeddings(df, user_input)
107 |         if strategy == "paper":
108 |             prompt = """You are a large language model whose expertise is reading and summarizing scientific papers.
109 |             You are given a query and a series of text embeddings from a paper in order of their cosine similarity to the query.
110 |             You must take the given embeddings and return a very detailed summary of the paper that answers the query.
111 |                 Given the question: """+ user_input + """
112 |                 
113 |                 and the following embeddings as data: 
114 |                 
115 |                 1.""" + str(result.iloc[0]['text']) + """
116 |                 2.""" + str(result.iloc[1]['text']) + """
117 |     
118 |                 Return a concise and accurate answer:"""
119 |         elif strategy == "handbook":
120 |             prompt = """You are a large language model whose expertise is reading and summarizing financial handbook.
121 |             You are given a query and a series of text embeddings from a handbook in order of their cosine similarity to the query.
122 |             You must take the given embeddings and return a very detailed answer in Chinese of the handbook that answers the query.
123 |             If not necessary, your answer please use the original text as much as possible.
124 |             You should also ensure that your response is written in clear and concise Chinese, using appropriate grammar and vocabulary.  
125 |             Additionally, your response should focus on answering the specific query provided..
126 |                 Given the question: """+ user_input + """
127 |                 and the following embeddings as data: 
128 |                 
129 |                 1.""" + str(result.iloc[0]['text']) + """
130 |                 2.""" + str(result.iloc[1]['text']) + """
131 |     
132 |                 Return a concise and accurate answer:"""
133 |         elif strategy == "contract":
134 |             prompt = """As a large language model specializing in reading and summarizing, your task is to read a query and a sequence of text inputs sorted by their cosine similarity to the query.
135 |              Your goal is to provide a Chinese answer to the query using the given padding. If possible, please use the original text of your answer. 
136 |              Please ensure that your response adheres to the terms of the agreement. Your response should focus on addressing the specific query provided, 
137 |              providing relevant information and details based on the input texts' content. You should also strive for clarity and conciseness in your response, 
138 |              summarizing key points while maintaining accuracy and relevance. Please note that you should prioritize understanding the context and meaning 
139 |              behind both the query and input texts before generating a response.
140 |                 Given the question: """+ user_input + """
141 |                 and the following embeddings as data: 
142 |                 
143 |                 1.""" + str(result.iloc[0]['text']) + """
144 |                 2.""" + str(result.iloc[1]['text']) + """
145 |     
146 |                 Return a concise and accurate answer:"""
147 |         else:
148 |             prompt = """As a language model specialized in reading and summarizing documents, your task is to provide a concise answer in Chinese based on a given query and a series of text embeddings from the document. 
149 |             The embeddings are provided in order of their cosine similarity to the query. Your response should use as much original text as possible. 
150 |             Your answer should be highly concise and accurate, providing relevant information that directly answers the query. 
151 |             You should also ensure that your response is written in clear and concise Chinese, using appropriate grammar and vocabulary. 
152 |             Please note that you must use the provided text embeddings to generate your response, which means you will need to understand how they relate to the original document. 
153 |             Additionally, your response should focus on answering the specific query provided..
154 |                 Given the question: """+ user_input + """
155 |                 
156 |                 and the following embeddings as data: 
157 |                 
158 |                 1.""" + str(result.iloc[0]['text']) + """
159 |                 2.""" + str(result.iloc[1]['text']) + """
160 |     
161 |                 Return a concise and accurate answer:"""
162 |         logger.info('Done creating prompt')
163 |         return prompt
164 | 
165 |     def response(self, df, prompt):
166 |         logger.info('Sending request to GPT-3')
167 |         prompt = self.create_prompt(df, prompt)
168 |         r = openai.ChatCompletion.create(model="gpt-3.5-turbo",
169 |                                          messages=[{"role": "user", "content": prompt},
170 |                                                    ])
171 |         answer = r.choices[0]['message']['content']
172 |         logger.info('Done sending request to GPT-3')
173 |         response = {'answer': answer, 'sources': sources}
174 |         return response
175 | 
176 | 
177 | class OllamaChatbot(Chatbot):
178 | 
179 |     def __init__(self):
180 |         self.ollama_api_key = os.getenv('OLLAMA_API_KEY')
181 |         ollama.api_key = self.ollama_api_key
182 | 
183 |     def get_ollama_embedding(self, text):
184 |         response = ollama.embed(model='llama3.1', input=text)
185 |         return response['embedding']
186 | 
187 |     def calculate_ollama_embeddings(self, df):
188 |         logger.info('Calculating embeddings using Ollama')
189 |         embeddings = df.text.apply(lambda x: self.get_ollama_embedding(x))
190 |         df["embeddings"] = embeddings
191 |         logger.info('Done calculating embeddings')
192 |         return df
193 | 
194 |     def ollama_response(self, df, prompt):
195 |         logger.info('Sending request to Ollama')
196 |         response = ollama.chat(model='llama3.1', messages=[{"role": "user", "content": prompt}])
197 |         answer = response['message']['content']
198 |         logger.info('Done sending request to Ollama')
199 |         response = {'answer': answer, 'sources': sources}
200 |         return response
201 | 


--------------------------------------------------------------------------------
/gif.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ulov888/chatpdflike/26dedce74609d48eccc80bede66fe9cc3c871f89/gif.gif


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | flask
 2 | PyPDF2
 3 | pandas
 4 | openai
 5 | requests
 6 | flask-cors
 7 | matplotlib
 8 | scipy
 9 | plotly
10 | google-cloud-storage
11 | gunicorn==20.1.0
12 | scikit-learn==0.24.1
13 | ollama
14 | 


--------------------------------------------------------------------------------
/run.py:
--------------------------------------------------------------------------------
 1 | from flask import Flask, request, render_template
 2 | from flask_cors import CORS
 3 | from generate_embedding import Chatbot, OllamaChatbot
 4 | from io import BytesIO
 5 | from PyPDF2 import PdfReader
 6 | import requests
 7 | 
 8 | 
 9 | app = Flask(__name__)
10 | CORS(app)
11 | 
12 | @app.route("/", methods=["GET", "POST"])
13 | def index():
14 |     return render_template("index.html")
15 | 
16 | 
17 | @app.route("/process_pdf", methods=['POST'])
18 | def process_pdf():
19 |     print("Processing pdf")
20 |     file = request.data
21 |     pdf = PdfReader(BytesIO(file))
22 |     chatbot = Chatbot()  # Default to OpenAI
23 |     if request.args.get('provider') == 'ollama':
24 |         chatbot = OllamaChatbot()
25 |     paper_text = chatbot.parse_paper(pdf)
26 |     global df
27 |     df = chatbot.paper_df(paper_text)
28 |     df = chatbot.calculate_embeddings(df)
29 |     print("Done processing pdf")
30 |     return {'answer': ''}
31 | 
32 | 
33 | @app.route("/download_pdf", methods=['POST'])
34 | def download_pdf():
35 |     chatbot = Chatbot()  # Default to OpenAI
36 |     if request.args.get('provider') == 'ollama':
37 |         chatbot = OllamaChatbot()
38 |     url = request.json['url']
39 |     r = requests.get(str(url))
40 |     print(r.headers)
41 |     pdf = PdfReader(BytesIO(r.content))
42 |     paper_text = chatbot.parse_paper(pdf)
43 |     global df
44 |     df = chatbot.paper_df(paper_text)
45 |     df = chatbot.calculate_embeddings(df)
46 |     print("Done processing pdf")
47 |     return {'key': ''}
48 | 
49 | 
50 | @app.route("/reply", methods=['POST'])
51 | def reply():
52 |     chatbot = Chatbot()  # Default to OpenAI
53 |     if request.args.get('provider') == 'ollama':
54 |         chatbot = OllamaChatbot()
55 |     query = request.json['query']
56 |     query = str(query)
57 |     prompt = chatbot.create_prompt(df, query)
58 |     response = chatbot.response(df, prompt)
59 |     print(response)
60 |     return response, 200
61 | 
62 | 
63 | if __name__ == '__main__':
64 |     app.run(host='0.0.0.0', port=8080, debug=True)
65 | 


--------------------------------------------------------------------------------
/static/css/styles.css:
--------------------------------------------------------------------------------
  1 | html {
  2 |     height: 100%;
  3 |     position: fixed;
  4 | }
  5 | 
  6 | body {
  7 |     margin: 0;
  8 |     position: fixed;
  9 |     background-color: white;
 10 |     display: flex;
 11 |     flex-direction: column;
 12 |     align-items: center;
 13 |     height: 100%;
 14 |     width: 100%;
 15 | }
 16 | 
 17 | .upload-btn {
 18 |     display: inline-block;
 19 |     padding: 12px 30px;
 20 |     background-color: #4285F4;
 21 |     color: #fff;
 22 |     font-size: 16px;
 23 |     font-weight: 500;
 24 |     border-radius: 4px;
 25 |     cursor: pointer;
 26 |     transition: background-color 0.3s ease;
 27 | }
 28 | 
 29 | input[type="file"] {
 30 |     display: none;
 31 | }
 32 | 
 33 | form {
 34 |     margin: auto 0;
 35 | }
 36 | 
 37 | input[type="text"] {
 38 |     border: none;
 39 |     outline: none;
 40 | }
 41 | 
 42 | #container {
 43 |     display: none;
 44 |     width: 100%;
 45 |     height: 100vh;
 46 |     flex-direction: row;
 47 | }
 48 | 
 49 | .pdf-viewer {
 50 |     width: 100%;
 51 |     height: 100vh;
 52 |     display: none;
 53 | }
 54 | 
 55 | #chat {
 56 |     width: 100%;
 57 | }
 58 | 
 59 | #chat p {
 60 |     color: white;
 61 |     margin: 10px;
 62 |     padding: 10px;
 63 |     border-radius: 10px;
 64 |     width: fit-content;
 65 |     text-align: left;
 66 |     word-break: break-all;
 67 |     font-family: Roboto;
 68 |     font-size: 18px;
 69 |     font-weight: lighter;
 70 | }
 71 | .search-container {
 72 |     display: flex;
 73 |     align-items: center;
 74 |     /*margin: 0 auto;*/
 75 |     max-width: 600px;
 76 |     height: 50px;
 77 |     border: 2px solid #ccc;
 78 |     border-radius: 25px;
 79 |     overflow: hidden;
 80 |     padding: 0 10px;
 81 |     background-color: #fff;
 82 |     box-shadow: 0px 2px 5px rgba(0, 0, 0, 0.2);
 83 |     margin-top: 150px;
 84 |     margin-bottom: 20px;
 85 | }
 86 | 
 87 | .search-container input[type="text"] {
 88 |     flex: 1;
 89 |     border: none;
 90 |     font-size: 18px;
 91 |     height: 100%;
 92 |     padding: 0 15px;
 93 |     background-color: transparent;
 94 |     color: #666;
 95 |     font-weight: bold;
 96 |     outline: none;
 97 | }
 98 | 
 99 | .search-container input[type="text"]::placeholder {
100 |     color: #ccc;
101 |     font-weight: normal;
102 | }
103 | 
104 | .search-container button {
105 |     border: none;
106 |     background-color: #4285F4;
107 |     color: #fff;
108 |     font-size: 18px;
109 |     cursor: pointer;
110 |     padding: 10px 15px;
111 |     border-radius: 25px;
112 |     box-shadow: 0px 2px 5px rgba(0, 0, 0, 0.2);
113 |     transition: background-color 0.2s ease-in-out;
114 | }
115 | 
116 | .search-container button:hover {
117 |     background-color: #2962FF;
118 | }
119 | 
120 | @media only screen and (max-width: 600px) {
121 |     .search-container {
122 |         max-width: 100%;
123 |         margin: 0 10px;
124 |         height: auto;
125 |         border-radius: 0;
126 |         box-shadow: none;
127 |     }
128 | 
129 |     .search-container input[type="text"] {
130 |         font-size: 16px;
131 |         padding: 10px 15px;
132 |     }
133 | 
134 |     .search-container button {
135 |         font-size: 16px;
136 |         padding: 10px 12px;
137 |         border-radius: 20px;
138 |     }
139 | }
140 | 
141 | 
142 | 
143 | .file-upload {
144 |     display: inline-block;
145 |     padding: 12px 30px;
146 |     background-color: #4285F4;
147 |     color: #fff;
148 |     font-size: 16px;
149 |     font-weight: 500;
150 |     border-radius: 4px;
151 |     cursor: pointer;
152 |     transition: background-color 0.3s ease;
153 | }
154 | 
155 | .file-upload:hover {
156 |     background-color: #3367D6;
157 | }
158 | 
159 | 


--------------------------------------------------------------------------------
/static/js/script.js:
--------------------------------------------------------------------------------
  1 | document.addEventListener("DOMContentLoaded", function() {
  2 |   // This file contains the JavaScript code for the web app
  3 | 
  4 | const input = document.querySelector("input[type='file']");
  5 | var uploadBtn = document.querySelector(".upload-btn");
  6 | const viewer = document.querySelector("#pdf-viewer");
  7 | const container = document.querySelector("#container");
  8 | var x = document.querySelector("input[name='pdf-url']");
  9 | const form = document.querySelector("form");
 10 | const p = document.querySelector("p");
 11 | const up = document.querySelector("#up");
 12 | const y = document.querySelector("#url");
 13 | const send = document.querySelector("#send");
 14 | 
 15 | 
 16 | send.addEventListener("click", function(event) {
 17 |   event.preventDefault();
 18 |   const message = document.querySelector("input[name='chat']").value;
 19 |   // if the message is empty, do nothing
 20 |   if (message === "") {
 21 |     return;
 22 |   }
 23 |   const chat = document.querySelector("#chat");
 24 |   const query = document.createElement("p");
 25 |   query.innerHTML = message;
 26 |   chat.appendChild(query);
 27 |   
 28 |   const loading = document.createElement("p");
 29 |   loading.style.color = "lightgray";
 30 |   loading.style.fontSize = "14px";
 31 |   loading.innerHTML = "Loading...";
 32 |   chat.appendChild(loading);
 33 | 
 34 |   // call the endpoint /reply with the message and get the reply.
 35 |   fetch('/reply', {
 36 |       method: 'POST',
 37 |       body: JSON.stringify({'query': message, 'key': window.key}),
 38 |       headers: {
 39 |           'Content-Type': 'application/json'
 40 |       }
 41 |   })
 42 |   .then(response => response.json())
 43 |   // Append the reply to #chat as a simple paragraph without any styling
 44 |   .then(data => {
 45 |       console.log(data.answer);
 46 |       chat.removeChild(loading);
 47 | 
 48 |       const reply = document.createElement("p");
 49 |       reply.style.color = "lightgray";
 50 |       reply.style.marginBottom = "0px";
 51 |       reply.style.paddingTop = "0px";
 52 |       reply.innerHTML = data.answer;
 53 |       chat.appendChild(reply);
 54 |       chat.scrollTop = chat.scrollHeight;
 55 | 
 56 |       const sources = data.sources;
 57 |       console.log(sources)
 58 |       // console.log(typeof JSON.parse(sources))
 59 |       sources.forEach(function(source) {
 60 |         for (var page in source) {
 61 |           var p = document.createElement("p");
 62 |           p.style.color = "gray";
 63 |           p.style.fontSize = "12px";
 64 |           p.style.fontWeight = "bold";
 65 |           p.style.marginTop = "0px";
 66 |           p.style.marginBottom = "0px";
 67 |           p.style.paddingTop = "0px";
 68 |           p.style.paddingBottom = "5px";
 69 |           p.innerHTML = page + ": " + "'"+source[page];+"'"
 70 |           chat.appendChild(p);
 71 |         }
 72 |       });
 73 |     })
 74 |     .catch(error => {
 75 |       chat.removeChild(loading);
 76 |       console.error(error);
 77 |     
 78 |       const errorMessage = document.createElement("p");
 79 |       errorMessage.style.color = "red";
 80 |       errorMessage.style.marginBottom = "0px";
 81 |       errorMessage.style.paddingTop = "0px";
 82 |       errorMessage.innerHTML = "Error: Request to OpenAI failed. Please try again.";
 83 |       chat.appendChild(errorMessage);
 84 |       chat.scrollTop = chat.scrollHeight;
 85 |     });
 86 |   document.querySelector("input[name='chat']").value = "";
 87 | });
 88 | 
 89 | x.addEventListener("focus", function() {
 90 |     if (this.value === "Enter URL") {
 91 |     this.value = "";
 92 |     this.style.color = "black";
 93 |     }
 94 | });
 95 | 
 96 | y.addEventListener("submit", function(event) {
 97 |     event.preventDefault();
 98 |     const url = this.elements["pdf-url"].value;
 99 |     if (url === "") {
100 |         return;
101 |     }
102 |     // if the url does not end with .pdf, make x.value = "Error: URL does not end with .pdf"
103 |     if (!url.endsWith(".pdf")) {
104 |         x.value = "Error: URL does not end with .pdf";
105 |         return;
106 |     }
107 |     x.value = "Loading...";
108 |     console.log(url);
109 |     fetch(url)
110 |     .then(response => response.blob())
111 |     .then(pdfBlob => {
112 |         console.log(pdfBlob);
113 |         const pdfUrl = URL.createObjectURL(pdfBlob);
114 |         pdfjsLib.getDocument(pdfUrl).promise.then(pdfDoc => {
115 |             viewer.src = pdfUrl;
116 |             uploadBtn.style.display = "none";
117 |             form.style.display = "none";
118 |             form.style.marginTop = "0px";
119 |             p.style.display = "none";
120 |             up.style.display = "none";
121 |             container.style.display = "flex";
122 |             viewer.style.display = "block";
123 |         });
124 |         })
125 |         .catch(error => {
126 |             console.error(error);
127 |         });
128 |     var loading = document.createElement("p");
129 |     loading.style.color = "lightgray";
130 |     loading.style.fontSize = "14px";
131 |     loading.innerHTML = "Calculating embeddings...";
132 |     chat.appendChild(loading);
133 | 
134 |     // Make a POST request to the server 'myserver/download-pdf' with the URL
135 |     fetch('/download_pdf', {
136 |       method: 'POST',
137 |       body: JSON.stringify({'url': url}),
138 |       headers: {
139 |           'Content-Type': 'application/json',
140 |           'Access-Control-Allow-Origin': '*',
141 |           'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE, PATCH, OPTIONS',
142 |           'Access-Control-Allow-Headers': 'Content-Type, Authorization'
143 |       }
144 |       })
145 |       .then(response => response.json())
146 |       // Append the reply to #chat as a simple paragraph without any styling
147 |       .then(data => {
148 |         chat.removeChild(loading);
149 |         window.key = data.key;
150 |       })
151 |       .catch(error => {
152 |         uploadBtn.innerHTML = "Error: Request to server failed. Please try again. Check the URL if there is https:// at the beginning. If not, add it.";
153 |         x.innerHTML = "Error: Request to server failed. Please try again. Check the URL if there is https:// at the beginning. If not, add it.";
154 |         console.error(error);
155 |       });
156 | });
157 | 
158 | input.addEventListener("change", async function() {
159 |   const file = this.files[0];
160 |   const fileArrayBuffer = await file.arrayBuffer();
161 |   console.log(fileArrayBuffer);
162 | 
163 |   var loading = document.createElement("p");
164 |   loading.style.color = "lightgray";
165 |   loading.style.fontSize = "14px";
166 |   loading.innerHTML = "Calculating embeddings...";
167 |   chat.appendChild(loading);
168 | 
169 |   // Make a post request to /process_pdf with the file
170 |   fetch('/process_pdf', {
171 |       method: 'POST',
172 |       body: fileArrayBuffer,
173 |       headers: {
174 |           'Content-Type': 'application/pdf',
175 |           'Content-Length': fileArrayBuffer.byteLength,
176 |           'Access-Control-Allow-Origin': '*',
177 |           'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE, PATCH, OPTIONS',
178 |           'Access-Control-Allow-Headers': 'Content-Type, Authorization'
179 |       }
180 |   })
181 |   .then(response => response.json())
182 |   // Append the reply to #chat as a simple paragraph without any styling
183 |   .then(data => {
184 |     chat.removeChild(loading);
185 |     window.key = data.key;
186 |   })
187 |   .catch(error => {
188 |     loading.innerHTML = "Error: Processing the pdf failed due to excess load. Please try again later.  Check the URL if there is https:// at the beginning. If not, add it.";
189 |     console.error(error);
190 |   });
191 |     
192 |   pdfjsLib.getDocument(fileArrayBuffer).promise.then(pdfDoc => {
193 |   viewer.src = URL.createObjectURL(file);
194 |   uploadBtn.style.display = "none";
195 |   form.style.display = "none";
196 |   form.style.marginTop = "0px";
197 |   p.style.display = "none";
198 |   up.style.display = "none";
199 |   container.style.display = "flex";
200 |   viewer.style.display = "block";
201 |   }).catch(error => {
202 |   console.error(error);
203 |   });
204 | });
205 | });
206 | 


--------------------------------------------------------------------------------
/static/send-icon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ulov888/chatpdflike/26dedce74609d48eccc80bede66fe9cc3c871f89/static/send-icon.png


--------------------------------------------------------------------------------
/templates/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |     <script src="static/js/script.js"></script>
 5 |     <script src="https://cdn.jsdelivr.net/npm/pdfjs-dist@3.3.122/build/pdf.min.js"></script>
 6 |     <link rel="stylesheet" href="static/css/styles.css">
 7 | </head>
 8 | <body>
 9 | <form id='url' class="search-container">
10 |     <input type="text" name="pdf-url" placeholder="Enter PDF URL">
11 |     <button type="submit">Submit</button>
12 | </form>
13 | 
14 | <p style="color: black; margin: 10px; /* padding: 10px; */ /* border-radius: 10px; */ /* background-color: grey; */ width: fit-content; /* max-width: 80%; */ text-align: start; overflow-wrap: break-word; /* font-style: normal; */ font-variant-caps: normal; /* font-weight: normal; */ /* font-stretch: normal; */ font-size: 1.5em; line-height: normal; font-family: Roboto;">or</p>
15 | 
16 | <form id="up" style="margin-top: 30px; margin-left: 15px;" enctype="multipart/form-data">
17 |     <label for="file-input" class="upload-btn">Upload PDF</label>
18 |     <input type="file" accept=".pdf" id="file-input"/>
19 | </form>
20 | 
21 | <div id="container">
22 |     <iframe id="pdf-viewer" class="pdf-viewer" style="border-width: 0"></iframe>
23 |     <div style="width: 40%; height: 100vh; border: none; background-color: black; display: flex; flex-direction: column; align-items: center; justify-content: space-between;">
24 |         <div id="chat" style="height: 100%; overflow-y: scroll; background-color: black;">
25 |             <!-- Chat messages go here -->
26 |         </div>
27 |         <form style="display: flex; padding-left: 10px; margin-bottom: 10px; width: 100%;">
28 |             <input type="text" name='chat' style="font-size: 14px; flex-grow: 2; padding: 10px; border: none; background-color: black; color: white;" placeholder="Ask a question..." />
29 |             <button id="send" style="background-color: transparent; border: none; padding: 10px 20px;">
30 |                 <img src="static/send-icon.png" style="width: 20px; height: 20px;" />
31 |             </button>
32 |         </form>
33 |     </div>
34 | </div>
35 | </body>
36 | </html>
37 | 


--------------------------------------------------------------------------------