├── .gitattributes ├── .gitignore ├── LICENSE ├── README.md ├── document_index.faiss ├── local_genai_search.py ├── metadata.json └── requirements.txt /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .venv/* -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 imanoop7 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Local GenAI Search 2 | 3 | Local GenAI Search is an AI-powered document search and question-answering system that works with your local files. It uses advanced natural language processing techniques to understand and answer questions based on the content of your documents. 4 | 5 | ## Features 6 | 7 | - Index and search through various document types (PDF, DOCX, PPTX, TXT) 8 | - Extract and process images from documents 9 | - Use AI to generate answers to questions based on document content 10 | - User-friendly Streamlit interface 11 | - Local processing for data privacy 12 | 13 | ## Requirements 14 | 15 | - Python 3.7+ 16 | - Ollama (for running local AI models) 17 | 18 | ## Installation 19 | 20 | 1. Clone this repository: ``` 21 | git clone https://github.com/imanoop7/Generative-Search-Engine-For-Local-Files-v2 22 | cd Generative-Search-Engine-For-Local-Files-v2 ``` 23 | 24 | 2. Create a virtual environment and activate it: ``` 25 | python -m venv .venv 26 | source .venv/bin/activate # On Windows, use `.venv\Scripts\activate` ``` 27 | 28 | 3. Install the required packages: ``` 29 | pip install -r requirements.txt ``` 30 | 31 | 4. Install Ollama by following the instructions at [https://ollama.ai/](https://ollama.ai/) 32 | 33 | 5. Pull the required models: ``` 34 | ollama pull tinyllama 35 | ollama pull llava ``` 36 | 37 | ## Usage 38 | 39 | 1. Start the Streamlit app: ``` 40 | streamlit run local_genai_search.py ``` 41 | 42 | 2. Open your web browser and go to the URL displayed in the terminal (usually `http://localhost:8501`). 43 | 44 | 3. Enter the path to your documents folder in the text input field. 45 | 46 | 4. Click the "Index Documents" button to process and index your documents. This may take some time depending on the number and size of your documents. 47 | 48 | 5. Once indexing is complete, you can start asking questions about your documents in the "Ask a Question" section. 49 | 50 | 6. Click "Search and Answer" to get AI-generated answers based on your document content. 51 | 52 | ## How it Works 53 | 54 | 1. Document Indexing: The system reads and processes various document types, extracting text and images. It then creates embeddings for the text content using a pre-trained sentence transformer model. 55 | 56 | 2. Semantic Search: When you ask a question, the system converts it into an embedding and searches for the most similar content in your indexed documents. 57 | 58 | 3. Answer Generation: The system uses the Ollama API to generate an answer based on the question and the most relevant document content found during the search. 59 | 60 | 4. Image Processing: For documents containing images, the system uses the LLaVA model to generate descriptions, which are then included in the search index. 61 | 62 | ## Troubleshooting 63 | 64 | If you encounter any issues: 65 | 66 | 1. Make sure all required packages are installed correctly. 67 | 2. Ensure Ollama is running and the required models (tinyllama and llava) are downloaded. 68 | 3. Check the console output for any error messages. 69 | 70 | ## Contributing 71 | 72 | Contributions are welcome! Please feel free to submit a Pull Request. 73 | 74 | ## License 75 | 76 | This project is licensed under the MIT License - see the LICENSE file for details. 77 | 78 | ## Acknowledgments 79 | 80 | - Streamlit for the amazing web app framework 81 | - Sentence Transformers for text embeddings 82 | - Ollama for local AI model inference 83 | - FAISS for efficient similarity search 84 | -------------------------------------------------------------------------------- /document_index.faiss: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/imanoop7/Generative-Search-Engine-For-Local-Files-v2/a321421590b833b39cd69f83f13da24606c35b21/document_index.faiss -------------------------------------------------------------------------------- /local_genai_search.py: -------------------------------------------------------------------------------- 1 | import os 2 | import faiss 3 | import numpy as np 4 | from sentence_transformers import SentenceTransformer 5 | import PyPDF2 6 | import docx 7 | from pptx import Presentation 8 | import json 9 | import streamlit as st 10 | import re 11 | import ollama 12 | from streamlit_lottie import st_lottie 13 | import requests 14 | from PIL import Image 15 | import io 16 | import base64 17 | import fitz # PyMuPDF for PDF image extraction 18 | from docx import Document 19 | from pptx import Presentation 20 | from pptx.enum.shapes import MSO_SHAPE_TYPE 21 | import tempfile 22 | 23 | print("Starting the application...") 24 | 25 | # Add these functions here, before they are called 26 | def extract_images_from_pdf(file_path): 27 | images = [] 28 | doc = fitz.open(file_path) 29 | for page in doc: 30 | image_list = page.get_images(full=True) 31 | for img_index, img in enumerate(image_list): 32 | xref = img[0] 33 | base_image = doc.extract_image(xref) 34 | image_bytes = base_image["image"] 35 | image = Image.open(io.BytesIO(image_bytes)) 36 | images.append(image) 37 | return images 38 | 39 | def extract_images_from_docx(file_path): 40 | images = [] 41 | doc = Document(file_path) 42 | for rel in doc.part.rels.values(): 43 | if "image" in rel.target_ref: 44 | image_data = rel.target_part.blob 45 | image = Image.open(io.BytesIO(image_data)) 46 | images.append(image) 47 | return images 48 | 49 | def extract_images_from_pptx(file_path): 50 | images = [] 51 | prs = Presentation(file_path) 52 | for slide in prs.slides: 53 | for shape in slide.shapes: 54 | if shape.shape_type == MSO_SHAPE_TYPE.PICTURE: 55 | image = shape.image 56 | image_bytes = image.blob 57 | image = Image.open(io.BytesIO(image_bytes)) 58 | images.append(image) 59 | return images 60 | 61 | # Global variables 62 | model = SentenceTransformer('sentence-transformers/msmarco-bert-base-dot-v5') 63 | dimension = 768 64 | index = faiss.IndexFlatIP(dimension) 65 | metadata = [] 66 | 67 | print(f"Initialized model and FAISS index with dimension {dimension}") 68 | 69 | # Document reading functions 70 | def read_pdf(file_path): 71 | print(f"Reading PDF: {file_path}") 72 | with open(file_path, 'rb') as file: 73 | reader = PyPDF2.PdfReader(file) 74 | return ' '.join([page.extract_text() for page in reader.pages]) 75 | 76 | def read_docx(file_path): 77 | print(f"Reading DOCX: {file_path}") 78 | doc = docx.Document(file_path) 79 | return ' '.join([para.text for para in doc.paragraphs]) 80 | 81 | def read_pptx(file_path): 82 | print(f"Reading PPTX: {file_path}") 83 | prs = Presentation(file_path) 84 | return ' '.join([shape.text for slide in prs.slides for shape in slide.shapes if hasattr(shape, 'text')]) 85 | 86 | def chunk_text(text, chunk_size=500, overlap=50): 87 | print(f"Chunking text of length {len(text)} with chunk size {chunk_size} and overlap {overlap}") 88 | words = text.split() 89 | chunks = [] 90 | for i in range(0, len(words), chunk_size - overlap): 91 | chunk = ' '.join(words[i:i + chunk_size]) 92 | chunks.append(chunk) 93 | print(f"Created {len(chunks)} chunks") 94 | return chunks 95 | 96 | def process_image(image): 97 | print(f"Processing image") 98 | with tempfile.NamedTemporaryFile(delete=False, suffix='.png') as temp_file: 99 | image.save(temp_file, format='PNG') 100 | return temp_file.name 101 | 102 | # Indexing function 103 | def index_documents(directory): 104 | print(f"Indexing documents in directory: {directory}") 105 | global metadata 106 | documents = [] 107 | 108 | file_count = 0 109 | for root, _, files in os.walk(directory): 110 | for file in files: 111 | file_path = os.path.join(root, file) 112 | print(f"Processing file: {file_path}") 113 | content = "" 114 | images = [] 115 | 116 | if file.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp')): 117 | with Image.open(file_path) as img: 118 | images.append(img) 119 | elif file.endswith('.pdf'): 120 | content = read_pdf(file_path) 121 | images = extract_images_from_pdf(file_path) 122 | elif file.endswith('.docx'): 123 | content = read_docx(file_path) 124 | images = extract_images_from_docx(file_path) 125 | elif file.endswith('.pptx'): 126 | content = read_pptx(file_path) 127 | images = extract_images_from_pptx(file_path) 128 | elif file.endswith('.txt'): 129 | with open(file_path, 'r', encoding='utf-8') as f: 130 | content = f.read() 131 | 132 | if content: 133 | chunks = chunk_text(content) 134 | for i, chunk in enumerate(chunks): 135 | documents.append(chunk) 136 | metadata.append({"path": file_path, "chunk_id": i, "type": "text"}) 137 | 138 | for i, img in enumerate(images): 139 | img_content = process_image(img) 140 | documents.append(img_content) 141 | metadata.append({"path": file_path, "chunk_id": i, "type": "image"}) 142 | file_count += 1 143 | 144 | if file_count == 0: 145 | print(f"No files found in the directory: {directory}") 146 | return 147 | 148 | print(f"Encoding {len(documents)} document chunks and images") 149 | embeddings = [] 150 | for doc in documents: 151 | if isinstance(doc, str): 152 | embeddings.append(model.encode([doc])[0]) 153 | else: 154 | # For images, use LLaVA to generate a description and then encode it 155 | image_description = llava_generate("Describe this image in detail.", doc) 156 | embeddings.append(model.encode([image_description])[0]) 157 | # Delete the temporary file after processing 158 | os.unlink(doc) 159 | 160 | print(f"Adding embeddings to FAISS index") 161 | index.add(np.array(embeddings)) 162 | 163 | # Save index and metadata 164 | print("Saving FAISS index and metadata") 165 | index_path = "document_index.faiss" 166 | metadata_path = "metadata.json" 167 | faiss.write_index(index, index_path) 168 | with open(metadata_path, "w") as f: 169 | json.dump(metadata, f) 170 | 171 | print(f"Indexed {len(documents)} document chunks and images.") 172 | print(f"Index saved to: {os.path.abspath(index_path)}") 173 | print(f"Metadata saved to: {os.path.abspath(metadata_path)}") 174 | 175 | # Add this check 176 | if not os.path.exists(metadata_path): 177 | print(f"Warning: metadata file was not created at {metadata_path}") 178 | print(f"Current directory contents: {os.listdir('.')}") 179 | 180 | # Function to read document chunk 181 | def read_document_chunk(file_path, chunk_id, chunk_type): 182 | print(f"Reading document chunk: {file_path}, chunk_id: {chunk_id}, type: {chunk_type}") 183 | try: 184 | if chunk_type == "text": 185 | content = "" 186 | if file_path.endswith('.pdf'): 187 | content = read_pdf(file_path) 188 | elif file_path.endswith('.docx'): 189 | content = read_docx(file_path) 190 | elif file_path.endswith('.pptx'): 191 | content = read_pptx(file_path) 192 | elif file_path.endswith('.txt'): 193 | with open(file_path, 'r', encoding='utf-8') as f: 194 | content = f.read() 195 | 196 | chunks = chunk_text(content) 197 | return chunks[chunk_id] if chunk_id < len(chunks) else "" 198 | elif chunk_type == "image": 199 | if file_path.endswith('.pdf'): 200 | images = extract_images_from_pdf(file_path) 201 | elif file_path.endswith('.docx'): 202 | images = extract_images_from_docx(file_path) 203 | elif file_path.endswith('.pptx'): 204 | images = extract_images_from_pptx(file_path) 205 | else: 206 | images = [Image.open(file_path)] 207 | 208 | if chunk_id < len(images): 209 | return process_image(images[chunk_id]) 210 | else: 211 | return None 212 | else: 213 | print(f"Unknown chunk type: {chunk_type}") 214 | return "" 215 | except Exception as e: 216 | print(f"Error reading document chunk: {e}") 217 | return "" 218 | 219 | # Search function 220 | def semantic_search(query, k=10, query_type='text'): 221 | print(f"Performing semantic search for query type: {query_type}, k={k}") 222 | 223 | if query_type == 'text': 224 | query_vector = model.encode([query])[0] 225 | elif query_type == 'image': 226 | return image_search(query, k) 227 | 228 | distances, indices = index.search(np.array([query_vector]), k) 229 | 230 | results = [] 231 | for i, idx in enumerate(indices[0]): 232 | meta = metadata[idx] 233 | content_type = meta.get("type", "text") # Default to 'text' if 'type' is missing 234 | content = read_document_chunk(meta["path"], meta["chunk_id"], content_type) 235 | results.append({ 236 | "id": int(idx), 237 | "path": meta["path"], 238 | "content": content, 239 | "type": content_type, 240 | "score": float(distances[0][i]) 241 | }) 242 | 243 | print(f"Found {len(results)} search results") 244 | return results 245 | 246 | def image_search(image_path, k=10): 247 | # This function will handle image-based search using LLaVA 248 | base64_image = encode_image_to_base64(image_path) 249 | prompt = "Describe this image in detail." 250 | 251 | response = ollama.generate(model='llava', prompt=prompt, images=[base64_image]) 252 | image_description = response['response'] 253 | 254 | # Now use the image description to perform a text-based search 255 | return semantic_search(image_description, k, query_type='text') 256 | 257 | # Answer generation function 258 | def generate_answer(query, context): 259 | print(f"Generating answer for query: '{query}'") 260 | prompt = f"""Answer the user's question using the documents given in the context. In the context are documents that should contain an answer. Please always reference the document ID (in square brackets, for example [0], [1]) of the document that was used to make a claim. Use as many citations and documents as necessary to answer the question. 261 | 262 | Context: 263 | {context} 264 | 265 | Question: {query} 266 | 267 | Answer: (Remember to use document references like [0], [1], etc.)""" 268 | 269 | print("Sending prompt to Ollama") 270 | response = ollama.generate(model='tinyllama', prompt=prompt) 271 | print("Received response from Ollama") 272 | print(f"Raw response: {response['response']}") 273 | return response['response'] 274 | 275 | def load_lottieurl(url: str): 276 | r = requests.get(url) 277 | if r.status_code != 200: 278 | return None 279 | return r.json() 280 | 281 | def encode_image_to_base64(image_path): 282 | with open(image_path, "rb") as image_file: 283 | return base64.b64encode(image_file.read()).decode('utf-8') 284 | 285 | def llava_generate(prompt, image_path): 286 | if image_path: 287 | base64_image = encode_image_to_base64(image_path) 288 | response = ollama.generate(model='llava', prompt=prompt, images=[base64_image]) 289 | else: 290 | response = ollama.generate(model='llava', prompt=prompt) 291 | return response['response'] 292 | 293 | def generate_answer_with_image(query, context, image_path): 294 | base64_image = encode_image_to_base64(image_path) 295 | prompt = f"""Answer the user's question using the documents and image given in the context. Please reference the document ID (in square brackets) when using information from the text documents. 296 | 297 | Context: 298 | {context} 299 | 300 | Question: {query} 301 | 302 | Answer:""" 303 | 304 | response = ollama.generate(model='llava', prompt=prompt, images=[base64_image]) 305 | return response['response'] 306 | 307 | # Streamlit UI 308 | def main(): 309 | print("Starting Streamlit UI") 310 | 311 | # Page config 312 | st.set_page_config(page_title="Local GenAI Search", page_icon="🔍", layout="wide") 313 | 314 | # Custom CSS 315 | st.markdown(""" 316 | 357 | """, unsafe_allow_html=True) 358 | 359 | # Title and animation 360 | st.markdown('

Local GenAI Search 🔍

', unsafe_allow_html=True) 361 | st.markdown('

Explore your documents with the power of AI!

', unsafe_allow_html=True) 362 | 363 | col1, col2, col3 = st.columns([1, 2, 1]) 364 | with col2: 365 | lottie_url = "https://assets5.lottiefiles.com/packages/lf20_fcfjwiyb.json" 366 | lottie_json = load_lottieurl(lottie_url) 367 | st_lottie(lottie_json, height=200, key="coding") 368 | 369 | # Input for documents path 370 | documents_path = st.text_input("📁 Enter the path to your documents folder:", "Folder Path") 371 | if documents_path != "Folder Path" and not os.path.exists(documents_path): 372 | st.error(f"The specified path does not exist: {documents_path}") 373 | 374 | # Check if documents are indexed 375 | if not os.path.exists("document_index.faiss") or not os.path.exists("metadata.json"): 376 | st.warning("⚠️ Documents are not indexed or metadata is missing. Please run the indexing process.") 377 | if st.button("🚀 Index Documents", key="index_button"): 378 | with st.spinner("Indexing documents... This may take a while."): 379 | print(f"Indexing documents in {documents_path}") 380 | if os.path.exists(documents_path): 381 | index_documents(documents_path) 382 | if os.path.exists("document_index.faiss") and os.path.exists("metadata.json"): 383 | st.success("✅ Indexing complete!") 384 | st.rerun() # Changed from st.experimental_rerun() to st.rerun() 385 | else: 386 | st.error("Indexing failed. Please check the application logs.") 387 | else: 388 | st.error(f"The specified documents path does not exist: {documents_path}") 389 | 390 | # Load index and metadata if not already loaded 391 | global index, metadata 392 | if len(metadata) == 0: 393 | print("Loading FAISS index and metadata") 394 | index_path = "document_index.faiss" 395 | metadata_path = "metadata.json" 396 | try: 397 | if not os.path.exists(index_path): 398 | raise FileNotFoundError(f"Index file not found: {index_path}") 399 | if not os.path.exists(metadata_path): 400 | raise FileNotFoundError(f"Metadata file not found: {metadata_path}") 401 | 402 | index = faiss.read_index(index_path) 403 | with open(metadata_path, "r") as f: 404 | metadata = json.load(f) 405 | print(f"Loaded index with {index.ntotal} vectors and {len(metadata)} metadata entries") 406 | except FileNotFoundError as e: 407 | st.error(f"Error: {str(e)}. Please run the indexing process first.") 408 | st.error(f"Current working directory: {os.getcwd()}") 409 | st.error(f"Files in current directory: {os.listdir('.')}") 410 | return 411 | except json.JSONDecodeError: 412 | st.error(f"Error reading metadata file: {metadata_path}. Please run the indexing process again.") 413 | return 414 | except Exception as e: 415 | st.error(f"Unexpected error: {str(e)}. Please check the application logs.") 416 | return 417 | 418 | st.markdown("---") 419 | st.markdown("## 🤔 Ask a Question") 420 | question = st.text_input("What would you like to know about your documents?", "") 421 | 422 | col1, col2, col3 = st.columns([1, 1, 1]) 423 | with col2: 424 | search_button = st.button("🔍 Search and Answer", key="search_button") 425 | 426 | if search_button: 427 | if question: 428 | with st.spinner("🕵️‍♀️ Searching and generating answer..."): 429 | search_results = semantic_search(question) 430 | context = "\n\n".join([f"{i}: {result['content']}" for i, result in enumerate(search_results)]) 431 | answer = generate_answer(question, context) 432 | 433 | # Display answer and referenced documents 434 | st.markdown("### 🤖 AI Answer:") 435 | st.info(answer) 436 | 437 | # Display referenced documents 438 | st.markdown("### 📚 Referenced Documents:") 439 | rege = re.compile(r"\[Document\s+[0-9]+\]|\[[0-9]+\]") 440 | referenced_ids_raw = re.findall(r'\b\d+\b', ' '.join(rege.findall(answer))) 441 | referenced_ids = [int(s) for s in referenced_ids_raw] 442 | 443 | print(f"Raw answer: {answer}") 444 | print(f"Regex matches: {rege.findall(answer)}") 445 | print(f"Referenced IDs (raw): {referenced_ids_raw}") 446 | print(f"Referenced IDs: {referenced_ids}") 447 | 448 | if not referenced_ids: 449 | st.warning("No specific document references found in the answer.") 450 | 451 | print(f"Displaying {len(referenced_ids)} referenced documents") 452 | for doc_id in referenced_ids: 453 | if doc_id < len(search_results): 454 | doc = search_results[doc_id] 455 | with st.expander(f"📄 Document {doc_id} - {os.path.basename(doc['path'])}"): 456 | st.write(doc['content']) 457 | col1, col2 = st.columns([3, 1]) 458 | with col2: 459 | with open(doc['path'], 'rb') as f: 460 | st.download_button("⬇️ Download file", f, file_name=os.path.basename(doc['path'])) 461 | else: 462 | st.warning(f"Referenced document ID {doc_id} is out of range.") 463 | else: 464 | st.warning("⚠️ Please enter a question before clicking 'Search and Answer'.") 465 | 466 | st.markdown("---") 467 | st.markdown("### 🌟 Made with love by Anoop Maurya") 468 | 469 | if __name__ == "__main__": 470 | main() 471 | print("Application finished") 472 | 473 | -------------------------------------------------------------------------------- /metadata.json: -------------------------------------------------------------------------------- 1 | [{"path": "file\\2210.03629v3.pdf", "chunk_id": 0, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 1, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 2, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 3, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 4, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 5, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 6, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 7, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 8, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 9, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 10, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 11, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 12, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 13, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 14, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 15, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 16, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 17, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 18, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 19, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 20, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 21, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 22, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 23, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 24, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 25, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 26, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 27, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 28, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 29, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 30, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 31, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 32, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 33, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 34, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 35, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 36, "type": "text"}, {"path": "file\\2210.03629v3.pdf", "chunk_id": 37, "type": "text"}] -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | faiss-cpu 2 | sentence-transformers 3 | PyPDF2 4 | python-docx 5 | python-pptx 6 | streamlit 7 | ollama 8 | numpy 9 | streamlit-lottie 10 | requests 11 | Pillow 12 | PyMuPDF 13 | --------------------------------------------------------------------------------