├── .gitignore ├── requirements.txt ├── TODO.md ├── docker-compose.yml ├── example.env ├── CHANGELOG.md ├── README.md └── PyPlexitas.py /.gitignore: -------------------------------------------------------------------------------- 1 | .venv 2 | .env 3 | token.json 4 | credentials.json -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | langchain==0.2.0 2 | langchain-community==0.2.0 3 | langchain_openai==0.1.7 4 | requests==2.31.0 5 | qdrant-client 6 | lxml 7 | cssselect 8 | python-dotenv 9 | google-auth 10 | google-auth-oauthlib 11 | google-auth-httplib2 12 | google-api-python-client -------------------------------------------------------------------------------- /TODO.md: -------------------------------------------------------------------------------- 1 | # TODO 2 | 3 | 1. **Speed Optimizations** 4 | - Identify and optimize slow-performing segments of the code. 5 | 6 | 2. **Better Configurable Parameters** 7 | - Review current configuration parameters and identify areas for improvement. 8 | - Introduce new parameters to enhance configurability and flexibility. 9 | - Ensure documentation is updated to reflect changes in configuration options. 10 | 11 | 3. **Gmail Improvement** 12 | - Investigate issues and areas of enhancement within existing Gmail integration. 13 | - Improve the reliability and performance of Gmail-related features. 14 | 15 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | services: 2 | qdrant: 3 | image: qdrant/qdrant:latest 4 | restart: always 5 | container_name: qdrant 6 | ports: 7 | - 6333:6333 8 | - 6334:6334 9 | expose: 10 | - 6333 11 | - 6334 12 | - 6335 13 | configs: 14 | - source: qdrant_config 15 | target: /qdrant/config/production.yaml 16 | volumes: 17 | - qdrant_data_volume:/qdrant_data 18 | networks: 19 | - PyPlexitas 20 | 21 | configs: 22 | qdrant_config: 23 | content: | 24 | log_level: INFO 25 | 26 | volumes: 27 | qdrant_data_volume: 28 | driver: local 29 | 30 | networks: 31 | PyPlexitas: 32 | driver: bridge -------------------------------------------------------------------------------- /example.env: -------------------------------------------------------------------------------- 1 | # Open AI config 2 | OPENAI_API_KEY="" 3 | # OPENAI_BASE_URL="" # Leave blank for default 4 | EMBEDDING_MODEL_NAME="text-embedding-ada-002" 5 | CHAT_MODEL_NAME="gpt-4o" 6 | 7 | # Ollama config 8 | USE_OLLAMA=false 9 | # OLLAMA_API_KEY = "ollama" 10 | OLLAMA_BASE_URL = "http://localhost:11434/v1" 11 | OLLAMA_EMBEDDING_MODEL_NAME = "llama3" 12 | OLLAMA_CHAT_MODEL_NAME = "llama3" 13 | 14 | # Google config go to https://cse.google.com/cse/create/new and https://developers.google.com/custom-search/v1/overview for API and CX 15 | GOOGLE_API_KEY="" 16 | GOOGLE_CX="" 17 | # GOOGLE_ENDPOINT="" # Leave blank for default 18 | 19 | # Bing config 20 | BING_SUBSCRIPTION_KEY="" 21 | # BING_ENDPOINT="" # Leave blank for default -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | ## [Unreleased] 4 | 5 | ## [v0.2.0] - 2024-05-22 6 | 7 | ### Added 8 | - Optional LLM query parameter `--llm-query`. 9 | - New TODO.md file outlining tasks related to speed optimizations, configurable parameters review, and Gmail integration improvements. 10 | - Integration of Gmail search functionality into the main function, allowing users to search and fetch emails from Gmail using specified query and count, including email content retrieval. 11 | 12 | ### Changed 13 | - Updated `README.md` to reflect the new optional LLM query parameter and increased default max-tokens limit. 14 | - Updated default constants for chunk size and max tokens. 15 | - Enhanced `fetch_url_content` function with a timeout parameter. 16 | 17 | ### Fixed 18 | - Use correct query variable for embedding calculation in PyPlexitas. 19 | - Added missing Google libraries to `requirements.txt`. 20 | 21 | ### Removed 22 | - Removed token.json and credentials.json from version control by adding them to `.gitignore`. 23 | 24 | ## [v0.1.0] - 2024-05-20 25 | 26 | ### Added 27 | - Initial project files and configurations. 28 | - Docker Compose configuration for Qdrant container including services, configs, volumes, and networks. 29 | - Functionality to load environment variables from `.env` file and handle missing Bing API key gracefully. 30 | - Embedding selection logic based on the value of `USE_OLLAMA`. 31 | - Logging and debug statements throughout the codebase. 32 | - Google search functionality with the necessary endpoint and parameters. 33 | 34 | ### Changed 35 | - Refactored imports in `PyPlexitas.py` to include both OpenAI and OpenAIEmbeddings from `langchain_openai`. 36 | - Updated logging configuration to set the logging level to DEBUG for more detailed logging information. 37 | - Enhanced LLMAgent class by adding an optional debug parameter to print document content for debugging purposes. 38 | - Adjusted the argparse description to reflect the project name PyPlexitas. 39 | - Implemented collection handling in `vector_client`, including deleting the collection if it already exists before recreating it with the appropriate configurations. 40 | 41 | ### Fixed 42 | - Corrected API key and model names in environment configuration. 43 | - Set global User-Agent for all API requests. 44 | - Added missing import statement for `hashlib` in `hash_string` function. 45 | 46 | ### Documentation 47 | - Added `README.md` for PyPlexitas including features, installation instructions, configuration details, and usage guidelines. 48 | - Updated `README.md` formatting and added license information. 49 | 50 | ### Chore 51 | - Improved logging and print messages for search operations, including emoji indicators. 52 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🌟 PyPlexitas 2 | 3 | PyPlexitas is a Python script that is designed to create an open-source alternative to Perplexity AI, a tool that provides users with detailed answers to their queries by searching the web, extracting relevant content, and using advanced language models to generate responses. 4 | 5 | The script operates by first taking a user’s query and using search engines like Bing or Google and GMAIL to find relevant web pages or e-mails. It then scrapes the content from these web pages or mails, processes the text into manageable chunks, and generates vector embeddings for these chunks. Vector embeddings are mathematical representations of text that allow for efficient searching and comparison of content. These embeddings are stored in a database, enabling quick retrieval of relevant information based on the user’s query. 6 | 7 | Once the content is processed and stored, the script uses a language model to generate a detailed answer to the user’s query, using the information extracted from the web pages. This response is designed to be accurate and informative, drawing directly from the content found during the search process. 8 | 9 | **Example:** 10 | ```bash 11 | python PyPlexitas.py -q "When will the next model GPT-5 be released" -s 10 --engine google 12 | ``` 13 | 14 | Expected Output: 15 | ``` 16 | Searching for 🔎: When will the next model GPT-5 be released using google 17 | Starting Google search ⏳ 18 | Google search returned 🔗: 10 results 19 | From domains 🌐: mashable.com www.reddit.com www.tomsguide.com www.datacamp.com medium.com www.standard.co.uk www.theverge.com arstechnica.com 20 | Scraping content from search results... 21 | Embedding content ✨ 22 | Total embeddings 📊: 10 23 | Total chunks processed 🧩: 7 24 | 25 | Answering your query: When will the next model GPT-5 be released 🙋 26 | 27 | The release date for GPT-5 is currently expected to be sometime in mid-2024, likely during the summer, according to a report from Business Insider [1][2]. OpenAI representatives have not provided a specific release date, and the timeline may be subject to change depending on the duration of safety testing and other factors [1][2]. OpenAI CEO Sam Altman has indicated that a major AI model will be released this year, but it is unclear whether it will be called GPT-5 or something else [1]. 28 | 29 | ### Sources 30 | 1. Benj Edwards - https://arstechnica.com/information-technology/2024/03/gpt-5-might-arrive-this-summer-as-a-materially-better-update-to-chatgpt/ 31 | 2. Saqib Shah - https://www.standard.co.uk/tech/openai-chatgpt-5-release-date-b1076129.html 32 | ``` 33 | 34 | ## Table of Contents 35 | - [Features](#features) 36 | - [Installation](#installation) 37 | - [Configuration](#configuration) 38 | - [Getting API Keys](#getting-api-keys) 39 | - [Usage](#usage) 40 | - [Project Structure](#project-structure) 41 | - [Contributing](#contributing) 42 | - [License](#license) 43 | 44 | ## Features 45 | - **Web Search**: Perform web searches using Bing, Google, or Gmail APIs. 46 | - **Content Scraping**: Scrape content from search results. 47 | - **Embedding Generation**: Generate embeddings for content using OpenAI or Ollama models. 48 | - **Question Answering**: Answer questions based on the scraped content. 49 | - **Vector Database**: Use Qdrant for storing and querying embeddings. 50 | 51 | ## Installation 52 | 1. Clone the repository: 53 | ```bash 54 | git clone https://github.com/dkruyt/PyPlexitas.git 55 | cd PyPlexitas 56 | ``` 57 | 58 | 2. Install the required Python packages: 59 | ```bash 60 | pip install -r requirements.txt 61 | ``` 62 | 63 | 3. Set up the Qdrant service using Docker: 64 | ```bash 65 | docker-compose up -d 66 | ``` 67 | 68 | ## Configuration 69 | Configure your environment variables by creating a `.env` file in the project root. Use the provided `example.env` as a template: 70 | ```bash 71 | cp example.env .env 72 | ``` 73 | Fill in your API keys and other necessary details in the `.env` file: 74 | - `OPENAI_API_KEY` 75 | - `GOOGLE_API_KEY` 76 | - `GOOGLE_CX` 77 | - `BING_SUBSCRIPTION_KEY` 78 | 79 | ## Getting API Keys 80 | 81 | ### OpenAI API Key 82 | 1. Sign up or log in to your [OpenAI account](https://www.openai.com/). 83 | 2. Go to the API section and generate a new API key. 84 | 3. Copy the API key and add it to the `OPENAI_API_KEY` field in your `.env` file. 85 | 86 | ### Google Custom Search API Key and CX 87 | 1. Go to the [Google Cloud Console](https://console.cloud.google.com/). 88 | 2. Create a new project or select an existing project. 89 | 3. Enable the Custom Search API in the API & Services library. 90 | 4. Go to the [Credentials](https://console.cloud.google.com/apis/credentials) page and create an API key. 91 | 5. Copy the API key and add it to the `GOOGLE_API_KEY` field in your `.env` file. 92 | 6. To get the Custom Search Engine (CX) ID, go to the [Custom Search Engine](https://cse.google.com/cse/) page. 93 | 7. Create a new search engine or select an existing one. 94 | 8. Copy the Search Engine ID (CX) and add it to the `GOOGLE_CX` field in your `.env` file. 95 | 96 | ### Bing Search API Key 97 | 1. Sign up or log in to your [Microsoft Azure account](https://azure.microsoft.com/). 98 | 2. Create a new Azure resource for Bing Search v7. 99 | 3. Go to the Keys and Endpoint section to find your API key. 100 | 4. Copy the API key and add it to the `BING_SUBSCRIPTION_KEY` field in your `.env` file. 101 | 102 | ### Gmail API Key 103 | 1. Go to the [Google Cloud Console](https://console.cloud.google.com/). 104 | 2. Create a new project or select an existing project. 105 | 3. Enable the Gmail API in the API & Services library. 106 | 4. Go to the [Credentials](https://console.cloud.google.com/apis/credentials) page and create OAuth 2.0 credentials. 107 | 5. Download the credentials file and save it as `credentials.json` in the project root. 108 | 109 | ## Usage 110 | Run the PyPlexitas script with your query: 111 | ```bash 112 | python PyPlexitas.py -q "Your search query" -s 10 --engine bing 113 | ``` 114 | Options: 115 | - `-q, --query`: Search query (required) 116 | - `--llm-query`: Optional LLM query for answering (defaults to the search query) 117 | - `-s, --search`: Number of search results to parse (default: 10) 118 | - `--engine`: Search engine to use (`bing`, `google`, or `gmail`, default: `bing`) 119 | - `-l, --log-level`: Set the logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`, default: `ERROR`) 120 | - `-t, --max-tokens`: Maximum token limit for model input (default: 16000) 121 | - `--quiet`: Suppress print messages 122 | 123 | ## Project Structure 124 | - `PyPlexitas.py`: Main script for running the application. 125 | - `example.env`: Example configuration file for environment variables. 126 | - `docker-compose.yml`: Docker Compose configuration for Qdrant. 127 | - `requirements.txt`: List of required Python packages. 128 | - `README.md`: Project documentation. 129 | 130 | ## Contributing 131 | Contributions are welcome! Please fork the repository and submit a pull request for any improvements or bug fixes. 132 | 133 | ## License 134 | This project is licensed under the GPL 3.0 License. See the [LICENSE](LICENSE) file for details. 135 | -------------------------------------------------------------------------------- /PyPlexitas.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import json 4 | import hashlib 5 | import logging 6 | import argparse 7 | import asyncio 8 | from aiohttp import ClientSession, ClientError, ClientSSLError 9 | from typing import List, Dict, Optional 10 | from urllib.parse import urlparse 11 | import base64 12 | import google.auth 13 | from google.auth.transport.requests import Request as GoogleRequest 14 | from google.oauth2.credentials import Credentials 15 | from google_auth_oauthlib.flow import InstalledAppFlow 16 | from googleapiclient.discovery import build 17 | 18 | import aiohttp 19 | from lxml import html 20 | from langchain_community.llms import Ollama 21 | from langchain_community.embeddings import OllamaEmbeddings 22 | from langchain_openai import OpenAIEmbeddings, ChatOpenAI 23 | from langchain.prompts import PromptTemplate 24 | from langchain.chains import ConversationChain 25 | from langchain.chains.question_answering import load_qa_chain 26 | from langchain.memory import ConversationBufferMemory 27 | from langchain.docstore.document import Document 28 | from qdrant_client import QdrantClient 29 | from qdrant_client.http import models 30 | from dotenv import load_dotenv 31 | 32 | # Load environment variables from .env file 33 | load_dotenv() 34 | 35 | # Constants 36 | # Default endpoint for Bing Search API 37 | DEFAULT_BING_ENDPOINT = "https://api.bing.microsoft.com/v7.0/search" 38 | # Default endpoint for Google Custom Search API 39 | DEFAULT_GOOGLE_ENDPOINT = "https://www.googleapis.com/customsearch/v1" 40 | # Default base URL for OpenAI API 41 | DEFAULT_OPENAI_BASE_URL = "https://api.openai.com/v1" 42 | # Default chunk size for processing text data 43 | CHUNK_SIZE = 800 44 | # Dimension of the embedding vectors 45 | DIMENSION = 1536 46 | # Default maximum number of tokens for input/output in API requests 47 | DEFAULT_MAX_TOKENS = 16000 48 | # Defined user-agent globally, to be used in all API requests can be helpfull in bypassing logins and pywalls 49 | USER_AGENT = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 50 | 51 | # Initialize logger 52 | logger = logging.getLogger(__name__) 53 | 54 | # Request class 55 | class Request: 56 | def __init__(self, query: str): 57 | self.query = query 58 | self.search_map: Dict[str, SearchResult] = {} 59 | self.chunk_id_chunk_map: Dict[int, str] = {} 60 | self.chunk_id_to_search_id: Dict[int, str] = {} 61 | 62 | def add_search_result(self, search_result: "SearchResult"): 63 | url_hash = hash_string(search_result.url) 64 | self.search_map[url_hash] = search_result 65 | 66 | def add_webpage_content(self, url_hash: str, content: str): 67 | if (search := self.search_map.get(url_hash, None)) is not None: 68 | search.content = content 69 | logger.debug(f"Added content for URL hash {url_hash}") 70 | 71 | def add_id_to_chunk(self, chunk: str, search_result_id: str, chunk_id: int): 72 | self.chunk_id_chunk_map[chunk_id] = chunk 73 | self.chunk_id_to_search_id[chunk_id] = search_result_id 74 | logger.debug(f"Chunk added for search result ID {search_result_id} with chunk ID {chunk_id}") 75 | 76 | def get_chunks(self, ids: List[int]) -> List["Chunk"]: 77 | chunks = [] 78 | for chunk_id in ids: 79 | chunk_content = self.chunk_id_chunk_map.get(chunk_id) 80 | search_id = self.chunk_id_to_search_id.get(chunk_id) 81 | search_result = self.search_map.get(search_id) 82 | if search_result and chunk_content: 83 | chunk = Chunk( 84 | content=chunk_content, 85 | name=search_result.name, 86 | url=search_result.url, 87 | ) 88 | chunks.append(chunk) 89 | return chunks 90 | 91 | # SearchResult class 92 | class SearchResult: 93 | def __init__(self, name: str, url: str, content: Optional[str] = None): 94 | self.name = name 95 | self.url = url 96 | self.content = content 97 | 98 | # Chunk class 99 | class Chunk: 100 | def __init__(self, content: str, name: str, url: str): 101 | self.content = content 102 | self.name = name 103 | self.url = url 104 | 105 | # Function to hash strings, typically used for URL hashing or similar purposes. 106 | def hash_string(input_string: str) -> str: 107 | return hashlib.sha256(input_string.encode()).hexdigest() 108 | 109 | # Function to clean text by removing extra whitespace and normalizing spaces. 110 | def clean_text(text: str) -> str: 111 | return re.sub(r"\s+", " ", text) 112 | 113 | # Function to perform web search using Bing's search API. 114 | async def fetch_web_pages_bing(request: Request, search_count: int, verbose: bool): 115 | if verbose: print("Starting Bing search ⏳") 116 | 117 | query = request.query 118 | mkt = "en-US" 119 | count_str = str(search_count) 120 | 121 | logger.debug(f"Search parameters - Query: {query}, Market: {mkt}, Count: {count_str}") 122 | 123 | params = { 124 | "mkt": mkt, 125 | "q": query, 126 | "count": count_str, 127 | } 128 | 129 | bing_endpoint = os.getenv("BING_ENDPOINT", DEFAULT_BING_ENDPOINT) 130 | bing_api_key = os.getenv("BING_SUBSCRIPTION_KEY") 131 | if not bing_api_key: 132 | logger.error("Bing API key is missing. Please set BING_SUBSCRIPTION_KEY in your environment variables.") 133 | return 134 | 135 | headers = { 136 | "Ocp-Apim-Subscription-Key": bing_api_key, 137 | "User-Agent": USER_AGENT, # Using the global variable 138 | } 139 | 140 | async with aiohttp.ClientSession() as session: 141 | async with session.get(bing_endpoint, params=params, headers=headers) as response: 142 | if response.status == 200: 143 | json_data = await response.json() 144 | logger.info(f"Bing search returned: {len(json_data['webPages']['value'])} results") 145 | print(f"Bing search returned 🔗: {len(json_data['webPages']['value'])} results") 146 | 147 | for wp in json_data["webPages"]["value"]: 148 | search_result = SearchResult( 149 | name=wp["name"], 150 | url=wp["url"], 151 | ) 152 | request.add_search_result(search_result) 153 | 154 | logger.debug(f"JSON result from Bing: {json.dumps(json_data, indent=2)}") 155 | else: 156 | logger.error(f"Request failed with status code: {response.status}") 157 | raise Exception(f"Request failed with status code: {response.status}") 158 | 159 | # Function to perform web search using Google's search API. 160 | async def fetch_web_pages_google(request: Request, search_count: int, verbose: bool): 161 | if verbose: print("Starting Google search ⏳") 162 | 163 | query = request.query 164 | 165 | google_endpoint = os.getenv("GOOGLE_ENDPOINT", DEFAULT_GOOGLE_ENDPOINT) 166 | google_api_key = os.getenv("GOOGLE_API_KEY") 167 | google_cx = os.getenv("GOOGLE_CX") 168 | if not google_api_key or not google_cx: 169 | logger.error("Google API key or CX is missing. Please set GOOGLE_API_KEY and GOOGLE_CX in your environment variables.") 170 | return 171 | 172 | params = { 173 | "key": google_api_key, 174 | "cx": google_cx, 175 | "q": query, 176 | "num": search_count 177 | } 178 | 179 | async with aiohttp.ClientSession() as session: 180 | async with session.get(google_endpoint, params=params) as response: 181 | if response.status == 200: 182 | json_data = await response.json() 183 | logger.info(f"Google search returned: {len(json_data['items'])} results") 184 | if verbose: print(f"Google search returned 🔗: {len(json_data['items'])} results") 185 | 186 | for item in json_data["items"]: 187 | search_result = SearchResult( 188 | name=item["title"], 189 | url=item["link"], 190 | ) 191 | request.add_search_result(search_result) 192 | 193 | logger.debug(f"JSON result from Google: {json.dumps(json_data, indent=2)}") 194 | else: 195 | logger.error(f"Request failed with status code: {response.status}") 196 | raise Exception(f"Request failed with status code: {response.status}") 197 | 198 | # Function to scrape content from given URLs by making HTTP requests. 199 | async def fetch_url_content(session: ClientSession, url: str, max_retries: int = 3, retry_delay: int = 1, timeout: int = 5) -> str: 200 | logger.info(f"Scraping content from URL: {url}") 201 | retries = 0 202 | while retries < max_retries: 203 | try: 204 | async with session.get(url, timeout=timeout) as response: 205 | if response.status == 200: 206 | full_text = await response.text() 207 | document = html.fromstring(full_text) 208 | 209 | selectors = [ 210 | "article", # good for blog posts/articles 211 | "div.main-content", # a more specific div that usually holds main content 212 | "body", # generic selector 213 | ] 214 | 215 | main_text = "" 216 | for selector in selectors: 217 | elements = document.cssselect(selector) 218 | if elements: 219 | main_text = "\n".join([clean_text(element.text_content()) for element in elements if element.text_content()]) 220 | break 221 | 222 | logger.debug(f"Extracted content: '{main_text}' from URL: {url}") 223 | if not main_text.strip(): 224 | logger.warning(f"No content extracted from URL: {url}") 225 | return main_text 226 | else: 227 | logger.warning(f"Request failed with status code: {response.status}") 228 | return "" 229 | except (ClientError, ClientSSLError, asyncio.TimeoutError) as e: 230 | logger.warning(f"Error occurred while scraping URL: {url}. Error: {str(e)}") 231 | retries += 1 232 | await asyncio.sleep(retry_delay) 233 | logger.error(f"Failed to scrape content from URL: {url} after {max_retries} retries.") 234 | return "" 235 | 236 | # Function to handle the URL processing by scraping content from each URL found in the search results. 237 | async def process_urls(request: Request): 238 | logger.info("Processing URLs to scrape content...") 239 | async with aiohttp.ClientSession(headers={"User-Agent": USER_AGENT}) as session: 240 | tasks = [] 241 | for search_result in request.search_map.values(): 242 | url = search_result.url 243 | task = asyncio.create_task(fetch_url_content(session, url)) 244 | tasks.append(task) 245 | 246 | webpage_contents = await asyncio.gather(*tasks) 247 | 248 | for search_result, content in zip(request.search_map.values(), webpage_contents): 249 | url_hash = hash_string(search_result.url) 250 | logger.debug(f"Adding webpage content for URL hash {url_hash}. Content length: {len(content)}") 251 | request.add_webpage_content(url_hash, content) 252 | 253 | # Embedding generation and upsert 254 | async def insert_embedding(vector_client: QdrantClient, embedding: List[float], chunk_id: int): 255 | logger.debug(f"Inserting embedding for chunk ID {chunk_id}") 256 | vector_client.upsert( 257 | collection_name="embeddings", 258 | points=[ 259 | models.PointStruct( 260 | id=chunk_id, 261 | vector=embedding, 262 | ) 263 | ], 264 | ) 265 | 266 | # Function to generate embeddings for web page content and upsert them into the vector database. 267 | # This function iterates over the search results, chunks the content, and processes each chunk to create and store embeddings. 268 | async def generate_upsert_embeddings(request: Request, vector_client: QdrantClient) -> int: 269 | logger.info("Generating and upserting embeddings...") 270 | tasks = [] 271 | shared_counter = 0 272 | 273 | for url_hash, search_result in request.search_map.items(): 274 | content = search_result.content or "" 275 | logger.debug(f"Content length for URL hash {url_hash}: {len(content)}") 276 | 277 | if not content.strip(): 278 | logger.warning(f"Skipping URL {search_result.url} due to empty or non-relevant content") 279 | continue 280 | 281 | # Split content into chunks 282 | chunks = [] 283 | words = content.split() 284 | for i in range(0, len(words), CHUNK_SIZE): 285 | chunk = " ".join(words[i:i+CHUNK_SIZE]) 286 | chunks.append(chunk) 287 | 288 | if not chunks: 289 | chunks = [content] 290 | 291 | logger.info(f"Chunked content into {len(chunks)} chunks for url: {search_result.url}") 292 | logger.info(f"Generating embedding for url: {search_result.url}") 293 | 294 | for chunk in chunks: 295 | task = asyncio.create_task(process_chunk(request, vector_client, shared_counter, url_hash, chunk)) 296 | tasks.append(task) 297 | shared_counter += 1 298 | 299 | await asyncio.gather(*tasks) 300 | return shared_counter 301 | 302 | # Function to process a content chunk, generate its embedding, and upsert the embedding into the vector database. 303 | async def process_chunk(request: Request, vector_client: QdrantClient, shared_counter: int, url_hash: str, chunk: str): 304 | logger.debug(f"Processing chunk with ID {shared_counter} for URL hash {url_hash}") 305 | if os.getenv("USE_OLLAMA", "false").lower() == "true": 306 | embeddings = OllamaEmbeddings() 307 | else: 308 | embeddings = OpenAIEmbeddings() 309 | embedding = embeddings.embed_query(chunk) 310 | 311 | chunk_id = shared_counter 312 | 313 | request.add_id_to_chunk(chunk, url_hash, chunk_id) 314 | logger.debug(f"Processed and added chunk with ID {chunk_id} for URL hash {url_hash}") 315 | 316 | await insert_embedding(vector_client, embedding, chunk_id) 317 | 318 | # LLM agent 319 | class LLMAgent: 320 | def __init__(self): 321 | base_url = os.getenv("OPENAI_BASE_URL", DEFAULT_OPENAI_BASE_URL) 322 | api_key = os.environ.get("OPENAI_API_KEY", "") 323 | use_ollama = os.getenv("USE_OLLAMA", "false").lower() == "true" 324 | 325 | embedding_model_name = os.getenv("EMBEDDING_MODEL_NAME", "llama3") 326 | chat_model_name = os.getenv("CHAT_MODEL_NAME", "gpt-4o") 327 | ollama_chat_model_name = os.getenv("OLLAMA_CHAT_MODEL_NAME", "llama3") 328 | 329 | self.local_mode = use_ollama 330 | if self.local_mode: 331 | print(f"Using Ollama model 🧠 for embeddings: {embedding_model_name} and chat: {ollama_chat_model_name}") 332 | self.embeddings = OllamaEmbeddings(model='llama3') 333 | self.llm = Ollama(model=ollama_chat_model_name) 334 | else: 335 | print(f"Using OpenAI model 🧠 for embeddings and chat: {chat_model_name}") 336 | self.embeddings = OpenAIEmbeddings(openai_api_key=api_key) 337 | self.llm = ChatOpenAI(openai_api_key=api_key, model_name=chat_model_name) 338 | 339 | logger.info(f"LLM Agent initialized. Local mode: {self.local_mode}") 340 | 341 | def chunk_to_documents(self, chunks: List[Chunk], max_tokens: int) -> List[Document]: 342 | documents = [] 343 | current_tokens = 0 344 | 345 | for chunk in chunks: 346 | chunk_tokens = len(chunk.content.split()) # Simple token estimation using word count 347 | if current_tokens + chunk_tokens > max_tokens: 348 | break 349 | documents.append(Document(page_content=chunk.content, metadata={"name": chunk.name, "url": chunk.url})) 350 | current_tokens += chunk_tokens 351 | 352 | logger.debug(f"Converted {len(chunks)} chunks into {len(documents)} documents") 353 | return documents 354 | 355 | async def answer_question_stream(self, llm_query: str, chunks: List[Chunk], max_tokens: int): 356 | print(f"\nAnswering your query: {llm_query} 🙋\n") 357 | documents = self.chunk_to_documents(chunks, max_tokens) 358 | logger.debug(f"Documents metadata:\n{[doc.metadata for doc in documents]}") 359 | prompt = PromptTemplate( 360 | input_variables=["context", "question"], 361 | template=""" 362 | CONTEXT: 363 | {context} 364 | 365 | QUESTION: 366 | {question} 367 | 368 | INSTRUCTIONS: 369 | You are a helpful AI assistant that helps users answer questions using the provided context. If the answer is not in the context, say you don't know rather than making up an answer. 370 | 371 | Please provide a detailed answer to the question above only using the context provided. 372 | Include in-text citations like this [1] for each significant fact or statement at the end of the sentence. 373 | At the end of your response, list all sources in a citation section with the format: [citation number] Name - URL. 374 | """, 375 | ) 376 | chain = load_qa_chain(self.llm, chain_type="stuff", prompt=prompt) 377 | result = chain.invoke({"input_documents": documents, "question": llm_query}, return_only_outputs=True) 378 | logger.debug(f"Generated answer: {result['output_text']}") 379 | print(result["output_text"]) 380 | 381 | # Function to authenticate and create a Gmail API client 382 | def authenticate_gmail(): 383 | creds = None 384 | SCOPES = ['https://www.googleapis.com/auth/gmail.readonly'] 385 | token_path = 'token.json' 386 | creds_path = 'credentials.json' 387 | 388 | # The token.json file stores the user's access and refresh tokens and is created automatically when the authorization flow completes for the first time. 389 | if os.path.exists(token_path): 390 | creds = Credentials.from_authorized_user_file(token_path, SCOPES) 391 | 392 | # If there are no (valid) credentials available, let the user log in. 393 | if not creds or not creds.valid: 394 | if creds and creds.expired and creds.refresh_token: 395 | creds.refresh(GoogleRequest()) 396 | else: 397 | flow = InstalledAppFlow.from_client_secrets_file(creds_path, SCOPES) 398 | creds = flow.run_local_server(port=0) 399 | # Save the credentials for the next run 400 | with open(token_path, 'w') as token: 401 | token.write(creds.to_json()) 402 | 403 | return build('gmail', 'v1', credentials=creds) 404 | 405 | # Function to search for emails in Gmail 406 | async def fetch_emails_gmail(query: str, max_results: int) -> List[Dict[str, str]]: 407 | service = authenticate_gmail() 408 | results = service.users().messages().list(userId='me', q=query, maxResults=max_results).execute() 409 | messages = results.get('messages', []) 410 | 411 | emails = [] 412 | for message in messages: 413 | msg = service.users().messages().get(userId='me', id=message['id']).execute() 414 | payload = msg.get('payload', {}) 415 | headers = payload.get('headers', []) 416 | subject = next((header['value'] for header in headers if header['name'] == 'Subject'), 'No Subject') 417 | snippet = msg.get('snippet', '') 418 | email_data = { 419 | 'id': message['id'], 420 | 'subject': subject, 421 | 'snippet': snippet 422 | } 423 | emails.append(email_data) 424 | 425 | return emails 426 | 427 | # Function to download the full content of an email 428 | async def fetch_email_content_gmail(email_id: str) -> str: 429 | service = authenticate_gmail() 430 | msg = service.users().messages().get(userId='me', id=email_id, format='full').execute() 431 | payload = msg.get('payload', {}) 432 | parts = payload.get('parts', []) 433 | body = "" 434 | 435 | for part in parts: 436 | if part['mimeType'] == 'text/plain': 437 | body = extract_body_from_part(part) 438 | break 439 | 440 | return body 441 | 442 | # Function to remove HTTP links from email content 443 | def strip_http_links(content: str) -> str: 444 | return re.sub(r'http[s]?://\S+', '', content) 445 | 446 | def extract_body_from_part(part): 447 | body = "" 448 | if 'body' in part and 'data' in part['body']: 449 | body = base64.urlsafe_b64decode(part['body']['data']).decode('utf-8') 450 | elif 'parts' in part: 451 | for sub_part in part['parts']: 452 | body += extract_body_from_part(sub_part) 453 | return body 454 | 455 | # Integrate Gmail search into main function 456 | async def main(): 457 | parser = argparse.ArgumentParser(description="PyPlexitas - Open source CLI alternative to Perplexity AI by Dennis Kruyt") 458 | parser.add_argument("-q", "--query", type=str, required=True, help="Search Query") 459 | parser.add_argument("--llm-query", type=str, help="Optional LLM Query") 460 | parser.add_argument("-s", "--search", type=int, default=10, help="Number of search results to parse") 461 | parser.add_argument("--engine", type=str, choices=['bing', 'google', 'gmail'], default='bing', help="Search engine to use (bing, google, gmail)") 462 | parser.add_argument("-l", "--log-level", type=str, default="ERROR", help="Set the logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)") 463 | parser.add_argument("-t", "--max-tokens", type=int, default=DEFAULT_MAX_TOKENS, help="Maximum token limit for model input") 464 | parser.add_argument("--quiet", action='store_true', help="Suppress print messages") 465 | args = parser.parse_args() 466 | 467 | logger.setLevel(args.log_level.upper()) 468 | logging.basicConfig(level=args.log_level.upper()) 469 | 470 | logger.info("Application started") 471 | logger.debug(f"Received arguments: {args}") 472 | 473 | query = args.query 474 | llm_query = args.llm_query or query 475 | search_count = args.search 476 | max_tokens = args.max_tokens 477 | search_engine = args.engine 478 | verbose = not args.quiet 479 | 480 | logger.info(f"Searching for: {query} using {search_engine}") 481 | if verbose: print(f"Searching for 🔎: {query} using {search_engine}") 482 | request = Request(query) 483 | llm_agent = LLMAgent() 484 | 485 | if search_engine == 'gmail': 486 | emails = await fetch_emails_gmail(query, search_count) 487 | for email in emails: 488 | email_id = email['id'] 489 | subject = email['subject'] 490 | snippet = email['snippet'] 491 | content = await fetch_email_content_gmail(email_id) 492 | stripped_content = strip_http_links(content) 493 | 494 | search_result = SearchResult(name=subject, url=f"gmail://{email_id}", content=stripped_content) 495 | request.add_search_result(search_result) 496 | else: 497 | # Fetch search results 498 | logger.info("Fetching search results...") 499 | if search_engine == 'bing': 500 | await fetch_web_pages_bing(request, search_count, verbose) 501 | else: 502 | await fetch_web_pages_google(request, search_count, verbose) 503 | 504 | if search_engine != 'gmail': 505 | # Extract unique base names from URLs 506 | if verbose: 507 | unique_basenames = set(urlparse(search_result.url).netloc for search_result in request.search_map.values()) 508 | 509 | print("From domains 🌐: ", end="") 510 | for basename in unique_basenames: 511 | print(basename, " ", end="") 512 | print("") 513 | 514 | # Scrape content 515 | logger.info("Scraping content from search results...") 516 | if verbose: print(f"Scraping content from search results...") 517 | await process_urls(request) 518 | 519 | # Generate and upsert embeddings 520 | logger.info("Embedding content 📥") 521 | if verbose: print(f"Embedding content ✨") 522 | dimension = len(llm_agent.embeddings.embed_query(llm_query)) 523 | vector_client = QdrantClient(host="localhost", port=6333) 524 | 525 | collection_name = "embeddings" 526 | if vector_client.collection_exists(collection_name): 527 | vector_client.delete_collection(collection_name) 528 | 529 | vector_client.create_collection( 530 | collection_name=collection_name, 531 | vectors_config=models.VectorParams(size=dimension, distance=models.Distance.COSINE), 532 | ) 533 | 534 | total_chunks = await generate_upsert_embeddings(request, vector_client) 535 | 536 | if verbose: 537 | print(f"Total embeddings 📊: {len(request.search_map)}") 538 | print(f"Total chunks processed 🧩: {total_chunks}") 539 | 540 | # Search across embeddings 541 | prompt_embedding = llm_agent.embeddings.embed_query(llm_query) 542 | search_result = vector_client.search( 543 | collection_name="embeddings", 544 | query_vector=prompt_embedding, 545 | limit=10, 546 | ) 547 | chunk_ids = [result.id for result in search_result] 548 | chunks = request.get_chunks(chunk_ids) 549 | 550 | # Answer the question 551 | await llm_agent.answer_question_stream(llm_query, chunks, max_tokens) 552 | 553 | if __name__ == "__main__": 554 | asyncio.run(main()) 555 | 556 | --------------------------------------------------------------------------------