├── .gitignore
├── requirements.txt
├── TODO.md
├── docker-compose.yml
├── example.env
├── CHANGELOG.md
├── README.md
└── PyPlexitas.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .venv
2 | .env
3 | token.json
4 | credentials.json


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | langchain==0.2.0
 2 | langchain-community==0.2.0
 3 | langchain_openai==0.1.7
 4 | requests==2.31.0
 5 | qdrant-client
 6 | lxml
 7 | cssselect
 8 | python-dotenv
 9 | google-auth
10 | google-auth-oauthlib
11 | google-auth-httplib2
12 | google-api-python-client


--------------------------------------------------------------------------------
/TODO.md:
--------------------------------------------------------------------------------
 1 | # TODO
 2 | 
 3 | 1. **Speed Optimizations**
 4 |    - Identify and optimize slow-performing segments of the code.
 5 | 
 6 | 2. **Better Configurable Parameters**
 7 |    - Review current configuration parameters and identify areas for improvement.
 8 |    - Introduce new parameters to enhance configurability and flexibility.
 9 |    - Ensure documentation is updated to reflect changes in configuration options.
10 | 
11 | 3. **Gmail Improvement**
12 |    - Investigate issues and areas of enhancement within existing Gmail integration.
13 |    - Improve the reliability and performance of Gmail-related features.
14 | 
15 | 


--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
 1 | services:
 2 |   qdrant:
 3 |     image: qdrant/qdrant:latest
 4 |     restart: always
 5 |     container_name: qdrant
 6 |     ports:
 7 |       - 6333:6333
 8 |       - 6334:6334
 9 |     expose:
10 |       - 6333
11 |       - 6334
12 |       - 6335
13 |     configs:
14 |       - source: qdrant_config
15 |         target: /qdrant/config/production.yaml
16 |     volumes:
17 |       - qdrant_data_volume:/qdrant_data
18 |     networks:
19 |       - PyPlexitas
20 | 
21 | configs:
22 |   qdrant_config:
23 |     content: |
24 |       log_level: INFO
25 | 
26 | volumes:
27 |   qdrant_data_volume:
28 |     driver: local
29 | 
30 | networks:
31 |   PyPlexitas:
32 |     driver: bridge


--------------------------------------------------------------------------------
/example.env:
--------------------------------------------------------------------------------
 1 | # Open AI config
 2 | OPENAI_API_KEY="<OPENAI_API_KEY>"
 3 | # OPENAI_BASE_URL="" # Leave blank for default
 4 | EMBEDDING_MODEL_NAME="text-embedding-ada-002"
 5 | CHAT_MODEL_NAME="gpt-4o"
 6 | 
 7 | # Ollama config
 8 | USE_OLLAMA=false
 9 | # OLLAMA_API_KEY = "ollama"
10 | OLLAMA_BASE_URL = "http://localhost:11434/v1"
11 | OLLAMA_EMBEDDING_MODEL_NAME = "llama3"
12 | OLLAMA_CHAT_MODEL_NAME = "llama3"
13 | 
14 | # Google config go to https://cse.google.com/cse/create/new and https://developers.google.com/custom-search/v1/overview for API and CX
15 | GOOGLE_API_KEY="<GOOGLE_API_KEY>"
16 | GOOGLE_CX="<GOOGLE_CX>"
17 | # GOOGLE_ENDPOINT="" # Leave blank for default
18 | 
19 | # Bing config
20 | BING_SUBSCRIPTION_KEY="<BING_SUBSCRIPTION_KEY>"
21 | # BING_ENDPOINT="" # Leave blank for default


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | # Changelog
 2 | 
 3 | ## [Unreleased]
 4 | 
 5 | ## [v0.2.0] - 2024-05-22
 6 | 
 7 | ### Added
 8 | - Optional LLM query parameter `--llm-query`.
 9 | - New TODO.md file outlining tasks related to speed optimizations, configurable parameters review, and Gmail integration improvements.
10 | - Integration of Gmail search functionality into the main function, allowing users to search and fetch emails from Gmail using specified query and count, including email content retrieval.
11 | 
12 | ### Changed
13 | - Updated `README.md` to reflect the new optional LLM query parameter and increased default max-tokens limit.
14 | - Updated default constants for chunk size and max tokens.
15 | - Enhanced `fetch_url_content` function with a timeout parameter.
16 | 
17 | ### Fixed
18 | - Use correct query variable for embedding calculation in PyPlexitas.
19 | - Added missing Google libraries to `requirements.txt`.
20 | 
21 | ### Removed
22 | - Removed token.json and credentials.json from version control by adding them to `.gitignore`.
23 | 
24 | ## [v0.1.0] - 2024-05-20
25 | 
26 | ### Added
27 | - Initial project files and configurations.
28 | - Docker Compose configuration for Qdrant container including services, configs, volumes, and networks.
29 | - Functionality to load environment variables from `.env` file and handle missing Bing API key gracefully.
30 | - Embedding selection logic based on the value of `USE_OLLAMA`.
31 | - Logging and debug statements throughout the codebase.
32 | - Google search functionality with the necessary endpoint and parameters.
33 | 
34 | ### Changed
35 | - Refactored imports in `PyPlexitas.py` to include both OpenAI and OpenAIEmbeddings from `langchain_openai`.
36 | - Updated logging configuration to set the logging level to DEBUG for more detailed logging information.
37 | - Enhanced LLMAgent class by adding an optional debug parameter to print document content for debugging purposes.
38 | - Adjusted the argparse description to reflect the project name PyPlexitas.
39 | - Implemented collection handling in `vector_client`, including deleting the collection if it already exists before recreating it with the appropriate configurations.
40 | 
41 | ### Fixed
42 | - Corrected API key and model names in environment configuration.
43 | - Set global User-Agent for all API requests.
44 | - Added missing import statement for `hashlib` in `hash_string` function.
45 | 
46 | ### Documentation
47 | - Added `README.md` for PyPlexitas including features, installation instructions, configuration details, and usage guidelines.
48 | - Updated `README.md` formatting and added license information.
49 | 
50 | ### Chore
51 | - Improved logging and print messages for search operations, including emoji indicators.
52 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🌟 PyPlexitas
  2 | 
  3 | PyPlexitas is a Python script that is designed to create an open-source alternative to Perplexity AI, a tool that provides users with detailed answers to their queries by searching the web, extracting relevant content, and using advanced language models to generate responses.
  4 | 
  5 | The script operates by first taking a user’s query and using search engines like Bing or Google and GMAIL to find relevant web pages or e-mails. It then scrapes the content from these web pages or mails, processes the text into manageable chunks, and generates vector embeddings for these chunks. Vector embeddings are mathematical representations of text that allow for efficient searching and comparison of content. These embeddings are stored in a database, enabling quick retrieval of relevant information based on the user’s query.
  6 | 
  7 | Once the content is processed and stored, the script uses a language model to generate a detailed answer to the user’s query, using the information extracted from the web pages. This response is designed to be accurate and informative, drawing directly from the content found during the search process.
  8 | 
  9 | **Example:**
 10 | ```bash
 11 | python PyPlexitas.py -q "When will the next model GPT-5 be released" -s 10 --engine google 
 12 | ```
 13 | 
 14 | Expected Output:
 15 | ```
 16 | Searching for 🔎: When will the next model GPT-5 be released using google
 17 | Starting Google search ⏳
 18 | Google search returned 🔗: 10 results
 19 | From domains 🌐: mashable.com  www.reddit.com  www.tomsguide.com  www.datacamp.com  medium.com  www.standard.co.uk  www.theverge.com  arstechnica.com  
 20 | Scraping content from search results...
 21 | Embedding content ✨
 22 | Total embeddings 📊: 10
 23 | Total chunks processed 🧩: 7
 24 | 
 25 | Answering your query: When will the next model GPT-5 be released 🙋
 26 | 
 27 | The release date for GPT-5 is currently expected to be sometime in mid-2024, likely during the summer, according to a report from Business Insider [1][2]. OpenAI representatives have not provided a specific release date, and the timeline may be subject to change depending on the duration of safety testing and other factors [1][2]. OpenAI CEO Sam Altman has indicated that a major AI model will be released this year, but it is unclear whether it will be called GPT-5 or something else [1].
 28 | 
 29 | ### Sources
 30 | 1. Benj Edwards - https://arstechnica.com/information-technology/2024/03/gpt-5-might-arrive-this-summer-as-a-materially-better-update-to-chatgpt/
 31 | 2. Saqib Shah - https://www.standard.co.uk/tech/openai-chatgpt-5-release-date-b1076129.html
 32 | ```
 33 | 
 34 | ## Table of Contents
 35 | - [Features](#features)
 36 | - [Installation](#installation)
 37 | - [Configuration](#configuration)
 38 | - [Getting API Keys](#getting-api-keys)
 39 | - [Usage](#usage)
 40 | - [Project Structure](#project-structure)
 41 | - [Contributing](#contributing)
 42 | - [License](#license)
 43 | 
 44 | ## Features
 45 | - **Web Search**: Perform web searches using Bing, Google, or Gmail APIs.
 46 | - **Content Scraping**: Scrape content from search results.
 47 | - **Embedding Generation**: Generate embeddings for content using OpenAI or Ollama models.
 48 | - **Question Answering**: Answer questions based on the scraped content.
 49 | - **Vector Database**: Use Qdrant for storing and querying embeddings.
 50 | 
 51 | ## Installation
 52 | 1. Clone the repository:
 53 |     ```bash
 54 |     git clone https://github.com/dkruyt/PyPlexitas.git
 55 |     cd PyPlexitas
 56 |     ```
 57 | 
 58 | 2. Install the required Python packages:
 59 |     ```bash
 60 |     pip install -r requirements.txt
 61 |     ```
 62 | 
 63 | 3. Set up the Qdrant service using Docker:
 64 |     ```bash
 65 |     docker-compose up -d
 66 |     ```
 67 | 
 68 | ## Configuration
 69 | Configure your environment variables by creating a `.env` file in the project root. Use the provided `example.env` as a template:
 70 | ```bash
 71 | cp example.env .env
 72 | ```
 73 | Fill in your API keys and other necessary details in the `.env` file:
 74 | - `OPENAI_API_KEY`
 75 | - `GOOGLE_API_KEY`
 76 | - `GOOGLE_CX`
 77 | - `BING_SUBSCRIPTION_KEY`
 78 | 
 79 | ## Getting API Keys
 80 | 
 81 | ### OpenAI API Key
 82 | 1. Sign up or log in to your [OpenAI account](https://www.openai.com/).
 83 | 2. Go to the API section and generate a new API key.
 84 | 3. Copy the API key and add it to the `OPENAI_API_KEY` field in your `.env` file.
 85 | 
 86 | ### Google Custom Search API Key and CX
 87 | 1. Go to the [Google Cloud Console](https://console.cloud.google.com/).
 88 | 2. Create a new project or select an existing project.
 89 | 3. Enable the Custom Search API in the API & Services library.
 90 | 4. Go to the [Credentials](https://console.cloud.google.com/apis/credentials) page and create an API key.
 91 | 5. Copy the API key and add it to the `GOOGLE_API_KEY` field in your `.env` file.
 92 | 6. To get the Custom Search Engine (CX) ID, go to the [Custom Search Engine](https://cse.google.com/cse/) page.
 93 | 7. Create a new search engine or select an existing one.
 94 | 8. Copy the Search Engine ID (CX) and add it to the `GOOGLE_CX` field in your `.env` file.
 95 | 
 96 | ### Bing Search API Key
 97 | 1. Sign up or log in to your [Microsoft Azure account](https://azure.microsoft.com/).
 98 | 2. Create a new Azure resource for Bing Search v7.
 99 | 3. Go to the Keys and Endpoint section to find your API key.
100 | 4. Copy the API key and add it to the `BING_SUBSCRIPTION_KEY` field in your `.env` file.
101 | 
102 | ### Gmail API Key
103 | 1. Go to the [Google Cloud Console](https://console.cloud.google.com/).
104 | 2. Create a new project or select an existing project.
105 | 3. Enable the Gmail API in the API & Services library.
106 | 4. Go to the [Credentials](https://console.cloud.google.com/apis/credentials) page and create OAuth 2.0 credentials.
107 | 5. Download the credentials file and save it as `credentials.json` in the project root.
108 | 
109 | ## Usage
110 | Run the PyPlexitas script with your query:
111 | ```bash
112 | python PyPlexitas.py -q "Your search query" -s 10 --engine bing
113 | ```
114 | Options:
115 | - `-q, --query`: Search query (required)
116 | - `--llm-query`: Optional LLM query for answering (defaults to the search query)
117 | - `-s, --search`: Number of search results to parse (default: 10)
118 | - `--engine`: Search engine to use (`bing`, `google`, or `gmail`, default: `bing`)
119 | - `-l, --log-level`: Set the logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`, default: `ERROR`)
120 | - `-t, --max-tokens`: Maximum token limit for model input (default: 16000)
121 | - `--quiet`: Suppress print messages
122 | 
123 | ## Project Structure
124 | - `PyPlexitas.py`: Main script for running the application.
125 | - `example.env`: Example configuration file for environment variables.
126 | - `docker-compose.yml`: Docker Compose configuration for Qdrant.
127 | - `requirements.txt`: List of required Python packages.
128 | - `README.md`: Project documentation.
129 | 
130 | ## Contributing
131 | Contributions are welcome! Please fork the repository and submit a pull request for any improvements or bug fixes.
132 | 
133 | ## License
134 | This project is licensed under the GPL 3.0 License. See the [LICENSE](LICENSE) file for details.
135 | 


--------------------------------------------------------------------------------
/PyPlexitas.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import json
  4 | import hashlib
  5 | import logging
  6 | import argparse
  7 | import asyncio
  8 | from aiohttp import ClientSession, ClientError, ClientSSLError
  9 | from typing import List, Dict, Optional
 10 | from urllib.parse import urlparse
 11 | import base64
 12 | import google.auth
 13 | from google.auth.transport.requests import Request as GoogleRequest
 14 | from google.oauth2.credentials import Credentials
 15 | from google_auth_oauthlib.flow import InstalledAppFlow
 16 | from googleapiclient.discovery import build
 17 | 
 18 | import aiohttp
 19 | from lxml import html
 20 | from langchain_community.llms import Ollama
 21 | from langchain_community.embeddings import OllamaEmbeddings
 22 | from langchain_openai import OpenAIEmbeddings, ChatOpenAI
 23 | from langchain.prompts import PromptTemplate
 24 | from langchain.chains import ConversationChain
 25 | from langchain.chains.question_answering import load_qa_chain
 26 | from langchain.memory import ConversationBufferMemory
 27 | from langchain.docstore.document import Document
 28 | from qdrant_client import QdrantClient
 29 | from qdrant_client.http import models
 30 | from dotenv import load_dotenv  
 31 | 
 32 | # Load environment variables from .env file
 33 | load_dotenv()
 34 | 
 35 | # Constants
 36 | # Default endpoint for Bing Search API
 37 | DEFAULT_BING_ENDPOINT = "https://api.bing.microsoft.com/v7.0/search"
 38 | # Default endpoint for Google Custom Search API
 39 | DEFAULT_GOOGLE_ENDPOINT = "https://www.googleapis.com/customsearch/v1"
 40 | # Default base URL for OpenAI API
 41 | DEFAULT_OPENAI_BASE_URL = "https://api.openai.com/v1"
 42 | # Default chunk size for processing text data
 43 | CHUNK_SIZE = 800
 44 | # Dimension of the embedding vectors
 45 | DIMENSION = 1536
 46 | # Default maximum number of tokens for input/output in API requests
 47 | DEFAULT_MAX_TOKENS = 16000  
 48 | # Defined user-agent globally, to be used in all API requests can be helpfull in bypassing logins and pywalls
 49 | USER_AGENT = "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
 50 | 
 51 | # Initialize logger
 52 | logger = logging.getLogger(__name__)
 53 | 
 54 | # Request class
 55 | class Request:
 56 |     def __init__(self, query: str):
 57 |         self.query = query
 58 |         self.search_map: Dict[str, SearchResult] = {}
 59 |         self.chunk_id_chunk_map: Dict[int, str] = {}
 60 |         self.chunk_id_to_search_id: Dict[int, str] = {}
 61 | 
 62 |     def add_search_result(self, search_result: "SearchResult"):
 63 |         url_hash = hash_string(search_result.url)
 64 |         self.search_map[url_hash] = search_result
 65 | 
 66 |     def add_webpage_content(self, url_hash: str, content: str):
 67 |         if (search := self.search_map.get(url_hash, None)) is not None:
 68 |             search.content = content
 69 |             logger.debug(f"Added content for URL hash {url_hash}")
 70 | 
 71 |     def add_id_to_chunk(self, chunk: str, search_result_id: str, chunk_id: int):
 72 |         self.chunk_id_chunk_map[chunk_id] = chunk
 73 |         self.chunk_id_to_search_id[chunk_id] = search_result_id
 74 |         logger.debug(f"Chunk added for search result ID {search_result_id} with chunk ID {chunk_id}")
 75 | 
 76 |     def get_chunks(self, ids: List[int]) -> List["Chunk"]:
 77 |         chunks = []
 78 |         for chunk_id in ids:
 79 |             chunk_content = self.chunk_id_chunk_map.get(chunk_id)
 80 |             search_id = self.chunk_id_to_search_id.get(chunk_id)
 81 |             search_result = self.search_map.get(search_id)
 82 |             if search_result and chunk_content:
 83 |                 chunk = Chunk(
 84 |                     content=chunk_content,
 85 |                     name=search_result.name,
 86 |                     url=search_result.url,
 87 |                 )
 88 |                 chunks.append(chunk)
 89 |         return chunks
 90 | 
 91 | # SearchResult class
 92 | class SearchResult:
 93 |     def __init__(self, name: str, url: str, content: Optional[str] = None):
 94 |         self.name = name
 95 |         self.url = url
 96 |         self.content = content
 97 | 
 98 | # Chunk class
 99 | class Chunk:
100 |     def __init__(self, content: str, name: str, url: str):
101 |         self.content = content
102 |         self.name = name
103 |         self.url = url
104 | 
105 | # Function to hash strings, typically used for URL hashing or similar purposes.
106 | def hash_string(input_string: str) -> str:
107 |     return hashlib.sha256(input_string.encode()).hexdigest()
108 | 
109 | # Function to clean text by removing extra whitespace and normalizing spaces.
110 | def clean_text(text: str) -> str:
111 |     return re.sub(r"\s+", " ", text)
112 | 
113 | # Function to perform web search using Bing's search API.
114 | async def fetch_web_pages_bing(request: Request, search_count: int, verbose: bool):
115 |     if verbose: print("Starting Bing search ⏳")
116 |     
117 |     query = request.query
118 |     mkt = "en-US"
119 |     count_str = str(search_count)
120 | 
121 |     logger.debug(f"Search parameters - Query: {query}, Market: {mkt}, Count: {count_str}")
122 | 
123 |     params = {
124 |         "mkt": mkt,
125 |         "q": query,
126 |         "count": count_str,
127 |     }
128 | 
129 |     bing_endpoint = os.getenv("BING_ENDPOINT", DEFAULT_BING_ENDPOINT)
130 |     bing_api_key = os.getenv("BING_SUBSCRIPTION_KEY")
131 |     if not bing_api_key:
132 |         logger.error("Bing API key is missing. Please set BING_SUBSCRIPTION_KEY in your environment variables.")
133 |         return
134 | 
135 |     headers = {
136 |         "Ocp-Apim-Subscription-Key": bing_api_key,
137 |         "User-Agent": USER_AGENT,  # Using the global variable
138 |     }
139 | 
140 |     async with aiohttp.ClientSession() as session:
141 |         async with session.get(bing_endpoint, params=params, headers=headers) as response:
142 |             if response.status == 200:
143 |                 json_data = await response.json()
144 |                 logger.info(f"Bing search returned: {len(json_data['webPages']['value'])} results")
145 |                 print(f"Bing search returned 🔗: {len(json_data['webPages']['value'])} results")
146 | 
147 |                 for wp in json_data["webPages"]["value"]:
148 |                     search_result = SearchResult(
149 |                         name=wp["name"],
150 |                         url=wp["url"],
151 |                     )
152 |                     request.add_search_result(search_result)
153 | 
154 |                 logger.debug(f"JSON result from Bing: {json.dumps(json_data, indent=2)}")
155 |             else:
156 |                 logger.error(f"Request failed with status code: {response.status}")
157 |                 raise Exception(f"Request failed with status code: {response.status}")
158 | 
159 | # Function to perform web search using Google's search API.
160 | async def fetch_web_pages_google(request: Request, search_count: int, verbose: bool):
161 |     if verbose: print("Starting Google search ⏳")
162 |     
163 |     query = request.query
164 | 
165 |     google_endpoint = os.getenv("GOOGLE_ENDPOINT", DEFAULT_GOOGLE_ENDPOINT)
166 |     google_api_key = os.getenv("GOOGLE_API_KEY")
167 |     google_cx = os.getenv("GOOGLE_CX")
168 |     if not google_api_key or not google_cx:
169 |         logger.error("Google API key or CX is missing. Please set GOOGLE_API_KEY and GOOGLE_CX in your environment variables.")
170 |         return
171 | 
172 |     params = {
173 |         "key": google_api_key,
174 |         "cx": google_cx,
175 |         "q": query,
176 |         "num": search_count
177 |     }
178 | 
179 |     async with aiohttp.ClientSession() as session:
180 |         async with session.get(google_endpoint, params=params) as response:
181 |             if response.status == 200:
182 |                 json_data = await response.json()
183 |                 logger.info(f"Google search returned: {len(json_data['items'])} results")
184 |                 if verbose: print(f"Google search returned 🔗: {len(json_data['items'])} results")
185 | 
186 |                 for item in json_data["items"]:
187 |                     search_result = SearchResult(
188 |                         name=item["title"],
189 |                         url=item["link"],
190 |                     )
191 |                     request.add_search_result(search_result)
192 | 
193 |                 logger.debug(f"JSON result from Google: {json.dumps(json_data, indent=2)}")
194 |             else:
195 |                 logger.error(f"Request failed with status code: {response.status}")
196 |                 raise Exception(f"Request failed with status code: {response.status}")
197 | 
198 | # Function to scrape content from given URLs by making HTTP requests.
199 | async def fetch_url_content(session: ClientSession, url: str, max_retries: int = 3, retry_delay: int = 1, timeout: int = 5) -> str:
200 |     logger.info(f"Scraping content from URL: {url}")
201 |     retries = 0
202 |     while retries < max_retries:
203 |         try:
204 |             async with session.get(url, timeout=timeout) as response:
205 |                 if response.status == 200:
206 |                     full_text = await response.text()
207 |                     document = html.fromstring(full_text)
208 | 
209 |                     selectors = [
210 |                         "article",  # good for blog posts/articles
211 |                         "div.main-content",  # a more specific div that usually holds main content
212 |                         "body",  # generic selector
213 |                     ]
214 | 
215 |                     main_text = ""
216 |                     for selector in selectors:
217 |                         elements = document.cssselect(selector)
218 |                         if elements:
219 |                             main_text = "\n".join([clean_text(element.text_content()) for element in elements if element.text_content()])
220 |                             break
221 | 
222 |                     logger.debug(f"Extracted content: '{main_text}' from URL: {url}")
223 |                     if not main_text.strip():
224 |                         logger.warning(f"No content extracted from URL: {url}")
225 |                     return main_text
226 |                 else:
227 |                     logger.warning(f"Request failed with status code: {response.status}")
228 |                     return ""
229 |         except (ClientError, ClientSSLError, asyncio.TimeoutError) as e:
230 |             logger.warning(f"Error occurred while scraping URL: {url}. Error: {str(e)}")
231 |             retries += 1
232 |             await asyncio.sleep(retry_delay)
233 |     logger.error(f"Failed to scrape content from URL: {url} after {max_retries} retries.")
234 |     return ""
235 | 
236 | # Function to handle the URL processing by scraping content from each URL found in the search results.
237 | async def process_urls(request: Request):
238 |     logger.info("Processing URLs to scrape content...")
239 |     async with aiohttp.ClientSession(headers={"User-Agent": USER_AGENT}) as session:
240 |         tasks = []
241 |         for search_result in request.search_map.values():
242 |             url = search_result.url
243 |             task = asyncio.create_task(fetch_url_content(session, url))
244 |             tasks.append(task)
245 | 
246 |         webpage_contents = await asyncio.gather(*tasks)
247 |         
248 |         for search_result, content in zip(request.search_map.values(), webpage_contents):
249 |             url_hash = hash_string(search_result.url)
250 |             logger.debug(f"Adding webpage content for URL hash {url_hash}. Content length: {len(content)}")
251 |             request.add_webpage_content(url_hash, content)
252 | 
253 | # Embedding generation and upsert
254 | async def insert_embedding(vector_client: QdrantClient, embedding: List[float], chunk_id: int):
255 |     logger.debug(f"Inserting embedding for chunk ID {chunk_id}")
256 |     vector_client.upsert(
257 |         collection_name="embeddings",
258 |         points=[
259 |             models.PointStruct(
260 |                 id=chunk_id,
261 |                 vector=embedding,
262 |             )
263 |         ],
264 |     )
265 |     
266 | # Function to generate embeddings for web page content and upsert them into the vector database.
267 | # This function iterates over the search results, chunks the content, and processes each chunk to create and store embeddings.
268 | async def generate_upsert_embeddings(request: Request, vector_client: QdrantClient) -> int:
269 |     logger.info("Generating and upserting embeddings...")
270 |     tasks = []
271 |     shared_counter = 0
272 | 
273 |     for url_hash, search_result in request.search_map.items():
274 |         content = search_result.content or ""
275 |         logger.debug(f"Content length for URL hash {url_hash}: {len(content)}")
276 | 
277 |         if not content.strip():
278 |             logger.warning(f"Skipping URL {search_result.url} due to empty or non-relevant content")
279 |             continue
280 | 
281 |         # Split content into chunks
282 |         chunks = []
283 |         words = content.split()
284 |         for i in range(0, len(words), CHUNK_SIZE):
285 |             chunk = " ".join(words[i:i+CHUNK_SIZE])
286 |             chunks.append(chunk)
287 | 
288 |         if not chunks:
289 |             chunks = [content]
290 | 
291 |         logger.info(f"Chunked content into {len(chunks)} chunks for url: {search_result.url}")
292 |         logger.info(f"Generating embedding for url: {search_result.url}")
293 | 
294 |         for chunk in chunks:
295 |             task = asyncio.create_task(process_chunk(request, vector_client, shared_counter, url_hash, chunk))
296 |             tasks.append(task)
297 |             shared_counter += 1
298 | 
299 |     await asyncio.gather(*tasks)
300 |     return shared_counter
301 | 
302 | # Function to process a content chunk, generate its embedding, and upsert the embedding into the vector database.
303 | async def process_chunk(request: Request, vector_client: QdrantClient, shared_counter: int, url_hash: str, chunk: str):
304 |     logger.debug(f"Processing chunk with ID {shared_counter} for URL hash {url_hash}")
305 |     if os.getenv("USE_OLLAMA", "false").lower() == "true":
306 |         embeddings = OllamaEmbeddings()
307 |     else:
308 |         embeddings = OpenAIEmbeddings()
309 |     embedding = embeddings.embed_query(chunk)
310 | 
311 |     chunk_id = shared_counter
312 | 
313 |     request.add_id_to_chunk(chunk, url_hash, chunk_id)
314 |     logger.debug(f"Processed and added chunk with ID {chunk_id} for URL hash {url_hash}")
315 | 
316 |     await insert_embedding(vector_client, embedding, chunk_id)
317 | 
318 | # LLM agent
319 | class LLMAgent:
320 |     def __init__(self):
321 |         base_url = os.getenv("OPENAI_BASE_URL", DEFAULT_OPENAI_BASE_URL)
322 |         api_key = os.environ.get("OPENAI_API_KEY", "")
323 |         use_ollama = os.getenv("USE_OLLAMA", "false").lower() == "true"
324 | 
325 |         embedding_model_name = os.getenv("EMBEDDING_MODEL_NAME", "llama3")
326 |         chat_model_name = os.getenv("CHAT_MODEL_NAME", "gpt-4o")
327 |         ollama_chat_model_name = os.getenv("OLLAMA_CHAT_MODEL_NAME", "llama3")
328 | 
329 |         self.local_mode = use_ollama
330 |         if self.local_mode:
331 |             print(f"Using Ollama model 🧠 for embeddings: {embedding_model_name} and chat: {ollama_chat_model_name}")
332 |             self.embeddings = OllamaEmbeddings(model='llama3')
333 |             self.llm = Ollama(model=ollama_chat_model_name)
334 |         else:
335 |             print(f"Using OpenAI model 🧠 for embeddings and chat: {chat_model_name}")
336 |             self.embeddings = OpenAIEmbeddings(openai_api_key=api_key)
337 |             self.llm = ChatOpenAI(openai_api_key=api_key, model_name=chat_model_name)
338 |         
339 |         logger.info(f"LLM Agent initialized. Local mode: {self.local_mode}")
340 | 
341 |     def chunk_to_documents(self, chunks: List[Chunk], max_tokens: int) -> List[Document]:
342 |         documents = []
343 |         current_tokens = 0
344 | 
345 |         for chunk in chunks:
346 |             chunk_tokens = len(chunk.content.split())  # Simple token estimation using word count
347 |             if current_tokens + chunk_tokens > max_tokens:
348 |                 break
349 |             documents.append(Document(page_content=chunk.content, metadata={"name": chunk.name, "url": chunk.url}))
350 |             current_tokens += chunk_tokens
351 |         
352 |         logger.debug(f"Converted {len(chunks)} chunks into {len(documents)} documents")
353 |         return documents
354 | 
355 |     async def answer_question_stream(self, llm_query: str, chunks: List[Chunk], max_tokens: int):
356 |         print(f"\nAnswering your query: {llm_query} 🙋\n")
357 |         documents = self.chunk_to_documents(chunks, max_tokens)
358 |         logger.debug(f"Documents metadata:\n{[doc.metadata for doc in documents]}")
359 |         prompt = PromptTemplate(
360 |             input_variables=["context", "question"],
361 |             template="""
362 |             CONTEXT:
363 |             {context}
364 | 
365 |             QUESTION:
366 |             {question}
367 | 
368 |             INSTRUCTIONS:
369 |             You are a helpful AI assistant that helps users answer questions using the provided context. If the answer is not in the context, say you don't know rather than making up an answer.
370 | 
371 |             Please provide a detailed answer to the question above only using the context provided.
372 |             Include in-text citations like this [1] for each significant fact or statement at the end of the sentence.
373 |             At the end of your response, list all sources in a citation section with the format: [citation number] Name - URL.
374 |             """,
375 |         )
376 |         chain = load_qa_chain(self.llm, chain_type="stuff", prompt=prompt)
377 |         result = chain.invoke({"input_documents": documents, "question": llm_query}, return_only_outputs=True)
378 |         logger.debug(f"Generated answer: {result['output_text']}")
379 |         print(result["output_text"])
380 | 
381 | # Function to authenticate and create a Gmail API client
382 | def authenticate_gmail():
383 |     creds = None
384 |     SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']
385 |     token_path = 'token.json'
386 |     creds_path = 'credentials.json'
387 | 
388 |     # The token.json file stores the user's access and refresh tokens and is created automatically when the authorization flow completes for the first time.
389 |     if os.path.exists(token_path):
390 |         creds = Credentials.from_authorized_user_file(token_path, SCOPES)
391 | 
392 |     # If there are no (valid) credentials available, let the user log in.
393 |     if not creds or not creds.valid:
394 |         if creds and creds.expired and creds.refresh_token:
395 |             creds.refresh(GoogleRequest())
396 |         else:
397 |             flow = InstalledAppFlow.from_client_secrets_file(creds_path, SCOPES)
398 |             creds = flow.run_local_server(port=0)
399 |         # Save the credentials for the next run
400 |         with open(token_path, 'w') as token:
401 |             token.write(creds.to_json())
402 | 
403 |     return build('gmail', 'v1', credentials=creds)
404 | 
405 | # Function to search for emails in Gmail
406 | async def fetch_emails_gmail(query: str, max_results: int) -> List[Dict[str, str]]:
407 |     service = authenticate_gmail()
408 |     results = service.users().messages().list(userId='me', q=query, maxResults=max_results).execute()
409 |     messages = results.get('messages', [])
410 | 
411 |     emails = []
412 |     for message in messages:
413 |         msg = service.users().messages().get(userId='me', id=message['id']).execute()
414 |         payload = msg.get('payload', {})
415 |         headers = payload.get('headers', [])
416 |         subject = next((header['value'] for header in headers if header['name'] == 'Subject'), 'No Subject')
417 |         snippet = msg.get('snippet', '')
418 |         email_data = {
419 |             'id': message['id'],
420 |             'subject': subject,
421 |             'snippet': snippet
422 |         }
423 |         emails.append(email_data)
424 | 
425 |     return emails
426 | 
427 | # Function to download the full content of an email
428 | async def fetch_email_content_gmail(email_id: str) -> str:
429 |     service = authenticate_gmail()
430 |     msg = service.users().messages().get(userId='me', id=email_id, format='full').execute()
431 |     payload = msg.get('payload', {})
432 |     parts = payload.get('parts', [])
433 |     body = ""
434 | 
435 |     for part in parts:
436 |         if part['mimeType'] == 'text/plain':
437 |             body = extract_body_from_part(part)
438 |             break
439 | 
440 |     return body
441 | 
442 | # Function to remove HTTP links from email content
443 | def strip_http_links(content: str) -> str:
444 |     return re.sub(r'http[s]?://\S+', '', content)
445 | 
446 | def extract_body_from_part(part):
447 |     body = ""
448 |     if 'body' in part and 'data' in part['body']:
449 |         body = base64.urlsafe_b64decode(part['body']['data']).decode('utf-8')
450 |     elif 'parts' in part:
451 |         for sub_part in part['parts']:
452 |             body += extract_body_from_part(sub_part)
453 |     return body
454 | 
455 | # Integrate Gmail search into main function
456 | async def main():
457 |     parser = argparse.ArgumentParser(description="PyPlexitas - Open source CLI alternative to Perplexity AI by Dennis Kruyt")
458 |     parser.add_argument("-q", "--query", type=str, required=True, help="Search Query")
459 |     parser.add_argument("--llm-query", type=str, help="Optional LLM Query")
460 |     parser.add_argument("-s", "--search", type=int, default=10, help="Number of search results to parse")
461 |     parser.add_argument("--engine", type=str, choices=['bing', 'google', 'gmail'], default='bing', help="Search engine to use (bing, google, gmail)")
462 |     parser.add_argument("-l", "--log-level", type=str, default="ERROR", help="Set the logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)")
463 |     parser.add_argument("-t", "--max-tokens", type=int, default=DEFAULT_MAX_TOKENS, help="Maximum token limit for model input")
464 |     parser.add_argument("--quiet", action='store_true', help="Suppress print messages")
465 |     args = parser.parse_args()
466 | 
467 |     logger.setLevel(args.log_level.upper())
468 |     logging.basicConfig(level=args.log_level.upper())
469 |     
470 |     logger.info("Application started")
471 |     logger.debug(f"Received arguments: {args}")
472 | 
473 |     query = args.query
474 |     llm_query = args.llm_query or query
475 |     search_count = args.search
476 |     max_tokens = args.max_tokens
477 |     search_engine = args.engine
478 |     verbose = not args.quiet
479 | 
480 |     logger.info(f"Searching for: {query} using {search_engine}")
481 |     if verbose: print(f"Searching for 🔎: {query} using {search_engine}")
482 |     request = Request(query)
483 |     llm_agent = LLMAgent()
484 | 
485 |     if search_engine == 'gmail':
486 |         emails = await fetch_emails_gmail(query, search_count)
487 |         for email in emails:
488 |             email_id = email['id']
489 |             subject = email['subject']
490 |             snippet = email['snippet']
491 |             content = await fetch_email_content_gmail(email_id)
492 |             stripped_content = strip_http_links(content)
493 | 
494 |             search_result = SearchResult(name=subject, url=f"gmail://{email_id}", content=stripped_content)
495 |             request.add_search_result(search_result)
496 |     else:
497 |         # Fetch search results
498 |         logger.info("Fetching search results...")
499 |         if search_engine == 'bing':
500 |             await fetch_web_pages_bing(request, search_count, verbose)
501 |         else:
502 |             await fetch_web_pages_google(request, search_count, verbose)
503 | 
504 |     if search_engine != 'gmail':
505 |         # Extract unique base names from URLs
506 |         if verbose:
507 |             unique_basenames = set(urlparse(search_result.url).netloc for search_result in request.search_map.values())
508 | 
509 |             print("From domains 🌐: ", end="")
510 |             for basename in unique_basenames:
511 |                 print(basename, " ", end="")
512 |             print("")
513 | 
514 |         # Scrape content
515 |         logger.info("Scraping content from search results...")
516 |         if verbose: print(f"Scraping content from search results...")
517 |         await process_urls(request)
518 | 
519 |     # Generate and upsert embeddings
520 |     logger.info("Embedding content 📥")
521 |     if verbose: print(f"Embedding content ✨")
522 |     dimension = len(llm_agent.embeddings.embed_query(llm_query))
523 |     vector_client = QdrantClient(host="localhost", port=6333)
524 | 
525 |     collection_name = "embeddings"
526 |     if vector_client.collection_exists(collection_name):
527 |         vector_client.delete_collection(collection_name)
528 | 
529 |     vector_client.create_collection(
530 |         collection_name=collection_name,
531 |         vectors_config=models.VectorParams(size=dimension, distance=models.Distance.COSINE),
532 |     )
533 | 
534 |     total_chunks = await generate_upsert_embeddings(request, vector_client)
535 | 
536 |     if verbose:
537 |         print(f"Total embeddings 📊: {len(request.search_map)}")
538 |         print(f"Total chunks processed 🧩: {total_chunks}")
539 |     
540 |     # Search across embeddings
541 |     prompt_embedding = llm_agent.embeddings.embed_query(llm_query)
542 |     search_result = vector_client.search(
543 |         collection_name="embeddings",
544 |         query_vector=prompt_embedding,
545 |         limit=10,
546 |     )
547 |     chunk_ids = [result.id for result in search_result]
548 |     chunks = request.get_chunks(chunk_ids)
549 | 
550 |     # Answer the question
551 |     await llm_agent.answer_question_stream(llm_query, chunks, max_tokens)
552 | 
553 | if __name__ == "__main__":
554 |     asyncio.run(main())
555 | 
556 | 


--------------------------------------------------------------------------------