├── vault.txt ├── .env ├── requirements.txt ├── config.yaml ├── LICENSE ├── README.md ├── localrag_no_rewrite.py ├── upload.py ├── emailrag2.py ├── collect_emails.py └── localrag.py /vault.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /.env: -------------------------------------------------------------------------------- 1 | GMAIL_USERNAME= 2 | GMAIL_PASSWORD= 3 | OUTLOOK_USERNAME= 4 | OUTLOOK_PASSWORD= 5 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | openai 2 | torch 3 | PyPDF2 4 | ollama 5 | pyyaml 6 | beautifulsoup4 7 | lxml 8 | python-dotenv 9 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | vault_file: "vault.txt" 2 | embeddings_file: "vault_embeddings.json" 3 | ollama_model: "llama3" 4 | top_k: 7 5 | system_message: "You are a helpful assistant that is an expert at extracting the most useful information from a given text" 6 | 7 | ollama_api: 8 | base_url: "http://localhost:11434/v1" 9 | api_key: "llama3" -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Kris 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SuperEasy 100% Local RAG with Ollama + Email RAG 2 | 3 | ### YouTube Tutorials 4 | - https://www.youtube.com/watch?v=Oe-7dGDyzPM 5 | - https://www.youtube.com/watch?v=vFGng_3hDRk 6 | ### Latest YouTube Updated Features 7 | [![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/0X7raD1kISQ/0.jpg)](https://www.youtube.com/watch?v=0X7raD1kISQ) 8 | ### Setup 9 | 1. git clone https://github.com/AllAboutAI-YT/easy-local-rag.git 10 | 2. cd dir 11 | 3. pip install -r requirements.txt 12 | 4. Install Ollama (https://ollama.com/download) 13 | 5. ollama pull llama3 (etc) 14 | 6. ollama pull mxbai-embed-large 15 | 7. run upload.py (pdf, .txt, JSON) 16 | 8. run localrag.py (with query re-write) 17 | 9. run localrag_no_rewrite.py (no query re-write) 18 | 19 | ### Email RAG Setup 20 | 1. git clone https://github.com/AllAboutAI-YT/easy-local-rag.git 21 | 2. cd dir 22 | 3. pip install -r requirements.txt 23 | 4. Install Ollama (https://ollama.com/download) 24 | 5. ollama pull llama3 (etc) 25 | 6. ollama pull mxbai-embed-large 26 | 7. set YOUR email logins in .env (for gmail create app password (video)) 27 | 9. python collect_emails.py to download your emails 28 | 10. python emailrag2.py to talk to your emails 29 | 30 | ### Latest Updates 31 | - Added Email RAG Support (v1.3) 32 | - Upload.py (v1.2) 33 | - replaced /n/n with /n 34 | - New embeddings model mxbai-embed-large from ollama (1.2) 35 | - Rewrite query function to improve retrival on vauge questions (1.2) 36 | - Pick your model from the CLI (1.1) 37 | - python localrag.py --model mistral (llama3 is default) 38 | - Talk in a true loop with conversation history (1.1) 39 | 40 | ### My YouTube Channel 41 | https://www.youtube.com/c/AllAboutAI 42 | 43 | ### What is RAG? 44 | RAG is a way to enhance the capabilities of LLMs by combining their powerful language understanding with targeted retrieval of relevant information from external sources often with using embeddings in vector databases, leading to more accurate, trustworthy, and versatile AI-powered applications 45 | 46 | ### What is Ollama? 47 | Ollama is an open-source platform that simplifies the process of running powerful LLMs locally on your own machine, giving users more control and flexibility in their AI projects. https://www.ollama.com 48 | -------------------------------------------------------------------------------- /localrag_no_rewrite.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import ollama 3 | import os 4 | from openai import OpenAI 5 | import argparse 6 | 7 | # ANSI escape codes for colors 8 | PINK = '\033[95m' 9 | CYAN = '\033[96m' 10 | YELLOW = '\033[93m' 11 | NEON_GREEN = '\033[92m' 12 | RESET_COLOR = '\033[0m' 13 | 14 | # Function to open a file and return its contents as a string 15 | def open_file(filepath): 16 | with open(filepath, 'r', encoding='utf-8') as infile: 17 | return infile.read() 18 | 19 | # Function to get relevant context from the vault based on user input 20 | def get_relevant_context(rewritten_input, vault_embeddings, vault_content, top_k=3): 21 | if vault_embeddings.nelement() == 0: # Check if the tensor has any elements 22 | return [] 23 | # Encode the rewritten input 24 | input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=rewritten_input)["embedding"] 25 | # Compute cosine similarity between the input and vault embeddings 26 | cos_scores = torch.cosine_similarity(torch.tensor(input_embedding).unsqueeze(0), vault_embeddings) 27 | # Adjust top_k if it's greater than the number of available scores 28 | top_k = min(top_k, len(cos_scores)) 29 | # Sort the scores and get the top-k indices 30 | top_indices = torch.topk(cos_scores, k=top_k)[1].tolist() 31 | # Get the corresponding context from the vault 32 | relevant_context = [vault_content[idx].strip() for idx in top_indices] 33 | return relevant_context 34 | 35 | # Function to interact with the Ollama model 36 | def ollama_chat(user_input, system_message, vault_embeddings, vault_content, ollama_model, conversation_history): 37 | # Get relevant context from the vault 38 | relevant_context = get_relevant_context(user_input, vault_embeddings_tensor, vault_content, top_k=3) 39 | if relevant_context: 40 | # Convert list to a single string with newlines between items 41 | context_str = "\n".join(relevant_context) 42 | print("Context Pulled from Documents: \n\n" + CYAN + context_str + RESET_COLOR) 43 | else: 44 | print(CYAN + "No relevant context found." + RESET_COLOR) 45 | 46 | # Prepare the user's input by concatenating it with the relevant context 47 | user_input_with_context = user_input 48 | if relevant_context: 49 | user_input_with_context = context_str + "\n\n" + user_input 50 | 51 | # Append the user's input to the conversation history 52 | conversation_history.append({"role": "user", "content": user_input_with_context}) 53 | 54 | # Create a message history including the system message and the conversation history 55 | messages = [ 56 | {"role": "system", "content": system_message}, 57 | *conversation_history 58 | ] 59 | 60 | # Send the completion request to the Ollama model 61 | response = client.chat.completions.create( 62 | model=ollama_model, 63 | messages=messages 64 | ) 65 | 66 | # Append the model's response to the conversation history 67 | conversation_history.append({"role": "assistant", "content": response.choices[0].message.content}) 68 | 69 | # Return the content of the response from the model 70 | return response.choices[0].message.content 71 | 72 | # Parse command-line arguments 73 | parser = argparse.ArgumentParser(description="Ollama Chat") 74 | parser.add_argument("--model", default="dolphin-llama3", help="Ollama model to use (default: llama3)") 75 | args = parser.parse_args() 76 | 77 | # Configuration for the Ollama API client 78 | client = OpenAI( 79 | base_url='http://localhost:11434/v1', 80 | api_key='dolphin-llama3' 81 | ) 82 | 83 | # Load the vault content 84 | vault_content = [] 85 | if os.path.exists("vault.txt"): 86 | with open("vault.txt", "r", encoding='utf-8') as vault_file: 87 | vault_content = vault_file.readlines() 88 | 89 | # Generate embeddings for the vault content using Ollama 90 | vault_embeddings = [] 91 | for content in vault_content: 92 | response = ollama.embeddings(model='mxbai-embed-large', prompt=content) 93 | vault_embeddings.append(response["embedding"]) 94 | 95 | # Convert to tensor and print embeddings 96 | vault_embeddings_tensor = torch.tensor(vault_embeddings) 97 | print("Embeddings for each line in the vault:") 98 | print(vault_embeddings_tensor) 99 | 100 | # Conversation loop 101 | conversation_history = [] 102 | system_message = "You are a helpful assistant that is an expert at extracting the most useful information from a given text" 103 | 104 | while True: 105 | user_input = input(YELLOW + "Ask a question about your documents (or type 'quit' to exit): " + RESET_COLOR) 106 | if user_input.lower() == 'quit': 107 | break 108 | 109 | response = ollama_chat(user_input, system_message, vault_embeddings_tensor, vault_content, args.model, conversation_history) 110 | print(NEON_GREEN + "Response: \n\n" + response + RESET_COLOR) 111 | -------------------------------------------------------------------------------- /upload.py: -------------------------------------------------------------------------------- 1 | import os 2 | import tkinter as tk 3 | from tkinter import filedialog 4 | import PyPDF2 5 | import re 6 | import json 7 | 8 | # Function to convert PDF to text and append to vault.txt 9 | def convert_pdf_to_text(): 10 | file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")]) 11 | if file_path: 12 | with open(file_path, 'rb') as pdf_file: 13 | pdf_reader = PyPDF2.PdfReader(pdf_file) 14 | num_pages = len(pdf_reader.pages) 15 | text = '' 16 | for page_num in range(num_pages): 17 | page = pdf_reader.pages[page_num] 18 | if page.extract_text(): 19 | text += page.extract_text() + " " 20 | 21 | # Normalize whitespace and clean up text 22 | text = re.sub(r'\s+', ' ', text).strip() 23 | 24 | # Split text into chunks by sentences, respecting a maximum chunk size 25 | sentences = re.split(r'(?<=[.!?]) +', text) # split on spaces following sentence-ending punctuation 26 | chunks = [] 27 | current_chunk = "" 28 | for sentence in sentences: 29 | # Check if the current sentence plus the current chunk exceeds the limit 30 | if len(current_chunk) + len(sentence) + 1 < 1000: # +1 for the space 31 | current_chunk += (sentence + " ").strip() 32 | else: 33 | # When the chunk exceeds 1000 characters, store it and start a new one 34 | chunks.append(current_chunk) 35 | current_chunk = sentence + " " 36 | if current_chunk: # Don't forget the last chunk! 37 | chunks.append(current_chunk) 38 | with open("vault.txt", "a", encoding="utf-8") as vault_file: 39 | for chunk in chunks: 40 | # Write each chunk to its own line 41 | vault_file.write(chunk.strip() + "\n") # Two newlines to separate chunks 42 | print(f"PDF content appended to vault.txt with each chunk on a separate line.") 43 | 44 | # Function to upload a text file and append to vault.txt 45 | def upload_txtfile(): 46 | file_path = filedialog.askopenfilename(filetypes=[("Text Files", "*.txt")]) 47 | if file_path: 48 | with open(file_path, 'r', encoding="utf-8") as txt_file: 49 | text = txt_file.read() 50 | 51 | # Normalize whitespace and clean up text 52 | text = re.sub(r'\s+', ' ', text).strip() 53 | 54 | # Split text into chunks by sentences, respecting a maximum chunk size 55 | sentences = re.split(r'(?<=[.!?]) +', text) # split on spaces following sentence-ending punctuation 56 | chunks = [] 57 | current_chunk = "" 58 | for sentence in sentences: 59 | # Check if the current sentence plus the current chunk exceeds the limit 60 | if len(current_chunk) + len(sentence) + 1 < 1000: # +1 for the space 61 | current_chunk += (sentence + " ").strip() 62 | else: 63 | # When the chunk exceeds 1000 characters, store it and start a new one 64 | chunks.append(current_chunk) 65 | current_chunk = sentence + " " 66 | if current_chunk: # Don't forget the last chunk! 67 | chunks.append(current_chunk) 68 | with open("vault.txt", "a", encoding="utf-8") as vault_file: 69 | for chunk in chunks: 70 | # Write each chunk to its own line 71 | vault_file.write(chunk.strip() + "\n") # Two newlines to separate chunks 72 | print(f"Text file content appended to vault.txt with each chunk on a separate line.") 73 | 74 | # Function to upload a JSON file and append to vault.txt 75 | def upload_jsonfile(): 76 | file_path = filedialog.askopenfilename(filetypes=[("JSON Files", "*.json")]) 77 | if file_path: 78 | with open(file_path, 'r', encoding="utf-8") as json_file: 79 | data = json.load(json_file) 80 | 81 | # Flatten the JSON data into a single string 82 | text = json.dumps(data, ensure_ascii=False) 83 | 84 | # Normalize whitespace and clean up text 85 | text = re.sub(r'\s+', ' ', text).strip() 86 | 87 | # Split text into chunks by sentences, respecting a maximum chunk size 88 | sentences = re.split(r'(?<=[.!?]) +', text) # split on spaces following sentence-ending punctuation 89 | chunks = [] 90 | current_chunk = "" 91 | for sentence in sentences: 92 | # Check if the current sentence plus the current chunk exceeds the limit 93 | if len(current_chunk) + len(sentence) + 1 < 1000: # +1 for the space 94 | current_chunk += (sentence + " ").strip() 95 | else: 96 | # When the chunk exceeds 1000 characters, store it and start a new one 97 | chunks.append(current_chunk) 98 | current_chunk = sentence + " " 99 | if current_chunk: # Don't forget the last chunk! 100 | chunks.append(current_chunk) 101 | with open("vault.txt", "a", encoding="utf-8") as vault_file: 102 | for chunk in chunks: 103 | # Write each chunk to its own line 104 | vault_file.write(chunk.strip() + "\n") # Two newlines to separate chunks 105 | print(f"JSON file content appended to vault.txt with each chunk on a separate line.") 106 | 107 | # Create the main window 108 | root = tk.Tk() 109 | root.title("Upload .pdf, .txt, or .json") 110 | 111 | # Create a button to open the file dialog for PDF 112 | pdf_button = tk.Button(root, text="Upload PDF", command=convert_pdf_to_text) 113 | pdf_button.pack(pady=10) 114 | 115 | # Create a button to open the file dialog for text file 116 | txt_button = tk.Button(root, text="Upload Text File", command=upload_txtfile) 117 | txt_button.pack(pady=10) 118 | 119 | # Create a button to open the file dialog for JSON file 120 | json_button = tk.Button(root, text="Upload JSON File", command=upload_jsonfile) 121 | json_button.pack(pady=10) 122 | 123 | # Run the main event loop 124 | root.mainloop() 125 | -------------------------------------------------------------------------------- /emailrag2.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import ollama 3 | import os 4 | import json 5 | from openai import OpenAI 6 | import argparse 7 | import yaml 8 | 9 | # ANSI escape codes for colors 10 | PINK = '\033[95m' 11 | CYAN = '\033[96m' 12 | YELLOW = '\033[93m' 13 | NEON_GREEN = '\033[92m' 14 | RESET_COLOR = '\033[0m' 15 | 16 | def load_config(config_file): 17 | print("Loading configuration...") 18 | try: 19 | with open(config_file, 'r') as file: 20 | return yaml.safe_load(file) 21 | except FileNotFoundError: 22 | print(f"Configuration file '{config_file}' not found.") 23 | exit(1) 24 | 25 | def open_file(filepath): 26 | print("Opening file...") 27 | try: 28 | with open(filepath, 'r', encoding='utf-8') as infile: 29 | return infile.read() 30 | except FileNotFoundError: 31 | print(f"File '{filepath}' not found.") 32 | return None 33 | 34 | def load_or_generate_embeddings(vault_content, embeddings_file): 35 | if os.path.exists(embeddings_file): 36 | print(f"Loading embeddings from '{embeddings_file}'...") 37 | try: 38 | with open(embeddings_file, "r", encoding="utf-8") as file: 39 | return torch.tensor(json.load(file)) 40 | except json.JSONDecodeError: 41 | print(f"Invalid JSON format in embeddings file '{embeddings_file}'.") 42 | embeddings = [] 43 | else: 44 | print(f"No embeddings found. Generating new embeddings...") 45 | embeddings = generate_embeddings(vault_content) 46 | save_embeddings(embeddings, embeddings_file) 47 | return torch.tensor(embeddings) 48 | 49 | def generate_embeddings(vault_content): 50 | print("Generating embeddings...") 51 | embeddings = [] 52 | for content in vault_content: 53 | try: 54 | response = ollama.embeddings(model='mxbai-embed-large', prompt=content) 55 | embeddings.append(response["embedding"]) 56 | except Exception as e: 57 | print(f"Error generating embeddings: {str(e)}") 58 | return embeddings 59 | 60 | def save_embeddings(embeddings, embeddings_file): 61 | print(f"Saving embeddings to '{embeddings_file}'...") 62 | try: 63 | with open(embeddings_file, "w", encoding="utf-8") as file: 64 | json.dump(embeddings, file) 65 | except Exception as e: 66 | print(f"Error saving embeddings: {str(e)}") 67 | 68 | def get_relevant_context(rewritten_input, vault_embeddings, vault_content, top_k): 69 | print("Retrieving relevant context...") 70 | if vault_embeddings.nelement() == 0: 71 | return [] 72 | try: 73 | input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=rewritten_input)["embedding"] 74 | cos_scores = torch.cosine_similarity(torch.tensor(input_embedding).unsqueeze(0), vault_embeddings) 75 | top_k = min(top_k, len(cos_scores)) 76 | top_indices = torch.topk(cos_scores, k=top_k)[1].tolist() 77 | return [vault_content[idx].strip() for idx in top_indices] 78 | except Exception as e: 79 | print(f"Error getting relevant context: {str(e)}") 80 | return [] 81 | 82 | def ollama_chat(user_input, system_message, vault_embeddings, vault_content, ollama_model, conversation_history, top_k, client): 83 | relevant_context = get_relevant_context(user_input, vault_embeddings, vault_content, top_k) 84 | if relevant_context: 85 | context_str = "\n".join(relevant_context) 86 | print("Context Pulled from Documents: \n\n" + CYAN + context_str + RESET_COLOR) 87 | else: 88 | print("No relevant context found.") 89 | 90 | user_input_with_context = user_input 91 | if relevant_context: 92 | user_input_with_context = context_str + "\n\n" + user_input 93 | 94 | conversation_history.append({"role": "user", "content": user_input_with_context}) 95 | messages = [{"role": "system", "content": system_message}, *conversation_history] 96 | 97 | try: 98 | response = client.chat.completions.create( 99 | model=ollama_model, 100 | messages=messages 101 | ) 102 | conversation_history.append({"role": "assistant", "content": response.choices[0].message.content}) 103 | return response.choices[0].message.content 104 | except Exception as e: 105 | print(f"Error in Ollama chat: {str(e)}") 106 | return "An error occurred while processing your request." 107 | 108 | def main(): 109 | parser = argparse.ArgumentParser(description="Ollama Chat") 110 | parser.add_argument("--config", default="config.yaml", help="Path to the configuration file") 111 | parser.add_argument("--clear-cache", action="store_true", help="Clear the embeddings cache") 112 | parser.add_argument("--model", help="Model to use for embeddings and responses") 113 | 114 | args = parser.parse_args() 115 | 116 | config = load_config(args.config) 117 | 118 | if args.clear_cache and os.path.exists(config["embeddings_file"]): 119 | print(f"Clearing embeddings cache at '{config['embeddings_file']}'...") 120 | os.remove(config["embeddings_file"]) 121 | 122 | if args.model: 123 | config["ollama_model"] = args.model 124 | 125 | vault_content = [] 126 | if os.path.exists(config["vault_file"]): 127 | print(f"Loading content from vault '{config['vault_file']}'...") 128 | with open(config["vault_file"], "r", encoding='utf-8') as vault_file: 129 | vault_content = vault_file.readlines() 130 | 131 | vault_embeddings_tensor = load_or_generate_embeddings(vault_content, config["embeddings_file"]) 132 | 133 | client = OpenAI( 134 | base_url=config["ollama_api"]["base_url"], 135 | api_key=config["ollama_api"]["api_key"] 136 | ) 137 | 138 | conversation_history = [] 139 | system_message = config["system_message"] 140 | 141 | while True: 142 | user_input = input(YELLOW + "Ask a question about your documents (or type 'quit' to exit): " + RESET_COLOR) 143 | if user_input.lower() == 'quit': 144 | break 145 | response = ollama_chat(user_input, system_message, vault_embeddings_tensor, vault_content, config["ollama_model"], conversation_history, config["top_k"], client) 146 | print(NEON_GREEN + "Response: \n\n" + response + RESET_COLOR) 147 | 148 | if __name__ == "__main__": 149 | main() 150 | -------------------------------------------------------------------------------- /collect_emails.py: -------------------------------------------------------------------------------- 1 | import imaplib 2 | import email 3 | from email import policy 4 | from email.parser import BytesParser 5 | from datetime import datetime, timedelta 6 | import os 7 | import re 8 | import argparse 9 | from bs4 import BeautifulSoup 10 | import lxml 11 | from dotenv import load_dotenv 12 | 13 | load_dotenv() # Load environment variables from .env file 14 | 15 | def chunk_text(text, max_length=1000): 16 | # Normalize Unicode characters to the closest ASCII representation 17 | text = text.encode('ascii', 'ignore').decode('ascii') 18 | 19 | # Remove sequences of '>' used in email threads 20 | text = re.sub(r'\s*(?:>\s*){2,}', ' ', text) 21 | 22 | # Remove sequences of dashes, underscores, or non-breaking spaces 23 | text = re.sub(r'-{3,}', ' ', text) 24 | text = re.sub(r'_{3,}', ' ', text) 25 | text = re.sub(r'\s{2,}', ' ', text) # Collapse multiple spaces into one 26 | 27 | # Replace URLs with a single space, or remove them 28 | text = re.sub(r'https?://\S+|www\.\S+', '', text) 29 | 30 | # Normalize whitespace to single spaces, strip leading/trailing whitespace 31 | text = re.sub(r'\s+', ' ', text).strip() 32 | 33 | # Split text into sentences while preserving punctuation 34 | sentences = re.split(r'(?<=[.!?]) +', text) 35 | chunks = [] 36 | current_chunk = "" 37 | 38 | for sentence in sentences: 39 | if len(current_chunk) + len(sentence) + 1 < max_length: 40 | current_chunk += (sentence + " ").strip() 41 | else: 42 | chunks.append(current_chunk) 43 | current_chunk = sentence + " " 44 | if current_chunk: 45 | chunks.append(current_chunk) 46 | 47 | return chunks 48 | 49 | def save_chunks_to_vault(chunks): 50 | vault_path = "vault.txt" 51 | with open(vault_path, "a", encoding="utf-8") as vault_file: 52 | for chunk in chunks: 53 | vault_file.write(chunk.strip() + "\n") 54 | 55 | def get_text_from_html(html_content): 56 | soup = BeautifulSoup(html_content, 'lxml') 57 | return soup.get_text() 58 | 59 | def save_plain_text_content(email_bytes, email_id): 60 | msg = BytesParser(policy=policy.default).parsebytes(email_bytes) 61 | text_content = "" 62 | if msg.is_multipart(): 63 | for part in msg.walk(): 64 | if part.get_content_type() == 'text/plain': 65 | text_content += part.get_payload(decode=True).decode(part.get_content_charset('utf-8')) 66 | elif part.get_content_type() == 'text/html': 67 | html_content = part.get_payload(decode=True).decode(part.get_content_charset('utf-8')) 68 | text_content += get_text_from_html(html_content) 69 | else: 70 | if msg.get_content_type() == 'text/plain': 71 | text_content = msg.get_payload(decode=True).decode(msg.get_content_charset('utf-8')) 72 | elif msg.get_content_type() == 'text/html': 73 | text_content = get_text_from_html(msg.get_payload(decode=True).decode(msg.get_content_charset('utf-8'))) 74 | 75 | chunks = chunk_text(text_content) 76 | save_chunks_to_vault(chunks) 77 | return text_content 78 | 79 | def search_and_process_emails(imap_client, email_source, search_keyword, start_date, end_date): 80 | search_criteria = 'ALL' 81 | if start_date and end_date: 82 | search_criteria = f'(SINCE "{start_date}" BEFORE "{end_date}")' 83 | if search_keyword: 84 | search_criteria += f' BODY "{search_keyword}"' # Ensure the correct combination of conditions 85 | 86 | print(f"Using search criteria for {email_source}: {search_criteria}") 87 | typ, data = imap_client.search(None, search_criteria) 88 | if typ == 'OK': 89 | email_ids = data[0].split() 90 | print(f"Found {len(email_ids)} emails matching criteria in {email_source}.") 91 | 92 | for num in email_ids: 93 | typ, email_data = imap_client.fetch(num, '(RFC822)') 94 | if typ == 'OK': 95 | email_id = num.decode('utf-8') 96 | print(f"Downloading and processing email ID: {email_id} from {email_source}") 97 | save_plain_text_content(email_data[0][1], email_id) 98 | else: 99 | print(f"Failed to fetch email ID: {num.decode('utf-8')} from {email_source}") 100 | else: 101 | print(f"Failed to find emails with given criteria in {email_source}. No emails found.") 102 | 103 | 104 | def main(): 105 | parser = argparse.ArgumentParser(description="Search and process emails based on optional keyword and date range.") 106 | parser.add_argument("--keyword", help="The keyword to search for in the email bodies.", default="") 107 | parser.add_argument("--startdate", help="Start date in DD.MM.YYYY format.", required=False) 108 | parser.add_argument("--enddate", help="End date in DD.MM.YYYY format.", required=False) 109 | args = parser.parse_args() 110 | 111 | start_date = None 112 | end_date = None 113 | 114 | # Check if both start and end dates are provided and valid 115 | if args.startdate and args.enddate: 116 | try: 117 | start_date = datetime.strptime(args.startdate, "%d.%m.%Y").strftime("%d-%b-%Y") 118 | end_date = datetime.strptime(args.enddate, "%d.%m.%Y").strftime("%d-%b-%Y") 119 | except ValueError as e: 120 | print(f"Error: Date format is incorrect. Please use DD.MM.YYYY format. Details: {e}") 121 | return 122 | elif args.startdate or args.enddate: 123 | print("Both start date and end date must be provided together.") 124 | return 125 | 126 | # Retrieve email credentials from environment variables 127 | gmail_username = os.getenv('GMAIL_USERNAME') 128 | gmail_password = os.getenv('GMAIL_PASSWORD') 129 | outlook_username = os.getenv('OUTLOOK_USERNAME') 130 | outlook_password = os.getenv('OUTLOOK_PASSWORD') 131 | 132 | # Connect to Gmail's IMAP server 133 | M = imaplib.IMAP4_SSL('imap.gmail.com') 134 | M.login(gmail_username, gmail_password) 135 | M.select('inbox') 136 | 137 | # Connect to Outlook IMAP server 138 | H = imaplib.IMAP4_SSL('imap-mail.outlook.com') 139 | H.login(outlook_username, outlook_password) 140 | H.select('inbox') 141 | 142 | # Search and process emails from Gmail and Outlook 143 | search_and_process_emails(M, "Gmail", args.keyword, start_date, end_date) 144 | search_and_process_emails(H, "Outlook", args.keyword, start_date, end_date) 145 | 146 | M.logout() 147 | H.logout() 148 | 149 | if __name__ == "__main__": 150 | main() 151 | -------------------------------------------------------------------------------- /localrag.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import ollama 3 | import os 4 | from openai import OpenAI 5 | import argparse 6 | import json 7 | 8 | # ANSI escape codes for colors 9 | PINK = '\033[95m' 10 | CYAN = '\033[96m' 11 | YELLOW = '\033[93m' 12 | NEON_GREEN = '\033[92m' 13 | RESET_COLOR = '\033[0m' 14 | 15 | # Function to open a file and return its contents as a string 16 | def open_file(filepath): 17 | with open(filepath, 'r', encoding='utf-8') as infile: 18 | return infile.read() 19 | 20 | # Function to get relevant context from the vault based on user input 21 | def get_relevant_context(rewritten_input, vault_embeddings, vault_content, top_k=3): 22 | if vault_embeddings.nelement() == 0: # Check if the tensor has any elements 23 | return [] 24 | # Encode the rewritten input 25 | input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=rewritten_input)["embedding"] 26 | # Compute cosine similarity between the input and vault embeddings 27 | cos_scores = torch.cosine_similarity(torch.tensor(input_embedding).unsqueeze(0), vault_embeddings) 28 | # Adjust top_k if it's greater than the number of available scores 29 | top_k = min(top_k, len(cos_scores)) 30 | # Sort the scores and get the top-k indices 31 | top_indices = torch.topk(cos_scores, k=top_k)[1].tolist() 32 | # Get the corresponding context from the vault 33 | relevant_context = [vault_content[idx].strip() for idx in top_indices] 34 | return relevant_context 35 | 36 | def rewrite_query(user_input_json, conversation_history, ollama_model): 37 | user_input = json.loads(user_input_json)["Query"] 38 | context = "\n".join([f"{msg['role']}: {msg['content']}" for msg in conversation_history[-2:]]) 39 | prompt = f"""Rewrite the following query by incorporating relevant context from the conversation history. 40 | The rewritten query should: 41 | 42 | - Preserve the core intent and meaning of the original query 43 | - Expand and clarify the query to make it more specific and informative for retrieving relevant context 44 | - Avoid introducing new topics or queries that deviate from the original query 45 | - DONT EVER ANSWER the Original query, but instead focus on rephrasing and expanding it into a new query 46 | 47 | Return ONLY the rewritten query text, without any additional formatting or explanations. 48 | 49 | Conversation History: 50 | {context} 51 | 52 | Original query: [{user_input}] 53 | 54 | Rewritten query: 55 | """ 56 | response = client.chat.completions.create( 57 | model=ollama_model, 58 | messages=[{"role": "system", "content": prompt}], 59 | max_tokens=200, 60 | n=1, 61 | temperature=0.1, 62 | ) 63 | rewritten_query = response.choices[0].message.content.strip() 64 | return json.dumps({"Rewritten Query": rewritten_query}) 65 | 66 | def ollama_chat(user_input, system_message, vault_embeddings, vault_content, ollama_model, conversation_history): 67 | conversation_history.append({"role": "user", "content": user_input}) 68 | 69 | if len(conversation_history) > 1: 70 | query_json = { 71 | "Query": user_input, 72 | "Rewritten Query": "" 73 | } 74 | rewritten_query_json = rewrite_query(json.dumps(query_json), conversation_history, ollama_model) 75 | rewritten_query_data = json.loads(rewritten_query_json) 76 | rewritten_query = rewritten_query_data["Rewritten Query"] 77 | print(PINK + "Original Query: " + user_input + RESET_COLOR) 78 | print(PINK + "Rewritten Query: " + rewritten_query + RESET_COLOR) 79 | else: 80 | rewritten_query = user_input 81 | 82 | relevant_context = get_relevant_context(rewritten_query, vault_embeddings, vault_content) 83 | if relevant_context: 84 | context_str = "\n".join(relevant_context) 85 | print("Context Pulled from Documents: \n\n" + CYAN + context_str + RESET_COLOR) 86 | else: 87 | print(CYAN + "No relevant context found." + RESET_COLOR) 88 | 89 | user_input_with_context = user_input 90 | if relevant_context: 91 | user_input_with_context = user_input + "\n\nRelevant Context:\n" + context_str 92 | 93 | conversation_history[-1]["content"] = user_input_with_context 94 | 95 | messages = [ 96 | {"role": "system", "content": system_message}, 97 | *conversation_history 98 | ] 99 | 100 | response = client.chat.completions.create( 101 | model=ollama_model, 102 | messages=messages, 103 | max_tokens=2000, 104 | ) 105 | 106 | conversation_history.append({"role": "assistant", "content": response.choices[0].message.content}) 107 | 108 | return response.choices[0].message.content 109 | 110 | # Parse command-line arguments 111 | print(NEON_GREEN + "Parsing command-line arguments..." + RESET_COLOR) 112 | parser = argparse.ArgumentParser(description="Ollama Chat") 113 | parser.add_argument("--model", default="llama3", help="Ollama model to use (default: llama3)") 114 | args = parser.parse_args() 115 | 116 | # Configuration for the Ollama API client 117 | print(NEON_GREEN + "Initializing Ollama API client..." + RESET_COLOR) 118 | client = OpenAI( 119 | base_url='http://localhost:11434/v1', 120 | api_key='llama3' 121 | ) 122 | 123 | # Load the vault content 124 | print(NEON_GREEN + "Loading vault content..." + RESET_COLOR) 125 | vault_content = [] 126 | if os.path.exists("vault.txt"): 127 | with open("vault.txt", "r", encoding='utf-8') as vault_file: 128 | vault_content = vault_file.readlines() 129 | 130 | # Generate embeddings for the vault content using Ollama 131 | print(NEON_GREEN + "Generating embeddings for the vault content..." + RESET_COLOR) 132 | vault_embeddings = [] 133 | for content in vault_content: 134 | response = ollama.embeddings(model='mxbai-embed-large', prompt=content) 135 | vault_embeddings.append(response["embedding"]) 136 | 137 | # Convert to tensor and print embeddings 138 | print("Converting embeddings to tensor...") 139 | vault_embeddings_tensor = torch.tensor(vault_embeddings) 140 | print("Embeddings for each line in the vault:") 141 | print(vault_embeddings_tensor) 142 | 143 | # Conversation loop 144 | print("Starting conversation loop...") 145 | conversation_history = [] 146 | system_message = "You are a helpful assistant that is an expert at extracting the most useful information from a given text. Also bring in extra relevant infromation to the user query from outside the given context." 147 | 148 | while True: 149 | user_input = input(YELLOW + "Ask a query about your documents (or type 'quit' to exit): " + RESET_COLOR) 150 | if user_input.lower() == 'quit': 151 | break 152 | 153 | response = ollama_chat(user_input, system_message, vault_embeddings_tensor, vault_content, args.model, conversation_history) 154 | print(NEON_GREEN + "Response: \n\n" + response + RESET_COLOR) 155 | --------------------------------------------------------------------------------