├── vault.txt
├── .env
├── requirements.txt
├── config.yaml
├── LICENSE
├── README.md
├── localrag_no_rewrite.py
├── upload.py
├── emailrag2.py
├── collect_emails.py
└── localrag.py


/vault.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/.env:
--------------------------------------------------------------------------------
1 | GMAIL_USERNAME=
2 | GMAIL_PASSWORD=
3 | OUTLOOK_USERNAME=
4 | OUTLOOK_PASSWORD=
5 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | openai
2 | torch
3 | PyPDF2
4 | ollama
5 | pyyaml
6 | beautifulsoup4
7 | lxml
8 | python-dotenv
9 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | vault_file: "vault.txt"
2 | embeddings_file: "vault_embeddings.json"
3 | ollama_model: "llama3"
4 | top_k: 7
5 | system_message: "You are a helpful assistant that is an expert at extracting the most useful information from a given text"
6 | 
7 | ollama_api:
8 |   base_url: "http://localhost:11434/v1"
9 |   api_key: "llama3"


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Kris
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SuperEasy 100% Local RAG with Ollama + Email RAG
 2 | 
 3 | ### YouTube Tutorials
 4 | - https://www.youtube.com/watch?v=Oe-7dGDyzPM
 5 | - https://www.youtube.com/watch?v=vFGng_3hDRk
 6 | ### Latest YouTube Updated Features
 7 | [![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/0X7raD1kISQ/0.jpg)](https://www.youtube.com/watch?v=0X7raD1kISQ)
 8 | ### Setup
 9 | 1. git clone https://github.com/AllAboutAI-YT/easy-local-rag.git
10 | 2. cd dir
11 | 3. pip install -r requirements.txt
12 | 4. Install Ollama (https://ollama.com/download)
13 | 5. ollama pull llama3 (etc)
14 | 6. ollama pull mxbai-embed-large
15 | 7. run upload.py (pdf, .txt, JSON)
16 | 8. run localrag.py (with query re-write)
17 | 9. run localrag_no_rewrite.py (no query re-write)
18 | 
19 | ### Email RAG Setup
20 | 1. git clone https://github.com/AllAboutAI-YT/easy-local-rag.git
21 | 2. cd dir
22 | 3. pip install -r requirements.txt
23 | 4. Install Ollama (https://ollama.com/download)
24 | 5. ollama pull llama3 (etc)
25 | 6. ollama pull mxbai-embed-large
26 | 7. set YOUR email logins in .env (for gmail create app password (video))
27 | 9. python collect_emails.py to download your emails
28 | 10. python emailrag2.py to talk to your emails
29 | 
30 | ### Latest Updates
31 | - Added Email RAG Support (v1.3)
32 | - Upload.py (v1.2)
33 |    - replaced /n/n with /n 
34 | - New embeddings model mxbai-embed-large from ollama (1.2)
35 | - Rewrite query function to improve retrival on vauge questions (1.2)
36 | - Pick your model from the CLI (1.1)
37 |   - python localrag.py --model mistral (llama3 is default) 
38 | - Talk in a true loop with conversation history (1.1)
39 |    
40 | ### My YouTube Channel
41 | https://www.youtube.com/c/AllAboutAI
42 | 
43 | ### What is RAG?
44 | RAG is a way to enhance the capabilities of LLMs by combining their powerful language understanding with targeted retrieval of relevant information from external sources often with using embeddings in vector databases, leading to more accurate, trustworthy, and versatile AI-powered applications
45 | 
46 | ### What is Ollama?
47 | Ollama is an open-source platform that simplifies the process of running powerful LLMs locally on your own machine, giving users more control and flexibility in their AI projects. https://www.ollama.com
48 | 


--------------------------------------------------------------------------------
/localrag_no_rewrite.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import ollama
  3 | import os
  4 | from openai import OpenAI
  5 | import argparse
  6 | 
  7 | # ANSI escape codes for colors
  8 | PINK = '\033[95m'
  9 | CYAN = '\033[96m'
 10 | YELLOW = '\033[93m'
 11 | NEON_GREEN = '\033[92m'
 12 | RESET_COLOR = '\033[0m'
 13 | 
 14 | # Function to open a file and return its contents as a string
 15 | def open_file(filepath):
 16 |     with open(filepath, 'r', encoding='utf-8') as infile:
 17 |         return infile.read()
 18 | 
 19 | # Function to get relevant context from the vault based on user input
 20 | def get_relevant_context(rewritten_input, vault_embeddings, vault_content, top_k=3):
 21 |     if vault_embeddings.nelement() == 0:  # Check if the tensor has any elements
 22 |         return []
 23 |     # Encode the rewritten input
 24 |     input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=rewritten_input)["embedding"]
 25 |     # Compute cosine similarity between the input and vault embeddings
 26 |     cos_scores = torch.cosine_similarity(torch.tensor(input_embedding).unsqueeze(0), vault_embeddings)
 27 |     # Adjust top_k if it's greater than the number of available scores
 28 |     top_k = min(top_k, len(cos_scores))
 29 |     # Sort the scores and get the top-k indices
 30 |     top_indices = torch.topk(cos_scores, k=top_k)[1].tolist()
 31 |     # Get the corresponding context from the vault
 32 |     relevant_context = [vault_content[idx].strip() for idx in top_indices]
 33 |     return relevant_context
 34 | 
 35 | # Function to interact with the Ollama model
 36 | def ollama_chat(user_input, system_message, vault_embeddings, vault_content, ollama_model, conversation_history):
 37 |     # Get relevant context from the vault
 38 |     relevant_context = get_relevant_context(user_input, vault_embeddings_tensor, vault_content, top_k=3)
 39 |     if relevant_context:
 40 |         # Convert list to a single string with newlines between items
 41 |         context_str = "\n".join(relevant_context)
 42 |         print("Context Pulled from Documents: \n\n" + CYAN + context_str + RESET_COLOR)
 43 |     else:
 44 |         print(CYAN + "No relevant context found." + RESET_COLOR)
 45 |     
 46 |     # Prepare the user's input by concatenating it with the relevant context
 47 |     user_input_with_context = user_input
 48 |     if relevant_context:
 49 |         user_input_with_context = context_str + "\n\n" + user_input
 50 |     
 51 |     # Append the user's input to the conversation history
 52 |     conversation_history.append({"role": "user", "content": user_input_with_context})
 53 |     
 54 |     # Create a message history including the system message and the conversation history
 55 |     messages = [
 56 |         {"role": "system", "content": system_message},
 57 |         *conversation_history
 58 |     ]
 59 |     
 60 |     # Send the completion request to the Ollama model
 61 |     response = client.chat.completions.create(
 62 |         model=ollama_model,
 63 |         messages=messages
 64 |     )
 65 |     
 66 |     # Append the model's response to the conversation history
 67 |     conversation_history.append({"role": "assistant", "content": response.choices[0].message.content})
 68 |     
 69 |     # Return the content of the response from the model
 70 |     return response.choices[0].message.content
 71 | 
 72 | # Parse command-line arguments
 73 | parser = argparse.ArgumentParser(description="Ollama Chat")
 74 | parser.add_argument("--model", default="dolphin-llama3", help="Ollama model to use (default: llama3)")
 75 | args = parser.parse_args()
 76 | 
 77 | # Configuration for the Ollama API client
 78 | client = OpenAI(
 79 |     base_url='http://localhost:11434/v1',
 80 |     api_key='dolphin-llama3'
 81 | )
 82 | 
 83 | # Load the vault content
 84 | vault_content = []
 85 | if os.path.exists("vault.txt"):
 86 |     with open("vault.txt", "r", encoding='utf-8') as vault_file:
 87 |         vault_content = vault_file.readlines()
 88 | 
 89 | # Generate embeddings for the vault content using Ollama
 90 | vault_embeddings = []
 91 | for content in vault_content:
 92 |     response = ollama.embeddings(model='mxbai-embed-large', prompt=content)
 93 |     vault_embeddings.append(response["embedding"])
 94 | 
 95 | # Convert to tensor and print embeddings
 96 | vault_embeddings_tensor = torch.tensor(vault_embeddings) 
 97 | print("Embeddings for each line in the vault:")
 98 | print(vault_embeddings_tensor)
 99 | 
100 | # Conversation loop
101 | conversation_history = []
102 | system_message = "You are a helpful assistant that is an expert at extracting the most useful information from a given text"
103 | 
104 | while True:
105 |     user_input = input(YELLOW + "Ask a question about your documents (or type 'quit' to exit): " + RESET_COLOR)
106 |     if user_input.lower() == 'quit':
107 |         break
108 | 
109 |     response = ollama_chat(user_input, system_message, vault_embeddings_tensor, vault_content, args.model, conversation_history)
110 |     print(NEON_GREEN + "Response: \n\n" + response + RESET_COLOR)
111 | 


--------------------------------------------------------------------------------
/upload.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import tkinter as tk
  3 | from tkinter import filedialog
  4 | import PyPDF2
  5 | import re
  6 | import json
  7 | 
  8 | # Function to convert PDF to text and append to vault.txt
  9 | def convert_pdf_to_text():
 10 |     file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")])
 11 |     if file_path:
 12 |         with open(file_path, 'rb') as pdf_file:
 13 |             pdf_reader = PyPDF2.PdfReader(pdf_file)
 14 |             num_pages = len(pdf_reader.pages)
 15 |             text = ''
 16 |             for page_num in range(num_pages):
 17 |                 page = pdf_reader.pages[page_num]
 18 |                 if page.extract_text():
 19 |                     text += page.extract_text() + " "
 20 |             
 21 |             # Normalize whitespace and clean up text
 22 |             text = re.sub(r'\s+', ' ', text).strip()
 23 |             
 24 |             # Split text into chunks by sentences, respecting a maximum chunk size
 25 |             sentences = re.split(r'(?<=[.!?]) +', text)  # split on spaces following sentence-ending punctuation
 26 |             chunks = []
 27 |             current_chunk = ""
 28 |             for sentence in sentences:
 29 |                 # Check if the current sentence plus the current chunk exceeds the limit
 30 |                 if len(current_chunk) + len(sentence) + 1 < 1000:  # +1 for the space
 31 |                     current_chunk += (sentence + " ").strip()
 32 |                 else:
 33 |                     # When the chunk exceeds 1000 characters, store it and start a new one
 34 |                     chunks.append(current_chunk)
 35 |                     current_chunk = sentence + " "
 36 |             if current_chunk:  # Don't forget the last chunk!
 37 |                 chunks.append(current_chunk)
 38 |             with open("vault.txt", "a", encoding="utf-8") as vault_file:
 39 |                 for chunk in chunks:
 40 |                     # Write each chunk to its own line
 41 |                     vault_file.write(chunk.strip() + "\n")  # Two newlines to separate chunks
 42 |             print(f"PDF content appended to vault.txt with each chunk on a separate line.")
 43 | 
 44 | # Function to upload a text file and append to vault.txt
 45 | def upload_txtfile():
 46 |     file_path = filedialog.askopenfilename(filetypes=[("Text Files", "*.txt")])
 47 |     if file_path:
 48 |         with open(file_path, 'r', encoding="utf-8") as txt_file:
 49 |             text = txt_file.read()
 50 |             
 51 |             # Normalize whitespace and clean up text
 52 |             text = re.sub(r'\s+', ' ', text).strip()
 53 |             
 54 |             # Split text into chunks by sentences, respecting a maximum chunk size
 55 |             sentences = re.split(r'(?<=[.!?]) +', text)  # split on spaces following sentence-ending punctuation
 56 |             chunks = []
 57 |             current_chunk = ""
 58 |             for sentence in sentences:
 59 |                 # Check if the current sentence plus the current chunk exceeds the limit
 60 |                 if len(current_chunk) + len(sentence) + 1 < 1000:  # +1 for the space
 61 |                     current_chunk += (sentence + " ").strip()
 62 |                 else:
 63 |                     # When the chunk exceeds 1000 characters, store it and start a new one
 64 |                     chunks.append(current_chunk)
 65 |                     current_chunk = sentence + " "
 66 |             if current_chunk:  # Don't forget the last chunk!
 67 |                 chunks.append(current_chunk)
 68 |             with open("vault.txt", "a", encoding="utf-8") as vault_file:
 69 |                 for chunk in chunks:
 70 |                     # Write each chunk to its own line
 71 |                     vault_file.write(chunk.strip() + "\n")  # Two newlines to separate chunks
 72 |             print(f"Text file content appended to vault.txt with each chunk on a separate line.")
 73 | 
 74 | # Function to upload a JSON file and append to vault.txt
 75 | def upload_jsonfile():
 76 |     file_path = filedialog.askopenfilename(filetypes=[("JSON Files", "*.json")])
 77 |     if file_path:
 78 |         with open(file_path, 'r', encoding="utf-8") as json_file:
 79 |             data = json.load(json_file)
 80 |             
 81 |             # Flatten the JSON data into a single string
 82 |             text = json.dumps(data, ensure_ascii=False)
 83 |             
 84 |             # Normalize whitespace and clean up text
 85 |             text = re.sub(r'\s+', ' ', text).strip()
 86 |             
 87 |             # Split text into chunks by sentences, respecting a maximum chunk size
 88 |             sentences = re.split(r'(?<=[.!?]) +', text)  # split on spaces following sentence-ending punctuation
 89 |             chunks = []
 90 |             current_chunk = ""
 91 |             for sentence in sentences:
 92 |                 # Check if the current sentence plus the current chunk exceeds the limit
 93 |                 if len(current_chunk) + len(sentence) + 1 < 1000:  # +1 for the space
 94 |                     current_chunk += (sentence + " ").strip()
 95 |                 else:
 96 |                     # When the chunk exceeds 1000 characters, store it and start a new one
 97 |                     chunks.append(current_chunk)
 98 |                     current_chunk = sentence + " "
 99 |             if current_chunk:  # Don't forget the last chunk!
100 |                 chunks.append(current_chunk)
101 |             with open("vault.txt", "a", encoding="utf-8") as vault_file:
102 |                 for chunk in chunks:
103 |                     # Write each chunk to its own line
104 |                     vault_file.write(chunk.strip() + "\n")  # Two newlines to separate chunks
105 |             print(f"JSON file content appended to vault.txt with each chunk on a separate line.")
106 | 
107 | # Create the main window
108 | root = tk.Tk()
109 | root.title("Upload .pdf, .txt, or .json")
110 | 
111 | # Create a button to open the file dialog for PDF
112 | pdf_button = tk.Button(root, text="Upload PDF", command=convert_pdf_to_text)
113 | pdf_button.pack(pady=10)
114 | 
115 | # Create a button to open the file dialog for text file
116 | txt_button = tk.Button(root, text="Upload Text File", command=upload_txtfile)
117 | txt_button.pack(pady=10)
118 | 
119 | # Create a button to open the file dialog for JSON file
120 | json_button = tk.Button(root, text="Upload JSON File", command=upload_jsonfile)
121 | json_button.pack(pady=10)
122 | 
123 | # Run the main event loop
124 | root.mainloop()
125 | 


--------------------------------------------------------------------------------
/emailrag2.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import ollama
  3 | import os
  4 | import json
  5 | from openai import OpenAI
  6 | import argparse
  7 | import yaml
  8 | 
  9 | # ANSI escape codes for colors
 10 | PINK = '\033[95m'
 11 | CYAN = '\033[96m'
 12 | YELLOW = '\033[93m'
 13 | NEON_GREEN = '\033[92m'
 14 | RESET_COLOR = '\033[0m'
 15 | 
 16 | def load_config(config_file):
 17 |     print("Loading configuration...")
 18 |     try:
 19 |         with open(config_file, 'r') as file:
 20 |             return yaml.safe_load(file)
 21 |     except FileNotFoundError:
 22 |         print(f"Configuration file '{config_file}' not found.")
 23 |         exit(1)
 24 | 
 25 | def open_file(filepath):
 26 |     print("Opening file...")
 27 |     try:
 28 |         with open(filepath, 'r', encoding='utf-8') as infile:
 29 |             return infile.read()
 30 |     except FileNotFoundError:
 31 |         print(f"File '{filepath}' not found.")
 32 |         return None
 33 | 
 34 | def load_or_generate_embeddings(vault_content, embeddings_file):
 35 |     if os.path.exists(embeddings_file):
 36 |         print(f"Loading embeddings from '{embeddings_file}'...")
 37 |         try:
 38 |             with open(embeddings_file, "r", encoding="utf-8") as file:
 39 |                 return torch.tensor(json.load(file))
 40 |         except json.JSONDecodeError:
 41 |             print(f"Invalid JSON format in embeddings file '{embeddings_file}'.")
 42 |             embeddings = []
 43 |     else:
 44 |         print(f"No embeddings found. Generating new embeddings...")
 45 |         embeddings = generate_embeddings(vault_content)
 46 |         save_embeddings(embeddings, embeddings_file)
 47 |     return torch.tensor(embeddings)
 48 | 
 49 | def generate_embeddings(vault_content):
 50 |     print("Generating embeddings...")
 51 |     embeddings = []
 52 |     for content in vault_content:
 53 |         try:
 54 |             response = ollama.embeddings(model='mxbai-embed-large', prompt=content)
 55 |             embeddings.append(response["embedding"])
 56 |         except Exception as e:
 57 |             print(f"Error generating embeddings: {str(e)}")
 58 |     return embeddings
 59 | 
 60 | def save_embeddings(embeddings, embeddings_file):
 61 |     print(f"Saving embeddings to '{embeddings_file}'...")
 62 |     try:
 63 |         with open(embeddings_file, "w", encoding="utf-8") as file:
 64 |             json.dump(embeddings, file)
 65 |     except Exception as e:
 66 |         print(f"Error saving embeddings: {str(e)}")
 67 | 
 68 | def get_relevant_context(rewritten_input, vault_embeddings, vault_content, top_k):
 69 |     print("Retrieving relevant context...")
 70 |     if vault_embeddings.nelement() == 0:
 71 |         return []
 72 |     try:
 73 |         input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=rewritten_input)["embedding"]
 74 |         cos_scores = torch.cosine_similarity(torch.tensor(input_embedding).unsqueeze(0), vault_embeddings)
 75 |         top_k = min(top_k, len(cos_scores))
 76 |         top_indices = torch.topk(cos_scores, k=top_k)[1].tolist()
 77 |         return [vault_content[idx].strip() for idx in top_indices]
 78 |     except Exception as e:
 79 |         print(f"Error getting relevant context: {str(e)}")
 80 |         return []
 81 | 
 82 | def ollama_chat(user_input, system_message, vault_embeddings, vault_content, ollama_model, conversation_history, top_k, client):
 83 |     relevant_context = get_relevant_context(user_input, vault_embeddings, vault_content, top_k)
 84 |     if relevant_context:
 85 |         context_str = "\n".join(relevant_context)
 86 |         print("Context Pulled from Documents: \n\n" + CYAN + context_str + RESET_COLOR)
 87 |     else:
 88 |         print("No relevant context found.")
 89 | 
 90 |     user_input_with_context = user_input
 91 |     if relevant_context:
 92 |         user_input_with_context = context_str + "\n\n" + user_input
 93 | 
 94 |     conversation_history.append({"role": "user", "content": user_input_with_context})
 95 |     messages = [{"role": "system", "content": system_message}, *conversation_history]
 96 | 
 97 |     try:
 98 |         response = client.chat.completions.create(
 99 |             model=ollama_model,
100 |             messages=messages
101 |         )
102 |         conversation_history.append({"role": "assistant", "content": response.choices[0].message.content})
103 |         return response.choices[0].message.content
104 |     except Exception as e:
105 |         print(f"Error in Ollama chat: {str(e)}")
106 |         return "An error occurred while processing your request."
107 | 
108 | def main():
109 |     parser = argparse.ArgumentParser(description="Ollama Chat")
110 |     parser.add_argument("--config", default="config.yaml", help="Path to the configuration file")
111 |     parser.add_argument("--clear-cache", action="store_true", help="Clear the embeddings cache")
112 |     parser.add_argument("--model", help="Model to use for embeddings and responses")
113 | 
114 |     args = parser.parse_args()
115 | 
116 |     config = load_config(args.config)
117 | 
118 |     if args.clear_cache and os.path.exists(config["embeddings_file"]):
119 |         print(f"Clearing embeddings cache at '{config['embeddings_file']}'...")
120 |         os.remove(config["embeddings_file"])
121 | 
122 |     if args.model:
123 |         config["ollama_model"] = args.model
124 | 
125 |     vault_content = []
126 |     if os.path.exists(config["vault_file"]):
127 |         print(f"Loading content from vault '{config['vault_file']}'...")
128 |         with open(config["vault_file"], "r", encoding='utf-8') as vault_file:
129 |             vault_content = vault_file.readlines()
130 | 
131 |     vault_embeddings_tensor = load_or_generate_embeddings(vault_content, config["embeddings_file"])
132 | 
133 |     client = OpenAI(
134 |         base_url=config["ollama_api"]["base_url"],
135 |         api_key=config["ollama_api"]["api_key"]
136 |     )
137 | 
138 |     conversation_history = []
139 |     system_message = config["system_message"]
140 | 
141 |     while True:
142 |         user_input = input(YELLOW + "Ask a question about your documents (or type 'quit' to exit): " + RESET_COLOR)
143 |         if user_input.lower() == 'quit':
144 |             break
145 |         response = ollama_chat(user_input, system_message, vault_embeddings_tensor, vault_content, config["ollama_model"], conversation_history, config["top_k"], client)
146 |         print(NEON_GREEN + "Response: \n\n" + response + RESET_COLOR)
147 | 
148 | if __name__ == "__main__":
149 |     main()
150 | 


--------------------------------------------------------------------------------
/collect_emails.py:
--------------------------------------------------------------------------------
  1 | import imaplib
  2 | import email
  3 | from email import policy
  4 | from email.parser import BytesParser
  5 | from datetime import datetime, timedelta
  6 | import os
  7 | import re
  8 | import argparse
  9 | from bs4 import BeautifulSoup
 10 | import lxml
 11 | from dotenv import load_dotenv
 12 | 
 13 | load_dotenv()  # Load environment variables from .env file
 14 | 
 15 | def chunk_text(text, max_length=1000):
 16 |     # Normalize Unicode characters to the closest ASCII representation
 17 |     text = text.encode('ascii', 'ignore').decode('ascii')
 18 | 
 19 |     # Remove sequences of '>' used in email threads
 20 |     text = re.sub(r'\s*(?:>\s*){2,}', ' ', text)
 21 | 
 22 |     # Remove sequences of dashes, underscores, or non-breaking spaces
 23 |     text = re.sub(r'-{3,}', ' ', text)
 24 |     text = re.sub(r'_{3,}', ' ', text)
 25 |     text = re.sub(r'\s{2,}', ' ', text)  # Collapse multiple spaces into one
 26 | 
 27 |     # Replace URLs with a single space, or remove them
 28 |     text = re.sub(r'https?://\S+|www\.\S+', '', text)
 29 | 
 30 |     # Normalize whitespace to single spaces, strip leading/trailing whitespace
 31 |     text = re.sub(r'\s+', ' ', text).strip()
 32 | 
 33 |     # Split text into sentences while preserving punctuation
 34 |     sentences = re.split(r'(?<=[.!?]) +', text)
 35 |     chunks = []
 36 |     current_chunk = ""
 37 |     
 38 |     for sentence in sentences:
 39 |         if len(current_chunk) + len(sentence) + 1 < max_length:
 40 |             current_chunk += (sentence + " ").strip()
 41 |         else:
 42 |             chunks.append(current_chunk)
 43 |             current_chunk = sentence + " "
 44 |     if current_chunk:
 45 |         chunks.append(current_chunk)
 46 | 
 47 |     return chunks
 48 | 
 49 | def save_chunks_to_vault(chunks):
 50 |     vault_path = "vault.txt"
 51 |     with open(vault_path, "a", encoding="utf-8") as vault_file:
 52 |         for chunk in chunks:
 53 |             vault_file.write(chunk.strip() + "\n")
 54 | 
 55 | def get_text_from_html(html_content):
 56 |     soup = BeautifulSoup(html_content, 'lxml')
 57 |     return soup.get_text()
 58 | 
 59 | def save_plain_text_content(email_bytes, email_id):
 60 |     msg = BytesParser(policy=policy.default).parsebytes(email_bytes)
 61 |     text_content = ""
 62 |     if msg.is_multipart():
 63 |         for part in msg.walk():
 64 |             if part.get_content_type() == 'text/plain':
 65 |                 text_content += part.get_payload(decode=True).decode(part.get_content_charset('utf-8'))
 66 |             elif part.get_content_type() == 'text/html':
 67 |                 html_content = part.get_payload(decode=True).decode(part.get_content_charset('utf-8'))
 68 |                 text_content += get_text_from_html(html_content)
 69 |     else:
 70 |         if msg.get_content_type() == 'text/plain':
 71 |             text_content = msg.get_payload(decode=True).decode(msg.get_content_charset('utf-8'))
 72 |         elif msg.get_content_type() == 'text/html':
 73 |             text_content = get_text_from_html(msg.get_payload(decode=True).decode(msg.get_content_charset('utf-8')))
 74 | 
 75 |     chunks = chunk_text(text_content)
 76 |     save_chunks_to_vault(chunks)
 77 |     return text_content
 78 | 
 79 | def search_and_process_emails(imap_client, email_source, search_keyword, start_date, end_date):
 80 |     search_criteria = 'ALL'
 81 |     if start_date and end_date:
 82 |         search_criteria = f'(SINCE "{start_date}" BEFORE "{end_date}")'
 83 |     if search_keyword:
 84 |         search_criteria += f' BODY "{search_keyword}"'  # Ensure the correct combination of conditions
 85 | 
 86 |     print(f"Using search criteria for {email_source}: {search_criteria}")
 87 |     typ, data = imap_client.search(None, search_criteria)
 88 |     if typ == 'OK':
 89 |         email_ids = data[0].split()
 90 |         print(f"Found {len(email_ids)} emails matching criteria in {email_source}.")
 91 | 
 92 |         for num in email_ids:
 93 |             typ, email_data = imap_client.fetch(num, '(RFC822)')
 94 |             if typ == 'OK':
 95 |                 email_id = num.decode('utf-8')
 96 |                 print(f"Downloading and processing email ID: {email_id} from {email_source}")
 97 |                 save_plain_text_content(email_data[0][1], email_id)
 98 |             else:
 99 |                 print(f"Failed to fetch email ID: {num.decode('utf-8')} from {email_source}")
100 |     else:
101 |         print(f"Failed to find emails with given criteria in {email_source}. No emails found.")
102 | 
103 | 
104 | def main():
105 |     parser = argparse.ArgumentParser(description="Search and process emails based on optional keyword and date range.")
106 |     parser.add_argument("--keyword", help="The keyword to search for in the email bodies.", default="")
107 |     parser.add_argument("--startdate", help="Start date in DD.MM.YYYY format.", required=False)
108 |     parser.add_argument("--enddate", help="End date in DD.MM.YYYY format.", required=False)
109 |     args = parser.parse_args()
110 | 
111 |     start_date = None
112 |     end_date = None
113 | 
114 |     # Check if both start and end dates are provided and valid
115 |     if args.startdate and args.enddate:
116 |         try:
117 |             start_date = datetime.strptime(args.startdate, "%d.%m.%Y").strftime("%d-%b-%Y")
118 |             end_date = datetime.strptime(args.enddate, "%d.%m.%Y").strftime("%d-%b-%Y")
119 |         except ValueError as e:
120 |             print(f"Error: Date format is incorrect. Please use DD.MM.YYYY format. Details: {e}")
121 |             return
122 |     elif args.startdate or args.enddate:
123 |         print("Both start date and end date must be provided together.")
124 |         return
125 | 
126 |     # Retrieve email credentials from environment variables
127 |     gmail_username = os.getenv('GMAIL_USERNAME')
128 |     gmail_password = os.getenv('GMAIL_PASSWORD')
129 |     outlook_username = os.getenv('OUTLOOK_USERNAME')
130 |     outlook_password = os.getenv('OUTLOOK_PASSWORD')
131 | 
132 |     # Connect to Gmail's IMAP server
133 |     M = imaplib.IMAP4_SSL('imap.gmail.com')
134 |     M.login(gmail_username, gmail_password)
135 |     M.select('inbox')
136 | 
137 |     # Connect to Outlook IMAP server
138 |     H = imaplib.IMAP4_SSL('imap-mail.outlook.com')
139 |     H.login(outlook_username, outlook_password)
140 |     H.select('inbox')
141 | 
142 |     # Search and process emails from Gmail and Outlook
143 |     search_and_process_emails(M, "Gmail", args.keyword, start_date, end_date)
144 |     search_and_process_emails(H, "Outlook", args.keyword, start_date, end_date)
145 | 
146 |     M.logout()
147 |     H.logout()
148 | 
149 | if __name__ == "__main__":
150 |     main()
151 | 


--------------------------------------------------------------------------------
/localrag.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import ollama
  3 | import os
  4 | from openai import OpenAI
  5 | import argparse
  6 | import json
  7 | 
  8 | # ANSI escape codes for colors
  9 | PINK = '\033[95m'
 10 | CYAN = '\033[96m'
 11 | YELLOW = '\033[93m'
 12 | NEON_GREEN = '\033[92m'
 13 | RESET_COLOR = '\033[0m'
 14 | 
 15 | # Function to open a file and return its contents as a string
 16 | def open_file(filepath):
 17 |     with open(filepath, 'r', encoding='utf-8') as infile:
 18 |         return infile.read()
 19 | 
 20 | # Function to get relevant context from the vault based on user input
 21 | def get_relevant_context(rewritten_input, vault_embeddings, vault_content, top_k=3):
 22 |     if vault_embeddings.nelement() == 0:  # Check if the tensor has any elements
 23 |         return []
 24 |     # Encode the rewritten input
 25 |     input_embedding = ollama.embeddings(model='mxbai-embed-large', prompt=rewritten_input)["embedding"]
 26 |     # Compute cosine similarity between the input and vault embeddings
 27 |     cos_scores = torch.cosine_similarity(torch.tensor(input_embedding).unsqueeze(0), vault_embeddings)
 28 |     # Adjust top_k if it's greater than the number of available scores
 29 |     top_k = min(top_k, len(cos_scores))
 30 |     # Sort the scores and get the top-k indices
 31 |     top_indices = torch.topk(cos_scores, k=top_k)[1].tolist()
 32 |     # Get the corresponding context from the vault
 33 |     relevant_context = [vault_content[idx].strip() for idx in top_indices]
 34 |     return relevant_context
 35 | 
 36 | def rewrite_query(user_input_json, conversation_history, ollama_model):
 37 |     user_input = json.loads(user_input_json)["Query"]
 38 |     context = "\n".join([f"{msg['role']}: {msg['content']}" for msg in conversation_history[-2:]])
 39 |     prompt = f"""Rewrite the following query by incorporating relevant context from the conversation history.
 40 |     The rewritten query should:
 41 |     
 42 |     - Preserve the core intent and meaning of the original query
 43 |     - Expand and clarify the query to make it more specific and informative for retrieving relevant context
 44 |     - Avoid introducing new topics or queries that deviate from the original query
 45 |     - DONT EVER ANSWER the Original query, but instead focus on rephrasing and expanding it into a new query
 46 |     
 47 |     Return ONLY the rewritten query text, without any additional formatting or explanations.
 48 |     
 49 |     Conversation History:
 50 |     {context}
 51 |     
 52 |     Original query: [{user_input}]
 53 |     
 54 |     Rewritten query: 
 55 |     """
 56 |     response = client.chat.completions.create(
 57 |         model=ollama_model,
 58 |         messages=[{"role": "system", "content": prompt}],
 59 |         max_tokens=200,
 60 |         n=1,
 61 |         temperature=0.1,
 62 |     )
 63 |     rewritten_query = response.choices[0].message.content.strip()
 64 |     return json.dumps({"Rewritten Query": rewritten_query})
 65 |    
 66 | def ollama_chat(user_input, system_message, vault_embeddings, vault_content, ollama_model, conversation_history):
 67 |     conversation_history.append({"role": "user", "content": user_input})
 68 |     
 69 |     if len(conversation_history) > 1:
 70 |         query_json = {
 71 |             "Query": user_input,
 72 |             "Rewritten Query": ""
 73 |         }
 74 |         rewritten_query_json = rewrite_query(json.dumps(query_json), conversation_history, ollama_model)
 75 |         rewritten_query_data = json.loads(rewritten_query_json)
 76 |         rewritten_query = rewritten_query_data["Rewritten Query"]
 77 |         print(PINK + "Original Query: " + user_input + RESET_COLOR)
 78 |         print(PINK + "Rewritten Query: " + rewritten_query + RESET_COLOR)
 79 |     else:
 80 |         rewritten_query = user_input
 81 |     
 82 |     relevant_context = get_relevant_context(rewritten_query, vault_embeddings, vault_content)
 83 |     if relevant_context:
 84 |         context_str = "\n".join(relevant_context)
 85 |         print("Context Pulled from Documents: \n\n" + CYAN + context_str + RESET_COLOR)
 86 |     else:
 87 |         print(CYAN + "No relevant context found." + RESET_COLOR)
 88 |     
 89 |     user_input_with_context = user_input
 90 |     if relevant_context:
 91 |         user_input_with_context = user_input + "\n\nRelevant Context:\n" + context_str
 92 |     
 93 |     conversation_history[-1]["content"] = user_input_with_context
 94 |     
 95 |     messages = [
 96 |         {"role": "system", "content": system_message},
 97 |         *conversation_history
 98 |     ]
 99 |     
100 |     response = client.chat.completions.create(
101 |         model=ollama_model,
102 |         messages=messages,
103 |         max_tokens=2000,
104 |     )
105 |     
106 |     conversation_history.append({"role": "assistant", "content": response.choices[0].message.content})
107 |     
108 |     return response.choices[0].message.content
109 | 
110 | # Parse command-line arguments
111 | print(NEON_GREEN + "Parsing command-line arguments..." + RESET_COLOR)
112 | parser = argparse.ArgumentParser(description="Ollama Chat")
113 | parser.add_argument("--model", default="llama3", help="Ollama model to use (default: llama3)")
114 | args = parser.parse_args()
115 | 
116 | # Configuration for the Ollama API client
117 | print(NEON_GREEN + "Initializing Ollama API client..." + RESET_COLOR)
118 | client = OpenAI(
119 |     base_url='http://localhost:11434/v1',
120 |     api_key='llama3'
121 | )
122 | 
123 | # Load the vault content
124 | print(NEON_GREEN + "Loading vault content..." + RESET_COLOR)
125 | vault_content = []
126 | if os.path.exists("vault.txt"):
127 |     with open("vault.txt", "r", encoding='utf-8') as vault_file:
128 |         vault_content = vault_file.readlines()
129 | 
130 | # Generate embeddings for the vault content using Ollama
131 | print(NEON_GREEN + "Generating embeddings for the vault content..." + RESET_COLOR)
132 | vault_embeddings = []
133 | for content in vault_content:
134 |     response = ollama.embeddings(model='mxbai-embed-large', prompt=content)
135 |     vault_embeddings.append(response["embedding"])
136 | 
137 | # Convert to tensor and print embeddings
138 | print("Converting embeddings to tensor...")
139 | vault_embeddings_tensor = torch.tensor(vault_embeddings) 
140 | print("Embeddings for each line in the vault:")
141 | print(vault_embeddings_tensor)
142 | 
143 | # Conversation loop
144 | print("Starting conversation loop...")
145 | conversation_history = []
146 | system_message = "You are a helpful assistant that is an expert at extracting the most useful information from a given text. Also bring in extra relevant infromation to the user query from outside the given context."
147 | 
148 | while True:
149 |     user_input = input(YELLOW + "Ask a query about your documents (or type 'quit' to exit): " + RESET_COLOR)
150 |     if user_input.lower() == 'quit':
151 |         break
152 |     
153 |     response = ollama_chat(user_input, system_message, vault_embeddings_tensor, vault_content, args.model, conversation_history)
154 |     print(NEON_GREEN + "Response: \n\n" + response + RESET_COLOR)
155 | 


--------------------------------------------------------------------------------