├── .gitignore
├── LICENSE
├── README.md
├── data
    └── .gitkeep
├── requirements.txt
├── retro_cartoon_robot.jpg
└── src
    ├── ingestion.py
    ├── main.py
    └── retrieval.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | data/*
3 | !data/.gitkeep
4 | chroma_db/*


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Javier Orraca-Deatcu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Vector DB and RAG Maker
  2 | 
  3 | This is a limited vector database and Retrieval-Augmented Generation (RAG) system developed with Python, made for R users, designed to generate bleeding-edge responses from challenging LLM prompts using your local R technical documentation.
  4 | 
  5 | ![](retro_cartoon_robot.jpg)
  6 | 
  7 | ## Overview
  8 | 
  9 | This system:
 10 | 1. Ingests R documentation (`.md`, `.R`, `.Rmd`, `.qmd`) from a single directory
 11 | 2. Creates vector embeddings of the content and stores them in a `chromadb` vector database
 12 | 3. Provides a query interface to ask questions about the R packages
 13 | 4. Uses an LLM to generate responses based on retrieved context
 14 | 
 15 | ## Requirements
 16 | 
 17 | - Python 3.8+
 18 | - Required Python packages:
 19 |   - langchain
 20 |   - langchain-community
 21 |   - langchain-anthropic (for Claude 3.7 Sonnet integration)
 22 |   - langchain-text-splitters
 23 |   - chromadb
 24 |   - sentence-transformers
 25 |   - argparse
 26 |   - glob
 27 | 
 28 | ## Installation
 29 | 
 30 | 1. Clone this repository:
 31 | ```bash
 32 | git clone JavOrraca/Vector-DB-and-RAG-Maker
 33 | cd Vector-DB-and-RAG-Maker
 34 | ```
 35 | 
 36 | 2. Install the required dependencies:
 37 | ```bash
 38 | pip install -r requirements.txt
 39 | ```
 40 | 
 41 | 3. Set your Anthropic API key:
 42 | ```bash
 43 | export ANTHROPIC_API_KEY=your-api-key
 44 | ```
 45 | 
 46 | ## Usage
 47 | 
 48 | ### Ingesting R-related Files
 49 | 
 50 | First, ingest all your R-related files from a single directory:
 51 | 
 52 | ```bash
 53 | python src/main.py ingest --content-dir ./data --output-dir ./chroma_db
 54 | ```
 55 | 
 56 | This command will:
 57 | - Find all `.md`, `.R`, `.Rmd`, and `.qmd` files in the specified directory
 58 | - Process them appropriately based on file type
 59 | - Store them in a single unified Chroma vector database
 60 | 
 61 | ### Querying the System
 62 | 
 63 | After ingestion, you can query the system:
 64 | 
 65 | ```bash
 66 | # Interactive mode
 67 | python src/main.py query --db-path ./chroma_db/r_knowledge_base
 68 | 
 69 | # Single query mode
 70 | python src/main.py query --db-path ./chroma_db/r_knowledge_base --question "How do I use dplyr's filter function?"
 71 | ```
 72 | 
 73 | ## Supported File Types
 74 | 
 75 | The system can ingest and process the following file types:
 76 | 
 77 | - **Markdown (`.md`)** - Documentation, READMEs, etc.
 78 | - **R Files (`.R`)** - R source code files
 79 | - **R Markdown (`.Rmd`)** - Mixed R code and markdown
 80 | - **Quarto (`.qmd`)** - Next-gen technical publishing framework (basically the successor to `.Rmd`)
 81 | 
 82 | All files are processed appropriately based on their type and structure. If you want any additional file types, please reach out to Javier.
 83 | 
 84 | ## System Components
 85 | 
 86 | ### Ingestion
 87 | 
 88 | The ingestion pipeline:
 89 | 1. Recursively finds all supported files in the specified directory
 90 | 2. Processes each file type appropriately:
 91 |    - Splits markdown files by headers and then into chunks
 92 |    - Splits R code files into chunks
 93 |    - Handles `.Rmd` and `.qmd` files intelligently, attempting to parse them as markdown first
 94 | 3. Creates vector embeddings for each chunk
 95 | 4. Stores the embeddings in a unified Chroma vector database
 96 | 
 97 | ### Retrieval
 98 | 
 99 | The retrieval system:
100 | 1. Takes a user question
101 | 2. Searches the vector database for relevant context
102 | 3. Combines the results
103 | 4. Sends the most relevant context to an LLM
104 | 5. Returns the LLM's response
105 | 
106 | ## Customization
107 | 
108 | ### Embedding Models
109 | 
110 | By default, the system uses the `sentence-transformers/all-MiniLM-L6-v2` model for embeddings. You can modify this in the code to use other models.
111 | 
112 | ### LLM
113 | 
114 | The system is configured to use Anthropic's Claude 3.7 Sonnet, but you can modify it to use other LLMs supported by LangChain.
115 | 
116 | ### Chunking Parameters
117 | 
118 | You can adjust the chunking parameters in the code to better suit your needs:
119 | - `chunk_size`: The size of text chunks
120 | - `chunk_overlap`: The amount of overlap between chunks
121 | 


--------------------------------------------------------------------------------
/data/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JavOrraca/Vector-DB-and-RAG-Maker/ea4969eb7704cbf74cb7108947c45898ef5f4b44/data/.gitkeep


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | langchain
 2 | langchain-community
 3 | langchain-anthropic
 4 | langchain-text-splitters
 5 | langchain-huggingface
 6 | langchain-chroma
 7 | chromadb
 8 | sentence-transformers
 9 | argparse
10 | pydantic
11 | anthropic


--------------------------------------------------------------------------------
/retro_cartoon_robot.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JavOrraca/Vector-DB-and-RAG-Maker/ea4969eb7704cbf74cb7108947c45898ef5f4b44/retro_cartoon_robot.jpg


--------------------------------------------------------------------------------
/src/ingestion.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
  4 | from langchain_chroma import Chroma
  5 | from langchain_huggingface import HuggingFaceEmbeddings
  6 | 
  7 | def ingest_all_r_files(directory_path, collection_name="r_knowledge_base", output_dir="./chroma_db"):
  8 |     """
  9 |     Ingest all R-related files (R, Rmd, qmd, md) from a single directory, 
 10 |     split into appropriate chunks, and store in vector database.
 11 |     
 12 |     Args:
 13 |         directory_path: Path to directory containing files to ingest
 14 |         collection_name: Name of the collection in the vector database
 15 |         output_dir: Base directory to store the vector database
 16 |     """
 17 |     # Find all relevant files by type
 18 |     markdown_files = glob.glob(os.path.join(directory_path, "**/*.md"), recursive=True)
 19 |     r_files = glob.glob(os.path.join(directory_path, "**/*.R"), recursive=True)
 20 |     rmd_files = glob.glob(os.path.join(directory_path, "**/*.Rmd"), recursive=True)
 21 |     qmd_files = glob.glob(os.path.join(directory_path, "**/*.qmd"), recursive=True)
 22 |     
 23 |     # Headers to split markdown on
 24 |     headers_to_split_on = [
 25 |         ("#", "header1"),
 26 |         ("##", "header2"),
 27 |         ("###", "header3"),
 28 |         ("####", "header4")
 29 |     ]
 30 |     
 31 |     # Initialize splitters
 32 |     markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
 33 |     text_splitter = RecursiveCharacterTextSplitter(
 34 |         chunk_size=1000,
 35 |         chunk_overlap=200
 36 |     )
 37 |     
 38 |     code_splitter = RecursiveCharacterTextSplitter(
 39 |         chunk_size=1000,
 40 |         chunk_overlap=200,
 41 |         separators=["\n\n", "\n", " ", ""]
 42 |     )
 43 |     
 44 |     # Initialize embedding model
 45 |     embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
 46 |     
 47 |     # Process all files
 48 |     documents = []
 49 |     
 50 |     # Process markdown files
 51 |     print(f"Processing {len(markdown_files)} markdown (.md) files...")
 52 |     for file_path in markdown_files:
 53 |         with open(file_path, "r", encoding="utf-8") as f:
 54 |             try:
 55 |                 content = f.read()
 56 |             except UnicodeDecodeError:
 57 |                 print(f"Warning: Could not read {file_path} due to encoding issues. Skipping.")
 58 |                 continue
 59 |         
 60 |         # Get relative path for metadata
 61 |         rel_path = os.path.relpath(file_path, directory_path)
 62 |         
 63 |         # Split by headers first
 64 |         md_docs = markdown_splitter.split_text(content)
 65 |         for doc in md_docs:
 66 |             doc.metadata["source"] = rel_path
 67 |             doc.metadata["file_type"] = "markdown"
 68 |         
 69 |         # Further split by size if needed
 70 |         docs = text_splitter.split_documents(md_docs)
 71 |         documents.extend(docs)
 72 |     
 73 |     # Process R files
 74 |     print(f"Processing {len(r_files)} R (.R) files...")
 75 |     for file_path in r_files:
 76 |         with open(file_path, "r", encoding="utf-8") as f:
 77 |             try:
 78 |                 content = f.read()
 79 |             except UnicodeDecodeError:
 80 |                 print(f"Warning: Could not read {file_path} due to encoding issues. Skipping.")
 81 |                 continue
 82 |         
 83 |         # Get relative path for metadata
 84 |         rel_path = os.path.relpath(file_path, directory_path)
 85 |         
 86 |         # Split text into chunks
 87 |         chunks = code_splitter.create_documents(
 88 |             texts=[content],
 89 |             metadatas=[{"source": rel_path, "file_type": "R", "language": "R"}]
 90 |         )
 91 |         documents.extend(chunks)
 92 |     
 93 |     # Process Rmd files
 94 |     print(f"Processing {len(rmd_files)} R Markdown (.Rmd) files...")
 95 |     for file_path in rmd_files:
 96 |         with open(file_path, "r", encoding="utf-8") as f:
 97 |             try:
 98 |                 content = f.read()
 99 |             except UnicodeDecodeError:
100 |                 print(f"Warning: Could not read {file_path} due to encoding issues. Skipping.")
101 |                 continue
102 |         
103 |         # Get relative path for metadata
104 |         rel_path = os.path.relpath(file_path, directory_path)
105 |         
106 |         # Try to split by headers first (since Rmd is markdown-based)
107 |         try:
108 |             md_docs = markdown_splitter.split_text(content)
109 |             for doc in md_docs:
110 |                 doc.metadata["source"] = rel_path
111 |                 doc.metadata["file_type"] = "Rmd"
112 |             
113 |             # Further split by size if needed
114 |             docs = text_splitter.split_documents(md_docs)
115 |             documents.extend(docs)
116 |         except Exception as e:
117 |             # Fallback to regular splitting if header parsing fails
118 |             print(f"Warning: Markdown parsing failed for {file_path}, using regular chunking")
119 |             chunks = text_splitter.create_documents(
120 |                 texts=[content],
121 |                 metadatas=[{"source": rel_path, "file_type": "Rmd"}]
122 |             )
123 |             documents.extend(chunks)
124 |     
125 |     # Process Quarto files
126 |     print(f"Processing {len(qmd_files)} Quarto (.qmd) files...")
127 |     for file_path in qmd_files:
128 |         with open(file_path, "r", encoding="utf-8") as f:
129 |             try:
130 |                 content = f.read()
131 |             except UnicodeDecodeError:
132 |                 print(f"Warning: Could not read {file_path} due to encoding issues. Skipping.")
133 |                 continue
134 |         
135 |         # Get relative path for metadata
136 |         rel_path = os.path.relpath(file_path, directory_path)
137 |         
138 |         # Try to split by headers first (since qmd is markdown-based)
139 |         try:
140 |             md_docs = markdown_splitter.split_text(content)
141 |             for doc in md_docs:
142 |                 doc.metadata["source"] = rel_path
143 |                 doc.metadata["file_type"] = "qmd"
144 |             
145 |             # Further split by size if needed
146 |             docs = text_splitter.split_documents(md_docs)
147 |             documents.extend(docs)
148 |         except Exception as e:
149 |             # Fallback to regular splitting if header parsing fails
150 |             print(f"Warning: Markdown parsing failed for {file_path}, using regular chunking")
151 |             chunks = text_splitter.create_documents(
152 |                 texts=[content],
153 |                 metadatas=[{"source": rel_path, "file_type": "qmd"}]
154 |             )
155 |             documents.extend(chunks)
156 |     
157 |     # Store in vector database
158 |     print(f"Creating vector database with {len(documents)} document chunks...")
159 |     # Use os.path.join for proper path handling
160 |     persist_dir = os.path.join(output_dir, collection_name)
161 |     db = Chroma.from_documents(
162 |         documents=documents,
163 |         embedding=embedding_model,
164 |         persist_directory=persist_dir
165 |     )
166 |     
167 |     print(f"Vector database created and persisted to {persist_dir}")
168 |     return db
169 | 
170 | # For backward compatibility
171 | def ingest_markdown_files(directory_path, collection_name="r_packages_docs", output_dir="./chroma_db"):
172 |     """
173 |     Legacy function. Use ingest_all_r_files instead.
174 |     """
175 |     print("Warning: This function is deprecated. Use ingest_all_r_files instead.")
176 |     
177 |     # Find all .md files in the directory
178 |     md_files = glob.glob(os.path.join(directory_path, "**/*.md"), recursive=True)
179 |     
180 |     # Headers to split markdown on
181 |     headers_to_split_on = [
182 |         ("#", "header1"),
183 |         ("##", "header2"),
184 |         ("###", "header3"),
185 |         ("####", "header4")
186 |     ]
187 |     
188 |     # Initialize splitters
189 |     markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
190 |     text_splitter = RecursiveCharacterTextSplitter(
191 |         chunk_size=1000,
192 |         chunk_overlap=200
193 |     )
194 |     
195 |     # Initialize embedding model
196 |     embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
197 |     
198 |     # Process each file
199 |     documents = []
200 |     for file_path in md_files:
201 |         with open(file_path, "r", encoding="utf-8") as f:
202 |             try:
203 |                 content = f.read()
204 |             except UnicodeDecodeError:
205 |                 print(f"Warning: Could not read {file_path} due to encoding issues. Skipping.")
206 |                 continue
207 |         
208 |         # Get relative path for metadata
209 |         rel_path = os.path.relpath(file_path, directory_path)
210 |         
211 |         # Split by headers first
212 |         md_docs = markdown_splitter.split_text(content)
213 |         for doc in md_docs:
214 |             doc.metadata["source"] = rel_path
215 |             doc.metadata["file_type"] = "markdown"
216 |         
217 |         # Further split by size if needed
218 |         docs = text_splitter.split_documents(md_docs)
219 |         documents.extend(docs)
220 |     
221 |     # Store in vector database
222 |     # Use os.path.join for proper path handling
223 |     persist_dir = os.path.join(output_dir, collection_name)
224 |     db = Chroma.from_documents(
225 |         documents=documents,
226 |         embedding=embedding_model,
227 |         persist_directory=persist_dir
228 |     )
229 |     
230 |     return db
231 | 
232 | # For backward compatibility
233 | def ingest_r_files(directory_path, collection_name="r_packages_code", output_dir="./chroma_db"):
234 |     """
235 |     Legacy function. Use ingest_all_r_files instead.
236 |     """
237 |     print("Warning: This function is deprecated. Use ingest_all_r_files instead.")
238 |     
239 |     # Find all .R files in the directory
240 |     r_files = glob.glob(os.path.join(directory_path, "**/*.R"), recursive=True)
241 |     
242 |     # Initialize splitter for code
243 |     text_splitter = RecursiveCharacterTextSplitter(
244 |         chunk_size=1000,
245 |         chunk_overlap=200,
246 |         separators=["\n\n", "\n", " ", ""]
247 |     )
248 |     
249 |     # Initialize embedding model
250 |     embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
251 |     
252 |     # Process each file
253 |     documents = []
254 |     for file_path in r_files:
255 |         with open(file_path, "r", encoding="utf-8") as f:
256 |             try:
257 |                 content = f.read()
258 |             except UnicodeDecodeError:
259 |                 print(f"Warning: Could not read {file_path} due to encoding issues. Skipping.")
260 |                 continue
261 |         
262 |         # Get relative path for metadata
263 |         rel_path = os.path.relpath(file_path, directory_path)
264 |         
265 |         # Split text into chunks
266 |         chunks = text_splitter.create_documents(
267 |             texts=[content],
268 |             metadatas=[{"source": rel_path, "file_type": "R", "language": "R"}]
269 |         )
270 |         documents.extend(chunks)
271 |     
272 |     # Store in vector database
273 |     # Use os.path.join for proper path handling
274 |     persist_dir = os.path.join(output_dir, collection_name)
275 |     db = Chroma.from_documents(
276 |         documents=documents,
277 |         embedding=embedding_model,
278 |         persist_directory=persist_dir
279 |     )
280 |     
281 |     return db
282 | 
283 | if __name__ == "__main__":
284 |     # Example usage
285 |     # Replace with your actual path
286 |     r_content_path = "../data/r_content"
287 |     output_dir = "./chroma_db"
288 |     
289 |     # Use the new unified function
290 |     db = ingest_all_r_files(r_content_path, output_dir=output_dir)
291 |     
292 |     print(f"Ingested all R-related files into Chroma DB")


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | from ingestion import ingest_all_r_files
 4 | from retrieval import RPackageRagSystem
 5 | 
 6 | def main():
 7 |     parser = argparse.ArgumentParser(description="R Package RAG System")
 8 |     subparsers = parser.add_subparsers(dest="command", help="Command to run")
 9 |     
10 |     # Ingest command
11 |     ingest_parser = subparsers.add_parser("ingest", help="Ingest R package files")
12 |     ingest_parser.add_argument("--content-dir", required=True, help="Directory containing all R-related files (.md, .R, .Rmd, .qmd)")
13 |     ingest_parser.add_argument("--output-dir", default="./chroma_db", help="Directory to store vector database")
14 |     ingest_parser.add_argument("--collection-name", default="r_knowledge_base", help="Name for the vector database collection")
15 |     
16 |     # Query command
17 |     query_parser = subparsers.add_parser("query", help="Query the RAG system")
18 |     query_parser.add_argument("--db-path", required=True, help="Path to vector database")
19 |     query_parser.add_argument("--question", help="Question to ask (if not provided, enters interactive mode)")
20 |     query_parser.add_argument("--api-key", help="Anthropic API key (if not set as env var)")
21 |     
22 |     args = parser.parse_args()
23 |     
24 |     if args.command == "ingest":
25 |         # Create output directory if it doesn't exist
26 |         os.makedirs(args.output_dir, exist_ok=True)
27 |         os.makedirs(os.path.join(args.output_dir, args.collection_name), exist_ok=True)
28 |         
29 |         db_path = os.path.join(args.output_dir, args.collection_name)
30 |         
31 |         print(f"Ingesting all R-related files from {args.content_dir}...")
32 |         db = ingest_all_r_files(
33 |             directory_path=args.content_dir, 
34 |             collection_name=args.collection_name,
35 |             output_dir=args.output_dir
36 |         )
37 |         
38 |         print(f"Ingestion complete. Vector database stored in: {db_path}")
39 |         
40 |     elif args.command == "query":
41 |         # Get API key from args or environment
42 |         api_key = args.api_key or os.environ.get("ANTHROPIC_API_KEY")
43 |         if not api_key:
44 |             print("Warning: No Anthropic API key provided. Set it with --api-key or ANTHROPIC_API_KEY environment variable.")
45 |         
46 |         # Initialize RAG system with the same DB for both docs and code (we combined them)
47 |         rag = RPackageRagSystem(args.db_path, args.db_path, api_key)
48 |         
49 |         if args.question:
50 |             # Single question mode
51 |             answer = rag.query(args.question)
52 |             print(answer)
53 |         else:
54 |             # Interactive mode
55 |             rag.interactive_mode()
56 |     
57 |     else:
58 |         parser.print_help()
59 | 
60 | if __name__ == "__main__":
61 |     main()


--------------------------------------------------------------------------------
/src/retrieval.py:
--------------------------------------------------------------------------------
  1 | from langchain_chroma import Chroma
  2 | from langchain_huggingface import HuggingFaceEmbeddings
  3 | from langchain.chains import RetrievalQA
  4 | from langchain_anthropic import ChatAnthropic
  5 | from langchain_core.prompts import PromptTemplate
  6 | 
  7 | class RPackageRagSystem:
  8 |     """
  9 |     RAG system for querying R package documentation and code.
 10 |     """
 11 |     
 12 |     def __init__(self, docs_db_path, code_db_path, anthropic_api_key=None):
 13 |         """
 14 |         Initialize the RAG system with paths to the vector databases.
 15 |         
 16 |         Args:
 17 |             docs_db_path: Path to the Chroma DB for markdown documentation
 18 |             code_db_path: Path to the Chroma DB for R code
 19 |             anthropic_api_key: Optional Anthropic API key
 20 |         """
 21 |         # Load embedding model
 22 |         self.embedding_model = HuggingFaceEmbeddings(
 23 |             model_name="sentence-transformers/all-MiniLM-L6-v2"
 24 |         )
 25 |         
 26 |         # Load vector databases
 27 |         self.docs_db = Chroma(
 28 |             persist_directory=docs_db_path,
 29 |             embedding_function=self.embedding_model
 30 |         )
 31 |         
 32 |         self.code_db = Chroma(
 33 |             persist_directory=code_db_path,
 34 |             embedding_function=self.embedding_model
 35 |         )
 36 |         
 37 |         # Initialize LLM (Claude 3.7 Sonnet)
 38 |         self.llm = ChatAnthropic(
 39 |             temperature=0,
 40 |             model="claude-3-7-sonnet-20250219",
 41 |             anthropic_api_key=anthropic_api_key
 42 |         )
 43 |         
 44 |         # Create prompt template
 45 |         self.prompt_template = PromptTemplate(
 46 |             input_variables=["context", "question"],
 47 |             template="""
 48 |             You are an expert R programmer and data scientist. Use the provided context about R packages 
 49 |             to answer the user's question. The context includes both documentation and code from various R packages.
 50 |             
 51 |             Context:
 52 |             {context}
 53 |             
 54 |             Question:
 55 |             {question}
 56 |             
 57 |             Answer:
 58 |             """
 59 |         )
 60 |     
 61 |     def query(self, question, k=5, doc_weight=0.7):
 62 |         """
 63 |         Query the RAG system with a question.
 64 |         
 65 |         Args:
 66 |             question: User's question about R packages
 67 |             k: Number of documents to retrieve from each database
 68 |             doc_weight: Weight to give documentation vs code (0-1)
 69 |             
 70 |         Returns:
 71 |             Answer from the LLM
 72 |         """
 73 |         # Retrieve relevant documentation
 74 |         docs_results = self.docs_db.similarity_search_with_score(question, k=k)
 75 |         
 76 |         # Retrieve relevant code
 77 |         code_results = self.code_db.similarity_search_with_score(question, k=k)
 78 |         
 79 |         # Combine results with weighting
 80 |         combined_results = []
 81 |         
 82 |         # Add documentation with its weight
 83 |         for doc, score in docs_results:
 84 |             combined_results.append((doc, score * doc_weight))
 85 |         
 86 |         # Add code with its weight
 87 |         for doc, score in code_results:
 88 |             combined_results.append((doc, score * (1 - doc_weight)))
 89 |         
 90 |         # Sort by weighted score and take top k*2
 91 |         combined_results.sort(key=lambda x: x[1], reverse=True)
 92 |         top_results = combined_results[:k*2]
 93 |         
 94 |         # Extract documents
 95 |         context_docs = [item[0] for item in top_results]
 96 |         
 97 |         # Build context string
 98 |         context_str = "\n\n".join([
 99 |             f"[Source: {doc.metadata.get('source', 'Unknown')} | Type: {doc.metadata.get('file_type', 'Unknown')}]\n{doc.page_content}"
100 |             for doc in context_docs
101 |         ])
102 |         
103 |         # Create chain
104 |         chain = RetrievalQA.from_chain_type(
105 |             llm=self.llm,
106 |             chain_type="stuff",
107 |             retriever=self.docs_db.as_retriever(search_kwargs={"k": k}),
108 |             chain_type_kwargs={"prompt": self.prompt_template}
109 |         )
110 |         
111 |         # Run chain
112 |         result = chain.invoke({
113 |             "query": question,
114 |             "context": context_str
115 |         })
116 |         
117 |         return result["result"]
118 |     
119 |     def interactive_mode(self):
120 |         """
121 |         Start an interactive session for querying the RAG system.
122 |         """
123 |         print("R Package RAG System - Interactive Mode (powered by Claude 3.7 Sonnet)")
124 |         print("Type 'exit' to quit")
125 |         
126 |         while True:
127 |             question = input("\nEnter your question: ")
128 |             
129 |             if question.lower() == 'exit':
130 |                 break
131 |                 
132 |             try:
133 |                 answer = self.query(question)
134 |                 print("\nAnswer:")
135 |                 print(answer)
136 |             except Exception as e:
137 |                 print(f"Error: {e}")
138 | 
139 | if __name__ == "__main__":
140 |     # Example usage
141 |     docs_db_path = "./chroma_db_r_packages_docs"
142 |     code_db_path = "./chroma_db_r_packages_code"
143 |     
144 |     # Initialize RAG system
145 |     rag = RPackageRagSystem(docs_db_path, code_db_path)
146 |     
147 |     # Start interactive mode
148 |     rag.interactive_mode()


--------------------------------------------------------------------------------