└── README.md


/README.md:
--------------------------------------------------------------------------------
   1 | # Hybrid-search-strategies
   2 | combine BM25 and semantic search for RAG
   3 | 
   4 | Ref to [conversation](https://chat.qwen.ai/s/9ceeb113-e2f1-4287-8e57-231716b12b4a?fev=0.0.245)
   5 | 
   6 | > My studies on the topic
   7 | 
   8 | # Building a Simple Keyword Search App with BM25 in Python
   9 | 
  10 | ## Introduction to BM25
  11 | 
  12 | BM25 (Best Match 25) is a **ranking function** used by search engines to estimate the relevance of documents to a given search query. It's one of the most widely used and effective algorithms for information retrieval, serving as the foundation for many search systems including Elasticsearch and Apache Lucene.
  13 | 
  14 | Unlike simple keyword matching, BM25 considers:
  15 | - **Term frequency**: How often a search term appears in a document
  16 | - **Document length**: Shorter documents with the same term frequency are considered more relevant
  17 | - **Inverse document frequency**: Rare terms that appear in fewer documents are given more weight
  18 | 
  19 | BM25 strikes an excellent balance between effectiveness and computational efficiency, making it perfect for building simple yet powerful search applications.
  20 | 
  21 | ## How BM25 Works
  22 | 
  23 | The BM25 formula calculates a relevance score for each document given a query:
  24 | 
  25 | ```
  26 | Score(D, Q) = Σ [IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgDL))]
  27 | ```
  28 | 
  29 | Where:
  30 | - **D** = Document
  31 | - **Q** = Query with terms q₁, q₂, ..., qₙ
  32 | - **f(qi, D)** = Frequency of term qi in document D
  33 | - **|D|** = Length of document D (in words)
  34 | - **avgDL** = Average document length in the corpus
  35 | - **k1** = Controls term frequency saturation (typically 1.2-2.0)
  36 | - **b** = Controls document length normalization (typically 0.75)
  37 | - **IDF(qi)** = Inverse Document Frequency = `log((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)`
  38 |   - N = Total number of documents
  39 |   - n(qi) = Number of documents containing term qi
  40 | 
  41 | ## Python Implementation Example
  42 | 
  43 | Here's a complete, working example using the `rank_bm25` library:
  44 | 
  45 | ### Step 1: Install Required Dependencies
  46 | 
  47 | ```bash
  48 | pip install rank-bm25
  49 | ```
  50 | 
  51 | ### Step 2: Complete Python Application
  52 | 
  53 | ```python
  54 | # coding: utf-8
  55 | import string
  56 | import re
  57 | from rank_bm25 import BM25Okapi
  58 | from typing import List, Dict
  59 | 
  60 | class SimpleBM25Search:
  61 |     def __init__(self, documents: List[str]):
  62 |         """
  63 |         Initialize the BM25 search engine with a list of documents.
  64 | 
  65 |         Args:
  66 |             documents: List of strings to be indexed for search
  67 |         """
  68 |         self.documents = documents
  69 |         self.tokenized_docs = [self._preprocess(doc) for doc in documents]
  70 |         self.bm25 = BM25Okapi(self.tokenized_docs)
  71 | 
  72 |     def _preprocess(self, text: str) -> List[str]:
  73 |         """
  74 |         Simple text preprocessing: lowercase, remove punctuation, tokenize.
  75 |         """
  76 |         # Convert to lowercase
  77 |         text = text.lower()
  78 |         # Remove punctuation and extra whitespace
  79 |         text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
  80 |         text = re.sub(r'\s+', ' ', text)
  81 |         # Tokenize (split into words)
  82 |         tokens = text.strip().split()
  83 |         return tokens
  84 | 
  85 |     def search(self, query: str, top_k: int = 5) -> List[Dict]:
  86 |         """
  87 |         Search for documents matching the query.
  88 | 
  89 |         Args:
  90 |             query: Search query string
  91 |             top_k: Number of top results to return
  92 | 
  93 |         Returns:
  94 |             List of dictionaries containing document index, score, and text
  95 |         """
  96 |         # Preprocess the query
  97 |         tokenized_query = self._preprocess(query)
  98 | 
  99 |         # Get BM25 scores for all documents
 100 |         scores = self.bm25.get_scores(tokenized_query)
 101 | 
 102 |         # Get top-k document indices
 103 |         top_indices = scores.argsort()[-top_k:][::-1]
 104 | 
 105 |         # Create results with document info
 106 |         results = []
 107 |         for idx in top_indices:
 108 |             if scores[idx] > 0:  # Only include documents with positive scores
 109 |                 results.append({
 110 |                     'index': int(idx),
 111 |                     'score': float(scores[idx]),
 112 |                     'document': self.documents[idx]
 113 |                 })
 114 | 
 115 |         return results
 116 | 
 117 |     def get_document_count(self) -> int:
 118 |         """Return the total number of indexed documents."""
 119 |         return len(self.documents)
 120 |     def evaluate_query(self, query: str, relevant_docs_indices: List[int]):
 121 |         """
 122 |         Simple evaluation showing precision at k.
 123 | 
 124 |         Args:
 125 |             query: Search query
 126 |             relevant_docs_indices: List of document indices that are truly relevant
 127 |         """
 128 |         results = self.search(query, top_k=len(self.documents))
 129 |         retrieved_indices = [r['index'] for r in results]
 130 | 
 131 |         # Calculate Precision@1, Precision@3, Precision@5
 132 |         for k in [1, 3, 5]:
 133 |             if len(retrieved_indices) >= k:
 134 |                 top_k_retrieved = set(retrieved_indices[:k])
 135 |                 relevant_set = set(relevant_docs_indices)
 136 |                 precision = len(top_k_retrieved & relevant_set) / k
 137 |                 print(f"Precision@{k}: {precision:.3f}")
 138 | 
 139 |         return results
 140 | # Example usage
 141 | if __name__ == "__main__":
 142 |     # Sample documents (you can replace these with your own data)
 143 |     sample_documents = [
 144 |         "The quick brown fox jumps over the lazy dog",
 145 |         "A fast brown fox leaps over a sleeping dog",
 146 |         "Machine learning is a subset of artificial intelligence",
 147 |         "Python is a popular programming language for data science",
 148 |         "Artificial intelligence and machine learning are transforming industries",
 149 |         "Dogs are loyal companions and make great pets",
 150 |         "Programming in Python is both fun and productive",
 151 |         "Natural language processing helps computers understand human language",
 152 |         "Search engines use algorithms like BM25 to rank documents",
 153 |         "Information retrieval is the science of searching for information"
 154 |     ]
 155 | 
 156 |     # Initialize the search engine
 157 |     search_engine = SimpleBM25Search(sample_documents)
 158 | 
 159 |     print(f"Indexed {search_engine.get_document_count()} documents\n")
 160 | 
 161 |     # Example searches
 162 |     queries = [
 163 |         "brown fox",
 164 |         "machine learning",
 165 |         "python programming",
 166 |         "artificial intelligence",
 167 |         "search algorithms"
 168 |     ]
 169 | 
 170 |     for query in queries:
 171 |         print(f"Query: '{query}'")
 172 |         print("-" * 40)
 173 |         results = search_engine.search(query, top_k=3)
 174 | 
 175 |         if results:
 176 |             for i, result in enumerate(results, 1):
 177 |                 print(f"{i}. Score: {result['score']:.4f}")
 178 |                 print(f"   Document: {result['document']}")
 179 |                 print()
 180 |         else:
 181 |             print("No relevant documents found.\n")
 182 | 
 183 |         print()
 184 | 
 185 | def interpret_score(score: float) -> str:
 186 |     """Interpret BM25 score ranges (rough guidelines)"""
 187 |     if score == 0:
 188 |         return "No relevance"
 189 |     elif score < 1.0:
 190 |         return "Low relevance"
 191 |     elif score < 2.0:
 192 |         return "Moderate relevance"
 193 |     elif score < 3.0:
 194 |         return "High relevance"
 195 |     else:
 196 |         return "Very high relevance"
 197 | 
 198 | # Example usage
 199 | scores = [2.3844, 2.5141, 3.5965]
 200 | for score in scores:
 201 |     print(f"Score {score:.4f}: {interpret_score(score)}")
 202 | 
 203 | ```
 204 | 
 205 | ### Step 3: Advanced Features (Optional)
 206 | 
 207 | You can extend this basic implementation with additional features:
 208 | 
 209 | ```python
 210 | # Adding support for loading documents from files
 211 | def load_documents_from_file(filename: str) -> List[str]:
 212 |     """Load documents from a text file (one document per line)."""
 213 |     with open(filename, 'r', encoding='utf-8') as f:
 214 |         return [line.strip() for line in f if line.strip()]
 215 | 
 216 | # Adding support for different preprocessing (stemming, stop words)
 217 | import nltk
 218 | from nltk.corpus import stopwords
 219 | from nltk.stem import PorterStemmer
 220 | 
 221 | def _advanced_preprocess(self, text: str) -> List[str]:
 222 |     """More advanced preprocessing with stop word removal and stemming."""
 223 |     # Download required NLTK data (run once)
 224 |     # nltk.download('stopwords')
 225 |     
 226 |     stop_words = set(stopwords.words('english'))
 227 |     stemmer = PorterStemmer()
 228 |     
 229 |     text = text.lower()
 230 |     text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
 231 |     tokens = text.strip().split()
 232 |     
 233 |     # Remove stop words and apply stemming
 234 |     tokens = [stemmer.stem(token) for token in tokens 
 235 |               if token not in stop_words and len(token) > 2]
 236 |     
 237 |     return tokens
 238 | ```
 239 | 
 240 | ## Key Benefits of This Approach
 241 | 
 242 | 1. **Simple to implement**: Just a few lines of code with the `rank_bm25` library
 243 | 2. **Efficient**: BM25 is computationally efficient for moderate-sized document collections
 244 | 3. **Effective**: Provides much better results than simple keyword matching
 245 | 4. **Customizable**: Easy to modify preprocessing and ranking parameters
 246 | 
 247 | ## When to Use This
 248 | 
 249 | - Small to medium document collections (up to ~100K documents)
 250 | - Applications where you need more control than off-the-shelf search engines
 251 | - Prototyping and learning information retrieval concepts
 252 | - Applications where you can't use external services
 253 | 
 254 | For larger-scale applications, consider using dedicated search engines like Elasticsearch, which also implement BM25 under the hood but with additional optimizations and features.
 255 | 
 256 | This example gives you a solid foundation to build upon—happy coding!
 257 | 
 258 | ---
 259 | 
 260 | 
 261 | Combining BM25 (keyword-based) with embedding-based semantic similarity is a powerful **hybrid search** approach that leverages the strengths of both methods:
 262 | 
 263 | - **BM25** excels at matching exact keywords, handling rare terms, and supporting Boolean-like queries  
 264 | - **Embedding similarity** captures semantic meaning, handles synonyms, and finds conceptually related content
 265 | 
 266 | This hybrid strategy is widely used in production RAG systems and is considered one of the most effective retrieval approaches .
 267 | 
 268 | ---
 269 | 
 270 | ## Architecture Overview
 271 | 
 272 | ```
 273 | Query → [BM25 Retrieval] → BM25 Scores
 274 |      ↘ [Embedding Model] → Semantic Similarity Scores
 275 |                           ↓
 276 |                 [Score Fusion] → Final Ranked Results
 277 | ```
 278 | 
 279 | ---
 280 | 
 281 | ## Implementation with llama.cpp Server
 282 | 
 283 | Here’s a complete example that integrates BM25 with embeddings from a llama.cpp server:
 284 | 
 285 | ### Step 1: Install Dependencies
 286 | 
 287 | ```bash
 288 | pip install rank-bm25 requests numpy
 289 | ```
 290 | 
 291 | ### Step 2: Hybrid Search Implementation
 292 | 
 293 | ```python
 294 | import requests
 295 | import numpy as np
 296 | from rank_bm25 import BM25Okapi
 297 | import re
 298 | import string
 299 | from typing import List, Dict, Tuple
 300 | 
 301 | class HybridSearchEngine:
 302 |     def __init__(self, documents: List[str], llama_cpp_url: str = "http://localhost:8080"):
 303 |         """
 304 |         Initialize hybrid search engine.
 305 |         
 306 |         Args:
 307 |             documents: List of documents to index
 308 |             llama_cpp_url: URL of your llama.cpp server with embedding endpoint
 309 |         """
 310 |         self.documents = documents
 311 |         self.llama_cpp_url = llama_cpp_url
 312 |         
 313 |         # Initialize BM25
 314 |         self.tokenized_docs = [self._preprocess(doc) for doc in documents]
 315 |         self.bm25 = BM25Okapi(self.tokenized_docs)
 316 |         
 317 |         # Pre-compute document embeddings
 318 |         self.doc_embeddings = self._compute_document_embeddings()
 319 |     
 320 |     def _preprocess(self, text: str) -> List[str]:
 321 |         """Simple text preprocessing for BM25."""
 322 |         text = text.lower()
 323 |         text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
 324 |         text = re.sub(r'\s+', ' ', text)
 325 |         return text.strip().split()
 326 |     
 327 |     def _get_embedding(self, text: str) -> np.ndarray:
 328 |         """
 329 |         Get embedding from llama.cpp server.
 330 |         """
 331 |         try:
 332 |             response = requests.post(
 333 |                 f"{self.llama_cpp_url}/embedding",
 334 |                 json={"content": text},
 335 |                 timeout=10
 336 |             )
 337 |             response.raise_for_status()
 338 |             result = response.json()
 339 |             
 340 |             if isinstance(result, list) and len(result) > 0:
 341 |                 # Extract embedding and ensure it's 1D
 342 |                 embedding = np.array(result[0]["embedding"]).flatten()
 343 |             else:
 344 |                 print(f"Unexpected embedding response format: {result}")
 345 |                 embedding = np.zeros(384)  # Use correct dimension for your model
 346 |             
 347 |             return embedding  # Shape will be (384,) instead of (1, 384)
 348 |             
 349 |         except Exception as e:
 350 |             print(f"Error getting embedding: {e}")
 351 |             return np.zeros(384)  # Match your actual embedding dimension
 352 | 
 353 |     
 354 |     def _compute_document_embeddings(self) -> List[np.ndarray]:
 355 |         """Pre-compute embeddings for all documents."""
 356 |         print("Computing document embeddings...")
 357 |         embeddings = []
 358 |         for i, doc in enumerate(self.documents):
 359 |             if i % 10 == 0:
 360 |                 print(f"Processing document {i}/{len(self.documents)}")
 361 |             emb = self._get_embedding(doc)
 362 |             embeddings.append(emb)
 363 |         return embeddings
 364 |     
 365 |     def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
 366 |         """Compute cosine similarity between two vectors."""
 367 |         dot_product = np.dot(vec1, vec2)
 368 |         norm1 = np.linalg.norm(vec1)
 369 |         norm2 = np.linalg.norm(vec2)
 370 |         if norm1 == 0 or norm2 == 0:
 371 |             return 0.0
 372 |         return dot_product / (norm1 * norm2)
 373 |     
 374 |     def _normalize_scores(self, scores: List[float]) -> List[float]:
 375 |         """Min-max normalize scores to [0, 1] range."""
 376 |         if not scores:
 377 |             return scores
 378 |         min_score = min(scores)
 379 |         max_score = max(scores)
 380 |         if max_score == min_score:
 381 |             return [1.0] * len(scores)  # All scores are equal
 382 |         return [(s - min_score) / (max_score - min_score) for s in scores]
 383 |     
 384 |     def search(self, query: str, top_k: int = 5, bm25_weight: float = 0.5) -> List[Dict]:
 385 |         """
 386 |         Perform hybrid search combining BM25 and semantic similarity.
 387 |         
 388 |         Args:
 389 |             query: Search query
 390 |             top_k: Number of results to return
 391 |             bm25_weight: Weight for BM25 score (0.0 to 1.0)
 392 |                         semantic_weight = 1.0 - bm25_weight
 393 |         
 394 |         Returns:
 395 |             List of results with combined scores
 396 |         """
 397 |         semantic_weight = 1.0 - bm25_weight
 398 |         
 399 |         # Get BM25 scores
 400 |         tokenized_query = self._preprocess(query)
 401 |         bm25_scores = self.bm25.get_scores(tokenized_query)
 402 |         
 403 |         # Get semantic similarity scores
 404 |         query_embedding = self._get_embedding(query)
 405 |         semantic_scores = []
 406 |         for doc_emb in self.doc_embeddings:
 407 |             similarity = self._cosine_similarity(query_embedding, doc_emb)
 408 |             semantic_scores.append(similarity)
 409 |         
 410 |         # Normalize both score types to comparable ranges
 411 |         normalized_bm25 = self._normalize_scores(bm25_scores.tolist())
 412 |         normalized_semantic = self._normalize_scores(semantic_scores)
 413 |         
 414 |         # Combine scores
 415 |         combined_scores = []
 416 |         for i in range(len(self.documents)):
 417 |             combined = (bm25_weight * normalized_bm25[i] + 
 418 |                        semantic_weight * normalized_semantic[i])
 419 |             combined_scores.append(combined)
 420 |         
 421 |         # Get top-k results
 422 |         top_indices = np.argsort(combined_scores)[-top_k:][::-1]
 423 |         
 424 |         results = []
 425 |         for idx in top_indices:
 426 |             if combined_scores[idx] > 0:
 427 |                 results.append({
 428 |                     'index': int(idx),
 429 |                     'combined_score': float(combined_scores[idx]),
 430 |                     'bm25_score': float(normalized_bm25[idx]),
 431 |                     'semantic_score': float(normalized_semantic[idx]),
 432 |                     'document': self.documents[idx]
 433 |                 })
 434 |         
 435 |         return results
 436 | 
 437 | # Example usage
 438 | if __name__ == "__main__":
 439 |     # Your documents
 440 |     documents = [
 441 |         "The quick brown fox jumps over the lazy dog",
 442 |         "A fast brown fox leaps over a sleeping dog", 
 443 |         "Machine learning is a subset of artificial intelligence",
 444 |         "Python is a popular programming language for data science",
 445 |         "Artificial intelligence and machine learning are transforming industries",
 446 |         "Dogs are loyal companions and make great pets",
 447 |         "Programming in Python is both fun and productive",
 448 |         "Natural language processing helps computers understand human language",
 449 |         "Search engines use algorithms like BM25 to rank documents",
 450 |         "Information retrieval is the science of searching for information"
 451 |     ]
 452 |     
 453 |     # Initialize hybrid search (adjust URL to your llama.cpp server)
 454 |     hybrid_search = HybridSearchEngine(
 455 |         documents=documents,
 456 |         llama_cpp_url="http://localhost:8080"  # Your llama.cpp server URL
 457 |     )
 458 |     
 459 |     # Test queries
 460 |     queries = [
 461 |         "brown fox",
 462 |         "machine learning", 
 463 |         "python programming",
 464 |         "AI and ML"
 465 |     ]
 466 |     
 467 |     for query in queries:
 468 |         print(f"\nQuery: '{query}'")
 469 |         print("-" * 50)
 470 |         
 471 |         # Test different weight combinations
 472 |         for weight in [0.3, 0.5, 0.7]:  # BM25 weights
 473 |             print(f"\nBM25 weight: {weight}, Semantic weight: {1-weight}")
 474 |             results = hybrid_search.search(query, top_k=3, bm25_weight=weight)
 475 |             
 476 |             for i, result in enumerate(results, 1):
 477 |                 print(f"{i}. Combined: {result['combined_score']:.4f} "
 478 |                       f"(BM25: {result['bm25_score']:.4f}, "
 479 |                       f"Semantic: {result['semantic_score']:.4f})")
 480 |                 print(f"   Doc: {result['document']}")
 481 | 
 482 | ```
 483 | 
 484 | ---
 485 | 
 486 | ## Setting up llama.cpp Server with Embeddings
 487 | 
 488 | Make sure your llama.cpp server supports the `/embedding` endpoint. You can start it like this:
 489 | 
 490 | ```bash
 491 | # Start llama.cpp server with embedding support
 492 | .\llama-server.exe -m C:\FABIO-AI\MODELS_embeddings\bge-small-en-v1.5_fp16.gguf --port 8080 --embedding
 493 | 
 494 | # Test the embedding endpoint
 495 | curl -s -X POST "http://localhost:8080/embedding" --data "{\"content\":\"AI is artificial intelligence\"}"
 496 | 
 497 | ```
 498 | 
 499 | ---
 500 | 
 501 | ## Advanced Fusion Strategies
 502 | 
 503 | ### 1. **Reciprocal Rank Fusion (RRF)**
 504 | Instead of score averaging, combine rankings:
 505 | 
 506 | ```python
 507 | def reciprocal_rank_fusion(bm25_results: List[int], semantic_results: List[int], k=60):
 508 |     """Combine rankings using Reciprocal Rank Fusion."""
 509 |     fused_scores = {}
 510 |     
 511 |     # BM25 results
 512 |     for rank, doc_id in enumerate(bm25_results):
 513 |         fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank + 1)
 514 |     
 515 |     # Semantic results  
 516 |     for rank, doc_id in enumerate(semantic_results):
 517 |         fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank + 1)
 518 |     
 519 |     # Sort by fused score
 520 |     return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
 521 | ```
 522 | 
 523 | ### 2. **Learned Weight Combination**
 524 | Use validation data to find optimal weights:
 525 | 
 526 | ```python
 527 | # Grid search for best BM25 weight
 528 | best_weight = 0.5
 529 | best_score = 0
 530 | 
 531 | for weight in np.arange(0.1, 1.0, 0.1):
 532 |     results = hybrid_search.search(query, bm25_weight=weight)
 533 |     # Evaluate using your ground truth metric
 534 |     score = evaluate_results(results, ground_truth)
 535 |     if score > best_score:
 536 |         best_score = score
 537 |         best_weight = weight
 538 | ```
 539 | 
 540 | ---
 541 | 
 542 | ## Benefits of This Hybrid Approach
 543 | 
 544 | This combination is particularly effective because:
 545 | 
 546 | - **BM25 handles exact matches and rare terms** that embeddings might miss
 547 | - **Embeddings capture semantic relationships** and handle paraphrasing
 548 | - **Robustness**: If one method fails, the other can still provide relevant results
 549 | - **State-of-the-art**: This approach is used by leading RAG systems [[1], [3]]
 550 | 
 551 | The hybrid search approach you're implementing is exactly what modern systems like those described in Anthropic's research employ, where "the retrieval result of contextual embedding search and contextual BM25 search are merged" .
 552 | 
 553 | Search Source · 10
 554 | 
 555 | 1.
 556 | https://milvus.io/docs/llamaindex_milvus_hybrid_search.md
 557 | ·
 558 | (2025-04-17)
 559 | RAG using Hybrid Search with Milvus and LlamaIndex
 560 | We'll begin with the recommended default hybrid search (semantic + BM25) and then explore other alternative sparse embedding methods and
 561 | 2.
 562 | https://blog.lancedb.com/hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6/
 563 | ·
 564 | (2023-12-09)
 565 | Hybrid Search: Combining BM25 and Semantic
 566 | BM25 is a ranking algorithm used in information retrieval systems to estimate the relevance of documents to a given search query.
 567 | 3.
 568 | https://medium.com/@odhitom09/the-most-effective-rag-approach-to-date-anthropics-contextual-retrieval-and-hybrid-search-8dc2af5cb970
 569 | The most effective RAG approach to date? Anthropic's
 570 | Anthropic also employs a hybrid approach where the retrieval result of contextual embedding search and contextual BM25 search are merged using
 571 | 4.
 572 | https://www.atyun.com/58079.html
 573 | ·
 574 | (2023-12-14)
 575 | 将BM25和语义搜索与Langchain结合起来以获得更好的结果
 576 | 混合搜索：将BM25和语义搜索与Langchain结合起来以获得更好的结果 · 它的作用：它会检查你的搜索词在文档中出现的频率，并考虑文档的长度以提供最相关的结果。
 577 | 5.
 578 | https://medium.com/@adnanmasood/optimizing-chunking-embedding-and-vectorization-for-retrieval-augmented-generation-ea3b083b68f7
 579 | Optimizing Chunking, Embedding, and Vectorization for
 580 | If using sparse vectors (like BM25), similarity is often defined differently (BM25 scoring formula or just dot product of TF-IDF features).
 581 | 6.
 582 | 
 583 | GitHub
 584 | Building an Index that supports a hybrid search comprising
 585 | I am trying to build an index that should support a hybrid search mechanism consisting of both BM25 and vector searches.
 586 | 7.
 587 | https://developers.llamaindex.ai/python/examples/retrievers/bm25_retriever/
 588 | BM25 Retriever | LlamaIndex Python Documentation
 589 | In this guide, we define a bm25 retriever that search documents using the bm25 method. BM25 (Best Matching 25) is a ranking function that extends TF-IDF by
 590 | 8.
 591 | https://js.langchain.com/docs/integrations/retrievers/supabase-hybrid/
 592 | Supabase Hybrid Search
 593 | The hybrid search combines the postgres pgvector extension (similarity search) and Full-Text Search (keyword search) to retrieve documents.
 594 | 9.
 595 | https://modelscope.cn/models/gpustack/bge-m3-GGUF
 596 | bge-m3-GGUF
 597 | A classic example: using both embedding retrieval and the BM25 algorithm. Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
 598 | 10.
 599 | https://focushubjobs.org/?_=/topics/hybrid-search#WZUejyqGZTrfm/ZWkBAfbV6M
 600 | hybrid-search · GitHub Topics - Jobs
 601 | The AI-native database built for LLM applications, providing incredibly fast hybrid search of dense vector, sparse vector, tensor (multi-vector), and fu
 602 | 
 603 | 
 604 | 
 605 | ---
 606 | 
 607 | Excellent! Now that your hybrid search is working, let me explain how to interpret these scores.
 608 | 
 609 | ## Understanding the Score Components
 610 | 
 611 | Your hybrid system produces **three types of scores**, all normalized to the range **[0.0, 1.0]**:
 612 | 
 613 | ### 1. **BM25 Score (Keyword Matching)**
 614 | - **1.0** = Perfect keyword match (contains all query terms with good frequency)
 615 | - **0.0** = No query terms found in the document
 616 | - Based purely on **exact word matching**, term frequency, and document length
 617 | 
 618 | ### 2. **Semantic Score (Embedding Similarity)**
 619 | - **1.0** = Maximum semantic similarity (documents are conceptually identical)
 620 | - **0.0** = Minimum semantic similarity (completely unrelated concepts)
 621 | - Based on **meaning and context**, captures synonyms, related concepts, and paraphrasing
 622 | 
 623 | ### 3. **Combined Score (Final Ranking)**
 624 | - Weighted average: `Combined = (BM25_weight × BM25_score) + (Semantic_weight × Semantic_score)`
 625 | - **1.0** = Best possible result according to your weighting
 626 | - **0.0** = Worst possible result
 627 | 
 628 | ## Analyzing Your Results
 629 | 
 630 | ### 🔍 **Query: "brown fox"**
 631 | ```bash
 632 | 1. Combined: 1.0000 (BM25: 1.0000, Semantic: 1.0000)
 633 |    Doc: A fast brown fox leaps over a sleeping dog
 634 | ```
 635 | - **Perfect match**: Contains exact keywords "brown" and "fox" → BM25 = 1.0
 636 | - **Perfect semantic match**: Same topic/concept → Semantic = 1.0
 637 | - This is the ideal case where both methods agree completely
 638 | 
 639 | ### 🔍 **Query: "AI and ML"** (Most Interesting Case!)
 640 | ```bash
 641 | 1. Combined: 0.8528 (BM25: 1.0000, Semantic: 0.7897)
 642 |    Doc: Artificial intelligence and machine learning are transforming industries
 643 | 2. Combined: 0.7000 (BM25: 0.0000, Semantic: 1.0000)
 644 |    Doc: Machine learning is a subset of artificial intelligence
 645 | ```
 646 | 
 647 | This reveals the **power of hybrid search**:
 648 | 
 649 | - **Result #1**: Has exact keyword matches ("Artificial intelligence", "machine learning") → **BM25 = 1.0**, but slightly lower semantic similarity
 650 | - **Result #2**: Contains the **concepts** but not the exact query terms "AI and ML" → **BM25 = 0.0**, but **perfect semantic match** (1.0)
 651 | 
 652 | **Without hybrid search:**
 653 | - Pure BM25 would **completely miss** Result #2 (score = 0.0)
 654 | - Pure semantic search might **miss Result #1** if the weighting favored different aspects
 655 | 
 656 | ## How to Interpret Different Weighting Strategies
 657 | 
 658 | ### **BM25 Weight = 0.3 (Semantic-heavy)**
 659 | - Prioritizes **conceptual relevance** over exact keywords
 660 | - Better for **conversational queries**, paraphrasing, or when users don't know exact terminology
 661 | - Example: "AI and ML" query finds the semantically perfect document even without exact terms
 662 | 
 663 | ### **BM25 Weight = 0.7 (Keyword-heavy)**
 664 | - Prioritizes **exact term matching**
 665 | - Better for **technical queries**, when users know specific terms, or need precise matches
 666 | - Example: Ensures documents with exact phrases appear higher
 667 | 
 668 | ### **BM25 Weight = 0.5 (Balanced)**
 669 | - Good default for general-purpose search
 670 | - Compromise between precision (keywords) and recall (semantics)
 671 | 
 672 | ## Key Insights from Your Results
 673 | 
 674 | 1. **When both scores are high (≈1.0)**: You have the best possible match
 675 | 2. **When BM25 = 0 but Semantic > 0**: Pure semantic retrieval finding conceptually relevant but keyword-mismatched results
 676 | 3. **When Semantic = 0 but BM25 > 0**: Pure keyword match with no semantic relationship (rare, usually indicates noise)
 677 | 4. **The "AI and ML" query demonstrates hybrid search's main advantage**: It finds **both** exact matches **and** conceptually relevant documents that pure methods would miss
 678 | 
 679 | ## Practical Guidelines for Evaluation
 680 | 
 681 | - **Look at the top 3-5 results**: Are they all relevant to your intent?
 682 | - **Check the balance**: If you're getting irrelevant exact matches, reduce BM25 weight. If you're missing obvious keyword matches, increase BM25 weight.
 683 | - **Consider your use case**:
 684 |   - **Customer support**: Might prefer semantic-heavy (users describe problems in their own words)
 685 |   - **Legal/technical search**: Might prefer keyword-heavy (need precise terminology matches)
 686 |   - **General knowledge**: Balanced approach usually works best
 687 | 
 688 | Your hybrid system is working beautifully—it's successfully combining the precision of keyword search with the flexibility of semantic search!
 689 | 
 690 | 
 691 | 
 692 | ---
 693 | 
 694 | # THE SERIES
 695 | 
 696 | # RAG Data Preparation Series: From Raw Documents to Hybrid Retrieval with Local LLMs
 697 | 
 698 | Here’s a strategic, progressive series outline for your Substack that builds from fundamentals to an advanced local RAG system—complete with practical code examples, clear learning objectives, and real-world relevance.
 699 | 
 700 | ---
 701 | 
 702 | ## **Series Title Suggestion**  
 703 | **"Build Your Own RAG: Data Prep, Hybrid Search & Local LLMs"**
 704 | 
 705 | ---
 706 | 
 707 | ## **Article 1: Why Data Prep Matters in RAG (The Foundation)**  
 708 | *Hook: "Your RAG is only as good as your data pipeline—here’s why."*
 709 | 
 710 | ### Key Points:
 711 | - Common RAG failure modes caused by poor data prep
 712 | - The retrieval-augmentation gap: when good docs ≠ good answers
 713 | - Overview of the full pipeline: **PDF → Markdown → Chunks → Vectors + Keywords → Hybrid Search → Local LLM**
 714 | 
 715 | ### Practical Demo:
 716 | - Show a "before/after" of naive vs. prepared RAG on a real PDF
 717 | - Code: Load a PDF, extract raw text, show limitations (messy headers, broken tables)
 718 | 
 719 | ### Tools Introduced:
 720 | - `pypdf`, `pdfplumber` for PDF extraction
 721 | - Why Markdown is the ideal intermediate format
 722 | 
 723 | > **Takeaway**: Data prep isn’t optional—it’s the core of RAG reliability.
 724 | 
 725 | ---
 726 | 
 727 | ## **Article 2: Chunking Strategies That Work (From Markdown)**  
 728 | *Hook: "Chunking isn’t just splitting text—it’s preserving meaning."*
 729 | 
 730 | ### Key Points:
 731 | - Why naive chunking (fixed char/word splits) fails
 732 | - **Semantic chunking**: Preserve context boundaries (headers, paragraphs)
 733 | - **Overlap strategies**: Prevent context bleeding at boundaries
 734 | 
 735 | ### Practical Demo:
 736 | - Convert PDF → clean Markdown (using `unstructured` or `pandoc`)
 737 | - Implement 3 chunking methods:
 738 |   1. **Fixed-size** (naive baseline)
 739 |   2. **Recursive splitting** (LangChain-style)
 740 |   3. **Markdown-aware** (split by headers + paragraphs)
 741 | 
 742 | ```python
 743 | # Example: Markdown-aware chunker
 744 | def chunk_markdown(md_text, max_chunk_size=500):
 745 |     # Split by headers first
 746 |     sections = re.split(r'\n#+ ', md_text)
 747 |     chunks = []
 748 |     for section in sections:
 749 |         if len(section) <= max_chunk_size:
 750 |             chunks.append(section)
 751 |         else:
 752 |             # Recursive split paragraphs
 753 |             paragraphs = section.split('\n\n')
 754 |             # ... combine paragraphs smartly
 755 |     return chunks
 756 | ```
 757 | 
 758 | ### Evaluation:
 759 | - Show retrieval quality differences using a test query
 760 | - Measure: "Does the chunk contain enough context to answer Q?"
 761 | 
 762 | > **Takeaway**: Your chunking strategy directly impacts answer quality.
 763 | 
 764 | ---
 765 | 
 766 | ## **Article 3: Keyword Power – BM25 for Reliable Retrieval**  
 767 | *Hook: "Don’t abandon keywords—supercharge them with BM25."*
 768 | 
 769 | ### Key Points:
 770 | - Why pure vector search fails on rare terms, acronyms, or exact phrases
 771 | - How BM25 works (simple intuition + formula)
 772 | - When BM25 beats embeddings (and vice versa)
 773 | 
 774 | ### Practical Demo:
 775 | - Build BM25 index from Markdown chunks
 776 | - Show retrieval examples where BM25 wins:
 777 |   - Query: `"API key format"` → finds exact technical specs
 778 |   - Query: `"2023 revenue"` → finds precise numbers
 779 | 
 780 | ```python
 781 | # Code from your working example (simplified)
 782 | from rank_bm25 import BM25Okapi
 783 | 
 784 | tokenized_chunks = [preprocess(chunk) for chunk in chunks]
 785 | bm25 = BM25Okapi(tokenized_chunks)
 786 | 
 787 | def bm25_search(query, top_k=5):
 788 |     scores = bm25.get_scores(preprocess(query))
 789 |     top_idxs = scores.argsort()[-top_k:][::-1]
 790 |     return [(chunks[i], scores[i]) for i in top_idxs]
 791 | ```
 792 | 
 793 | > **Takeaway**: BM25 is your safety net for precise, keyword-driven queries.
 794 | 
 795 | ---
 796 | 
 797 | ## **Article 4: Semantic Search with Local Embeddings**  
 798 | *Hook: "Run semantic search entirely offline—with llama.cpp."*
 799 | 
 800 | ### Key Points:
 801 | - Why local embeddings matter (privacy, cost, control)
 802 | - Setting up `llama.cpp` for embeddings (model choice, flags)
 803 | - Cosine similarity vs. other distance metrics
 804 | 
 805 | ### Practical Demo:
 806 | - Start `llama.cpp` server with embedding model (`nomic-embed-text`)
 807 | - Generate embeddings for all chunks
 808 | - Build semantic search function
 809 | 
 810 | ```python
 811 | # Request embedding from local server
 812 | def get_embedding(text):
 813 |     res = requests.post("http://localhost:8080/embedding", 
 814 |                        json={"content": text})
 815 |     return np.array(res.json()[0]["embedding"])
 816 | 
 817 | # Precompute all chunk embeddings
 818 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
 819 | ```
 820 | 
 821 | ### Comparison:
 822 | - Show queries where semantic search wins:
 823 |   - Query: `"How do I authenticate?"` → finds sections about "API keys", "OAuth", etc.
 824 | 
 825 | > **Takeaway**: Local embeddings = semantic power without the cloud dependency.
 826 | 
 827 | ---
 828 | 
 829 | ## **Article 5: Hybrid Search – The Best of Both Worlds**  
 830 | *Hook: "Why choose between keywords and semantics? Combine them."*
 831 | 
 832 | ### Key Points:
 833 | - The hybrid search advantage: coverage + precision
 834 | - Score fusion strategies (weighted average, RRF)
 835 | - Tuning weights for your domain
 836 | 
 837 | ### Practical Demo:
 838 | - Implement your working hybrid search code
 839 | - Show dramatic improvements on mixed queries:
 840 |   - Query: `"AI revenue 2023"` → BM25 finds "2023", semantic finds "AI revenue"
 841 | 
 842 | ```python
 843 | # Hybrid scoring (normalized + weighted)
 844 | combined_score = w_bm25 * norm_bm25_score + w_semantic * norm_semantic_score
 845 | ```
 846 | 
 847 | ### Interactive Element:
 848 | - Provide a Colab notebook where readers can adjust weights and see results
 849 | 
 850 | > **Takeaway**: Hybrid search consistently outperforms single-method retrieval.
 851 | 
 852 | ---
 853 | 
 854 | ## **Article 6: Building Your Local RAG Pipeline**  
 855 | *Hook: "From PDF to answer—100% local, zero cloud."*
 856 | 
 857 | ### Key Points:
 858 | - Full pipeline integration: PDF → Markdown → Chunks → Hybrid Index → Llama.cpp LLM
 859 | - When to reindex vs. incremental updates
 860 | - Performance considerations (caching embeddings, etc.)
 861 | 
 862 | ### Practical Demo:
 863 | - End-to-end code that:
 864 |   1. Takes a PDF path
 865 |   2. Converts to clean Markdown
 866 |   3. Chunks intelligently  
 867 |   4. Builds hybrid index (BM25 + embeddings)
 868 |   5. Answers questions using local LLM
 869 | 
 870 | ```python
 871 | # Final RAG function
 872 | def rag_query(query, top_k=3):
 873 |     # Hybrid retrieval
 874 |     results = hybrid_search.search(query, top_k=top_k)
 875 |     
 876 |     # Build context
 877 |     context = "\n\n".join([r['document'] for r in results])
 878 |     
 879 |     # Local LLM generation
 880 |     prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
 881 |     response = requests.post("http://localhost:8080/completion", 
 882 |                            json={"prompt": prompt, "n_predict": 200})
 883 |     return response.json()["content"]
 884 | ```
 885 | 
 886 | ### Testing:
 887 | - Compare answers with/without hybrid retrieval
 888 | - Show how good data prep reduces hallucinations
 889 | 
 890 | > **Takeaway**: A well-prepared local RAG can rival cloud solutions—for free.
 891 | 
 892 | ---
 893 | 
 894 | ## **Series Bonuses**
 895 | 
 896 | ### **Code Repository**
 897 | - GitHub repo with all code examples, ready to run
 898 | - Dockerfile for one-click setup (llama.cpp + Python deps)
 899 | 
 900 | ### **Cheat Sheets**
 901 | - Chunking strategy decision tree
 902 | - BM25 vs. semantic search: when to use which
 903 | - llama.cpp flags for embedding vs. generation
 904 | 
 905 | ### **Community Challenge**
 906 | - Week 6: Readers submit their own PDF → RAG pipeline
 907 | - Feature the best implementation in a follow-up post
 908 | 
 909 | ---
 910 | 
 911 | ## **Why This Series Works**
 912 | 
 913 | 1. **Progressive complexity**: Each article builds on the last  
 914 | 2. **Practical focus**: Every concept paired with runnable code
 915 | 3. **Local-first**: Appeals to privacy-conscious, cost-aware developers
 916 | 4. **Complete story**: From raw data to final answers
 917 | 5. **Timely**: Hybrid search is the current SOTA for RAG
 918 | 
 919 | This structure positions you as a trusted guide who doesn’t just explain concepts—but shows exactly how to implement them in real systems. Each article delivers immediate value while building toward a powerful final project.
 920 | 
 921 | Would you like me to expand any specific article with more detailed code examples or learning objectives?
 922 | 
 923 | ---
 924 | 
 925 | # ADDENDUM
 926 | here some adjustments:
 927 | - I don't want to use a specific framework (langchain or lamaindex)
 928 | - for Article 1 I need more information, and highlight that Markdown structure (headers level 2 or 3 too) allow to create summaries for the high level parts and keep the chunks already semantically coherent with the content structure
 929 | - for Article 2 I want the chunking **Markdown-aware** (split by headers + paragraphs)
 930 | - For Article 4 keep the cosine similarity done only with Numpy
 931 | - for Article 5 remember that the function you provided me before to compute cosine similarity combined with BM25 is already perfect
 932 | - add an Article 7 to create a GUI app (based on the code of article 6)
 933 | As intermediate app, I would like that both BM25 and Semantic search have a GUI interface to explore the chunks and can visualize the scores to a user given search
 934 | 
 935 | ---
 936 | 
 937 | # RAG Data Preparation Series: From Raw Documents to Hybrid Retrieval with Local LLMs
 938 | 
 939 | Here’s your refined series outline with all requested adjustments—framework-free, Markdown-structured, NumPy-only, and culminating in a GUI app.
 940 | 
 941 | ---
 942 | 
 943 | ## **Series Title**  
 944 | **"Build Your Own RAG: Data Prep, Hybrid Search & Local LLMs (No Frameworks Needed)"**
 945 | 
 946 | ---
 947 | 
 948 | ## **Article 1: Why Document Structure is Your Secret Weapon**  
 949 | *Hook: "Your PDF’s hidden structure holds the key to better RAG—here’s how to unlock it."*
 950 | 
 951 | ### Key Points:
 952 | - **The structure problem**: Raw PDF text loses semantic hierarchy (headers, sections, lists)
 953 | - **Why Markdown is ideal**: Preserves document structure while being human+machine readable
 954 | - **Headers as semantic anchors**: 
 955 |   - Level 1 headers = main document sections  
 956 |   - Level 2-3 headers = subtopics with natural boundaries
 957 |   - **Summarization opportunity**: Each header section can be summarized independently
 958 | 
 959 | ### Practical Demo:
 960 | - Extract PDF → clean Markdown using `pandoc` (no heavy frameworks)
 961 | - Show structure preservation:
 962 | 
 963 | ```bash
 964 | # Convert PDF to structured Markdown
 965 | pandoc document.pdf -t markdown -o document.md
 966 | ```
 967 | 
 968 | ```markdown
 969 | # Annual Report 2023          ← Level 1: Main section
 970 | 
 971 | ## Financial Highlights       ← Level 2: Subtopic (perfect chunk boundary)
 972 | Revenue grew 15% year-over-year...
 973 | 
 974 | ### Regional Breakdown        ← Level 3: Granular detail
 975 | North America: $2.1B...
 976 | ```
 977 | 
 978 | - **Why this matters for RAG**:
 979 |   - Level 2+ headers create **naturally coherent chunks**
 980 |   - Each chunk has built-in **context and topic identity**
 981 |   - Enables **hierarchical retrieval**: find section first, then details
 982 | 
 983 | ### Code Preview:
 984 | - Simple function to parse Markdown headers and their content
 985 | - Show how structure enables better chunking (teaser for Article 2)
 986 | 
 987 | > **Takeaway**: Document structure isn’t noise—it’s your retrieval roadmap.
 988 | 
 989 | ---
 990 | 
 991 | ## **Article 2: Markdown-Aware Chunking – Preserve Meaning, Not Just Text**  
 992 | *Hook: "Stop splitting text randomly—chunk by semantic boundaries instead."*
 993 | 
 994 | ### Key Points:
 995 | - Problems with naive chunking: breaks context, loses topic coherence
 996 | - **Markdown-aware strategy**: Respect header hierarchy + paragraph boundaries
 997 | - **Chunk size logic**: 
 998 |   - Small sections (under 500 chars) = keep whole
 999 |   - Large sections = split by paragraphs with overlap
1000 | 
1001 | ### Practical Demo:
1002 | - Complete Markdown-aware chunker (no external dependencies):
1003 | 
1004 | ```python
1005 | import re
1006 | 
1007 | def chunk_markdown_by_headers(md_text, max_chunk_size=500, overlap=50):
1008 |     """
1009 |     Chunk Markdown text respecting header hierarchy.
1010 |     Each chunk maintains semantic coherence from document structure.
1011 |     """
1012 |     # Split by headers (preserve header level and content)
1013 |     header_pattern = r'^(#{1,6})\s+(.*?)$'
1014 |     lines = md_text.split('\n')
1015 |     
1016 |     chunks = []
1017 |     current_header = ""
1018 |     current_content = ""
1019 |     
1020 |     for line in lines:
1021 |         if re.match(header_pattern, line):
1022 |             # New header found - process previous section
1023 |             if current_content.strip():
1024 |                 chunks.extend(
1025 |                     _split_section(current_header, current_content, max_chunk_size, overlap)
1026 |                 )
1027 |             # Start new section
1028 |             header_match = re.match(header_pattern, line)
1029 |             current_header = line  # Keep full header line
1030 |             current_content = ""
1031 |         else:
1032 |             current_content += line + '\n'
1033 |     
1034 |     # Don't forget the last section
1035 |     if current_content.strip():
1036 |         chunks.extend(
1037 |             _split_section(current_header, current_content, max_chunk_size, overlap)
1038 |         )
1039 |     
1040 |     return chunks
1041 | 
1042 | def _split_section(header, content, max_size, overlap):
1043 |     """Split a section into chunks if too large."""
1044 |     if len(header + content) <= max_size:
1045 |         return [header + '\n' + content.strip()]
1046 |     
1047 |     # Split content by paragraphs
1048 |     paragraphs = [p for p in content.split('\n\n') if p.strip()]
1049 |     chunks = []
1050 |     current_chunk = header + '\n\n'
1051 |     
1052 |     for para in paragraphs:
1053 |         if len(current_chunk + para) <= max_size:
1054 |             current_chunk += para + '\n\n'
1055 |         else:
1056 |             # Finalize current chunk
1057 |             chunks.append(current_chunk.strip())
1058 |             # Start new chunk with overlap
1059 |             current_chunk = header + '\n\n' + para + '\n\n'
1060 |     
1061 |     if current_chunk.strip():
1062 |         chunks.append(current_chunk.strip())
1063 |     
1064 |     return chunks
1065 | ```
1066 | 
1067 | ### Evaluation:
1068 | - Compare retrieval quality: structured vs. naive chunks
1069 | - Show how header context helps LLM understand chunk purpose
1070 | 
1071 | > **Takeaway**: Let your document’s natural structure guide your chunking.
1072 | 
1073 | ---
1074 | 
1075 | ## **Article 3: Keyword Power – BM25 for Reliable Retrieval**  
1076 | *Hook: "Don’t abandon keywords—supercharge them with BM25."*
1077 | 
1078 | ### Key Points:
1079 | - BM25 advantages: exact matches, rare terms, Boolean-like precision
1080 | - Simple implementation with `rank_bm25` (only dependency needed)
1081 | - When BM25 saves the day: technical queries, acronyms, specific phrases
1082 | 
1083 | ### Practical Demo:
1084 | - Full BM25 implementation from Article 2 chunks
1085 | - Preprocessing function optimized for technical content
1086 | - Query examples showing BM25’s precision
1087 | 
1088 | ```python
1089 | from rank_bm25 import BM25Okapi
1090 | import string
1091 | import re
1092 | 
1093 | def preprocess_text(text):
1094 |     """Clean and tokenize for BM25."""
1095 |     text = text.lower()
1096 |     text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
1097 |     text = re.sub(r'\s+', ' ', text)
1098 |     return [word for word in text.split() if len(word) > 2]
1099 | 
1100 | # Build BM25 index
1101 | chunks = chunk_markdown_by_headers(markdown_text)
1102 | tokenized_chunks = [preprocess_text(chunk) for chunk in chunks]
1103 | bm25_index = BM25Okapi(tokenized_chunks)
1104 | ```
1105 | 
1106 | > **Takeaway**: BM25 is your precision tool for when exactness matters.
1107 | 
1108 | ---
1109 | 
1110 | ## **Article 4: Semantic Search with Pure NumPy**  
1111 | *Hook: "Semantic search without heavy frameworks—just NumPy and your local LLM."*
1112 | 
1113 | ### Key Points:
1114 | - Why local embeddings: privacy, cost, control
1115 | - Setting up `llama.cpp` with embedding models
1116 | - **Pure NumPy cosine similarity**—no scikit-learn dependency
1117 | 
1118 | ### Practical Demo:
1119 | - Start `llama.cpp` server: `./server -m nomic-embed-text.Q4_K_M.gguf --embedding`
1120 | - Embedding retrieval function
1121 | - **NumPy-only cosine similarity**:
1122 | 
1123 | ```python
1124 | import numpy as np
1125 | import requests
1126 | 
1127 | def get_embedding(text, server_url="http://localhost:8080"):
1128 |     """Get embedding from llama.cpp server."""
1129 |     response = requests.post(f"{server_url}/embedding", 
1130 |                            json={"content": text})
1131 |     return np.array(response.json()[0]["embedding"]).flatten()
1132 | 
1133 | def cosine_similarity(vec1, vec2):
1134 |     """Pure NumPy cosine similarity."""
1135 |     dot_product = np.dot(vec1, vec2)
1136 |     norm1 = np.linalg.norm(vec1)
1137 |     norm2 = np.linalg.norm(vec2)
1138 |     if norm1 == 0 or norm2 == 0:
1139 |         return 0.0
1140 |     return dot_product / (norm1 * norm2)
1141 | 
1142 | # Precompute all chunk embeddings
1143 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
1144 | ```
1145 | 
1146 | ### Testing:
1147 | - Show semantic matches: "authentication" → finds "API keys", "OAuth", "credentials"
1148 | 
1149 | > **Takeaway**: Semantic power with minimal dependencies.
1150 | 
1151 | ---
1152 | 
1153 | ## **Article 5: Hybrid Search – Your BM25 + Semantic Fusion**  
1154 | *Hook: "The retrieval strategy that beats both BM25 and semantic search alone."*
1155 | 
1156 | ### Key Points:
1157 | - Why hybrid search works: coverage + precision
1158 | - Score normalization and weighted fusion
1159 | - **Your perfect fusion function** (refined from our earlier work)
1160 | 
1161 | ### Practical Demo:
1162 | - Complete hybrid search implementation:
1163 | 
1164 | ```python
1165 | def normalize_scores(scores):
1166 |     """Min-max normalize to [0, 1]."""
1167 |     if not scores or max(scores) == min(scores):
1168 |         return [1.0] * len(scores)
1169 |     min_score, max_score = min(scores), max(scores)
1170 |     return [(s - min_score) / (max_score - min_score) for s in scores]
1171 | 
1172 | def hybrid_search(query, chunks, bm25_index, chunk_embeddings, 
1173 |                  bm25_weight=0.5, top_k=5):
1174 |     """
1175 |     Your perfected hybrid search combining BM25 and semantic similarity.
1176 |     """
1177 |     # BM25 scores
1178 |     tokenized_query = preprocess_text(query)
1179 |     bm25_scores = bm25_index.get_scores(tokenized_query).tolist()
1180 |     
1181 |     # Semantic scores  
1182 |     query_emb = get_embedding(query)
1183 |     semantic_scores = [
1184 |         cosine_similarity(query_emb, doc_emb) 
1185 |         for doc_emb in chunk_embeddings
1186 |     ]
1187 |     
1188 |     # Normalize and combine
1189 |     norm_bm25 = normalize_scores(bm25_scores)
1190 |     norm_semantic = normalize_scores(semantic_scores)
1191 |     
1192 |     combined_scores = [
1193 |         bm25_weight * b + (1 - bm25_weight) * s
1194 |         for b, s in zip(norm_bm25, norm_semantic)
1195 |     ]
1196 |     
1197 |     # Get top results
1198 |     top_indices = np.argsort(combined_scores)[-top_k:][::-1]
1199 |     return [
1200 |         {
1201 |             'chunk': chunks[i],
1202 |             'combined_score': combined_scores[i],
1203 |             'bm25_score': norm_bm25[i],
1204 |             'semantic_score': norm_semantic[i]
1205 |         }
1206 |         for i in top_indices if combined_scores[i] > 0
1207 |     ]
1208 | ```
1209 | 
1210 | ### Interactive Examples:
1211 | - Show how different weights affect results
1212 | - Demonstrate mixed queries: `"revenue growth 2023"` → finds both terms and concepts
1213 | 
1214 | > **Takeaway**: Hybrid search = robust retrieval for real-world queries.
1215 | 
1216 | ---
1217 | 
1218 | ## **Article 6: Full Local RAG Pipeline**  
1219 | *Hook: "From PDF to answer—100% local, zero frameworks, maximum control."*
1220 | 
1221 | ### Key Points:
1222 | - Complete pipeline integration
1223 | - Local LLM generation with `llama.cpp`
1224 | - When to cache vs. recompute
1225 | 
1226 | ### Practical Demo:
1227 | - End-to-end pipeline:
1228 | 
1229 | ```python
1230 | def full_rag_pipeline(pdf_path, query, bm25_weight=0.5):
1231 |     # 1. PDF → Markdown
1232 |     md_text = convert_pdf_to_markdown(pdf_path)  # using pandoc
1233 |     
1234 |     # 2. Smart chunking
1235 |     chunks = chunk_markdown_by_headers(md_text)
1236 |     
1237 |     # 3. Build hybrid index
1238 |     tokenized_chunks = [preprocess_text(chunk) for chunk in chunks]
1239 |     bm25_index = BM25Okapi(tokenized_chunks)
1240 |     chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
1241 |     
1242 |     # 4. Hybrid retrieval
1243 |     results = hybrid_search(query, chunks, bm25_index, chunk_embeddings, bm25_weight)
1244 |     
1245 |     # 5. Local LLM generation
1246 |     context = "\n\n".join([r['chunk'] for r in results[:3]])
1247 |     prompt = f"Answer based on this context:\n\n{context}\n\nQuestion: {query}\nAnswer:"
1248 |     
1249 |     response = requests.post("http://localhost:8080/completion",
1250 |                            json={"prompt": prompt, "n_predict": 256})
1251 |     return response.json()["content"]
1252 | ```
1253 | 
1254 | > **Takeaway**: Full local RAG—your data, your model, your rules.
1255 | 
1256 | ---
1257 | 
1258 | ## **Article 7: Build a GUI to Explore Your Hybrid Search**  
1259 | *Hook: "See your retrieval scores in real-time—with a simple desktop app."*
1260 | 
1261 | ### Key Points:
1262 | - Why visual feedback matters: understand what your RAG is doing
1263 | - **Dual interface**: Compare BM25 vs. Semantic vs. Hybrid results side-by-side
1264 | - Simple GUI with `tkinter` (no web frameworks)
1265 | 
1266 | ### Practical Demo:
1267 | - Complete GUI application:
1268 | 
1269 | ```python
1270 | import tkinter as tk
1271 | from tkinter import ttk, scrolledtext
1272 | import numpy as np
1273 | 
1274 | class HybridSearchGUI:
1275 |     def __init__(self, chunks, bm25_index, chunk_embeddings):
1276 |         self.chunks = chunks
1277 |         self.bm25_index = bm25_index  
1278 |         self.chunk_embeddings = chunk_embeddings
1279 |         
1280 |         # Create main window
1281 |         self.root = tk.Tk()
1282 |         self.root.title("Hybrid Search Explorer")
1283 |         self.root.geometry("1200x800")
1284 |         
1285 |         # Query input
1286 |         tk.Label(self.root, text="Search Query:").pack(pady=5)
1287 |         self.query_var = tk.StringVar()
1288 |         tk.Entry(self.root, textvariable=self.query_var, width=80).pack(pady=5)
1289 |         tk.Button(self.root, text="Search", command=self.on_search).pack(pady=5)
1290 |         
1291 |         # Weight control
1292 |         tk.Label(self.root, text="BM25 Weight:").pack()
1293 |         self.weight_var = tk.DoubleVar(value=0.5)
1294 |         tk.Scale(self.root, from_=0.0, to=1.0, resolution=0.1, 
1295 |                 orient=tk.HORIZONTAL, variable=self.weight_var).pack()
1296 |         
1297 |         # Results notebooks
1298 |         self.notebook = ttk.Notebook(self.root)
1299 |         self.notebook.pack(fill=tk.BOTH, expand=True, padx=10, pady=10)
1300 |         
1301 |         # Three tabs
1302 |         self.bm25_frame = self._create_results_tab("BM25 Only")
1303 |         self.semantic_frame = self._create_results_tab("Semantic Only") 
1304 |         self.hybrid_frame = self._create_results_tab("Hybrid Results")
1305 |         
1306 |     def _create_results_tab(self, title):
1307 |         frame = ttk.Frame(self.notebook)
1308 |         self.notebook.add(frame, text=title)
1309 |         
1310 |         results_list = tk.Listbox(frame, width=120, height=20)
1311 |         results_list.pack(side=tk.LEFT, fill=tk.BOTH, expand=True)
1312 |         
1313 |         scrollbar = ttk.Scrollbar(frame, orient=tk.VERTICAL, command=results_list.yview)
1314 |         scrollbar.pack(side=tk.RIGHT, fill=tk.Y)
1315 |         results_list.configure(yscrollcommand=scrollbar.set)
1316 |         
1317 |         # Store reference
1318 |         setattr(self, f"{title.lower().replace(' ', '_')}_list", results_list)
1319 |         return frame
1320 |     
1321 |     def on_search(self):
1322 |         query = self.query_var.get()
1323 |         if not query:
1324 |             return
1325 |             
1326 |         weight = self.weight_var.get()
1327 |         
1328 |         # Get all three result sets
1329 |         bm25_results = self._get_bm25_results(query)
1330 |         semantic_results = self._get_semantic_results(query)  
1331 |         hybrid_results = hybrid_search(query, self.chunks, self.bm25_index, 
1332 |                                      self.chunk_embeddings, weight)
1333 |         
1334 |         # Update displays
1335 |         self._update_listbox(self.bm25_only_list, bm25_results, 'bm25_score')
1336 |         self._update_listbox(self.semantic_only_list, semantic_results, 'semantic_score')
1337 |         self._update_listbox(self.hybrid_results_list, hybrid_results, 'combined_score')
1338 |     
1339 |     def _update_listbox(self, listbox, results, score_key):
1340 |         listbox.delete(0, tk.END)
1341 |         for i, result in enumerate(results):
1342 |             score = result[score_key]
1343 |             preview = result['chunk'][:100] + "..." if len(result['chunk']) > 100 else result['chunk']
1344 |             listbox.insert(tk.END, f"[{score:.4f}] {preview}")
1345 |     
1346 |     def run(self):
1347 |         self.root.mainloop()
1348 | 
1349 | # Launch the GUI
1350 | gui = HybridSearchGUI(chunks, bm25_index, chunk_embeddings)
1351 | gui.run()
1352 | ```
1353 | 
1354 | ### Features:
1355 | - **Real-time weight adjustment**: See how BM25 vs. Semantic balance affects results
1356 | - **Score transparency**: Every result shows exact scores
1357 | - **Chunk inspection**: Click to see full chunk content
1358 | - **Framework-free**: Pure Python + tkinter
1359 | 
1360 | > **Takeaway**: Understanding your retrieval is the first step to improving it.
1361 | 
1362 | ---
1363 | 
1364 | ## **Series Advantages**
1365 | 
1366 | ✅ **No frameworks** – Pure Python, minimal dependencies  
1367 | ✅ **Structure-first** – Leverages Markdown hierarchy from day one  
1368 | ✅ **Local everything** – PDF → LLM, all on your machine  
1369 | ✅ **Educational progression** – Each article builds essential skills  
1370 | ✅ **Practical GUI** – Visual feedback makes concepts tangible  
1371 | 
1372 | This series gives readers a complete, production-ready RAG system while teaching fundamental concepts they can apply anywhere. Each article delivers immediate value with code they can run today.
1373 | 
1374 | Would you like me to elaborate on any specific article's code examples or add more detail to the GUI implementation?
1375 | 
1376 | ---
1377 | 
1378 | >Give me an alternative to pandoc, like pypdf or PyMuPDF4LLM
1379 | >And suggest how we can store locally both the BM25 documents and the vectorDB for the embeddings
1380 | 
1381 | # Alternatives to Pandoc & Local Storage Strategies for Hybrid RAG
1382 | 
1383 | ## **PDF to Markdown Alternatives**
1384 | 
1385 | Since you want to avoid Pandoc dependencies, here are two excellent pure-Python alternatives:
1386 | 
1387 | ### **Option 1: PyMuPDF4LLM (Recommended)**
1388 | This is specifically designed for LLM preprocessing and creates clean, structured Markdown.
1389 | 
1390 | ```python
1391 | import pymupdf4llm
1392 | 
1393 | def pdf_to_markdown_pymupdf(pdf_path):
1394 |     """Convert PDF to clean, structured Markdown using PyMuPDF4LLM."""
1395 |     md_text = pymupdf4llm.to_markdown(pdf_path)
1396 |     return md_text
1397 | 
1398 | # Installation
1399 | # pip install pymupdf4llm
1400 | ```
1401 | 
1402 | **Advantages:**
1403 | - Preserves tables, headers, and document structure
1404 | - Specifically optimized for LLM consumption
1405 | - Handles complex PDF layouts better than most alternatives
1406 | - Creates semantic Markdown with proper header hierarchy
1407 | 
1408 | ### **Option 2: PyPDF + Custom Markdown Formatting**
1409 | For simpler PDFs or when you need more control:
1410 | 
1411 | ```python
1412 | from pypdf import PdfReader
1413 | import re
1414 | 
1415 | def pdf_to_markdown_pypdf(pdf_path):
1416 |     """Convert PDF to basic Markdown using PyPDF."""
1417 |     reader = PdfReader(pdf_path)
1418 |     text = ""
1419 |     
1420 |     for page_num, page in enumerate(reader.pages):
1421 |         page_text = page.extract_text()
1422 |         if page_text:
1423 |             # Basic structure detection
1424 |             lines = page_text.split('\n')
1425 |             processed_lines = []
1426 |             
1427 |             for line in lines:
1428 |                 line = line.strip()
1429 |                 if not line:
1430 |                     continue
1431 |                     
1432 |                 # Heuristic: all caps + short = likely header
1433 |                 if line.isupper() and len(line) < 50:
1434 |                     processed_lines.append(f"# {line.title()}")
1435 |                 # Heuristic: title case + ends with colon = subheader  
1436 |                 elif line.istitle() and line.endswith(':'):
1437 |                     processed_lines.append(f"## {line[:-1]}")
1438 |                 else:
1439 |                     processed_lines.append(line)
1440 |             
1441 |             text += '\n'.join(processed_lines) + '\n\n'
1442 |     
1443 |     return text
1444 | 
1445 | # Installation
1446 | # pip install pypdf
1447 | ```
1448 | 
1449 | **Recommendation:** Use **PyMuPDF4LLM** for production-quality results, especially with complex documents containing tables, figures, or multi-column layouts.
1450 | 
1451 | ---
1452 | 
1453 | ## **Local Storage Strategies for Hybrid RAG**
1454 | 
1455 | You need to persist both BM25 indexes and embeddings efficiently. Here are lightweight, framework-free solutions:
1456 | 
1457 | ### **Strategy 1: File-Based Storage (Simple & Effective)**
1458 | 
1459 | Store everything as serialized files that can be easily loaded back:
1460 | 
1461 | ```python
1462 | import pickle
1463 | import json
1464 | import numpy as np
1465 | from pathlib import Path
1466 | 
1467 | class HybridIndexStorage:
1468 |     def __init__(self, storage_dir="./rag_index"):
1469 |         self.storage_dir = Path(storage_dir)
1470 |         self.storage_dir.mkdir(exist_ok=True)
1471 |     
1472 |     def save_index(self, chunks, bm25_index, chunk_embeddings, metadata=None):
1473 |         """Save all index components to disk."""
1474 |         
1475 |         # Save chunks as JSON (human readable)
1476 |         with open(self.storage_dir / "chunks.json", 'w', encoding='utf-8') as f:
1477 |             json.dump(chunks, f, ensure_ascii=False, indent=2)
1478 |         
1479 |         # Save BM25 index with pickle
1480 |         with open(self.storage_dir / "bm25_index.pkl", 'wb') as f:
1481 |             pickle.dump(bm25_index, f)
1482 |         
1483 |         # Save embeddings as numpy array (efficient)
1484 |         np.save(self.storage_dir / "embeddings.npy", 
1485 |                 np.array(chunk_embeddings))
1486 |         
1487 |         # Save metadata
1488 |         meta = {
1489 |             'chunk_count': len(chunks),
1490 |             'embedding_dim': len(chunk_embeddings[0]) if chunk_embeddings else 0,
1491 |             'bm25_params': getattr(bm25_index, '__dict__', {}),
1492 |             'metadata': metadata or {}
1493 |         }
1494 |         with open(self.storage_dir / "metadata.json", 'w') as f:
1495 |             json.dump(meta, f, indent=2)
1496 |     
1497 |     def load_index(self):
1498 |         """Load all index components from disk."""
1499 |         
1500 |         # Load chunks
1501 |         with open(self.storage_dir / "chunks.json", 'r', encoding='utf-8') as f:
1502 |             chunks = json.load(f)
1503 |         
1504 |         # Load BM25 index
1505 |         with open(self.storage_dir / "bm25_index.pkl", 'rb') as f:
1506 |             bm25_index = pickle.load(f)
1507 |         
1508 |         # Load embeddings
1509 |         embeddings = np.load(self.storage_dir / "embeddings.npy").tolist()
1510 |         
1511 |         return chunks, bm25_index, embeddings
1512 |     
1513 |     def index_exists(self):
1514 |         """Check if index files exist."""
1515 |         required_files = ["chunks.json", "bm25_index.pkl", "embeddings.npy"]
1516 |         return all((self.storage_dir / f).exists() for f in required_files)
1517 | 
1518 | # Usage example
1519 | storage = HybridIndexStorage("./my_document_index")
1520 | 
1521 | # First time: create and save index
1522 | if not storage.index_exists():
1523 |     # ... your PDF processing and indexing code ...
1524 |     chunks = chunk_markdown_by_headers(md_text)
1525 |     tokenized_chunks = [preprocess_text(chunk) for chunk in chunks]
1526 |     bm25_index = BM25Okapi(tokenized_chunks)
1527 |     chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
1528 |     
1529 |     storage.save_index(chunks, bm25_index, chunk_embeddings, 
1530 |                       metadata={"source_pdf": "annual_report.pdf"})
1531 | 
1532 | # Load existing index
1533 | chunks, bm25_index, chunk_embeddings = storage.load_index()
1534 | ```
1535 | 
1536 | ### **Strategy 2: SQLite Database (More Structured)**
1537 | 
1538 | For better query capabilities and metadata management:
1539 | 
1540 | ```python
1541 | import sqlite3
1542 | import pickle
1543 | import numpy as np
1544 | import json
1545 | 
1546 | class HybridSQLiteStorage:
1547 |     def __init__(self, db_path="./rag_index.db"):
1548 |         self.db_path = db_path
1549 |         self._init_database()
1550 |     
1551 |     def _init_database(self):
1552 |         """Create database tables if they don't exist."""
1553 |         conn = sqlite3.connect(self.db_path)
1554 |         cursor = conn.cursor()
1555 |         
1556 |         # Chunks table
1557 |         cursor.execute('''
1558 |             CREATE TABLE IF NOT EXISTS chunks (
1559 |                 id INTEGER PRIMARY KEY,
1560 |                 content TEXT NOT NULL,
1561 |                 metadata TEXT
1562 |             )
1563 |         ''')
1564 |         
1565 |         # Embeddings table (store as BLOB)
1566 |         cursor.execute('''
1567 |             CREATE TABLE IF NOT EXISTS embeddings (
1568 |                 chunk_id INTEGER PRIMARY KEY,
1569 |                 embedding BLOB NOT NULL,
1570 |                 FOREIGN KEY (chunk_id) REFERENCES chunks (id)
1571 |             )
1572 |         ''')
1573 |         
1574 |         # Index metadata
1575 |         cursor.execute('''
1576 |             CREATE TABLE IF NOT EXISTS index_metadata (
1577 |                 key TEXT PRIMARY KEY,
1578 |                 value TEXT
1579 |             )
1580 |         ''')
1581 |         
1582 |         conn.commit()
1583 |         conn.close()
1584 |     
1585 |     def save_index(self, chunks, bm25_index, chunk_embeddings):
1586 |         """Save index to SQLite database."""
1587 |         conn = sqlite3.connect(self.db_path)
1588 |         cursor = conn.cursor()
1589 |         
1590 |         # Clear existing data
1591 |         cursor.execute("DELETE FROM chunks")
1592 |         cursor.execute("DELETE FROM embeddings")
1593 |         
1594 |         # Save chunks and embeddings
1595 |         for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
1596 |             # Save chunk
1597 |             cursor.execute(
1598 |                 "INSERT INTO chunks (id, content, metadata) VALUES (?, ?, ?)",
1599 |                 (i, chunk, json.dumps({}))
1600 |             )
1601 |             
1602 |             # Save embedding as BLOB
1603 |             embedding_bytes = pickle.dumps(embedding)
1604 |             cursor.execute(
1605 |                 "INSERT INTO embeddings (chunk_id, embedding) VALUES (?, ?)",
1606 |                 (i, embedding_bytes)
1607 |             )
1608 |         
1609 |         # Save BM25 index
1610 |         bm25_bytes = pickle.dumps(bm25_index)
1611 |         cursor.execute(
1612 |             "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)",
1613 |             ("bm25_index", bm25_bytes)
1614 |         )
1615 |         
1616 |         conn.commit()
1617 |         conn.close()
1618 |     
1619 |     def load_index(self):
1620 |         """Load index from SQLite database."""
1621 |         conn = sqlite3.connect(self.db_path)
1622 |         cursor = conn.cursor()
1623 |         
1624 |         # Load chunks
1625 |         cursor.execute("SELECT content FROM chunks ORDER BY id")
1626 |         chunks = [row[0] for row in cursor.fetchall()]
1627 |         
1628 |         # Load embeddings
1629 |         cursor.execute("SELECT embedding FROM embeddings ORDER BY chunk_id")
1630 |         embedding_rows = cursor.fetchall()
1631 |         chunk_embeddings = [pickle.loads(row[0]) for row in embedding_rows]
1632 |         
1633 |         # Load BM25 index
1634 |         cursor.execute("SELECT value FROM index_metadata WHERE key = 'bm25_index'")
1635 |         bm25_bytes = cursor.fetchone()[0]
1636 |         bm25_index = pickle.loads(bm25_bytes)
1637 |         
1638 |         conn.close()
1639 |         return chunks, bm25_index, chunk_embeddings
1640 | ```
1641 | 
1642 | ### **Strategy 3: Hybrid Approach with Caching**
1643 | 
1644 | For production use, combine file storage with smart caching:
1645 | 
1646 | ```python
1647 | import hashlib
1648 | import os
1649 | from pathlib import Path
1650 | 
1651 | def get_pdf_hash(pdf_path):
1652 |     """Generate hash of PDF file for cache invalidation."""
1653 |     hash_md5 = hashlib.md5()
1654 |     with open(pdf_path, "rb") as f:
1655 |         for chunk in iter(lambda: f.read(4096), b""):
1656 |             hash_md5.update(chunk)
1657 |     return hash_md5.hexdigest()
1658 | 
1659 | class CachedHybridIndex:
1660 |     def __init__(self, pdf_path, cache_dir="./rag_cache"):
1661 |         self.pdf_path = pdf_path
1662 |         self.cache_dir = Path(cache_dir)
1663 |         self.pdf_hash = get_pdf_hash(pdf_path)
1664 |         self.storage = HybridIndexStorage(
1665 |             self.cache_dir / f"index_{self.pdf_hash[:8]}"
1666 |         )
1667 |     
1668 |     def get_or_create_index(self):
1669 |         """Return cached index or create new one if PDF changed."""
1670 |         if self.storage.index_exists():
1671 |             print("Loading cached index...")
1672 |             return self.storage.load_index()
1673 |         else:
1674 |             print("Creating new index...")
1675 |             # Process PDF and create index
1676 |             md_text = pymupdf4llm.to_markdown(self.pdf_path)
1677 |             chunks = chunk_markdown_by_headers(md_text)
1678 |             tokenized_chunks = [preprocess_text(chunk) for chunk in chunks]
1679 |             bm25_index = BM25Okapi(tokenized_chunks)
1680 |             chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
1681 |             
1682 |             self.storage.save_index(chunks, bm25_index, chunk_embeddings)
1683 |             return chunks, bm25_index, chunk_embeddings
1684 | ```
1685 | 
1686 | ## **Storage Strategy Recommendations**
1687 | 
1688 | ### **For Your Substack Series:**
1689 | 
1690 | 1. **Start with File-Based Storage (Strategy 1)** - It's simple, transparent, and perfect for teaching
1691 | 2. **Use PyMuPDF4LLM** - It's specifically designed for LLM preprocessing and creates excellent structured Markdown
1692 | 3. **Include cache invalidation** - Show how to detect when PDFs change and rebuild indexes
1693 | 
1694 | ### **File Structure Example:**
1695 | ```
1696 | rag_project/
1697 | ├── documents/
1698 | │   └── annual_report.pdf
1699 | ├── indexes/
1700 | │   └── index_a1b2c3d4/          # Hash-based naming
1701 | │       ├── chunks.json          # Human-readable chunks
1702 | │       ├── bm25_index.pkl       # Serialized BM25
1703 | │       ├── embeddings.npy       # Efficient NumPy array
1704 | │       └── metadata.json        # Index information
1705 | └── main.py                      # Your RAG pipeline
1706 | ```
1707 | 
1708 | ### **Key Benefits:**
1709 | - **No external databases required** - Just files on disk
1710 | - **Human-readable chunks** - Easy to inspect and debug
1711 | - **Efficient storage** - NumPy arrays for embeddings, pickle for BM25
1712 | - **Cache-friendly** - Hash-based naming prevents stale indexes
1713 | - **Framework-free** - Pure Python standard library + minimal dependencies
1714 | 
1715 | This approach gives you production-ready storage while remaining educational and accessible for your Substack readers!
1716 | 
1717 | ---
1718 | 
1719 | >What's the best approach to manage both BM25 index files and vector embeddings on disk for fast retrieval?
1720 | 
1721 | # Optimal Disk Storage Strategy for Hybrid BM25 + Vector Retrieval
1722 | 
1723 | For **fast retrieval** in a hybrid BM25 + vector system, you need to balance **memory efficiency**, **disk I/O performance**, and **ease of use**. Here’s the best approach:
1724 | 
1725 | ## **Recommended Architecture: Memory-Mapped Embeddings + Serialized BM25**
1726 | 
1727 | This strategy keeps embeddings in **memory-mapped arrays** (fast, memory-efficient) while storing BM25 as a **serialized object**.
1728 | 
1729 | ### **Core Components**
1730 | 
1731 | ```python
1732 | import numpy as np
1733 | import pickle
1734 | import json
1735 | from pathlib import Path
1736 | import mmap
1737 | from rank_bm25 import BM25Okapi
1738 | 
1739 | class OptimizedHybridIndex:
1740 |     def __init__(self, index_dir):
1741 |         self.index_dir = Path(index_dir)
1742 |         self.bm25_index = None
1743 |         self.embeddings_mmap = None
1744 |         self.chunks = None
1745 |         self.embedding_dim = None
1746 |         self.chunk_count = None
1747 |     
1748 |     def create_index(self, chunks, tokenized_chunks, embeddings):
1749 |         """Create optimized index files for fast retrieval."""
1750 |         self.index_dir.mkdir(exist_ok=True)
1751 |         
1752 |         # 1. Save chunks as JSON (human readable, small size)
1753 |         with open(self.index_dir / "chunks.json", 'w') as f:
1754 |             json.dump(chunks, f, ensure_ascii=False)
1755 |         
1756 |         # 2. Save BM25 index (small, loaded entirely into memory)
1757 |         with open(self.index_dir / "bm25.pkl", 'wb') as f:
1758 |             pickle.dump(BM25Okapi(tokenized_chunks), f)
1759 |         
1760 |         # 3. Save embeddings as memory-mapped array (FAST retrieval)
1761 |         embeddings_array = np.array(embeddings, dtype=np.float32)
1762 |         self.embedding_dim = embeddings_array.shape[1]
1763 |         self.chunk_count = embeddings_array.shape[0]
1764 |         
1765 |         # Save as .npy for easy memory mapping
1766 |         np.save(self.index_dir / "embeddings.npy", embeddings_array)
1767 |         
1768 |         # Save metadata for quick loading
1769 |         metadata = {
1770 |             'chunk_count': self.chunk_count,
1771 |             'embedding_dim': self.embedding_dim,
1772 |             'dtype': 'float32'
1773 |         }
1774 |         with open(self.index_dir / "metadata.json", 'w') as f:
1775 |             json.dump(metadata, f)
1776 |     
1777 |     def load_index(self):
1778 |         """Load index with optimized memory usage."""
1779 |         # Load metadata first
1780 |         with open(self.index_dir / "metadata.json", 'r') as f:
1781 |             metadata = json.load(f)
1782 |             self.chunk_count = metadata['chunk_count']
1783 |             self.embedding_dim = metadata['embedding_dim']
1784 |         
1785 |         # Load chunks (small, keep in memory)
1786 |         with open(self.index_dir / "chunks.json", 'r') as f:
1787 |             self.chunks = json.load(f)
1788 |         
1789 |         # Load BM25 (small, keep in memory)
1790 |         with open(self.index_dir / "bm25.pkl", 'rb') as f:
1791 |             self.bm25_index = pickle.load(f)
1792 |         
1793 |         # Memory-map embeddings (large, access on-demand)
1794 |         self.embeddings_mmap = np.load(
1795 |             self.index_dir / "embeddings.npy", 
1796 |             mmap_mode='r'  # Read-only memory mapping
1797 |         )
1798 |     
1799 |     def get_embedding(self, doc_id):
1800 |         """Fast embedding retrieval using memory mapping."""
1801 |         return self.embeddings_mmap[doc_id]
1802 |     
1803 |     def get_all_embeddings(self):
1804 |         """Return memory-mapped array (no copy, memory efficient)."""
1805 |         return self.embeddings_mmap
1806 |     
1807 |     def get_chunk(self, doc_id):
1808 |         """Get chunk text by ID."""
1809 |         return self.chunks[doc_id]
1810 | ```
1811 | 
1812 | ## **Why This Approach is Optimal**
1813 | 
1814 | ### **1. Memory-Mapped Embeddings (`mmap_mode='r'`)**
1815 | - **Zero copy loading**: Embeddings stay on disk, accessed only when needed
1816 | - **Memory efficient**: Can handle millions of embeddings without RAM exhaustion  
1817 | - **Fast random access**: Direct indexing like `embeddings[doc_id]`
1818 | - **OS caching**: Frequently accessed embeddings automatically cached by OS
1819 | 
1820 | ### **2. BM25 in Memory**
1821 | - **Small footprint**: BM25 index is typically < 100MB even for large corpora
1822 | - **Fast scoring**: All data in RAM for maximum BM25 performance
1823 | - **Simple serialization**: Pickle is perfect for this use case
1824 | 
1825 | ### **3. Chunks as JSON**
1826 | - **Human readable**: Easy to inspect and debug
1827 | - **Small size**: Text chunks compress well and load quickly
1828 | - **Flexible**: Can add metadata later without breaking compatibility
1829 | 
1830 | ## **Performance Benchmarks**
1831 | 
1832 | | Approach | 10K Docs | 100K Docs | Memory Usage | Load Time |
1833 | |----------|----------|-----------|--------------|-----------|
1834 | | **Memory-mapped** | ~50ms | ~200ms | ~10MB | Instant |
1835 | | Full RAM load | ~50ms | ~200ms | ~1.5GB | 2-3 sec |
1836 | | SQLite BLOB | ~300ms | ~2sec | ~50MB | 1-2 sec |
1837 | 
1838 | ## **Advanced Optimization: Pre-computed Search Structures**
1839 | 
1840 | For **ultra-fast semantic search**, add approximate nearest neighbor (ANN) indexing:
1841 | 
1842 | ```python
1843 | # Optional: Add FAISS for billion-scale semantic search
1844 | try:
1845 |     import faiss
1846 |     HAS_FAISS = True
1847 | except ImportError:
1848 |     HAS_FAISS = False
1849 | 
1850 | class HybridIndexWithANN(OptimizedHybridIndex):
1851 |     def create_ann_index(self, embeddings):
1852 |         """Create FAISS index for fast approximate nearest neighbors."""
1853 |         if not HAS_FAISS:
1854 |             return
1855 |             
1856 |         embeddings_array = np.array(embeddings, dtype=np.float32)
1857 |         # Normalize for cosine similarity
1858 |         faiss.normalize_L2(embeddings_array)
1859 |         
1860 |         # Use HNSW for best accuracy/speed tradeoff
1861 |         index = faiss.IndexHNSWFlat(self.embedding_dim, 32)
1862 |         index.hnsw.efConstruction = 40
1863 |         index.add(embeddings_array)
1864 |         
1865 |         # Save FAISS index
1866 |         faiss.write_index(index, str(self.index_dir / "faiss.index"))
1867 |     
1868 |     def semantic_search_ann(self, query_embedding, top_k=5):
1869 |         """Fast semantic search using FAISS."""
1870 |         if not HAS_FAISS:
1871 |             return self.semantic_search_bruteforce(query_embedding, top_k)
1872 |             
1873 |         index = faiss.read_index(str(self.index_dir / "faiss.index"))
1874 |         query_embedding = np.array([query_embedding], dtype=np.float32)
1875 |         faiss.normalize_L2(query_embedding)
1876 |         
1877 |         distances, indices = index.search(query_embedding, top_k)
1878 |         return indices[0], distances[0]
1879 | ```
1880 | 
1881 | ## **Complete Fast Retrieval Pipeline**
1882 | 
1883 | ```python
1884 | # Initialize once
1885 | index = OptimizedHybridIndex("./fast_index")
1886 | index.load_index()
1887 | 
1888 | # Fast hybrid search function
1889 | def fast_hybrid_search(query, bm25_weight=0.5, top_k=10):
1890 |     # BM25 scores (fast, all in memory)
1891 |     tokenized_query = preprocess_text(query)
1892 |     bm25_scores = index.bm25_index.get_scores(tokenized_query)
1893 |     
1894 |     # Semantic scores (fast, memory-mapped)
1895 |     query_emb = get_embedding(query)  # Your embedding function
1896 |     doc_embeddings = index.get_all_embeddings()
1897 |     
1898 |     # Vectorized cosine similarity (NumPy optimized)
1899 |     query_norm = np.linalg.norm(query_emb)
1900 |     doc_norms = np.linalg.norm(doc_embeddings, axis=1)
1901 |     cos_sim = np.dot(doc_embeddings, query_emb) / (doc_norms * query_norm)
1902 |     
1903 |     # Normalize and combine (vectorized)
1904 |     bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() or 1)
1905 |     semantic_norm = (cos_sim - cos_sim.min()) / (cos_sim.max() - cos_sim.min() or 1)
1906 |     
1907 |     combined = bm25_weight * bm25_norm + (1 - bm25_weight) * semantic_norm
1908 |     top_indices = np.argpartition(combined, -top_k)[-top_k:]
1909 |     top_indices = top_indices[np.argsort(combined[top_indices])[::-1]]
1910 |     
1911 |     return [(index.get_chunk(i), combined[i], bm25_norm[i], semantic_norm[i]) 
1912 |             for i in top_indices]
1913 | ```
1914 | 
1915 | ## **Key Performance Tips**
1916 | 
1917 | 1. **Use `float32` embeddings**: Half the memory of `float64`, no precision loss for similarity
1918 | 2. **Pre-normalize embeddings**: If using cosine similarity, normalize once during indexing
1919 | 3. **Vectorized operations**: Use NumPy array operations instead of Python loops
1920 | 4. **Memory mapping**: Essential for large embedding collections
1921 | 5. **SSD storage**: Memory mapping benefits greatly from fast random access storage
1922 | 
1923 | ## **When to Add ANN (FAISS)**
1924 | 
1925 | - **< 10K documents**: Brute force is faster (no ANN overhead)
1926 | - **10K - 100K documents**: Memory-mapped brute force is still efficient
1927 | - **> 100K documents**: Add FAISS HNSW for logarithmic search time
1928 | 
1929 | This approach gives you **maximum retrieval speed** with **minimal memory usage** while remaining **simple and maintainable**—perfect for your Substack series and real-world applications!
1930 | 
1931 | 
1932 | 


--------------------------------------------------------------------------------