└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Hybrid-search-strategies 2 | combine BM25 and semantic search for RAG 3 | 4 | Ref to [conversation](https://chat.qwen.ai/s/9ceeb113-e2f1-4287-8e57-231716b12b4a?fev=0.0.245) 5 | 6 | > My studies on the topic 7 | 8 | # Building a Simple Keyword Search App with BM25 in Python 9 | 10 | ## Introduction to BM25 11 | 12 | BM25 (Best Match 25) is a **ranking function** used by search engines to estimate the relevance of documents to a given search query. It's one of the most widely used and effective algorithms for information retrieval, serving as the foundation for many search systems including Elasticsearch and Apache Lucene. 13 | 14 | Unlike simple keyword matching, BM25 considers: 15 | - **Term frequency**: How often a search term appears in a document 16 | - **Document length**: Shorter documents with the same term frequency are considered more relevant 17 | - **Inverse document frequency**: Rare terms that appear in fewer documents are given more weight 18 | 19 | BM25 strikes an excellent balance between effectiveness and computational efficiency, making it perfect for building simple yet powerful search applications. 20 | 21 | ## How BM25 Works 22 | 23 | The BM25 formula calculates a relevance score for each document given a query: 24 | 25 | ``` 26 | Score(D, Q) = Σ [IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgDL))] 27 | ``` 28 | 29 | Where: 30 | - **D** = Document 31 | - **Q** = Query with terms q₁, q₂, ..., qₙ 32 | - **f(qi, D)** = Frequency of term qi in document D 33 | - **|D|** = Length of document D (in words) 34 | - **avgDL** = Average document length in the corpus 35 | - **k1** = Controls term frequency saturation (typically 1.2-2.0) 36 | - **b** = Controls document length normalization (typically 0.75) 37 | - **IDF(qi)** = Inverse Document Frequency = `log((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)` 38 | - N = Total number of documents 39 | - n(qi) = Number of documents containing term qi 40 | 41 | ## Python Implementation Example 42 | 43 | Here's a complete, working example using the `rank_bm25` library: 44 | 45 | ### Step 1: Install Required Dependencies 46 | 47 | ```bash 48 | pip install rank-bm25 49 | ``` 50 | 51 | ### Step 2: Complete Python Application 52 | 53 | ```python 54 | # coding: utf-8 55 | import string 56 | import re 57 | from rank_bm25 import BM25Okapi 58 | from typing import List, Dict 59 | 60 | class SimpleBM25Search: 61 | def __init__(self, documents: List[str]): 62 | """ 63 | Initialize the BM25 search engine with a list of documents. 64 | 65 | Args: 66 | documents: List of strings to be indexed for search 67 | """ 68 | self.documents = documents 69 | self.tokenized_docs = [self._preprocess(doc) for doc in documents] 70 | self.bm25 = BM25Okapi(self.tokenized_docs) 71 | 72 | def _preprocess(self, text: str) -> List[str]: 73 | """ 74 | Simple text preprocessing: lowercase, remove punctuation, tokenize. 75 | """ 76 | # Convert to lowercase 77 | text = text.lower() 78 | # Remove punctuation and extra whitespace 79 | text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text) 80 | text = re.sub(r'\s+', ' ', text) 81 | # Tokenize (split into words) 82 | tokens = text.strip().split() 83 | return tokens 84 | 85 | def search(self, query: str, top_k: int = 5) -> List[Dict]: 86 | """ 87 | Search for documents matching the query. 88 | 89 | Args: 90 | query: Search query string 91 | top_k: Number of top results to return 92 | 93 | Returns: 94 | List of dictionaries containing document index, score, and text 95 | """ 96 | # Preprocess the query 97 | tokenized_query = self._preprocess(query) 98 | 99 | # Get BM25 scores for all documents 100 | scores = self.bm25.get_scores(tokenized_query) 101 | 102 | # Get top-k document indices 103 | top_indices = scores.argsort()[-top_k:][::-1] 104 | 105 | # Create results with document info 106 | results = [] 107 | for idx in top_indices: 108 | if scores[idx] > 0: # Only include documents with positive scores 109 | results.append({ 110 | 'index': int(idx), 111 | 'score': float(scores[idx]), 112 | 'document': self.documents[idx] 113 | }) 114 | 115 | return results 116 | 117 | def get_document_count(self) -> int: 118 | """Return the total number of indexed documents.""" 119 | return len(self.documents) 120 | def evaluate_query(self, query: str, relevant_docs_indices: List[int]): 121 | """ 122 | Simple evaluation showing precision at k. 123 | 124 | Args: 125 | query: Search query 126 | relevant_docs_indices: List of document indices that are truly relevant 127 | """ 128 | results = self.search(query, top_k=len(self.documents)) 129 | retrieved_indices = [r['index'] for r in results] 130 | 131 | # Calculate Precision@1, Precision@3, Precision@5 132 | for k in [1, 3, 5]: 133 | if len(retrieved_indices) >= k: 134 | top_k_retrieved = set(retrieved_indices[:k]) 135 | relevant_set = set(relevant_docs_indices) 136 | precision = len(top_k_retrieved & relevant_set) / k 137 | print(f"Precision@{k}: {precision:.3f}") 138 | 139 | return results 140 | # Example usage 141 | if __name__ == "__main__": 142 | # Sample documents (you can replace these with your own data) 143 | sample_documents = [ 144 | "The quick brown fox jumps over the lazy dog", 145 | "A fast brown fox leaps over a sleeping dog", 146 | "Machine learning is a subset of artificial intelligence", 147 | "Python is a popular programming language for data science", 148 | "Artificial intelligence and machine learning are transforming industries", 149 | "Dogs are loyal companions and make great pets", 150 | "Programming in Python is both fun and productive", 151 | "Natural language processing helps computers understand human language", 152 | "Search engines use algorithms like BM25 to rank documents", 153 | "Information retrieval is the science of searching for information" 154 | ] 155 | 156 | # Initialize the search engine 157 | search_engine = SimpleBM25Search(sample_documents) 158 | 159 | print(f"Indexed {search_engine.get_document_count()} documents\n") 160 | 161 | # Example searches 162 | queries = [ 163 | "brown fox", 164 | "machine learning", 165 | "python programming", 166 | "artificial intelligence", 167 | "search algorithms" 168 | ] 169 | 170 | for query in queries: 171 | print(f"Query: '{query}'") 172 | print("-" * 40) 173 | results = search_engine.search(query, top_k=3) 174 | 175 | if results: 176 | for i, result in enumerate(results, 1): 177 | print(f"{i}. Score: {result['score']:.4f}") 178 | print(f" Document: {result['document']}") 179 | print() 180 | else: 181 | print("No relevant documents found.\n") 182 | 183 | print() 184 | 185 | def interpret_score(score: float) -> str: 186 | """Interpret BM25 score ranges (rough guidelines)""" 187 | if score == 0: 188 | return "No relevance" 189 | elif score < 1.0: 190 | return "Low relevance" 191 | elif score < 2.0: 192 | return "Moderate relevance" 193 | elif score < 3.0: 194 | return "High relevance" 195 | else: 196 | return "Very high relevance" 197 | 198 | # Example usage 199 | scores = [2.3844, 2.5141, 3.5965] 200 | for score in scores: 201 | print(f"Score {score:.4f}: {interpret_score(score)}") 202 | 203 | ``` 204 | 205 | ### Step 3: Advanced Features (Optional) 206 | 207 | You can extend this basic implementation with additional features: 208 | 209 | ```python 210 | # Adding support for loading documents from files 211 | def load_documents_from_file(filename: str) -> List[str]: 212 | """Load documents from a text file (one document per line).""" 213 | with open(filename, 'r', encoding='utf-8') as f: 214 | return [line.strip() for line in f if line.strip()] 215 | 216 | # Adding support for different preprocessing (stemming, stop words) 217 | import nltk 218 | from nltk.corpus import stopwords 219 | from nltk.stem import PorterStemmer 220 | 221 | def _advanced_preprocess(self, text: str) -> List[str]: 222 | """More advanced preprocessing with stop word removal and stemming.""" 223 | # Download required NLTK data (run once) 224 | # nltk.download('stopwords') 225 | 226 | stop_words = set(stopwords.words('english')) 227 | stemmer = PorterStemmer() 228 | 229 | text = text.lower() 230 | text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text) 231 | tokens = text.strip().split() 232 | 233 | # Remove stop words and apply stemming 234 | tokens = [stemmer.stem(token) for token in tokens 235 | if token not in stop_words and len(token) > 2] 236 | 237 | return tokens 238 | ``` 239 | 240 | ## Key Benefits of This Approach 241 | 242 | 1. **Simple to implement**: Just a few lines of code with the `rank_bm25` library 243 | 2. **Efficient**: BM25 is computationally efficient for moderate-sized document collections 244 | 3. **Effective**: Provides much better results than simple keyword matching 245 | 4. **Customizable**: Easy to modify preprocessing and ranking parameters 246 | 247 | ## When to Use This 248 | 249 | - Small to medium document collections (up to ~100K documents) 250 | - Applications where you need more control than off-the-shelf search engines 251 | - Prototyping and learning information retrieval concepts 252 | - Applications where you can't use external services 253 | 254 | For larger-scale applications, consider using dedicated search engines like Elasticsearch, which also implement BM25 under the hood but with additional optimizations and features. 255 | 256 | This example gives you a solid foundation to build upon—happy coding! 257 | 258 | --- 259 | 260 | 261 | Combining BM25 (keyword-based) with embedding-based semantic similarity is a powerful **hybrid search** approach that leverages the strengths of both methods: 262 | 263 | - **BM25** excels at matching exact keywords, handling rare terms, and supporting Boolean-like queries 264 | - **Embedding similarity** captures semantic meaning, handles synonyms, and finds conceptually related content 265 | 266 | This hybrid strategy is widely used in production RAG systems and is considered one of the most effective retrieval approaches . 267 | 268 | --- 269 | 270 | ## Architecture Overview 271 | 272 | ``` 273 | Query → [BM25 Retrieval] → BM25 Scores 274 | ↘ [Embedding Model] → Semantic Similarity Scores 275 | ↓ 276 | [Score Fusion] → Final Ranked Results 277 | ``` 278 | 279 | --- 280 | 281 | ## Implementation with llama.cpp Server 282 | 283 | Here’s a complete example that integrates BM25 with embeddings from a llama.cpp server: 284 | 285 | ### Step 1: Install Dependencies 286 | 287 | ```bash 288 | pip install rank-bm25 requests numpy 289 | ``` 290 | 291 | ### Step 2: Hybrid Search Implementation 292 | 293 | ```python 294 | import requests 295 | import numpy as np 296 | from rank_bm25 import BM25Okapi 297 | import re 298 | import string 299 | from typing import List, Dict, Tuple 300 | 301 | class HybridSearchEngine: 302 | def __init__(self, documents: List[str], llama_cpp_url: str = "http://localhost:8080"): 303 | """ 304 | Initialize hybrid search engine. 305 | 306 | Args: 307 | documents: List of documents to index 308 | llama_cpp_url: URL of your llama.cpp server with embedding endpoint 309 | """ 310 | self.documents = documents 311 | self.llama_cpp_url = llama_cpp_url 312 | 313 | # Initialize BM25 314 | self.tokenized_docs = [self._preprocess(doc) for doc in documents] 315 | self.bm25 = BM25Okapi(self.tokenized_docs) 316 | 317 | # Pre-compute document embeddings 318 | self.doc_embeddings = self._compute_document_embeddings() 319 | 320 | def _preprocess(self, text: str) -> List[str]: 321 | """Simple text preprocessing for BM25.""" 322 | text = text.lower() 323 | text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text) 324 | text = re.sub(r'\s+', ' ', text) 325 | return text.strip().split() 326 | 327 | def _get_embedding(self, text: str) -> np.ndarray: 328 | """ 329 | Get embedding from llama.cpp server. 330 | """ 331 | try: 332 | response = requests.post( 333 | f"{self.llama_cpp_url}/embedding", 334 | json={"content": text}, 335 | timeout=10 336 | ) 337 | response.raise_for_status() 338 | result = response.json() 339 | 340 | if isinstance(result, list) and len(result) > 0: 341 | # Extract embedding and ensure it's 1D 342 | embedding = np.array(result[0]["embedding"]).flatten() 343 | else: 344 | print(f"Unexpected embedding response format: {result}") 345 | embedding = np.zeros(384) # Use correct dimension for your model 346 | 347 | return embedding # Shape will be (384,) instead of (1, 384) 348 | 349 | except Exception as e: 350 | print(f"Error getting embedding: {e}") 351 | return np.zeros(384) # Match your actual embedding dimension 352 | 353 | 354 | def _compute_document_embeddings(self) -> List[np.ndarray]: 355 | """Pre-compute embeddings for all documents.""" 356 | print("Computing document embeddings...") 357 | embeddings = [] 358 | for i, doc in enumerate(self.documents): 359 | if i % 10 == 0: 360 | print(f"Processing document {i}/{len(self.documents)}") 361 | emb = self._get_embedding(doc) 362 | embeddings.append(emb) 363 | return embeddings 364 | 365 | def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float: 366 | """Compute cosine similarity between two vectors.""" 367 | dot_product = np.dot(vec1, vec2) 368 | norm1 = np.linalg.norm(vec1) 369 | norm2 = np.linalg.norm(vec2) 370 | if norm1 == 0 or norm2 == 0: 371 | return 0.0 372 | return dot_product / (norm1 * norm2) 373 | 374 | def _normalize_scores(self, scores: List[float]) -> List[float]: 375 | """Min-max normalize scores to [0, 1] range.""" 376 | if not scores: 377 | return scores 378 | min_score = min(scores) 379 | max_score = max(scores) 380 | if max_score == min_score: 381 | return [1.0] * len(scores) # All scores are equal 382 | return [(s - min_score) / (max_score - min_score) for s in scores] 383 | 384 | def search(self, query: str, top_k: int = 5, bm25_weight: float = 0.5) -> List[Dict]: 385 | """ 386 | Perform hybrid search combining BM25 and semantic similarity. 387 | 388 | Args: 389 | query: Search query 390 | top_k: Number of results to return 391 | bm25_weight: Weight for BM25 score (0.0 to 1.0) 392 | semantic_weight = 1.0 - bm25_weight 393 | 394 | Returns: 395 | List of results with combined scores 396 | """ 397 | semantic_weight = 1.0 - bm25_weight 398 | 399 | # Get BM25 scores 400 | tokenized_query = self._preprocess(query) 401 | bm25_scores = self.bm25.get_scores(tokenized_query) 402 | 403 | # Get semantic similarity scores 404 | query_embedding = self._get_embedding(query) 405 | semantic_scores = [] 406 | for doc_emb in self.doc_embeddings: 407 | similarity = self._cosine_similarity(query_embedding, doc_emb) 408 | semantic_scores.append(similarity) 409 | 410 | # Normalize both score types to comparable ranges 411 | normalized_bm25 = self._normalize_scores(bm25_scores.tolist()) 412 | normalized_semantic = self._normalize_scores(semantic_scores) 413 | 414 | # Combine scores 415 | combined_scores = [] 416 | for i in range(len(self.documents)): 417 | combined = (bm25_weight * normalized_bm25[i] + 418 | semantic_weight * normalized_semantic[i]) 419 | combined_scores.append(combined) 420 | 421 | # Get top-k results 422 | top_indices = np.argsort(combined_scores)[-top_k:][::-1] 423 | 424 | results = [] 425 | for idx in top_indices: 426 | if combined_scores[idx] > 0: 427 | results.append({ 428 | 'index': int(idx), 429 | 'combined_score': float(combined_scores[idx]), 430 | 'bm25_score': float(normalized_bm25[idx]), 431 | 'semantic_score': float(normalized_semantic[idx]), 432 | 'document': self.documents[idx] 433 | }) 434 | 435 | return results 436 | 437 | # Example usage 438 | if __name__ == "__main__": 439 | # Your documents 440 | documents = [ 441 | "The quick brown fox jumps over the lazy dog", 442 | "A fast brown fox leaps over a sleeping dog", 443 | "Machine learning is a subset of artificial intelligence", 444 | "Python is a popular programming language for data science", 445 | "Artificial intelligence and machine learning are transforming industries", 446 | "Dogs are loyal companions and make great pets", 447 | "Programming in Python is both fun and productive", 448 | "Natural language processing helps computers understand human language", 449 | "Search engines use algorithms like BM25 to rank documents", 450 | "Information retrieval is the science of searching for information" 451 | ] 452 | 453 | # Initialize hybrid search (adjust URL to your llama.cpp server) 454 | hybrid_search = HybridSearchEngine( 455 | documents=documents, 456 | llama_cpp_url="http://localhost:8080" # Your llama.cpp server URL 457 | ) 458 | 459 | # Test queries 460 | queries = [ 461 | "brown fox", 462 | "machine learning", 463 | "python programming", 464 | "AI and ML" 465 | ] 466 | 467 | for query in queries: 468 | print(f"\nQuery: '{query}'") 469 | print("-" * 50) 470 | 471 | # Test different weight combinations 472 | for weight in [0.3, 0.5, 0.7]: # BM25 weights 473 | print(f"\nBM25 weight: {weight}, Semantic weight: {1-weight}") 474 | results = hybrid_search.search(query, top_k=3, bm25_weight=weight) 475 | 476 | for i, result in enumerate(results, 1): 477 | print(f"{i}. Combined: {result['combined_score']:.4f} " 478 | f"(BM25: {result['bm25_score']:.4f}, " 479 | f"Semantic: {result['semantic_score']:.4f})") 480 | print(f" Doc: {result['document']}") 481 | 482 | ``` 483 | 484 | --- 485 | 486 | ## Setting up llama.cpp Server with Embeddings 487 | 488 | Make sure your llama.cpp server supports the `/embedding` endpoint. You can start it like this: 489 | 490 | ```bash 491 | # Start llama.cpp server with embedding support 492 | .\llama-server.exe -m C:\FABIO-AI\MODELS_embeddings\bge-small-en-v1.5_fp16.gguf --port 8080 --embedding 493 | 494 | # Test the embedding endpoint 495 | curl -s -X POST "http://localhost:8080/embedding" --data "{\"content\":\"AI is artificial intelligence\"}" 496 | 497 | ``` 498 | 499 | --- 500 | 501 | ## Advanced Fusion Strategies 502 | 503 | ### 1. **Reciprocal Rank Fusion (RRF)** 504 | Instead of score averaging, combine rankings: 505 | 506 | ```python 507 | def reciprocal_rank_fusion(bm25_results: List[int], semantic_results: List[int], k=60): 508 | """Combine rankings using Reciprocal Rank Fusion.""" 509 | fused_scores = {} 510 | 511 | # BM25 results 512 | for rank, doc_id in enumerate(bm25_results): 513 | fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank + 1) 514 | 515 | # Semantic results 516 | for rank, doc_id in enumerate(semantic_results): 517 | fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank + 1) 518 | 519 | # Sort by fused score 520 | return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True) 521 | ``` 522 | 523 | ### 2. **Learned Weight Combination** 524 | Use validation data to find optimal weights: 525 | 526 | ```python 527 | # Grid search for best BM25 weight 528 | best_weight = 0.5 529 | best_score = 0 530 | 531 | for weight in np.arange(0.1, 1.0, 0.1): 532 | results = hybrid_search.search(query, bm25_weight=weight) 533 | # Evaluate using your ground truth metric 534 | score = evaluate_results(results, ground_truth) 535 | if score > best_score: 536 | best_score = score 537 | best_weight = weight 538 | ``` 539 | 540 | --- 541 | 542 | ## Benefits of This Hybrid Approach 543 | 544 | This combination is particularly effective because: 545 | 546 | - **BM25 handles exact matches and rare terms** that embeddings might miss 547 | - **Embeddings capture semantic relationships** and handle paraphrasing 548 | - **Robustness**: If one method fails, the other can still provide relevant results 549 | - **State-of-the-art**: This approach is used by leading RAG systems [[1], [3]] 550 | 551 | The hybrid search approach you're implementing is exactly what modern systems like those described in Anthropic's research employ, where "the retrieval result of contextual embedding search and contextual BM25 search are merged" . 552 | 553 | Search Source · 10 554 | 555 | 1. 556 | https://milvus.io/docs/llamaindex_milvus_hybrid_search.md 557 | · 558 | (2025-04-17) 559 | RAG using Hybrid Search with Milvus and LlamaIndex 560 | We'll begin with the recommended default hybrid search (semantic + BM25) and then explore other alternative sparse embedding methods and 561 | 2. 562 | https://blog.lancedb.com/hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6/ 563 | · 564 | (2023-12-09) 565 | Hybrid Search: Combining BM25 and Semantic 566 | BM25 is a ranking algorithm used in information retrieval systems to estimate the relevance of documents to a given search query. 567 | 3. 568 | https://medium.com/@odhitom09/the-most-effective-rag-approach-to-date-anthropics-contextual-retrieval-and-hybrid-search-8dc2af5cb970 569 | The most effective RAG approach to date? Anthropic's 570 | Anthropic also employs a hybrid approach where the retrieval result of contextual embedding search and contextual BM25 search are merged using 571 | 4. 572 | https://www.atyun.com/58079.html 573 | · 574 | (2023-12-14) 575 | 将BM25和语义搜索与Langchain结合起来以获得更好的结果 576 | 混合搜索:将BM25和语义搜索与Langchain结合起来以获得更好的结果 · 它的作用:它会检查你的搜索词在文档中出现的频率,并考虑文档的长度以提供最相关的结果。 577 | 5. 578 | https://medium.com/@adnanmasood/optimizing-chunking-embedding-and-vectorization-for-retrieval-augmented-generation-ea3b083b68f7 579 | Optimizing Chunking, Embedding, and Vectorization for 580 | If using sparse vectors (like BM25), similarity is often defined differently (BM25 scoring formula or just dot product of TF-IDF features). 581 | 6. 582 | 583 | GitHub 584 | Building an Index that supports a hybrid search comprising 585 | I am trying to build an index that should support a hybrid search mechanism consisting of both BM25 and vector searches. 586 | 7. 587 | https://developers.llamaindex.ai/python/examples/retrievers/bm25_retriever/ 588 | BM25 Retriever | LlamaIndex Python Documentation 589 | In this guide, we define a bm25 retriever that search documents using the bm25 method. BM25 (Best Matching 25) is a ranking function that extends TF-IDF by 590 | 8. 591 | https://js.langchain.com/docs/integrations/retrievers/supabase-hybrid/ 592 | Supabase Hybrid Search 593 | The hybrid search combines the postgres pgvector extension (similarity search) and Full-Text Search (keyword search) to retrieve documents. 594 | 9. 595 | https://modelscope.cn/models/gpustack/bge-m3-GGUF 596 | bge-m3-GGUF 597 | A classic example: using both embedding retrieval and the BM25 algorithm. Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval. 598 | 10. 599 | https://focushubjobs.org/?_=/topics/hybrid-search#WZUejyqGZTrfm/ZWkBAfbV6M 600 | hybrid-search · GitHub Topics - Jobs 601 | The AI-native database built for LLM applications, providing incredibly fast hybrid search of dense vector, sparse vector, tensor (multi-vector), and fu 602 | 603 | 604 | 605 | --- 606 | 607 | Excellent! Now that your hybrid search is working, let me explain how to interpret these scores. 608 | 609 | ## Understanding the Score Components 610 | 611 | Your hybrid system produces **three types of scores**, all normalized to the range **[0.0, 1.0]**: 612 | 613 | ### 1. **BM25 Score (Keyword Matching)** 614 | - **1.0** = Perfect keyword match (contains all query terms with good frequency) 615 | - **0.0** = No query terms found in the document 616 | - Based purely on **exact word matching**, term frequency, and document length 617 | 618 | ### 2. **Semantic Score (Embedding Similarity)** 619 | - **1.0** = Maximum semantic similarity (documents are conceptually identical) 620 | - **0.0** = Minimum semantic similarity (completely unrelated concepts) 621 | - Based on **meaning and context**, captures synonyms, related concepts, and paraphrasing 622 | 623 | ### 3. **Combined Score (Final Ranking)** 624 | - Weighted average: `Combined = (BM25_weight × BM25_score) + (Semantic_weight × Semantic_score)` 625 | - **1.0** = Best possible result according to your weighting 626 | - **0.0** = Worst possible result 627 | 628 | ## Analyzing Your Results 629 | 630 | ### 🔍 **Query: "brown fox"** 631 | ```bash 632 | 1. Combined: 1.0000 (BM25: 1.0000, Semantic: 1.0000) 633 | Doc: A fast brown fox leaps over a sleeping dog 634 | ``` 635 | - **Perfect match**: Contains exact keywords "brown" and "fox" → BM25 = 1.0 636 | - **Perfect semantic match**: Same topic/concept → Semantic = 1.0 637 | - This is the ideal case where both methods agree completely 638 | 639 | ### 🔍 **Query: "AI and ML"** (Most Interesting Case!) 640 | ```bash 641 | 1. Combined: 0.8528 (BM25: 1.0000, Semantic: 0.7897) 642 | Doc: Artificial intelligence and machine learning are transforming industries 643 | 2. Combined: 0.7000 (BM25: 0.0000, Semantic: 1.0000) 644 | Doc: Machine learning is a subset of artificial intelligence 645 | ``` 646 | 647 | This reveals the **power of hybrid search**: 648 | 649 | - **Result #1**: Has exact keyword matches ("Artificial intelligence", "machine learning") → **BM25 = 1.0**, but slightly lower semantic similarity 650 | - **Result #2**: Contains the **concepts** but not the exact query terms "AI and ML" → **BM25 = 0.0**, but **perfect semantic match** (1.0) 651 | 652 | **Without hybrid search:** 653 | - Pure BM25 would **completely miss** Result #2 (score = 0.0) 654 | - Pure semantic search might **miss Result #1** if the weighting favored different aspects 655 | 656 | ## How to Interpret Different Weighting Strategies 657 | 658 | ### **BM25 Weight = 0.3 (Semantic-heavy)** 659 | - Prioritizes **conceptual relevance** over exact keywords 660 | - Better for **conversational queries**, paraphrasing, or when users don't know exact terminology 661 | - Example: "AI and ML" query finds the semantically perfect document even without exact terms 662 | 663 | ### **BM25 Weight = 0.7 (Keyword-heavy)** 664 | - Prioritizes **exact term matching** 665 | - Better for **technical queries**, when users know specific terms, or need precise matches 666 | - Example: Ensures documents with exact phrases appear higher 667 | 668 | ### **BM25 Weight = 0.5 (Balanced)** 669 | - Good default for general-purpose search 670 | - Compromise between precision (keywords) and recall (semantics) 671 | 672 | ## Key Insights from Your Results 673 | 674 | 1. **When both scores are high (≈1.0)**: You have the best possible match 675 | 2. **When BM25 = 0 but Semantic > 0**: Pure semantic retrieval finding conceptually relevant but keyword-mismatched results 676 | 3. **When Semantic = 0 but BM25 > 0**: Pure keyword match with no semantic relationship (rare, usually indicates noise) 677 | 4. **The "AI and ML" query demonstrates hybrid search's main advantage**: It finds **both** exact matches **and** conceptually relevant documents that pure methods would miss 678 | 679 | ## Practical Guidelines for Evaluation 680 | 681 | - **Look at the top 3-5 results**: Are they all relevant to your intent? 682 | - **Check the balance**: If you're getting irrelevant exact matches, reduce BM25 weight. If you're missing obvious keyword matches, increase BM25 weight. 683 | - **Consider your use case**: 684 | - **Customer support**: Might prefer semantic-heavy (users describe problems in their own words) 685 | - **Legal/technical search**: Might prefer keyword-heavy (need precise terminology matches) 686 | - **General knowledge**: Balanced approach usually works best 687 | 688 | Your hybrid system is working beautifully—it's successfully combining the precision of keyword search with the flexibility of semantic search! 689 | 690 | 691 | 692 | --- 693 | 694 | # THE SERIES 695 | 696 | # RAG Data Preparation Series: From Raw Documents to Hybrid Retrieval with Local LLMs 697 | 698 | Here’s a strategic, progressive series outline for your Substack that builds from fundamentals to an advanced local RAG system—complete with practical code examples, clear learning objectives, and real-world relevance. 699 | 700 | --- 701 | 702 | ## **Series Title Suggestion** 703 | **"Build Your Own RAG: Data Prep, Hybrid Search & Local LLMs"** 704 | 705 | --- 706 | 707 | ## **Article 1: Why Data Prep Matters in RAG (The Foundation)** 708 | *Hook: "Your RAG is only as good as your data pipeline—here’s why."* 709 | 710 | ### Key Points: 711 | - Common RAG failure modes caused by poor data prep 712 | - The retrieval-augmentation gap: when good docs ≠ good answers 713 | - Overview of the full pipeline: **PDF → Markdown → Chunks → Vectors + Keywords → Hybrid Search → Local LLM** 714 | 715 | ### Practical Demo: 716 | - Show a "before/after" of naive vs. prepared RAG on a real PDF 717 | - Code: Load a PDF, extract raw text, show limitations (messy headers, broken tables) 718 | 719 | ### Tools Introduced: 720 | - `pypdf`, `pdfplumber` for PDF extraction 721 | - Why Markdown is the ideal intermediate format 722 | 723 | > **Takeaway**: Data prep isn’t optional—it’s the core of RAG reliability. 724 | 725 | --- 726 | 727 | ## **Article 2: Chunking Strategies That Work (From Markdown)** 728 | *Hook: "Chunking isn’t just splitting text—it’s preserving meaning."* 729 | 730 | ### Key Points: 731 | - Why naive chunking (fixed char/word splits) fails 732 | - **Semantic chunking**: Preserve context boundaries (headers, paragraphs) 733 | - **Overlap strategies**: Prevent context bleeding at boundaries 734 | 735 | ### Practical Demo: 736 | - Convert PDF → clean Markdown (using `unstructured` or `pandoc`) 737 | - Implement 3 chunking methods: 738 | 1. **Fixed-size** (naive baseline) 739 | 2. **Recursive splitting** (LangChain-style) 740 | 3. **Markdown-aware** (split by headers + paragraphs) 741 | 742 | ```python 743 | # Example: Markdown-aware chunker 744 | def chunk_markdown(md_text, max_chunk_size=500): 745 | # Split by headers first 746 | sections = re.split(r'\n#+ ', md_text) 747 | chunks = [] 748 | for section in sections: 749 | if len(section) <= max_chunk_size: 750 | chunks.append(section) 751 | else: 752 | # Recursive split paragraphs 753 | paragraphs = section.split('\n\n') 754 | # ... combine paragraphs smartly 755 | return chunks 756 | ``` 757 | 758 | ### Evaluation: 759 | - Show retrieval quality differences using a test query 760 | - Measure: "Does the chunk contain enough context to answer Q?" 761 | 762 | > **Takeaway**: Your chunking strategy directly impacts answer quality. 763 | 764 | --- 765 | 766 | ## **Article 3: Keyword Power – BM25 for Reliable Retrieval** 767 | *Hook: "Don’t abandon keywords—supercharge them with BM25."* 768 | 769 | ### Key Points: 770 | - Why pure vector search fails on rare terms, acronyms, or exact phrases 771 | - How BM25 works (simple intuition + formula) 772 | - When BM25 beats embeddings (and vice versa) 773 | 774 | ### Practical Demo: 775 | - Build BM25 index from Markdown chunks 776 | - Show retrieval examples where BM25 wins: 777 | - Query: `"API key format"` → finds exact technical specs 778 | - Query: `"2023 revenue"` → finds precise numbers 779 | 780 | ```python 781 | # Code from your working example (simplified) 782 | from rank_bm25 import BM25Okapi 783 | 784 | tokenized_chunks = [preprocess(chunk) for chunk in chunks] 785 | bm25 = BM25Okapi(tokenized_chunks) 786 | 787 | def bm25_search(query, top_k=5): 788 | scores = bm25.get_scores(preprocess(query)) 789 | top_idxs = scores.argsort()[-top_k:][::-1] 790 | return [(chunks[i], scores[i]) for i in top_idxs] 791 | ``` 792 | 793 | > **Takeaway**: BM25 is your safety net for precise, keyword-driven queries. 794 | 795 | --- 796 | 797 | ## **Article 4: Semantic Search with Local Embeddings** 798 | *Hook: "Run semantic search entirely offline—with llama.cpp."* 799 | 800 | ### Key Points: 801 | - Why local embeddings matter (privacy, cost, control) 802 | - Setting up `llama.cpp` for embeddings (model choice, flags) 803 | - Cosine similarity vs. other distance metrics 804 | 805 | ### Practical Demo: 806 | - Start `llama.cpp` server with embedding model (`nomic-embed-text`) 807 | - Generate embeddings for all chunks 808 | - Build semantic search function 809 | 810 | ```python 811 | # Request embedding from local server 812 | def get_embedding(text): 813 | res = requests.post("http://localhost:8080/embedding", 814 | json={"content": text}) 815 | return np.array(res.json()[0]["embedding"]) 816 | 817 | # Precompute all chunk embeddings 818 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks] 819 | ``` 820 | 821 | ### Comparison: 822 | - Show queries where semantic search wins: 823 | - Query: `"How do I authenticate?"` → finds sections about "API keys", "OAuth", etc. 824 | 825 | > **Takeaway**: Local embeddings = semantic power without the cloud dependency. 826 | 827 | --- 828 | 829 | ## **Article 5: Hybrid Search – The Best of Both Worlds** 830 | *Hook: "Why choose between keywords and semantics? Combine them."* 831 | 832 | ### Key Points: 833 | - The hybrid search advantage: coverage + precision 834 | - Score fusion strategies (weighted average, RRF) 835 | - Tuning weights for your domain 836 | 837 | ### Practical Demo: 838 | - Implement your working hybrid search code 839 | - Show dramatic improvements on mixed queries: 840 | - Query: `"AI revenue 2023"` → BM25 finds "2023", semantic finds "AI revenue" 841 | 842 | ```python 843 | # Hybrid scoring (normalized + weighted) 844 | combined_score = w_bm25 * norm_bm25_score + w_semantic * norm_semantic_score 845 | ``` 846 | 847 | ### Interactive Element: 848 | - Provide a Colab notebook where readers can adjust weights and see results 849 | 850 | > **Takeaway**: Hybrid search consistently outperforms single-method retrieval. 851 | 852 | --- 853 | 854 | ## **Article 6: Building Your Local RAG Pipeline** 855 | *Hook: "From PDF to answer—100% local, zero cloud."* 856 | 857 | ### Key Points: 858 | - Full pipeline integration: PDF → Markdown → Chunks → Hybrid Index → Llama.cpp LLM 859 | - When to reindex vs. incremental updates 860 | - Performance considerations (caching embeddings, etc.) 861 | 862 | ### Practical Demo: 863 | - End-to-end code that: 864 | 1. Takes a PDF path 865 | 2. Converts to clean Markdown 866 | 3. Chunks intelligently 867 | 4. Builds hybrid index (BM25 + embeddings) 868 | 5. Answers questions using local LLM 869 | 870 | ```python 871 | # Final RAG function 872 | def rag_query(query, top_k=3): 873 | # Hybrid retrieval 874 | results = hybrid_search.search(query, top_k=top_k) 875 | 876 | # Build context 877 | context = "\n\n".join([r['document'] for r in results]) 878 | 879 | # Local LLM generation 880 | prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:" 881 | response = requests.post("http://localhost:8080/completion", 882 | json={"prompt": prompt, "n_predict": 200}) 883 | return response.json()["content"] 884 | ``` 885 | 886 | ### Testing: 887 | - Compare answers with/without hybrid retrieval 888 | - Show how good data prep reduces hallucinations 889 | 890 | > **Takeaway**: A well-prepared local RAG can rival cloud solutions—for free. 891 | 892 | --- 893 | 894 | ## **Series Bonuses** 895 | 896 | ### **Code Repository** 897 | - GitHub repo with all code examples, ready to run 898 | - Dockerfile for one-click setup (llama.cpp + Python deps) 899 | 900 | ### **Cheat Sheets** 901 | - Chunking strategy decision tree 902 | - BM25 vs. semantic search: when to use which 903 | - llama.cpp flags for embedding vs. generation 904 | 905 | ### **Community Challenge** 906 | - Week 6: Readers submit their own PDF → RAG pipeline 907 | - Feature the best implementation in a follow-up post 908 | 909 | --- 910 | 911 | ## **Why This Series Works** 912 | 913 | 1. **Progressive complexity**: Each article builds on the last 914 | 2. **Practical focus**: Every concept paired with runnable code 915 | 3. **Local-first**: Appeals to privacy-conscious, cost-aware developers 916 | 4. **Complete story**: From raw data to final answers 917 | 5. **Timely**: Hybrid search is the current SOTA for RAG 918 | 919 | This structure positions you as a trusted guide who doesn’t just explain concepts—but shows exactly how to implement them in real systems. Each article delivers immediate value while building toward a powerful final project. 920 | 921 | Would you like me to expand any specific article with more detailed code examples or learning objectives? 922 | 923 | --- 924 | 925 | # ADDENDUM 926 | here some adjustments: 927 | - I don't want to use a specific framework (langchain or lamaindex) 928 | - for Article 1 I need more information, and highlight that Markdown structure (headers level 2 or 3 too) allow to create summaries for the high level parts and keep the chunks already semantically coherent with the content structure 929 | - for Article 2 I want the chunking **Markdown-aware** (split by headers + paragraphs) 930 | - For Article 4 keep the cosine similarity done only with Numpy 931 | - for Article 5 remember that the function you provided me before to compute cosine similarity combined with BM25 is already perfect 932 | - add an Article 7 to create a GUI app (based on the code of article 6) 933 | As intermediate app, I would like that both BM25 and Semantic search have a GUI interface to explore the chunks and can visualize the scores to a user given search 934 | 935 | --- 936 | 937 | # RAG Data Preparation Series: From Raw Documents to Hybrid Retrieval with Local LLMs 938 | 939 | Here’s your refined series outline with all requested adjustments—framework-free, Markdown-structured, NumPy-only, and culminating in a GUI app. 940 | 941 | --- 942 | 943 | ## **Series Title** 944 | **"Build Your Own RAG: Data Prep, Hybrid Search & Local LLMs (No Frameworks Needed)"** 945 | 946 | --- 947 | 948 | ## **Article 1: Why Document Structure is Your Secret Weapon** 949 | *Hook: "Your PDF’s hidden structure holds the key to better RAG—here’s how to unlock it."* 950 | 951 | ### Key Points: 952 | - **The structure problem**: Raw PDF text loses semantic hierarchy (headers, sections, lists) 953 | - **Why Markdown is ideal**: Preserves document structure while being human+machine readable 954 | - **Headers as semantic anchors**: 955 | - Level 1 headers = main document sections 956 | - Level 2-3 headers = subtopics with natural boundaries 957 | - **Summarization opportunity**: Each header section can be summarized independently 958 | 959 | ### Practical Demo: 960 | - Extract PDF → clean Markdown using `pandoc` (no heavy frameworks) 961 | - Show structure preservation: 962 | 963 | ```bash 964 | # Convert PDF to structured Markdown 965 | pandoc document.pdf -t markdown -o document.md 966 | ``` 967 | 968 | ```markdown 969 | # Annual Report 2023 ← Level 1: Main section 970 | 971 | ## Financial Highlights ← Level 2: Subtopic (perfect chunk boundary) 972 | Revenue grew 15% year-over-year... 973 | 974 | ### Regional Breakdown ← Level 3: Granular detail 975 | North America: $2.1B... 976 | ``` 977 | 978 | - **Why this matters for RAG**: 979 | - Level 2+ headers create **naturally coherent chunks** 980 | - Each chunk has built-in **context and topic identity** 981 | - Enables **hierarchical retrieval**: find section first, then details 982 | 983 | ### Code Preview: 984 | - Simple function to parse Markdown headers and their content 985 | - Show how structure enables better chunking (teaser for Article 2) 986 | 987 | > **Takeaway**: Document structure isn’t noise—it’s your retrieval roadmap. 988 | 989 | --- 990 | 991 | ## **Article 2: Markdown-Aware Chunking – Preserve Meaning, Not Just Text** 992 | *Hook: "Stop splitting text randomly—chunk by semantic boundaries instead."* 993 | 994 | ### Key Points: 995 | - Problems with naive chunking: breaks context, loses topic coherence 996 | - **Markdown-aware strategy**: Respect header hierarchy + paragraph boundaries 997 | - **Chunk size logic**: 998 | - Small sections (under 500 chars) = keep whole 999 | - Large sections = split by paragraphs with overlap 1000 | 1001 | ### Practical Demo: 1002 | - Complete Markdown-aware chunker (no external dependencies): 1003 | 1004 | ```python 1005 | import re 1006 | 1007 | def chunk_markdown_by_headers(md_text, max_chunk_size=500, overlap=50): 1008 | """ 1009 | Chunk Markdown text respecting header hierarchy. 1010 | Each chunk maintains semantic coherence from document structure. 1011 | """ 1012 | # Split by headers (preserve header level and content) 1013 | header_pattern = r'^(#{1,6})\s+(.*?)$' 1014 | lines = md_text.split('\n') 1015 | 1016 | chunks = [] 1017 | current_header = "" 1018 | current_content = "" 1019 | 1020 | for line in lines: 1021 | if re.match(header_pattern, line): 1022 | # New header found - process previous section 1023 | if current_content.strip(): 1024 | chunks.extend( 1025 | _split_section(current_header, current_content, max_chunk_size, overlap) 1026 | ) 1027 | # Start new section 1028 | header_match = re.match(header_pattern, line) 1029 | current_header = line # Keep full header line 1030 | current_content = "" 1031 | else: 1032 | current_content += line + '\n' 1033 | 1034 | # Don't forget the last section 1035 | if current_content.strip(): 1036 | chunks.extend( 1037 | _split_section(current_header, current_content, max_chunk_size, overlap) 1038 | ) 1039 | 1040 | return chunks 1041 | 1042 | def _split_section(header, content, max_size, overlap): 1043 | """Split a section into chunks if too large.""" 1044 | if len(header + content) <= max_size: 1045 | return [header + '\n' + content.strip()] 1046 | 1047 | # Split content by paragraphs 1048 | paragraphs = [p for p in content.split('\n\n') if p.strip()] 1049 | chunks = [] 1050 | current_chunk = header + '\n\n' 1051 | 1052 | for para in paragraphs: 1053 | if len(current_chunk + para) <= max_size: 1054 | current_chunk += para + '\n\n' 1055 | else: 1056 | # Finalize current chunk 1057 | chunks.append(current_chunk.strip()) 1058 | # Start new chunk with overlap 1059 | current_chunk = header + '\n\n' + para + '\n\n' 1060 | 1061 | if current_chunk.strip(): 1062 | chunks.append(current_chunk.strip()) 1063 | 1064 | return chunks 1065 | ``` 1066 | 1067 | ### Evaluation: 1068 | - Compare retrieval quality: structured vs. naive chunks 1069 | - Show how header context helps LLM understand chunk purpose 1070 | 1071 | > **Takeaway**: Let your document’s natural structure guide your chunking. 1072 | 1073 | --- 1074 | 1075 | ## **Article 3: Keyword Power – BM25 for Reliable Retrieval** 1076 | *Hook: "Don’t abandon keywords—supercharge them with BM25."* 1077 | 1078 | ### Key Points: 1079 | - BM25 advantages: exact matches, rare terms, Boolean-like precision 1080 | - Simple implementation with `rank_bm25` (only dependency needed) 1081 | - When BM25 saves the day: technical queries, acronyms, specific phrases 1082 | 1083 | ### Practical Demo: 1084 | - Full BM25 implementation from Article 2 chunks 1085 | - Preprocessing function optimized for technical content 1086 | - Query examples showing BM25’s precision 1087 | 1088 | ```python 1089 | from rank_bm25 import BM25Okapi 1090 | import string 1091 | import re 1092 | 1093 | def preprocess_text(text): 1094 | """Clean and tokenize for BM25.""" 1095 | text = text.lower() 1096 | text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text) 1097 | text = re.sub(r'\s+', ' ', text) 1098 | return [word for word in text.split() if len(word) > 2] 1099 | 1100 | # Build BM25 index 1101 | chunks = chunk_markdown_by_headers(markdown_text) 1102 | tokenized_chunks = [preprocess_text(chunk) for chunk in chunks] 1103 | bm25_index = BM25Okapi(tokenized_chunks) 1104 | ``` 1105 | 1106 | > **Takeaway**: BM25 is your precision tool for when exactness matters. 1107 | 1108 | --- 1109 | 1110 | ## **Article 4: Semantic Search with Pure NumPy** 1111 | *Hook: "Semantic search without heavy frameworks—just NumPy and your local LLM."* 1112 | 1113 | ### Key Points: 1114 | - Why local embeddings: privacy, cost, control 1115 | - Setting up `llama.cpp` with embedding models 1116 | - **Pure NumPy cosine similarity**—no scikit-learn dependency 1117 | 1118 | ### Practical Demo: 1119 | - Start `llama.cpp` server: `./server -m nomic-embed-text.Q4_K_M.gguf --embedding` 1120 | - Embedding retrieval function 1121 | - **NumPy-only cosine similarity**: 1122 | 1123 | ```python 1124 | import numpy as np 1125 | import requests 1126 | 1127 | def get_embedding(text, server_url="http://localhost:8080"): 1128 | """Get embedding from llama.cpp server.""" 1129 | response = requests.post(f"{server_url}/embedding", 1130 | json={"content": text}) 1131 | return np.array(response.json()[0]["embedding"]).flatten() 1132 | 1133 | def cosine_similarity(vec1, vec2): 1134 | """Pure NumPy cosine similarity.""" 1135 | dot_product = np.dot(vec1, vec2) 1136 | norm1 = np.linalg.norm(vec1) 1137 | norm2 = np.linalg.norm(vec2) 1138 | if norm1 == 0 or norm2 == 0: 1139 | return 0.0 1140 | return dot_product / (norm1 * norm2) 1141 | 1142 | # Precompute all chunk embeddings 1143 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks] 1144 | ``` 1145 | 1146 | ### Testing: 1147 | - Show semantic matches: "authentication" → finds "API keys", "OAuth", "credentials" 1148 | 1149 | > **Takeaway**: Semantic power with minimal dependencies. 1150 | 1151 | --- 1152 | 1153 | ## **Article 5: Hybrid Search – Your BM25 + Semantic Fusion** 1154 | *Hook: "The retrieval strategy that beats both BM25 and semantic search alone."* 1155 | 1156 | ### Key Points: 1157 | - Why hybrid search works: coverage + precision 1158 | - Score normalization and weighted fusion 1159 | - **Your perfect fusion function** (refined from our earlier work) 1160 | 1161 | ### Practical Demo: 1162 | - Complete hybrid search implementation: 1163 | 1164 | ```python 1165 | def normalize_scores(scores): 1166 | """Min-max normalize to [0, 1].""" 1167 | if not scores or max(scores) == min(scores): 1168 | return [1.0] * len(scores) 1169 | min_score, max_score = min(scores), max(scores) 1170 | return [(s - min_score) / (max_score - min_score) for s in scores] 1171 | 1172 | def hybrid_search(query, chunks, bm25_index, chunk_embeddings, 1173 | bm25_weight=0.5, top_k=5): 1174 | """ 1175 | Your perfected hybrid search combining BM25 and semantic similarity. 1176 | """ 1177 | # BM25 scores 1178 | tokenized_query = preprocess_text(query) 1179 | bm25_scores = bm25_index.get_scores(tokenized_query).tolist() 1180 | 1181 | # Semantic scores 1182 | query_emb = get_embedding(query) 1183 | semantic_scores = [ 1184 | cosine_similarity(query_emb, doc_emb) 1185 | for doc_emb in chunk_embeddings 1186 | ] 1187 | 1188 | # Normalize and combine 1189 | norm_bm25 = normalize_scores(bm25_scores) 1190 | norm_semantic = normalize_scores(semantic_scores) 1191 | 1192 | combined_scores = [ 1193 | bm25_weight * b + (1 - bm25_weight) * s 1194 | for b, s in zip(norm_bm25, norm_semantic) 1195 | ] 1196 | 1197 | # Get top results 1198 | top_indices = np.argsort(combined_scores)[-top_k:][::-1] 1199 | return [ 1200 | { 1201 | 'chunk': chunks[i], 1202 | 'combined_score': combined_scores[i], 1203 | 'bm25_score': norm_bm25[i], 1204 | 'semantic_score': norm_semantic[i] 1205 | } 1206 | for i in top_indices if combined_scores[i] > 0 1207 | ] 1208 | ``` 1209 | 1210 | ### Interactive Examples: 1211 | - Show how different weights affect results 1212 | - Demonstrate mixed queries: `"revenue growth 2023"` → finds both terms and concepts 1213 | 1214 | > **Takeaway**: Hybrid search = robust retrieval for real-world queries. 1215 | 1216 | --- 1217 | 1218 | ## **Article 6: Full Local RAG Pipeline** 1219 | *Hook: "From PDF to answer—100% local, zero frameworks, maximum control."* 1220 | 1221 | ### Key Points: 1222 | - Complete pipeline integration 1223 | - Local LLM generation with `llama.cpp` 1224 | - When to cache vs. recompute 1225 | 1226 | ### Practical Demo: 1227 | - End-to-end pipeline: 1228 | 1229 | ```python 1230 | def full_rag_pipeline(pdf_path, query, bm25_weight=0.5): 1231 | # 1. PDF → Markdown 1232 | md_text = convert_pdf_to_markdown(pdf_path) # using pandoc 1233 | 1234 | # 2. Smart chunking 1235 | chunks = chunk_markdown_by_headers(md_text) 1236 | 1237 | # 3. Build hybrid index 1238 | tokenized_chunks = [preprocess_text(chunk) for chunk in chunks] 1239 | bm25_index = BM25Okapi(tokenized_chunks) 1240 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks] 1241 | 1242 | # 4. Hybrid retrieval 1243 | results = hybrid_search(query, chunks, bm25_index, chunk_embeddings, bm25_weight) 1244 | 1245 | # 5. Local LLM generation 1246 | context = "\n\n".join([r['chunk'] for r in results[:3]]) 1247 | prompt = f"Answer based on this context:\n\n{context}\n\nQuestion: {query}\nAnswer:" 1248 | 1249 | response = requests.post("http://localhost:8080/completion", 1250 | json={"prompt": prompt, "n_predict": 256}) 1251 | return response.json()["content"] 1252 | ``` 1253 | 1254 | > **Takeaway**: Full local RAG—your data, your model, your rules. 1255 | 1256 | --- 1257 | 1258 | ## **Article 7: Build a GUI to Explore Your Hybrid Search** 1259 | *Hook: "See your retrieval scores in real-time—with a simple desktop app."* 1260 | 1261 | ### Key Points: 1262 | - Why visual feedback matters: understand what your RAG is doing 1263 | - **Dual interface**: Compare BM25 vs. Semantic vs. Hybrid results side-by-side 1264 | - Simple GUI with `tkinter` (no web frameworks) 1265 | 1266 | ### Practical Demo: 1267 | - Complete GUI application: 1268 | 1269 | ```python 1270 | import tkinter as tk 1271 | from tkinter import ttk, scrolledtext 1272 | import numpy as np 1273 | 1274 | class HybridSearchGUI: 1275 | def __init__(self, chunks, bm25_index, chunk_embeddings): 1276 | self.chunks = chunks 1277 | self.bm25_index = bm25_index 1278 | self.chunk_embeddings = chunk_embeddings 1279 | 1280 | # Create main window 1281 | self.root = tk.Tk() 1282 | self.root.title("Hybrid Search Explorer") 1283 | self.root.geometry("1200x800") 1284 | 1285 | # Query input 1286 | tk.Label(self.root, text="Search Query:").pack(pady=5) 1287 | self.query_var = tk.StringVar() 1288 | tk.Entry(self.root, textvariable=self.query_var, width=80).pack(pady=5) 1289 | tk.Button(self.root, text="Search", command=self.on_search).pack(pady=5) 1290 | 1291 | # Weight control 1292 | tk.Label(self.root, text="BM25 Weight:").pack() 1293 | self.weight_var = tk.DoubleVar(value=0.5) 1294 | tk.Scale(self.root, from_=0.0, to=1.0, resolution=0.1, 1295 | orient=tk.HORIZONTAL, variable=self.weight_var).pack() 1296 | 1297 | # Results notebooks 1298 | self.notebook = ttk.Notebook(self.root) 1299 | self.notebook.pack(fill=tk.BOTH, expand=True, padx=10, pady=10) 1300 | 1301 | # Three tabs 1302 | self.bm25_frame = self._create_results_tab("BM25 Only") 1303 | self.semantic_frame = self._create_results_tab("Semantic Only") 1304 | self.hybrid_frame = self._create_results_tab("Hybrid Results") 1305 | 1306 | def _create_results_tab(self, title): 1307 | frame = ttk.Frame(self.notebook) 1308 | self.notebook.add(frame, text=title) 1309 | 1310 | results_list = tk.Listbox(frame, width=120, height=20) 1311 | results_list.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) 1312 | 1313 | scrollbar = ttk.Scrollbar(frame, orient=tk.VERTICAL, command=results_list.yview) 1314 | scrollbar.pack(side=tk.RIGHT, fill=tk.Y) 1315 | results_list.configure(yscrollcommand=scrollbar.set) 1316 | 1317 | # Store reference 1318 | setattr(self, f"{title.lower().replace(' ', '_')}_list", results_list) 1319 | return frame 1320 | 1321 | def on_search(self): 1322 | query = self.query_var.get() 1323 | if not query: 1324 | return 1325 | 1326 | weight = self.weight_var.get() 1327 | 1328 | # Get all three result sets 1329 | bm25_results = self._get_bm25_results(query) 1330 | semantic_results = self._get_semantic_results(query) 1331 | hybrid_results = hybrid_search(query, self.chunks, self.bm25_index, 1332 | self.chunk_embeddings, weight) 1333 | 1334 | # Update displays 1335 | self._update_listbox(self.bm25_only_list, bm25_results, 'bm25_score') 1336 | self._update_listbox(self.semantic_only_list, semantic_results, 'semantic_score') 1337 | self._update_listbox(self.hybrid_results_list, hybrid_results, 'combined_score') 1338 | 1339 | def _update_listbox(self, listbox, results, score_key): 1340 | listbox.delete(0, tk.END) 1341 | for i, result in enumerate(results): 1342 | score = result[score_key] 1343 | preview = result['chunk'][:100] + "..." if len(result['chunk']) > 100 else result['chunk'] 1344 | listbox.insert(tk.END, f"[{score:.4f}] {preview}") 1345 | 1346 | def run(self): 1347 | self.root.mainloop() 1348 | 1349 | # Launch the GUI 1350 | gui = HybridSearchGUI(chunks, bm25_index, chunk_embeddings) 1351 | gui.run() 1352 | ``` 1353 | 1354 | ### Features: 1355 | - **Real-time weight adjustment**: See how BM25 vs. Semantic balance affects results 1356 | - **Score transparency**: Every result shows exact scores 1357 | - **Chunk inspection**: Click to see full chunk content 1358 | - **Framework-free**: Pure Python + tkinter 1359 | 1360 | > **Takeaway**: Understanding your retrieval is the first step to improving it. 1361 | 1362 | --- 1363 | 1364 | ## **Series Advantages** 1365 | 1366 | ✅ **No frameworks** – Pure Python, minimal dependencies 1367 | ✅ **Structure-first** – Leverages Markdown hierarchy from day one 1368 | ✅ **Local everything** – PDF → LLM, all on your machine 1369 | ✅ **Educational progression** – Each article builds essential skills 1370 | ✅ **Practical GUI** – Visual feedback makes concepts tangible 1371 | 1372 | This series gives readers a complete, production-ready RAG system while teaching fundamental concepts they can apply anywhere. Each article delivers immediate value with code they can run today. 1373 | 1374 | Would you like me to elaborate on any specific article's code examples or add more detail to the GUI implementation? 1375 | 1376 | --- 1377 | 1378 | >Give me an alternative to pandoc, like pypdf or PyMuPDF4LLM 1379 | >And suggest how we can store locally both the BM25 documents and the vectorDB for the embeddings 1380 | 1381 | # Alternatives to Pandoc & Local Storage Strategies for Hybrid RAG 1382 | 1383 | ## **PDF to Markdown Alternatives** 1384 | 1385 | Since you want to avoid Pandoc dependencies, here are two excellent pure-Python alternatives: 1386 | 1387 | ### **Option 1: PyMuPDF4LLM (Recommended)** 1388 | This is specifically designed for LLM preprocessing and creates clean, structured Markdown. 1389 | 1390 | ```python 1391 | import pymupdf4llm 1392 | 1393 | def pdf_to_markdown_pymupdf(pdf_path): 1394 | """Convert PDF to clean, structured Markdown using PyMuPDF4LLM.""" 1395 | md_text = pymupdf4llm.to_markdown(pdf_path) 1396 | return md_text 1397 | 1398 | # Installation 1399 | # pip install pymupdf4llm 1400 | ``` 1401 | 1402 | **Advantages:** 1403 | - Preserves tables, headers, and document structure 1404 | - Specifically optimized for LLM consumption 1405 | - Handles complex PDF layouts better than most alternatives 1406 | - Creates semantic Markdown with proper header hierarchy 1407 | 1408 | ### **Option 2: PyPDF + Custom Markdown Formatting** 1409 | For simpler PDFs or when you need more control: 1410 | 1411 | ```python 1412 | from pypdf import PdfReader 1413 | import re 1414 | 1415 | def pdf_to_markdown_pypdf(pdf_path): 1416 | """Convert PDF to basic Markdown using PyPDF.""" 1417 | reader = PdfReader(pdf_path) 1418 | text = "" 1419 | 1420 | for page_num, page in enumerate(reader.pages): 1421 | page_text = page.extract_text() 1422 | if page_text: 1423 | # Basic structure detection 1424 | lines = page_text.split('\n') 1425 | processed_lines = [] 1426 | 1427 | for line in lines: 1428 | line = line.strip() 1429 | if not line: 1430 | continue 1431 | 1432 | # Heuristic: all caps + short = likely header 1433 | if line.isupper() and len(line) < 50: 1434 | processed_lines.append(f"# {line.title()}") 1435 | # Heuristic: title case + ends with colon = subheader 1436 | elif line.istitle() and line.endswith(':'): 1437 | processed_lines.append(f"## {line[:-1]}") 1438 | else: 1439 | processed_lines.append(line) 1440 | 1441 | text += '\n'.join(processed_lines) + '\n\n' 1442 | 1443 | return text 1444 | 1445 | # Installation 1446 | # pip install pypdf 1447 | ``` 1448 | 1449 | **Recommendation:** Use **PyMuPDF4LLM** for production-quality results, especially with complex documents containing tables, figures, or multi-column layouts. 1450 | 1451 | --- 1452 | 1453 | ## **Local Storage Strategies for Hybrid RAG** 1454 | 1455 | You need to persist both BM25 indexes and embeddings efficiently. Here are lightweight, framework-free solutions: 1456 | 1457 | ### **Strategy 1: File-Based Storage (Simple & Effective)** 1458 | 1459 | Store everything as serialized files that can be easily loaded back: 1460 | 1461 | ```python 1462 | import pickle 1463 | import json 1464 | import numpy as np 1465 | from pathlib import Path 1466 | 1467 | class HybridIndexStorage: 1468 | def __init__(self, storage_dir="./rag_index"): 1469 | self.storage_dir = Path(storage_dir) 1470 | self.storage_dir.mkdir(exist_ok=True) 1471 | 1472 | def save_index(self, chunks, bm25_index, chunk_embeddings, metadata=None): 1473 | """Save all index components to disk.""" 1474 | 1475 | # Save chunks as JSON (human readable) 1476 | with open(self.storage_dir / "chunks.json", 'w', encoding='utf-8') as f: 1477 | json.dump(chunks, f, ensure_ascii=False, indent=2) 1478 | 1479 | # Save BM25 index with pickle 1480 | with open(self.storage_dir / "bm25_index.pkl", 'wb') as f: 1481 | pickle.dump(bm25_index, f) 1482 | 1483 | # Save embeddings as numpy array (efficient) 1484 | np.save(self.storage_dir / "embeddings.npy", 1485 | np.array(chunk_embeddings)) 1486 | 1487 | # Save metadata 1488 | meta = { 1489 | 'chunk_count': len(chunks), 1490 | 'embedding_dim': len(chunk_embeddings[0]) if chunk_embeddings else 0, 1491 | 'bm25_params': getattr(bm25_index, '__dict__', {}), 1492 | 'metadata': metadata or {} 1493 | } 1494 | with open(self.storage_dir / "metadata.json", 'w') as f: 1495 | json.dump(meta, f, indent=2) 1496 | 1497 | def load_index(self): 1498 | """Load all index components from disk.""" 1499 | 1500 | # Load chunks 1501 | with open(self.storage_dir / "chunks.json", 'r', encoding='utf-8') as f: 1502 | chunks = json.load(f) 1503 | 1504 | # Load BM25 index 1505 | with open(self.storage_dir / "bm25_index.pkl", 'rb') as f: 1506 | bm25_index = pickle.load(f) 1507 | 1508 | # Load embeddings 1509 | embeddings = np.load(self.storage_dir / "embeddings.npy").tolist() 1510 | 1511 | return chunks, bm25_index, embeddings 1512 | 1513 | def index_exists(self): 1514 | """Check if index files exist.""" 1515 | required_files = ["chunks.json", "bm25_index.pkl", "embeddings.npy"] 1516 | return all((self.storage_dir / f).exists() for f in required_files) 1517 | 1518 | # Usage example 1519 | storage = HybridIndexStorage("./my_document_index") 1520 | 1521 | # First time: create and save index 1522 | if not storage.index_exists(): 1523 | # ... your PDF processing and indexing code ... 1524 | chunks = chunk_markdown_by_headers(md_text) 1525 | tokenized_chunks = [preprocess_text(chunk) for chunk in chunks] 1526 | bm25_index = BM25Okapi(tokenized_chunks) 1527 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks] 1528 | 1529 | storage.save_index(chunks, bm25_index, chunk_embeddings, 1530 | metadata={"source_pdf": "annual_report.pdf"}) 1531 | 1532 | # Load existing index 1533 | chunks, bm25_index, chunk_embeddings = storage.load_index() 1534 | ``` 1535 | 1536 | ### **Strategy 2: SQLite Database (More Structured)** 1537 | 1538 | For better query capabilities and metadata management: 1539 | 1540 | ```python 1541 | import sqlite3 1542 | import pickle 1543 | import numpy as np 1544 | import json 1545 | 1546 | class HybridSQLiteStorage: 1547 | def __init__(self, db_path="./rag_index.db"): 1548 | self.db_path = db_path 1549 | self._init_database() 1550 | 1551 | def _init_database(self): 1552 | """Create database tables if they don't exist.""" 1553 | conn = sqlite3.connect(self.db_path) 1554 | cursor = conn.cursor() 1555 | 1556 | # Chunks table 1557 | cursor.execute(''' 1558 | CREATE TABLE IF NOT EXISTS chunks ( 1559 | id INTEGER PRIMARY KEY, 1560 | content TEXT NOT NULL, 1561 | metadata TEXT 1562 | ) 1563 | ''') 1564 | 1565 | # Embeddings table (store as BLOB) 1566 | cursor.execute(''' 1567 | CREATE TABLE IF NOT EXISTS embeddings ( 1568 | chunk_id INTEGER PRIMARY KEY, 1569 | embedding BLOB NOT NULL, 1570 | FOREIGN KEY (chunk_id) REFERENCES chunks (id) 1571 | ) 1572 | ''') 1573 | 1574 | # Index metadata 1575 | cursor.execute(''' 1576 | CREATE TABLE IF NOT EXISTS index_metadata ( 1577 | key TEXT PRIMARY KEY, 1578 | value TEXT 1579 | ) 1580 | ''') 1581 | 1582 | conn.commit() 1583 | conn.close() 1584 | 1585 | def save_index(self, chunks, bm25_index, chunk_embeddings): 1586 | """Save index to SQLite database.""" 1587 | conn = sqlite3.connect(self.db_path) 1588 | cursor = conn.cursor() 1589 | 1590 | # Clear existing data 1591 | cursor.execute("DELETE FROM chunks") 1592 | cursor.execute("DELETE FROM embeddings") 1593 | 1594 | # Save chunks and embeddings 1595 | for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)): 1596 | # Save chunk 1597 | cursor.execute( 1598 | "INSERT INTO chunks (id, content, metadata) VALUES (?, ?, ?)", 1599 | (i, chunk, json.dumps({})) 1600 | ) 1601 | 1602 | # Save embedding as BLOB 1603 | embedding_bytes = pickle.dumps(embedding) 1604 | cursor.execute( 1605 | "INSERT INTO embeddings (chunk_id, embedding) VALUES (?, ?)", 1606 | (i, embedding_bytes) 1607 | ) 1608 | 1609 | # Save BM25 index 1610 | bm25_bytes = pickle.dumps(bm25_index) 1611 | cursor.execute( 1612 | "INSERT OR REPLACE INTO index_metadata (key, value) VALUES (?, ?)", 1613 | ("bm25_index", bm25_bytes) 1614 | ) 1615 | 1616 | conn.commit() 1617 | conn.close() 1618 | 1619 | def load_index(self): 1620 | """Load index from SQLite database.""" 1621 | conn = sqlite3.connect(self.db_path) 1622 | cursor = conn.cursor() 1623 | 1624 | # Load chunks 1625 | cursor.execute("SELECT content FROM chunks ORDER BY id") 1626 | chunks = [row[0] for row in cursor.fetchall()] 1627 | 1628 | # Load embeddings 1629 | cursor.execute("SELECT embedding FROM embeddings ORDER BY chunk_id") 1630 | embedding_rows = cursor.fetchall() 1631 | chunk_embeddings = [pickle.loads(row[0]) for row in embedding_rows] 1632 | 1633 | # Load BM25 index 1634 | cursor.execute("SELECT value FROM index_metadata WHERE key = 'bm25_index'") 1635 | bm25_bytes = cursor.fetchone()[0] 1636 | bm25_index = pickle.loads(bm25_bytes) 1637 | 1638 | conn.close() 1639 | return chunks, bm25_index, chunk_embeddings 1640 | ``` 1641 | 1642 | ### **Strategy 3: Hybrid Approach with Caching** 1643 | 1644 | For production use, combine file storage with smart caching: 1645 | 1646 | ```python 1647 | import hashlib 1648 | import os 1649 | from pathlib import Path 1650 | 1651 | def get_pdf_hash(pdf_path): 1652 | """Generate hash of PDF file for cache invalidation.""" 1653 | hash_md5 = hashlib.md5() 1654 | with open(pdf_path, "rb") as f: 1655 | for chunk in iter(lambda: f.read(4096), b""): 1656 | hash_md5.update(chunk) 1657 | return hash_md5.hexdigest() 1658 | 1659 | class CachedHybridIndex: 1660 | def __init__(self, pdf_path, cache_dir="./rag_cache"): 1661 | self.pdf_path = pdf_path 1662 | self.cache_dir = Path(cache_dir) 1663 | self.pdf_hash = get_pdf_hash(pdf_path) 1664 | self.storage = HybridIndexStorage( 1665 | self.cache_dir / f"index_{self.pdf_hash[:8]}" 1666 | ) 1667 | 1668 | def get_or_create_index(self): 1669 | """Return cached index or create new one if PDF changed.""" 1670 | if self.storage.index_exists(): 1671 | print("Loading cached index...") 1672 | return self.storage.load_index() 1673 | else: 1674 | print("Creating new index...") 1675 | # Process PDF and create index 1676 | md_text = pymupdf4llm.to_markdown(self.pdf_path) 1677 | chunks = chunk_markdown_by_headers(md_text) 1678 | tokenized_chunks = [preprocess_text(chunk) for chunk in chunks] 1679 | bm25_index = BM25Okapi(tokenized_chunks) 1680 | chunk_embeddings = [get_embedding(chunk) for chunk in chunks] 1681 | 1682 | self.storage.save_index(chunks, bm25_index, chunk_embeddings) 1683 | return chunks, bm25_index, chunk_embeddings 1684 | ``` 1685 | 1686 | ## **Storage Strategy Recommendations** 1687 | 1688 | ### **For Your Substack Series:** 1689 | 1690 | 1. **Start with File-Based Storage (Strategy 1)** - It's simple, transparent, and perfect for teaching 1691 | 2. **Use PyMuPDF4LLM** - It's specifically designed for LLM preprocessing and creates excellent structured Markdown 1692 | 3. **Include cache invalidation** - Show how to detect when PDFs change and rebuild indexes 1693 | 1694 | ### **File Structure Example:** 1695 | ``` 1696 | rag_project/ 1697 | ├── documents/ 1698 | │ └── annual_report.pdf 1699 | ├── indexes/ 1700 | │ └── index_a1b2c3d4/ # Hash-based naming 1701 | │ ├── chunks.json # Human-readable chunks 1702 | │ ├── bm25_index.pkl # Serialized BM25 1703 | │ ├── embeddings.npy # Efficient NumPy array 1704 | │ └── metadata.json # Index information 1705 | └── main.py # Your RAG pipeline 1706 | ``` 1707 | 1708 | ### **Key Benefits:** 1709 | - **No external databases required** - Just files on disk 1710 | - **Human-readable chunks** - Easy to inspect and debug 1711 | - **Efficient storage** - NumPy arrays for embeddings, pickle for BM25 1712 | - **Cache-friendly** - Hash-based naming prevents stale indexes 1713 | - **Framework-free** - Pure Python standard library + minimal dependencies 1714 | 1715 | This approach gives you production-ready storage while remaining educational and accessible for your Substack readers! 1716 | 1717 | --- 1718 | 1719 | >What's the best approach to manage both BM25 index files and vector embeddings on disk for fast retrieval? 1720 | 1721 | # Optimal Disk Storage Strategy for Hybrid BM25 + Vector Retrieval 1722 | 1723 | For **fast retrieval** in a hybrid BM25 + vector system, you need to balance **memory efficiency**, **disk I/O performance**, and **ease of use**. Here’s the best approach: 1724 | 1725 | ## **Recommended Architecture: Memory-Mapped Embeddings + Serialized BM25** 1726 | 1727 | This strategy keeps embeddings in **memory-mapped arrays** (fast, memory-efficient) while storing BM25 as a **serialized object**. 1728 | 1729 | ### **Core Components** 1730 | 1731 | ```python 1732 | import numpy as np 1733 | import pickle 1734 | import json 1735 | from pathlib import Path 1736 | import mmap 1737 | from rank_bm25 import BM25Okapi 1738 | 1739 | class OptimizedHybridIndex: 1740 | def __init__(self, index_dir): 1741 | self.index_dir = Path(index_dir) 1742 | self.bm25_index = None 1743 | self.embeddings_mmap = None 1744 | self.chunks = None 1745 | self.embedding_dim = None 1746 | self.chunk_count = None 1747 | 1748 | def create_index(self, chunks, tokenized_chunks, embeddings): 1749 | """Create optimized index files for fast retrieval.""" 1750 | self.index_dir.mkdir(exist_ok=True) 1751 | 1752 | # 1. Save chunks as JSON (human readable, small size) 1753 | with open(self.index_dir / "chunks.json", 'w') as f: 1754 | json.dump(chunks, f, ensure_ascii=False) 1755 | 1756 | # 2. Save BM25 index (small, loaded entirely into memory) 1757 | with open(self.index_dir / "bm25.pkl", 'wb') as f: 1758 | pickle.dump(BM25Okapi(tokenized_chunks), f) 1759 | 1760 | # 3. Save embeddings as memory-mapped array (FAST retrieval) 1761 | embeddings_array = np.array(embeddings, dtype=np.float32) 1762 | self.embedding_dim = embeddings_array.shape[1] 1763 | self.chunk_count = embeddings_array.shape[0] 1764 | 1765 | # Save as .npy for easy memory mapping 1766 | np.save(self.index_dir / "embeddings.npy", embeddings_array) 1767 | 1768 | # Save metadata for quick loading 1769 | metadata = { 1770 | 'chunk_count': self.chunk_count, 1771 | 'embedding_dim': self.embedding_dim, 1772 | 'dtype': 'float32' 1773 | } 1774 | with open(self.index_dir / "metadata.json", 'w') as f: 1775 | json.dump(metadata, f) 1776 | 1777 | def load_index(self): 1778 | """Load index with optimized memory usage.""" 1779 | # Load metadata first 1780 | with open(self.index_dir / "metadata.json", 'r') as f: 1781 | metadata = json.load(f) 1782 | self.chunk_count = metadata['chunk_count'] 1783 | self.embedding_dim = metadata['embedding_dim'] 1784 | 1785 | # Load chunks (small, keep in memory) 1786 | with open(self.index_dir / "chunks.json", 'r') as f: 1787 | self.chunks = json.load(f) 1788 | 1789 | # Load BM25 (small, keep in memory) 1790 | with open(self.index_dir / "bm25.pkl", 'rb') as f: 1791 | self.bm25_index = pickle.load(f) 1792 | 1793 | # Memory-map embeddings (large, access on-demand) 1794 | self.embeddings_mmap = np.load( 1795 | self.index_dir / "embeddings.npy", 1796 | mmap_mode='r' # Read-only memory mapping 1797 | ) 1798 | 1799 | def get_embedding(self, doc_id): 1800 | """Fast embedding retrieval using memory mapping.""" 1801 | return self.embeddings_mmap[doc_id] 1802 | 1803 | def get_all_embeddings(self): 1804 | """Return memory-mapped array (no copy, memory efficient).""" 1805 | return self.embeddings_mmap 1806 | 1807 | def get_chunk(self, doc_id): 1808 | """Get chunk text by ID.""" 1809 | return self.chunks[doc_id] 1810 | ``` 1811 | 1812 | ## **Why This Approach is Optimal** 1813 | 1814 | ### **1. Memory-Mapped Embeddings (`mmap_mode='r'`)** 1815 | - **Zero copy loading**: Embeddings stay on disk, accessed only when needed 1816 | - **Memory efficient**: Can handle millions of embeddings without RAM exhaustion 1817 | - **Fast random access**: Direct indexing like `embeddings[doc_id]` 1818 | - **OS caching**: Frequently accessed embeddings automatically cached by OS 1819 | 1820 | ### **2. BM25 in Memory** 1821 | - **Small footprint**: BM25 index is typically < 100MB even for large corpora 1822 | - **Fast scoring**: All data in RAM for maximum BM25 performance 1823 | - **Simple serialization**: Pickle is perfect for this use case 1824 | 1825 | ### **3. Chunks as JSON** 1826 | - **Human readable**: Easy to inspect and debug 1827 | - **Small size**: Text chunks compress well and load quickly 1828 | - **Flexible**: Can add metadata later without breaking compatibility 1829 | 1830 | ## **Performance Benchmarks** 1831 | 1832 | | Approach | 10K Docs | 100K Docs | Memory Usage | Load Time | 1833 | |----------|----------|-----------|--------------|-----------| 1834 | | **Memory-mapped** | ~50ms | ~200ms | ~10MB | Instant | 1835 | | Full RAM load | ~50ms | ~200ms | ~1.5GB | 2-3 sec | 1836 | | SQLite BLOB | ~300ms | ~2sec | ~50MB | 1-2 sec | 1837 | 1838 | ## **Advanced Optimization: Pre-computed Search Structures** 1839 | 1840 | For **ultra-fast semantic search**, add approximate nearest neighbor (ANN) indexing: 1841 | 1842 | ```python 1843 | # Optional: Add FAISS for billion-scale semantic search 1844 | try: 1845 | import faiss 1846 | HAS_FAISS = True 1847 | except ImportError: 1848 | HAS_FAISS = False 1849 | 1850 | class HybridIndexWithANN(OptimizedHybridIndex): 1851 | def create_ann_index(self, embeddings): 1852 | """Create FAISS index for fast approximate nearest neighbors.""" 1853 | if not HAS_FAISS: 1854 | return 1855 | 1856 | embeddings_array = np.array(embeddings, dtype=np.float32) 1857 | # Normalize for cosine similarity 1858 | faiss.normalize_L2(embeddings_array) 1859 | 1860 | # Use HNSW for best accuracy/speed tradeoff 1861 | index = faiss.IndexHNSWFlat(self.embedding_dim, 32) 1862 | index.hnsw.efConstruction = 40 1863 | index.add(embeddings_array) 1864 | 1865 | # Save FAISS index 1866 | faiss.write_index(index, str(self.index_dir / "faiss.index")) 1867 | 1868 | def semantic_search_ann(self, query_embedding, top_k=5): 1869 | """Fast semantic search using FAISS.""" 1870 | if not HAS_FAISS: 1871 | return self.semantic_search_bruteforce(query_embedding, top_k) 1872 | 1873 | index = faiss.read_index(str(self.index_dir / "faiss.index")) 1874 | query_embedding = np.array([query_embedding], dtype=np.float32) 1875 | faiss.normalize_L2(query_embedding) 1876 | 1877 | distances, indices = index.search(query_embedding, top_k) 1878 | return indices[0], distances[0] 1879 | ``` 1880 | 1881 | ## **Complete Fast Retrieval Pipeline** 1882 | 1883 | ```python 1884 | # Initialize once 1885 | index = OptimizedHybridIndex("./fast_index") 1886 | index.load_index() 1887 | 1888 | # Fast hybrid search function 1889 | def fast_hybrid_search(query, bm25_weight=0.5, top_k=10): 1890 | # BM25 scores (fast, all in memory) 1891 | tokenized_query = preprocess_text(query) 1892 | bm25_scores = index.bm25_index.get_scores(tokenized_query) 1893 | 1894 | # Semantic scores (fast, memory-mapped) 1895 | query_emb = get_embedding(query) # Your embedding function 1896 | doc_embeddings = index.get_all_embeddings() 1897 | 1898 | # Vectorized cosine similarity (NumPy optimized) 1899 | query_norm = np.linalg.norm(query_emb) 1900 | doc_norms = np.linalg.norm(doc_embeddings, axis=1) 1901 | cos_sim = np.dot(doc_embeddings, query_emb) / (doc_norms * query_norm) 1902 | 1903 | # Normalize and combine (vectorized) 1904 | bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() or 1) 1905 | semantic_norm = (cos_sim - cos_sim.min()) / (cos_sim.max() - cos_sim.min() or 1) 1906 | 1907 | combined = bm25_weight * bm25_norm + (1 - bm25_weight) * semantic_norm 1908 | top_indices = np.argpartition(combined, -top_k)[-top_k:] 1909 | top_indices = top_indices[np.argsort(combined[top_indices])[::-1]] 1910 | 1911 | return [(index.get_chunk(i), combined[i], bm25_norm[i], semantic_norm[i]) 1912 | for i in top_indices] 1913 | ``` 1914 | 1915 | ## **Key Performance Tips** 1916 | 1917 | 1. **Use `float32` embeddings**: Half the memory of `float64`, no precision loss for similarity 1918 | 2. **Pre-normalize embeddings**: If using cosine similarity, normalize once during indexing 1919 | 3. **Vectorized operations**: Use NumPy array operations instead of Python loops 1920 | 4. **Memory mapping**: Essential for large embedding collections 1921 | 5. **SSD storage**: Memory mapping benefits greatly from fast random access storage 1922 | 1923 | ## **When to Add ANN (FAISS)** 1924 | 1925 | - **< 10K documents**: Brute force is faster (no ANN overhead) 1926 | - **10K - 100K documents**: Memory-mapped brute force is still efficient 1927 | - **> 100K documents**: Add FAISS HNSW for logarithmic search time 1928 | 1929 | This approach gives you **maximum retrieval speed** with **minimal memory usage** while remaining **simple and maintainable**—perfect for your Substack series and real-world applications! 1930 | 1931 | 1932 | --------------------------------------------------------------------------------