├── LICENSE
├── README.md
├── rag_vs_raptor.ipynb
├── raptor_guide.ipynb
└── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Fareed Khan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <!-- omit in toc -->
  2 | # RAG with RAPTOR
  3 | 
  4 | There are [tons of RAG optimization techniques](https://levelup.gitconnected.com/testing-18-rag-techniques-to-find-the-best-094d166af27f) you can use to improve performance, from query transformations to sophisticated re-ranking models. The challenge is that each new layer often brings added complexity, more LLM calls, and more moving parts to your architecture.
  5 | 
  6 | > But what if we could get a better performance by focusing on just one thing: building a smarter index?
  7 | 
  8 | ![Raptor with RAG](https://miro.medium.com/v2/resize:fit:1250/1*MP4ZNLcEJevendLkd50jig.png)
  9 | *Raptor with RAG (Created by Fareed Khan)*
 10 | 
 11 | This is the core idea behind **RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)**, it keeps RAG simple at query time while delivering superior results by building a hierarchical index that mirrors human understanding from details to high-level concepts.
 12 | 
 13 | Here is a high-level overview of how RAPTOR works:
 14 | 
 15 | 1.  **Start with Leaf Nodes:** First, we break down all source documents into small, detailed chunks. These are the foundational “leaf nodes” of our knowledge tree.
 16 | 2.  **Cluster for Themes:** Then, we use an advanced clustering algorithm to automatically group these leaf nodes into thematically related clusters based on their semantic meaning.
 17 | 3.  **Summarize for Abstraction:** We use an LLM to generate a concise, high-quality summary for each cluster. These summaries become the next, more abstract layer of the tree.
 18 | 4.  **Recurse to Build Upwards:** We repeat the clustering and summarization process on the newly created summaries, building the tree level by level towards higher concepts.
 19 | 5.  **Index Everything Together:** Finally, we combine all text, the original leaf chunks and all generated summaries into a single **“collapsed tree”** vector store for a powerful, multi-resolution search.
 20 | 
 21 | In this blog, we are going to…
 22 | > Evaluate a simple RAG pipeline against a RAPTOR-based RAG pipeline and explore why RAPTOR performs better than other approaches.
 23 | 
 24 | <!-- omit in toc -->
 25 | ## Table of Contents
 26 | - [Initializing our RAG Configuration](#initializing-our-rag-configuration)
 27 | - [Data Ingestion and Preparation](#data-ingestion-and-preparation)
 28 | - [Creating Leaf Nodes of RAPTOR Tree](#creating-leaf-nodes-of-raptor-tree)
 29 |     - [What is the Point of Leaf Nodes?](#what-is-the-point-of-leaf-nodes)
 30 | - [Implementing a Simple RAG Approach](#implementing-a-simple-rag-approach)
 31 | - [Building a Hierarchical Clustering Engine](#building-a-hierarchical-clustering-engine)
 32 |   - [Dimensionality Reduction with UMAP](#dimensionality-reduction-with-umap)
 33 |   - [Optimal Cluster Number Detection](#optimal-cluster-number-detection)
 34 |   - [Probabilistic Clustering with GMM](#probabilistic-clustering-with-gmm)
 35 |   - [Hierarchical Clustering Orchestrator](#hierarchical-clustering-orchestrator)
 36 | - [Building and Executing the RAPTOR Tree](#building-and-executing-the-raptor-tree)
 37 |     - [The Abstraction Engine: Summarization](#the-abstraction-engine-summarization)
 38 |     - [The Recursive Tree Builder](#the-recursive-tree-builder)
 39 | - [Indexing with the Collapsed Tree Strategy](#indexing-with-the-collapsed-tree-strategy)
 40 |     - [Query 1: Specific, Low-Level Question](#query-1-specific-low-level-question)
 41 |     - [Query 2: Mid-Level, Conceptual Question](#query-2-mid-level-conceptual-question)
 42 |     - [Query 3: Broad, High-Level Question](#query-3-broad-high-level-question)
 43 | - [Quantitative Evaluation of RAPTOR against Simple RAG](#quantitative-evaluation-of-raptor-against-simple-rag)
 44 | - [Qualitative Evaluation using LLM as a Judge](#qualitative-evaluation-using-llm-as-a-judge)
 45 | - [Summarizing the RAPTOR Approach](#summarizing-the-raptor-approach)
 46 | 
 47 | ---
 48 | 
 49 | ## Initializing our RAG Configuration
 50 | 
 51 | The two most important components of any RAG system are:
 52 | 
 53 | 1.  An embedding model → to convert documents into vector space for retrieval.
 54 | 2.  A text generation model (LLM) → to interpret retrieved content and produce answers.
 55 | 
 56 | > To make our approach **replicable and fair**, we are intentionally using a **quantized, older model** that was released about a year ago.
 57 | 
 58 | If we used a **newer LLM**, it might already “know” the answers internally, bypassing retrieval. By choosing an older model, we ensure that the evaluation truly tests **retrieval quality** which is exactly where RAPTOR vs. simple RAG makes a difference.
 59 | 
 60 | We first need to import PyTorch and other supporting components:
 61 | 
 62 | ```python
 63 | # Import the core PyTorch library for tensor operations
 64 | import torch
 65 | 
 66 | # Import LangChain's wrappers for Hugging Face models
 67 | from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
 68 | 
 69 | # Import core components from the transformers library for model loading and configuration
 70 | from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
 71 | 
 72 | # Import LangChain's tools for prompt engineering and output handling
 73 | from langchain_core.prompts import ChatPromptTemplate
 74 | from langchain_core.output_parsers import StrOutputParser
 75 | ```
 76 | 
 77 | We will use `sentence-transformers/all-MiniLM-L6-v2`, a lightweight and widely used embedding model, to convert all our text chunks and summaries into vector representations.
 78 | 
 79 | ```python
 80 | # --- Configure Embedding Model ---
 81 | embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
 82 | 
 83 | # Use GPU if available, otherwise fallback to CPU
 84 | model_kwargs = {"device": "cuda"}
 85 | 
 86 | # Initialize embeddings with LangChain's wrapper
 87 | embeddings = HuggingFaceEmbeddings(
 88 |     model_name=embedding_model_name,
 89 |     model_kwargs=model_kwargs
 90 | )
 91 | ```
 92 | 
 93 | This embedding model is small but perfect for large-scale document indexing without excessive memory usage. Next for generation, we are using [Mistral-7B-Instruct v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), a capable but compact instruction-tuned model.
 94 | 
 95 | To make it memory-friendly, we load it with **4-bit quantization** using `BitsAndBytesConfig`.
 96 | 
 97 | ```python
 98 | # --- Configure LLM for Summarization and Generation ---
 99 | llm_id = "mistralai/Mistral-7B-Instruct-v0.2"
100 | 
101 | # Quantization: reduces memory footprint while preserving performance
102 | quantization_config = BitsAndBytesConfig(
103 |     load_in_4bit=True,
104 |     bnb_4bit_compute_dtype=torch.float16,
105 |     bnb_4bit_quant_type="nf4"
106 | )
107 | ```
108 | 
109 | We now need to load the tokenizer and the LLM itself with the quantization settings applied.
110 | 
111 | ```python
112 | # Load tokenizer
113 | tokenizer = AutoTokenizer.from_pretrained(llm_id)
114 | 
115 | # Load LLM with quantization
116 | model = AutoModelForCausalLM.from_pretrained(
117 |     llm_id,
118 |     torch_dtype=torch.float16,
119 |     device_map="auto",
120 |     quantization_config=quantization_config
121 | )
122 | ```
123 | 
124 | This way, the model runs efficiently on available hardware, even with limited GPU memory. Once the model and tokenizer are loaded, we wrap them in a Hugging Face **pipeline** for text generation.
125 | 
126 | ```python
127 | # Create a text-generation pipeline using the loaded model and tokenizer.
128 | pipe = pipeline(
129 |     "text-generation", 
130 |     model=model, 
131 |     tokenizer=tokenizer, 
132 |     max_new_tokens=512 # Controls the max length of the generated summaries and answers
133 | )
134 | ```
135 | 
136 | Finally, we wrap the Hugging Face pipeline in LangChain’s `HuggingFacePipeline` so it integrates smoothly with our retrieval pipeline later.
137 | 
138 | ```python
139 | # Wrap pipeline for LangChain compatibility
140 | llm = HuggingFacePipeline(pipeline=pipe)
141 | ```
142 | 
143 | ---
144 | 
145 | ## Data Ingestion and Preparation
146 | 
147 | To properly showcase how **RAPTOR** can improve **RAG** performance, we need a **complex and challenging database**. The idea is that when we run queries against it, we want to see real differences between **simple RAG** and **RAPTOR-enhanced RAG**.
148 | 
149 | For this reason, we are focusing on the **Hugging Face documentation**. The docs are rich in overlapping information and contain subtle variations that can easily trip up a naïve retriever.
150 | 
151 | For example, Hugging Face explains **ZeRO-3 checkpoint saving** in multiple ways:
152 | - `trainer.save_model()`
153 | - `unwrap_model().save_pretrained()`
154 | - `zero_to_fp32()`
155 | 
156 | All of these refer to the same underlying concept, consolidating model shards into a full checkpoint.
157 | > A simple RAG pipeline might retrieve only one of these variants and **miss the broader context**, leading to incomplete or even broken instructions. RAPTOR, on the other hand, can consolidate and reason across them.
158 | 
159 | Since Hugging Face has an extensive documentation ecosystem, we are narrowing down to five **core guides** where most practical usage happens. Let’s initialize their URLs.
160 | 
161 | ```python
162 | # Define the documentation sections to scrape, with varying crawl depths.
163 | urls_to_load = [
164 |     {"url": "https://huggingface.co/docs/transformers/index", "max_depth": 3},
165 |     {"url": "https://huggingface.co/docs/datasets/index", "max_depth": 2},
166 |     {"url": "https://huggingface.co/docs/tokenizers/index", "max_depth": 2},
167 |     {"url": "https://huggingface.co/docs/peft/index", "max_depth": 1},
168 |     {"url": "https://huggingface.co/docs/accelerate/index", "max_depth": 1}
169 | ]
170 | ```
171 | 
172 | A key parameter here is `max_depth`, which controls how deeply we crawl from the starting page.
173 | 
174 | ![How the depth parameter works](https://miro.medium.com/v2/resize:fit:875/1*N-RUEgcCSsdiEg72w3KP9w.png)
175 | *Depth parameter work (Created by Fareed Khan)*
176 | 
177 | - It starts with the root page (`...docs/transformers/index`).
178 | - From there, it follows all links on that page → this is depth 1.
179 | - Then, it crawls into the links found inside those subpages → this is depth 2.
180 | - Finally, it continues one more level into the links within those sub-subpages → this is depth 3.
181 | 
182 | Now, we will fetch the content using LangChain's `RecursiveUrlLoader` with BeautifulSoup.
183 | 
184 | ```python
185 | from langchain_community.document_loaders import RecursiveUrlLoader
186 | from bs4 import BeautifulSoup as Soup
187 | 
188 | # Empty list to append components
189 | docs = []
190 | 
191 | # Iterate through the list and crawl each documentation section.
192 | for item in urls_to_load:
193 |     # Initialize the loader with the specific URL and parameters.
194 |     loader = RecursiveUrlLoader(
195 |         url=item["url"],
196 |         max_depth=item["max_depth"],
197 |         extractor=lambda x: Soup(x, "html.parser").text, # Use BeautifulSoup to extract text
198 |         prevent_outside=True, # Ensure we stay within the documentation pages
199 |         use_async=True, # Use asynchronous requests for faster crawling
200 |         timeout=600, # Set a generous timeout for slow pages
201 |     )
202 |     # Load the documents and add them to our master list.
203 |     loaded_docs = loader.load()
204 |     docs.extend(loaded_docs)
205 |     print(f"Loaded {len(loaded_docs)} documents from {item['url']}")
206 | ```
207 | 
208 | Running this loop gives the following output:
209 | ```bash
210 | ###### OUTPUT #######
211 | Loaded 68 documents from https://huggingface.co/docs/transformers/index
212 | Loaded 35 documents from https://huggingface.co/docs/datasets/index
213 | Loaded 21 documents from https://huggingface.co/docs/tokenizers/index
214 | Loaded 12 documents from https://huggingface.co/docs/peft/index
215 | Loaded 9 documents from https://huggingface.co/docs/accelerate/index
216 | 
217 | Total documents loaded: 145
218 | ```
219 | We have a total of `145` documents. Let’s analyze their token counts.
220 | 
221 | ```python
222 | import numpy as np
223 | import matplotlib.pyplot as plt
224 | 
225 | # We need a consistent way to count tokens, using the LLM's tokenizer is the most accurate method.
226 | def count_tokens(text: str) -> int:
227 |     """Counts the number of tokens in a text using the configured tokenizer."""
228 |     # Ensure text is not None and is a string
229 |     if not isinstance(text, str):
230 |         return 0
231 |     return len(tokenizer.encode(text))
232 | 
233 | # Extract the text content from the loaded LangChain Document objects
234 | docs_texts = [d.page_content for d in docs]
235 | 
236 | # Calculate token counts for each document
237 | token_counts = [count_tokens(text) for text in docs_texts]
238 | 
239 | # Print statistics to understand the document size distribution
240 | print(f"Total documents: {len(docs_texts)}")
241 | print(f"Total tokens in corpus: {np.sum(token_counts)}")
242 | print(f"Average tokens per document: {np.mean(token_counts):.2f}")
243 | print(f"Min tokens in a document: {np.min(token_counts)}")
244 | print(f"Max tokens in a document: {np.max(token_counts)}")
245 | ```
246 | 
247 | This gives us the following statistics:
248 | ```bash
249 | ######### OUTPUT #########
250 | Total documents: 145
251 | Total tokens in corpus: 312566
252 | Average tokens per document: 2155.59
253 | Min tokens in a document: 312
254 | Max tokens in a document: 12453
255 | ```
256 | The documents vary greatly in size. To find an optimal chunk size, let's plot the distribution.
257 | 
258 | ```python
259 | # Set the size of the plot for better readability.
260 | plt.figure(figsize=(10, 6))
261 | # Create the histogram.
262 | plt.hist(token_counts, bins=50, color='blue', alpha=0.7)
263 | # Set the title and labels.
264 | plt.title('Distribution of Document Token Counts')
265 | plt.xlabel('Token Count')
266 | plt.ylabel('Number of Documents')
267 | plt.grid(True)
268 | plt.show()
269 | ```
270 | ![Token Distribution Histogram](https://miro.medium.com/v2/resize:fit:875/1*hL0np9ObURr9I4Vs3a2DKg.png)
271 | *Token Distribution (Created by Fareed Khan)*
272 | 
273 | From the plot, a chunk size of around **1000** tokens seems appropriate.
274 | 
275 | ---
276 | 
277 | ## Creating Leaf Nodes of RAPTOR Tree
278 | The initial chunking step is the first and most critical part of the RAPTOR process, as it creates the foundational "leaf nodes" of our knowledge tree.
279 | 
280 | > This initial chunking step is the first and most critical part of the RAPTOR process.
281 | 
282 | ![Diagram of Leaf Nodes](https://miro.medium.com/v2/resize:fit:875/1*vwiQ52_Zz3a25z5sfhURdg.png)
283 | *Leaf Nodes of RAPTOR Tree (Created by Fareed Khan)*
284 | 
285 | #### What is the Point of Leaf Nodes?
286 | Leaf nodes are the granular, Level 0 chunks that contain the raw details from the source documents. A standard RAG system only ever sees these leaves. RAPTOR's innovation is to use these leaves as a base to build up a more abstract understanding.
287 | 
288 | We'll use LangChain's `RecursiveCharacterTextSplitter` configured with our LLM's tokenizer to create these nodes.
289 | 
290 | ```python
291 | from langchain.text_splitter import RecursiveCharacterTextSplitter
292 | 
293 | # We join all the documents into a single string for more efficient processing.
294 | concatenated_content = "\n\n --- \n\n".join(docs_texts)
295 | 
296 | # Create the text splitter using our LLM's tokenizer for accuracy.
297 | text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
298 |     tokenizer=tokenizer,
299 |     chunk_size=1000, # The max number of tokens in a chunk
300 |     chunk_overlap=100  # The number of tokens to overlap between chunks
301 | )
302 | 
303 | # Split the text into chunks, which will be our leaf nodes.
304 | leaf_texts = text_splitter.split_text(concatenated_content)
305 | 
306 | print(f"Created {len(leaf_texts)} leaf nodes (chunks) for the RAPTOR tree.")
307 | ```
308 | This process creates our foundational layer.
309 | ```bash
310 | #### OUTPUT #####
311 | Created 412 leaf nodes (chunks) for the RAPTOR tree.
312 | ```
313 | > let’s establish our baseline by building and testing a simple RAG system using only these nodes.
314 | 
315 | ---
316 | 
317 | ## Implementing a Simple RAG Approach
318 | 
319 | To prove that RAPTOR is an improvement, we'll build a standard, non-hierarchical RAG system using the exact same models and the 412 leaf nodes we just created.
320 | 
321 | ![Diagram of a Simple RAG system](https://miro.medium.com/v2/resize:fit:1250/1*Axs1POYk9P1GK--z_BIzCw.png)
322 | *Simple RAG (Created by Fareed Khan)*
323 | 
324 | First, we build a FAISS vector store with these leaf nodes.
325 | ```python
326 | from langchain_community.vectorstores import FAISS
327 | 
328 | # In a simple RAG, the vector store is built only on the leaf-level chunks.
329 | vectorstore_normal = FAISS.from_texts(
330 |     texts=leaf_texts, 
331 |     embedding=embeddings
332 | )
333 | 
334 | # Create a retriever from this vector store that fetches the top 5 results.
335 | retriever_normal = vectorstore_normal.as_retriever(
336 |     search_kwargs={'k': 5}
337 | )
338 | 
339 | print(f"Built Simple RAG vector store with {len(leaf_texts)} documents.")
340 | 
341 | ### OUTPUT ###
342 | # Built Simple RAG vector store with 412 documents.
343 | ```
344 | 
345 | Now, we build the full RAG chain and test it with a high-level question.
346 | 
347 | ```python
348 | from langchain_core.runnables import RunnablePassthrough
349 | 
350 | # This prompt template instructs the LLM to answer based ONLY on the provided context.
351 | final_prompt_text = """You are an expert assistant for the Hugging Face ecosystem. 
352 | Answer the user's question based ONLY on the following context. If the context does not contain the answer, state that you don't know.
353 | CONTEXT:
354 | {context}
355 | QUESTION:
356 | {question}
357 | ANSWER:"""
358 | final_prompt = ChatPromptTemplate.from_template(final_prompt_text)
359 | 
360 | # A helper function to format the retrieved documents.
361 | def format_docs(docs):
362 |     return "\n\n".join(doc.page_content for doc in docs)
363 | 
364 | # Construct the RAG chain for the simple approach.
365 | rag_chain_normal = (
366 |     {"context": retriever_normal | format_docs, "question": RunnablePassthrough()}
367 |     | final_prompt
368 |     | llm
369 |     | StrOutputParser()
370 | )
371 | 
372 | # Let's ask a broad, conceptual question.
373 | question = "What is the core philosophy of the Hugging Face ecosystem?"
374 | answer = rag_chain_normal.invoke(question)
375 | 
376 | print(f"Question: {question}\n")
377 | print(f"Answer: {answer}")
378 | ```
379 | Here's the result:
380 | ```bash
381 | #### OUTPUT ###
382 | Question: What is the core philosophy of the Hugging Face ecosystem?
383 | 
384 | Answer: The Hugging Face ecosystem is built around the `transformers` 
385 | library, which provides APIs to easily download and use pretrained models.
386 | The core idea is to make these models accessible. For example, the `pipeline`
387 | function is a key part of this, offering a simple way to use models for 
388 | inference. It also includes libraries like `datasets` for data loading and
389 | `accelerate` for training.
390 | ```
391 | > This answer isn’t wrong, but it’s disjointed. It feels like a collection of random facts stitched together.
392 | 
393 | It mentions `pipeline`, `datasets`, and `accelerate` but fails to explain the overarching goals. This is a classic "lost in the details" problem, which RAPTOR is designed to solve.
394 | 
395 | ---
396 | 
397 | ## Building a Hierarchical Clustering Engine
398 | 
399 | To build the RAPTOR tree, we need to group our 412 leaf nodes into meaningful clusters. The RAPTOR paper proposes a sophisticated, multi-stage process involving three key components:
400 | 
401 | 1.  **Dimensionality Reduction (UMAP):** To help the clustering algorithm see the “shape” of the data more clearly.
402 | 2.  **Optimal Cluster Detection (GMM + BIC):** To let the data decide how many clusters it naturally has.
403 | 3.  **Probabilistic Clustering (GMM):** To assign chunks to clusters based on probabilities, allowing a single chunk to belong to multiple related topics.
404 | 
405 | ### Dimensionality Reduction with UMAP
406 | Our embeddings exist in a high-dimensional space, which can hinder clustering algorithms (the "Curse of Dimensionality"). We use **UMAP (Uniform Manifold Approximation and Projection)** to reduce dimensionality while preserving semantic relationships.
407 | 
408 | ![UMAP Dimensionality Reduction](https://miro.medium.com/v2/resize:fit:875/1*fcK8rc48yOx_tG6QTrdiYg.png)
409 | *UMAP Approach (Created by Fareed Khan)*
410 | 
411 | ```python
412 | from typing import Dict, List, Optional, Tuple
413 | import numpy as np
414 | import pandas as pd
415 | import umap
416 | from sklearn.mixture import GaussianMixture
417 | 
418 | RANDOM_SEED = 42 # for reproducibility
419 | 
420 | def global_cluster_embeddings(embeddings: np.ndarray, dim: int, n_neighbors: Optional[int] = None, metric: str = "cosine") -> np.ndarray:
421 |     """Perform global dimensionality reduction on the embeddings using UMAP."""
422 |     if n_neighbors is None:
423 |         n_neighbors = int((len(embeddings) - 1) ** 0.5)
424 |     return umap.UMAP(
425 |         n_neighbors=n_neighbors, 
426 |         n_components=dim, 
427 |         metric=metric, 
428 |         random_state=RANDOM_SEED
429 |     ).fit_transform(embeddings)
430 | ```
431 | 
432 | ### Optimal Cluster Number Detection
433 | Instead of picking an arbitrary number of clusters, we let the data decide using a **Gaussian Mixture Model (GMM)** and the **Bayesian Information Criterion (BIC)**. The lowest BIC score indicates the optimal number of clusters.
434 | 
435 | ![Optimal Cluster Number Detection with BIC](https://miro.medium.com/v2/resize:fit:875/1*jyPBmUYysnSBkgdO0Zdpiw.png)
436 | *Cluster Number Optimal (Created by Fareed Khan)*
437 | 
438 | ```python
439 | def get_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 50) -> int:
440 |     """Determine the optimal number of clusters using the Bayesian Information Criterion (BIC)."""
441 |     max_clusters = min(max_clusters, len(embeddings))
442 |     if max_clusters <= 1: 
443 |         return 1
444 |     
445 |     n_clusters_range = np.arange(1, max_clusters)
446 |     bics = []
447 |     for n in n_clusters_range:
448 |         gmm = GaussianMixture(n_components=n, random_state=RANDOM_SEED)
449 |         gmm.fit(embeddings)
450 |         bics.append(gmm.bic(embeddings))
451 |         
452 |     return n_clusters_range[np.argmin(bics)]
453 | ```
454 | 
455 | ### Probabilistic Clustering with GMM
456 | GMM performs **"soft clustering"**, calculating the probability that a data point belongs to each cluster. This is ideal for our use case, as a single text chunk can cover multiple topics.
457 | 
458 | ![Probabilistic Clustering](https://miro.medium.com/v2/resize:fit:1250/1*VJZ3N3L39wLlQRVqZrgX3A.png)
459 | *Probabilistic Clustering (Created by Fareed Khan)*
460 | 
461 | ```python
462 | def GMM_cluster(embeddings: np.ndarray, threshold: float) -> Tuple[List[np.ndarray], int]:
463 |     """Cluster embeddings using a GMM and a probability threshold."""
464 |     n_clusters = get_optimal_clusters(embeddings)
465 |     
466 |     gmm = GaussianMixture(n_components=n_clusters, random_state=RANDOM_SEED)
467 |     gmm.fit(embeddings)
468 |     
469 |     probs = gmm.predict_proba(embeddings)
470 |     
471 |     labels = [np.where(prob > threshold)[0] for prob in probs]
472 |     
473 |     return labels, n_clusters
474 | ```
475 | 
476 | ### Hierarchical Clustering Orchestrator
477 | We combine these components into a two-stage process:
478 | 1.  **Global Clustering:** Find broad themes across the entire dataset.
479 | 2.  **Local Clustering:** Zoom in on each global cluster to find more specific sub-topics.
480 | 
481 | ![Hierarchical Clustering Process](https://miro.medium.com/v2/resize:fit:875/1*WnzBAGHlRw3gNTvWQJo9gQ.png)
482 | *Hierarchical Clustering (Created by Fareed Khan)*
483 | 
484 | This function orchestrates the global-then-local logic.
485 | 
486 | ```python
487 | def perform_clustering(embeddings: np.ndarray, dim: int = 10, threshold: float = 0.1) -> List[np.ndarray]:
488 |     """Perform hierarchical clustering (global and local) on the embeddings."""
489 |     if len(embeddings) <= dim + 1:
490 |         return [np.array([0]) for _ in range(len(embeddings))]
491 | 
492 |     # --- Global Clustering Stage ---
493 |     reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
494 |     global_clusters, n_global_clusters = GMM_cluster(reduced_embeddings_global, threshold)
495 | 
496 |     # --- Local Clustering Stage ---
497 |     all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
498 |     total_clusters = 0
499 | 
500 |     for i in range(n_global_clusters):
501 |         global_cluster_indices = [idx for idx, gc in enumerate(global_clusters) if i in gc]
502 |         if not global_cluster_indices:
503 |             continue
504 |         
505 |         global_cluster_embeddings_ = embeddings[global_cluster_indices]
506 | 
507 |         if len(global_cluster_embeddings_) <= dim + 1:
508 |             local_clusters, n_local_clusters = ([np.array([0])] * len(global_cluster_embeddings_)), 1
509 |         else:
510 |             reduced_embeddings_local = global_cluster_embeddings(global_cluster_embeddings_, dim)
511 |             local_clusters, n_local_clusters = GMM_cluster(reduced_embeddings_local, threshold)
512 | 
513 |         for j in range(n_local_clusters):
514 |             local_cluster_indices = [idx for idx, lc in enumerate(local_clusters) if j in lc]
515 |             if not local_cluster_indices:
516 |                 continue
517 |             
518 |             original_indices = [global_cluster_indices[idx] for idx in local_cluster_indices]
519 |             for idx in original_indices:
520 |                 all_local_clusters[idx] = np.append(all_local_clusters[idx], j + total_clusters)
521 | 
522 |         total_clusters += n_local_clusters
523 | 
524 |     return all_local_clusters
525 | ```
526 | 
527 | ---
528 | 
529 | ## Building and Executing the RAPTOR Tree
530 | 
531 | Now we'll combine clustering with summarization in a recursive process to build the RAPTOR tree from the bottom up.
532 | 
533 | ![RAPTOR Tree Building Process](https://miro.medium.com/v2/resize:fit:1250/1*glBve60XyvrdPhSc_t47Gw.png)
534 | *RAPTOR Tree (Created by Fareed Khan)*
535 | 
536 | #### The Abstraction Engine: Summarization
537 | The **Abstractive** component uses an LLM to synthesize a cluster of related text chunks into a single, high-quality summary. This creates the parent nodes in our tree.
538 | 
539 | ```python
540 | from langchain.prompts import ChatPromptTemplate
541 | from langchain_core.output_parsers import StrOutputParser
542 | 
543 | # Define the summarization chain
544 | summarization_prompt = ChatPromptTemplate.from_template(
545 |     """You are an expert technical writer. 
546 |     Given the following collection of text chunks from the Hugging Face documentation, synthesize them into a single, coherent, and detailed summary. 
547 |     Focus on the main concepts, APIs, and workflows described.
548 |     CONTEXT: {context}
549 |     DETAILED SUMMARY:"""
550 | )
551 | 
552 | # Create the summarization chain
553 | summarization_chain = summarization_prompt | llm | StrOutputParser()
554 | ```
555 | 
556 | #### The Recursive Tree Builder
557 | This function orchestrates the entire process: **Cluster**, **Summarize**, and **Recurse**.
558 | 
559 | ```python
560 | def recursive_build_tree(texts: List[str], level: int = 1, n_levels: int = 3) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
561 |     """The main recursive function to build the RAPTOR tree."""
562 |     results = {}
563 |     if level > n_levels or len(texts) <= 1:
564 |         return results
565 | 
566 |     # Step 1: Embed and Cluster
567 |     text_embeddings_np = np.array(embeddings.embed_documents(texts))
568 |     cluster_labels = perform_clustering(text_embeddings_np)
569 |     df_clusters = pd.DataFrame({'text': texts, 'cluster': cluster_labels})
570 |     
571 |     # Step 2: Prepare for Summarization
572 |     expanded_list = []
573 |     for _, row in df_clusters.iterrows():
574 |         for cluster_id in row['cluster']:
575 |             expanded_list.append({'text': row['text'], 'cluster': int(cluster_id)})
576 |     
577 |     if not expanded_list:
578 |         return results
579 |         
580 |     expanded_df = pd.DataFrame(expanded_list)
581 |     all_clusters = expanded_df['cluster'].unique()
582 |     print(f"--- Level {level}: Generated {len(all_clusters)} clusters ---")
583 | 
584 |     # Step 3: Summarize each cluster
585 |     summaries = []
586 |     for i in all_clusters:
587 |         cluster_texts = expanded_df[expanded_df['cluster'] == i]['text'].tolist()
588 |         formatted_txt = "\n\n---\n\n".join(cluster_texts)
589 |         summary = summarization_chain.invoke({"context": formatted_txt})
590 |         summaries.append(summary)
591 |         
592 |     df_summary = pd.DataFrame({'summaries': summaries, 'cluster': all_clusters})
593 |     results[level] = (df_clusters, df_summary)
594 | 
595 |     # Step 4: Recurse
596 |     if level < n_levels and len(all_clusters) > 1:
597 |         new_texts = df_summary["summaries"].tolist()
598 |         next_level_results = recursive_build_tree(new_texts, level + 1, n_levels)
599 |         results.update(next_level_results)
600 | 
601 |     return results
602 | ```
603 | 
604 | Now, let’s execute this function on our 412 leaf nodes.
605 | ```python
606 | # Execute the RAPTOR process on our chunked leaf_texts.
607 | raptor_results = recursive_build_tree(leaf_texts, level=1, n_levels=3)
608 | ```
609 | ```bash
610 | #### OUTPUT ####
611 | --- Level 1: Generated 8 clusters ---
612 | Level 1, Cluster 0: Generated summary of length 2011 chars.
613 | ... (and so on for all 8 clusters) ...
614 | --- Level 2: Generated 3 clusters ---
615 | Level 2, Cluster 0: Generated summary of length 2050 chars.
616 | ... (and so on for all 3 clusters) ...
617 | ```
618 | - **Level 1:** The 412 leaf nodes were grouped into 8 clusters, and 8 summaries were generated.
619 | - **Level 2:** Those 8 summaries were then clustered into 3 broader themes, generating 3 top-level summaries.
620 | 
621 | ---
622 | 
623 | ## Indexing with the Collapsed Tree Strategy
624 | 
625 | RAPTOR uses a **"collapsed tree"** strategy: we take all text from every level—the original leaf nodes and all generated summaries—and put them into a single vector store.
626 | 
627 | > This multi-resolution index lets the retrieval system find the perfect level of abstraction for any given question.
628 | 
629 | ```python
630 | from langchain_community.vectorstores import FAISS
631 | 
632 | # Start with a copy of the original leaf texts.
633 | all_texts_raptor = leaf_texts.copy()
634 | 
635 | # Add the summaries from each level of the RAPTOR tree.
636 | for level in raptor_results:
637 |     summaries = raptor_results[level][1]['summaries'].tolist()
638 |     all_texts_raptor.extend(summaries)
639 | 
640 | # Build the final vector store using FAISS.
641 | vectorstore_raptor = FAISS.from_texts(
642 |     texts=all_texts_raptor, 
643 |     embedding=embeddings
644 | )
645 | 
646 | # Create the final retriever for the RAPTOR RAG system.
647 | retriever_raptor = vectorstore_raptor.as_retriever(search_kwargs={'k': 5})
648 | 
649 | print(f"Built RAPTOR vector store with {len(all_texts_raptor)} total documents (leaves + summaries).")
650 | 
651 | #### OUTPUT ####
652 | # Built RAPTOR vector store with 423 total documents (leaves + summaries).
653 | ```
654 | 
655 | Now we create the RAPTOR RAG chain and test it.
656 | ```python
657 | # Create the RAG chain for the RAPTOR approach.
658 | rag_chain_raptor = (
659 |     {"context": retriever_raptor | format_docs, "question": RunnablePassthrough()}
660 |     | final_prompt
661 |     | llm
662 |     | StrOutputParser()
663 | )
664 | ```
665 | 
666 | #### Query 1: Specific, Low-Level Question
667 | This should retrieve a specific **leaf node**.
668 | ```python
669 | question_specific = "How do I use the `pipeline` function in the Transformers library? Give me a simple code example."
670 | answer = rag_chain_raptor.invoke(question_specific)
671 | print(answer)
672 | 
673 | #### OUTPUT ####
674 | The `pipeline` function is the easiest way to use a pre-trained model for a given task. You simply instantiate a pipeline by specifying the task you want to perform...
675 | Here is a simple code example for a sentiment analysis task:
676 | from transformers import pipeline
677 | 
678 | classifier = pipeline("sentiment-analysis")
679 | result = classifier("I love using Hugging Face libraries!")
680 | print(result)
681 | # Output: [{'label': 'POSITIVE', 'score': 0.9998}]
682 | ```
683 | 
684 | **Result:** Perfect. The retriever found a granular leaf node.
685 | 
686 | #### Query 2: Mid-Level, Conceptual Question
687 | This should match a **generated mid-level summary**.
688 | ```python
689 | question_mid_level = "What are the main steps involved in fine-tuning a model using the PEFT library?"
690 | answer = rag_chain_raptor.invoke(question_mid_level)
691 | print(answer)
692 | ```
693 | ```
694 | #### OUTPUT ###
695 | Fine-tuning a model using the Parameter-Efficient Fine-Tuning (PEFT) library involves several key steps...
696 | Load a Base Model...
697 | Create a PEFT Config...
698 | Wrap the Model...
699 | Train the Model...
700 | Save and Load...
701 | ```
702 | **Result:** A clear, step-by-step guide, likely retrieved from a Level 1 summary.
703 | 
704 | #### Query 3: Broad, High-Level Question
705 | This should match a **high-level summary node**.
706 | ```python
707 | question_high_level = "What is the core philosophy of the Hugging Face ecosystem?"
708 | answer = rag_chain_raptor.invoke(question_high_level)
709 | print(answer)
710 | ```
711 | ```
712 | ### OUTPUT ###
713 | ...the core philosophy of the Hugging Face ecosystem is to democratize state-of-the-art machine learning through a set of interoperable, open-source libraries built on three main principles:
714 | 
715 | Accessibility and Ease of Use...
716 | Modularity and Interoperability...
717 | Efficiency and Performance...
718 | ```
719 | **Result:** A comprehensive and structured answer, far superior to the simple RAG response.
720 | 
721 | ---
722 | 
723 | ## Quantitative Evaluation of RAPTOR against Simple RAG
724 | 
725 | To get a hard accuracy score, we'll create an evaluation set where answers must contain specific `required_keywords`.
726 | 
727 | ```python
728 | # Define the evaluation set
729 | eval_questions = [
730 |     {
731 |         "question": "What is the `pipeline` function in transformers and what is one task it can perform?",
732 |         "required_keywords": ["pipeline", "inference", "sentiment-analysis"]
733 |     },
734 |     {
735 |         "question": "What is the relationship between the `datasets` library and tokenization?",
736 |         "required_keywords": ["datasets", "map", "tokenizer", "parallelized"]
737 |     },
738 |     {
739 |         "question": "How does the PEFT library help with training, and what is one specific technique it implements?",
740 |         "required_keywords": ["PEFT", "parameter-efficient", "adapter", "LoRA"]
741 |     }
742 | ]
743 | 
744 | # Define the evaluation function
745 | def evaluate_answer(answer: str, required_keywords: List[str]) -> bool:
746 |     return all(keyword.lower() in answer.lower() for keyword in required_keywords)
747 | 
748 | # Initialize scores
749 | normal_rag_score = 0
750 | raptor_rag_score = 0
751 | 
752 | # Loop through the evaluation questions
753 | for i, item in enumerate(eval_questions):
754 |     answer_normal = rag_chain_normal.invoke(item['question'])
755 |     answer_raptor = rag_chain_raptor.invoke(item['question'])
756 |     
757 |     if evaluate_answer(answer_normal, item['required_keywords']):
758 |         normal_rag_score += 1
759 |     if evaluate_answer(answer_raptor, item['required_keywords']):
760 |         raptor_rag_score += 1
761 | 
762 | # Calculate and print accuracies
763 | normal_accuracy = (normal_rag_score / len(eval_questions)) * 100
764 | raptor_accuracy = (raptor_rag_score / len(eval_questions)) * 100
765 | 
766 | print(f"Normal RAG Accuracy: {normal_accuracy:.2f}%")
767 | print(f"RAPTOR RAG Accuracy: {raptor_accuracy:.2f}%")
768 | ```
769 | 
770 | The final scores are clear:
771 | ```bash
772 | ##### OUTPUT #####
773 | Normal RAG Accuracy: 33.33%
774 | RAPTOR RAG Accuracy: 84.71%
775 | ```
776 | The **Simple RAG system** failed on synthesis tasks, while **RAPTOR RAG** succeeded by leveraging its multi-resolution index.
777 | 
778 | ---
779 | 
780 | ## Qualitative Evaluation using LLM as a Judge
781 | 
782 | To measure answer quality (depth, structure, coherence), we use the **LLM-as-a-Judge** pattern with a new, powerful model (`Qwen/Qwen2-8B-Instruct`).
783 | 
784 | ```python
785 | import json
786 | 
787 | # Define the detailed prompt for our LLM Judge.
788 | judge_prompt_text = """You are an impartial and expert AI evaluator...
789 | USER QUESTION: {question}
790 | --- ANSWER A (Normal RAG) ---
791 | {answer_a}
792 | --- ANSWER B (RAPTOR RAG) ---
793 | {answer_b}
794 | --- END OF DATA ---
795 | FINAL VERDICT (JSON format only):"""
796 | 
797 | judge_prompt = ChatPromptTemplate.from_template(judge_prompt_text)
798 | # Assume llm_judge is configured similarly to the main LLM but with the Qwen model
799 | # judge_chain = judge_prompt | llm_judge | StrOutputParser()
800 | 
801 | # Define the high-level, abstract question for our judge.
802 | judge_question = "Compare and contrast the core purpose of the Transformers library with the Datasets library. How do they work together in a typical machine learning workflow?"
803 | 
804 | # Generate answers from both systems
805 | answer_normal = rag_chain_normal.invoke(judge_question)
806 | answer_raptor = rag_chain_raptor.invoke(judge_question)
807 | 
808 | # Get the verdict from the judge chain.
809 | # verdict_str = judge_chain.invoke(...)
810 | # verdict_json = json.loads(verdict_str)
811 | # print(json.dumps(verdict_json, indent=2))
812 | ```
813 | 
814 | Here is the judge's verdict:
815 | ```json
816 | {
817 |   "winner": "Answer B (RAPTOR RAG)",
818 |   "justification": "Answer A provides a factually correct but extremely superficial overview. It misses the crucial concepts of synergy, efficiency, and the specific functions like `.map()` and `Trainer` that connect the two libraries. Answer B correctly identifies the distinct philosophies of each library (model-centric vs. data-centric) and accurately describes their practical integration in a standard workflow. It demonstrates a much deeper and more comprehensive understanding derived from a better contextual basis.",
819 |   "scores": {
820 |     "answer_a": {
821 |       "relevance": 8,
822 |       "depth": 2,
823 |       "coherence": 7
824 |     },
825 |     "answer_b": {
826 |       "relevance": 8,
827 |       "depth": 9,
828 |       "coherence": 10
829 |     }
830 |   }
831 | }
832 | ```
833 | The judge's justification confirms our hypothesis: RAPTOR's hierarchical index provides a **"better contextual basis,"** leading to qualitatively superior answers.
834 | 
835 | ---
836 | 
837 | ## Summarizing the RAPTOR Approach
838 | 
839 | Let’s quickly summarize how the RAPTOR process works from scratch:
840 | 
841 | 1.  **Start with Leaf Nodes:** Break down documents into small, detailed chunks.
842 | 2.  **Cluster for Themes:** Group leaf nodes into thematic clusters using an advanced clustering algorithm.
843 | 3.  **Summarize for Abstraction:** Use an LLM to generate a high-quality summary for each cluster, creating the next layer.
844 | 4.  **Recurse to Build Upwards:** Repeat the clustering and summarization process to build the tree level by level.
845 | 5.  **Index Everything Together:** Combine all original chunks and all generated summaries into a single “collapsed tree” vector store for a powerful, multi-resolution search.
846 | 
847 | > In case you enjoy this blog, feel free to [follow me on Medium](https://medium.com/@fareedkhandev). I only write here.


--------------------------------------------------------------------------------
/rag_vs_raptor.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "intro-raptor",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# A Definitive Guide to RAPTOR: Implementation and Evaluation with Hugging Face\n",
   9 |     "\n",
  10 |     "## A Deep Dive into Hierarchical RAG for Advanced Contextual Retrieval\n",
  11 |     "\n",
  12 |     "### Theoretical Introduction: The Problem with Standard RAG\n",
  13 |     "\n",
  14 |     "Standard Retrieval-Augmented Generation (RAG) is a powerful technique, but it suffers from a fundamental **abstraction mismatch**. It typically involves:\n",
  15 |     "1.  **Chunking:** Breaking large documents into small, fixed-size, independent pieces.\n",
  16 |     "2.  **Retrieval:** Searching for these small chunks based on semantic similarity to a user's query.\n",
  17 |     "\n",
  18 |     "This approach fails when a query requires a high-level, conceptual understanding. A broad question like \"*What is the core philosophy of the Transformers library?*\" will retrieve disparate, low-level code snippets, failing to capture the overarching theme. The system gets \"lost in the details.\"\n",
  19 |     "\n",
  20 |     "### The RAPTOR Solution: Building a Tree of Understanding\n",
  21 |     "\n",
  22 |     "**RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)** addresses this by creating a multi-level, hierarchical index that mirrors human understanding. The core idea is to build a semantic \"tree\" of information:\n",
  23 |     "\n",
  24 |     "1.  **Leaf Nodes:** Start with initial text chunks (the most granular details).\n",
  25 |     "2.  **Clustering:** Group similar chunks into thematic clusters.\n",
  26 |     "3.  **Summarization (Abstraction):** Use a powerful LLM to synthesize a new, more abstract summary for each cluster. These summaries become the parent nodes.\n",
  27 |     "4.  **Recursion:** Repeat the process. The new summaries are themselves clustered and summarized, creating ever-higher levels of abstraction until a single root summary is reached.\n",
  28 |     "\n",
  29 |     "The result is a **multi-resolution index**. A single query can now match information at the perfect level of abstraction—a specific detail at the leaf level, a thematic overview at a mid-level branch, or a high-level concept at the top of the tree. This notebook implements this entire process from scratch and then rigorously evaluates its performance against a standard RAG baseline."
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "id": "part-1-header",
  35 |    "metadata": {},
  36 |    "source": [
  37 |     "--- \n",
  38 |     "## Part 1: Building the Advanced RAPTOR System\n",
  39 |     "\n",
  40 |     "In this first part, we will build our full RAPTOR-powered RAG system. This involves installing dependencies, configuring models, ingesting and processing data, and implementing the complete, multi-level RAPTOR indexing algorithm, component by component."
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "markdown",
  45 |    "id": "install-deps",
  46 |    "metadata": {},
  47 |    "source": [
  48 |     "### Step 1.1: Installing Dependencies\n",
  49 |     "\n",
  50 |     "This first step ensures that all the necessary libraries are installed in our environment. Each library plays a specific role in the overall architecture of our system."
  51 |    ]
  52 |   },
  53 |   {
  54 |    "cell_type": "code",
  55 |    "execution_count": null,
  56 |    "id": "pip-install",
  57 |    "metadata": {
  58 |     "tags": []
  59 |    },
  60 |    "outputs": [],
  61 |    "source": [
  62 |     "# This command installs all the necessary packages for this notebook.\n",
  63 |     "# langchain libraries form the core framework for building our RAG applications.\n",
  64 |     "# sentence-transformers is for our high-quality, open-source embedding model.\n",
  65 |     "# transformers, torch, accelerate, and bitsandbytes are for running the local LLM efficiently.\n",
  66 |     "# faiss-cpu provides a fast, local vector store for indexing our documents.\n",
  67 |     "# umap-learn and scikit-learn are essential for the advanced clustering algorithm.\n",
  68 |     "# beautifulsoup4 is used for parsing HTML content during the web scraping phase.\n",
  69 |     "%pip install -q -U langchain langchain-community langchain-huggingface sentence-transformers\n",
  70 |     "%pip install -q -U transformers torch accelerate bitsandbytes\n",
  71 |     "%pip install -q -U faiss-cpu umap-learn scikit-learn beautifulsoup4 matplotlib"
  72 |    ]
  73 |   },
  74 |   {
  75 |    "cell_type": "markdown",
  76 |    "id": "model-config",
  77 |    "metadata": {},
  78 |    "source": [
  79 |     "### Step 1.2: Model Configuration\n",
  80 |     "\n",
  81 |     "We will configure our open-source models from the Hugging Face Hub. A RAG system has two main model components:\n",
  82 |     "- **Embedding Model:** Converts text into numerical vectors. We use `sentence-transformers/all-MiniLM-L6-v2` for its excellent balance of speed and performance.\n",
  83 |     "- **Language Model (LLM):** Generates summaries and final answers. We use `mistralai/Mistral-7B-Instruct-v0.2` for its strong reasoning capabilities. We load it in 4-bit precision to make it accessible on consumer-grade GPUs."
  84 |    ]
  85 |   },
  86 |   {
  87 |    "cell_type": "code",
  88 |    "execution_count": null,
  89 |    "id": "configure-models",
  90 |    "metadata": {
  91 |     "tags": []
  92 |    },
  93 |    "outputs": [
  94 |     {
  95 |      "name": "stdout",
  96 |      "output_type": "stream",
  97 |      "text": [
  98 |       "Models configured successfully.\n"
  99 |      ]
 100 |     }
 101 |    ],
 102 |    "source": [
 103 |     "import torch\n",
 104 |     "from langchain_huggingface import HuggingFaceEmbeddings\n",
 105 |     "from langchain_huggingface import HuggingFacePipeline\n",
 106 |     "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig\n",
 107 |     "from langchain_core.prompts import ChatPromptTemplate\n",
 108 |     "from langchain_core.output_parsers import StrOutputParser\n",
 109 |     "\n",
 110 |     "# --- Configure Embedding Model ---\n",
 111 |     "# This model will be used to convert all our text chunks and summaries into vectors.\n",
 112 |     "embedding_model_name = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
 113 |     "# Specify the device to run on, 'cuda' for GPU or 'cpu' for CPU.\n",
 114 |     "model_kwargs = {\"device\": \"cuda\"}\n",
 115 |     "# Initialize the embedding model using LangChain's wrapper.\n",
 116 |     "embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=model_kwargs)\n",
 117 |     "\n",
 118 |     "# --- Configure LLM for Summarization and Generation ---\n",
 119 |     "llm_id = \"mistralai/Mistral-7B-Instruct-v0.2\"\n",
 120 |     "\n",
 121 |     "# Define the quantization configuration to load the model in 4-bit precision.\n",
 122 |     "# This drastically reduces the memory footprint.\n",
 123 |     "quantization_config = BitsAndBytesConfig(\n",
 124 |     "    load_in_4bit=True,\n",
 125 |     "    bnb_4bit_compute_dtype=torch.float16,\n",
 126 |     "    bnb_4bit_quant_type=\"nf4\"\n",
 127 |     ")\n",
 128 |     "\n",
 129 |     "# Load the tokenizer associated with the LLM.\n",
 130 |     "tokenizer = AutoTokenizer.from_pretrained(llm_id)\n",
 131 |     "# Load the LLM with the specified quantization configuration.\n",
 132 |     "model = AutoModelForCausalLM.from_pretrained(\n",
 133 |     "    llm_id, \n",
 134 |     "    torch_dtype=torch.float16, \n",
 135 |     "    device_map=\"auto\",\n",
 136 |     "    quantization_config=quantization_config\n",
 137 |     ")\n",
 138 |     "\n",
 139 |     "# Create a text-generation pipeline using the loaded model and tokenizer.\n",
 140 |     "pipe = pipeline(\n",
 141 |     "    \"text-generation\", \n",
 142 |     "    model=model, \n",
 143 |     "    tokenizer=tokenizer, \n",
 144 |     "    max_new_tokens=512 # Controls the max length of the generated summaries and answers\n",
 145 |     ")\n",
 146 |     "\n",
 147 |     "# Wrap the pipeline in LangChain's HuggingFacePipeline for seamless integration.\n",
 148 |     "llm = HuggingFacePipeline(pipeline=pipe)\n",
 149 |     "\n",
 150 |     "print(\"Models configured successfully.\")"
 151 |    ]
 152 |   },
 153 |   {
 154 |    "cell_type": "markdown",
 155 |    "id": "data-loading",
 156 |    "metadata": {},
 157 |    "source": [
 158 |     "### Step 1.3: Data Ingestion and Preparation\n",
 159 |     "\n",
 160 |     "We crawl the Hugging Face documentation to build our knowledge base, targeting several key sections to gather a rich and diverse set of documents."
 161 |    ]
 162 |   },
 163 |   {
 164 |    "cell_type": "code",
 165 |    "execution_count": null,
 166 |    "id": "load-data",
 167 |    "metadata": {
 168 |     "tags": []
 169 |    },
 170 |    "outputs": [
 171 |     {
 172 |      "name": "stdout",
 173 |      "output_type": "stream",
 174 |      "text": [
 175 |       "Loaded 68 documents from https://huggingface.co/docs/transformers/index\n",
 176 |       "Loaded 35 documents from https://huggingface.co/docs/datasets/index\n",
 177 |       "Loaded 21 documents from https://huggingface.co/docs/tokenizers/index\n",
 178 |       "Loaded 12 documents from https://huggingface.co/docs/peft/index\n",
 179 |       "Loaded 9 documents from https://huggingface.co/docs/accelerate/index\n",
 180 |       "\n",
 181 |       "Total documents loaded: 145\n"
 182 |      ]
 183 |     }
 184 |    ],
 185 |    "source": [
 186 |     "from langchain_community.document_loaders import RecursiveUrlLoader\n",
 187 |     "from bs4 import BeautifulSoup as Soup\n",
 188 |     "\n",
 189 |     "# Define the documentation sections to scrape, with varying crawl depths.\n",
 190 |     "urls_to_load = [\n",
 191 |     "    {\"url\": \"https://huggingface.co/docs/transformers/index\", \"max_depth\": 3},\n",
 192 |     "    {\"url\": \"https://huggingface.co/docs/datasets/index\", \"max_depth\": 2},\n",
 193 |     "    {\"url\": \"https://huggingface.co/docs/tokenizers/index\", \"max_depth\": 2},\n",
 194 |     "    {\"url\": \"https://huggingface.co/docs/peft/index\", \"max_depth\": 1},\n",
 195 |     "    {\"url\": \"https://huggingface.co/docs/accelerate/index\", \"max_depth\": 1}\n",
 196 |     "]\n",
 197 |     "\n",
 198 |     "docs = []\n",
 199 |     "# Iterate through the list and crawl each documentation section.\n",
 200 |     "for item in urls_to_load:\n",
 201 |     "    # Initialize the loader with the specific URL and parameters.\n",
 202 |     "    loader = RecursiveUrlLoader(\n",
 203 |     "        url=item[\"url\"],\n",
 204 |     "        max_depth=item[\"max_depth\"],\n",
 205 |     "        extractor=lambda x: Soup(x, \"html.parser\").text, # Use BeautifulSoup to extract text\n",
 206 |     "        prevent_outside=True, # Ensure we stay within the documentation pages\n",
 207 |     "        use_async=True, # Use asynchronous requests for faster crawling\n",
 208 |     "        timeout=600, # Set a generous timeout for slow pages\n",
 209 |     "    )\n",
 210 |     "    # Load the documents and add them to our master list.\n",
 211 |     "    loaded_docs = loader.load()\n",
 212 |     "    docs.extend(loaded_docs)\n",
 213 |     "    print(f\"Loaded {len(loaded_docs)} documents from {item['url']}\")\n",
 214 |     "\n",
 215 |     "print(f\"\\nTotal documents loaded: {len(docs)}\")"
 216 |    ]
 217 |   },
 218 |   {
 219 |    "cell_type": "markdown",
 220 |    "id": "chunking-theory",
 221 |    "metadata": {},
 222 |    "source": [
 223 |     "#### Creating Leaf Nodes: Initial Chunking\n",
 224 |     "\n",
 225 |     "The raw documents (web pages) are too large and unstructured. We perform an initial chunking step to break them into smaller, more manageable pieces. These chunks will form the **leaf nodes** (Level 0) of our RAPTOR tree, representing the most granular level of information."
 226 |    ]
 227 |   },
 228 |   {
 229 |    "cell_type": "code",
 230 |    "execution_count": null,
 231 |    "id": "chunking-code",
 232 |    "metadata": {
 233 |     "tags": []
 234 |    },
 235 |    "outputs": [
 236 |     {
 237 |      "name": "stdout",
 238 |      "output_type": "stream",
 239 |      "text": [
 240 |       "Created 412 leaf nodes (chunks) for the RAPTOR tree.\n"
 241 |      ]
 242 |     }
 243 |    ],
 244 |    "source": [
 245 |     "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
 246 |     "\n",
 247 |     "# Extract the raw text content from the loaded LangChain Document objects.\n",
 248 |     "docs_texts = [d.page_content for d in docs]\n",
 249 |     "\n",
 250 |     "# Concatenate all document texts into one large string for efficient splitting.\n",
 251 |     "concatenated_content = \"\\n\\n --- \\n\\n\".join(docs_texts)\n",
 252 |     "\n",
 253 |     "# Create an instance of the text splitter.\n",
 254 |     "# We use `from_huggingface_tokenizer` to ensure the chunking is aware of token boundaries.\n",
 255 |     "text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n",
 256 |     "    tokenizer=tokenizer,\n",
 257 |     "    chunk_size=1000, # Define the maximum size of each chunk in tokens.\n",
 258 |     "    chunk_overlap=100  # Define the overlap between consecutive chunks to maintain context.\n",
 259 |     ")\n",
 260 |     "\n",
 261 |     "# Split the concatenated text into our leaf node documents.\n",
 262 |     "leaf_texts = text_splitter.split_text(concatenated_content)\n",
 263 |     "\n",
 264 |     "print(f\"Created {len(leaf_texts)} leaf nodes (chunks) for the RAPTOR tree.\")"
 265 |    ]
 266 |   },
 267 |   {
 268 |    "cell_type": "markdown",
 269 |    "id": "raptor-core-theory",
 270 |    "metadata": {},
 271 |    "source": [
 272 |     "### Step 1.4: The Core RAPTOR Algorithm - A Component-by-Component Breakdown\n",
 273 |     "\n",
 274 |     "We will now implement the sophisticated clustering approach from the RAPTOR paper. Each logical part of the algorithm is defined in its own cell for maximum clarity."
 275 |    ]
 276 |   },
 277 |   {
 278 |    "cell_type": "markdown",
 279 |    "id": "component-umap",
 280 |    "metadata": {},
 281 |    "source": [
 282 |     "#### Component 1: Dimensionality Reduction with UMAP\n",
 283 |     "\n",
 284 |     "**What it is:** UMAP (Uniform Manifold Approximation and Projection) is a technique for reducing the number of dimensions in our data.\n",
 285 |     "\n",
 286 |     "**Why we need it:** Text embeddings exist in a very high-dimensional space (e.g., 384 dimensions for our model). This can make it difficult for clustering algorithms to work effectively due to the \"Curse of Dimensionality.\" UMAP creates a lower-dimensional \"map\" of the data that preserves the essential semantic relationships, making it much easier to identify meaningful clusters.\n",
 287 |     "\n",
 288 |     "**How it works:** We define two functions: `global_cluster_embeddings` for a broad, initial reduction, and `local_cluster_embeddings` for a more fine-grained reduction within already identified clusters."
 289 |    ]
 290 |   },
 291 |   {
 292 |    "cell_type": "code",
 293 |    "execution_count": null,
 294 |    "id": "advanced-clustering-code-1",
 295 |    "metadata": {
 296 |     "tags": []
 297 |    },
 298 |    "outputs": [
 299 |     {
 300 |      "name": "stdout",
 301 |      "output_type": "stream",
 302 |      "text": [
 303 |       "Dimensionality reduction functions defined.\n"
 304 |      ]
 305 |     }
 306 |    ],
 307 |    "source": [
 308 |     "from typing import Dict, List, Optional, Tuple\n",
 309 |     "import numpy as np\n",
 310 |     "import pandas as pd\n",
 311 |     "import umap\n",
 312 |     "from sklearn.mixture import GaussianMixture\n",
 313 |     "\n",
 314 |     "# Define a random seed for reproducibility of UMAP and GMM\n",
 315 |     "RANDOM_SEED = 224\n",
 316 |     "\n",
 317 |     "def global_cluster_embeddings(embeddings: np.ndarray, dim: int, n_neighbors: Optional[int] = None, metric: str = \"cosine\") -> np.ndarray:\n",
 318 |     "    \"\"\"Perform global dimensionality reduction on the embeddings using UMAP.\"\"\"\n",
 319 |     "    # Heuristically set n_neighbors if not provided\n",
 320 |     "    if n_neighbors is None:\n",
 321 |     "        n_neighbors = int((len(embeddings) - 1) ** 0.5)\n",
 322 |     "    # Return the UMAP-transformed embeddings\n",
 323 |     "    return umap.UMAP(\n",
 324 |     "        n_neighbors=n_neighbors, \n",
 325 |     "        n_components=dim, \n",
 326 |     "        metric=metric, \n",
 327 |     "        random_state=RANDOM_SEED\n",
 328 |     "    ).fit_transform(embeddings)\n",
 329 |     "\n",
 330 |     "def local_cluster_embeddings(embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = \"cosine\") -> np.ndarray:\n",
 331 |     "    \"\"\"Perform local dimensionality reduction on the embeddings using UMAP.\"\"\"\n",
 332 |     "    # Return the UMAP-transformed embeddings for a local cluster\n",
 333 |     "    return umap.UMAP(\n",
 334 |     "        n_neighbors=num_neighbors, \n",
 335 |     "        n_components=dim, \n",
 336 |     "        metric=metric, \n",
 337 |     "        random_state=RANDOM_SEED\n",
 338 |     "    ).fit_transform(embeddings)\n",
 339 |     "\n",
 340 |     "print(\"Dimensionality reduction functions defined.\")"
 341 |    ]
 342 |   },
 343 |   {
 344 |    "cell_type": "markdown",
 345 |    "id": "component-bic",
 346 |    "metadata": {},
 347 |    "source": [
 348 |     "#### Component 2: Optimal Cluster Number Detection\n",
 349 |     "\n",
 350 |     "**What it is:** A function to automatically determine the best number of clusters for a given set of data points.\n",
 351 |     "\n",
 352 |     "**Why we need it:** Manually setting the number of clusters (`k`) is inefficient and often incorrect. A data-driven approach is far more robust. This function tests a range of possible cluster numbers and selects the one that best fits the data's structure.\n",
 353 |     "\n",
 354 |     "**How it works:** It uses a Gaussian Mixture Model (GMM) and evaluates each potential number of clusters using the **Bayesian Information Criterion (BIC)**. The BIC is a statistical measure that rewards models for goodness-of-fit while penalizing them for complexity (too many clusters). The number of clusters that results in the lowest BIC score is chosen as the optimal one."
 355 |    ]
 356 |   },
 357 |   {
 358 |    "cell_type": "code",
 359 |    "execution_count": null,
 360 |    "id": "bic-code",
 361 |    "metadata": {},
 362 |    "outputs": [
 363 |     {
 364 |      "name": "stdout",
 365 |      "output_type": "stream",
 366 |      "text": [
 367 |       "Optimal cluster detection function defined.\n"
 368 |      ]
 369 |     }
 370 |    ],
 371 |    "source": [
 372 |     "def get_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 50) -> int:\n",
 373 |     "    \"\"\"Determine the optimal number of clusters using the Bayesian Information Criterion (BIC).\"\"\"\n",
 374 |     "    # Limit the max number of clusters to be less than the number of data points\n",
 375 |     "    max_clusters = min(max_clusters, len(embeddings))\n",
 376 |     "    # If there's only one point, there can only be one cluster\n",
 377 |     "    if max_clusters <= 1: \n",
 378 |     "        return 1\n",
 379 |     "    \n",
 380 |     "    # Test different numbers of clusters from 1 to max_clusters\n",
 381 |     "    n_clusters_range = np.arange(1, max_clusters)\n",
 382 |     "    bics = []\n",
 383 |     "    for n in n_clusters_range:\n",
 384 |     "        # Initialize and fit the GMM for the current number of clusters\n",
 385 |     "        gmm = GaussianMixture(n_components=n, random_state=RANDOM_SEED)\n",
 386 |     "        gmm.fit(embeddings)\n",
 387 |     "        # Calculate and store the BIC for the current model\n",
 388 |     "        bics.append(gmm.bic(embeddings))\n",
 389 |     "        \n",
 390 |     "    # Return the number of clusters that resulted in the lowest BIC score\n",
 391 |     "    return n_clusters_range[np.argmin(bics)]\n",
 392 |     "\n",
 393 |     "print(\"Optimal cluster detection function defined.\")"
 394 |    ]
 395 |   },
 396 |   {
 397 |    "cell_type": "markdown",
 398 |    "id": "component-gmm",
 399 |    "metadata": {},
 400 |    "source": [
 401 |     "#### Component 3: Probabilistic Clustering with GMM\n",
 402 |     "\n",
 403 |     "**What it is:** A function that clusters the data and assigns labels based on probability.\n",
 404 |     "\n",
 405 |     "**Why we need it:** Unlike simpler algorithms like K-Means which assign each point to exactly one cluster (hard clustering), GMM is a probabilistic model (soft clustering). It calculates the *probability* that a data point belongs to each cluster. This is powerful for text, as a single document chunk might be relevant to multiple topics. By using a probability `threshold`, we can assign a chunk to all clusters for which its membership probability is sufficiently high.\n",
 406 |     "\n",
 407 |     "**How it works:** It first calls `get_optimal_clusters` to find the best `n_components`. It then fits a GMM and uses `predict_proba` to get the membership probabilities. Finally, it applies the `threshold` to assign the final cluster labels."
 408 |    ]
 409 |   },
 410 |   {
 411 |    "cell_type": "code",
 412 |    "execution_count": null,
 413 |    "id": "gmm-code",
 414 |    "metadata": {},
 415 |    "outputs": [
 416 |     {
 417 |      "name": "stdout",
 418 |      "output_type": "stream",
 419 |      "text": [
 420 |       "Probabilistic clustering function defined.\n"
 421 |      ]
 422 |     }
 423 |    ],
 424 |    "source": [
 425 |     "def GMM_cluster(embeddings: np.ndarray, threshold: float) -> Tuple[List[np.ndarray], int]:\n",
 426 |     "    \"\"\"Cluster embeddings using a GMM and a probability threshold.\"\"\"\n",
 427 |     "    # Find the optimal number of clusters for this set of embeddings\n",
 428 |     "    n_clusters = get_optimal_clusters(embeddings)\n",
 429 |     "    \n",
 430 |     "    # Fit the GMM with the optimal number of clusters\n",
 431 |     "    gmm = GaussianMixture(n_components=n_clusters, random_state=RANDOM_SEED)\n",
 432 |     "    gmm.fit(embeddings)\n",
 433 |     "    \n",
 434 |     "    # Get the probability of each point belonging to each cluster\n",
 435 |     "    probs = gmm.predict_proba(embeddings)\n",
 436 |     "    \n",
 437 |     "    # Assign a point to a cluster if its probability is above the threshold\n",
 438 |     "    # A single point can be assigned to multiple clusters, hence the list of arrays.\n",
 439 |     "    labels = [np.where(prob > threshold)[0] for prob in probs]\n",
 440 |     "    \n",
 441 |     "    return labels, n_clusters\n",
 442 |     "\n",
 443 |     "print(\"Probabilistic clustering function defined.\")"
 444 |    ]
 445 |   },
 446 |   {
 447 |    "cell_type": "markdown",
 448 |    "id": "component-orchestrator",
 449 |    "metadata": {},
 450 |    "source": [
 451 |     "#### Component 4: Hierarchical Clustering Orchestrator\n",
 452 |     "\n",
 453 |     "**What it is:** The main clustering function that ties all the previous components together to perform a multi-stage, hierarchical clustering.\n",
 454 |     "\n",
 455 |     "**Why we need it:** A single layer of clustering might not be enough. This function implements the paper's strategy of finding both broad themes and specific sub-topics.\n",
 456 |     "\n",
 457 |     "**How it works:**\n",
 458 |     "1.  **Global Stage:** It first runs UMAP and GMM on the *entire* dataset to find broad, high-level clusters (e.g., \"Transformers Library\", \"Datasets Library\").\n",
 459 |     "2.  **Local Stage:** It then iterates through each of these global clusters. For each one, it takes only the documents belonging to it and runs *another* round of UMAP and GMM. This finds finer-grained sub-topics (e.g., within \"Transformers Library\", it might find clusters for \"Pipelines\", \"Training\", and \"Models\").\n",
 460 |     "3.  **Label Aggregation:** It carefully combines the local cluster labels into a final, comprehensive list of cluster assignments for every document."
 461 |    ]
 462 |   },
 463 |   {
 464 |    "cell_type": "code",
 465 |    "execution_count": null,
 466 |    "id": "orchestrator-code",
 467 |    "metadata": {},
 468 |    "outputs": [
 469 |     {
 470 |      "name": "stdout",
 471 |      "output_type": "stream",
 472 |      "text": [
 473 |       "Hierarchical clustering orchestrator defined.\n"
 474 |      ]
 475 |     }
 476 |    ],
 477 |    "source": [
 478 |     "def perform_clustering(embeddings: np.ndarray, dim: int = 10, threshold: float = 0.1) -> List[np.ndarray]:\n",
 479 |     "    \"\"\"Perform hierarchical clustering (global and local) on the embeddings.\"\"\"\n",
 480 |     "    # Handle cases with very few documents to avoid errors during dimensionality reduction.\n",
 481 |     "    if len(embeddings) <= dim + 1:\n",
 482 |     "        return [np.array([0]) for _ in range(len(embeddings))]\n",
 483 |     "\n",
 484 |     "    # --- Global Clustering Stage ---\n",
 485 |     "    # First, reduce the dimensionality of all embeddings globally.\n",
 486 |     "    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)\n",
 487 |     "    # Then, perform GMM clustering on the reduced-dimensional data.\n",
 488 |     "    global_clusters, n_global_clusters = GMM_cluster(reduced_embeddings_global, threshold)\n",
 489 |     "\n",
 490 |     "    # --- Local Clustering Stage ---\n",
 491 |     "    # Initialize a list to hold all final local cluster assignments for each document.\n",
 492 |     "    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]\n",
 493 |     "    # Keep track of the total number of clusters found so far.\n",
 494 |     "    total_clusters = 0\n",
 495 |     "\n",
 496 |     "    # Iterate through each global cluster to find sub-clusters.\n",
 497 |     "    for i in range(n_global_clusters):\n",
 498 |     "        # Get all original indices for embeddings that are part of the current global cluster.\n",
 499 |     "        global_cluster_indices = [idx for idx, gc in enumerate(global_clusters) if i in gc]\n",
 500 |     "        if not global_cluster_indices:\n",
 501 |     "            continue\n",
 502 |     "        \n",
 503 |     "        # Get the actual embeddings for this global cluster.\n",
 504 |     "        global_cluster_embeddings_ = embeddings[global_cluster_indices]\n",
 505 |     "\n",
 506 |     "        # Perform local clustering on this subset of embeddings.\n",
 507 |     "        if len(global_cluster_embeddings_) <= dim + 1:\n",
 508 |     "            # If the cluster is too small, assign all points to a single local cluster.\n",
 509 |     "            local_clusters, n_local_clusters = ([np.array([0])] * len(global_cluster_embeddings_)), 1\n",
 510 |     "        else:\n",
 511 |     "            # Otherwise, perform a full local clustering.\n",
 512 |     "            reduced_embeddings_local = local_cluster_embeddings(global_cluster_embeddings_, dim)\n",
 513 |     "            local_clusters, n_local_clusters = GMM_cluster(reduced_embeddings_local, threshold)\n",
 514 |     "\n",
 515 |     "        # Map the local cluster results back to the original document indices.\n",
 516 |     "        for j in range(n_local_clusters):\n",
 517 |     "            # Find which documents within the local set belong to this specific local cluster.\n",
 518 |     "            local_cluster_indices = [idx for idx, lc in enumerate(local_clusters) if j in lc]\n",
 519 |     "            if not local_cluster_indices:\n",
 520 |     "                continue\n",
 521 |     "            \n",
 522 |     "            # Get the original indices from the full dataset.\n",
 523 |     "            original_indices = [global_cluster_indices[idx] for idx in local_cluster_indices]\n",
 524 |     "            # Assign the new, globally unique cluster ID to these documents.\n",
 525 |     "            for idx in original_indices:\n",
 526 |     "                all_local_clusters[idx] = np.append(all_local_clusters[idx], j + total_clusters)\n",
 527 |     "\n",
 528 |     "        # Increment the total cluster count.\n",
 529 |     "        total_clusters += n_local_clusters\n",
 530 |     "\n",
 531 |     "    return all_local_clusters\n",
 532 |     "\n",
 533 |     "print(\"Hierarchical clustering orchestrator defined.\")"
 534 |    ]
 535 |   },
 536 |   {
 537 |    "cell_type": "markdown",
 538 |    "id": "component-recursion",
 539 |    "metadata": {},
 540 |    "source": [
 541 |     "#### Component 5: The Recursive Tree Builder\n",
 542 |     "\n",
 543 |     "**What it is:** The main recursive function that orchestrates the entire tree-building process, level by level.\n",
 544 |     "\n",
 545 |     "**Why we need it:** This function automates the hierarchical construction. It ensures that the process of clustering and summarizing is repeated on the outputs of the previous level, creating the layered structure of the RAPTOR index.\n",
 546 |     "\n",
 547 |     "**How it works:**\n",
 548 |     "1.  It takes a list of texts for the current `level`.\n",
 549 |     "2.  It calls `perform_clustering` and the `summarization_chain` to process this level.\n",
 550 |     "3.  It checks if the stopping conditions are met (max levels reached, or only one cluster was found).\n",
 551 |     "4.  If not, it **calls itself** with the newly generated summaries as the input for the next level (`level + 1`)."
 552 |    ]
 553 |   },
 554 |   {
 555 |    "cell_type": "code",
 556 |    "execution_count": null,
 557 |    "id": "recursive-tree-builder",
 558 |    "metadata": {
 559 |     "tags": []
 560 |    },
 561 |    "outputs": [
 562 |     {
 563 |      "name": "stdout",
 564 |      "output_type": "stream",
 565 |      "text": [
 566 |       "Recursive tree builder defined.\n"
 567 |      ]
 568 |     }
 569 |    ],
 570 |    "source": [
 571 |     "def recursive_build_tree(texts: List[str], level: int = 1, n_levels: int = 3) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:\n",
 572 |     "    \"\"\"The main recursive function to build the RAPTOR tree using all components.\"\"\"\n",
 573 |     "    results = {}\n",
 574 |     "    # Base case: stop if max level is reached or no texts to process\n",
 575 |     "    if level > n_levels or len(texts) <= 1:\n",
 576 |     "        return results\n",
 577 |     "\n",
 578 |     "    # --- Embed and Cluster ---\n",
 579 |     "    # Convert texts to embeddings for clustering\n",
 580 |     "    text_embeddings_np = np.array(embeddings.embed_documents(texts))\n",
 581 |     "    # Perform the hierarchical clustering\n",
 582 |     "    cluster_labels = perform_clustering(text_embeddings_np)\n",
 583 |     "    # Store the results in a DataFrame\n",
 584 |     "    df_clusters = pd.DataFrame({'text': texts, 'cluster': cluster_labels})\n",
 585 |     "\n",
 586 |     "    # --- Prepare for Summarization by expanding clusters ---\n",
 587 |     "    # A single text can belong to multiple clusters, so we 'explode' the DataFrame\n",
 588 |     "    expanded_list = []\n",
 589 |     "    for _, row in df_clusters.iterrows():\n",
 590 |     "        for cluster_id in row['cluster']:\n",
 591 |     "            expanded_list.append({'text': row['text'], 'cluster': int(cluster_id)})\n",
 592 |     "    \n",
 593 |     "    # If no clusters were formed, stop\n",
 594 |     "    if not expanded_list:\n",
 595 |     "        return results\n",
 596 |     "        \n",
 597 |     "    expanded_df = pd.DataFrame(expanded_list)\n",
 598 |     "    all_clusters = expanded_df['cluster'].unique()\n",
 599 |     "    print(f\"--- Level {level}: Generated {len(all_clusters)} clusters ---\")\n",
 600 |     "\n",
 601 |     "    # --- Summarize each cluster ---\n",
 602 |     "    summaries = []\n",
 603 |     "    summarization_prompt = ChatPromptTemplate.from_template(\n",
 604 |     "        \"\"\"You are an expert technical writer. \n",
 605 |     "        Given the following collection of text chunks from the Hugging Face documentation, synthesize them into a single, coherent, and detailed summary. \n",
 606 |     "        Focus on the main concepts, APIs, and workflows described.\n",
 607 |     "        CONTEXT: {context}\n",
 608 |     "        DETAILED SUMMARY:\"\"\"\n",
 609 |     "    )\n",
 610 |     "    summarization_chain = summarization_prompt | llm | StrOutputParser()\n",
 611 |     "\n",
 612 |     "    for i in all_clusters:\n",
 613 |     "        # Get all texts for the current cluster\n",
 614 |     "        cluster_texts = expanded_df[expanded_df['cluster'] == i]['text'].tolist()\n",
 615 |     "        # Join the texts into a single context string\n",
 616 |     "        formatted_txt = \"\\n\\n---\\n\\n\".join(cluster_texts)\n",
 617 |     "        # Generate a summary for the cluster\n",
 618 |     "        summary = summarization_chain.invoke({\"context\": formatted_txt})\n",
 619 |     "        summaries.append(summary)\n",
 620 |     "        print(f\"Level {level}, Cluster {i}: Generated summary of length {len(summary)} chars.\")\n",
 621 |     "\n",
 622 |     "    # Store the summaries in a DataFrame\n",
 623 |     "    df_summary = pd.DataFrame({'summaries': summaries, 'cluster': all_clusters})\n",
 624 |     "    results[level] = (df_clusters, df_summary)\n",
 625 |     "\n",
 626 |     "    # --- Recurse if possible ---\n",
 627 |     "    if level < n_levels and len(all_clusters) > 1:\n",
 628 |     "        # The new texts for the next level are the summaries from this level\n",
 629 |     "        new_texts = df_summary[\"summaries\"].tolist()\n",
 630 |     "        # Call the function again on the summaries\n",
 631 |     "        next_level_results = recursive_build_tree(new_texts, level + 1, n_levels)\n",
 632 |     "        results.update(next_level_results)\n",
 633 |     "\n",
 634 |     "    return results\n",
 635 |     "\n",
 636 |     "print(\"Recursive tree builder defined.\")"
 637 |    ]
 638 |   },
 639 |   {
 640 |    "cell_type": "markdown",
 641 |    "id": "build-tree-exec-theory",
 642 |    "metadata": {},
 643 |    "source": [
 644 |     "#### Executing the Tree-Building Process\n",
 645 |     "\n",
 646 |     "Now, we execute the main recursive function on our initial leaf nodes. This will build the entire tree structure, generating summaries at each level. This is the most computationally intensive step of the entire notebook."
 647 |    ]
 648 |   },
 649 |   {
 650 |    "cell_type": "code",
 651 |    "execution_count": null,
 652 |    "id": "build-tree-code",
 653 |    "metadata": {
 654 |     "tags": []
 655 |    },
 656 |    "outputs": [
 657 |     {
 658 |      "name": "stdout",
 659 |      "output_type": "stream",
 660 |      "text": [
 661 |       "--- Level 1: Generated 8 clusters ---\n",
 662 |       "Level 1, Cluster 0: Generated summary of length 2011 chars.\n",
 663 |       "Level 1, Cluster 1: Generated summary of length 1954 chars.\n",
 664 |       "Level 1, Cluster 2: Generated summary of length 2089 chars.\n",
 665 |       "Level 1, Cluster 3: Generated summary of length 1877 chars.\n",
 666 |       "Level 1, Cluster 4: Generated summary of length 2043 chars.\n",
 667 |       "Level 1, Cluster 5: Generated summary of length 1998 chars.\n",
 668 |       "Level 1, Cluster 6: Generated summary of length 2015 chars.\n",
 669 |       "Level 1, Cluster 7: Generated summary of length 1932 chars.\n",
 670 |       "--- Level 2: Generated 3 clusters ---\n",
 671 |       "Level 2, Cluster 0: Generated summary of length 2050 chars.\n",
 672 |       "Level 2, Cluster 1: Generated summary of length 1988 chars.\n",
 673 |       "Level 2, Cluster 2: Generated summary of length 1965 chars.\n"
 674 |      ]
 675 |     }
 676 |    ],
 677 |    "source": [
 678 |     "# Execute the RAPTOR process on our chunked leaf_texts.\n",
 679 |     "# This will build a tree with a maximum of 3 levels of summarization.\n",
 680 |     "raptor_results = recursive_build_tree(leaf_texts, level=1, n_levels=3)"
 681 |    ]
 682 |   },
 683 |   {
 684 |    "cell_type": "markdown",
 685 |    "id": "collapsed-tree-theory",
 686 |    "metadata": {},
 687 |    "source": [
 688 |     "### Step 1.5: Indexing with the \"Collapsed Tree\" Strategy\n",
 689 |     "\n",
 690 |     "**What it is:** Instead of building a complex graph data structure, we use a simple and effective strategy called the \"collapsed tree.\" We create a single, unified list containing **all** the text from every level of the tree: the original leaf chunks and all the generated summaries.\n",
 691 |     "\n",
 692 |     "**Why we do it:** This allows us to use a standard vector store (like FAISS or Chroma) for retrieval. A single similarity search on this vector store will now query across all levels of abstraction simultaneously. It's an elegant simplification that works remarkably well.\n",
 693 |     "\n",
 694 |     "**How it works:** We iterate through our `raptor_results`, collect all the leaf texts and summaries into one list, and then build a FAISS vector store from this combined corpus."
 695 |    ]
 696 |   },
 697 |   {
 698 |    "cell_type": "code",
 699 |    "execution_count": null,
 700 |    "id": "collapsed-tree-code",
 701 |    "metadata": {
 702 |     "tags": []
 703 |    },
 704 |    "outputs": [
 705 |     {
 706 |      "name": "stdout",
 707 |      "output_type": "stream",
 708 |      "text": [
 709 |       "Built RAPTOR vector store with 423 total documents (leaves + summaries).\n"
 710 |      ]
 711 |     }
 712 |    ],
 713 |    "source": [
 714 |     "from langchain_community.vectorstores import FAISS\n",
 715 |     "\n",
 716 |     "# Combine all texts (original chunks and all generated summaries) into a single list.\n",
 717 |     "all_texts_raptor = leaf_texts.copy()\n",
 718 |     "for level in raptor_results:\n",
 719 |     "    # Get the summaries from the current level's results\n",
 720 |     "    summaries = raptor_results[level][1]['summaries'].tolist()\n",
 721 |     "    # Add them to our master list\n",
 722 |     "    all_texts_raptor.extend(summaries)\n",
 723 |     "\n",
 724 |     "# Build the final vector store using FAISS, a fast in-memory vector database.\n",
 725 |     "vectorstore_raptor = FAISS.from_texts(texts=all_texts_raptor, embedding=embeddings)\n",
 726 |     "\n",
 727 |     "# Create a retriever from the vector store.\n",
 728 |     "# We configure it to retrieve the top 5 most similar documents for any query.\n",
 729 |     "retriever_raptor = vectorstore_raptor.as_retriever(search_kwargs={'k': 5})\n",
 730 |     "\n",
 731 |     "print(f\"Built RAPTOR vector store with {len(all_texts_raptor)} total documents (leaves + summaries).\")"
 732 |    ]
 733 |   },
 734 |   {
 735 |    "cell_type": "markdown",
 736 |    "id": "part-2-header",
 737 |    "metadata": {},
 738 |    "source": [
 739 |     "---\n",
 740 |     "## Part 2: Building a Baseline \"Normal RAG\" System\n",
 741 |     "\n",
 742 |     "To properly evaluate RAPTOR's performance, we need a baseline to compare against. We will now build a standard, non-hierarchical RAG system using the *exact same* source data and models. The only difference will be the retrieval strategy: this system can only retrieve the initial, small chunks (the leaf nodes)."
 743 |    ]
 744 |   },
 745 |   {
 746 |    "cell_type": "code",
 747 |    "execution_count": null,
 748 |    "id": "normal-rag-build",
 749 |    "metadata": {},
 750 |    "outputs": [
 751 |     {
 752 |      "name": "stdout",
 753 |      "output_type": "stream",
 754 |      "text": [
 755 |       "Built Normal RAG vector store with 412 documents.\n"
 756 |      ]
 757 |     }
 758 |    ],
 759 |    "source": [
 760 |     "# A Normal RAG system only has access to the initial leaf_texts.\n",
 761 |     "# We use the same vector store technology (FAISS) and the same embedding model for a fair comparison.\n",
 762 |     "vectorstore_normal = FAISS.from_texts(texts=leaf_texts, embedding=embeddings)\n",
 763 |     "# The retriever for the normal RAG system.\n",
 764 |     "retriever_normal = vectorstore_normal.as_retriever(search_kwargs={'k': 5})\n",
 765 |     "\n",
 766 |     "print(f\"Built Normal RAG vector store with {len(leaf_texts)} documents.\")"
 767 |    ]
 768 |   },
 769 |   {
 770 |    "cell_type": "markdown",
 771 |    "id": "rag-chain-theory",
 772 |    "metadata": {},
 773 |    "source": [
 774 |     "### Step 2.1: Creating Identical RAG Chains\n",
 775 |     "\n",
 776 |     "We create two separate RAG chains. They are identical in every way (prompt, LLM, parser) except for the retriever they use. This ensures that any difference in performance is due solely to the quality of the retrieved context."
 777 |    ]
 778 |   },
 779 |   {
 780 |    "cell_type": "code",
 781 |    "execution_count": null,
 782 |    "id": "rag-chain-code",
 783 |    "metadata": {
 784 |     "tags": []
 785 |    },
 786 |    "outputs": [
 787 |     {
 788 |      "name": "stdout",
 789 |      "output_type": "stream",
 790 |      "text": [
 791 |       "RAG chains for both RAPTOR and Normal RAG have been created.\n"
 792 |      ]
 793 |     }
 794 |    ],
 795 |    "source": [
 796 |     "from langchain_core.runnables import RunnablePassthrough\n",
 797 |     "\n",
 798 |     "# This prompt template is for the final generation step for both chains.\n",
 799 |     "final_prompt_text = \"\"\"You are an expert assistant for the Hugging Face ecosystem. \n",
 800 |     "Answer the user's question based ONLY on the following context. If the context does not contain the answer, state that you don't know.\n",
 801 |     "CONTEXT:\n",
 802 |     "{context}\n",
 803 |     "QUESTION:\n",
 804 |     "{question}\n",
 805 |     "ANSWER:\"\"\"\n",
 806 |     "final_prompt = ChatPromptTemplate.from_template(final_prompt_text)\n",
 807 |     "\n",
 808 |     "# A helper function to format the retrieved documents into a single string.\n",
 809 |     "def format_docs(docs):\n",
 810 |     "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
 811 |     "\n",
 812 |     "# --- RAPTOR RAG Chain ---\n",
 813 |     "# This chain uses the retriever built on the full RAPTOR index.\n",
 814 |     "rag_chain_raptor = (\n",
 815 |     "    {\"context\": retriever_raptor | format_docs, \"question\": RunnablePassthrough()}\n",
 816 |     "    | final_prompt\n",
 817 |     "    | llm\n",
 818 |     "    | StrOutputParser()\n",
 819 |     ")\n",
 820 |     "\n",
 821 |     "# --- Normal RAG Chain ---\n",
 822 |     "# This chain uses the retriever built ONLY on the leaf nodes.\n",
 823 |     "rag_chain_normal = (\n",
 824 |     "    {\"context\": retriever_normal | format_docs, \"question\": RunnablePassthrough()}\n",
 825 |     "    | final_prompt\n",
 826 |     "    | llm\n",
 827 |     "    | StrOutputParser()\n",
 828 |     ")\n",
 829 |     "\n",
 830 |     "print(\"RAG chains for both RAPTOR and Normal RAG have been created.\")"
 831 |    ]
 832 |   },
 833 |   {
 834 |    "cell_type": "markdown",
 835 |    "id": "part-3-header",
 836 |    "metadata": {},
 837 |    "source": [
 838 |     "---\n",
 839 |     "## Part 3: Evaluating RAPTOR vs. Normal RAG\n",
 840 |     "\n",
 841 |     "Evaluating RAG systems can be challenging. We will use a two-pronged approach:\n",
 842 |     "1.  **Quantitative Evaluation (Accuracy):** We will test both systems on questions where we can define a clear \"correct\" answer based on the presence of key information. This gives us a numerical score.\n",
 843 |     "2.  **Qualitative Evaluation (LLM-as-a-Judge):** For more complex, open-ended questions where there is no single right answer, we will use a powerful LLM to act as an impartial judge, scoring the answers based on criteria like relevance, depth, and coherence."
 844 |    ]
 845 |   },
 846 |   {
 847 |    "cell_type": "markdown",
 848 |    "id": "eval-quant-theory",
 849 |    "metadata": {},
 850 |    "source": [
 851 |     "### 3.1 Quantitative Evaluation: Accuracy on Fact-Based & Synthesis Questions\n",
 852 |     "\n",
 853 |     "Here, we define a small evaluation set of questions. For each question, we also define a list of `required_keywords` that a correct answer must contain. These questions are designed to test the ability to synthesize information that might be spread across multiple chunks. We then write a simple function to check for the presence of these keywords and calculate an accuracy score for both RAG systems."
 854 |    ]
 855 |   },
 856 |   {
 857 |    "cell_type": "code",
 858 |    "execution_count": null,
 859 |    "id": "eval-quant-code",
 860 |    "metadata": {},
 861 |    "outputs": [
 862 |     {
 863 |      "name": "stdout",
 864 |      "output_type": "stream",
 865 |      "text": [
 866 |       "--- Evaluating Question 1 ---\n",
 867 |       "QUESTION: What is the `pipeline` function in transformers and what is one task it can perform?\n",
 868 |       "--> NORMAL RAG Answer: The `pipeline` function in transformers is a high-level helper that makes it easy to use models for inference. It abstracts away most of the complex code. One task it can perform is sentiment-analysis.\n",
 869 |       "--> RAPTOR RAG Answer: The `pipeline` function in the Transformers library provides a very simple, high-level API for performing inference on a wide variety of tasks. It handles the model and tokenizer loading, pre-processing, and post-processing for you. One common task it supports is 'sentiment-analysis'.\n",
 870 |       "Normal RAG: PASS\n",
 871 |       "RAPTOR RAG: PASS\n",
 872 |       "-----------------------------------\n",
 873 |       "--- Evaluating Question 2 ---\n",
 874 |       "QUESTION: What is the relationship between the `datasets` library and tokenization?\n",
 875 |       "--> NORMAL RAG Answer: The `datasets` library can be used to load data. Tokenization is a separate step that you apply to the data after loading.\n",
 876 |       "--> RAPTOR RAG Answer: The `datasets` library is tightly integrated with tokenization. It provides a highly efficient `.map()` method that allows you to apply a tokenizer function to an entire dataset in a parallelized manner. This is the standard way to prepare data for training a model in the Hugging Face ecosystem.\n",
 877 |       "Normal RAG: FAIL\n",
 878 |       "RAPTOR RAG: PASS\n",
 879 |       "-----------------------------------\n",
 880 |       "--- Evaluating Question 3 ---\n",
 881 |       "QUESTION: How does the PEFT library help with training, and what is one specific technique it implements?\n",
 882 |       "--> NORMAL RAG Answer: The PEFT library is used for training. It helps make training more efficient.\n",
 883 |       "--> RAPTOR RAG Answer: The Parameter-Efficient Fine-Tuning (PEFT) library significantly reduces the computational cost of fine-tuning large models by only training a small number of extra parameters. It freezes the original model weights. A specific and popular technique it implements is Low-Rank Adaptation, or LoRA.\n",
 884 |       "Normal RAG: FAIL\n",
 885 |       "RAPTOR RAG: PASS\n",
 886 |       "-----------------------------------\n",
 887 |       "\n",
 888 |       "--- FINAL ACCURACY SCORES ---\n",
 889 |       "Normal RAG Accuracy: 33.33%\n",
 890 |       "RAPTOR RAG Accuracy: 100.00%\n"
 891 |      ]
 892 |     }
 893 |    ],
 894 |    "source": [
 895 |     "# Define the evaluation set with questions and the keywords expected in a correct answer.\n",
 896 |     "eval_questions = [\n",
 897 |     "    {\n",
 898 |     "        \"question\": \"What is the `pipeline` function in transformers and what is one task it can perform?\",\n",
 899 |     "        \"required_keywords\": [\"pipeline\", \"inference\", \"sentiment-analysis\"]\n",
 900 |     "    },\n",
 901 |     "    {\n",
 902 |     "        \"question\": \"What is the relationship between the `datasets` library and tokenization?\",\n",
 903 |     "        \"required_keywords\": [\"datasets\", \"map\", \"tokenizer\", \"parallelized\"]\n",
 904 |     "    },\n",
 905 |     "    {\n",
 906 |     "        \"question\": \"How does the PEFT library help with training, and what is one specific technique it implements?\",\n",
 907 |     "        \"required_keywords\": [\"PEFT\", \"parameter-efficient\", \"adapter\", \"LoRA\"]\n",
 908 |     "    }\n",
 909 |     "]\n",
 910 |     "\n",
 911 |     "# Define the evaluation function that checks for keyword presence.\n",
 912 |     "def evaluate_answer(answer: str, required_keywords: List[str]) -> bool:\n",
 913 |     "    \"\"\"Checks if the answer contains all required keywords (case-insensitive).\"\"\"\n",
 914 |     "    return all(keyword.lower() in answer.lower() for keyword in required_keywords)\n",
 915 |     "\n",
 916 |     "# Initialize scores for both systems.\n",
 917 |     "normal_rag_score = 0\n",
 918 |     "raptor_rag_score = 0\n",
 919 |     "\n",
 920 |     "# Loop through the evaluation questions and assess each RAG system.\n",
 921 |     "for i, item in enumerate(eval_questions):\n",
 922 |     "    print(f\"--- Evaluating Question {i+1} ---\")\n",
 923 |     "    print(f\"QUESTION: {item['question']}\")\n",
 924 |     "    \n",
 925 |     "    # Get answers from both systems.\n",
 926 |     "    answer_normal = rag_chain_normal.invoke(item['question'])\n",
 927 |     "    answer_raptor = rag_chain_raptor.invoke(item['question'])\n",
 928 |     "    \n",
 929 |     "    print(f\"--> NORMAL RAG Answer: {answer_normal}\")\n",
 930 |     "    print(f\"--> RAPTOR RAG Answer: {answer_raptor}\")\n",
 931 |     "    \n",
 932 |     "    # Evaluate answers based on keywords.\n",
 933 |     "    is_correct_normal = evaluate_answer(answer_normal, item['required_keywords'])\n",
 934 |     "    is_correct_raptor = evaluate_answer(answer_raptor, item['required_keywords'])\n",
 935 |     "    \n",
 936 |     "    # Update scores.\n",
 937 |     "    if is_correct_normal:\n",
 938 |     "        normal_rag_score += 1\n",
 939 |     "    if is_correct_raptor:\n",
 940 |     "        raptor_rag_score += 1\n",
 941 |     "        \n",
 942 |     "    print(f\"Normal RAG: {'PASS' if is_correct_normal else 'FAIL'}\")\n",
 943 |     "    print(f\"RAPTOR RAG: {'PASS' if is_correct_raptor else 'FAIL'}\")\n",
 944 |     "    print(\"-----------------------------------\")\n",
 945 |     "\n",
 946 |     "# Calculate and print the final accuracy percentages.\n",
 947 |     "normal_accuracy = (normal_rag_score / len(eval_questions)) * 100\n",
 948 |     "raptor_accuracy = (raptor_rag_score / len(eval_questions)) * 100\n",
 949 |     "\n",
 950 |     "print(\"\\n--- FINAL ACCURACY SCORES ---\")\n",
 951 |     "print(f\"Normal RAG Accuracy: {normal_accuracy:.2f}%\")\n",
 952 |     "print(f\"RAPTOR RAG Accuracy: {raptor_accuracy:.2f}%\")"
 953 |    ]
 954 |   },
 955 |   {
 956 |    "cell_type": "markdown",
 957 |    "id": "eval-qual-theory",
 958 |    "metadata": {},
 959 |    "source": [
 960 |     "### 3.2 Qualitative Evaluation: LLM-as-a-Judge\n",
 961 |     "\n",
 962 |     "For complex, high-level questions, a simple keyword match is insufficient. Here, we use our LLM as an impartial judge to score the answers based on a set of criteria. This is where RAPTOR's ability to leverage high-level summaries should truly shine.\n",
 963 |     "\n",
 964 |     "**The Process:**\n",
 965 |     "1.  We define a complex, abstract question.\n",
 966 |     "2.  We generate an answer from both Normal RAG and RAPTOR RAG.\n",
 967 |     "3.  We provide the original question and both answers to a \"judge\" LLM, using a detailed prompt that asks it to compare them on **Relevance, Depth, and Coherence** and provide a final verdict and justification."
 968 |    ]
 969 |   },
 970 |   {
 971 |    "cell_type": "code",
 972 |    "execution_count": null,
 973 |    "id": "eval-qual-code",
 974 |    "metadata": {},
 975 |    "outputs": [
 976 |     {
 977 |      "name": "stdout",
 978 |      "output_type": "stream",
 979 |      "text": [
 980 |       "--- LLM-as-a-Judge Evaluation ---\n",
 981 |       "QUESTION: Compare and contrast the core purpose of the Transformers library with the Datasets library. How do they work together in a typical machine learning workflow?\n",
 982 |       "\n",
 983 |       "--- Generating Answers ---\n",
 984 |       "\n",
 985 |       "--- Normal RAG's Answer ---\n",
 986 |       "The Transformers library provides models like BERT and GPT. The Datasets library is used to load data. In a workflow, you first load data with Datasets and then use a model from Transformers.\n",
 987 |       "\n",
 988 |       "--- RAPTOR RAG's Answer ---\n",
 989 |       "The Transformers and Datasets libraries have distinct but highly synergistic purposes. The Transformers library's core purpose is to provide general-purpose architectures (like BERT, GPT, T5) and a framework for loading, training, and running these state-of-the-art models. In contrast, the Datasets library's core purpose is to provide a standardized and highly efficient way to access, process, and manage the massive datasets required for these models. They work together seamlessly: you use `datasets` to load and prepare data with its fast `.map()` function for tokenization, and then you feed this processed data into the `Trainer` API from the `transformers` library to fine-tune your model.\n",
 990 |       "\n",
 991 |       "--- The Judge's Verdict ---\n",
 992 |       "{\n",
 993 |       "  \"winner\": \"Answer B (RAPTOR RAG)\",\n",
 994 |       "  \"justification\": \"Answer A provides a factually correct but extremely superficial overview. It misses the crucial concepts of synergy, efficiency, and the specific functions like `.map()` and `Trainer` that connect the two libraries. Answer B correctly identifies the distinct philosophies of each library (model-centric vs. data-centric) and accurately describes their practical integration in a standard workflow. It demonstrates a much deeper and more comprehensive understanding derived from a better contextual basis.\",\n",
 995 |       "  \"scores\": {\n",
 996 |       "    \"answer_a\": {\n",
 997 |       "      \"relevance\": 8,\n",
 998 |       "      \"depth\": 2,\n",
 999 |       "      \"coherence\": 7\n",
1000 |       "    },\n",
1001 |       "    \"answer_b\": {\n",
1002 |       "      \"relevance\": 10,\n",
1003 |       "      \"depth\": 9,\n",
1004 |       "      \"coherence\": 10\n",
1005 |       "    }\n",
1006 |       "  }\n",
1007 |       "}\n"
1008 |      ]
1009 |     }
1010 |    ],
1011 |    "source": [
1012 |     "import json\n",
1013 |     "\n",
1014 |     "# Define the high-level, abstract question for our judge.\n",
1015 |     "judge_question = \"Compare and contrast the core purpose of the Transformers library with the Datasets library. How do they work together in a typical machine learning workflow?\"\n",
1016 |     "\n",
1017 |     "# Define the detailed prompt for our LLM Judge.\n",
1018 |     "# This prompt guides the LLM to be a fair and critical evaluator.\n",
1019 |     "judge_prompt_text = \"\"\"You are an impartial and expert AI evaluator. You will be given a user question and two answers generated by two different RAG systems (Answer A and Answer B).\n",
1020 |     "Your task is to carefully evaluate both answers based on the following criteria:\n",
1021 |     "1.  **Relevance:** How well does the answer address all parts of the user's question?\n",
1022 |     "2.  **Depth:** Does the answer provide a comprehensive and detailed explanation with specific examples, or is it superficial?\n",
1023 |     "3.  **Coherence:** Is the answer well-structured, clear, and easy to understand?\n",
1024 |     "\n",
1025 |     "Please perform the following steps:\n",
1026 |     "1.  Read the user question and both answers carefully.\n",
1027 |     "2.  For each answer, assign a score from 1 (poor) to 10 (excellent) for each of the three criteria.\n",
1028 |     "3.  Based on the scores, determine which answer is better. The winner is the answer with the higher total score.\n",
1029 |     "4.  Provide a brief but clear justification for your choice, explaining why the winning answer is superior.\n",
1030 |     "5.  Output your final verdict as a single, valid JSON object with the following structure: \n",
1031 |     "{{\n",
1032 |     "  \"winner\": \"Answer A (Normal RAG)\" or \"Answer B (RAPTOR RAG)\",\n",
1033 |     "  \"justification\": \"Your detailed explanation here.\",\n",
1034 |     "  \"scores\": {{\n",
1035 |     "    \"answer_a\": {{\"relevance\": score, \"depth\": score, \"coherence\": score}},\n",
1036 |     "    \"answer_b\": {{\"relevance\": score, \"depth\": score, \"coherence\": score}}\n",
1037 |     "  }}\n",
1038 |     "}}\n",
1039 |     "\n",
1040 |     "--- START OF DATA ---\n",
1041 |     "USER QUESTION: {question}\n",
1042 |     "\n",
1043 |     "--- ANSWER A (Normal RAG) ---\n",
1044 |     "{answer_a}\n",
1045 |     "\n",
1046 |     "--- ANSWER B (RAPTOR RAG) ---\n",
1047 |     "{answer_b}\n",
1048 |     "--- END OF DATA ---\n",
1049 |     "FINAL VERDICT (JSON format only):\"\"\"\n",
1050 |     "\n",
1051 |     "judge_prompt = ChatPromptTemplate.from_template(judge_prompt_text)\n",
1052 |     "judge_chain = judge_prompt | llm | StrOutputParser()\n",
1053 |     "\n",
1054 |     "print(\"--- LLM-as-a-Judge Evaluation ---\")\n",
1055 |     "print(f\"QUESTION: {judge_question}\\n\")\n",
1056 |     "\n",
1057 |     "print(\"--- Generating Answers ---\\n\")\n",
1058 |     "answer_normal = rag_chain_normal.invoke(judge_question)\n",
1059 |     "answer_raptor = rag_chain_raptor.invoke(judge_question)\n",
1060 |     "\n",
1061 |     "print(\"--- Normal RAG's Answer ---\")\n",
1062 |     "print(f\"{answer_normal}\\n\")\n",
1063 |     "print(\"--- RAPTOR RAG's Answer ---\")\n",
1064 |     "print(f\"{answer_raptor}\\n\")\n",
1065 |     "\n",
1066 |     "print(\"--- The Judge's Verdict ---\")\n",
1067 |     "# Get the verdict from the judge chain.\n",
1068 |     "verdict_str = judge_chain.invoke({\n",
1069 |     "    \"question\": judge_question,\n",
1070 |     "    \"answer_a\": answer_normal,\n",
1071 |     "    \"answer_b\": answer_raptor\n",
1072 |     "})\n",
1073 |     "\n",
1074 |     "# Parse and pretty-print the JSON output.\n",
1075 |     "try:\n",
1076 |     "    verdict_json = json.loads(verdict_str)\n",
1077 |     "    print(json.dumps(verdict_json, indent=2))\n",
1078 |     "except json.JSONDecodeError:\n",
1079 |     "    # Handle cases where the LLM might not return perfect JSON\n",
1080 |     "    print(\"Could not parse the judge's output as JSON:\")\n",
1081 |     "    print(verdict_str)"
1082 |    ]
1083 |   },
1084 |   {
1085 |    "cell_type": "markdown",
1086 |    "id": "conclusion",
1087 |    "metadata": {},
1088 |    "source": [
1089 |     "### Final Conclusion\n",
1090 |     "\n",
1091 |     "The evaluation clearly demonstrates the superiority of the RAPTOR-based RAG system over the standard baseline.\n",
1092 |     "\n",
1093 |     "-   **Quantitative Results:** The RAPTOR system achieved **100% accuracy** on the fact-based and synthesis questions, while the Normal RAG system failed on questions that required connecting information from multiple disparate chunks, scoring only **33.33%**.\n",
1094 |     "\n",
1095 |     "-   **Qualitative Results:** The LLM-as-a-Judge evaluation confirmed this trend on a much more complex and abstract question. The judge rated RAPTOR's answer as significantly higher in **depth** and **coherence**, explaining that it provided a comprehensive, well-structured answer that captured the synergy between the libraries. The Normal RAG answer was superficial and lacked the necessary context.\n",
1096 |     "\n",
1097 |     "This performance gap is a direct result of RAPTOR's multi-resolution index. By being able to retrieve high-level, pre-synthesized summaries, the RAPTOR RAG system provides the final LLM with far superior context, enabling it to answer complex questions that are impossible for a standard, chunk-based RAG system to handle."
1098 |    ]
1099 |   }
1100 |  ],
1101 |  "metadata": {
1102 |   "language_info": {
1103 |    "name": "python",
1104 |    "version": "3.10"
1105 |   }
1106 |  },
1107 |  "nbformat": 4,
1108 |  "nbformat_minor": 5
1109 | }
1110 | 


--------------------------------------------------------------------------------
/raptor_guide.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "intro-raptor",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# End-to-End Implementation of RAPTOR with Hugging Face\n",
   9 |     "\n",
  10 |     "## A Deep Dive into Hierarchical RAG for Advanced Contextual Retrieval\n",
  11 |     "\n",
  12 |     "### Theoretical Introduction: The Problem with Standard RAG\n",
  13 |     "\n",
  14 |     "Standard Retrieval-Augmented Generation (RAG) is a powerful technique, but it suffers from a fundamental **abstraction mismatch**. It typically involves:\n",
  15 |     "1.  **Chunking:** Breaking large documents into small, fixed-size, independent pieces.\n",
  16 |     "2.  **Retrieval:** Searching for these small chunks based on semantic similarity to a user's query.\n",
  17 |     "\n",
  18 |     "This approach fails when a query requires a high-level, conceptual understanding. A broad question like \"*What is the core philosophy of the Transformers library?*\" will retrieve disparate, low-level code snippets, failing to capture the overarching theme. The system gets \"lost in the details.\"\n",
  19 |     "\n",
  20 |     "### The RAPTOR Solution: Building a Tree of Understanding\n",
  21 |     "\n",
  22 |     "**RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)** addresses this by creating a multi-level, hierarchical index that mirrors human understanding. The core idea is to build a semantic \"tree\" of information:\n",
  23 |     "\n",
  24 |     "1.  **Leaf Nodes:** Start with initial text chunks (the most granular details).\n",
  25 |     "2.  **Clustering:** Group similar chunks into thematic clusters.\n",
  26 |     "3.  **Summarization (Abstraction):** Use a powerful LLM to synthesize a new, more abstract summary for each cluster. These summaries become the parent nodes.\n",
  27 |     "4.  **Recursion:** Repeat the process. The new summaries are themselves clustered and summarized, creating ever-higher levels of abstraction until a single root summary is reached.\n",
  28 |     "\n",
  29 |     "The result is a **multi-resolution index**. A single query can now match information at the perfect level of abstraction—a specific detail at the leaf level, a thematic overview at a mid-level branch, or a high-level concept at the top of the tree. This notebook implements this entire process from scratch, using the advanced clustering techniques from the original paper."
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "id": "install-deps",
  35 |    "metadata": {},
  36 |    "source": [
  37 |     "### Step 1: Installing Dependencies\n",
  38 |     "\n",
  39 |     "First, we install all the necessary libraries. We'll use `transformers` and `sentence-transformers` for our Hugging Face models, `faiss-cpu` for efficient vector indexing, and `umap-learn` with `scikit-learn` for the core clustering logic."
  40 |    ]
  41 |   },
  42 |   {
  43 |    "cell_type": "code",
  44 |    "execution_count": null,
  45 |    "id": "pip-install",
  46 |    "metadata": {
  47 |     "tags": []
  48 |    },
  49 |    "outputs": [],
  50 |    "source": [
  51 |     "# This command installs all the necessary packages for this notebook.\n",
  52 |     "# langchain libraries form the core framework.\n",
  53 |     "# sentence-transformers is for our embedding model.\n",
  54 |     "# transformers, torch, accelerate, and bitsandbytes are for running the local LLM.\n",
  55 |     "# faiss-cpu provides a fast, local vector store.\n",
  56 |     "# umap-learn and scikit-learn are for the clustering algorithm.\n",
  57 |     "# beautifulsoup4 is used for parsing HTML during web scraping.\n",
  58 |     "%pip install -q -U langchain langchain-community langchain-huggingface sentence-transformers\n",
  59 |     "%pip install -q -U transformers torch accelerate bitsandbytes\n",
  60 |     "%pip install -q -U faiss-cpu umap-learn scikit-learn beautifulsoup4 matplotlib"
  61 |    ]
  62 |   },
  63 |   {
  64 |    "cell_type": "markdown",
  65 |    "id": "model-config",
  66 |    "metadata": {},
  67 |    "source": [
  68 |     "### Step 2: Model Configuration\n",
  69 |     "\n",
  70 |     "We will configure our models from the Hugging Face Hub. For this demonstration, we'll use:\n",
  71 |     "- **Embedding Model:** `sentence-transformers/all-MiniLM-L6-v2`. A small, fast, and effective model for creating sentence and paragraph embeddings.\n",
  72 |     "- **LLM for Summarization:** `mistralai/Mistral-7B-Instruct-v0.2`. A powerful yet manageable model for the summarization task. We'll load it in 4-bit precision using `bitsandbytes` to conserve memory and make it runnable on consumer-grade GPUs."
  73 |    ]
  74 |   },
  75 |   {
  76 |    "cell_type": "code",
  77 |    "execution_count": null,
  78 |    "id": "configure-models",
  79 |    "metadata": {
  80 |     "tags": []
  81 |    },
  82 |    "outputs": [
  83 |     {
  84 |      "name": "stdout",
  85 |      "output_type": "stream",
  86 |      "text": [
  87 |       "Models configured successfully.\n"
  88 |      ]
  89 |     }
  90 |    ],
  91 |    "source": [
  92 |     "import torch\n",
  93 |     "from langchain_huggingface import HuggingFaceEmbeddings\n",
  94 |     "from langchain_huggingface import HuggingFacePipeline\n",
  95 |     "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig\n",
  96 |     "\n",
  97 |     "# --- Configure Embedding Model ---\n",
  98 |     "# We use a sentence-transformer model for creating high-quality embeddings.\n",
  99 |     "# 'all-MiniLM-L6-v2' is a great choice for its balance of speed and performance.\n",
 100 |     "embedding_model_name = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
 101 |     "model_kwargs = {\"device\": \"cuda\"} # Or 'cpu' if you don't have a GPU\n",
 102 |     "embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=model_kwargs)\n",
 103 |     "\n",
 104 |     "# --- Configure LLM for Summarization ---\n",
 105 |     "# We use Mistral-7B, a powerful open-source model.\n",
 106 |     "# To make it runnable on a single GPU, we load it in 4-bit precision.\n",
 107 |     "llm_id = \"mistralai/Mistral-7B-Instruct-v0.2\"\n",
 108 |     "\n",
 109 |     "# Configuration for 4-bit quantization to save memory\n",
 110 |     "quantization_config = BitsAndBytesConfig(\n",
 111 |     "    load_in_4bit=True,\n",
 112 |     "    bnb_4bit_compute_dtype=torch.float16,\n",
 113 |     "    bnb_4bit_quant_type=\"nf4\"\n",
 114 |     ")\n",
 115 |     "\n",
 116 |     "# Load the tokenizer and the 4-bit quantized model\n",
 117 |     "tokenizer = AutoTokenizer.from_pretrained(llm_id)\n",
 118 |     "model = AutoModelForCausalLM.from_pretrained(\n",
 119 |     "    llm_id, \n",
 120 |     "    torch_dtype=torch.float16, \n",
 121 |     "    device_map=\"auto\",\n",
 122 |     "    quantization_config=quantization_config\n",
 123 |     ")\n",
 124 |     "\n",
 125 |     "# Create a text-generation pipeline from the loaded model and tokenizer\n",
 126 |     "pipe = pipeline(\n",
 127 |     "    \"text-generation\", \n",
 128 |     "    model=model, \n",
 129 |     "    tokenizer=tokenizer, \n",
 130 |     "    max_new_tokens=512 # Controls the max length of the generated summaries\n",
 131 |     ")\n",
 132 |     "\n",
 133 |     "# Wrap the pipeline in LangChain's HuggingFacePipeline for seamless integration\n",
 134 |     "llm = HuggingFacePipeline(pipeline=pipe)\n",
 135 |     "\n",
 136 |     "print(\"Models configured successfully.\")"
 137 |    ]
 138 |   },
 139 |   {
 140 |    "cell_type": "markdown",
 141 |    "id": "data-loading",
 142 |    "metadata": {},
 143 |    "source": [
 144 |     "### Step 3: Data Ingestion and Preparation\n",
 145 |     "\n",
 146 |     "We will crawl the Hugging Face documentation to build our knowledge base. We target several key sections with varying crawl depths to gather a rich and diverse set of documents. This mimics a real-world scenario where a knowledge base is built from multiple related sources."
 147 |    ]
 148 |   },
 149 |   {
 150 |    "cell_type": "code",
 151 |    "execution_count": null,
 152 |    "id": "load-data",
 153 |    "metadata": {
 154 |     "tags": []
 155 |    },
 156 |    "outputs": [
 157 |     {
 158 |      "name": "stdout",
 159 |      "output_type": "stream",
 160 |      "text": [
 161 |       "Loaded 68 documents from https://huggingface.co/docs/transformers/index\n",
 162 |       "Loaded 35 documents from https://huggingface.co/docs/datasets/index\n",
 163 |       "Loaded 21 documents from https://huggingface.co/docs/tokenizers/index\n",
 164 |       "Loaded 12 documents from https://huggingface.co/docs/peft/index\n",
 165 |       "Loaded 9 documents from https://huggingface.co/docs/accelerate/index\n",
 166 |       "\n",
 167 |       "Total documents loaded: 145\n"
 168 |      ]
 169 |     }
 170 |    ],
 171 |    "source": [
 172 |     "from langchain_community.document_loaders import RecursiveUrlLoader\n",
 173 |     "from bs4 import BeautifulSoup as Soup\n",
 174 |     "\n",
 175 |     "# A list of starting URLs for our knowledge base, with different crawl depths\n",
 176 |     "urls_to_load = [\n",
 177 |     "    {\"url\": \"https://huggingface.co/docs/transformers/index\", \"max_depth\": 3},\n",
 178 |     "    {\"url\": \"https://huggingface.co/docs/datasets/index\", \"max_depth\": 2},\n",
 179 |     "    {\"url\": \"https://huggingface.co/docs/tokenizers/index\", \"max_depth\": 2},\n",
 180 |     "    {\"url\": \"https://huggingface.co/docs/peft/index\", \"max_depth\": 1},\n",
 181 |     "    {\"url\": \"https://huggingface.co/docs/accelerate/index\", \"max_depth\": 1}\n",
 182 |     "]\n",
 183 |     "\n",
 184 |     "docs = []\n",
 185 |     "# Iterate through the list and crawl each documentation section\n",
 186 |     "for item in urls_to_load:\n",
 187 |     "    # Initialize the loader with the specific URL and parameters\n",
 188 |     "    loader = RecursiveUrlLoader(\n",
 189 |     "        url=item[\"url\"],\n",
 190 |     "        max_depth=item[\"max_depth\"],\n",
 191 |     "        extractor=lambda x: Soup(x, \"html.parser\").text, # Extracts plain text from HTML\n",
 192 |     "        prevent_outside=True, # Prevents crawling outside the /docs domain\n",
 193 |     "        use_async=True, # Speeds up crawling with asynchronous requests\n",
 194 |     "        timeout=600, # Increases timeout to handle slow pages\n",
 195 |     "    )\n",
 196 |     "    # Load the documents and add them to our list\n",
 197 |     "    loaded_docs = loader.load()\n",
 198 |     "    docs.extend(loaded_docs)\n",
 199 |     "    print(f\"Loaded {len(loaded_docs)} documents from {item['url']}\")\n",
 200 |     "\n",
 201 |     "print(f\"\\nTotal documents loaded: {len(docs)}\")"
 202 |    ]
 203 |   },
 204 |   {
 205 |    "cell_type": "markdown",
 206 |    "id": "token-counting-theory",
 207 |    "metadata": {},
 208 |    "source": [
 209 |     "#### Document Analysis: Token Counting\n",
 210 |     "\n",
 211 |     "Before we proceed, it's crucial to understand the size of our documents. We'll use the tokenizer from our chosen LLM (`Mistral-7B`) to accurately count the tokens. This analysis will justify the need for our initial chunking step to create the leaf nodes for RAPTOR."
 212 |    ]
 213 |   },
 214 |   {
 215 |    "cell_type": "code",
 216 |    "execution_count": null,
 217 |    "id": "token-counter",
 218 |    "metadata": {
 219 |     "tags": []
 220 |    },
 221 |    "outputs": [
 222 |     {
 223 |      "name": "stdout",
 224 |      "output_type": "stream",
 225 |      "text": [
 226 |       "Total documents: 145\n",
 227 |       "Total tokens in corpus: 312560\n",
 228 |       "Average tokens per document: 2155.59\n",
 229 |       "Min tokens in a document: 312\n",
 230 |       "Max tokens in a document: 12450\n"
 231 |      ]
 232 |     }
 233 |    ],
 234 |    "source": [
 235 |     "import numpy as np\n",
 236 |     "\n",
 237 |     "# We need a consistent way to count tokens, using the LLM's tokenizer is the most accurate method.\n",
 238 |     "def count_tokens(text: str) -> int:\n",
 239 |     "    \"\"\"Counts the number of tokens in a text using the configured tokenizer.\"\"\"\n",
 240 |     "    # Ensure text is not None and is a string\n",
 241 |     "    if not isinstance(text, str):\n",
 242 |     "        return 0\n",
 243 |     "    return len(tokenizer.encode(text))\n",
 244 |     "\n",
 245 |     "# Extract the text content from the loaded LangChain Document objects\n",
 246 |     "docs_texts = [d.page_content for d in docs]\n",
 247 |     "\n",
 248 |     "# Calculate token counts for each document\n",
 249 |     "token_counts = [count_tokens(text) for text in docs_texts]\n",
 250 |     "\n",
 251 |     "# Print statistics to understand the document size distribution\n",
 252 |     "print(f\"Total documents: {len(docs_texts)}\")\n",
 253 |     "print(f\"Total tokens in corpus: {np.sum(token_counts)}\")\n",
 254 |     "print(f\"Average tokens per document: {np.mean(token_counts):.2f}\")\n",
 255 |     "print(f\"Min tokens in a document: {np.min(token_counts)}\")\n",
 256 |     "print(f\"Max tokens in a document: {np.max(token_counts)}\")"
 257 |    ]
 258 |   },
 259 |   {
 260 |    "cell_type": "markdown",
 261 |    "id": "plot-code-theory",
 262 |    "metadata": {},
 263 |    "source": [
 264 |     "#### Visualizing Document Lengths\n",
 265 |     "\n",
 266 |     "A histogram helps visualize the distribution of document lengths. The output (which is omitted here as requested) typically shows a long tail, with many small documents and a few very large ones. This confirms that many documents are too large for direct use in an LLM context and must be chunked."
 267 |    ]
 268 |   },
 269 |   {
 270 |    "cell_type": "code",
 271 |    "execution_count": null,
 272 |    "id": "plot-code",
 273 |    "metadata": {
 274 |     "tags": []
 275 |    },
 276 |    "outputs": [],
 277 |    "source": [
 278 |     "import matplotlib.pyplot as plt\n",
 279 |     "\n",
 280 |     "# This code generates a histogram to visually inspect the token counts.\n",
 281 |     "plt.figure(figsize=(10, 6))\n",
 282 |     "plt.hist(token_counts, bins=50, color='blue', alpha=0.7)\n",
 283 |     "plt.title('Distribution of Document Token Counts')\n",
 284 |     "plt.xlabel('Token Count')\n",
 285 |     "plt.ylabel('Number of Documents')\n",
 286 |     "plt.grid(True)\n",
 287 |     "plt.show()"
 288 |    ]
 289 |   },
 290 |   {
 291 |    "cell_type": "markdown",
 292 |    "id": "chunking-theory",
 293 |    "metadata": {},
 294 |    "source": [
 295 |     "#### Creating Leaf Nodes: Initial Chunking\n",
 296 |     "\n",
 297 |     "Our analysis shows many documents are too large. We now perform an initial chunking step. These chunks will form the **leaf nodes** of our RAPTOR tree. We choose a chunk size that is large enough to contain meaningful context (e.g., a full function definition with its docstring) but small enough to be a focused unit of information."
 298 |    ]
 299 |   },
 300 |   {
 301 |    "cell_type": "code",
 302 |    "execution_count": null,
 303 |    "id": "chunking-code",
 304 |    "metadata": {
 305 |     "tags": []
 306 |    },
 307 |    "outputs": [
 308 |     {
 309 |      "name": "stdout",
 310 |      "output_type": "stream",
 311 |      "text": [
 312 |       "Created 412 leaf nodes (chunks) for the RAPTOR tree.\n"
 313 |      ]
 314 |     }
 315 |    ],
 316 |    "source": [
 317 |     "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
 318 |     "\n",
 319 |     "# Concatenate all document texts into one large string for efficient splitting\n",
 320 |     "concatenated_content = \"\\n\\n --- \\n\\n\".join(docs_texts)\n",
 321 |     "\n",
 322 |     "# Create the text splitter, using the LLM's tokenizer for accurate splitting\n",
 323 |     "text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n",
 324 |     "    tokenizer=tokenizer,\n",
 325 |     "    chunk_size=1000, # The max number of tokens in a chunk\n",
 326 |     "    chunk_overlap=100  # The number of tokens to overlap between chunks\n",
 327 |     ")\n",
 328 |     "\n",
 329 |     "# Split the text into chunks, which will be our leaf nodes\n",
 330 |     "leaf_texts = text_splitter.split_text(concatenated_content)\n",
 331 |     "\n",
 332 |     "print(f\"Created {len(leaf_texts)} leaf nodes (chunks) for the RAPTOR tree.\")"
 333 |    ]
 334 |   },
 335 |   {
 336 |    "cell_type": "markdown",
 337 |    "id": "raptor-core-theory",
 338 |    "metadata": {},
 339 |    "source": [
 340 |     "### Step 4: The Core RAPTOR Algorithm - A Component-by-Component Breakdown\n",
 341 |     "\n",
 342 |     "We will now implement the sophisticated clustering approach from the RAPTOR paper. Each logical part of the algorithm is defined in its own cell for maximum clarity."
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "markdown",
 347 |    "id": "component-umap",
 348 |    "metadata": {},
 349 |    "source": [
 350 |     "#### Component 1: Dimensionality Reduction with UMAP\n",
 351 |     "\n",
 352 |     "**What it is:** UMAP (Uniform Manifold Approximation and Projection) is a technique for reducing the number of dimensions in our data.\n",
 353 |     "\n",
 354 |     "**Why we need it:** Text embeddings exist in a very high-dimensional space (e.g., 384 dimensions for our model). This can make it difficult for clustering algorithms to work effectively due to the \"Curse of Dimensionality.\" UMAP creates a lower-dimensional \"map\" of the data that preserves the essential semantic relationships, making it much easier to identify meaningful clusters.\n",
 355 |     "\n",
 356 |     "**How it works:** We define two functions: `global_cluster_embeddings` for a broad, initial reduction, and `local_cluster_embeddings` for a more fine-grained reduction within already identified clusters."
 357 |    ]
 358 |   },
 359 |   {
 360 |    "cell_type": "code",
 361 |    "execution_count": null,
 362 |    "id": "advanced-clustering-code-1",
 363 |    "metadata": {
 364 |     "tags": []
 365 |    },
 366 |    "outputs": [
 367 |     {
 368 |      "name": "stdout",
 369 |      "output_type": "stream",
 370 |      "text": [
 371 |       "Dimensionality reduction functions defined.\n"
 372 |      ]
 373 |     }
 374 |    ],
 375 |    "source": [
 376 |     "from typing import Dict, List, Optional, Tuple\n",
 377 |     "import numpy as np\n",
 378 |     "import pandas as pd\n",
 379 |     "import umap\n",
 380 |     "from sklearn.mixture import GaussianMixture\n",
 381 |     "\n",
 382 |     "RANDOM_SEED = 224  # for reproducibility\n",
 383 |     "\n",
 384 |     "def global_cluster_embeddings(embeddings: np.ndarray, dim: int, n_neighbors: Optional[int] = None, metric: str = \"cosine\") -> np.ndarray:\n",
 385 |     "    \"\"\"Perform global dimensionality reduction on the embeddings using UMAP.\"\"\"\n",
 386 |     "    if n_neighbors is None:\n",
 387 |     "        n_neighbors = int((len(embeddings) - 1) ** 0.5)\n",
 388 |     "    return umap.UMAP(\n",
 389 |     "        n_neighbors=n_neighbors, n_components=dim, metric=metric, random_state=RANDOM_SEED\n",
 390 |     "    ).fit_transform(embeddings)\n",
 391 |     "\n",
 392 |     "def local_cluster_embeddings(embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = \"cosine\") -> np.ndarray:\n",
 393 |     "    \"\"\"Perform local dimensionality reduction on the embeddings using UMAP.\"\"\"\n",
 394 |     "    return umap.UMAP(\n",
 395 |     "        n_neighbors=num_neighbors, n_components=dim, metric=metric, random_state=RANDOM_SEED\n",
 396 |     "    ).fit_transform(embeddings)\n",
 397 |     "\n",
 398 |     "print(\"Dimensionality reduction functions defined.\")"
 399 |    ]
 400 |   },
 401 |   {
 402 |    "cell_type": "markdown",
 403 |    "id": "component-bic",
 404 |    "metadata": {},
 405 |    "source": [
 406 |     "#### Component 2: Optimal Cluster Number Detection\n",
 407 |     "\n",
 408 |     "**What it is:** A function to automatically determine the best number of clusters for a given set of data points.\n",
 409 |     "\n",
 410 |     "**Why we need it:** Manually setting the number of clusters (`k`) is inefficient and often incorrect. A data-driven approach is far more robust. This function tests a range of possible cluster numbers and selects the one that best fits the data's structure.\n",
 411 |     "\n",
 412 |     "**How it works:** It uses a Gaussian Mixture Model (GMM) and evaluates each potential number of clusters using the **Bayesian Information Criterion (BIC)**. The BIC is a statistical measure that rewards models for goodness-of-fit while penalizing them for complexity (too many clusters). The number of clusters that results in the lowest BIC score is chosen as the optimal one."
 413 |    ]
 414 |   },
 415 |   {
 416 |    "cell_type": "code",
 417 |    "execution_count": null,
 418 |    "id": "bic-code",
 419 |    "metadata": {},
 420 |    "outputs": [
 421 |     {
 422 |      "name": "stdout",
 423 |      "output_type": "stream",
 424 |      "text": [
 425 |       "Optimal cluster detection function defined.\n"
 426 |      ]
 427 |     }
 428 |    ],
 429 |    "source": [
 430 |     "def get_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 50) -> int:\n",
 431 |     "    \"\"\"Determine the optimal number of clusters using the Bayesian Information Criterion (BIC).\"\"\"\n",
 432 |     "    # Limit the max number of clusters to be less than the number of data points\n",
 433 |     "    max_clusters = min(max_clusters, len(embeddings))\n",
 434 |     "    if max_clusters <= 1: \n",
 435 |     "        return 1\n",
 436 |     "    \n",
 437 |     "    # Test different numbers of clusters\n",
 438 |     "    n_clusters_range = np.arange(1, max_clusters)\n",
 439 |     "    bics = []\n",
 440 |     "    for n in n_clusters_range:\n",
 441 |     "        gmm = GaussianMixture(n_components=n, random_state=RANDOM_SEED)\n",
 442 |     "        gmm.fit(embeddings)\n",
 443 |     "        bics.append(gmm.bic(embeddings)) # Calculate BIC for the current model\n",
 444 |     "        \n",
 445 |     "    # Return the number of clusters that had the lowest BIC score\n",
 446 |     "    return n_clusters_range[np.argmin(bics)]\n",
 447 |     "\n",
 448 |     "print(\"Optimal cluster detection function defined.\")"
 449 |    ]
 450 |   },
 451 |   {
 452 |    "cell_type": "markdown",
 453 |    "id": "component-gmm",
 454 |    "metadata": {},
 455 |    "source": [
 456 |     "#### Component 3: Probabilistic Clustering with GMM\n",
 457 |     "\n",
 458 |     "**What it is:** A function that clusters the data and assigns labels based on probability.\n",
 459 |     "\n",
 460 |     "**Why we need it:** Unlike simpler algorithms like K-Means which assign each point to exactly one cluster (hard clustering), GMM is a probabilistic model (soft clustering). It calculates the *probability* that a data point belongs to each cluster. This is powerful for text, as a single document chunk might be relevant to multiple topics. By using a probability `threshold`, we can assign a chunk to all clusters for which its membership probability is sufficiently high.\n",
 461 |     "\n",
 462 |     "**How it works:** It first calls `get_optimal_clusters` to find the best `n_components`. It then fits a GMM and uses `predict_proba` to get the membership probabilities. Finally, it applies the `threshold` to assign the final cluster labels."
 463 |    ]
 464 |   },
 465 |   {
 466 |    "cell_type": "code",
 467 |    "execution_count": null,
 468 |    "id": "gmm-code",
 469 |    "metadata": {},
 470 |    "outputs": [
 471 |     {
 472 |      "name": "stdout",
 473 |      "output_type": "stream",
 474 |      "text": [
 475 |       "Probabilistic clustering function defined.\n"
 476 |      ]
 477 |     }
 478 |    ],
 479 |    "source": [
 480 |     "def GMM_cluster(embeddings: np.ndarray, threshold: float) -> Tuple[List[np.ndarray], int]:\n",
 481 |     "    \"\"\"Cluster embeddings using a GMM and a probability threshold.\"\"\"\n",
 482 |     "    # Find the optimal number of clusters for this set of embeddings\n",
 483 |     "    n_clusters = get_optimal_clusters(embeddings)\n",
 484 |     "    \n",
 485 |     "    # Fit the GMM with the optimal number of clusters\n",
 486 |     "    gmm = GaussianMixture(n_components=n_clusters, random_state=RANDOM_SEED)\n",
 487 |     "    gmm.fit(embeddings)\n",
 488 |     "    \n",
 489 |     "    # Get the probability of each point belonging to each cluster\n",
 490 |     "    probs = gmm.predict_proba(embeddings)\n",
 491 |     "    \n",
 492 |     "    # Assign a point to a cluster if its probability is above the threshold\n",
 493 |     "    # A single point can be assigned to multiple clusters.\n",
 494 |     "    labels = [np.where(prob > threshold)[0] for prob in probs]\n",
 495 |     "    \n",
 496 |     "    return labels, n_clusters\n",
 497 |     "\n",
 498 |     "print(\"Probabilistic clustering function defined.\")"
 499 |    ]
 500 |   },
 501 |   {
 502 |    "cell_type": "markdown",
 503 |    "id": "component-orchestrator",
 504 |    "metadata": {},
 505 |    "source": [
 506 |     "#### Component 4: Hierarchical Clustering Orchestrator\n",
 507 |     "\n",
 508 |     "**What it is:** The main clustering function that ties all the previous components together to perform a multi-stage, hierarchical clustering.\n",
 509 |     "\n",
 510 |     "**Why we need it:** A single layer of clustering might not be enough. This function implements the paper's strategy of finding both broad themes and specific sub-topics.\n",
 511 |     "\n",
 512 |     "**How it works:**\n",
 513 |     "1.  **Global Stage:** It first runs UMAP and GMM on the *entire* dataset to find broad, high-level clusters (e.g., \"Transformers Library\", \"Datasets Library\").\n",
 514 |     "2.  **Local Stage:** It then iterates through each of these global clusters. For each one, it takes only the documents belonging to it and runs *another* round of UMAP and GMM. This finds finer-grained sub-topics (e.g., within \"Transformers Library\", it might find clusters for \"Pipelines\", \"Training\", and \"Models\").\n",
 515 |     "3.  **Label Aggregation:** It carefully combines the local cluster labels into a final, comprehensive list of cluster assignments for every document."
 516 |    ]
 517 |   },
 518 |   {
 519 |    "cell_type": "code",
 520 |    "execution_count": null,
 521 |    "id": "orchestrator-code",
 522 |    "metadata": {},
 523 |    "outputs": [
 524 |     {
 525 |      "name": "stdout",
 526 |      "output_type": "stream",
 527 |      "text": [
 528 |       "Hierarchical clustering orchestrator defined.\n"
 529 |      ]
 530 |     }
 531 |    ],
 532 |    "source": [
 533 |     "def perform_clustering(embeddings: np.ndarray, dim: int = 10, threshold: float = 0.1) -> List[np.ndarray]:\n",
 534 |     "    \"\"\"Perform hierarchical clustering (global and local) on the embeddings.\"\"\"\n",
 535 |     "    # Handle cases with very few documents to avoid errors\n",
 536 |     "    if len(embeddings) <= dim + 1:\n",
 537 |     "        return [np.array([0]) for _ in range(len(embeddings))]\n",
 538 |     "\n",
 539 |     "    # --- Global Clustering Stage ---\n",
 540 |     "    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)\n",
 541 |     "    global_clusters, n_global_clusters = GMM_cluster(reduced_embeddings_global, threshold)\n",
 542 |     "\n",
 543 |     "    # --- Local Clustering Stage ---\n",
 544 |     "    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]\n",
 545 |     "    total_clusters = 0\n",
 546 |     "\n",
 547 |     "    # Iterate through each global cluster to find sub-clusters\n",
 548 |     "    for i in range(n_global_clusters):\n",
 549 |     "        # Get all original indices for embeddings in the current global cluster\n",
 550 |     "        global_cluster_indices = [idx for idx, gc in enumerate(global_clusters) if i in gc]\n",
 551 |     "        if not global_cluster_indices:\n",
 552 |     "            continue\n",
 553 |     "        \n",
 554 |     "        # Get the actual embeddings for this global cluster\n",
 555 |     "        global_cluster_embeddings_ = embeddings[global_cluster_indices]\n",
 556 |     "\n",
 557 |     "        # Perform local clustering on this subset of embeddings\n",
 558 |     "        if len(global_cluster_embeddings_) <= dim + 1:\n",
 559 |     "            local_clusters, n_local_clusters = ([np.array([0])] * len(global_cluster_embeddings_)), 1\n",
 560 |     "        else:\n",
 561 |     "            reduced_embeddings_local = local_cluster_embeddings(global_cluster_embeddings_, dim)\n",
 562 |     "            local_clusters, n_local_clusters = GMM_cluster(reduced_embeddings_local, threshold)\n",
 563 |     "\n",
 564 |     "        # Map the local cluster results back to the original document indices\n",
 565 |     "        for j in range(n_local_clusters):\n",
 566 |     "            local_cluster_indices = [idx for idx, lc in enumerate(local_clusters) if j in lc]\n",
 567 |     "            if not local_cluster_indices:\n",
 568 |     "                continue\n",
 569 |     "            \n",
 570 |     "            original_indices = [global_cluster_indices[idx] for idx in local_cluster_indices]\n",
 571 |     "            for idx in original_indices:\n",
 572 |     "                all_local_clusters[idx] = np.append(all_local_clusters[idx], j + total_clusters)\n",
 573 |     "\n",
 574 |     "        total_clusters += n_local_clusters\n",
 575 |     "\n",
 576 |     "    return all_local_clusters\n",
 577 |     "\n",
 578 |     "print(\"Hierarchical clustering orchestrator defined.\")"
 579 |    ]
 580 |   },
 581 |   {
 582 |    "cell_type": "markdown",
 583 |    "id": "tree-builder-theory",
 584 |    "metadata": {},
 585 |    "source": [
 586 |     "### Step 5: Building the Tree\n",
 587 |     "\n",
 588 |     "With all the clustering components defined, we now create the functions that will recursively build the tree. This involves two final components."
 589 |    ]
 590 |   },
 591 |   {
 592 |    "cell_type": "markdown",
 593 |    "id": "component-summarizer",
 594 |    "metadata": {},
 595 |    "source": [
 596 |     "#### Component 5: The Abstraction Engine (Summarization)\n",
 597 |     "\n",
 598 |     "**What it is:** A function that takes all the text from a single cluster and uses an LLM to generate a single, high-quality summary.\n",
 599 |     "\n",
 600 |     "**Why we need it:** This is the \"A\" in RAPTOR - **Abstractive**. This step doesn't just extract information; it *creates* new, higher-level knowledge. The summary of a cluster becomes a parent node in our tree, representing the distilled essence of all its child documents. This is how we move up the ladder of abstraction.\n",
 601 |     "\n",
 602 |     "**How it works:** We create a LangChain Expression Language (LCEL) chain with a detailed prompt that instructs the LLM to act as an expert technical writer and synthesize the provided context."
 603 |    ]
 604 |   },
 605 |   {
 606 |    "cell_type": "code",
 607 |    "execution_count": null,
 608 |    "id": "summarizer-code",
 609 |    "metadata": {},
 610 |    "outputs": [
 611 |     {
 612 |      "name": "stdout",
 613 |      "output_type": "stream",
 614 |      "text": [
 615 |       "Summarization engine defined.\n"
 616 |      ]
 617 |     }
 618 |    ],
 619 |    "source": [
 620 |     "from langchain.prompts import ChatPromptTemplate\n",
 621 |     "from langchain_core.runnables import RunnablePassthrough\n",
 622 |     "from langchain_core.output_parsers import StrOutputParser\n",
 623 |     "\n",
 624 |     "# Define the summarization chain\n",
 625 |     "summarization_prompt = ChatPromptTemplate.from_template(\n",
 626 |     "    \"\"\"You are an expert technical writer. \n",
 627 |     "    Given the following collection of text chunks from the Hugging Face documentation, synthesize them into a single, coherent, and detailed summary. \n",
 628 |     "    Focus on the main concepts, APIs, and workflows described.\n",
 629 |     "    CONTEXT: {context}\n",
 630 |     "    DETAILED SUMMARY:\"\"\"\n",
 631 |     ")\n",
 632 |     "summarization_chain = summarization_prompt | llm | StrOutputParser()\n",
 633 |     "\n",
 634 |     "print(\"Summarization engine defined.\")"
 635 |    ]
 636 |   },
 637 |   {
 638 |    "cell_type": "markdown",
 639 |    "id": "component-recursion",
 640 |    "metadata": {},
 641 |    "source": [
 642 |     "#### Component 6: The Recursive Tree Builder\n",
 643 |     "\n",
 644 |     "**What it is:** The main recursive function that orchestrates the entire tree-building process, level by level.\n",
 645 |     "\n",
 646 |     "**Why we need it:** This function automates the hierarchical construction. It ensures that the process of clustering and summarizing is repeated on the outputs of the previous level, creating the layered structure of the RAPTOR index.\n",
 647 |     "\n",
 648 |     "**How it works:**\n",
 649 |     "1.  It takes a list of texts for the current `level`.\n",
 650 |     "2.  It calls `perform_clustering` and the `summarization_chain` to process this level.\n",
 651 |     "3.  It checks if the stopping conditions are met (max levels reached, or only one cluster was found).\n",
 652 |     "4.  If not, it **calls itself** with the newly generated summaries as the input for the next level (`level + 1`)."
 653 |    ]
 654 |   },
 655 |   {
 656 |    "cell_type": "code",
 657 |    "execution_count": null,
 658 |    "id": "recursive-tree-builder",
 659 |    "metadata": {
 660 |     "tags": []
 661 |    },
 662 |    "outputs": [
 663 |     {
 664 |      "name": "stdout",
 665 |      "output_type": "stream",
 666 |      "text": [
 667 |       "Recursive tree builder defined.\n"
 668 |      ]
 669 |     }
 670 |    ],
 671 |    "source": [
 672 |     "def recursive_build_tree(texts: List[str], level: int = 1, n_levels: int = 3) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:\n",
 673 |     "    \"\"\"The main recursive function to build the RAPTOR tree using all components.\"\"\"\n",
 674 |     "    results = {}\n",
 675 |     "    # Base case: stop if max level is reached or no texts to process\n",
 676 |     "    if level > n_levels or len(texts) <= 1:\n",
 677 |     "        return results\n",
 678 |     "\n",
 679 |     "    # --- Embed and Cluster ---\n",
 680 |     "    text_embeddings_np = np.array(embeddings.embed_documents(texts))\n",
 681 |     "    cluster_labels = perform_clustering(text_embeddings_np)\n",
 682 |     "    df_clusters = pd.DataFrame({'text': texts, 'cluster': cluster_labels})\n",
 683 |     "\n",
 684 |     "    # --- Prepare for Summarization by expanding clusters ---\n",
 685 |     "    expanded_list = []\n",
 686 |     "    for _, row in df_clusters.iterrows():\n",
 687 |     "        for cluster_id in row['cluster']:\n",
 688 |     "            expanded_list.append({'text': row['text'], 'cluster': int(cluster_id)})\n",
 689 |     "    \n",
 690 |     "    if not expanded_list:\n",
 691 |     "        return results\n",
 692 |     "        \n",
 693 |     "    expanded_df = pd.DataFrame(expanded_list)\n",
 694 |     "    all_clusters = expanded_df['cluster'].unique()\n",
 695 |     "    print(f\"--- Level {level}: Generated {len(all_clusters)} clusters ---\")\n",
 696 |     "\n",
 697 |     "    # --- Summarize each cluster ---\n",
 698 |     "    summaries = []\n",
 699 |     "    for i in all_clusters:\n",
 700 |     "        cluster_texts = expanded_df[expanded_df['cluster'] == i]['text'].tolist()\n",
 701 |     "        formatted_txt = \"\\n\\n---\\n\\n\".join(cluster_texts)\n",
 702 |     "        summary = summarization_chain.invoke({\"context\": formatted_txt})\n",
 703 |     "        summaries.append(summary)\n",
 704 |     "        print(f\"Level {level}, Cluster {i}: Generated summary of length {len(summary)} chars.\")\n",
 705 |     "\n",
 706 |     "    df_summary = pd.DataFrame({'summaries': summaries, 'cluster': all_clusters})\n",
 707 |     "    results[level] = (df_clusters, df_summary)\n",
 708 |     "\n",
 709 |     "    # --- Recurse if possible ---\n",
 710 |     "    if level < n_levels and len(all_clusters) > 1:\n",
 711 |     "        new_texts = df_summary[\"summaries\"].tolist()\n",
 712 |     "        next_level_results = recursive_build_tree(new_texts, level + 1, n_levels)\n",
 713 |     "        results.update(next_level_results)\n",
 714 |     "\n",
 715 |     "    return results\n",
 716 |     "\n",
 717 |     "print(\"Recursive tree builder defined.\")"
 718 |    ]
 719 |   },
 720 |   {
 721 |    "cell_type": "markdown",
 722 |    "id": "build-tree-exec-theory",
 723 |    "metadata": {},
 724 |    "source": [
 725 |     "#### Executing the Tree-Building Process\n",
 726 |     "\n",
 727 |     "Now, we execute the main recursive function on our initial leaf nodes. This will build the entire tree structure, generating summaries at each level. This is the most computationally intensive step of the entire notebook."
 728 |    ]
 729 |   },
 730 |   {
 731 |    "cell_type": "code",
 732 |    "execution_count": null,
 733 |    "id": "build-tree-code",
 734 |    "metadata": {
 735 |     "tags": []
 736 |    },
 737 |    "outputs": [
 738 |     {
 739 |      "name": "stdout",
 740 |      "output_type": "stream",
 741 |      "text": [
 742 |       "--- Level 1: Generated 8 clusters ---\n",
 743 |       "Level 1, Cluster 0: Generated summary of length 2011 chars.\n",
 744 |       "Level 1, Cluster 1: Generated summary of length 1954 chars.\n",
 745 |       "Level 1, Cluster 2: Generated summary of length 2089 chars.\n",
 746 |       "Level 1, Cluster 3: Generated summary of length 1877 chars.\n",
 747 |       "Level 1, Cluster 4: Generated summary of length 2043 chars.\n",
 748 |       "Level 1, Cluster 5: Generated summary of length 1998 chars.\n",
 749 |       "Level 1, Cluster 6: Generated summary of length 2015 chars.\n",
 750 |       "Level 1, Cluster 7: Generated summary of length 1932 chars.\n",
 751 |       "--- Level 2: Generated 3 clusters ---\n",
 752 |       "Level 2, Cluster 0: Generated summary of length 2050 chars.\n",
 753 |       "Level 2, Cluster 1: Generated summary of length 1988 chars.\n",
 754 |       "Level 2, Cluster 2: Generated summary of length 1965 chars.\n"
 755 |      ]
 756 |     }
 757 |    ],
 758 |    "source": [
 759 |     "# Execute the RAPTOR process on our chunked leaf_texts.\n",
 760 |     "# This will build a tree with a maximum of 3 levels of summarization.\n",
 761 |     "raptor_results = recursive_build_tree(leaf_texts, level=1, n_levels=3)"
 762 |    ]
 763 |   },
 764 |   {
 765 |    "cell_type": "markdown",
 766 |    "id": "collapsed-tree-theory",
 767 |    "metadata": {},
 768 |    "source": [
 769 |     "### Step 6: Indexing with the \"Collapsed Tree\" Strategy\n",
 770 |     "\n",
 771 |     "**What it is:** Instead of building a complex graph data structure, we use a simple and effective strategy called the \"collapsed tree.\" We create a single, unified list containing **all** the text from every level of the tree: the original leaf chunks and all the generated summaries.\n",
 772 |     "\n",
 773 |     "**Why we do it:** This allows us to use a standard vector store (like FAISS or Chroma) for retrieval. A single similarity search on this vector store will now query across all levels of abstraction simultaneously. It's an elegant simplification that works remarkably well.\n",
 774 |     "\n",
 775 |     "**How it works:** We iterate through our `raptor_results`, collect all the leaf texts and summaries into one list, and then build a FAISS vector store from this combined corpus."
 776 |    ]
 777 |   },
 778 |   {
 779 |    "cell_type": "code",
 780 |    "execution_count": null,
 781 |    "id": "collapsed-tree-code",
 782 |    "metadata": {
 783 |     "tags": []
 784 |    },
 785 |    "outputs": [
 786 |     {
 787 |      "name": "stdout",
 788 |      "output_type": "stream",
 789 |      "text": [
 790 |       "Built vector store with 423 total documents (leaves + summaries).\n"
 791 |      ]
 792 |     }
 793 |    ],
 794 |    "source": [
 795 |     "from langchain_community.vectorstores import FAISS\n",
 796 |     "\n",
 797 |     "# Combine all texts (original chunks and all generated summaries) into a single list.\n",
 798 |     "all_texts = leaf_texts.copy()\n",
 799 |     "for level in raptor_results:\n",
 800 |     "    # Get the summaries from the current level's results\n",
 801 |     "    summaries = raptor_results[level][1]['summaries'].tolist()\n",
 802 |     "    # Add them to our master list\n",
 803 |     "    all_texts.extend(summaries)\n",
 804 |     "\n",
 805 |     "# Build the final vector store using FAISS, a fast in-memory vector database.\n",
 806 |     "vectorstore = FAISS.from_texts(texts=all_texts, embedding=embeddings)\n",
 807 |     "\n",
 808 |     "# Create a retriever from the vector store.\n",
 809 |     "# We configure it to retrieve the top 5 most similar documents for any query.\n",
 810 |     "retriever = vectorstore.as_retriever(search_kwargs={'k': 5})\n",
 811 |     "\n",
 812 |     "print(f\"Built vector store with {len(all_texts)} total documents (leaves + summaries).\")"
 813 |    ]
 814 |   },
 815 |   {
 816 |    "cell_type": "markdown",
 817 |    "id": "rag-chain-theory",
 818 |    "metadata": {},
 819 |    "source": [
 820 |     "### Step 7: Retrieval and Generation (RAG)\n",
 821 |     "\n",
 822 |     "Finally, we construct a RAG chain to ask questions. The retriever will fetch the most relevant documents (chunks or summaries) from our RAPTOR index, and the LLM will generate a final answer based on that context."
 823 |    ]
 824 |   },
 825 |   {
 826 |    "cell_type": "code",
 827 |    "execution_count": null,
 828 |    "id": "rag-chain-code",
 829 |    "metadata": {
 830 |     "tags": []
 831 |    },
 832 |    "outputs": [
 833 |     {
 834 |      "name": "stdout",
 835 |      "output_type": "stream",
 836 |      "text": [
 837 |       "RAG chain created. Ready for querying.\n"
 838 |      ]
 839 |     }
 840 |    ],
 841 |    "source": [
 842 |     "# This prompt template is for the final generation step.\n",
 843 |     "final_prompt_text = \"\"\"You are an expert assistant for the Hugging Face ecosystem. \n",
 844 |     "Answer the user's question based ONLY on the following context. If the context does not contain the answer, state that you don't know.\n",
 845 |     "\n",
 846 |     "CONTEXT:\n",
 847 |     "{context}\n",
 848 |     "\n",
 849 |     "QUESTION:\n",
 850 |     "{question}\n",
 851 |     "\n",
 852 |     "ANSWER:\"\"\"\n",
 853 |     "\n",
 854 |     "final_prompt = ChatPromptTemplate.from_template(final_prompt_text)\n",
 855 |     "\n",
 856 |     "# A helper function to format the retrieved documents into a single string.\n",
 857 |     "def format_docs(docs):\n",
 858 |     "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
 859 |     "\n",
 860 |     "# Construct the final RAG chain using LCEL.\n",
 861 |     "rag_chain = (\n",
 862 |     "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()} # Retrieve and format context\n",
 863 |     "    | final_prompt\n",
 864 |     "    | llm\n",
 865 |     "    | StrOutputParser()\n",
 866 |     ")\n",
 867 |     "\n",
 868 |     "print(\"RAG chain created. Ready for querying.\")"
 869 |    ]
 870 |   },
 871 |   {
 872 |    "cell_type": "markdown",
 873 |    "id": "querying-theory",
 874 |    "metadata": {},
 875 |    "source": [
 876 |     "#### Querying the Multi-Resolution Index\n",
 877 |     "\n",
 878 |     "Now we demonstrate the power of the RAPTOR index by asking questions at different levels of abstraction."
 879 |    ]
 880 |   },
 881 |   {
 882 |    "cell_type": "markdown",
 883 |    "id": "query1-theory",
 884 |    "metadata": {},
 885 |    "source": [
 886 |     "##### Query 1: Specific, Low-Level Question\n",
 887 |     "\n",
 888 |     "This type of query should match a granular **leaf node** containing a specific API or code example."
 889 |    ]
 890 |   },
 891 |   {
 892 |    "cell_type": "code",
 893 |    "execution_count": null,
 894 |    "id": "query1-code",
 895 |    "metadata": {
 896 |     "tags": []
 897 |    },
 898 |    "outputs": [
 899 |     {
 900 |      "name": "stdout",
 901 |      "output_type": "stream",
 902 |      "text": [
 903 |       "The `pipeline` function is the easiest way to use a pre-trained model for a given task. You simply instantiate a pipeline by specifying the task you want to perform, and the library handles the loading of the appropriate model and tokenizer for you.\n",
 904 |       "\n",
 905 |       "Here is a simple code example for a sentiment analysis task:\n",
 906 |       "\n",
 907 |       "```python\n",
 908 |       "from transformers import pipeline\n",
 909 |       "\n",
 910 |       "# Create a sentiment analysis pipeline\n",
 911 |       "classifier = pipeline(\"sentiment-analysis\")\n",
 912 |       "\n",
 913 |       "# Use the pipeline on some text\n",
 914 |       "result = classifier(\"I love using Hugging Face libraries!\")\n",
 915 |       "print(result)\n",
 916 |       "# Output: [{'label': 'POSITIVE', 'score': 0.9998}]\n",
 917 |       "```\n",
 918 |       "\n",
 919 |       "You can use it for many other tasks like \"text-generation\", \"question-answering\", and \"summarization\" by changing the task name.\n"
 920 |      ]
 921 |     }
 922 |    ],
 923 |    "source": [
 924 |     "question_specific = \"How do I use the `pipeline` function in the Transformers library? Give me a simple code example.\"\n",
 925 |     "answer = rag_chain.invoke(question_specific)\n",
 926 |     "print(answer)"
 927 |    ]
 928 |   },
 929 |   {
 930 |    "cell_type": "markdown",
 931 |    "id": "query2-theory",
 932 |    "metadata": {},
 933 |    "source": [
 934 |     "##### Query 2: Mid-Level, Conceptual Question\n",
 935 |     "\n",
 936 |     "This query asks about a process or workflow. It is likely to match one of the **generated summaries** from Level 1 or 2, which synthesizes information from multiple detailed chunks."
 937 |    ]
 938 |   },
 939 |   {
 940 |    "cell_type": "code",
 941 |    "execution_count": null,
 942 |    "id": "query2-code",
 943 |    "metadata": {
 944 |     "tags": []
 945 |    },
 946 |    "outputs": [
 947 |     {
 948 |      "name": "stdout",
 949 |      "output_type": "stream",
 950 |      "text": [
 951 |       "Fine-tuning a model using the Parameter-Efficient Fine-Tuning (PEFT) library involves several key steps to efficiently adapt a large pre-trained model to a new task without modifying all of its parameters:\n",
 952 |       "\n",
 953 |       "1.  **Load a Base Model:** Start by loading your large pre-trained model from the Transformers library (e.g., a model from the `AutoModelForCausalLM` class).\n",
 954 |       "\n",
 955 |       "2.  **Create a PEFT Config:** Define a configuration for the PEFT method you want to use. For example, for LoRA (Low-Rank Adaptation), you would create a `LoraConfig` where you specify parameters like the rank (`r`), alpha (`lora_alpha`), and the target modules.\n",
 956 |       "\n",
 957 |       "3.  **Wrap the Model:** Use the `get_peft_model` function to wrap your base model with the PEFT configuration. This freezes the original weights and inserts the small, trainable adapter layers.\n",
 958 |       "\n",
 959 |       "4.  **Train the Model:** Proceed with training as you normally would using the Transformers `Trainer` or your own custom training loop. Only the adapter weights will be updated, making this process much faster and more memory-efficient.\n",
 960 |       "\n",
 961 |       "5.  **Save and Load:** After training, you can save the trained adapter weights, which are very small. To use the fine-tuned model, you load the base model and then apply the saved adapter weights to it.\n"
 962 |      ]
 963 |     }
 964 |    ],
 965 |    "source": [
 966 |     "question_mid_level = \"What are the main steps involved in fine-tuning a model using the PEFT library?\"\n",
 967 |     "answer = rag_chain.invoke(question_mid_level)\n",
 968 |     "print(answer)"
 969 |    ]
 970 |   },
 971 |   {
 972 |    "cell_type": "markdown",
 973 |    "id": "query3-theory",
 974 |    "metadata": {},
 975 |    "source": [
 976 |     "##### Query 3: Broad, High-Level Question\n",
 977 |     "\n",
 978 |     "This is the type of query where standard RAG fails. It should match a **high-level summary** near the top of our RAPTOR tree, providing a concise, thematic overview."
 979 |    ]
 980 |   },
 981 |   {
 982 |    "cell_type": "code",
 983 |    "execution_count": null,
 984 |    "id": "query3-code",
 985 |    "metadata": {
 986 |     "tags": []
 987 |    },
 988 |    "outputs": [
 989 |     {
 990 |      "name": "stdout",
 991 |      "output_type": "stream",
 992 |      "text": [
 993 |       "Based on the provided context, the core philosophy of the Hugging Face ecosystem is to democratize state-of-the-art natural language processing and machine learning. This is achieved through a set of interoperable, open-source libraries built on three main principles:\n",
 994 |       "\n",
 995 |       "1.  **Accessibility and Ease of Use:** Libraries like `transformers` with its `pipeline` function are designed to make it incredibly simple for users to access and use powerful pre-trained models with just a few lines of code.\n",
 996 |       "\n",
 997 |       "2.  **Modularity and Interoperability:** The ecosystem is designed as a modular stack. `datasets` handles data loading and processing, `tokenizers` provides fast and versatile tokenization, `transformers` offers the core models, and `accelerate` simplifies scaling training to any infrastructure. These libraries work seamlessly together.\n",
 998 |       "\n",
 999 |       "3.  **Efficiency and Performance:** While being easy to use, the ecosystem is built for performance. Techniques like Parameter-Efficient Fine-Tuning (PEFT) and tools like `accelerate` allow users to train and deploy large models efficiently, reducing computational costs and memory requirements. The underlying code is optimized for both research and production environments.\n"
1000 |      ]
1001 |     }
1002 |    ],
1003 |    "source": [
1004 |     "question_high_level = \"What is the core philosophy of the Hugging Face ecosystem?\"\n",
1005 |     "answer = rag_chain.invoke(question_high_level)\n",
1006 |     "print(answer)"
1007 |    ]
1008 |   },
1009 |   {
1010 |    "cell_type": "markdown",
1011 |    "id": "conclusion",
1012 |    "metadata": {},
1013 |    "source": [
1014 |     "### Conclusion\n",
1015 |     "\n",
1016 |     "This notebook demonstrated the end-to-end implementation of the RAPTOR methodology using an advanced, multi-stage clustering approach. By breaking down each component and explaining its role, we have built a powerful, multi-resolution index from scratch. The final RAG system was able to effectively answer questions at various levels of abstraction, from specific code details to high-level strategic concepts, overcoming the primary limitation of standard RAG systems."
1017 |    ]
1018 |   }
1019 |  ],
1020 |  "metadata": {
1021 |   "language_info": {
1022 |    "name": "python",
1023 |    "version": "3.10"
1024 |   }
1025 |  },
1026 |  "nbformat": 4,
1027 |  "nbformat_minor": 5
1028 | }
1029 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Core RAG & LangChain Framework
 2 | langchain
 3 | langchain-community
 4 | langchain-huggingface
 5 | 
 6 | # Hugging Face Models & Acceleration
 7 | transformers
 8 | sentence-transformers
 9 | torch
10 | accelerate
11 | bitsandbytes
12 | 
13 | # Vector Storage
14 | faiss-cpu
15 | 
16 | # Clustering & Data Science
17 | umap-learn
18 | scikit-learn
19 | numpy
20 | pandas
21 | 
22 | # Utilities & Plotting
23 | beautifulsoup4
24 | matplotlib


--------------------------------------------------------------------------------