├── .dockerignore ├── .env.example ├── .gitattributes ├── .gitignore ├── Dockerfile ├── LICENSE ├── README.md ├── crawled_pages.sql ├── pyproject.toml ├── src ├── crawl4ai_mcp.py └── utils.py └── uv.lock /.dockerignore: -------------------------------------------------------------------------------- 1 | crawl4ai_mcp.egg-info 2 | __pycache__ 3 | .venv 4 | .env -------------------------------------------------------------------------------- /.env.example: -------------------------------------------------------------------------------- 1 | # The transport for the MCP server - either 'sse' or 'stdio' (defaults to sse if left empty) 2 | TRANSPORT= 3 | 4 | # Host to bind to if using sse as the transport (leave empty if using stdio) 5 | HOST= 6 | 7 | # Port to listen on if using sse as the transport (leave empty if using stdio) 8 | PORT= 9 | 10 | # Get your Open AI API Key by following these instructions - 11 | # https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key 12 | # This is for the embedding model - text-embed-small-3 will be used 13 | OPENAI_API_KEY= 14 | 15 | # The LLM you want to use for summaries and contextual embeddings 16 | # Generally this is a very cheap and fast LLM like gpt-4.1-nano 17 | MODEL_CHOICE= 18 | 19 | # RAG strategies - set these to "true" or "false" (default to "false") 20 | # USE_CONTEXTUAL_EMBEDDINGS: Enhances embeddings with contextual information for better retrieval 21 | USE_CONTEXTUAL_EMBEDDINGS=false 22 | 23 | # USE_HYBRID_SEARCH: Combines vector similarity search with keyword search for better results 24 | USE_HYBRID_SEARCH=false 25 | 26 | # USE_AGENTIC_RAG: Enables code example extraction, storage, and specialized code search functionality 27 | USE_AGENTIC_RAG=false 28 | 29 | # USE_RERANKING: Applies cross-encoder reranking to improve search result relevance 30 | USE_RERANKING=false 31 | 32 | # For the Supabase version (sample_supabase_agent.py), set your Supabase URL and Service Key. 33 | # Get your SUPABASE_URL from the API section of your Supabase project settings - 34 | # https://supabase.com/dashboard/project//settings/api 35 | SUPABASE_URL= 36 | 37 | # Get your SUPABASE_SERVICE_KEY from the API section of your Supabase project settings - 38 | # https://supabase.com/dashboard/project//settings/api 39 | # On this page it is called the service_role secret. 40 | SUPABASE_SERVICE_KEY= -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | .venv 3 | __pycache__ 4 | crawl4ai_mcp.egg-info -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.12-slim 2 | 3 | ARG PORT=8051 4 | 5 | WORKDIR /app 6 | 7 | # Install uv 8 | RUN pip install uv 9 | 10 | # Copy the MCP server files 11 | COPY . . 12 | 13 | # Install packages directly to the system (no virtual environment) 14 | # Combining commands to reduce Docker layers 15 | RUN uv pip install --system -e . && \ 16 | crawl4ai-setup 17 | 18 | EXPOSE ${PORT} 19 | 20 | # Command to run the MCP server 21 | CMD ["uv", "run", "src/crawl4ai_mcp.py"] -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Cole Medin 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Crawl4AI RAG MCP Server

2 | 3 |

4 | Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants 5 |

6 | 7 | A powerful implementation of the [Model Context Protocol (MCP)](https://modelcontextprotocol.io) integrated with [Crawl4AI](https://crawl4ai.com) and [Supabase](https://supabase.com/) for providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities. 8 | 9 | With this MCP server, you can scrape anything and then use that knowledge anywhere for RAG. 10 | 11 | The primary goal is to bring this MCP server into [Archon](https://github.com/coleam00/Archon) as I evolve it to be more of a knowledge engine for AI coding assistants to build AI agents. This first version of the Crawl4AI/RAG MCP server will be improved upon greatly soon, especially making it more configurable so you can use different embedding models and run everything locally with Ollama. 12 | 13 | ## Overview 14 | 15 | This MCP server provides tools that enable AI agents to crawl websites, store content in a vector database (Supabase), and perform RAG over the crawled content. It follows the best practices for building MCP servers based on the [Mem0 MCP server template](https://github.com/coleam00/mcp-mem0/) I provided on my channel previously. 16 | 17 | The server includes several advanced RAG strategies that can be enabled to enhance retrieval quality: 18 | - **Contextual Embeddings** for enriched semantic understanding 19 | - **Hybrid Search** combining vector and keyword search 20 | - **Agentic RAG** for specialized code example extraction 21 | - **Reranking** for improved result relevance using cross-encoder models 22 | 23 | See the [Configuration section](#configuration) below for details on how to enable and configure these strategies. 24 | 25 | ## Vision 26 | 27 | The Crawl4AI RAG MCP server is just the beginning. Here's where we're headed: 28 | 29 | 1. **Integration with Archon**: Building this system directly into [Archon](https://github.com/coleam00/Archon) to create a comprehensive knowledge engine for AI coding assistants to build better AI agents. 30 | 31 | 2. **Multiple Embedding Models**: Expanding beyond OpenAI to support a variety of embedding models, including the ability to run everything locally with Ollama for complete control and privacy. 32 | 33 | 3. **Advanced RAG Strategies**: Implementing sophisticated retrieval techniques like contextual retrieval, late chunking, and others to move beyond basic "naive lookups" and significantly enhance the power and precision of the RAG system, especially as it integrates with Archon. 34 | 35 | 4. **Enhanced Chunking Strategy**: Implementing a Context 7-inspired chunking approach that focuses on examples and creates distinct, semantically meaningful sections for each chunk, improving retrieval precision. 36 | 37 | 5. **Performance Optimization**: Increasing crawling and indexing speed to make it more realistic to "quickly" index new documentation to then leverage it within the same prompt in an AI coding assistant. 38 | 39 | ## Features 40 | 41 | - **Smart URL Detection**: Automatically detects and handles different URL types (regular webpages, sitemaps, text files) 42 | - **Recursive Crawling**: Follows internal links to discover content 43 | - **Parallel Processing**: Efficiently crawls multiple pages simultaneously 44 | - **Content Chunking**: Intelligently splits content by headers and size for better processing 45 | - **Vector Search**: Performs RAG over crawled content, optionally filtering by data source for precision 46 | - **Source Retrieval**: Retrieve sources available for filtering to guide the RAG process 47 | 48 | ## Tools 49 | 50 | The server provides essential web crawling and search tools: 51 | 52 | ### Core Tools (Always Available) 53 | 54 | 1. **`crawl_single_page`**: Quickly crawl a single web page and store its content in the vector database 55 | 2. **`smart_crawl_url`**: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively) 56 | 3. **`get_available_sources`**: Get a list of all available sources (domains) in the database 57 | 4. **`perform_rag_query`**: Search for relevant content using semantic search with optional source filtering 58 | 59 | ### Conditional Tools 60 | 61 | 5. **`search_code_examples`** (requires `USE_AGENTIC_RAG=true`): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants. 62 | 63 | ## Prerequisites 64 | 65 | - [Docker/Docker Desktop](https://www.docker.com/products/docker-desktop/) if running the MCP server as a container (recommended) 66 | - [Python 3.12+](https://www.python.org/downloads/) if running the MCP server directly through uv 67 | - [Supabase](https://supabase.com/) (database for RAG) 68 | - [OpenAI API key](https://platform.openai.com/api-keys) (for generating embeddings) 69 | 70 | ## Installation 71 | 72 | ### Using Docker (Recommended) 73 | 74 | 1. Clone this repository: 75 | ```bash 76 | git clone https://github.com/coleam00/mcp-crawl4ai-rag.git 77 | cd mcp-crawl4ai-rag 78 | ``` 79 | 80 | 2. Build the Docker image: 81 | ```bash 82 | docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 . 83 | ``` 84 | 85 | 3. Create a `.env` file based on the configuration section below 86 | 87 | ### Using uv directly (no Docker) 88 | 89 | 1. Clone this repository: 90 | ```bash 91 | git clone https://github.com/coleam00/mcp-crawl4ai-rag.git 92 | cd mcp-crawl4ai-rag 93 | ``` 94 | 95 | 2. Install uv if you don't have it: 96 | ```bash 97 | pip install uv 98 | ``` 99 | 100 | 3. Create and activate a virtual environment: 101 | ```bash 102 | uv venv 103 | .venv\Scripts\activate 104 | # on Mac/Linux: source .venv/bin/activate 105 | ``` 106 | 107 | 4. Install dependencies: 108 | ```bash 109 | uv pip install -e . 110 | crawl4ai-setup 111 | ``` 112 | 113 | 5. Create a `.env` file based on the configuration section below 114 | 115 | ## Database Setup 116 | 117 | Before running the server, you need to set up the database with the pgvector extension: 118 | 119 | 1. Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary) 120 | 121 | 2. Create a new query and paste the contents of `crawled_pages.sql` 122 | 123 | 3. Run the query to create the necessary tables and functions 124 | 125 | ## Configuration 126 | 127 | Create a `.env` file in the project root with the following variables: 128 | 129 | ``` 130 | # MCP Server Configuration 131 | HOST=0.0.0.0 132 | PORT=8051 133 | TRANSPORT=sse 134 | 135 | # OpenAI API Configuration 136 | OPENAI_API_KEY=your_openai_api_key 137 | 138 | # LLM for summaries and contextual embeddings 139 | MODEL_CHOICE=gpt-4.1-nano 140 | 141 | # RAG Strategies (set to "true" or "false", default to "false") 142 | USE_CONTEXTUAL_EMBEDDINGS=false 143 | USE_HYBRID_SEARCH=false 144 | USE_AGENTIC_RAG=false 145 | USE_RERANKING=false 146 | 147 | # Supabase Configuration 148 | SUPABASE_URL=your_supabase_project_url 149 | SUPABASE_SERVICE_KEY=your_supabase_service_key 150 | ``` 151 | 152 | ### RAG Strategy Options 153 | 154 | The Crawl4AI RAG MCP server supports four powerful RAG strategies that can be enabled independently: 155 | 156 | #### 1. **USE_CONTEXTUAL_EMBEDDINGS** 157 | When enabled, this strategy enhances each chunk's embedding with additional context from the entire document. The system passes both the full document and the specific chunk to an LLM (configured via `MODEL_CHOICE`) to generate enriched context that gets embedded alongside the chunk content. 158 | 159 | - **When to use**: Enable this when you need high-precision retrieval where context matters, such as technical documentation where terms might have different meanings in different sections. 160 | - **Trade-offs**: Slower indexing due to LLM calls for each chunk, but significantly better retrieval accuracy. 161 | - **Cost**: Additional LLM API calls during indexing. 162 | 163 | #### 2. **USE_HYBRID_SEARCH** 164 | Combines traditional keyword search with semantic vector search to provide more comprehensive results. The system performs both searches in parallel and intelligently merges results, prioritizing documents that appear in both result sets. 165 | 166 | - **When to use**: Enable this when users might search using specific technical terms, function names, or when exact keyword matches are important alongside semantic understanding. 167 | - **Trade-offs**: Slightly slower search queries but more robust results, especially for technical content. 168 | - **Cost**: No additional API costs, just computational overhead. 169 | 170 | #### 3. **USE_AGENTIC_RAG** 171 | Enables specialized code example extraction and storage. When crawling documentation, the system identifies code blocks (≥300 characters), extracts them with surrounding context, generates summaries, and stores them in a separate vector database table specifically designed for code search. 172 | 173 | - **When to use**: Essential for AI coding assistants that need to find specific code examples, implementation patterns, or usage examples from documentation. 174 | - **Trade-offs**: Significantly slower crawling due to code extraction and summarization, requires more storage space. 175 | - **Cost**: Additional LLM API calls for summarizing each code example. 176 | - **Benefits**: Provides a dedicated `search_code_examples` tool that AI agents can use to find specific code implementations. 177 | 178 | #### 4. **USE_RERANKING** 179 | Applies cross-encoder reranking to search results after initial retrieval. Uses a lightweight cross-encoder model (`cross-encoder/ms-marco-MiniLM-L-6-v2`) to score each result against the original query, then reorders results by relevance. 180 | 181 | - **When to use**: Enable this when search precision is critical and you need the most relevant results at the top. Particularly useful for complex queries where semantic similarity alone might not capture query intent. 182 | - **Trade-offs**: Adds ~100-200ms to search queries depending on result count, but significantly improves result ordering. 183 | - **Cost**: No additional API costs - uses a local model that runs on CPU. 184 | - **Benefits**: Better result relevance, especially for complex queries. Works with both regular RAG search and code example search. 185 | 186 | ### Recommended Configurations 187 | 188 | **For general documentation RAG:** 189 | ``` 190 | USE_CONTEXTUAL_EMBEDDINGS=false 191 | USE_HYBRID_SEARCH=true 192 | USE_AGENTIC_RAG=false 193 | USE_RERANKING=true 194 | ``` 195 | 196 | **For AI coding assistant with code examples:** 197 | ``` 198 | USE_CONTEXTUAL_EMBEDDINGS=true 199 | USE_HYBRID_SEARCH=true 200 | USE_AGENTIC_RAG=true 201 | USE_RERANKING=true 202 | ``` 203 | 204 | **For fast, basic RAG:** 205 | ``` 206 | USE_CONTEXTUAL_EMBEDDINGS=false 207 | USE_HYBRID_SEARCH=true 208 | USE_AGENTIC_RAG=false 209 | USE_RERANKING=false 210 | ``` 211 | 212 | ## Running the Server 213 | 214 | ### Using Docker 215 | 216 | ```bash 217 | docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag 218 | ``` 219 | 220 | ### Using Python 221 | 222 | ```bash 223 | uv run src/crawl4ai_mcp.py 224 | ``` 225 | 226 | The server will start and listen on the configured host and port. 227 | 228 | ## Integration with MCP Clients 229 | 230 | ### SSE Configuration 231 | 232 | Once you have the server running with SSE transport, you can connect to it using this configuration: 233 | 234 | ```json 235 | { 236 | "mcpServers": { 237 | "crawl4ai-rag": { 238 | "transport": "sse", 239 | "url": "http://localhost:8051/sse" 240 | } 241 | } 242 | } 243 | ``` 244 | 245 | > **Note for Windsurf users**: Use `serverUrl` instead of `url` in your configuration: 246 | > ```json 247 | > { 248 | > "mcpServers": { 249 | > "crawl4ai-rag": { 250 | > "transport": "sse", 251 | > "serverUrl": "http://localhost:8051/sse" 252 | > } 253 | > } 254 | > } 255 | > ``` 256 | > 257 | > **Note for Docker users**: Use `host.docker.internal` instead of `localhost` if your client is running in a different container. This will apply if you are using this MCP server within n8n! 258 | 259 | ### Stdio Configuration 260 | 261 | Add this server to your MCP configuration for Claude Desktop, Windsurf, or any other MCP client: 262 | 263 | ```json 264 | { 265 | "mcpServers": { 266 | "crawl4ai-rag": { 267 | "command": "python", 268 | "args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"], 269 | "env": { 270 | "TRANSPORT": "stdio", 271 | "OPENAI_API_KEY": "your_openai_api_key", 272 | "SUPABASE_URL": "your_supabase_url", 273 | "SUPABASE_SERVICE_KEY": "your_supabase_service_key" 274 | } 275 | } 276 | } 277 | } 278 | ``` 279 | 280 | ### Docker with Stdio Configuration 281 | 282 | ```json 283 | { 284 | "mcpServers": { 285 | "crawl4ai-rag": { 286 | "command": "docker", 287 | "args": ["run", "--rm", "-i", 288 | "-e", "TRANSPORT", 289 | "-e", "OPENAI_API_KEY", 290 | "-e", "SUPABASE_URL", 291 | "-e", "SUPABASE_SERVICE_KEY", 292 | "mcp/crawl4ai"], 293 | "env": { 294 | "TRANSPORT": "stdio", 295 | "OPENAI_API_KEY": "your_openai_api_key", 296 | "SUPABASE_URL": "your_supabase_url", 297 | "SUPABASE_SERVICE_KEY": "your_supabase_service_key" 298 | } 299 | } 300 | } 301 | } 302 | ``` 303 | 304 | ## Building Your Own Server 305 | 306 | This implementation provides a foundation for building more complex MCP servers with web crawling capabilities. To build your own: 307 | 308 | 1. Add your own tools by creating methods with the `@mcp.tool()` decorator 309 | 2. Create your own lifespan function to add your own dependencies 310 | 3. Modify the `utils.py` file for any helper functions you need 311 | 4. Extend the crawling capabilities by adding more specialized crawlers -------------------------------------------------------------------------------- /crawled_pages.sql: -------------------------------------------------------------------------------- 1 | -- Enable the pgvector extension 2 | create extension if not exists vector; 3 | 4 | -- Drop tables if they exist (to allow rerunning the script) 5 | drop table if exists crawled_pages; 6 | drop table if exists code_examples; 7 | drop table if exists sources; 8 | 9 | -- Create the sources table 10 | create table sources ( 11 | source_id text primary key, 12 | summary text, 13 | total_word_count integer default 0, 14 | created_at timestamp with time zone default timezone('utc'::text, now()) not null, 15 | updated_at timestamp with time zone default timezone('utc'::text, now()) not null 16 | ); 17 | 18 | -- Create the documentation chunks table 19 | create table crawled_pages ( 20 | id bigserial primary key, 21 | url varchar not null, 22 | chunk_number integer not null, 23 | content text not null, 24 | metadata jsonb not null default '{}'::jsonb, 25 | source_id text not null, 26 | embedding vector(1536), -- OpenAI embeddings are 1536 dimensions 27 | created_at timestamp with time zone default timezone('utc'::text, now()) not null, 28 | 29 | -- Add a unique constraint to prevent duplicate chunks for the same URL 30 | unique(url, chunk_number), 31 | 32 | -- Add foreign key constraint to sources table 33 | foreign key (source_id) references sources(source_id) 34 | ); 35 | 36 | -- Create an index for better vector similarity search performance 37 | create index on crawled_pages using ivfflat (embedding vector_cosine_ops); 38 | 39 | -- Create an index on metadata for faster filtering 40 | create index idx_crawled_pages_metadata on crawled_pages using gin (metadata); 41 | 42 | -- Create an index on source_id for faster filtering 43 | CREATE INDEX idx_crawled_pages_source_id ON crawled_pages (source_id); 44 | 45 | -- Create a function to search for documentation chunks 46 | create or replace function match_crawled_pages ( 47 | query_embedding vector(1536), 48 | match_count int default 10, 49 | filter jsonb DEFAULT '{}'::jsonb, 50 | source_filter text DEFAULT NULL 51 | ) returns table ( 52 | id bigint, 53 | url varchar, 54 | chunk_number integer, 55 | content text, 56 | metadata jsonb, 57 | source_id text, 58 | similarity float 59 | ) 60 | language plpgsql 61 | as $$ 62 | #variable_conflict use_column 63 | begin 64 | return query 65 | select 66 | id, 67 | url, 68 | chunk_number, 69 | content, 70 | metadata, 71 | source_id, 72 | 1 - (crawled_pages.embedding <=> query_embedding) as similarity 73 | from crawled_pages 74 | where metadata @> filter 75 | AND (source_filter IS NULL OR source_id = source_filter) 76 | order by crawled_pages.embedding <=> query_embedding 77 | limit match_count; 78 | end; 79 | $$; 80 | 81 | -- Enable RLS on the crawled_pages table 82 | alter table crawled_pages enable row level security; 83 | 84 | -- Create a policy that allows anyone to read crawled_pages 85 | create policy "Allow public read access to crawled_pages" 86 | on crawled_pages 87 | for select 88 | to public 89 | using (true); 90 | 91 | -- Enable RLS on the sources table 92 | alter table sources enable row level security; 93 | 94 | -- Create a policy that allows anyone to read sources 95 | create policy "Allow public read access to sources" 96 | on sources 97 | for select 98 | to public 99 | using (true); 100 | 101 | -- Create the code_examples table 102 | create table code_examples ( 103 | id bigserial primary key, 104 | url varchar not null, 105 | chunk_number integer not null, 106 | content text not null, -- The code example content 107 | summary text not null, -- Summary of the code example 108 | metadata jsonb not null default '{}'::jsonb, 109 | source_id text not null, 110 | embedding vector(1536), -- OpenAI embeddings are 1536 dimensions 111 | created_at timestamp with time zone default timezone('utc'::text, now()) not null, 112 | 113 | -- Add a unique constraint to prevent duplicate chunks for the same URL 114 | unique(url, chunk_number), 115 | 116 | -- Add foreign key constraint to sources table 117 | foreign key (source_id) references sources(source_id) 118 | ); 119 | 120 | -- Create an index for better vector similarity search performance 121 | create index on code_examples using ivfflat (embedding vector_cosine_ops); 122 | 123 | -- Create an index on metadata for faster filtering 124 | create index idx_code_examples_metadata on code_examples using gin (metadata); 125 | 126 | -- Create an index on source_id for faster filtering 127 | CREATE INDEX idx_code_examples_source_id ON code_examples (source_id); 128 | 129 | -- Create a function to search for code examples 130 | create or replace function match_code_examples ( 131 | query_embedding vector(1536), 132 | match_count int default 10, 133 | filter jsonb DEFAULT '{}'::jsonb, 134 | source_filter text DEFAULT NULL 135 | ) returns table ( 136 | id bigint, 137 | url varchar, 138 | chunk_number integer, 139 | content text, 140 | summary text, 141 | metadata jsonb, 142 | source_id text, 143 | similarity float 144 | ) 145 | language plpgsql 146 | as $$ 147 | #variable_conflict use_column 148 | begin 149 | return query 150 | select 151 | id, 152 | url, 153 | chunk_number, 154 | content, 155 | summary, 156 | metadata, 157 | source_id, 158 | 1 - (code_examples.embedding <=> query_embedding) as similarity 159 | from code_examples 160 | where metadata @> filter 161 | AND (source_filter IS NULL OR source_id = source_filter) 162 | order by code_examples.embedding <=> query_embedding 163 | limit match_count; 164 | end; 165 | $$; 166 | 167 | -- Enable RLS on the code_examples table 168 | alter table code_examples enable row level security; 169 | 170 | -- Create a policy that allows anyone to read code_examples 171 | create policy "Allow public read access to code_examples" 172 | on code_examples 173 | for select 174 | to public 175 | using (true); -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "crawl4ai-mcp" 3 | version = "0.1.0" 4 | description = "MCP server for integrating web crawling and RAG into AI agents and AI coding assistants" 5 | readme = "README.md" 6 | requires-python = ">=3.12" 7 | dependencies = [ 8 | "crawl4ai==0.6.2", 9 | "mcp==1.7.1", 10 | "supabase==2.15.1", 11 | "openai==1.71.0", 12 | "dotenv==0.9.9", 13 | "sentence-transformers>=4.1.0", 14 | ] 15 | -------------------------------------------------------------------------------- /src/crawl4ai_mcp.py: -------------------------------------------------------------------------------- 1 | """ 2 | MCP server for web crawling with Crawl4AI. 3 | 4 | This server provides tools to crawl websites using Crawl4AI, automatically detecting 5 | the appropriate crawl method based on URL type (sitemap, txt file, or regular webpage). 6 | """ 7 | from mcp.server.fastmcp import FastMCP, Context 8 | from sentence_transformers import CrossEncoder 9 | from contextlib import asynccontextmanager 10 | from collections.abc import AsyncIterator 11 | from dataclasses import dataclass 12 | from typing import List, Dict, Any, Optional 13 | from urllib.parse import urlparse, urldefrag 14 | from xml.etree import ElementTree 15 | from dotenv import load_dotenv 16 | from supabase import Client 17 | from pathlib import Path 18 | import requests 19 | import asyncio 20 | import json 21 | import os 22 | import re 23 | import concurrent.futures 24 | 25 | from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, MemoryAdaptiveDispatcher 26 | 27 | from utils import ( 28 | get_supabase_client, 29 | add_documents_to_supabase, 30 | search_documents, 31 | extract_code_blocks, 32 | generate_code_example_summary, 33 | add_code_examples_to_supabase, 34 | update_source_info, 35 | extract_source_summary, 36 | search_code_examples 37 | ) 38 | 39 | # Load environment variables from the project root .env file 40 | project_root = Path(__file__).resolve().parent.parent 41 | dotenv_path = project_root / '.env' 42 | 43 | # Force override of existing environment variables 44 | load_dotenv(dotenv_path, override=True) 45 | 46 | # Create a dataclass for our application context 47 | @dataclass 48 | class Crawl4AIContext: 49 | """Context for the Crawl4AI MCP server.""" 50 | crawler: AsyncWebCrawler 51 | supabase_client: Client 52 | reranking_model: Optional[CrossEncoder] = None 53 | 54 | @asynccontextmanager 55 | async def crawl4ai_lifespan(server: FastMCP) -> AsyncIterator[Crawl4AIContext]: 56 | """ 57 | Manages the Crawl4AI client lifecycle. 58 | 59 | Args: 60 | server: The FastMCP server instance 61 | 62 | Yields: 63 | Crawl4AIContext: The context containing the Crawl4AI crawler and Supabase client 64 | """ 65 | # Create browser configuration 66 | browser_config = BrowserConfig( 67 | headless=True, 68 | verbose=False 69 | ) 70 | 71 | # Initialize the crawler 72 | crawler = AsyncWebCrawler(config=browser_config) 73 | await crawler.__aenter__() 74 | 75 | # Initialize Supabase client 76 | supabase_client = get_supabase_client() 77 | 78 | # Initialize cross-encoder model for reranking if enabled 79 | reranking_model = None 80 | if os.getenv("USE_RERANKING", "false") == "true": 81 | try: 82 | reranking_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") 83 | except Exception as e: 84 | print(f"Failed to load reranking model: {e}") 85 | reranking_model = None 86 | 87 | try: 88 | yield Crawl4AIContext( 89 | crawler=crawler, 90 | supabase_client=supabase_client, 91 | reranking_model=reranking_model 92 | ) 93 | finally: 94 | # Clean up the crawler 95 | await crawler.__aexit__(None, None, None) 96 | 97 | # Initialize FastMCP server 98 | mcp = FastMCP( 99 | "mcp-crawl4ai-rag", 100 | description="MCP server for RAG and web crawling with Crawl4AI", 101 | lifespan=crawl4ai_lifespan, 102 | host=os.getenv("HOST", "0.0.0.0"), 103 | port=os.getenv("PORT", "8051") 104 | ) 105 | 106 | def rerank_results(model: CrossEncoder, query: str, results: List[Dict[str, Any]], content_key: str = "content") -> List[Dict[str, Any]]: 107 | """ 108 | Rerank search results using a cross-encoder model. 109 | 110 | Args: 111 | model: The cross-encoder model to use for reranking 112 | query: The search query 113 | results: List of search results 114 | content_key: The key in each result dict that contains the text content 115 | 116 | Returns: 117 | Reranked list of results 118 | """ 119 | if not model or not results: 120 | return results 121 | 122 | try: 123 | # Extract content from results 124 | texts = [result.get(content_key, "") for result in results] 125 | 126 | # Create pairs of [query, document] for the cross-encoder 127 | pairs = [[query, text] for text in texts] 128 | 129 | # Get relevance scores from the cross-encoder 130 | scores = model.predict(pairs) 131 | 132 | # Add scores to results and sort by score (descending) 133 | for i, result in enumerate(results): 134 | result["rerank_score"] = float(scores[i]) 135 | 136 | # Sort by rerank score 137 | reranked = sorted(results, key=lambda x: x.get("rerank_score", 0), reverse=True) 138 | 139 | return reranked 140 | except Exception as e: 141 | print(f"Error during reranking: {e}") 142 | return results 143 | 144 | def is_sitemap(url: str) -> bool: 145 | """ 146 | Check if a URL is a sitemap. 147 | 148 | Args: 149 | url: URL to check 150 | 151 | Returns: 152 | True if the URL is a sitemap, False otherwise 153 | """ 154 | return url.endswith('sitemap.xml') or 'sitemap' in urlparse(url).path 155 | 156 | def is_txt(url: str) -> bool: 157 | """ 158 | Check if a URL is a text file. 159 | 160 | Args: 161 | url: URL to check 162 | 163 | Returns: 164 | True if the URL is a text file, False otherwise 165 | """ 166 | return url.endswith('.txt') 167 | 168 | def parse_sitemap(sitemap_url: str) -> List[str]: 169 | """ 170 | Parse a sitemap and extract URLs. 171 | 172 | Args: 173 | sitemap_url: URL of the sitemap 174 | 175 | Returns: 176 | List of URLs found in the sitemap 177 | """ 178 | resp = requests.get(sitemap_url) 179 | urls = [] 180 | 181 | if resp.status_code == 200: 182 | try: 183 | tree = ElementTree.fromstring(resp.content) 184 | urls = [loc.text for loc in tree.findall('.//{*}loc')] 185 | except Exception as e: 186 | print(f"Error parsing sitemap XML: {e}") 187 | 188 | return urls 189 | 190 | def smart_chunk_markdown(text: str, chunk_size: int = 5000) -> List[str]: 191 | """Split text into chunks, respecting code blocks and paragraphs.""" 192 | chunks = [] 193 | start = 0 194 | text_length = len(text) 195 | 196 | while start < text_length: 197 | # Calculate end position 198 | end = start + chunk_size 199 | 200 | # If we're at the end of the text, just take what's left 201 | if end >= text_length: 202 | chunks.append(text[start:].strip()) 203 | break 204 | 205 | # Try to find a code block boundary first (```) 206 | chunk = text[start:end] 207 | code_block = chunk.rfind('```') 208 | if code_block != -1 and code_block > chunk_size * 0.3: 209 | end = start + code_block 210 | 211 | # If no code block, try to break at a paragraph 212 | elif '\n\n' in chunk: 213 | # Find the last paragraph break 214 | last_break = chunk.rfind('\n\n') 215 | if last_break > chunk_size * 0.3: # Only break if we're past 30% of chunk_size 216 | end = start + last_break 217 | 218 | # If no paragraph break, try to break at a sentence 219 | elif '. ' in chunk: 220 | # Find the last sentence break 221 | last_period = chunk.rfind('. ') 222 | if last_period > chunk_size * 0.3: # Only break if we're past 30% of chunk_size 223 | end = start + last_period + 1 224 | 225 | # Extract chunk and clean it up 226 | chunk = text[start:end].strip() 227 | if chunk: 228 | chunks.append(chunk) 229 | 230 | # Move start position for next chunk 231 | start = end 232 | 233 | return chunks 234 | 235 | def extract_section_info(chunk: str) -> Dict[str, Any]: 236 | """ 237 | Extracts headers and stats from a chunk. 238 | 239 | Args: 240 | chunk: Markdown chunk 241 | 242 | Returns: 243 | Dictionary with headers and stats 244 | """ 245 | headers = re.findall(r'^(#+)\s+(.+)$', chunk, re.MULTILINE) 246 | header_str = '; '.join([f'{h[0]} {h[1]}' for h in headers]) if headers else '' 247 | 248 | return { 249 | "headers": header_str, 250 | "char_count": len(chunk), 251 | "word_count": len(chunk.split()) 252 | } 253 | 254 | def process_code_example(args): 255 | """ 256 | Process a single code example to generate its summary. 257 | This function is designed to be used with concurrent.futures. 258 | 259 | Args: 260 | args: Tuple containing (code, context_before, context_after) 261 | 262 | Returns: 263 | The generated summary 264 | """ 265 | code, context_before, context_after = args 266 | return generate_code_example_summary(code, context_before, context_after) 267 | 268 | @mcp.tool() 269 | async def crawl_single_page(ctx: Context, url: str) -> str: 270 | """ 271 | Crawl a single web page and store its content in Supabase. 272 | 273 | This tool is ideal for quickly retrieving content from a specific URL without following links. 274 | The content is stored in Supabase for later retrieval and querying. 275 | 276 | Args: 277 | ctx: The MCP server provided context 278 | url: URL of the web page to crawl 279 | 280 | Returns: 281 | Summary of the crawling operation and storage in Supabase 282 | """ 283 | try: 284 | # Get the crawler from the context 285 | crawler = ctx.request_context.lifespan_context.crawler 286 | supabase_client = ctx.request_context.lifespan_context.supabase_client 287 | 288 | # Configure the crawl 289 | run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=False) 290 | 291 | # Crawl the page 292 | result = await crawler.arun(url=url, config=run_config) 293 | 294 | if result.success and result.markdown: 295 | # Extract source_id 296 | parsed_url = urlparse(url) 297 | source_id = parsed_url.netloc or parsed_url.path 298 | 299 | # Chunk the content 300 | chunks = smart_chunk_markdown(result.markdown) 301 | 302 | # Prepare data for Supabase 303 | urls = [] 304 | chunk_numbers = [] 305 | contents = [] 306 | metadatas = [] 307 | total_word_count = 0 308 | 309 | for i, chunk in enumerate(chunks): 310 | urls.append(url) 311 | chunk_numbers.append(i) 312 | contents.append(chunk) 313 | 314 | # Extract metadata 315 | meta = extract_section_info(chunk) 316 | meta["chunk_index"] = i 317 | meta["url"] = url 318 | meta["source"] = source_id 319 | meta["crawl_time"] = str(asyncio.current_task().get_coro().__name__) 320 | metadatas.append(meta) 321 | 322 | # Accumulate word count 323 | total_word_count += meta.get("word_count", 0) 324 | 325 | # Create url_to_full_document mapping 326 | url_to_full_document = {url: result.markdown} 327 | 328 | # Update source information FIRST (before inserting documents) 329 | source_summary = extract_source_summary(source_id, result.markdown[:5000]) # Use first 5000 chars for summary 330 | update_source_info(supabase_client, source_id, source_summary, total_word_count) 331 | 332 | # Add documentation chunks to Supabase (AFTER source exists) 333 | add_documents_to_supabase(supabase_client, urls, chunk_numbers, contents, metadatas, url_to_full_document) 334 | 335 | # Extract and process code examples only if enabled 336 | extract_code_examples = os.getenv("USE_AGENTIC_RAG", "false") == "true" 337 | if extract_code_examples: 338 | code_blocks = extract_code_blocks(result.markdown) 339 | if code_blocks: 340 | code_urls = [] 341 | code_chunk_numbers = [] 342 | code_examples = [] 343 | code_summaries = [] 344 | code_metadatas = [] 345 | 346 | # Process code examples in parallel 347 | with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: 348 | # Prepare arguments for parallel processing 349 | summary_args = [(block['code'], block['context_before'], block['context_after']) 350 | for block in code_blocks] 351 | 352 | # Generate summaries in parallel 353 | summaries = list(executor.map(process_code_example, summary_args)) 354 | 355 | # Prepare code example data 356 | for i, (block, summary) in enumerate(zip(code_blocks, summaries)): 357 | code_urls.append(url) 358 | code_chunk_numbers.append(i) 359 | code_examples.append(block['code']) 360 | code_summaries.append(summary) 361 | 362 | # Create metadata for code example 363 | code_meta = { 364 | "chunk_index": i, 365 | "url": url, 366 | "source": source_id, 367 | "char_count": len(block['code']), 368 | "word_count": len(block['code'].split()) 369 | } 370 | code_metadatas.append(code_meta) 371 | 372 | # Add code examples to Supabase 373 | add_code_examples_to_supabase( 374 | supabase_client, 375 | code_urls, 376 | code_chunk_numbers, 377 | code_examples, 378 | code_summaries, 379 | code_metadatas 380 | ) 381 | 382 | return json.dumps({ 383 | "success": True, 384 | "url": url, 385 | "chunks_stored": len(chunks), 386 | "code_examples_stored": len(code_blocks) if code_blocks else 0, 387 | "content_length": len(result.markdown), 388 | "total_word_count": total_word_count, 389 | "source_id": source_id, 390 | "links_count": { 391 | "internal": len(result.links.get("internal", [])), 392 | "external": len(result.links.get("external", [])) 393 | } 394 | }, indent=2) 395 | else: 396 | return json.dumps({ 397 | "success": False, 398 | "url": url, 399 | "error": result.error_message 400 | }, indent=2) 401 | except Exception as e: 402 | return json.dumps({ 403 | "success": False, 404 | "url": url, 405 | "error": str(e) 406 | }, indent=2) 407 | 408 | @mcp.tool() 409 | async def smart_crawl_url(ctx: Context, url: str, max_depth: int = 3, max_concurrent: int = 10, chunk_size: int = 5000) -> str: 410 | """ 411 | Intelligently crawl a URL based on its type and store content in Supabase. 412 | 413 | This tool automatically detects the URL type and applies the appropriate crawling method: 414 | - For sitemaps: Extracts and crawls all URLs in parallel 415 | - For text files (llms.txt): Directly retrieves the content 416 | - For regular webpages: Recursively crawls internal links up to the specified depth 417 | 418 | All crawled content is chunked and stored in Supabase for later retrieval and querying. 419 | 420 | Args: 421 | ctx: The MCP server provided context 422 | url: URL to crawl (can be a regular webpage, sitemap.xml, or .txt file) 423 | max_depth: Maximum recursion depth for regular URLs (default: 3) 424 | max_concurrent: Maximum number of concurrent browser sessions (default: 10) 425 | chunk_size: Maximum size of each content chunk in characters (default: 1000) 426 | 427 | Returns: 428 | JSON string with crawl summary and storage information 429 | """ 430 | try: 431 | # Get the crawler from the context 432 | crawler = ctx.request_context.lifespan_context.crawler 433 | supabase_client = ctx.request_context.lifespan_context.supabase_client 434 | 435 | # Determine the crawl strategy 436 | crawl_results = [] 437 | crawl_type = None 438 | 439 | if is_txt(url): 440 | # For text files, use simple crawl 441 | crawl_results = await crawl_markdown_file(crawler, url) 442 | crawl_type = "text_file" 443 | elif is_sitemap(url): 444 | # For sitemaps, extract URLs and crawl in parallel 445 | sitemap_urls = parse_sitemap(url) 446 | if not sitemap_urls: 447 | return json.dumps({ 448 | "success": False, 449 | "url": url, 450 | "error": "No URLs found in sitemap" 451 | }, indent=2) 452 | crawl_results = await crawl_batch(crawler, sitemap_urls, max_concurrent=max_concurrent) 453 | crawl_type = "sitemap" 454 | else: 455 | # For regular URLs, use recursive crawl 456 | crawl_results = await crawl_recursive_internal_links(crawler, [url], max_depth=max_depth, max_concurrent=max_concurrent) 457 | crawl_type = "webpage" 458 | 459 | if not crawl_results: 460 | return json.dumps({ 461 | "success": False, 462 | "url": url, 463 | "error": "No content found" 464 | }, indent=2) 465 | 466 | # Process results and store in Supabase 467 | urls = [] 468 | chunk_numbers = [] 469 | contents = [] 470 | metadatas = [] 471 | chunk_count = 0 472 | 473 | # Track sources and their content 474 | source_content_map = {} 475 | source_word_counts = {} 476 | 477 | # Process documentation chunks 478 | for doc in crawl_results: 479 | source_url = doc['url'] 480 | md = doc['markdown'] 481 | chunks = smart_chunk_markdown(md, chunk_size=chunk_size) 482 | 483 | # Extract source_id 484 | parsed_url = urlparse(source_url) 485 | source_id = parsed_url.netloc or parsed_url.path 486 | 487 | # Store content for source summary generation 488 | if source_id not in source_content_map: 489 | source_content_map[source_id] = md[:5000] # Store first 5000 chars 490 | source_word_counts[source_id] = 0 491 | 492 | for i, chunk in enumerate(chunks): 493 | urls.append(source_url) 494 | chunk_numbers.append(i) 495 | contents.append(chunk) 496 | 497 | # Extract metadata 498 | meta = extract_section_info(chunk) 499 | meta["chunk_index"] = i 500 | meta["url"] = source_url 501 | meta["source"] = source_id 502 | meta["crawl_type"] = crawl_type 503 | meta["crawl_time"] = str(asyncio.current_task().get_coro().__name__) 504 | metadatas.append(meta) 505 | 506 | # Accumulate word count 507 | source_word_counts[source_id] += meta.get("word_count", 0) 508 | 509 | chunk_count += 1 510 | 511 | # Create url_to_full_document mapping 512 | url_to_full_document = {} 513 | for doc in crawl_results: 514 | url_to_full_document[doc['url']] = doc['markdown'] 515 | 516 | # Update source information for each unique source FIRST (before inserting documents) 517 | with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: 518 | source_summary_args = [(source_id, content) for source_id, content in source_content_map.items()] 519 | source_summaries = list(executor.map(lambda args: extract_source_summary(args[0], args[1]), source_summary_args)) 520 | 521 | for (source_id, _), summary in zip(source_summary_args, source_summaries): 522 | word_count = source_word_counts.get(source_id, 0) 523 | update_source_info(supabase_client, source_id, summary, word_count) 524 | 525 | # Add documentation chunks to Supabase (AFTER sources exist) 526 | batch_size = 20 527 | add_documents_to_supabase(supabase_client, urls, chunk_numbers, contents, metadatas, url_to_full_document, batch_size=batch_size) 528 | 529 | # Extract and process code examples from all documents only if enabled 530 | extract_code_examples_enabled = os.getenv("USE_AGENTIC_RAG", "false") == "true" 531 | if extract_code_examples_enabled: 532 | all_code_blocks = [] 533 | code_urls = [] 534 | code_chunk_numbers = [] 535 | code_examples = [] 536 | code_summaries = [] 537 | code_metadatas = [] 538 | 539 | # Extract code blocks from all documents 540 | for doc in crawl_results: 541 | source_url = doc['url'] 542 | md = doc['markdown'] 543 | code_blocks = extract_code_blocks(md) 544 | 545 | if code_blocks: 546 | # Process code examples in parallel 547 | with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: 548 | # Prepare arguments for parallel processing 549 | summary_args = [(block['code'], block['context_before'], block['context_after']) 550 | for block in code_blocks] 551 | 552 | # Generate summaries in parallel 553 | summaries = list(executor.map(process_code_example, summary_args)) 554 | 555 | # Prepare code example data 556 | parsed_url = urlparse(source_url) 557 | source_id = parsed_url.netloc or parsed_url.path 558 | 559 | for i, (block, summary) in enumerate(zip(code_blocks, summaries)): 560 | code_urls.append(source_url) 561 | code_chunk_numbers.append(len(code_examples)) # Use global code example index 562 | code_examples.append(block['code']) 563 | code_summaries.append(summary) 564 | 565 | # Create metadata for code example 566 | code_meta = { 567 | "chunk_index": len(code_examples) - 1, 568 | "url": source_url, 569 | "source": source_id, 570 | "char_count": len(block['code']), 571 | "word_count": len(block['code'].split()) 572 | } 573 | code_metadatas.append(code_meta) 574 | 575 | # Add all code examples to Supabase 576 | if code_examples: 577 | add_code_examples_to_supabase( 578 | supabase_client, 579 | code_urls, 580 | code_chunk_numbers, 581 | code_examples, 582 | code_summaries, 583 | code_metadatas, 584 | batch_size=batch_size 585 | ) 586 | 587 | return json.dumps({ 588 | "success": True, 589 | "url": url, 590 | "crawl_type": crawl_type, 591 | "pages_crawled": len(crawl_results), 592 | "chunks_stored": chunk_count, 593 | "code_examples_stored": len(code_examples), 594 | "sources_updated": len(source_content_map), 595 | "urls_crawled": [doc['url'] for doc in crawl_results][:5] + (["..."] if len(crawl_results) > 5 else []) 596 | }, indent=2) 597 | except Exception as e: 598 | return json.dumps({ 599 | "success": False, 600 | "url": url, 601 | "error": str(e) 602 | }, indent=2) 603 | 604 | @mcp.tool() 605 | async def get_available_sources(ctx: Context) -> str: 606 | """ 607 | Get all available sources from the sources table. 608 | 609 | This tool returns a list of all unique sources (domains) that have been crawled and stored 610 | in the database, along with their summaries and statistics. This is useful for discovering 611 | what content is available for querying. 612 | 613 | Always use this tool before calling the RAG query or code example query tool 614 | with a specific source filter! 615 | 616 | Args: 617 | ctx: The MCP server provided context 618 | 619 | Returns: 620 | JSON string with the list of available sources and their details 621 | """ 622 | try: 623 | # Get the Supabase client from the context 624 | supabase_client = ctx.request_context.lifespan_context.supabase_client 625 | 626 | # Query the sources table directly 627 | result = supabase_client.from_('sources')\ 628 | .select('*')\ 629 | .order('source_id')\ 630 | .execute() 631 | 632 | # Format the sources with their details 633 | sources = [] 634 | if result.data: 635 | for source in result.data: 636 | sources.append({ 637 | "source_id": source.get("source_id"), 638 | "summary": source.get("summary"), 639 | "total_words": source.get("total_words"), 640 | "created_at": source.get("created_at"), 641 | "updated_at": source.get("updated_at") 642 | }) 643 | 644 | return json.dumps({ 645 | "success": True, 646 | "sources": sources, 647 | "count": len(sources) 648 | }, indent=2) 649 | except Exception as e: 650 | return json.dumps({ 651 | "success": False, 652 | "error": str(e) 653 | }, indent=2) 654 | 655 | @mcp.tool() 656 | async def perform_rag_query(ctx: Context, query: str, source: str = None, match_count: int = 5) -> str: 657 | """ 658 | Perform a RAG (Retrieval Augmented Generation) query on the stored content. 659 | 660 | This tool searches the vector database for content relevant to the query and returns 661 | the matching documents. Optionally filter by source domain. 662 | Get the source by using the get_available_sources tool before calling this search! 663 | 664 | Args: 665 | ctx: The MCP server provided context 666 | query: The search query 667 | source: Optional source domain to filter results (e.g., 'example.com') 668 | match_count: Maximum number of results to return (default: 5) 669 | 670 | Returns: 671 | JSON string with the search results 672 | """ 673 | try: 674 | # Get the Supabase client from the context 675 | supabase_client = ctx.request_context.lifespan_context.supabase_client 676 | 677 | # Check if hybrid search is enabled 678 | use_hybrid_search = os.getenv("USE_HYBRID_SEARCH", "false") == "true" 679 | 680 | # Prepare filter if source is provided and not empty 681 | filter_metadata = None 682 | if source and source.strip(): 683 | filter_metadata = {"source": source} 684 | 685 | if use_hybrid_search: 686 | # Hybrid search: combine vector and keyword search 687 | 688 | # 1. Get vector search results (get more to account for filtering) 689 | vector_results = search_documents( 690 | client=supabase_client, 691 | query=query, 692 | match_count=match_count * 2, # Get double to have room for filtering 693 | filter_metadata=filter_metadata 694 | ) 695 | 696 | # 2. Get keyword search results using ILIKE 697 | keyword_query = supabase_client.from_('crawled_pages')\ 698 | .select('id, url, chunk_number, content, metadata, source_id')\ 699 | .ilike('content', f'%{query}%') 700 | 701 | # Apply source filter if provided 702 | if source and source.strip(): 703 | keyword_query = keyword_query.eq('source_id', source) 704 | 705 | # Execute keyword search 706 | keyword_response = keyword_query.limit(match_count * 2).execute() 707 | keyword_results = keyword_response.data if keyword_response.data else [] 708 | 709 | # 3. Combine results with preference for items appearing in both 710 | seen_ids = set() 711 | combined_results = [] 712 | 713 | # First, add items that appear in both searches (these are the best matches) 714 | vector_ids = {r.get('id') for r in vector_results if r.get('id')} 715 | for kr in keyword_results: 716 | if kr['id'] in vector_ids and kr['id'] not in seen_ids: 717 | # Find the vector result to get similarity score 718 | for vr in vector_results: 719 | if vr.get('id') == kr['id']: 720 | # Boost similarity score for items in both results 721 | vr['similarity'] = min(1.0, vr.get('similarity', 0) * 1.2) 722 | combined_results.append(vr) 723 | seen_ids.add(kr['id']) 724 | break 725 | 726 | # Then add remaining vector results (semantic matches without exact keyword) 727 | for vr in vector_results: 728 | if vr.get('id') and vr['id'] not in seen_ids and len(combined_results) < match_count: 729 | combined_results.append(vr) 730 | seen_ids.add(vr['id']) 731 | 732 | # Finally, add pure keyword matches if we still need more results 733 | for kr in keyword_results: 734 | if kr['id'] not in seen_ids and len(combined_results) < match_count: 735 | # Convert keyword result to match vector result format 736 | combined_results.append({ 737 | 'id': kr['id'], 738 | 'url': kr['url'], 739 | 'chunk_number': kr['chunk_number'], 740 | 'content': kr['content'], 741 | 'metadata': kr['metadata'], 742 | 'source_id': kr['source_id'], 743 | 'similarity': 0.5 # Default similarity for keyword-only matches 744 | }) 745 | seen_ids.add(kr['id']) 746 | 747 | # Use combined results 748 | results = combined_results[:match_count] 749 | 750 | else: 751 | # Standard vector search only 752 | results = search_documents( 753 | client=supabase_client, 754 | query=query, 755 | match_count=match_count, 756 | filter_metadata=filter_metadata 757 | ) 758 | 759 | # Apply reranking if enabled 760 | use_reranking = os.getenv("USE_RERANKING", "false") == "true" 761 | if use_reranking and ctx.request_context.lifespan_context.reranking_model: 762 | results = rerank_results(ctx.request_context.lifespan_context.reranking_model, query, results, content_key="content") 763 | 764 | # Format the results 765 | formatted_results = [] 766 | for result in results: 767 | formatted_result = { 768 | "url": result.get("url"), 769 | "content": result.get("content"), 770 | "metadata": result.get("metadata"), 771 | "similarity": result.get("similarity") 772 | } 773 | # Include rerank score if available 774 | if "rerank_score" in result: 775 | formatted_result["rerank_score"] = result["rerank_score"] 776 | formatted_results.append(formatted_result) 777 | 778 | return json.dumps({ 779 | "success": True, 780 | "query": query, 781 | "source_filter": source, 782 | "search_mode": "hybrid" if use_hybrid_search else "vector", 783 | "reranking_applied": use_reranking and ctx.request_context.lifespan_context.reranking_model is not None, 784 | "results": formatted_results, 785 | "count": len(formatted_results) 786 | }, indent=2) 787 | except Exception as e: 788 | return json.dumps({ 789 | "success": False, 790 | "query": query, 791 | "error": str(e) 792 | }, indent=2) 793 | 794 | @mcp.tool() 795 | async def search_code_examples(ctx: Context, query: str, source_id: str = None, match_count: int = 5) -> str: 796 | """ 797 | Search for code examples relevant to the query. 798 | 799 | This tool searches the vector database for code examples relevant to the query and returns 800 | the matching examples with their summaries. Optionally filter by source_id. 801 | Get the source_id by using the get_available_sources tool before calling this search! 802 | 803 | Use the get_available_sources tool first to see what sources are available for filtering. 804 | 805 | Args: 806 | ctx: The MCP server provided context 807 | query: The search query 808 | source_id: Optional source ID to filter results (e.g., 'example.com') 809 | match_count: Maximum number of results to return (default: 5) 810 | 811 | Returns: 812 | JSON string with the search results 813 | """ 814 | # Check if code example extraction is enabled 815 | extract_code_examples_enabled = os.getenv("USE_AGENTIC_RAG", "false") == "true" 816 | if not extract_code_examples_enabled: 817 | return json.dumps({ 818 | "success": False, 819 | "error": "Code example extraction is disabled. Perform a normal RAG search." 820 | }, indent=2) 821 | 822 | try: 823 | # Get the Supabase client from the context 824 | supabase_client = ctx.request_context.lifespan_context.supabase_client 825 | 826 | # Check if hybrid search is enabled 827 | use_hybrid_search = os.getenv("USE_HYBRID_SEARCH", "false") == "true" 828 | 829 | # Prepare filter if source is provided and not empty 830 | filter_metadata = None 831 | if source_id and source_id.strip(): 832 | filter_metadata = {"source": source_id} 833 | 834 | if use_hybrid_search: 835 | # Hybrid search: combine vector and keyword search 836 | 837 | # Import the search function from utils 838 | from utils import search_code_examples as search_code_examples_impl 839 | 840 | # 1. Get vector search results (get more to account for filtering) 841 | vector_results = search_code_examples_impl( 842 | client=supabase_client, 843 | query=query, 844 | match_count=match_count * 2, # Get double to have room for filtering 845 | filter_metadata=filter_metadata 846 | ) 847 | 848 | # 2. Get keyword search results using ILIKE on both content and summary 849 | keyword_query = supabase_client.from_('code_examples')\ 850 | .select('id, url, chunk_number, content, summary, metadata, source_id')\ 851 | .or_(f'content.ilike.%{query}%,summary.ilike.%{query}%') 852 | 853 | # Apply source filter if provided 854 | if source_id and source_id.strip(): 855 | keyword_query = keyword_query.eq('source_id', source_id) 856 | 857 | # Execute keyword search 858 | keyword_response = keyword_query.limit(match_count * 2).execute() 859 | keyword_results = keyword_response.data if keyword_response.data else [] 860 | 861 | # 3. Combine results with preference for items appearing in both 862 | seen_ids = set() 863 | combined_results = [] 864 | 865 | # First, add items that appear in both searches (these are the best matches) 866 | vector_ids = {r.get('id') for r in vector_results if r.get('id')} 867 | for kr in keyword_results: 868 | if kr['id'] in vector_ids and kr['id'] not in seen_ids: 869 | # Find the vector result to get similarity score 870 | for vr in vector_results: 871 | if vr.get('id') == kr['id']: 872 | # Boost similarity score for items in both results 873 | vr['similarity'] = min(1.0, vr.get('similarity', 0) * 1.2) 874 | combined_results.append(vr) 875 | seen_ids.add(kr['id']) 876 | break 877 | 878 | # Then add remaining vector results (semantic matches without exact keyword) 879 | for vr in vector_results: 880 | if vr.get('id') and vr['id'] not in seen_ids and len(combined_results) < match_count: 881 | combined_results.append(vr) 882 | seen_ids.add(vr['id']) 883 | 884 | # Finally, add pure keyword matches if we still need more results 885 | for kr in keyword_results: 886 | if kr['id'] not in seen_ids and len(combined_results) < match_count: 887 | # Convert keyword result to match vector result format 888 | combined_results.append({ 889 | 'id': kr['id'], 890 | 'url': kr['url'], 891 | 'chunk_number': kr['chunk_number'], 892 | 'content': kr['content'], 893 | 'summary': kr['summary'], 894 | 'metadata': kr['metadata'], 895 | 'source_id': kr['source_id'], 896 | 'similarity': 0.5 # Default similarity for keyword-only matches 897 | }) 898 | seen_ids.add(kr['id']) 899 | 900 | # Use combined results 901 | results = combined_results[:match_count] 902 | 903 | else: 904 | # Standard vector search only 905 | from utils import search_code_examples as search_code_examples_impl 906 | 907 | results = search_code_examples_impl( 908 | client=supabase_client, 909 | query=query, 910 | match_count=match_count, 911 | filter_metadata=filter_metadata 912 | ) 913 | 914 | # Apply reranking if enabled 915 | use_reranking = os.getenv("USE_RERANKING", "false") == "true" 916 | if use_reranking and ctx.request_context.lifespan_context.reranking_model: 917 | results = rerank_results(ctx.request_context.lifespan_context.reranking_model, query, results, content_key="content") 918 | 919 | # Format the results 920 | formatted_results = [] 921 | for result in results: 922 | formatted_result = { 923 | "url": result.get("url"), 924 | "code": result.get("content"), 925 | "summary": result.get("summary"), 926 | "metadata": result.get("metadata"), 927 | "source_id": result.get("source_id"), 928 | "similarity": result.get("similarity") 929 | } 930 | # Include rerank score if available 931 | if "rerank_score" in result: 932 | formatted_result["rerank_score"] = result["rerank_score"] 933 | formatted_results.append(formatted_result) 934 | 935 | return json.dumps({ 936 | "success": True, 937 | "query": query, 938 | "source_filter": source_id, 939 | "search_mode": "hybrid" if use_hybrid_search else "vector", 940 | "reranking_applied": use_reranking and ctx.request_context.lifespan_context.reranking_model is not None, 941 | "results": formatted_results, 942 | "count": len(formatted_results) 943 | }, indent=2) 944 | except Exception as e: 945 | return json.dumps({ 946 | "success": False, 947 | "query": query, 948 | "error": str(e) 949 | }, indent=2) 950 | 951 | async def crawl_markdown_file(crawler: AsyncWebCrawler, url: str) -> List[Dict[str, Any]]: 952 | """ 953 | Crawl a .txt or markdown file. 954 | 955 | Args: 956 | crawler: AsyncWebCrawler instance 957 | url: URL of the file 958 | 959 | Returns: 960 | List of dictionaries with URL and markdown content 961 | """ 962 | crawl_config = CrawlerRunConfig() 963 | 964 | result = await crawler.arun(url=url, config=crawl_config) 965 | if result.success and result.markdown: 966 | return [{'url': url, 'markdown': result.markdown}] 967 | else: 968 | print(f"Failed to crawl {url}: {result.error_message}") 969 | return [] 970 | 971 | async def crawl_batch(crawler: AsyncWebCrawler, urls: List[str], max_concurrent: int = 10) -> List[Dict[str, Any]]: 972 | """ 973 | Batch crawl multiple URLs in parallel. 974 | 975 | Args: 976 | crawler: AsyncWebCrawler instance 977 | urls: List of URLs to crawl 978 | max_concurrent: Maximum number of concurrent browser sessions 979 | 980 | Returns: 981 | List of dictionaries with URL and markdown content 982 | """ 983 | crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=False) 984 | dispatcher = MemoryAdaptiveDispatcher( 985 | memory_threshold_percent=70.0, 986 | check_interval=1.0, 987 | max_session_permit=max_concurrent 988 | ) 989 | 990 | results = await crawler.arun_many(urls=urls, config=crawl_config, dispatcher=dispatcher) 991 | return [{'url': r.url, 'markdown': r.markdown} for r in results if r.success and r.markdown] 992 | 993 | async def crawl_recursive_internal_links(crawler: AsyncWebCrawler, start_urls: List[str], max_depth: int = 3, max_concurrent: int = 10) -> List[Dict[str, Any]]: 994 | """ 995 | Recursively crawl internal links from start URLs up to a maximum depth. 996 | 997 | Args: 998 | crawler: AsyncWebCrawler instance 999 | start_urls: List of starting URLs 1000 | max_depth: Maximum recursion depth 1001 | max_concurrent: Maximum number of concurrent browser sessions 1002 | 1003 | Returns: 1004 | List of dictionaries with URL and markdown content 1005 | """ 1006 | run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=False) 1007 | dispatcher = MemoryAdaptiveDispatcher( 1008 | memory_threshold_percent=70.0, 1009 | check_interval=1.0, 1010 | max_session_permit=max_concurrent 1011 | ) 1012 | 1013 | visited = set() 1014 | 1015 | def normalize_url(url): 1016 | return urldefrag(url)[0] 1017 | 1018 | current_urls = set([normalize_url(u) for u in start_urls]) 1019 | results_all = [] 1020 | 1021 | for depth in range(max_depth): 1022 | urls_to_crawl = [normalize_url(url) for url in current_urls if normalize_url(url) not in visited] 1023 | if not urls_to_crawl: 1024 | break 1025 | 1026 | results = await crawler.arun_many(urls=urls_to_crawl, config=run_config, dispatcher=dispatcher) 1027 | next_level_urls = set() 1028 | 1029 | for result in results: 1030 | norm_url = normalize_url(result.url) 1031 | visited.add(norm_url) 1032 | 1033 | if result.success and result.markdown: 1034 | results_all.append({'url': result.url, 'markdown': result.markdown}) 1035 | for link in result.links.get("internal", []): 1036 | next_url = normalize_url(link["href"]) 1037 | if next_url not in visited: 1038 | next_level_urls.add(next_url) 1039 | 1040 | current_urls = next_level_urls 1041 | 1042 | return results_all 1043 | 1044 | async def main(): 1045 | transport = os.getenv("TRANSPORT", "sse") 1046 | if transport == 'sse': 1047 | # Run the MCP server with sse transport 1048 | await mcp.run_sse_async() 1049 | else: 1050 | # Run the MCP server with stdio transport 1051 | await mcp.run_stdio_async() 1052 | 1053 | if __name__ == "__main__": 1054 | asyncio.run(main()) -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utility functions for the Crawl4AI MCP server. 3 | """ 4 | import os 5 | import concurrent.futures 6 | from typing import List, Dict, Any, Optional, Tuple 7 | import json 8 | from supabase import create_client, Client 9 | from urllib.parse import urlparse 10 | import openai 11 | import re 12 | import time 13 | 14 | # Load OpenAI API key for embeddings 15 | openai.api_key = os.getenv("OPENAI_API_KEY") 16 | 17 | def get_supabase_client() -> Client: 18 | """ 19 | Get a Supabase client with the URL and key from environment variables. 20 | 21 | Returns: 22 | Supabase client instance 23 | """ 24 | url = os.getenv("SUPABASE_URL") 25 | key = os.getenv("SUPABASE_SERVICE_KEY") 26 | 27 | if not url or not key: 28 | raise ValueError("SUPABASE_URL and SUPABASE_SERVICE_KEY must be set in environment variables") 29 | 30 | return create_client(url, key) 31 | 32 | def create_embeddings_batch(texts: List[str]) -> List[List[float]]: 33 | """ 34 | Create embeddings for multiple texts in a single API call. 35 | 36 | Args: 37 | texts: List of texts to create embeddings for 38 | 39 | Returns: 40 | List of embeddings (each embedding is a list of floats) 41 | """ 42 | if not texts: 43 | return [] 44 | 45 | max_retries = 3 46 | retry_delay = 1.0 # Start with 1 second delay 47 | 48 | for retry in range(max_retries): 49 | try: 50 | response = openai.embeddings.create( 51 | model="text-embedding-3-small", # Hardcoding embedding model for now, will change this later to be more dynamic 52 | input=texts 53 | ) 54 | return [item.embedding for item in response.data] 55 | except Exception as e: 56 | if retry < max_retries - 1: 57 | print(f"Error creating batch embeddings (attempt {retry + 1}/{max_retries}): {e}") 58 | print(f"Retrying in {retry_delay} seconds...") 59 | time.sleep(retry_delay) 60 | retry_delay *= 2 # Exponential backoff 61 | else: 62 | print(f"Failed to create batch embeddings after {max_retries} attempts: {e}") 63 | # Try creating embeddings one by one as fallback 64 | print("Attempting to create embeddings individually...") 65 | embeddings = [] 66 | successful_count = 0 67 | 68 | for i, text in enumerate(texts): 69 | try: 70 | individual_response = openai.embeddings.create( 71 | model="text-embedding-3-small", 72 | input=[text] 73 | ) 74 | embeddings.append(individual_response.data[0].embedding) 75 | successful_count += 1 76 | except Exception as individual_error: 77 | print(f"Failed to create embedding for text {i}: {individual_error}") 78 | # Add zero embedding as fallback 79 | embeddings.append([0.0] * 1536) 80 | 81 | print(f"Successfully created {successful_count}/{len(texts)} embeddings individually") 82 | return embeddings 83 | 84 | def create_embedding(text: str) -> List[float]: 85 | """ 86 | Create an embedding for a single text using OpenAI's API. 87 | 88 | Args: 89 | text: Text to create an embedding for 90 | 91 | Returns: 92 | List of floats representing the embedding 93 | """ 94 | try: 95 | embeddings = create_embeddings_batch([text]) 96 | return embeddings[0] if embeddings else [0.0] * 1536 97 | except Exception as e: 98 | print(f"Error creating embedding: {e}") 99 | # Return empty embedding if there's an error 100 | return [0.0] * 1536 101 | 102 | def generate_contextual_embedding(full_document: str, chunk: str) -> Tuple[str, bool]: 103 | """ 104 | Generate contextual information for a chunk within a document to improve retrieval. 105 | 106 | Args: 107 | full_document: The complete document text 108 | chunk: The specific chunk of text to generate context for 109 | 110 | Returns: 111 | Tuple containing: 112 | - The contextual text that situates the chunk within the document 113 | - Boolean indicating if contextual embedding was performed 114 | """ 115 | model_choice = os.getenv("MODEL_CHOICE") 116 | 117 | try: 118 | # Create the prompt for generating contextual information 119 | prompt = f""" 120 | {full_document[:25000]} 121 | 122 | Here is the chunk we want to situate within the whole document 123 | 124 | {chunk} 125 | 126 | Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.""" 127 | 128 | # Call the OpenAI API to generate contextual information 129 | response = openai.chat.completions.create( 130 | model=model_choice, 131 | messages=[ 132 | {"role": "system", "content": "You are a helpful assistant that provides concise contextual information."}, 133 | {"role": "user", "content": prompt} 134 | ], 135 | temperature=0.3, 136 | max_tokens=200 137 | ) 138 | 139 | # Extract the generated context 140 | context = response.choices[0].message.content.strip() 141 | 142 | # Combine the context with the original chunk 143 | contextual_text = f"{context}\n---\n{chunk}" 144 | 145 | return contextual_text, True 146 | 147 | except Exception as e: 148 | print(f"Error generating contextual embedding: {e}. Using original chunk instead.") 149 | return chunk, False 150 | 151 | def process_chunk_with_context(args): 152 | """ 153 | Process a single chunk with contextual embedding. 154 | This function is designed to be used with concurrent.futures. 155 | 156 | Args: 157 | args: Tuple containing (url, content, full_document) 158 | 159 | Returns: 160 | Tuple containing: 161 | - The contextual text that situates the chunk within the document 162 | - Boolean indicating if contextual embedding was performed 163 | """ 164 | url, content, full_document = args 165 | return generate_contextual_embedding(full_document, content) 166 | 167 | def add_documents_to_supabase( 168 | client: Client, 169 | urls: List[str], 170 | chunk_numbers: List[int], 171 | contents: List[str], 172 | metadatas: List[Dict[str, Any]], 173 | url_to_full_document: Dict[str, str], 174 | batch_size: int = 20 175 | ) -> None: 176 | """ 177 | Add documents to the Supabase crawled_pages table in batches. 178 | Deletes existing records with the same URLs before inserting to prevent duplicates. 179 | 180 | Args: 181 | client: Supabase client 182 | urls: List of URLs 183 | chunk_numbers: List of chunk numbers 184 | contents: List of document contents 185 | metadatas: List of document metadata 186 | url_to_full_document: Dictionary mapping URLs to their full document content 187 | batch_size: Size of each batch for insertion 188 | """ 189 | # Get unique URLs to delete existing records 190 | unique_urls = list(set(urls)) 191 | 192 | # Delete existing records for these URLs in a single operation 193 | try: 194 | if unique_urls: 195 | # Use the .in_() filter to delete all records with matching URLs 196 | client.table("crawled_pages").delete().in_("url", unique_urls).execute() 197 | except Exception as e: 198 | print(f"Batch delete failed: {e}. Trying one-by-one deletion as fallback.") 199 | # Fallback: delete records one by one 200 | for url in unique_urls: 201 | try: 202 | client.table("crawled_pages").delete().eq("url", url).execute() 203 | except Exception as inner_e: 204 | print(f"Error deleting record for URL {url}: {inner_e}") 205 | # Continue with the next URL even if one fails 206 | 207 | # Check if MODEL_CHOICE is set for contextual embeddings 208 | use_contextual_embeddings = os.getenv("USE_CONTEXTUAL_EMBEDDINGS", "false") == "true" 209 | print(f"\n\nUse contextual embeddings: {use_contextual_embeddings}\n\n") 210 | 211 | # Process in batches to avoid memory issues 212 | for i in range(0, len(contents), batch_size): 213 | batch_end = min(i + batch_size, len(contents)) 214 | 215 | # Get batch slices 216 | batch_urls = urls[i:batch_end] 217 | batch_chunk_numbers = chunk_numbers[i:batch_end] 218 | batch_contents = contents[i:batch_end] 219 | batch_metadatas = metadatas[i:batch_end] 220 | 221 | # Apply contextual embedding to each chunk if MODEL_CHOICE is set 222 | if use_contextual_embeddings: 223 | # Prepare arguments for parallel processing 224 | process_args = [] 225 | for j, content in enumerate(batch_contents): 226 | url = batch_urls[j] 227 | full_document = url_to_full_document.get(url, "") 228 | process_args.append((url, content, full_document)) 229 | 230 | # Process in parallel using ThreadPoolExecutor 231 | contextual_contents = [] 232 | with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: 233 | # Submit all tasks and collect results 234 | future_to_idx = {executor.submit(process_chunk_with_context, arg): idx 235 | for idx, arg in enumerate(process_args)} 236 | 237 | # Process results as they complete 238 | for future in concurrent.futures.as_completed(future_to_idx): 239 | idx = future_to_idx[future] 240 | try: 241 | result, success = future.result() 242 | contextual_contents.append(result) 243 | if success: 244 | batch_metadatas[idx]["contextual_embedding"] = True 245 | except Exception as e: 246 | print(f"Error processing chunk {idx}: {e}") 247 | # Use original content as fallback 248 | contextual_contents.append(batch_contents[idx]) 249 | 250 | # Sort results back into original order if needed 251 | if len(contextual_contents) != len(batch_contents): 252 | print(f"Warning: Expected {len(batch_contents)} results but got {len(contextual_contents)}") 253 | # Use original contents as fallback 254 | contextual_contents = batch_contents 255 | else: 256 | # If not using contextual embeddings, use original contents 257 | contextual_contents = batch_contents 258 | 259 | # Create embeddings for the entire batch at once 260 | batch_embeddings = create_embeddings_batch(contextual_contents) 261 | 262 | batch_data = [] 263 | for j in range(len(contextual_contents)): 264 | # Extract metadata fields 265 | chunk_size = len(contextual_contents[j]) 266 | 267 | # Extract source_id from URL 268 | parsed_url = urlparse(batch_urls[j]) 269 | source_id = parsed_url.netloc or parsed_url.path 270 | 271 | # Prepare data for insertion 272 | data = { 273 | "url": batch_urls[j], 274 | "chunk_number": batch_chunk_numbers[j], 275 | "content": contextual_contents[j], # Store original content 276 | "metadata": { 277 | "chunk_size": chunk_size, 278 | **batch_metadatas[j] 279 | }, 280 | "source_id": source_id, # Add source_id field 281 | "embedding": batch_embeddings[j] # Use embedding from contextual content 282 | } 283 | 284 | batch_data.append(data) 285 | 286 | # Insert batch into Supabase with retry logic 287 | max_retries = 3 288 | retry_delay = 1.0 # Start with 1 second delay 289 | 290 | for retry in range(max_retries): 291 | try: 292 | client.table("crawled_pages").insert(batch_data).execute() 293 | # Success - break out of retry loop 294 | break 295 | except Exception as e: 296 | if retry < max_retries - 1: 297 | print(f"Error inserting batch into Supabase (attempt {retry + 1}/{max_retries}): {e}") 298 | print(f"Retrying in {retry_delay} seconds...") 299 | time.sleep(retry_delay) 300 | retry_delay *= 2 # Exponential backoff 301 | else: 302 | # Final attempt failed 303 | print(f"Failed to insert batch after {max_retries} attempts: {e}") 304 | # Optionally, try inserting records one by one as a last resort 305 | print("Attempting to insert records individually...") 306 | successful_inserts = 0 307 | for record in batch_data: 308 | try: 309 | client.table("crawled_pages").insert(record).execute() 310 | successful_inserts += 1 311 | except Exception as individual_error: 312 | print(f"Failed to insert individual record for URL {record['url']}: {individual_error}") 313 | 314 | if successful_inserts > 0: 315 | print(f"Successfully inserted {successful_inserts}/{len(batch_data)} records individually") 316 | 317 | def search_documents( 318 | client: Client, 319 | query: str, 320 | match_count: int = 10, 321 | filter_metadata: Optional[Dict[str, Any]] = None 322 | ) -> List[Dict[str, Any]]: 323 | """ 324 | Search for documents in Supabase using vector similarity. 325 | 326 | Args: 327 | client: Supabase client 328 | query: Query text 329 | match_count: Maximum number of results to return 330 | filter_metadata: Optional metadata filter 331 | 332 | Returns: 333 | List of matching documents 334 | """ 335 | # Create embedding for the query 336 | query_embedding = create_embedding(query) 337 | 338 | # Execute the search using the match_crawled_pages function 339 | try: 340 | # Only include filter parameter if filter_metadata is provided and not empty 341 | params = { 342 | 'query_embedding': query_embedding, 343 | 'match_count': match_count 344 | } 345 | 346 | # Only add the filter if it's actually provided and not empty 347 | if filter_metadata: 348 | params['filter'] = filter_metadata # Pass the dictionary directly, not JSON-encoded 349 | 350 | result = client.rpc('match_crawled_pages', params).execute() 351 | 352 | return result.data 353 | except Exception as e: 354 | print(f"Error searching documents: {e}") 355 | return [] 356 | 357 | 358 | def extract_code_blocks(markdown_content: str, min_length: int = 1000) -> List[Dict[str, Any]]: 359 | """ 360 | Extract code blocks from markdown content along with context. 361 | 362 | Args: 363 | markdown_content: The markdown content to extract code blocks from 364 | min_length: Minimum length of code blocks to extract (default: 1000 characters) 365 | 366 | Returns: 367 | List of dictionaries containing code blocks and their context 368 | """ 369 | code_blocks = [] 370 | 371 | # Skip if content starts with triple backticks (edge case for files wrapped in backticks) 372 | content = markdown_content.strip() 373 | start_offset = 0 374 | if content.startswith('```'): 375 | # Skip the first triple backticks 376 | start_offset = 3 377 | print("Skipping initial triple backticks") 378 | 379 | # Find all occurrences of triple backticks 380 | backtick_positions = [] 381 | pos = start_offset 382 | while True: 383 | pos = markdown_content.find('```', pos) 384 | if pos == -1: 385 | break 386 | backtick_positions.append(pos) 387 | pos += 3 388 | 389 | # Process pairs of backticks 390 | i = 0 391 | while i < len(backtick_positions) - 1: 392 | start_pos = backtick_positions[i] 393 | end_pos = backtick_positions[i + 1] 394 | 395 | # Extract the content between backticks 396 | code_section = markdown_content[start_pos+3:end_pos] 397 | 398 | # Check if there's a language specifier on the first line 399 | lines = code_section.split('\n', 1) 400 | if len(lines) > 1: 401 | # Check if first line is a language specifier (no spaces, common language names) 402 | first_line = lines[0].strip() 403 | if first_line and not ' ' in first_line and len(first_line) < 20: 404 | language = first_line 405 | code_content = lines[1].strip() if len(lines) > 1 else "" 406 | else: 407 | language = "" 408 | code_content = code_section.strip() 409 | else: 410 | language = "" 411 | code_content = code_section.strip() 412 | 413 | # Skip if code block is too short 414 | if len(code_content) < min_length: 415 | i += 2 # Move to next pair 416 | continue 417 | 418 | # Extract context before (1000 chars) 419 | context_start = max(0, start_pos - 1000) 420 | context_before = markdown_content[context_start:start_pos].strip() 421 | 422 | # Extract context after (1000 chars) 423 | context_end = min(len(markdown_content), end_pos + 3 + 1000) 424 | context_after = markdown_content[end_pos + 3:context_end].strip() 425 | 426 | code_blocks.append({ 427 | 'code': code_content, 428 | 'language': language, 429 | 'context_before': context_before, 430 | 'context_after': context_after, 431 | 'full_context': f"{context_before}\n\n{code_content}\n\n{context_after}" 432 | }) 433 | 434 | # Move to next pair (skip the closing backtick we just processed) 435 | i += 2 436 | 437 | return code_blocks 438 | 439 | 440 | def generate_code_example_summary(code: str, context_before: str, context_after: str) -> str: 441 | """ 442 | Generate a summary for a code example using its surrounding context. 443 | 444 | Args: 445 | code: The code example 446 | context_before: Context before the code 447 | context_after: Context after the code 448 | 449 | Returns: 450 | A summary of what the code example demonstrates 451 | """ 452 | model_choice = os.getenv("MODEL_CHOICE") 453 | 454 | # Create the prompt 455 | prompt = f""" 456 | {context_before[-500:] if len(context_before) > 500 else context_before} 457 | 458 | 459 | 460 | {code[:1500] if len(code) > 1500 else code} 461 | 462 | 463 | 464 | {context_after[:500] if len(context_after) > 500 else context_after} 465 | 466 | 467 | Based on the code example and its surrounding context, provide a concise summary (2-3 sentences) that describes what this code example demonstrates and its purpose. Focus on the practical application and key concepts illustrated. 468 | """ 469 | 470 | try: 471 | response = openai.chat.completions.create( 472 | model=model_choice, 473 | messages=[ 474 | {"role": "system", "content": "You are a helpful assistant that provides concise code example summaries."}, 475 | {"role": "user", "content": prompt} 476 | ], 477 | temperature=0.3, 478 | max_tokens=100 479 | ) 480 | 481 | return response.choices[0].message.content.strip() 482 | 483 | except Exception as e: 484 | print(f"Error generating code example summary: {e}") 485 | return "Code example for demonstration purposes." 486 | 487 | 488 | def add_code_examples_to_supabase( 489 | client: Client, 490 | urls: List[str], 491 | chunk_numbers: List[int], 492 | code_examples: List[str], 493 | summaries: List[str], 494 | metadatas: List[Dict[str, Any]], 495 | batch_size: int = 20 496 | ): 497 | """ 498 | Add code examples to the Supabase code_examples table in batches. 499 | 500 | Args: 501 | client: Supabase client 502 | urls: List of URLs 503 | chunk_numbers: List of chunk numbers 504 | code_examples: List of code example contents 505 | summaries: List of code example summaries 506 | metadatas: List of metadata dictionaries 507 | batch_size: Size of each batch for insertion 508 | """ 509 | if not urls: 510 | return 511 | 512 | # Delete existing records for these URLs 513 | unique_urls = list(set(urls)) 514 | for url in unique_urls: 515 | try: 516 | client.table('code_examples').delete().eq('url', url).execute() 517 | except Exception as e: 518 | print(f"Error deleting existing code examples for {url}: {e}") 519 | 520 | # Process in batches 521 | total_items = len(urls) 522 | for i in range(0, total_items, batch_size): 523 | batch_end = min(i + batch_size, total_items) 524 | batch_texts = [] 525 | 526 | # Create combined texts for embedding (code + summary) 527 | for j in range(i, batch_end): 528 | combined_text = f"{code_examples[j]}\n\nSummary: {summaries[j]}" 529 | batch_texts.append(combined_text) 530 | 531 | # Create embeddings for the batch 532 | embeddings = create_embeddings_batch(batch_texts) 533 | 534 | # Check if embeddings are valid (not all zeros) 535 | valid_embeddings = [] 536 | for embedding in embeddings: 537 | if embedding and not all(v == 0.0 for v in embedding): 538 | valid_embeddings.append(embedding) 539 | else: 540 | print(f"Warning: Zero or invalid embedding detected, creating new one...") 541 | # Try to create a single embedding as fallback 542 | single_embedding = create_embedding(batch_texts[len(valid_embeddings)]) 543 | valid_embeddings.append(single_embedding) 544 | 545 | # Prepare batch data 546 | batch_data = [] 547 | for j, embedding in enumerate(valid_embeddings): 548 | idx = i + j 549 | 550 | # Extract source_id from URL 551 | parsed_url = urlparse(urls[idx]) 552 | source_id = parsed_url.netloc or parsed_url.path 553 | 554 | batch_data.append({ 555 | 'url': urls[idx], 556 | 'chunk_number': chunk_numbers[idx], 557 | 'content': code_examples[idx], 558 | 'summary': summaries[idx], 559 | 'metadata': metadatas[idx], # Store as JSON object, not string 560 | 'source_id': source_id, 561 | 'embedding': embedding 562 | }) 563 | 564 | # Insert batch into Supabase with retry logic 565 | max_retries = 3 566 | retry_delay = 1.0 # Start with 1 second delay 567 | 568 | for retry in range(max_retries): 569 | try: 570 | client.table('code_examples').insert(batch_data).execute() 571 | # Success - break out of retry loop 572 | break 573 | except Exception as e: 574 | if retry < max_retries - 1: 575 | print(f"Error inserting batch into Supabase (attempt {retry + 1}/{max_retries}): {e}") 576 | print(f"Retrying in {retry_delay} seconds...") 577 | time.sleep(retry_delay) 578 | retry_delay *= 2 # Exponential backoff 579 | else: 580 | # Final attempt failed 581 | print(f"Failed to insert batch after {max_retries} attempts: {e}") 582 | # Optionally, try inserting records one by one as a last resort 583 | print("Attempting to insert records individually...") 584 | successful_inserts = 0 585 | for record in batch_data: 586 | try: 587 | client.table('code_examples').insert(record).execute() 588 | successful_inserts += 1 589 | except Exception as individual_error: 590 | print(f"Failed to insert individual record for URL {record['url']}: {individual_error}") 591 | 592 | if successful_inserts > 0: 593 | print(f"Successfully inserted {successful_inserts}/{len(batch_data)} records individually") 594 | print(f"Inserted batch {i//batch_size + 1} of {(total_items + batch_size - 1)//batch_size} code examples") 595 | 596 | 597 | def update_source_info(client: Client, source_id: str, summary: str, word_count: int): 598 | """ 599 | Update or insert source information in the sources table. 600 | 601 | Args: 602 | client: Supabase client 603 | source_id: The source ID (domain) 604 | summary: Summary of the source 605 | word_count: Total word count for the source 606 | """ 607 | try: 608 | # Try to update existing source 609 | result = client.table('sources').update({ 610 | 'summary': summary, 611 | 'total_word_count': word_count, 612 | 'updated_at': 'now()' 613 | }).eq('source_id', source_id).execute() 614 | 615 | # If no rows were updated, insert new source 616 | if not result.data: 617 | client.table('sources').insert({ 618 | 'source_id': source_id, 619 | 'summary': summary, 620 | 'total_word_count': word_count 621 | }).execute() 622 | print(f"Created new source: {source_id}") 623 | else: 624 | print(f"Updated source: {source_id}") 625 | 626 | except Exception as e: 627 | print(f"Error updating source {source_id}: {e}") 628 | 629 | 630 | def extract_source_summary(source_id: str, content: str, max_length: int = 500) -> str: 631 | """ 632 | Extract a summary for a source from its content using an LLM. 633 | 634 | This function uses the OpenAI API to generate a concise summary of the source content. 635 | 636 | Args: 637 | source_id: The source ID (domain) 638 | content: The content to extract a summary from 639 | max_length: Maximum length of the summary 640 | 641 | Returns: 642 | A summary string 643 | """ 644 | # Default summary if we can't extract anything meaningful 645 | default_summary = f"Content from {source_id}" 646 | 647 | if not content or len(content.strip()) == 0: 648 | return default_summary 649 | 650 | # Get the model choice from environment variables 651 | model_choice = os.getenv("MODEL_CHOICE") 652 | 653 | # Limit content length to avoid token limits 654 | truncated_content = content[:25000] if len(content) > 25000 else content 655 | 656 | # Create the prompt for generating the summary 657 | prompt = f""" 658 | {truncated_content} 659 | 660 | 661 | The above content is from the documentation for '{source_id}'. Please provide a concise summary (3-5 sentences) that describes what this library/tool/framework is about. The summary should help understand what the library/tool/framework accomplishes and the purpose. 662 | """ 663 | 664 | try: 665 | # Call the OpenAI API to generate the summary 666 | response = openai.chat.completions.create( 667 | model=model_choice, 668 | messages=[ 669 | {"role": "system", "content": "You are a helpful assistant that provides concise library/tool/framework summaries."}, 670 | {"role": "user", "content": prompt} 671 | ], 672 | temperature=0.3, 673 | max_tokens=150 674 | ) 675 | 676 | # Extract the generated summary 677 | summary = response.choices[0].message.content.strip() 678 | 679 | # Ensure the summary is not too long 680 | if len(summary) > max_length: 681 | summary = summary[:max_length] + "..." 682 | 683 | return summary 684 | 685 | except Exception as e: 686 | print(f"Error generating summary with LLM for {source_id}: {e}. Using default summary.") 687 | return default_summary 688 | 689 | 690 | def search_code_examples( 691 | client: Client, 692 | query: str, 693 | match_count: int = 10, 694 | filter_metadata: Optional[Dict[str, Any]] = None, 695 | source_id: Optional[str] = None 696 | ) -> List[Dict[str, Any]]: 697 | """ 698 | Search for code examples in Supabase using vector similarity. 699 | 700 | Args: 701 | client: Supabase client 702 | query: Query text 703 | match_count: Maximum number of results to return 704 | filter_metadata: Optional metadata filter 705 | source_id: Optional source ID to filter results 706 | 707 | Returns: 708 | List of matching code examples 709 | """ 710 | # Create a more descriptive query for better embedding match 711 | # Since code examples are embedded with their summaries, we should make the query more descriptive 712 | enhanced_query = f"Code example for {query}\n\nSummary: Example code showing {query}" 713 | 714 | # Create embedding for the enhanced query 715 | query_embedding = create_embedding(enhanced_query) 716 | 717 | # Execute the search using the match_code_examples function 718 | try: 719 | # Only include filter parameter if filter_metadata is provided and not empty 720 | params = { 721 | 'query_embedding': query_embedding, 722 | 'match_count': match_count 723 | } 724 | 725 | # Only add the filter if it's actually provided and not empty 726 | if filter_metadata: 727 | params['filter'] = filter_metadata 728 | 729 | # Add source filter if provided 730 | if source_id: 731 | params['source_filter'] = source_id 732 | 733 | result = client.rpc('match_code_examples', params).execute() 734 | 735 | return result.data 736 | except Exception as e: 737 | print(f"Error searching code examples: {e}") 738 | return [] --------------------------------------------------------------------------------