├── .env.example ├── .gitignore ├── README.md ├── main.py ├── requirements.txt ├── static ├── index.html └── script.js └── utils ├── parse.py └── scrape.py /.env.example: -------------------------------------------------------------------------------- 1 | SBR_WEBDRIVER= 2 | GEMINI_API_KEY= -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .vercel 2 | *.log 3 | *.pyc 4 | __pycache__ 5 | 6 | # Environments 7 | .env 8 | .venv 9 | env/ 10 | venv/ 11 | ENV/ 12 | env.bak/ 13 | venv.bak/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WebCrawlAI: AI-Powered Web Scraper 2 | 3 | This project implements a web scraping API that leverages the Gemini AI model to extract specific information from websites. It provides a user-friendly interface for defining extraction criteria and handles dynamic content and CAPTCHAs using a scraping browser. The API is deployed on Render and is designed for easy integration into various projects. 4 | 5 | ## Features 6 | 7 | * Scrapes data from websites, handling dynamic content and CAPTCHAs. 8 | * Uses Gemini AI to precisely extract the requested information. 9 | * Provides a clean JSON output of the extracted data. 10 | * Includes a user-friendly web interface for easy interaction. 11 | * Error handling and retry mechanisms for robust operation. 12 | * Event tracking using GetAnalyzr for monitoring API usage. 13 | 14 | ## Usage 15 | 16 | 1. **Access the Web Interface:** Visit [https://webcrawlai.onrender.com/](https://webcrawlai.onrender.com/) 17 | 2. **Enter the URL:** Input the website URL you want to scrape. 18 | 3. **Specify Extraction Prompt:** Provide a clear description of the data you need (e.g., "Extract all product names and prices"). 19 | 4. **Click "Extract Information":** The API will process your request, and the results will be displayed. 20 | 21 | ## Installation 22 | 23 | This project is deployed as a web application. No local installation is required for usage. However, if you wish to run the code locally, follow these steps: 24 | 25 | 1. **Clone the Repository:** 26 | ```bash 27 | git clone https://github.com/YOUR_USERNAME/WebCrawlAI.git 28 | cd WebCrawlAI 29 | ``` 30 | 2. **Install Dependencies:** 31 | ```bash 32 | pip install -r requirements.txt 33 | ``` 34 | 3. **Set Environment Variables:** Create a `.env` file (refer to `.env.example`) and populate it with your `SBR_WEBDRIVER` (Bright Data Scraping Browser URL) and `GEMINI_API_KEY` (Google Gemini API Key). 35 | 4. **Run the Application:** 36 | ```bash 37 | python main.py 38 | ``` 39 | 40 | ## Technologies Used 41 | 42 | * **Flask (3.0.0):** Web framework for building the API. 43 | * **BeautifulSoup (4.12.2):** HTML/XML parser for extracting data from web pages. 44 | * **Selenium (4.16.0):** For automating browser interactions, handling dynamic content and CAPTCHAs. 45 | * **lxml:** Fast and efficient XML and HTML processing library. 46 | * **html5lib:** For parsing HTML documents. 47 | * **python-dotenv (1.0.0):** For managing environment variables. 48 | * **google-generativeai (0.3.1):** Integrates the Gemini AI model for data parsing and extraction. 49 | * **axios:** JavaScript library for making HTTP requests (client-side). 50 | * **marked:** JavaScript library for rendering Markdown (client-side). 51 | * **Tailwind CSS:** Utility-first CSS framework for styling (client-side). 52 | * **GetAnalyzr:** For event tracking and API usage monitoring. 53 | * **Bright Data Scraping Browser:** Provides fully-managed, headless browsers for reliable web scraping. 54 | 55 | 56 | ## API Documentation 57 | 58 | **Endpoint:** `/scrape-and-parse` 59 | 60 | **Method:** `POST` 61 | 62 | **Request Body (JSON):** 63 | 64 | ```json 65 | { 66 | "url": "https://www.example.com", 67 | "parse_description": "Extract all product names and prices" 68 | } 69 | ``` 70 | 71 | **Response (JSON):** 72 | 73 | **Success:** 74 | 75 | ```json 76 | { 77 | "success": true, 78 | "result": { 79 | "products": [ 80 | {"name": "Product A", "price": "$10"}, 81 | {"name": "Product B", "price": "$20"} 82 | ] 83 | } 84 | } 85 | ``` 86 | 87 | **Error:** 88 | 89 | ```json 90 | { 91 | "error": "An error occurred during scraping or parsing" 92 | } 93 | ``` 94 | 95 | 96 | ## Dependencies 97 | 98 | The project dependencies are listed in `requirements.txt`. Use `pip install -r requirements.txt` to install them. 99 | 100 | ## Contributing 101 | 102 | Contributions are welcome! Please open an issue or submit a pull request. 103 | 104 | ## Testing 105 | 106 | No formal testing framework is currently implemented. Testing should be added as part of future development. 107 | 108 | 109 | *README.md was made with [Etchr](https://etchr.dev)* -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, request, jsonify, send_from_directory 2 | import json 3 | from utils.scrape import ( 4 | scrape_website, 5 | extract_body_content, 6 | clean_body_content, 7 | split_dom_content, 8 | ) 9 | from utils.parse import parse_with_gemini 10 | 11 | app = Flask(__name__, static_url_path='/static') 12 | 13 | @app.route('/') 14 | def index(): 15 | return send_from_directory('static', 'index.html') 16 | 17 | @app.route('/scrape-and-parse', methods=['POST']) 18 | def scrape_and_parse(): 19 | data = request.get_json() 20 | url = data.get('url') 21 | parse_description = data.get('parse_description') 22 | 23 | if not url or not parse_description: 24 | return jsonify({'error': 'Both URL and parse_description are required'}), 400 25 | 26 | try: 27 | # Scrape the website 28 | dom_content = scrape_website(url) 29 | body_content = extract_body_content(dom_content) 30 | cleaned_content = clean_body_content(body_content) 31 | 32 | # Parse the content 33 | dom_chunks = split_dom_content(cleaned_content) 34 | result = parse_with_gemini(dom_chunks, parse_description) 35 | 36 | # Try to parse the result as JSON if it's a string 37 | try: 38 | if isinstance(result, str): 39 | result = json.loads(result) 40 | except json.JSONDecodeError: 41 | pass # Keep the result as is if it's not valid JSON 42 | 43 | return jsonify({ 44 | 'success': True, 45 | 'result': result 46 | }) 47 | except Exception as e: 48 | print(f"Error in scrape_and_parse: {str(e)}") 49 | return jsonify({'error': str(e)}), 500 50 | 51 | if __name__ == '__main__': 52 | app.run() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArjunCodess/WebCrawlAI/b64edabd89ed1bb7fb2716811a2b83717e314d1d/requirements.txt -------------------------------------------------------------------------------- /static/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | AI Web Scraper 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 |

AI Web Scraper

18 | 19 | 20 |
21 |
22 | 23 | 26 |
27 |
28 | 30 | 33 |
34 | 38 |
39 | 40 | 41 | 47 | 48 | 49 | 53 |
54 | 55 | 56 | -------------------------------------------------------------------------------- /static/script.js: -------------------------------------------------------------------------------- 1 | document.addEventListener('DOMContentLoaded', () => { 2 | const elements = { 3 | url: document.getElementById('url'), 4 | prompt: document.getElementById('prompt'), 5 | extractBtn: document.getElementById('extractBtn'), 6 | results: document.getElementById('results'), 7 | loadingSpinner: document.getElementById('loadingSpinner'), 8 | resultsSection: document.getElementById('resultsSection') 9 | }; 10 | 11 | const showLoading = () => elements.loadingSpinner.classList.remove('hidden'); 12 | const hideLoading = () => elements.loadingSpinner.classList.add('hidden'); 13 | 14 | const showError = (message) => { 15 | alert(message); 16 | hideLoading(); 17 | }; 18 | 19 | const formatResult = (result) => { 20 | try { 21 | // If result is already a JSON string, parse it 22 | const parsed = typeof result === 'string' ? JSON.parse(result) : result; 23 | return JSON.stringify(parsed, null, 2); 24 | } catch (e) { 25 | return result; // Return as is if parsing fails 26 | } 27 | }; 28 | 29 | elements.extractBtn.addEventListener('click', async () => { 30 | const url = elements.url.value.trim(); 31 | const prompt = elements.prompt.value.trim(); 32 | 33 | if (!url || !prompt) { 34 | showError('Please enter both URL and extraction prompt'); 35 | return; 36 | } 37 | 38 | showLoading(); 39 | try { 40 | const response = await axios.post('/scrape-and-parse', { 41 | url: url, 42 | parse_description: prompt 43 | }); 44 | 45 | // Only display the result part, properly formatted 46 | const formattedResult = formatResult(response.data.result); 47 | elements.results.textContent = formattedResult; 48 | elements.resultsSection.classList.remove('hidden'); 49 | 50 | // Send event tracking data 51 | const API_KEY = process.env.ANALYZR_API_KEY; 52 | const trackingUrl = "https://getanalyzr.vercel.app/api/events"; 53 | const headers = { 54 | "Content-Type": "application/json", 55 | "Authorization": `Bearer ${API_KEY}` 56 | }; 57 | 58 | const eventData = { 59 | name: "Information Extracted", 60 | domain: window.location.hostname || 'localhost', 61 | description: `Extracted information from URL: ${url}`, 62 | emoji: "🔍", 63 | fields: [ 64 | { 65 | name: "URL", 66 | value: url, 67 | inline: true 68 | }, 69 | { 70 | name: "Prompt", 71 | value: prompt, 72 | inline: true 73 | } 74 | ] 75 | }; 76 | 77 | try { 78 | await axios.post(trackingUrl, eventData, { headers }); 79 | console.log("Event tracking successful"); 80 | } catch (error) { 81 | console.error("Event tracking error:", error.response ? error.response.data : error.message); 82 | } 83 | } catch (error) { 84 | showError(error.response?.data?.error || 'Failed to extract information'); 85 | } finally { 86 | hideLoading(); 87 | } 88 | }); 89 | }); -------------------------------------------------------------------------------- /utils/parse.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from dotenv import load_dotenv 4 | import google.generativeai as genai 5 | 6 | load_dotenv() 7 | 8 | # Configure Gemini API 9 | genai.configure(api_key=os.getenv('GEMINI_API_KEY')) 10 | model = genai.GenerativeModel('gemini-pro') 11 | 12 | def clean_json_response(text): 13 | """Clean the response to extract only the JSON part""" 14 | # Remove markdown code blocks if present 15 | text = text.replace('```json', '').replace('```', '').strip() 16 | 17 | # Try to find JSON content between curly braces 18 | try: 19 | start = text.index('{') 20 | end = text.rindex('}') + 1 21 | json_str = text[start:end] 22 | 23 | # Parse and re-format JSON 24 | parsed_json = json.loads(json_str) 25 | return json.dumps(parsed_json, indent=2) 26 | except (ValueError, json.JSONDecodeError) as e: 27 | return text 28 | 29 | def parse_with_gemini(dom_chunks, parse_description): 30 | prompt_template = """ 31 | Extract information from the following text content and return it as a CLEAN JSON object. 32 | 33 | Text content: {content} 34 | 35 | Instructions: 36 | 1. Extract information matching this description: {description} 37 | 2. Return ONLY a valid JSON object, no other text or markdown 38 | 3. If no information is found, return an empty JSON object {{}} 39 | 4. Ensure the JSON is properly formatted and valid 40 | 5. DO NOT include any explanatory text, code blocks, or markdown - ONLY the JSON object 41 | """ 42 | 43 | parsed_results = [] 44 | 45 | for i, chunk in enumerate(dom_chunks, start=1): 46 | try: 47 | prompt = prompt_template.format( 48 | content=chunk, 49 | description=parse_description 50 | ) 51 | 52 | response = model.generate_content(prompt) 53 | result = clean_json_response(response.text.strip()) 54 | if result and result != '{}': 55 | parsed_results.append(result) 56 | print(f"Parsed batch: {i} of {len(dom_chunks)}") 57 | except Exception as e: 58 | print(f"Error processing chunk {i}: {str(e)}") 59 | continue 60 | 61 | # Combine results if multiple chunks produced output 62 | if len(parsed_results) > 1: 63 | try: 64 | # Parse all results into Python objects 65 | json_objects = [json.loads(result) for result in parsed_results] 66 | 67 | # Merge objects if they're dictionaries 68 | if all(isinstance(obj, dict) for obj in json_objects): 69 | merged = {} 70 | for obj in json_objects: 71 | merged.update(obj) 72 | return json.dumps(merged, indent=2) 73 | 74 | # If they're lists or mixed, combine them 75 | return json.dumps(json_objects, indent=2) 76 | except json.JSONDecodeError: 77 | # If merging fails, return the first valid result 78 | return parsed_results[0] 79 | 80 | # Return the single result or empty JSON object 81 | return parsed_results[0] if parsed_results else '{}' -------------------------------------------------------------------------------- /utils/scrape.py: -------------------------------------------------------------------------------- 1 | import os 2 | from bs4 import BeautifulSoup 3 | from dotenv import load_dotenv 4 | from selenium.webdriver import ChromeOptions, Remote 5 | from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection 6 | from selenium.common.exceptions import WebDriverException 7 | import time 8 | 9 | load_dotenv() 10 | SBR_WEBDRIVER = os.getenv("SBR_WEBDRIVER") 11 | 12 | def create_driver(): 13 | options = ChromeOptions() 14 | options.add_argument('--no-sandbox') 15 | options.add_argument('--headless') 16 | options.add_argument('--disable-dev-shm-usage') 17 | return Remote( 18 | command_executor=SBR_WEBDRIVER, 19 | options=options 20 | ) 21 | 22 | def scrape_website(website): 23 | max_retries = 3 24 | retry_delay = 2 25 | 26 | for attempt in range(max_retries): 27 | driver = None 28 | try: 29 | print(f"Attempt {attempt + 1} of {max_retries}") 30 | print("Connecting to Scraping Browser...") 31 | 32 | driver = create_driver() 33 | driver.get(website) 34 | 35 | print("Waiting for page to load...") 36 | time.sleep(2) # Give the page some time to load 37 | 38 | print("Checking for captcha...") 39 | try: 40 | solve_res = driver.execute( 41 | "executeCdpCommand", 42 | { 43 | "cmd": "Captcha.waitForSolve", 44 | "params": {"detectTimeout": 10000}, 45 | }, 46 | ) 47 | print("Captcha solve status:", solve_res["value"]["status"]) 48 | except WebDriverException as e: 49 | print("No captcha detected or captcha handling failed:", str(e)) 50 | 51 | print("Scraping page content...") 52 | html = driver.page_source 53 | 54 | if html and len(html) > 0: 55 | return html 56 | else: 57 | raise Exception("Empty page content received") 58 | 59 | except Exception as e: 60 | print(f"Error during attempt {attempt + 1}: {str(e)}") 61 | if attempt < max_retries - 1: 62 | print(f"Retrying in {retry_delay} seconds...") 63 | time.sleep(retry_delay) 64 | else: 65 | raise Exception(f"Failed to scrape after {max_retries} attempts: {str(e)}") 66 | finally: 67 | if driver: 68 | try: 69 | driver.quit() 70 | except: 71 | pass 72 | 73 | def extract_body_content(html_content): 74 | soup = BeautifulSoup(html_content, "html.parser") 75 | body_content = soup.body 76 | if body_content: 77 | return str(body_content) 78 | return "" 79 | 80 | def clean_body_content(body_content): 81 | soup = BeautifulSoup(body_content, "html.parser") 82 | 83 | for script_or_style in soup(["script", "style"]): 84 | script_or_style.extract() 85 | 86 | # Get text or further process the content 87 | cleaned_content = soup.get_text(separator="\n") 88 | cleaned_content = "\n".join( 89 | line.strip() for line in cleaned_content.splitlines() if line.strip() 90 | ) 91 | 92 | return cleaned_content 93 | 94 | def split_dom_content(dom_content, max_length=6000): 95 | return [ 96 | dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length) 97 | ] --------------------------------------------------------------------------------