├── .env.example
├── .gitignore
├── README.md
├── main.py
├── requirements.txt
├── static
├── index.html
└── script.js
└── utils
├── parse.py
└── scrape.py
/.env.example:
--------------------------------------------------------------------------------
1 | SBR_WEBDRIVER=
2 | GEMINI_API_KEY=
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .vercel
2 | *.log
3 | *.pyc
4 | __pycache__
5 |
6 | # Environments
7 | .env
8 | .venv
9 | env/
10 | venv/
11 | ENV/
12 | env.bak/
13 | venv.bak/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # WebCrawlAI: AI-Powered Web Scraper
2 |
3 | This project implements a web scraping API that leverages the Gemini AI model to extract specific information from websites. It provides a user-friendly interface for defining extraction criteria and handles dynamic content and CAPTCHAs using a scraping browser. The API is deployed on Render and is designed for easy integration into various projects.
4 |
5 | ## Features
6 |
7 | * Scrapes data from websites, handling dynamic content and CAPTCHAs.
8 | * Uses Gemini AI to precisely extract the requested information.
9 | * Provides a clean JSON output of the extracted data.
10 | * Includes a user-friendly web interface for easy interaction.
11 | * Error handling and retry mechanisms for robust operation.
12 | * Event tracking using GetAnalyzr for monitoring API usage.
13 |
14 | ## Usage
15 |
16 | 1. **Access the Web Interface:** Visit [https://webcrawlai.onrender.com/](https://webcrawlai.onrender.com/)
17 | 2. **Enter the URL:** Input the website URL you want to scrape.
18 | 3. **Specify Extraction Prompt:** Provide a clear description of the data you need (e.g., "Extract all product names and prices").
19 | 4. **Click "Extract Information":** The API will process your request, and the results will be displayed.
20 |
21 | ## Installation
22 |
23 | This project is deployed as a web application. No local installation is required for usage. However, if you wish to run the code locally, follow these steps:
24 |
25 | 1. **Clone the Repository:**
26 | ```bash
27 | git clone https://github.com/YOUR_USERNAME/WebCrawlAI.git
28 | cd WebCrawlAI
29 | ```
30 | 2. **Install Dependencies:**
31 | ```bash
32 | pip install -r requirements.txt
33 | ```
34 | 3. **Set Environment Variables:** Create a `.env` file (refer to `.env.example`) and populate it with your `SBR_WEBDRIVER` (Bright Data Scraping Browser URL) and `GEMINI_API_KEY` (Google Gemini API Key).
35 | 4. **Run the Application:**
36 | ```bash
37 | python main.py
38 | ```
39 |
40 | ## Technologies Used
41 |
42 | * **Flask (3.0.0):** Web framework for building the API.
43 | * **BeautifulSoup (4.12.2):** HTML/XML parser for extracting data from web pages.
44 | * **Selenium (4.16.0):** For automating browser interactions, handling dynamic content and CAPTCHAs.
45 | * **lxml:** Fast and efficient XML and HTML processing library.
46 | * **html5lib:** For parsing HTML documents.
47 | * **python-dotenv (1.0.0):** For managing environment variables.
48 | * **google-generativeai (0.3.1):** Integrates the Gemini AI model for data parsing and extraction.
49 | * **axios:** JavaScript library for making HTTP requests (client-side).
50 | * **marked:** JavaScript library for rendering Markdown (client-side).
51 | * **Tailwind CSS:** Utility-first CSS framework for styling (client-side).
52 | * **GetAnalyzr:** For event tracking and API usage monitoring.
53 | * **Bright Data Scraping Browser:** Provides fully-managed, headless browsers for reliable web scraping.
54 |
55 |
56 | ## API Documentation
57 |
58 | **Endpoint:** `/scrape-and-parse`
59 |
60 | **Method:** `POST`
61 |
62 | **Request Body (JSON):**
63 |
64 | ```json
65 | {
66 | "url": "https://www.example.com",
67 | "parse_description": "Extract all product names and prices"
68 | }
69 | ```
70 |
71 | **Response (JSON):**
72 |
73 | **Success:**
74 |
75 | ```json
76 | {
77 | "success": true,
78 | "result": {
79 | "products": [
80 | {"name": "Product A", "price": "$10"},
81 | {"name": "Product B", "price": "$20"}
82 | ]
83 | }
84 | }
85 | ```
86 |
87 | **Error:**
88 |
89 | ```json
90 | {
91 | "error": "An error occurred during scraping or parsing"
92 | }
93 | ```
94 |
95 |
96 | ## Dependencies
97 |
98 | The project dependencies are listed in `requirements.txt`. Use `pip install -r requirements.txt` to install them.
99 |
100 | ## Contributing
101 |
102 | Contributions are welcome! Please open an issue or submit a pull request.
103 |
104 | ## Testing
105 |
106 | No formal testing framework is currently implemented. Testing should be added as part of future development.
107 |
108 |
109 | *README.md was made with [Etchr](https://etchr.dev)*
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | from flask import Flask, request, jsonify, send_from_directory
2 | import json
3 | from utils.scrape import (
4 | scrape_website,
5 | extract_body_content,
6 | clean_body_content,
7 | split_dom_content,
8 | )
9 | from utils.parse import parse_with_gemini
10 |
11 | app = Flask(__name__, static_url_path='/static')
12 |
13 | @app.route('/')
14 | def index():
15 | return send_from_directory('static', 'index.html')
16 |
17 | @app.route('/scrape-and-parse', methods=['POST'])
18 | def scrape_and_parse():
19 | data = request.get_json()
20 | url = data.get('url')
21 | parse_description = data.get('parse_description')
22 |
23 | if not url or not parse_description:
24 | return jsonify({'error': 'Both URL and parse_description are required'}), 400
25 |
26 | try:
27 | # Scrape the website
28 | dom_content = scrape_website(url)
29 | body_content = extract_body_content(dom_content)
30 | cleaned_content = clean_body_content(body_content)
31 |
32 | # Parse the content
33 | dom_chunks = split_dom_content(cleaned_content)
34 | result = parse_with_gemini(dom_chunks, parse_description)
35 |
36 | # Try to parse the result as JSON if it's a string
37 | try:
38 | if isinstance(result, str):
39 | result = json.loads(result)
40 | except json.JSONDecodeError:
41 | pass # Keep the result as is if it's not valid JSON
42 |
43 | return jsonify({
44 | 'success': True,
45 | 'result': result
46 | })
47 | except Exception as e:
48 | print(f"Error in scrape_and_parse: {str(e)}")
49 | return jsonify({'error': str(e)}), 500
50 |
51 | if __name__ == '__main__':
52 | app.run()
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArjunCodess/WebCrawlAI/b64edabd89ed1bb7fb2716811a2b83717e314d1d/requirements.txt
--------------------------------------------------------------------------------
/static/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | AI Web Scraper
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
AI Web Scraper
18 |
19 |
20 |
21 |
22 |
23 |
26 |
27 |
28 |
30 |
33 |
34 |
38 |
39 |
40 |
41 |
47 |
48 |
49 |
53 |
54 |
55 |
56 |
--------------------------------------------------------------------------------
/static/script.js:
--------------------------------------------------------------------------------
1 | document.addEventListener('DOMContentLoaded', () => {
2 | const elements = {
3 | url: document.getElementById('url'),
4 | prompt: document.getElementById('prompt'),
5 | extractBtn: document.getElementById('extractBtn'),
6 | results: document.getElementById('results'),
7 | loadingSpinner: document.getElementById('loadingSpinner'),
8 | resultsSection: document.getElementById('resultsSection')
9 | };
10 |
11 | const showLoading = () => elements.loadingSpinner.classList.remove('hidden');
12 | const hideLoading = () => elements.loadingSpinner.classList.add('hidden');
13 |
14 | const showError = (message) => {
15 | alert(message);
16 | hideLoading();
17 | };
18 |
19 | const formatResult = (result) => {
20 | try {
21 | // If result is already a JSON string, parse it
22 | const parsed = typeof result === 'string' ? JSON.parse(result) : result;
23 | return JSON.stringify(parsed, null, 2);
24 | } catch (e) {
25 | return result; // Return as is if parsing fails
26 | }
27 | };
28 |
29 | elements.extractBtn.addEventListener('click', async () => {
30 | const url = elements.url.value.trim();
31 | const prompt = elements.prompt.value.trim();
32 |
33 | if (!url || !prompt) {
34 | showError('Please enter both URL and extraction prompt');
35 | return;
36 | }
37 |
38 | showLoading();
39 | try {
40 | const response = await axios.post('/scrape-and-parse', {
41 | url: url,
42 | parse_description: prompt
43 | });
44 |
45 | // Only display the result part, properly formatted
46 | const formattedResult = formatResult(response.data.result);
47 | elements.results.textContent = formattedResult;
48 | elements.resultsSection.classList.remove('hidden');
49 |
50 | // Send event tracking data
51 | const API_KEY = process.env.ANALYZR_API_KEY;
52 | const trackingUrl = "https://getanalyzr.vercel.app/api/events";
53 | const headers = {
54 | "Content-Type": "application/json",
55 | "Authorization": `Bearer ${API_KEY}`
56 | };
57 |
58 | const eventData = {
59 | name: "Information Extracted",
60 | domain: window.location.hostname || 'localhost',
61 | description: `Extracted information from URL: ${url}`,
62 | emoji: "🔍",
63 | fields: [
64 | {
65 | name: "URL",
66 | value: url,
67 | inline: true
68 | },
69 | {
70 | name: "Prompt",
71 | value: prompt,
72 | inline: true
73 | }
74 | ]
75 | };
76 |
77 | try {
78 | await axios.post(trackingUrl, eventData, { headers });
79 | console.log("Event tracking successful");
80 | } catch (error) {
81 | console.error("Event tracking error:", error.response ? error.response.data : error.message);
82 | }
83 | } catch (error) {
84 | showError(error.response?.data?.error || 'Failed to extract information');
85 | } finally {
86 | hideLoading();
87 | }
88 | });
89 | });
--------------------------------------------------------------------------------
/utils/parse.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | from dotenv import load_dotenv
4 | import google.generativeai as genai
5 |
6 | load_dotenv()
7 |
8 | # Configure Gemini API
9 | genai.configure(api_key=os.getenv('GEMINI_API_KEY'))
10 | model = genai.GenerativeModel('gemini-pro')
11 |
12 | def clean_json_response(text):
13 | """Clean the response to extract only the JSON part"""
14 | # Remove markdown code blocks if present
15 | text = text.replace('```json', '').replace('```', '').strip()
16 |
17 | # Try to find JSON content between curly braces
18 | try:
19 | start = text.index('{')
20 | end = text.rindex('}') + 1
21 | json_str = text[start:end]
22 |
23 | # Parse and re-format JSON
24 | parsed_json = json.loads(json_str)
25 | return json.dumps(parsed_json, indent=2)
26 | except (ValueError, json.JSONDecodeError) as e:
27 | return text
28 |
29 | def parse_with_gemini(dom_chunks, parse_description):
30 | prompt_template = """
31 | Extract information from the following text content and return it as a CLEAN JSON object.
32 |
33 | Text content: {content}
34 |
35 | Instructions:
36 | 1. Extract information matching this description: {description}
37 | 2. Return ONLY a valid JSON object, no other text or markdown
38 | 3. If no information is found, return an empty JSON object {{}}
39 | 4. Ensure the JSON is properly formatted and valid
40 | 5. DO NOT include any explanatory text, code blocks, or markdown - ONLY the JSON object
41 | """
42 |
43 | parsed_results = []
44 |
45 | for i, chunk in enumerate(dom_chunks, start=1):
46 | try:
47 | prompt = prompt_template.format(
48 | content=chunk,
49 | description=parse_description
50 | )
51 |
52 | response = model.generate_content(prompt)
53 | result = clean_json_response(response.text.strip())
54 | if result and result != '{}':
55 | parsed_results.append(result)
56 | print(f"Parsed batch: {i} of {len(dom_chunks)}")
57 | except Exception as e:
58 | print(f"Error processing chunk {i}: {str(e)}")
59 | continue
60 |
61 | # Combine results if multiple chunks produced output
62 | if len(parsed_results) > 1:
63 | try:
64 | # Parse all results into Python objects
65 | json_objects = [json.loads(result) for result in parsed_results]
66 |
67 | # Merge objects if they're dictionaries
68 | if all(isinstance(obj, dict) for obj in json_objects):
69 | merged = {}
70 | for obj in json_objects:
71 | merged.update(obj)
72 | return json.dumps(merged, indent=2)
73 |
74 | # If they're lists or mixed, combine them
75 | return json.dumps(json_objects, indent=2)
76 | except json.JSONDecodeError:
77 | # If merging fails, return the first valid result
78 | return parsed_results[0]
79 |
80 | # Return the single result or empty JSON object
81 | return parsed_results[0] if parsed_results else '{}'
--------------------------------------------------------------------------------
/utils/scrape.py:
--------------------------------------------------------------------------------
1 | import os
2 | from bs4 import BeautifulSoup
3 | from dotenv import load_dotenv
4 | from selenium.webdriver import ChromeOptions, Remote
5 | from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection
6 | from selenium.common.exceptions import WebDriverException
7 | import time
8 |
9 | load_dotenv()
10 | SBR_WEBDRIVER = os.getenv("SBR_WEBDRIVER")
11 |
12 | def create_driver():
13 | options = ChromeOptions()
14 | options.add_argument('--no-sandbox')
15 | options.add_argument('--headless')
16 | options.add_argument('--disable-dev-shm-usage')
17 | return Remote(
18 | command_executor=SBR_WEBDRIVER,
19 | options=options
20 | )
21 |
22 | def scrape_website(website):
23 | max_retries = 3
24 | retry_delay = 2
25 |
26 | for attempt in range(max_retries):
27 | driver = None
28 | try:
29 | print(f"Attempt {attempt + 1} of {max_retries}")
30 | print("Connecting to Scraping Browser...")
31 |
32 | driver = create_driver()
33 | driver.get(website)
34 |
35 | print("Waiting for page to load...")
36 | time.sleep(2) # Give the page some time to load
37 |
38 | print("Checking for captcha...")
39 | try:
40 | solve_res = driver.execute(
41 | "executeCdpCommand",
42 | {
43 | "cmd": "Captcha.waitForSolve",
44 | "params": {"detectTimeout": 10000},
45 | },
46 | )
47 | print("Captcha solve status:", solve_res["value"]["status"])
48 | except WebDriverException as e:
49 | print("No captcha detected or captcha handling failed:", str(e))
50 |
51 | print("Scraping page content...")
52 | html = driver.page_source
53 |
54 | if html and len(html) > 0:
55 | return html
56 | else:
57 | raise Exception("Empty page content received")
58 |
59 | except Exception as e:
60 | print(f"Error during attempt {attempt + 1}: {str(e)}")
61 | if attempt < max_retries - 1:
62 | print(f"Retrying in {retry_delay} seconds...")
63 | time.sleep(retry_delay)
64 | else:
65 | raise Exception(f"Failed to scrape after {max_retries} attempts: {str(e)}")
66 | finally:
67 | if driver:
68 | try:
69 | driver.quit()
70 | except:
71 | pass
72 |
73 | def extract_body_content(html_content):
74 | soup = BeautifulSoup(html_content, "html.parser")
75 | body_content = soup.body
76 | if body_content:
77 | return str(body_content)
78 | return ""
79 |
80 | def clean_body_content(body_content):
81 | soup = BeautifulSoup(body_content, "html.parser")
82 |
83 | for script_or_style in soup(["script", "style"]):
84 | script_or_style.extract()
85 |
86 | # Get text or further process the content
87 | cleaned_content = soup.get_text(separator="\n")
88 | cleaned_content = "\n".join(
89 | line.strip() for line in cleaned_content.splitlines() if line.strip()
90 | )
91 |
92 | return cleaned_content
93 |
94 | def split_dom_content(dom_content, max_length=6000):
95 | return [
96 | dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length)
97 | ]
--------------------------------------------------------------------------------