├── .env.example
├── .gitignore
├── README.md
├── main.py
├── requirements.txt
├── static
    ├── index.html
    └── script.js
└── utils
    ├── parse.py
    └── scrape.py


/.env.example:
--------------------------------------------------------------------------------
1 | SBR_WEBDRIVER=
2 | GEMINI_API_KEY=


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .vercel
 2 | *.log
 3 | *.pyc
 4 | __pycache__
 5 | 
 6 | # Environments 
 7 | .env
 8 | .venv 
 9 | env/ 
10 | venv/ 
11 | ENV/ 
12 | env.bak/ 
13 | venv.bak/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # WebCrawlAI: AI-Powered Web Scraper
  2 | 
  3 | This project implements a web scraping API that leverages the Gemini AI model to extract specific information from websites.  It provides a user-friendly interface for defining extraction criteria and handles dynamic content and CAPTCHAs using a scraping browser.  The API is deployed on Render and is designed for easy integration into various projects.
  4 | 
  5 | ## Features
  6 | 
  7 | *   Scrapes data from websites, handling dynamic content and CAPTCHAs.
  8 | *   Uses Gemini AI to precisely extract the requested information.
  9 | *   Provides a clean JSON output of the extracted data.
 10 | *   Includes a user-friendly web interface for easy interaction.
 11 | *   Error handling and retry mechanisms for robust operation.
 12 | *   Event tracking using GetAnalyzr for monitoring API usage.
 13 | 
 14 | ## Usage
 15 | 
 16 | 1.  **Access the Web Interface:** Visit [https://webcrawlai.onrender.com/](https://webcrawlai.onrender.com/)
 17 | 2.  **Enter the URL:** Input the website URL you want to scrape.
 18 | 3.  **Specify Extraction Prompt:** Provide a clear description of the data you need (e.g., "Extract all product names and prices").
 19 | 4.  **Click "Extract Information":** The API will process your request, and the results will be displayed.
 20 | 
 21 | ## Installation
 22 | 
 23 | This project is deployed as a web application. No local installation is required for usage.  However, if you wish to run the code locally, follow these steps:
 24 | 
 25 | 1.  **Clone the Repository:**
 26 |     ```bash
 27 |     git clone https://github.com/YOUR_USERNAME/WebCrawlAI.git
 28 |     cd WebCrawlAI
 29 |     ```
 30 | 2.  **Install Dependencies:**
 31 |     ```bash
 32 |     pip install -r requirements.txt
 33 |     ```
 34 | 3.  **Set Environment Variables:** Create a `.env` file (refer to `.env.example`) and populate it with your `SBR_WEBDRIVER` (Bright Data Scraping Browser URL) and `GEMINI_API_KEY` (Google Gemini API Key).
 35 | 4.  **Run the Application:**
 36 |     ```bash
 37 |     python main.py
 38 |     ```
 39 | 
 40 | ## Technologies Used
 41 | 
 42 | *   **Flask (3.0.0):** Web framework for building the API.
 43 | *   **BeautifulSoup (4.12.2):** HTML/XML parser for extracting data from web pages.
 44 | *   **Selenium (4.16.0):** For automating browser interactions, handling dynamic content and CAPTCHAs.
 45 | *   **lxml:** Fast and efficient XML and HTML processing library.
 46 | *   **html5lib:**  For parsing HTML documents.
 47 | *   **python-dotenv (1.0.0):** For managing environment variables.
 48 | *   **google-generativeai (0.3.1):**  Integrates the Gemini AI model for data parsing and extraction.
 49 | *   **axios:** JavaScript library for making HTTP requests (client-side).
 50 | *   **marked:** JavaScript library for rendering Markdown (client-side).
 51 | *   **Tailwind CSS:** Utility-first CSS framework for styling (client-side).
 52 | *   **GetAnalyzr:** For event tracking and API usage monitoring.
 53 | *   **Bright Data Scraping Browser:** Provides fully-managed, headless browsers for reliable web scraping.
 54 | 
 55 | 
 56 | ## API Documentation
 57 | 
 58 | **Endpoint:** `/scrape-and-parse`
 59 | 
 60 | **Method:** `POST`
 61 | 
 62 | **Request Body (JSON):**
 63 | 
 64 | ```json
 65 | {
 66 |   "url": "https://www.example.com",
 67 |   "parse_description": "Extract all product names and prices"
 68 | }
 69 | ```
 70 | 
 71 | **Response (JSON):**
 72 | 
 73 | **Success:**
 74 | 
 75 | ```json
 76 | {
 77 |   "success": true,
 78 |   "result": {
 79 |     "products": [
 80 |       {"name": "Product A", "price": "$10"},
 81 |       {"name": "Product B", "price": "$20"}
 82 |     ]
 83 |   }
 84 | }
 85 | ```
 86 | 
 87 | **Error:**
 88 | 
 89 | ```json
 90 | {
 91 |   "error": "An error occurred during scraping or parsing"
 92 | }
 93 | ```
 94 | 
 95 | 
 96 | ## Dependencies
 97 | 
 98 | The project dependencies are listed in `requirements.txt`.  Use `pip install -r requirements.txt` to install them.
 99 | 
100 | ## Contributing
101 | 
102 | Contributions are welcome! Please open an issue or submit a pull request.
103 | 
104 | ## Testing
105 | 
106 | No formal testing framework is currently implemented.  Testing should be added as part of future development.
107 | 
108 | 
109 | *README.md was made with [Etchr](https://etchr.dev)*


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | from flask import Flask, request, jsonify, send_from_directory
 2 | import json
 3 | from utils.scrape import (
 4 |     scrape_website,
 5 |     extract_body_content,
 6 |     clean_body_content,
 7 |     split_dom_content,
 8 | )
 9 | from utils.parse import parse_with_gemini
10 | 
11 | app = Flask(__name__, static_url_path='/static')
12 | 
13 | @app.route('/')
14 | def index():
15 |     return send_from_directory('static', 'index.html')
16 | 
17 | @app.route('/scrape-and-parse', methods=['POST'])
18 | def scrape_and_parse():
19 |     data = request.get_json()
20 |     url = data.get('url')
21 |     parse_description = data.get('parse_description')
22 |     
23 |     if not url or not parse_description:
24 |         return jsonify({'error': 'Both URL and parse_description are required'}), 400
25 |     
26 |     try:
27 |         # Scrape the website
28 |         dom_content = scrape_website(url)
29 |         body_content = extract_body_content(dom_content)
30 |         cleaned_content = clean_body_content(body_content)
31 |         
32 |         # Parse the content
33 |         dom_chunks = split_dom_content(cleaned_content)
34 |         result = parse_with_gemini(dom_chunks, parse_description)
35 |         
36 |         # Try to parse the result as JSON if it's a string
37 |         try:
38 |             if isinstance(result, str):
39 |                 result = json.loads(result)
40 |         except json.JSONDecodeError:
41 |             pass  # Keep the result as is if it's not valid JSON
42 |         
43 |         return jsonify({
44 |             'success': True,
45 |             'result': result
46 |         })
47 |     except Exception as e:
48 |         print(f"Error in scrape_and_parse: {str(e)}")
49 |         return jsonify({'error': str(e)}), 500
50 | 
51 | if __name__ == '__main__':
52 |     app.run()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArjunCodess/WebCrawlAI/b64edabd89ed1bb7fb2716811a2b83717e314d1d/requirements.txt


--------------------------------------------------------------------------------
/static/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 | 
 4 | <head>
 5 |     <meta charset="UTF-8">
 6 |     <meta name="viewport" content="width=device-width, initial-scale=1.0">
 7 |     <title>AI Web Scraper</title>
 8 |     <script src="https://cdn.tailwindcss.com"></script>
 9 |     <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
10 |     <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
11 |     <script src="static/script.js" defer></script>
12 |     <script defer data-domain="webcrawlai.onrender.com" src="https://getanalyzr.vercel.app/tracking-script.js"></script>
13 | </head>
14 | 
15 | <body class="bg-gray-100 min-h-screen">
16 |     <div class="container mx-auto px-4 py-8 max-w-4xl">
17 |         <h1 class="text-4xl font-bold text-center mb-8 text-gray-800">AI Web Scraper</h1>
18 | 
19 |         <!-- Input Form -->
20 |         <div class="bg-white rounded-lg shadow-md p-6 mb-6">
21 |             <div class="mb-4">
22 |                 <label for="url" class="block text-sm font-medium text-gray-700 mb-2">Website URL</label>
23 |                 <input type="url" id="url"
24 |                     class="w-full px-3 py-2 border border-gray-300 rounded-md focus:outline-none focus:ring-2 focus:ring-blue-500"
25 |                     placeholder="https://example.com">
26 |             </div>
27 |             <div class="mb-4">
28 |                 <label for="prompt" class="block text-sm font-medium text-gray-700 mb-2">What would you like to
29 |                     extract?</label>
30 |                 <textarea id="prompt"
31 |                     class="w-full px-3 py-2 border border-gray-300 rounded-md focus:outline-none focus:ring-2 focus:ring-blue-500"
32 |                     rows="3" placeholder="Example: Extract all product names and prices"></textarea>
33 |             </div>
34 |             <button id="extractBtn"
35 |                 class="w-full bg-blue-500 text-white px-4 py-2 rounded-md hover:bg-blue-600 transition-colors">
36 |                 Extract Information
37 |             </button>
38 |         </div>
39 | 
40 |         <!-- Results -->
41 |         <div id="resultsSection" class="bg-white rounded-lg shadow-md p-6 mb-6 hidden">
42 |             <h2 class="text-xl font-semibold mb-4 text-gray-700">Results</h2>
43 |             <div class="bg-gray-50 p-4 rounded-md overflow-x-auto">
44 |                 <pre id="results" class="text-sm whitespace-pre-wrap"></pre>
45 |             </div>
46 |         </div>
47 | 
48 |         <!-- Loading Spinner -->
49 |         <div id="loadingSpinner"
50 |             class="hidden fixed top-0 left-0 w-full h-full bg-black bg-opacity-50 flex items-center justify-center">
51 |             <div class="animate-spin rounded-full h-16 w-16 border-t-2 border-b-2 border-blue-500"></div>
52 |         </div>
53 |     </div>
54 | </body>
55 | 
56 | </html>


--------------------------------------------------------------------------------
/static/script.js:
--------------------------------------------------------------------------------
 1 | document.addEventListener('DOMContentLoaded', () => {
 2 |     const elements = {
 3 |         url: document.getElementById('url'),
 4 |         prompt: document.getElementById('prompt'),
 5 |         extractBtn: document.getElementById('extractBtn'),
 6 |         results: document.getElementById('results'),
 7 |         loadingSpinner: document.getElementById('loadingSpinner'),
 8 |         resultsSection: document.getElementById('resultsSection')
 9 |     };
10 | 
11 |     const showLoading = () => elements.loadingSpinner.classList.remove('hidden');
12 |     const hideLoading = () => elements.loadingSpinner.classList.add('hidden');
13 | 
14 |     const showError = (message) => {
15 |         alert(message);
16 |         hideLoading();
17 |     };
18 | 
19 |     const formatResult = (result) => {
20 |         try {
21 |             // If result is already a JSON string, parse it
22 |             const parsed = typeof result === 'string' ? JSON.parse(result) : result;
23 |             return JSON.stringify(parsed, null, 2);
24 |         } catch (e) {
25 |             return result; // Return as is if parsing fails
26 |         }
27 |     };
28 | 
29 |     elements.extractBtn.addEventListener('click', async () => {
30 |         const url = elements.url.value.trim();
31 |         const prompt = elements.prompt.value.trim();
32 | 
33 |         if (!url || !prompt) {
34 |             showError('Please enter both URL and extraction prompt');
35 |             return;
36 |         }
37 | 
38 |         showLoading();
39 |         try {
40 |             const response = await axios.post('/scrape-and-parse', {
41 |                 url: url,
42 |                 parse_description: prompt
43 |             });
44 |             
45 |             // Only display the result part, properly formatted
46 |             const formattedResult = formatResult(response.data.result);
47 |             elements.results.textContent = formattedResult;
48 |             elements.resultsSection.classList.remove('hidden');
49 | 
50 |             // Send event tracking data
51 |             const API_KEY = process.env.ANALYZR_API_KEY;
52 |             const trackingUrl = "https://getanalyzr.vercel.app/api/events";
53 |             const headers = {
54 |                 "Content-Type": "application/json",
55 |                 "Authorization": `Bearer ${API_KEY}`
56 |             };
57 | 
58 |             const eventData = {
59 |                 name: "Information Extracted",
60 |                 domain: window.location.hostname || 'localhost',
61 |                 description: `Extracted information from URL: ${url}`,
62 |                 emoji: "🔍",
63 |                 fields: [
64 |                     {
65 |                         name: "URL",
66 |                         value: url,
67 |                         inline: true
68 |                     },
69 |                     {
70 |                         name: "Prompt",
71 |                         value: prompt,
72 |                         inline: true
73 |                     }
74 |                 ]
75 |             };
76 | 
77 |             try {
78 |                 await axios.post(trackingUrl, eventData, { headers });
79 |                 console.log("Event tracking successful");
80 |             } catch (error) {
81 |                 console.error("Event tracking error:", error.response ? error.response.data : error.message);
82 |             }
83 |         } catch (error) {
84 |             showError(error.response?.data?.error || 'Failed to extract information');
85 |         } finally {
86 |             hideLoading();
87 |         }
88 |     });
89 | });


--------------------------------------------------------------------------------
/utils/parse.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | from dotenv import load_dotenv
 4 | import google.generativeai as genai
 5 | 
 6 | load_dotenv()
 7 | 
 8 | # Configure Gemini API
 9 | genai.configure(api_key=os.getenv('GEMINI_API_KEY'))
10 | model = genai.GenerativeModel('gemini-pro')
11 | 
12 | def clean_json_response(text):
13 |     """Clean the response to extract only the JSON part"""
14 |     # Remove markdown code blocks if present
15 |     text = text.replace('```json', '').replace('```', '').strip()
16 |     
17 |     # Try to find JSON content between curly braces
18 |     try:
19 |         start = text.index('{')
20 |         end = text.rindex('}') + 1
21 |         json_str = text[start:end]
22 |         
23 |         # Parse and re-format JSON
24 |         parsed_json = json.loads(json_str)
25 |         return json.dumps(parsed_json, indent=2)
26 |     except (ValueError, json.JSONDecodeError) as e:
27 |         return text
28 | 
29 | def parse_with_gemini(dom_chunks, parse_description):
30 |     prompt_template = """
31 |     Extract information from the following text content and return it as a CLEAN JSON object.
32 |     
33 |     Text content: {content}
34 |     
35 |     Instructions:
36 |     1. Extract information matching this description: {description}
37 |     2. Return ONLY a valid JSON object, no other text or markdown
38 |     3. If no information is found, return an empty JSON object {{}}
39 |     4. Ensure the JSON is properly formatted and valid
40 |     5. DO NOT include any explanatory text, code blocks, or markdown - ONLY the JSON object
41 |     """
42 |     
43 |     parsed_results = []
44 |     
45 |     for i, chunk in enumerate(dom_chunks, start=1):
46 |         try:
47 |             prompt = prompt_template.format(
48 |                 content=chunk,
49 |                 description=parse_description
50 |             )
51 |             
52 |             response = model.generate_content(prompt)
53 |             result = clean_json_response(response.text.strip())
54 |             if result and result != '{}':
55 |                 parsed_results.append(result)
56 |             print(f"Parsed batch: {i} of {len(dom_chunks)}")
57 |         except Exception as e:
58 |             print(f"Error processing chunk {i}: {str(e)}")
59 |             continue
60 |     
61 |     # Combine results if multiple chunks produced output
62 |     if len(parsed_results) > 1:
63 |         try:
64 |             # Parse all results into Python objects
65 |             json_objects = [json.loads(result) for result in parsed_results]
66 |             
67 |             # Merge objects if they're dictionaries
68 |             if all(isinstance(obj, dict) for obj in json_objects):
69 |                 merged = {}
70 |                 for obj in json_objects:
71 |                     merged.update(obj)
72 |                 return json.dumps(merged, indent=2)
73 |             
74 |             # If they're lists or mixed, combine them
75 |             return json.dumps(json_objects, indent=2)
76 |         except json.JSONDecodeError:
77 |             # If merging fails, return the first valid result
78 |             return parsed_results[0]
79 |     
80 |     # Return the single result or empty JSON object
81 |     return parsed_results[0] if parsed_results else '{}'


--------------------------------------------------------------------------------
/utils/scrape.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from bs4 import BeautifulSoup
 3 | from dotenv import load_dotenv
 4 | from selenium.webdriver import ChromeOptions, Remote
 5 | from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection
 6 | from selenium.common.exceptions import WebDriverException
 7 | import time
 8 | 
 9 | load_dotenv()
10 | SBR_WEBDRIVER = os.getenv("SBR_WEBDRIVER")
11 | 
12 | def create_driver():
13 |     options = ChromeOptions()
14 |     options.add_argument('--no-sandbox')
15 |     options.add_argument('--headless')
16 |     options.add_argument('--disable-dev-shm-usage')
17 |     return Remote(
18 |         command_executor=SBR_WEBDRIVER,
19 |         options=options
20 |     )
21 | 
22 | def scrape_website(website):
23 |     max_retries = 3
24 |     retry_delay = 2
25 |     
26 |     for attempt in range(max_retries):
27 |         driver = None
28 |         try:
29 |             print(f"Attempt {attempt + 1} of {max_retries}")
30 |             print("Connecting to Scraping Browser...")
31 |             
32 |             driver = create_driver()
33 |             driver.get(website)
34 |             
35 |             print("Waiting for page to load...")
36 |             time.sleep(2)  # Give the page some time to load
37 |             
38 |             print("Checking for captcha...")
39 |             try:
40 |                 solve_res = driver.execute(
41 |                     "executeCdpCommand",
42 |                     {
43 |                         "cmd": "Captcha.waitForSolve",
44 |                         "params": {"detectTimeout": 10000},
45 |                     },
46 |                 )
47 |                 print("Captcha solve status:", solve_res["value"]["status"])
48 |             except WebDriverException as e:
49 |                 print("No captcha detected or captcha handling failed:", str(e))
50 |             
51 |             print("Scraping page content...")
52 |             html = driver.page_source
53 |             
54 |             if html and len(html) > 0:
55 |                 return html
56 |             else:
57 |                 raise Exception("Empty page content received")
58 |                 
59 |         except Exception as e:
60 |             print(f"Error during attempt {attempt + 1}: {str(e)}")
61 |             if attempt < max_retries - 1:
62 |                 print(f"Retrying in {retry_delay} seconds...")
63 |                 time.sleep(retry_delay)
64 |             else:
65 |                 raise Exception(f"Failed to scrape after {max_retries} attempts: {str(e)}")
66 |         finally:
67 |             if driver:
68 |                 try:
69 |                     driver.quit()
70 |                 except:
71 |                     pass
72 | 
73 | def extract_body_content(html_content):
74 |     soup = BeautifulSoup(html_content, "html.parser")
75 |     body_content = soup.body
76 |     if body_content:
77 |         return str(body_content)
78 |     return ""
79 | 
80 | def clean_body_content(body_content):
81 |     soup = BeautifulSoup(body_content, "html.parser")
82 | 
83 |     for script_or_style in soup(["script", "style"]):
84 |         script_or_style.extract()
85 | 
86 |     # Get text or further process the content
87 |     cleaned_content = soup.get_text(separator="\n")
88 |     cleaned_content = "\n".join(
89 |         line.strip() for line in cleaned_content.splitlines() if line.strip()
90 |     )
91 | 
92 |     return cleaned_content
93 | 
94 | def split_dom_content(dom_content, max_length=6000):
95 |     return [
96 |         dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length)
97 |     ]


--------------------------------------------------------------------------------