├── README.md └── images ├── sandbox.png ├── sandbox_dev_tools.png └── scraped_csv.png /README.md: -------------------------------------------------------------------------------- 1 | # How to Use ChatGPT for Web Scraping in 2024 2 | 3 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112) 4 | 5 | [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq) 6 | 7 | - [1. Create a ChatGPT Account](#1-create-a-chatgpt-account) 8 | - [2. Locate the elements to scrape](#2-locate-the-elements-to-scrape) 9 | - [3. Prepare the ChatGPT prompt](#3-prepare-the-chatgpt-prompt) 10 | - [4. Review the code](#4-review-the-code) 11 | - [5. Execute and test](#5-execute-and-test) 12 | - [Tips and tricks for using ChatGPT](#tips-and-tricks-for-using-chatgpt) 13 | * [1. Get code editing assistance](#1-get-code-editing-assistance) 14 | * [2. Check for errors](#2-check-for-errors) 15 | * [3. Code Optimization Assistance](#3-code-optimization-assistance) 16 | * [4. Handle dynamic content](#4-handle-dynamic-content) 17 | - [Overcome web scraping blocks with a dedicated API](#overcome-web-scraping-blocks-with-a-dedicated-api) 18 | 19 | 20 | Follow this article to learn how to use [ChatGPT](https://chat.openai.com/) for developing fully-functional Python web scrapers. You'll also find out some important tips and tricks to improve the quality of a scraper’s code. 21 | 22 | Before moving to the actual topic, let’s briefly introduce our demo target for this tutorial. We'll extract data from the [Oxylabs Scraping Sandbox](https://sandbox.oxylabs.io/products), a dummy e-commerce store that maintains video game listings in several categories. Here's what the landing page of the store looks like: 23 | 24 | ![](/images/sandbox.png) 25 | 26 | Now, let’s delve into the steps required to scrape data from this webpage using ChatGPT. 27 | 28 | 29 | ## 1. Create a ChatGPT Account 30 | 31 | Visit ChatGPT’s [login page](https://chat.openai.com/auth/login) and hit Sign-up. You also have the option to sign up using your Google account. On successful sign-up, you will be redirected to the chat window. You can initiate a chat by entering your query in the text field. 32 | 33 | ## 2. Locate the elements to scrape 34 | 35 | Before prompting ChatGPT, let’s first locate the elements we need to extract from the target page. Assume that we need only the video game **titles** and **prices**. 36 | 37 | - Right-click one of the game titles and select `Inspect`. This will open the HTML code for this element in the Developer Tools window. 38 | 39 | - Right-click the element and select `Copy selector` with the game title in it. The following figure explains it all: 40 | 41 | ![](/images/sandbox_dev_tools.png) 42 | 43 | 44 | Write down the selector and repeat the same to find the selector for the price element. 45 | 46 | 47 | ## 3. Prepare the ChatGPT prompt 48 | 49 | The prompt should be well-explained, specifying the code’s programming language, tools and libraries to be used, element selectors, output, and any special instructions the code must comply with. Here's a sample prompt that you can use to create a web scraper using Python and & BeautifulSoup: 50 | 51 | ```markdown 52 | Write a web scraper using Python and BeautifulSoup. 53 | 54 | 55 | Sample Target: https://sandbox.oxylabs.io/products 56 | 57 | Rationale: Scrape the video game titles and prices of all the games on the target page. 58 | 59 | 60 | CSS selectors are as follows: 61 | 62 | 1. Title: #__next > main > div > div > div > div:nth-child(2) > div > div:nth-child(1) > a.card-header.css-o171kl.eag3qlw2 > h4 63 | 64 | 2. Price: #__next > main > div > div > div > div:nth-child(2) > div > div:nth-child(1) > div.price-wrapper.css-li4v8k.eag3qlw4 65 | 66 | 67 | Output: Save all the Titles and Prices for all the video games in a CSV file 68 | 69 | 70 | Additional Instructions: Handle character encoding and remove undesirable symbols in the output CSV. 71 | ``` 72 | Notice we have provided CSS selectors for prices and titles that we copied in the earlier step. 73 | 74 | > [!TIP] 75 | > The scraped data might contain a few odd characters if not handled properly due to encoding issues. It happens when the web page and the Python script interpret the character encoding differently. Therefore, we can add relevant Additional Instructions to avoid these encoding issues. 76 | 77 | 78 | ## 4. Review the code 79 | 80 | Once ChatGPT replies with a code, it’s always a good idea to review it first. Just overview the code if it seems to do what you want it to do. Double-check if it uses some libraries or packages you may wish to exclude. If you find any inherent glitches, write back to ChatGPT and ask to generate the code again. 81 | 82 | 83 | ## 5. Execute and test 84 | 85 | Copy the ChatGPT-generated code and check if it's executing fine. Here's the code that ChatGPT generated for us: 86 | 87 | ```python 88 | import requests 89 | from bs4 import BeautifulSoup 90 | import csv 91 | 92 | # URL of the target page 93 | url = "https://sandbox.oxylabs.io/products" 94 | 95 | # Send a GET request to the URL 96 | response = requests.get(url) 97 | 98 | # Parse the HTML content using BeautifulSoup 99 | soup = BeautifulSoup(response.content, "html.parser") 100 | 101 | # CSS selectors for title and price 102 | title_selector = "a.card-header h4" 103 | price_selector = "div.price-wrapper" 104 | 105 | # Find all game titles and prices 106 | titles = soup.select(title_selector) 107 | prices = soup.select(price_selector) 108 | 109 | # Create a list to store the data 110 | data = [] 111 | 112 | # Iterate through the titles and prices 113 | for title, price in zip(titles, prices): 114 | game_title = title.get_text(strip=True) 115 | game_price = price.get_text(strip=True) 116 | data.append((game_title, game_price)) 117 | 118 | # Define the CSV filename 119 | filename = "game_data.csv" 120 | 121 | # Write the data to a CSV file 122 | with open(filename, "w", newline="", encoding="utf-8") as file: 123 | writer = csv.writer(file) 124 | writer.writerow(["Title", "Price"]) # Write header 125 | for game in data: 126 | writer.writerow(game) 127 | 128 | print(f"Data scraped successfully and saved to '{filename}'.") 129 | ``` 130 | 131 | 132 | > [!NOTE] 133 | > Remember to install the `BeautifulSoup` library and `requests` library before running the code. This can be done by opening the terminal and running: 134 | 135 | ```bash 136 | pip install requests beautifulsoup4 137 | ``` 138 | 139 | 140 | Here's the snippet of the output CSV file generated after executing the code: 141 | 142 | ![](/images/scraped_csv.png) 143 | 144 | Congratulations! You've just effortlessly scraped the target website. For your convenience, we also prepared this tutorial in a [video format](https://www.youtube.com/watch?v=AUEjBzLJlE4). 145 | 146 | 147 | ## Tips and tricks for using ChatGPT 148 | 149 | ### 1. Get code editing assistance 150 | 151 | Specify the changes you want to make, such as modifying the scraped elements, boosting the effectiveness of the code, or modifying the data extraction procedure. ChatGPT can offer you additional code options or modify suggestions to improve the web scraping process. 152 | 153 | ### 2. Check for errors 154 | 155 | To adhere to coding standards and practices, you can ask ChatGPT to review the code and provide recommendations. You can even paste your code and ask ChatGPT to lint it. You can do so by adding the “lint the code” phrase in the additional instructions of the prompt. 156 | 157 | ### 3. Code Optimization Assistance 158 | 159 | When it comes to web scraping, efficiency is critical, especially when working with large datasets or challenging web scraping tasks. ChatGPT can provide tips on how to increase the performance of your code. 160 | 161 | You can ask for advice on how to use frameworks and packages that speed up web scraping, use caching techniques, exploit concurrency or parallel processing, and minimize pointless network calls. 162 | 163 | ### 4. Handle dynamic content 164 | 165 | Certain websites produce dynamic content using Javascript libraries or use AJAX requests to produce the content. ChatGPT can help you navigate such complex web content. You can inquire ChatGPT for the techniques to get the dynamic content from such Javascript-rendered pages. 166 | 167 | ChatGPT can offer suggestions on using headless browsers, parsing dynamic HTML, or even automating interactions using simulated user actions. 168 | 169 | ## Overcome web scraping blocks with a dedicated API 170 | 171 | Be aware that there are some limitations of using ChatGPT for web scraping. Many websites have implemented strong security measures to block automated scrapers from accessing the sites. Commonly, sites use CAPTCHAs and request rate-limiting to prevent automated scraping. Thereby, simple **ChatGPT-generated scrapers may fail** at these sites. However, [Web Unblocker](https://oxylabs.io/products/web-unblocker) by Oxylabs can help in these scenarios. It's a **paid proxy solution** which you can test using a **1-week free trial** by regsitering a free account on the [dashboard](https://dashboard.oxylabs.io/). 172 | 173 | Web Unblocker provides features such as rotating proxies, bypassing CAPTCHAs, managing requests, utilizing a built-in headless browser, etc. Such measures can help minimize the chances of triggering automated bot detection. 174 | -------------------------------------------------------------------------------- /images/sandbox.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/chatgpt-web-scraping/461f214e49fd9f9494da5a22de2232caa7561c60/images/sandbox.png -------------------------------------------------------------------------------- /images/sandbox_dev_tools.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/chatgpt-web-scraping/461f214e49fd9f9494da5a22de2232caa7561c60/images/sandbox_dev_tools.png -------------------------------------------------------------------------------- /images/scraped_csv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/chatgpt-web-scraping/461f214e49fd9f9494da5a22de2232caa7561c60/images/scraped_csv.png --------------------------------------------------------------------------------