├── README.md ├── screenshot_1.png └── screenshot_2.png /README.md: -------------------------------------------------------------------------------- 1 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112) 2 | 3 | [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/Pds3gBmKMH) 4 | 5 | # Automated Web Scraper With Python AutoScraper 6 | 7 | This tutorial will show you how to automate your web scraping processes using AutoScaper – one of the several Python web scraping libraries available. 8 | 9 | Check out a more detailed tutorial on [our blog](https://oxylabs.io/blog/automated-web-scraper-autoscraper). 10 | 11 | * [Methods to install AutoScraper](#methods-to-install-autoscraper) 12 | * [Scraping products with AutoScraper](#scraping-products-with-autoscraper) 13 | + [Scraping product category URLs](#scraping-product-category-urls) 14 | + [Scraping product information from a single webpage](#scraping-product-information-from-a-single-webpage) 15 | + [Scraping all the products on a specific category](#scraping-all-the-products-on-a-specific-category) 16 | * [How to use AutoScraper with proxies](#how-to-use-autoscraper-with-proxies) 17 | * [Saving and loading an AutoScraper model](#saving-and-loading-an-autoscraper-model) 18 | 19 | ## Methods to install AutoScraper 20 | 21 | First things first, let’s install the AutoScraper library. There are actually several ways to install and use this library, but for this tutorial, we’re going to use the Python package index (PyPI) repository using the following pip command: 22 | 23 | ```pip install autoscraper``` 24 | 25 | ## Scraping products with AutoScraper 26 | 27 | This section showcases an example to auto scrape public data with the AutoScraper module in Python using the [Oxylabs Scraping Sandbox](https://sandbox.oxylabs.io/products?_gl=1*10z7xfg*_gcl_au*NzUxMjYzNDQ4LjE3MjY1NzMwNjc.) website as a target. 28 | 29 | The target website has three thousand products in different categories. 30 | 31 | ### Scraping product category URLs 32 | 33 | Now, if you want to scrape the links to the category pages, you can do it with the following trivial code: 34 | 35 | ``` 36 | from autoscraper import AutoScraper 37 | 38 | UrlToScrape = "https://sandbox.oxylabs.io/products" 39 | 40 | WantedList = [ 41 | "https://sandbox.oxylabs.io/products/category/nintendo", 42 | "https://sandbox.oxylabs.io/products/category/dreamcast" 43 | ] 44 | 45 | Scraper = AutoScraper() 46 | data = Scraper.build(UrlToScrape, wanted_list=WantedList) 47 | print(data) 48 | ``` 49 | 50 | Note that the Oxylabs Sandbox uses JavaScript to load some elements dynamically, such as the category buttons. Since AutoScraper doesn’t support JavaScript rendering, you won’t be able to scrape all category links. For instance, you can access the “Xbox platform” category but not the subcategories inside it. 51 | 52 | With that in mind, the code above first imports AutoScraper from the autoscraper library. Then, we provide the URL from which we want to scrape the information in the ```UrlToScrape```. 53 | 54 | The ```WantedList``` assigns sample data that we want to scrape from the given subject URL. To get the category page links from the target page, you need to provide two example URLs to the ```WantedList```. One link is a data sample of a JavaScript-rendered category button, while another link is a data sample of a static category button that doesn’t have any subcategories. Try running the code with only one category URL in the ```WantedList``` to see the difference. 55 | 56 | The AutoScraper() creates an AutoScraper object to initiate different functions of the autoscraper library. The Scraper.build() method scrapes the data similar to the ```WantedList``` from the target URL. 57 | 58 | After executing the Python script above, the ```data``` list will have the category page links available at https://sandbox.oxylabs.io/products. The output of the script should look like this: 59 | 60 | ```['https://sandbox.oxylabs.io/products/category/nintendo', 'https://sandbox.oxylabs.io/products/category/xbox-platform', 'https://sandbox.oxylabs.io/products/category/playstation-platform', 'https://sandbox.oxylabs.io/products/category/dreamcast', 'https://sandbox.oxylabs.io/products/category/pc', 'https://sandbox.oxylabs.io/products/category/stadia']``` 61 | 62 | ### Scraping product information from a single webpage 63 | 64 | Say that we want to get the title of the product along with its price; we can train and build an AutoScraper model as follows: 65 | 66 | ``` 67 | from autoscraper import AutoScraper 68 | 69 | UrlToScrape = "https://sandbox.oxylabs.io/products/3" 70 | WantedList = ["Super Mario Galaxy 2", "91,99 €"] 71 | 72 | InfoScraper = AutoScraper() 73 | InfoScraper.build(UrlToScrape, wanted_list=WantedList) 74 | ``` 75 | 76 | The script above feeds a URL of a product page and a sample of required information from that page to the AutoScraper model. The ```build()``` method learns the rules for scraping information and preparing our ```InfoScraper``` for future use. 77 | 78 | Now, let’s apply this ```InfoScraper``` tactic to a different product’s URL and see if it returns the desired information. 79 | 80 | ``` 81 | another_product_url = "https://sandbox.oxylabs.io/products/39" 82 | 83 | data = InfoScraper.get_result_similar(another_product_url) 84 | print(data) 85 | ``` 86 | 87 | Output: 88 | 89 | ```['Super Mario 64', '91,99 €']``` 90 | 91 | The script above applies ```InfoScraper``` to ```another_product_url``` and prints the ```data```. Depending on the target website you want to scrape, you may want to use the ```get_result_exact()``` function instead of ```get_result_similar()```. This should ensure that AutoScraper returns an accurate product title and price as defined by the ```WantedList```. 92 | 93 | Additionally, it’s very important to provide a ```UrlToScrape``` that doesn’t have duplicate data that may match some unwanted elements. Consider this example: 94 | 95 | ``` 96 | from autoscraper import AutoScraper 97 | 98 | UrlToScrape = "https://sandbox.oxylabs.io/products/1" 99 | WantedList = ["The Legend of Zelda: Ocarina of Time", "91,99 €"] 100 | 101 | InfoScraper = AutoScraper() 102 | InfoScraper.build(UrlToScrape, wanted_list=WantedList) 103 | 104 | another_product_url = "https://sandbox.oxylabs.io/products/39" 105 | 106 | data = InfoScraper.get_result_exact(another_product_url) 107 | print(data) 108 | ``` 109 | 110 | Here, the ```UrlToScrape``` has the price of ```91,99 €``` twice on the page, as highlighted in the screenshot: 111 | 112 | ![](screenshot_1.png) 113 | 114 | Hence, the code also matches the unwanted element with the price of ```91,99 €``` and additionally returns the price of a related product like this: 115 | 116 | ```['Super Mario 64', '87,99 €', '91,99 €']``` 117 | 118 | One way to solve this problem is to use the ```grouped=True``` parameter to return the data points with their corresponding AutoScraper rule names. Next, use the ```keep_rules()``` function and pass the rules you want to keep. In our code, we have to turn the data ```dictionary``` into a ```list``` to access and pass over the first and the last rule, containing accurate product title and price: 119 | 120 | ``` 121 | from autoscraper import AutoScraper 122 | 123 | UrlToScrape = "https://sandbox.oxylabs.io/products/1" 124 | WantedList = ["The Legend of Zelda: Ocarina of Time", "91,99 €"] 125 | 126 | InfoScraper = AutoScraper() 127 | InfoScraper.build(UrlToScrape, wanted_list=WantedList) 128 | 129 | another_product_url = "https://sandbox.oxylabs.io/products/39" 130 | 131 | data = InfoScraper.get_result_exact(another_product_url, grouped=True) 132 | print(data) 133 | print() 134 | 135 | InfoScraper.keep_rules([list(data)[0], list(data)[-1]]) 136 | filtered_data = InfoScraper.get_result_exact(another_product_url) 137 | print(filtered_data) 138 | ``` 139 | 140 | Please note that this method may return incorrect results if the actual price of the product isn’t the last item returned by the ```data``` object. 141 | 142 | ### Scraping all the products on a specific category 143 | 144 | Install the [pandas](https://pypi.org/project/pandas/) and [openpyxl](https://pypi.org/project/openpyxl/) libraries via the terminal, which we’ll use to save the data to an Excel file: 145 | 146 | ```pip install pandas openpyxl``` 147 | 148 | Then, use the following Python script: 149 | 150 | ``` 151 | #ProductsByCategoryScraper.py 152 | from autoscraper import AutoScraper 153 | import pandas as pd 154 | 155 | #ProductUrlScraper section 156 | Playstation_5_Category = "https://sandbox.oxylabs.io/products/category/playstation-platform/playstation-5" 157 | WantedList = ["https://sandbox.oxylabs.io/products/246"] 158 | Product_Url_Scraper = AutoScraper() 159 | Product_Url_Scraper.build(Playstation_5_Category, wanted_list=WantedList) 160 | 161 | #ProductInfoScraper section 162 | Product_Page_Url = "https://sandbox.oxylabs.io/products/246" 163 | WantedList = ["Ratchet & Clank: Rift Apart", "87,99 €"] 164 | 165 | Product_Info_Scraper = AutoScraper() 166 | Product_Info_Scraper.build(Product_Page_Url, wanted_list=WantedList) 167 | 168 | #Scraping info of each product and storing into an Excel file 169 | Products_Url_List = Product_Url_Scraper.get_result_similar(Playstation_5_Category) 170 | Products_Info_List = [] 171 | for Url in Products_Url_List: 172 | product_info = Product_Info_Scraper.get_result_exact(Url) 173 | Products_Info_List.append(product_info) 174 | df = pd.DataFrame(Products_Info_List, columns =["Title", "Price"]) 175 | df.to_excel("products_playstation_5.xlsx", index=False) 176 | ``` 177 | 178 | The script above has three main constituents: two sections for building the scrapers and the third one to scrape data from all the products in the Playstation 5 category and save it as an Excel file. 179 | 180 | For this step, we’ve built ```Product_Url_Scraper``` to scrape all the similar product links on the [Playstation 5 Category](https://sandbox.oxylabs.io/products/category/playstation-platform/playstation-5?_gl=1*1frsqtt*_gcl_au*NzUxMjYzNDQ4LjE3MjY1NzMwNjc.) page. These thirteen links are stored in the ```Products_Url_List```. Now, for each URL in the ```Products_Url_List```, we apply the ```Product_Info_Scraper``` and append the scraped information to the ```Products_Info_List```. Finally, the ```Products_Info_List``` is converted to a data frame and then exported as an Excel file for future use. 181 | 182 | Output: 183 | 184 | ![](screenshot_2.png) 185 | 186 | The output reflects achieving the initial goal – scraping titles and prices of all the products in the Playstation 5 category. 187 | 188 | Now, we know how to use a combination of multiple AutoScraper models to scrape data in bulk. You can re-formulate the script above to scrape all the products from all the categories and save them in different Excel files for each category. 189 | 190 | ## How to use AutoScraper with proxies 191 | 192 | The ```build```, ```get_result_similar```, and ```get_result_exact``` functions of AutoScraper accept request-related arguments in the request_args parameter. 193 | 194 | Here’s what testing and using AutoScraper with proxy IPs looks like: 195 | 196 | ``` 197 | from autoscraper import AutoScraper 198 | 199 | UrlToScrape = "https://ip.oxylabs.io/" 200 | WantedList = ["YOUR_REAL_IP_ADDRESS"] 201 | 202 | proxy = { 203 | "http": "proxy_endpoint", 204 | "https": "proxy_endpoint", 205 | } 206 | 207 | InfoScraper = AutoScraper() 208 | InfoScraper.build(UrlToScrape, wanted_list=WantedList) 209 | 210 | data = InfoScraper.get_result_similar(UrlToScrape, request_args={"proxies": proxy}) 211 | print(data) 212 | ``` 213 | 214 | Visit https://ip.oxylabs.io/, copy the displayed IP address, and paste it instead of YOUR_REAL_IP_ADDRESS in the ```WantedList```. This information will be used to tell AutoScraper what kind of data to look for. 215 | 216 | The proxy_endpoint refers to the address of a [proxy server](https://oxylabs.io/products/paid-proxy-servers) in the correct format (e.g., ```http://customer-USERNAME:PASSWORD@pr.oxylabs.io:7777```). The script above should work fine when proper proxy endpoints are added to the proxy dictionary. Every time you run the code, it outputs a proxy IP address that should be different from your actual IP. 217 | 218 | After you’ve successfully tested the proxy server connection, you can then use the [proxies](https://oxylabs.io/products/residential-proxy-pool) with the initial request like so: 219 | 220 | ``` 221 | from autoscraper import AutoScraper 222 | 223 | UrlToScrape = "https://sandbox.oxylabs.io/products/3" 224 | WantedList = ["Super Mario Galaxy 2", "91,99 €"] 225 | 226 | proxy = { 227 | "http": "proxy_endpoint", 228 | "https": "proxy_endpoint", 229 | } 230 | 231 | InfoScraper = AutoScraper() 232 | InfoScraper.build(UrlToScrape, wanted_list=WantedList, request_args={"proxies": proxy}) 233 | # Remaining code... 234 | ``` 235 | 236 | ## Saving and loading an AutoScraper model 237 | 238 | AutoScraper provides the ability to save and load a pre-trained scraper. We can use the following script to save the InfoScraper object to a file: 239 | 240 | ```InfoScraper.save("file_name")``` 241 | 242 | Similarly, we can load a scraper using: 243 | 244 | ``` 245 | SavedScraper = AutoScraper() 246 | SavedScraper.load("file_name") 247 | ``` 248 | -------------------------------------------------------------------------------- /screenshot_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/automated-web-scraper-autoscraper/613662b3fbd140f37f8072e5952455ba0d1cbf87/screenshot_1.png -------------------------------------------------------------------------------- /screenshot_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/automated-web-scraper-autoscraper/613662b3fbd140f37f8072e5952455ba0d1cbf87/screenshot_2.png --------------------------------------------------------------------------------