├── README.md ├── node ├── basic.js ├── image.js ├── intercept.js ├── package-lock.json ├── package.json └── script.js └── python ├── basic.py ├── image.py ├── intercept.py ├── requirements.txt └── script.py /README.md: -------------------------------------------------------------------------------- 1 | # Web Scraping With Playwright 2 | 3 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112) 4 | 5 | [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq) 6 | 7 | [](https://github.com/topics/playwright) [](https://github.com/topics/web-scraping) 8 | 9 | - [Web Scraping With Playwright](#web-scraping-with-playwright) 10 | - [Support for proxies in Playwright](#support-for-proxies-in-playwright) 11 | - [Node.js](#nodejs) 12 | - [Python](#python) 13 | - [Node JS](#node-js) 14 | - [Python Code](#python-code) 15 | - [Basic scraping with Playwright](#basic-scraping-with-playwright) 16 | - [Locating elements](#locating-elements) 17 | - [Scraping text](#scraping-text) 18 | - [Scraping Images](#scraping-images) 19 | - [Node JS](#node-js-1) 20 | - [Python](#python-1) 21 | - [Intercepting HTTP Requests with Playwright](#intercepting-http-requests-with-playwright) 22 | - [Python](#python-2) 23 | - [Node JS](#node-js-2) 24 | 25 | This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper. 26 | 27 | For a detailed explanation, see our [blog post](https://oxy.yt/erHw). 28 | 29 | ## Support for proxies in Playwright 30 | 31 | Playwright supports the use of proxies. Before exploring this subject further, here is a quick code snippet showing how to start using a proxy with Chromium: 32 | 33 | ### Node.js 34 | 35 | ```javascript 36 | const chromium = require('playwright') 37 | const browser = await chromium.launch() 38 | ``` 39 | 40 | ### Python 41 | 42 | ```python 43 | from playwright.async_api import async_playwright 44 | import asyncio 45 | async def main(): 46 | with async_playwright() as p: 47 | browser = await p.chromium.launch() 48 | ``` 49 | 50 | This code needs only slight modifications to fully utilize proxies. 51 | 52 | In the case of Node.js, the launch function can accept an optional parameter of launch options. This `launchOption` object can, in turn, send several other parameters, e.g., headless. The other parameter needed is proxy. This proxy is another object with properties such as server, username, password, etc. The first step is to create an object where these parameters can be specified. And, then pass it to the launch method like the below example: 53 | 54 | ### Node JS 55 | 56 | ```javascript 57 | const playwright = require("playwright") 58 | 59 | (async() =>{ 60 | for (const browserType of ['chromium', 'firefox', 'webkit']){ 61 | const launchOptions = { 62 | headless: false, 63 | proxy: { 64 | server: "http://pr.oxylabs.io:7777", 65 | username: "USERNAME", 66 | password: "PASSWORD" 67 | } 68 | } 69 | const browser = await playwright[browserType].launch(launchOptions) 70 | } 71 | }) 72 | ``` 73 | 74 | In the case of Python, it’s slightly different. There’s no need to create an object of LaunchOptions. Instead, all the values can be sent as separate parameters. Here’s how the proxy dictionary will be sent: 75 | 76 | ### Python Code 77 | 78 | ```python 79 | from playwright.async_api import async_playwright 80 | import asyncio 81 | async def main(): 82 | with async_playwright() as p: 83 | browser = await p.chromium.launch( 84 | proxy={ 85 | 'server': "http://pr.oxylabs.io:7777", 86 | "username": "USERNAME", 87 | "password": "PASSWORD" 88 | }, 89 | headless=False 90 | ) 91 | ``` 92 | 93 | When deciding on which proxy to use, it’s best to use residential proxies as they don’t leave a footprint and won’t trigger any security alarms. Oxylabs’ Residential Proxies can help you with an extensive and stable proxy network. You can access proxies in a specific country, state, or even a city. What’s essential, you can integrate them easily with Playwright as well. 94 | 95 | ## Basic scraping with Playwright 96 | 97 | Let’s move to another topic that will cover how to get started with Playwright using Node.js and Python. 98 | 99 | If you’re using Node.js, create a new project and install the Playwright library. This can be done using these two simple commands: 100 | 101 | ```shell 102 | npm init -y 103 | npm install playwright 104 | ``` 105 | 106 | A basic script that opens a dynamic page is as follows: 107 | 108 | ```javascript 109 | const playwright = require("playwright") 110 | (async() =>{ 111 | for (const browserType of ['chromium', 'firefox', 'webkit']){ 112 | const browser = await playwright[browserType].launch() 113 | const context = await browser.newContext() 114 | const page = await context.newPage() 115 | await page.goto("https://amazon.com") 116 | await page.wait_for_timeout(1000) 117 | await browser.close() 118 | } 119 | }) 120 | 121 | ``` 122 | 123 | Let’s look at the above code – the first line of the code imports Playwright. Then, multiple browsers are launched. It allows the script to automate Chromium, Firefox, and Webkit. Then, a new browser page is opened. Afterward, the `page.goto()` function navigates to the Amazon web page. After that, there’s a wait of 1 second to show the page to the end user. Finally, the browser is closed. 124 | 125 | The same code can be written in Python easily. First, install the Playwright Python library using the pip command and also install the necessary browsers afterward using the install command: 126 | 127 | ```shell 128 | python -m pip install playwright 129 | playwright install 130 | ``` 131 | 132 | Note that Playwright supports two variations – synchronous and asynchronous. The following example uses the asynchronous API: 133 | 134 | ```python 135 | from playwright.async_api import async_playwright 136 | import asyncio 137 | async def main(): 138 | async with async_playwright() as p: 139 | browser = await p.chromium.launch(headless=False) 140 | page = await browser.new_page() 141 | await page.goto('https://amazon.com') 142 | await page.wait_for_timeout(1000) 143 | await browser.close() 144 | ``` 145 | 146 | This code is similar to the Node.js code. The biggest difference is the use of `asyncio` library. Another difference is that the function names change from camelCase to snake_case. 147 | 148 | In Node JS, If you want to create more than one browser context or if you want to have finer control, you can create a context object and create multiple pages in that context. This would open pages in new tabs: 149 | 150 | ```javascript 151 | const context = await browser.newContext() 152 | const page1 = await context.newPage() 153 | const page2 = await context.newPage() 154 | ``` 155 | 156 | You may also want to handle page context in your code. It’s possible to get the browser context that the page belongs to using the `page.context()` function. 157 | 158 | ## Locating elements 159 | 160 | To extract information from any element or to click any element, the first step is to locate the element. Playwright supports both CSS and XPath selectors. 161 | 162 | This can be understood better with a practical example. Open the following amazon link: 163 | 164 | 165 | 166 | You can see that all the items are under the International Best Seller category, which has div elements with the class name "a-spacing-base". 167 | 168 | To select all the div elements, you need to run a loop over all these elements. These div elements can be selected using the CSS selector: 169 | 170 | ```css 171 | .a-spacing-base 172 | ``` 173 | 174 | Similarly, the XPath selector would be as follows: 175 | 176 | ```text 177 | //*[@class="a-spacing-base"] 178 | ``` 179 | 180 | To use these selectors, the most common functions are as follows: 181 | 182 | - `$eval(selector, function)` – selects the first element, sends the element to the function, and the result of the function is returned; 183 | 184 | - `$$eval(selector, function)` – same as above, except that it selects all elements; 185 | 186 | - `querySelector(selector)` – returns the first element; 187 | 188 | - `querySelectorAll(selector)` – return all the elements. 189 | 190 | These methods will work correctly with both CSS and XPath Selectors. 191 | 192 | ## Scraping text 193 | 194 | Continuing with the example of Amazon, after the page has been loaded, you can use a selector to extract all products using the $$eval function. 195 | 196 | ```javascript 197 | const products = await page.$$eval('.a-spacing-base', all_products => { 198 | // run a loop here 199 | }) 200 | ``` 201 | 202 | Now all the elements that contain product data can be extracted in a loop: 203 | 204 | ```javascript 205 | all_products.forEach(product => { 206 | const title = product.querySelector('span.a-size-base-plus').innerText 207 | }) 208 | ``` 209 | 210 | Finally, the innerText attribute can be used to extract the data from each data point. Here’s the complete code in Node.js: 211 | 212 | ```javascript 213 | const playwright = require("playwright") 214 | (async() =>{ 215 | for (const browserType of ['chromium', 'firefox', 'webkit']){ 216 | const launchOptions = { 217 | headless: false, 218 | proxy: { 219 | server: "http://pr.oxylabs.io:7777", 220 | username: "USERNAME", 221 | password: "PASSWORD" 222 | } 223 | } 224 | const browser = await playwright[browserType].launch(launchOptions) 225 | const context = await browser.newContext() 226 | const page = await context.newPage() 227 | await page.goto('https://www.amazon.com/b?node=17938598011'); 228 | const products = await page.$$eval('.a-spacing-base', all_products => { 229 | const data = [] 230 | all_products.forEach(product => { 231 | const title = product.querySelector('span.a-size-base-plus').innerText 232 | const price = product.querySelector('span.a-price').innerText 233 | const rating = product.querySelector('span.a-icon-alt').innerText 234 | data.push({ title, price, rating}) 235 | }); 236 | return data 237 | }) 238 | console.log(products) 239 | await browser.close() 240 | } 241 | }) 242 | 243 | ``` 244 | 245 | The Python code will be a bit different. Python has a function eval_on_selector, which is similar to the `$``eval` of Node.js, but it’s not suitable for this scenario. The reason is that the second parameter still needs to be JavaScript. This can be good in a certain scenario, but in this case, it will be much better to write the entire code in Python. 246 | 247 | It would be better to use `query_selector` and `query_selector_all` which will return an element and a list of elements respectively. 248 | 249 | ```python 250 | from playwright.async_api import async_playwright 251 | import asyncio 252 | 253 | 254 | async def main(): 255 | async with async_playwright() as pw: 256 | browser = await pw.chromium.launch( 257 | proxy={ 258 | 'server': "http://pr.oxylabs.io:7777", 259 | "username": "USERNAME", 260 | "password": "PASSWORD" 261 | }, 262 | headless=False 263 | ) 264 | 265 | page = await browser.new_page() 266 | await page.goto('https://www.amazon.com/b?node=17938598011') 267 | await page.wait_for_timeout(5000) 268 | 269 | all_products = await page.query_selector_all('.a-spacing-base') 270 | data = [] 271 | for product in all_products: 272 | result = dict() 273 | title_el = await product.query_selector('span.a-size-base-plus') 274 | result['title'] = await title_el.inner_text() 275 | price_el = await product.query_selector('span.a-price') 276 | result['price'] = await price_el.inner_text() 277 | rating_el = await product.query_selector('span.a-icon-alt') 278 | result['rating'] = await rating_el.inner_text() 279 | data.append(result) 280 | print(data) 281 | await browser.close() 282 | 283 | if __name__ == '__main__': 284 | asyncio.run(main()) 285 | ``` 286 | 287 | The output of both the Node.js and the Python code will be the same. 288 | 289 | ## Scraping Images 290 | 291 | Next, we will learn how to scrape images using Playwright. For this instance, we will be using the Oxylabs official website as an image source. If you visit the website: you will notice there are many images, we will extract all these images and save them in our current directory. First, let’s explore how we can accomplish this using Node JS. 292 | 293 | ### Node JS 294 | 295 | The code will be similar to the one that we’ve written earlier. There are multiple ways to extract images using the Javascript playwright wrapper. In this example, we will be using two additional libraries https and fs. These libraries will help us to make Network requests to download the images and store them in the current directory. Take a look at the full source code below: 296 | 297 | ```javascript 298 | const playwright = require("playwright") 299 | const https = require('https') 300 | const fs = require('fs') 301 | 302 | (async() =>{ 303 | const launchOptions = { 304 | headless: false, 305 | proxy: { 306 | server: "http://pr.oxylabs.io:7777", 307 | username: "USERNAME", 308 | password: "PASSWORD" 309 | } 310 | } 311 | const browser = await playwright["chromium"].launch(launchOptions) 312 | const context = await browser.newContext() 313 | const page = await context.newPage() 314 | await page.goto('https://oxylabs.io'); 315 | const images = await page.$$eval('img', all_images => { 316 | const image_links = [] 317 | all_images.forEach((image, index) => { 318 | const path = `image_${index}.svg` 319 | const file = fs.createWriteStream(path) 320 | https.get(image.href, function(response) { 321 | response.pipe(file); 322 | }) 323 | image_links.push(image.href) 324 | }) 325 | return image_links 326 | }) 327 | console.log(images) 328 | await browser.close() 329 | }) 330 | ``` 331 | 332 | As you can see. we are initializing a chromium browser instance with the Oxylabs Residential proxy just like the previous example. After navigating to the website, we are using the `$$eval` to extract all the image elements. 333 | 334 | After extracting all the images we are using `forEach` loop to iterate over every image element. 335 | 336 | ```javascript 337 | all_images.forEach((image, index) => { 338 | const path = `image_${index}.svg` 339 | const file = fs.createWriteStream(path) 340 | https.get(image.src, function(response) { 341 | response.pipe(file); 342 | }) 343 | ``` 344 | 345 | Inside this `forEach` loop, we are constructing the image name using the index and also the path of the image. We are using a relative path so that the images will be stored in the current directory. 346 | 347 | We then initiate a `file` object by calling the `createWriteStream` method of the fs library. Finally, we use the https library to send a `GET` request to download the image using the image src URL. We also pipe the response that we receive directly to the file stream which will write it in the current directory. 348 | 349 | Once we execute this code, the script will loop through each of the images available on the oxylabs.io website and download them to our current directory. 350 | 351 | ### Python 352 | 353 | Python’s built-in support for file I/O operations makes this task way easier than Node JS. Similar to the Node JS code, we will first extract the images using the playwright wrapper. Just like our Amazon example, we can use the `query_selector_all` method, to extract all the image elements. After extracting the image elements, we will send a GET request to each image source URL and store the response content in the current directory. 354 | 355 | The full source code is given below: 356 | 357 | ```python 358 | from playwright.async_api import async_playwright 359 | import asyncio 360 | import requests 361 | 362 | 363 | async def main(): 364 | async with async_playwright() as pw: 365 | browser = await pw.chromium.launch( 366 | proxy={ 367 | 'server': "http://pr.oxylabs.io:7777", 368 | "username": "USERNAME", 369 | "password": "PASSWORD" 370 | }, 371 | headless=False 372 | ) 373 | 374 | page = await browser.new_page() 375 | await page.goto('https://www.oxylabs.io') 376 | await page.wait_for_timeout(5000) 377 | 378 | all_images = await page.query_selector_all('img') 379 | images = [] 380 | for i, img in enumerate(all_images): 381 | image_url = await img..get_attribute("src") 382 | content = requests.get(image_url).content 383 | with open("image_{}.svg".format(i), "wb") as f: 384 | f.write(content) 385 | images.append(image_url) 386 | print(images) 387 | await browser.close() 388 | 389 | if __name__ == '__main__': 390 | asyncio.run(main()) 391 | ``` 392 | 393 | ## Intercepting HTTP Requests with Playwright 394 | 395 | Now, we will explore how to intercept HTTP requests with Playwright. It can be used for advanced web scraping, debugging, testing, and performance optimization. For example, using playwright we can Intercept the HTTP Requests to abort loading images, customize headers, modify response output, etc. Let’s take a look at the below examples: 396 | 397 | ### Python 398 | 399 | We will define a new function named `handle_route`, Playwright will invoke this function to intercept the HTTP requests. The function will be simple, we will fetch and update the title of the HTML code and also replace the header to make the `content-type: text/html`. 400 | 401 | We will also write another lambda function which will help us to prevent images from loading. So, if we execute the script the website will load without any images, and both title & header modified. The code is given below: 402 | 403 | ```python 404 | from playwright.async_api import async_playwright 405 | import asyncio 406 | import requests 407 | 408 | async def handle_route(route) -> None: 409 | response = await route.fetch() 410 | body = await response.text() 411 | body = body.replace("", "<title>Modified Response") 412 | await route.fulfill( 413 | response=response, 414 | body=body, 415 | headers={**response.headers, "content-type": "text/html"}, 416 | ) 417 | 418 | async def main(): 419 | async with async_playwright() as pw: 420 | browser = await pw.chromium.launch( 421 | proxy={ 422 | 'server': "http://pr.oxylabs.io:7777", 423 | "username": "USERNAME", 424 | "password": "PASSWORD" 425 | }, 426 | headless=False 427 | ) 428 | 429 | page = await browser.new_page() 430 | # abort image loading 431 | await page.route("**/*.{png,jpg,jpeg,svg}", lambda route: route.abort()) 432 | await page.route("**/*", handle_route) 433 | await page.goto('https://www.oxylabs.io') 434 | await page.wait_for_timeout(5000) 435 | await browser.close() 436 | 437 | if __name__ == '__main__': 438 | asyncio.run(main()) 439 | ``` 440 | 441 | Notice, we are using the `route()` method to let Playwright know which function to call when intercepting the requests. It takes two parameters, first parameter is a regex to match the URI path. And, the second parameter is the name of the function or lambda. When we are using the `"**/*.{png,jpg,jpeg,svg}"` regex, we are telling Playwright to match all the URLs that end with the given extensions e.g. PNG, JPG, JPEG, and SVG. 442 | 443 | ### Node JS 444 | 445 | The same thing can be achieved using Node JS as well. The code is also quite similar to Python. 446 | 447 | ```javascript 448 | const playwright = require("playwright") 449 | (async() =>{ 450 | const launchOptions = { 451 | headless: false, 452 | proxy: { 453 | server: "http://pr.oxylabs.io:7777", 454 | username: "USERNAME", 455 | password: "PASSWORD" 456 | } 457 | } 458 | const browser = await playwright["chromium"].launch(launchOptions) 459 | const context = await browser.newContext() 460 | const page = await context.newPage() 461 | await page.route(/(png|jpeg|jpg|svg)$/, route => route.abort()) 462 | await page.route('**/*', async route => { 463 | const response = await route.fetch(); 464 | let body = await response.text(); 465 | body = body.replace('<title>', '<title>Modified Response: '); 466 | route.fulfill({ 467 | response, 468 | body, 469 | headers: { 470 | ...response.headers(), 471 | 'content-type': 'text/html' 472 | } 473 | }) 474 | }) 475 | await page.goto('https://oxylabs.io'); 476 | await browser.close() 477 | }) 478 | ``` 479 | 480 | We are using the `page.route` method to intercept the HTTP requests and modify the response’s title and headers. We are also blocking any images from loading. This can be a handy trick to speed up page loading and improve scraping performance. 481 | -------------------------------------------------------------------------------- /node/basic.js: -------------------------------------------------------------------------------- 1 | const playwright = require("playwright") 2 | (async() =>{ 3 | for (const browserType of ['chromium', 'firefox', 'webkit']){ 4 | const browser = await playwright[browserType].launch() 5 | const context = await browser.newContext() 6 | const page = await context.newPage() 7 | await page.goto("https://amazon.com") 8 | await page.wait_for_timeout(1000) 9 | await browser.close() 10 | } 11 | }) 12 | -------------------------------------------------------------------------------- /node/image.js: -------------------------------------------------------------------------------- 1 | const playwright = require("playwright") 2 | const https = require('https') 3 | const fs = require('fs') 4 | 5 | (async() =>{ 6 | const launchOptions = { 7 | headless: false, 8 | proxy: { 9 | server: "http://pr.oxylabs.io:7777", 10 | username: "USERNAME", 11 | password: "PASSWORD" 12 | } 13 | } 14 | const browser = await playwright["chromium"].launch(launchOptions) 15 | const context = await browser.newContext() 16 | const page = await context.newPage() 17 | await page.goto('https://oxylabs.io'); 18 | const images = await page.$$eval('img', all_images => { 19 | const image_links = [] 20 | all_images.forEach((image, index) => { 21 | const path = `image_${index}.svg` 22 | const file = fs.createWriteStream(path) 23 | https.get(image.href, function(response) { 24 | response.pipe(file); 25 | }) 26 | image_links.push(image.href) 27 | }) 28 | return image_links 29 | }) 30 | console.log(images) 31 | await browser.close() 32 | }) -------------------------------------------------------------------------------- /node/intercept.js: -------------------------------------------------------------------------------- 1 | const playwright = require("playwright") 2 | (async() =>{ 3 | const launchOptions = { 4 | headless: false, 5 | proxy: { 6 | server: "http://pr.oxylabs.io:7777", 7 | username: "USERNAME", 8 | password: "PASSWORD" 9 | } 10 | } 11 | const browser = await playwright["chromium"].launch(launchOptions) 12 | const context = await browser.newContext() 13 | const page = await context.newPage() 14 | await page.route(/(png|jpeg|jpg|svg)$/, route => route.abort()) 15 | await page.route('**/*', async route => { 16 | const response = await route.fetch(); 17 | let body = await response.text(); 18 | body = body.replace('<title>', '<title>Modified Response: '); 19 | route.fulfill({ 20 | response, 21 | body, 22 | headers: { 23 | ...response.headers(), 24 | 'content-type': 'text/html' 25 | } 26 | }) 27 | }) 28 | await page.goto('https://oxylabs.io'); 29 | await browser.close() 30 | }) -------------------------------------------------------------------------------- /node/package-lock.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "node", 3 | "lockfileVersion": 2, 4 | "requires": true, 5 | "packages": { 6 | "": { 7 | "dependencies": { 8 | "playwright": "^1.27.0" 9 | } 10 | }, 11 | "node_modules/playwright": { 12 | "version": "1.27.0", 13 | "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.27.0.tgz", 14 | "integrity": "sha512-F+0+0RD03LS+KdNAMMp63OBzu+NwYYLd52pKLczuSlTsV5b/SLkUoNhSfzDFngEFOuRL2gk0LlfGW3mKiUBk6w==", 15 | "hasInstallScript": true, 16 | "dependencies": { 17 | "playwright-core": "1.27.0" 18 | }, 19 | "bin": { 20 | "playwright": "cli.js" 21 | }, 22 | "engines": { 23 | "node": ">=14" 24 | } 25 | }, 26 | "node_modules/playwright-core": { 27 | "version": "1.27.0", 28 | "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.27.0.tgz", 29 | "integrity": "sha512-VBKaaFUVKDo3akW+o4DwbK1ZyXh46tcSwQKPK3lruh8IJd5feu55XVZx4vOkbb2uqrNdIF51sgsadYT533SdpA==", 30 | "bin": { 31 | "playwright": "cli.js" 32 | }, 33 | "engines": { 34 | "node": ">=14" 35 | } 36 | } 37 | }, 38 | "dependencies": { 39 | "playwright": { 40 | "version": "1.27.0", 41 | "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.27.0.tgz", 42 | "integrity": "sha512-F+0+0RD03LS+KdNAMMp63OBzu+NwYYLd52pKLczuSlTsV5b/SLkUoNhSfzDFngEFOuRL2gk0LlfGW3mKiUBk6w==", 43 | "requires": { 44 | "playwright-core": "1.27.0" 45 | } 46 | }, 47 | "playwright-core": { 48 | "version": "1.27.0", 49 | "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.27.0.tgz", 50 | "integrity": "sha512-VBKaaFUVKDo3akW+o4DwbK1ZyXh46tcSwQKPK3lruh8IJd5feu55XVZx4vOkbb2uqrNdIF51sgsadYT533SdpA==" 51 | } 52 | } 53 | } 54 | -------------------------------------------------------------------------------- /node/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "dependencies": { 3 | "playwright": "^1.27.0" 4 | } 5 | } 6 | -------------------------------------------------------------------------------- /node/script.js: -------------------------------------------------------------------------------- 1 | const playwright = require("playwright") 2 | 3 | (async() =>{ 4 | for (const browserType of ['chromium', 'firefox', 'webkit']){ 5 | const launchOptions = { 6 | headless: false, 7 | proxy: { 8 | server: "http://pr.oxylabs.io:7777", 9 | username: "USERNAME", 10 | password: "PASSWORD" 11 | } 12 | } 13 | const browser = await playwright[browserType].launch(launchOptions) 14 | const context = await browser.newContext() 15 | const page = await context.newPage() 16 | await page.goto('https://www.amazon.com/b?node=17938598011'); 17 | const products = await page.$$eval('.a-spacing-base', all_products => { 18 | const data = [] 19 | all_products.forEach(product => { 20 | const title = product.querySelector('span.a-size-base-plus').innerText 21 | const price = product.querySelector('span.a-price').innerText 22 | const rating = product.querySelector('span.a-icon-alt').innerText 23 | data.push({ title, price, rating}) 24 | }); 25 | return data 26 | }) 27 | console.log(products) 28 | await browser.close() 29 | } 30 | }) -------------------------------------------------------------------------------- /python/basic.py: -------------------------------------------------------------------------------- 1 | from playwright.async_api import async_playwright 2 | import asyncio 3 | async def main(): 4 | async with async_playwright() as p: 5 | browser = await p.chromium.launch(headless=False) 6 | page = await browser.new_page() 7 | await page.goto('https://amazon.com') 8 | await page.wait_for_timeout(1000) 9 | await browser.close() 10 | -------------------------------------------------------------------------------- /python/image.py: -------------------------------------------------------------------------------- 1 | from playwright.async_api import async_playwright 2 | import asyncio 3 | import requests 4 | 5 | 6 | async def main(): 7 | async with async_playwright() as pw: 8 | browser = await pw.chromium.launch( 9 | proxy={ 10 | 'server': "http://pr.oxylabs.io:7777", 11 | "username": "USERNAME", 12 | "password": "PASSWORD" 13 | }, 14 | headless=False 15 | ) 16 | 17 | page = await browser.new_page() 18 | await page.goto('https://www.oxylabs.io') 19 | await page.wait_for_timeout(5000) 20 | 21 | all_images = await page.query_selector_all('img') 22 | images = [] 23 | for i, img in enumerate(all_images): 24 | image_url = await img..get_attribute("src") 25 | content = requests.get(image_url).content 26 | with open("image_{}.svg".format(i), "wb") as f: 27 | f.write(content) 28 | images.append(image_url) 29 | print(images) 30 | await browser.close() 31 | 32 | if __name__ == '__main__': 33 | asyncio.run(main()) -------------------------------------------------------------------------------- /python/intercept.py: -------------------------------------------------------------------------------- 1 | from playwright.async_api import async_playwright 2 | import asyncio 3 | import requests 4 | 5 | async def handle_route(route) -> None: 6 | response = await route.fetch() 7 | body = await response.text() 8 | body = body.replace("<title>", "<title>Modified Response") 9 | await route.fulfill( 10 | response=response, 11 | body=body, 12 | headers={**response.headers, "content-type": "text/html"}, 13 | ) 14 | 15 | async def main(): 16 | async with async_playwright() as pw: 17 | browser = await pw.chromium.launch( 18 | proxy={ 19 | 'server': "http://pr.oxylabs.io:7777", 20 | "username": "USERNAME", 21 | "password": "PASSWORD" 22 | }, 23 | headless=False 24 | ) 25 | 26 | page = await browser.new_page() 27 | # abort image loading 28 | await page.route("**/*.{png,jpg,jpeg,svg}", lambda route: route.abort()) 29 | await page.route("**/*", handle_route) 30 | await page.goto('https://www.oxylabs.io') 31 | await page.wait_for_timeout(5000) 32 | await browser.close() 33 | 34 | if __name__ == '__main__': 35 | asyncio.run(main()) -------------------------------------------------------------------------------- /python/requirements.txt: -------------------------------------------------------------------------------- 1 | playwright 2 | -------------------------------------------------------------------------------- /python/script.py: -------------------------------------------------------------------------------- 1 | from playwright.async_api import async_playwright 2 | import asyncio 3 | 4 | 5 | async def main(): 6 | async with async_playwright() as pw: 7 | browser = await pw.chromium.launch( 8 | proxy={ 9 | 'server': "http://pr.oxylabs.io:7777", 10 | "username": "USERNAME", 11 | "password": "PASSWORD" 12 | }, 13 | headless=False 14 | ) 15 | 16 | page = await browser.new_page() 17 | await page.goto('https://www.amazon.com/b?node=17938598011') 18 | await page.wait_for_timeout(5000) 19 | 20 | all_products = await page.query_selector_all('.a-spacing-base') 21 | data = [] 22 | for product in all_products: 23 | result = dict() 24 | title_el = await product.query_selector('span.a-size-base-plus') 25 | result['title'] = await title_el.inner_text() 26 | price_el = await product.query_selector('span.a-price') 27 | result['price'] = await price_el.inner_text() 28 | rating_el = await product.query_selector('span.a-icon-alt') 29 | result['rating'] = await rating_el.inner_text() 30 | data.append(result) 31 | print(data) 32 | await browser.close() 33 | 34 | if __name__ == '__main__': 35 | asyncio.run(main()) --------------------------------------------------------------------------------