├── README.md
├── node
├── basic.js
├── image.js
├── intercept.js
├── package-lock.json
├── package.json
└── script.js
└── python
├── basic.py
├── image.py
├── intercept.py
├── requirements.txt
└── script.py
/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping With Playwright
2 |
3 | [](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
4 |
5 | [](https://discord.gg/GbxmdGhZjq)
6 |
7 | [
](https://github.com/topics/playwright) [
](https://github.com/topics/web-scraping)
8 |
9 | - [Web Scraping With Playwright](#web-scraping-with-playwright)
10 | - [Support for proxies in Playwright](#support-for-proxies-in-playwright)
11 | - [Node.js](#nodejs)
12 | - [Python](#python)
13 | - [Node JS](#node-js)
14 | - [Python Code](#python-code)
15 | - [Basic scraping with Playwright](#basic-scraping-with-playwright)
16 | - [Locating elements](#locating-elements)
17 | - [Scraping text](#scraping-text)
18 | - [Scraping Images](#scraping-images)
19 | - [Node JS](#node-js-1)
20 | - [Python](#python-1)
21 | - [Intercepting HTTP Requests with Playwright](#intercepting-http-requests-with-playwright)
22 | - [Python](#python-2)
23 | - [Node JS](#node-js-2)
24 |
25 | This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.
26 |
27 | For a detailed explanation, see our [blog post](https://oxy.yt/erHw).
28 |
29 | ## Support for proxies in Playwright
30 |
31 | Playwright supports the use of proxies. Before exploring this subject further, here is a quick code snippet showing how to start using a proxy with Chromium:
32 |
33 | ### Node.js
34 |
35 | ```javascript
36 | const chromium = require('playwright')
37 | const browser = await chromium.launch()
38 | ```
39 |
40 | ### Python
41 |
42 | ```python
43 | from playwright.async_api import async_playwright
44 | import asyncio
45 | async def main():
46 | with async_playwright() as p:
47 | browser = await p.chromium.launch()
48 | ```
49 |
50 | This code needs only slight modifications to fully utilize proxies.
51 |
52 | In the case of Node.js, the launch function can accept an optional parameter of launch options. This `launchOption` object can, in turn, send several other parameters, e.g., headless. The other parameter needed is proxy. This proxy is another object with properties such as server, username, password, etc. The first step is to create an object where these parameters can be specified. And, then pass it to the launch method like the below example:
53 |
54 | ### Node JS
55 |
56 | ```javascript
57 | const playwright = require("playwright")
58 |
59 | (async() =>{
60 | for (const browserType of ['chromium', 'firefox', 'webkit']){
61 | const launchOptions = {
62 | headless: false,
63 | proxy: {
64 | server: "http://pr.oxylabs.io:7777",
65 | username: "USERNAME",
66 | password: "PASSWORD"
67 | }
68 | }
69 | const browser = await playwright[browserType].launch(launchOptions)
70 | }
71 | })
72 | ```
73 |
74 | In the case of Python, it’s slightly different. There’s no need to create an object of LaunchOptions. Instead, all the values can be sent as separate parameters. Here’s how the proxy dictionary will be sent:
75 |
76 | ### Python Code
77 |
78 | ```python
79 | from playwright.async_api import async_playwright
80 | import asyncio
81 | async def main():
82 | with async_playwright() as p:
83 | browser = await p.chromium.launch(
84 | proxy={
85 | 'server': "http://pr.oxylabs.io:7777",
86 | "username": "USERNAME",
87 | "password": "PASSWORD"
88 | },
89 | headless=False
90 | )
91 | ```
92 |
93 | When deciding on which proxy to use, it’s best to use residential proxies as they don’t leave a footprint and won’t trigger any security alarms. Oxylabs’ Residential Proxies can help you with an extensive and stable proxy network. You can access proxies in a specific country, state, or even a city. What’s essential, you can integrate them easily with Playwright as well.
94 |
95 | ## Basic scraping with Playwright
96 |
97 | Let’s move to another topic that will cover how to get started with Playwright using Node.js and Python.
98 |
99 | If you’re using Node.js, create a new project and install the Playwright library. This can be done using these two simple commands:
100 |
101 | ```shell
102 | npm init -y
103 | npm install playwright
104 | ```
105 |
106 | A basic script that opens a dynamic page is as follows:
107 |
108 | ```javascript
109 | const playwright = require("playwright")
110 | (async() =>{
111 | for (const browserType of ['chromium', 'firefox', 'webkit']){
112 | const browser = await playwright[browserType].launch()
113 | const context = await browser.newContext()
114 | const page = await context.newPage()
115 | await page.goto("https://amazon.com")
116 | await page.wait_for_timeout(1000)
117 | await browser.close()
118 | }
119 | })
120 |
121 | ```
122 |
123 | Let’s look at the above code – the first line of the code imports Playwright. Then, multiple browsers are launched. It allows the script to automate Chromium, Firefox, and Webkit. Then, a new browser page is opened. Afterward, the `page.goto()` function navigates to the Amazon web page. After that, there’s a wait of 1 second to show the page to the end user. Finally, the browser is closed.
124 |
125 | The same code can be written in Python easily. First, install the Playwright Python library using the pip command and also install the necessary browsers afterward using the install command:
126 |
127 | ```shell
128 | python -m pip install playwright
129 | playwright install
130 | ```
131 |
132 | Note that Playwright supports two variations – synchronous and asynchronous. The following example uses the asynchronous API:
133 |
134 | ```python
135 | from playwright.async_api import async_playwright
136 | import asyncio
137 | async def main():
138 | async with async_playwright() as p:
139 | browser = await p.chromium.launch(headless=False)
140 | page = await browser.new_page()
141 | await page.goto('https://amazon.com')
142 | await page.wait_for_timeout(1000)
143 | await browser.close()
144 | ```
145 |
146 | This code is similar to the Node.js code. The biggest difference is the use of `asyncio` library. Another difference is that the function names change from camelCase to snake_case.
147 |
148 | In Node JS, If you want to create more than one browser context or if you want to have finer control, you can create a context object and create multiple pages in that context. This would open pages in new tabs:
149 |
150 | ```javascript
151 | const context = await browser.newContext()
152 | const page1 = await context.newPage()
153 | const page2 = await context.newPage()
154 | ```
155 |
156 | You may also want to handle page context in your code. It’s possible to get the browser context that the page belongs to using the `page.context()` function.
157 |
158 | ## Locating elements
159 |
160 | To extract information from any element or to click any element, the first step is to locate the element. Playwright supports both CSS and XPath selectors.
161 |
162 | This can be understood better with a practical example. Open the following amazon link:
163 |
164 |
165 |
166 | You can see that all the items are under the International Best Seller category, which has div elements with the class name "a-spacing-base".
167 |
168 | To select all the div elements, you need to run a loop over all these elements. These div elements can be selected using the CSS selector:
169 |
170 | ```css
171 | .a-spacing-base
172 | ```
173 |
174 | Similarly, the XPath selector would be as follows:
175 |
176 | ```text
177 | //*[@class="a-spacing-base"]
178 | ```
179 |
180 | To use these selectors, the most common functions are as follows:
181 |
182 | - `$eval(selector, function)` – selects the first element, sends the element to the function, and the result of the function is returned;
183 |
184 | - `$$eval(selector, function)` – same as above, except that it selects all elements;
185 |
186 | - `querySelector(selector)` – returns the first element;
187 |
188 | - `querySelectorAll(selector)` – return all the elements.
189 |
190 | These methods will work correctly with both CSS and XPath Selectors.
191 |
192 | ## Scraping text
193 |
194 | Continuing with the example of Amazon, after the page has been loaded, you can use a selector to extract all products using the $$eval function.
195 |
196 | ```javascript
197 | const products = await page.$$eval('.a-spacing-base', all_products => {
198 | // run a loop here
199 | })
200 | ```
201 |
202 | Now all the elements that contain product data can be extracted in a loop:
203 |
204 | ```javascript
205 | all_products.forEach(product => {
206 | const title = product.querySelector('span.a-size-base-plus').innerText
207 | })
208 | ```
209 |
210 | Finally, the innerText attribute can be used to extract the data from each data point. Here’s the complete code in Node.js:
211 |
212 | ```javascript
213 | const playwright = require("playwright")
214 | (async() =>{
215 | for (const browserType of ['chromium', 'firefox', 'webkit']){
216 | const launchOptions = {
217 | headless: false,
218 | proxy: {
219 | server: "http://pr.oxylabs.io:7777",
220 | username: "USERNAME",
221 | password: "PASSWORD"
222 | }
223 | }
224 | const browser = await playwright[browserType].launch(launchOptions)
225 | const context = await browser.newContext()
226 | const page = await context.newPage()
227 | await page.goto('https://www.amazon.com/b?node=17938598011');
228 | const products = await page.$$eval('.a-spacing-base', all_products => {
229 | const data = []
230 | all_products.forEach(product => {
231 | const title = product.querySelector('span.a-size-base-plus').innerText
232 | const price = product.querySelector('span.a-price').innerText
233 | const rating = product.querySelector('span.a-icon-alt').innerText
234 | data.push({ title, price, rating})
235 | });
236 | return data
237 | })
238 | console.log(products)
239 | await browser.close()
240 | }
241 | })
242 |
243 | ```
244 |
245 | The Python code will be a bit different. Python has a function eval_on_selector, which is similar to the `$``eval` of Node.js, but it’s not suitable for this scenario. The reason is that the second parameter still needs to be JavaScript. This can be good in a certain scenario, but in this case, it will be much better to write the entire code in Python.
246 |
247 | It would be better to use `query_selector` and `query_selector_all` which will return an element and a list of elements respectively.
248 |
249 | ```python
250 | from playwright.async_api import async_playwright
251 | import asyncio
252 |
253 |
254 | async def main():
255 | async with async_playwright() as pw:
256 | browser = await pw.chromium.launch(
257 | proxy={
258 | 'server': "http://pr.oxylabs.io:7777",
259 | "username": "USERNAME",
260 | "password": "PASSWORD"
261 | },
262 | headless=False
263 | )
264 |
265 | page = await browser.new_page()
266 | await page.goto('https://www.amazon.com/b?node=17938598011')
267 | await page.wait_for_timeout(5000)
268 |
269 | all_products = await page.query_selector_all('.a-spacing-base')
270 | data = []
271 | for product in all_products:
272 | result = dict()
273 | title_el = await product.query_selector('span.a-size-base-plus')
274 | result['title'] = await title_el.inner_text()
275 | price_el = await product.query_selector('span.a-price')
276 | result['price'] = await price_el.inner_text()
277 | rating_el = await product.query_selector('span.a-icon-alt')
278 | result['rating'] = await rating_el.inner_text()
279 | data.append(result)
280 | print(data)
281 | await browser.close()
282 |
283 | if __name__ == '__main__':
284 | asyncio.run(main())
285 | ```
286 |
287 | The output of both the Node.js and the Python code will be the same.
288 |
289 | ## Scraping Images
290 |
291 | Next, we will learn how to scrape images using Playwright. For this instance, we will be using the Oxylabs official website as an image source. If you visit the website: you will notice there are many images, we will extract all these images and save them in our current directory. First, let’s explore how we can accomplish this using Node JS.
292 |
293 | ### Node JS
294 |
295 | The code will be similar to the one that we’ve written earlier. There are multiple ways to extract images using the Javascript playwright wrapper. In this example, we will be using two additional libraries https and fs. These libraries will help us to make Network requests to download the images and store them in the current directory. Take a look at the full source code below:
296 |
297 | ```javascript
298 | const playwright = require("playwright")
299 | const https = require('https')
300 | const fs = require('fs')
301 |
302 | (async() =>{
303 | const launchOptions = {
304 | headless: false,
305 | proxy: {
306 | server: "http://pr.oxylabs.io:7777",
307 | username: "USERNAME",
308 | password: "PASSWORD"
309 | }
310 | }
311 | const browser = await playwright["chromium"].launch(launchOptions)
312 | const context = await browser.newContext()
313 | const page = await context.newPage()
314 | await page.goto('https://oxylabs.io');
315 | const images = await page.$$eval('img', all_images => {
316 | const image_links = []
317 | all_images.forEach((image, index) => {
318 | const path = `image_${index}.svg`
319 | const file = fs.createWriteStream(path)
320 | https.get(image.href, function(response) {
321 | response.pipe(file);
322 | })
323 | image_links.push(image.href)
324 | })
325 | return image_links
326 | })
327 | console.log(images)
328 | await browser.close()
329 | })
330 | ```
331 |
332 | As you can see. we are initializing a chromium browser instance with the Oxylabs Residential proxy just like the previous example. After navigating to the website, we are using the `$$eval` to extract all the image elements.
333 |
334 | After extracting all the images we are using `forEach` loop to iterate over every image element.
335 |
336 | ```javascript
337 | all_images.forEach((image, index) => {
338 | const path = `image_${index}.svg`
339 | const file = fs.createWriteStream(path)
340 | https.get(image.src, function(response) {
341 | response.pipe(file);
342 | })
343 | ```
344 |
345 | Inside this `forEach` loop, we are constructing the image name using the index and also the path of the image. We are using a relative path so that the images will be stored in the current directory.
346 |
347 | We then initiate a `file` object by calling the `createWriteStream` method of the fs library. Finally, we use the https library to send a `GET` request to download the image using the image src URL. We also pipe the response that we receive directly to the file stream which will write it in the current directory.
348 |
349 | Once we execute this code, the script will loop through each of the images available on the oxylabs.io website and download them to our current directory.
350 |
351 | ### Python
352 |
353 | Python’s built-in support for file I/O operations makes this task way easier than Node JS. Similar to the Node JS code, we will first extract the images using the playwright wrapper. Just like our Amazon example, we can use the `query_selector_all` method, to extract all the image elements. After extracting the image elements, we will send a GET request to each image source URL and store the response content in the current directory.
354 |
355 | The full source code is given below:
356 |
357 | ```python
358 | from playwright.async_api import async_playwright
359 | import asyncio
360 | import requests
361 |
362 |
363 | async def main():
364 | async with async_playwright() as pw:
365 | browser = await pw.chromium.launch(
366 | proxy={
367 | 'server': "http://pr.oxylabs.io:7777",
368 | "username": "USERNAME",
369 | "password": "PASSWORD"
370 | },
371 | headless=False
372 | )
373 |
374 | page = await browser.new_page()
375 | await page.goto('https://www.oxylabs.io')
376 | await page.wait_for_timeout(5000)
377 |
378 | all_images = await page.query_selector_all('img')
379 | images = []
380 | for i, img in enumerate(all_images):
381 | image_url = await img..get_attribute("src")
382 | content = requests.get(image_url).content
383 | with open("image_{}.svg".format(i), "wb") as f:
384 | f.write(content)
385 | images.append(image_url)
386 | print(images)
387 | await browser.close()
388 |
389 | if __name__ == '__main__':
390 | asyncio.run(main())
391 | ```
392 |
393 | ## Intercepting HTTP Requests with Playwright
394 |
395 | Now, we will explore how to intercept HTTP requests with Playwright. It can be used for advanced web scraping, debugging, testing, and performance optimization. For example, using playwright we can Intercept the HTTP Requests to abort loading images, customize headers, modify response output, etc. Let’s take a look at the below examples:
396 |
397 | ### Python
398 |
399 | We will define a new function named `handle_route`, Playwright will invoke this function to intercept the HTTP requests. The function will be simple, we will fetch and update the title of the HTML code and also replace the header to make the `content-type: text/html`.
400 |
401 | We will also write another lambda function which will help us to prevent images from loading. So, if we execute the script the website will load without any images, and both title & header modified. The code is given below:
402 |
403 | ```python
404 | from playwright.async_api import async_playwright
405 | import asyncio
406 | import requests
407 |
408 | async def handle_route(route) -> None:
409 | response = await route.fetch()
410 | body = await response.text()
411 | body = body.replace("", "Modified Response")
412 | await route.fulfill(
413 | response=response,
414 | body=body,
415 | headers={**response.headers, "content-type": "text/html"},
416 | )
417 |
418 | async def main():
419 | async with async_playwright() as pw:
420 | browser = await pw.chromium.launch(
421 | proxy={
422 | 'server': "http://pr.oxylabs.io:7777",
423 | "username": "USERNAME",
424 | "password": "PASSWORD"
425 | },
426 | headless=False
427 | )
428 |
429 | page = await browser.new_page()
430 | # abort image loading
431 | await page.route("**/*.{png,jpg,jpeg,svg}", lambda route: route.abort())
432 | await page.route("**/*", handle_route)
433 | await page.goto('https://www.oxylabs.io')
434 | await page.wait_for_timeout(5000)
435 | await browser.close()
436 |
437 | if __name__ == '__main__':
438 | asyncio.run(main())
439 | ```
440 |
441 | Notice, we are using the `route()` method to let Playwright know which function to call when intercepting the requests. It takes two parameters, first parameter is a regex to match the URI path. And, the second parameter is the name of the function or lambda. When we are using the `"**/*.{png,jpg,jpeg,svg}"` regex, we are telling Playwright to match all the URLs that end with the given extensions e.g. PNG, JPG, JPEG, and SVG.
442 |
443 | ### Node JS
444 |
445 | The same thing can be achieved using Node JS as well. The code is also quite similar to Python.
446 |
447 | ```javascript
448 | const playwright = require("playwright")
449 | (async() =>{
450 | const launchOptions = {
451 | headless: false,
452 | proxy: {
453 | server: "http://pr.oxylabs.io:7777",
454 | username: "USERNAME",
455 | password: "PASSWORD"
456 | }
457 | }
458 | const browser = await playwright["chromium"].launch(launchOptions)
459 | const context = await browser.newContext()
460 | const page = await context.newPage()
461 | await page.route(/(png|jpeg|jpg|svg)$/, route => route.abort())
462 | await page.route('**/*', async route => {
463 | const response = await route.fetch();
464 | let body = await response.text();
465 | body = body.replace('', 'Modified Response: ');
466 | route.fulfill({
467 | response,
468 | body,
469 | headers: {
470 | ...response.headers(),
471 | 'content-type': 'text/html'
472 | }
473 | })
474 | })
475 | await page.goto('https://oxylabs.io');
476 | await browser.close()
477 | })
478 | ```
479 |
480 | We are using the `page.route` method to intercept the HTTP requests and modify the response’s title and headers. We are also blocking any images from loading. This can be a handy trick to speed up page loading and improve scraping performance.
481 |
--------------------------------------------------------------------------------
/node/basic.js:
--------------------------------------------------------------------------------
1 | const playwright = require("playwright")
2 | (async() =>{
3 | for (const browserType of ['chromium', 'firefox', 'webkit']){
4 | const browser = await playwright[browserType].launch()
5 | const context = await browser.newContext()
6 | const page = await context.newPage()
7 | await page.goto("https://amazon.com")
8 | await page.wait_for_timeout(1000)
9 | await browser.close()
10 | }
11 | })
12 |
--------------------------------------------------------------------------------
/node/image.js:
--------------------------------------------------------------------------------
1 | const playwright = require("playwright")
2 | const https = require('https')
3 | const fs = require('fs')
4 |
5 | (async() =>{
6 | const launchOptions = {
7 | headless: false,
8 | proxy: {
9 | server: "http://pr.oxylabs.io:7777",
10 | username: "USERNAME",
11 | password: "PASSWORD"
12 | }
13 | }
14 | const browser = await playwright["chromium"].launch(launchOptions)
15 | const context = await browser.newContext()
16 | const page = await context.newPage()
17 | await page.goto('https://oxylabs.io');
18 | const images = await page.$$eval('img', all_images => {
19 | const image_links = []
20 | all_images.forEach((image, index) => {
21 | const path = `image_${index}.svg`
22 | const file = fs.createWriteStream(path)
23 | https.get(image.href, function(response) {
24 | response.pipe(file);
25 | })
26 | image_links.push(image.href)
27 | })
28 | return image_links
29 | })
30 | console.log(images)
31 | await browser.close()
32 | })
--------------------------------------------------------------------------------
/node/intercept.js:
--------------------------------------------------------------------------------
1 | const playwright = require("playwright")
2 | (async() =>{
3 | const launchOptions = {
4 | headless: false,
5 | proxy: {
6 | server: "http://pr.oxylabs.io:7777",
7 | username: "USERNAME",
8 | password: "PASSWORD"
9 | }
10 | }
11 | const browser = await playwright["chromium"].launch(launchOptions)
12 | const context = await browser.newContext()
13 | const page = await context.newPage()
14 | await page.route(/(png|jpeg|jpg|svg)$/, route => route.abort())
15 | await page.route('**/*', async route => {
16 | const response = await route.fetch();
17 | let body = await response.text();
18 | body = body.replace('', 'Modified Response: ');
19 | route.fulfill({
20 | response,
21 | body,
22 | headers: {
23 | ...response.headers(),
24 | 'content-type': 'text/html'
25 | }
26 | })
27 | })
28 | await page.goto('https://oxylabs.io');
29 | await browser.close()
30 | })
--------------------------------------------------------------------------------
/node/package-lock.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "node",
3 | "lockfileVersion": 2,
4 | "requires": true,
5 | "packages": {
6 | "": {
7 | "dependencies": {
8 | "playwright": "^1.27.0"
9 | }
10 | },
11 | "node_modules/playwright": {
12 | "version": "1.27.0",
13 | "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.27.0.tgz",
14 | "integrity": "sha512-F+0+0RD03LS+KdNAMMp63OBzu+NwYYLd52pKLczuSlTsV5b/SLkUoNhSfzDFngEFOuRL2gk0LlfGW3mKiUBk6w==",
15 | "hasInstallScript": true,
16 | "dependencies": {
17 | "playwright-core": "1.27.0"
18 | },
19 | "bin": {
20 | "playwright": "cli.js"
21 | },
22 | "engines": {
23 | "node": ">=14"
24 | }
25 | },
26 | "node_modules/playwright-core": {
27 | "version": "1.27.0",
28 | "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.27.0.tgz",
29 | "integrity": "sha512-VBKaaFUVKDo3akW+o4DwbK1ZyXh46tcSwQKPK3lruh8IJd5feu55XVZx4vOkbb2uqrNdIF51sgsadYT533SdpA==",
30 | "bin": {
31 | "playwright": "cli.js"
32 | },
33 | "engines": {
34 | "node": ">=14"
35 | }
36 | }
37 | },
38 | "dependencies": {
39 | "playwright": {
40 | "version": "1.27.0",
41 | "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.27.0.tgz",
42 | "integrity": "sha512-F+0+0RD03LS+KdNAMMp63OBzu+NwYYLd52pKLczuSlTsV5b/SLkUoNhSfzDFngEFOuRL2gk0LlfGW3mKiUBk6w==",
43 | "requires": {
44 | "playwright-core": "1.27.0"
45 | }
46 | },
47 | "playwright-core": {
48 | "version": "1.27.0",
49 | "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.27.0.tgz",
50 | "integrity": "sha512-VBKaaFUVKDo3akW+o4DwbK1ZyXh46tcSwQKPK3lruh8IJd5feu55XVZx4vOkbb2uqrNdIF51sgsadYT533SdpA=="
51 | }
52 | }
53 | }
54 |
--------------------------------------------------------------------------------
/node/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "dependencies": {
3 | "playwright": "^1.27.0"
4 | }
5 | }
6 |
--------------------------------------------------------------------------------
/node/script.js:
--------------------------------------------------------------------------------
1 | const playwright = require("playwright")
2 |
3 | (async() =>{
4 | for (const browserType of ['chromium', 'firefox', 'webkit']){
5 | const launchOptions = {
6 | headless: false,
7 | proxy: {
8 | server: "http://pr.oxylabs.io:7777",
9 | username: "USERNAME",
10 | password: "PASSWORD"
11 | }
12 | }
13 | const browser = await playwright[browserType].launch(launchOptions)
14 | const context = await browser.newContext()
15 | const page = await context.newPage()
16 | await page.goto('https://www.amazon.com/b?node=17938598011');
17 | const products = await page.$$eval('.a-spacing-base', all_products => {
18 | const data = []
19 | all_products.forEach(product => {
20 | const title = product.querySelector('span.a-size-base-plus').innerText
21 | const price = product.querySelector('span.a-price').innerText
22 | const rating = product.querySelector('span.a-icon-alt').innerText
23 | data.push({ title, price, rating})
24 | });
25 | return data
26 | })
27 | console.log(products)
28 | await browser.close()
29 | }
30 | })
--------------------------------------------------------------------------------
/python/basic.py:
--------------------------------------------------------------------------------
1 | from playwright.async_api import async_playwright
2 | import asyncio
3 | async def main():
4 | async with async_playwright() as p:
5 | browser = await p.chromium.launch(headless=False)
6 | page = await browser.new_page()
7 | await page.goto('https://amazon.com')
8 | await page.wait_for_timeout(1000)
9 | await browser.close()
10 |
--------------------------------------------------------------------------------
/python/image.py:
--------------------------------------------------------------------------------
1 | from playwright.async_api import async_playwright
2 | import asyncio
3 | import requests
4 |
5 |
6 | async def main():
7 | async with async_playwright() as pw:
8 | browser = await pw.chromium.launch(
9 | proxy={
10 | 'server': "http://pr.oxylabs.io:7777",
11 | "username": "USERNAME",
12 | "password": "PASSWORD"
13 | },
14 | headless=False
15 | )
16 |
17 | page = await browser.new_page()
18 | await page.goto('https://www.oxylabs.io')
19 | await page.wait_for_timeout(5000)
20 |
21 | all_images = await page.query_selector_all('img')
22 | images = []
23 | for i, img in enumerate(all_images):
24 | image_url = await img..get_attribute("src")
25 | content = requests.get(image_url).content
26 | with open("image_{}.svg".format(i), "wb") as f:
27 | f.write(content)
28 | images.append(image_url)
29 | print(images)
30 | await browser.close()
31 |
32 | if __name__ == '__main__':
33 | asyncio.run(main())
--------------------------------------------------------------------------------
/python/intercept.py:
--------------------------------------------------------------------------------
1 | from playwright.async_api import async_playwright
2 | import asyncio
3 | import requests
4 |
5 | async def handle_route(route) -> None:
6 | response = await route.fetch()
7 | body = await response.text()
8 | body = body.replace("", "Modified Response")
9 | await route.fulfill(
10 | response=response,
11 | body=body,
12 | headers={**response.headers, "content-type": "text/html"},
13 | )
14 |
15 | async def main():
16 | async with async_playwright() as pw:
17 | browser = await pw.chromium.launch(
18 | proxy={
19 | 'server': "http://pr.oxylabs.io:7777",
20 | "username": "USERNAME",
21 | "password": "PASSWORD"
22 | },
23 | headless=False
24 | )
25 |
26 | page = await browser.new_page()
27 | # abort image loading
28 | await page.route("**/*.{png,jpg,jpeg,svg}", lambda route: route.abort())
29 | await page.route("**/*", handle_route)
30 | await page.goto('https://www.oxylabs.io')
31 | await page.wait_for_timeout(5000)
32 | await browser.close()
33 |
34 | if __name__ == '__main__':
35 | asyncio.run(main())
--------------------------------------------------------------------------------
/python/requirements.txt:
--------------------------------------------------------------------------------
1 | playwright
2 |
--------------------------------------------------------------------------------
/python/script.py:
--------------------------------------------------------------------------------
1 | from playwright.async_api import async_playwright
2 | import asyncio
3 |
4 |
5 | async def main():
6 | async with async_playwright() as pw:
7 | browser = await pw.chromium.launch(
8 | proxy={
9 | 'server': "http://pr.oxylabs.io:7777",
10 | "username": "USERNAME",
11 | "password": "PASSWORD"
12 | },
13 | headless=False
14 | )
15 |
16 | page = await browser.new_page()
17 | await page.goto('https://www.amazon.com/b?node=17938598011')
18 | await page.wait_for_timeout(5000)
19 |
20 | all_products = await page.query_selector_all('.a-spacing-base')
21 | data = []
22 | for product in all_products:
23 | result = dict()
24 | title_el = await product.query_selector('span.a-size-base-plus')
25 | result['title'] = await title_el.inner_text()
26 | price_el = await product.query_selector('span.a-price')
27 | result['price'] = await price_el.inner_text()
28 | rating_el = await product.query_selector('span.a-icon-alt')
29 | result['rating'] = await rating_el.inner_text()
30 | data.append(result)
31 | print(data)
32 | await browser.close()
33 |
34 | if __name__ == '__main__':
35 | asyncio.run(main())
--------------------------------------------------------------------------------