├── .DS_Store ├── Images ├── .DS_Store ├── Antibot │ └── Perimterx1.png └── Tools │ ├── Wappalizer2.png │ └── Wappalyzer1.png ├── Pages ├── .DS_Store ├── Antibot │ ├── .DS_Store │ ├── Akamai.md │ ├── Canvasfingerprint.md │ ├── Cloudflare.md │ ├── Datadome.md │ └── PerimeterX.md └── Tools │ ├── .DS_Store │ ├── Playwright.md │ ├── Playwright_stealth.md │ ├── Puppeteer.md │ ├── Scrapy.md │ ├── Scrapy_splash.md │ ├── Selenium.md │ └── Wappalyzer.md └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/.DS_Store -------------------------------------------------------------------------------- /Images/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Images/.DS_Store -------------------------------------------------------------------------------- /Images/Antibot/Perimterx1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Images/Antibot/Perimterx1.png -------------------------------------------------------------------------------- /Images/Tools/Wappalizer2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Images/Tools/Wappalizer2.png -------------------------------------------------------------------------------- /Images/Tools/Wappalyzer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Images/Tools/Wappalyzer1.png -------------------------------------------------------------------------------- /Pages/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Pages/.DS_Store -------------------------------------------------------------------------------- /Pages/Antibot/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Pages/Antibot/.DS_Store -------------------------------------------------------------------------------- /Pages/Antibot/Akamai.md: -------------------------------------------------------------------------------- 1 | # Akamai Bot Manager 2 | 3 | ## What is Akamai Bot Manager? 4 | [Akamai Bot Manager ](https://www.akamai.com/products/bot-manager "Akamai") detect bots using device fingerprinting bot signatures and ip checks. 5 | 6 | ## Our View on Akamai Bot Manager 7 | 8 | ### How to Identify Akamai Bot Manager 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) 10 | 11 | ### Recommended approach to Akamai Bot Manager 12 | **BEST CHOICE**: a standard configuration of Akamai requires a good proxy rotation to be beaten, there's no need of a fully rendered browser 13 | 14 | ### Reference and interesting links 15 | [Official web page](https://www.akamai.com/products/bot-manager) 16 | 17 | -------------------------------------------------------------------------------- /Pages/Antibot/Canvasfingerprint.md: -------------------------------------------------------------------------------- 1 | # Canvas Fingerprint 2 | 3 | ## What is Canvas Fingerprint? 4 | Canvas Fingerprint is an anti-bot technique that can be found in some installations of Cloudflare Bot Management. 5 | It consists in a javascript that renders an image in background on website loading: similar hardware configurations give the same result and this allows the anti bot to trigger a captcha if the hardware configuration is similar to a server machine. It may happens that the same scraper works on a workstation but does not work on a VM hosted server. 6 | 7 | 8 | ## Our View on Canvas Fingerprint 9 | 10 | ### How to Identify Canvas Fingerprint 11 | As far as we know, at the moment Cloudflare (in some configuration) is the only anti bot who uses this solution (not visible from wappalyzer). An homemade bot for analyzing anti bot measures can be found on discord server "[Scraping Enthusiast](https://discord.gg/Y8yuF55m7j)" managed by the developers of Puppetter and can recognize the Canvas Fingerprint using a series of queries on the website. 12 | 13 | 14 | ### Recommended approach to Canvas Fingerprint 15 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) + hosting on a workstation 16 | 17 | ### Reference and interesting links 18 | [How canvas Fingerprint work](https://fingerprintjs.com/blog/canvas-fingerprinting/) 19 | [How canvas Fingerprint work 2](https://browserleaks.com/canvas#how-does-it-work) 20 | [Test if a VM is detected here](https://fingerprintjs.com/products/bot-detection/) 21 | -------------------------------------------------------------------------------- /Pages/Antibot/Cloudflare.md: -------------------------------------------------------------------------------- 1 | # Cloudflare Bot Management 2 | 3 | ## What is Cloudflare Bot Management? 4 | [Akamai Bot Manager ](https://www.akamai.com/products/bot-manager "Akamai") detect bots using device fingerprinting bot signatures and ip checks. 5 | 6 | ## Our View on Cloudflare Bot Management 7 | 8 | ### How to Identify Cloudflare Bot Management 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) 10 | 11 | ### Recommended approach to Cloudflare Bot Management 12 | **BEST CHOICE**: Depends from the configuration of the single website, but [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) are usually enough for scraping. 13 | 14 | ### Reference and interesting links 15 | [Official web page](https://www.cloudflare.com/en-gb/products/bot-management/) 16 | 17 | [Our in-depth with Playwright against Cloudflare](https://reanalytics.freshdesk.com/discussions/topics/28000008894) 18 | 19 | [A package for bypass Cloudflare](https://github.com/Anorov/cloudflare-scrape): maybe obsolete, not updated in 2 years 20 | 21 | [Firefox appears to be flagged as suspicious from Cloudflare](https://brianlovin.com/hn/31459258) 22 | -------------------------------------------------------------------------------- /Pages/Antibot/Datadome.md: -------------------------------------------------------------------------------- 1 | # Datadome 2 | 3 | ## What is Datadome? 4 | [Datadome](https://datadome.co/ "Datadome") is the so called "State of the art protection from Scraping". They say they apply statistical and behavioral detection, can also detect [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md), implemented client-side detection, and so on. 5 | 6 | ## Our View on Datadome 7 | 8 | ### How to Identify Datadome 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) 10 | 11 | ### Recommended approach to Datadome 12 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) are usually enough for scraping. 13 | 14 | ### Reference and interesting links 15 | [Official web page](https://datadome.co/) 16 | [Tests made with online tools](https://blog.vanila.io/how-strong-is-the-datadome-5e9ff211384e) 17 | 18 | -------------------------------------------------------------------------------- /Pages/Antibot/PerimeterX.md: -------------------------------------------------------------------------------- 1 | # PerimeterX 2 | 3 | ## What is PerimeterX? 4 | [Perimeterx](https://www.perimeterx.com/products/bot-defender "Perimeterx") Anti-Bot system is a protection system some websites use for blocking web scraping. One example at the moment is [ssense.com](https://www.ssense.com/). 5 | 6 | ## Our View on PerimeterX 7 | 8 | ### How to Identify PerimeterX 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md), you will see it in the Security Tab 10 | ![PerimeterX](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Images/Antibot/Perimterx1.png) 11 | 12 | ### Recommended approach to PerimeterX 13 | During the execution of the scraper it happens, after some pages, that a challenge like the one in the picture is triggered, blocking the execution. It's needed a fully browser to not trigger the captcha, adding some random movement of the mouse and timers before moving to another page. 14 | 15 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) 16 | 17 | ### Reference and interesting links 18 | [Official web page](https://www.perimeterx.com/products/bot-defender) 19 | [How Perimeterx works](https://www.trickster.dev/post/how-does-perimeterx-bot-defender-work/) 20 | 21 | -------------------------------------------------------------------------------- /Pages/Tools/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ohld/webscraping-open-project/4a4b90d6efe25103ec764af1ea6daf98bd639fc0/Pages/Tools/.DS_Store -------------------------------------------------------------------------------- /Pages/Tools/Playwright.md: -------------------------------------------------------------------------------- 1 | # Playwright 2 | 3 | ## What is Playwright? 4 | [Playwright](https://playwright.dev/ "Official website") is a testing tool for web application, useful also for web scraping, released on 2021. 5 | 6 | ## Our View on Playwright 7 | 8 | ### Usage Rating of Playwright 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | It's the best choice when there's need of a fully rendered browser to scrape a website. 11 | 12 | ### Configuration 13 | The best configuration we've found up to date against antibot systems consists in: 14 | 1. [playwright_stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) module 15 | 2. a function to randomize mouse movement 16 | 3. selection of a consistent combination of device to emulate and browser 17 | 4. slow_mo option to reduce the rendering speed of the browser 18 | 5. headless mode 19 | 20 | You can find our standard base configuration here. 21 | 22 | ### When to use Playwright 23 | This configuration is more computing power intensive than a simply scrapy installation so is used only when a fully rendered browser is needed, actually it works pretty well against: 24 | 25 | - [Perimeterx](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/PerimeterX.md) 26 | - [Cloudflare](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Cloudflare.md) 27 | - [Datadome](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Datadome.md) 28 | 29 | ### Reference and interesting links 30 | [Official website](https://playwright.dev/) 31 | 32 | [Our tests](https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/gppongmhjkpfnbhagpmjfkannfbllamg?hl=it) 33 | 34 | [Article: Making chrome headless undetectable](https://intoli.com/blog/making-chrome-headless-undetectable/) 35 | 36 | 37 | -------------------------------------------------------------------------------- /Pages/Tools/Playwright_stealth.md: -------------------------------------------------------------------------------- 1 | # playwright_stealth 2 | 3 | ## What is playwright_stealth? 4 | [playwright_stealth](https://github.com/AtuboDad/playwright_stealth "Official website") is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) module useful for obfuscating the bot and make it seems a regular navigation session. It changes browser settings in order to make it much more similar to a real user using a real browser and not an automation tool browsing. 5 | 6 | ## Our View on playwright_stealth 7 | 8 | ### Usage Rating of playwright_stealth 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | At the moment is the best module for obfuscating playwright settings 11 | 12 | ### When to use Playwright 13 | We consider it as a default options for starting a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) scraper. 14 | 15 | ### Reference and interesting links 16 | [Github repository](https://github.com/AtuboDad/playwright_stealth) 17 | 18 | [Our quick test on Playwright against Cloudflare protected website](https://reanalytics.freshdesk.com/discussions/topics/28000008894) 19 | 20 | 21 | -------------------------------------------------------------------------------- /Pages/Tools/Puppeteer.md: -------------------------------------------------------------------------------- 1 | # Puppeteer 2 | 3 | ## What is Puppeteer? 4 | [Puppeteer](https://pptr.dev/ "Official website") is a browser automation tool useful for web scraping. 5 | 6 | ## Our View on Puppeteer 7 | 8 | ### Usage Rating of Puppeteer 9 | **SECOND CHOICE**: Although not preferred, still an acceptable choice. Can be controlled via Javascript, not our language of choice. 10 | Since its features are very similar to [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) (part of the team of Playwright came from Puppeteer project), we prefer Playwright for its multiple browser support. 11 | 12 | ### When to use Puppeteer 13 | When we need a full browser rendering the page and [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) and [Selenium](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Selenium.md) can't help. The choice is made only due to the fact that we're Python experts and don't have any experience in Javascript. 14 | 15 | ### Reference and interesting links 16 | [Official website](https://pptr.dev/) 17 | 18 | [Github Wiki](https://github.com/berstend/puppeteer-extra/wiki) 19 | 20 | [Github Repository](https://github.com/puppeteer/puppeteer/) 21 | 22 | 23 | -------------------------------------------------------------------------------- /Pages/Tools/Scrapy.md: -------------------------------------------------------------------------------- 1 | # Scrapy 2 | 3 | ## What is Scrapy? 4 | [Scrapy](https://scrapy.org/) is a Python framework for web scraping, maintained by [Zyte](https://www.zyte.com/). 5 | 6 | ## Our View on Scrapy 7 | 8 | ### Usage Rating of Scrapy 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | Scrapy is our best choice for every website that doesn't have any particular website anti-bot tool. It's the de-facto standard in the industry for web scraping in Python. 11 | 12 | ### Configuration 13 | With a proper default headers setting, a small number of concurrent requests on the website and a delay between them, you can scrape many of the common websites. 14 | Inside a standard settings.py file you will find the following voices: 15 | 16 | `USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"` 17 | 18 | This option is needed to identify you and your bot as a genuine user assigning a specific user agent, in this case a Chrome Browser. 19 | 20 | `ROBOTSTXT_OBEY = True` 21 | 22 | This option indicates if the scraper should follow or not the rules written in the robots.txt file on the target website. For a fair web scraping practice, should be set to True. 23 | 24 | `CONCURRENT_REQUESTS = 3` 25 | 26 | Number of concurrent requests Scrapy could make to the target website. Depending from the target dimension, this could vary but in our opinion should not be more than 10 to not overload target website servers and trigger anti-bot protection systems. 27 | 28 | `DOWNLOAD_DELAY = 1` 29 | 30 | Number of seconds of delay between the requests in each thread (thread number is specified with CONCURRENT_REQUESTS options. 31 | 32 | Its standard installation can be integrated with python modules that augment its powers: 33 | 34 | * [advanced_scrapy_proxies](https://github.com/reanalytics-databoutique/advanced-scrapy-proxies): module to handle external lists of proxies, using them randomly and deleting not working ones 35 | * [scrapy_splash](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy_splash.md): to render javascript code in a web page via an API 36 | * [selenium webdriver](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Selenium.md): when you need a full headless browser working 37 | 38 | ### Our standard and best practices 39 | Please read our standards and best practices for web scraping in python before implementing a new website with Scrapy. 40 | 41 | ### When to use Scrapy 42 | Whenever possible, the first attempt to scrape a website should we always with a standard configuration of Scrapy (unless we already know it's not enough from the antibot analysis we performed) 43 | 44 | ### Reference and interesting links 45 | [Official website](https://scrapy.org/) 46 | 47 | [Short tutorial](https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0) 48 | 49 | 50 | -------------------------------------------------------------------------------- /Pages/Tools/Scrapy_splash.md: -------------------------------------------------------------------------------- 1 | # Scrapy_splash (Scrapy module) 2 | 3 | ## What is scrapy_splash? 4 | [Splash](https://github.com/scrapy-plugins/scrapy-splash "Splash") is the javascript rendering engine maintained by Zyte and scrapy_splash is the python module. 5 | 6 | ## Our View on scrapy_splash 7 | 8 | ### Usage Rating of scrapy_splash 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | 11 | ### Configuration 12 | In settings.py file of the [scrapy](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy.md) project we need to enable the middleware, as stated in the [official github repository](https://github.com/scrapy-plugins/scrapy-splash). 13 | We need to declare also the variable SPLASH_URL that points to the address of a running splash server. You can install and run it via docker as explained in the github repository. 14 | 15 | ### When to use scrapy_splash 16 | This python module is required when there's the need to solve javascript challenges in the website, typically when loading the first page of the website. 17 | Usually it's enough to solve reCaptcha javascript challenges (added to a proper scrapy settings configuration) 18 | 19 | ### Reference and interesting links 20 | [Github repository with instructions and examples](https://github.com/scrapy-plugins/scrapy-splash) 21 | 22 | [Useful article](https://www.zyte.com/blog/handling-javascript-in-scrapy-with-splash/) 23 | 24 | [Another How-to](https://medium.com/@shahwaiz055/scrapy-splash-400a03a829bf) 25 | 26 | -------------------------------------------------------------------------------- /Pages/Tools/Selenium.md: -------------------------------------------------------------------------------- 1 | # Selenium Webdriver 2 | 3 | ## What is Wappalyzer Selenium Webdriver? 4 | [Selenium Webdriver](https://www.selenium.dev/documentation/overview/ "Wappalyzer") is a web application testing suite used also for web scraping. 5 | 6 | ## Our View on Selenium Webdriver 7 | 8 | ### Usage Rating of Selenium Webdrive 9 | **SECOND BEST**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) is preferred to Selenium Webdriver, as it has a more similar behaviour to a real user and it's easier to install and configure. 10 | 11 | ### When to use Selenium Webdriver 12 | When a website requires a fully rendered browser, and [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) does not solve the issue. 13 | ### Reference and interesting links 14 | [Official website](https://www.selenium.dev/documentation/overview/) 15 | -------------------------------------------------------------------------------- /Pages/Tools/Wappalyzer.md: -------------------------------------------------------------------------------- 1 | # Wappalyzer Chrome Extension 2 | 3 | ## What is Wappalyzer Chrome Extension? 4 | [Wappalyzer](https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/ "Wappalyzer") is a chrome browser extension that uncovers the technologies used on websites. It detects content management systems, eCommerce platforms, web servers, JavaScript frameworks, analytics tools and many more. 5 | 6 | ## Our View on Wappalyzer Chrome Extension 7 | 8 | ### Usage Rating of Wappalyzer Chrome Extension 9 | We didn't try many tools, but this one seems to have a good coverage. Very helpful to identify how to proceed in web scraping. Under security tab you can find which antibot system is used on the target website. 10 | 11 | ![Security tab](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Images/Tools/Wappalyzer1.png "Security tab") 12 | 13 | Under Ecommerce tab you can find which commercial software, if any, is used for building the website. 14 | ![Ecommerce tab](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Images/Tools/Wappalizer2.png "Ecommerce tab") 15 | 16 | ### When to use Wappalyzer Chrome Extension 17 | Always use this tool before starting a web scraping activity, to identify what technologies have been used and go straight to the possible solution 18 | 19 | ### Reference and interesting links 20 | [Official web page](https://www.wappalyzer.com/) 21 | 22 | [Download link](https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/gppongmhjkpfnbhagpmjfkannfbllamg?hl=it) 23 | 24 | [List of applications Wappalyzer detects](https://wappalyzer.com/applications) 25 | 26 | [Firefox version](https://addons.mozilla.org/en-US/firefox/addon/wappalyzer/) 27 | 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web scraping with Python open knowledge 2 | During the past several years at [Re Analytics](http://re-analytics.com/ "Re Analytics website") we've spent a lot of time finding the best practices for web scraping, to make it scalable and efficient to maintain. 3 | It's like the cat and mouse game, you need to be always updated on the latest developments but, at the same time, the information needed is very sparse on the net. 4 | For this reason, we started to centralize all the information we collected and the best practices we developed, to build a point of reference for the Python web scraping community. 5 | Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone. 6 | 7 | ## Why Using Some Best Practice 8 | Our goal is to scrape as many sites as we can so we've always looked for these key elements to make a successful large-scale web scraping project. At the moment they are focused on web scraping of E-commerce website because it's what we've done for years, but we're open to integrate them with best practices derived from other industries. 9 | - **Resilient execution**: We want the code to be as low maintenance as possible 10 | - **Faster maintenance**: We work smarter if we find standard solutions, and do not have to decode creative creations every time. 11 | - **Regulatory compliance**: web scraping is a serious thing, we need to know exactly what tools are used. 12 | The following practices are always evolving and feel free to suggest yours. 13 | 14 | ### 1.Preliminary Study 15 | 16 | #### 1.1.Technology Stack 17 | Perform a technology stack evaluation for the target website using [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md), with attention in the "Security" block. 18 | When a technology stack is detected under the "Security" section, please verify if in this list of technologies there is a specific solution for that technology. 19 | #### 1.2.API search 20 | Has the website some internal or public APIs for fetching the price\product data? If so, this is the best scenario available and we should use them to gather data 21 | #### 1.3. JSON in HTML Search 22 | Sometimes websites have JSON in their HTML, not only when there's an API. Finding this, will ensure stability. 23 | #### 1.4. Pagination 24 | How the website handles the pagination of product catalogue? Internal services that provide the html code of the catalogue are preferred vs loading the full page code 25 | ### 2. Code Best Practices 26 | #### 2.1. JSON 27 | Use json if available (on html of the page or from API). It's less prone to changes 28 | #### 2.2. XPATHS 29 | Use Xpaths, not css selectors for getting a clearer code. 30 | #### 2.3. Indent using TABS 31 | Use tabs for indentation instead of spaces - code weights less and it's easier to detect badly indented structure 32 | #### 2.4. No formatting rules in numeric fields 33 | Don't insert rules for cleaning prices or numeric fields: formats change over different countries and are not standards, let's keep this task to post scraping phases in the DBs. 34 | #### 2.5. Product List Page wins on Single Product Page 35 | Load the fewer pages you can. Try to see if the fields you need are all available from product catalogue pages and try avoiding enter the single product page. 36 | 37 | ### 3. Tools 38 | #### 3.1. Headless python scrapers 39 | - [Scrapy](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy.md) 40 | - [scrapy_splash](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy_splash.md) 41 | 42 | #### 3.2. Python scrapers with fully rendered browsers 43 | - [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) 44 | - [playwright_stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) 45 | 46 | #### 3.3. Non Python scrapers with fully rendered browsers 47 | - [Puppeteer](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Puppeteer.md) 48 | 49 | ### 4. Common anti-bot softwares & techniques 50 | #### 4.1. Anti-bot Softwares 51 | - [Akamai](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Akamai.md) 52 | - [Cloudflare](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Cloudflare.md) 53 | - [Datadome](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Datadome.md) 54 | - [PerimeterX](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/PerimeterX.md) 55 | - Forter 56 | - Riskified 57 | #### 4.2. Anti-bot Techniques 58 | - [Canvas Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Canvasfingerprint.md) 59 | - WebGl 60 | - Browser Fingerprinting 61 | 62 | ### 5. Test websites 63 | Here's a list of websites where to test your scraper and find out how many checks it passes 64 | - [https://bot.incolumitas.com/](https://bot.incolumitas.com/) one of the most complete set of tests for your scrapers 65 | - [https://pixelscan.net/](https://pixelscan.net/) check your ip and your machine 66 | --------------------------------------------------------------------------------