├── .gitignore ├── Images ├── Antibot │ ├── 1200px-Open_proxy_h2g2bob.png │ └── Perimterx1.png └── Tools │ ├── 1200px-Open_proxy_h2g2bob.jpg │ ├── Wappalizer2.png │ └── Wappalyzer1.png ├── Pages ├── 1.Before Scraping │ ├── Buy or Make.md │ ├── Privacy and copyright.md │ └── Reading Terms.md ├── 3.Free Tools │ ├── Crawlee.md │ ├── Playwright.md │ ├── Playwright_stealth.md │ ├── Puppeteer.md │ ├── Scrapy.md │ ├── Scrapy_splash.md │ ├── Selenium.md │ └── Wappalyzer.md ├── 4.Commercial tools │ └── Proxies.md └── 5.Antibot │ ├── Akamai.md │ ├── Akamai_WP_Passive_Fingerprinting.pdf │ ├── Browserfingerprint.md │ ├── Canvasfingerprint.md │ ├── Cloudflare.md │ ├── Datadome.md │ ├── Devicefingerprint.md │ ├── HttpFingerprint.md │ ├── Kasada.md │ ├── Passivefingerprint.md │ ├── PerimeterX.md │ ├── Shape.md │ ├── TLSFingerprint.md │ ├── TcpFingerprint.md │ └── Webglfingerprint.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /Images/Antibot/1200px-Open_proxy_h2g2bob.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TheWebScrapingClub/webscraping-from-0-to-hero/fcb31693a4fc9f81be561fbbbc44afdf4aa454d1/Images/Antibot/1200px-Open_proxy_h2g2bob.png -------------------------------------------------------------------------------- /Images/Antibot/Perimterx1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TheWebScrapingClub/webscraping-from-0-to-hero/fcb31693a4fc9f81be561fbbbc44afdf4aa454d1/Images/Antibot/Perimterx1.png -------------------------------------------------------------------------------- /Images/Tools/1200px-Open_proxy_h2g2bob.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TheWebScrapingClub/webscraping-from-0-to-hero/fcb31693a4fc9f81be561fbbbc44afdf4aa454d1/Images/Tools/1200px-Open_proxy_h2g2bob.jpg -------------------------------------------------------------------------------- /Images/Tools/Wappalizer2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TheWebScrapingClub/webscraping-from-0-to-hero/fcb31693a4fc9f81be561fbbbc44afdf4aa454d1/Images/Tools/Wappalizer2.png -------------------------------------------------------------------------------- /Images/Tools/Wappalyzer1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TheWebScrapingClub/webscraping-from-0-to-hero/fcb31693a4fc9f81be561fbbbc44afdf4aa454d1/Images/Tools/Wappalyzer1.png -------------------------------------------------------------------------------- /Pages/1.Before Scraping/Buy or Make.md: -------------------------------------------------------------------------------- 1 | # Do I really need to scrape that website or I can buy pre-scraped data? 2 | Before scraping by yourself, try to have a look if there's the data you already need in the following data marketplace: 3 | 4 | [Databoutique.com](https://www.databoutique.com/) the first data marketplace designed for web scraped data, launching in these weeks 5 | 6 | [Aws Data Exchange](https://aws.amazon.com/data-exchange/) data marketplace with 3500+ datasets 7 | -------------------------------------------------------------------------------- /Pages/1.Before Scraping/Privacy and copyright.md: -------------------------------------------------------------------------------- 1 | # Is the data I want to scrape compliant with privacy laws or copyrighted? 2 | **This is not a legal advice, please refer to a lawyer if you're in doubt** 3 | Another aspect needed to be considered before starting a web scraping project is about the kind of data we're retrieving. 4 | ## Personal data or PII 5 | Unless you have the person's explicit consent it is now illegal to scrape an EU resident's personal data under GDPR and this should be enough to make you stop from any personal data gathering. It's very difficult to know before scraping the citizenship of a person whose data is going to be scraped and in any case, there are similar rules also in other countries, making the scraping of personal data prohibitive. 6 | 7 | In this great article by [Zyte](https://www.zyte.com/blog/web-scraping-gdpr-compliance-guide/#:~:text=Scraping%20sensitive%20data%20means%20that,you%20should%20avoid%20scraping%20it.) it's explained how to behave to be compliant with GDPR, which is only valid in Europe. 8 | 9 | ## Copyrighted Data 10 | [Unless you're OpenAI](https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books), you cannot scrape copyrighted material and hope to win a case in court. 11 | So limit your operation on what is publicly available and it's factual, not made by someone who can claim the data as its own. This means also to don't scraping and store pictures made by professional photographers, not limited to artistic pictures but also pictures made for fashion websites. 12 | 13 | -------------------------------------------------------------------------------- /Pages/1.Before Scraping/Reading Terms.md: -------------------------------------------------------------------------------- 1 | # Reading Terms of services of the website 2 | **This is not a legal advice, please refer to a lawyer if you're in doubt** 3 | This great article by [Apify](https://blog.apify.com/enforceability-of-terms-of-use/) summarizes what are the different types of Terms of Use and when they are enforceable or not. 4 | Basically, it depends if the user, or the scraper, did some active actions for accepting them. 5 | - Browsewrap: when the TOS is placed somewhere on the website but the user doesn't need to make any action. Not enforceable in most cases, since the user could not have seen them. 6 | - Clickwrap: then the TOS needs to be accepted with a click by the user. They are generally enforceable since the user actively accepted them and any break of TOS could be punished. 7 | - Scrollwrap: similar to clickwrap but the user needs also to scroll down the page to the end of TOS before accepting. Enforceable in most cases too. 8 | - Sign-in-wrap: when you need to login and somewhere in the UX you accept the TOS. Depending on the UX, how easy to see are TOS, could be enforceable. 9 | 10 | Generally speaking, better not to scrape websites that require a login and an active acceptance of TOS that ban scraping, even because the study made by Apify refers to US the legislation results of cases may differ in other countries. 11 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Crawlee.md: -------------------------------------------------------------------------------- 1 | # Crawlee 2 | 3 | ## What is Crawlee? 4 | 5 | [Crawlee](https://crawlee.dev/) is a Node.js library for web scraping, maintained by [Apify](https://apify.com/). 6 | 7 | ## Our View on Crawlee 8 | 9 | ### Usage Rating of Crawlee 10 | 11 | **RISING STAR**: Crawlee is one the most complete Node.js web scraping libraries. Despite being a relatively new player in the scraping scene, Crawlee is rapidly growing in popularity due to its extensive features and focus on mimicking real-user behavior to bypass website anti-bot protections. 12 | 13 | ### Configuration 14 | 15 | Crawlee comes with three main crawler classes: **CheerioCrawler**, **PuppeteerCrawler** and **PlaywrightCrawler**. All classes share the same interface for maximum flexibility when switching between them. 16 | 17 | #### CheerioCrawler 18 | 19 | A plain HTTP crawler, that parses HTML using the Cheerio library. It's very fast and efficient, but can't handle JavaScript rendering. 20 | 21 | #### PuppeteerCrawler 22 | 23 | A headless browser crawler, controlled by the Puppeteer library. It can control Chromium or Chrome. [Puppeteer](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Puppeteer.md) is the de-facto standard in Node.js headless browser automation. 24 | 25 | #### PlaywrightCrawler 26 | 27 | Playwright can be considered as the successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers. If you're not familiar with Puppeteer already, and you need a headless browser, we recommend you go with [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md). 28 | 29 | ### When to use Crawlee 30 | 31 | Crawlee builds on top of popular Node.js libraries such as Cheerio, Puppeteer and Playwright while adding extra functionalities such as out of the box anti-block features. 32 | 33 | This makes Crawlee a great choice for Node.js web scraping developers looking for a flexible and complete library that takes away the complexity of configuring scrapers to bypass modern anti-bot protections. 34 | 35 | ### Reference and interesting links 36 | 37 | [Official website](https://crawlee.dev/) 38 | 39 | [Building an Amazon Scraper in Node.js with Crawlee - Written tutorial](https://developers.apify.com/academy/web-scraping-for-beginners/challenge/initializing-and-setting-up) 40 | [Building an Amazon Scraper in Node.js with Crawlee - Video tutorial](https://www.youtube.com/watch?v=yTRHomGg9uQ) 41 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Playwright.md: -------------------------------------------------------------------------------- 1 | # Playwright 2 | 3 | ## What is Playwright? 4 | [Playwright](https://playwright.dev/ "Official website") is a testing tool for web application, useful also for web scraping, released on 2021. 5 | 6 | ## Our View on Playwright 7 | 8 | ### Usage Rating of Playwright 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | It's the best choice when there's need of a fully rendered browser to scrape a website. 11 | 12 | ### Configuration 13 | The best configuration we've found up to date against antibot systems consists in: 14 | 1. [playwright_stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) module 15 | 2. a function to randomize mouse movement 16 | 3. selection of a consistent combination of device to emulate and browser 17 | 4. slow_mo option to reduce the rendering speed of the browser 18 | 5. headless mode 19 | 20 | You can find our standard base configuration here. 21 | 22 | ### When to use Playwright 23 | This configuration is more computing power intensive than a simply scrapy installation so is used only when a fully rendered browser is needed, actually it works pretty well against: 24 | 25 | - [Perimeterx](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/PerimeterX.md) 26 | - [Cloudflare](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Cloudflare.md) 27 | - [Datadome](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Datadome.md) 28 | 29 | ### Reference and interesting links 30 | [Official website](https://playwright.dev/) 31 | 32 | [Our tests](https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/gppongmhjkpfnbhagpmjfkannfbllamg?hl=it) 33 | 34 | [Article: Making chrome headless undetectable](https://intoli.com/blog/making-chrome-headless-undetectable/) 35 | 36 | 37 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Playwright_stealth.md: -------------------------------------------------------------------------------- 1 | # playwright_stealth 2 | 3 | ## What is playwright_stealth? 4 | [playwright_stealth](https://github.com/AtuboDad/playwright_stealth "Official website") is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) module useful for obfuscating the bot and make it seems a regular navigation session. It changes browser settings in order to make it much more similar to a real user using a real browser and not an automation tool browsing. 5 | 6 | ## Our View on playwright_stealth 7 | 8 | ### Usage Rating of playwright_stealth 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | At the moment is the best module for obfuscating playwright settings 11 | 12 | ### When to use Playwright 13 | We consider it as a default options for starting a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) scraper. 14 | 15 | ### Reference and interesting links 16 | [Github repository](https://github.com/AtuboDad/playwright_stealth) 17 | 18 | [Our quick test on Playwright against Cloudflare protected website](https://reanalytics.freshdesk.com/discussions/topics/28000008894) 19 | 20 | 21 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Puppeteer.md: -------------------------------------------------------------------------------- 1 | # Puppeteer 2 | 3 | ## What is Puppeteer? 4 | [Puppeteer](https://pptr.dev/ "Official website") is a browser automation tool useful for web scraping. 5 | 6 | ## Our View on Puppeteer 7 | 8 | ### Usage Rating of Puppeteer 9 | **SECOND CHOICE**: Although not preferred, still an acceptable choice. Can be controlled via Javascript, not our language of choice. 10 | Since its features are very similar to [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) (part of the team of Playwright came from Puppeteer project), we prefer Playwright for its multiple browser support. 11 | 12 | ### When to use Puppeteer 13 | When we need a full browser rendering the page and [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) and [Selenium](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Selenium.md) can't help. The choice is made only due to the fact that we're Python experts and don't have any experience in Javascript. 14 | 15 | ### Reference and interesting links 16 | [Official website](https://pptr.dev/) 17 | 18 | [Github Wiki](https://github.com/berstend/puppeteer-extra/wiki) 19 | 20 | [Github Repository](https://github.com/puppeteer/puppeteer/) 21 | 22 | 23 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Scrapy.md: -------------------------------------------------------------------------------- 1 | # Scrapy 2 | 3 | ## What is Scrapy? 4 | [Scrapy](https://scrapy.org/) is a Python framework for web scraping, maintained by [Zyte](https://www.zyte.com/). 5 | 6 | ## Our View on Scrapy 7 | 8 | ### Usage Rating of Scrapy 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | Scrapy is our best choice for every website that doesn't have any particular website anti-bot tool. It's the de-facto standard in the industry for web scraping in Python. 11 | 12 | ### Configuration 13 | With a proper default headers setting, a small number of concurrent requests on the website and a delay between them, you can scrape many of the common websites. 14 | Inside a standard settings.py file you will find the following voices: 15 | 16 | `USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"` 17 | 18 | This option is needed to identify you and your bot as a genuine user assigning a specific user agent, in this case a Chrome Browser. 19 | 20 | `ROBOTSTXT_OBEY = True` 21 | 22 | This option indicates if the scraper should follow or not the rules written in the robots.txt file on the target website. For a fair web scraping practice, should be set to True. 23 | 24 | `CONCURRENT_REQUESTS = 3` 25 | 26 | Number of concurrent requests Scrapy could make to the target website. Depending from the target dimension, this could vary but in our opinion should not be more than 10 to not overload target website servers and trigger anti-bot protection systems. 27 | 28 | `DOWNLOAD_DELAY = 1` 29 | 30 | Number of seconds of delay between the requests in each thread (thread number is specified with CONCURRENT_REQUESTS options. 31 | 32 | Its standard installation can be integrated with python modules that augment its powers: 33 | 34 | * [advanced_scrapy_proxies](https://github.com/reanalytics-databoutique/advanced-scrapy-proxies): module to handle external lists of proxies, using them randomly and deleting not working ones 35 | * [scrapy_splash](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy_splash.md): to render javascript code in a web page via an API 36 | * [selenium webdriver](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Selenium.md): when you need a full headless browser working 37 | 38 | ### Our standard and best practices 39 | Please read our standards and best practices for web scraping in python before implementing a new website with Scrapy. 40 | 41 | ### When to use Scrapy 42 | Whenever possible, the first attempt to scrape a website should we always with a standard configuration of Scrapy (unless we already know it's not enough from the antibot analysis we performed) 43 | 44 | ### Reference and interesting links 45 | [Official website](https://scrapy.org/) 46 | 47 | [Short tutorial](https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0) 48 | 49 | 50 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Scrapy_splash.md: -------------------------------------------------------------------------------- 1 | # Scrapy_splash (Scrapy module) 2 | 3 | ## What is scrapy_splash? 4 | [Splash](https://github.com/scrapy-plugins/scrapy-splash "Splash") is the javascript rendering engine maintained by Zyte and scrapy_splash is the python module. 5 | 6 | ## Our View on scrapy_splash 7 | 8 | ### Usage Rating of scrapy_splash 9 | **BEST CHOICE**: This is among the preferred tools we use 10 | 11 | ### Configuration 12 | In settings.py file of the [scrapy](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy.md) project we need to enable the middleware, as stated in the [official github repository](https://github.com/scrapy-plugins/scrapy-splash). 13 | We need to declare also the variable SPLASH_URL that points to the address of a running splash server. You can install and run it via docker as explained in the github repository. 14 | 15 | ### When to use scrapy_splash 16 | This python module is required when there's the need to solve javascript challenges in the website, typically when loading the first page of the website. 17 | Usually it's enough to solve reCaptcha javascript challenges (added to a proper scrapy settings configuration) 18 | 19 | ### Reference and interesting links 20 | [Github repository with instructions and examples](https://github.com/scrapy-plugins/scrapy-splash) 21 | 22 | [Useful article](https://www.zyte.com/blog/handling-javascript-in-scrapy-with-splash/) 23 | 24 | [Another How-to](https://medium.com/@shahwaiz055/scrapy-splash-400a03a829bf) 25 | 26 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Selenium.md: -------------------------------------------------------------------------------- 1 | # Selenium Webdriver 2 | 3 | ## What is Wappalyzer Selenium Webdriver? 4 | [Selenium Webdriver](https://www.selenium.dev/documentation/overview/ "Wappalyzer") is a web application testing suite used also for web scraping. 5 | 6 | ## Our View on Selenium Webdriver 7 | 8 | ### Usage Rating of Selenium Webdrive 9 | **SECOND BEST**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) is preferred to Selenium Webdriver, as it has a more similar behaviour to a real user and it's easier to install and configure. 10 | 11 | ### When to use Selenium Webdriver 12 | When a website requires a fully rendered browser, and [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) does not solve the issue. 13 | ### Reference and interesting links 14 | [Official website](https://www.selenium.dev/documentation/overview/) 15 | -------------------------------------------------------------------------------- /Pages/3.Free Tools/Wappalyzer.md: -------------------------------------------------------------------------------- 1 | # Wappalyzer Chrome Extension 2 | 3 | ## What is Wappalyzer Chrome Extension? 4 | [Wappalyzer](https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/ "Wappalyzer") is a chrome browser extension that uncovers the technologies used on websites. It detects content management systems, eCommerce platforms, web servers, JavaScript frameworks, analytics tools and many more. 5 | 6 | ## Our View on Wappalyzer Chrome Extension 7 | 8 | ### Usage Rating of Wappalyzer Chrome Extension 9 | We didn't try many tools, but this one seems to have a good coverage. Very helpful to identify how to proceed in web scraping. Under security tab you can find which antibot system is used on the target website. 10 | 11 | ![Security tab](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Images/Tools/Wappalyzer1.png "Security tab") 12 | 13 | Under Ecommerce tab you can find which commercial software, if any, is used for building the website. 14 | ![Ecommerce tab](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Images/Tools/Wappalizer2.png "Ecommerce tab") 15 | 16 | ### When to use Wappalyzer Chrome Extension 17 | Always use this tool before starting a web scraping activity, to identify what technologies have been used and go straight to the possible solution 18 | 19 | ### Reference and interesting links 20 | [Official web page](https://www.wappalyzer.com/) 21 | 22 | [Download link](https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/gppongmhjkpfnbhagpmjfkannfbllamg?hl=it) 23 | 24 | [List of applications Wappalyzer detects](https://wappalyzer.com/applications) 25 | 26 | [Firefox version](https://addons.mozilla.org/en-US/firefox/addon/wappalyzer/) 27 | 28 | -------------------------------------------------------------------------------- /Pages/4.Commercial tools/Proxies.md: -------------------------------------------------------------------------------- 1 | # Proxies 2 | 3 | ## What is a proxy? 4 | Generally speaking, a proxy is a server that stays between a client and the target server. 5 | ![Proxy Image from wikipedia](https://github.com/reanalytics-databoutique/webscraping-open-project/blob/main/Images/Tools/1200px-Open_proxy_h2g2bob.jpg) 6 | They are tipically used in web scraping project for the following reasons: 7 | - Ip rotation, a single scraper execution connects to multiple proxies, so the target website doesn't see that requests are coming from a single source 8 | - Avoid geoblocking, some websites have different behaviours in different geographical areas. 9 | 10 | ## What kind of proxies are available? 11 | Proxies can be divided in groups by their anonimity level or by the device they are running on. 12 | According the anonimity level we have: 13 | - **Transparent Proxies**: the entering request is not altered and the request made by the proxy specified the original Ip. No information about the original request or user is hidden. 14 | - **Anonymous Proxies**: an anonymous proxy does not reveal your IP address but it does reveal that you are using a proxy server. 15 | - **High Anoniymous Proxies (a.k.a. Elite Proxies)**: Elite proxy servers hide both your IP address and the fact that you are using a proxy server at all. These are the most advanced proxies that offer the most security. 16 | For a more in depth analysis, please check [this useful article](https://proxyscrape.com/blog/proxy-anonymity-levels) 17 | 18 | As said before, proxies can be divided also by devise they are running on. 19 | - **Datacenter proxies** are typically running on servers and if your target server uses Canvas Fingerprint and blocks the views from these machines, they can be blocked. 20 | - **Residential proxies** instead are running from devices outside datacenters (typically studios, offices and so on). They're tipically less reliable but if they're run on desktops, they can be used to avoid fingerprinting. 21 | - **Mobile proxies** if the device is a mobile phone\tablet. 22 | 23 | ## Where to find proxies ready to use? 24 | There are websites that offer a list of free proxies ready to use. They are free but not so reliable, so the list can be checked frequently and probably you'll discard a lot of entries because they are not working: 25 | - **[https://free-proxy-list.net/](https://free-proxy-list.net/)** 26 | - **[https://spys.one/en/](https://spys.one/en/)** 27 | - **[https://www.freeproxylists.net/](https://www.freeproxylists.net/)** 28 | 29 | For professional pursposes there are several IP providers that offer their services like: 30 | - **[BrightData](https://brightdata.com/lp/proxy-network)** 31 | - **[Oxylabs](https://oxylabs.io/)** 32 | - **[Zyte](https://www.zyte.com/smart-proxy-manager/)** 33 | 34 | ## How to use these proxies in my projects? 35 | If you're using Scrapy, we developed a python package called [advanced-scrapy-proxies](https://github.com/reanalytics-databoutique/advanced-scrapy-proxies) that given a list of urls, remote or on the local machine, handles the proxy rotation. 36 | For generic Python scripts instead you can use python Requests options as [this example](https://reqbin.com/code/python/bnnyomhw/python-requests-proxy-example) shows 37 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Akamai.md: -------------------------------------------------------------------------------- 1 | # Akamai Bot Manager 2 | 3 | ## What is Akamai Bot Manager? 4 | [Akamai Bot Manager ](https://www.akamai.com/products/bot-manager "Akamai") detect bots using device fingerprinting bot signatures and ip checks. 5 | 6 | ## Our View on Akamai Bot Manager 7 | 8 | ### How to Identify Akamai Bot Manager 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) 10 | 11 | ### Recommended approach to Akamai Bot Manager 12 | 13 | **BEST CHOICE**: a standard configuration of Akamai requires a good proxy rotation to be beaten, there's no need of a fully rendered browser. If there's no need to login, rotating datacenter proxies are usually enough. 14 | 15 | ### How yo bypass Akamai according to The Web Scraping Club 16 | [Posts about Akamai](https://substack.thewebscraping.club/t/akamai) 17 | 18 | 19 | ### Reference and interesting links 20 | [Official web page](https://www.akamai.com/products/bot-manager) 21 | [High level description](https://www.zenrows.com/blog/bypass-akamai) 22 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Akamai_WP_Passive_Fingerprinting.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TheWebScrapingClub/webscraping-from-0-to-hero/fcb31693a4fc9f81be561fbbbc44afdf4aa454d1/Pages/5.Antibot/Akamai_WP_Passive_Fingerprinting.pdf -------------------------------------------------------------------------------- /Pages/5.Antibot/Browserfingerprint.md: -------------------------------------------------------------------------------- 1 | # Browser Fingerprinting 2 | 3 | ## What is Browser Fingerprinting? 4 | Browser fingerprinting is a powerful method that websites use to collect information about your browser type and version, as well as your operating system, active plugins, time zone, language, screen resolution and various other active settings. Like other fingerprinting techniques, these datapoints are matched inside a database of known profiles to see if the visitor is a real person or an automated program. 5 | To get these all these details, some of the techniques used are: 6 | - [Canvas Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Canvasfingerprint.md) 7 | - [WebGL Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Webglfingerprint.md) 8 | - [Device Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Devicefingerprint.md) 9 | 10 | ### Reference and interesting links 11 | [Browser Fingerprint Intro](https://www.avast.com/c-what-is-browser-fingerprinting) 12 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Canvasfingerprint.md: -------------------------------------------------------------------------------- 1 | # Canvas Fingerprint 2 | 3 | ## What is Canvas Fingerprint? 4 | Canvas Fingerprint is an anti-bot technique that can be found in some installations of Cloudflare Bot Management. 5 | It consists in a javascript that renders an image in background on website loading: similar hardware configurations give the same result and this allows the anti bot to trigger a captcha if the hardware configuration is similar to a server machine. It may happens that the same scraper works on a workstation but does not work on a VM hosted server. 6 | 7 | 8 | ## Our View on Canvas Fingerprint 9 | 10 | ### How to Identify Canvas Fingerprint 11 | As far as we know, at the moment Cloudflare (in some configuration) and reCaptcha v3 use this solution (not visible from wappalyzer). An homemade bot for analyzing anti bot measures can be found on discord server "[Scraping Enthusiast](https://discord.gg/Y8yuF55m7j)" managed by the developers of Puppetter and can recognize the Canvas Fingerprint using a series of queries on the website. 12 | 13 | 14 | ### Recommended approach to Canvas Fingerprint 15 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) + hosting on a workstation 16 | 17 | ### Reference and interesting links 18 | [How canvas Fingerprint work](https://fingerprintjs.com/blog/canvas-fingerprinting/) 19 | [How canvas Fingerprint work 2](https://browserleaks.com/canvas#how-does-it-work) 20 | [Test if a VM is detected here](https://fingerprintjs.com/products/bot-detection/) 21 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Cloudflare.md: -------------------------------------------------------------------------------- 1 | # Cloudflare Bot Management 2 | 3 | ## What is Cloudflare Bot Management? 4 | [Cloudflare Bot Management ](https://www.cloudflare.com/products/bot-management/ "Cloudflare") detect bots using device fingerprinting bot signatures and ip checks. 5 | 6 | ## Our View on Cloudflare Bot Management 7 | 8 | ### How to Identify Cloudflare Bot Management 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) 10 | 11 | ### Recommended approach to Cloudflare Bot Management 12 | **BEST CHOICE**: Each website can be configured with different degrees of protection. The best approach is using [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + a privacy focused browser like Brave or antidetect browser like Gologin. 13 | 14 | A good solution, still to be tested by our side, is to find the IP address of the web server of the target website and then scrape from there. 15 | 16 | ### How yo bypass Cloudflare according to The Web Scraping Club 17 | [Posts about Cloudflare](https://substack.thewebscraping.club/t/cloudflare) 18 | 19 | ### Reference and interesting links 20 | [Official web page](https://www.cloudflare.com/en-gb/products/bot-management/) 21 | 22 | [Our in-depth with Playwright against Cloudflare](https://reanalytics.freshdesk.com/discussions/topics/28000008894) 23 | 24 | [A package for bypass Cloudflare](https://github.com/Anorov/cloudflare-scrape): maybe obsolete, not updated in 2 years 25 | 26 | [Firefox appears to be flagged as suspicious from Cloudflare](https://brianlovin.com/hn/31459258) 27 | 28 | [High level description](https://www.zenrows.com/blog/bypass-cloudflare#what-is-cloudflare-bot-management) 29 | 30 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Datadome.md: -------------------------------------------------------------------------------- 1 | # Datadome 2 | 3 | ## What is Datadome? 4 | [Datadome](https://datadome.co/ "Datadome") is the so called "State of the art protection from Scraping". They say they apply statistical and behavioral detection, can also detect [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md), implemented client-side detection, and so on. 5 | 6 | ## Our View on Datadome 7 | 8 | ### How to Identify Datadome 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) 10 | 11 | ### Recommended approach to Datadome 12 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + Brave browser is a good solution. 13 | 14 | ### How yo bypass Datadome according to The Web Scraping Club 15 | [Posts about Datadome](https://substack.thewebscraping.club/t/datadome) 16 | 17 | ### Reference and interesting links 18 | [Official web page](https://datadome.co/) 19 | [Tests made with online tools](https://blog.vanila.io/how-strong-is-the-datadome-5e9ff211384e) 20 | 21 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Devicefingerprint.md: -------------------------------------------------------------------------------- 1 | # WebGL Fingerprinting 2 | 3 | ## What is WebGL Fingerprinting? 4 | While device fingerprinting is often used synonymously with browser fingerprinting, it also refers to a particular technique that uncovers a list of all the media devices (and their IDs) on your PC. That includes internal media components such as your audio and video card, as well as any connected devices like headphones. 5 | 6 | ### Recommended approach to WebGL Fingerprint 7 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) + hosting on a workstation if the configuration of the antibot is particularly strict. 8 | 9 | 10 | ### Reference and interesting links 11 | [Device Fingerprint Intro](https://www.netacea.com/glossary/device-fingerprint/) 12 | -------------------------------------------------------------------------------- /Pages/5.Antibot/HttpFingerprint.md: -------------------------------------------------------------------------------- 1 | # HTTP Fingerprinting 2 | 3 | ## What is HTTP Fingerprinting? 4 | Http fingerprinting analyze the Http request headers to see any incongruence in the parameters. 5 | It means that any incongruence in User Agent configuration or also in the order of the parameters in the HTTP requests can be tagged as suspicious by anti-bot systems. 6 | 7 | ### Reference and interesting links 8 | [Http Fingerprinting Whitepaper](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Akamai_WP_Passive_Fingerprinting.pdf) 9 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Kasada.md: -------------------------------------------------------------------------------- 1 | # Kasada 2 | 3 | ## What is Kasada? 4 | [Kasada ](https://www.kasada.io/ "Kasada") is one of the new anti-bot solution on the market. 5 | 6 | ## Our View on Kasada 7 | 8 | ### How to Identify Kasada 9 | Unluckily [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) does not work in this case but the websites protected with this solution have a peculiar behaviour. 10 | The first request to the website returns a 429 error (visible only from the Network inspection in the browser's developer tools), then redirect to the same page that works properly. This second request added some elements in the response headers like "x-kpsdk-ct" 11 | 12 | ### Recommended approach to Kasada 13 | **BEST CHOICE**: at the moment, the best approach is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) using Firefox with the right flags. 14 | 15 | ### How yo bypass Kasada according to The Web Scraping Club 16 | [Posts about Kasada](https://substack.thewebscraping.club/t/kasada) 17 | 18 | ### Reference and interesting links 19 | [Official web page](https://www.kasada.io/) 20 | 21 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Passivefingerprint.md: -------------------------------------------------------------------------------- 1 | # Passive Fingerprinting 2 | 3 | ## What is Passive Fingerprinting? 4 | Passive fingerprinting refers to the passive collection of attributes from a network-connecting client or server. In other words, the attributes sent when connecting to the target website determine a fingerprint. When this is compared to a database of "plausible" fingerprints, the target server can determine if the connection is trustworthy or not. 5 | The difference between passive fingerprinting and active fingerprinting is that active fingerprinting sends data to "query" the client connecting while passive does not. 6 | 7 | There are several layers of attributes that can be checked and for each one, there's a different technique: 8 | - [TCP/IP Fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/TcpFingerprint.md) 9 | - [TLS fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/TLSFingerprint.md) 10 | - [HTTP Fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/HttpFingerprint.md) 11 | 12 | ## Possible solutions 13 | Generally speaking, the passive fingerprinting techniques block the configuration "outliers", so using a plausible and real world setting in the scraper is the best way to avoid blocks. 14 | 15 | ### Reference and interesting links 16 | [Akamai Whitepaper on passive fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Akamai_WP_Passive_Fingerprinting.pdf) 17 | 18 | -------------------------------------------------------------------------------- /Pages/5.Antibot/PerimeterX.md: -------------------------------------------------------------------------------- 1 | # PerimeterX 2 | 3 | ## What is PerimeterX? 4 | [Perimeterx](https://www.perimeterx.com/products/bot-defender "Perimeterx") Anti-Bot system is a protection system some websites use for blocking web scraping. One example at the moment is [ssense.com](https://www.ssense.com/). 5 | 6 | ## Our View on PerimeterX 7 | 8 | ### How to Identify PerimeterX 9 | Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md), you will see it in the Security Tab 10 | ![PerimeterX](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Images/Antibot/Perimterx1.png) 11 | 12 | ### Recommended approach to PerimeterX 13 | During the execution of the scraper it happens, after some pages, that a challenge like the one in the picture is triggered, blocking the execution. It's needed a fully browser to not trigger the captcha, adding some random movement of the mouse and timers before moving to another page. 14 | 15 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + Firefox 16 | 17 | ### How yo bypass PerimeterX according to The Web Scraping Club 18 | [Posts about PerimeterX](https://substack.thewebscraping.club/t/perimeterx) 19 | 20 | ### Reference and interesting links 21 | [Official web page](https://www.perimeterx.com/products/bot-defender) 22 | [How Perimeterx works](https://www.trickster.dev/post/how-does-perimeterx-bot-defender-work/) 23 | 24 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Shape.md: -------------------------------------------------------------------------------- 1 | # Shape Bot Defence (renamed F5 Distributed Cloud Bot Defense) 2 | 3 | ## What is Shape Bot Defence? 4 | [Shape Bot Defence ](https://www.f5.com/cloud/products/bot-defense "Shape") detect bots AI and Machine Learning applied to the behaviour of the bot. 5 | 6 | ## Our View on Shape Bot Defence 7 | 8 | ### How to Identify Shape Bot Defence 9 | [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) doesn't seem to recognize it, we've noticed that certain websites protected by Shape, if opened by a browser in incognito mode and with developer tools tab opened, they stop to work. Closing the developer tools tab, they work again. 10 | 11 | ### Recommended approach to Shape Bot Defence 12 | **BEST CHOICE**: at the moment, the best approach is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + Firefox with the right options. 13 | 14 | ### How yo bypass Shape according to The Web Scraping Club 15 | [Posts about Shape](https://substack.thewebscraping.club/t/shape) 16 | 17 | ### Reference and interesting links 18 | [Shape Bot Defence](https://www.f5.com/cloud/products/bot-defense) 19 | 20 | -------------------------------------------------------------------------------- /Pages/5.Antibot/TLSFingerprint.md: -------------------------------------------------------------------------------- 1 | # TLS Fingerprinting 2 | 3 | ## What is TLS Fingerprinting? 4 | Transport Layer Security (TLS) encrypts data sent over the Internet to ensure that eavesdroppers and hackers are unable to see what you transmit which is particularly useful for private and sensitive information such as passwords, credit card numbers, and personal correspondence. 5 | As first step of the communication, TLS creates and handshake between the two machines and this transfer of information is not encrypted. Capturing elements in this packages can reveale informations about the machines involved in the communication. 6 | 7 | ### Reference and interesting links 8 | [What is TLS fingerprinting](https://blog.squarelemon.com/tls-fingerprinting/) 9 | [More details on TLS fingerprinting](https://fingerprintjs.com/blog/what-is-tls-fingerprinting-transport-layer-security/) 10 | -------------------------------------------------------------------------------- /Pages/5.Antibot/TcpFingerprint.md: -------------------------------------------------------------------------------- 1 | # TCP Fingerprinting 2 | 3 | ## What is TCP Fingerprinting? 4 | TCP fingerprinting use the TCP/IP stack characteristics to identify a client connecting to a remote machine, inspecting IPv4 and IPv6 headers, TCP headers, the dynamics of the TCP handshake, and the contents of application-level payloads. 5 | 6 | ### Reference and interesting links 7 | [p0f3 tcp fingerprinting tool](https://lcamtuf.coredump.cx/p0f3/README) 8 | 9 | -------------------------------------------------------------------------------- /Pages/5.Antibot/Webglfingerprint.md: -------------------------------------------------------------------------------- 1 | # WebGL Fingerprinting 2 | 3 | ## What is WebGL Fingerprinting? 4 | Like canvas fingerprinting, this technique force your browser to render images off-screen and then use these images to infer information about your device’s hardware and graphics system. 5 | 6 | ### Recommended approach to WebGL Fingerprint 7 | **BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) + hosting on a workstation if the configuration of the antibot is particularly strict. 8 | 9 | 10 | ### Reference and interesting links 11 | [WebGL Fingerprint Intro](https://webbrowsertools.com/webgl-fingerprint/) 12 | [WebGL Fingerprint Examples](https://codepen.io/jon/pen/LLPKbz) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web scraping from 0 to hero 2 | Originally named "Web Scraping Open Project", this repository wants so create a common knowledge among web scraping experts, interesting enough for both rookies and experts in the field. 3 | Anyone can submit some content if it adds value to the project. 4 | Of course, we won't accept any AI-generated content and sellish and sponsored material, even if there are some sections dedicated to commercial tools, but they're based on user experience and not on marketing. 5 | 6 | ## Why this repository? 7 | Web scraping is becoming harder and more expensive, with anti-bot becoming more aggressive and requiring commercial tools for being bypassed. But, at the same time, the need for web data is growing exponentially, following the post-Covid-19 increase in digitalization. On top of this, AI models will need more and more data to be trained and the main source is usually the web (just ask [Reddit](https://techcrunch.com/2023/07/04/reddit-braces-for-life-after-api-changes/ "Reddit API controversy") and [Twitter](https://business.twitter.com/en/blog/update-on-twitters-limited-usage.html "Twitter anti-scraping measures") ) 8 | So while there are some increasing challenges, there are more and more opportunities for developers who want to embark on the career of a web data engineer. 9 | In this repository we're building a silo of all the sparse and fragmented content around the web and sharing some experience with tools, languages, and best practices to create a great basecamp for who's starting now but also a source of inspiration for experts looking for new tools and solutions. 10 | 11 | ## Who am I? 12 | I'm [Pierluigi Vinciguerra](https://www.linkedin.com/in/pierluigivinciguerra/), co-founder and CTO at [Databoutique.com](https://www.databoutique.com) and I'm working in web scraping for more than 10 years. 13 | I've always felt the need to centralize in some places the information about web scraping that are sparse around the web. At first, I started taking some notes and in 2022 I've decided to share with everyone starting a free substack called [The Web Scraping Club](https://substack.thewebscraping.club/), a quite successful one considering the niche I'm writing to, even if it's only my voice that is heard. With this repository, I want to create a chorus of web scraping experts sharing their experiences and ideas so that all the industry could benefit from it. 14 | 15 | ## How this repository works? 16 | This repository wants to be a central hub for information about web scraping, so to keep it readable and ordered this page will be used as a table of content, with links to all the topics covered. 17 | Topics can be added by anyone if they are relevant and add some value to the repository. 18 | I tend to use the pages to create short content (about 400/500 words max) and link to external pages if longer content is needed, but that's not a rule. 19 | You can write an excerpt of a longer blog on these pages and then link the full article. 20 | Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone. 21 | 22 | Content not allowed: 23 | - **Out of scope content** 24 | - **Promotional content** 25 | - **Referral codes** 26 | - **AI-generated content** 27 | 28 | The table of content below will be updated regularly as soon as some new topics are coming to my mind, if it's not linking to any article it means that the page still does not exist, so feel free to add one. 29 | 30 | ## Table of content 31 | 32 | ### 1.Before scraping a website 33 | #### 1.1 Is scraping that website legal? 34 | - [Reading terms of services of the website](https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero/blob/main/Pages/1.Before%20Scraping/Reading%20Terms.md) 35 | - [Is the data I want to scrape compliant with privacy laws or copyrighted?](https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero/blob/main/Pages/1.Before%20Scraping/Privacy%20and%20copyright.md) 36 | - [Do I really need to scrape that website or I can buy pre-scraped data?](https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero/blob/main/Pages/1.Before%20Scraping/Buy%20or%20Make.md) 37 | #### 1.2 Preliminary website study 38 | - Does the website have an API (internal or exposed)? 39 | - Does it have some JSON inside the HTML? 40 | ### 2. Best practices 41 | - Use JSON instead of HTML, if possible 42 | - Selectors 43 | - Data formatting 44 | - Reducing the requests number 45 | ### 3. Free Tools 46 | #### 3.1. Headless python scrapers 47 | - [Scrapy]([https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy.md](https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero/blob/main/Pages/3.Free%20Tools/Scrapy.md)) 48 | - [scrapy_splash](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/3.Free%20Tools/Scrapy_splash.md) 49 | 50 | #### 3.2. Python scrapers with fully rendered browsers 51 | - [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/3.Free%20Tools/Playwright.md) 52 | - [playwright_stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/3.Free%20Tools/Playwright_stealth.md) 53 | 54 | #### 3.3. Non Python scrapers with fully rendered browsers 55 | - [Puppeteer](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/3.Free%20Tools/Puppeteer.md) 56 | 57 | #### 3.4. Non Python full-featured web scraping libraries 58 | - [Crawlee](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/3.Free%20Tools/Crawlee.md) 59 | 60 | ### 4. Commercial Tools 61 | - Proxy solutions 62 | - Scraping API 63 | 64 | ### 5. Common anti-bot softwares & techniques 65 | 66 | #### 5.1. Anti-bot Softwares 67 | - [Akamai](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Akamai.md) 68 | - [Cloudflare](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Cloudflare.md) 69 | - [Datadome](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Datadome.md) 70 | - [PerimeterX](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/PerimeterX.md) 71 | - [Kasada](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Kasada.md) 72 | - [F5 Shape Security](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Shape.md) 73 | - Forter 74 | - Riskified 75 | #### 5.2. Anti-bot Techniques 76 | - [Passive fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Passivefingerprint.md) including: 77 | - [TCP/IP Fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/TcpFingerprint.md) 78 | - [TLS fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/TLSFingerprint.md) 79 | - [HTTP Fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/HttpFingerprint.md) 80 | - [Browser Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Browserfingerprint.md) techniques including: 81 | - [Canvas Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Canvasfingerprint.md) 82 | - [WebGL Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Webglfingerprint.md) 83 | - [Device Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/5.Antibot/Devicefingerprint.md) 84 | 85 | ### 6. Test websites for your scraper 86 | - [https://bot.incolumitas.com/](https://bot.incolumitas.com/) one of the most complete set of tests for your scrapers 87 | - [https://pixelscan.net/](https://pixelscan.net/) check your ip and your machine 88 | - [https://bot.sannysoft.com/](https://bot.sannysoft.com/) another great list of tests 89 | - [https://abrahamjuliot.github.io/creepjs/](https://abrahamjuliot.github.io/creepjs/) set of tests on fingerprinting 90 | - [https://fingerprintjs.com/products/bot-detection/](https://fingerprintjs.com/products/bot-detection/) page about BotD, a javascript bot detection library included in Cloudflare, where you can also test your configuration 91 | 92 | ### 7. How to make money with web scraping 93 | - Freelancing 94 | - Sell your scrapers with Apify 95 | - Sell your data on Databoutique.com 96 | 97 | 98 | --------------------------------------------------------------------------------