├── scrapoxy.json ├── scrapers ├── __init__.py ├── middlewares │ ├── __init__.py │ ├── info.py │ └── retry.py ├── pipelines │ ├── __init__.py │ └── csv.py ├── spiders │ ├── __init__.py │ └── trekky.py ├── settings.py ├── items.py └── utils.py ├── tools ├── deobfuscated.js ├── obfuscated.js └── deobfuscator.js ├── header.jpg ├── images ├── info.png ├── note.png ├── ast-ui.png ├── warning.png ├── ast-header.png ├── scrapoxy-proxies.png ├── scrapoxy-connector-run.png ├── scrapoxy-project-create.png ├── scrapoxy-project-update.png ├── scrapoxy-connector-create.png ├── chrome-network-inspector-list.png ├── chrome-network-inspector-list2.png └── chrome-network-inspector-initiator.png ├── scrapy.cfg ├── requirements.txt ├── .idea ├── vcs.xml ├── .gitignore ├── modules.xml ├── scraping-workshop.iml ├── runConfigurations │ ├── deobfuscator_js.xml │ └── trekky.xml ├── misc.xml └── inspectionProfiles │ └── Project_Default.xml ├── .editorconfig ├── package.json ├── .vscode ├── settings.json └── launch.json ├── .gitignore ├── solutions ├── challenge-2.py ├── challenge-3.py ├── challenge-6-1-partial.py ├── challenge-6-2.py ├── challenge-4.py ├── challenge-5.py └── challenge-7.py ├── playwright_spider.py └── README.md /scrapoxy.json: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /scrapers/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tools/deobfuscated.js: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tools/obfuscated.js: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /scrapers/middlewares/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /scrapers/pipelines/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /header.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/header.jpg -------------------------------------------------------------------------------- /images/info.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/info.png -------------------------------------------------------------------------------- /images/note.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/note.png -------------------------------------------------------------------------------- /images/ast-ui.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/ast-ui.png -------------------------------------------------------------------------------- /images/warning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/warning.png -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | [settings] 2 | default = scrapers.settings 3 | 4 | [deploy] 5 | project = scrapers 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pycryptodome==3.23.0 2 | scrapoxy==2.1.1 3 | Scrapy==2.13.3 4 | scrapy_playwright 5 | -------------------------------------------------------------------------------- /images/ast-header.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/ast-header.png -------------------------------------------------------------------------------- /images/scrapoxy-proxies.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-proxies.png -------------------------------------------------------------------------------- /images/scrapoxy-connector-run.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-connector-run.png -------------------------------------------------------------------------------- /images/scrapoxy-project-create.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-project-create.png -------------------------------------------------------------------------------- /images/scrapoxy-project-update.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-project-update.png -------------------------------------------------------------------------------- /images/scrapoxy-connector-create.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-connector-create.png -------------------------------------------------------------------------------- /images/chrome-network-inspector-list.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/chrome-network-inspector-list.png -------------------------------------------------------------------------------- /images/chrome-network-inspector-list2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/chrome-network-inspector-list2.png -------------------------------------------------------------------------------- /images/chrome-network-inspector-initiator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/chrome-network-inspector-initiator.png -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /scrapers/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /shelf/ 3 | /workspace.xml 4 | # Editor-based HTTP Client requests 5 | /httpRequests/ 6 | # Datasource local storage ignored files 7 | /dataSources/ 8 | /dataSources.local.xml 9 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/scraping-workshop.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | -------------------------------------------------------------------------------- /.idea/runConfigurations/deobfuscator_js.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | -------------------------------------------------------------------------------- /.editorconfig: -------------------------------------------------------------------------------- 1 | # Editor configuration, see https://editorconfig.org 2 | root = true 3 | 4 | [*] 5 | charset = utf-8 6 | indent_style = space 7 | indent_size = 4 8 | insert_final_newline = true 9 | end_of_line = lf 10 | trim_trailing_whitespace = true 11 | 12 | [*.ts] 13 | quote_type = single 14 | 15 | [*.yml] 16 | indent_size = 2 17 | 18 | [*.md] 19 | max_line_length = off 20 | trim_trailing_whitespace = false 21 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "scraping-workshop", 3 | "version": "1.0.0", 4 | "description": "", 5 | "type": "module", 6 | "private": false, 7 | "engines": { 8 | "node": ">= 20.0.0", 9 | "npm": ">= 6.0.0" 10 | }, 11 | "scripts": { 12 | "dev": "", 13 | "test": "" 14 | }, 15 | "dependencies": { 16 | "@babel/generator": "~7.23.0", 17 | "@babel/parser": "~7.23.0", 18 | "@babel/traverse": "~7.23.0" 19 | } 20 | } 21 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/Project_Default.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 14 | -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "editor.formatOnSave": true, 3 | "python.linting.enabled": true, 4 | "python.linting.pylintEnabled": true, 5 | "python.defaultInterpreterPath": "/home/vboxguest/venv/bin/python", 6 | "python.analysis.extraPaths": [ 7 | "${workspaceFolder}" 8 | ], 9 | "files.exclude": { 10 | "**/__pycache__": true, 11 | "**/.pytest_cache": true, 12 | "**/*.pyc": true 13 | }, 14 | "javascript.format.enable": true, 15 | "javascript.validate.enable": true, 16 | "[python]": { 17 | "editor.tabSize": 4 18 | }, 19 | "[javascript]": { 20 | "editor.tabSize": 2 21 | } 22 | } 23 | -------------------------------------------------------------------------------- /scrapers/settings.py: -------------------------------------------------------------------------------- 1 | BOT_NAME = "scrapers" 2 | 3 | SPIDER_MODULES = ["scrapers.spiders"] 4 | NEWSPIDER_MODULE = "scrapers.spiders" 5 | 6 | CONCURRENT_REQUESTS = 5 7 | DOWNLOAD_TIMEOUT = 10 8 | 9 | SPIDER_MIDDLEWARES = { 10 | 'scrapers.middlewares.info.InfoSpiderMiddleware': 40, 11 | } 12 | 13 | ITEM_PIPELINES = { 14 | 'scrapers.pipelines.csv.SaveToCsvPipeline': 300, 15 | } 16 | 17 | REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" 18 | TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" 19 | FEED_EXPORT_ENCODING = "utf-8" 20 | 21 | # Prevent Scrapy from overriding Chrome's default HTTP headers. 22 | PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None 23 | -------------------------------------------------------------------------------- /scrapers/middlewares/info.py: -------------------------------------------------------------------------------- 1 | from scrapy import signals 2 | 3 | class InfoSpiderMiddleware: 4 | ###This spider middleware class logs the number of scraped items when the spider is closed.### 5 | def __init__(self, stats): 6 | self.stats = stats 7 | 8 | @classmethod 9 | def from_crawler(cls, crawler): 10 | s = cls(crawler.stats) 11 | crawler.signals.connect(s.spider_closed, signal=signals.spider_closed) 12 | return s 13 | 14 | def spider_closed(self, spider): 15 | count = self.stats.get_value("item_scraped_count", 0, spider=spider) 16 | if count > 0: 17 | if count > 1: 18 | spider.logger.info(f"\n\nWe got: {count} items\n") 19 | else: 20 | spider.logger.info("\n\nWe got: 1 item\n") 21 | -------------------------------------------------------------------------------- /.vscode/launch.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "0.2.0", 3 | "configurations": [ 4 | { 5 | "name": "deobfuscator.js", 6 | "type": "node", 7 | "request": "launch", 8 | "program": "${workspaceFolder}/tools/deobfuscator.js", 9 | "cwd": "${workspaceFolder}", 10 | "skipFiles": [ 11 | "/**" 12 | ] 13 | }, 14 | { 15 | "name": "trekky", 16 | "type": "python", 17 | "request": "launch", 18 | "module": "scrapy.cmdline", 19 | "args": [ 20 | "crawl", 21 | "trekky" 22 | ], 23 | "cwd": "${workspaceFolder}", 24 | "python": "/home/vboxguest/venv/bin/python", 25 | "env": { 26 | "PYTHONUNBUFFERED": "1" 27 | }, 28 | "console": "integratedTerminal" 29 | } 30 | ] 31 | } 32 | -------------------------------------------------------------------------------- /scrapers/items.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass, field 2 | from itemloaders.processors import TakeFirst, MapCompose, Identity 3 | from scrapy.loader import ItemLoader 4 | 5 | 6 | @dataclass 7 | class ReviewItem: 8 | rating: float = field(default=None) 9 | 10 | 11 | class ReviewItemLoader(ItemLoader): 12 | default_item_class = ReviewItem 13 | rating_in = MapCompose(str.strip, float) 14 | rating_out = TakeFirst() 15 | 16 | 17 | @dataclass 18 | class HotelItem: 19 | name: str = field(default=None) 20 | email: str = field(default=None) 21 | reviews: list[ReviewItem] = field(default=None) 22 | 23 | 24 | class HotelItemLoader(ItemLoader): 25 | default_input_processor = MapCompose(str.strip) 26 | default_output_processor = TakeFirst() 27 | default_item_class = HotelItem 28 | 29 | reviews_in = Identity() 30 | reviews_out = Identity() 31 | 32 | -------------------------------------------------------------------------------- /scrapers/pipelines/csv.py: -------------------------------------------------------------------------------- 1 | from itemadapter import ItemAdapter 2 | from scrapy import signals 3 | 4 | import csv 5 | 6 | 7 | class SaveToCsvPipeline: 8 | ###This pipeline class saves the scraped data to a CSV file named 'results.csv'.### 9 | _items = [] 10 | 11 | @classmethod 12 | def from_crawler(cls, crawler): 13 | s = cls() 14 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 15 | crawler.signals.connect(s.spider_closed, signal=signals.spider_closed) 16 | return s 17 | 18 | def process_item(self, item, spider): 19 | self._items.append(item) 20 | return item 21 | 22 | def spider_opened(self, spider): 23 | self._items = [] 24 | 25 | def spider_closed(self, spider): 26 | with open('results.csv', 'w', newline='') as csvfile: 27 | csvwriter = csv.DictWriter( 28 | csvfile, 29 | fieldnames=['name', 'email', 'reviews'], 30 | delimiter=',', 31 | quotechar='"', quoting=csv.QUOTE_MINIMAL 32 | ) 33 | 34 | csvwriter.writeheader() 35 | 36 | for item in self._items: 37 | csvwriter.writerow(ItemAdapter(item).asdict()) 38 | -------------------------------------------------------------------------------- /.idea/runConfigurations/trekky.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 25 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # See http://help.github.com/ignore-files/ for more about ignoring files. 2 | 3 | # compiled output 4 | /dist 5 | /tmp 6 | /out-tsc 7 | __pycache__/ 8 | *.py[cod] 9 | *$py.class 10 | 11 | # C extensions 12 | *.so 13 | 14 | # cache 15 | .nx 16 | 17 | # Distribution / packaging 18 | .Python 19 | build/ 20 | develop-eggs/ 21 | dist/ 22 | downloads/ 23 | eggs/ 24 | .eggs/ 25 | lib/ 26 | lib64/ 27 | parts/ 28 | sdist/ 29 | var/ 30 | wheels/ 31 | share/python-wheels/ 32 | *.egg-info/ 33 | .installed.cfg 34 | *.egg 35 | MANIFEST 36 | 37 | # pytype static type analyzer 38 | .pytype/ 39 | 40 | # Cython debug symbols 41 | cython_debug/ 42 | 43 | # PyInstaller 44 | # Usually these files are written by a python script from a template 45 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 46 | *.manifest 47 | *.spec 48 | 49 | # Installer logs 50 | pip-log.txt 51 | pip-delete-this-directory.txt 52 | 53 | # dependencies 54 | /node_modules 55 | 56 | # IDEs and editors 57 | /.idea/*8kqj7y1qjbninwbd95ozki 58 | !.idea/runConfigurations 59 | .project 60 | .classpath 61 | .c9/ 62 | *.launch 63 | .settings/ 64 | *.sublime-workspace 65 | 66 | # IDE - VSCode 67 | .vscode/* 68 | !.vscode/settings.json 69 | !.vscode/tasks.json 70 | !.vscode/launch.json 71 | !.vscode/extensions.json 72 | 73 | # misc 74 | /.sass-cache 75 | /connect.lock 76 | /coverage 77 | /libpeerconnection.log 78 | npm-debug.log 79 | yarn-error.log 80 | testem.log 81 | /typings 82 | yarn.lock 83 | 84 | # System Files 85 | .DS_Store 86 | Thumbs.db 87 | 88 | # Scrapy stuff: 89 | .scrapy 90 | 91 | # IPython 92 | profile_default/ 93 | ipython_config.py 94 | 95 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 96 | __pypackages__/ 97 | 98 | # Configuration 99 | .env 100 | .venv 101 | env/ 102 | venv/ 103 | ENV/ 104 | env.bak/ 105 | venv.bak/ 106 | .spyderproject 107 | .spyproject 108 | 109 | # Data 110 | results.csv 111 | -------------------------------------------------------------------------------- /scrapers/utils.py: -------------------------------------------------------------------------------- 1 | from base64 import b64encode 2 | from datetime import datetime 3 | from scrapy.spidermiddlewares.httperror import HttpError 4 | from w3lib.html import remove_tags 5 | from Crypto.Cipher import PKCS1_OAEP 6 | from Crypto.Hash import SHA256 7 | from Crypto.PublicKey import RSA 8 | 9 | import re 10 | import json 11 | 12 | 13 | def remove_whitespace(text): 14 | text = text.replace('\n', ' ').replace('\r', '') 15 | text = re.sub(r'\s+', ' ', text) 16 | return text.strip() 17 | 18 | 19 | def date_to_timestamp(date): 20 | try: 21 | return int(datetime.strptime(date.strip(), '%b %d, %Y, %I:%M %p').timestamp() * 1000) 22 | except ValueError: 23 | return None 24 | 25 | 26 | def print_failure(logger, failure): 27 | message = f"\nURL: {failure.request.url}\n\n" 28 | 29 | if failure.check(HttpError): 30 | response = failure.value.response 31 | 32 | text = remove_tags(response.text) 33 | 34 | try: 35 | if response.status == 429: 36 | error = { 37 | "message": "Too many requests", 38 | "description": response.text 39 | } 40 | else: 41 | error = json.loads(text) 42 | 43 | description = error.get('description') 44 | if description: 45 | message += f"Error: {error['message']}\n\nDetails: {description}\n" 46 | else: 47 | message += f"Error: {error['message']}\n" 48 | except json.JSONDecodeError: 49 | message += text 50 | else: 51 | message += f"Error: {failure.getErrorMessage()}\n" 52 | 53 | logger.error(f"\n{message}\n") 54 | 55 | 56 | def rsa_encrypt(message, public_key): 57 | """Use RSA public key encryption to encrypt the message.""" 58 | 59 | # Convert the public key into PEM format for use in RSA encryption. 60 | pem_key = f"-----BEGIN PUBLIC KEY-----\n{public_key}\n-----END PUBLIC KEY-----" 61 | rsa_public_key = RSA.importKey(pem_key) 62 | rsa_public_key = PKCS1_OAEP.new(rsa_public_key, hashAlgo=SHA256) 63 | 64 | # Encrypt the message 65 | message = str.encode(message) 66 | encrypted_text = rsa_public_key.encrypt(message) 67 | encrypted_text_b64 = b64encode(encrypted_text) 68 | return encrypted_text_b64 69 | -------------------------------------------------------------------------------- /scrapers/spiders/trekky.py: -------------------------------------------------------------------------------- 1 | from scrapy import Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure 4 | 5 | 6 | class TrekkySpider(Spider): 7 | """This class manages all the logic required for scraping the Trekky website. 8 | 9 | Attributes: 10 | name (str): The unique name of the spider. 11 | start_url (str): Root of the website and first URL to scrape. 12 | custom_settings (dict): Custom settings for the scraper 13 | """ 14 | 15 | name = "trekky" 16 | 17 | start_url = "https://trekky-reviews.com/level1" 18 | 19 | custom_settings = { 20 | "DEFAULT_REQUEST_HEADERS": { 21 | "Connection": "close", 22 | }, 23 | 24 | "DOWNLOADER_MIDDLEWARES": { 25 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 26 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 27 | }, 28 | } 29 | 30 | def start_requests(self): 31 | """This method start 10 separate sessions on the homepage, one per page.""" 32 | for page in range(1, 10): 33 | yield Request( 34 | url=self.start_url, 35 | callback=self.parse, 36 | errback=self.errback, 37 | dont_filter=True, 38 | meta=dict( 39 | page=page, 40 | cookiejar="jar%d" % page, 41 | ), 42 | ) 43 | 44 | def parse(self, response): 45 | """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X.""" 46 | yield Request( 47 | url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']), 48 | callback=self.parse_listing, 49 | errback=self.errback, 50 | meta=response.meta, 51 | ) 52 | 53 | def parse_listing(self, response): 54 | """This method parses the list of hotels in Paris from page X.""" 55 | for el in response.css('.hotel-link'): 56 | yield response.follow( 57 | url=el, 58 | callback=self.parse_hotel, 59 | errback=self.errback, 60 | meta=response.meta, 61 | ) 62 | 63 | def parse_hotel(self, response): 64 | """This method parses hotel details such as name, email, and reviews.""" 65 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 66 | 67 | hotel = HotelItemLoader(response=response) 68 | hotel.add_css('name', '.hotel-name::text') 69 | hotel.add_css('email', '.hotel-email::text') 70 | hotel.add_value('reviews', reviews) 71 | return hotel.load_item() 72 | 73 | def get_review(self, review_el): 74 | """This method extracts rating from a review""" 75 | review = ReviewItemLoader(selector=review_el) 76 | review.add_css('rating', '.review-rating::text') 77 | return review.load_item() 78 | 79 | def errback(self, failure): 80 | """This method handles and logs errors and is invoked with each request.""" 81 | print_failure(self.logger, failure) 82 | -------------------------------------------------------------------------------- /solutions/challenge-2.py: -------------------------------------------------------------------------------- 1 | from scrapy import Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure 4 | 5 | 6 | class TrekkySpider(Spider): 7 | """This class manages all the logic required for scraping the Trekky website. 8 | 9 | Attributes: 10 | name (str): The unique name of the spider. 11 | start_url (str): Root of the website and first URL to scrape. 12 | custom_settings (dict): Custom settings for the scraper 13 | """ 14 | 15 | name = "trekky" 16 | 17 | start_url = "https://trekky-reviews.com/level2" 18 | 19 | custom_settings = { 20 | # Add a User Agent to simulate a real browser 21 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3", 22 | 23 | "DEFAULT_REQUEST_HEADERS": { 24 | "Connection": "close", 25 | 26 | # Add headers to simulate a real browser, matching the User Agent 27 | "Sec-Ch-Ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"90\", \"Google Chrome\";v=\"90\"", 28 | "Sec-Ch-Ua-Mobile": "?0", 29 | "Sec-Ch-Ua-Platform": "\"Windows\"", 30 | }, 31 | 32 | "DOWNLOADER_MIDDLEWARES": { 33 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 34 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 35 | }, 36 | } 37 | 38 | def start_requests(self): 39 | """This method start 10 separate sessions on the homepage, one per page.""" 40 | for page in range(1, 10): 41 | yield Request( 42 | url=self.start_url, 43 | callback=self.parse, 44 | errback=self.errback, 45 | dont_filter=True, 46 | meta=dict( 47 | page=page, 48 | cookiejar="jar%d" % page, 49 | ), 50 | ) 51 | 52 | def parse(self, response): 53 | """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X.""" 54 | yield Request( 55 | url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']), 56 | callback=self.parse_listing, 57 | errback=self.errback, 58 | meta=response.meta, 59 | ) 60 | 61 | def parse_listing(self, response): 62 | """This method parses the list of hotels in Paris from page X.""" 63 | for el in response.css('.hotel-link'): 64 | yield response.follow( 65 | url=el, 66 | callback=self.parse_hotel, 67 | errback=self.errback, 68 | meta=response.meta, 69 | ) 70 | 71 | def parse_hotel(self, response): 72 | """This method parses hotel details such as name, email, and reviews.""" 73 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 74 | 75 | hotel = HotelItemLoader(response=response) 76 | hotel.add_css('name', '.hotel-name::text') 77 | hotel.add_css('email', '.hotel-email::text') 78 | hotel.add_value('reviews', reviews) 79 | return hotel.load_item() 80 | 81 | def get_review(self, review_el): 82 | """This method extracts rating from a review""" 83 | review = ReviewItemLoader(selector=review_el) 84 | review.add_css('rating', '.review-rating::text') 85 | return review.load_item() 86 | 87 | def errback(self, failure): 88 | """This method handles and logs errors and is invoked with each request.""" 89 | print_failure(self.logger, failure) 90 | -------------------------------------------------------------------------------- /solutions/challenge-3.py: -------------------------------------------------------------------------------- 1 | from scrapy import Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure 4 | 5 | 6 | class TrekkySpider(Spider): 7 | """This class manages all the logic required for scraping the Trekky website. 8 | 9 | Attributes: 10 | name (str): The unique name of the spider. 11 | start_url (str): Root of the website and first URL to scrape. 12 | custom_settings (dict): Custom settings for the scraper 13 | """ 14 | 15 | name = "trekky" 16 | 17 | start_url = "https://trekky-reviews.com/level4" 18 | 19 | custom_settings = { 20 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3", 21 | 22 | "DEFAULT_REQUEST_HEADERS": { 23 | "Connection": "close", 24 | "Sec-Ch-Ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"90\", \"Google Chrome\";v=\"90\"", 25 | "Sec-Ch-Ua-Mobile": "?0", 26 | "Sec-Ch-Ua-Platform": "\"Windows\"", 27 | }, 28 | 29 | "DOWNLOADER_MIDDLEWARES": { 30 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 31 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 32 | }, 33 | 34 | "ADDONS": { 35 | 'scrapoxy.Addon': 100, 36 | }, 37 | 38 | # Set up Scrapoxy settings and credentials 39 | "SCRAPOXY_MASTER": "http://localhost:8888", 40 | "SCRAPOXY_API": "http://localhost:8890/api", 41 | "SCRAPOXY_USERNAME": "TO_FILL", 42 | "SCRAPOXY_PASSWORD": "TO_FILL", 43 | } 44 | 45 | def start_requests(self): 46 | """This method start 10 separate sessions on the homepage, one per page.""" 47 | for page in range(1, 10): 48 | yield Request( 49 | url=self.start_url, 50 | callback=self.parse, 51 | errback=self.errback, 52 | dont_filter=True, 53 | meta=dict( 54 | page=page, 55 | cookiejar="jar%d" % page, 56 | ), 57 | ) 58 | 59 | def parse(self, response): 60 | """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X.""" 61 | yield Request( 62 | url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']), 63 | callback=self.parse_listing, 64 | errback=self.errback, 65 | meta=response.meta, 66 | ) 67 | 68 | def parse_listing(self, response): 69 | """This method parses the list of hotels in Paris from page X.""" 70 | for el in response.css('.hotel-link'): 71 | yield response.follow( 72 | url=el, 73 | callback=self.parse_hotel, 74 | errback=self.errback, 75 | meta=response.meta, 76 | ) 77 | 78 | def parse_hotel(self, response): 79 | """This method parses hotel details such as name, email, and reviews.""" 80 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 81 | 82 | hotel = HotelItemLoader(response=response) 83 | hotel.add_css('name', '.hotel-name::text') 84 | hotel.add_css('email', '.hotel-email::text') 85 | hotel.add_value('reviews', reviews) 86 | return hotel.load_item() 87 | 88 | def get_review(self, review_el): 89 | """This method extracts rating from a review""" 90 | review = ReviewItemLoader(selector=review_el) 91 | review.add_css('rating', '.review-rating::text') 92 | return review.load_item() 93 | 94 | def errback(self, failure): 95 | """This method handles and logs errors and is invoked with each request.""" 96 | print_failure(self.logger, failure) 97 | -------------------------------------------------------------------------------- /solutions/challenge-6-1-partial.py: -------------------------------------------------------------------------------- 1 | from scrapy import FormRequest, Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure, rsa_encrypt 4 | from urllib.parse import urljoin 5 | 6 | import json 7 | 8 | 9 | def build_payload(): 10 | """Build the encrypted payload to send to the server.""" 11 | payload = json.dumps({ 12 | "KEY_1_TO_REPLACE": "VALUE_1_TO_REPLACE", 13 | "KEY_2_TO_REPLACE": "VALUE_2_TO_REPLACE", 14 | }) 15 | 16 | # The public key is extracted from the deobfuscated JavaScript code of the website's antibot. 17 | public_key = "TO_FILL" 18 | 19 | payload_encoded = rsa_encrypt(payload, public_key) 20 | return payload_encoded 21 | 22 | 23 | class TrekkySpider(Spider): 24 | """This class manages all the logic required for scraping the Trekky website. 25 | 26 | Attributes: 27 | name (str): The unique name of the spider. 28 | start_url (str): Root of the website and first URL to scrape. 29 | custom_settings (dict): Custom settings for the scraper 30 | """ 31 | 32 | name = "trekky" 33 | 34 | start_url = "https://trekky-reviews.com/level8" 35 | 36 | custom_settings = { 37 | "DEFAULT_REQUEST_HEADERS": { 38 | "Connection": "close", 39 | }, 40 | 41 | "DOWNLOADER_MIDDLEWARES": { 42 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 43 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 44 | }, 45 | } 46 | 47 | def start_requests(self): 48 | """This method start 10 separate sessions on the homepage, one per page.""" 49 | for page in range(1, 10): 50 | yield Request( 51 | url=self.start_url, 52 | callback=self.parse_home, 53 | errback=self.errback, 54 | dont_filter=True, 55 | meta=dict( 56 | page=page, 57 | cookiejar="jar%d" % page, 58 | ), 59 | ) 60 | 61 | def parse_home(self, response): 62 | """After accessing the website's homepage, we generate the encrypted payload and send it to the server.""" 63 | yield FormRequest( 64 | url=urljoin(self.start_url, '/Vmi6869kJM7vS70sZKXrwn5Lq0CORjRl'), 65 | formdata={ 66 | "payload": build_payload(), 67 | }, 68 | callback=self.parse, 69 | errback=self.errback, 70 | dont_filter=True, 71 | meta=response.meta, 72 | ) 73 | 74 | def parse(self, response): 75 | """Once approved, we retrieve the list of hotels in Paris from page X.""" 76 | yield Request( 77 | url=self.start_url + "/cities?city=paris&page=%d" % response.meta['page'], 78 | callback=self.parse_listing, 79 | errback=self.errback, 80 | meta=response.meta, 81 | ) 82 | 83 | def parse_listing(self, response): 84 | """This method parses the list of hotels in Paris from page X.""" 85 | for el in response.css('.hotel-link'): 86 | yield response.follow( 87 | url=el, 88 | callback=self.parse_hotel, 89 | errback=self.errback, 90 | meta=response.meta, 91 | ) 92 | 93 | def parse_hotel(self, response): 94 | """This method parses hotel details such as name, email, and reviews.""" 95 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 96 | 97 | hotel = HotelItemLoader(response=response) 98 | hotel.add_css('name', '.hotel-name::text') 99 | hotel.add_css('email', '.hotel-email::text') 100 | hotel.add_value('reviews', reviews) 101 | return hotel.load_item() 102 | 103 | def get_review(self, review_el): 104 | """This method extracts rating from a review""" 105 | review = ReviewItemLoader(selector=review_el) 106 | review.add_css('rating', '.review-rating::text') 107 | return review.load_item() 108 | 109 | def errback(self, failure): 110 | """This method handles and logs errors and is invoked with each request.""" 111 | print_failure(self.logger, failure) 112 | 113 | -------------------------------------------------------------------------------- /solutions/challenge-6-2.py: -------------------------------------------------------------------------------- 1 | from scrapy import FormRequest, Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure, rsa_encrypt 4 | from urllib.parse import urljoin 5 | 6 | import json 7 | 8 | 9 | def build_payload(): 10 | """Build the encrypted payload to send to the server.""" 11 | payload = json.dumps({ 12 | "vendor": "Intel", 13 | "renderer": "Intel Iris OpenGL Engine", 14 | }) 15 | 16 | # The public key is extracted from the deobfuscated JavaScript code of the website's antibot. 17 | public_key = "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEApgjwxZd4I6YnOE1GGCdnKIatX71CyGpssvAAH7udNLcBVr0WzIP1t+KZ7mDzLMyZE9MJmSsEgKidzaVRikarUQ6MUWnyJQxe8DlUNrSmK4ZrnLBD/5rVBcepZo1mPj1MdQWie4AYHUt++lLpPrXqEJ7xugSGIt7ORVGgcKO5ku5RSS1Ssy5iUhYtQo4VCb2UxYuMbpt2YF8LOaR8KtPIQENtNH2Jj7akQTna4I5lixOB0jme03lR5n94SqACUAZ+rFBDKgrC9eVWX8xdfMERxcKuD9NxFCV65tdNiH64CHWaDU13j9v2XGHKFkEORgRn+RQBintX5fEqt7GTTIzvoQIDAQAB" 18 | 19 | payload_encoded = rsa_encrypt(payload, public_key) 20 | return payload_encoded 21 | 22 | 23 | class TrekkySpider(Spider): 24 | """This class manages all the logic required for scraping the Trekky website. 25 | 26 | Attributes: 27 | name (str): The unique name of the spider. 28 | start_url (str): Root of the website and first URL to scrape. 29 | custom_settings (dict): Custom settings for the scraper 30 | """ 31 | 32 | name = "trekky" 33 | 34 | start_url = "https://trekky-reviews.com/level8" 35 | 36 | custom_settings = { 37 | "DEFAULT_REQUEST_HEADERS": { 38 | "Connection": "close", 39 | }, 40 | 41 | "DOWNLOADER_MIDDLEWARES": { 42 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 43 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 44 | }, 45 | } 46 | 47 | def start_requests(self): 48 | """This method start 10 separate sessions on the homepage, one per page.""" 49 | for page in range(1, 10): 50 | yield Request( 51 | url=self.start_url, 52 | callback=self.parse_home, 53 | errback=self.errback, 54 | dont_filter=True, 55 | meta=dict( 56 | page=page, 57 | cookiejar="jar%d" % page, 58 | ), 59 | ) 60 | 61 | def parse_home(self, response): 62 | """After accessing the website's homepage, we generate the encrypted payload and send it to the server.""" 63 | yield FormRequest( 64 | url=urljoin(self.start_url, '/Vmi6869kJM7vS70sZKXrwn5Lq0CORjRl'), 65 | formdata={ 66 | "payload": build_payload(), 67 | }, 68 | callback=self.parse, 69 | errback=self.errback, 70 | dont_filter=True, 71 | meta=response.meta, 72 | ) 73 | 74 | def parse(self, response): 75 | """Once approved, we retrieve the list of hotels in Paris from page X.""" 76 | yield Request( 77 | url=self.start_url + "/cities?city=paris&page=%d" % response.meta['page'], 78 | callback=self.parse_listing, 79 | errback=self.errback, 80 | meta=response.meta, 81 | ) 82 | 83 | def parse_listing(self, response): 84 | """This method parses the list of hotels in Paris from page X.""" 85 | for el in response.css('.hotel-link'): 86 | yield response.follow( 87 | url=el, 88 | callback=self.parse_hotel, 89 | errback=self.errback, 90 | meta=response.meta, 91 | ) 92 | 93 | def parse_hotel(self, response): 94 | """This method parses hotel details such as name, email, and reviews.""" 95 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 96 | 97 | hotel = HotelItemLoader(response=response) 98 | hotel.add_css('name', '.hotel-name::text') 99 | hotel.add_css('email', '.hotel-email::text') 100 | hotel.add_value('reviews', reviews) 101 | return hotel.load_item() 102 | 103 | def get_review(self, review_el): 104 | """This method extracts rating from a review""" 105 | review = ReviewItemLoader(selector=review_el) 106 | review.add_css('rating', '.review-rating::text') 107 | return review.load_item() 108 | 109 | def errback(self, failure): 110 | """This method handles and logs errors and is invoked with each request.""" 111 | print_failure(self.logger, failure) 112 | 113 | -------------------------------------------------------------------------------- /solutions/challenge-4.py: -------------------------------------------------------------------------------- 1 | from scrapy import Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure 4 | 5 | 6 | class TrekkySpider(Spider): 7 | """This class manages all the logic required for scraping the Trekky website. 8 | 9 | Attributes: 10 | name (str): The unique name of the spider. 11 | start_url (str): Root of the website and first URL to scrape. 12 | custom_settings (dict): Custom settings for the scraper 13 | """ 14 | 15 | name = "trekky" 16 | 17 | start_url = "https://trekky-reviews.com/level6" 18 | 19 | custom_settings = { 20 | "DEFAULT_REQUEST_HEADERS": { 21 | "Connection": "close", 22 | }, 23 | 24 | "DOWNLOADER_MIDDLEWARES": { 25 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 26 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 27 | }, 28 | 29 | # Replace the default Scrapy downloader with Playwright and Chrome to manage JavaScript content. 30 | "DOWNLOAD_HANDLERS": { 31 | "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", 32 | "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", 33 | }, 34 | 35 | # Set up Playwright to launch the browser in headful mode using Scrapoxy. 36 | "PLAYWRIGHT_LAUNCH_OPTIONS": { 37 | "headless": False, 38 | "proxy": { 39 | "server": "http://localhost:8888", 40 | "username": "TO_FILL", 41 | "password": "TO_FILL", 42 | }, 43 | } 44 | } 45 | 46 | def start_requests(self): 47 | """This method start 10 separate sessions on the homepage, one per page.""" 48 | for page in range(1, 10): 49 | yield Request( 50 | url=self.start_url, 51 | callback=self.parse, 52 | errback=self.errback, 53 | dont_filter=True, 54 | meta=dict( 55 | # Enable Playwright 56 | playwright=True, 57 | # Include the Playwright page object in the response 58 | playwright_include_page=True, 59 | playwright_context="context%d" % page, 60 | playwright_context_kwargs=dict( 61 | # Ignore HTTPS errors 62 | ignore_https_errors=True, 63 | ), 64 | playwright_page_goto_kwargs=dict( 65 | wait_until='networkidle', 66 | ), 67 | page=page, 68 | ), 69 | ) 70 | 71 | async def parse(self, response): 72 | """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X.""" 73 | await response.meta["playwright_page"].close() 74 | del response.meta["playwright_page"] 75 | 76 | # For the next requests, skip page rendering and download only the HTML content. 77 | response.meta["playwright_page_goto_kwargs"]["wait_until"] = 'commit' 78 | 79 | yield Request( 80 | url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']), 81 | callback=self.parse_listing, 82 | errback=self.errback, 83 | meta=response.meta, 84 | ) 85 | 86 | async def parse_listing(self, response): 87 | """This method parses the list of hotels in Paris from page X.""" 88 | await response.meta["playwright_page"].close() 89 | del response.meta["playwright_page"] 90 | 91 | for el in response.css('.hotel-link'): 92 | yield response.follow( 93 | url=el, 94 | callback=self.parse_hotel, 95 | errback=self.errback, 96 | meta=response.meta, 97 | ) 98 | 99 | async def parse_hotel(self, response): 100 | """This method parses hotel details such as name, email, and reviews.""" 101 | await response.meta["playwright_page"].close() 102 | del response.meta["playwright_page"] 103 | 104 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 105 | 106 | hotel = HotelItemLoader(response=response) 107 | hotel.add_css('name', '.hotel-name::text') 108 | hotel.add_css('email', '.hotel-email::text') 109 | hotel.add_value('reviews', reviews) 110 | return hotel.load_item() 111 | 112 | def get_review(self, review_el): 113 | """This method extracts rating from a review""" 114 | review = ReviewItemLoader(selector=review_el) 115 | review.add_css('rating', '.review-rating::text') 116 | return review.load_item() 117 | 118 | async def errback(self, failure): 119 | """This method handles and logs errors and is invoked with each request.""" 120 | print_failure(self.logger, failure) 121 | page = failure.request.meta.get("playwright_page") 122 | if page: 123 | await page.close() 124 | -------------------------------------------------------------------------------- /solutions/challenge-5.py: -------------------------------------------------------------------------------- 1 | from scrapy import Request, Spider 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader 3 | from scrapers.utils import print_failure 4 | 5 | 6 | class TrekkySpider(Spider): 7 | """This class manages all the logic required for scraping the Trekky website. 8 | 9 | Attributes: 10 | name (str): The unique name of the spider. 11 | start_url (str): Root of the website and first URL to scrape. 12 | custom_settings (dict): Custom settings for the scraper 13 | """ 14 | 15 | name = "trekky" 16 | 17 | start_url = "https://trekky-reviews.com/level7" 18 | 19 | custom_settings = { 20 | "DEFAULT_REQUEST_HEADERS": { 21 | "Connection": "close", 22 | }, 23 | 24 | "DOWNLOADER_MIDDLEWARES": { 25 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 26 | 'scrapers.middlewares.retry.RetryMiddleware': 550, 27 | }, 28 | 29 | # Replace the default Scrapy downloader with Playwright and Chrome to manage JavaScript content. 30 | "DOWNLOAD_HANDLERS": { 31 | "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", 32 | "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", 33 | }, 34 | 35 | # Set up Playwright to launch the browser in headful mode using Scrapoxy. 36 | "PLAYWRIGHT_LAUNCH_OPTIONS": { 37 | "headless": False, 38 | "proxy": { 39 | "server": "http://localhost:8888", 40 | "username": "TO_FILL", 41 | "password": "TO_FILL", 42 | }, 43 | } 44 | } 45 | 46 | def start_requests(self): 47 | """This method start 10 separate sessions on the homepage, one per page.""" 48 | for page in range(1, 10): 49 | yield Request( 50 | url=self.start_url, 51 | callback=self.parse, 52 | errback=self.errback, 53 | dont_filter=True, 54 | meta=dict( 55 | # Enable Playwright 56 | playwright=True, 57 | # Include the Playwright page object in the response 58 | playwright_include_page=True, 59 | playwright_context="context%d" % page, 60 | playwright_context_kwargs=dict( 61 | # Ignore HTTPS errors 62 | ignore_https_errors=True, 63 | # Sync the timezone with the proxy's location. 64 | timezone_id='America/Chicago', 65 | ), 66 | playwright_page_goto_kwargs=dict( 67 | wait_until='networkidle', 68 | ), 69 | page=page, 70 | ), 71 | ) 72 | 73 | async def parse(self, response): 74 | """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X.""" 75 | await response.meta["playwright_page"].close() 76 | del response.meta["playwright_page"] 77 | 78 | # For the next requests, skip page rendering and download only the HTML content. 79 | response.meta["playwright_page_goto_kwargs"]["wait_until"] = 'commit' 80 | 81 | yield Request( 82 | url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']), 83 | callback=self.parse_listing, 84 | errback=self.errback, 85 | meta=response.meta, 86 | ) 87 | 88 | async def parse_listing(self, response): 89 | """This method parses the list of hotels in Paris from page X.""" 90 | await response.meta["playwright_page"].close() 91 | del response.meta["playwright_page"] 92 | 93 | for el in response.css('.hotel-link'): 94 | yield response.follow( 95 | url=el, 96 | callback=self.parse_hotel, 97 | errback=self.errback, 98 | meta=response.meta, 99 | ) 100 | 101 | async def parse_hotel(self, response): 102 | """This method parses hotel details such as name, email, and reviews.""" 103 | await response.meta["playwright_page"].close() 104 | del response.meta["playwright_page"] 105 | 106 | reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')] 107 | 108 | hotel = HotelItemLoader(response=response) 109 | hotel.add_css('name', '.hotel-name::text') 110 | hotel.add_css('email', '.hotel-email::text') 111 | hotel.add_value('reviews', reviews) 112 | return hotel.load_item() 113 | 114 | def get_review(self, review_el): 115 | """This method extracts rating from a review""" 116 | review = ReviewItemLoader(selector=review_el) 117 | review.add_css('rating', '.review-rating::text') 118 | return review.load_item() 119 | 120 | async def errback(self, failure): 121 | """This method handles and logs errors and is invoked with each request.""" 122 | print_failure(self.logger, failure) 123 | page = failure.request.meta.get("playwright_page") 124 | if page: 125 | await page.close() 126 | -------------------------------------------------------------------------------- /tools/deobfuscator.js: -------------------------------------------------------------------------------- 1 | import {promises as fs} from 'fs'; 2 | import * as parser from '@babel/parser'; 3 | import * as t from "@babel/types"; 4 | import _traverse from '@babel/traverse'; 5 | import _generate from '@babel/generator'; 6 | 7 | const traverse = _traverse.default; 8 | const generate = _generate.default; 9 | 10 | 11 | /* 12 | * Generate and prettify source code from the AST 13 | */ 14 | function generateCode(ast) { 15 | return generate(ast, { 16 | comments: false, 17 | compact: false, 18 | }).code; 19 | } 20 | 21 | 22 | /* 23 | * Replace constants by the string value 24 | */ 25 | function constantUnfolding(source) { 26 | // Convert source code to an AST (Abstract Syntax Tree) 27 | const ast = parser.parse(source); 28 | 29 | // Replace variable usage by their string declaration 30 | traverse(ast, { 31 | VariableDeclaration(path) { 32 | const newDeclarations = []; 33 | for (const declaration of path.node.declarations) { 34 | if (!t.isStringLiteral(declaration.init)) { 35 | newDeclarations.push(declaration); 36 | continue; 37 | } 38 | 39 | const binding = path.scope.getBinding(declaration.id.name); 40 | if (!binding || binding.constantViolations.length > 0) { 41 | newDeclarations.push(declaration); 42 | continue; 43 | } 44 | 45 | for (const referencePath of binding.referencePaths) { 46 | referencePath.replaceWith(t.stringLiteral(declaration.init.value)); 47 | } 48 | } 49 | 50 | if (newDeclarations.length > 0) { 51 | path.node.declarations = newDeclarations; 52 | } else { 53 | path.remove(); 54 | } 55 | } 56 | }); 57 | 58 | // Replace the binding of the window usage 59 | traverse(ast, { 60 | VariableDeclaration(path) { 61 | const newDeclarations = []; 62 | for (const declaration of path.node.declarations) { 63 | if (!t.isIdentifier(declaration.init) || declaration.init.name !== 'window') { 64 | newDeclarations.push(declaration); 65 | continue; 66 | } 67 | 68 | const binding = path.scope.getBinding(declaration.id.name); 69 | if (!binding || binding.constantViolations.length > 0) { 70 | newDeclarations.push(declaration); 71 | continue; 72 | } 73 | 74 | for (const referencePath of binding.referencePaths) { 75 | referencePath.replaceWith(t.identifier('window')); 76 | } 77 | } 78 | 79 | if (newDeclarations.length > 0) { 80 | path.node.declarations = newDeclarations; 81 | } else { 82 | path.remove(); 83 | } 84 | } 85 | }); 86 | 87 | return generateCode(ast); 88 | } 89 | 90 | /* 91 | * Join binary expressions with string literals 92 | */ 93 | function stringJoin(source) { 94 | // Convert source code to an AST (Abstract Syntax Tree) 95 | const ast = parser.parse(source); 96 | 97 | function joinBinaryWithStringRecursively(node) { 98 | let left; 99 | if (t.isBinaryExpression(node.left)) { 100 | left = joinBinaryWithStringRecursively(node.left); 101 | } else { 102 | left = node.left; 103 | } 104 | 105 | let right; 106 | if (t.isBinaryExpression(node.right)) { 107 | right = joinBinaryWithStringRecursively(node.right); 108 | } else { 109 | right = node.right; 110 | } 111 | 112 | if (t.isStringLiteral(left) && t.isStringLiteral(right)) { 113 | return t.stringLiteral(left.value + right.value); 114 | } else { 115 | return node; 116 | } 117 | } 118 | 119 | // Replace binary expressions by their string concatenation 120 | // Use a recursive function 121 | traverse(ast, { 122 | BinaryExpression(path) { 123 | const node = path.node; 124 | 125 | path.replaceWith(joinBinaryWithStringRecursively(node)); 126 | } 127 | }); 128 | 129 | return generateCode(ast); 130 | } 131 | 132 | /* 133 | * Convert the string notation to the dot notation 134 | */ 135 | function convertStringNotationToDotNotation(source) { 136 | // Convert source code to an AST (Abstract Syntax Tree) 137 | const ast = parser.parse(source); 138 | 139 | traverse(ast, { 140 | MemberExpression(path) { 141 | const {node} = path; 142 | 143 | if (node.computed && t.isStringLiteral(node.property)) { 144 | node.property = t.identifier(node.property.value); 145 | node.computed = false; 146 | } 147 | } 148 | }); 149 | 150 | return generateCode(ast); 151 | } 152 | 153 | 154 | (async () => { 155 | let code = await fs.readFile('./tools/obfuscated.js', 'utf-8'); 156 | 157 | code = constantUnfolding(code); 158 | code = stringJoin(code); 159 | code = convertStringNotationToDotNotation(code); 160 | 161 | await fs.writeFile('./tools/deobfuscated.js', code, 'utf-8'); 162 | })() 163 | .catch(console.error); 164 | -------------------------------------------------------------------------------- /solutions/challenge-7.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | from dataclasses import dataclass, field 3 | from parsel import Selector 4 | from camoufox.async_api import AsyncCamoufox 5 | from typing import List, Iterator 6 | from urllib.parse import urljoin 7 | 8 | import asyncio 9 | import csv 10 | import logging 11 | 12 | 13 | logging.basicConfig( 14 | level=logging.INFO, 15 | format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' 16 | ) 17 | 18 | 19 | # Define data classes similar to those in scrapers/items.py 20 | @dataclass 21 | class ReviewItem: 22 | rating: float = field(default=None) 23 | 24 | 25 | @dataclass 26 | class HotelItem: 27 | name: str = field(default=None) 28 | email: str = field(default=None) 29 | reviews: List[ReviewItem] = field(default_factory=list) 30 | 31 | 32 | class TrekkyCamoufoxSpider: 33 | """Pure Camoufox implementation of the Trekky spider.""" 34 | start_url = "https://trekky-reviews.com/level9" 35 | 36 | logger = logging.getLogger(__name__) 37 | 38 | async def start(self) -> None: 39 | """Entry point to start the scraping process.""" 40 | if not self.start_url.endswith('/'): 41 | self.start_url += '/' 42 | 43 | tasks = [] 44 | # Create separate sessions 45 | for page_num in range(1, 10): 46 | task = asyncio.create_task(self.parse_homepage(page_num)) 47 | tasks.append(task) 48 | 49 | # Wait for all tasks to complete 50 | results = await asyncio.gather(*tasks, return_exceptions=True) 51 | 52 | # Process results 53 | hotels = [] 54 | for result in results: 55 | if isinstance(result, Exception): 56 | self.logger.error(f"Error occurred: {result}") 57 | elif result: 58 | hotels.extend(result) 59 | 60 | # Output results to a CSV file matching the required format 61 | with open('../results.csv', 'w', newline='') as f: 62 | writer = csv.writer(f) 63 | writer.writerow(['name', 'email', 'reviews']) 64 | for hotel in hotels: 65 | reviews_str = str([{"rating": review.rating} for review in hotel.reviews]) 66 | writer.writerow([hotel.name, hotel.email, reviews_str]) 67 | 68 | self.logger.info(f"Scraped {len(hotels)} hotels and saved to results.csv") 69 | 70 | async def parse_homepage(self, page_num) -> List[HotelItem]: 71 | """This method starts a session for the homepage and retrieves the list of hotels in Paris from page X.""" 72 | hotels = [] 73 | 74 | try: 75 | # Launch browser with camoufox 76 | async with AsyncCamoufox( 77 | headless=False, 78 | ) as browser: 79 | # Create a new context 80 | context = await browser.new_context( 81 | ignore_https_errors=True, 82 | ) 83 | 84 | # Open a new page and navigate to the homepage 85 | self.logger.info(f"Go to homepage for page {page_num}") 86 | 87 | page = await context.new_page() 88 | await page.route('**/*.{png,jpg,jpeg,svg,gif,css}', lambda route: route.abort()) 89 | await page.goto(self.start_url, wait_until='networkidle', timeout=60000) 90 | await page.wait_for_timeout(2000) 91 | 92 | # Navigate to the listing page and get the hotels 93 | url = urljoin(self.start_url, f"cities?city=paris&page={page_num}") 94 | async for hotel in self.parse_listing(page, url): 95 | hotels.append(hotel) 96 | 97 | # Close the page and context 98 | await page.close() 99 | await context.close() 100 | 101 | except Exception as e: 102 | self.logger.error(f"Error in session {page_num}: {e}") 103 | raise 104 | 105 | return hotels 106 | 107 | async def parse_listing(self, page, url) -> Iterator[HotelItem]: 108 | """This method parses the list of hotels.""" 109 | self.logger.info(f"Go to listing page: {url}") 110 | await page.goto(url, wait_until='networkidle', timeout=60000) 111 | 112 | # Wait longer for JavaScript to load content 113 | await page.wait_for_timeout(5000) 114 | 115 | selector = Selector(text=await page.content()) 116 | 117 | for link in selector.css('.hotel-link'): 118 | href = link.attrib.get('href') 119 | if href: 120 | hotel_url = urljoin(self.start_url, href) 121 | 122 | try: 123 | item = await self.parse_hotel(page, hotel_url) 124 | if item: 125 | yield item 126 | except Exception as e: 127 | self.logger.error(f"Error scraping hotel {hotel_url}: {e}") 128 | 129 | async def parse_hotel(self, page, url) -> HotelItem | None: 130 | """This method parses hotel details.""" 131 | try: 132 | self.logger.info(f"Go to hotel page: {url}") 133 | await page.goto(url, wait_until='commit', timeout=30000) 134 | await page.wait_for_selector('.hotel-name', timeout=10000) 135 | 136 | selector = Selector(text=await page.content()) 137 | 138 | reviews = [] 139 | for review_el in selector.css('.hotel-review'): 140 | review_item = self.get_review(review_el) 141 | if review_item: 142 | reviews.append(review_item) 143 | 144 | name = selector.css('.hotel-name::text').get() 145 | email = selector.css('.hotel-email::text').get() 146 | 147 | if name and email: 148 | return HotelItem(name=name.strip(), email=email.strip(), reviews=reviews) 149 | else: 150 | self.logger.warning(f"Incomplete hotel data extracted: name={name}, email={email}") 151 | return None 152 | 153 | except Exception as e: 154 | self.logger.error(f"Error extracting hotel info: {e}") 155 | return None 156 | 157 | def get_review(self, review_el) -> ReviewItem | None: 158 | """This method extracts rating from a review""" 159 | rating_text = review_el.css('.review-rating::text').get() 160 | try: 161 | rating = float(rating_text.strip()) if rating_text else None 162 | return ReviewItem(rating=rating) 163 | except (ValueError, TypeError): 164 | self.logger.warning(f"Invalid rating value: {rating_text}") 165 | return None 166 | 167 | 168 | 169 | 170 | async def main(): 171 | """Main entry point.""" 172 | spider = TrekkyCamoufoxSpider() 173 | await spider.start() 174 | 175 | 176 | if __name__ == "__main__": 177 | asyncio.run(main()) 178 | -------------------------------------------------------------------------------- /playwright_spider.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | from dataclasses import dataclass, field 3 | from parsel import Selector 4 | from playwright.async_api import async_playwright 5 | from typing import List, Iterator 6 | from urllib.parse import urljoin 7 | 8 | import asyncio 9 | import csv 10 | import logging 11 | 12 | 13 | logging.basicConfig( 14 | level=logging.INFO, 15 | format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' 16 | ) 17 | 18 | 19 | @dataclass 20 | class ReviewItem: 21 | rating: float = field(default=None) 22 | 23 | 24 | @dataclass 25 | class HotelItem: 26 | name: str = field(default=None) 27 | email: str = field(default=None) 28 | reviews: List[ReviewItem] = field(default_factory=list) 29 | 30 | 31 | class TrekkyPlaywrightSpider: 32 | """Pure Playwright implementation of the Trekky spider.""" 33 | start_url = "https://trekky-reviews.com/level9" 34 | 35 | logger = logging.getLogger(__name__) 36 | 37 | async def start(self) -> None: 38 | """Entry point to start the scraping process.""" 39 | if not self.start_url.endswith('/'): 40 | self.start_url += '/' 41 | 42 | tasks = [] 43 | # Create separate sessions 44 | for page_num in range(1, 10): 45 | task = asyncio.create_task(self.parse_homepage(page_num)) 46 | tasks.append(task) 47 | 48 | # Wait for all tasks to complete 49 | results = await asyncio.gather(*tasks, return_exceptions=True) 50 | 51 | # Process results 52 | hotels = [] 53 | for result in results: 54 | if isinstance(result, Exception): 55 | self.logger.error(f"Error occurred: {result}") 56 | elif result: 57 | hotels.extend(result) 58 | 59 | # Output results to a CSV file matching the required format 60 | with open('results.csv', 'w', newline='') as f: 61 | writer = csv.writer(f) 62 | writer.writerow(['name', 'email', 'reviews']) 63 | for hotel in hotels: 64 | reviews_str = str([{"rating": review.rating} for review in hotel.reviews]) 65 | writer.writerow([hotel.name, hotel.email, reviews_str]) 66 | 67 | self.logger.info(f"Scraped {len(hotels)} hotels and saved to results.csv") 68 | 69 | async def parse_homepage(self, page_num) -> List[HotelItem]: 70 | """This method starts a session for the homepage and retrieves the list of hotels in Paris from page X.""" 71 | hotels = [] 72 | 73 | try: 74 | # Launch browser with playwright 75 | async with async_playwright() as p: 76 | # Launch browser 77 | browser = await p.chromium.launch(headless=False) 78 | context = await browser.new_context( 79 | ignore_https_errors=True, 80 | ) 81 | 82 | # Open a new page and navigate to the homepage 83 | self.logger.info(f"Go to homepage for page {page_num}") 84 | 85 | page = await context.new_page() 86 | await page.route('**/*.{png,jpg,jpeg,svg,gif,css}', lambda route: route.abort()) 87 | await page.goto(self.start_url, wait_until='networkidle', timeout=60000) 88 | await page.wait_for_timeout(2000) 89 | 90 | # Navigate to the listing page and get the hotels 91 | url = urljoin(self.start_url, f"cities?city=paris&page={page_num}") 92 | async for hotel in self.parse_listing(page, url): 93 | hotels.append(hotel) 94 | 95 | # Close the page and context 96 | await page.close() 97 | await context.close() 98 | await browser.close() 99 | 100 | except Exception as e: 101 | self.logger.error(f"Error in session {page_num}: {e}") 102 | raise 103 | 104 | return hotels 105 | 106 | async def parse_listing(self, page, url) -> Iterator[HotelItem]: 107 | """This method parses the list of hotels.""" 108 | self.logger.info(f"Go to listing page: {url}") 109 | await page.goto(url, wait_until='networkidle', timeout=60000) 110 | 111 | # Wait longer for JavaScript to load content 112 | await page.wait_for_timeout(5000) 113 | 114 | selector = Selector(text=await page.content()) 115 | 116 | for link in selector.css('.hotel-link'): 117 | href = link.attrib.get('href') 118 | if href: 119 | hotel_url = urljoin(self.start_url, href) 120 | 121 | try: 122 | item = await self.parse_hotel(page, hotel_url) 123 | if item: 124 | yield item 125 | except Exception as e: 126 | self.logger.error(f"Error scraping hotel {hotel_url}: {e}") 127 | 128 | async def parse_hotel(self, page, url) -> HotelItem | None: 129 | """This method parses hotel details.""" 130 | try: 131 | self.logger.info(f"Go to hotel page: {url}") 132 | await page.goto(url, wait_until='commit', timeout=30000) 133 | 134 | await page.wait_for_selector('.hotel-name', timeout=10000) 135 | 136 | selector = Selector(text=await page.content()) 137 | 138 | reviews = [] 139 | for review_el in selector.css('.hotel-review'): 140 | review_item = self.get_review(review_el) 141 | if review_item: 142 | reviews.append(review_item) 143 | 144 | name = selector.css('.hotel-name::text').get() 145 | email = selector.css('.hotel-email::text').get() 146 | 147 | if name and email: 148 | return HotelItem(name=name.strip(), email=email.strip(), reviews=reviews) 149 | else: 150 | self.logger.warning(f"Incomplete hotel data extracted: name={name}, email={email}") 151 | return None 152 | 153 | except Exception as e: 154 | self.logger.error(f"Error extracting hotel info: {e}") 155 | return None 156 | 157 | def get_review(self, review_el) -> ReviewItem | None: 158 | """This method extracts rating from a review""" 159 | rating_text = review_el.css('.review-rating::text').get() 160 | try: 161 | rating = float(rating_text.strip()) if rating_text else None 162 | return ReviewItem(rating=rating) 163 | except (ValueError, TypeError): 164 | self.logger.warning(f"Invalid rating value: {rating_text}") 165 | return None 166 | 167 | async def main(): 168 | """Main entry point.""" 169 | spider = TrekkyPlaywrightSpider() 170 | await spider.start() 171 | 172 | 173 | if __name__ == "__main__": 174 | asyncio.run(main()) 175 | -------------------------------------------------------------------------------- /scrapers/middlewares/retry.py: -------------------------------------------------------------------------------- 1 | """ 2 | An extension to retry failed requests that are potentially caused by temporary 3 | problems such as a connection timeout or HTTP 500 error. 4 | 5 | You can change the behaviour of this middleware by modifying the scraping settings: 6 | RETRY_TIMES - how many times to retry a failed page 7 | RETRY_HTTP_CODES - which HTTP response codes to retry 8 | 9 | Failed pages are collected on the scraping process and rescheduled at the end, 10 | once the spider has finished crawling all regular (non failed) pages. 11 | """ 12 | 13 | from __future__ import annotations 14 | 15 | import warnings 16 | from logging import Logger, getLogger 17 | from typing import TYPE_CHECKING, Any, Optional, Tuple, Type, Union 18 | 19 | from scrapy.crawler import Crawler 20 | from scrapy.exceptions import NotConfigured, ScrapyDeprecationWarning 21 | from scrapy.http import Response 22 | from scrapy.http.request import Request 23 | from scrapy.settings import BaseSettings, Settings 24 | from scrapy.spiders import Spider 25 | from scrapy.utils.misc import load_object 26 | from scrapy.utils.python import global_object_name 27 | from scrapy.utils.response import response_status_message 28 | 29 | from twisted.internet import reactor 30 | from twisted.internet.defer import Deferred 31 | 32 | if TYPE_CHECKING: 33 | # typing.Self requires Python 3.11 34 | from typing_extensions import Self 35 | 36 | retry_logger = getLogger(__name__) 37 | 38 | 39 | def backwards_compatibility_getattr(self: Any, name: str) -> Tuple[Any, ...]: 40 | if name == "EXCEPTIONS_TO_RETRY": 41 | warnings.warn( 42 | "Attribute RetryMiddleware.EXCEPTIONS_TO_RETRY is deprecated. " 43 | "Use the RETRY_EXCEPTIONS setting instead.", 44 | ScrapyDeprecationWarning, 45 | stacklevel=2, 46 | ) 47 | return tuple( 48 | load_object(x) if isinstance(x, str) else x 49 | for x in Settings().getlist("RETRY_EXCEPTIONS") 50 | ) 51 | raise AttributeError( 52 | f"{self.__class__.__name__!r} object has no attribute {name!r}" 53 | ) 54 | 55 | 56 | class BackwardsCompatibilityMetaclass(type): 57 | __getattr__ = backwards_compatibility_getattr 58 | 59 | 60 | def get_retry_request( 61 | request: Request, 62 | *, 63 | spider: Spider, 64 | reason: Union[str, Exception, Type[Exception]] = "unspecified", 65 | max_retry_times: Optional[int] = None, 66 | priority_adjust: Optional[int] = None, 67 | delay: int, 68 | logger: Logger = retry_logger, 69 | stats_base_key: str = "retry", 70 | ) -> Optional[Request]: 71 | """ 72 | Returns a new :class:`~scrapy.Request` object to retry the specified 73 | request, or ``None`` if retries of the specified request have been 74 | exhausted. 75 | 76 | For example, in a :class:`~scrapy.Spider` callback, you could use it as 77 | follows:: 78 | 79 | def parse(self, response): 80 | if not response.text: 81 | new_request_or_none = get_retry_request( 82 | response.request, 83 | spider=self, 84 | reason='empty', 85 | ) 86 | return new_request_or_none 87 | 88 | *spider* is the :class:`~scrapy.Spider` instance which is asking for the 89 | retry request. It is used to access the :ref:`settings ` 90 | and :ref:`stats `, and to provide extra logging context (see 91 | :func:`logging.debug`). 92 | 93 | *reason* is a string or an :class:`Exception` object that indicates the 94 | reason why the request needs to be retried. It is used to name retry stats. 95 | 96 | *max_retry_times* is a number that determines the maximum number of times 97 | that *request* can be retried. If not specified or ``None``, the number is 98 | read from the :reqmeta:`max_retry_times` meta key of the request. If the 99 | :reqmeta:`max_retry_times` meta key is not defined or ``None``, the number 100 | is read from the :setting:`RETRY_TIMES` setting. 101 | 102 | *priority_adjust* is a number that determines how the priority of the new 103 | request changes in relation to *request*. If not specified, the number is 104 | read from the :setting:`RETRY_PRIORITY_ADJUST` setting. 105 | 106 | *logger* is the logging.Logger object to be used when logging messages 107 | 108 | *stats_base_key* is a string to be used as the base key for the 109 | retry-related job stats 110 | """ 111 | settings = spider.crawler.settings 112 | assert spider.crawler.stats 113 | stats = spider.crawler.stats 114 | retry_times = request.meta.get("retry_times", 0) + 1 115 | if max_retry_times is None: 116 | max_retry_times = request.meta.get("max_retry_times") 117 | if max_retry_times is None: 118 | max_retry_times = settings.getint("RETRY_TIMES") 119 | if retry_times <= max_retry_times: 120 | logger.debug( 121 | "Retrying %(request)s (failed %(retry_times)d times): %(reason)s", 122 | {"request": request, "retry_times": retry_times, "reason": reason}, 123 | extra={"spider": spider}, 124 | ) 125 | new_request: Request = request.copy() 126 | new_request.meta["retry_times"] = retry_times 127 | new_request.dont_filter = True 128 | if priority_adjust is None: 129 | priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST") 130 | new_request.priority = request.priority + priority_adjust 131 | 132 | new_request.meta["delay_request_by"] = delay 133 | 134 | if callable(reason): 135 | reason = reason() 136 | if isinstance(reason, Exception): 137 | reason = global_object_name(reason.__class__) 138 | 139 | stats.inc_value(f"{stats_base_key}/count") 140 | stats.inc_value(f"{stats_base_key}/reason_count/{reason}") 141 | return new_request 142 | stats.inc_value(f"{stats_base_key}/max_reached") 143 | logger.error( 144 | "Gave up retrying %(request)s (failed %(retry_times)d times): " "%(reason)s", 145 | {"request": request, "retry_times": retry_times, "reason": reason}, 146 | extra={"spider": spider}, 147 | ) 148 | return None 149 | 150 | 151 | class RetryMiddleware(metaclass=BackwardsCompatibilityMetaclass): 152 | def __init__(self, settings: BaseSettings): 153 | if not settings.getbool("RETRY_ENABLED"): 154 | raise NotConfigured 155 | self.max_retry_times = settings.getint("RETRY_TIMES") 156 | self.retry_http_codes = set( 157 | int(x) for x in settings.getlist("RETRY_HTTP_CODES") 158 | ) 159 | self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST") 160 | 161 | try: 162 | self.exceptions_to_retry = self.__getattribute__("EXCEPTIONS_TO_RETRY") 163 | except AttributeError: 164 | # If EXCEPTIONS_TO_RETRY is not "overridden" 165 | self.exceptions_to_retry = tuple( 166 | load_object(x) if isinstance(x, str) else x 167 | for x in settings.getlist("RETRY_EXCEPTIONS") 168 | ) 169 | 170 | @classmethod 171 | def from_crawler(cls, crawler: Crawler) -> Self: 172 | return cls(crawler.settings) 173 | 174 | def process_request(self, request, spider): 175 | delay_s = request.meta.get('delay_request_by', None) 176 | if not delay_s: 177 | return 178 | 179 | try: 180 | delay = float(delay_s) 181 | except ValueError: 182 | spider.logger.error("Invalid delay value %s on URL %s", delay_s, request.url) 183 | return 184 | 185 | deferred = Deferred() 186 | reactor.callLater(delay, deferred.callback, None) 187 | return deferred 188 | 189 | def process_response( 190 | self, request: Request, response: Response, spider: Spider 191 | ) -> Union[Request, Response]: 192 | if request.meta.get("dont_retry", False): 193 | return response 194 | if response.status in self.retry_http_codes: 195 | reason = response_status_message(response.status) 196 | return self._retry(request, reason, spider) or response 197 | return response 198 | 199 | def process_exception( 200 | self, request: Request, exception: Exception, spider: Spider 201 | ) -> Union[Request, Response, None]: 202 | if isinstance(exception, self.exceptions_to_retry) and not request.meta.get( 203 | "dont_retry", False 204 | ): 205 | return self._retry(request, exception, spider) 206 | return None 207 | 208 | def _retry( 209 | self, 210 | request: Request, 211 | reason: Union[str, Exception, Type[Exception]], 212 | spider: Spider, 213 | ) -> Optional[Request]: 214 | max_retry_times = request.meta.get("max_retry_times", self.max_retry_times) 215 | priority_adjust = request.meta.get("priority_adjust", self.priority_adjust) 216 | delay = int(request.meta.get('delay_request_by', 10) * 2) 217 | 218 | spider.logger.info("Delaying request on URL %s by %d seconds", request.url, delay) 219 | 220 | return get_retry_request( 221 | request, 222 | reason=reason, 223 | spider=spider, 224 | max_retry_times=max_retry_times, 225 | priority_adjust=priority_adjust, 226 | delay=delay 227 | ) 228 | 229 | __getattr__ = backwards_compatibility_getattr 230 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fabien's WebScraping Anti-Ban Workshop 2 | 3 | ![Header](header.jpg) 4 | 5 | 6 | ## Introduction 7 | 8 | Our goal is to understand how anti-bot protections work and how to bypass them. 9 | 10 | I created a dedicated website for this workshop [https://trekky-reviews.com](https://trekky-reviews.com). 11 | This website provides a list of hotels for every city, including reviews. 12 | 13 | We will collect **name, email and reviews** for each hotel. 14 | 15 | During this workshop, we will use the following open-source software: 16 | 17 | | Framework | Description | 18 | |--------------------------------------|------------------------------------------------------------------------------| 19 | | [Scrapy](https://scrapy.org) | the leading framework for web scraping | 20 | | [Scrapoxy](https://scrapoxy.io) | the super proxies aggregator | 21 | | [Playwright](https://playwright.dev) | the latest headless browser framework that integrates seamlessly with Scrapy | 22 | | [Babel.js](https://babeljs.io) | a transpiler used for deobfuscation purposes | 23 | 24 | The scraper can be found at [scrapers/spiders/trekky.py](scrapers/spiders/trekky.py). 25 | 26 | All solutions are located in [solutions](solutions). 27 | If you have any difficulties implementing a solution, feel free to copy and paste it. 28 | However, I recommend taking some time to search and explore to get the most out of the workshop, rather than rushing through it in 10 minutes. 29 | 30 | 31 | ## Preflight Checklist 32 | 33 | ### VirtualBox (Linux and Windows) 34 | 35 | To simplify the installation process, 36 | I've pre-configured an Ubuntu virtual machine for you with 37 | all the necessary dependencies for this workshop. 38 | 39 | 40 | 41 | 44 | 51 | 52 |
42 | 43 | 45 | This virtual machine is compatible only with AMD64 architecture (Linux, Windows, and Intel-based macOS). 46 | 47 | For macOS M1 (ARM64), please manually install the dependencies. 48 | 49 | On Windows, avoid using WSL2 (it doesn't work with Playwright) 50 |
53 | 54 | You can download it **[from this link](https://bit.ly/scwsfiles).** 55 | 56 | The virtual machine is in OVA format and can be easily imported into [VirtualBox](https://www.virtualbox.org). 57 | 58 | It requires 8 GiB RAM and 2 vCPU. 59 | 60 | Click on `Import Appliance` and choose the OVA file you downloaded. 61 | 62 | Credentials are: `vboxguest / changeme`. 63 | 64 | I recommend switching the network setting from NAT to **Bridge Adapter** for improved performance. 65 | 66 | _Note: If the network is too slow, I have USB drives available with the VM._ 67 | 68 | 69 | ### Full Installation (Linux, Windows, and macOS) 70 | 71 | You can manually install the required dependencies, which include: 72 | 73 | - Python (version 3 or higher) with virtualenv 74 | - Node.js (version 20 or higher) 75 | - Docker 76 | 77 | #### Python 78 | 79 | If you need Python, I recommend using [Anaconda](https://www.anaconda.com/download). 80 | 81 | To install a Virtual Environment, run the following command: 82 | 83 | ```shell 84 | python3 -m venv venv 85 | source venv/bin/activate 86 | ``` 87 | 88 | 89 | #### Node.js 90 | 91 | To install Node.js, follow the instructions on: https://nodejs.org/en/download 92 | 93 | 94 | #### Docker 95 | 96 | To install Docker, follow the instructions for 97 | [Mac](https://docs.docker.com/desktop/install/mac-install), 98 | [Windows](https://docs.docker.com/desktop/install/windows-install) or 99 | [Linux](https://docs.docker.com/desktop/install/linux/). 100 | 101 | 102 | ## Setting up 103 | 104 | This step is necessary even if you are using the VM. 105 | 106 | 107 | ### Step 1: Clone the Repository 108 | 109 | Clone this repository: 110 | 111 | ```shell 112 | git clone https://github.com/scrapoxy/scraping-workshop.git 113 | cd scraping-workshop 114 | ``` 115 | 116 | 117 | ### Step 2: Install Python libraries 118 | 119 | Open a shell and install libraries: 120 | 121 | ```shell 122 | pip install -r requirements.txt 123 | ``` 124 | 125 | 126 | ### Step 3: Install Playwright 127 | 128 | After installing the Python libraries, run the follow command: 129 | 130 | ```shell 131 | playwright install --with-deps chromium 132 | ``` 133 | 134 | 135 | ### Step 4: Install Node.js 136 | 137 | Install Node.js from the [official website](https://nodejs.org/en/download/) or through the version management [NVM](https://github.com/nvm-sh/nvm) 138 | 139 | 140 | ### Step 5: Install Node.js libraries 141 | 142 | Open a shell and install libraries from `package.json`: 143 | 144 | ```shell 145 | npm install 146 | ``` 147 | 148 | 149 | ### Step 6: Scrapoxy 150 | 151 | Run the following command to download Scrapoxy: 152 | 153 | ```shell 154 | sudo docker pull scrapoxy/scrapoxy 155 | ``` 156 | 157 | 158 | ## Challenge 1: Run your first Scraper 159 | 160 | The URL to scrape is: [https://trekky-reviews.com/level1](https://trekky-reviews.com/level1) 161 | 162 | Our goal is to collect **names, emails, and reviews** for each hotel listed. 163 | 164 | Open the file [`scrapers/spiders/trekky.py`](scrapers/spiders/trekky.py). 165 | 166 | In Scrapy, a spider is a Python class with a unique `name` property. Here, the name is `trekky`. 167 | 168 | The spider class includes a method called `start_requests`, which defines the initial URLs to scrape. 169 | When a URL is fetched, the Scrapy engine triggers a callback function. 170 | This callback function handles the parsing of the data. 171 | It's also possible to generate new requests from within the callback function, allowing for chaining of requests and callbacks. 172 | 173 | The structure of items is defined in the file [`scrapers/items.py`](scrapers/items.py). 174 | Each item type is represented by a dataclass containing fields and a loader: 175 | 176 | * `HotelItem`: name, email, review with the loader `HotelItemLoader` 177 | * `ReviewItem`: rating with the loader `ReviewItemLoader` 178 | 179 | To run the spider, open a terminal at the project's root directory and run the following command: 180 | 181 | ```shell 182 | scrapy crawl trekky 183 | ``` 184 | 185 | Scrapy will collect data from **50 hotels**: 186 | 187 | ```text 188 | 2024-07-05 23:11:43 [trekky] INFO: 189 | 190 | We got: 50 items 191 | 192 | ``` 193 | 194 | Check the `results.csv` file to confirm that all items were collected. 195 | 196 | 197 | ## Challenge 2: First Anti-Bot protection 198 | 199 | The URL to scrape is: [https://trekky-reviews.com/level2](https://trekky-reviews.com/level2) 200 | 201 | Update the URL in your scraper to target the new page and execute the spider: 202 | 203 | ```shell 204 | scrapy crawl trekky 205 | ``` 206 | 207 | Data collection may fail due to **an anti-bot system**. 208 | 209 | Pay attention to the **messages** explaining why access is blocked and use this information to correct the scraper. 210 | 211 | _Hint: It relates to HTTP request headers 😉_ 212 | 213 |
214 | Soluce is here 215 | Open the soluce 216 |
217 | 218 | 219 | ## Challenge 3: Rate limit 220 | 221 | The URL to scrape is: [https://trekky-reviews.com/level4](https://trekky-reviews.com/level4) (we will skip level3) 222 | 223 | Update the URL in your scraper to target the new page and execute the spider: 224 | 225 | ```shell 226 | scrapy crawl trekky 227 | ``` 228 | 229 | Data collection might fail due to **rate limiting** on our IP address. 230 | 231 | 232 | 233 | 236 | 241 | 242 |
234 | 235 | 237 | Please don't adjust the delay between requests or the number of concurrent requests; that is not our goal. 238 | Imagine we need to collect millions of items within a few hours, and delaying our scraping session is not an option. 239 | Instead, we will use proxies to distribute requests across multiple IP addresses. 240 |
243 | 244 | Use [Scrapoxy](https://scrapoxy.io) to bypass the rate limit with a cloud provider (not a proxy service). 245 | 246 | 247 | ### Step 1: Install Scrapoxy 248 | 249 | Follow this [guide](https://scrapoxy.io/intro/get-started) or run the following command in the project directory: 250 | 251 | ```shell 252 | sudo docker run -p 8888:8888 -p 8890:8890 -e AUTH_LOCAL_USERNAME=admin -e AUTH_LOCAL_PASSWORD=password -e BACKEND_JWT_SECRET=secret1 -e FRONTEND_JWT_SECRET=secret2 -e STORAGE_FILE_FILENAME=/scrapoxy.json -v ./scrapoxy.json:/scrapoxy.json scrapoxy/scrapoxy:latest 253 | ``` 254 | 255 | 256 | ### Step 2: Create a new project 257 | 258 | In the new project, keep the default settings and click the `Create` button: 259 | 260 | ![Scrapoxy Project Create](images/scrapoxy-project-create.png) 261 | 262 | 263 | ### Step 3: Add a Proxy Provider 264 | 265 | See the slides to set up the proxies provider account. 266 | 267 | Use **10 proxies** from the **United States of America**: 268 | 269 | ![Scrapoxy Connector Create](images/scrapoxy-connector-create.png) 270 | 271 | If you don't have an account with these cloud providers, you can create one. 272 | 273 | 274 | 275 | 278 | 282 | 283 |
276 | 277 | 279 | They typically require a credit card, and you may need to pay a nominal fee of $1 or $2 for this workshop. 280 | Such charges are common when using proxies. Don't worry; in the next challenge, I'll provide you with free credit. 281 |
284 | 285 | 286 | ### Step 4: Run the connector 287 | 288 | ![Scrapoxy Connector Run](images/scrapoxy-connector-run.png) 289 | 290 | 291 | ### Step 5: Connect Scrapoxy to the spider 292 | 293 | Follow this [guide](https://scrapoxy.io/integration/python/scrapy/guide). 294 | 295 | 296 | ### Step 6: Execute the spider 297 | 298 | Run your spider and check that Scrapoxy is handling the requests: 299 | 300 | ![Scrapoxy Proxies](images/scrapoxy-proxies.png) 301 | 302 | You should observe an increase in the count of received and sent requests. 303 | 304 |
305 | Soluce is here 306 | Open the soluce 307 |
308 | 309 | 310 | ## Challenge 4: Fingerprint 311 | 312 | The URL to scrape is: [https://trekky-reviews.com/level6](https://trekky-reviews.com/level6) (we will skip level5). 313 | 314 | Update the URL in your scraper to target the new page and execute the spider: 315 | 316 | ```shell 317 | scrapy crawl trekky 318 | ``` 319 | 320 | Data collection may fail due to **fingerprinting**. 321 | 322 | Use the Network Inspector in your browser to view all requests. 323 | You will notice a new request appearing, which is a **POST request** instead of a GET request. 324 | 325 | ![Chrome Network Inspector List](images/chrome-network-inspector-list.png) 326 | 327 | Inspect the website's code to identify the **JavaScript** that triggers this request. 328 | 329 | ![Chrome Network Inspector](images/chrome-network-inspector-initiator.png) 330 | 331 | It's clear that we need to **execute JavaScript**. 332 | Simply using Scrapy to send HTTP requests is not enough. 333 | 334 | Also, it's important to maintain the **same IP address** throughout the session. 335 | Scrapoxy offers a **sticky session** feature for this purpose when using a browser. 336 | 337 | Navigate to the Project options and enable both `Intercept HTTPS requests with MITM` 338 | and `Keep the same proxy with cookie injection`: 339 | 340 | ![Scrapoxy Project Update](images/scrapoxy-project-update.png) 341 | 342 | We will use the headless framework [Playwright](https://playwright.dev) along with Scrapy's plugin [scrapy-playwright](https://github.com/scrapy-plugins/scrapy-playwright). 343 | 344 | 345 | 346 | 349 | 352 | 353 |
347 | 348 | 350 | scrapy-playwright should already be installed. 351 |
354 | 355 | Our goal is to adapt the spider to integrate Playwright. 356 | 357 | You can now completely replace the code in [solutions/challenge-4.py](solutions/challenge-4.py) 358 | due to extensive modifications needed. 359 | 360 | The updates include: 361 | 362 | * **Settings**: Updated to add a custom `DOWNLOAD_HANDLERS`. 363 | * **Playwright Configuration**: `PLAYWRIGHT_LAUNCH_OPTIONS` now: 364 | * Disables headless mode, allowing you to view Playwright’s actions. 365 | * Configures a proxy to route traffic through Scrapoxy. 366 | * **Request Metadata**: Each request now includes metadata to enable Playwright and ignore HTTPS errors (using ignore_https_errors). 367 | 368 | 369 | ## Challenge 5: Consistency 370 | 371 | The URL to scrape is: [https://trekky-reviews.com/level7](https://trekky-reviews.com/level7) 372 | 373 | Update the URL in your scraper to target the new page and execute the spider: 374 | 375 | ```shell 376 | scrapy crawl trekky 377 | ``` 378 | 379 | You will notice that data collection may fail due to **inconsistency** errors. 380 | 381 | Anti-bot checks consistency across various layers of the browser stack. 382 | 383 | Try to solve these errors! 384 | 385 | _Hint: It involves adjusting timezones 😉_ 386 | 387 |
388 | Soluce is here 389 | Open the soluce 390 |
391 | 392 | 393 | ## Challenge 6: Deobfuscation 394 | 395 | The URL to scrape is: [https://trekky-reviews.com/level8](https://trekky-reviews.com/level8) 396 | 397 | Update the URL in your scraper to target the new page and execute the spider: 398 | 399 | ```shell 400 | scrapy crawl trekky 401 | ``` 402 | 403 | ### Step 1: Find the Anti-Bot Javascript 404 | 405 | Use the **Network Inspector** to review all requests. 406 | Among them, you'll spot some unusual ones. 407 | By inspecting the payload, you'll notice that the content is **encrypted**: 408 | 409 | ![Chrome Network Inspector - List 2](images/chrome-network-inspector-list2.png) 410 | 411 | Inspect the website's code to find the JavaScript responsible for sending this requests. 412 | In this case, the source code is obfuscated. 413 | 414 | Obfuscated code appears to be: 415 | 416 | ```javascript 417 | var _qkrt1f=window,_uqvmii="tdN",_u5zh1i="UNM",_p949z3="on",_eu2jji="en",_vnsd5q="bto",_bi4e9="a",_f1e79r="e",_w13dld="ode",_vbg0l7="RSA-",_6uh486="ki"... 418 | ``` 419 | 420 | To understand which information is being sent and how to emulate it, we need to **deobfuscate the code**. 421 | 422 | 423 | ### Step 2: Understand the code structure 424 | 425 | To understand the structure of the code, copy/paste some code into the website [AST Explorer](https://astexplorer.net) 426 | 427 | Don't forget to select `@babel/parser` and enable `Transform`: 428 | 429 | ![AST Explorer Header](images/ast-header.png) 430 | 431 | AST Explorer parses the source code and generates a visual tree: 432 | 433 | ![AST Explorer UI](images/ast-ui.png) 434 | 435 | 436 | 437 | 440 | 443 | 444 |
438 | 439 | 441 | For the record, I only obfuscated strings, not the code flow. 442 |
445 | 446 | 447 | ### Step 3: Deobfuscate the Javascript 448 | 449 | Copy/paste the whole obfuscated code to `tools/obfuscated.js`. 450 | 451 | And run the deobfucator script: 452 | 453 | ```shell 454 | node tools/deobfuscator.js 455 | ``` 456 | 457 | This script will deobfuscate specificaly this code. 458 | 459 | 460 | 461 | 464 | 472 | 473 |
462 | 463 | 465 | You can use online tools to deobfuscate this script, 466 | given that it's a straightforward obfuscated script. 467 | Also, GitHub Copilot 468 | can be incredibly helpful in writing AST operations, just as 469 | Claude Sonnet 3.5 470 | is valuable for deciphering complex functions. 471 |
474 | 475 | 476 | ### Step 4: Payload generation 477 | 478 | Here’s a summary of the script’s behavior: 479 | 480 | 1. It collects **WebGL information**; 481 | 2. It encrypts the data using **RSA encryption** with an obfuscated public key; 482 | 3. It sends the payload to the webserver via a **POST request**. 483 | 484 | We need to implement the same approach in our spider. 485 | 486 | Since we will be crafting the payload ourselves, there is **no need** to use Playwright anymore. 487 | We will simply send the payload **before** initiating any requests. 488 | 489 | You can now completely replace the code in [solutions/challenge-6-1-partial.py](solutions/challenge-6-1-partial.py) 490 | and fill in the missing parts. 491 | 492 |
493 | Soluce is here 494 | Open the soluce 495 |
496 | 497 | 498 | ## Challenge 7: Playwright Detection 499 | 500 | The URL to scrape is: [https://trekky-reviews.com/level9](https://trekky-reviews.com/level9) 501 | 502 | For this challenge, directly use a Pure Playwright scraper from [playwright_spider.py](playwright_spider.py) (don't use Scrapy). 503 | 504 | Run it: 505 | 506 | ```shell 507 | python playwright_spider.py 508 | ``` 509 | 510 | You will notice that data collection may fail due to **playwright** detection. 511 | 512 | Anti-bot checks if CDP protocol or network inspector is opened. 513 | 514 | Try to replace Playwright by another framework! 515 | 516 | _Hint: Use Camoufoux 😉_ 517 | 518 |
519 | Soluce is here 520 | Open the soluce 521 |
522 | 523 | 524 | ## Conclusion 525 | 526 | Thank you so much for participating in this workshop. 527 | 528 | Your feedback is incredibly valuable to me. 529 | Please take a moment to complete this survey; your insights will greatly assist in enhancing future workshops: 530 | 531 | 👉 [GO TO SURVEY](https://bit.ly/scwsv) 👈 532 | 533 | 534 | ## Licence 535 | 536 | WebScraping Anti-Ban Workshop © 2024 by [Fabien Vauchelles](https://www.linkedin.com/in/fabienvauchelles) is licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1): 537 | 538 | * Credit must be given to the creator; 539 | * Only noncommercial use of your work is permitted; 540 | * No derivatives or adaptations of your work are permitted. 541 | 542 | --------------------------------------------------------------------------------