├── scrapoxy.json
├── scrapers
    ├── __init__.py
    ├── middlewares
    │   ├── __init__.py
    │   ├── info.py
    │   └── retry.py
    ├── pipelines
    │   ├── __init__.py
    │   └── csv.py
    ├── spiders
    │   ├── __init__.py
    │   └── trekky.py
    ├── settings.py
    ├── items.py
    └── utils.py
├── tools
    ├── deobfuscated.js
    ├── obfuscated.js
    └── deobfuscator.js
├── header.jpg
├── images
    ├── info.png
    ├── note.png
    ├── ast-ui.png
    ├── warning.png
    ├── ast-header.png
    ├── scrapoxy-proxies.png
    ├── scrapoxy-connector-run.png
    ├── scrapoxy-project-create.png
    ├── scrapoxy-project-update.png
    ├── scrapoxy-connector-create.png
    ├── chrome-network-inspector-list.png
    ├── chrome-network-inspector-list2.png
    └── chrome-network-inspector-initiator.png
├── scrapy.cfg
├── requirements.txt
├── .idea
    ├── vcs.xml
    ├── .gitignore
    ├── modules.xml
    ├── scraping-workshop.iml
    ├── runConfigurations
    │   ├── deobfuscator_js.xml
    │   └── trekky.xml
    ├── misc.xml
    └── inspectionProfiles
    │   └── Project_Default.xml
├── .editorconfig
├── package.json
├── .vscode
    ├── settings.json
    └── launch.json
├── .gitignore
├── solutions
    ├── challenge-2.py
    ├── challenge-3.py
    ├── challenge-6-1-partial.py
    ├── challenge-6-2.py
    ├── challenge-4.py
    ├── challenge-5.py
    └── challenge-7.py
├── playwright_spider.py
└── README.md


/scrapoxy.json:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/scrapers/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tools/deobfuscated.js:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tools/obfuscated.js:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/scrapers/middlewares/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/scrapers/pipelines/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/header.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/header.jpg


--------------------------------------------------------------------------------
/images/info.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/info.png


--------------------------------------------------------------------------------
/images/note.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/note.png


--------------------------------------------------------------------------------
/images/ast-ui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/ast-ui.png


--------------------------------------------------------------------------------
/images/warning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/warning.png


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
1 | [settings]
2 | default = scrapers.settings
3 | 
4 | [deploy]
5 | project = scrapers
6 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pycryptodome==3.23.0
2 | scrapoxy==2.1.1
3 | Scrapy==2.13.3
4 | scrapy_playwright
5 | 


--------------------------------------------------------------------------------
/images/ast-header.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/ast-header.png


--------------------------------------------------------------------------------
/images/scrapoxy-proxies.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-proxies.png


--------------------------------------------------------------------------------
/images/scrapoxy-connector-run.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-connector-run.png


--------------------------------------------------------------------------------
/images/scrapoxy-project-create.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-project-create.png


--------------------------------------------------------------------------------
/images/scrapoxy-project-update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-project-update.png


--------------------------------------------------------------------------------
/images/scrapoxy-connector-create.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/scrapoxy-connector-create.png


--------------------------------------------------------------------------------
/images/chrome-network-inspector-list.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/chrome-network-inspector-list.png


--------------------------------------------------------------------------------
/images/chrome-network-inspector-list2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/chrome-network-inspector-list2.png


--------------------------------------------------------------------------------
/images/chrome-network-inspector-initiator.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapoxy/scraping-workshop/HEAD/images/chrome-network-inspector-initiator.png


--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <project version="4">
3 |   <component name="VcsDirectoryMappings">
4 |     <mapping directory="" vcs="Git" />
5 |   </component>
6 | </project>


--------------------------------------------------------------------------------
/scrapers/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | /shelf/
3 | /workspace.xml
4 | # Editor-based HTTP Client requests
5 | /httpRequests/
6 | # Datasource local storage ignored files
7 | /dataSources/
8 | /dataSources.local.xml
9 | 


--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <project version="4">
3 |   <component name="ProjectModuleManager">
4 |     <modules>
5 |       <module fileurl="file://$PROJECT_DIR$/.idea/scraping-workshop.iml" filepath="$PROJECT_DIR$/.idea/scraping-workshop.iml" />
6 |     </modules>
7 |   </component>
8 | </project>


--------------------------------------------------------------------------------
/.idea/scraping-workshop.iml:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <module type="JAVA_MODULE" version="4">
3 |   <component name="NewModuleRootManager" inherit-compiler-output="true">
4 |     <exclude-output />
5 |     <content url="file://$MODULE_DIR$" />
6 |     <orderEntry type="inheritedJdk" />
7 |     <orderEntry type="sourceFolder" forTests="false" />
8 |   </component>
9 | </module>


--------------------------------------------------------------------------------
/.idea/runConfigurations/deobfuscator_js.xml:
--------------------------------------------------------------------------------
1 | <component name="ProjectRunConfigurationManager">
2 |   <configuration default="false" name="deobfuscator.js" type="NodeJSConfigurationType" path-to-node="$USER_HOME$/.nvm/versions/node/v22.13.1/bin/node" nameIsGenerated="true" path-to-js-file="tools/deobfuscator.js" working-dir="$PROJECT_DIR$">
3 |     <method v="2" />
4 |   </configuration>
5 | </component>


--------------------------------------------------------------------------------
/.editorconfig:
--------------------------------------------------------------------------------
 1 | # Editor configuration, see https://editorconfig.org
 2 | root = true
 3 | 
 4 | [*]
 5 | charset = utf-8
 6 | indent_style = space
 7 | indent_size = 4
 8 | insert_final_newline = true
 9 | end_of_line = lf
10 | trim_trailing_whitespace = true
11 | 
12 | [*.ts]
13 | quote_type = single
14 | 
15 | [*.yml]
16 | indent_size = 2
17 | 
18 | [*.md]
19 | max_line_length = off
20 | trim_trailing_whitespace = false
21 | 


--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <project version="4">
3 |   <component name="Black">
4 |     <option name="sdkName" value="Python 3.11 (scrapy)" />
5 |   </component>
6 |   <component name="ProjectRootManager" version="2" languageLevel="JDK_21" default="true" project-jdk-name="Python 3.12 (workshop)" project-jdk-type="Python SDK">
7 |     <output url="file://$PROJECT_DIR$/out" />
8 |   </component>
9 | </project>


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "name": "scraping-workshop",
 3 |     "version": "1.0.0",
 4 |     "description": "",
 5 |     "type": "module",
 6 |     "private": false,
 7 |     "engines": {
 8 |         "node": ">= 20.0.0",
 9 |         "npm": ">= 6.0.0"
10 |     },
11 |     "scripts": {
12 |         "dev": "",
13 |         "test": ""
14 |     },
15 |     "dependencies": {
16 |         "@babel/generator": "~7.23.0",
17 |         "@babel/parser": "~7.23.0",
18 |         "@babel/traverse": "~7.23.0"
19 |     }
20 | }
21 | 


--------------------------------------------------------------------------------
/.idea/inspectionProfiles/Project_Default.xml:
--------------------------------------------------------------------------------
 1 | <component name="InspectionProjectProfileManager">
 2 |   <profile version="1.0">
 3 |     <option name="myName" value="Project Default" />
 4 |     <inspection_tool class="PyPackageRequirementsInspection" enabled="true" level="WARNING" enabled_by_default="true">
 5 |       <option name="ignoredPackages">
 6 |         <value>
 7 |           <list size="1">
 8 |             <item index="0" class="java.lang.String" itemvalue="playwright-python" />
 9 |           </list>
10 |         </value>
11 |       </option>
12 |     </inspection_tool>
13 |   </profile>
14 | </component>


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "editor.formatOnSave": true,
 3 |   "python.linting.enabled": true,
 4 |   "python.linting.pylintEnabled": true,
 5 |   "python.defaultInterpreterPath": "/home/vboxguest/venv/bin/python",
 6 |   "python.analysis.extraPaths": [
 7 |     "${workspaceFolder}"
 8 |   ],
 9 |   "files.exclude": {
10 |     "**/__pycache__": true,
11 |     "**/.pytest_cache": true,
12 |     "**/*.pyc": true
13 |   },
14 |   "javascript.format.enable": true,
15 |   "javascript.validate.enable": true,
16 |   "[python]": {
17 |     "editor.tabSize": 4
18 |   },
19 |   "[javascript]": {
20 |     "editor.tabSize": 2
21 |   }
22 | }
23 | 


--------------------------------------------------------------------------------
/scrapers/settings.py:
--------------------------------------------------------------------------------
 1 | BOT_NAME = "scrapers"
 2 | 
 3 | SPIDER_MODULES = ["scrapers.spiders"]
 4 | NEWSPIDER_MODULE = "scrapers.spiders"
 5 | 
 6 | CONCURRENT_REQUESTS = 5
 7 | DOWNLOAD_TIMEOUT = 10
 8 | 
 9 | SPIDER_MIDDLEWARES = {
10 |     'scrapers.middlewares.info.InfoSpiderMiddleware': 40,
11 | }
12 | 
13 | ITEM_PIPELINES = {
14 |     'scrapers.pipelines.csv.SaveToCsvPipeline': 300,
15 | }
16 | 
17 | REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
18 | TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
19 | FEED_EXPORT_ENCODING = "utf-8"
20 | 
21 | # Prevent Scrapy from overriding Chrome's default HTTP headers.
22 | PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
23 | 


--------------------------------------------------------------------------------
/scrapers/middlewares/info.py:
--------------------------------------------------------------------------------
 1 | from scrapy import signals
 2 | 
 3 | class InfoSpiderMiddleware:
 4 |     ###This spider middleware class logs the number of scraped items when the spider is closed.###
 5 |     def __init__(self, stats):
 6 |         self.stats = stats
 7 | 
 8 |     @classmethod
 9 |     def from_crawler(cls, crawler):
10 |         s = cls(crawler.stats)
11 |         crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
12 |         return s
13 | 
14 |     def spider_closed(self, spider):
15 |         count = self.stats.get_value("item_scraped_count", 0, spider=spider)
16 |         if count > 0:
17 |             if count > 1:
18 |                 spider.logger.info(f"\n\nWe got: {count} items\n")
19 |             else:
20 |                 spider.logger.info("\n\nWe got: 1 item\n")
21 | 


--------------------------------------------------------------------------------
/.vscode/launch.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "version": "0.2.0",
 3 |   "configurations": [
 4 |     {
 5 |       "name": "deobfuscator.js",
 6 |       "type": "node",
 7 |       "request": "launch",
 8 |       "program": "${workspaceFolder}/tools/deobfuscator.js",
 9 |       "cwd": "${workspaceFolder}",
10 |       "skipFiles": [
11 |         "<node_internals>/**"
12 |       ]
13 |     },
14 |     {
15 |       "name": "trekky",
16 |       "type": "python",
17 |       "request": "launch",
18 |       "module": "scrapy.cmdline",
19 |       "args": [
20 |         "crawl",
21 |         "trekky"
22 |       ],
23 |       "cwd": "${workspaceFolder}",
24 |       "python": "/home/vboxguest/venv/bin/python",
25 |       "env": {
26 |         "PYTHONUNBUFFERED": "1"
27 |       },
28 |       "console": "integratedTerminal"
29 |     }
30 |   ]
31 | }
32 | 


--------------------------------------------------------------------------------
/scrapers/items.py:
--------------------------------------------------------------------------------
 1 | from dataclasses import dataclass, field
 2 | from itemloaders.processors import TakeFirst, MapCompose, Identity
 3 | from scrapy.loader import ItemLoader
 4 | 
 5 | 
 6 | @dataclass
 7 | class ReviewItem:
 8 |     rating: float = field(default=None)
 9 | 
10 | 
11 | class ReviewItemLoader(ItemLoader):
12 |     default_item_class = ReviewItem
13 |     rating_in = MapCompose(str.strip, float)
14 |     rating_out = TakeFirst()
15 | 
16 | 
17 | @dataclass
18 | class HotelItem:
19 |     name: str = field(default=None)
20 |     email: str = field(default=None)
21 |     reviews: list[ReviewItem] = field(default=None)
22 | 
23 | 
24 | class HotelItemLoader(ItemLoader):
25 |     default_input_processor = MapCompose(str.strip)
26 |     default_output_processor = TakeFirst()
27 |     default_item_class = HotelItem
28 | 
29 |     reviews_in = Identity()
30 |     reviews_out = Identity()
31 | 
32 | 


--------------------------------------------------------------------------------
/scrapers/pipelines/csv.py:
--------------------------------------------------------------------------------
 1 | from itemadapter import ItemAdapter
 2 | from scrapy import signals
 3 | 
 4 | import csv
 5 | 
 6 | 
 7 | class SaveToCsvPipeline:
 8 |     ###This pipeline class saves the scraped data to a CSV file named 'results.csv'.###
 9 |     _items = []
10 | 
11 |     @classmethod
12 |     def from_crawler(cls, crawler):
13 |         s = cls()
14 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
15 |         crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
16 |         return s
17 | 
18 |     def process_item(self, item, spider):
19 |         self._items.append(item)
20 |         return item
21 | 
22 |     def spider_opened(self, spider):
23 |         self._items = []
24 | 
25 |     def spider_closed(self, spider):
26 |         with open('results.csv', 'w', newline='') as csvfile:
27 |             csvwriter = csv.DictWriter(
28 |                 csvfile,
29 |                 fieldnames=['name', 'email', 'reviews'],
30 |                 delimiter=',',
31 |                 quotechar='"', quoting=csv.QUOTE_MINIMAL
32 |             )
33 | 
34 |             csvwriter.writeheader()
35 | 
36 |             for item in self._items:
37 |                 csvwriter.writerow(ItemAdapter(item).asdict())
38 | 


--------------------------------------------------------------------------------
/.idea/runConfigurations/trekky.xml:
--------------------------------------------------------------------------------
 1 | <component name="ProjectRunConfigurationManager">
 2 |   <configuration default="false" name="trekky" type="PythonConfigurationType" factoryName="Python">
 3 |     <module name="scraping-workshop" />
 4 |     <option name="ENV_FILES" value="" />
 5 |     <option name="INTERPRETER_OPTIONS" value="" />
 6 |     <option name="PARENT_ENVS" value="true" />
 7 |     <envs>
 8 |       <env name="PYTHONUNBUFFERED" value="1" />
 9 |     </envs>
10 |     <option name="SDK_HOME" value="" />
11 |     <option name="WORKING_DIRECTORY" value="" />
12 |     <option name="IS_MODULE_SDK" value="true" />
13 |     <option name="ADD_CONTENT_ROOTS" value="true" />
14 |     <option name="ADD_SOURCE_ROOTS" value="true" />
15 |     <EXTENSION ID="PythonCoverageRunConfigurationExtension" runner="coverage.py" />
16 |     <option name="SCRIPT_NAME" value="scrapy.cmdline" />
17 |     <option name="PARAMETERS" value="crawl trekky" />
18 |     <option name="SHOW_COMMAND_LINE" value="false" />
19 |     <option name="EMULATE_TERMINAL" value="false" />
20 |     <option name="MODULE_MODE" value="true" />
21 |     <option name="REDIRECT_INPUT" value="false" />
22 |     <option name="INPUT_FILE" value="" />
23 |     <method v="2" />
24 |   </configuration>
25 | </component>


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # See http://help.github.com/ignore-files/ for more about ignoring files.
  2 | 
  3 | # compiled output
  4 | /dist
  5 | /tmp
  6 | /out-tsc
  7 | __pycache__/
  8 | *.py[cod]
  9 | *$py.class
 10 | 
 11 | # C extensions
 12 | *.so
 13 | 
 14 | # cache
 15 | .nx
 16 | 
 17 | # Distribution / packaging
 18 | .Python
 19 | build/
 20 | develop-eggs/
 21 | dist/
 22 | downloads/
 23 | eggs/
 24 | .eggs/
 25 | lib/
 26 | lib64/
 27 | parts/
 28 | sdist/
 29 | var/
 30 | wheels/
 31 | share/python-wheels/
 32 | *.egg-info/
 33 | .installed.cfg
 34 | *.egg
 35 | MANIFEST
 36 | 
 37 | # pytype static type analyzer
 38 | .pytype/
 39 | 
 40 | # Cython debug symbols
 41 | cython_debug/
 42 | 
 43 | # PyInstaller
 44 | #  Usually these files are written by a python script from a template
 45 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 46 | *.manifest
 47 | *.spec
 48 | 
 49 | # Installer logs
 50 | pip-log.txt
 51 | pip-delete-this-directory.txt
 52 | 
 53 | # dependencies
 54 | /node_modules
 55 | 
 56 | # IDEs and editors
 57 | /.idea/*8kqj7y1qjbninwbd95ozki
 58 | !.idea/runConfigurations
 59 | .project
 60 | .classpath
 61 | .c9/
 62 | *.launch
 63 | .settings/
 64 | *.sublime-workspace
 65 | 
 66 | # IDE - VSCode
 67 | .vscode/*
 68 | !.vscode/settings.json
 69 | !.vscode/tasks.json
 70 | !.vscode/launch.json
 71 | !.vscode/extensions.json
 72 | 
 73 | # misc
 74 | /.sass-cache
 75 | /connect.lock
 76 | /coverage
 77 | /libpeerconnection.log
 78 | npm-debug.log
 79 | yarn-error.log
 80 | testem.log
 81 | /typings
 82 | yarn.lock
 83 | 
 84 | # System Files
 85 | .DS_Store
 86 | Thumbs.db
 87 | 
 88 | # Scrapy stuff:
 89 | .scrapy
 90 | 
 91 | # IPython
 92 | profile_default/
 93 | ipython_config.py
 94 | 
 95 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 96 | __pypackages__/
 97 | 
 98 | # Configuration
 99 | .env
100 | .venv
101 | env/
102 | venv/
103 | ENV/
104 | env.bak/
105 | venv.bak/
106 | .spyderproject
107 | .spyproject
108 | 
109 | # Data
110 | results.csv
111 | 


--------------------------------------------------------------------------------
/scrapers/utils.py:
--------------------------------------------------------------------------------
 1 | from base64 import b64encode
 2 | from datetime import datetime
 3 | from scrapy.spidermiddlewares.httperror import HttpError
 4 | from w3lib.html import remove_tags
 5 | from Crypto.Cipher import PKCS1_OAEP
 6 | from Crypto.Hash import SHA256
 7 | from Crypto.PublicKey import RSA
 8 | 
 9 | import re
10 | import json
11 | 
12 | 
13 | def remove_whitespace(text):
14 |     text = text.replace('\n', ' ').replace('\r', '')
15 |     text = re.sub(r'\s+', ' ', text)
16 |     return text.strip()
17 | 
18 | 
19 | def date_to_timestamp(date):
20 |     try:
21 |         return int(datetime.strptime(date.strip(), '%b %d, %Y, %I:%M %p').timestamp() * 1000)
22 |     except ValueError:
23 |         return None
24 | 
25 | 
26 | def print_failure(logger, failure):
27 |     message = f"\nURL: {failure.request.url}\n\n"
28 | 
29 |     if failure.check(HttpError):
30 |         response = failure.value.response
31 | 
32 |         text = remove_tags(response.text)
33 | 
34 |         try:
35 |             if response.status == 429:
36 |                 error = {
37 |                     "message": "Too many requests",
38 |                     "description": response.text
39 |                 }
40 |             else:
41 |                 error = json.loads(text)
42 | 
43 |             description = error.get('description')
44 |             if description:
45 |                 message += f"Error: {error['message']}\n\nDetails: {description}\n"
46 |             else:
47 |                 message += f"Error: {error['message']}\n"
48 |         except json.JSONDecodeError:
49 |             message += text
50 |     else:
51 |         message += f"Error: {failure.getErrorMessage()}\n"
52 | 
53 |     logger.error(f"\n{message}\n")
54 | 
55 | 
56 | def rsa_encrypt(message, public_key):
57 |     """Use RSA public key encryption to encrypt the message."""
58 | 
59 |     # Convert the public key into PEM format for use in RSA encryption.
60 |     pem_key = f"-----BEGIN PUBLIC KEY-----\n{public_key}\n-----END PUBLIC KEY-----"
61 |     rsa_public_key = RSA.importKey(pem_key)
62 |     rsa_public_key = PKCS1_OAEP.new(rsa_public_key, hashAlgo=SHA256)
63 | 
64 |     # Encrypt the message
65 |     message = str.encode(message)
66 |     encrypted_text = rsa_public_key.encrypt(message)
67 |     encrypted_text_b64 = b64encode(encrypted_text)
68 |     return encrypted_text_b64
69 | 


--------------------------------------------------------------------------------
/scrapers/spiders/trekky.py:
--------------------------------------------------------------------------------
 1 | from scrapy import Request, Spider
 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
 3 | from scrapers.utils import print_failure
 4 | 
 5 | 
 6 | class TrekkySpider(Spider):
 7 |     """This class manages all the logic required for scraping the Trekky website.
 8 | 
 9 |     Attributes:
10 |         name (str): The unique name of the spider.
11 |         start_url (str): Root of the website and first URL to scrape.
12 |         custom_settings (dict): Custom settings for the scraper
13 |     """
14 | 
15 |     name = "trekky"
16 | 
17 |     start_url = "https://trekky-reviews.com/level1"
18 | 
19 |     custom_settings = {
20 |         "DEFAULT_REQUEST_HEADERS": {
21 |             "Connection": "close",
22 |         },
23 | 
24 |         "DOWNLOADER_MIDDLEWARES": {
25 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
26 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
27 |         },
28 |     }
29 | 
30 |     def start_requests(self):
31 |         """This method start 10 separate sessions on the homepage, one per page."""
32 |         for page in range(1, 10):
33 |             yield Request(
34 |                 url=self.start_url,
35 |                 callback=self.parse,
36 |                 errback=self.errback,
37 |                 dont_filter=True,
38 |                 meta=dict(
39 |                     page=page,
40 |                     cookiejar="jar%d" % page,
41 |                 ),
42 |             )
43 | 
44 |     def parse(self, response):
45 |         """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X."""
46 |         yield Request(
47 |             url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']),
48 |             callback=self.parse_listing,
49 |             errback=self.errback,
50 |             meta=response.meta,
51 |         )
52 | 
53 |     def parse_listing(self, response):
54 |         """This method parses the list of hotels in Paris from page X."""
55 |         for el in response.css('.hotel-link'):
56 |             yield response.follow(
57 |                 url=el,
58 |                 callback=self.parse_hotel,
59 |                 errback=self.errback,
60 |                 meta=response.meta,
61 |             )
62 | 
63 |     def parse_hotel(self, response):
64 |         """This method parses hotel details such as name, email, and reviews."""
65 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
66 | 
67 |         hotel = HotelItemLoader(response=response)
68 |         hotel.add_css('name', '.hotel-name::text')
69 |         hotel.add_css('email', '.hotel-email::text')
70 |         hotel.add_value('reviews', reviews)
71 |         return hotel.load_item()
72 | 
73 |     def get_review(self, review_el):
74 |         """This method extracts rating from a review"""
75 |         review = ReviewItemLoader(selector=review_el)
76 |         review.add_css('rating', '.review-rating::text')
77 |         return review.load_item()
78 | 
79 |     def errback(self, failure):
80 |         """This method handles and logs errors and is invoked with each request."""
81 |         print_failure(self.logger, failure)
82 | 


--------------------------------------------------------------------------------
/solutions/challenge-2.py:
--------------------------------------------------------------------------------
 1 | from scrapy import Request, Spider
 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
 3 | from scrapers.utils import print_failure
 4 | 
 5 | 
 6 | class TrekkySpider(Spider):
 7 |     """This class manages all the logic required for scraping the Trekky website.
 8 | 
 9 |     Attributes:
10 |         name (str): The unique name of the spider.
11 |         start_url (str): Root of the website and first URL to scrape.
12 |         custom_settings (dict): Custom settings for the scraper
13 |     """
14 | 
15 |     name = "trekky"
16 | 
17 |     start_url = "https://trekky-reviews.com/level2"
18 | 
19 |     custom_settings = {
20 |         # Add a User Agent to simulate a real browser
21 |         "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
22 | 
23 |         "DEFAULT_REQUEST_HEADERS": {
24 |             "Connection": "close",
25 | 
26 |             # Add headers to simulate a real browser, matching the User Agent
27 |             "Sec-Ch-Ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"90\", \"Google Chrome\";v=\"90\"",
28 |             "Sec-Ch-Ua-Mobile": "?0",
29 |             "Sec-Ch-Ua-Platform": "\"Windows\"",
30 |         },
31 | 
32 |         "DOWNLOADER_MIDDLEWARES": {
33 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
34 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
35 |         },
36 |     }
37 | 
38 |     def start_requests(self):
39 |         """This method start 10 separate sessions on the homepage, one per page."""
40 |         for page in range(1, 10):
41 |             yield Request(
42 |                 url=self.start_url,
43 |                 callback=self.parse,
44 |                 errback=self.errback,
45 |                 dont_filter=True,
46 |                 meta=dict(
47 |                     page=page,
48 |                     cookiejar="jar%d" % page,
49 |                 ),
50 |             )
51 | 
52 |     def parse(self, response):
53 |         """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X."""
54 |         yield Request(
55 |             url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']),
56 |             callback=self.parse_listing,
57 |             errback=self.errback,
58 |             meta=response.meta,
59 |         )
60 | 
61 |     def parse_listing(self, response):
62 |         """This method parses the list of hotels in Paris from page X."""
63 |         for el in response.css('.hotel-link'):
64 |             yield response.follow(
65 |                 url=el,
66 |                 callback=self.parse_hotel,
67 |                 errback=self.errback,
68 |                 meta=response.meta,
69 |             )
70 | 
71 |     def parse_hotel(self, response):
72 |         """This method parses hotel details such as name, email, and reviews."""
73 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
74 | 
75 |         hotel = HotelItemLoader(response=response)
76 |         hotel.add_css('name', '.hotel-name::text')
77 |         hotel.add_css('email', '.hotel-email::text')
78 |         hotel.add_value('reviews', reviews)
79 |         return hotel.load_item()
80 | 
81 |     def get_review(self, review_el):
82 |         """This method extracts rating from a review"""
83 |         review = ReviewItemLoader(selector=review_el)
84 |         review.add_css('rating', '.review-rating::text')
85 |         return review.load_item()
86 | 
87 |     def errback(self, failure):
88 |         """This method handles and logs errors and is invoked with each request."""
89 |         print_failure(self.logger, failure)
90 | 


--------------------------------------------------------------------------------
/solutions/challenge-3.py:
--------------------------------------------------------------------------------
 1 | from scrapy import Request, Spider
 2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
 3 | from scrapers.utils import print_failure
 4 | 
 5 | 
 6 | class TrekkySpider(Spider):
 7 |     """This class manages all the logic required for scraping the Trekky website.
 8 | 
 9 |     Attributes:
10 |         name (str): The unique name of the spider.
11 |         start_url (str): Root of the website and first URL to scrape.
12 |         custom_settings (dict): Custom settings for the scraper
13 |     """
14 | 
15 |     name = "trekky"
16 | 
17 |     start_url = "https://trekky-reviews.com/level4"
18 | 
19 |     custom_settings = {
20 |         "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
21 | 
22 |         "DEFAULT_REQUEST_HEADERS": {
23 |             "Connection": "close",
24 |             "Sec-Ch-Ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"90\", \"Google Chrome\";v=\"90\"",
25 |             "Sec-Ch-Ua-Mobile": "?0",
26 |             "Sec-Ch-Ua-Platform": "\"Windows\"",
27 |         },
28 | 
29 |         "DOWNLOADER_MIDDLEWARES": {
30 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
31 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
32 |         },
33 | 
34 |         "ADDONS": {
35 |             'scrapoxy.Addon': 100,
36 |         },
37 | 
38 |         # Set up Scrapoxy settings and credentials
39 |         "SCRAPOXY_MASTER": "http://localhost:8888",
40 |         "SCRAPOXY_API": "http://localhost:8890/api",
41 |         "SCRAPOXY_USERNAME": "TO_FILL",
42 |         "SCRAPOXY_PASSWORD": "TO_FILL",
43 |     }
44 | 
45 |     def start_requests(self):
46 |         """This method start 10 separate sessions on the homepage, one per page."""
47 |         for page in range(1, 10):
48 |             yield Request(
49 |                 url=self.start_url,
50 |                 callback=self.parse,
51 |                 errback=self.errback,
52 |                 dont_filter=True,
53 |                 meta=dict(
54 |                     page=page,
55 |                     cookiejar="jar%d" % page,
56 |                 ),
57 |             )
58 | 
59 |     def parse(self, response):
60 |         """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X."""
61 |         yield Request(
62 |             url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']),
63 |             callback=self.parse_listing,
64 |             errback=self.errback,
65 |             meta=response.meta,
66 |         )
67 | 
68 |     def parse_listing(self, response):
69 |         """This method parses the list of hotels in Paris from page X."""
70 |         for el in response.css('.hotel-link'):
71 |             yield response.follow(
72 |                 url=el,
73 |                 callback=self.parse_hotel,
74 |                 errback=self.errback,
75 |                 meta=response.meta,
76 |             )
77 | 
78 |     def parse_hotel(self, response):
79 |         """This method parses hotel details such as name, email, and reviews."""
80 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
81 | 
82 |         hotel = HotelItemLoader(response=response)
83 |         hotel.add_css('name', '.hotel-name::text')
84 |         hotel.add_css('email', '.hotel-email::text')
85 |         hotel.add_value('reviews', reviews)
86 |         return hotel.load_item()
87 | 
88 |     def get_review(self, review_el):
89 |         """This method extracts rating from a review"""
90 |         review = ReviewItemLoader(selector=review_el)
91 |         review.add_css('rating', '.review-rating::text')
92 |         return review.load_item()
93 | 
94 |     def errback(self, failure):
95 |         """This method handles and logs errors and is invoked with each request."""
96 |         print_failure(self.logger, failure)
97 | 


--------------------------------------------------------------------------------
/solutions/challenge-6-1-partial.py:
--------------------------------------------------------------------------------
  1 | from scrapy import FormRequest, Request, Spider
  2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
  3 | from scrapers.utils import print_failure, rsa_encrypt
  4 | from urllib.parse import urljoin
  5 | 
  6 | import json
  7 | 
  8 | 
  9 | def build_payload():
 10 |     """Build the encrypted payload to send to the server."""
 11 |     payload = json.dumps({
 12 |         "KEY_1_TO_REPLACE": "VALUE_1_TO_REPLACE",
 13 |         "KEY_2_TO_REPLACE": "VALUE_2_TO_REPLACE",
 14 |     })
 15 | 
 16 |     # The public key is extracted from the deobfuscated JavaScript code of the website's antibot.
 17 |     public_key = "TO_FILL"
 18 | 
 19 |     payload_encoded = rsa_encrypt(payload, public_key)
 20 |     return payload_encoded
 21 | 
 22 | 
 23 | class TrekkySpider(Spider):
 24 |     """This class manages all the logic required for scraping the Trekky website.
 25 | 
 26 |     Attributes:
 27 |         name (str): The unique name of the spider.
 28 |         start_url (str): Root of the website and first URL to scrape.
 29 |         custom_settings (dict): Custom settings for the scraper
 30 |     """
 31 | 
 32 |     name = "trekky"
 33 | 
 34 |     start_url = "https://trekky-reviews.com/level8"
 35 | 
 36 |     custom_settings = {
 37 |         "DEFAULT_REQUEST_HEADERS": {
 38 |             "Connection": "close",
 39 |         },
 40 | 
 41 |         "DOWNLOADER_MIDDLEWARES": {
 42 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 43 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
 44 |         },
 45 |     }
 46 | 
 47 |     def start_requests(self):
 48 |         """This method start 10 separate sessions on the homepage, one per page."""
 49 |         for page in range(1, 10):
 50 |             yield Request(
 51 |                 url=self.start_url,
 52 |                 callback=self.parse_home,
 53 |                 errback=self.errback,
 54 |                 dont_filter=True,
 55 |                 meta=dict(
 56 |                     page=page,
 57 |                     cookiejar="jar%d" % page,
 58 |                 ),
 59 |             )
 60 | 
 61 |     def parse_home(self, response):
 62 |         """After accessing the website's homepage, we generate the encrypted payload and send it to the server."""
 63 |         yield FormRequest(
 64 |             url=urljoin(self.start_url, '/Vmi6869kJM7vS70sZKXrwn5Lq0CORjRl'),
 65 |             formdata={
 66 |                 "payload": build_payload(),
 67 |             },
 68 |             callback=self.parse,
 69 |             errback=self.errback,
 70 |             dont_filter=True,
 71 |             meta=response.meta,
 72 |         )
 73 | 
 74 |     def parse(self, response):
 75 |         """Once approved, we retrieve the list of hotels in Paris from page X."""
 76 |         yield Request(
 77 |             url=self.start_url + "/cities?city=paris&page=%d" % response.meta['page'],
 78 |             callback=self.parse_listing,
 79 |             errback=self.errback,
 80 |             meta=response.meta,
 81 |         )
 82 | 
 83 |     def parse_listing(self, response):
 84 |         """This method parses the list of hotels in Paris from page X."""
 85 |         for el in response.css('.hotel-link'):
 86 |             yield response.follow(
 87 |                 url=el,
 88 |                 callback=self.parse_hotel,
 89 |                 errback=self.errback,
 90 |                 meta=response.meta,
 91 |             )
 92 | 
 93 |     def parse_hotel(self, response):
 94 |         """This method parses hotel details such as name, email, and reviews."""
 95 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
 96 | 
 97 |         hotel = HotelItemLoader(response=response)
 98 |         hotel.add_css('name', '.hotel-name::text')
 99 |         hotel.add_css('email', '.hotel-email::text')
100 |         hotel.add_value('reviews', reviews)
101 |         return hotel.load_item()
102 | 
103 |     def get_review(self, review_el):
104 |         """This method extracts rating from a review"""
105 |         review = ReviewItemLoader(selector=review_el)
106 |         review.add_css('rating', '.review-rating::text')
107 |         return review.load_item()
108 | 
109 |     def errback(self, failure):
110 |         """This method handles and logs errors and is invoked with each request."""
111 |         print_failure(self.logger, failure)
112 | 
113 | 


--------------------------------------------------------------------------------
/solutions/challenge-6-2.py:
--------------------------------------------------------------------------------
  1 | from scrapy import FormRequest, Request, Spider
  2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
  3 | from scrapers.utils import print_failure, rsa_encrypt
  4 | from urllib.parse import urljoin
  5 | 
  6 | import json
  7 | 
  8 | 
  9 | def build_payload():
 10 |     """Build the encrypted payload to send to the server."""
 11 |     payload = json.dumps({
 12 |         "vendor": "Intel",
 13 |         "renderer": "Intel Iris OpenGL Engine",
 14 |     })
 15 | 
 16 |     # The public key is extracted from the deobfuscated JavaScript code of the website's antibot.
 17 |     public_key = "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEApgjwxZd4I6YnOE1GGCdnKIatX71CyGpssvAAH7udNLcBVr0WzIP1t+KZ7mDzLMyZE9MJmSsEgKidzaVRikarUQ6MUWnyJQxe8DlUNrSmK4ZrnLBD/5rVBcepZo1mPj1MdQWie4AYHUt++lLpPrXqEJ7xugSGIt7ORVGgcKO5ku5RSS1Ssy5iUhYtQo4VCb2UxYuMbpt2YF8LOaR8KtPIQENtNH2Jj7akQTna4I5lixOB0jme03lR5n94SqACUAZ+rFBDKgrC9eVWX8xdfMERxcKuD9NxFCV65tdNiH64CHWaDU13j9v2XGHKFkEORgRn+RQBintX5fEqt7GTTIzvoQIDAQAB"
 18 | 
 19 |     payload_encoded = rsa_encrypt(payload, public_key)
 20 |     return payload_encoded
 21 | 
 22 | 
 23 | class TrekkySpider(Spider):
 24 |     """This class manages all the logic required for scraping the Trekky website.
 25 | 
 26 |     Attributes:
 27 |         name (str): The unique name of the spider.
 28 |         start_url (str): Root of the website and first URL to scrape.
 29 |         custom_settings (dict): Custom settings for the scraper
 30 |     """
 31 | 
 32 |     name = "trekky"
 33 | 
 34 |     start_url = "https://trekky-reviews.com/level8"
 35 | 
 36 |     custom_settings = {
 37 |         "DEFAULT_REQUEST_HEADERS": {
 38 |             "Connection": "close",
 39 |         },
 40 | 
 41 |         "DOWNLOADER_MIDDLEWARES": {
 42 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 43 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
 44 |         },
 45 |     }
 46 | 
 47 |     def start_requests(self):
 48 |         """This method start 10 separate sessions on the homepage, one per page."""
 49 |         for page in range(1, 10):
 50 |             yield Request(
 51 |                 url=self.start_url,
 52 |                 callback=self.parse_home,
 53 |                 errback=self.errback,
 54 |                 dont_filter=True,
 55 |                 meta=dict(
 56 |                     page=page,
 57 |                     cookiejar="jar%d" % page,
 58 |                 ),
 59 |             )
 60 | 
 61 |     def parse_home(self, response):
 62 |         """After accessing the website's homepage, we generate the encrypted payload and send it to the server."""
 63 |         yield FormRequest(
 64 |             url=urljoin(self.start_url, '/Vmi6869kJM7vS70sZKXrwn5Lq0CORjRl'),
 65 |             formdata={
 66 |                 "payload": build_payload(),
 67 |             },
 68 |             callback=self.parse,
 69 |             errback=self.errback,
 70 |             dont_filter=True,
 71 |             meta=response.meta,
 72 |         )
 73 | 
 74 |     def parse(self, response):
 75 |         """Once approved, we retrieve the list of hotels in Paris from page X."""
 76 |         yield Request(
 77 |             url=self.start_url + "/cities?city=paris&page=%d" % response.meta['page'],
 78 |             callback=self.parse_listing,
 79 |             errback=self.errback,
 80 |             meta=response.meta,
 81 |         )
 82 | 
 83 |     def parse_listing(self, response):
 84 |         """This method parses the list of hotels in Paris from page X."""
 85 |         for el in response.css('.hotel-link'):
 86 |             yield response.follow(
 87 |                 url=el,
 88 |                 callback=self.parse_hotel,
 89 |                 errback=self.errback,
 90 |                 meta=response.meta,
 91 |             )
 92 | 
 93 |     def parse_hotel(self, response):
 94 |         """This method parses hotel details such as name, email, and reviews."""
 95 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
 96 | 
 97 |         hotel = HotelItemLoader(response=response)
 98 |         hotel.add_css('name', '.hotel-name::text')
 99 |         hotel.add_css('email', '.hotel-email::text')
100 |         hotel.add_value('reviews', reviews)
101 |         return hotel.load_item()
102 | 
103 |     def get_review(self, review_el):
104 |         """This method extracts rating from a review"""
105 |         review = ReviewItemLoader(selector=review_el)
106 |         review.add_css('rating', '.review-rating::text')
107 |         return review.load_item()
108 | 
109 |     def errback(self, failure):
110 |         """This method handles and logs errors and is invoked with each request."""
111 |         print_failure(self.logger, failure)
112 | 
113 | 


--------------------------------------------------------------------------------
/solutions/challenge-4.py:
--------------------------------------------------------------------------------
  1 | from scrapy import Request, Spider
  2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
  3 | from scrapers.utils import print_failure
  4 | 
  5 | 
  6 | class TrekkySpider(Spider):
  7 |     """This class manages all the logic required for scraping the Trekky website.
  8 | 
  9 |     Attributes:
 10 |         name (str): The unique name of the spider.
 11 |         start_url (str): Root of the website and first URL to scrape.
 12 |         custom_settings (dict): Custom settings for the scraper
 13 |     """
 14 | 
 15 |     name = "trekky"
 16 | 
 17 |     start_url = "https://trekky-reviews.com/level6"
 18 | 
 19 |     custom_settings = {
 20 |         "DEFAULT_REQUEST_HEADERS": {
 21 |             "Connection": "close",
 22 |         },
 23 | 
 24 |         "DOWNLOADER_MIDDLEWARES": {
 25 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 26 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
 27 |         },
 28 | 
 29 |         # Replace the default Scrapy downloader with Playwright and Chrome to manage JavaScript content.
 30 |         "DOWNLOAD_HANDLERS": {
 31 |             "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
 32 |             "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
 33 |         },
 34 | 
 35 |         # Set up Playwright to launch the browser in headful mode using Scrapoxy.
 36 |         "PLAYWRIGHT_LAUNCH_OPTIONS": {
 37 |             "headless": False,
 38 |             "proxy": {
 39 |                 "server": "http://localhost:8888",
 40 |                 "username": "TO_FILL",
 41 |                 "password": "TO_FILL",
 42 |             },
 43 |         }
 44 |     }
 45 | 
 46 |     def start_requests(self):
 47 |         """This method start 10 separate sessions on the homepage, one per page."""
 48 |         for page in range(1, 10):
 49 |             yield Request(
 50 |                 url=self.start_url,
 51 |                 callback=self.parse,
 52 |                 errback=self.errback,
 53 |                 dont_filter=True,
 54 |                 meta=dict(
 55 |                     # Enable Playwright
 56 |                     playwright=True,
 57 |                     # Include the Playwright page object in the response
 58 |                     playwright_include_page=True,
 59 |                     playwright_context="context%d" % page,
 60 |                     playwright_context_kwargs=dict(
 61 |                         # Ignore HTTPS errors
 62 |                         ignore_https_errors=True,
 63 |                     ),
 64 |                     playwright_page_goto_kwargs=dict(
 65 |                         wait_until='networkidle',
 66 |                     ),
 67 |                     page=page,
 68 |                 ),
 69 |             )
 70 | 
 71 |     async def parse(self, response):
 72 |         """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X."""
 73 |         await response.meta["playwright_page"].close()
 74 |         del response.meta["playwright_page"]
 75 | 
 76 |         # For the next requests, skip page rendering and download only the HTML content.
 77 |         response.meta["playwright_page_goto_kwargs"]["wait_until"] = 'commit'
 78 | 
 79 |         yield Request(
 80 |             url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']),
 81 |             callback=self.parse_listing,
 82 |             errback=self.errback,
 83 |             meta=response.meta,
 84 |         )
 85 | 
 86 |     async def parse_listing(self, response):
 87 |         """This method parses the list of hotels in Paris from page X."""
 88 |         await response.meta["playwright_page"].close()
 89 |         del response.meta["playwright_page"]
 90 | 
 91 |         for el in response.css('.hotel-link'):
 92 |             yield response.follow(
 93 |                 url=el,
 94 |                 callback=self.parse_hotel,
 95 |                 errback=self.errback,
 96 |                 meta=response.meta,
 97 |             )
 98 | 
 99 |     async def parse_hotel(self, response):
100 |         """This method parses hotel details such as name, email, and reviews."""
101 |         await response.meta["playwright_page"].close()
102 |         del response.meta["playwright_page"]
103 | 
104 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
105 | 
106 |         hotel = HotelItemLoader(response=response)
107 |         hotel.add_css('name', '.hotel-name::text')
108 |         hotel.add_css('email', '.hotel-email::text')
109 |         hotel.add_value('reviews', reviews)
110 |         return hotel.load_item()
111 | 
112 |     def get_review(self, review_el):
113 |         """This method extracts rating from a review"""
114 |         review = ReviewItemLoader(selector=review_el)
115 |         review.add_css('rating', '.review-rating::text')
116 |         return review.load_item()
117 | 
118 |     async def errback(self, failure):
119 |         """This method handles and logs errors and is invoked with each request."""
120 |         print_failure(self.logger, failure)
121 |         page = failure.request.meta.get("playwright_page")
122 |         if page:
123 |             await page.close()
124 | 


--------------------------------------------------------------------------------
/solutions/challenge-5.py:
--------------------------------------------------------------------------------
  1 | from scrapy import Request, Spider
  2 | from scrapers.items import HotelItemLoader, ReviewItemLoader
  3 | from scrapers.utils import print_failure
  4 | 
  5 | 
  6 | class TrekkySpider(Spider):
  7 |     """This class manages all the logic required for scraping the Trekky website.
  8 | 
  9 |     Attributes:
 10 |         name (str): The unique name of the spider.
 11 |         start_url (str): Root of the website and first URL to scrape.
 12 |         custom_settings (dict): Custom settings for the scraper
 13 |     """
 14 | 
 15 |     name = "trekky"
 16 | 
 17 |     start_url = "https://trekky-reviews.com/level7"
 18 | 
 19 |     custom_settings = {
 20 |         "DEFAULT_REQUEST_HEADERS": {
 21 |             "Connection": "close",
 22 |         },
 23 | 
 24 |         "DOWNLOADER_MIDDLEWARES": {
 25 |             'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 26 |             'scrapers.middlewares.retry.RetryMiddleware': 550,
 27 |         },
 28 | 
 29 |         # Replace the default Scrapy downloader with Playwright and Chrome to manage JavaScript content.
 30 |         "DOWNLOAD_HANDLERS": {
 31 |             "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
 32 |             "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
 33 |         },
 34 | 
 35 |         # Set up Playwright to launch the browser in headful mode using Scrapoxy.
 36 |         "PLAYWRIGHT_LAUNCH_OPTIONS": {
 37 |             "headless": False,
 38 |             "proxy": {
 39 |                 "server": "http://localhost:8888",
 40 |                 "username": "TO_FILL",
 41 |                 "password": "TO_FILL",
 42 |             },
 43 |         }
 44 |     }
 45 | 
 46 |     def start_requests(self):
 47 |         """This method start 10 separate sessions on the homepage, one per page."""
 48 |         for page in range(1, 10):
 49 |             yield Request(
 50 |                 url=self.start_url,
 51 |                 callback=self.parse,
 52 |                 errback=self.errback,
 53 |                 dont_filter=True,
 54 |                 meta=dict(
 55 |                     # Enable Playwright
 56 |                     playwright=True,
 57 |                     # Include the Playwright page object in the response
 58 |                     playwright_include_page=True,
 59 |                     playwright_context="context%d" % page,
 60 |                     playwright_context_kwargs=dict(
 61 |                         # Ignore HTTPS errors
 62 |                         ignore_https_errors=True,
 63 |                         # Sync the timezone with the proxy's location.
 64 |                         timezone_id='America/Chicago',
 65 |                     ),
 66 |                     playwright_page_goto_kwargs=dict(
 67 |                         wait_until='networkidle',
 68 |                     ),
 69 |                     page=page,
 70 |                 ),
 71 |             )
 72 | 
 73 |     async def parse(self, response):
 74 |         """After accessing the website's homepage, we retrieve the list of hotels in Paris from page X."""
 75 |         await response.meta["playwright_page"].close()
 76 |         del response.meta["playwright_page"]
 77 | 
 78 |         # For the next requests, skip page rendering and download only the HTML content.
 79 |         response.meta["playwright_page_goto_kwargs"]["wait_until"] = 'commit'
 80 | 
 81 |         yield Request(
 82 |             url=response.urljoin("cities?city=paris&page=%d" % response.meta['page']),
 83 |             callback=self.parse_listing,
 84 |             errback=self.errback,
 85 |             meta=response.meta,
 86 |         )
 87 | 
 88 |     async def parse_listing(self, response):
 89 |         """This method parses the list of hotels in Paris from page X."""
 90 |         await response.meta["playwright_page"].close()
 91 |         del response.meta["playwright_page"]
 92 | 
 93 |         for el in response.css('.hotel-link'):
 94 |             yield response.follow(
 95 |                 url=el,
 96 |                 callback=self.parse_hotel,
 97 |                 errback=self.errback,
 98 |                 meta=response.meta,
 99 |             )
100 | 
101 |     async def parse_hotel(self, response):
102 |         """This method parses hotel details such as name, email, and reviews."""
103 |         await response.meta["playwright_page"].close()
104 |         del response.meta["playwright_page"]
105 | 
106 |         reviews = [self.get_review(review_el) for review_el in response.css('.hotel-review')]
107 | 
108 |         hotel = HotelItemLoader(response=response)
109 |         hotel.add_css('name', '.hotel-name::text')
110 |         hotel.add_css('email', '.hotel-email::text')
111 |         hotel.add_value('reviews', reviews)
112 |         return hotel.load_item()
113 | 
114 |     def get_review(self, review_el):
115 |         """This method extracts rating from a review"""
116 |         review = ReviewItemLoader(selector=review_el)
117 |         review.add_css('rating', '.review-rating::text')
118 |         return review.load_item()
119 | 
120 |     async def errback(self, failure):
121 |         """This method handles and logs errors and is invoked with each request."""
122 |         print_failure(self.logger, failure)
123 |         page = failure.request.meta.get("playwright_page")
124 |         if page:
125 |             await page.close()
126 | 


--------------------------------------------------------------------------------
/tools/deobfuscator.js:
--------------------------------------------------------------------------------
  1 | import {promises as fs} from 'fs';
  2 | import * as parser from '@babel/parser';
  3 | import * as t from "@babel/types";
  4 | import _traverse from '@babel/traverse';
  5 | import _generate from '@babel/generator';
  6 | 
  7 | const traverse = _traverse.default;
  8 | const generate = _generate.default;
  9 | 
 10 | 
 11 | /*
 12 |  * Generate and prettify source code from the AST
 13 |  */
 14 | function generateCode(ast) {
 15 |     return generate(ast, {
 16 |         comments: false,
 17 |         compact: false,
 18 |     }).code;
 19 | }
 20 | 
 21 | 
 22 | /*
 23 |  * Replace constants by the string value
 24 |  */
 25 | function constantUnfolding(source) {
 26 |     // Convert source code to an AST (Abstract Syntax Tree)
 27 |     const ast = parser.parse(source);
 28 | 
 29 |     // Replace variable usage by their string declaration
 30 |     traverse(ast, {
 31 |         VariableDeclaration(path) {
 32 |             const newDeclarations = [];
 33 |             for (const declaration of path.node.declarations) {
 34 |                 if (!t.isStringLiteral(declaration.init)) {
 35 |                     newDeclarations.push(declaration);
 36 |                     continue;
 37 |                 }
 38 | 
 39 |                 const binding = path.scope.getBinding(declaration.id.name);
 40 |                 if (!binding || binding.constantViolations.length > 0) {
 41 |                     newDeclarations.push(declaration);
 42 |                     continue;
 43 |                 }
 44 | 
 45 |                 for (const referencePath of binding.referencePaths) {
 46 |                     referencePath.replaceWith(t.stringLiteral(declaration.init.value));
 47 |                 }
 48 |             }
 49 | 
 50 |             if (newDeclarations.length > 0) {
 51 |                 path.node.declarations = newDeclarations;
 52 |             } else {
 53 |                 path.remove();
 54 |             }
 55 |         }
 56 |     });
 57 | 
 58 |     // Replace the binding of the window usage
 59 |     traverse(ast, {
 60 |         VariableDeclaration(path) {
 61 |             const newDeclarations = [];
 62 |             for (const declaration of path.node.declarations) {
 63 |                 if (!t.isIdentifier(declaration.init) || declaration.init.name !== 'window') {
 64 |                     newDeclarations.push(declaration);
 65 |                     continue;
 66 |                 }
 67 | 
 68 |                 const binding = path.scope.getBinding(declaration.id.name);
 69 |                 if (!binding || binding.constantViolations.length > 0) {
 70 |                     newDeclarations.push(declaration);
 71 |                     continue;
 72 |                 }
 73 | 
 74 |                 for (const referencePath of binding.referencePaths) {
 75 |                     referencePath.replaceWith(t.identifier('window'));
 76 |                 }
 77 |             }
 78 | 
 79 |             if (newDeclarations.length > 0) {
 80 |                 path.node.declarations = newDeclarations;
 81 |             } else {
 82 |                 path.remove();
 83 |             }
 84 |         }
 85 |     });
 86 | 
 87 |     return generateCode(ast);
 88 | }
 89 | 
 90 | /*
 91 |  * Join binary expressions with string literals
 92 |  */
 93 | function stringJoin(source) {
 94 |     // Convert source code to an AST (Abstract Syntax Tree)
 95 |     const ast = parser.parse(source);
 96 | 
 97 |     function joinBinaryWithStringRecursively(node) {
 98 |         let left;
 99 |         if (t.isBinaryExpression(node.left)) {
100 |             left = joinBinaryWithStringRecursively(node.left);
101 |         } else {
102 |             left = node.left;
103 |         }
104 | 
105 |         let right;
106 |         if (t.isBinaryExpression(node.right)) {
107 |             right = joinBinaryWithStringRecursively(node.right);
108 |         } else {
109 |             right = node.right;
110 |         }
111 | 
112 |         if (t.isStringLiteral(left) && t.isStringLiteral(right)) {
113 |             return t.stringLiteral(left.value + right.value);
114 |         } else {
115 |             return node;
116 |         }
117 |     }
118 | 
119 |     // Replace binary expressions by their string concatenation
120 |     // Use a recursive function
121 |     traverse(ast, {
122 |         BinaryExpression(path) {
123 |             const node = path.node;
124 | 
125 |             path.replaceWith(joinBinaryWithStringRecursively(node));
126 |         }
127 |     });
128 | 
129 |     return generateCode(ast);
130 | }
131 | 
132 | /*
133 |  * Convert the string notation to the dot notation
134 |  */
135 | function convertStringNotationToDotNotation(source) {
136 |     // Convert source code to an AST (Abstract Syntax Tree)
137 |     const ast = parser.parse(source);
138 | 
139 |     traverse(ast, {
140 |         MemberExpression(path) {
141 |             const {node} = path;
142 | 
143 |             if (node.computed && t.isStringLiteral(node.property)) {
144 |                 node.property = t.identifier(node.property.value);
145 |                 node.computed = false;
146 |             }
147 |         }
148 |     });
149 | 
150 |     return generateCode(ast);
151 | }
152 | 
153 | 
154 | (async () => {
155 |     let code = await fs.readFile('./tools/obfuscated.js', 'utf-8');
156 | 
157 |     code = constantUnfolding(code);
158 |     code = stringJoin(code);
159 |     code = convertStringNotationToDotNotation(code);
160 | 
161 |     await fs.writeFile('./tools/deobfuscated.js', code, 'utf-8');
162 | })()
163 |     .catch(console.error);
164 | 


--------------------------------------------------------------------------------
/solutions/challenge-7.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | from dataclasses import dataclass, field
  3 | from parsel import Selector
  4 | from camoufox.async_api import AsyncCamoufox
  5 | from typing import List, Iterator
  6 | from urllib.parse import urljoin
  7 | 
  8 | import asyncio
  9 | import csv
 10 | import logging
 11 | 
 12 | 
 13 | logging.basicConfig(
 14 |     level=logging.INFO,
 15 |     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 16 | )
 17 | 
 18 | 
 19 | # Define data classes similar to those in scrapers/items.py
 20 | @dataclass
 21 | class ReviewItem:
 22 |     rating: float = field(default=None)
 23 | 
 24 | 
 25 | @dataclass
 26 | class HotelItem:
 27 |     name: str = field(default=None)
 28 |     email: str = field(default=None)
 29 |     reviews: List[ReviewItem] = field(default_factory=list)
 30 | 
 31 | 
 32 | class TrekkyCamoufoxSpider:
 33 |     """Pure Camoufox implementation of the Trekky spider."""
 34 |     start_url = "https://trekky-reviews.com/level9"
 35 | 
 36 |     logger = logging.getLogger(__name__)
 37 | 
 38 |     async def start(self) -> None:
 39 |         """Entry point to start the scraping process."""
 40 |         if not self.start_url.endswith('/'):
 41 |             self.start_url += '/'
 42 | 
 43 |         tasks = []
 44 |         # Create separate sessions
 45 |         for page_num in range(1, 10):
 46 |             task = asyncio.create_task(self.parse_homepage(page_num))
 47 |             tasks.append(task)
 48 | 
 49 |         # Wait for all tasks to complete
 50 |         results = await asyncio.gather(*tasks, return_exceptions=True)
 51 | 
 52 |         # Process results
 53 |         hotels = []
 54 |         for result in results:
 55 |             if isinstance(result, Exception):
 56 |                 self.logger.error(f"Error occurred: {result}")
 57 |             elif result:
 58 |                 hotels.extend(result)
 59 | 
 60 |         # Output results to a CSV file matching the required format
 61 |         with open('../results.csv', 'w', newline='') as f:
 62 |             writer = csv.writer(f)
 63 |             writer.writerow(['name', 'email', 'reviews'])
 64 |             for hotel in hotels:
 65 |                 reviews_str = str([{"rating": review.rating} for review in hotel.reviews])
 66 |                 writer.writerow([hotel.name, hotel.email, reviews_str])
 67 | 
 68 |         self.logger.info(f"Scraped {len(hotels)} hotels and saved to results.csv")
 69 | 
 70 |     async def parse_homepage(self, page_num) -> List[HotelItem]:
 71 |         """This method starts a session for the homepage and retrieves the list of hotels in Paris from page X."""
 72 |         hotels = []
 73 | 
 74 |         try:
 75 |             # Launch browser with camoufox
 76 |             async with AsyncCamoufox(
 77 |                 headless=False,
 78 |             ) as browser:
 79 |                 # Create a new context
 80 |                 context = await browser.new_context(
 81 |                     ignore_https_errors=True,
 82 |                 )
 83 | 
 84 |                 # Open a new page and navigate to the homepage
 85 |                 self.logger.info(f"Go to homepage for page {page_num}")
 86 | 
 87 |                 page = await context.new_page()
 88 |                 await page.route('**/*.{png,jpg,jpeg,svg,gif,css}', lambda route: route.abort())
 89 |                 await page.goto(self.start_url, wait_until='networkidle', timeout=60000)
 90 |                 await page.wait_for_timeout(2000)
 91 | 
 92 |                 # Navigate to the listing page and get the hotels
 93 |                 url = urljoin(self.start_url, f"cities?city=paris&page={page_num}")
 94 |                 async for hotel in self.parse_listing(page, url):
 95 |                     hotels.append(hotel)
 96 | 
 97 |                 # Close the page and context
 98 |                 await page.close()
 99 |                 await context.close()
100 | 
101 |         except Exception as e:
102 |             self.logger.error(f"Error in session {page_num}: {e}")
103 |             raise
104 | 
105 |         return hotels
106 | 
107 |     async def parse_listing(self, page, url) -> Iterator[HotelItem]:
108 |         """This method parses the list of hotels."""
109 |         self.logger.info(f"Go to listing page: {url}")
110 |         await page.goto(url, wait_until='networkidle', timeout=60000)
111 | 
112 |         # Wait longer for JavaScript to load content
113 |         await page.wait_for_timeout(5000)
114 | 
115 |         selector = Selector(text=await page.content())
116 | 
117 |         for link in selector.css('.hotel-link'):
118 |             href = link.attrib.get('href')
119 |             if href:
120 |                 hotel_url = urljoin(self.start_url, href)
121 | 
122 |                 try:
123 |                     item = await self.parse_hotel(page, hotel_url)
124 |                     if item:
125 |                         yield item
126 |                 except Exception as e:
127 |                     self.logger.error(f"Error scraping hotel {hotel_url}: {e}")
128 | 
129 |     async def parse_hotel(self, page, url) -> HotelItem | None:
130 |         """This method parses hotel details."""
131 |         try:
132 |             self.logger.info(f"Go to hotel page: {url}")
133 |             await page.goto(url, wait_until='commit', timeout=30000)
134 |             await page.wait_for_selector('.hotel-name', timeout=10000)
135 | 
136 |             selector = Selector(text=await page.content())
137 | 
138 |             reviews = []
139 |             for review_el in selector.css('.hotel-review'):
140 |                 review_item = self.get_review(review_el)
141 |                 if review_item:
142 |                     reviews.append(review_item)
143 | 
144 |             name = selector.css('.hotel-name::text').get()
145 |             email = selector.css('.hotel-email::text').get()
146 | 
147 |             if name and email:
148 |                 return HotelItem(name=name.strip(), email=email.strip(), reviews=reviews)
149 |             else:
150 |                 self.logger.warning(f"Incomplete hotel data extracted: name={name}, email={email}")
151 |                 return None
152 | 
153 |         except Exception as e:
154 |             self.logger.error(f"Error extracting hotel info: {e}")
155 |             return None
156 | 
157 |     def get_review(self, review_el) -> ReviewItem | None:
158 |         """This method extracts rating from a review"""
159 |         rating_text = review_el.css('.review-rating::text').get()
160 |         try:
161 |             rating = float(rating_text.strip()) if rating_text else None
162 |             return ReviewItem(rating=rating)
163 |         except (ValueError, TypeError):
164 |             self.logger.warning(f"Invalid rating value: {rating_text}")
165 |             return None
166 | 
167 | 
168 | 
169 | 
170 | async def main():
171 |     """Main entry point."""
172 |     spider = TrekkyCamoufoxSpider()
173 |     await spider.start()
174 | 
175 | 
176 | if __name__ == "__main__":
177 |     asyncio.run(main())
178 | 


--------------------------------------------------------------------------------
/playwright_spider.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | from dataclasses import dataclass, field
  3 | from parsel import Selector
  4 | from playwright.async_api import async_playwright
  5 | from typing import List, Iterator
  6 | from urllib.parse import urljoin
  7 | 
  8 | import asyncio
  9 | import csv
 10 | import logging
 11 | 
 12 | 
 13 | logging.basicConfig(
 14 |     level=logging.INFO,
 15 |     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 16 | )
 17 | 
 18 | 
 19 | @dataclass
 20 | class ReviewItem:
 21 |     rating: float = field(default=None)
 22 | 
 23 | 
 24 | @dataclass
 25 | class HotelItem:
 26 |     name: str = field(default=None)
 27 |     email: str = field(default=None)
 28 |     reviews: List[ReviewItem] = field(default_factory=list)
 29 | 
 30 | 
 31 | class TrekkyPlaywrightSpider:
 32 |     """Pure Playwright implementation of the Trekky spider."""
 33 |     start_url = "https://trekky-reviews.com/level9"
 34 | 
 35 |     logger = logging.getLogger(__name__)
 36 | 
 37 |     async def start(self) -> None:
 38 |         """Entry point to start the scraping process."""
 39 |         if not self.start_url.endswith('/'):
 40 |             self.start_url += '/'
 41 | 
 42 |         tasks = []
 43 |         # Create separate sessions
 44 |         for page_num in range(1, 10):
 45 |             task = asyncio.create_task(self.parse_homepage(page_num))
 46 |             tasks.append(task)
 47 | 
 48 |         # Wait for all tasks to complete
 49 |         results = await asyncio.gather(*tasks, return_exceptions=True)
 50 | 
 51 |         # Process results
 52 |         hotels = []
 53 |         for result in results:
 54 |             if isinstance(result, Exception):
 55 |                 self.logger.error(f"Error occurred: {result}")
 56 |             elif result:
 57 |                 hotels.extend(result)
 58 | 
 59 |         # Output results to a CSV file matching the required format
 60 |         with open('results.csv', 'w', newline='') as f:
 61 |             writer = csv.writer(f)
 62 |             writer.writerow(['name', 'email', 'reviews'])
 63 |             for hotel in hotels:
 64 |                 reviews_str = str([{"rating": review.rating} for review in hotel.reviews])
 65 |                 writer.writerow([hotel.name, hotel.email, reviews_str])
 66 | 
 67 |         self.logger.info(f"Scraped {len(hotels)} hotels and saved to results.csv")
 68 | 
 69 |     async def parse_homepage(self, page_num) -> List[HotelItem]:
 70 |         """This method starts a session for the homepage and retrieves the list of hotels in Paris from page X."""
 71 |         hotels = []
 72 | 
 73 |         try:
 74 |             # Launch browser with playwright
 75 |             async with async_playwright() as p:
 76 |                 # Launch browser
 77 |                 browser = await p.chromium.launch(headless=False)
 78 |                 context = await browser.new_context(
 79 |                     ignore_https_errors=True,
 80 |                 )
 81 | 
 82 |                 # Open a new page and navigate to the homepage
 83 |                 self.logger.info(f"Go to homepage for page {page_num}")
 84 | 
 85 |                 page = await context.new_page()
 86 |                 await page.route('**/*.{png,jpg,jpeg,svg,gif,css}', lambda route: route.abort())
 87 |                 await page.goto(self.start_url, wait_until='networkidle', timeout=60000)
 88 |                 await page.wait_for_timeout(2000)
 89 | 
 90 |                 # Navigate to the listing page and get the hotels
 91 |                 url = urljoin(self.start_url, f"cities?city=paris&page={page_num}")
 92 |                 async for hotel in self.parse_listing(page, url):
 93 |                     hotels.append(hotel)
 94 | 
 95 |                 # Close the page and context
 96 |                 await page.close()
 97 |                 await context.close()
 98 |                 await browser.close()
 99 | 
100 |         except Exception as e:
101 |             self.logger.error(f"Error in session {page_num}: {e}")
102 |             raise
103 | 
104 |         return hotels
105 | 
106 |     async def parse_listing(self, page, url) -> Iterator[HotelItem]:
107 |         """This method parses the list of hotels."""
108 |         self.logger.info(f"Go to listing page: {url}")
109 |         await page.goto(url, wait_until='networkidle', timeout=60000)
110 | 
111 |         # Wait longer for JavaScript to load content
112 |         await page.wait_for_timeout(5000)
113 | 
114 |         selector = Selector(text=await page.content())
115 | 
116 |         for link in selector.css('.hotel-link'):
117 |             href = link.attrib.get('href')
118 |             if href:
119 |                 hotel_url = urljoin(self.start_url, href)
120 | 
121 |                 try:
122 |                     item = await self.parse_hotel(page, hotel_url)
123 |                     if item:
124 |                         yield item
125 |                 except Exception as e:
126 |                     self.logger.error(f"Error scraping hotel {hotel_url}: {e}")
127 | 
128 |     async def parse_hotel(self, page, url) -> HotelItem | None:
129 |         """This method parses hotel details."""
130 |         try:
131 |             self.logger.info(f"Go to hotel page: {url}")
132 |             await page.goto(url, wait_until='commit', timeout=30000)
133 | 
134 |             await page.wait_for_selector('.hotel-name', timeout=10000)
135 | 
136 |             selector = Selector(text=await page.content())
137 | 
138 |             reviews = []
139 |             for review_el in selector.css('.hotel-review'):
140 |                 review_item = self.get_review(review_el)
141 |                 if review_item:
142 |                     reviews.append(review_item)
143 | 
144 |             name = selector.css('.hotel-name::text').get()
145 |             email = selector.css('.hotel-email::text').get()
146 | 
147 |             if name and email:
148 |                 return HotelItem(name=name.strip(), email=email.strip(), reviews=reviews)
149 |             else:
150 |                 self.logger.warning(f"Incomplete hotel data extracted: name={name}, email={email}")
151 |                 return None
152 | 
153 |         except Exception as e:
154 |             self.logger.error(f"Error extracting hotel info: {e}")
155 |             return None
156 | 
157 |     def get_review(self, review_el) -> ReviewItem | None:
158 |             """This method extracts rating from a review"""
159 |             rating_text = review_el.css('.review-rating::text').get()
160 |             try:
161 |                 rating = float(rating_text.strip()) if rating_text else None
162 |                 return ReviewItem(rating=rating)
163 |             except (ValueError, TypeError):
164 |                 self.logger.warning(f"Invalid rating value: {rating_text}")
165 |                 return None
166 | 
167 | async def main():
168 |     """Main entry point."""
169 |     spider = TrekkyPlaywrightSpider()
170 |     await spider.start()
171 | 
172 | 
173 | if __name__ == "__main__":
174 |     asyncio.run(main())
175 | 


--------------------------------------------------------------------------------
/scrapers/middlewares/retry.py:
--------------------------------------------------------------------------------
  1 | """
  2 | An extension to retry failed requests that are potentially caused by temporary
  3 | problems such as a connection timeout or HTTP 500 error.
  4 | 
  5 | You can change the behaviour of this middleware by modifying the scraping settings:
  6 | RETRY_TIMES - how many times to retry a failed page
  7 | RETRY_HTTP_CODES - which HTTP response codes to retry
  8 | 
  9 | Failed pages are collected on the scraping process and rescheduled at the end,
 10 | once the spider has finished crawling all regular (non failed) pages.
 11 | """
 12 | 
 13 | from __future__ import annotations
 14 | 
 15 | import warnings
 16 | from logging import Logger, getLogger
 17 | from typing import TYPE_CHECKING, Any, Optional, Tuple, Type, Union
 18 | 
 19 | from scrapy.crawler import Crawler
 20 | from scrapy.exceptions import NotConfigured, ScrapyDeprecationWarning
 21 | from scrapy.http import Response
 22 | from scrapy.http.request import Request
 23 | from scrapy.settings import BaseSettings, Settings
 24 | from scrapy.spiders import Spider
 25 | from scrapy.utils.misc import load_object
 26 | from scrapy.utils.python import global_object_name
 27 | from scrapy.utils.response import response_status_message
 28 | 
 29 | from twisted.internet import reactor
 30 | from twisted.internet.defer import Deferred
 31 | 
 32 | if TYPE_CHECKING:
 33 |     # typing.Self requires Python 3.11
 34 |     from typing_extensions import Self
 35 | 
 36 | retry_logger = getLogger(__name__)
 37 | 
 38 | 
 39 | def backwards_compatibility_getattr(self: Any, name: str) -> Tuple[Any, ...]:
 40 |     if name == "EXCEPTIONS_TO_RETRY":
 41 |         warnings.warn(
 42 |             "Attribute RetryMiddleware.EXCEPTIONS_TO_RETRY is deprecated. "
 43 |             "Use the RETRY_EXCEPTIONS setting instead.",
 44 |             ScrapyDeprecationWarning,
 45 |             stacklevel=2,
 46 |         )
 47 |         return tuple(
 48 |             load_object(x) if isinstance(x, str) else x
 49 |             for x in Settings().getlist("RETRY_EXCEPTIONS")
 50 |         )
 51 |     raise AttributeError(
 52 |         f"{self.__class__.__name__!r} object has no attribute {name!r}"
 53 |     )
 54 | 
 55 | 
 56 | class BackwardsCompatibilityMetaclass(type):
 57 |     __getattr__ = backwards_compatibility_getattr
 58 | 
 59 | 
 60 | def get_retry_request(
 61 |     request: Request,
 62 |     *,
 63 |     spider: Spider,
 64 |     reason: Union[str, Exception, Type[Exception]] = "unspecified",
 65 |     max_retry_times: Optional[int] = None,
 66 |     priority_adjust: Optional[int] = None,
 67 |     delay: int,
 68 |     logger: Logger = retry_logger,
 69 |     stats_base_key: str = "retry",
 70 | ) -> Optional[Request]:
 71 |     """
 72 |     Returns a new :class:`~scrapy.Request` object to retry the specified
 73 |     request, or ``None`` if retries of the specified request have been
 74 |     exhausted.
 75 | 
 76 |     For example, in a :class:`~scrapy.Spider` callback, you could use it as
 77 |     follows::
 78 | 
 79 |         def parse(self, response):
 80 |             if not response.text:
 81 |                 new_request_or_none = get_retry_request(
 82 |                     response.request,
 83 |                     spider=self,
 84 |                     reason='empty',
 85 |                 )
 86 |                 return new_request_or_none
 87 | 
 88 |     *spider* is the :class:`~scrapy.Spider` instance which is asking for the
 89 |     retry request. It is used to access the :ref:`settings <topics-settings>`
 90 |     and :ref:`stats <topics-stats>`, and to provide extra logging context (see
 91 |     :func:`logging.debug`).
 92 | 
 93 |     *reason* is a string or an :class:`Exception` object that indicates the
 94 |     reason why the request needs to be retried. It is used to name retry stats.
 95 | 
 96 |     *max_retry_times* is a number that determines the maximum number of times
 97 |     that *request* can be retried. If not specified or ``None``, the number is
 98 |     read from the :reqmeta:`max_retry_times` meta key of the request. If the
 99 |     :reqmeta:`max_retry_times` meta key is not defined or ``None``, the number
100 |     is read from the :setting:`RETRY_TIMES` setting.
101 | 
102 |     *priority_adjust* is a number that determines how the priority of the new
103 |     request changes in relation to *request*. If not specified, the number is
104 |     read from the :setting:`RETRY_PRIORITY_ADJUST` setting.
105 | 
106 |     *logger* is the logging.Logger object to be used when logging messages
107 | 
108 |     *stats_base_key* is a string to be used as the base key for the
109 |     retry-related job stats
110 |     """
111 |     settings = spider.crawler.settings
112 |     assert spider.crawler.stats
113 |     stats = spider.crawler.stats
114 |     retry_times = request.meta.get("retry_times", 0) + 1
115 |     if max_retry_times is None:
116 |         max_retry_times = request.meta.get("max_retry_times")
117 |         if max_retry_times is None:
118 |             max_retry_times = settings.getint("RETRY_TIMES")
119 |     if retry_times <= max_retry_times:
120 |         logger.debug(
121 |             "Retrying %(request)s (failed %(retry_times)d times): %(reason)s",
122 |             {"request": request, "retry_times": retry_times, "reason": reason},
123 |             extra={"spider": spider},
124 |         )
125 |         new_request: Request = request.copy()
126 |         new_request.meta["retry_times"] = retry_times
127 |         new_request.dont_filter = True
128 |         if priority_adjust is None:
129 |             priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")
130 |         new_request.priority = request.priority + priority_adjust
131 | 
132 |         new_request.meta["delay_request_by"] = delay
133 | 
134 |         if callable(reason):
135 |             reason = reason()
136 |         if isinstance(reason, Exception):
137 |             reason = global_object_name(reason.__class__)
138 | 
139 |         stats.inc_value(f"{stats_base_key}/count")
140 |         stats.inc_value(f"{stats_base_key}/reason_count/{reason}")
141 |         return new_request
142 |     stats.inc_value(f"{stats_base_key}/max_reached")
143 |     logger.error(
144 |         "Gave up retrying %(request)s (failed %(retry_times)d times): " "%(reason)s",
145 |         {"request": request, "retry_times": retry_times, "reason": reason},
146 |         extra={"spider": spider},
147 |     )
148 |     return None
149 | 
150 | 
151 | class RetryMiddleware(metaclass=BackwardsCompatibilityMetaclass):
152 |     def __init__(self, settings: BaseSettings):
153 |         if not settings.getbool("RETRY_ENABLED"):
154 |             raise NotConfigured
155 |         self.max_retry_times = settings.getint("RETRY_TIMES")
156 |         self.retry_http_codes = set(
157 |             int(x) for x in settings.getlist("RETRY_HTTP_CODES")
158 |         )
159 |         self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")
160 | 
161 |         try:
162 |             self.exceptions_to_retry = self.__getattribute__("EXCEPTIONS_TO_RETRY")
163 |         except AttributeError:
164 |             # If EXCEPTIONS_TO_RETRY is not "overridden"
165 |             self.exceptions_to_retry = tuple(
166 |                 load_object(x) if isinstance(x, str) else x
167 |                 for x in settings.getlist("RETRY_EXCEPTIONS")
168 |             )
169 | 
170 |     @classmethod
171 |     def from_crawler(cls, crawler: Crawler) -> Self:
172 |         return cls(crawler.settings)
173 | 
174 |     def process_request(self, request, spider):
175 |         delay_s = request.meta.get('delay_request_by', None)
176 |         if not delay_s:
177 |             return
178 | 
179 |         try:
180 |             delay = float(delay_s)
181 |         except ValueError:
182 |             spider.logger.error("Invalid delay value %s on URL %s", delay_s, request.url)
183 |             return
184 | 
185 |         deferred = Deferred()
186 |         reactor.callLater(delay, deferred.callback, None)
187 |         return deferred
188 | 
189 |     def process_response(
190 |         self, request: Request, response: Response, spider: Spider
191 |     ) -> Union[Request, Response]:
192 |         if request.meta.get("dont_retry", False):
193 |             return response
194 |         if response.status in self.retry_http_codes:
195 |             reason = response_status_message(response.status)
196 |             return self._retry(request, reason, spider) or response
197 |         return response
198 | 
199 |     def process_exception(
200 |         self, request: Request, exception: Exception, spider: Spider
201 |     ) -> Union[Request, Response, None]:
202 |         if isinstance(exception, self.exceptions_to_retry) and not request.meta.get(
203 |             "dont_retry", False
204 |         ):
205 |             return self._retry(request, exception, spider)
206 |         return None
207 | 
208 |     def _retry(
209 |         self,
210 |         request: Request,
211 |         reason: Union[str, Exception, Type[Exception]],
212 |         spider: Spider,
213 |     ) -> Optional[Request]:
214 |         max_retry_times = request.meta.get("max_retry_times", self.max_retry_times)
215 |         priority_adjust = request.meta.get("priority_adjust", self.priority_adjust)
216 |         delay = int(request.meta.get('delay_request_by', 10) * 2)
217 | 
218 |         spider.logger.info("Delaying request on URL %s by %d seconds", request.url, delay)
219 | 
220 |         return get_retry_request(
221 |             request,
222 |             reason=reason,
223 |             spider=spider,
224 |             max_retry_times=max_retry_times,
225 |             priority_adjust=priority_adjust,
226 |             delay=delay
227 |         )
228 | 
229 |     __getattr__ = backwards_compatibility_getattr
230 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Fabien's WebScraping Anti-Ban Workshop
  2 | 
  3 | ![Header](header.jpg)
  4 | 
  5 | 
  6 | ## Introduction
  7 | 
  8 | Our goal is to understand how anti-bot protections work and how to bypass them.
  9 | 
 10 | I created a dedicated website for this workshop [https://trekky-reviews.com](https://trekky-reviews.com).
 11 | This website provides a list of hotels for every city, including reviews.
 12 | 
 13 | We will collect **name, email and reviews** for each hotel.
 14 | 
 15 | During this workshop, we will use the following open-source software:
 16 | 
 17 | | Framework                            | Description                                                                  |
 18 | |--------------------------------------|------------------------------------------------------------------------------|
 19 | | [Scrapy](https://scrapy.org)         | the leading framework for web scraping                                       |
 20 | | [Scrapoxy](https://scrapoxy.io)      | the super proxies aggregator                                                 |
 21 | | [Playwright](https://playwright.dev) | the latest headless browser framework that integrates seamlessly with Scrapy |
 22 | | [Babel.js](https://babeljs.io)       | a transpiler used for deobfuscation purposes                                 |
 23 | 
 24 | The scraper can be found at [scrapers/spiders/trekky.py](scrapers/spiders/trekky.py).
 25 | 
 26 | All solutions are located in [solutions](solutions).
 27 | If you have any difficulties implementing a solution, feel free to copy and paste it. 
 28 | However, I recommend taking some time to search and explore to get the most out of the workshop, rather than rushing through it in 10 minutes.
 29 | 
 30 | 
 31 | ## Preflight Checklist
 32 | 
 33 | ### VirtualBox (Linux and Windows)
 34 | 
 35 | To simplify the installation process,
 36 | I've pre-configured an Ubuntu virtual machine for you with
 37 | all the necessary dependencies for this workshop.
 38 | 
 39 | <table>
 40 |     <tr>
 41 |         <td>
 42 |             <img src="images/warning.png" />
 43 |         </td>
 44 |         <td>
 45 | This virtual machine is compatible only with AMD64 architecture (Linux, Windows, and Intel-based macOS).
 46 |             
 47 | For macOS M1 (ARM64), please manually install the dependencies. 
 48 | 
 49 | On Windows, avoid using WSL2 (it doesn't work with Playwright)
 50 |         </td>
 51 |     </tr>
 52 | </table>
 53 | 
 54 | You can download it **[from this link](https://bit.ly/scwsfiles).**
 55 | 
 56 | The virtual machine is in OVA format and can be easily imported into [VirtualBox](https://www.virtualbox.org).
 57 | 
 58 | It requires 8 GiB RAM and 2 vCPU.
 59 | 
 60 | Click on `Import Appliance` and choose the OVA file you downloaded.
 61 | 
 62 | Credentials are: `vboxguest / changeme`.
 63 | 
 64 | I recommend switching the network setting from NAT to **Bridge Adapter** for improved performance.
 65 | 
 66 | _Note: If the network is too slow, I have USB drives available with the VM._
 67 | 
 68 | 
 69 | ### Full Installation (Linux, Windows, and macOS)
 70 | 
 71 | You can manually install the required dependencies, which include:
 72 | 
 73 | - Python (version 3 or higher) with virtualenv
 74 | - Node.js (version 20 or higher)
 75 | - Docker
 76 | 
 77 | #### Python
 78 | 
 79 | If you need Python, I recommend using [Anaconda](https://www.anaconda.com/download).
 80 | 
 81 | To install a Virtual Environment, run the following command:
 82 | 
 83 | ```shell
 84 | python3 -m venv venv
 85 | source venv/bin/activate
 86 | ```
 87 | 
 88 | 
 89 | #### Node.js
 90 | 
 91 | To install Node.js, follow the instructions on: https://nodejs.org/en/download
 92 | 
 93 | 
 94 | #### Docker
 95 | 
 96 | To install Docker, follow the instructions for
 97 | [Mac](https://docs.docker.com/desktop/install/mac-install), 
 98 | [Windows](https://docs.docker.com/desktop/install/windows-install) or
 99 | [Linux](https://docs.docker.com/desktop/install/linux/).
100 | 
101 | 
102 | ## Setting up
103 | 
104 | This step is necessary even if you are using the VM.
105 | 
106 | 
107 | ### Step 1: Clone the Repository
108 | 
109 | Clone this repository:
110 | 
111 | ```shell
112 | git clone https://github.com/scrapoxy/scraping-workshop.git
113 | cd scraping-workshop
114 | ```
115 | 
116 | 
117 | ### Step 2: Install Python libraries
118 | 
119 | Open a shell and install libraries:
120 | 
121 | ```shell
122 | pip install -r requirements.txt
123 | ```
124 | 
125 | 
126 | ### Step 3: Install Playwright
127 | 
128 | After installing the Python libraries, run the follow command:
129 | 
130 | ```shell
131 | playwright install --with-deps chromium
132 | ```
133 | 
134 | 
135 | ### Step 4: Install Node.js
136 | 
137 | Install Node.js from the [official website](https://nodejs.org/en/download/) or through the version management [NVM](https://github.com/nvm-sh/nvm)
138 | 
139 | 
140 | ### Step 5: Install Node.js libraries
141 | 
142 | Open a shell and install libraries from `package.json`:
143 | 
144 | ```shell
145 | npm install
146 | ```
147 | 
148 | 
149 | ### Step 6: Scrapoxy
150 | 
151 | Run the following command to download Scrapoxy:
152 | 
153 | ```shell
154 | sudo docker pull scrapoxy/scrapoxy
155 | ```
156 | 
157 | 
158 | ## Challenge 1: Run your first Scraper
159 | 
160 | The URL to scrape is: [https://trekky-reviews.com/level1](https://trekky-reviews.com/level1)
161 | 
162 | Our goal is to collect **names, emails, and reviews** for each hotel listed.
163 | 
164 | Open the file [`scrapers/spiders/trekky.py`](scrapers/spiders/trekky.py).
165 | 
166 | In Scrapy, a spider is a Python class with a unique `name` property. Here, the name is `trekky`.
167 | 
168 | The spider class includes a method called `start_requests`, which defines the initial URLs to scrape. 
169 | When a URL is fetched, the Scrapy engine triggers a callback function. 
170 | This callback function handles the parsing of the data. 
171 | It's also possible to generate new requests from within the callback function, allowing for chaining of requests and callbacks.
172 | 
173 | The structure of items is defined in the file [`scrapers/items.py`](scrapers/items.py). 
174 | Each item type is represented by a dataclass containing fields and a loader:
175 | 
176 | * `HotelItem`: name, email, review with the loader `HotelItemLoader`
177 | * `ReviewItem`: rating with the loader `ReviewItemLoader`
178 | 
179 | To run the spider, open a terminal at the project's root directory and run the following command:
180 | 
181 | ```shell
182 | scrapy crawl trekky
183 | ```
184 | 
185 | Scrapy will collect data from **50 hotels**:
186 | 
187 | ```text
188 | 2024-07-05 23:11:43 [trekky] INFO: 
189 | 
190 | We got: 50 items
191 | 
192 | ```
193 | 
194 | Check the `results.csv` file to confirm that all items were collected.
195 | 
196 | 
197 | ## Challenge 2: First Anti-Bot protection
198 | 
199 | The URL to scrape is: [https://trekky-reviews.com/level2](https://trekky-reviews.com/level2)
200 | 
201 | Update the URL in your scraper to target the new page and execute the spider:
202 | 
203 | ```shell
204 | scrapy crawl trekky
205 | ```
206 | 
207 | Data collection may fail due to **an anti-bot system**.
208 | 
209 | Pay attention to the **messages** explaining why access is blocked and use this information to correct the scraper.
210 | 
211 | _Hint: It relates to HTTP request headers 😉_
212 | 
213 | <details>
214 |     <summary>Soluce is here</summary>
215 |     <a href="solutions/challenge-2.py">Open the soluce</a>
216 | </details>
217 | 
218 | 
219 | ## Challenge 3: Rate limit
220 | 
221 | The URL to scrape is: [https://trekky-reviews.com/level4](https://trekky-reviews.com/level4) (we will skip level3)
222 | 
223 | Update the URL in your scraper to target the new page and execute the spider:
224 | 
225 | ```shell
226 | scrapy crawl trekky
227 | ```
228 | 
229 | Data collection might fail due to **rate limiting** on our IP address.
230 | 
231 | <table>
232 |     <tr>
233 |         <td>
234 |             <img src="images/warning.png" />
235 |         </td>
236 |         <td>
237 |             Please don't adjust the delay between requests or the number of concurrent requests; <b>that is not our goal</b>. 
238 |             Imagine we need to collect millions of items within a few hours, and delaying our scraping session is not an option. 
239 |             Instead, we will use proxies to distribute requests across multiple IP addresses.
240 |         </td>
241 |     </tr>
242 | </table>
243 | 
244 | Use [Scrapoxy](https://scrapoxy.io) to bypass the rate limit with a cloud provider (not a proxy service).
245 | 
246 | 
247 | ### Step 1: Install Scrapoxy
248 | 
249 | Follow this [guide](https://scrapoxy.io/intro/get-started) or run the following command in the project directory:
250 | 
251 | ```shell
252 | sudo docker run -p 8888:8888 -p 8890:8890 -e AUTH_LOCAL_USERNAME=admin -e AUTH_LOCAL_PASSWORD=password -e BACKEND_JWT_SECRET=secret1 -e FRONTEND_JWT_SECRET=secret2 -e STORAGE_FILE_FILENAME=/scrapoxy.json -v ./scrapoxy.json:/scrapoxy.json scrapoxy/scrapoxy:latest
253 | ```
254 | 
255 | 
256 | ### Step 2: Create a new project
257 | 
258 | In the new project, keep the default settings and click the `Create` button:
259 | 
260 | ![Scrapoxy Project Create](images/scrapoxy-project-create.png)
261 | 
262 | 
263 | ### Step 3: Add a Proxy Provider
264 | 
265 | See the slides to set up the proxies provider account.
266 | 
267 | Use **10 proxies** from the **United States of America**:
268 | 
269 | ![Scrapoxy Connector Create](images/scrapoxy-connector-create.png)
270 | 
271 | If you don't have an account with these cloud providers, you can create one.
272 | 
273 | <table>
274 |     <tr>
275 |         <td width="70">
276 |             <img src="images/warning.png" />
277 |         </td>
278 |         <td>
279 |             They typically require a credit card, and you may need to pay a nominal fee of $1 or $2 for this workshop.
280 |             Such charges are common when using proxies. Don't worry; in the next challenge, I'll provide you with free credit.
281 |         </td>
282 |     </tr>
283 | </table>
284 | 
285 | 
286 | ### Step 4: Run the connector
287 | 
288 | ![Scrapoxy Connector Run](images/scrapoxy-connector-run.png)
289 | 
290 | 
291 | ### Step 5: Connect Scrapoxy to the spider
292 | 
293 | Follow this [guide](https://scrapoxy.io/integration/python/scrapy/guide).
294 | 
295 | 
296 | ### Step 6: Execute the spider
297 | 
298 | Run your spider and check that Scrapoxy is handling the requests:
299 | 
300 | ![Scrapoxy Proxies](images/scrapoxy-proxies.png)
301 | 
302 | You should observe an increase in the count of received and sent requests.
303 | 
304 | <details>
305 |     <summary>Soluce is here</summary>
306 |     <a href="solutions/challenge-3.py">Open the soluce</a>
307 | </details>
308 | 
309 | 
310 | ## Challenge 4: Fingerprint
311 | 
312 | The URL to scrape is: [https://trekky-reviews.com/level6](https://trekky-reviews.com/level6) (we will skip level5).
313 | 
314 | Update the URL in your scraper to target the new page and execute the spider:
315 | 
316 | ```shell
317 | scrapy crawl trekky
318 | ```
319 | 
320 | Data collection may fail due to **fingerprinting**.
321 | 
322 | Use the Network Inspector in your browser to view all requests. 
323 | You will notice a new request appearing, which is a **POST request** instead of a GET request.
324 | 
325 | ![Chrome Network Inspector List](images/chrome-network-inspector-list.png)
326 | 
327 | Inspect the website's code to identify the **JavaScript** that triggers this request.
328 | 
329 | ![Chrome Network Inspector](images/chrome-network-inspector-initiator.png)
330 | 
331 | It's clear that we need to **execute JavaScript**. 
332 | Simply using Scrapy to send HTTP requests is not enough.
333 | 
334 | Also, it's important to maintain the **same IP address** throughout the session. 
335 | Scrapoxy offers a **sticky session** feature for this purpose when using a browser.
336 | 
337 | Navigate to the Project options and enable both `Intercept HTTPS requests with MITM` 
338 | and `Keep the same proxy with cookie injection`:
339 | 
340 | ![Scrapoxy Project Update](images/scrapoxy-project-update.png)
341 | 
342 | We will use the headless framework [Playwright](https://playwright.dev) along with Scrapy's plugin [scrapy-playwright](https://github.com/scrapy-plugins/scrapy-playwright).
343 | 
344 | <table>
345 |     <tr>
346 |         <td width="70">
347 |             <img src="images/info.png" />
348 |         </td>
349 |         <td>
350 |             <a href="https://github.com/scrapy-plugins/scrapy-playwright">scrapy-playwright</a> should already be installed.
351 |         </td>
352 |     </tr>
353 | </table>
354 | 
355 | Our goal is to adapt the spider to integrate Playwright.
356 | 
357 | You can now completely replace the code in [solutions/challenge-4.py](solutions/challenge-4.py)
358 | due to extensive modifications needed. 
359 | 
360 | The updates include:
361 |  
362 | * **Settings**: Updated to add a custom `DOWNLOAD_HANDLERS`.
363 | * **Playwright Configuration**: `PLAYWRIGHT_LAUNCH_OPTIONS` now:
364 |   * Disables headless mode, allowing you to view Playwright’s actions.
365 |   * Configures a proxy to route traffic through Scrapoxy.
366 | * **Request Metadata**: Each request now includes metadata to enable Playwright and ignore HTTPS errors (using ignore_https_errors).
367 | 
368 | 
369 | ## Challenge 5: Consistency
370 | 
371 | The URL to scrape is: [https://trekky-reviews.com/level7](https://trekky-reviews.com/level7)
372 | 
373 | Update the URL in your scraper to target the new page and execute the spider:
374 | 
375 | ```shell
376 | scrapy crawl trekky
377 | ```
378 | 
379 | You will notice that data collection may fail due to **inconsistency** errors.
380 | 
381 | Anti-bot checks consistency across various layers of the browser stack.
382 | 
383 | Try to solve these errors!
384 | 
385 | _Hint: It involves adjusting timezones 😉_
386 | 
387 | <details>
388 |     <summary>Soluce is here</summary>
389 |     <a href="solutions/challenge-5.py">Open the soluce</a>
390 | </details>
391 | 
392 | 
393 | ## Challenge 6: Deobfuscation
394 | 
395 | The URL to scrape is: [https://trekky-reviews.com/level8](https://trekky-reviews.com/level8)
396 | 
397 | Update the URL in your scraper to target the new page and execute the spider:
398 | 
399 | ```shell
400 | scrapy crawl trekky
401 | ```
402 | 
403 | ### Step 1: Find the Anti-Bot Javascript
404 | 
405 | Use the **Network Inspector** to review all requests.
406 | Among them, you'll spot some unusual ones. 
407 | By inspecting the payload, you'll notice that the content is **encrypted**:
408 | 
409 | ![Chrome Network Inspector - List 2](images/chrome-network-inspector-list2.png)
410 | 
411 | Inspect the website's code to find the JavaScript responsible for sending this requests.
412 | In this case, the source code is obfuscated.
413 | 
414 | Obfuscated code appears to be:
415 | 
416 | ```javascript
417 | var _qkrt1f=window,_uqvmii="tdN",_u5zh1i="UNM",_p949z3="on",_eu2jji="en",_vnsd5q="bto",_bi4e9="a",_f1e79r="e",_w13dld="ode",_vbg0l7="RSA-",_6uh486="ki"...
418 | ```
419 | 
420 | To understand which information is being sent and how to emulate it, we need to **deobfuscate the code**.
421 | 
422 | 
423 | ### Step 2: Understand the code structure
424 | 
425 | To understand the structure of the code, copy/paste some code into the website [AST Explorer](https://astexplorer.net)
426 | 
427 | Don't forget to select `@babel/parser` and enable `Transform`:
428 | 
429 | ![AST Explorer Header](images/ast-header.png)
430 | 
431 | AST Explorer parses the source code and generates a visual tree:
432 | 
433 | ![AST Explorer UI](images/ast-ui.png)
434 | 
435 | <table>
436 |     <tr>
437 |         <td width="70">
438 |             <image src="images/info.png">
439 |         </td>
440 |         <td>
441 |             For the record, I only obfuscated strings, not the code flow.
442 |         </td>
443 |     </tr>
444 | </table>
445 | 
446 | 
447 | ### Step 3: Deobfuscate the Javascript
448 | 
449 | Copy/paste the whole obfuscated code to `tools/obfuscated.js`.
450 | 
451 | And run the deobfucator script:
452 | 
453 | ```shell
454 | node tools/deobfuscator.js
455 | ```
456 | 
457 | This script will deobfuscate specificaly this code.
458 | 
459 | <table>
460 |     <tr>
461 |         <td width="70">
462 |             <image src="images/note.png">
463 |         </td>
464 |         <td>
465 |             You can use <a href="https://obf-io.deobfuscate.io">online tools</a> to deobfuscate this script,
466 | given that it's a straightforward obfuscated script.
467 |             Also, <a href="https://github.com/features/copilot">GitHub Copilot</a>
468 |             can be incredibly helpful in writing AST operations, just as
469 |             <a href="https://claude.ai">Claude Sonnet 3.5</a>
470 |             is valuable for deciphering complex functions.
471 |         </td>
472 |     </tr>
473 | </table>
474 | 
475 | 
476 | ### Step 4: Payload generation
477 | 
478 | Here’s a summary of the script’s behavior:
479 | 
480 | 1. It collects **WebGL information**;
481 | 2. It encrypts the data using **RSA encryption** with an obfuscated public key;
482 | 3. It sends the payload to the webserver via a **POST request**.
483 | 
484 | We need to implement the same approach in our spider.
485 | 
486 | Since we will be crafting the payload ourselves, there is **no need** to use Playwright anymore. 
487 | We will simply send the payload **before** initiating any requests.
488 | 
489 | You can now completely replace the code in [solutions/challenge-6-1-partial.py](solutions/challenge-6-1-partial.py)
490 | and fill in the missing parts.
491 | 
492 | <details>
493 |     <summary>Soluce is here</summary>
494 |     <a href="solutions/challenge-6-2.py">Open the soluce</a>
495 | </details>
496 | 
497 | 
498 | ## Challenge 7: Playwright Detection
499 | 
500 | The URL to scrape is: [https://trekky-reviews.com/level9](https://trekky-reviews.com/level9)
501 | 
502 | For this challenge, directly use a Pure Playwright scraper from [playwright_spider.py](playwright_spider.py) (don't use Scrapy).
503 | 
504 | Run it:
505 | 
506 | ```shell
507 | python playwright_spider.py
508 | ```
509 | 
510 | You will notice that data collection may fail due to **playwright** detection.
511 | 
512 | Anti-bot checks if CDP protocol or network inspector is opened.
513 | 
514 | Try to replace Playwright by another framework!
515 | 
516 | _Hint: Use <a href="https://camoufox.com/">Camoufoux</a> 😉_
517 | 
518 | <details>
519 |     <summary>Soluce is here</summary>
520 |     <a href="solutions/challenge-7.py">Open the soluce</a>
521 | </details>
522 | 
523 | 
524 | ## Conclusion
525 | 
526 | Thank you so much for participating in this workshop.
527 | 
528 | Your feedback is incredibly valuable to me. 
529 | Please take a moment to complete this survey; your insights will greatly assist in enhancing future workshops:
530 | 
531 | 👉 [GO TO SURVEY](https://bit.ly/scwsv) 👈
532 | 
533 | 
534 | ## Licence
535 | 
536 | WebScraping Anti-Ban Workshop © 2024 by [Fabien Vauchelles](https://www.linkedin.com/in/fabienvauchelles) is licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1):
537 | 
538 | * Credit must be given to the creator;
539 | * Only noncommercial use of your work is permitted;
540 | * No derivatives or adaptations of your work are permitted.
541 | 
542 | 


--------------------------------------------------------------------------------