├── .flake8 ├── .github └── workflows │ └── tests.yml ├── .gitignore ├── LICENSE ├── README.rst ├── pyproject.toml ├── scrapy_frontera ├── __init__.py ├── components │ └── __init__.py ├── converters.py ├── core │ ├── __init__.py │ └── manager.py ├── manager.py ├── middlewares.py ├── scheduler.py ├── settings.py └── utils.py ├── setup.py ├── tests └── test_scheduler.py └── tox.ini /.flake8: -------------------------------------------------------------------------------- 1 | [flake8] 2 | max-line-length = 120 3 | -------------------------------------------------------------------------------- /.github/workflows/tests.yml: -------------------------------------------------------------------------------- 1 | name: Tests 2 | on: [push, pull_request] 3 | 4 | jobs: 5 | tests: 6 | runs-on: ubuntu-latest 7 | strategy: 8 | fail-fast: false 9 | matrix: 10 | include: 11 | - python-version: "3.8" 12 | env: 13 | TOXENV: pinned 14 | - python-version: "3.8" 15 | env: 16 | TOXENV: py 17 | - python-version: "3.9" 18 | env: 19 | TOXENV: py 20 | 21 | steps: 22 | - uses: actions/checkout@v4 23 | 24 | - name: Set up Python ${{ matrix.python-version }} 25 | uses: actions/setup-python@v4 26 | with: 27 | python-version: ${{ matrix.python-version }} 28 | 29 | - name: Run tests 30 | env: ${{ matrix.env }} 31 | run: | 32 | pip install -U tox 33 | tox 34 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | /build 3 | /*.egg-info 4 | /dist 5 | dumpers/build 6 | dumpers/*.egg-info 7 | dumpers/dist 8 | .tox 9 | venv 10 | .idea 11 | .eggs 12 | htmlcov 13 | .coverage 14 | .vim.custom 15 | docs/_build 16 | Pipfile* 17 | .env 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018, Scrapinghub 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of exporters nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Frontera scheduler for Scrapy 2 | ============================= 3 | 4 | More flexible and featured `Frontera `_ scheduler for scrapy, which don't force to reimplement 5 | capabilities already present in scrapy, so it provides: 6 | 7 | - Scrapy handled request dupefilter 8 | - Scrapy handled disk and memory request queues 9 | - Only send to frontera requests marked to be processed by it (using request meta attribute ``cf_store`` to True), thus avoiding lot of conflicts. 10 | - Allows to set frontera settings from spider constructor, by loading frontera manager after spider instantiation. 11 | - Allows frontera components to access scrapy stat manager instance by adding STATS_MANAGER frontera setting 12 | - Better request/response converters, fully compatible with ScrapyCloud and Scrapy 13 | - Emulates dont_filter=True scrapy Request flag 14 | - Frontier fingerprint is same as scrapy request fingerprint (can be overridden by passing 'frontier_fingerprint' to request meta) 15 | - allow custom preprocessing or ignoring of request from frontier before actually being enqueued 16 | - Thoroughly tested, used and featured 17 | 18 | The result is that crawler using this scheduler will not work differently than a crawler that doesn't use frontier, and 19 | reingeneering of a spider in order to be adapted to work with frontier is minimal. 20 | 21 | 22 | Versions: 23 | --------- 24 | 25 | Up to version 0.1.8, frontera==0.3.3 and python2 are required. Version 0.2.x requires frontera==0.7.1 and is compatible with python3. 26 | 27 | Installation: 28 | ------------- 29 | 30 | pip install scrapy-frontera 31 | 32 | 33 | Usage and features: 34 | ------------------- 35 | 36 | Note: In the context of this doc, a producer spider is the spider that writes requests to the frontier, and the consumer is the one that reads 37 | them from the frontier. They can be either the same spider or separated ones. 38 | 39 | In your project settings.py:: 40 | 41 | SCHEDULER = 'scrapy_frontera.scheduler.FronteraScheduler' 42 | 43 | DOWNLOADER_MIDDLEWARES = { 44 | 'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0, 45 | } 46 | 47 | SPIDER_MIDDLEWARES = { 48 | 'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0, 49 | } 50 | 51 | # Set to True if you want start requests to be redirected to frontier 52 | # By default they go directly to scrapy downloader 53 | # FRONTERA_SCHEDULER_START_REQUESTS_TO_FRONTIER = False 54 | 55 | # Allows to redirect to frontier, the requests with the given callback names 56 | # Important: this setting doesn't affect start requests. 57 | # FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER = [] 58 | 59 | # Spider attributes that need to be passed to the requests redirected to frontier 60 | # Some previous callbacks may have generated some state needed for following ones. 61 | # This setting allows to transmit that state between different jobs 62 | # FRONTERA_SCHEDULER_STATE_ATTRIBUTES = [] 63 | 64 | # map specific requests to specific slot prefix by its callback name. 65 | # FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP = {} 66 | 67 | 68 | Plus the usual Frontera setup. For instance, for `hcf-backend `_:: 69 | 70 | BACKEND = 'hcf_backend.HCFBackend' 71 | HCF_PROJECT_ID = 11111 72 | 73 | (etc...) 74 | 75 | You can also set up spider specific frontera settings via the spider class attribute dict ``frontera_settings``. Example 76 | with `hcf backend`:: 77 | 78 | class MySpider(Spider): 79 | 80 | name = 'my-producer' 81 | 82 | frontera_settings = { 83 | 'HCF_AUTH': 'xxxxxxxxxx', 84 | 'HCF_PROJECT_ID': 11111, 85 | 'HCF_PRODUCER_FRONTIER': 'myfrontier', 86 | 'HCF_PRODUCER_NUMBER_OF_SLOTS': 8, 87 | } 88 | 89 | Scrapy-frontera also accepts the spider attribute ``frontera_settings_json``. This is specially useful for consumers, which need per job 90 | setup of reading slot.For example, you can configure a consumer spider in this way, for usage with `hcf backend `_:: 91 | 92 | class MySpider(Spider): 93 | 94 | name = 'my-consumer' 95 | 96 | frontera_settings = { 97 | 'HCF_AUTH': 'xxxxxxxxxx', 98 | 'HCF_PROJECT_ID': 11111, 99 | 'HCF_CONSUMER_FRONTIER': 'myfrontier', 100 | } 101 | 102 | 103 | and invoke it via:: 104 | 105 | scrapy crawl my-consumer -a frontera_settings_json='{"HCF_CONSUMER_SLOT": "0"}' 106 | 107 | Settings provided through ``frontera_settings_json`` overrides those provided using ``frontera_settings``, which in turn overrides those provided in the 108 | project settings.py file. 109 | 110 | Requests will go through the Frontera pipeline only if the flag ``cf_store`` with value True is included in the request meta. If ``cf_store`` is not present 111 | or is False, requests will be processed as normal scrapy request. An alternative to ``cf_store`` flag are the scrapy settings ``FRONTERA_SCHEDULER_START_REQUESTS_TO_FRONTIER`` and ``FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER`` (see above about usage of these settings) 112 | 113 | Requests read from the frontier are directly enqueued by the scheduler. This means that they are not processed by spider middleware. Their 114 | processing entrypoint is downloader middleware ``process_request()`` pipeline. But if you need to preprocess requests incoming from the frontier 115 | in the spider, you can define the spider method ``preprocess_request_from_frontier(request: scrapy.Request)``. If defined, the scheduler will invoke 116 | it before actually enqueuing it. This method must returns either None or a request (same from the call, or another). This return value is what 117 | will be actually enqueued, so if it is None, request is skipped (not enqueued). 118 | 119 | If requests read from frontier doesn't already have an errback defined, the scheduler will automatically assign the consumer spider ``errback`` method, 120 | if it exists, to them. This is specially useful when consumer spider is not the same as the producer one. 121 | 122 | Another useful setting is ``FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP``. This is a dict which allows to map requests with a specific callback, to a specific slot prefix, and optionally a number of slots, different than the default one assigned by frontera backend (this feature has to be supported by the specific frontera backend you will use, last versions of hcf-backend does supports it). For example:: 123 | 124 | class MySpider(Spider): 125 | 126 | name = 'my-producer' 127 | 128 | frontera_settings = { 129 | 'HCF_AUTH': 'xxxxxxxxxx', 130 | 'HCF_PROJECT_ID': 11111, 131 | 'HCF_PRODUCER_FRONTIER': 'myfrontier', 132 | 'HCF_PRODUCER_SLOT_PREFIX': 'my-consumer' 133 | 'HCF_PRODUCER_NUMBER_OF_SLOTS': 8, 134 | } 135 | 136 | custom_settings = { 137 | 'FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP': {'parse': 'my-producer/4'}, 138 | 'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse', 'parse_consumer'] 139 | } 140 | 141 | def parse_consumer(self, response): 142 | assert False 143 | 144 | def parse(self, response): 145 | (...) 146 | 147 | Under this configuration, requests with callback ``parse()`` will be saved on 4 slots with prefix ``my-producer``, while requests with callback ``parse_consumer()`` will use the configuration from hcf settings, that is, 8 slot with prefix ``my-consumer``. 148 | 149 | An integrated tutorial is available at `shub-workflow Tutorial `_ 150 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.black] 2 | line-length = 120 3 | 4 | [tool.pytest.ini_options] 5 | filterwarnings = [ 6 | "ignore::DeprecationWarning:.*\\bfrontera\\b", 7 | ] 8 | -------------------------------------------------------------------------------- /scrapy_frontera/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.2.9.1' 2 | -------------------------------------------------------------------------------- /scrapy_frontera/components/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/scrapy-frontera/5b260dbea6c9a2cc4e0e0b4d9c0f5f3270d567c8/scrapy_frontera/components/__init__.py -------------------------------------------------------------------------------- /scrapy_frontera/converters.py: -------------------------------------------------------------------------------- 1 | import uuid 2 | 3 | import logging 4 | 5 | from scrapy.http.request import Request as ScrapyRequest 6 | from scrapy.http.response import Response as ScrapyResponse 7 | 8 | from w3lib.util import to_bytes, to_native_str 9 | 10 | from frontera.core.models import Request as FrontierRequest 11 | from frontera.core.models import Response as FrontierResponse 12 | from frontera.utils.converters import BaseRequestConverter, BaseResponseConverter 13 | 14 | from .utils import get_callback_name 15 | 16 | 17 | _LOG = logging.getLogger(__name__) 18 | 19 | 20 | class RequestConverter(BaseRequestConverter): 21 | """Converts between frontera and Scrapy request objects""" 22 | 23 | def __init__(self, spider): 24 | self.spider = spider 25 | crawler = spider.crawler 26 | if hasattr(crawler, "request_fingerprinter"): 27 | self.request_fingerprint = crawler.request_fingerprinter.fingerprint 28 | else: 29 | from scrapy.utils.request import request_fingerprint 30 | self.request_fingerprint = request_fingerprint 31 | 32 | def to_frontier(self, scrapy_request): 33 | """request: Scrapy > Frontier""" 34 | if isinstance(scrapy_request.cookies, dict): 35 | cookies = scrapy_request.cookies 36 | else: 37 | cookies = dict(sum([list(d.items()) for d in scrapy_request.cookies], [])) 38 | cb = scrapy_request.callback 39 | if callable(cb): 40 | cb = _find_method(self.spider, cb) 41 | eb = scrapy_request.errback 42 | if callable(eb): 43 | eb = _find_method(self.spider, eb) 44 | 45 | statevars = self.spider.crawler.settings.getlist("FRONTERA_SCHEDULER_STATE_ATTRIBUTES", []) 46 | meta = { 47 | b"scrapy_callback": cb, 48 | b"scrapy_cb_kwargs": scrapy_request.cb_kwargs, 49 | b"scrapy_errback": eb, 50 | b"scrapy_meta": scrapy_request.meta, 51 | b"scrapy_body": scrapy_request.body, 52 | b"spider_state": [(attr, getattr(self.spider, attr, None)) for attr in statevars], 53 | b"origin_is_frontier": True, 54 | } 55 | 56 | fingerprint_scrapy_request = scrapy_request 57 | if fingerprint_scrapy_request.dont_filter: 58 | # if dont_filter is True, we need to simulate 59 | # not filtering by generating a different fingerprint each time we see same request. 60 | # So let's altere randomly the url 61 | fake_url = fingerprint_scrapy_request.url + str(uuid.uuid4()) 62 | fingerprint_scrapy_request = fingerprint_scrapy_request.replace(url=fake_url) 63 | meta[b"frontier_fingerprint"] = scrapy_request.meta.get( 64 | "frontier_fingerprint", self.request_fingerprint(fingerprint_scrapy_request) 65 | ) 66 | callback_slot_prefix_map = self.spider.crawler.settings.getdict("FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP") 67 | frontier_slot_prefix_num_slots = callback_slot_prefix_map.get(get_callback_name(scrapy_request)) 68 | if frontier_slot_prefix_num_slots: 69 | frontier_slot_prefix, *rest = frontier_slot_prefix_num_slots.split("/", 1) 70 | meta[b"frontier_slot_prefix"] = frontier_slot_prefix 71 | if rest: 72 | meta[b"frontier_number_of_slots"] = int(rest[0]) 73 | return FrontierRequest( 74 | url=scrapy_request.url, 75 | method=scrapy_request.method, 76 | headers=dict(scrapy_request.headers.items()), 77 | cookies=cookies, 78 | meta=meta, 79 | ) 80 | 81 | def from_frontier(self, frontier_request): 82 | """request: Frontier > Scrapy""" 83 | cb = frontier_request.meta.get(b"scrapy_callback", None) 84 | if cb and self.spider: 85 | cb = _get_method(self.spider, cb) 86 | eb = frontier_request.meta.get(b"scrapy_errback", None) 87 | if eb and self.spider: 88 | eb = _get_method(self.spider, eb) 89 | body = frontier_request.meta.get(b"scrapy_body", None) 90 | cb_kwargs = frontier_request.meta[b"scrapy_cb_kwargs"] 91 | meta = frontier_request.meta[b"scrapy_meta"] 92 | meta.pop("cf_store", None) 93 | for attr, val in frontier_request.meta.get(b"spider_state", []): 94 | prev_value = getattr(self.spider, attr, None) 95 | if prev_value is not None and prev_value != val: 96 | _LOG.error( 97 | "State for attribute '%s' change from '%s' to '%s' attempted by request <%s> so crawl may loose consistency. \ 98 | Per request state should be propagated via request attributes.", 99 | attr, 100 | prev_value, 101 | val, 102 | frontier_request.url, 103 | ) 104 | elif prev_value != val: 105 | setattr(self.spider, attr, val) 106 | _LOG.info("State for attribute '%s' set to %s by request <%s>", attr, val, frontier_request.url) 107 | 108 | return ScrapyRequest( 109 | url=frontier_request.url, 110 | callback=cb, 111 | errback=eb, 112 | body=body, 113 | method=to_native_str(frontier_request.method), 114 | headers=frontier_request.headers, 115 | cookies=frontier_request.cookies, 116 | meta=meta, 117 | cb_kwargs=cb_kwargs, 118 | dont_filter=True, 119 | ) 120 | 121 | 122 | class ResponseConverter(BaseResponseConverter): 123 | """Converts between frontera and Scrapy response objects""" 124 | 125 | def __init__(self, spider, request_converter): 126 | self.spider = spider 127 | self._request_converter = request_converter 128 | 129 | def to_frontier(self, scrapy_response): 130 | """response: Scrapy > Frontier""" 131 | frontier_request = scrapy_response.meta.get( 132 | "frontier_request", self._request_converter.to_frontier(scrapy_response.request) 133 | ) 134 | frontier_request.meta[b"scrapy_meta"] = scrapy_response.meta 135 | return FrontierResponse( 136 | url=scrapy_response.url, 137 | status_code=scrapy_response.status, 138 | headers=dict(scrapy_response.headers.items()), 139 | body=scrapy_response.body, 140 | request=frontier_request, 141 | ) 142 | 143 | def from_frontier(self, response): 144 | """response: Frontier > Scrapy""" 145 | return ScrapyResponse( 146 | url=response.url, 147 | status=response.status_code, 148 | headers=response.headers, 149 | body=response.body, 150 | request=self._request_converter.from_frontier(response.request), 151 | ) 152 | 153 | 154 | def _find_method(obj, func): 155 | if obj and hasattr(func, "__self__") and func.__self__ is obj: 156 | return to_bytes(func.__func__.__name__) 157 | else: 158 | raise ValueError("Function %s is not a method of: %s" % (func, obj)) 159 | 160 | 161 | def _get_method(obj, name): 162 | name = to_native_str(name) 163 | try: 164 | return getattr(obj, name) 165 | except AttributeError: 166 | raise ValueError("Method %r not found in: %s" % (name, obj)) 167 | -------------------------------------------------------------------------------- /scrapy_frontera/core/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/scrapy-frontera/5b260dbea6c9a2cc4e0e0b4d9c0f5f3270d567c8/scrapy_frontera/core/__init__.py -------------------------------------------------------------------------------- /scrapy_frontera/core/manager.py: -------------------------------------------------------------------------------- 1 | from frontera.core.manager import FrontierManager as FronteraFrontierManager 2 | from frontera.settings import Settings 3 | 4 | from scrapy_frontera.settings import DEFAULT_SETTINGS 5 | 6 | 7 | class FrontierManager(FronteraFrontierManager): 8 | @classmethod 9 | def from_settings(cls, settings=None): 10 | """ 11 | Returns a :class:`FrontierManager ` instance initialized with \ 12 | the passed settings argument. If no settings is given, 13 | :ref:`frontier default settings ` are used. 14 | """ 15 | manager_settings = Settings.object_from(settings) 16 | settings.set_from_dict(DEFAULT_SETTINGS) 17 | return cls( 18 | request_model=manager_settings.REQUEST_MODEL, 19 | response_model=manager_settings.RESPONSE_MODEL, 20 | backend=manager_settings.BACKEND, 21 | middlewares=manager_settings.MIDDLEWARES, 22 | test_mode=manager_settings.TEST_MODE, 23 | max_requests=manager_settings.MAX_REQUESTS, 24 | max_next_requests=manager_settings.MAX_NEXT_REQUESTS, 25 | auto_start=manager_settings.AUTO_START, 26 | settings=manager_settings, 27 | canonicalsolver=manager_settings.CANONICAL_SOLVER, 28 | ) 29 | -------------------------------------------------------------------------------- /scrapy_frontera/manager.py: -------------------------------------------------------------------------------- 1 | from .converters import RequestConverter, ResponseConverter 2 | 3 | from scrapy_frontera.core.manager import FrontierManager 4 | 5 | 6 | class ScrapyFrontierManager(object): 7 | 8 | spider = None 9 | 10 | def set_spider(self, spider): 11 | assert self.spider is None, 'Spider is already set. Only one spider is supported per process.' 12 | self.spider = spider 13 | self.request_converter = RequestConverter(self.spider) 14 | self.response_converter = ResponseConverter(self.spider, self.request_converter) 15 | 16 | def __init__(self, settings): 17 | self.manager = FrontierManager.from_settings(settings) 18 | 19 | def start(self): 20 | self.manager.start() 21 | 22 | def stop(self): 23 | self.manager.stop() 24 | 25 | def add_seeds(self, seeds): 26 | frontier_seeds = [self.request_converter.to_frontier(seed) for seed in seeds] 27 | self.manager.add_seeds(seeds=frontier_seeds) 28 | 29 | def get_next_requests(self, max_next_requests=0, **kwargs): 30 | frontier_requests = self.manager.get_next_requests(max_next_requests=max_next_requests, **kwargs) 31 | return [self.request_converter.from_frontier(frontier_request) for frontier_request in frontier_requests] 32 | 33 | def page_crawled(self, response): 34 | frontier_response = self.response_converter.to_frontier(response) 35 | self.manager.page_crawled(frontier_response) 36 | 37 | def links_extracted(self, request, links): 38 | frontera_request = self.request_converter.to_frontier(request) 39 | frontera_links = [self.request_converter.to_frontier(link) for link in links] 40 | self.manager.links_extracted(frontera_request, frontera_links) 41 | 42 | def request_error(self, request, error): 43 | self.manager.request_error(request=self.request_converter.to_frontier(request), 44 | error=error) 45 | -------------------------------------------------------------------------------- /scrapy_frontera/middlewares.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | class BaseSchedulerMiddleware(object): 4 | 5 | def __init__(self, crawler): 6 | self.crawler = crawler 7 | 8 | @classmethod 9 | def from_crawler(cls, crawler): 10 | return cls(crawler) 11 | 12 | @property 13 | def scheduler(self): 14 | return self.crawler.engine.slot.scheduler 15 | 16 | 17 | class SchedulerSpiderMiddleware(BaseSchedulerMiddleware): 18 | def process_spider_output(self, response, result, spider): 19 | return self.scheduler.process_spider_output(response, result, spider) 20 | 21 | def process_start_requests(self, start_requests, spider): 22 | if self.crawler.settings.getbool('FRONTERA_SCHEDULER_START_REQUESTS_TO_FRONTIER'): 23 | return [] 24 | if self.crawler.settings.getbool('FRONTERA_SCHEDULER_SKIP_START_REQUESTS'): 25 | return [] 26 | return start_requests 27 | 28 | 29 | class SchedulerDownloaderMiddleware(BaseSchedulerMiddleware): 30 | def process_exception(self, request, exception, spider): 31 | return self.scheduler.process_exception(request, exception, spider) 32 | -------------------------------------------------------------------------------- /scrapy_frontera/scheduler.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | 4 | from scrapy.core.scheduler import Scheduler 5 | from scrapy.http import Request 6 | 7 | from frontera.contrib.scrapy.settings_adapter import ScrapySettingsAdapter 8 | from .manager import ScrapyFrontierManager 9 | from .utils import get_callback_name 10 | 11 | 12 | LOG = logging.getLogger(__name__) 13 | 14 | 15 | class FronteraScheduler(Scheduler): 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | obj = super(FronteraScheduler, cls).from_crawler(crawler) 20 | obj.crawler = crawler 21 | obj.frontier = None 22 | return obj 23 | 24 | def next_request(self): 25 | if not self.has_pending_requests(): 26 | self._get_requests_from_backend() 27 | return super(FronteraScheduler, self).next_request() 28 | 29 | def is_frontera_request(self, request): 30 | """ 31 | Only requests which its callback is the spider can be sent 32 | """ 33 | if request.meta.get('cf_store', False) or get_callback_name(request) in self.frontier_requests_callbacks: 34 | if request.callback is None or getattr(request.callback, '__self__', None) is self.spider: 35 | return True 36 | raise ValueError('Request <{}>: frontera request callback must be a spider method.'.format(request)) 37 | return False 38 | 39 | def process_spider_output(self, response, result, spider): 40 | links = [] 41 | for element in result: 42 | if isinstance(element, Request) and self.is_frontera_request(element): 43 | links.append(element) 44 | else: 45 | yield element 46 | self.frontier.page_crawled(response) 47 | if links: 48 | self.frontier.links_extracted(response.request, links) 49 | self.stats.inc_value('scrapyfrontera/links_extracted_count', len(links)) 50 | 51 | def process_exception(self, request, exception, spider): 52 | error_code = self._get_exception_code(exception) 53 | if self.is_frontera_request(request): 54 | self.frontier.request_error(request=request, error=error_code) 55 | 56 | def open(self, spider): 57 | super(FronteraScheduler, self).open(spider) 58 | frontera_settings = ScrapySettingsAdapter(spider.crawler.settings) 59 | frontera_settings.set_from_dict(getattr(spider, 'frontera_settings', {})) 60 | frontera_settings.set_from_dict(json.loads(getattr(spider, 'frontera_settings_json', '{}'))) 61 | frontera_settings.set('STATS_MANAGER', self.stats) 62 | self.frontier = ScrapyFrontierManager(frontera_settings) 63 | 64 | self.frontier.set_spider(spider) 65 | 66 | if self.crawler.settings.getbool('FRONTERA_SCHEDULER_START_REQUESTS_TO_FRONTIER'): 67 | self.frontier.add_seeds(spider.start_requests()) 68 | 69 | self.frontier_requests_callbacks = \ 70 | self.crawler.settings.getlist('FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER') 71 | 72 | LOG.info('Starting frontier') 73 | if not self.frontier.manager.auto_start: 74 | self.frontier.start() 75 | 76 | def close(self, reason): 77 | super(FronteraScheduler, self).close(reason) 78 | LOG.info('Finishing frontier (%s)' % reason) 79 | self.frontier.stop() 80 | return self.df.close(reason) 81 | 82 | def _get_requests_from_backend(self): 83 | if not self.frontier.manager.finished: 84 | info = self._get_downloader_info() 85 | requests = self.frontier.get_next_requests(key_type=info['key_type'], overused_keys=info['overused_keys']) 86 | for request in requests: 87 | if request.errback is None and hasattr(self.spider, 'errback'): 88 | request.errback = self.spider.errback 89 | request.callback = request.callback or self.spider.parse 90 | if hasattr(self.spider, 'preprocess_request_from_frontier'): 91 | request = self.spider.preprocess_request_from_frontier(request) 92 | if request is None: 93 | self.stats.inc_value('scrapyfrontera/ignored_requests_count') 94 | else: 95 | self.enqueue_request(request) 96 | self.stats.inc_value('scrapyfrontera/returned_requests_count') 97 | 98 | def _get_exception_code(self, exception): 99 | try: 100 | return exception.__class__.__name__ 101 | except: 102 | return '?' 103 | 104 | def _get_downloader_info(self): 105 | downloader = self.crawler.engine.downloader 106 | info = { 107 | 'key_type': 'ip' if downloader.ip_concurrency else 'domain', 108 | 'overused_keys': [] 109 | } 110 | for key, slot in downloader.slots.items(): 111 | overused_factor = len(slot.active) / float(slot.concurrency) 112 | if overused_factor > self.frontier.manager.settings.get('OVERUSED_SLOT_FACTOR'): 113 | info['overused_keys'].append(key) 114 | return info 115 | -------------------------------------------------------------------------------- /scrapy_frontera/settings.py: -------------------------------------------------------------------------------- 1 | DEFAULT_SETTINGS = { 2 | 'MIDDLEWARES': ['frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware'], 3 | } 4 | -------------------------------------------------------------------------------- /scrapy_frontera/utils.py: -------------------------------------------------------------------------------- 1 | def get_callback_name(request): 2 | if request.callback is None: 3 | return 'parse' 4 | if hasattr(request.callback, '__func__'): 5 | return request.callback.__func__.__name__ 6 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name = 'scrapy-frontera', 5 | description = 'Featured Frontera scheduler for Scrapy', 6 | long_description = open('README.rst').read(), 7 | version = '0.2.9.1', 8 | licence = 'BSD', 9 | url = 'https://github.com/scrapinghub/scrapy-frontera', 10 | maintainer = 'Scrapinghub', 11 | packages = find_packages(), 12 | install_requires=( 13 | 'frontera==0.7.1', 14 | 'scrapy>=1.7.0', 15 | ), 16 | classifiers = [ 17 | 'Development Status :: 5 - Production/Stable', 18 | 'Intended Audience :: Developers', 19 | 'License :: OSI Approved :: BSD License', 20 | 'Operating System :: OS Independent', 21 | 'Programming Language :: Python', 22 | 'Programming Language :: Python :: 3', 23 | ] 24 | ) 25 | -------------------------------------------------------------------------------- /tests/test_scheduler.py: -------------------------------------------------------------------------------- 1 | from unittest.mock import patch 2 | 3 | from twisted.trial.unittest import TestCase 4 | from twisted.internet import defer 5 | 6 | from scrapy import Request, Spider 7 | from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler 8 | from scrapy.http import Response 9 | from scrapy.settings import Settings 10 | from scrapy.utils.test import get_crawler 11 | from scrapy.crawler import CrawlerRunner 12 | 13 | 14 | TEST_SETTINGS = { 15 | 'SCHEDULER': 'scrapy_frontera.scheduler.FronteraScheduler', 16 | 'BACKEND': 'frontera.contrib.backends.memory.FIFO', 17 | 'DOWNLOADER_MIDDLEWARES': { 18 | 'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0, 19 | }, 20 | 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 21 | 'SPIDER_MIDDLEWARES': { 22 | 'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0, 23 | }, 24 | } 25 | 26 | 27 | class _TestSpider(Spider): 28 | name = 'test' 29 | success = False 30 | success2 = False 31 | success3 = False 32 | error = False 33 | 34 | def start_requests(self): 35 | yield Request('http://example.com') 36 | 37 | def parse(self, response): 38 | self.success = True 39 | if response.body == b'cf_store': 40 | yield Request('http://example2.com', callback=self.parse2, errback=self.errback, 41 | meta={'cf_store': True}) 42 | else: 43 | yield Request('http://example2.com', callback=self.parse2, errback=self.errback) 44 | 45 | def parse2(self, response): 46 | self.success2 = True 47 | 48 | def errback(self, failure): 49 | self.error = True 50 | response = failure.value.response 51 | if response.body == b'cf_store': 52 | yield Request('http://example3.com', callback=self.parse3, meta={'cf_store': True}) 53 | else: 54 | yield Request('http://example3.com', callback=self.parse3) 55 | 56 | def parse3(self, response): 57 | self.success3 = True 58 | 59 | 60 | class _TestSpider2(Spider): 61 | name = 'test' 62 | success = False 63 | success2 = False 64 | 65 | def start_requests(self): 66 | yield Request('http://example.com') 67 | 68 | def parse(self, response): 69 | self.success = True 70 | yield Request('http://example2.com', callback=self.parse2) 71 | 72 | def parse2(self, response): 73 | self.success2 = True 74 | 75 | 76 | class _TestSpider3(Spider): 77 | name = 'test' 78 | success = 0 79 | 80 | def start_requests(self): 81 | yield Request('http://example.com') 82 | 83 | def parse(self, response): 84 | self.success += 1 85 | yield Request('http://example2.com') 86 | 87 | 88 | class TestDownloadHandler: 89 | 90 | results = [] 91 | 92 | def set_results(self, results): 93 | for r in results: 94 | self.results.append(r) 95 | 96 | def download_request(self, request, spider): 97 | return self.results.pop(0) 98 | 99 | def close(self): 100 | pass 101 | 102 | 103 | class FronteraSchedulerTest(TestCase): 104 | 105 | def setUp(self): 106 | self.runner = CrawlerRunner() 107 | 108 | def tearDown(self): 109 | self.runner.stop() 110 | while TestDownloadHandler.results: 111 | TestDownloadHandler.results.pop() 112 | 113 | @staticmethod 114 | def setup_mocked_handler(mocked_handler, results=None): 115 | handler = TestDownloadHandler() 116 | if results: 117 | handler.set_results(results) 118 | if hasattr(HTTP11DownloadHandler, "from_crawler"): 119 | mocked_handler.from_crawler.return_value = handler 120 | else: 121 | mocked_handler.return_value = handler 122 | 123 | 124 | @defer.inlineCallbacks 125 | def test_start_requests(self): 126 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 127 | self.setup_mocked_handler( 128 | mocked_handler, 129 | [ 130 | Response(url='http://example.com'), 131 | Response(url='http://example2.com'), 132 | ], 133 | ) 134 | 135 | with patch('frontera.contrib.backends.memory.MemoryBaseBackend.links_extracted') as mocked_links_extracted: 136 | mocked_links_extracted.return_value = None 137 | settings = Settings() 138 | settings.setdict(TEST_SETTINGS, priority='cmdline') 139 | crawler = get_crawler(_TestSpider, settings) 140 | 141 | yield self.runner.crawl(crawler) 142 | self.assertTrue(crawler.spider.success) 143 | self.assertTrue(crawler.spider.success2) 144 | mocked_links_extracted.assert_not_called() 145 | 146 | @defer.inlineCallbacks 147 | def test_cf_store(self): 148 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 149 | self.setup_mocked_handler( 150 | mocked_handler, 151 | [ 152 | Response(url='http://example.com', body=b'cf_store'), 153 | ], 154 | ) 155 | 156 | with patch('frontera.contrib.backends.memory.MemoryDequeQueue.schedule') as mocked_schedule: 157 | mocked_schedule.return_value = None 158 | settings = Settings() 159 | settings.setdict(TEST_SETTINGS, priority='cmdline') 160 | crawler = get_crawler(_TestSpider, settings) 161 | 162 | yield self.runner.crawl(crawler) 163 | self.assertTrue(crawler.spider.success) 164 | self.assertEqual(mocked_schedule.call_count, 1) 165 | 166 | @defer.inlineCallbacks 167 | def test_callback_requests_to_frontier(self): 168 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 169 | self.setup_mocked_handler( 170 | mocked_handler, 171 | [ 172 | Response(url='http://example.com'), 173 | ], 174 | ) 175 | 176 | with patch('frontera.contrib.backends.memory.MemoryDequeQueue.schedule') as mocked_schedule: 177 | mocked_schedule.return_value = None 178 | settings = Settings() 179 | settings.setdict(TEST_SETTINGS, priority='cmdline') 180 | settings.setdict({ 181 | 'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse2'], 182 | }) 183 | crawler = get_crawler(_TestSpider2, settings) 184 | 185 | yield self.runner.crawl(crawler) 186 | self.assertTrue(crawler.spider.success) 187 | self.assertFalse(crawler.spider.success2) 188 | self.assertEqual(mocked_schedule.call_count, 1) 189 | 190 | @defer.inlineCallbacks 191 | def test_callback_requests_to_frontier_with_implicit_callback(self): 192 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 193 | self.setup_mocked_handler( 194 | mocked_handler, 195 | [ 196 | Response(url='http://example.com'), 197 | Response(url='http://example2.com'), 198 | ], 199 | ) 200 | 201 | with patch('frontera.contrib.backends.memory.MemoryDequeQueue.schedule') as mocked_schedule: 202 | mocked_schedule.return_value = None 203 | settings = Settings() 204 | settings.setdict(TEST_SETTINGS, priority='cmdline') 205 | settings.setdict({ 206 | 'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse'], 207 | }) 208 | crawler = get_crawler(_TestSpider3, settings) 209 | 210 | yield self.runner.crawl(crawler) 211 | self.assertEqual(crawler.spider.success, 1) 212 | self.assertEqual(mocked_schedule.call_count, 1) 213 | 214 | @defer.inlineCallbacks 215 | def test_callback_requests_slot_map(self): 216 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 217 | resp1 = Response(url='http://example.com') 218 | resp2 = Response(url='http://example2.com') 219 | self.setup_mocked_handler(mocked_handler, [resp1, resp2]) 220 | 221 | with patch('frontera.contrib.backends.memory.MemoryDequeQueue.schedule') as mocked_schedule: 222 | mocked_schedule.return_value = None 223 | settings = Settings() 224 | settings.setdict(TEST_SETTINGS, priority='cmdline') 225 | settings.setdict({ 226 | 'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse'], 227 | 'FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP': {'parse': 'myslot'}, 228 | }) 229 | crawler = get_crawler(_TestSpider3, settings) 230 | 231 | yield self.runner.crawl(crawler) 232 | self.assertEqual(crawler.spider.success, 1) 233 | self.assertEqual(mocked_schedule.call_count, 1) 234 | frontera_request = mocked_schedule.call_args_list[0][0][0][0][2] 235 | self.assertEqual(frontera_request.url, resp2.url) 236 | self.assertEqual(frontera_request.meta[b'frontier_slot_prefix'], 'myslot') 237 | 238 | @defer.inlineCallbacks 239 | def test_callback_requests_slot_map_with_num_slots(self): 240 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 241 | resp1 = Response(url='http://example.com') 242 | resp2 = Response(url='http://example2.com') 243 | self.setup_mocked_handler(mocked_handler, [resp1, resp2]) 244 | 245 | with patch('frontera.contrib.backends.memory.MemoryDequeQueue.schedule') as mocked_schedule: 246 | mocked_schedule.return_value = None 247 | settings = Settings() 248 | settings.setdict(TEST_SETTINGS, priority='cmdline') 249 | settings.setdict({ 250 | 'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse'], 251 | 'FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP': {'parse': 'myslot/5'}, 252 | }) 253 | crawler = get_crawler(_TestSpider3, settings) 254 | 255 | yield self.runner.crawl(crawler) 256 | self.assertEqual(crawler.spider.success, 1) 257 | self.assertEqual(mocked_schedule.call_count, 1) 258 | frontera_request = mocked_schedule.call_args_list[0][0][0][0][2] 259 | self.assertEqual(frontera_request.url, resp2.url) 260 | self.assertEqual(frontera_request.meta[b'frontier_slot_prefix'], 'myslot') 261 | self.assertEqual(frontera_request.meta[b'frontier_number_of_slots'], 5) 262 | 263 | @defer.inlineCallbacks 264 | def test_start_requests_to_frontier(self): 265 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 266 | self.setup_mocked_handler( 267 | mocked_handler, 268 | [ 269 | Response(url='http://example.com'), 270 | Response(url='http://example2.com'), 271 | ], 272 | ) 273 | 274 | settings = Settings() 275 | settings.setdict(TEST_SETTINGS, priority='cmdline') 276 | settings.setdict({ 277 | 'FRONTERA_SCHEDULER_START_REQUESTS_TO_FRONTIER': True, 278 | }) 279 | crawler = get_crawler(_TestSpider, settings) 280 | 281 | yield self.runner.crawl(crawler) 282 | self.assertTrue(crawler.spider.success) 283 | self.assertTrue(crawler.spider.success2) 284 | 285 | @defer.inlineCallbacks 286 | def test_start_requests_to_frontier_ii(self): 287 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 288 | self.setup_mocked_handler(mocked_handler) 289 | 290 | with patch('frontera.contrib.backends.memory.MemoryBaseBackend.add_seeds') as mocked_add_seeds: 291 | mocked_add_seeds.return_value = None 292 | settings = Settings() 293 | settings.setdict(TEST_SETTINGS, priority='cmdline') 294 | settings.setdict({ 295 | 'FRONTERA_SCHEDULER_START_REQUESTS_TO_FRONTIER': True, 296 | }) 297 | 298 | crawler = get_crawler(_TestSpider, settings) 299 | 300 | yield self.runner.crawl(crawler) 301 | self.assertEqual(mocked_add_seeds.call_count, 1) 302 | 303 | @defer.inlineCallbacks 304 | def test_start_handle_errback(self): 305 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 306 | self.setup_mocked_handler( 307 | mocked_handler, 308 | [ 309 | Response(url='http://example.com'), 310 | Response(url='http://example2.com', status=501), 311 | Response(url='http://example3.com'), 312 | ], 313 | ) 314 | 315 | settings = Settings() 316 | settings.setdict(TEST_SETTINGS, priority='cmdline') 317 | crawler = get_crawler(_TestSpider, settings) 318 | 319 | yield self.runner.crawl(crawler) 320 | self.assertTrue(crawler.spider.success) 321 | self.assertFalse(crawler.spider.success2) 322 | self.assertTrue(crawler.spider.error) 323 | self.assertTrue(crawler.spider.success3) 324 | 325 | @defer.inlineCallbacks 326 | def test_start_handle_errback_with_cf_store(self): 327 | """ 328 | Test that we get the expected result with errback cf_store 329 | """ 330 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 331 | self.setup_mocked_handler( 332 | mocked_handler, 333 | [ 334 | Response(url='http://example.com'), 335 | Response(url='http://example2.com', status=501, body=b'cf_store'), 336 | Response(url='http://example3.com'), 337 | ], 338 | ) 339 | 340 | settings = Settings() 341 | settings.setdict(TEST_SETTINGS, priority='cmdline') 342 | crawler = get_crawler(_TestSpider, settings) 343 | 344 | yield self.runner.crawl(crawler) 345 | self.assertTrue(crawler.spider.success) 346 | self.assertFalse(crawler.spider.success2) 347 | self.assertTrue(crawler.spider.error) 348 | self.assertTrue(crawler.spider.success3) 349 | 350 | @defer.inlineCallbacks 351 | def test_start_handle_errback_with_cf_store_ii(self): 352 | """ 353 | Test that we scheduled cf_store request on backend queue 354 | """ 355 | with patch('scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler') as mocked_handler: 356 | self.setup_mocked_handler( 357 | mocked_handler, 358 | [ 359 | Response(url='http://example.com'), 360 | Response(url='http://example2.com', status=501, body=b'cf_store'), 361 | Response(url='http://example3.com'), 362 | ], 363 | ) 364 | 365 | with patch('frontera.contrib.backends.memory.MemoryDequeQueue.schedule') as mocked_schedule: 366 | mocked_schedule.return_value = None 367 | settings = Settings() 368 | settings.setdict(TEST_SETTINGS, priority='cmdline') 369 | crawler = get_crawler(_TestSpider, settings) 370 | 371 | yield self.runner.crawl(crawler) 372 | self.assertTrue(crawler.spider.success) 373 | self.assertFalse(crawler.spider.success2) 374 | self.assertTrue(crawler.spider.error) 375 | self.assertEqual(mocked_schedule.call_count, 1) 376 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = pinned,py38,py39 3 | 4 | [testenv] 5 | deps = 6 | pytest 7 | commands = 8 | pytest {posargs:tests} 9 | 10 | [testenv:pinned] 11 | basepython = python3.8 12 | deps = 13 | {[testenv]deps} 14 | cryptography==35.0.0 15 | pyOpenSSL==21.0.0 16 | scrapy==1.7.0 17 | twisted==19.2.0 18 | zope.interface==4.7.0 19 | --------------------------------------------------------------------------------