├── aioscrapy ├── VERSION ├── core │ ├── __init__.py │ └── downloader │ │ └── handlers │ │ └── webdriver │ │ └── __init__.py ├── libs │ ├── __init__.py │ ├── spider │ │ ├── __init__.py │ │ └── urllength.py │ ├── downloader │ │ ├── __init__.py │ │ ├── defaultheaders.py │ │ ├── useragent.py │ │ ├── downloadtimeout.py │ │ └── ja3fingerprint.py │ ├── extensions │ │ └── __init__.py │ └── pipelines │ │ ├── pg.py │ │ ├── mysql.py │ │ └── redis.py ├── utils │ ├── __init__.py │ ├── spider.py │ ├── httpobj.py │ ├── decorators.py │ ├── reqser.py │ ├── ossignal.py │ ├── template.py │ ├── log.py │ └── response.py ├── templates │ ├── project │ │ ├── module │ │ │ ├── __init__.py │ │ │ ├── spiders │ │ │ │ └── __init__.py │ │ │ ├── pipelines.py.tmpl │ │ │ ├── settings.py.tmpl │ │ │ └── middlewares.py.tmpl │ │ └── aioscrapy.cfg │ └── spiders │ │ ├── basic.tmpl │ │ └── single.tmpl ├── scrapyd │ ├── __init__.py │ └── default_scrapyd.conf ├── __main__.py ├── commands │ ├── list.py │ ├── version.py │ ├── crawl.py │ ├── settings.py │ ├── runspider.py │ ├── startproject.py │ ├── __init__.py │ └── genspider.py ├── middleware │ ├── __init__.py │ ├── extension.py │ └── itempipeline.py ├── http │ ├── __init__.py │ ├── response │ │ ├── html.py │ │ ├── xml.py │ │ └── web_driver.py │ └── request │ │ └── form.py ├── __init__.py ├── queue │ ├── rabbitmq.py │ ├── memory.py │ └── redis.py └── dupefilters │ └── __init__.py ├── example ├── projectspider │ ├── redisdemo │ │ ├── __init__.py │ │ ├── spiders │ │ │ ├── __init__.py │ │ │ └── baidu.py │ │ ├── pipelines.py │ │ ├── proxy.py │ │ ├── settings.py │ │ └── middlewares.py │ ├── deploy.bat │ ├── start.py │ ├── aioscrapy.cfg │ └── setup.py └── singlespider │ ├── start.py │ ├── demo_proxy.py │ ├── demo_request_requests.py │ ├── demo_request_pyhttpx.py │ ├── demo_pipeline_csv.py │ ├── demo_proxy_pool.py │ ├── demo_proxy_self.py │ ├── demo_queue_memory.py │ ├── demo_request.py │ ├── demo_request_aiohttp.py │ ├── demo_request_httpx.py │ ├── push_task.py │ ├── demo_queue_rabbitmq.py │ ├── demo_pipeline_excel.py │ ├── demo_request_curl_cffi.py │ ├── demo_metric.py │ ├── demo_pipeline_mongo.py │ ├── demo_request_drissionpage.py │ ├── demo_queue_redis.py │ ├── demo_pipeline_pg.py │ ├── demo_pipeline_mysql.py │ ├── demo_duplicate.py │ ├── demo_request_sbcdp.py │ └── demo_request_playwright.py ├── .gitignore ├── MANIFEST.in ├── Dockerfile ├── LICENSE ├── setup.py ├── docs └── installation.md └── README.md /aioscrapy/VERSION: -------------------------------------------------------------------------------- 1 | 2.1.9 -------------------------------------------------------------------------------- /aioscrapy/core/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /aioscrapy/libs/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /aioscrapy/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /aioscrapy/libs/spider/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /aioscrapy/libs/downloader/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /aioscrapy/libs/extensions/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /example/projectspider/redisdemo/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /aioscrapy/templates/project/module/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.log 2 | *.pyc 3 | *.egg-info/ 4 | __pycache__/ 5 | .idea/ 6 | build/ 7 | dist/ 8 | job_dir/ -------------------------------------------------------------------------------- /aioscrapy/scrapyd/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Author : conlin 3 | # @Time : 2022/6/6 4 | -------------------------------------------------------------------------------- /aioscrapy/__main__.py: -------------------------------------------------------------------------------- 1 | from aioscrapy.cmdline import execute 2 | 3 | if __name__ == '__main__': 4 | execute() 5 | -------------------------------------------------------------------------------- /example/projectspider/deploy.bat: -------------------------------------------------------------------------------- 1 | call python %cd%\deploy.py server1 2 | :: call python %cd%\deploy.py server2 3 | pause>nul 4 | -------------------------------------------------------------------------------- /example/projectspider/start.py: -------------------------------------------------------------------------------- 1 | from aioscrapy.cmdline import execute 2 | 3 | 4 | execute("aioscrapy crawl baidu".split()) 5 | -------------------------------------------------------------------------------- /aioscrapy/templates/project/aioscrapy.cfg: -------------------------------------------------------------------------------- 1 | [settings] 2 | default = ${project_name}.settings 3 | 4 | [deploy] 5 | #url = http://localhost:6800/ 6 | project = ${project_name} 7 | -------------------------------------------------------------------------------- /example/projectspider/aioscrapy.cfg: -------------------------------------------------------------------------------- 1 | [settings] 2 | default = redisdemo.settings 3 | 4 | 5 | [deploy:server1] 6 | ;url = http://localhost:6800/ 7 | url = http://192.168.234.128:6800/ 8 | project = redisdemo 9 | -------------------------------------------------------------------------------- /example/projectspider/redisdemo/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include MANIFEST.in 3 | 4 | include aioscrapy/VERSION 5 | 6 | include requirements-*.txt 7 | 8 | recursive-include aioscrapy/templates * 9 | 10 | global-exclude __pycache__ *.py[cod] 11 | -------------------------------------------------------------------------------- /aioscrapy/templates/project/module/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your aioscrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /aioscrapy/templates/project/module/pipelines.py.tmpl: -------------------------------------------------------------------------------- 1 | from aioscrapy import logger 2 | 3 | 4 | class ${ProjectName}Pipeline: 5 | def process_item(self, item, spider): 6 | logger.info(f"Get item {item}" ) 7 | return item 8 | -------------------------------------------------------------------------------- /aioscrapy/core/downloader/handlers/webdriver/__init__.py: -------------------------------------------------------------------------------- 1 | from .playwright import PlaywrightDownloadHandler, PlaywrightDriver 2 | from .drissionpage import DrissionPageDownloadHandler, DrissionPageDriver 3 | from .sbcdp import SbcdpDownloadHandler, SbcdpDriver 4 | -------------------------------------------------------------------------------- /example/projectspider/redisdemo/pipelines.py: -------------------------------------------------------------------------------- 1 | # Define your item pipelines here 2 | from aioscrapy import logger 3 | 4 | 5 | class DemoPipeline: 6 | def process_item(self, item, spider): 7 | logger.info(f"From DemoPipeline: {item}") 8 | return item 9 | -------------------------------------------------------------------------------- /example/projectspider/setup.py: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapyd-deploy 2 | 3 | from setuptools import setup, find_packages 4 | 5 | setup( 6 | name = 'project', 7 | version = '1.0', 8 | packages = find_packages(), 9 | entry_points = {'aioscrapy': ['settings = redisdemo.settings']}, 10 | ) 11 | -------------------------------------------------------------------------------- /aioscrapy/commands/list.py: -------------------------------------------------------------------------------- 1 | from aioscrapy.commands import AioScrapyCommand 2 | 3 | 4 | class Command(AioScrapyCommand): 5 | 6 | requires_project = True 7 | default_settings = {'LOG_ENABLED': False} 8 | 9 | def short_desc(self): 10 | return "List available spiders" 11 | 12 | def run(self, args, opts): 13 | for s in sorted(self.crawler_process.spider_loader.list()): 14 | print(s) 15 | -------------------------------------------------------------------------------- /aioscrapy/middleware/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | from aioscrapy.middleware.downloader import DownloaderMiddlewareManager 3 | from aioscrapy.middleware.itempipeline import ItemPipelineManager 4 | from aioscrapy.middleware.spider import SpiderMiddlewareManager 5 | from aioscrapy.middleware.extension import ExtensionManager 6 | 7 | __all__ = ( 8 | "DownloaderMiddlewareManager", "ItemPipelineManager", 9 | "SpiderMiddlewareManager", "ExtensionManager" 10 | ) 11 | -------------------------------------------------------------------------------- /aioscrapy/templates/spiders/basic.tmpl: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider 2 | 3 | 4 | class $classname(Spider): 5 | name = '$name' 6 | custom_settings = { 7 | "CLOSE_SPIDER_ON_IDLE": True 8 | } 9 | start_urls = [] 10 | 11 | async def parse(self, response): 12 | item = { 13 | 'title': '\n'.join(response.xpath('//title/text()').extract()), 14 | } 15 | yield item 16 | 17 | 18 | if __name__ == '__main__': 19 | $classname.start() 20 | -------------------------------------------------------------------------------- /aioscrapy/commands/version.py: -------------------------------------------------------------------------------- 1 | import aioscrapy 2 | from aioscrapy.commands import AioScrapyCommand 3 | 4 | 5 | class Command(AioScrapyCommand): 6 | default_settings = { 7 | 'LOG_ENABLED': False, 8 | 'SPIDER_LOADER_WARN_ONLY': True 9 | } 10 | 11 | def syntax(self): 12 | return "[-v]" 13 | 14 | def short_desc(self): 15 | return "Print aioscrapy version" 16 | 17 | def add_options(self, parser): 18 | AioScrapyCommand.add_options(self, parser) 19 | 20 | def run(self, args, opts): 21 | print(f"aioscrapy {aioscrapy.__version__}") 22 | -------------------------------------------------------------------------------- /example/projectspider/redisdemo/proxy.py: -------------------------------------------------------------------------------- 1 | from aioscrapy.proxy.redis import AbsProxy 2 | from aioscrapy import logger 3 | 4 | 5 | # TODO: 根据实际情况重写AbsProxy部分方法 6 | class MyProxy(AbsProxy): 7 | def __init__( 8 | self, 9 | settings, 10 | ): 11 | super().__init__(settings) 12 | 13 | @classmethod 14 | async def from_crawler(cls, crawler) -> AbsProxy: 15 | settings = crawler.settings 16 | return cls( 17 | settings 18 | ) 19 | 20 | async def get(self) -> str: 21 | # TODO: 实现ip逻辑 22 | logger.warning("未实现ip代理逻辑") 23 | return 'http://127.0.0.1:7890' 24 | -------------------------------------------------------------------------------- /aioscrapy/http/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Module containing all HTTP related classes 3 | 4 | Use this module (instead of the more specific ones) when importing Headers, 5 | Request and Response outside this module. 6 | """ 7 | 8 | from aioscrapy.http.request import Request 9 | from aioscrapy.http.request.form import FormRequest 10 | from aioscrapy.http.request.json_request import JsonRequest 11 | from aioscrapy.http.response import Response 12 | from aioscrapy.http.response.html import HtmlResponse 13 | from aioscrapy.http.response.web_driver import WebDriverResponse 14 | from aioscrapy.http.response.text import TextResponse 15 | from aioscrapy.http.response.xml import XmlResponse 16 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.11.6-alpine3.18 2 | RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \ 3 | && echo 'Asia/Shanghai' > /etc/timezone 4 | RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.ustc.edu.cn/g' /etc/apk/repositories \ 5 | && apk add --no-cache --virtual build-dependencies\ 6 | build-base \ 7 | gcc \ 8 | musl-dev \ 9 | libc-dev \ 10 | libffi-dev \ 11 | mariadb-dev \ 12 | libxslt-dev \ 13 | libpq-dev \ 14 | libstdc++ \ 15 | && pip3 install pip install aio-scrapy[all] -U -i https://mirrors.aliyun.com/pypi/simple \ 16 | && rm -rf .cache/pip3 \ 17 | && apk del build-dependencies 18 | CMD ["aioscrapy", "version"] 19 | -------------------------------------------------------------------------------- /example/singlespider/start.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | 4 | sys.path.append(os.path.dirname(os.path.dirname(os.getcwd()))) 5 | from aioscrapy.process import multi_process_run, single_process_run 6 | 7 | from demo_queue_memory import DemoMemorySpider 8 | from demo_request_httpx import DemoHttpxSpider 9 | from demo_request_playwright import DemoPlaywrightSpider 10 | 11 | if __name__ == '__main__': 12 | 13 | # # 单进程跑多爬虫 14 | # single_process_run( 15 | # (DemoMemorySpider, None), 16 | # (DemoHttpxSpider, None), 17 | # # ... 18 | # ) 19 | 20 | # 多进程跑多爬虫 21 | multi_process_run( 22 | [(DemoMemorySpider, None), (DemoHttpxSpider, None)], # 子进程进程里面跑多爬虫 23 | (DemoPlaywrightSpider, None), # 子进程进程里面跑但爬虫 24 | # ... 25 | ) 26 | -------------------------------------------------------------------------------- /example/projectspider/redisdemo/spiders/baidu.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, Spider, logger 2 | 3 | 4 | class BaiduSpider(Spider): 5 | name = 'baidu' 6 | 7 | start_urls = ['https://hanyu.baidu.com/zici/s?wd=黄&query=黄'] 8 | 9 | async def parse(self, response): 10 | logger.info(response) 11 | item = { 12 | 'pingyin': response.xpath('//div[@id="pinyin"]/span/b/text()').get(), 13 | 'fan': response.xpath('//*[@id="traditional"]/span/text()').get(), 14 | } 15 | yield item 16 | 17 | new_character = response.xpath('//a[@class="img-link"]/@href').getall() 18 | for character in new_character: 19 | new_url = 'https://hanyu.baidu.com/zici' + character 20 | yield Request(new_url, callback=self.parse, dont_filter=True) 21 | 22 | 23 | if __name__ == '__main__': 24 | BaiduSpider.start() 25 | -------------------------------------------------------------------------------- /aioscrapy/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | aioscrapy - a web crawling and web scraping framework written for Python 3 | """ 4 | 5 | import pkgutil 6 | import sys 7 | 8 | # Declare top-level shortcuts 9 | from aioscrapy.spiders import Spider 10 | from aioscrapy.http import Request, FormRequest 11 | from aioscrapy.settings import Settings 12 | from aioscrapy.crawler import Crawler 13 | from aioscrapy.utils.log import logger 14 | 15 | 16 | __all__ = [ 17 | '__version__', 'version_info', 'Spider', 'Request', 'FormRequest', 'Crawler', 'Settings', 'logger' 18 | ] 19 | 20 | 21 | # aioscrapy versions 22 | __version__ = (pkgutil.get_data(__package__, "VERSION") or b"").decode("ascii").strip() 23 | version_info = tuple(int(v) if v.isdigit() else v for v in __version__.split('.')) 24 | 25 | 26 | # Check minimum required Python version 27 | if sys.version_info < (3, 9): 28 | print("aioscrapy %s requires Python 3.9+" % __version__) 29 | sys.exit(1) 30 | 31 | 32 | del pkgutil 33 | del sys 34 | -------------------------------------------------------------------------------- /aioscrapy/scrapyd/default_scrapyd.conf: -------------------------------------------------------------------------------- 1 | [scrapyd] 2 | eggs_dir = eggs 3 | logs_dir = logs 4 | items_dir = 5 | jobs_to_keep = 5 6 | dbs_dir = dbs 7 | max_proc = 0 8 | max_proc_per_cpu = 4 9 | finished_to_keep = 100 10 | poll_interval = 5.0 11 | bind_address = 127.0.0.1 12 | http_port = 6800 13 | debug = off 14 | runner = aioscrapy.scrapyd.runner 15 | application = scrapyd.app.application 16 | launcher = scrapyd.launcher.Launcher 17 | webroot = scrapyd.website.Root 18 | 19 | [services] 20 | schedule.json = scrapyd.webservice.Schedule 21 | cancel.json = scrapyd.webservice.Cancel 22 | addversion.json = scrapyd.webservice.AddVersion 23 | listprojects.json = scrapyd.webservice.ListProjects 24 | listversions.json = scrapyd.webservice.ListVersions 25 | listspiders.json = scrapyd.webservice.ListSpiders 26 | delproject.json = scrapyd.webservice.DeleteProject 27 | delversion.json = scrapyd.webservice.DeleteVersion 28 | listjobs.json = scrapyd.webservice.ListJobs 29 | daemonstatus.json = scrapyd.webservice.DaemonStatus 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 ConlinH 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /aioscrapy/commands/crawl.py: -------------------------------------------------------------------------------- 1 | from aioscrapy.commands import BaseRunSpiderCommand 2 | from aioscrapy.exceptions import UsageError 3 | 4 | 5 | class Command(BaseRunSpiderCommand): 6 | 7 | requires_project = True 8 | 9 | def syntax(self): 10 | return "[options] " 11 | 12 | def short_desc(self): 13 | return "Run a spider" 14 | 15 | def run(self, args, opts): 16 | if len(args) < 1: 17 | raise UsageError() 18 | elif len(args) > 1: 19 | raise UsageError("running 'aioscrapy crawl' with more than one spider is not supported") 20 | spname = args[0] 21 | 22 | crawl_defer = self.crawler_process.crawl(spname, **opts.spargs) 23 | 24 | if getattr(crawl_defer, 'result', None) is not None and issubclass(crawl_defer.result.type, Exception): 25 | self.exitcode = 1 26 | else: 27 | self.crawler_process.start() 28 | 29 | if ( 30 | self.crawler_process.bootstrap_failed 31 | or hasattr(self.crawler_process, 'has_exception') and self.crawler_process.has_exception 32 | ): 33 | self.exitcode = 1 34 | -------------------------------------------------------------------------------- /aioscrapy/templates/spiders/single.tmpl: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider, logger 2 | 3 | 4 | class $classname(Spider): 5 | name = '$name' 6 | custom_settings = { 7 | "CLOSE_SPIDER_ON_IDLE": True 8 | } 9 | start_urls = ["https://quotes.toscrape.com"] 10 | 11 | @staticmethod 12 | async def process_request(request, spider): 13 | """ request middleware """ 14 | pass 15 | 16 | @staticmethod 17 | async def process_response(request, response, spider): 18 | """ response middleware """ 19 | return response 20 | 21 | @staticmethod 22 | async def process_exception(request, exception, spider): 23 | """ exception middleware """ 24 | pass 25 | 26 | async def parse(self, response): 27 | for quote in response.css('div.quote'): 28 | item = { 29 | 'author': quote.xpath('span/small/text()').get(), 30 | 'text': quote.css('span.text::text').get(), 31 | } 32 | yield item 33 | 34 | async def process_item(self, item): 35 | logger.info(item) 36 | 37 | 38 | if __name__ == '__main__': 39 | $classname.start() 40 | -------------------------------------------------------------------------------- /example/singlespider/demo_proxy.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, logger, Spider 2 | 3 | 4 | class DemoProxySpider(Spider): 5 | """ 6 | 适用于隧道代理 7 | """ 8 | name = 'DemoProxySpider' 9 | custom_settings = dict( 10 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 11 | LOG_LEVEL='DEBUG', 12 | CLOSE_SPIDER_ON_IDLE=True, 13 | ) 14 | 15 | start_urls = ['https://quotes.toscrape.com'] 16 | 17 | @staticmethod 18 | async def process_request(request, spider): 19 | """ request middleware """ 20 | # 添加代理 适用于隧道代理 21 | request.meta['proxy'] = 'http://127.0.0.1:7890' 22 | 23 | async def parse(self, response): 24 | for quote in response.css('div.quote'): 25 | yield { 26 | 'author': quote.xpath('span/small/text()').get(), 27 | 'text': quote.css('span.text::text').get(), 28 | } 29 | 30 | next_page = response.css('li.next a::attr("href")').get() 31 | if next_page is not None: 32 | # yield response.follow(next_page, self.parse) 33 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 34 | 35 | async def process_item(self, item): 36 | logger.info(item) 37 | 38 | 39 | if __name__ == '__main__': 40 | DemoProxySpider.start() 41 | -------------------------------------------------------------------------------- /aioscrapy/templates/project/module/settings.py.tmpl: -------------------------------------------------------------------------------- 1 | 2 | BOT_NAME = '$project_name' 3 | 4 | SPIDER_MODULES = ['$project_name.spiders'] 5 | NEWSPIDER_MODULE = '$project_name.spiders' 6 | 7 | # SPIDER_MIDDLEWARES = { 8 | # '$project_name.middlewares.${ProjectName}SpiderMiddleware': 543, 9 | # } 10 | 11 | # DOWNLOADER_MIDDLEWARES = { 12 | # '$project_name.middlewares.${ProjectName}DownloaderMiddleware': 543, 13 | # } 14 | 15 | # EXTENSIONS = { 16 | # } 17 | 18 | # ITEM_PIPELINES = { 19 | # '$project_name.pipelines.${ProjectName}Pipeline': 300, 20 | # } 21 | 22 | DOWNLOAD_HANDLERS_TYPE = "aiohttp" # aiohttp httpx pyhttpx requests playwright 不配置则默认为aiohttp 23 | # 自定义发包方式请用scrapy的形式,例如: 24 | # DOWNLOAD_HANDLERS={ 25 | # 'http': 'aioscrapy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler', 26 | # 'https': 'aioscrapy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler', 27 | # } 28 | 29 | # DOWNLOAD_DELAY = 3 30 | # RANDOMIZE_DOWNLOAD_DELAY = True 31 | CONCURRENT_REQUESTS = 16 32 | CONCURRENT_REQUESTS_PER_DOMAIN = 8 33 | 34 | CLOSE_SPIDER_ON_IDLE = False 35 | 36 | # LOG_FILE = './log/info.log' 37 | LOG_STDOUT = True 38 | LOG_LEVEL = 'DEBUG' 39 | 40 | SCHEDULER_QUEUE_CLASS = 'aioscrapy.queue.redis.SpiderPriorityQueue' 41 | # SCHEDULER_FLUSH_ON_START = False 42 | REDIS_ARGS = { 43 | 'queue': { 44 | 'url': 'redis://:@127.0.0.1:6379/0', 45 | 'max_connections': 2, 46 | 'timeout': None, 47 | 'retry_on_timeout': True, 48 | 'health_check_interval': 30, 49 | } 50 | } 51 | 52 | # 本框架基本实现了scrapy/scrapy-redis的功能 想要配置更多参数,请参考scrapy 53 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_requests.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider, logger 2 | from aioscrapy.http import Response 3 | 4 | 5 | class DemoRequestsSpider(Spider): 6 | name = 'DemoRequestsSpider' 7 | 8 | custom_settings = dict( 9 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 10 | # DOWNLOAD_DELAY=3, 11 | # RANDOMIZE_DOWNLOAD_DELAY=True, 12 | CONCURRENT_REQUESTS=1, 13 | LOG_LEVEL='INFO', 14 | CLOSE_SPIDER_ON_IDLE=True, 15 | # DOWNLOAD_HANDLERS={ 16 | # 'http': 'aioscrapy.core.downloader.handlers.requests.RequestsDownloadHandler', 17 | # 'https': 'aioscrapy.core.downloader.handlers.requests.RequestsDownloadHandler', 18 | # }, 19 | DOWNLOAD_HANDLERS_TYPE="requests", 20 | ) 21 | 22 | start_urls = ['https://quotes.toscrape.com'] 23 | 24 | @staticmethod 25 | async def process_request(request, spider): 26 | """ request middleware """ 27 | pass 28 | 29 | @staticmethod 30 | async def process_response(request, response, spider): 31 | """ response middleware """ 32 | return response 33 | 34 | @staticmethod 35 | async def process_exception(request, exception, spider): 36 | """ exception middleware """ 37 | pass 38 | 39 | async def parse(self, response: Response): 40 | print(response.text) 41 | 42 | async def process_item(self, item): 43 | logger.info(item) 44 | 45 | 46 | if __name__ == '__main__': 47 | DemoRequestsSpider.start() 48 | -------------------------------------------------------------------------------- /aioscrapy/http/response/html.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | HTML response implementation for aioscrapy. 4 | aioscrapy的HTML响应实现。 5 | 6 | This module provides the HtmlResponse class, which is a specialized TextResponse 7 | for handling HTML content. It inherits all functionality from TextResponse 8 | but is specifically intended for HTML responses. 9 | 此模块提供了HtmlResponse类,这是一个专门用于处理HTML内容的TextResponse。 10 | 它继承了TextResponse的所有功能,但专门用于HTML响应。 11 | """ 12 | 13 | from aioscrapy.http.response.text import TextResponse 14 | 15 | 16 | class HtmlResponse(TextResponse): 17 | """ 18 | A Response subclass specifically for HTML responses. 19 | 专门用于HTML响应的Response子类。 20 | 21 | This class extends TextResponse to handle HTML content. It inherits all the 22 | functionality of TextResponse, including: 23 | 此类扩展了TextResponse以处理HTML内容。它继承了TextResponse的所有功能,包括: 24 | 25 | - Automatic encoding detection 26 | 自动编码检测 27 | - Unicode conversion 28 | Unicode转换 29 | - CSS and XPath selectors 30 | CSS和XPath选择器 31 | - JSON parsing 32 | JSON解析 33 | - Enhanced link following 34 | 增强的链接跟踪 35 | 36 | The main purpose of this class is to provide a specific type for HTML responses, 37 | which can be useful for type checking and middleware processing. 38 | 此类的主要目的是为HTML响应提供特定类型,这对类型检查和中间件处理很有用。 39 | 40 | Example: 41 | ```python 42 | def parse(self, response): 43 | if isinstance(response, HtmlResponse): 44 | # Process HTML response 45 | title = response.css('title::text').get() 46 | else: 47 | # Handle other response types 48 | pass 49 | ``` 50 | """ 51 | pass 52 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_pyhttpx.py: -------------------------------------------------------------------------------- 1 | 2 | from aioscrapy import Request, logger, Spider 3 | from aioscrapy.http import Response 4 | 5 | 6 | class DemoPyhttpxSpider(Spider): 7 | name = 'DemoPyhttpxSpider' 8 | 9 | custom_settings = dict( 10 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 11 | # DOWNLOAD_DELAY=3, 12 | # RANDOMIZE_DOWNLOAD_DELAY=True, 13 | CONCURRENT_REQUESTS=1, 14 | LOG_LEVEL='INFO', 15 | CLOSE_SPIDER_ON_IDLE=True, 16 | # DOWNLOAD_HANDLERS={ 17 | # 'http': 'aioscrapy.core.downloader.handlers.pyhttpx.PyhttpxDownloadHandler', 18 | # 'https': 'aioscrapy.core.downloader.handlers.pyhttpx.PyhttpxDownloadHandler', 19 | # }, 20 | DOWNLOAD_HANDLERS_TYPE="pyhttpx", 21 | PYHTTPX_ARGS={} # 传递给pyhttpx.HttpSession构造函数的参数 22 | ) 23 | 24 | start_urls = ['https://tls.peet.ws/api/all'] 25 | 26 | @staticmethod 27 | async def process_request(request, spider): 28 | """ request middleware """ 29 | pass 30 | 31 | @staticmethod 32 | async def process_response(request, response, spider): 33 | """ response middleware """ 34 | return response 35 | 36 | @staticmethod 37 | async def process_exception(request, exception, spider): 38 | """ exception middleware """ 39 | pass 40 | 41 | async def parse(self, response: Response): 42 | print(response.text) 43 | 44 | async def process_item(self, item): 45 | logger.info(item) 46 | 47 | 48 | if __name__ == '__main__': 49 | DemoPyhttpxSpider.start() 50 | -------------------------------------------------------------------------------- /example/singlespider/demo_pipeline_csv.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider, logger 2 | 3 | 4 | class DemoCsvSpider(Spider): 5 | name = 'DemoCsvSpider' 6 | custom_settings = { 7 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 8 | # 'DOWNLOAD_DELAY': 3, 9 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 10 | # 'CONCURRENT_REQUESTS': 1, 11 | # 'LOG_LEVEL': 'INFO' 12 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.disk.RFPDupeFilter', 13 | "CLOSE_SPIDER_ON_IDLE": True, 14 | 15 | "ITEM_PIPELINES": { 16 | 'aioscrapy.libs.pipelines.csv.CsvPipeline': 100, 17 | }, 18 | "SAVE_CACHE_NUM": 1000, # 每次存储1000条 19 | "SAVE_CACHE_INTERVAL": 10, # 每次10秒存储一次 20 | } 21 | 22 | start_urls = ['https://quotes.toscrape.com'] 23 | 24 | @staticmethod 25 | async def process_request(request, spider): 26 | """ request middleware """ 27 | pass 28 | 29 | @staticmethod 30 | async def process_response(request, response, spider): 31 | """ response middleware """ 32 | return response 33 | 34 | @staticmethod 35 | async def process_exception(request, exception, spider): 36 | """ exception middleware """ 37 | pass 38 | 39 | async def parse(self, response): 40 | for quote in response.css('div.quote'): 41 | yield { 42 | 'author': quote.xpath('span/small/text()').get(), 43 | 'text': quote.css('span.text::text').get(), 44 | '__csv__': { 45 | 'filename': 'article', # 文件名 或 存储的路径及文件名 如:D:\article.xlsx 46 | } 47 | } 48 | 49 | async def process_item(self, item): 50 | logger.info(item) 51 | 52 | 53 | if __name__ == '__main__': 54 | DemoCsvSpider.start() 55 | -------------------------------------------------------------------------------- /example/singlespider/demo_proxy_pool.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, logger, Spider 2 | 3 | 4 | class DemoRedisProxySpider(Spider): 5 | """ 6 | 适用于代理池 代理池的实现参考 https://github.com/Python3WebSpider/ProxyPool 7 | """ 8 | name = 'DemoRedisProxySpider' 9 | custom_settings = dict( 10 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 11 | LOG_LEVEL='DEBUG', 12 | CLOSE_SPIDER_ON_IDLE=True, 13 | # 代理配置 14 | USE_PROXY=True, # 是否开启代理 15 | PROXY_HANDLER='aioscrapy.proxy.redis.RedisProxy', # 代理类的加载路径 16 | PROXY_QUEUE_ALIAS='proxy', # 代理的redis队列别名 17 | PROXY_KEY='proxies:universal', # 代理的key名, 使用redis的ZSET结构存储代理 18 | PROXY_MAX_COUNT=10, # 最多缓存到内存中的代理个数 19 | PROXY_MIN_COUNT=1, # 最小缓存到内存中的代理个数 20 | REDIS_ARGS={ # 代理存放的redis位置 21 | 'proxy': { 22 | 'url': 'redis://username:password@192.168.234.128:6379/2', 23 | 'max_connections': 2, 24 | 'timeout': None, 25 | 'retry_on_timeout': True, 26 | 'health_check_interval': 30, 27 | } 28 | } 29 | ) 30 | 31 | start_urls = ['https://quotes.toscrape.com'] 32 | 33 | async def parse(self, response): 34 | for quote in response.css('div.quote'): 35 | yield { 36 | 'author': quote.xpath('span/small/text()').get(), 37 | 'text': quote.css('span.text::text').get(), 38 | } 39 | 40 | next_page = response.css('li.next a::attr("href")').get() 41 | if next_page is not None: 42 | # yield response.follow(next_page, self.parse) 43 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 44 | 45 | async def process_item(self, item): 46 | logger.info(item) 47 | 48 | 49 | if __name__ == '__main__': 50 | DemoRedisProxySpider.start() 51 | -------------------------------------------------------------------------------- /aioscrapy/commands/settings.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | from aioscrapy.commands import AioScrapyCommand 4 | from aioscrapy.settings import BaseSettings 5 | 6 | 7 | class Command(AioScrapyCommand): 8 | 9 | requires_project = False 10 | default_settings = {'LOG_ENABLED': False, 11 | 'SPIDER_LOADER_WARN_ONLY': True} 12 | 13 | def syntax(self): 14 | return "[options]" 15 | 16 | def short_desc(self): 17 | return "Get settings values" 18 | 19 | def add_options(self, parser): 20 | AioScrapyCommand.add_options(self, parser) 21 | parser.add_option("--get", dest="get", metavar="SETTING", 22 | help="print raw setting value") 23 | parser.add_option("--getbool", dest="getbool", metavar="SETTING", 24 | help="print setting value, interpreted as a boolean") 25 | parser.add_option("--getint", dest="getint", metavar="SETTING", 26 | help="print setting value, interpreted as an integer") 27 | parser.add_option("--getfloat", dest="getfloat", metavar="SETTING", 28 | help="print setting value, interpreted as a float") 29 | parser.add_option("--getlist", dest="getlist", metavar="SETTING", 30 | help="print setting value, interpreted as a list") 31 | 32 | def run(self, args, opts): 33 | settings = self.crawler_process.settings 34 | if opts.get: 35 | s = settings.get(opts.get) 36 | if isinstance(s, BaseSettings): 37 | print(json.dumps(s.copy_to_dict())) 38 | else: 39 | print(s) 40 | elif opts.getbool: 41 | print(settings.getbool(opts.getbool)) 42 | elif opts.getint: 43 | print(settings.getint(opts.getint)) 44 | elif opts.getfloat: 45 | print(settings.getfloat(opts.getfloat)) 46 | elif opts.getlist: 47 | print(settings.getlist(opts.getlist)) 48 | -------------------------------------------------------------------------------- /aioscrapy/http/response/xml.py: -------------------------------------------------------------------------------- 1 | """ 2 | XML response implementation for aioscrapy. 3 | aioscrapy的XML响应实现。 4 | 5 | This module provides the XmlResponse class, which is a specialized TextResponse 6 | for handling XML content. It inherits all functionality from TextResponse 7 | but is specifically intended for XML responses, with support for XML encoding 8 | declarations. 9 | 此模块提供了XmlResponse类,这是一个专门用于处理XML内容的TextResponse。 10 | 它继承了TextResponse的所有功能,但专门用于XML响应,支持XML编码声明。 11 | """ 12 | 13 | from aioscrapy.http.response.text import TextResponse 14 | 15 | 16 | class XmlResponse(TextResponse): 17 | """ 18 | A Response subclass specifically for XML responses. 19 | 专门用于XML响应的Response子类。 20 | 21 | This class extends TextResponse to handle XML content. It inherits all the 22 | functionality of TextResponse, including: 23 | 此类扩展了TextResponse以处理XML内容。它继承了TextResponse的所有功能,包括: 24 | 25 | - Automatic encoding detection (including from XML declarations) 26 | 自动编码检测(包括从XML声明中) 27 | - Unicode conversion 28 | Unicode转换 29 | - CSS and XPath selectors (particularly useful for XML) 30 | CSS和XPath选择器(对XML特别有用) 31 | - Enhanced link following 32 | 增强的链接跟踪 33 | 34 | The main purpose of this class is to provide a specific type for XML responses, 35 | which can be useful for type checking and middleware processing. 36 | 此类的主要目的是为XML响应提供特定类型,这对类型检查和中间件处理很有用。 37 | 38 | Example: 39 | ```python 40 | def parse(self, response): 41 | if isinstance(response, XmlResponse): 42 | # Process XML response 43 | items = response.xpath('//item') 44 | for item in items: 45 | yield { 46 | 'name': item.xpath('./name/text()').get(), 47 | 'value': item.xpath('./value/text()').get() 48 | } 49 | else: 50 | # Handle other response types 51 | pass 52 | ``` 53 | """ 54 | pass 55 | -------------------------------------------------------------------------------- /example/singlespider/demo_proxy_self.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, Spider, logger 2 | from aioscrapy.proxy.redis import AbsProxy 3 | 4 | 5 | class MyProxy(AbsProxy): 6 | """ 7 | TODO: 根据实际情况重写AbsProxy部分方法 实现自己的代理逻辑 8 | """ 9 | 10 | def __init__( 11 | self, 12 | settings, 13 | ): 14 | super().__init__(settings) 15 | 16 | @classmethod 17 | async def from_crawler(cls, crawler) -> AbsProxy: 18 | settings = crawler.settings 19 | return cls( 20 | settings 21 | ) 22 | 23 | async def get(self) -> str: 24 | # TODO: 实现ip获取逻辑 25 | logger.warning("未实现ip代理逻辑") 26 | return 'http://127.0.0.1:7890' 27 | 28 | 29 | class DemoRedisProxySpider(Spider): 30 | """ 31 | 自定义代理 自己实现代理相关逻辑 32 | """ 33 | name = 'DemoRedisProxySpider' 34 | custom_settings = dict( 35 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 36 | LOG_LEVEL='DEBUG', 37 | CLOSE_SPIDER_ON_IDLE=True, 38 | 39 | # 代理配置 40 | USE_PROXY=True, # 是否开启代理 默认为False 41 | PROXY_HANDLER='demo_proxy_self.MyProxy', # 代理类的加载路径 根据情况自己实现一个代理类(可参考RedisProxy类) 42 | ) 43 | 44 | start_urls = ['https://quotes.toscrape.com'] 45 | 46 | async def parse(self, response): 47 | for quote in response.css('div.quote'): 48 | yield { 49 | 'author': quote.xpath('span/small/text()').get(), 50 | 'text': quote.css('span.text::text').get(), 51 | } 52 | 53 | next_page = response.css('li.next a::attr("href")').get() 54 | if next_page is not None: 55 | # yield response.follow(next_page, self.parse) 56 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 57 | 58 | async def process_item(self, item): 59 | logger.info(item) 60 | 61 | 62 | if __name__ == '__main__': 63 | DemoRedisProxySpider.start() 64 | -------------------------------------------------------------------------------- /example/singlespider/demo_queue_memory.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, logger, Spider 2 | 3 | 4 | class DemoMemorySpider(Spider): 5 | name = 'DemoMemorySpider' 6 | custom_settings = { 7 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 8 | # 'DOWNLOAD_DELAY': 3, 9 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 10 | # 'CONCURRENT_REQUESTS': 1, 11 | 'LOG_LEVEL': 'INFO', 12 | "CLOSE_SPIDER_ON_IDLE": True, 13 | 14 | 'SCHEDULER_QUEUE_CLASS': 'aioscrapy.queue.memory.SpiderPriorityQueue', 15 | # 'SCHEDULER_QUEUE_CLASS': 'aioscrapy.queue.memory.SpiderQueue', 16 | # 'SCHEDULER_QUEUE_CLASS': 'aioscrapy.queue.memory.SpiderStack', 17 | } 18 | 19 | start_urls = ['https://quotes.toscrape.com'] 20 | 21 | @staticmethod 22 | async def process_request(request, spider): 23 | """ request middleware """ 24 | pass 25 | 26 | @staticmethod 27 | async def process_response(request, response, spider): 28 | """ response middleware """ 29 | # spider.pause = True 30 | # spider.pause_time = 60 * 10 31 | return response 32 | 33 | @staticmethod 34 | async def process_exception(request, exception, spider): 35 | """ exception middleware """ 36 | pass 37 | 38 | async def parse(self, response): 39 | for quote in response.css('div.quote'): 40 | yield { 41 | 'author': quote.xpath('span/small/text()').get(), 42 | 'text': quote.css('span.text::text').get(), 43 | } 44 | 45 | next_page = response.css('li.next a::attr("href")').get() 46 | if next_page is not None: 47 | # yield response.follow(next_page, self.parse) 48 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 49 | 50 | async def process_item(self, item): 51 | logger.info(item) 52 | 53 | 54 | if __name__ == '__main__': 55 | DemoMemorySpider.start() 56 | -------------------------------------------------------------------------------- /example/singlespider/demo_request.py: -------------------------------------------------------------------------------- 1 | import json 2 | from aioscrapy import Spider 3 | from aioscrapy.http import Response, Request, FormRequest, JsonRequest 4 | 5 | 6 | class DemoRequestSpider(Spider): 7 | name = 'DemoRequestSpider' 8 | 9 | custom_settings = dict( 10 | CONCURRENT_REQUESTS=1, 11 | LOG_LEVEL='INFO', 12 | CLOSE_SPIDER_ON_IDLE=True, 13 | DOWNLOAD_HANDLERS_TYPE="aiohttp", 14 | # DOWNLOAD_HANDLERS_TYPE="curl_cffi", 15 | # DOWNLOAD_HANDLERS_TYPE="requests", 16 | # DOWNLOAD_HANDLERS_TYPE="httpx", 17 | # DOWNLOAD_HANDLERS_TYPE="pyhttpx", 18 | ) 19 | 20 | start_urls = [] 21 | 22 | async def start_requests(self): 23 | # 发送get请求 24 | # 等价于 requests.get("https://httpbin.org/get", params=dict(test="test")) 25 | yield Request("https://httpbin.org/get?a=a", params=dict(b="b")) 26 | 27 | # 发送post form请求 28 | # 等价于 requests.post("https://httpbin.org/post", data=dict(test="test")) 29 | yield Request("https://httpbin.org/post", method='POST', data=dict(test="test")) # 写法一 (推荐) 30 | yield FormRequest("https://httpbin.org/post", formdata=dict(test="test")) # 写法二 31 | yield Request("https://httpbin.org/post", method='POST', body="test=test", # 写法三 (不推荐) 32 | headers={'Content-Type': 'application/x-www-form-urlencoded'}) 33 | 34 | # 发送post json请求 35 | # 等价于 requests.post("https://httpbin.org/post", json=dict(test="test")) 36 | yield Request("https://httpbin.org/post", method='POST', json=dict(test="test")) # 写法一 (推荐) 37 | yield JsonRequest("https://httpbin.org/post", data=dict(test="test")) # 写法二 38 | yield Request("https://httpbin.org/post", method='POST', # 写法三 (不推荐) 39 | body=json.dumps(dict(test="test")), 40 | headers={'Content-Type': 'application/json'}) 41 | 42 | async def parse(self, response: Response): 43 | print(response.text) 44 | 45 | 46 | if __name__ == '__main__': 47 | DemoRequestSpider.start() 48 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_aiohttp.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, logger, Spider 2 | 3 | 4 | class DemoAiohttpSpider(Spider): 5 | name = 'DemoAiohttpSpider' 6 | custom_settings = dict( 7 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36", 8 | # DOWNLOAD_DELAY=3, 9 | # RANDOMIZE_DOWNLOAD_DELAY=True, 10 | # CONCURRENT_REQUESTS=1, 11 | LOG_LEVEL='INFO', 12 | CLOSE_SPIDER_ON_IDLE=True, 13 | HTTPERROR_ALLOW_ALL=True, 14 | # DOWNLOAD_HANDLERS={ 15 | # 'http': 'aioscrapy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler', 16 | # 'https': 'aioscrapy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler', 17 | # }, 18 | DOWNLOAD_HANDLERS_TYPE="aiohttp", 19 | AIOHTTP_ARGS={} # 传递给aiohttp.ClientSession的参数 20 | ) 21 | 22 | start_urls = ['https://quotes.toscrape.com'] 23 | 24 | @staticmethod 25 | async def process_request(request, spider): 26 | """ request middleware """ 27 | pass 28 | 29 | @staticmethod 30 | async def process_response(request, response, spider): 31 | """ response middleware """ 32 | return response 33 | 34 | @staticmethod 35 | async def process_exception(request, exception, spider): 36 | """ exception middleware """ 37 | pass 38 | 39 | async def parse(self, response): 40 | for quote in response.css('div.quote'): 41 | yield { 42 | 'author': quote.xpath('span/small/text()').get(), 43 | 'text': quote.css('span.text::text').get(), 44 | } 45 | 46 | next_page = response.css('li.next a::attr("href")').get() 47 | if next_page is not None: 48 | # yield response.follow(next_page, self.parse) 49 | yield Request(f"https://quotes.toscrape.com/{next_page}", callback=self.parse) 50 | 51 | async def process_item(self, item): 52 | logger.info(item) 53 | 54 | 55 | if __name__ == '__main__': 56 | DemoAiohttpSpider.start() 57 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_httpx.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, logger, Spider 2 | 3 | 4 | class DemoHttpxSpider(Spider): 5 | name = 'DemoHttpxSpider' 6 | custom_settings = dict( 7 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36", 8 | # DOWNLOAD_DELAY=3, 9 | # RANDOMIZE_DOWNLOAD_DELAY=True, 10 | # CONCURRENT_REQUESTS=1, 11 | LOG_LEVEL='INFO', 12 | CLOSE_SPIDER_ON_IDLE=True, 13 | # DOWNLOAD_HANDLERS={ 14 | # 'http': 'aioscrapy.core.downloader.handlers.httpx.HttpxDownloadHandler', 15 | # 'https': 'aioscrapy.core.downloader.handlers.httpx.HttpxDownloadHandler', 16 | # }, 17 | DOWNLOAD_HANDLERS_TYPE="httpx", 18 | HTTPX_ARGS={'http2': True}, # 传递给httpx.AsyncClient构造函数的参数 19 | FIX_HTTPX_HEADER=True, # 修复响应中的非标准HTTP头 20 | ) 21 | 22 | start_urls = ['https://quotes.toscrape.com'] 23 | 24 | @staticmethod 25 | async def process_request(request, spider): 26 | """ request middleware """ 27 | pass 28 | 29 | @staticmethod 30 | async def process_response(request, response, spider): 31 | """ response middleware """ 32 | return response 33 | 34 | @staticmethod 35 | async def process_exception(request, exception, spider): 36 | """ exception middleware """ 37 | pass 38 | 39 | async def parse(self, response): 40 | for quote in response.css('div.quote'): 41 | yield { 42 | 'author': quote.xpath('span/small/text()').get(), 43 | 'text': quote.css('span.text::text').get(), 44 | } 45 | 46 | next_page = response.css('li.next a::attr("href")').get() 47 | if next_page is not None: 48 | # yield response.follow(next_page, self.parse) 49 | yield Request(f"https://quotes.toscrape.com/{next_page}", callback=self.parse) 50 | 51 | async def process_item(self, item): 52 | logger.info(item) 53 | 54 | 55 | if __name__ == '__main__': 56 | DemoHttpxSpider.start() 57 | -------------------------------------------------------------------------------- /example/singlespider/push_task.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | 3 | from aioscrapy import Request 4 | from aioscrapy.db import db_manager 5 | 6 | 7 | async def push_redis_task(): 8 | from aioscrapy.queue.redis import SpiderPriorityQueue 9 | 10 | try: 11 | await db_manager.from_dict({ 12 | 'redis': { 13 | 'queue': { 14 | 'url': 'redis://192.168.234.128:6379/0', 15 | 'max_connections': 2, 16 | } 17 | } 18 | }) 19 | q = SpiderPriorityQueue.from_dict({ 20 | 'alias': 'queue', 21 | 'spider_name': 'DemoRedisSpider', 22 | 'serializer': 'aioscrapy.serializer.JsonSerializer' 23 | }) 24 | for page in range(1, 10): 25 | r = Request( 26 | f'https://quotes.toscrape.com/page/{page}/', 27 | priority=page 28 | ) 29 | await q.push(r) 30 | finally: 31 | await db_manager.close_all() 32 | 33 | 34 | async def push_rabbitmq_task(): 35 | from aioscrapy.queue.rabbitmq import SpiderPriorityQueue 36 | 37 | try: 38 | await db_manager.from_dict({ 39 | 'rabbitmq': { 40 | 'queue': { 41 | 'url': "amqp://guest:guest@192.168.234.128:5673", 42 | 'connection_max_size': 2, 43 | 'channel_max_size': 10, 44 | } 45 | } 46 | }) 47 | q = SpiderPriorityQueue.from_dict({ 48 | 'alias': 'queue', 49 | 'spider_name': 'DemoRabbitmqSpider', 50 | 'serializer': 'aioscrapy.serializer.JsonSerializer' 51 | }) 52 | for page in range(1, 10): 53 | r = Request( 54 | f'https://quotes.toscrape.com/page/{page}/', 55 | priority=page 56 | ) 57 | await q.push(r) 58 | finally: 59 | await db_manager.close_all() 60 | 61 | 62 | if __name__ == '__main__': 63 | asyncio.get_event_loop().run_until_complete(push_redis_task()) 64 | # asyncio.get_event_loop().run_until_complete(push_rabbitmq_task()) 65 | -------------------------------------------------------------------------------- /example/singlespider/demo_queue_rabbitmq.py: -------------------------------------------------------------------------------- 1 | 2 | from aioscrapy import Spider, logger 3 | 4 | 5 | class DemoRabbitmqSpider(Spider): 6 | name = 'DemoRabbitmqSpider' 7 | custom_settings = { 8 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 9 | # 'DOWNLOAD_DELAY': 3, 10 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 11 | 'CONCURRENT_REQUESTS': 2, 12 | "CLOSE_SPIDER_ON_IDLE": False, 13 | # 'SCHEDULER_FLUSH_ON_START': True, 14 | 15 | 'SCHEDULER_QUEUE_CLASS': 'aioscrapy.queue.rabbitmq.SpiderPriorityQueue', 16 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.desk.RFPDupeFilter' 17 | 'SCHEDULER_SERIALIZER': 'aioscrapy.serializer.PickleSerializer', 18 | 'RABBITMQ_ARGS': { 19 | 'queue': { 20 | 'url': "amqp://guest:guest@192.168.234.128:5673", 21 | 'connection_max_size': 2, 22 | 'channel_max_size': 10, 23 | } 24 | } 25 | } 26 | 27 | start_urls = ['https://quotes.toscrape.com'] 28 | 29 | @staticmethod 30 | async def process_request(request, spider): 31 | """ request middleware """ 32 | pass 33 | 34 | @staticmethod 35 | async def process_response(request, response, spider): 36 | """ response middleware """ 37 | return response 38 | 39 | @staticmethod 40 | async def process_exception(request, exception, spider): 41 | """ exception middleware """ 42 | pass 43 | 44 | async def parse(self, response): 45 | for quote in response.css('div.quote'): 46 | yield { 47 | 'author': quote.xpath('span/small/text()').get(), 48 | 'text': quote.css('span.text::text').get(), 49 | } 50 | 51 | next_page = response.css('li.next a::attr("href")').get() 52 | if next_page is not None: 53 | yield response.follow(next_page, self.parse) 54 | 55 | async def process_item(self, item): 56 | logger.info(item) 57 | 58 | 59 | if __name__ == '__main__': 60 | DemoRabbitmqSpider.start() 61 | -------------------------------------------------------------------------------- /example/singlespider/demo_pipeline_excel.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider, logger 2 | 3 | 4 | class DemoExcelSpider(Spider): 5 | name = 'DemoExcelSpider' 6 | custom_settings = { 7 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 8 | # 'DOWNLOAD_DELAY': 3, 9 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 10 | # 'CONCURRENT_REQUESTS': 1, 11 | # 'LOG_LEVEL': 'INFO' 12 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.disk.RFPDupeFilter', 13 | "CLOSE_SPIDER_ON_IDLE": True, 14 | 15 | "ITEM_PIPELINES": { 16 | 'aioscrapy.libs.pipelines.excel.ExcelPipeline': 100, 17 | }, 18 | "SAVE_CACHE_NUM": 1000, # 每次存储1000条 19 | "SAVE_CACHE_INTERVAL": 10, # 每次10秒存储一次 20 | } 21 | 22 | start_urls = ['https://quotes.toscrape.com'] 23 | 24 | @staticmethod 25 | async def process_request(request, spider): 26 | """ request middleware """ 27 | pass 28 | 29 | @staticmethod 30 | async def process_response(request, response, spider): 31 | """ response middleware """ 32 | return response 33 | 34 | @staticmethod 35 | async def process_exception(request, exception, spider): 36 | """ exception middleware """ 37 | pass 38 | 39 | async def parse(self, response): 40 | for quote in response.css('div.quote'): 41 | yield { 42 | 'author': quote.xpath('span/small/text()').get(), 43 | 'text': quote.css('span.text::text').get(), 44 | '__excel__': { 45 | 'filename': 'article', # 文件名 或 存储的路径及文件名 如:D:\article.xlsx 46 | 'sheet': 'sheet1', # 表格的sheet名字 不指定默认为sheet1 47 | 48 | # 'img_fields': ['img'], # 图片字段 当指定图片字段时 自行下载图片 并保存到表格里 49 | # 'img_size': (100, 100) # 指定图片大小时 自动将图片转换为指定大小 50 | 51 | # 传递给Workbook的参数 xlsxwriter.Workbook(filename, {'strings_to_urls': False}) 52 | 'strings_to_urls': False 53 | } 54 | } 55 | 56 | async def process_item(self, item): 57 | logger.info(item) 58 | 59 | 60 | if __name__ == '__main__': 61 | DemoExcelSpider.start() 62 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_curl_cffi.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider, logger 2 | from aioscrapy.http import Response, Request, FormRequest, JsonRequest 3 | 4 | 5 | class DemoCurlCffiSpider(Spider): 6 | name = 'DemoCurlCffiSpider' 7 | 8 | custom_settings = dict( 9 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 10 | # DOWNLOAD_DELAY=3, 11 | # RANDOMIZE_DOWNLOAD_DELAY=True, 12 | CONCURRENT_REQUESTS=1, 13 | LOG_LEVEL='INFO', 14 | CLOSE_SPIDER_ON_IDLE=True, 15 | # DOWNLOAD_HANDLERS={ 16 | # 'http': 'aioscrapy.core.downloader.handlers.curl_cffi.CurlCffiDownloadHandler', 17 | # 'https': 'aioscrapy.core.downloader.handlers.curl_cffi.CurlCffiDownloadHandler', 18 | # }, 19 | DOWNLOAD_HANDLERS_TYPE="curl_cffi", 20 | # USE_THREAD=True, # True: 使用asyncio.to_thread运行同步的curl_cffi; False: 使用curl_cffi的异步; 默认值为False 21 | CURL_CFFI_ARGS=dict(impersonate="chrome110"), # 传递给curl_cffi.AsyncSession构造函数的参数 22 | ) 23 | 24 | start_urls = ["https://quotes.toscrape.com"] 25 | 26 | async def start_requests(self): 27 | # for url in self.start_urls: 28 | # yield Request(url) 29 | yield Request("https://tools.scrapfly.io/api/fp/ja3", meta=dict(impersonate="chrome110")) 30 | yield FormRequest("https://httpbin.org/post", formdata=dict(test="test"), meta=dict(impersonate="chrome110")) 31 | yield JsonRequest("https://httpbin.org/post", data=dict(test="test"), meta=dict(impersonate="chrome110")) 32 | 33 | @staticmethod 34 | async def process_request(request, spider): 35 | """ request middleware """ 36 | pass 37 | 38 | @staticmethod 39 | async def process_response(request, response, spider): 40 | """ response middleware """ 41 | return response 42 | 43 | @staticmethod 44 | async def process_exception(request, exception, spider): 45 | """ exception middleware """ 46 | pass 47 | 48 | async def parse(self, response: Response): 49 | print(response.text) 50 | 51 | async def process_item(self, item): 52 | logger.info(item) 53 | 54 | 55 | if __name__ == '__main__': 56 | DemoCurlCffiSpider.start(use_windows_selector_eventLoop=True) 57 | -------------------------------------------------------------------------------- /aioscrapy/commands/runspider.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import inspect 4 | from importlib import import_module 5 | 6 | from aioscrapy.exceptions import UsageError 7 | from aioscrapy.commands import BaseRunSpiderCommand 8 | 9 | 10 | def iter_spider_classes(module): 11 | from aioscrapy.spiders import Spider 12 | 13 | for obj in vars(module).values(): 14 | if ( 15 | inspect.isclass(obj) 16 | and issubclass(obj, Spider) 17 | and obj.__module__ == module.__name__ 18 | and getattr(obj, 'name', None) 19 | ): 20 | yield obj 21 | 22 | 23 | def _import_file(filepath): 24 | abspath = os.path.abspath(filepath) 25 | dirname, file = os.path.split(abspath) 26 | fname, fext = os.path.splitext(file) 27 | if fext not in ('.py', '.pyw'): 28 | raise ValueError(f"Not a Python source file: {abspath}") 29 | if dirname: 30 | sys.path = [dirname] + sys.path 31 | try: 32 | module = import_module(fname) 33 | finally: 34 | if dirname: 35 | sys.path.pop(0) 36 | return module 37 | 38 | 39 | class Command(BaseRunSpiderCommand): 40 | 41 | requires_project = False 42 | default_settings = {'SPIDER_LOADER_WARN_ONLY': True} 43 | 44 | def syntax(self): 45 | return "[options] " 46 | 47 | def short_desc(self): 48 | return "Run a self-contained spider (without creating a project)" 49 | 50 | def long_desc(self): 51 | return "Run the spider defined in the given file" 52 | 53 | def run(self, args, opts): 54 | if len(args) != 1: 55 | raise UsageError() 56 | filename = args[0] 57 | if not os.path.exists(filename): 58 | raise UsageError(f"File not found: {filename}\n") 59 | try: 60 | module = _import_file(filename) 61 | except (ImportError, ValueError) as e: 62 | raise UsageError(f"Unable to load {filename!r}: {e}\n") 63 | spclasses = list(iter_spider_classes(module)) 64 | if not spclasses: 65 | raise UsageError(f"No spider found in file: {filename}\n") 66 | spidercls = spclasses.pop() 67 | 68 | self.crawler_process.crawl(spidercls, **opts.spargs) 69 | self.crawler_process.start() 70 | 71 | if self.crawler_process.bootstrap_failed: 72 | self.exitcode = 1 73 | -------------------------------------------------------------------------------- /aioscrapy/utils/spider.py: -------------------------------------------------------------------------------- 1 | """ 2 | Spider utility functions for aioscrapy. 3 | aioscrapy的爬虫实用函数。 4 | 5 | This module provides utility functions for working with spider classes in aioscrapy. 6 | It includes functions for discovering and iterating over spider classes. 7 | 此模块提供了用于处理aioscrapy中爬虫类的实用函数。 8 | 它包括用于发现和迭代爬虫类的函数。 9 | """ 10 | 11 | import inspect 12 | 13 | from aioscrapy.spiders import Spider 14 | 15 | 16 | def iter_spider_classes(module): 17 | """ 18 | Iterate over all valid spider classes defined in a module. 19 | 迭代模块中定义的所有有效爬虫类。 20 | 21 | This function finds all classes in the given module that: 22 | 1. Are subclasses of the Spider class 23 | 2. Are defined in the module itself (not imported) 24 | 3. Have a non-empty 'name' attribute (required for instantiation) 25 | 26 | 此函数查找给定模块中满足以下条件的所有类: 27 | 1. 是Spider类的子类 28 | 2. 在模块本身中定义(非导入) 29 | 3. 具有非空的'name'属性(实例化所必需的) 30 | 31 | The function is used by the spider loader to discover spiders in a module. 32 | 该函数被爬虫加载器用来在模块中发现爬虫。 33 | 34 | Args: 35 | module: The module object to inspect for spider classes. 36 | 要检查爬虫类的模块对象。 37 | 38 | Yields: 39 | class: Spider classes that can be instantiated. 40 | 可以实例化的爬虫类。 41 | 42 | Note: 43 | This implementation avoids importing the spider manager singleton 44 | from aioscrapy.spider.spiders, which would create circular imports. 45 | 此实现避免从aioscrapy.spider.spiders导入爬虫管理器单例, 46 | 这会创建循环导入。 47 | """ 48 | # Iterate through all objects in the module 49 | # 迭代模块中的所有对象 50 | for obj in vars(module).values(): 51 | # Check if the object meets all criteria for a valid spider class 52 | # 检查对象是否满足有效爬虫类的所有条件 53 | if ( 54 | # Must be a class 55 | # 必须是一个类 56 | inspect.isclass(obj) 57 | # Must be a subclass of Spider 58 | # 必须是Spider的子类 59 | and issubclass(obj, Spider) 60 | # Must be defined in this module (not imported) 61 | # 必须在此模块中定义(非导入) 62 | and obj.__module__ == module.__name__ 63 | # Must have a name attribute (required for instantiation) 64 | # 必须有name属性(实例化所必需的) 65 | and getattr(obj, 'name', None) 66 | ): 67 | yield obj 68 | -------------------------------------------------------------------------------- /example/singlespider/demo_metric.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from aioscrapy import Request 4 | from aioscrapy.spiders import Spider 5 | 6 | logger = logging.getLogger(__name__) 7 | 8 | 9 | class DemoMetricSpider(Spider): 10 | name = 'DemoMetricSpider' 11 | custom_settings = dict( 12 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 13 | # DOWNLOAD_DELAY=3, 14 | # RANDOMIZE_DOWNLOAD_DELAY=True, 15 | # CONCURRENT_REQUESTS=1, 16 | LOG_LEVEL='DEBUG', 17 | CLOSE_SPIDER_ON_IDLE=True, 18 | 19 | 20 | EXTENSIONS={ 21 | 'aioscrapy.libs.extensions.metric.Metric': 0, 22 | }, 23 | 24 | # http(侵入式) 使用http协议将监控指标写入influxdb2 25 | # METRIC_INFLUXDB_URL="http://192.168.15.111:8086/api/v2/write?org=spiderman&bucket=spider-metric&precision=ns", 26 | # METRIC_INFLUXDB_TOKEN="YequFPGDEuukHUG9l8l2nlaatufGQK_UOD7MBpo3KvB8jIg5-cFa89GLXYfgk76M2sHvEtERpAXK7_fMNsBjAA==", 27 | 28 | # # log + vector(非侵入式) 将监控指标写入单独的日志文件,利用vector收集日志写入influxdb2 29 | METRIC_LOG_ARGS=dict(sink='DemoMetricSpider.metric', rotation='20MB', retention=3) 30 | ) 31 | 32 | start_urls = ['https://quotes.toscrape.com'] 33 | 34 | @staticmethod 35 | async def process_request(request, spider): 36 | """ request middleware """ 37 | pass 38 | 39 | @staticmethod 40 | async def process_response(request, response, spider): 41 | """ response middleware """ 42 | return response 43 | 44 | @staticmethod 45 | async def process_exception(request, exception, spider): 46 | """ exception middleware """ 47 | pass 48 | 49 | async def parse(self, response): 50 | for quote in response.css('div.quote'): 51 | yield { 52 | 'author': quote.xpath('span/small/text()').get(), 53 | 'text': quote.css('span.text::text').get(), 54 | } 55 | 56 | next_page = response.css('li.next a::attr("href")').get() 57 | if next_page is not None: 58 | # yield response.follow(next_page, self.parse) 59 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 60 | 61 | # async def process_item(self, item): 62 | # print(item) 63 | 64 | 65 | if __name__ == '__main__': 66 | DemoMetricSpider.start() 67 | -------------------------------------------------------------------------------- /aioscrapy/middleware/extension.py: -------------------------------------------------------------------------------- 1 | """ 2 | The Extension Manager 3 | 扩展管理器 4 | 5 | This module provides the ExtensionManager class, which manages the loading and 6 | execution of extensions. Extensions are components that can hook into various 7 | parts of the Scrapy process to add functionality or modify behavior. 8 | 此模块提供了ExtensionManager类,用于管理扩展的加载和执行。扩展是可以挂钩到 9 | Scrapy流程的各个部分以添加功能或修改行为的组件。 10 | 11 | Extensions are loaded from the EXTENSIONS setting and can be enabled or disabled 12 | through this setting. They can connect to signals to execute code at specific 13 | points in the crawling process. 14 | 扩展从EXTENSIONS设置加载,可以通过此设置启用或禁用。它们可以连接到信号, 15 | 以在爬取过程的特定点执行代码。 16 | """ 17 | from aioscrapy.middleware.absmanager import AbsMiddlewareManager 18 | from aioscrapy.utils.conf import build_component_list 19 | 20 | 21 | class ExtensionManager(AbsMiddlewareManager): 22 | """ 23 | Manager for extension components. 24 | 扩展组件的管理器。 25 | 26 | This class manages the loading and execution of extensions. It inherits from 27 | AbsMiddlewareManager and implements the specific behavior for extensions. 28 | Extensions are components that can hook into various parts of the Scrapy 29 | process to add functionality or modify behavior. 30 | 此类管理扩展的加载和执行。它继承自AbsMiddlewareManager,并实现了扩展的特定行为。 31 | 扩展是可以挂钩到Scrapy流程的各个部分以添加功能或修改行为的组件。 32 | 33 | Extensions typically connect to signals to execute code at specific points in 34 | the crawling process. They can be enabled or disabled through the EXTENSIONS 35 | setting. 36 | 扩展通常连接到信号,以在爬取过程的特定点执行代码。它们可以通过EXTENSIONS设置 37 | 启用或禁用。 38 | """ 39 | 40 | # Name of the component 41 | # 组件的名称 42 | component_name = 'extension' 43 | 44 | @classmethod 45 | def _get_mwlist_from_settings(cls, settings): 46 | """ 47 | Get the list of extension classes from settings. 48 | 从设置中获取扩展类列表。 49 | 50 | This method implements the abstract method from AbsMiddlewareManager. 51 | It retrieves the list of extension classes from the EXTENSIONS setting. 52 | 此方法实现了AbsMiddlewareManager中的抽象方法。它从EXTENSIONS设置中检索 53 | 扩展类列表。 54 | 55 | Args: 56 | settings: The settings object. 57 | 设置对象。 58 | 59 | Returns: 60 | list: A list of extension class paths. 61 | 扩展类路径列表。 62 | """ 63 | # Build component list from EXTENSIONS setting 64 | # 从EXTENSIONS设置构建组件列表 65 | return build_component_list(settings.getwithbase('EXTENSIONS')) 66 | -------------------------------------------------------------------------------- /example/projectspider/redisdemo/settings.py: -------------------------------------------------------------------------------- 1 | BOT_NAME = 'redisdemo' 2 | 3 | SPIDER_MODULES = ['redisdemo.spiders'] 4 | NEWSPIDER_MODULE = 'redisdemo.spiders' 5 | 6 | # 是否配置去重及去重的方式 7 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.disk.RFPDupeFilter' 8 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.redis.RFPDupeFilter' 9 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.redis.BloomDupeFilter' 10 | 11 | # 配置队列任务的序列化 12 | # SCHEDULER_SERIALIZER = 'aioscrapy.serializer.JsonSerializer' # 默认 13 | # SCHEDULER_SERIALIZER = 'aioscrapy.serializer.PickleSerializer' 14 | 15 | # 下载中间件 16 | DOWNLOADER_MIDDLEWARES = { 17 | 'redisdemo.middlewares.DemoDownloaderMiddleware': 543, 18 | } 19 | 20 | # 爬虫中间件 21 | SPIDER_MIDDLEWARES = { 22 | 'redisdemo.middlewares.DemoSpiderMiddleware': 543, 23 | } 24 | 25 | # item的处理方式 26 | ITEM_PIPELINES = { 27 | 'redisdemo.pipelines.DemoPipeline': 100, 28 | } 29 | 30 | # 扩展 31 | # EXTENSIONS = { 32 | # } 33 | 34 | # 使用什么包发送请求 35 | DOWNLOAD_HANDLERS_TYPE = "aiohttp" # aiohttp httpx pyhttpx requests playwright 不配置则默认为aiohttp 36 | # 自定义发包方式请用scrapy的形式,例如: 37 | # DOWNLOAD_HANDLERS={ 38 | # 'http': 'aioscrapy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler', 39 | # 'https': 'aioscrapy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler', 40 | # } 41 | 42 | # 下载延迟 43 | # DOWNLOAD_DELAY = 3 44 | 45 | # 随机下载延迟 46 | # RANDOMIZE_DOWNLOAD_DELAY = True 47 | 48 | # 并发数 49 | CONCURRENT_REQUESTS = 16 # 总并发的个数 默认为16 50 | CONCURRENT_REQUESTS_PER_DOMAIN = 8 # 按域名并发的个数 默认为8 51 | 52 | # 爬虫空闲时是否关闭 53 | CLOSE_SPIDER_ON_IDLE = False # 默认为True 54 | 55 | # 日志 56 | # LOG_FILE = './log/info.log' # 保存日志的位置及名称 57 | LOG_STDOUT = True # 是否将日志输出到控制台 默认为True 58 | LOG_LEVEL = 'DEBUG' # 日志等级 59 | 60 | # 用redis当作任务队列 61 | SCHEDULER_QUEUE_CLASS = 'aioscrapy.queue.redis.SpiderPriorityQueue' # 配置redis的优先级队列 62 | # SCHEDULER_FLUSH_ON_START = False # 重启爬虫时是否清空任务队列 默认False 63 | # redis 参数配置 64 | REDIS_ARGS = { 65 | # "queue" 队列存放的别名, 用于存放求情到改redis连接池中 66 | 'queue': { 67 | 'url': 'redis://:@192.168.43.165:6379/10', 68 | 'max_connections': 2, 69 | 'timeout': None, 70 | 'retry_on_timeout': True, 71 | 'health_check_interval': 30, 72 | }, 73 | 74 | # "proxy"是代理池的别名, 用以存放代理ip到改redis连接池中 75 | # 'proxy': { 76 | # 'url': 'redis://username:password@192.168.234.128:6379/2', 77 | # 'max_connections': 2, 78 | # 'timeout': None, 79 | # 'retry_on_timeout': True, 80 | # 'health_check_interval': 30, 81 | # } 82 | } 83 | 84 | # 代理配置 85 | USE_PROXY = True # 是否使用代理 默认为 False 86 | PROXY_HANDLER = 'redisdemo.proxy.MyProxy' # 代理类的加载路径 可参考aioscrapy.proxy.redis.RedisProxy代理的实现 87 | 88 | 89 | # 本框架基本实现了scrapy/scrapy-redis的功能 想要配置更多参数,请参考scrapy 90 | -------------------------------------------------------------------------------- /aioscrapy/queue/rabbitmq.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, List 2 | 3 | import aioscrapy 4 | from aioscrapy.db import db_manager 5 | from aioscrapy.queue import AbsQueue 6 | from aioscrapy.serializer import AbsSerializer 7 | from aioscrapy.utils.misc import load_object 8 | 9 | 10 | class RabbitMqPriorityQueue(AbsQueue): 11 | inc_key = 'scheduler/enqueued/rabbitmq' 12 | 13 | @classmethod 14 | def from_dict(cls, data: dict) -> "RabbitMqPriorityQueue": 15 | alias: str = data.get("alias", 'queue') 16 | server: aioscrapy.db.aiorabbitmq.RabbitmqExecutor = db_manager.rabbitmq.executor(alias) 17 | spider_name: str = data["spider_name"] 18 | serializer: str = data.get("serializer", "aioscrapy.serializer.JsonSerializer") 19 | serializer: AbsSerializer = load_object(serializer) 20 | return cls( 21 | server, 22 | key='%(spider)s:requests' % {'spider': spider_name}, 23 | serializer=serializer 24 | ) 25 | 26 | @classmethod 27 | async def from_spider(cls, spider: aioscrapy.Spider) -> "RabbitMqPriorityQueue": 28 | alias: str = spider.settings.get("SCHEDULER_QUEUE_ALIAS", 'queue') 29 | executor: aioscrapy.db.aiorabbitmq.RabbitmqExecutor = db_manager.rabbitmq.executor(alias) 30 | queue_key: str = spider.settings.get("SCHEDULER_QUEUE_KEY", '%(spider)s:requests') 31 | serializer: str = spider.settings.get("SCHEDULER_SERIALIZER", "aioscrapy.serializer.JsonSerializer") 32 | serializer: AbsSerializer = load_object(serializer) 33 | return cls( 34 | executor, 35 | spider, 36 | queue_key % {'spider': spider.name}, 37 | serializer=serializer 38 | ) 39 | 40 | async def len(self) -> int: 41 | return await self.container.get_message_count(self.key) 42 | 43 | async def push(self, request: aioscrapy.Request) -> None: 44 | data = self._encode_request(request) 45 | score = request.priority 46 | await self.container.publish( 47 | routing_key=self.key, 48 | body=data if isinstance(data, bytes) else data.encode(), 49 | priority=score 50 | ) 51 | 52 | async def push_batch(self, requests: List[aioscrapy.Request]) -> None: 53 | # TODO: 实现rabbitmq的批量存储 54 | for request in requests: 55 | await self.push(request) 56 | 57 | async def pop(self, count: int = 1) -> Optional[aioscrapy.Request]: 58 | result = await self.container.get_message(self.key) 59 | if result: 60 | yield await self._decode_request(result) 61 | 62 | async def clear(self) -> None: 63 | await self.container.clean_message_queue(self.key) 64 | 65 | 66 | SpiderPriorityQueue = RabbitMqPriorityQueue 67 | -------------------------------------------------------------------------------- /example/singlespider/demo_pipeline_mongo.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Spider, logger, Request 2 | 3 | 4 | class DemoMongoSpider(Spider): 5 | name = 'DemoMongoSpider' 6 | custom_settings = { 7 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 8 | # 'DOWNLOAD_DELAY': 3, 9 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 10 | # 'CONCURRENT_REQUESTS': 1, 11 | # 'LOG_LEVEL': 'INFO' 12 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.disk.RFPDupeFilter', 13 | "CLOSE_SPIDER_ON_IDLE": True, 14 | "MONGO_TIMEOUT_RETRY_TIMES": 3, # mongo写入时发生NetworkTimeout错误重试次数 15 | # mongo parameter 16 | "MONGO_ARGS": { 17 | 'default': { 18 | 'host': 'mongodb://root:root@192.168.234.128:27017', 19 | 'db': 'test', 20 | } 21 | }, 22 | "ITEM_PIPELINES": { 23 | 'aioscrapy.libs.pipelines.mongo.MongoPipeline': 100, 24 | }, 25 | "SAVE_CACHE_NUM": 1000, # 每次存储1000条 26 | "SAVE_CACHE_INTERVAL": 10, # 每次10秒存储一次 27 | } 28 | 29 | start_urls = ['https://quotes.toscrape.com'] 30 | 31 | @staticmethod 32 | async def process_request(request, spider): 33 | """ request middleware """ 34 | pass 35 | 36 | @staticmethod 37 | async def process_response(request, response, spider): 38 | """ response middleware """ 39 | return response 40 | 41 | @staticmethod 42 | async def process_exception(request, exception, spider): 43 | """ exception middleware """ 44 | pass 45 | 46 | async def parse(self, response): 47 | for quote in response.css('div.quote'): 48 | yield { 49 | 'author': quote.xpath('span/small/text()').get(), 50 | 'text': quote.css('span.text::text').get(), 51 | '__mongo__': { 52 | 'db_alias': 'default', # 要存储的mongo, 参数“MONGO_ARGS”的key 53 | 'table_name': 'article', # 要存储的表名字 54 | # 'db_name': 'xxx', # 要存储的mongo的库名, 不指定则默认为“MONGO_ARGS”中的“db”值 55 | # 'ordered': False, # 批量写入时 单条数据异常是否影响整体写入 默认是False不影响其他数据写入 56 | } 57 | } 58 | 59 | next_page = response.css('li.next a::attr("href")').get() 60 | if next_page is not None: 61 | # yield response.follow(next_page, self.parse) 62 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 63 | 64 | async def process_item(self, item): 65 | logger.info(item) 66 | 67 | 68 | if __name__ == '__main__': 69 | DemoMongoSpider.start() 70 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_drissionpage.py: -------------------------------------------------------------------------------- 1 | import time 2 | from typing import Any 3 | 4 | from watchdog.watchmedo import argument 5 | 6 | from aioscrapy import Request, logger, Spider 7 | from aioscrapy.core.downloader.handlers.webdriver import DrissionPageDriver 8 | from aioscrapy.http import WebDriverResponse 9 | 10 | 11 | class DemoPlaywrightSpider(Spider): 12 | name = 'DemoPlaywrightSpider' 13 | 14 | custom_settings = dict( 15 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 16 | # DOWNLOAD_DELAY=3, 17 | # RANDOMIZE_DOWNLOAD_DELAY=True, 18 | CONCURRENT_REQUESTS=2, 19 | LOG_LEVEL='INFO', 20 | CLOSE_SPIDER_ON_IDLE=True, 21 | # DOWNLOAD_HANDLERS={ 22 | # 'http': 'aioscrapy.core.downloader.handlers.webdriver.drissionpage.DrissionPageDownloadHandler', 23 | # 'https': 'aioscrapy.core.downloader.handlers.webdriver.drissionpage.DrissionPageDownloadHandler', 24 | # }, 25 | DOWNLOAD_HANDLERS_TYPE="dp", 26 | DP_ARGS=dict( 27 | use_pool=True, # use_pool=True时 使用完driver后不销毁 重复使用 提供效率 28 | max_uses=None, # 在use_pool=True时生效,如果driver达到指定使用次数,则销毁,重新启动一个driver(处理有些driver使用次数变多则变卡的情况) 29 | headless=False, 30 | arguments=['--no-sandbox', ('--window-size', '1024,800')] 31 | ) 32 | 33 | ) 34 | 35 | start_urls = [ 36 | 'https://hanyu.baidu.com/zici/s?wd=黄&query=黄', 37 | ] 38 | 39 | @staticmethod 40 | async def process_request(request, spider): 41 | """ request middleware """ 42 | pass 43 | 44 | @staticmethod 45 | async def process_response(request, response, spider): 46 | """ response middleware """ 47 | return response 48 | 49 | @staticmethod 50 | async def process_exception(request, exception, spider): 51 | """ exception middleware """ 52 | pass 53 | 54 | async def parse(self, response: WebDriverResponse): 55 | yield { 56 | 'pingyin': response.xpath('//div[@class="pinyin-text"]/text()').get(), 57 | 'action_ret': response.get_response('action_ret'), 58 | } 59 | 60 | def process_action(self, driver: DrissionPageDriver, request: Request) -> Any: 61 | """ 62 | 该方法在异步线程中执行,不要使用异步函数 63 | """ 64 | img_bytes = driver.page.get_screenshot(as_bytes='jpeg') 65 | print(img_bytes) 66 | 67 | time.sleep(2) # 等待js渲染 68 | 69 | # TODO: 点击 选择元素等操作 70 | 71 | # 如果有需要传递回 parse函数的内容,则按如下返回,结果将在self.parse的response.cache_response中, 72 | return 'action_ret', 'process_action result' 73 | 74 | async def process_item(self, item): 75 | logger.info(item) 76 | 77 | 78 | if __name__ == '__main__': 79 | DemoPlaywrightSpider.start() 80 | -------------------------------------------------------------------------------- /example/singlespider/demo_queue_redis.py: -------------------------------------------------------------------------------- 1 | 2 | from aioscrapy import Spider, logger 3 | 4 | 5 | class DemoRedisSpider(Spider): 6 | name = 'DemoRedisSpider' 7 | custom_settings = { 8 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 9 | # 'DOWNLOAD_DELAY': 3, 10 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 11 | 'CONCURRENT_REQUESTS': 2, 12 | "CLOSE_SPIDER_ON_IDLE": True, 13 | # 'LOG_FILE': 'test.log', 14 | 15 | # SCHEDULER_QUEUE_CLASS = 'aioscrapy.queue.redis.SpiderPriorityQueue' 16 | # SCHEDULER_QUEUE_CLASS = 'aioscrapy.queue.rabbitmq.SpiderPriorityQueue' 17 | 18 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.disk.RFPDupeFilter' 19 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.redis.RFPDupeFilter' 20 | # DUPEFILTER_CLASS = 'aioscrapy.dupefilters.redis.BloomDupeFilter' 21 | 22 | # SCHEDULER_SERIALIZER = 'aioscrapy.serializer.JsonSerializer' 23 | # SCHEDULER_SERIALIZER = 'aioscrapy.serializer.PickleSerializer' 24 | 25 | 'SCHEDULER_QUEUE_CLASS': 'aioscrapy.queue.redis.SpiderPriorityQueue', 26 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.redis.RFPDupeFilter', 27 | 'SCHEDULER_SERIALIZER': 'aioscrapy.serializer.JsonSerializer', 28 | 'ENQUEUE_CACHE_NUM': 100, 29 | 'REDIS_ARGS': { 30 | 'queue': { 31 | 'url': 'redis://192.168.18.129:6379/0', 32 | 'max_connections': 2, 33 | 'timeout': None, 34 | 'retry_on_timeout': True, 35 | 'health_check_interval': 30 36 | } 37 | } 38 | } 39 | 40 | start_urls = ['https://quotes.toscrape.com'] 41 | 42 | @staticmethod 43 | async def process_request(request, spider): 44 | """ request middleware """ 45 | pass 46 | 47 | @staticmethod 48 | async def process_response(request, response, spider): 49 | """ response middleware """ 50 | return response 51 | 52 | @staticmethod 53 | async def process_exception(request, exception, spider): 54 | """ exception middleware """ 55 | pass 56 | 57 | async def parse(self, response): 58 | for quote in response.css('div.quote'): 59 | yield { 60 | 'author': quote.xpath('span/small/text()').get(), 61 | 'text': quote.css('span.text::text').get(), 62 | } 63 | 64 | next_page = response.css('li.next a::attr("href")').get() 65 | if next_page is not None: 66 | yield response.follow(next_page, self.parse, dont_filter=False) 67 | 68 | async def process_item(self, item): 69 | logger.info(item) 70 | 71 | 72 | if __name__ == '__main__': 73 | DemoRedisSpider.start() 74 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from os.path import dirname, join 2 | from setuptools import setup, find_packages 3 | 4 | with open(join(dirname(__file__), 'aioscrapy/VERSION'), 'rb') as f: 5 | version = f.read().decode('ascii').strip() 6 | 7 | install_requires = [ 8 | "aiohttp", 9 | "ujson", 10 | "w3lib>=1.17.0", 11 | "parsel>=1.5.0", 12 | "PyDispatcher>=2.0.5", 13 | 'zope.interface>=5.1.0', 14 | "redis>=4.3.1", 15 | "aiomultiprocess>=0.9.0", 16 | "loguru>=0.7.0", 17 | "anyio>=3.6.2" 18 | ] 19 | extras_require = { 20 | "all": [ 21 | "aiomysql>=0.1.1", "httpx[http2]>=0.23.0", "aio-pika>=8.1.1", 22 | "cryptography", "motor>=2.1.0", "pyhttpx>=2.10.1", "asyncpg>=0.27.0", 23 | "XlsxWriter>=3.1.2", "pillow>=9.4.0", "requests>=2.28.2", "curl_cffi", 24 | "sbcdp", "DrissionPage" 25 | ], 26 | "aiomysql": ["aiomysql>=0.1.1", "cryptography"], 27 | "httpx": ["httpx[http2]>=0.23.0"], 28 | "aio-pika": ["aio-pika>=8.1.1"], 29 | "mongo": ["motor>=2.1.0"], 30 | "playwright": ["playwright>=1.31.1"], 31 | "sbcdp": ["sbcdp"], 32 | "dp": ["DrissionPage"], 33 | "pyhttpx": ["pyhttpx>=2.10.4"], 34 | "curl_cffi": ["curl_cffi>=0.6.1"], 35 | "requests": ["requests>=2.28.2"], 36 | "pg": ["asyncpg>=0.27.0"], 37 | "execl": ["XlsxWriter>=3.1.2", "pillow>=9.4.0"], 38 | } 39 | 40 | setup( 41 | name='aio-scrapy', 42 | version=version, 43 | url='https://github.com/conlin-huang/aio-scrapy.git', 44 | description='A high-level Web Crawling and Web Scraping framework based on Asyncio', 45 | long_description_content_type="text/markdown", 46 | long_description=open('README.md', encoding='utf-8').read(), 47 | author='conlin', 48 | author_email="995018884@qq.com", 49 | license="MIT", 50 | packages=find_packages(exclude=('example',)), 51 | include_package_data=True, 52 | zip_safe=False, 53 | entry_points={ 54 | 'console_scripts': ['aioscrapy = aioscrapy.cmdline:execute'] 55 | }, 56 | classifiers=[ 57 | 'License :: OSI Approved :: MIT License', 58 | 'Intended Audience :: Developers', 59 | 'Operating System :: OS Independent', 60 | 'Programming Language :: Python :: 3.9', 61 | 'Programming Language :: Python :: 3.10', 62 | 'Programming Language :: Python :: 3.11', 63 | 'Topic :: Internet :: WWW/HTTP', 64 | 'Topic :: Software Development :: Libraries :: Application Frameworks', 65 | 'Topic :: Software Development :: Libraries :: Python Modules', 66 | ], 67 | python_requires='>=3.9', 68 | install_requires=install_requires, 69 | extras_require=extras_require, 70 | keywords=[ 71 | 'aio-scrapy', 72 | 'scrapy', 73 | 'aioscrapy', 74 | 'scrapy redis', 75 | 'asyncio', 76 | 'spider', 77 | ] 78 | ) 79 | -------------------------------------------------------------------------------- /example/singlespider/demo_pipeline_pg.py: -------------------------------------------------------------------------------- 1 | from aioscrapy import Request, Spider, logger 2 | 3 | 4 | class DemoPGSpider(Spider): 5 | name = 'DemoPGSpider' 6 | custom_settings = { 7 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 8 | # 'DOWNLOAD_DELAY': 3, 9 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 10 | # 'CONCURRENT_REQUESTS': 1, 11 | # 'LOG_LEVEL': 'INFO' 12 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.disk.RFPDupeFilter', 13 | "CLOSE_SPIDER_ON_IDLE": True, 14 | # mongo parameter 15 | "PG_ARGS": { 16 | 'default': { 17 | 'user': 'user', 18 | 'password': 'password', 19 | 'database': 'spider_db', 20 | 'host': '127.0.0.1' 21 | } 22 | }, 23 | "ITEM_PIPELINES": { 24 | 'aioscrapy.libs.pipelines.pg.PGPipeline': 100, 25 | }, 26 | "SAVE_CACHE_NUM": 1000, # 每次存储1000条 27 | "SAVE_CACHE_INTERVAL": 10, # 每次10秒存储一次 28 | } 29 | 30 | start_urls = ['https://quotes.toscrape.com'] 31 | 32 | @staticmethod 33 | async def process_request(request, spider): 34 | """ request middleware """ 35 | pass 36 | 37 | @staticmethod 38 | async def process_response(request, response, spider): 39 | """ response middleware """ 40 | return response 41 | 42 | @staticmethod 43 | async def process_exception(request, exception, spider): 44 | """ exception middleware """ 45 | pass 46 | 47 | async def parse(self, response): 48 | for quote in response.css('div.quote'): 49 | yield { 50 | 'author': quote.xpath('span/small/text()').get(), 51 | 'text': quote.css('span.text::text').get(), 52 | '__pg__': { 53 | 'db_alias': 'default', # 要存储的PostgreSQL, 参数“PG_ARGS”的key 54 | 'table_name': 'spider_db.article', # 要存储的schema和表名字,用.隔开 55 | 56 | # 写入数据库的方式: 57 | # insert:普通写入 出现主键或唯一键冲突时抛出异常 58 | # update_insert:更新插入 出现on_conflict指定的冲突时,更新写入 59 | # ignore_insert:忽略写入 写入时出现冲突 丢掉该条数据 不抛出异常 60 | 'insert_type': 'update_insert', 61 | 'on_conflict': 'id', # update_insert方式下的约束 62 | } 63 | } 64 | next_page = response.css('li.next a::attr("href")').get() 65 | if next_page is not None: 66 | # yield response.follow(next_page, self.parse) 67 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 68 | 69 | async def process_item(self, item): 70 | logger.info(item) 71 | 72 | 73 | if __name__ == '__main__': 74 | DemoPGSpider.start() 75 | -------------------------------------------------------------------------------- /example/singlespider/demo_pipeline_mysql.py: -------------------------------------------------------------------------------- 1 | 2 | from aioscrapy import Request, Spider, logger 3 | 4 | 5 | class DemoMysqlSpider(Spider): 6 | name = 'DemoMysqlSpider' 7 | custom_settings = dict( 8 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 9 | # OWNLOAD_DELAY=3, 10 | # ANDOMIZE_DOWNLOAD_DELAY=True, 11 | # ONCURRENT_REQUESTS=1, 12 | # OG_LEVEL='INFO', 13 | # UPEFILTER_CLASS='aioscrapy.dupefilters.disk.RFPDupeFilter', 14 | CLOSE_SPIDER_ON_IDLE=True, 15 | # mysql parameter 16 | MYSQL_ARGS={ 17 | 'default': { 18 | 'host': '127.0.0.1', 19 | 'user': 'root', 20 | 'password': 'root', 21 | 'port': 3306, 22 | 'charset': 'utf8mb4', 23 | 'db': 'test', 24 | }, 25 | }, 26 | ITEM_PIPELINES={ 27 | 'aioscrapy.libs.pipelines.mysql.MysqlPipeline': 100, 28 | }, 29 | SAVE_CACHE_NUM=1000, # 每次存储1000条 30 | SAVE_CACHE_INTERVAL=10, # 每次10秒存储一次 31 | ) 32 | 33 | start_urls = ['https://quotes.toscrape.com'] 34 | 35 | @staticmethod 36 | async def process_request(request, spider): 37 | """ request middleware """ 38 | pass 39 | 40 | @staticmethod 41 | async def process_response(request, response, spider): 42 | """ response middleware """ 43 | return response 44 | 45 | @staticmethod 46 | async def process_exception(request, exception, spider): 47 | """ exception middleware """ 48 | pass 49 | 50 | async def parse(self, response): 51 | for quote in response.css('div.quote'): 52 | yield { 53 | 'author': quote.xpath('span/small/text()').get(), 54 | 'text': quote.css('span.text::text').get(), 55 | '__mysql__': { 56 | 'db_alias': 'default', # 要存储的mysql, 参数“MYSQL_ARGS”的key 57 | 'table_name': 'article', # 要存储的表名字 58 | 59 | # 写入数据库的方式: 默认insert方式 60 | # insert:普通写入 出现主键或唯一键冲突时抛出异常 61 | # update_insert:更新插入 出现主键或唯一键冲突时,更新写入 62 | # ignore_insert:忽略写入 写入时出现冲突 丢掉该条数据 不抛出异常 63 | 'insert_type': 'update_insert', 64 | 65 | # 更新指定字段(只能在update_insert的时候使用) 66 | # 'update_fields': ['text'] 67 | } 68 | } 69 | 70 | next_page = response.css('li.next a::attr("href")').get() 71 | if next_page is not None: 72 | # yield response.follow(next_page, self.parse) 73 | yield Request(f"https://quotes.toscrape.com{next_page}", callback=self.parse) 74 | 75 | async def process_item(self, item): 76 | logger.info(item) 77 | 78 | 79 | if __name__ == '__main__': 80 | DemoMysqlSpider.start() 81 | -------------------------------------------------------------------------------- /aioscrapy/utils/httpobj.py: -------------------------------------------------------------------------------- 1 | """ 2 | HTTP object utility functions for aioscrapy. 3 | aioscrapy的HTTP对象实用函数。 4 | 5 | This module provides utility functions for working with HTTP objects (Request, Response) 6 | in aioscrapy. It includes functions for parsing and caching URL information to improve 7 | performance when the same URLs are processed multiple times. 8 | 此模块提供了用于处理aioscrapy中HTTP对象(Request, Response)的实用函数。 9 | 它包括用于解析和缓存URL信息的函数,以提高多次处理相同URL时的性能。 10 | """ 11 | 12 | from typing import Union 13 | from urllib.parse import urlparse, ParseResult 14 | from weakref import WeakKeyDictionary 15 | 16 | from aioscrapy.http import Request, Response 17 | 18 | 19 | # Cache for storing parsed URLs to avoid repeated parsing of the same URL 20 | # Uses WeakKeyDictionary so entries are automatically removed when the Request/Response is garbage collected 21 | # 用于存储已解析URL的缓存,以避免重复解析相同的URL 22 | # 使用WeakKeyDictionary,因此当Request/Response被垃圾回收时,条目会自动删除 23 | _urlparse_cache: "WeakKeyDictionary[Union[Request, Response], ParseResult]" = WeakKeyDictionary() 24 | 25 | 26 | def urlparse_cached(request_or_response: Union[Request, Response]) -> ParseResult: 27 | """ 28 | Parse the URL of a Request or Response object with caching. 29 | 解析Request或Response对象的URL,并进行缓存。 30 | 31 | This function parses the URL of the given Request or Response object using 32 | urllib.parse.urlparse and caches the result. If the same object is passed 33 | again, the cached result is returned instead of re-parsing the URL. 34 | 此函数使用urllib.parse.urlparse解析给定Request或Response对象的URL, 35 | 并缓存结果。如果再次传递相同的对象,则返回缓存的结果,而不是重新解析URL。 36 | 37 | The caching mechanism uses a WeakKeyDictionary, so the cache entries are 38 | automatically removed when the Request or Response objects are garbage collected. 39 | This prevents memory leaks while still providing performance benefits. 40 | 缓存机制使用WeakKeyDictionary,因此当Request或Response对象被垃圾回收时, 41 | 缓存条目会自动删除。这可以防止内存泄漏,同时仍然提供性能优势。 42 | 43 | Args: 44 | request_or_response: A Request or Response object whose URL will be parsed. 45 | 将解析其URL的Request或Response对象。 46 | 47 | Returns: 48 | ParseResult: The parsed URL components (scheme, netloc, path, params, 49 | query, fragment). 50 | 解析的URL组件(scheme, netloc, path, params, query, fragment)。 51 | 52 | Example: 53 | >>> request = Request('https://example.com/path?query=value') 54 | >>> parsed = urlparse_cached(request) 55 | >>> parsed.netloc 56 | 'example.com' 57 | >>> parsed.path 58 | '/path' 59 | >>> parsed.query 60 | 'query=value' 61 | """ 62 | # Check if this object's URL has already been parsed and cached 63 | # 检查此对象的URL是否已被解析和缓存 64 | if request_or_response not in _urlparse_cache: 65 | # If not in cache, parse the URL and store the result 66 | # 如果不在缓存中,解析URL并存储结果 67 | _urlparse_cache[request_or_response] = urlparse(request_or_response.url) 68 | 69 | # Return the cached parse result 70 | # 返回缓存的解析结果 71 | return _urlparse_cache[request_or_response] 72 | -------------------------------------------------------------------------------- /docs/installation.md: -------------------------------------------------------------------------------- 1 | # 安装指南 | Installation Guide 2 | 3 | 本页面提供了安装AioScrapy的详细说明。
4 | This page provides detailed instructions for installing AioScrapy. 5 | 6 | ## 系统要求 | System Requirements 7 | 8 | AioScrapy支持以下操作系统:
9 | AioScrapy supports the following operating systems: 10 | 11 | - Windows 12 | - Linux 13 | - macOS 14 | 15 | ## Python版本 | Python Version 16 | 17 | AioScrapy需要Python 3.9或更高版本。
18 | AioScrapy requires Python 3.9 or higher. 19 | 20 | 您可以使用以下命令检查您的Python版本:
21 | You can check your Python version with the following command: 22 | 23 | ```bash 24 | python --version 25 | ``` 26 | 27 | ## 安装方法 | Installation Methods 28 | ### 使用pip安装 | Install with pip 29 | 30 | 推荐使用pip安装AioScrapy:
31 | It is recommended to install AioScrapy using pip: 32 | 33 | ```bash 34 | pip install aio-scrapy 35 | ``` 36 | 37 | ### 安装可选依赖 | Install Optional Dependencies 38 | 39 | AioScrapy提供了多个可选依赖包,以支持不同的功能:
40 | AioScrapy provides several optional dependency packages to support different features: 41 | 42 | ```bash 43 | # 安装所有可选依赖 | Install all optional dependencies 44 | pip install aio-scrapy[all] 45 | 46 | # 安装Redis支持 | Install Redis support 47 | pip install aio-scrapy[redis] 48 | 49 | # 安装MySQL支持 | Install MySQL support 50 | pip install aio-scrapy[mysql] 51 | 52 | # 安装MongoDB支持 | Install MongoDB support 53 | pip install aio-scrapy[mongo] 54 | 55 | # 安装PostgreSQL支持 | Install PostgreSQL support 56 | pip install aio-scrapy[pg] 57 | 58 | # 安装RabbitMQ支持 | Install RabbitMQ support 59 | pip install aio-scrapy[rabbitmq] 60 | ``` 61 | 62 | ### 从源代码安装 | Install from Source 63 | 64 | 您也可以从源代码安装AioScrapy: 65 | You can also install AioScrapy from source: 66 | 67 | ```bash 68 | git clone https://github.com/conlin-huang/aio-scrapy.git 69 | cd aio-scrapy 70 | pip install -e . 71 | ``` 72 | 73 | ## 验证安装 | Verify Installation 74 | 75 | 安装完成后,您可以通过运行以下命令来验证安装是否成功:
76 | After installation, you can verify that the installation was successful by running the following command: 77 | 78 | ```bash 79 | aioscrapy version 80 | ``` 81 | 82 | 如果安装成功,您将看到AioScrapy的版本信息。
83 | If the installation was successful, you will see the AioScrapy version information. 84 | 85 | ## 安装不同的下载器 | Installing Different Downloaders 86 | 87 | AioScrapy支持多种HTTP客户端作为下载器。默认使用aiohttp,但您可以安装其他客户端以使用不同的下载器:
88 | AioScrapy supports multiple HTTP clients as downloaders. It uses aiohttp by default, but you can install other clients to use different downloaders: 89 | 90 | ```bash 91 | # 安装httpx支持 | Install httpx support 92 | pip install httpx 93 | 94 | # 安装playwright支持(用于JavaScript渲染) | Install playwright support (for JavaScript rendering) 95 | pip install playwright 96 | python -m playwright install 97 | 98 | # 安装pyhttpx支持 | Install pyhttpx support 99 | pip install pyhttpx 100 | 101 | # 安装curl_cffi支持 | Install curl_cffi support 102 | pip install curl_cffi 103 | 104 | # 安装DrissionPage支持 | Install DrissionPage support 105 | pip install DrissionPage 106 | ``` 107 | 108 | ### 更新AioScrapy | Updating AioScrapy 109 | 110 | 要更新到最新版本的AioScrapy,请使用:
111 | To update to the latest version of AioScrapy, use: 112 | 113 | ```bash 114 | pip install --upgrade aio-scrapy 115 | ``` 116 | -------------------------------------------------------------------------------- /aioscrapy/libs/downloader/defaultheaders.py: -------------------------------------------------------------------------------- 1 | """ 2 | DefaultHeaders Downloader Middleware 3 | 默认头部下载器中间件 4 | 5 | This middleware sets default headers for all requests, as specified in the 6 | DEFAULT_REQUEST_HEADERS setting. These headers are only set if they are not 7 | already present in the request. 8 | 此中间件为所有请求设置默认头部,如DEFAULT_REQUEST_HEADERS设置中指定的那样。 9 | 这些头部仅在请求中尚未存在时才会设置。 10 | """ 11 | 12 | from aioscrapy.utils.python import without_none_values 13 | 14 | 15 | class DefaultHeadersMiddleware: 16 | """ 17 | Middleware for setting default headers on requests. 18 | 用于在请求上设置默认头部的中间件。 19 | 20 | This middleware adds default headers to all outgoing requests, as specified in the 21 | DEFAULT_REQUEST_HEADERS setting. Headers are only added if they are not already 22 | present in the request, allowing request-specific headers to take precedence. 23 | 此中间件向所有传出请求添加默认头部,如DEFAULT_REQUEST_HEADERS设置中指定的那样。 24 | 仅当请求中尚未存在头部时才会添加头部,允许特定于请求的头部优先。 25 | """ 26 | 27 | def __init__(self, headers): 28 | """ 29 | Initialize the DefaultHeadersMiddleware. 30 | 初始化DefaultHeadersMiddleware。 31 | 32 | Args: 33 | headers: An iterable of (name, value) pairs representing the default headers. 34 | 表示默认头部的(名称, 值)对的可迭代对象。 35 | """ 36 | # Store the default headers 37 | # 存储默认头部 38 | self._headers = headers 39 | 40 | @classmethod 41 | def from_crawler(cls, crawler): 42 | """ 43 | Create a DefaultHeadersMiddleware instance from a crawler. 44 | 从爬虫创建DefaultHeadersMiddleware实例。 45 | 46 | This is the factory method used by AioScrapy to create the middleware. 47 | 这是AioScrapy用于创建中间件的工厂方法。 48 | 49 | Args: 50 | crawler: The crawler that will use this middleware. 51 | 将使用此中间件的爬虫。 52 | 53 | Returns: 54 | DefaultHeadersMiddleware: A new DefaultHeadersMiddleware instance. 55 | 一个新的DefaultHeadersMiddleware实例。 56 | """ 57 | # Get the default headers from settings, filtering out None values 58 | # 从设置获取默认头部,过滤掉None值 59 | headers = without_none_values(crawler.settings['DEFAULT_REQUEST_HEADERS']) 60 | 61 | # Create and return a new instance with the headers as (name, value) pairs 62 | # 使用作为(名称, 值)对的头部创建并返回一个新实例 63 | return cls(headers.items()) 64 | 65 | def process_request(self, request, spider): 66 | """ 67 | Process a request before it is sent to the downloader. 68 | 在请求发送到下载器之前处理它。 69 | 70 | This method adds the default headers to the request if they are not already present. 71 | 如果请求中尚未存在默认头部,此方法会将其添加到请求中。 72 | 73 | Args: 74 | request: The request being processed. 75 | 正在处理的请求。 76 | spider: The spider that generated the request. 77 | 生成请求的爬虫。 78 | 79 | Returns: 80 | None: This method does not return a response or a deferred. 81 | 此方法不返回响应或延迟对象。 82 | """ 83 | # Add each default header to the request if it's not already set 84 | # 如果尚未设置,则将每个默认头部添加到请求中 85 | for k, v in self._headers: 86 | request.headers.setdefault(k, v) 87 | -------------------------------------------------------------------------------- /example/singlespider/demo_duplicate.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | from aioscrapy import Spider, Request, logger 4 | 5 | 6 | class DemoDuplicateSpider(Spider): 7 | name = 'DemoDuplicateSpider' 8 | custom_settings = { 9 | "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 10 | # 'DOWNLOAD_DELAY': 3, 11 | # 'RANDOMIZE_DOWNLOAD_DELAY': True, 12 | 'CONCURRENT_REQUESTS': 2, 13 | 'LOG_LEVEL': 'INFO', 14 | "CLOSE_SPIDER_ON_IDLE": True, 15 | 16 | "DUPEFILTER_INFO": True, # 是否以info级别的日志输出过滤器的信息 17 | 18 | # 'LOG_FILE': 'test.log', 19 | 20 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.disk.RFPDupeFilter', # 本地文件存储指纹去重 21 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.redis.RFPDupeFilter', # redis set去重 22 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.redis.BloomDupeFilter', # 布隆过滤器去重 23 | 24 | # RFPDupeFilter去重的增强版: 当请求失败或解析失败时 从Set中移除指纹 25 | # 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.redis.ExRFPDupeFilter', 26 | 27 | # 布隆过滤器去重增强版:添加一个临时的Set集合缓存请求中的请求 在解析成功后再将指纹加入到布隆过滤器同时将Set中的清除 28 | 'DUPEFILTER_CLASS': 'aioscrapy.dupefilters.redis.ExBloomDupeFilter', 29 | "DUPEFILTER_SET_KEY_TTL": 60 * 3, # BloomSetDupeFilter过滤器的临时Redis Set集合的过期时间 30 | 31 | # 'SCHEDULER_QUEUE_CLASS': 'aioscrapy.queue.redis.SpiderPriorityQueue', 32 | 'SCHEDULER_SERIALIZER': 'aioscrapy.serializer.JsonSerializer', 33 | 'REDIS_ARGS': { 34 | 'queue': { 35 | 'url': 'redis://127.0.0.1:6379/10', 36 | 'max_connections': 2, 37 | 'timeout': None, 38 | 'retry_on_timeout': True, 39 | 'health_check_interval': 30 40 | } 41 | } 42 | } 43 | 44 | async def start_requests(self): 45 | yield Request( 46 | 'https://quotes.toscrape.com/page/1', 47 | dont_filter=False, 48 | fingerprint='1', # 不使用默认的指纹计算规则 而是指定去重指纹值 49 | meta=dict( 50 | dupefilter_msg="page_1" # 当Request被去重时 指定日志输出的内容为"page_1" 不设置则默认为request对象 51 | ) 52 | ) 53 | 54 | @staticmethod 55 | async def process_request(request, spider): 56 | """ request middleware """ 57 | pass 58 | 59 | @staticmethod 60 | async def process_response(request, response, spider): 61 | """ response middleware """ 62 | return response 63 | 64 | @staticmethod 65 | async def process_exception(request, exception, spider): 66 | """ exception middleware """ 67 | pass 68 | 69 | async def parse(self, response): 70 | for quote in response.css('div.quote'): 71 | yield { 72 | 'author': quote.xpath('span/small/text()').get(), 73 | 'text': quote.css('span.text::text').get(), 74 | } 75 | 76 | next_page_url = response.css('li.next a::attr("href")').get() 77 | if next_page_url is not None: 78 | page_fingerprint = ''.join(re.findall(r'page/(\d+)', next_page_url)) 79 | yield response.follow(next_page_url, self.parse, dont_filter=False, fingerprint=page_fingerprint) 80 | 81 | async def process_item(self, item): 82 | logger.info(item) 83 | 84 | 85 | if __name__ == '__main__': 86 | DemoDuplicateSpider.start() 87 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_sbcdp.py: -------------------------------------------------------------------------------- 1 | import random 2 | from typing import Any 3 | 4 | from aioscrapy import Request, logger, Spider 5 | from aioscrapy.core.downloader.handlers.webdriver import SbcdpDriver 6 | from sbcdp import NetHttp, NetWebsocket 7 | from aioscrapy.http import WebDriverResponse 8 | 9 | 10 | class DemoSbcdpSpider(Spider): 11 | name = 'DemoSbcdpSpider' 12 | 13 | custom_settings = dict( 14 | # DOWNLOAD_DELAY=3, 15 | # RANDOMIZE_DOWNLOAD_DELAY=True, 16 | CONCURRENT_REQUESTS=1, 17 | LOG_LEVEL='INFO', 18 | CLOSE_SPIDER_ON_IDLE=True, 19 | # DOWNLOAD_HANDLERS={ 20 | # 'http': 'aioscrapy.core.downloader.handlers.webdriver.sbcdp.SbcdpDownloadHandler', 21 | # 'https': 'aioscrapy.core.downloader.handlers.webdriver.sbcdp.SbcdpDownloadHandler', 22 | # }, 23 | DOWNLOAD_HANDLERS_TYPE="sbcdp", 24 | SBCDP_ARGS=dict( 25 | use_pool=True, # use_pool=True时 使用完driver后不销毁 重复使用 提高效率 26 | max_uses=None, # 在use_pool=True时生效,如果driver达到指定使用次数,则销毁,重新启动一个driver(处理有些driver使用次数变多则变卡的情况) 27 | # ... 其它参数为sbcdp的AsyncChrome类参数 28 | ) 29 | ) 30 | 31 | start_urls = [ 32 | 'https://hanyu.baidu.com/zici/s?wd=黄&query=黄', 33 | 'https://hanyu.baidu.com/zici/s?wd=王&query=王', 34 | 'https://hanyu.baidu.com/zici/s?wd=李&query=李', 35 | ] 36 | 37 | @staticmethod 38 | async def process_request(request, spider): 39 | """ request middleware """ 40 | pass 41 | 42 | @staticmethod 43 | async def process_response(request, response, spider): 44 | """ response middleware """ 45 | return response 46 | 47 | @staticmethod 48 | async def process_exception(request, exception, spider): 49 | """ exception middleware """ 50 | pass 51 | 52 | async def parse(self, response: WebDriverResponse): 53 | yield { 54 | 'pingyin': response.xpath('//div[@class="pinyin-text"]/text()').get(), 55 | 'title': response.get_response('title'), 56 | 'event_ret': response.get_response('info_key'), 57 | } 58 | 59 | async def on_event_http(self, result: NetHttp) -> Any: 60 | """ 61 | 监听网络请求 62 | sbcdp的http_monitor_all_tabs方法中的monitor_cb参数 63 | 具体使用参考sbcdp 64 | """ 65 | # 如果事件中有需要传递回 parse函数的内容,则按如下返回,结果将在self.parse的response.cache_response中, 66 | if '/dictapp/word/detail_getworddetail' in result.url: 67 | return 'event_ret', await result.get_response_body() 68 | 69 | if 'xxx2' in result.url: 70 | return 'xxx2', {'data': 'aaa'} 71 | 72 | async def on_event_http_intercept(self, result: NetHttp) -> Any: 73 | """ 74 | 监听网络请求拦截 75 | sbcdp的http_monitor_all_tabs方法中的intercept_cb参数 76 | 具体使用参考sbcdp 77 | """ 78 | # 阻止图片的加载 79 | if result.resource_type == 'Image': 80 | return True 81 | 82 | async def process_action(self, driver: SbcdpDriver, request: Request) -> Any: 83 | """ Do some operations after function page.goto """ 84 | title = await driver.browser.get_title() 85 | 86 | # TODO: 点击 选择元素等操作 87 | 88 | # 如果有需要传递回 parse函数的内容,则按如下返回,结果将在self.parse的response.cache_response中, 89 | return 'title', title 90 | 91 | async def process_item(self, item): 92 | logger.info(item) 93 | 94 | 95 | if __name__ == '__main__': 96 | DemoSbcdpSpider.start() 97 | -------------------------------------------------------------------------------- /aioscrapy/queue/memory.py: -------------------------------------------------------------------------------- 1 | from asyncio import PriorityQueue, Queue, LifoQueue 2 | from asyncio.queues import QueueEmpty 3 | from typing import Optional 4 | 5 | import aioscrapy 6 | from aioscrapy.queue import AbsQueue 7 | from aioscrapy.serializer import AbsSerializer 8 | from aioscrapy.utils.misc import load_object 9 | 10 | 11 | class MemoryQueueBase(AbsQueue): 12 | inc_key = 'scheduler/enqueued/memory' 13 | 14 | def __init__( 15 | self, 16 | container: Queue, 17 | spider: Optional[aioscrapy.Spider], 18 | key: Optional[str] = None, 19 | serializer: Optional[AbsSerializer] = None, 20 | max_size: int = 0 21 | ) -> None: 22 | super().__init__(container, spider, key, serializer) 23 | self.max_size = max_size 24 | 25 | @classmethod 26 | async def from_spider(cls, spider: aioscrapy.Spider) -> "MemoryQueueBase": 27 | max_size: int = spider.settings.getint("QUEUE_MAXSIZE", 0) 28 | queue: Queue = cls.get_queue(max_size) 29 | queue_key: str = spider.settings.get("SCHEDULER_QUEUE_KEY", '%(spider)s:requests') 30 | serializer: str = spider.settings.get("SCHEDULER_SERIALIZER", "aioscrapy.serializer.PickleSerializer") 31 | serializer: AbsSerializer = load_object(serializer) 32 | return cls( 33 | queue, 34 | spider, 35 | queue_key % {'spider': spider.name}, 36 | serializer=serializer 37 | ) 38 | 39 | def len(self) -> int: 40 | """Return the length of the queue""" 41 | return self.container.qsize() 42 | 43 | @staticmethod 44 | def get_queue(max_size: int) -> Queue: 45 | raise NotImplementedError 46 | 47 | async def push(self, request) -> None: 48 | data = self._encode_request(request) 49 | await self.container.put(data) 50 | 51 | async def push_batch(self, requests) -> None: 52 | for request in requests: 53 | await self.push(request) 54 | 55 | async def pop(self, count: int = 1) -> None: 56 | for _ in range(count): 57 | try: 58 | data = self.container.get_nowait() 59 | except QueueEmpty: 60 | break 61 | yield await self._decode_request(data) 62 | 63 | async def clear(self, timeout: int = 0) -> None: 64 | self.container = self.get_queue(self.max_size) 65 | 66 | 67 | class MemoryFifoQueue(MemoryQueueBase): 68 | 69 | @staticmethod 70 | def get_queue(max_size: int) -> Queue: 71 | return Queue(max_size) 72 | 73 | 74 | class MemoryLifoQueue(MemoryFifoQueue): 75 | @staticmethod 76 | def get_queue(max_size: int) -> LifoQueue: 77 | return LifoQueue(max_size) 78 | 79 | 80 | class MemoryPriorityQueue(MemoryFifoQueue): 81 | @staticmethod 82 | def get_queue(max_size: int) -> PriorityQueue: 83 | return PriorityQueue(max_size) 84 | 85 | async def push(self, request: aioscrapy.Request) -> None: 86 | data = self._encode_request(request) 87 | score = request.priority 88 | await self.container.put((score, data)) 89 | 90 | async def pop(self, count: int = 1) -> Optional[aioscrapy.Request]: 91 | for _ in range(count): 92 | try: 93 | score, data = self.container.get_nowait() 94 | except QueueEmpty: 95 | break 96 | yield await self._decode_request(data) 97 | 98 | 99 | SpiderQueue = MemoryFifoQueue 100 | SpiderStack = MemoryLifoQueue 101 | SpiderPriorityQueue = MemoryPriorityQueue 102 | -------------------------------------------------------------------------------- /aioscrapy/utils/decorators.py: -------------------------------------------------------------------------------- 1 | """ 2 | Decorator utilities for aioscrapy. 3 | aioscrapy的装饰器实用工具。 4 | 5 | This module provides utility decorators for use in aioscrapy. 6 | It includes decorators for marking functions and methods as deprecated. 7 | 此模块提供了在aioscrapy中使用的实用装饰器。 8 | 它包括用于将函数和方法标记为已弃用的装饰器。 9 | """ 10 | 11 | import warnings 12 | from functools import wraps 13 | 14 | from aioscrapy.exceptions import AioScrapyDeprecationWarning 15 | 16 | 17 | def deprecated(use_instead=None): 18 | """ 19 | Decorator to mark functions or methods as deprecated. 20 | 用于将函数或方法标记为已弃用的装饰器。 21 | 22 | This decorator can be used to mark functions or methods as deprecated. 23 | It will emit a warning when the decorated function is called, informing 24 | users that the function is deprecated and suggesting an alternative if provided. 25 | 此装饰器可用于将函数或方法标记为已弃用。 26 | 当调用被装饰的函数时,它将发出警告,通知用户该函数已弃用, 27 | 并在提供替代方案时建议使用替代方案。 28 | 29 | The decorator can be used in two ways: 30 | 装饰器可以通过两种方式使用: 31 | 32 | 1. With no arguments: @deprecated 33 | 2. With an argument specifying the alternative: @deprecated("new_function") 34 | 35 | Args: 36 | use_instead: Optional string indicating the alternative function or method to use. 37 | 可选的字符串,指示要使用的替代函数或方法。 38 | Defaults to None. 39 | 默认为None。 40 | 41 | Returns: 42 | callable: A decorated function that will emit a deprecation warning when called. 43 | 一个装饰后的函数,在调用时会发出弃用警告。 44 | 45 | Example: 46 | >>> @deprecated 47 | ... def old_function(): 48 | ... return "result" 49 | ... 50 | >>> @deprecated("new_improved_function") 51 | ... def another_old_function(): 52 | ... return "result" 53 | ... 54 | >>> # When called, these will emit warnings: 55 | >>> old_function() # Warning: "Call to deprecated function old_function." 56 | >>> another_old_function() # Warning: "Call to deprecated function another_old_function. Use new_improved_function instead." 57 | """ 58 | # Handle case where decorator is used without arguments: @deprecated 59 | # 处理装饰器不带参数使用的情况:@deprecated 60 | if callable(use_instead): 61 | # In this case, use_instead is actually the function being decorated 62 | # 在这种情况下,use_instead实际上是被装饰的函数 63 | func = use_instead 64 | use_instead = None 65 | 66 | # Apply the decorator directly and return the wrapped function 67 | # 直接应用装饰器并返回包装后的函数 68 | @wraps(func) 69 | def wrapped(*args, **kwargs): 70 | message = f"Call to deprecated function {func.__name__}." 71 | warnings.warn(message, category=AioScrapyDeprecationWarning, stacklevel=2) 72 | return func(*args, **kwargs) 73 | return wrapped 74 | 75 | # Handle case where decorator is used with an argument: @deprecated("new_function") 76 | # 处理装饰器带参数使用的情况:@deprecated("new_function") 77 | def deco(func): 78 | @wraps(func) 79 | def wrapped(*args, **kwargs): 80 | # Create the warning message 81 | # 创建警告消息 82 | message = f"Call to deprecated function {func.__name__}." 83 | if use_instead: 84 | message += f" Use {use_instead} instead." 85 | 86 | # Emit the deprecation warning 87 | # 发出弃用警告 88 | warnings.warn(message, category=AioScrapyDeprecationWarning, stacklevel=2) 89 | 90 | # Call the original function 91 | # 调用原始函数 92 | return func(*args, **kwargs) 93 | return wrapped 94 | 95 | return deco 96 | -------------------------------------------------------------------------------- /aioscrapy/utils/reqser.py: -------------------------------------------------------------------------------- 1 | """ 2 | Request serialization utilities for aioscrapy. 3 | aioscrapy的请求序列化实用工具。 4 | 5 | This module provides helper functions for serializing and deserializing Request objects. 6 | These functions are particularly useful for storing requests in queues, databases, 7 | or transmitting them between different processes or systems. 8 | 此模块提供了用于序列化和反序列化Request对象的辅助函数。 9 | 这些函数对于在队列、数据库中存储请求或在不同进程或系统之间传输请求特别有用。 10 | """ 11 | from typing import Optional 12 | 13 | import aioscrapy 14 | from aioscrapy.utils.request import request_from_dict as _from_dict 15 | 16 | 17 | def request_to_dict(request: "aioscrapy.Request", spider: Optional["aioscrapy.Spider"] = None) -> dict: 18 | """ 19 | Convert a Request object to a dictionary representation. 20 | 将Request对象转换为字典表示。 21 | 22 | This function serializes a Request object into a dictionary that can be easily 23 | stored or transmitted. The dictionary contains all the necessary information 24 | to reconstruct the Request object later using request_from_dict(). 25 | 此函数将Request对象序列化为可以轻松存储或传输的字典。 26 | 该字典包含稍后使用request_from_dict()重建Request对象所需的所有信息。 27 | 28 | Args: 29 | request: The Request object to serialize. 30 | 要序列化的Request对象。 31 | spider: Optional Spider instance that may be used to customize the 32 | serialization process. Some Request subclasses may use the spider 33 | to properly serialize their attributes. 34 | 可选的Spider实例,可用于自定义序列化过程。 35 | 某些Request子类可能使用spider来正确序列化其属性。 36 | 37 | Returns: 38 | dict: A dictionary representation of the Request object. 39 | Request对象的字典表示。 40 | 41 | Example: 42 | >>> request = Request('http://example.com', callback='parse_item') 43 | >>> request_dict = request_to_dict(request, spider) 44 | >>> # The dictionary can be stored or transmitted 45 | >>> new_request = await request_from_dict(request_dict, spider) 46 | """ 47 | # Delegate to the Request object's to_dict method 48 | # 委托给Request对象的to_dict方法 49 | return request.to_dict(spider=spider) 50 | 51 | 52 | async def request_from_dict(d: dict, spider: Optional["aioscrapy.Spider"] = None) -> "aioscrapy.Request": 53 | """ 54 | Convert a dictionary representation back to a Request object. 55 | 将字典表示转换回Request对象。 56 | 57 | This function deserializes a dictionary (previously created by request_to_dict) 58 | back into a Request object. It reconstructs all the attributes and properties 59 | of the original Request, including callback and errback methods if a spider 60 | is provided. 61 | 此函数将(先前由request_to_dict创建的)字典反序列化回Request对象。 62 | 它重建原始Request的所有属性和属性,如果提供了spider, 63 | 还包括回调和错误回调方法。 64 | 65 | Args: 66 | d: The dictionary representation of a Request object. 67 | Request对象的字典表示。 68 | spider: Optional Spider instance that may be used to resolve callback 69 | and errback method names to actual methods on the spider. 70 | 可选的Spider实例,可用于将回调和错误回调方法名称 71 | 解析为spider上的实际方法。 72 | 73 | Returns: 74 | aioscrapy.Request: A reconstructed Request object. 75 | 重建的Request对象。 76 | 77 | Example: 78 | >>> request_dict = { 79 | ... 'url': 'http://example.com', 80 | ... 'callback': 'parse_item', 81 | ... 'method': 'GET' 82 | ... } 83 | >>> request = await request_from_dict(request_dict, spider) 84 | >>> request.url 85 | 'http://example.com' 86 | """ 87 | # Delegate to the imported _from_dict function from aioscrapy.utils.request 88 | # 委托给从aioscrapy.utils.request导入的_from_dict函数 89 | return await _from_dict(d, spider=spider) 90 | -------------------------------------------------------------------------------- /aioscrapy/templates/project/module/middlewares.py.tmpl: -------------------------------------------------------------------------------- 1 | 2 | from aioscrapy import signals, logger 3 | 4 | class ${ProjectName}SpiderMiddleware: 5 | # Not all methods need to be defined. If a method is not defined, 6 | # scrapy acts as if the spider middleware does not modify the 7 | # passed objects. 8 | 9 | @classmethod 10 | def from_crawler(cls, crawler): 11 | # This method is used by Scrapy to create your spiders. 12 | s = cls() 13 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 14 | return s 15 | 16 | async def process_spider_input(self, response, spider): 17 | # Called for each response that goes through the spider 18 | # middleware and into the spider. 19 | 20 | # Should return None or raise an exception. 21 | return None 22 | 23 | async def process_spider_output(self, response, result, spider): 24 | # Called with the results returned from the Spider, after 25 | # it has processed the response. 26 | 27 | # Must return an iterable of Request, or item objects. 28 | async for i in result: 29 | yield i 30 | 31 | async def process_spider_exception(self, response, exception, spider): 32 | # Called when a spider or process_spider_input() method 33 | # (from other spider middleware) raises an exception. 34 | 35 | # Should return either None or an iterable of Request or item objects. 36 | pass 37 | 38 | async def process_start_requests(self, start_requests, spider): 39 | # Called with the start requests of the spider, and works 40 | # similarly to the process_spider_output() method, except 41 | # that it doesn’t have a response associated. 42 | 43 | # Must return only requests (not items). 44 | async for r in start_requests: 45 | yield r 46 | 47 | async def spider_opened(self, spider): 48 | logger.info('Spider opened: %s' % spider.name) 49 | 50 | 51 | class ${ProjectName}DownloaderMiddleware: 52 | # Not all methods need to be defined. If a method is not defined, 53 | # scrapy acts as if the downloader middleware does not modify the 54 | # passed objects. 55 | 56 | @classmethod 57 | def from_crawler(cls, crawler): 58 | # This method is used by Scrapy to create your spiders. 59 | s = cls() 60 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 61 | return s 62 | 63 | async def process_request(self, request, spider): 64 | # Called for each request that goes through the downloader 65 | # middleware. 66 | 67 | # Must either: 68 | # - return None: continue processing this request 69 | # - or return a Response object 70 | # - or return a Request object 71 | # - or raise IgnoreRequest: process_exception() methods of 72 | # installed downloader middleware will be called 73 | return None 74 | 75 | async def process_response(self, request, response, spider): 76 | # Called with the response returned from the downloader. 77 | 78 | # Must either; 79 | # - return a Response object 80 | # - return a Request object 81 | # - or raise IgnoreRequest 82 | return response 83 | 84 | async def process_exception(self, request, exception, spider): 85 | # Called when a download handler or a process_request() 86 | # (from other downloader middleware) raises an exception. 87 | 88 | # Must either: 89 | # - return None: continue processing this exception 90 | # - return a Response object: stops process_exception() chain 91 | # - return a Request object: stops process_exception() chain 92 | pass 93 | 94 | async def spider_opened(self, spider): 95 | logger.info('Spider opened: %s' % spider.name) 96 | -------------------------------------------------------------------------------- /aioscrapy/utils/ossignal.py: -------------------------------------------------------------------------------- 1 | """ 2 | Operating system signal utilities for aioscrapy. 3 | aioscrapy的操作系统信号实用工具。 4 | 5 | This module provides utilities for working with operating system signals in aioscrapy. 6 | It includes functions for installing signal handlers and mapping between signal 7 | numbers and their names. 8 | 此模块提供了用于处理aioscrapy中操作系统信号的实用工具。 9 | 它包括用于安装信号处理程序以及在信号编号和其名称之间映射的函数。 10 | """ 11 | 12 | import signal 13 | 14 | 15 | # Dictionary mapping signal numbers to their names (e.g., {2: 'SIGINT', 15: 'SIGTERM'}) 16 | # 将信号编号映射到其名称的字典(例如,{2: 'SIGINT', 15: 'SIGTERM'}) 17 | signal_names = {} 18 | 19 | # Populate the signal_names dictionary by iterating through all attributes in the signal module 20 | # 通过迭代信号模块中的所有属性来填充signal_names字典 21 | for signame in dir(signal): 22 | # Only process attributes that start with 'SIG' but not 'SIG_' 23 | # (SIG_ prefixed constants are signal handlers, not signal types) 24 | # 只处理以'SIG'开头但不以'SIG_'开头的属性 25 | # (SIG_前缀的常量是信号处理程序,而不是信号类型) 26 | if signame.startswith('SIG') and not signame.startswith('SIG_'): 27 | # Get the signal number for this signal name 28 | # 获取此信号名称的信号编号 29 | signum = getattr(signal, signame) 30 | # Only add to the dictionary if it's an integer (a valid signal number) 31 | # 只有当它是整数(有效的信号编号)时才添加到字典中 32 | if isinstance(signum, int): 33 | signal_names[signum] = signame 34 | 35 | 36 | def install_shutdown_handlers(function, override_sigint=True): 37 | """ 38 | Install a function as a signal handler for common shutdown signals. 39 | 为常见的关闭信号安装函数作为信号处理程序。 40 | 41 | This function installs the provided function as a handler for common shutdown 42 | signals such as SIGTERM (terminate), SIGINT (keyboard interrupt), and SIGBREAK 43 | (Ctrl-Break on Windows). This is useful for graceful shutdown of applications. 44 | 此函数将提供的函数安装为常见关闭信号的处理程序,如SIGTERM(终止)、 45 | SIGINT(键盘中断)和SIGBREAK(Windows上的Ctrl-Break)。 46 | 这对于应用程序的优雅关闭很有用。 47 | 48 | Args: 49 | function: The function to be called when a shutdown signal is received. 50 | 当收到关闭信号时要调用的函数。 51 | This function should accept two parameters: signal number and frame. 52 | 此函数应接受两个参数:信号编号和帧。 53 | override_sigint: Whether to override an existing SIGINT handler. 54 | 是否覆盖现有的SIGINT处理程序。 55 | If False, the SIGINT handler won't be installed if there's 56 | already a custom handler in place (e.g., a debugger like Pdb). 57 | 如果为False,则在已有自定义处理程序(例如Pdb调试器)的情况下 58 | 不会安装SIGINT处理程序。 59 | Defaults to True. 60 | 默认为True。 61 | 62 | Example: 63 | >>> def handle_shutdown(signum, frame): 64 | ... print(f"Received signal {signal_names.get(signum, signum)}") 65 | ... # Perform cleanup operations 66 | ... sys.exit(0) 67 | >>> install_shutdown_handlers(handle_shutdown) 68 | """ 69 | # Always install handler for SIGTERM (terminate signal) 70 | # 始终为SIGTERM(终止信号)安装处理程序 71 | signal.signal(signal.SIGTERM, function) 72 | 73 | # Install handler for SIGINT (keyboard interrupt) if: 74 | # - The current handler is the default handler, or 75 | # - override_sigint is True (forcing override of any existing handler) 76 | # 在以下情况下为SIGINT(键盘中断)安装处理程序: 77 | # - 当前处理程序是默认处理程序,或 78 | # - override_sigint为True(强制覆盖任何现有处理程序) 79 | if signal.getsignal(signal.SIGINT) == signal.default_int_handler or override_sigint: 80 | signal.signal(signal.SIGINT, function) 81 | 82 | # Install handler for SIGBREAK (Ctrl-Break) on Windows if available 83 | # 如果可用,在Windows上为SIGBREAK(Ctrl-Break)安装处理程序 84 | if hasattr(signal, 'SIGBREAK'): 85 | signal.signal(signal.SIGBREAK, function) 86 | -------------------------------------------------------------------------------- /aioscrapy/libs/pipelines/pg.py: -------------------------------------------------------------------------------- 1 | """ 2 | PostgreSQL Pipeline for AioScrapy 3 | AioScrapy的PostgreSQL管道 4 | 5 | This module provides a pipeline for storing scraped items in a PostgreSQL database. 6 | It extends the base database pipeline to implement PostgreSQL-specific functionality 7 | for batch inserting items. 8 | 此模块提供了一个用于将抓取的项目存储在PostgreSQL数据库中的管道。 9 | 它扩展了基本数据库管道,以实现PostgreSQL特定的批量插入项目功能。 10 | """ 11 | 12 | from aioscrapy.db import db_manager 13 | from aioscrapy.libs.pipelines import DBPipelineBase 14 | 15 | from aioscrapy.utils.log import logger 16 | 17 | 18 | class PGPipeline(DBPipelineBase): 19 | """ 20 | Pipeline for storing scraped items in a PostgreSQL database. 21 | 用于将抓取的项目存储在PostgreSQL数据库中的管道。 22 | 23 | This pipeline extends the base database pipeline to implement PostgreSQL-specific 24 | functionality for batch inserting items. It uses the database manager to handle 25 | connections and transactions. 26 | 此管道扩展了基本数据库管道,以实现PostgreSQL特定的批量插入项目功能。 27 | 它使用数据库管理器来处理连接和事务。 28 | """ 29 | 30 | @classmethod 31 | def from_settings(cls, settings): 32 | """ 33 | Create a PGPipeline instance from settings. 34 | 从设置创建PGPipeline实例。 35 | 36 | This is the factory method used by AioScrapy to create pipeline instances. 37 | It initializes the pipeline with the appropriate database type ('pg'). 38 | 这是AioScrapy用于创建管道实例的工厂方法。 39 | 它使用适当的数据库类型('pg')初始化管道。 40 | 41 | Args: 42 | settings: The AioScrapy settings object. 43 | AioScrapy设置对象。 44 | 45 | Returns: 46 | PGPipeline: A new PGPipeline instance. 47 | 一个新的PGPipeline实例。 48 | """ 49 | return cls(settings, 'pg') 50 | 51 | async def _save(self, cache_key): 52 | """ 53 | Save cached items with the given cache key to the PostgreSQL database. 54 | 将具有给定缓存键的缓存项目保存到PostgreSQL数据库。 55 | 56 | This method implements the abstract _save method from the base class. 57 | It retrieves the cached items and SQL statement for the given cache key, 58 | then executes a batch insert operation on each configured database connection. 59 | 此方法实现了基类中的抽象_save方法。 60 | 它检索给定缓存键的缓存项目和SQL语句,然后在每个配置的数据库连接上执行批量插入操作。 61 | 62 | Args: 63 | cache_key: The cache key used to retrieve the cached items, SQL statement, 64 | and other metadata needed for the database operation. 65 | 用于检索缓存项目、SQL语句和数据库操作所需的其他元数据的缓存键。 66 | """ 67 | # Get the table name from the cache 68 | # 从缓存获取表名 69 | table_name = self.table_cache[cache_key] 70 | try: 71 | # Process each database alias (connection) configured for this cache key 72 | # 处理为此缓存键配置的每个数据库别名(连接) 73 | for alias in self.db_alias_cache[cache_key]: 74 | # Get a database connection with context manager to ensure proper cleanup 75 | # 使用上下文管理器获取数据库连接,以确保正确清理 76 | async with db_manager.pg.get(alias) as conn: 77 | try: 78 | # Execute the batch insert operation 79 | # 执行批量插入操作 80 | num = await conn.executemany( 81 | self.insert_sql_cache[cache_key], self.item_cache[cache_key] 82 | ) 83 | # Log the result of the operation 84 | # 记录操作结果 85 | logger.info(f'table:{alias}->{table_name} sum:{len(self.item_cache[cache_key])} ok:{num}') 86 | except Exception as e: 87 | # Log any errors that occur during the operation 88 | # 记录操作期间发生的任何错误 89 | logger.exception(f'save data error, table:{alias}->{table_name}, err_msg:{e}') 90 | finally: 91 | # Clear the cache after processing, regardless of success or failure 92 | # 处理后清除缓存,无论成功或失败 93 | self.item_cache[cache_key] = [] 94 | -------------------------------------------------------------------------------- /aioscrapy/libs/pipelines/mysql.py: -------------------------------------------------------------------------------- 1 | """ 2 | MySQL Pipeline for AioScrapy 3 | AioScrapy的MySQL管道 4 | 5 | This module provides a pipeline for storing scraped items in a MySQL database. 6 | It extends the base database pipeline to implement MySQL-specific functionality 7 | for batch inserting items. 8 | 此模块提供了一个用于将抓取的项目存储在MySQL数据库中的管道。 9 | 它扩展了基本数据库管道,以实现MySQL特定的批量插入项目功能。 10 | """ 11 | 12 | from aioscrapy.db import db_manager 13 | from aioscrapy.libs.pipelines import DBPipelineBase 14 | 15 | from aioscrapy.utils.log import logger 16 | 17 | 18 | class MysqlPipeline(DBPipelineBase): 19 | """ 20 | Pipeline for storing scraped items in a MySQL database. 21 | 用于将抓取的项目存储在MySQL数据库中的管道。 22 | 23 | This pipeline extends the base database pipeline to implement MySQL-specific 24 | functionality for batch inserting items. It uses the database manager to 25 | handle connections and transactions. 26 | 此管道扩展了基本数据库管道,以实现MySQL特定的批量插入项目功能。 27 | 它使用数据库管理器来处理连接和事务。 28 | """ 29 | 30 | @classmethod 31 | def from_settings(cls, settings): 32 | """ 33 | Create a MysqlPipeline instance from settings. 34 | 从设置创建MysqlPipeline实例。 35 | 36 | This is the factory method used by AioScrapy to create pipeline instances. 37 | It initializes the pipeline with the appropriate database type ('mysql'). 38 | 这是AioScrapy用于创建管道实例的工厂方法。 39 | 它使用适当的数据库类型('mysql')初始化管道。 40 | 41 | Args: 42 | settings: The AioScrapy settings object. 43 | AioScrapy设置对象。 44 | 45 | Returns: 46 | MysqlPipeline: A new MysqlPipeline instance. 47 | 一个新的MysqlPipeline实例。 48 | """ 49 | return cls(settings, 'mysql') 50 | 51 | async def _save(self, cache_key): 52 | """ 53 | Save cached items with the given cache key to the MySQL database. 54 | 将具有给定缓存键的缓存项目保存到MySQL数据库。 55 | 56 | This method implements the abstract _save method from the base class. 57 | It retrieves the cached items and SQL statement for the given cache key, 58 | then executes a batch insert operation on each configured database connection. 59 | 此方法实现了基类中的抽象_save方法。 60 | 它检索给定缓存键的缓存项目和SQL语句,然后在每个配置的数据库连接上执行批量插入操作。 61 | 62 | Args: 63 | cache_key: The cache key used to retrieve the cached items, SQL statement, 64 | and other metadata needed for the database operation. 65 | 用于检索缓存项目、SQL语句和数据库操作所需的其他元数据的缓存键。 66 | """ 67 | # Get the table name from the cache 68 | # 从缓存获取表名 69 | table_name = self.table_cache[cache_key] 70 | try: 71 | # Process each database alias (connection) configured for this cache key 72 | # 处理为此缓存键配置的每个数据库别名(连接) 73 | for alias in self.db_alias_cache[cache_key]: 74 | # Get a database connection and cursor with ping to ensure the connection is alive 75 | # 获取数据库连接和游标,并使用ping确保连接处于活动状态 76 | async with db_manager.mysql.get(alias, ping=True) as (conn, cursor): 77 | try: 78 | # Execute the batch insert operation 79 | # 执行批量插入操作 80 | num = await cursor.executemany( 81 | self.insert_sql_cache[cache_key], self.item_cache[cache_key] 82 | ) 83 | # Log the result of the operation 84 | # 记录操作结果 85 | logger.info(f'table:{alias}->{table_name} sum:{len(self.item_cache[cache_key])} ok:{num}') 86 | except Exception as e: 87 | # Log any errors that occur during the operation 88 | # 记录操作期间发生的任何错误 89 | logger.exception(f'save data error, table:{alias}->{table_name}, err_msg:{e}') 90 | finally: 91 | # Clear the cache after processing, regardless of success or failure 92 | # 处理后清除缓存,无论成功或失败 93 | self.item_cache[cache_key] = [] 94 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AioScrapy 2 | 3 | AioScrapy是一个基于Python异步IO的强大网络爬虫框架。它的设计理念源自Scrapy,但完全基于异步IO实现,提供更高的性能和更灵活的配置选项。
4 | AioScrapy is a powerful asynchronous web crawling framework built on Python's asyncio library. It is inspired by Scrapy but completely reimplemented with asynchronous IO, offering higher performance and more flexible configuration options. 5 | 6 | ## 特性 | Features 7 | 8 | - **完全异步**:基于Python的asyncio库,实现高效的并发爬取 9 | - **多种下载处理程序**:支持多种HTTP客户端,包括aiohttp、httpx、requests、pyhttpx、curl_cffi、DrissionPage、playwright和sbcdp 10 | - **灵活的中间件系统**:轻松添加自定义功能和处理逻辑 11 | - **强大的数据处理管道**:支持多种数据库存储选项 12 | - **内置信号系统**:方便的事件处理机制 13 | - **丰富的配置选项**:高度可定制的爬虫行为 14 | - **分布式爬取**:支持使用Redis和RabbitMQ进行分布式爬取 15 | - **数据库集成**:内置支持Redis、MySQL、MongoDB、PostgreSQL和RabbitMQ 16 | 17 | 18 | - **Fully Asynchronous**: Built on Python's asyncio for efficient concurrent crawling 19 | - **Multiple Download Handlers**: Support for various HTTP clients including aiohttp, httpx, requests, pyhttpx, curl_cffi, DrissionPage, playwright and sbcdp 20 | - **Flexible Middleware System**: Easily add custom functionality and processing logic 21 | - **Powerful Data Processing Pipelines**: Support for various database storage options 22 | - **Built-in Signal System**: Convenient event handling mechanism 23 | - **Rich Configuration Options**: Highly customizable crawler behavior 24 | - **Distributed Crawling**: Support for distributed crawling using Redis and RabbitMQ 25 | - **Database Integration**: Built-in support for Redis, MySQL, MongoDB, PostgreSQL, and RabbitMQ 26 | 27 | ## 安装 | Installation 28 | 29 | ### 要求 | Requirements 30 | 31 | - Python 3.9+ 32 | 33 | ### 使用pip安装 | Install with pip 34 | 35 | ```bash 36 | pip install aio-scrapy 37 | 38 | # Install the latest aio-scrapy 39 | # pip install git+https://github.com/ConlinH/aio-scrapy 40 | ``` 41 | 42 | ### 开始 | Start 43 | ```python 44 | from aioscrapy import Spider, logger 45 | 46 | 47 | class MyspiderSpider(Spider): 48 | name = 'myspider' 49 | custom_settings = { 50 | "CLOSE_SPIDER_ON_IDLE": True 51 | } 52 | start_urls = ["https://quotes.toscrape.com"] 53 | 54 | @staticmethod 55 | async def process_request(request, spider): 56 | """ request middleware """ 57 | pass 58 | 59 | @staticmethod 60 | async def process_response(request, response, spider): 61 | """ response middleware """ 62 | return response 63 | 64 | @staticmethod 65 | async def process_exception(request, exception, spider): 66 | """ exception middleware """ 67 | pass 68 | 69 | async def parse(self, response): 70 | for quote in response.css('div.quote'): 71 | item = { 72 | 'author': quote.xpath('span/small/text()').get(), 73 | 'text': quote.css('span.text::text').get(), 74 | } 75 | yield item 76 | 77 | async def process_item(self, item): 78 | logger.info(item) 79 | 80 | 81 | if __name__ == '__main__': 82 | MyspiderSpider.start() 83 | ``` 84 | 85 | ## 文档 | Documentation 86 | 87 | ## 文档目录 | Documentation Contents 88 | - [安装指南 | Installation Guide](docs/installation.md) 89 | - [快速入门 | Quick Start](docs/quickstart.md) 90 | - [核心概念 | Core Concepts](docs/concepts.md) 91 | - [爬虫指南 | Spider Guide](docs/spiders.md) 92 | - [下载器 | Downloaders](docs/downloaders.md) 93 | - [中间件 | Middlewares](docs/middlewares.md) 94 | - [管道 | Pipelines](docs/pipelines.md) 95 | - [队列 | Queues](docs/queues.md) 96 | - [请求过滤器 | Request Filters](docs/dupefilters.md) 97 | - [代理 | Proxy](docs/proxy.md) 98 | - [数据库连接 | Database Connections](docs/databases.md) 99 | - [分布式部署 | Distributed Deployment](docs/distributed.md) 100 | - [配置参考 | Settings Reference](docs/settings.md) 101 | - [API参考 | API Reference](docs/api.md) 102 | - [示例 | Example](example) 103 | 104 | ## 许可证 | License 105 | 106 | 本项目采用MIT许可证 - 详情请查看LICENSE文件。
107 | This project is licensed under the MIT License - see the LICENSE file for details. 108 | 109 | 110 | ## 联系 111 | QQ: 995018884
112 | WeChat: h995018884 -------------------------------------------------------------------------------- /example/projectspider/redisdemo/middlewares.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your spider middleware 2 | 3 | from aioscrapy import signals, logger 4 | 5 | 6 | class DemoDownloaderMiddleware: 7 | # Not all methods need to be defined. If a method is not defined, 8 | # scrapy acts as if the downloader middleware does not modify the 9 | # passed objects. 10 | 11 | @classmethod 12 | def from_crawler(cls, crawler): 13 | # This method is used by Scrapy to create your spiders. 14 | s = cls() 15 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 16 | return s 17 | 18 | async def process_request(self, request, spider): 19 | # Called for each request that goes through the downloader 20 | # middleware. 21 | 22 | # Must either: 23 | # - return None: continue processing this request 24 | # - or return a Response object 25 | # - or return a Request object 26 | # - or raise IgnoreRequest: process_exception() methods of 27 | # installed downloader middleware will be called 28 | request.headers = { 29 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36' 30 | } 31 | return None 32 | 33 | async def process_response(self, request, response, spider): 34 | # Called with the response returned from the downloader. 35 | 36 | # Must either; 37 | # - return a Response object 38 | # - return a Request object 39 | # - or raise IgnoreRequest 40 | if response.status in [401, 403]: 41 | return request 42 | return response 43 | 44 | async def process_exception(self, request, exception, spider): 45 | # Called when a download handler or a process_request() 46 | # (from other downloader middleware) raises an exception. 47 | 48 | # Must either: 49 | # - return None: continue processing this exception 50 | # - return a Response object: stops process_exception() chain 51 | # - return a Request object: stops process_exception() chain 52 | pass 53 | 54 | async def spider_opened(self, spider): 55 | logger.info('Spider opened: %s' % spider.name) 56 | 57 | 58 | class DemoSpiderMiddleware: 59 | # Not all methods need to be defined. If a method is not defined, 60 | # scrapy acts as if the spider middleware does not modify the 61 | # passed objects. 62 | 63 | @classmethod 64 | def from_crawler(cls, crawler): 65 | # This method is used by Scrapy to create your spiders. 66 | s = cls() 67 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 68 | return s 69 | 70 | async def process_spider_input(self, response, spider): 71 | # Called for each response that goes through the spider 72 | # middleware and into the spider. 73 | 74 | # Should return None or raise an exception. 75 | return None 76 | 77 | async def process_spider_output(self, response, result, spider): 78 | # Called with the results returned from the Spider, after 79 | # it has processed the response. 80 | 81 | # Must return an iterable of Request, or item objects. 82 | async for i in result: 83 | yield i 84 | 85 | async def process_spider_exception(self, response, exception, spider): 86 | # Called when a spider or process_spider_input() method 87 | # (from other spider middleware) raises an exception. 88 | 89 | # Should return either None or an iterable of Request or item objects. 90 | pass 91 | 92 | async def process_start_requests(self, start_requests, spider): 93 | # Called with the start requests of the spider, and works 94 | # similarly to the process_spider_output() method, except 95 | # that it doesn’t have a response associated. 96 | 97 | # Must return only requests (not items). 98 | async for r in start_requests: 99 | yield r 100 | 101 | async def spider_opened(self, spider): 102 | logger.info('Spider opened: %s' % spider.name) 103 | -------------------------------------------------------------------------------- /aioscrapy/libs/downloader/useragent.py: -------------------------------------------------------------------------------- 1 | """ 2 | User-Agent Middleware 3 | 用户代理中间件 4 | 5 | This middleware sets the User-Agent header for all requests, using either a 6 | spider-specific user_agent attribute, or a default value from the USER_AGENT setting. 7 | 此中间件为所有请求设置User-Agent头,使用爬虫特定的user_agent属性, 8 | 或来自USER_AGENT设置的默认值。 9 | 10 | The User-Agent header is important for identifying your crawler to websites and 11 | can affect how websites respond to your requests. 12 | User-Agent头对于向网站标识您的爬虫很重要,可能会影响网站对您的请求的响应方式。 13 | """ 14 | 15 | from aioscrapy import signals 16 | 17 | 18 | class UserAgentMiddleware: 19 | """ 20 | Middleware for setting the User-Agent header on requests. 21 | 用于在请求上设置User-Agent头的中间件。 22 | 23 | This middleware allows spiders to override the default User-Agent by specifying 24 | a user_agent attribute. If no spider-specific User-Agent is defined, it uses 25 | the default value from the USER_AGENT setting. 26 | 此中间件允许爬虫通过指定user_agent属性来覆盖默认的User-Agent。 27 | 如果未定义爬虫特定的User-Agent,则使用USER_AGENT设置中的默认值。 28 | """ 29 | 30 | def __init__(self, user_agent='Scrapy'): 31 | """ 32 | Initialize the UserAgentMiddleware. 33 | 初始化UserAgentMiddleware。 34 | 35 | Args: 36 | user_agent: The default User-Agent string to use. 37 | 要使用的默认User-Agent字符串。 38 | Defaults to 'Scrapy'. 39 | 默认为'Scrapy'。 40 | """ 41 | # Store the default user agent 42 | # 存储默认用户代理 43 | self.user_agent = user_agent 44 | 45 | @classmethod 46 | def from_crawler(cls, crawler): 47 | """ 48 | Create a UserAgentMiddleware instance from a crawler. 49 | 从爬虫创建UserAgentMiddleware实例。 50 | 51 | This is the factory method used by AioScrapy to create the middleware. 52 | 这是AioScrapy用于创建中间件的工厂方法。 53 | 54 | Args: 55 | crawler: The crawler that will use this middleware. 56 | 将使用此中间件的爬虫。 57 | 58 | Returns: 59 | UserAgentMiddleware: A new UserAgentMiddleware instance. 60 | 一个新的UserAgentMiddleware实例。 61 | """ 62 | # Create a new instance with the user agent from settings 63 | # 使用来自设置的用户代理创建一个新实例 64 | o = cls(crawler.settings['USER_AGENT']) 65 | 66 | # Connect to the spider_opened signal 67 | # 连接到spider_opened信号 68 | crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) 69 | 70 | # Return the new instance 71 | # 返回新实例 72 | return o 73 | 74 | def spider_opened(self, spider): 75 | """ 76 | Handle the spider_opened signal. 77 | 处理spider_opened信号。 78 | 79 | This method is called when a spider is opened. It updates the user agent 80 | with the spider's user_agent attribute if it exists. 81 | 当爬虫打开时调用此方法。如果存在,它会使用爬虫的user_agent属性更新用户代理。 82 | 83 | Args: 84 | spider: The spider that was opened. 85 | 被打开的爬虫。 86 | """ 87 | # Update the user agent with the spider's user_agent attribute if it exists 88 | # 如果存在,则使用爬虫的user_agent属性更新用户代理 89 | self.user_agent = getattr(spider, 'user_agent', self.user_agent) 90 | 91 | def process_request(self, request, spider): 92 | """ 93 | Process a request before it is sent to the downloader. 94 | 在请求发送到下载器之前处理它。 95 | 96 | This method sets the User-Agent header in the request if it's not already set 97 | and if a user agent is configured. 98 | 如果尚未设置User-Agent头且已配置用户代理,此方法会在请求中设置它。 99 | 100 | Args: 101 | request: The request being processed. 102 | 正在处理的请求。 103 | spider: The spider that generated the request. 104 | 生成请求的爬虫。 105 | 106 | Returns: 107 | None: This method does not return a response or a deferred. 108 | 此方法不返回响应或延迟对象。 109 | """ 110 | # Set the User-Agent header in the request if it's not already set 111 | # and if a user agent is configured 112 | # 如果尚未设置User-Agent头且已配置用户代理,则在请求中设置它 113 | if self.user_agent: 114 | request.headers.setdefault('User-Agent', self.user_agent) 115 | -------------------------------------------------------------------------------- /aioscrapy/libs/downloader/downloadtimeout.py: -------------------------------------------------------------------------------- 1 | """ 2 | Download Timeout Middleware 3 | 下载超时中间件 4 | 5 | This middleware sets a default timeout for all requests, as specified in the 6 | DOWNLOAD_TIMEOUT setting or the spider's download_timeout attribute. The timeout 7 | can be overridden on a per-request basis by setting the 'download_timeout' key 8 | in the request's meta dictionary. 9 | 此中间件为所有请求设置默认超时,如DOWNLOAD_TIMEOUT设置或爬虫的download_timeout 10 | 属性中指定的那样。可以通过在请求的meta字典中设置'download_timeout'键来覆盖每个 11 | 请求的超时。 12 | """ 13 | 14 | from aioscrapy import signals 15 | 16 | 17 | class DownloadTimeoutMiddleware: 18 | """ 19 | Middleware for setting default download timeouts on requests. 20 | 用于在请求上设置默认下载超时的中间件。 21 | 22 | This middleware sets a default timeout for all outgoing requests, as specified in the 23 | DOWNLOAD_TIMEOUT setting or the spider's download_timeout attribute. The timeout 24 | can be overridden on a per-request basis by setting the 'download_timeout' key 25 | in the request's meta dictionary. 26 | 此中间件为所有传出请求设置默认超时,如DOWNLOAD_TIMEOUT设置或爬虫的download_timeout 27 | 属性中指定的那样。可以通过在请求的meta字典中设置'download_timeout'键来覆盖每个 28 | 请求的超时。 29 | """ 30 | 31 | def __init__(self, timeout=180): 32 | """ 33 | Initialize the DownloadTimeoutMiddleware. 34 | 初始化DownloadTimeoutMiddleware。 35 | 36 | Args: 37 | timeout: The default download timeout in seconds. 38 | 默认下载超时(以秒为单位)。 39 | Defaults to 180 seconds. 40 | 默认为180秒。 41 | """ 42 | # Store the default timeout 43 | # 存储默认超时 44 | self._timeout = timeout 45 | 46 | @classmethod 47 | def from_crawler(cls, crawler): 48 | """ 49 | Create a DownloadTimeoutMiddleware instance from a crawler. 50 | 从爬虫创建DownloadTimeoutMiddleware实例。 51 | 52 | This is the factory method used by AioScrapy to create the middleware. 53 | 这是AioScrapy用于创建中间件的工厂方法。 54 | 55 | Args: 56 | crawler: The crawler that will use this middleware. 57 | 将使用此中间件的爬虫。 58 | 59 | Returns: 60 | DownloadTimeoutMiddleware: A new DownloadTimeoutMiddleware instance. 61 | 一个新的DownloadTimeoutMiddleware实例。 62 | """ 63 | # Create a new instance with the timeout from settings 64 | # 使用来自设置的超时创建一个新实例 65 | o = cls(crawler.settings.getfloat('DOWNLOAD_TIMEOUT')) 66 | 67 | # Connect to the spider_opened signal 68 | # 连接到spider_opened信号 69 | crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) 70 | 71 | # Return the new instance 72 | # 返回新实例 73 | return o 74 | 75 | def spider_opened(self, spider): 76 | """ 77 | Handle the spider_opened signal. 78 | 处理spider_opened信号。 79 | 80 | This method is called when a spider is opened. It updates the default timeout 81 | with the spider's download_timeout attribute if it exists. 82 | 当爬虫打开时调用此方法。如果存在,它会使用爬虫的download_timeout属性更新默认超时。 83 | 84 | Args: 85 | spider: The spider that was opened. 86 | 被打开的爬虫。 87 | """ 88 | # Update the timeout with the spider's download_timeout attribute if it exists 89 | # 如果存在,则使用爬虫的download_timeout属性更新超时 90 | self._timeout = getattr(spider, 'download_timeout', self._timeout) 91 | 92 | def process_request(self, request, spider): 93 | """ 94 | Process a request before it is sent to the downloader. 95 | 在请求发送到下载器之前处理它。 96 | 97 | This method sets the default download timeout in the request's meta dictionary 98 | if it's not already set and if a default timeout is configured. 99 | 如果尚未设置默认下载超时且已配置默认超时,此方法会在请求的meta字典中设置它。 100 | 101 | Args: 102 | request: The request being processed. 103 | 正在处理的请求。 104 | spider: The spider that generated the request. 105 | 生成请求的爬虫。 106 | 107 | Returns: 108 | None: This method does not return a response or a deferred. 109 | 此方法不返回响应或延迟对象。 110 | """ 111 | # Set the default download timeout in the request's meta if it's not already set 112 | # and if a default timeout is configured 113 | # 如果尚未设置默认下载超时且已配置默认超时,则在请求的meta中设置它 114 | if self._timeout: 115 | request.meta.setdefault('download_timeout', self._timeout) 116 | -------------------------------------------------------------------------------- /aioscrapy/utils/template.py: -------------------------------------------------------------------------------- 1 | """ 2 | Template utility functions for aioscrapy. 3 | aioscrapy的模板实用函数。 4 | 5 | This module provides utility functions for working with templates in aioscrapy. 6 | It includes functions for rendering template files and string transformations 7 | commonly used in code generation. 8 | 此模块提供了用于处理aioscrapy中模板的实用函数。 9 | 它包括用于渲染模板文件和在代码生成中常用的字符串转换的函数。 10 | """ 11 | 12 | import os 13 | import re 14 | import string 15 | 16 | 17 | def render_templatefile(path, **kwargs): 18 | """ 19 | Render a template file with the given parameters. 20 | 使用给定参数渲染模板文件。 21 | 22 | This function reads a template file, substitutes variables using Python's 23 | string.Template, and writes the result back to the file system. If the file 24 | has a '.tmpl' extension, it will be renamed to remove this extension after 25 | rendering. 26 | 此函数读取模板文件,使用Python的string.Template替换变量, 27 | 并将结果写回文件系统。如果文件有'.tmpl'扩展名, 28 | 渲染后将重命名以删除此扩展名。 29 | 30 | The template uses the syntax defined by string.Template, where variables are 31 | marked with a $ prefix (e.g., $variable or ${variable}). 32 | 模板使用string.Template定义的语法,其中变量用$前缀标记 33 | (例如,$variable或${variable})。 34 | 35 | Args: 36 | path: Path to the template file to render. 37 | 要渲染的模板文件的路径。 38 | **kwargs: Variables to substitute in the template. 39 | 要在模板中替换的变量。 40 | 41 | Example: 42 | >>> render_templatefile('spider.py.tmpl', 43 | ... classname='MySpider', 44 | ... domain='example.com') 45 | 46 | Note: 47 | This function modifies the file system by: 48 | 此函数通过以下方式修改文件系统: 49 | 1. Potentially renaming the template file (if it ends with .tmpl) 50 | 可能重命名模板文件(如果以.tmpl结尾) 51 | 2. Writing the rendered content to the target file 52 | 将渲染的内容写入目标文件 53 | """ 54 | # Read the template file as UTF-8 55 | # 以UTF-8格式读取模板文件 56 | with open(path, 'rb') as fp: 57 | raw = fp.read().decode('utf8') 58 | 59 | # Substitute variables in the template 60 | # 替换模板中的变量 61 | content = string.Template(raw).substitute(**kwargs) 62 | 63 | # Determine the output path (remove .tmpl extension if present) 64 | # 确定输出路径(如果存在,则删除.tmpl扩展名) 65 | render_path = path[:-len('.tmpl')] if path.endswith('.tmpl') else path 66 | 67 | # Rename the file if it has a .tmpl extension 68 | # 如果文件有.tmpl扩展名,则重命名文件 69 | if path.endswith('.tmpl'): 70 | os.rename(path, render_path) 71 | 72 | # Write the rendered content back to the file 73 | # 将渲染的内容写回文件 74 | with open(render_path, 'wb') as fp: 75 | fp.write(content.encode('utf8')) 76 | 77 | 78 | # Regular expression pattern to match characters that are not letters or digits 79 | # Used by string_camelcase to remove invalid characters when converting to CamelCase 80 | # 匹配非字母或数字的字符的正则表达式模式 81 | # 由string_camelcase用于在转换为驼峰命名法时删除无效字符 82 | CAMELCASE_INVALID_CHARS = re.compile(r'[^a-zA-Z\d]') 83 | 84 | 85 | def string_camelcase(string): 86 | """ 87 | Convert a string to CamelCase and remove invalid characters. 88 | 将字符串转换为驼峰命名法并删除无效字符。 89 | 90 | This function converts a string to CamelCase by: 91 | 1. Capitalizing the first letter of each word (using str.title()) 92 | 2. Removing all non-alphanumeric characters (using CAMELCASE_INVALID_CHARS regex) 93 | 94 | 此函数通过以下方式将字符串转换为驼峰命名法: 95 | 1. 将每个单词的首字母大写(使用str.title()) 96 | 2. 删除所有非字母数字字符(使用CAMELCASE_INVALID_CHARS正则表达式) 97 | 98 | This is commonly used in code generation to convert variable names or 99 | identifiers from different formats (snake_case, kebab-case, etc.) to CamelCase. 100 | 这在代码生成中常用于将变量名或标识符从不同格式 101 | (snake_case、kebab-case等)转换为驼峰命名法。 102 | 103 | Args: 104 | string: The input string to convert to CamelCase. 105 | 要转换为驼峰命名法的输入字符串。 106 | 107 | Returns: 108 | str: The CamelCase version of the input string with invalid characters removed. 109 | 输入字符串的驼峰命名法版本,已删除无效字符。 110 | 111 | Examples: 112 | >>> string_camelcase('lost-pound') 113 | 'LostPound' 114 | 115 | >>> string_camelcase('missing_images') 116 | 'MissingImages' 117 | 118 | >>> string_camelcase('hello world') 119 | 'HelloWorld' 120 | """ 121 | # Convert to title case (capitalize first letter of each word) and remove invalid chars 122 | # 转换为标题大小写(每个单词的首字母大写)并删除无效字符 123 | return CAMELCASE_INVALID_CHARS.sub('', string.title()) 124 | -------------------------------------------------------------------------------- /aioscrapy/commands/startproject.py: -------------------------------------------------------------------------------- 1 | import re 2 | import os 3 | import string 4 | from importlib import import_module 5 | from os.path import join, exists, abspath 6 | from shutil import ignore_patterns, move, copy2, copystat 7 | from stat import S_IWUSR as OWNER_WRITE_PERMISSION 8 | 9 | import aioscrapy 10 | from aioscrapy.commands import AioScrapyCommand 11 | from aioscrapy.utils.template import render_templatefile, string_camelcase 12 | from aioscrapy.exceptions import UsageError 13 | 14 | 15 | TEMPLATES_TO_RENDER = ( 16 | ('aioscrapy.cfg',), 17 | ('${project_name}', 'settings.py.tmpl'), 18 | ('${project_name}', 'pipelines.py.tmpl'), 19 | ('${project_name}', 'middlewares.py.tmpl'), 20 | ) 21 | 22 | IGNORE = ignore_patterns('*.pyc', '__pycache__', '.svn') 23 | 24 | 25 | def _make_writable(path): 26 | current_permissions = os.stat(path).st_mode 27 | os.chmod(path, current_permissions | OWNER_WRITE_PERMISSION) 28 | 29 | 30 | class Command(AioScrapyCommand): 31 | 32 | requires_project = False 33 | default_settings = {'LOG_ENABLED': False, 34 | 'SPIDER_LOADER_WARN_ONLY': True} 35 | 36 | def syntax(self): 37 | return " [project_dir]" 38 | 39 | def short_desc(self): 40 | return "Create new project" 41 | 42 | def _is_valid_name(self, project_name): 43 | def _module_exists(module_name): 44 | try: 45 | import_module(module_name) 46 | return True 47 | except ImportError: 48 | return False 49 | 50 | if not re.search(r'^[_a-zA-Z]\w*$', project_name): 51 | print('Error: Project names must begin with a letter and contain' 52 | ' only\nletters, numbers and underscores') 53 | elif _module_exists(project_name): 54 | print(f'Error: Module {project_name!r} already exists') 55 | else: 56 | return True 57 | return False 58 | 59 | def _copytree(self, src, dst): 60 | """ 61 | Since the original function always creates the directory, to resolve 62 | the issue a new function had to be created. It's a simple copy and 63 | was reduced for this case. 64 | """ 65 | ignore = IGNORE 66 | names = os.listdir(src) 67 | ignored_names = ignore(src, names) 68 | 69 | if not os.path.exists(dst): 70 | os.makedirs(dst) 71 | 72 | for name in names: 73 | if name in ignored_names: 74 | continue 75 | 76 | srcname = os.path.join(src, name) 77 | dstname = os.path.join(dst, name) 78 | if os.path.isdir(srcname): 79 | self._copytree(srcname, dstname) 80 | else: 81 | copy2(srcname, dstname) 82 | _make_writable(dstname) 83 | 84 | copystat(src, dst) 85 | _make_writable(dst) 86 | 87 | def run(self, args, opts): 88 | if len(args) not in (1, 2): 89 | raise UsageError() 90 | 91 | project_name = args[0] 92 | project_dir = args[0] 93 | 94 | if len(args) == 2: 95 | project_dir = args[1] 96 | 97 | if exists(join(project_dir, 'aioscrapy.cfg')): 98 | self.exitcode = 1 99 | print(f'Error: aioscrapy.cfg already exists in {abspath(project_dir)}') 100 | return 101 | 102 | if not self._is_valid_name(project_name): 103 | self.exitcode = 1 104 | return 105 | 106 | self._copytree(self.templates_dir, abspath(project_dir)) 107 | move(join(project_dir, 'module'), join(project_dir, project_name)) 108 | for paths in TEMPLATES_TO_RENDER: 109 | path = join(*paths) 110 | tplfile = join(project_dir, string.Template(path).substitute(project_name=project_name)) 111 | render_templatefile(tplfile, project_name=project_name, ProjectName=string_camelcase(project_name)) 112 | print(f"New aiocrapy project '{project_name}', using template directory " 113 | f"'{self.templates_dir}', created in:") 114 | print(f" {abspath(project_dir)}\n") 115 | print("You can start your first spider with:") 116 | print(f" cd {project_dir}") 117 | print(" aioscrapy genspider example example.com") 118 | 119 | @property 120 | def templates_dir(self): 121 | return join( 122 | self.settings['TEMPLATES_DIR'] or join(aioscrapy.__path__[0], 'templates'), 123 | 'project' 124 | ) 125 | -------------------------------------------------------------------------------- /aioscrapy/http/request/form.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Form request implementation for aioscrapy. 4 | aioscrapy的表单请求实现。 5 | 6 | This module provides the FormRequest class, which is a specialized Request 7 | that handles HTML form submission, both GET and POST methods. 8 | 此模块提供了FormRequest类,这是一个专门处理HTML表单提交的Request, 9 | 支持GET和POST方法。 10 | """ 11 | 12 | from typing import List, Optional, Tuple, Union 13 | from urllib.parse import urlencode 14 | 15 | from aioscrapy.http.request import Request 16 | from aioscrapy.utils.python import to_bytes, is_listlike 17 | 18 | # Type definition for form data, which can be a dictionary or a list of key-value tuples 19 | # 表单数据的类型定义,可以是字典或键值元组列表 20 | FormdataType = Optional[Union[dict, List[Tuple[str, str]]]] 21 | 22 | 23 | class FormRequest(Request): 24 | """ 25 | A Request that submits HTML form data. 26 | 提交HTML表单数据的Request。 27 | 28 | This class extends the base Request to handle form submissions, 29 | automatically setting the appropriate method, headers, and 30 | encoding the form data either in the URL (for GET requests) 31 | or in the body (for POST requests). 32 | 此类扩展了基本Request以处理表单提交,自动设置适当的方法、 33 | 头部,并将表单数据编码到URL中(对于GET请求)或请求体中 34 | (对于POST请求)。 35 | """ 36 | 37 | # Valid HTTP methods for form submission 38 | # 表单提交的有效HTTP方法 39 | valid_form_methods = ['GET', 'POST'] 40 | 41 | def __init__(self, *args, formdata: FormdataType = None, **kwargs) -> None: 42 | """ 43 | Initialize a FormRequest. 44 | 初始化FormRequest。 45 | 46 | This constructor extends the base Request constructor to handle form data. 47 | If form data is provided and no method is specified, it defaults to POST. 48 | 此构造函数扩展了基本Request构造函数以处理表单数据。 49 | 如果提供了表单数据且未指定方法,则默认为POST。 50 | 51 | Args: 52 | *args: Positional arguments passed to the Request constructor. 53 | 传递给Request构造函数的位置参数。 54 | formdata: Form data to submit, either as a dict or a list of (name, value) tuples. 55 | 要提交的表单数据,可以是字典或(名称, 值)元组的列表。 56 | **kwargs: Keyword arguments passed to the Request constructor. 57 | 传递给Request构造函数的关键字参数。 58 | """ 59 | # Default to POST method if form data is provided and no method is specified 60 | # 如果提供了表单数据且未指定方法,则默认为POST方法 61 | if formdata and kwargs.get('method') is None: 62 | kwargs['method'] = 'POST' 63 | 64 | # Initialize the base Request 65 | # 初始化基本Request 66 | super().__init__(*args, **kwargs) 67 | 68 | # Process form data if provided 69 | # 如果提供了表单数据,则处理它 70 | if formdata: 71 | # Convert dict to items() iterator if necessary 72 | # 如果需要,将字典转换为items()迭代器 73 | items = formdata.items() if isinstance(formdata, dict) else formdata 74 | 75 | # URL-encode the form data 76 | # URL编码表单数据 77 | form_query: str = _urlencode(items, self.encoding) 78 | 79 | if self.method == 'POST': 80 | # For POST requests, set the Content-Type header and put form data in the body 81 | # 对于POST请求,设置Content-Type头部并将表单数据放入请求体 82 | self.headers.setdefault('Content-Type', 'application/x-www-form-urlencoded') 83 | self._set_body(form_query) 84 | else: 85 | # For GET requests, append form data to the URL 86 | # 对于GET请求,将表单数据附加到URL 87 | self._set_url(self.url + ('&' if '?' in self.url else '?') + form_query) 88 | 89 | 90 | def _urlencode(seq, enc): 91 | """ 92 | URL-encode a sequence of form data. 93 | URL编码表单数据序列。 94 | 95 | This internal function handles the encoding of form data for submission, 96 | converting keys and values to bytes using the specified encoding and 97 | properly handling list-like values. 98 | 此内部函数处理表单数据的编码以便提交,使用指定的编码将键和值转换为字节, 99 | 并正确处理类似列表的值。 100 | 101 | Args: 102 | seq: A sequence of (name, value) pairs to encode. 103 | 要编码的(名称, 值)对序列。 104 | enc: The encoding to use for converting strings to bytes. 105 | 用于将字符串转换为字节的编码。 106 | 107 | Returns: 108 | str: The URL-encoded form data string. 109 | URL编码的表单数据字符串。 110 | """ 111 | # Convert each key-value pair to bytes and handle list-like values 112 | # 将每个键值对转换为字节并处理类似列表的值 113 | values = [(to_bytes(k, enc), to_bytes(v, enc)) 114 | for k, vs in seq 115 | for v in (vs if is_listlike(vs) else [vs])] 116 | 117 | # Use urllib's urlencode with doseq=1 to properly handle sequences 118 | # 使用urllib的urlencode,doseq=1以正确处理序列 119 | return urlencode(values, doseq=1) 120 | -------------------------------------------------------------------------------- /aioscrapy/libs/pipelines/redis.py: -------------------------------------------------------------------------------- 1 | """ 2 | Redis Pipeline for AioScrapy 3 | AioScrapy的Redis管道 4 | 5 | This module provides a pipeline for storing scraped items in a redis. 6 | 此模块提供了一个用于将抓取的项目存储在Redis数据库中的管道。 7 | """ 8 | 9 | import ujson 10 | from aioscrapy.db import db_manager 11 | from aioscrapy.libs.pipelines import DBPipelineBase 12 | 13 | from aioscrapy.utils.log import logger 14 | 15 | class RedisPipeline(DBPipelineBase): 16 | """ 17 | Pipeline for storing scraped items in Redis. 18 | 用于将抓取的项目存储到Redis的管道。 19 | 20 | This pipeline extends the base database pipeline to implement Redis-specific 21 | functionality for batch inserting items. It uses the database manager to handle 22 | connections and operations. 23 | 此管道扩展了基本数据库管道,以实现Redis特定的批量插入项目功能。 24 | 它使用数据库管理器来处理连接和操作。 25 | """ 26 | def __init__(self, settings, db_type: str): 27 | """ 28 | Initialize the RedisPipeline instance. 29 | 初始化RedisPipeline实例。 30 | 31 | Args: 32 | settings: The AioScrapy settings object. 33 | AioScrapy设置对象。 34 | db_type: The type of database, should be 'redis'. 35 | 数据库类型,应为'redis'。 36 | """ 37 | super().__init__(settings, db_type) 38 | 39 | self.db_cache = {} # 缓存数据库别名 40 | self.key_name_cache = {} # 缓存Redis键名 41 | self.insert_method_cache = {} # 缓存插入方法名 42 | self.item_cache = {} # 缓存待插入的项目 43 | 44 | @classmethod 45 | def from_settings(cls, settings): 46 | """ 47 | Create a RedisPipeline instance from settings. 48 | 从设置创建RedisPipeline实例。 49 | 50 | Returns: 51 | RedisPipeline: A new RedisPipeline instance. 52 | 一个新的RedisPipeline实例。 53 | """ 54 | return cls(settings, 'redis') 55 | 56 | def parse_item_to_cache(self, item: dict, save_info: dict): 57 | """ 58 | Parse item and save info to cache for batch processing. 59 | 解析item和保存信息到缓存以便批量处理。 60 | 61 | Args: 62 | item: The item to be cached. 63 | 要缓存的项目。 64 | save_info: Information about how and where to save the item. 65 | 关于如何以及在哪里保存项目的信息。 66 | 67 | Returns: 68 | cache_key: The key used for caching. 69 | 用于缓存的键。 70 | count: Number of items currently cached under this key. 71 | 当前此键下缓存的项目数量。 72 | """ 73 | key_name = save_info.get("key_name") 74 | insert_method = save_info.get("insert_method") 75 | 76 | assert key_name is not None, "please set key_name" # 必须设置key_name 77 | assert insert_method is not None, "please set insert_method" # 必须设置insert_method 78 | 79 | db_alias = save_info.get("db_alias", ["default"]) # 获取数据库别名,默认为"default" 80 | 81 | if isinstance(db_alias, str): 82 | db_alias = [db_alias] # 如果是字符串则转为列表 83 | 84 | cache_key = "-".join([key_name, insert_method]) # 生成缓存键 85 | 86 | if self.db_cache.get(cache_key) is None: 87 | self.db_cache[cache_key] = db_alias # 缓存数据库别名 88 | self.key_name_cache[cache_key] = key_name # 缓存Redis键名 89 | self.insert_method_cache[cache_key] = insert_method # 缓存插入方法名 90 | self.item_cache[cache_key] = [] # 初始化项目缓存列表 91 | 92 | self.item_cache[cache_key].append(item) # 添加项目到缓存 93 | 94 | return cache_key, len(self.item_cache[cache_key]) # 返回缓存键和当前缓存数量 95 | 96 | async def _save(self, cache_key): 97 | """ 98 | Save cached items with the given cache key to Redis. 99 | 将具有给定缓存键的缓存项目保存到Redis。 100 | 101 | Args: 102 | cache_key: The cache key used to retrieve the cached items and metadata. 103 | 用于检索缓存项目和元数据的缓存键。 104 | """ 105 | insert_method_name = self.insert_method_cache[cache_key] # 获取插入方法名 106 | key_name = self.key_name_cache[cache_key] # 获取Redis键名 107 | items = self.item_cache[cache_key] # 获取待插入项目列表 108 | 109 | try: 110 | for alias in self.db_cache[cache_key]: # 遍历所有数据库别名 111 | try: 112 | executor = db_manager.redis.executor(alias) # 获取Redis执行器 113 | insert_method = getattr(executor, insert_method_name) # 获取插入方法 114 | # 批量插入项目到Redis 115 | result = await insert_method(key_name, *[ujson.dumps(item) for item in items]) 116 | logger.info( 117 | f"redis:{alias}->{key_name} sum:{len(items)} ok:{result}" 118 | ) # 记录插入结果 119 | except Exception as e: 120 | logger.exception(f'redis:push data error: {e}') # 记录异常 121 | finally: 122 | self.item_cache[cache_key] = [] # 清空缓存,无论成功 123 | -------------------------------------------------------------------------------- /aioscrapy/middleware/itempipeline.py: -------------------------------------------------------------------------------- 1 | """ 2 | Item Pipeline Manager Module 3 | 项目管道管理器模块 4 | 5 | This module provides the ItemPipelineManager class, which manages the execution 6 | of item pipeline components. Item pipelines are components that process items 7 | after they have been extracted by spiders, typically for cleaning, validation, 8 | persistence, or other post-processing tasks. 9 | 此模块提供了ItemPipelineManager类,用于管理项目管道组件的执行。项目管道是 10 | 在项目被爬虫提取后处理项目的组件,通常用于清洗、验证、持久化或其他后处理任务。 11 | 12 | Item pipelines are loaded from the ITEM_PIPELINES setting and are executed in 13 | the order specified by their priority values. Each pipeline component can process 14 | an item and either return it for further processing, drop it, or raise an exception. 15 | 项目管道从ITEM_PIPELINES设置加载,并按照其优先级值指定的顺序执行。每个管道组件 16 | 可以处理一个项目,并返回它以供进一步处理、丢弃它或引发异常。 17 | """ 18 | from aioscrapy.middleware.absmanager import AbsMiddlewareManager 19 | from aioscrapy.utils.conf import build_component_list 20 | 21 | 22 | class ItemPipelineManager(AbsMiddlewareManager): 23 | """ 24 | Manager for item pipeline components. 25 | 项目管道组件的管理器。 26 | 27 | This class manages the execution of item pipeline components, which process items 28 | after they have been extracted by spiders. It inherits from AbsMiddlewareManager 29 | and implements the specific behavior for item pipelines. 30 | 此类管理项目管道组件的执行,这些组件在项目被爬虫提取后进行处理。它继承自 31 | AbsMiddlewareManager,并实现了项目管道的特定行为。 32 | 33 | Item pipelines are executed in the order specified by their priority values in 34 | the ITEM_PIPELINES setting. Each pipeline can process an item and either return 35 | it for further processing, drop it, or raise an exception. 36 | 项目管道按照ITEM_PIPELINES设置中指定的优先级值顺序执行。每个管道可以处理一个 37 | 项目,并返回它以供进一步处理、丢弃它或引发异常。 38 | """ 39 | 40 | # Name of the component 41 | # 组件的名称 42 | component_name = 'item pipeline' 43 | 44 | @classmethod 45 | def _get_mwlist_from_settings(cls, settings): 46 | """ 47 | Get the list of item pipeline classes from settings. 48 | 从设置中获取项目管道类列表。 49 | 50 | This method implements the abstract method from AbsMiddlewareManager. 51 | It retrieves the list of item pipeline classes from the ITEM_PIPELINES setting. 52 | 此方法实现了AbsMiddlewareManager中的抽象方法。它从ITEM_PIPELINES设置中 53 | 检索项目管道类列表。 54 | 55 | Args: 56 | settings: The settings object. 57 | 设置对象。 58 | 59 | Returns: 60 | list: A list of item pipeline class paths. 61 | 项目管道类路径列表。 62 | """ 63 | # Build component list from ITEM_PIPELINES setting 64 | # 从ITEM_PIPELINES设置构建组件列表 65 | return build_component_list(settings.getwithbase('ITEM_PIPELINES')) 66 | 67 | def _add_middleware(self, pipe): 68 | """ 69 | Add a pipeline instance to the manager. 70 | 将管道实例添加到管理器。 71 | 72 | This method overrides the method from AbsMiddlewareManager to register 73 | the process_item method of item pipelines. It first calls the parent method 74 | to register open_spider and close_spider methods if they exist. 75 | 此方法覆盖了AbsMiddlewareManager中的方法,以注册项目管道的process_item方法。 76 | 它首先调用父方法来注册open_spider和close_spider方法(如果存在)。 77 | 78 | Args: 79 | pipe: The pipeline instance to add. 80 | 要添加的管道实例。 81 | """ 82 | # Call parent method to register open_spider and close_spider methods 83 | # 调用父方法来注册open_spider和close_spider方法 84 | super()._add_middleware(pipe) 85 | 86 | # Register process_item method if it exists 87 | # 如果存在,则注册process_item方法 88 | if hasattr(pipe, 'process_item'): 89 | self.methods['process_item'].append(pipe.process_item) 90 | 91 | async def process_item(self, item, spider): 92 | """ 93 | Process an item through all registered process_item methods. 94 | 通过所有已注册的process_item方法处理项目。 95 | 96 | This method calls each pipeline's process_item method in the order they 97 | were registered. The result of each pipeline is passed to the next one 98 | in a chain, allowing pipelines to modify the item or drop it by returning None. 99 | 此方法按照它们注册的顺序调用每个管道的process_item方法。每个管道的结果 100 | 以链式方式传递给下一个管道,允许管道修改项目或通过返回None来丢弃它。 101 | 102 | Args: 103 | item: The item to process. 104 | 要处理的项目。 105 | spider: The spider that generated the item. 106 | 生成项目的爬虫。 107 | 108 | Returns: 109 | The processed item, or None if it was dropped by a pipeline. 110 | 处理后的项目,如果被管道丢弃则为None。 111 | """ 112 | # Process the item through the chain of process_item methods 113 | # 通过process_item方法链处理项目 114 | return await self._process_chain('process_item', item, spider) 115 | -------------------------------------------------------------------------------- /aioscrapy/libs/spider/urllength.py: -------------------------------------------------------------------------------- 1 | """ 2 | Url Length Spider Middleware 3 | URL长度爬虫中间件 4 | 5 | This middleware filters out requests with URLs that exceed a configurable maximum 6 | length. It helps prevent issues with excessively long URLs that might cause problems 7 | with servers, proxies, or browsers. 8 | 此中间件过滤掉URL超过可配置最大长度的请求。它有助于防止过长的URL可能导致 9 | 服务器、代理或浏览器出现问题。 10 | """ 11 | 12 | from aioscrapy.exceptions import NotConfigured 13 | from aioscrapy.http import Request 14 | from aioscrapy.utils.log import logger 15 | 16 | 17 | class UrlLengthMiddleware: 18 | """ 19 | Spider middleware to filter out requests with excessively long URLs. 20 | 用于过滤掉URL过长的请求的爬虫中间件。 21 | 22 | This middleware checks the length of URLs in requests and filters out those 23 | that exceed a configurable maximum length. This helps prevent issues with 24 | servers, proxies, or browsers that might have trouble handling very long URLs. 25 | 此中间件检查请求中URL的长度,并过滤掉超过可配置最大长度的URL。 26 | 这有助于防止服务器、代理或浏览器在处理非常长的URL时可能遇到的问题。 27 | """ 28 | 29 | def __init__(self, maxlength): 30 | """ 31 | Initialize the URL length middleware. 32 | 初始化URL长度中间件。 33 | 34 | Args: 35 | maxlength: The maximum allowed URL length in characters. 36 | 允许的URL最大长度(以字符为单位)。 37 | """ 38 | # Maximum allowed URL length 39 | # 允许的URL最大长度 40 | self.maxlength = maxlength 41 | 42 | @classmethod 43 | def from_settings(cls, settings): 44 | """ 45 | Create a UrlLengthMiddleware instance from settings. 46 | 从设置创建UrlLengthMiddleware实例。 47 | 48 | This is the factory method used by AioScrapy to create the middleware. 49 | 这是AioScrapy用于创建中间件的工厂方法。 50 | 51 | Args: 52 | settings: The AioScrapy settings object. 53 | AioScrapy设置对象。 54 | 55 | Returns: 56 | UrlLengthMiddleware: A new UrlLengthMiddleware instance. 57 | 一个新的UrlLengthMiddleware实例。 58 | 59 | Raises: 60 | NotConfigured: If URLLENGTH_LIMIT is not set or is zero in the settings. 61 | 如果在设置中未设置URLLENGTH_LIMIT或其值为零。 62 | """ 63 | # Get the maximum URL length from settings 64 | # 从设置获取最大URL长度 65 | maxlength = settings.getint('URLLENGTH_LIMIT') 66 | 67 | # If no maximum length is configured, disable the middleware 68 | # 如果未配置最大长度,则禁用中间件 69 | if not maxlength: 70 | raise NotConfigured 71 | 72 | # Create and return a new instance 73 | # 创建并返回一个新实例 74 | return cls(maxlength) 75 | 76 | async def process_spider_output(self, response, result, spider): 77 | """ 78 | Process the spider output to filter out requests with long URLs. 79 | 处理爬虫输出以过滤掉具有长URL的请求。 80 | 81 | This method processes each request yielded by the spider and filters out 82 | those with URLs that exceed the configured maximum length. 83 | 此方法处理爬虫产生的每个请求,并过滤掉URL超过配置的最大长度的请求。 84 | 85 | Args: 86 | response: The response being processed. 87 | 正在处理的响应。 88 | result: The result returned by the spider. 89 | 爬虫返回的结果。 90 | spider: The spider that generated the result. 91 | 生成结果的爬虫。 92 | 93 | Returns: 94 | An async generator yielding filtered requests and other items. 95 | 一个产生过滤后的请求和其他项目的异步生成器。 96 | """ 97 | def _filter(request): 98 | """ 99 | Filter function to check if a request's URL is too long. 100 | 检查请求的URL是否过长的过滤函数。 101 | 102 | Args: 103 | request: The request to check. 104 | 要检查的请求。 105 | 106 | Returns: 107 | bool: True if the request should be kept, False if it should be filtered out. 108 | 如果应保留请求,则为True;如果应过滤掉请求,则为False。 109 | """ 110 | # Check if the item is a Request and if its URL exceeds the maximum length 111 | # 检查项目是否为Request,以及其URL是否超过最大长度 112 | if isinstance(request, Request) and len(request.url) > self.maxlength: 113 | # Log the ignored request 114 | # 记录被忽略的请求 115 | logger.info( 116 | "Ignoring link (url length > %(maxlength)d): %(url)s " % { 117 | 'maxlength': self.maxlength, 'url': request.url 118 | } 119 | ) 120 | # Update statistics 121 | # 更新统计信息 122 | spider.crawler.stats.inc_value('urllength/request_ignored_count', spider=spider) 123 | # Filter out this request 124 | # 过滤掉此请求 125 | return False 126 | else: 127 | # Keep all other items 128 | # 保留所有其他项目 129 | return True 130 | 131 | # Filter the results using the _filter function 132 | # 使用_filter函数过滤结果 133 | return (r async for r in result or () if _filter(r)) 134 | -------------------------------------------------------------------------------- /aioscrapy/utils/log.py: -------------------------------------------------------------------------------- 1 | """ 2 | Logging utilities for aioscrapy. 3 | aioscrapy的日志工具。 4 | 5 | This module provides logging functionality for aioscrapy using the loguru library. 6 | It configures logging based on settings and provides a spider-aware logger. 7 | 此模块使用loguru库为aioscrapy提供日志功能。 8 | 它根据设置配置日志记录,并提供一个感知爬虫的日志记录器。 9 | """ 10 | 11 | import asyncio 12 | import sys 13 | import warnings 14 | from typing import Type 15 | 16 | from loguru import logger as _logger 17 | 18 | from aioscrapy.settings import Settings 19 | 20 | # Remove the default stderr handler to avoid duplicate logs 21 | # 移除默认的stderr处理程序以避免重复日志 22 | for _handler in _logger._core.handlers.values(): 23 | if _handler._name == '': 24 | _logger.remove(_handler._id) 25 | 26 | 27 | def configure_logging(spider: Type["Spider"], settings: Settings): 28 | """ 29 | Configure logging for a spider based on settings. 30 | 根据设置为爬虫配置日志记录。 31 | 32 | This function sets up logging handlers for a specific spider based on the provided settings. 33 | It can configure logging to stderr and/or to a file, with various options like log level, 34 | rotation, retention, etc. 35 | 此函数根据提供的设置为特定爬虫设置日志处理程序。 36 | 它可以配置日志记录到stderr和/或文件,具有各种选项,如日志级别、轮换、保留等。 37 | 38 | Args: 39 | spider: The spider instance for which to configure logging. 40 | 要为其配置日志记录的爬虫实例。 41 | settings: The settings object containing logging configuration. 42 | 包含日志配置的设置对象。 43 | """ 44 | # Get logging configuration from settings 45 | # 从设置中获取日志配置 46 | formatter = settings.get('LOG_FORMAT') 47 | level = settings.get('LOG_LEVEL', 'INFO') 48 | enqueue = settings.get('ENQUEUE', True) 49 | 50 | # Configure stderr logging if enabled 51 | # 如果启用,配置stderr日志记录 52 | if settings.get('LOG_STDOUT', True): 53 | _logger.add( 54 | sys.stderr, format=formatter, level=level, enqueue=enqueue, 55 | filter=lambda record: record["extra"].get("spidername") == spider.name, 56 | ) 57 | 58 | # Configure file logging if a filename is provided 59 | # 如果提供了文件名,配置文件日志记录 60 | if filename := settings.get('LOG_FILE'): 61 | rotation = settings.get('LOG_ROTATION', '20MB') 62 | retention = settings.get('LOG_RETENTION', 10) 63 | encoding = settings.get('LOG_ENCODING', 'utf-8') 64 | _logger.add( 65 | sink=filename, format=formatter, encoding=encoding, level=level, 66 | enqueue=enqueue, rotation=rotation, retention=retention, 67 | filter=lambda record: record["extra"].get("spidername") == spider.name, 68 | ) 69 | 70 | 71 | class AioScrapyLogger: 72 | """ 73 | Spider-aware logger for aioscrapy. 74 | aioscrapy的爬虫感知日志记录器。 75 | 76 | This class provides a wrapper around the loguru logger that automatically 77 | binds the current spider name to log records. This allows for filtering 78 | logs by spider name and provides context about which spider generated each log. 79 | 此类提供了loguru日志记录器的包装器,它自动将当前爬虫名称绑定到日志记录。 80 | 这允许按爬虫名称过滤日志,并提供关于哪个爬虫生成了每条日志的上下文。 81 | 82 | The logger methods (debug, info, warning, etc.) are dynamically accessed 83 | through __getattr__, so they're not explicitly defined. 84 | 日志记录器方法(debug、info、warning等)是通过__getattr__动态访问的, 85 | 因此它们没有明确定义。 86 | """ 87 | __slots__ = ( 88 | 'catch', 'complete', 'critical', 'debug', 'error', 'exception', 89 | 'info', 'log', 'patch', 'success', 'trace', 'warning' 90 | ) 91 | 92 | def __getattr__(self, method): 93 | """ 94 | Dynamically access logger methods with spider name binding. 95 | 动态访问带有爬虫名称绑定的日志记录器方法。 96 | 97 | This method intercepts attribute access to provide logger methods that 98 | automatically include the current spider name in the log context. 99 | 此方法拦截属性访问,以提供自动在日志上下文中包含当前爬虫名称的日志记录器方法。 100 | 101 | Args: 102 | method: The name of the logger method to access (e.g., 'info', 'debug'). 103 | 要访问的日志记录器方法的名称(例如,'info'、'debug')。 104 | 105 | Returns: 106 | The requested logger method, bound with the current spider name. 107 | 请求的日志记录器方法,绑定了当前爬虫名称。 108 | 109 | Note: 110 | If the current task name cannot be determined, it falls back to the 111 | original logger method without binding a spider name. 112 | 如果无法确定当前任务名称,它会回退到原始日志记录器方法,而不绑定爬虫名称。 113 | """ 114 | try: 115 | # Get the current task name as the spider name 116 | # 获取当前任务名称作为爬虫名称 117 | spider_name = asyncio.current_task().get_name() 118 | # Return the logger method bound with the spider name 119 | # 返回绑定了爬虫名称的日志记录器方法 120 | return getattr(_logger.bind(spidername=spider_name), method) 121 | except Exception as e: 122 | # Fall back to the original logger method if binding fails 123 | # 如果绑定失败,回退到原始日志记录器方法 124 | warnings.warn(f'Error on get logger: {e}') 125 | return getattr(_logger, method) 126 | 127 | 128 | # Create a singleton instance of the spider-aware logger 129 | # 创建爬虫感知日志记录器的单例实例 130 | logger = AioScrapyLogger() 131 | -------------------------------------------------------------------------------- /aioscrapy/utils/response.py: -------------------------------------------------------------------------------- 1 | """ 2 | Response utility functions for aioscrapy. 3 | aioscrapy的响应实用函数。 4 | 5 | This module provides utility functions for working with aioscrapy.http.Response objects. 6 | It includes functions for extracting base URLs and meta refresh directives from HTML responses. 7 | 此模块提供了用于处理aioscrapy.http.Response对象的实用函数。 8 | 它包括从HTML响应中提取基本URL和元刷新指令的函数。 9 | """ 10 | from typing import Iterable, Optional, Tuple, Union 11 | from weakref import WeakKeyDictionary 12 | 13 | from w3lib import html 14 | 15 | import aioscrapy 16 | from aioscrapy.http.response import Response 17 | 18 | # Cache for storing base URLs to avoid repeated parsing of the same response 19 | # 缓存存储基本URL,以避免重复解析相同的响应 20 | _baseurl_cache: "WeakKeyDictionary[Response, str]" = WeakKeyDictionary() 21 | 22 | 23 | def get_base_url(response: "aioscrapy.http.response.TextResponse") -> str: 24 | """ 25 | Extract the base URL from an HTML response. 26 | 从HTML响应中提取基本URL。 27 | 28 | This function extracts the base URL from an HTML response by looking for 29 | the tag in the HTML. If found, it returns the href attribute of the 30 | base tag, resolved against the response URL. If not found, it returns the 31 | response URL. 32 | 此函数通过查找HTML中的标签来从HTML响应中提取基本URL。 33 | 如果找到,它返回base标签的href属性,相对于响应URL解析。 34 | 如果未找到,它返回响应URL。 35 | 36 | The function uses a cache to avoid repeated parsing of the same response. 37 | Only the first 4KB of the response text are examined for performance reasons. 38 | 该函数使用缓存来避免重复解析相同的响应。 39 | 出于性能原因,只检查响应文本的前4KB。 40 | 41 | Args: 42 | response: The HTML response to extract the base URL from. 43 | 要从中提取基本URL的HTML响应。 44 | 45 | Returns: 46 | str: The base URL of the response, which could be either: 47 | 响应的基本URL,可能是: 48 | - The href attribute of the tag, resolved against the response URL 49 | 标签的href属性,相对于响应URL解析 50 | - The response URL if no tag is found 51 | 如果未找到标签,则为响应URL 52 | """ 53 | # Check if the base URL is already cached for this response 54 | # 检查此响应的基本URL是否已缓存 55 | if response not in _baseurl_cache: 56 | # Only examine the first 4KB of the response for performance 57 | # 出于性能考虑,只检查响应的前4KB 58 | text = response.text[0:4096] 59 | # Extract the base URL using w3lib.html 60 | # 使用w3lib.html提取基本URL 61 | _baseurl_cache[response] = html.get_base_url(text, response.url, response.encoding) 62 | # Return the cached base URL 63 | # 返回缓存的基本URL 64 | return _baseurl_cache[response] 65 | 66 | 67 | # Cache for storing meta refresh directives to avoid repeated parsing of the same response 68 | # 缓存存储元刷新指令,以避免重复解析相同的响应 69 | # The cache stores either (None, None) if no meta refresh is found, or (seconds, url) if found 70 | # 如果未找到元刷新,缓存存储(None, None),如果找到,则存储(秒数, url) 71 | _metaref_cache: "WeakKeyDictionary[Response, Union[Tuple[None, None], Tuple[float, str]]]" = WeakKeyDictionary() 72 | 73 | 74 | def get_meta_refresh( 75 | response: "aioscrapy.http.response.TextResponse", 76 | ignore_tags: Optional[Iterable[str]] = ('script', 'noscript'), 77 | ) -> Union[Tuple[None, None], Tuple[float, str]]: 78 | """ 79 | Extract the meta refresh directive from an HTML response. 80 | 从HTML响应中提取元刷新指令。 81 | 82 | This function looks for the HTML meta refresh tag in the response and extracts 83 | the delay (in seconds) and the URL to redirect to. The meta refresh tag is 84 | typically used for automatic page redirection or refreshing. 85 | 此函数在响应中查找HTML元刷新标签,并提取延迟(以秒为单位)和要重定向到的URL。 86 | 元刷新标签通常用于自动页面重定向或刷新。 87 | 88 | Example of a meta refresh tag: 89 | 元刷新标签的示例: 90 | 91 | 92 | The function uses a cache to avoid repeated parsing of the same response. 93 | Only the first 4KB of the response text are examined for performance reasons. 94 | 该函数使用缓存来避免重复解析相同的响应。 95 | 出于性能原因,只检查响应文本的前4KB。 96 | 97 | Args: 98 | response: The HTML response to extract the meta refresh from. 99 | 要从中提取元刷新的HTML响应。 100 | ignore_tags: HTML tags to ignore when parsing. Default is ('script', 'noscript'). 101 | 解析时要忽略的HTML标签。默认为('script', 'noscript')。 102 | 103 | Returns: 104 | A tuple containing: 105 | 包含以下内容的元组: 106 | - If meta refresh is found: (delay_seconds, url) 107 | 如果找到元刷新:(延迟秒数, url) 108 | - If no meta refresh is found: (None, None) 109 | 如果未找到元刷新:(None, None) 110 | """ 111 | # Check if the meta refresh is already cached for this response 112 | # 检查此响应的元刷新是否已缓存 113 | if response not in _metaref_cache: 114 | # Only examine the first 4KB of the response for performance 115 | # 出于性能考虑,只检查响应的前4KB 116 | text = response.text[0:4096] 117 | # Extract the meta refresh using w3lib.html 118 | # 使用w3lib.html提取元刷新 119 | _metaref_cache[response] = html.get_meta_refresh( 120 | text, response.url, response.encoding, ignore_tags=ignore_tags) 121 | # Return the cached meta refresh 122 | # 返回缓存的元刷新 123 | return _metaref_cache[response] 124 | -------------------------------------------------------------------------------- /aioscrapy/libs/downloader/ja3fingerprint.py: -------------------------------------------------------------------------------- 1 | """ 2 | JA3 Fingerprint Randomization Middleware 3 | JA3指纹随机化中间件 4 | 5 | This module provides a middleware for randomizing SSL/TLS cipher suites to help 6 | avoid fingerprinting and detection when making HTTPS requests. JA3 is a method 7 | for creating SSL/TLS client fingerprints that can be used to identify specific 8 | clients regardless of the presented hostname or client IP address. 9 | 此模块提供了一个中间件,用于随机化SSL/TLS密码套件,以帮助在发出HTTPS请求时 10 | 避免指纹识别和检测。JA3是一种创建SSL/TLS客户端指纹的方法,可用于识别特定 11 | 客户端,而不考虑所呈现的主机名或客户端IP地址。 12 | 13 | By randomizing the order of cipher suites, this middleware helps to generate 14 | different JA3 fingerprints for each request, making it harder for servers to 15 | track or block the crawler based on its TLS fingerprint. 16 | 通过随机化密码套件的顺序,此中间件有助于为每个请求生成不同的JA3指纹, 17 | 使服务器更难基于其TLS指纹跟踪或阻止爬虫。 18 | """ 19 | import random 20 | 21 | 22 | # Default cipher suite string used when no custom ciphers are specified 23 | # 未指定自定义密码套件时使用的默认密码套件字符串 24 | ORIGIN_CIPHERS = ('ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:' 25 | 'DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:!aNULL:!eNULL:!MD5') 26 | 27 | 28 | class TLSCiphersMiddleware: 29 | """ 30 | SSL/TLS Fingerprint Randomization Middleware. 31 | SSL/TLS指纹随机化中间件。 32 | 33 | This middleware modifies the SSL/TLS cipher suites used in HTTPS requests 34 | to help avoid fingerprinting and detection. It can use custom cipher suites 35 | or randomize the order of the default cipher suites. 36 | 此中间件修改HTTPS请求中使用的SSL/TLS密码套件,以帮助避免指纹识别和检测。 37 | 它可以使用自定义密码套件或随机化默认密码套件的顺序。 38 | """ 39 | 40 | def __init__(self, ciphers, is_random): 41 | """ 42 | Initialize the TLS Ciphers Middleware. 43 | 初始化TLS密码套件中间件。 44 | 45 | Args: 46 | ciphers: The cipher suites to use, or 'DEFAULT' to use the default ciphers. 47 | 要使用的密码套件,或'DEFAULT'以使用默认密码套件。 48 | is_random: Whether to randomize the order of cipher suites. 49 | 是否随机化密码套件的顺序。 50 | """ 51 | # If ciphers is 'DEFAULT', set self.ciphers to None to use ORIGIN_CIPHERS later 52 | # 如果ciphers是'DEFAULT',将self.ciphers设置为None以便稍后使用ORIGIN_CIPHERS 53 | if ciphers == 'DEFAULT': 54 | self.ciphers = None 55 | else: 56 | self.ciphers = ciphers 57 | 58 | self.is_random = is_random 59 | 60 | @classmethod 61 | def from_crawler(cls, crawler): 62 | """ 63 | Create a TLSCiphersMiddleware instance from a crawler. 64 | 从爬虫创建TLSCiphersMiddleware实例。 65 | 66 | This is the factory method used by AioScrapy to create middleware instances. 67 | 这是AioScrapy用于创建中间件实例的工厂方法。 68 | 69 | Args: 70 | crawler: The crawler that will use this middleware. 71 | 将使用此中间件的爬虫。 72 | 73 | Returns: 74 | TLSCiphersMiddleware: A new TLSCiphersMiddleware instance. 75 | 一个新的TLSCiphersMiddleware实例。 76 | """ 77 | return cls( 78 | # Get custom cipher suites from settings, or use 'DEFAULT' 79 | # 从设置获取自定义密码套件,或使用'DEFAULT' 80 | ciphers=crawler.settings.get('DOWNLOADER_CLIENT_TLS_CIPHERS', 'DEFAULT'), 81 | # Get whether to randomize cipher suites from settings 82 | # 从设置获取是否随机化密码套件 83 | is_random=crawler.settings.get('RANDOM_TLS_CIPHERS', False) 84 | ) 85 | 86 | def process_request(self, request, spider): 87 | """ 88 | Process a request before it is sent to the downloader. 89 | 在请求发送到下载器之前处理它。 90 | 91 | This method sets the TLS cipher suites for the request, optionally 92 | randomizing their order to generate different JA3 fingerprints. 93 | 此方法为请求设置TLS密码套件,可选择随机化它们的顺序以生成不同的JA3指纹。 94 | 95 | Args: 96 | request: The request being processed. 97 | 正在处理的请求。 98 | spider: The spider that generated the request. 99 | 生成请求的爬虫。 100 | 101 | Returns: 102 | None: This method returns None to continue processing the request. 103 | 此方法返回None以继续处理请求。 104 | """ 105 | # Skip if neither custom ciphers nor randomization is enabled 106 | # 如果既没有启用自定义密码套件也没有启用随机化,则跳过 107 | if not (self.ciphers or self.is_random): 108 | return 109 | 110 | # Use custom ciphers if specified, otherwise use default 111 | # 如果指定了自定义密码套件则使用它,否则使用默认值 112 | ciphers = self.ciphers or ORIGIN_CIPHERS 113 | 114 | # Randomize cipher suite order if enabled 115 | # 如果启用了随机化,则随机化密码套件顺序 116 | if self.is_random: 117 | # Split the cipher string into individual ciphers 118 | # 将密码字符串拆分为单个密码 119 | ciphers = ciphers.split(":") 120 | # Shuffle the ciphers randomly 121 | # 随机打乱密码 122 | random.shuffle(ciphers) 123 | # Join the ciphers back into a string 124 | # 将密码重新连接成字符串 125 | ciphers = ":".join(ciphers) 126 | 127 | # Set the cipher suites in the request metadata 128 | # 在请求元数据中设置密码套件 129 | request.meta['TLS_CIPHERS'] = ciphers 130 | -------------------------------------------------------------------------------- /aioscrapy/http/response/web_driver.py: -------------------------------------------------------------------------------- 1 | """ 2 | Playwright response implementation for aioscrapy. 3 | aioscrapy的WebDriverResponse响应实现。 4 | 5 | This module provides the PlaywrightResponse class, which is a specialized TextResponse 6 | for handling responses from Playwright browser automation. It adds support for 7 | browser driver management and response caching. 8 | 此模块提供了WebDriverResponse类,这是一个专门用于处理来自Playwright/DrissionPage等浏览器自动化的响应的TextResponse。 9 | 它添加了对浏览器驱动程序管理和响应缓存的支持。 10 | """ 11 | 12 | from typing import Optional, Any 13 | 14 | from aioscrapy.http.response.text import TextResponse 15 | 16 | 17 | class WebDriverResponse(TextResponse): 18 | """ 19 | A Response subclass for handling Playwright browser automation responses. 20 | 用于处理Playwright浏览器自动化响应的Response子类。 21 | 22 | This class extends TextResponse to handle responses from Playwright browser automation. 23 | It adds support for: 24 | 此类扩展了TextResponse以处理来自Playwright浏览器自动化的响应。 25 | 它添加了对以下内容的支持: 26 | 27 | - Browser driver management 28 | 浏览器驱动程序管理 29 | - Response caching 30 | 响应缓存 31 | - Text content override 32 | 文本内容覆盖 33 | - Intercepted request data 34 | 拦截的请求数据 35 | """ 36 | 37 | def __init__( 38 | self, 39 | *args, 40 | text: str = '', 41 | cache_response: Optional[dict] = None, 42 | driver: Optional["WebDriverBase"] = None, 43 | driver_pool: Optional["WebDriverPool"] = None, 44 | intercept_request: Optional[dict] = None, 45 | **kwargs 46 | ): 47 | """ 48 | Initialize a PlaywrightResponse. 49 | 初始化PlaywrightResponse。 50 | 51 | Args: 52 | *args: Positional arguments passed to the TextResponse constructor. 53 | 传递给TextResponse构造函数的位置参数。 54 | text: The text content of the response, which can override the body's decoded text. 55 | 响应的文本内容,可以覆盖正文的解码文本。 56 | cache_response: A dictionary of cached response data. 57 | 缓存的响应数据字典。 58 | driver: The Playwright driver instance used for this response. 59 | 用于此响应的Playwright驱动程序实例。 60 | driver_pool: The WebDriverPool that manages the driver. 61 | 管理驱动程序的WebDriverPool。 62 | intercept_request: A dictionary of intercepted request data. 63 | 拦截的请求数据字典。 64 | **kwargs: Keyword arguments passed to the TextResponse constructor. 65 | 传递给TextResponse构造函数的关键字参数。 66 | """ 67 | # Store Playwright-specific attributes 68 | # 存储Playwright特定的属性 69 | self.driver = driver 70 | self.driver_pool = driver_pool 71 | self._text = text 72 | self.cache_response = cache_response or {} 73 | self.intercept_request = intercept_request 74 | 75 | # Initialize the base TextResponse 76 | # 初始化基本TextResponse 77 | super().__init__(*args, **kwargs) 78 | 79 | async def release(self): 80 | """ 81 | Release the Playwright driver back to the pool. 82 | 将Playwright驱动程序释放回池中。 83 | 84 | This method releases the driver instance back to the WebDriverPool 85 | if both the driver and pool are available. 86 | 如果驱动程序和池都可用,此方法将驱动程序实例释放回WebDriverPool。 87 | 88 | Returns: 89 | None 90 | """ 91 | self.driver_pool and self.driver and await self.driver_pool.release(self.driver) 92 | 93 | @property 94 | def text(self): 95 | """ 96 | Get the response text content. 97 | 获取响应文本内容。 98 | 99 | This property overrides the base TextResponse.text property to return 100 | the explicitly set text content if available, otherwise falls back to 101 | the decoded body text from the parent class. 102 | 此属性重写了基本TextResponse.text属性,如果可用,则返回明确设置的文本内容, 103 | 否则回退到父类的解码正文文本。 104 | 105 | Returns: 106 | str: The response text content. 107 | 响应文本内容。 108 | """ 109 | return self._text or super().text 110 | 111 | @text.setter 112 | def text(self, text): 113 | """ 114 | Set the response text content. 115 | 设置响应文本内容。 116 | 117 | This setter allows explicitly setting the text content of the response, 118 | which will override the decoded body text. 119 | 此设置器允许明确设置响应的文本内容,这将覆盖解码的正文文本。 120 | 121 | Args: 122 | text: The text content to set. 123 | 要设置的文本内容。 124 | """ 125 | self._text = text 126 | 127 | def get_response(self, key) -> Any: 128 | """ 129 | Get a value from the cached response data. 130 | 从缓存的响应数据中获取值。 131 | 132 | This method retrieves a value from the cache_response dictionary 133 | using the provided key. 134 | 此方法使用提供的键从cache_response字典中检索值。 135 | 136 | Args: 137 | key: The key to look up in the cached response data. 138 | 在缓存的响应数据中查找的键。 139 | 140 | Returns: 141 | Any: The value associated with the key, or None if the key is not found. 142 | 与键关联的值,如果未找到键,则为None。 143 | """ 144 | return self.cache_response.get(key) 145 | -------------------------------------------------------------------------------- /aioscrapy/commands/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Base class for Aioscrapy commands 3 | """ 4 | import os 5 | from optparse import OptionGroup 6 | from typing import Any, Dict 7 | 8 | from aioscrapy.utils.conf import arglist_to_dict 9 | from aioscrapy.exceptions import UsageError 10 | 11 | 12 | class AioScrapyCommand: 13 | 14 | requires_project = False 15 | crawler_process = None 16 | 17 | # default settings to be used for this command instead of global defaults 18 | default_settings: Dict[str, Any] = {} 19 | 20 | exitcode = 0 21 | 22 | def __init__(self): 23 | self.settings = None # set in scrapy.cmdline 24 | 25 | def set_crawler(self, crawler): 26 | if hasattr(self, '_crawler'): 27 | raise RuntimeError("crawler already set") 28 | self._crawler = crawler 29 | 30 | def syntax(self): 31 | """ 32 | Command syntax (preferably one-line). Do not include command name. 33 | """ 34 | return "" 35 | 36 | def short_desc(self): 37 | """ 38 | A short description of the command 39 | """ 40 | return "" 41 | 42 | def long_desc(self): 43 | """A long description of the command. Return short description when not 44 | available. It cannot contain newlines, since contents will be formatted 45 | by optparser which removes newlines and wraps text. 46 | """ 47 | return self.short_desc() 48 | 49 | def help(self): 50 | """An extensive help for the command. It will be shown when using the 51 | "help" command. It can contain newlines, since no post-formatting will 52 | be applied to its contents. 53 | """ 54 | return self.long_desc() 55 | 56 | def add_options(self, parser): 57 | """ 58 | Populate option parse with options available for this command 59 | """ 60 | group = OptionGroup(parser, "Global Options") 61 | group.add_option("--logfile", metavar="FILE", 62 | help="log file. if omitted stderr will be used") 63 | group.add_option("-L", "--loglevel", metavar="LEVEL", default=None, 64 | help=f"log level (default: {self.settings['LOG_LEVEL']})") 65 | group.add_option("--nolog", action="store_true", 66 | help="disable logging completely") 67 | group.add_option("--profile", metavar="FILE", default=None, 68 | help="write python cProfile stats to FILE") 69 | group.add_option("--pidfile", metavar="FILE", 70 | help="write process ID to FILE") 71 | group.add_option("-s", "--set", action="append", default=[], metavar="NAME=VALUE", 72 | help="set/override setting (may be repeated)") 73 | group.add_option("--pdb", action="store_true", help="enable pdb on failure") 74 | 75 | parser.add_option_group(group) 76 | 77 | def process_options(self, args, opts): 78 | try: 79 | self.settings.setdict(arglist_to_dict(opts.set), 80 | priority='cmdline') 81 | except ValueError: 82 | raise UsageError("Invalid -s value, use -s NAME=VALUE", print_help=False) 83 | 84 | if opts.logfile: 85 | self.settings.set('LOG_ENABLED', True, priority='cmdline') 86 | self.settings.set('LOG_FILE', opts.logfile, priority='cmdline') 87 | 88 | if opts.loglevel: 89 | self.settings.set('LOG_ENABLED', True, priority='cmdline') 90 | self.settings.set('LOG_LEVEL', opts.loglevel, priority='cmdline') 91 | 92 | if opts.nolog: 93 | self.settings.set('LOG_ENABLED', False, priority='cmdline') 94 | 95 | if opts.pidfile: 96 | with open(opts.pidfile, "w") as f: 97 | f.write(str(os.getpid()) + os.linesep) 98 | 99 | def run(self, args, opts): 100 | """ 101 | Entry point for running commands 102 | """ 103 | raise NotImplementedError 104 | 105 | 106 | class BaseRunSpiderCommand(AioScrapyCommand): 107 | """ 108 | Common class used to share functionality between the crawl, parse and runspider commands 109 | """ 110 | def add_options(self, parser): 111 | AioScrapyCommand.add_options(self, parser) 112 | parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", 113 | help="set spider argument (may be repeated)") 114 | parser.add_option("-o", "--output", metavar="FILE", action="append", 115 | help="append scraped items to the end of FILE (use - for stdout)") 116 | parser.add_option("-O", "--overwrite-output", metavar="FILE", action="append", 117 | help="dump scraped items into FILE, overwriting any existing file") 118 | parser.add_option("-t", "--output-format", metavar="FORMAT", 119 | help="format to use for dumping items") 120 | 121 | def process_options(self, args, opts): 122 | AioScrapyCommand.process_options(self, args, opts) 123 | try: 124 | opts.spargs = arglist_to_dict(opts.spargs) 125 | except ValueError: 126 | raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False) 127 | -------------------------------------------------------------------------------- /aioscrapy/queue/redis.py: -------------------------------------------------------------------------------- 1 | from abc import ABC 2 | from typing import Optional 3 | 4 | import aioscrapy 5 | from aioscrapy.db import db_manager 6 | from aioscrapy.queue import AbsQueue 7 | from aioscrapy.serializer import AbsSerializer 8 | from aioscrapy.utils.misc import load_object 9 | 10 | 11 | class RedisQueueBase(AbsQueue, ABC): 12 | inc_key = 'scheduler/enqueued/redis' 13 | 14 | @classmethod 15 | def from_dict(cls, data: dict) -> "RedisQueueBase": 16 | alias: str = data.get("alias", 'queue') 17 | server: aioscrapy.db.aioredis.Redis = db_manager.redis(alias) 18 | spider_name: str = data["spider_name"] 19 | serializer: str = data.get("serializer", "aioscrapy.serializer.JsonSerializer") 20 | serializer: AbsSerializer = load_object(serializer) 21 | return cls( 22 | server, 23 | key='%(spider)s:requests' % {'spider': spider_name}, 24 | serializer=serializer 25 | ) 26 | 27 | @classmethod 28 | async def from_spider(cls, spider: aioscrapy.Spider) -> "RedisQueueBase": 29 | alias: str = spider.settings.get("SCHEDULER_QUEUE_ALIAS", 'queue') 30 | server: aioscrapy.db.aioredis.Redis = db_manager.redis(alias) 31 | queue_key: str = spider.settings.get("SCHEDULER_QUEUE_KEY", '%(spider)s:requests') 32 | serializer: str = spider.settings.get("SCHEDULER_SERIALIZER", "aioscrapy.serializer.JsonSerializer") 33 | serializer: AbsSerializer = load_object(serializer) 34 | return cls( 35 | server, 36 | spider, 37 | queue_key % {'spider': spider.name}, 38 | serializer=serializer 39 | ) 40 | 41 | async def clear(self) -> None: 42 | """Clear queue/stack""" 43 | await self.container.delete(self.key) 44 | 45 | 46 | class RedisFifoQueue(RedisQueueBase): 47 | """Per-spider FIFO queue""" 48 | 49 | async def len(self) -> int: 50 | return await self.container.llen(self.key) 51 | 52 | async def push(self, request: aioscrapy.Request) -> None: 53 | """Push a request""" 54 | await self.container.lpush(self.key, self._encode_request(request)) 55 | 56 | async def push_batch(self, requests) -> None: 57 | async with self.container.pipeline() as pipe: 58 | for request in requests: 59 | pipe.lpush(self.key, self._encode_request(request)) 60 | await pipe.execute() 61 | 62 | async def pop(self, count: int = 1) -> Optional[aioscrapy.Request]: 63 | """Pop a request""" 64 | async with self.container.pipeline(transaction=True) as pipe: 65 | for _ in range(count): 66 | pipe.rpop(self.key) 67 | results = await pipe.execute() 68 | for result in results: 69 | if result: 70 | yield await self._decode_request(result) 71 | 72 | 73 | class RedisPriorityQueue(RedisQueueBase): 74 | """Per-spider priority queue abstraction using redis' sorted set""" 75 | 76 | async def len(self) -> int: 77 | return await self.container.zcard(self.key) 78 | 79 | async def push(self, request: aioscrapy.Request) -> None: 80 | """Push a request""" 81 | data = self._encode_request(request) 82 | score = request.priority 83 | await self.container.zadd(self.key, {data: score}) 84 | 85 | async def push_batch(self, requests) -> None: 86 | async with self.container.pipeline() as pipe: 87 | for request in requests: 88 | pipe.zadd(self.key, {self._encode_request(request): request.priority}) 89 | await pipe.execute() 90 | 91 | async def pop(self, count: int = 1) -> Optional[aioscrapy.Request]: 92 | async with self.container.pipeline(transaction=True) as pipe: 93 | stop = count - 1 if count - 1 > 0 else 0 94 | results, _ = await ( 95 | pipe.zrange(self.key, 0, stop) 96 | .zremrangebyrank(self.key, 0, stop) 97 | .execute() 98 | ) 99 | for result in results: 100 | yield await self._decode_request(result) 101 | 102 | 103 | class RedisLifoQueue(RedisQueueBase): 104 | """Per-spider LIFO queue.""" 105 | 106 | async def len(self) -> int: 107 | return await self.container.llen(self.key) 108 | 109 | async def push(self, request: aioscrapy.Request) -> None: 110 | """Push a request""" 111 | await self.container.lpush(self.key, self._encode_request(request)) 112 | 113 | async def push_batch(self, requests) -> None: 114 | async with self.container.pipeline() as pipe: 115 | for request in requests: 116 | pipe.lpush(self.key, self._encode_request(request)) 117 | await pipe.execute() 118 | 119 | async def pop(self, count: int = 1) -> Optional[aioscrapy.Request]: 120 | """Pop a request""" 121 | async with self.container.pipeline(transaction=True) as pipe: 122 | for _ in range(count): 123 | pipe.lpop(self.key) 124 | results = await pipe.execute() 125 | for result in results: 126 | if result: 127 | yield await self._decode_request(result) 128 | 129 | 130 | SpiderQueue = RedisFifoQueue 131 | SpiderStack = RedisLifoQueue 132 | SpiderPriorityQueue = RedisPriorityQueue 133 | -------------------------------------------------------------------------------- /example/singlespider/demo_request_playwright.py: -------------------------------------------------------------------------------- 1 | import random 2 | from typing import Any 3 | 4 | from playwright.async_api._generated import Response as EventResponse 5 | from playwright.async_api._generated import Request as EventRequest 6 | 7 | from aioscrapy import Request, logger, Spider 8 | from aioscrapy.core.downloader.handlers.webdriver import PlaywrightDriver 9 | from aioscrapy.http import WebDriverResponse 10 | 11 | 12 | class DemoPlaywrightSpider(Spider): 13 | name = 'DemoPlaywrightSpider' 14 | 15 | custom_settings = dict( 16 | USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", 17 | # DOWNLOAD_DELAY=3, 18 | # RANDOMIZE_DOWNLOAD_DELAY=True, 19 | CONCURRENT_REQUESTS=1, 20 | LOG_LEVEL='INFO', 21 | CLOSE_SPIDER_ON_IDLE=True, 22 | # DOWNLOAD_HANDLERS={ 23 | # 'http': 'aioscrapy.core.downloader.handlers.webdriver.playwright.PlaywrightDownloadHandler', 24 | # 'https': 'aioscrapy.core.downloader.handlers.webdriver.playwright.PlaywrightDownloadHandler', 25 | # }, 26 | DOWNLOAD_HANDLERS_TYPE="playwright", 27 | PLAYWRIGHT_ARGS=dict( 28 | use_pool=True, # use_pool=True时 使用完driver后不销毁 重复使用 提供效率 29 | max_uses=None, # 在use_pool=True时生效,如果driver达到指定使用次数,则销毁,重新启动一个driver(处理有些driver使用次数变多则变卡的情况) 30 | driver_type="chromium", # chromium、firefox、webkit 31 | wait_until="networkidle", # 等待页面加载完成的事件,可选值:"commit", "domcontentloaded", "load", "networkidle" 32 | window_size=(1024, 800), 33 | # proxy='http://user:pwd@127.0.0.1:7890', 34 | browser_args=dict( 35 | executable_path=None, channel=None, args=None, ignore_default_args=None, handle_sigint=None, 36 | handle_sigterm=None, handle_sighup=None, timeout=None, env=None, headless=False, devtools=None, 37 | downloads_path=None, slow_mo=None, traces_dir=None, chromium_sandbox=None, 38 | firefox_user_prefs=None, 39 | ), 40 | context_args=dict( 41 | no_viewport=None, ignore_https_errors=None, java_script_enabled=None, 42 | bypass_csp=None, user_agent=None, locale=None, timezone_id=None, geolocation=None, permissions=None, 43 | extra_http_headers=None, offline=None, http_credentials=None, device_scale_factor=None, 44 | is_mobile=None, has_touch=None, color_scheme=None, reduced_motion=None, forced_colors=None, 45 | accept_downloads=None, default_browser_type=None, record_har_path=None, 46 | record_har_omit_content=None, record_video_dir=None, record_video_size=None, storage_state=None, 47 | base_url=None, strict_selectors=None, service_workers=None, record_har_url_filter=None, 48 | record_har_mode=None, record_har_content=None, 49 | ), 50 | ) 51 | 52 | ) 53 | 54 | start_urls = ['https://hanyu.baidu.com/zici/s?wd=黄&query=黄'] 55 | # start_urls = ["https://mall.jd.com/view_search-507915-3733265-99-1-24-1.html"] 56 | 57 | @staticmethod 58 | async def process_request(request, spider): 59 | """ request middleware """ 60 | pass 61 | 62 | @staticmethod 63 | async def process_response(request, response, spider): 64 | """ response middleware """ 65 | return response 66 | 67 | @staticmethod 68 | async def process_exception(request, exception, spider): 69 | """ exception middleware """ 70 | pass 71 | 72 | async def parse(self, response: WebDriverResponse): 73 | yield { 74 | 'pingyin': response.xpath('//div[@class="pinyin-text"]/text()').get(), 75 | 'img_bytes': response.get_response('img_bytes'), 76 | } 77 | 78 | new_character = response.xpath('//a[@class="img-link"]/@href').getall() 79 | for character in new_character: 80 | new_url = 'https://hanyu.baidu.com/zici' + character 81 | yield Request(new_url, callback=self.parse, dont_filter=True) 82 | 83 | async def on_event_request(self, result: EventRequest) -> Any: 84 | """ 85 | 具体使用参考playwright的page.on('request', lambda req: print(req)) 86 | """ 87 | # print(result) 88 | 89 | async def on_event_response(self, result: EventResponse) -> Any: 90 | """ 91 | 具体使用参考playwright的page.on('response', lambda res: print(res)) 92 | """ 93 | 94 | # 如果事件中有需要传递回 parse函数的内容,则按如下返回,结果将在self.parse的response.cache_response中, 95 | 96 | if 'getModuleHtml' in result.url: 97 | return 'getModuleHtml_response', result 98 | 99 | if 'xxx1' in result.url: 100 | return 'xxx_response', await result.text() 101 | 102 | if 'xxx2' in result.url: 103 | return 'xxx2', {'data': 'aaa'} 104 | 105 | async def process_action(self, driver: PlaywrightDriver, request: Request) -> Any: 106 | """ Do some operations after function page.goto """ 107 | img_bytes = await driver.page.screenshot(type="jpeg", quality=50) 108 | 109 | # TODO: 点击 选择元素等操作 110 | 111 | # 如果有需要传递回 parse函数的内容,则按如下返回,结果将在self.parse的response.cache_response中, 112 | return 'img_bytes', img_bytes 113 | 114 | async def process_item(self, item): 115 | logger.info(item) 116 | 117 | 118 | if __name__ == '__main__': 119 | DemoPlaywrightSpider.start() 120 | -------------------------------------------------------------------------------- /aioscrapy/dupefilters/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Duplicate Filter Base Module for AioScrapy 3 | AioScrapy的重复过滤器基础模块 4 | 5 | This module provides the abstract base class for duplicate filters in AioScrapy. 6 | Duplicate filters are used to avoid crawling the same URL multiple times by 7 | tracking request fingerprints. 8 | 此模块提供了AioScrapy中重复过滤器的抽象基类。 9 | 重复过滤器用于通过跟踪请求指纹来避免多次爬取相同的URL。 10 | """ 11 | 12 | from typing import Literal 13 | from abc import ABCMeta, abstractmethod 14 | 15 | from aioscrapy import Request, Spider 16 | from aioscrapy.utils.log import logger 17 | 18 | 19 | class DupeFilterBase(metaclass=ABCMeta): 20 | """ 21 | Abstract base class for request fingerprint duplicate filters. 22 | 请求指纹重复过滤器的抽象基类。 23 | 24 | This class defines the interface that all duplicate filters must implement. 25 | Duplicate filters are used to avoid crawling the same URL multiple times by 26 | tracking request fingerprints. 27 | 此类定义了所有重复过滤器必须实现的接口。 28 | 重复过滤器用于通过跟踪请求指纹来避免多次爬取相同的URL。 29 | """ 30 | 31 | @classmethod 32 | @abstractmethod 33 | def from_crawler(cls, crawler: "aioscrapy.crawler.Crawler"): 34 | """ 35 | Create a duplicate filter instance from a crawler. 36 | 从爬虫创建重复过滤器实例。 37 | 38 | This is the factory method used by AioScrapy to create the dupefilter. 39 | 这是AioScrapy用于创建重复过滤器的工厂方法。 40 | 41 | Args: 42 | crawler: The crawler that will use this dupefilter. 43 | 将使用此重复过滤器的爬虫。 44 | 45 | Returns: 46 | DupeFilterBase: A new dupefilter instance. 47 | 一个新的重复过滤器实例。 48 | """ 49 | pass 50 | 51 | @abstractmethod 52 | async def request_seen(self, request: Request) -> bool: 53 | """ 54 | Check if a request has been seen before. 55 | 检查请求是否已经被看到过。 56 | 57 | This method checks if the request's fingerprint is in the set of seen 58 | fingerprints. If it is, the request is considered a duplicate. 59 | 此方法检查请求的指纹是否在已见过的指纹集合中。如果是,则认为请求是重复的。 60 | 61 | Args: 62 | request: The request to check. 63 | 要检查的请求。 64 | 65 | Returns: 66 | bool: True if the request has been seen before, False otherwise. 67 | 如果请求之前已经被看到过,则为True,否则为False。 68 | """ 69 | pass 70 | 71 | @abstractmethod 72 | async def close(self, reason: str = '') -> None: 73 | """ 74 | Close the dupefilter. 75 | 关闭过滤器。 76 | 77 | This method is called when the spider is closed. It should clean up 78 | any resources used by the dupefilter. 79 | 当爬虫关闭时调用此方法。它应该清理重复过滤器使用的任何资源。 80 | 81 | Args: 82 | reason: The reason why the spider was closed. 83 | 爬虫被关闭的原因。 84 | """ 85 | pass 86 | 87 | def log(self, request: Request, spider: Spider): 88 | """ 89 | Log a filtered duplicate request. 90 | 记录被过滤的重复请求。 91 | 92 | This method logs information about duplicate requests based on the 93 | logging settings (info, debug, logdupes). It also increments the 94 | dupefilter/filtered stats counter. 95 | 此方法根据日志设置(info、debug、logdupes)记录有关重复请求的信息。 96 | 它还增加dupefilter/filtered统计计数器。 97 | 98 | Args: 99 | request: The duplicate request that was filtered. 100 | 被过滤的重复请求。 101 | spider: The spider that generated the request. 102 | 生成请求的爬虫。 103 | """ 104 | # Log at INFO level if info is True 105 | # 如果info为True,则在INFO级别记录 106 | if self.info: 107 | logger.info("Filtered duplicate request: %(request)s" % { 108 | 'request': request.meta.get('dupefilter_msg') or request 109 | }) 110 | # Log at DEBUG level if debug is True 111 | # 如果debug为True,则在DEBUG级别记录 112 | elif self.debug: 113 | logger.debug("Filtered duplicate request: %(request)s" % { 114 | 'request': request.meta.get('dupefilter_msg') or request 115 | }) 116 | # Log the first duplicate at DEBUG level and disable further logging 117 | # 在DEBUG级别记录第一个重复项并禁用进一步的日志记录 118 | elif self.logdupes: 119 | msg = ("Filtered duplicate request: %(request)s" 120 | " - no more duplicates will be shown" 121 | " (see DUPEFILTER_DEBUG to show all duplicates)") 122 | logger.debug(msg % {'request': request.meta.get('dupefilter_msg') or request}) 123 | self.logdupes = False 124 | 125 | # Increment the dupefilter/filtered stats counter 126 | # 增加dupefilter/filtered统计计数器 127 | spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider) 128 | 129 | async def done( 130 | self, 131 | request: Request, 132 | done_type: Literal["request_ok", "request_err", "parse_ok", "parse_err"] 133 | ) -> None: 134 | """ 135 | Control the removal of fingerprints based on the done_type status. 136 | 根据done_type的状态控制指纹的移除。 137 | 138 | This method can be implemented by subclasses to handle the removal of 139 | fingerprints from the filter based on the status of the request processing. 140 | 子类可以实现此方法,以根据请求处理的状态处理从过滤器中移除指纹。 141 | 142 | Args: 143 | request: The request that has been processed. 144 | 已处理的请求。 145 | done_type: The status of the request processing. 146 | 请求处理的状态。 147 | Can be one of: "request_ok", "request_err", "parse_ok", "parse_err". 148 | 可以是以下之一:"request_ok"、"request_err"、"parse_ok"、"parse_err"。 149 | """ 150 | # Default implementation does nothing 151 | # 默认实现不执行任何操作 152 | pass 153 | -------------------------------------------------------------------------------- /aioscrapy/commands/genspider.py: -------------------------------------------------------------------------------- 1 | import os 2 | import shutil 3 | import string 4 | 5 | from importlib import import_module 6 | from os.path import join, dirname, abspath, exists, splitext 7 | 8 | import aioscrapy 9 | from aioscrapy.commands import AioScrapyCommand 10 | from aioscrapy.utils.template import render_templatefile, string_camelcase 11 | from aioscrapy.exceptions import UsageError 12 | 13 | 14 | def sanitize_module_name(module_name): 15 | """Sanitize the given module name, by replacing dashes and points 16 | with underscores and prefixing it with a letter if it doesn't start 17 | with one 18 | """ 19 | module_name = module_name.replace('-', '_').replace('.', '_') 20 | if module_name[0] not in string.ascii_letters: 21 | module_name = "a" + module_name 22 | return module_name 23 | 24 | 25 | class Command(AioScrapyCommand): 26 | 27 | requires_project = False 28 | default_settings = {'LOG_ENABLED': False} 29 | 30 | def syntax(self): 31 | return "[options] " 32 | 33 | def short_desc(self): 34 | return "Generate new spider using pre-defined templates" 35 | 36 | def add_options(self, parser): 37 | AioScrapyCommand.add_options(self, parser) 38 | parser.add_option("-l", "--list", dest="list", action="store_true", 39 | help="List available templates") 40 | parser.add_option("-d", "--dump", dest="dump", metavar="TEMPLATE", 41 | help="Dump template to standard output") 42 | parser.add_option("-t", "--template", dest="template", default="basic", 43 | help="Uses a custom template.") 44 | parser.add_option("--force", dest="force", action="store_true", 45 | help="If the spider already exists, overwrite it with the template") 46 | 47 | def run(self, args, opts): 48 | if opts.list: 49 | self._list_templates() 50 | return 51 | if opts.dump: 52 | template_file = self._find_template(opts.dump) 53 | if template_file: 54 | with open(template_file, "r") as f: 55 | print(f.read()) 56 | return 57 | if not args: 58 | raise UsageError() 59 | 60 | name = args[0] 61 | module = sanitize_module_name(name) 62 | 63 | if self.settings.get('BOT_NAME') == module: 64 | print("Cannot create a spider with the same name as your project") 65 | return 66 | 67 | if not opts.force and self._spider_exists(name): 68 | return 69 | 70 | template_file = self._find_template(opts.template) 71 | if template_file: 72 | self._genspider(module, name, opts.template, template_file) 73 | 74 | def _genspider(self, module, name, template_name, template_file): 75 | """Generate the spider module, based on the given template""" 76 | capitalized_module = ''.join(s.capitalize() for s in module.split('_')) 77 | tvars = { 78 | 'project_name': self.settings.get('BOT_NAME'), 79 | 'ProjectName': string_camelcase(self.settings.get('BOT_NAME')), 80 | 'module': module, 81 | 'name': name, 82 | 'classname': f'{capitalized_module}Spider' 83 | } 84 | if self.settings.get('NEWSPIDER_MODULE'): 85 | spiders_module = import_module(self.settings['NEWSPIDER_MODULE']) 86 | spiders_dir = abspath(dirname(spiders_module.__file__)) 87 | else: 88 | spiders_module = None 89 | spiders_dir = "." 90 | spider_file = f"{join(spiders_dir, module)}.py" 91 | shutil.copyfile(template_file, spider_file) 92 | render_templatefile(spider_file, **tvars) 93 | print(f"Created spider {name!r} using template {template_name!r} ", 94 | end=('' if spiders_module else '\n')) 95 | if spiders_module: 96 | print(f"in module:\n {spiders_module.__name__}.{module}") 97 | 98 | def _find_template(self, template): 99 | template_file = join(self.templates_dir, f'{template}.tmpl') 100 | if exists(template_file): 101 | return template_file 102 | print(f"Unable to find template: {template}\n") 103 | print('Use "aioscrapy genspider --list" to see all available templates.') 104 | 105 | def _list_templates(self): 106 | print("Available templates:") 107 | for filename in sorted(os.listdir(self.templates_dir)): 108 | if filename.endswith('.tmpl'): 109 | print(f" {splitext(filename)[0]}") 110 | 111 | def _spider_exists(self, name): 112 | if not self.settings.get('NEWSPIDER_MODULE'): 113 | # if run as a standalone command and file with same filename already exists 114 | if exists(name + ".py"): 115 | print(f"{abspath(name + '.py')} already exists") 116 | return True 117 | return False 118 | 119 | try: 120 | spidercls = self.crawler_process.spider_loader.load(name) 121 | except KeyError: 122 | pass 123 | else: 124 | # if spider with same name exists 125 | print(f"Spider {name!r} already exists in module:") 126 | print(f" {spidercls.__module__}") 127 | return True 128 | 129 | # a file with the same name exists in the target directory 130 | spiders_module = import_module(self.settings['NEWSPIDER_MODULE']) 131 | spiders_dir = dirname(spiders_module.__file__) 132 | spiders_dir_abs = abspath(spiders_dir) 133 | if exists(join(spiders_dir_abs, name + ".py")): 134 | print(f"{join(spiders_dir_abs, (name + '.py'))} already exists") 135 | return True 136 | 137 | return False 138 | 139 | @property 140 | def templates_dir(self): 141 | return join( 142 | self.settings['TEMPLATES_DIR'] or join(aioscrapy.__path__[0], 'templates'), 143 | 'spiders' 144 | ) 145 | --------------------------------------------------------------------------------