├── .gitignore ├── README.md ├── pronhubSpider ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py ├── spiders │ ├── __init__.py │ └── pronhub.py └── userAgents.py ├── quickstart.py ├── requirements.txt └── scrapy.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.DS_Store 3 | *.log 4 | *.out 5 | *.mp4 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pronhubSpider 2 | 3 | pornhubをクロールしているWebHubBotプロジェクトの模倣、効率が遅すぎる、方法を探しています 4 | 5 | ![][py3x] [![GitHub forks][forks]][network] [![GitHub stars][stars]][stargazers] [![GitHub license][license]][lic_file] ![issues][issues_img] 6 | > Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear. 7 | 8 | * The project is mainly used for crawling a Website, the largest site for pornhub in the world. In doing so it retrieves video titles, duration, mp4 file, cover url and direct Website`s url. 9 | * This project crawls PornHub.com slow, but with a simple structure. 10 | * This project can crawl up to 5 millon Website`s videos per day(it is not work), depending on your personal network. Because of my slow bandwith my results are relatively slow. 11 | * The crawler requests 10 threads at a time, and because of this can achieve the speed mentioned above. If your network is more performant you can request more threads and crawl a larger amount of videos per day. For the specific configuration see [pre-boot configuration] 12 | 13 | 14 | ## Environment, Architecture 15 | 16 | Language: Python3.6 17 | 18 | Environment: ubuntu, 4G RAM 19 | 20 | Database: MongoDB 21 | 22 | * Mainly uses the scrapy reptile framework. 23 | * Join to the Spider randomly by extracted from the Cookie pool and UA pool and tor ip pool.(if you are not in China,tor is no need) 24 | * Start_requests start five Request based on Website`s classification, and crawl the one categorie. 25 | * Support paging crawl data, and join to the queue. 26 | 27 | ## Instructions for use 28 | 29 | ### Pre-boot configuration 30 | 31 | * Install MongoDB and start without configuration 32 | * Install Python dependent modules:Scrapy, pymongo, requests ,scrapy-splash,or `pip install -r requirements.txt` 33 | * Modify the configuration by needed, such as the interval time, the number of threads, etc. 34 | * Install the [Splash] and use docker to run. 35 | 36 | ### Start up 37 | 38 | * cd WebHub 39 | * python quickstart.py 40 | 41 | 42 | 43 | ## Database description 44 | 45 | The table in the database that holds the data is PhRes. The following is a field description: 46 | 47 | #### PhRes table: 48 | 49 | video_title: The title of the video, and as a unique. 50 | link_url: Video jump to Website`s link 51 | image_url: Video cover link 52 | video_duration: The length of the video, in seconds 53 | quality_480p: Video 480p mp4 download path 54 | 55 | 改进、未知请告知,邮件,issue,push都可。 56 | 57 | [py3x]: https://img.shields.io/badge/python-3.x-brightgreen.svg 58 | [issues_img]: https://img.shields.io/github/issues/liazylee/pronhubSpider.svg 59 | [issues]: https://github.com/liazylee/pronhubSpider/issues 60 | [Splash]:https://splash.readthedocs.io/en/stable/install.html 61 | [forks]: https://img.shields.io/github/forks/liazylee/pronhubSpider.svg 62 | [network]: https://github.com/liazylee/pronhubSpider 63 | 64 | [stars]: https://img.shields.io/github/stars/liazylee/pronhubSpider.svg 65 | [stargazers]: https://github.com/xiyouMc/liazylee/stargazers 66 | 67 | [license]: https://img.shields.io/badge/license-MIT-blue.svg 68 | [lic_file]: https://raw.githubusercontent.com/liazylee/pronhubSpider/master/LICENSE 69 | -------------------------------------------------------------------------------- /pronhubSpider/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/liazylee/pronhubSpider/bd79aa78499769a95a59ade5fced9a323c2cc2bc/pronhubSpider/__init__.py -------------------------------------------------------------------------------- /pronhubSpider/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # https://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | from scrapy import Item,Field 9 | 10 | 11 | class PronhubspiderItem(Item): 12 | # define the fields for your item here like: 13 | # name = scrapy.Field() 14 | video_title=Field() 15 | image_url=Field() 16 | video_duration=Field() 17 | quality_480p=Field() 18 | video_views=Field() 19 | video_rating=Field() 20 | link_url=Field() 21 | pass 22 | -------------------------------------------------------------------------------- /pronhubSpider/middlewares.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your spider middleware 4 | # 5 | # See documentation in: 6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 7 | import json 8 | import random 9 | 10 | from scrapy import signals 11 | 12 | from pronhubSpider import settings 13 | from pronhubSpider.userAgents import agents 14 | 15 | 16 | class PronhubspiderSpiderMiddleware(object): 17 | # Not all methods need to be defined. If a method is not defined, 18 | # scrapy acts as if the spider middleware does not modify the 19 | # passed objects. 20 | 21 | @classmethod 22 | def from_crawler(cls, crawler): 23 | # This method is used by Scrapy to create your spiders. 24 | s = cls() 25 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 26 | return s 27 | 28 | def process_spider_input(self, response, spider): 29 | # Called for each response that goes through the spider 30 | # middleware and into the spider. 31 | 32 | # Should return None or raise an exception. 33 | return None 34 | 35 | def process_spider_output(self, response, result, spider): 36 | # Called with the results returned from the Spider, after 37 | # it has processed the response. 38 | 39 | # Must return an iterable of Request, dict or Item objects. 40 | for i in result: 41 | yield i 42 | 43 | def process_spider_exception(self, response, exception, spider): 44 | # Called when a spider or process_spider_input() method 45 | # (from other spider middleware) raises an exception. 46 | 47 | # Should return either None or an iterable of Response, dict 48 | # or Item objects. 49 | pass 50 | 51 | def process_start_requests(self, start_requests, spider): 52 | # Called with the start requests of the spider, and works 53 | # similarly to the process_spider_output() method, except 54 | # that it doesn’t have a response associated. 55 | 56 | # Must return only requests (not items). 57 | for r in start_requests: 58 | yield r 59 | 60 | def spider_opened(self, spider): 61 | spider.logger.info('Spider opened: %s' % spider.name) 62 | 63 | 64 | class PronhubspiderDownloaderMiddleware(object): 65 | # Not all methods need to be defined. If a method is not defined, 66 | # scrapy acts as if the downloader middleware does not modify the 67 | # passed objects. 68 | 69 | @classmethod 70 | def from_crawler(cls, crawler): 71 | # This method is used by Scrapy to create your spiders. 72 | s = cls() 73 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 74 | return s 75 | 76 | def process_request(self, request, spider): 77 | # Called for each request that goes through the downloader 78 | # middleware. 79 | 80 | # Must either: 81 | # - return None: continue processing this request 82 | # - or return a Response object 83 | # - or return a Request object 84 | # - or raise IgnoreRequest: process_exception() methods of 85 | # installed downloader middleware will be called 86 | return None 87 | 88 | def process_response(self, request, response, spider): 89 | # Called with the response returned from the downloader. 90 | 91 | # Must either; 92 | # - return a Response object 93 | # - return a Request object 94 | # - or raise IgnoreRequest 95 | return response 96 | 97 | def process_exception(self, request, exception, spider): 98 | # Called when a download handler or a process_request() 99 | # (from other downloader middleware) raises an exception. 100 | 101 | # Must either: 102 | # - return None: continue processing this exception 103 | # - return a Response object: stops process_exception() chain 104 | # - return a Request object: stops process_exception() chain 105 | pass 106 | 107 | def spider_opened(self, spider): 108 | spider.logger.info('Spider opened: %s' % spider.name) 109 | 110 | 111 | class ProxyMiddlewares(object): 112 | 113 | def process_request(self, request, spider): 114 | request.meta['proxy'] = settings.PROXY 115 | 116 | 117 | class CookiesMiddlewares(object): 118 | """ 119 | 120 | """ 121 | cookie = { 122 | 'platform': 'pc', 123 | 'ss': '367701188698225489', 124 | 'bs': '%s', 125 | 'RNLBSERVERID': 'ded6699', 126 | 'FastPopSessionRequestNumber': '1', 127 | 'FPSRN': '1', 128 | 'performance_timing': 'home', 129 | 'RNKEY': '40859743*68067497:1190152786:3363277230:1' 130 | } 131 | 132 | def process_request(self, spider, request): 133 | bs = '' 134 | for i in range(32): 135 | bs += chr(random.randint(97,122)) 136 | _cookie = json.dumps(self.cookie) % bs 137 | request.cookies = json.loads(_cookie) 138 | 139 | 140 | class UserAgentMiddlewares(object): 141 | """ 142 | 143 | """ 144 | 145 | def process_request(self, request, spider): 146 | agent = random.choice(agents) 147 | request.headers['User-Agent'] = agent 148 | 149 | 150 | -------------------------------------------------------------------------------- /pronhubSpider/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | import pymongo as pymongo 8 | import scrapy 9 | from pymongo import IndexModel 10 | from scrapy.exceptions import DropItem 11 | from scrapy.pipelines.files import FilesPipeline 12 | 13 | from pronhubSpider.items import PronhubspiderItem 14 | 15 | 16 | class OwnFilePipeline(FilesPipeline): 17 | 18 | def file_path(self, request, response=None, info=None): 19 | url=request.url 20 | path=url.split('?')[0].split(':')[1].replace('/','') 21 | return path 22 | def get_media_requests(self, item, info): 23 | yield scrapy.Request(item['quality_480p']) 24 | 25 | def item_completed(self, results, item, info): 26 | video_paths = [x['path'] for ok, x in results if ok] 27 | if not video_paths: 28 | raise DropItem('Item fail') 29 | item['quality_480p']=video_paths 30 | return item 31 | 32 | 33 | class PronhubspiderPipeline(object): 34 | def __init__(self): 35 | clinet = pymongo.MongoClient("127.0.0.1", 27017) 36 | db = clinet["PornHub"] 37 | self.PhRes = db["PhRes"] 38 | idx = IndexModel([('link_url', pymongo.ASCENDING)], unique=True) 39 | self.PhRes.create_indexes([idx]) 40 | # if your existing DB has duplicate records, refer to: 41 | # https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737 42 | 43 | def process_item(self, item, spider): 44 | # print 'MongoDBItem' 45 | """ 判断类型 存入MongoDB """ 46 | if isinstance(item, PronhubspiderItem): 47 | # print 'PornVideoItem True' 48 | try: 49 | self.PhRes.update_one({'link_url': item['link_url']}, {'$set': dict(item)}, upsert=True) 50 | except Exception as e: 51 | print(e) 52 | return item 53 | -------------------------------------------------------------------------------- /pronhubSpider/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for pronhubSpider project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # https://doc.scrapy.org/en/latest/topics/settings.html 9 | # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 10 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'pronhubSpider' 13 | 14 | SPIDER_MODULES = ['pronhubSpider.spiders'] 15 | NEWSPIDER_MODULE = 'pronhubSpider.spiders' 16 | 17 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 18 | # USER_AGENT = 'pronhubSpider (+http://www.yourdomain.com)' 19 | 20 | # Obey robots.txt rules 21 | ROBOTSTXT_OBEY = False 22 | 23 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 24 | CONCURRENT_REQUESTS = 2 25 | 26 | # Configure a delay for requests for the same website (default: 0) 27 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay 28 | # See also autothrottle settings and docs 29 | DOWNLOAD_DELAY = 1 30 | # The download delay setting will honor only one of: 31 | # CONCURRENT_REQUESTS_PER_DOMAIN = 16 32 | # CONCURRENT_REQUESTS_PER_IP = 16 33 | 34 | # Disable cookies (enabled by default) 35 | # COOKIES_ENABLED = False 36 | REDIRECT_ENABLED = False 37 | # Disable Telnet Console (enabled by default) 38 | # TELNETCONSOLE_ENABLED = False 39 | 40 | # Override the default request headers: 41 | # DEFAULT_REQUEST_HEADERS = { 42 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 43 | # 'Accept-Language': 'en', 44 | # } 45 | 46 | # Enable or disable spider middlewares 47 | # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html 48 | # SPIDER_MIDDLEWARES = { 49 | # 'pronhubSpider.middlewares.PronhubspiderSpiderMiddleware': 543, 50 | # } 51 | 52 | # Enable or disable downloader middlewares 53 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 54 | DOWNLOADER_MIDDLEWARES = { 55 | 'pronhubSpider.middlewares.PronhubspiderDownloaderMiddleware': 543, 56 | 'scrapy_splash.SplashCookiesMiddleware': 723, 57 | 'scrapy_splash.SplashMiddleware': 725, 58 | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 59 | # 'pronhubSpider.middlewares.ProxyMiddlewares': 400, #国内服务器请打开这个配置 60 | 'pronhubSpider.middlewares.UserAgentMiddlewares': 401, 61 | 'pronhubSpider.middlewares.CookiesMiddlewares': 402 62 | } 63 | 64 | # Enable or disable extensions 65 | # See https://doc.scrapy.org/en/latest/topics/extensions.html 66 | # EXTENSIONS = { 67 | # 'scrapy.extensions.telnet.TelnetConsole': None, 68 | # } 69 | 70 | # Configure item pipelines 71 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 72 | ITEM_PIPELINES = { 73 | 'pronhubSpider.pipelines.PronhubspiderPipeline': 300, 74 | 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 75 | 'scrapy.pipelines.files.FilesPipeline': 1, 76 | 'pronhubSpider.pipelines.OwnFilePipeline': 2 77 | 78 | } 79 | 80 | # Enable and configure the AutoThrottle extension (disabled by default) 81 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html 82 | # AUTOTHROTTLE_ENABLED = True 83 | # The initial download delay 84 | # AUTOTHROTTLE_START_DELAY = 5 85 | # The maximum download delay to be set in case of high latencies 86 | # AUTOTHROTTLE_MAX_DELAY = 60 87 | # The average number of requests Scrapy should be sending in parallel to 88 | # each remote server 89 | # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 90 | # Enable showing throttling stats for every response received: 91 | # AUTOTHROTTLE_DEBUG = False 92 | 93 | # Enable and configure HTTP caching (disabled by default) 94 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 95 | # HTTPCACHE_ENABLED = True 96 | # HTTPCACHE_EXPIRATION_SECS = 0 97 | # HTTPCACHE_DIR = 'httpcache' 98 | # HTTPCACHE_IGNORE_HTTP_CODES = [] 99 | # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 100 | PHTYPE = ['video?c=1', 101 | # 'recommended', 102 | # 'video?o=ht', # hot 103 | # 'video?o=mv', # Most Viewed 104 | # 'video?o=tr', # Top Rate 105 | 106 | # Examples of certain categories 107 | # 'video?c=1', # Category = Asian 108 | # 'video?c=111', # Category = Japanese 109 | ] 110 | SPLASH_URL = 'http://127.0.0.1:8050/' # 境外服务器 111 | SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' 112 | SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue' 113 | DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 114 | HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 115 | 116 | PROXY = 'http://127.0.0.1:8118' #此配置需tor, privoxy,ssr,详情 https://blog.csdn.net/liazylee/article/details/93078175 117 | FILES_STORE = './video' 118 | FILES_URLS_FIELD = 'field_name_for_your_files_urls' 119 | FILES_EXPIRES = 90 120 | DOWNLOAD_MAXSIZE = 1073741824 121 | DOWNLOAD_WARNSIZE = 1073741824 122 | DOWNLOAD_FAIL_ON_DATALOSS = False 123 | DOWNLOAD_TIMEOUT = 1800 124 | -------------------------------------------------------------------------------- /pronhubSpider/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /pronhubSpider/spiders/pronhub.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | import logging 4 | import re 5 | 6 | from scrapy import Request, Selector 7 | from scrapy_splash import SplashRequest 8 | 9 | from pronhubSpider import settings 10 | from pronhubSpider.items import PronhubspiderItem 11 | 12 | __auther__ = 'liazylee' 13 | # connect='li233111@gmail.com' 14 | # @Time : 6/20/19 3:29 PM 15 | # @FileName: pronhub.py 16 | # @Software: PyCharm 17 | # @project: pronhubSpider 18 | 19 | from scrapy.spiders import CrawlSpider 20 | 21 | script = """ 22 | function main(splash, args) 23 | splash:autoload([[ 24 | function get_video(){ 25 | 26 | return flashvars; 27 | } 28 | ]]) 29 | assert(splash:go(args.url)) 30 | return { 31 | q480=splash:evaljs("get_video()"), 32 | 33 | } 34 | end 35 | """ 36 | 37 | 38 | class Pronhub(CrawlSpider): 39 | name = 'pronhub' 40 | 41 | logging.getLogger('requests').setLevel(logging.WARNING) 42 | logging.basicConfig(level=logging.DEBUG, 43 | format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s', 44 | datefmt='%a, %d %b %Y %H:%M:%S', 45 | filename='tail_log.log', 46 | filemode='w') 47 | 48 | def start_requests(self): 49 | for ph_type in settings.PHTYPE: 50 | yield SplashRequest(url='https://www.pornhub.com/{}'.format(ph_type), callback=self.parse,args={'wait':1}) 51 | 52 | def parse(self, response): 53 | selector = Selector(response) 54 | logging.debug('request url ---->' + response.url) 55 | response_selectors = selector.xpath('//div[@class="phimage"]') 56 | for response_selector in response_selectors: 57 | view_key = re.findall('viewkey=(.*?)"', response_selector.extract()) 58 | yield SplashRequest(url='https://www.pornhub.com/embed/{}'.format(view_key[0]), 59 | callback=self.parse_key, 60 | endpoint='execute', 61 | args={'lua_source': script}) 62 | # selector_next_url=response.css('head > link:nth-child(42)') 63 | next_url = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract_first() 64 | logging.debug('next_url:----->' + str(next_url)) 65 | if next_url: 66 | yield SplashRequest(url='https://www.pornhub.com{}'.format(next_url), callback=self.parse,args={'wait':1}) 67 | pass 68 | 69 | def parse_key(self, response): 70 | phItem = PronhubspiderItem() 71 | selector = response.data['q480'] 72 | # logging.info(selector) 73 | # _ph_info = re.findall('var flashvars =(.*?),\n', selector) 74 | # logging.debug('PH信息的JSON:') 75 | # logging.debug(_ph_info) 76 | # _ph_info_json = json.loads(_ph_info[0]) 77 | duration = selector.get('video_duration') 78 | phItem['video_duration'] = duration 79 | title = selector.get('video_title') 80 | phItem['video_title'] = title 81 | image_url = selector.get('image_url') 82 | phItem['image_url'] = image_url 83 | link_url = selector.get('link_url') 84 | phItem['link_url'] = link_url 85 | quality_480p = selector.get('mediaDefinitions')[0].get('videoUrl') 86 | # quality_480p = _ph_info_json.get('quality_480p') 87 | phItem['quality_480p'] = quality_480p 88 | # logging.info('duration:' + duration + ' title:' + title + ' image_url:' 89 | # + image_url + ' link_url:' + link_url,'quality_480p:'+quality_480p) 90 | # pprint.pprint(phItem) 91 | yield phItem 92 | 93 | -------------------------------------------------------------------------------- /pronhubSpider/userAgents.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | __auther__ = 'liazylee' 4 | # connect='li233111@gmail.com' 5 | # @Time : 6/20/19 4:35 PM 6 | # @FileName: userAgents.py 7 | # @Software: PyCharm 8 | # @project: pronhubSpider 9 | 10 | agents = [ 11 | "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 12 | "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", 13 | "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", 14 | "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", 15 | "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", 16 | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", 17 | "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", 18 | "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", 19 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", 20 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", 21 | "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", 22 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", 23 | "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", 24 | "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", 25 | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", 26 | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", 27 | "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", 28 | "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", 29 | "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", 30 | "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", 31 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", 32 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", 33 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", 34 | "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", 35 | "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", 36 | "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a", 37 | "Mozilla/2.02E (Win95; U)", 38 | "Mozilla/3.01Gold (Win95; I)", 39 | "Mozilla/4.8 [en] (Windows NT 5.1; U)", 40 | "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", 41 | "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 42 | "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", 43 | "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 44 | "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 45 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 46 | "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 47 | "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 48 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 49 | "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 50 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 51 | "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 52 | "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", 53 | "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 54 | "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", 55 | "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", 56 | "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 57 | "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 58 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 59 | "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 60 | "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 61 | "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", 62 | "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", 63 | "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", 64 | "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 65 | "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", 66 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 67 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 68 | "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", 69 | "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", 70 | "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", 71 | "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 72 | "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", 73 | ] 74 | -------------------------------------------------------------------------------- /quickstart.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | __auther__ = 'liazylee' 4 | # connect='li233111@gmail.com' 5 | # @Time : 6/20/19 3:37 PM 6 | # @FileName: quickstart.py 7 | # @Software: PyCharm 8 | # @project: pronhubSpider 9 | 10 | 11 | from scrapy import cmdline 12 | 13 | cmdline.execute('scrapy crawl pronhub'.split()) 14 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | asn1crypto==0.24.0 2 | attrs==19.1.0 3 | Automat==0.7.0 4 | cffi==1.12.3 5 | constantly==15.1.0 6 | cryptography==2.7 7 | cssselect==1.0.3 8 | hyperlink==19.0.0 9 | idna==2.8 10 | incremental==17.5.0 11 | lxml==4.9.1 12 | parsel==1.5.1 13 | pkg-resources==0.0.0 14 | pyasn1==0.4.5 15 | pyasn1-modules==0.2.5 16 | pycparser==2.19 17 | PyDispatcher==2.0.5 18 | PyHamcrest==1.9.0 19 | pymongo==3.8.0 20 | pyOpenSSL==19.0.0 21 | queuelib==1.5.0 22 | Scrapy==2.6.2 23 | scrapy-splash==0.7.2 24 | service-identity==18.1.0 25 | six==1.12.0 26 | Twisted==22.10.0 27 | w3lib==1.20.0 28 | zope.interface==4.6.0 29 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = pronhubSpider.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = pronhubSpider 12 | --------------------------------------------------------------------------------