├── .gitignore
├── README.md
├── pronhubSpider
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    ├── spiders
    │   ├── __init__.py
    │   └── pronhub.py
    └── userAgents.py
├── quickstart.py
├── requirements.txt
└── scrapy.cfg


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.DS_Store
3 | *.log
4 | *.out
5 | *.mp4


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # pronhubSpider
 2 | 
 3 | pornhubをクロールしているWebHubBotプロジェクトの模倣、効率が遅すぎる、方法を探しています
 4 | 
 5 | ![][py3x] [![GitHub forks][forks]][network] [![GitHub stars][stars]][stargazers] [![GitHub license][license]][lic_file] ![issues][issues_img]
 6 | > Disclaimer: This project is intended to study the Scrapy Spider Framework and the MongoDB database, it cannot be used for commercial or other personal intentions. If used improperly, it will be the individuals bear.
 7 | 
 8 | * The project is mainly used for crawling a Website, the largest site for pornhub  in the world. In doing so it retrieves video titles, duration, mp4 file, cover url and direct Website`s url.
 9 | * This project crawls PornHub.com slow, but with a simple structure.
10 | * This project can crawl up to 5 millon Website`s videos per day(it is not work), depending on your personal network. Because of my slow bandwith my results are relatively slow.
11 | * The crawler requests 10 threads at a time, and because of this can achieve the speed mentioned above. If your network is more performant you can request more threads and crawl a larger amount of videos per day. For the specific configuration see [pre-boot configuration]
12 | 
13 | 
14 | ## Environment, Architecture
15 | 
16 | Language: Python3.6
17 | 
18 | Environment: ubuntu, 4G RAM
19 | 
20 | Database: MongoDB
21 | 
22 | * Mainly uses the scrapy reptile framework.
23 | * Join to the Spider randomly by extracted from the Cookie pool and UA pool and tor ip pool.(if you are not in China,tor is no need)
24 | * Start_requests start five Request based on Website`s classification, and crawl the one categorie.
25 | * Support paging crawl data, and join to the queue.
26 | 
27 | ## Instructions for use
28 | 
29 | ### Pre-boot configuration
30 | 
31 | * Install MongoDB and start without configuration
32 | * Install Python dependent modules：Scrapy, pymongo, requests ,scrapy-splash,or `pip install -r requirements.txt` 
33 | * Modify the configuration by needed, such as the interval time, the number of threads, etc.
34 | * Install the  [Splash] and use docker to run.
35 | 
36 | ### Start up
37 | 
38 | * cd WebHub
39 | * python quickstart.py
40 | 
41 | 
42 | 
43 | ## Database description
44 | 
45 | The table in the database that holds the data is PhRes. The following is a field description:
46 | 
47 | #### PhRes table：
48 |   
49 |     video_title:     The title of the video, and as a unique.
50 |     link_url:        Video jump to Website`s link
51 |     image_url:       Video cover link
52 |     video_duration:  The length of the video, in seconds
53 |     quality_480p:    Video 480p mp4 download path
54 | 
55 | 改进、未知请告知，邮件，issue，push都可。
56 | 
57 | [py3x]: https://img.shields.io/badge/python-3.x-brightgreen.svg
58 | [issues_img]: https://img.shields.io/github/issues/liazylee/pronhubSpider.svg
59 | [issues]: https://github.com/liazylee/pronhubSpider/issues
60 | [Splash]:https://splash.readthedocs.io/en/stable/install.html
61 | [forks]: https://img.shields.io/github/forks/liazylee/pronhubSpider.svg
62 | [network]: https://github.com/liazylee/pronhubSpider
63 | 
64 | [stars]: https://img.shields.io/github/stars/liazylee/pronhubSpider.svg
65 | [stargazers]: https://github.com/xiyouMc/liazylee/stargazers
66 | 
67 | [license]: https://img.shields.io/badge/license-MIT-blue.svg
68 | [lic_file]: https://raw.githubusercontent.com/liazylee/pronhubSpider/master/LICENSE
69 | 


--------------------------------------------------------------------------------
/pronhubSpider/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liazylee/pronhubSpider/bd79aa78499769a95a59ade5fced9a323c2cc2bc/pronhubSpider/__init__.py


--------------------------------------------------------------------------------
/pronhubSpider/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # https://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | from scrapy import Item,Field
 9 | 
10 | 
11 | class PronhubspiderItem(Item):
12 |     # define the fields for your item here like:
13 |     # name = scrapy.Field()
14 |     video_title=Field()
15 |     image_url=Field()
16 |     video_duration=Field()
17 |     quality_480p=Field()
18 |     video_views=Field()
19 |     video_rating=Field()
20 |     link_url=Field()
21 |     pass
22 | 


--------------------------------------------------------------------------------
/pronhubSpider/middlewares.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Define here the models for your spider middleware
  4 | #
  5 | # See documentation in:
  6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  7 | import json
  8 | import random
  9 | 
 10 | from scrapy import signals
 11 | 
 12 | from pronhubSpider import settings
 13 | from pronhubSpider.userAgents import agents
 14 | 
 15 | 
 16 | class PronhubspiderSpiderMiddleware(object):
 17 |     # Not all methods need to be defined. If a method is not defined,
 18 |     # scrapy acts as if the spider middleware does not modify the
 19 |     # passed objects.
 20 | 
 21 |     @classmethod
 22 |     def from_crawler(cls, crawler):
 23 |         # This method is used by Scrapy to create your spiders.
 24 |         s = cls()
 25 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 26 |         return s
 27 | 
 28 |     def process_spider_input(self, response, spider):
 29 |         # Called for each response that goes through the spider
 30 |         # middleware and into the spider.
 31 | 
 32 |         # Should return None or raise an exception.
 33 |         return None
 34 | 
 35 |     def process_spider_output(self, response, result, spider):
 36 |         # Called with the results returned from the Spider, after
 37 |         # it has processed the response.
 38 | 
 39 |         # Must return an iterable of Request, dict or Item objects.
 40 |         for i in result:
 41 |             yield i
 42 | 
 43 |     def process_spider_exception(self, response, exception, spider):
 44 |         # Called when a spider or process_spider_input() method
 45 |         # (from other spider middleware) raises an exception.
 46 | 
 47 |         # Should return either None or an iterable of Response, dict
 48 |         # or Item objects.
 49 |         pass
 50 | 
 51 |     def process_start_requests(self, start_requests, spider):
 52 |         # Called with the start requests of the spider, and works
 53 |         # similarly to the process_spider_output() method, except
 54 |         # that it doesn’t have a response associated.
 55 | 
 56 |         # Must return only requests (not items).
 57 |         for r in start_requests:
 58 |             yield r
 59 | 
 60 |     def spider_opened(self, spider):
 61 |         spider.logger.info('Spider opened: %s' % spider.name)
 62 | 
 63 | 
 64 | class PronhubspiderDownloaderMiddleware(object):
 65 |     # Not all methods need to be defined. If a method is not defined,
 66 |     # scrapy acts as if the downloader middleware does not modify the
 67 |     # passed objects.
 68 | 
 69 |     @classmethod
 70 |     def from_crawler(cls, crawler):
 71 |         # This method is used by Scrapy to create your spiders.
 72 |         s = cls()
 73 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 74 |         return s
 75 | 
 76 |     def process_request(self, request, spider):
 77 |         # Called for each request that goes through the downloader
 78 |         # middleware.
 79 | 
 80 |         # Must either:
 81 |         # - return None: continue processing this request
 82 |         # - or return a Response object
 83 |         # - or return a Request object
 84 |         # - or raise IgnoreRequest: process_exception() methods of
 85 |         #   installed downloader middleware will be called
 86 |         return None
 87 | 
 88 |     def process_response(self, request, response, spider):
 89 |         # Called with the response returned from the downloader.
 90 | 
 91 |         # Must either;
 92 |         # - return a Response object
 93 |         # - return a Request object
 94 |         # - or raise IgnoreRequest
 95 |         return response
 96 | 
 97 |     def process_exception(self, request, exception, spider):
 98 |         # Called when a download handler or a process_request()
 99 |         # (from other downloader middleware) raises an exception.
100 | 
101 |         # Must either:
102 |         # - return None: continue processing this exception
103 |         # - return a Response object: stops process_exception() chain
104 |         # - return a Request object: stops process_exception() chain
105 |         pass
106 | 
107 |     def spider_opened(self, spider):
108 |         spider.logger.info('Spider opened: %s' % spider.name)
109 | 
110 | 
111 | class ProxyMiddlewares(object):
112 | 
113 |     def process_request(self, request, spider):
114 |         request.meta['proxy'] = settings.PROXY
115 | 
116 | 
117 | class CookiesMiddlewares(object):
118 |     """
119 | 
120 |     """
121 |     cookie = {
122 |         'platform': 'pc',
123 |         'ss': '367701188698225489',
124 |         'bs': '%s',
125 |         'RNLBSERVERID': 'ded6699',
126 |         'FastPopSessionRequestNumber': '1',
127 |         'FPSRN': '1',
128 |         'performance_timing': 'home',
129 |         'RNKEY': '40859743*68067497:1190152786:3363277230:1'
130 |     }
131 | 
132 |     def process_request(self, spider, request):
133 |         bs = ''
134 |         for i in range(32):
135 |             bs += chr(random.randint(97,122))
136 |         _cookie = json.dumps(self.cookie) % bs
137 |         request.cookies = json.loads(_cookie)
138 | 
139 | 
140 | class UserAgentMiddlewares(object):
141 |     """
142 | 
143 |     """
144 | 
145 |     def process_request(self, request, spider):
146 |         agent = random.choice(agents)
147 |         request.headers['User-Agent'] = agent
148 | 
149 | 
150 | 


--------------------------------------------------------------------------------
/pronhubSpider/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | import pymongo as pymongo
 8 | import scrapy
 9 | from pymongo import IndexModel
10 | from scrapy.exceptions import DropItem
11 | from scrapy.pipelines.files import FilesPipeline
12 | 
13 | from pronhubSpider.items import PronhubspiderItem
14 | 
15 | 
16 | class OwnFilePipeline(FilesPipeline):
17 | 
18 |     def file_path(self, request, response=None, info=None):
19 |         url=request.url
20 |         path=url.split('?')[0].split(':')[1].replace('/','')
21 |         return path
22 |     def get_media_requests(self, item, info):
23 |         yield scrapy.Request(item['quality_480p'])
24 | 
25 |     def item_completed(self, results, item, info):
26 |         video_paths = [x['path'] for ok, x in results if ok]
27 |         if not video_paths:
28 |             raise DropItem('Item fail')
29 |         item['quality_480p']=video_paths
30 |         return item
31 | 
32 | 
33 | class PronhubspiderPipeline(object):
34 |     def __init__(self):
35 |         clinet = pymongo.MongoClient("127.0.0.1", 27017)
36 |         db = clinet["PornHub"]
37 |         self.PhRes = db["PhRes"]
38 |         idx = IndexModel([('link_url', pymongo.ASCENDING)], unique=True)
39 |         self.PhRes.create_indexes([idx])
40 |         # if your existing DB has duplicate records, refer to:
41 |         # https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737
42 | 
43 |     def process_item(self, item, spider):
44 |         # print 'MongoDBItem'
45 |         """ 判断类型 存入MongoDB """
46 |         if isinstance(item, PronhubspiderItem):
47 |             # print 'PornVideoItem True'
48 |             try:
49 |                 self.PhRes.update_one({'link_url': item['link_url']}, {'$set': dict(item)}, upsert=True)
50 |             except Exception as e:
51 |                 print(e)
52 |         return item
53 | 


--------------------------------------------------------------------------------
/pronhubSpider/settings.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Scrapy settings for pronhubSpider project
  4 | #
  5 | # For simplicity, this file contains only settings considered important or
  6 | # commonly used. You can find more settings consulting the documentation:
  7 | #
  8 | #     https://doc.scrapy.org/en/latest/topics/settings.html
  9 | #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 10 | #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 11 | 
 12 | BOT_NAME = 'pronhubSpider'
 13 | 
 14 | SPIDER_MODULES = ['pronhubSpider.spiders']
 15 | NEWSPIDER_MODULE = 'pronhubSpider.spiders'
 16 | 
 17 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 18 | # USER_AGENT = 'pronhubSpider (+http://www.yourdomain.com)'
 19 | 
 20 | # Obey robots.txt rules
 21 | ROBOTSTXT_OBEY = False
 22 | 
 23 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 24 | CONCURRENT_REQUESTS = 2
 25 | 
 26 | # Configure a delay for requests for the same website (default: 0)
 27 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
 28 | # See also autothrottle settings and docs
 29 | DOWNLOAD_DELAY = 1
 30 | # The download delay setting will honor only one of:
 31 | # CONCURRENT_REQUESTS_PER_DOMAIN = 16
 32 | # CONCURRENT_REQUESTS_PER_IP = 16
 33 | 
 34 | # Disable cookies (enabled by default)
 35 | # COOKIES_ENABLED = False
 36 | REDIRECT_ENABLED = False
 37 | # Disable Telnet Console (enabled by default)
 38 | # TELNETCONSOLE_ENABLED = False
 39 | 
 40 | # Override the default request headers:
 41 | # DEFAULT_REQUEST_HEADERS = {
 42 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 43 | #   'Accept-Language': 'en',
 44 | # }
 45 | 
 46 | # Enable or disable spider middlewares
 47 | # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 48 | # SPIDER_MIDDLEWARES = {
 49 | #    'pronhubSpider.middlewares.PronhubspiderSpiderMiddleware': 543,
 50 | # }
 51 | 
 52 | # Enable or disable downloader middlewares
 53 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 54 | DOWNLOADER_MIDDLEWARES = {
 55 |     'pronhubSpider.middlewares.PronhubspiderDownloaderMiddleware': 543,
 56 |     'scrapy_splash.SplashCookiesMiddleware': 723,
 57 |     'scrapy_splash.SplashMiddleware': 725,
 58 |     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
 59 |     # 'pronhubSpider.middlewares.ProxyMiddlewares': 400, #国内服务器请打开这个配置
 60 |     'pronhubSpider.middlewares.UserAgentMiddlewares': 401,
 61 |     'pronhubSpider.middlewares.CookiesMiddlewares': 402
 62 | }
 63 | 
 64 | # Enable or disable extensions
 65 | # See https://doc.scrapy.org/en/latest/topics/extensions.html
 66 | # EXTENSIONS = {
 67 | #    'scrapy.extensions.telnet.TelnetConsole': None,
 68 | # }
 69 | 
 70 | # Configure item pipelines
 71 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 72 | ITEM_PIPELINES = {
 73 |     'pronhubSpider.pipelines.PronhubspiderPipeline': 300,
 74 |     'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
 75 |     'scrapy.pipelines.files.FilesPipeline': 1,
 76 |     'pronhubSpider.pipelines.OwnFilePipeline': 2
 77 | 
 78 | }
 79 | 
 80 | # Enable and configure the AutoThrottle extension (disabled by default)
 81 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
 82 | # AUTOTHROTTLE_ENABLED = True
 83 | # The initial download delay
 84 | # AUTOTHROTTLE_START_DELAY = 5
 85 | # The maximum download delay to be set in case of high latencies
 86 | # AUTOTHROTTLE_MAX_DELAY = 60
 87 | # The average number of requests Scrapy should be sending in parallel to
 88 | # each remote server
 89 | # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 90 | # Enable showing throttling stats for every response received:
 91 | # AUTOTHROTTLE_DEBUG = False
 92 | 
 93 | # Enable and configure HTTP caching (disabled by default)
 94 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 95 | # HTTPCACHE_ENABLED = True
 96 | # HTTPCACHE_EXPIRATION_SECS = 0
 97 | # HTTPCACHE_DIR = 'httpcache'
 98 | # HTTPCACHE_IGNORE_HTTP_CODES = []
 99 | # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
100 | PHTYPE = ['video?c=1',
101 |           # 'recommended',
102 |           # 'video?o=ht',  # hot
103 |           # 'video?o=mv',  # Most Viewed
104 |           # 'video?o=tr',  # Top Rate
105 | 
106 |           # Examples of certain categories
107 |           # 'video?c=1',  # Category = Asian
108 |           # 'video?c=111',  # Category = Japanese
109 |           ]
110 | SPLASH_URL = 'http://127.0.0.1:8050/'  # 境外服务器
111 | SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
112 | SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
113 | DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
114 | HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
115 | 
116 | PROXY = 'http://127.0.0.1:8118' #此配置需tor, privoxy,ssr,详情  https://blog.csdn.net/liazylee/article/details/93078175
117 | FILES_STORE = './video'
118 | FILES_URLS_FIELD = 'field_name_for_your_files_urls'
119 | FILES_EXPIRES = 90
120 | DOWNLOAD_MAXSIZE = 1073741824
121 | DOWNLOAD_WARNSIZE = 1073741824
122 | DOWNLOAD_FAIL_ON_DATALOSS = False
123 | DOWNLOAD_TIMEOUT = 1800
124 | 


--------------------------------------------------------------------------------
/pronhubSpider/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/pronhubSpider/spiders/pronhub.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding=utf-8
 3 | import logging
 4 | import re
 5 | 
 6 | from scrapy import Request, Selector
 7 | from scrapy_splash import SplashRequest
 8 | 
 9 | from pronhubSpider import settings
10 | from pronhubSpider.items import PronhubspiderItem
11 | 
12 | __auther__ = 'liazylee'
13 | # connect='li233111@gmail.com'
14 | # @Time    : 6/20/19 3:29 PM
15 | # @FileName: pronhub.py
16 | # @Software: PyCharm
17 | # @project: pronhubSpider
18 | 
19 | from scrapy.spiders import CrawlSpider
20 | 
21 | script = """
22 | function main(splash, args)
23 |   splash:autoload([[
24 |     function get_video(){
25 |     
26 |       return flashvars;
27 |     }
28 |   ]])
29 |   assert(splash:go(args.url))
30 |   return {
31 |             q480=splash:evaljs("get_video()"),
32 |              
33 |         }
34 | end
35 | """
36 | 
37 | 
38 | class Pronhub(CrawlSpider):
39 |     name = 'pronhub'
40 | 
41 |     logging.getLogger('requests').setLevel(logging.WARNING)
42 |     logging.basicConfig(level=logging.DEBUG,
43 |                         format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
44 |                         datefmt='%a, %d %b %Y %H:%M:%S',
45 |                         filename='tail_log.log',
46 |                         filemode='w')
47 | 
48 |     def start_requests(self):
49 |         for ph_type in settings.PHTYPE:
50 |             yield SplashRequest(url='https://www.pornhub.com/{}'.format(ph_type), callback=self.parse,args={'wait':1})
51 | 
52 |     def parse(self, response):
53 |         selector = Selector(response)
54 |         logging.debug('request url  ---->' + response.url)
55 |         response_selectors = selector.xpath('//div[@class="phimage"]')
56 |         for response_selector in response_selectors:
57 |             view_key = re.findall('viewkey=(.*?)"', response_selector.extract())
58 |             yield SplashRequest(url='https://www.pornhub.com/embed/{}'.format(view_key[0]),
59 |                                 callback=self.parse_key,
60 |                                 endpoint='execute',
61 |                                 args={'lua_source': script})
62 |         # selector_next_url=response.css('head > link:nth-child(42)')
63 |         next_url = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract_first()
64 |         logging.debug('next_url:----->' + str(next_url))
65 |         if next_url:
66 |             yield SplashRequest(url='https://www.pornhub.com{}'.format(next_url), callback=self.parse,args={'wait':1})
67 |         pass
68 | 
69 |     def parse_key(self, response):
70 |         phItem = PronhubspiderItem()
71 |         selector = response.data['q480']
72 |         # logging.info(selector)
73 |         # _ph_info = re.findall('var flashvars =(.*?),\n', selector)
74 |         # logging.debug('PH信息的JSON:')
75 |         # logging.debug(_ph_info)
76 |         # _ph_info_json = json.loads(_ph_info[0])
77 |         duration = selector.get('video_duration')
78 |         phItem['video_duration'] = duration
79 |         title = selector.get('video_title')
80 |         phItem['video_title'] = title
81 |         image_url = selector.get('image_url')
82 |         phItem['image_url'] = image_url
83 |         link_url = selector.get('link_url')
84 |         phItem['link_url'] = link_url
85 |         quality_480p = selector.get('mediaDefinitions')[0].get('videoUrl')
86 |         # quality_480p = _ph_info_json.get('quality_480p')
87 |         phItem['quality_480p'] = quality_480p
88 |         # logging.info('duration:' + duration + ' title:' + title + ' image_url:'
89 |         #              + image_url + ' link_url:' + link_url,'quality_480p:'+quality_480p)
90 |         # pprint.pprint(phItem)
91 |         yield phItem
92 | 
93 | 


--------------------------------------------------------------------------------
/pronhubSpider/userAgents.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding=utf-8
 3 | __auther__ = 'liazylee'
 4 | # connect='li233111@gmail.com'
 5 | # @Time    : 6/20/19 4:35 PM
 6 | # @FileName: userAgents.py
 7 | # @Software: PyCharm
 8 | # @project: pronhubSpider
 9 | 
10 | agents = [
11 |     "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
12 |     "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
13 |     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
14 |     "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
15 |     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
16 |     "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
17 |     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
18 |     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
19 |     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
20 |     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
21 |     "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
22 |     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
23 |     "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
24 |     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
25 |     "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
26 |     "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
27 |     "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
28 |     "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
29 |     "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
30 |     "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
31 |     "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
32 |     "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
33 |     "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
34 |     "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
35 |     "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
36 |     "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
37 |     "Mozilla/2.02E (Win95; U)",
38 |     "Mozilla/3.01Gold (Win95; I)",
39 |     "Mozilla/4.8 [en] (Windows NT 5.1; U)",
40 |     "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
41 |     "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
42 |     "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
43 |     "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
44 |     "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
45 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
46 |     "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
47 |     "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
48 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
49 |     "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
50 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
51 |     "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
52 |     "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
53 |     "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
54 |     "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
55 |     "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
56 |     "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
57 |     "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
58 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
59 |     "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
60 |     "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
61 |     "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
62 |     "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
63 |     "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
64 |     "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
65 |     "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
66 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
67 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
68 |     "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
69 |     "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
70 |     "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
71 |     "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
72 |     "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
73 | ]
74 | 


--------------------------------------------------------------------------------
/quickstart.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding=utf-8
 3 | __auther__ = 'liazylee'
 4 | # connect='li233111@gmail.com'
 5 | # @Time    : 6/20/19 3:37 PM
 6 | # @FileName: quickstart.py
 7 | # @Software: PyCharm
 8 | # @project: pronhubSpider
 9 | 
10 | 
11 | from scrapy import cmdline
12 | 
13 | cmdline.execute('scrapy crawl pronhub'.split())
14 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | asn1crypto==0.24.0
 2 | attrs==19.1.0
 3 | Automat==0.7.0
 4 | cffi==1.12.3
 5 | constantly==15.1.0
 6 | cryptography==2.7
 7 | cssselect==1.0.3
 8 | hyperlink==19.0.0
 9 | idna==2.8
10 | incremental==17.5.0
11 | lxml==4.9.1
12 | parsel==1.5.1
13 | pkg-resources==0.0.0
14 | pyasn1==0.4.5
15 | pyasn1-modules==0.2.5
16 | pycparser==2.19
17 | PyDispatcher==2.0.5
18 | PyHamcrest==1.9.0
19 | pymongo==3.8.0
20 | pyOpenSSL==19.0.0
21 | queuelib==1.5.0
22 | Scrapy==2.6.2
23 | scrapy-splash==0.7.2
24 | service-identity==18.1.0
25 | six==1.12.0
26 | Twisted==22.10.0
27 | w3lib==1.20.0
28 | zope.interface==4.6.0
29 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = pronhubSpider.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = pronhubSpider
12 | 


--------------------------------------------------------------------------------