├── README.md ├── README.md~ ├── findtrip ├── .quaCity.txt.swp ├── ctrip.json ├── findtrip │ ├── __init__.py │ ├── __init__.pyc │ ├── commands │ │ ├── __init__.py │ │ ├── __init__.pyc │ │ ├── crawlall.py │ │ └── crawlall.pyc │ ├── ghostdriver.log │ ├── items.py │ ├── items.pyc │ ├── middleware.py │ ├── middleware.pyc │ ├── pipelines.py │ ├── pipelines.pyc │ ├── settings.py │ ├── settings.pyc │ ├── spiders │ │ ├── .spider.py.swp │ │ ├── __init__.py │ │ ├── __init__.pyc │ │ ├── __init__.py~ │ │ ├── airport.md~ │ │ ├── airportTable.md │ │ ├── alone │ │ │ ├── Ctrip.py │ │ │ ├── Ctrip.pyc │ │ │ ├── Quatrip.py │ │ │ ├── Quatrip.pyc │ │ │ └── ghostdriver.log │ │ ├── ghostdriver.log │ │ ├── spider.py │ │ ├── spider.pyc │ │ ├── spider_ctrip.py │ │ ├── spider_ctrip.pyc │ │ ├── washctrip.py │ │ └── washctrip.pyc │ ├── tickets.json~ │ ├── useragent.py │ └── useragent.pyc ├── ghostdriver.log ├── qua.json ├── scrapy.cfg ├── tickets1.json~ └── tickets2.json~ ├── requirements.txt └── requirements.txt~ /README.md: -------------------------------------------------------------------------------- 1 | # 作者声明:此项目不在维护 2 | #### 本项目代码及相关库已经是很旧的版本了,不再推荐使用此项目用来学习爬虫 3 | 4 | ---- 5 | 6 | # Findtrip说明文档 7 | 8 | ## 介绍 9 | Findtrip是一个基于Scrapy的机票爬虫,目前整合了国内两大机票网站(去哪儿 + 携程) 10 | 11 | ## Introduction 12 | Findtrip is a webspider for flight tickets by Scrapy,which contains two major china ticket websites ---- Qua & Ctrip 13 | 14 | 15 | ## 安装 16 | 在用户目录下执行,将代码clone到本地 17 | ``` 18 | git clone https://github.com/fankcoder/findtrip.git 19 | ``` 20 | 21 | 所需运行环境,请看 ./requirements.txt 22 | 23 | 本程序使用selenium+ phantomjs模拟浏览器行为获取数据,phantomjs浏览器下载地址(当然使用Firefox也可以,不过打开速度就会慢很多) 24 | 25 | http://npm.taobao.org/dist/phantomjs 26 | 27 | 28 | 29 | 数据库使用Mongodb存储,运行需要安装Mongodb,安装传送门 30 | 31 | https://www.mongodb.org/downloads 32 | 33 | 如果仅仅作为测试不需要使用Mongodb,可以注释settings.py下对应行 34 | ``` 35 | ''' 36 | ITEM_PIPELINES = { 37 | 'findtrip.pipelines.MongoDBPipeline': 300, 38 | } 39 | 40 | MONGODB_HOST = 'localhost' # Change in prod 41 | MONGODB_PORT = 27017 # Change in prod 42 | MONGODB_DATABASE = "findtrip" # Change in prod 43 | MONGODB_COLLECTION = "qua" 44 | MONGODB_USERNAME = "" # Change in prod 45 | MONGODB_PASSWORD = "" # Change in prod 46 | ''' 47 | 48 | ``` 49 | 50 | ## 运行 51 | 以下命令统一运行在findtrip/目录下,与scrapy.cfg文件同级目录 52 | 53 | 去哪儿网单爬,终端输入 54 | ``` 55 | scrapy crawl Qua 56 | ``` 57 | 携程网单爬,终端输入 58 | ``` 59 | scrapy crawl Ctrip 60 | ``` 61 | 去哪儿,携程多爬,同时爬取,终端输入 62 | ``` 63 | scrapy crawlall 64 | ``` 65 | 66 | ## 部分json数据 67 | 去哪儿网 68 | ``` 69 | [{"airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "company": ["China", "KN5927(73S)"], "site": "Qua", "flight_time": ["4:00", "PM", "7:00", "PM"], "passtime": ["3h"], "price": ["\u00a5", "689"]}, 70 | {"airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "company": ["Shandong", "SC4678(738)", "Same", "Shandong", "SC4678(738)"], "site": "Qua", "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"], "passtime": ["1h30m", "2h30m"], "price": ["\u00a5", "712"]},...] 71 | 72 | ``` 73 | 携程网 74 | ``` 75 | [{"flight_time": [["10:30", "20:50"], ["12:15", "22:20"]], "price": ["\u00a5", "580"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2"], ["\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU2891", "\u4e1c\u65b9\u822a\u7a7a", "MU728"], "site": "Ctrip"}, 76 | {"flight_time": [["11:05", "17:55"], ["12:50", "19:50"]], "price": ["\u00a5", "610"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a"], ["\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU5169", "\u4e1c\u65b9\u822a\u7a7a", "MU5171"], "site": "Ctrip"},...] 77 | ``` 78 | -------------------------------------------------------------------------------- /README.md~: -------------------------------------------------------------------------------- 1 | # Findtrip说明文档 2 | 3 | ## 介绍 4 | Findtrip是一个基于Scrapy的机票爬虫,目前整合了国内两大机票网站(去哪儿 + 携程) 5 | 6 | ## Introduction 7 | Findtrip is a webspider for flight tickets by Scrapy,which contains two major china ticket websites ---- Qua & Ctrip 8 | 9 | 10 | ## 安装 11 | 在用户目录下执行,将代码clone到本地 12 | ``` 13 | git clone https://github.com/fankcoder/findtrip.git 14 | ``` 15 | 16 | 所需运行环境,请看 django-todolist/doc/requirements.txt 17 | 18 | 本程序使用selenium+ phantomjs模拟浏览器行为获取数据,phantomjs浏览器下载地址(当然使用Firefox也可以,不过打开速度就会慢很多) 19 | 20 | http://npm.taobao.org/dist/phantomjs 21 | 22 | 23 | 24 | 数据库使用Mongodb存储,运行需要安装Mongodb,安装传送门 25 | 26 | https://www.mongodb.org/downloads 27 | 28 | 如果仅仅作为测试不需要使用Mongodb,可以注释settings.py下对应行 29 | ``` 30 | ''' 31 | ITEM_PIPELINES = { 32 | 'findtrip.pipelines.MongoDBPipeline': 300, 33 | } 34 | 35 | MONGODB_HOST = 'localhost' # Change in prod 36 | MONGODB_PORT = 27017 # Change in prod 37 | MONGODB_DATABASE = "findtrip" # Change in prod 38 | MONGODB_COLLECTION = "qua" 39 | MONGODB_USERNAME = "" # Change in prod 40 | MONGODB_PASSWORD = "" # Change in prod 41 | ''' 42 | 43 | ``` 44 | 45 | ## 运行 46 | 以下命令统一运行在findtrip/目录下,与scrapy.cfg文件同级目录 47 | 48 | 去哪儿网单爬,终端输入 49 | ``` 50 | scrapy crawl Qua 51 | ``` 52 | 携程网单爬,终端输入 53 | ``` 54 | scrapy crawl Ctrip 55 | ``` 56 | 去哪儿,携程多爬,同时爬取,终端输入 57 | ``` 58 | scrapy crawlall 59 | ``` 60 | 61 | ## 部分json数据 62 | 去哪儿网 63 | ``` 64 | [{"airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "company": ["China", "KN5927(73S)"], "site": "Qua", "flight_time": ["4:00", "PM", "7:00", "PM"], "passtime": ["3h"], "price": ["\u00a5", "689"]}, 65 | {"airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "company": ["Shandong", "SC4678(738)", "Same", "Shandong", "SC4678(738)"], "site": "Qua", "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"], "passtime": ["1h30m", "2h30m"], "price": ["\u00a5", "712"]},...] 66 | 67 | ``` 68 | 携程网 69 | ``` 70 | [{"flight_time": [["10:30", "20:50"], ["12:15", "22:20"]], "price": ["\u00a5", "580"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2"], ["\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU2891", "\u4e1c\u65b9\u822a\u7a7a", "MU728"], "site": "Ctrip"}, 71 | {"flight_time": [["11:05", "17:55"], ["12:50", "19:50"]], "price": ["\u00a5", "610"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a"], ["\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU5169", "\u4e1c\u65b9\u822a\u7a7a", "MU5171"], "site": "Ctrip"},...] 72 | ``` 73 | -------------------------------------------------------------------------------- /findtrip/.quaCity.txt.swp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/.quaCity.txt.swp -------------------------------------------------------------------------------- /findtrip/ctrip.json: -------------------------------------------------------------------------------- 1 | [{"flight_time": [["10:30", "20:50"], ["12:15", "22:20"]], "price": ["\u00a5", "580"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2"], ["\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU2891", "\u4e1c\u65b9\u822a\u7a7a", "MU728"], "site": "Ctrip"}, 2 | {"flight_time": [["11:05", "17:55"], ["12:50", "19:50"]], "price": ["\u00a5", "610"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a"], ["\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU5169", "\u4e1c\u65b9\u822a\u7a7a", "MU5171"], "site": "Ctrip"}, 3 | {"flight_time": [["22:05"], ["00:55", "+1\u5929"]], "price": ["\u00a5", "645"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4"], ["\u9996\u90fd\u56fd\u9645\u673a\u573aT1"]], "company": ["\u9996\u90fd\u822a\u7a7a", "JD5376"], "site": "Ctrip"}, 4 | {"flight_time": [["17:15"], ["22:15"]], "price": ["\u00a5", "660"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4"], ["\u9996\u90fd\u56fd\u9645\u673a\u573aT3"]], "company": ["\u5c71\u4e1c\u822a\u7a7a", "SC4677"], "site": "Ctrip"}, 5 | {"flight_time": [["17:30"], ["22:25"]], "price": ["\u00a5", "672"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT3"], ["\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u53a6\u95e8\u822a\u7a7a", "MF8159"], "site": "Ctrip"}, 6 | {"flight_time": [["17:30"], ["22:25"]], "price": ["\u00a5", "676"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT3"], ["\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u6cb3\u5317\u822a\u7a7a", "NS8159", "(\u5171\u4eab)"], "site": "Ctrip"}, 7 | {"flight_time": [["10:30"], ["13:10"]], "price": ["\u00a5", "692"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4"], ["\u5357\u82d1\u673a\u573a"]], "company": ["\u4e2d\u56fd\u8054\u822a", "KN5926"], "site": "Ctrip"}] -------------------------------------------------------------------------------- /findtrip/findtrip/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/__init__.py -------------------------------------------------------------------------------- /findtrip/findtrip/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/__init__.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/commands/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/commands/__init__.py -------------------------------------------------------------------------------- /findtrip/findtrip/commands/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/commands/__init__.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/commands/crawlall.py: -------------------------------------------------------------------------------- 1 | from scrapy.command import ScrapyCommand 2 | from scrapy.crawler import CrawlerRunner 3 | from scrapy.utils.conf import arglist_to_dict 4 | 5 | class Command(ScrapyCommand): 6 | 7 | requires_project = True 8 | 9 | def syntax(self): 10 | return '[options]' 11 | 12 | def short_desc(self): 13 | return 'Runs all of the spiders' 14 | 15 | def add_options(self, parser): 16 | ScrapyCommand.add_options(self, parser) 17 | parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", 18 | help="set spider argument (may be repeated)") 19 | parser.add_option("-o", "--output", metavar="FILE", 20 | help="dump scraped items into FILE (use - for stdout)") 21 | parser.add_option("-t", "--output-format", metavar="FORMAT", 22 | help="format to use for dumping items with -o") 23 | 24 | def process_options(self, args, opts): 25 | ScrapyCommand.process_options(self, args, opts) 26 | try: 27 | opts.spargs = arglist_to_dict(opts.spargs) 28 | except ValueError: 29 | raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False) 30 | 31 | def run(self, args, opts): 32 | #settings = get_project_settings() 33 | 34 | spider_loader = self.crawler_process.spider_loader 35 | for spidername in args or spider_loader.list(): 36 | print "*********cralall spidername************" + spidername 37 | self.crawler_process.crawl(spidername, **opts.spargs) 38 | 39 | self.crawler_process.start() 40 | -------------------------------------------------------------------------------- /findtrip/findtrip/commands/crawlall.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/commands/crawlall.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/ghostdriver.log: -------------------------------------------------------------------------------- 1 | [INFO - 2016-04-02T02:25:11.818Z] GhostDriver - Main - running on port 58314 2 | [INFO - 2016-04-02T02:25:12.836Z] Session [21036f80-f87a-11e5-ac91-b5e3bc220f1b] - page.settings - {"XSSAuditingEnabled":false,"javascriptCanCloseWindows":true,"javascriptCanOpenWindows":true,"javascriptEnabled":true,"loadImages":false,"localToRemoteUrlAccessEnabled":false,"userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36","webSecurityEnabled":true} 3 | [INFO - 2016-04-02T02:25:12.836Z] Session [21036f80-f87a-11e5-ac91-b5e3bc220f1b] - page.customHeaders: - {} 4 | [INFO - 2016-04-02T02:25:12.837Z] Session [21036f80-f87a-11e5-ac91-b5e3bc220f1b] - Session.negotiatedCapabilities - {"browserName":"phantomjs","version":"2.1.1","driverName":"ghostdriver","driverVersion":"1.2.0","platform":"linux-unknown-64bit","javascriptEnabled":true,"takesScreenshot":true,"handlesAlerts":false,"databaseEnabled":false,"locationContextEnabled":false,"applicationCacheEnabled":false,"browserConnectionEnabled":false,"cssSelectorsEnabled":true,"webStorageEnabled":false,"rotatable":false,"acceptSslCerts":false,"nativeEvents":true,"proxy":{"proxyType":"direct"},"phantomjs.page.settings.resourceTimeout":15,"phantomjs.page.settings.loadImages":false,"phantomjs.page.settings.userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36"} 5 | [INFO - 2016-04-02T02:25:12.837Z] SessionManagerReqHand - _postNewSessionCommand - New Session Created: 21036f80-f87a-11e5-ac91-b5e3bc220f1b 6 | -------------------------------------------------------------------------------- /findtrip/findtrip/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # http://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | import scrapy 9 | 10 | 11 | class FindtripItem(scrapy.Item): 12 | # define the fields for your item here like: 13 | # name = scrapy.Field() 14 | site = scrapy.Field() 15 | company = scrapy.Field() 16 | flight_time = scrapy.Field() 17 | airports = scrapy.Field() 18 | passtime = scrapy.Field() 19 | price = scrapy.Field() 20 | -------------------------------------------------------------------------------- /findtrip/findtrip/items.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/items.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/middleware.py: -------------------------------------------------------------------------------- 1 | #-*- coding:utf-8 -*- 2 | from selenium import webdriver 3 | from scrapy.http import HtmlResponse 4 | from lxml import etree 5 | import time 6 | from selenium.webdriver.common.action_chains import ActionChains 7 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 8 | from random import choice 9 | 10 | ua_list = [ 11 | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36", 12 | "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", 13 | "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36", 14 | "Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", 15 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36" 16 | ] 17 | 18 | dcap = dict(DesiredCapabilities.PHANTOMJS) 19 | dcap["phantomjs.page.settings.resourceTimeout"] = 15 20 | dcap["phantomjs.page.settings.loadImages"] = False 21 | dcap["phantomjs.page.settings.userAgent"] = choice(ua_list) 22 | #driver = webdriver.PhantomJS(executable_path='/home/icgoo/pywork/spider/phantomjs',desired_capabilities=dcap) 23 | #driver = webdriver.PhantomJS(executable_path=u'/home/fank/pywork/spider/phantomjs',desired_capabilities=dcap) 24 | driver = webdriver.PhantomJS() 25 | #driver = webdriver.Firefox() 26 | 27 | class SeleniumMiddleware(object): 28 | def process_request(self, request, spider): 29 | print spider.name 30 | if spider.name == 'Qua': 31 | try: 32 | driver.get(request.url) 33 | driver.implicitly_wait(3) 34 | time.sleep(5) 35 | origin_page = driver.page_source # .decode('utf-8','ignore') 36 | origin_html = etree.HTML(origin_page) 37 | items = origin_html.xpath("//div[@class='m-fly-item s-oneway']") 38 | 39 | for index,item in enumerate(items): 40 | flight_each = "//div[@id='list-box']/div["+str(index+1)+"]" 41 | detail_span = "//div[@class='fl-detail-nav']/ul/li[1]/span[@class='nav-label']" 42 | 43 | driver.find_element_by_xpath(flight_each+detail_span).click() # 数据由js来控制,点击后加载数据 44 | 45 | true_page = driver.page_source 46 | close_driver() 47 | 48 | return HtmlResponse(request.url,body = true_page,encoding = 'utf-8',request = request,) 49 | except: 50 | print "get Qua data failed" 51 | 52 | elif spider.name == 'Ctrip': 53 | driver.get(request.url) 54 | driver.implicitly_wait(3) 55 | time.sleep(5) 56 | ''' 57 | # 携程网数据是瀑布流形式,通过控制滚轴向下来加载更多数据,但是爬取速度会变慢 58 | origin_page = driver.page_source 59 | origin_html = etree.HTML(origin_page) 60 | fligint_div = "//div[@id='J_flightlist2']/div" 61 | items = origin_html.xpath(fligint_div) 62 | 63 | for index,item in enumerate(items): 64 | js="var q=document.documentElement.scrollTop=5000" 65 | driver.execute_script(js) 66 | time.sleep(2) 67 | ''' 68 | 69 | page = driver.page_source # .decode('utf-8','ignore') 70 | close_driver() 71 | 72 | return HtmlResponse(request.url,body = page,encoding = 'utf-8',request = request,) 73 | 74 | else: 75 | print "use middleware failed" 76 | 77 | 78 | def close_driver(): 79 | driver.close() 80 | -------------------------------------------------------------------------------- /findtrip/findtrip/middleware.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/middleware.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | from findtrip.spiders.washctrip import wash 8 | import pymongo 9 | from scrapy.conf import settings 10 | from scrapy import log 11 | 12 | 13 | class MongoDBPipeline(object): 14 | def __init__(self): 15 | connection = pymongo.MongoClient( 16 | settings['MONGODB_SERVER'], 17 | settings['MONGODB_PORT'] 18 | ) 19 | db = connection[settings['MONGODB_DATABASE']] 20 | self.collection = db[settings['MONGODB_COLLECTION']] 21 | 22 | def process_item(self, item, spider): 23 | if item['site'] == 'Qua': 24 | if item['company']: 25 | item['company'] = wash(item['company']) 26 | if item['flight_time']: 27 | item['flight_time'] = wash(item['flight_time']) 28 | if item['airports']: 29 | item['airports'] = wash(item['airports']) 30 | if item['passtime']: 31 | item['passtime'] = wash(item['passtime']) 32 | if item['price']: 33 | item['price'] = wash(item['price']) 34 | for data in item: 35 | if not data: 36 | raise DropItem("Missing data!") 37 | self.collection.insert(dict(item)) 38 | log.msg("Question added to MongoDB database!", 39 | level=log.DEBUG, spider=spider) 40 | elif item['site'] == 'Ctrip': 41 | self.collection.insert(dict(item)) 42 | log.msg("Question added to MongoDB database!", 43 | level=log.DEBUG, spider=spider) 44 | 45 | return item 46 | ''' 47 | class QuaPipeline(object): 48 | def process_item(self, item, spider): 49 | if item['company']: 50 | item['company'] = wash(item['company']) 51 | if item['flight_time']: 52 | item['flight_time'] = wash(item['flight_time']) 53 | if item['airports']: 54 | item['airports'] = wash(item['airports']) 55 | if item['passtime']: 56 | item['passtime'] = wash(item['passtime']) 57 | if item['price']: 58 | item['price'] = wash(item['price']) 59 | return item 60 | ''' 61 | -------------------------------------------------------------------------------- /findtrip/findtrip/pipelines.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/pipelines.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for findtrip project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # http://doc.scrapy.org/en/latest/topics/settings.html 9 | # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 | # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'findtrip' 13 | 14 | SPIDER_MODULES = ['findtrip.spiders'] 15 | NEWSPIDER_MODULE = 'findtrip.spiders' 16 | COMMANDS_MODULE = 'findtrip.commands' 17 | 18 | # Enable or disable downloader middlewares 19 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 20 | DOWNLOADER_MIDDLEWARES = { 21 | 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 22 | 'findtrip.useragent.RandomUserAgentMiddleware' :400, 23 | 'findtrip.middleware.SeleniumMiddleware': 543 24 | } 25 | 26 | # Configure item pipelines 27 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 28 | ITEM_PIPELINES = { 29 | 'findtrip.pipelines.MongoDBPipeline': 300, 30 | } 31 | 32 | MONGODB_HOST = 'localhost' # Change in prod 33 | MONGODB_PORT = 27017 # Change in prod 34 | MONGODB_DATABASE = "findtrip" # Change in prod 35 | MONGODB_COLLECTION = "qua" 36 | MONGODB_USERNAME = "" # Change in prod 37 | MONGODB_PASSWORD = "" # Change in prod 38 | 39 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 40 | #USER_AGENT = 'findtrip (+http://www.yourdomain.com)' 41 | 42 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 43 | #CONCURRENT_REQUESTS=32 44 | 45 | # Configure a delay for requests for the same website (default: 0) 46 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 47 | # See also autothrottle settings and docs 48 | #DOWNLOAD_DELAY=3 49 | # The download delay setting will honor only one of: 50 | #CONCURRENT_REQUESTS_PER_DOMAIN=16 51 | #CONCURRENT_REQUESTS_PER_IP=16 52 | 53 | # Disable cookies (enabled by default) 54 | COOKIES_ENABLED=False 55 | 56 | # Disable Telnet Console (enabled by default) 57 | #TELNETCONSOLE_ENABLED=False 58 | 59 | # Override the default request headers: 60 | DEFAULT_REQUEST_HEADERS = { 61 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 62 | 'Accept-Language': 'en-US,en;q=0.5', 63 | 'Accept-Encoding':"gzip, deflate", 64 | } 65 | 66 | 67 | # Enable or disable spider middlewares 68 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 69 | #SPIDER_MIDDLEWARES = { 70 | # 'findtrip.middlewares.MyCustomSpiderMiddleware': 543, 71 | #} 72 | 73 | 74 | 75 | # Enable or disable extensions 76 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 77 | #EXTENSIONS = { 78 | # 'scrapy.telnet.TelnetConsole': None, 79 | #} 80 | 81 | 82 | # Enable and configure the AutoThrottle extension (disabled by default) 83 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 84 | # NOTE: AutoThrottle will honour the standard settings for concurrency and delay 85 | #AUTOTHROTTLE_ENABLED=True 86 | # The initial download delay 87 | #AUTOTHROTTLE_START_DELAY=5 88 | # The maximum download delay to be set in case of high latencies 89 | #AUTOTHROTTLE_MAX_DELAY=60 90 | # Enable showing throttling stats for every response received: 91 | #AUTOTHROTTLE_DEBUG=False 92 | 93 | # Enable and configure HTTP caching (disabled by default) 94 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 95 | #HTTPCACHE_ENABLED=True 96 | #HTTPCACHE_EXPIRATION_SECS=0 97 | #HTTPCACHE_DIR='httpcache' 98 | #HTTPCACHE_IGNORE_HTTP_CODES=[] 99 | #HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage' 100 | -------------------------------------------------------------------------------- /findtrip/findtrip/settings.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/settings.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/.spider.py.swp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/.spider.py.swp -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/__init__.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/__init__.py~: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | import washctrip 6 | import ctrip 7 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/airport.md~: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/airport.md~ -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/airportTable.md: -------------------------------------------------------------------------------- 1 | 城市名 | 机场三字码 | 机场四字码 | 机场名称 | 英文名称 2 | ----|------|----|-----|----- 3 | 北京 | PEK | ZBAA | 北京首都国际机场 | BEIJING 4 | 上海 | SHA | ZSSS | 上海虹桥机场 | SHANGHAIHONGQIAO 5 | 上海 | PVG | ZSPD | 上海浦东机场 | SHANGHAIPUDONG 6 | 广州 | CAN | ZGGG | 广州白云机场 | GUANGZHOU 7 | 深圳 | SZX | ZGSZ | 深圳宝安国际机场 | SHENZHEN 8 | 成都 | CTU | ZUUU | 成都双流机场 | CHENGDU 9 | 厦门 | XMN | ZSAM | 厦门高崎机场 | XIAMEN 10 | 青岛 | TAO | ZSQD | 青岛流亭机场 | QINGDAO 11 | ... | ... | ... | ... | ... 12 | 查看更多,请访问http://airportcode.911cha.com/ 13 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/alone/Ctrip.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | from selenium import webdriver 3 | import time 4 | from random import choice 5 | from selenium.webdriver.common.action_chains import ActionChains 6 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 7 | from lxml import etree 8 | 9 | def findTrip(): 10 | url = "http://flights.ctrip.com/booking/XMN-BJS-day-1.html?DDate1=2016-04-18" 11 | ua_list = [ 12 | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36", 13 | "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", 14 | "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36", 15 | "Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", 16 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36" 17 | ] 18 | 19 | dcap = dict(DesiredCapabilities.PHANTOMJS) 20 | dcap["phantomjs.page.settings.resourceTimeout"] = 15 21 | dcap["phantomjs.page.settings.loadImages"] = False 22 | dcap["phantomjs.page.settings.userAgent"] = choice(ua_list) 23 | #driver = webdriver.PhantomJS(executable_path=u'/home/icgoo/pywork/spider/phantomjs',desired_capabilities=dcap) 24 | #driver = webdriver.PhantomJS(executable_path=u'/home/fank/pywork/spider/phantomjs',desired_capabilities=dcap) 25 | driver = webdriver.Firefox() 26 | 27 | driver.get(url) 28 | driver.implicitly_wait(3) 29 | time.sleep(5) 30 | page = driver.page_source # .decode('utf-8','ignore') 31 | html = etree.HTML(page) 32 | 33 | fligint_div = "//div[@id='J_flightlist2']/div" 34 | items = html.xpath(fligint_div) 35 | detail = [] 36 | for index,item in enumerate(items): 37 | flight_tr = fligint_div+'['+str(index+1)+']'+'//tr' 38 | istrain = html.xpath(flight_tr + "//div[@class='train_flight_tit']") 39 | if istrain: 40 | pass # is train add 41 | else: 42 | company = html.xpath(flight_tr + "//div[@class='info-flight J_flight_no']//text()") 43 | flight_time_from = html.xpath(flight_tr + "//td[@class='right']/div[1]//text()") 44 | flight_time_to = html.xpath(flight_tr + "//td[@class='left']/div[1]//text()") 45 | flight_time = [flight_time_from,flight_time_to] 46 | airports_from = html.xpath(flight_tr + "//td[@class='right']/div[2]//text()") 47 | airports_to = html.xpath(flight_tr + "//td[@class='left']/div[2]//text()") 48 | airports = [airports_from,airports_to] 49 | price = html.xpath(flight_tr + "[1]//td[@class='price middle ']/span//text()") 50 | 51 | detail.append( 52 | dict( 53 | company=company, 54 | flight_time=flight_time, 55 | airports=airports, 56 | price=price 57 | )) 58 | print detail 59 | driver.close() 60 | return detail 61 | 62 | if __name__ == '__main__': 63 | findTrip() 64 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/alone/Ctrip.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/alone/Ctrip.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/alone/Quatrip.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | from selenium import webdriver 3 | import time 4 | from random import choice 5 | from selenium.webdriver.common.action_chains import ActionChains 6 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 7 | from lxml import etree 8 | from washctrip import wash 9 | 10 | def findTrip(fromCity,toCity,date): 11 | #url = "http://www.qua.com/flights/PEK-XMN/2016-04-06?from=home" 12 | #url = "http://www.qua.com/flights/PEK-XMN/2016-04-06?m=CNY&from=home" 13 | url_head = "http://www.qua.com/flights/" 14 | url_tail = "?m=CNY&from=home" 15 | url = url_head + fromCity +'-'+ toCity +'/'+ date + url_tail 16 | 17 | ua_list = [ 18 | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36", 19 | "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", 20 | "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36", 21 | "Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", 22 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36" 23 | ] 24 | 25 | dcap = dict(DesiredCapabilities.PHANTOMJS) 26 | dcap["phantomjs.page.settings.resourceTimeout"] = 15 27 | dcap["phantomjs.page.settings.loadImages"] = False 28 | dcap["phantomjs.page.settings.userAgent"] = choice(ua_list) 29 | driver = webdriver.PhantomJS(executable_path=u'/home/icgoo/pywork/spider/phantomjs',desired_capabilities=dcap) 30 | #driver = webdriver.Firefox() 31 | #driver.maximize_window() 32 | driver.get(url) 33 | driver.implicitly_wait(3) 34 | time.sleep(5) 35 | 36 | origin_page = driver.page_source # .decode('utf-8','ignore') 37 | origin_html = etree.HTML(origin_page) 38 | #items = origin_html.xpath("//div[@class='fl-detail-nav']/ul/li[1]") 39 | #items = origin_html.xpath("//div[@class='m-fly-item s-oneway']") 40 | items = origin_html.xpath("//div[@id='list-box']/div") 41 | 42 | detail = [] 43 | for index,item in enumerate(items): 44 | flight_each = "//div[@id='list-box']/div["+str(index)+"]" 45 | detail_span = "//div[@class='fl-detail-nav']/ul/li[1]/span[@class='nav-label']" 46 | detail_span = "//div[@class='fl-detail-nav']/ul/li[1]" 47 | f_route_div = "//div[@class='m-fl-info-bd']/div" 48 | 49 | #driver.find_element_by_xpath(flight_each+detail_span).click() # 数据由js来控制,点击后加载数据 50 | #driver.find_element_by_xpath(flight_each+"/div[2]/div[1]/ul/li[1]/span").click() # 数据由js来控制,点击后加载数据 51 | element = driver.find_element_by_xpath(flight_each+detail_span) 52 | hover = ActionChains(driver).move_to_element_with_offset(element,0,20) 53 | hover.perform() 54 | element.click() 55 | 56 | true_page = driver.page_source 57 | true_html = etree.HTML(true_page) 58 | 59 | #test = true_html.xpath(flight_each + "//div[@class='m-fl-info-bd']/div/p[2]//text()") #get airflight and company 60 | #print test 61 | company = true_html.xpath(flight_each + f_route_div + '/p[1]//text()') #get airflight and company 62 | flight_time = true_html.xpath(flight_each + f_route_div + '/p[2]//text()') 63 | airports = true_html.xpath(flight_each + f_route_div + '/p[3]//text()') 64 | passtime = true_html.xpath(flight_each + f_route_div + '/p[4]//text()') 65 | price = true_html.xpath(flight_each + "//div[@class='fl-price-box']//em//text()") 66 | 67 | company = wash(company) 68 | flight_time = wash(flight_time) 69 | airports = wash(airports) 70 | passtime = wash(passtime) 71 | 72 | detail.append( 73 | dict( 74 | company=company, 75 | flight_time=flight_time, 76 | airports=airports, 77 | passtime=passtime, 78 | price=price 79 | )) 80 | driver.close() 81 | return detail 82 | 83 | if __name__ == '__main__': 84 | fromCity = "PEK" #replace beijing by 'PEK' 85 | toCity = "XMN" 86 | date = "2016-04-06" #example 2016-04-06 87 | data = findTrip(fromCity, toCity, date) 88 | #print data[0] 89 | for each in data: 90 | print each 91 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/alone/Quatrip.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/alone/Quatrip.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/alone/ghostdriver.log: -------------------------------------------------------------------------------- 1 | [INFO - 2016-04-04T08:00:40.968Z] GhostDriver - Main - running on port 55501 2 | [INFO - 2016-04-04T08:00:41.988Z] Session [53c063c0-fa3b-11e5-90ef-d7c6ca58bc2b] - page.settings - {"XSSAuditingEnabled":false,"javascriptCanCloseWindows":true,"javascriptCanOpenWindows":true,"javascriptEnabled":true,"loadImages":false,"localToRemoteUrlAccessEnabled":false,"userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36","webSecurityEnabled":true} 3 | [INFO - 2016-04-04T08:00:41.988Z] Session [53c063c0-fa3b-11e5-90ef-d7c6ca58bc2b] - page.customHeaders: - {} 4 | [INFO - 2016-04-04T08:00:41.988Z] Session [53c063c0-fa3b-11e5-90ef-d7c6ca58bc2b] - Session.negotiatedCapabilities - {"browserName":"phantomjs","version":"2.1.1","driverName":"ghostdriver","driverVersion":"1.2.0","platform":"linux-unknown-64bit","javascriptEnabled":true,"takesScreenshot":true,"handlesAlerts":false,"databaseEnabled":false,"locationContextEnabled":false,"applicationCacheEnabled":false,"browserConnectionEnabled":false,"cssSelectorsEnabled":true,"webStorageEnabled":false,"rotatable":false,"acceptSslCerts":false,"nativeEvents":true,"proxy":{"proxyType":"direct"},"phantomjs.page.settings.resourceTimeout":15,"phantomjs.page.settings.loadImages":false,"phantomjs.page.settings.userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36"} 5 | [INFO - 2016-04-04T08:00:41.988Z] SessionManagerReqHand - _postNewSessionCommand - New Session Created: 53c063c0-fa3b-11e5-90ef-d7c6ca58bc2b 6 | [INFO - 2016-04-04T08:05:40.967Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 7 | [INFO - 2016-04-04T08:10:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 8 | [INFO - 2016-04-04T08:15:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 9 | [INFO - 2016-04-04T08:20:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 10 | [INFO - 2016-04-04T08:25:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 11 | [INFO - 2016-04-04T08:30:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 12 | [INFO - 2016-04-04T08:35:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 13 | [INFO - 2016-04-04T08:40:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 14 | [INFO - 2016-04-04T08:45:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 15 | [INFO - 2016-04-04T08:50:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 16 | [INFO - 2016-04-04T08:55:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 17 | [INFO - 2016-04-04T09:00:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 18 | [INFO - 2016-04-04T09:05:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 19 | [INFO - 2016-04-04T09:10:40.968Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 20 | [INFO - 2016-04-04T09:15:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 21 | [INFO - 2016-04-04T09:20:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 22 | [INFO - 2016-04-04T09:25:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 23 | [INFO - 2016-04-04T09:30:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 24 | [INFO - 2016-04-04T09:35:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 25 | [INFO - 2016-04-04T09:40:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 26 | [INFO - 2016-04-04T09:45:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 27 | [INFO - 2016-04-04T09:50:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 28 | [INFO - 2016-04-04T09:55:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 29 | [INFO - 2016-04-04T10:00:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 30 | [INFO - 2016-04-04T10:05:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 31 | [INFO - 2016-04-04T10:10:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 32 | [INFO - 2016-04-04T10:15:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 33 | [INFO - 2016-04-04T10:20:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 34 | [INFO - 2016-04-04T10:25:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 35 | [INFO - 2016-04-04T10:30:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 36 | [INFO - 2016-04-04T10:35:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 37 | [INFO - 2016-04-04T10:40:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 38 | [INFO - 2016-04-04T10:45:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 39 | [INFO - 2016-04-04T10:50:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 40 | [INFO - 2016-04-04T10:55:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 41 | [INFO - 2016-04-04T11:00:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 42 | [INFO - 2016-04-04T11:05:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 43 | [INFO - 2016-04-04T11:10:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 44 | [INFO - 2016-04-04T11:15:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 45 | [INFO - 2016-04-04T11:20:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 46 | [INFO - 2016-04-04T11:25:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 47 | [INFO - 2016-04-04T11:30:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 48 | [INFO - 2016-04-04T11:35:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 49 | [INFO - 2016-04-04T11:40:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 50 | [INFO - 2016-04-04T11:45:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 51 | [INFO - 2016-04-04T11:50:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 52 | [INFO - 2016-04-04T11:55:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 53 | [INFO - 2016-04-04T12:00:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 54 | [INFO - 2016-04-04T12:05:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 55 | [INFO - 2016-04-04T12:10:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 56 | [INFO - 2016-04-04T12:15:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 57 | [INFO - 2016-04-04T12:20:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 58 | [INFO - 2016-04-04T12:25:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 59 | [INFO - 2016-04-04T12:30:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 60 | [INFO - 2016-04-04T12:35:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 61 | [INFO - 2016-04-04T12:40:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 62 | [INFO - 2016-04-04T12:45:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 63 | [INFO - 2016-04-04T12:50:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 64 | [INFO - 2016-04-04T12:55:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 65 | [INFO - 2016-04-04T13:00:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 66 | [INFO - 2016-04-04T13:05:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 67 | [INFO - 2016-04-04T13:10:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 68 | [INFO - 2016-04-04T13:15:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 69 | [INFO - 2016-04-04T13:20:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 70 | [INFO - 2016-04-04T13:25:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 71 | [INFO - 2016-04-04T13:30:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 72 | [INFO - 2016-04-04T13:35:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 73 | [INFO - 2016-04-04T13:40:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 74 | [INFO - 2016-04-04T13:45:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 75 | [INFO - 2016-04-04T13:50:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 76 | [INFO - 2016-04-04T13:55:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 77 | [INFO - 2016-04-04T14:00:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 78 | [INFO - 2016-04-04T14:05:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 79 | [INFO - 2016-04-04T14:10:40.969Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 80 | [INFO - 2016-04-04T14:15:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 81 | [INFO - 2016-04-04T14:20:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 82 | [INFO - 2016-04-04T14:25:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 83 | [INFO - 2016-04-04T14:30:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 84 | [INFO - 2016-04-04T14:35:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 85 | [INFO - 2016-04-04T14:40:40.970Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 86 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/ghostdriver.log: -------------------------------------------------------------------------------- 1 | [INFO - 2016-03-30T13:00:40.532Z] GhostDriver - Main - running on port 57674 2 | [INFO - 2016-03-30T13:00:41.443Z] Session [68320d80-f677-11e5-aa15-33314816cfdb] - page.settings - {"XSSAuditingEnabled":false,"javascriptCanCloseWindows":true,"javascriptCanOpenWindows":true,"javascriptEnabled":true,"loadImages":false,"localToRemoteUrlAccessEnabled":false,"userAgent":"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36","webSecurityEnabled":true} 3 | [INFO - 2016-03-30T13:00:41.444Z] Session [68320d80-f677-11e5-aa15-33314816cfdb] - page.customHeaders: - {} 4 | [INFO - 2016-03-30T13:00:41.444Z] Session [68320d80-f677-11e5-aa15-33314816cfdb] - Session.negotiatedCapabilities - {"browserName":"phantomjs","version":"2.1.1","driverName":"ghostdriver","driverVersion":"1.2.0","platform":"linux-unknown-64bit","javascriptEnabled":true,"takesScreenshot":true,"handlesAlerts":false,"databaseEnabled":false,"locationContextEnabled":false,"applicationCacheEnabled":false,"browserConnectionEnabled":false,"cssSelectorsEnabled":true,"webStorageEnabled":false,"rotatable":false,"acceptSslCerts":false,"nativeEvents":true,"proxy":{"proxyType":"direct"},"phantomjs.page.settings.resourceTimeout":15,"phantomjs.page.settings.loadImages":false,"phantomjs.page.settings.userAgent":"Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36"} 5 | [INFO - 2016-03-30T13:00:41.444Z] SessionManagerReqHand - _postNewSessionCommand - New Session Created: 68320d80-f677-11e5-aa15-33314816cfdb 6 | [INFO - 2016-03-30T13:05:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 7 | [INFO - 2016-03-30T13:10:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 8 | [INFO - 2016-03-30T13:15:05.017Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 9 | [INFO - 2016-03-30T13:20:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 10 | [INFO - 2016-03-30T13:25:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 11 | [INFO - 2016-03-30T13:30:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 12 | [INFO - 2016-03-30T13:35:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 13 | [INFO - 2016-03-30T13:40:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 14 | [INFO - 2016-03-30T13:45:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 15 | [INFO - 2016-03-30T13:50:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 16 | [INFO - 2016-03-30T13:55:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 17 | [INFO - 2016-03-30T14:00:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 18 | [INFO - 2016-03-30T14:05:05.019Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 19 | [INFO - 2016-03-30T14:10:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 20 | [INFO - 2016-03-30T14:15:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 21 | [INFO - 2016-03-30T14:20:05.018Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 22 | [INFO - 2016-03-30T14:25:05.019Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 23 | [INFO - 2016-03-30T14:30:05.019Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 24 | [INFO - 2016-03-30T14:35:05.019Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 25 | [INFO - 2016-03-30T14:40:05.019Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 26 | [INFO - 2016-03-30T14:45:05.019Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 27 | [INFO - 2016-03-30T14:50:05.021Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 28 | [INFO - 2016-03-30T14:55:05.021Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 29 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/spider.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from findtrip.items import FindtripItem 3 | 4 | class QuaSpider(scrapy.Spider): 5 | name = "Qua" 6 | start_urls = [ 7 | "http://www.qua.com/flights/PEK-XMN/2016-05-12?m=CNY&from=flight_home" 8 | ] 9 | 10 | def parse(self, response): 11 | sel = scrapy.Selector(response) 12 | dataList = sel.xpath("//div[@class='m-fly-item s-oneway']") 13 | items = [] 14 | 15 | for index,each in enumerate(dataList): 16 | flight_each = "//div[@id='list-box']/div["+str(index+1)+"]" 17 | detail_span = "//div[@class='fl-detail-nav']/ul/li[1]/span[@class='nav-label']" 18 | f_route_div = "//div[@class='m-fl-info-bd']/div" 19 | 20 | airports = sel.xpath(flight_each + f_route_div + '/p[3]//text()').extract() 21 | company = sel.xpath(flight_each + f_route_div + '/p[1]//text()').extract() 22 | flight_time = sel.xpath(flight_each + f_route_div + '/p[2]//text()').extract() 23 | passtime = sel.xpath(flight_each + f_route_div + '/p[4]//text()').extract() 24 | price = sel.xpath(flight_each + "//div[@class='fl-price-box']//em//text()").extract() 25 | 26 | item = FindtripItem() 27 | item['site'] = 'Qua' 28 | item['company'] = company 29 | item['flight_time'] = flight_time 30 | item['airports'] = airports 31 | item['passtime'] = passtime 32 | item['price'] = price 33 | items.append(item) 34 | return items 35 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/spider.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/spider.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/spider_ctrip.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from findtrip.items import FindtripItem 3 | 4 | class CtripSpider(scrapy.Spider): 5 | name = 'Ctrip' 6 | start_urls = [ 7 | "http://flights.ctrip.com/booking/XMN-BJS-day-1.html?DDate1=2016-04-19" 8 | ] 9 | 10 | def parse(self, response): 11 | sel = scrapy.Selector(response) 12 | fligint_div = "//div[@id='J_flightlist2']/div" 13 | dataList = sel.xpath(fligint_div) 14 | print dataList,len(dataList) 15 | 16 | items = [] 17 | for index,each in enumerate(dataList): 18 | flight_each = fligint_div+'['+str(index+1)+']' 19 | flight_tr = flight_each+"//tr[@class='J_header_row']" 20 | istrain = sel.xpath(flight_each + "//div[@class='train_flight_tit']") 21 | 22 | if istrain: 23 | print "this data is train add" 24 | else: 25 | company = sel.xpath(flight_tr + "//div[@class='info-flight J_flight_no']//text()").extract() 26 | 27 | flight_time_from = sel.xpath(flight_tr + "//td[@class='right']/div[1]//text()").extract() 28 | flight_time_to = sel.xpath(flight_tr + "//td[@class='left']/div[1]//text()").extract() 29 | flight_time = [flight_time_from,flight_time_to] 30 | 31 | airports_from = sel.xpath(flight_tr + "//td[@class='right']/div[2]//text()").extract() 32 | airports_to = sel.xpath(flight_tr + "//td[@class='left']/div[2]//text()").extract() 33 | airports = [airports_from,airports_to] 34 | 35 | price_middle = sel.xpath(flight_tr + "[1]//td[@class='price middle ']/span//text()").extract() 36 | price = sel.xpath(flight_tr + "[1]//td[@class='price ']/span//text()").extract() 37 | if price_middle: 38 | price = price_middle 39 | elif price: 40 | price = price 41 | else: 42 | price = '' 43 | 44 | item = FindtripItem() 45 | item['site'] = 'Ctrip' 46 | item['company'] = company 47 | item['flight_time'] = flight_time 48 | item['airports'] = airports 49 | item['price'] = price 50 | items.append(item) 51 | 52 | return items 53 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/spider_ctrip.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/spider_ctrip.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/washctrip.py: -------------------------------------------------------------------------------- 1 | def wash(dateList): 2 | dateList = map(lambda x : x.split(), dateList) 3 | cleanList = [] 4 | for each in dateList: 5 | if each: 6 | cleanList.append(each[0]) 7 | return cleanList 8 | -------------------------------------------------------------------------------- /findtrip/findtrip/spiders/washctrip.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/spiders/washctrip.pyc -------------------------------------------------------------------------------- /findtrip/findtrip/tickets.json~: -------------------------------------------------------------------------------- 1 | [[{"passtime": ["2h25m", "1h35m"], "company": ["Hebei", "NS8160(738)", "Same", "Hebei", "NS8160(738)"], "airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "price": ["\u00a5", "730"], "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"]}, 2 | {"passtime": ["2h25m", "1h35m"], "company": ["Xiamen", "MF8160(738)", "Same", "Xiamen", "MF8160(738)"], "airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "price": ["\u00a5", "732"], "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"]}, 3 | {"passtime": ["3h15m"], "company": ["Hebei", "NS8104(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "733"], "flight_time": ["6:45", "AM", "10:00", "AM"]}, 4 | {"passtime": ["3h15m"], "company": ["Xiamen", "MF8104(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "733"], "flight_time": ["6:45", "AM", "10:00", "AM"]}, 5 | {"passtime": ["1h55m", "1h35m"], "company": ["China", "MU5170(320)", "Same", "China", "MU5170(320)"], "airports": ["PEK", "Beijing", "HFE", "Hefei", "HFE", "Hefei", "XMN", "Xiamen"], "price": ["\u00a5", "750"], "flight_time": ["5:10", "PM", "7:05", "PM", "45m", "7:50", "PM", "9:25", "PM"]}, 6 | {"passtime": ["3h"], "company": ["China", "KN5927(73S)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "765"], "flight_time": ["4:00", "PM", "7:00", "PM"]}, 7 | {"passtime": ["2h45m"], "company": ["China", "KN5925(73E)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "765"], "flight_time": ["6:45", "AM", "9:30", "AM"]}, 8 | {"passtime": ["3h"], "company": ["Air", "CA1871(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "800"], "flight_time": ["7:20", "AM", "10:20", "AM"]}, 9 | {"passtime": ["3h"], "company": ["Capital", "JD5375(320)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "806"], "flight_time": ["6:55", "AM", "9:55", "AM"]}, 10 | {"passtime": ["1h10m", "2h15m"], "company": ["Shandong", "SC1152(738)", "Same", "Shandong", "SC1152(738)"], "airports": ["PEK", "Beijing", "TNA", "Jinan", "TNA", "Jinan", "XMN", "Xiamen"], "price": ["\u00a5", "904"], "flight_time": ["7:20", "PM", "8:30", "PM", "50m", "9:20", "PM", "11:35", "PM"]}, 11 | {"passtime": ["1h30m", "2h30m"], "company": ["Shandong", "SC4678(738)", "Same", "Shandong", "SC4678(738)"], "airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "price": ["\u00a5", "904"], "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"]}, 12 | {"passtime": ["1h10m", "2h15m"], "company": ["Air", "CA1152(738)", "Same", "Air", "CA1152(738)"], "airports": ["PEK", "Beijing", "TNA", "Jinan", "TNA", "Jinan", "XMN", "Xiamen"], "price": ["\u00a5", "930"], "flight_time": ["7:20", "PM", "8:30", "PM", "50m", "9:20", "PM", "11:35", "PM"]}, 13 | {"passtime": ["1h30m", "2h30m"], "company": ["Air", "CA4678(738)", "Same", "Air", "CA4678(738)"], "airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "price": ["\u00a5", "930"], "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"]}, 14 | {"passtime": ["3h5m"], "company": ["Air", "CA1801(73H)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "940"], "flight_time": ["8:05", "PM", "11:10", "PM"]}, 15 | {"passtime": ["3h5m"], "company": ["Air", "CA1833(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "940"], "flight_time": ["11:35", "AM", "2:40", "PM"]}, 16 | {"passtime": ["3h10m"], "company": ["Air", "CA1809(73C)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "940"], "flight_time": ["8:50", "AM", "12:00", "PM"]}, 17 | {"passtime": ["3h"], "company": ["Xiamen", "MF8170(787)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1074"], "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"]}, 18 | {"passtime": ["3h"], "company": ["Hebei", "NS8170(787)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1080"], "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"]}, 19 | {"passtime": ["3h5m"], "company": ["Air", "CA1815(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1100"], "flight_time": ["4:20", "PM", "7:25", "PM"]}, 20 | {"passtime": ["3h"], "company": ["Hainan", "HU7191(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1119"], "flight_time": ["9:10", "AM", "12:10", "PM"]}, 21 | {"passtime": ["3h10m"], "company": ["Hainan", "HU7291(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1119"], "flight_time": ["7:25", "AM", "10:35", "AM"]}, 22 | {"passtime": ["3h"], "company": ["China", "MU3849(73S)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "1124"], "flight_time": ["4:00", "PM", "7:00", "PM"]}, 23 | {"passtime": ["2h45m"], "company": ["China", "MU3811(73E)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "1124"], "flight_time": ["6:45", "AM", "9:30", "AM"]}, 24 | {"passtime": ["3h"], "company": ["Xiamen", "MF8106(787)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1245"], "flight_time": ["6:55", "PM", "9:55", "PM"]}, 25 | {"passtime": ["2h55m"], "company": ["Xiamen", "MF8102(787)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1245"], "flight_time": ["4:10", "PM", "7:05", "PM"]}] -------------------------------------------------------------------------------- /findtrip/findtrip/useragent.py: -------------------------------------------------------------------------------- 1 | # -*-coding:utf-8-*- 2 | 3 | from scrapy import log 4 | 5 | """避免被ban策略之一:使用useragent池。 6 | 7 | 使用注意:需在settings.py中进行相应的设置。 8 | """ 9 | 10 | import random 11 | from scrapy import signals 12 | from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware 13 | 14 | class RandomUserAgentMiddleware(UserAgentMiddleware): 15 | 16 | def __init__(self, settings, user_agent='Scrapy'): 17 | super(RandomUserAgentMiddleware, self).__init__() 18 | self.user_agent = user_agent 19 | user_agent_list_file = settings.get('USER_AGENT_LIST') 20 | if not user_agent_list_file: 21 | # If USER_AGENT_LIST_FILE settings is not set, 22 | # Use the default USER_AGENT or whatever was 23 | # passed to the middleware. 24 | ua = settings.get('USER_AGENT', user_agent) 25 | self.user_agent_list = [ua] 26 | else: 27 | with open(user_agent_list_file, 'r') as f: 28 | self.user_agent_list = [line.strip() for line in f.readlines()] 29 | 30 | @classmethod 31 | def from_crawler(cls, crawler): 32 | obj = cls(crawler.settings) 33 | crawler.signals.connect(obj.spider_opened, 34 | signal=signals.spider_opened) 35 | return obj 36 | 37 | def process_request(self, request, spider): 38 | user_agent = random.choice(self.user_agent_list) 39 | if user_agent: 40 | request.headers.setdefault('User-Agent', user_agent) 41 | -------------------------------------------------------------------------------- /findtrip/findtrip/useragent.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fankcoder/findtrip/4b063c717e53fde5bbbd9e39267f2142d0ce8653/findtrip/findtrip/useragent.pyc -------------------------------------------------------------------------------- /findtrip/ghostdriver.log: -------------------------------------------------------------------------------- 1 | PhantomJS is launching GhostDriver... 2 | [INFO - 2016-04-06T08:38:18.329Z] GhostDriver - Main - running on port 33936 3 | [INFO - 2016-04-06T08:38:19.274Z] Session [ea072970-fbd2-11e5-8b2c-b54983425cfa] - CONSTRUCTOR - Desired Capabilities: {"platform":"ANY","browserName":"phantomjs","version":"","javascriptEnabled":true} 4 | [INFO - 2016-04-06T08:38:19.274Z] Session [ea072970-fbd2-11e5-8b2c-b54983425cfa] - CONSTRUCTOR - Negotiated Capabilities: {"browserName":"phantomjs","version":"1.9.0","driverName":"ghostdriver","driverVersion":"1.0.3","platform":"linux-unknown-64bit","javascriptEnabled":true,"takesScreenshot":true,"handlesAlerts":false,"databaseEnabled":false,"locationContextEnabled":false,"applicationCacheEnabled":false,"browserConnectionEnabled":false,"cssSelectorsEnabled":true,"webStorageEnabled":false,"rotatable":false,"acceptSslCerts":false,"nativeEvents":true,"proxy":{"proxyType":"direct"}} 5 | [INFO - 2016-04-06T08:38:19.274Z] SessionManagerReqHand - _postNewSessionCommand - New Session Created: ea072970-fbd2-11e5-8b2c-b54983425cfa 6 | [INFO - 2016-04-06T08:42:35.647Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 7 | [INFO - 2016-04-06T08:45:14.712Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 8 | [INFO - 2016-04-06T08:50:14.793Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 9 | [INFO - 2016-04-06T08:55:14.891Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 10 | l?DDate1=2016-04-19\\\", \\\"sessionId\\\": \\\"ea072970-fbd2-11e5-8b2c-b54983425cfa\\\"}\",\"url\":\"/url\",\"urlParsed\":{\"anchor\":\"\",\"query\":\"\",\"file\":\"url\",\"directory\":\"/\",\"path\":\"/url\",\"relative\":\"/url\",\"port\":\"\",\"host\":\"\",\"password\":\"\",\"user\":\"\",\"userInfo\":\"\",\"authority\":\"\",\"protocol\":\"\",\"source\":\"/url\",\"queryKey\":{},\"chunks\":[\"url\"]},\"urlOriginal\":\"/session/ea072970-fbd2-11e5-8b2c-b54983425cfa/url\"}", 11 | "name": "NoSuchWindow", 12 | "errorStatusCode": 23, 13 | "errorSessionId": "ea072970-fbd2-11e5-8b2c-b54983425cfa", 14 | "errorClassName": "SessionReqHand", 15 | "errorScreenshot": "", 16 | "line": 190, 17 | "sourceId": 140576373271936, 18 | "sourceURL": ":/ghostdriver/request_handlers/request_handler.js", 19 | "stack": "NoSuchWindow: Error Message => 'Currently Window handle/name is invalid (closed?)'\n caused by Request => {\"headers\":{\"Accept\":\"application/json\",\"Accept-Encoding\":\"identity\",\"Connection\":\"close\",\"Content-Length\":\"133\",\"Content-Type\":\"application/json;charset=UTF-8\",\"Host\":\"127.0.0.1:33936\",\"User-Agent\":\"Python-urllib/2.7\"},\"httpVersion\":\"1.1\",\"method\":\"POST\",\"post\":\"{\\\"url\\\": \\\"http://flights.ctrip.com/booking/XMN-BJS-day-1.html?DDate1=2016-04-19\\\", \\\"sessionId\\\": \\\"ea072970-fbd2-11e5-8b2c-b54983425cfa\\\"}\",\"url\":\"/url\",\"urlParsed\":{\"anchor\":\"\",\"query\":\"\",\"file\":\"url\",\"directory\":\"/\",\"path\":\"/url\",\"relative\":\"/url\",\"port\":\"\",\"host\":\"\",\"password\":\"\",\"user\":\"\",\"userInfo\":\"\",\"authority\":\"\",\"protocol\":\"\",\"source\":\"/url\",\"queryKey\":{},\"chunks\":[\"url\"]},\"urlOriginal\":\"/session/ea072970-fbd2-11e5-8b2c-b54983425cfa/url\"}\n at :/ghostdriver/request_handlers/request_handler.js:190\n at :/ghostdriver/request_handlers/request_handler.js:165\n at :/ghostdriver/request_handlers/session_request_handler.js:450\n at :/ghostdriver/request_handlers/session_request_handler.js:86\n at :/ghostdriver/request_handlers/request_handler.js:61\n at :/ghostdriver/request_handlers/router_request_handler.js:82", 20 | "stackArray": [ 21 | { 22 | "sourceURL": ":/ghostdriver/request_handlers/request_handler.js", 23 | "line": 190 24 | }, 25 | { 26 | "sourceURL": ":/ghostdriver/request_handlers/request_handler.js", 27 | "line": 165 28 | }, 29 | { 30 | "sourceURL": ":/ghostdriver/request_handlers/session_request_handler.js", 31 | "line": 450 32 | }, 33 | { 34 | "sourceURL": ":/ghostdriver/request_handlers/session_request_handler.js", 35 | "line": 86 36 | }, 37 | { 38 | "sourceURL": ":/ghostdriver/request_handlers/request_handler.js", 39 | "line": 61 40 | }, 41 | { 42 | "sourceURL": ":/ghostdriver/request_handlers/router_request_handler.js", 43 | "line": 82 44 | } 45 | ] 46 | } 47 | [INFO - 2016-04-06T08:43:18.427Z] SessionManagerReqHand - _cleanupWindowlessSessions - Asynchronous Sessions clean-up phase starting NOW 48 | [INFO - 2016-04-06T08:43:18.427Z] SessionManagerReqHand - _cleanupWindowlessSessions - Deleted Session 'ea072970-fbd2-11e5-8b2c-b54983425cfa', because windowless 49 | -------------------------------------------------------------------------------- /findtrip/qua.json: -------------------------------------------------------------------------------- 1 | [{"airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "company": ["China", "KN5927(73S)"], "site": "Qua", "flight_time": ["4:00", "PM", "7:00", "PM"], "passtime": ["3h"], "price": ["\u00a5", "689"]}, 2 | {"airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "company": ["Shandong", "SC4678(738)", "Same", "Shandong", "SC4678(738)"], "site": "Qua", "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"], "passtime": ["1h30m", "2h30m"], "price": ["\u00a5", "712"]}, 3 | {"airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "company": ["Xiamen", "MF8160(738)", "Same", "Xiamen", "MF8160(738)"], "site": "Qua", "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"], "passtime": ["2h25m", "1h35m"], "price": ["\u00a5", "733"]}, 4 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Capital", "JD5375(320)"], "site": "Qua", "flight_time": ["6:55", "AM", "9:55", "AM"], "passtime": ["3h"], "price": ["\u00a5", "772"]}, 5 | {"airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "company": ["China", "KN5925(73E)"], "site": "Qua", "flight_time": ["6:45", "AM", "9:30", "AM"], "passtime": ["2h45m"], "price": ["\u00a5", "774"]}, 6 | {"airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "company": ["Hebei", "NS8160(738)", "Same", "Hebei", "NS8160(738)"], "site": "Qua", "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"], "passtime": ["2h25m", "1h35m"], "price": ["\u00a5", "787"]}, 7 | {"airports": ["PEK", "Beijing", "TNA", "Jinan", "TNA", "Jinan", "XMN", "Xiamen"], "company": ["Shandong", "SC1152(738)", "Same", "Shandong", "SC1152(738)"], "site": "Qua", "flight_time": ["7:20", "PM", "8:30", "PM", "50m", "9:20", "PM", "11:35", "PM"], "passtime": ["1h10m", "2h15m"], "price": ["\u00a5", "899"]}, 8 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Xiamen", "MF8104(738)"], "site": "Qua", "flight_time": ["6:45", "AM", "9:45", "AM"], "passtime": ["3h"], "price": ["\u00a5", "909"]}, 9 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Hebei", "NS8104(738)"], "site": "Qua", "flight_time": ["6:45", "AM", "9:45", "AM"], "passtime": ["3h"], "price": ["\u00a5", "917"]}, 10 | {"airports": ["PEK", "Beijing", "TNA", "Jinan", "TNA", "Jinan", "XMN", "Xiamen"], "company": ["Air", "CA1152(738)", "Same", "Air", "CA1152(738)"], "site": "Qua", "flight_time": ["7:20", "PM", "8:30", "PM", "50m", "9:20", "PM", "11:35", "PM"], "passtime": ["1h10m", "2h15m"], "price": ["\u00a5", "930"]}, 11 | {"airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "company": ["Air", "CA4678(738)", "Same", "Air", "CA4678(738)"], "site": "Qua", "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"], "passtime": ["1h30m", "2h30m"], "price": ["\u00a5", "930"]}, 12 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Air", "CA1801(73H)"], "site": "Qua", "flight_time": ["8:05", "PM", "11:10", "PM"], "passtime": ["3h5m"], "price": ["\u00a5", "940"]}, 13 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Air", "CA1871(738)"], "site": "Qua", "flight_time": ["7:20", "AM", "10:20", "AM"], "passtime": ["3h"], "price": ["\u00a5", "940"]}, 14 | {"airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "company": ["China", "CZ5102(738)", "Same", "China", "CZ5102(738)"], "site": "Qua", "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"], "passtime": ["2h25m", "1h35m"], "price": ["\u00a5", "960"]}, 15 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["China", "CZ5056(738)"], "site": "Qua", "flight_time": ["6:45", "AM", "9:45", "AM"], "passtime": ["3h"], "price": ["\u00a5", "960"]}, 16 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Xiamen", "MF8170(757)"], "site": "Qua", "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"], "passtime": ["3h"], "price": ["\u00a5", "1077"]}, 17 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Hebei", "NS8170(757)"], "site": "Qua", "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"], "passtime": ["3h"], "price": ["\u00a5", "1082"]}, 18 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Air", "CA1815(738)"], "site": "Qua", "flight_time": ["4:20", "PM", "7:25", "PM"], "passtime": ["3h5m"], "price": ["\u00a5", "1100"]}, 19 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Air", "CA1809(73C)"], "site": "Qua", "flight_time": ["8:50", "AM", "12:00", "PM"], "passtime": ["3h10m"], "price": ["\u00a5", "1100"]}, 20 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Hainan", "HU7191(767)"], "site": "Qua", "flight_time": ["9:10", "AM", "12:10", "PM"], "passtime": ["3h"], "price": ["\u00a5", "1107"]}, 21 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Hainan", "HU7291(738)"], "site": "Qua", "flight_time": ["7:25", "AM", "10:35", "AM"], "passtime": ["3h10m"], "price": ["\u00a5", "1107"]}, 22 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["China", "CZ5110(757)"], "site": "Qua", "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"], "passtime": ["3h"], "price": ["\u00a5", "1130"]}, 23 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Air", "CA1833(73G)"], "site": "Qua", "flight_time": ["11:35", "AM", "2:40", "PM"], "passtime": ["3h5m"], "price": ["\u00a5", "1240"]}, 24 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Xiamen", "MF8106(787)"], "site": "Qua", "flight_time": ["6:55", "PM", "9:55", "PM"], "passtime": ["3h"], "price": ["\u00a5", "1243"]}, 25 | {"airports": ["PEK", "Beijing", "XMN", "Xiamen"], "company": ["Xiamen", "MF8102(787)"], "site": "Qua", "flight_time": ["4:10", "PM", "7:05", "PM"], "passtime": ["2h55m"], "price": ["\u00a5", "1243"]}] -------------------------------------------------------------------------------- /findtrip/scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html 5 | 6 | [settings] 7 | default = findtrip.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = findtrip 12 | -------------------------------------------------------------------------------- /findtrip/tickets1.json~: -------------------------------------------------------------------------------- 1 | [[ -------------------------------------------------------------------------------- /findtrip/tickets2.json~: -------------------------------------------------------------------------------- 1 | [[[[{"passtime": ["2h25m", "1h35m"], "company": ["Hebei", "NS8160(738)", "Same", "Hebei", "NS8160(738)"], "airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "price": ["\u00a5", "730"], "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"]}, 2 | {"passtime": ["2h25m", "1h35m"], "company": ["Xiamen", "MF8160(738)", "Same", "Xiamen", "MF8160(738)"], "airports": ["PEK", "Beijing", "HSN", "Zhoushan", "HSN", "Zhoushan", "XMN", "Xiamen"], "price": ["\u00a5", "733"], "flight_time": ["8:35", "AM", "11:00", "AM", "50m", "11:50", "AM", "1:25", "PM"]}, 3 | {"passtime": ["2h10m", "1h35m"], "company": ["China", "MU5170(320)", "Same", "China", "MU5170(320)"], "airports": ["PEK", "Beijing", "HFE", "Hefei", "HFE", "Hefei", "XMN", "Xiamen"], "price": ["\u00a5", "750"], "flight_time": ["4:55", "PM", "7:05", "PM", "45m", "7:50", "PM", "9:25", "PM"]}, 4 | {"passtime": ["3h"], "company": ["China", "KN5927(73S)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "765"], "flight_time": ["4:00", "PM", "7:00", "PM"]}, 5 | {"passtime": ["2h45m"], "company": ["China", "KN5925(73E)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "765"], "flight_time": ["6:45", "AM", "9:30", "AM"]}, 6 | {"passtime": ["3h"], "company": ["Air", "CA1871(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "800"], "flight_time": ["7:20", "AM", "10:20", "AM"]}, 7 | {"passtime": ["3h"], "company": ["Capital", "JD5375(320)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "807"], "flight_time": ["6:55", "AM", "9:55", "AM"]}, 8 | {"passtime": ["1h10m", "2h15m"], "company": ["Shandong", "SC1152(738)", "Same", "Shandong", "SC1152(738)"], "airports": ["PEK", "Beijing", "TNA", "Jinan", "TNA", "Jinan", "XMN", "Xiamen"], "price": ["\u00a5", "904"], "flight_time": ["7:20", "PM", "8:30", "PM", "50m", "9:20", "PM", "11:35", "PM"]}, 9 | {"passtime": ["3h"], "company": ["Hebei", "NS8104(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "911"], "flight_time": ["6:45", "AM", "9:45", "AM"]}, 10 | {"passtime": ["3h"], "company": ["Xiamen", "MF8104(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "912"], "flight_time": ["6:45", "AM", "9:45", "AM"]}, 11 | {"passtime": ["1h10m", "2h15m"], "company": ["Air", "CA1152(738)", "Same", "Air", "CA1152(738)"], "airports": ["PEK", "Beijing", "TNA", "Jinan", "TNA", "Jinan", "XMN", "Xiamen"], "price": ["\u00a5", "930"], "flight_time": ["7:20", "PM", "8:30", "PM", "50m", "9:20", "PM", "11:35", "PM"]}, 12 | {"passtime": ["3h5m"], "company": ["Air", "CA1801(73H)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "940"], "flight_time": ["8:05", "PM", "11:10", "PM"]}, 13 | {"passtime": ["1h30m", "2h30m"], "company": ["Shandong", "SC4678(738)", "Same", "Shandong", "SC4678(738)"], "airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "price": ["\u00a5", "1054"], "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"]}, 14 | {"passtime": ["3h"], "company": ["Xiamen", "MF8170(757)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1074"], "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"]}, 15 | {"passtime": ["3h"], "company": ["Hebei", "NS8170(757)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1080"], "flight_time": ["9:25", "PM", "0:25", "AM", "(+1)"]}, 16 | {"passtime": ["3h5m"], "company": ["Air", "CA1815(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1100"], "flight_time": ["4:20", "PM", "7:25", "PM"]}, 17 | {"passtime": ["1h30m", "2h30m"], "company": ["Air", "CA4678(738)", "Same", "Air", "CA4678(738)"], "airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "price": ["\u00a5", "1100"], "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"]}, 18 | {"passtime": ["3h10m"], "company": ["Air", "CA1809(73C)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1100"], "flight_time": ["8:50", "AM", "12:00", "PM"]}, 19 | {"passtime": ["3h"], "company": ["Hainan", "HU7191(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1118"], "flight_time": ["9:10", "AM", "12:10", "PM"]}, 20 | {"passtime": ["3h10m"], "company": ["Hainan", "HU7291(738)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1118"], "flight_time": ["7:25", "AM", "10:35", "AM"]}, 21 | {"passtime": ["3h"], "company": ["China", "MU3849(73S)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "1124"], "flight_time": ["4:00", "PM", "7:00", "PM"]}, 22 | {"passtime": ["2h45m"], "company": ["China", "MU3811(73E)"], "airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "price": ["\u00a5", "1124"], "flight_time": ["6:45", "AM", "9:30", "AM"]}, 23 | {"passtime": ["3h5m"], "company": ["Air", "CA1833(73G)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1240"], "flight_time": ["11:35", "AM", "2:40", "PM"]}, 24 | {"passtime": ["3h"], "company": ["Xiamen", "MF8106(787)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1245"], "flight_time": ["6:55", "PM", "9:55", "PM"]}, 25 | {"passtime": ["2h55m"], "company": ["Xiamen", "MF8102(787)"], "airports": ["PEK", "Beijing", "XMN", "Xiamen"], "price": ["\u00a5", "1245"], "flight_time": ["4:10", "PM", "7:05", "PM"]}] -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | scrapy==0.25.0 2 | requests==2.4.3 3 | mongodb 4 | selenium==2.52.0 5 | phantomjs==2.1.1 6 | lxml==3.4.2 7 | 8 | -------------------------------------------------------------------------------- /requirements.txt~: -------------------------------------------------------------------------------- 1 | scrapy==0.25.0 2 | requests==2.4.3 3 | mongodb 4 | selenium==2.52.0 5 | phantomjs==2.1.1 6 | --------------------------------------------------------------------------------