├── .gitignore
├── README.md
├── coolscrapy
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── models.py
    ├── pipelines.py
    ├── run.py
    ├── settings.py
    ├── spiders
    │   ├── __init__.py
    │   ├── article_spider.py
    │   ├── drug_spider.py
    │   ├── huxiu_spider.py
    │   ├── joke_spider.py
    │   ├── js_spider.py
    │   ├── link_spider.py
    │   ├── login1_spider.py
    │   ├── login2_spider.py
    │   ├── test_spider.py
    │   ├── tobacco_spider.py
    │   └── xml_spider.py
    └── utils.py
├── doc
    ├── LICENSE
    ├── README.md
    ├── SUMMARY.md
    ├── assets
    │   ├── favicon.png
    │   └── logo.png
    ├── book.json
    ├── fonts
    │   ├── fontawesome-webfont.woff
    │   └── fontawesome-webfont.woff2
    ├── source
    │   ├── images
    │   │   ├── scrapy.png
    │   │   ├── scrapy01.png
    │   │   ├── scrapy02.png
    │   │   ├── scrapy03.png
    │   │   └── weixin1.png
    │   ├── other
    │   │   └── about.md
    │   ├── part1
    │   │   ├── README.md
    │   │   ├── scrapy-01.md
    │   │   └── scrapy-02.md
    │   ├── part2
    │   │   ├── README.md
    │   │   ├── scrapy-03.md
    │   │   ├── scrapy-04.md
    │   │   ├── scrapy-05.md
    │   │   ├── scrapy-06.md
    │   │   ├── scrapy-07.md
    │   │   └── scrapy-08.md
    │   ├── part3
    │   │   ├── README.md
    │   │   └── scrapy-09.md
    │   ├── part4
    │   │   ├── README.md
    │   │   ├── scrapy-10.md
    │   │   ├── scrapy-11.md
    │   │   └── scrapy-12.md
    │   └── part5
    │   │   └── README.md
    └── styles
    │   └── website.scss
├── publish_gitbook.sh
└── scrapy.cfg


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | *.pyc
3 | *.log
4 | node_modules/
5 | _book/
6 | .project
7 | *~
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ﻿## Python网络爬虫Scrapy框架研究
 2 | 
 3 | ### Scrapy1.0教程
 4 | 
 5 | * [Scrapy笔记（1）- 入门篇](https://www.xncoding.com/2016/03/08/scrapy-01.html)
 6 | * [Scrapy笔记（2）- 完整示例](https://www.xncoding.com/2016/03/10/scrapy-02.html)
 7 | * [Scrapy笔记（3）- Spider详解](https://www.xncoding.com/2016/03/12/scrapy-03.html)
 8 | * [Scrapy笔记（4）- Selector详解](https://www.xncoding.com/2016/03/14/scrapy-04.html)
 9 | * [Scrapy笔记（5）- Item详解](https://www.xncoding.com/2016/03/16/scrapy-05.html)
10 | * [Scrapy笔记（6）- Item Pipeline](https://www.xncoding.com/2016/03/18/scrapy-06.html)
11 | * [Scrapy笔记（7）- 内置服务](https://www.xncoding.com/2016/03/19/scrapy-07.html)
12 | * [Scrapy笔记（8）- 文件与图片](https://www.xncoding.com/2016/03/20/scrapy-08.html)
13 | * [Scrapy笔记（9）- 部署](https://www.xncoding.com/2016/03/21/scrapy-09.html)
14 | * [Scrapy笔记（10）- 动态配置爬虫](https://www.xncoding.com/2016/04/10/scrapy-10.html)
15 | * [Scrapy笔记（11）- 模拟登录](https://www.xncoding.com/2016/04/12/scrapy-11.html)
16 | * [Scrapy笔记（12）- 抓取动态网站](https://www.xncoding.com/2016/04/15/scrapy-12.html)
17 | 
18 | ### Wiki
19 | Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。
20 | Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。
21 | 
22 | Scrapy吸引人的地方在于它是一个框架，任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类，
23 | 如BaseSpider、sitemap爬虫等，还有对web2.0爬虫的支持。
24 | 
25 | Scrach是抓取的意思，这个Python的爬虫框架叫Scrapy，大概也是这个意思吧，就叫它：小刮刮吧。
26 | 
27 | 基于最新的Scrapy 1.0编写，已更新至Python3.6
28 | 
29 | ------------------------------------------
30 | 
31 | ### 对多个内容网站的采集，主要功能实现如下：
32 | 
33 |   * 最新文章列表的爬取
34 |   * 采集的数据放入MySQL数据库中，并且包含标题，发布日期，文章来源，链接地址等等信息
35 |   * URL去重复，程序保证对于同一个链接不会爬取两次
36 |   * 防止封IP策略，如果抓取太频繁了，就被被封IP，目前采用三种策略保证不会被封：
37 | 
38 |      * 策略1：设置download_delay下载延迟，数字设置为5秒，越大越安全
39 |      * 策略2：禁止Cookie，某些网站会通过Cookie识别用户身份，禁用后使得服务器无法识别爬虫轨迹
40 |      * 策略3：使用user agent池。也就是每次发送的时候随机从池中选择不一样的浏览器头信息，防止暴露爬虫身份
41 |      * 策略4：使用IP池，这个需要大量的IP资源，貌似还达不到这个要求
42 |      * 策略5：分布式爬取，这个是针对大型爬虫系统的，对目前而言我们还用不到。
43 | 
44 |   * 模拟登录后的爬取
45 |   * 针对RSS源的爬取
46 |   * 对于每个新的爬取目标网站，或者原来的网站格式有变动的时候，需要做到可配置，
47 |     只修改配置文件即可，而不是修改源文件，增加一段爬虫代码，主要是用xpath配置爬取规则
48 |   * 定时爬取，设置定时任务周期性爬取
49 |   * 与微信公共平台的结合，给大量的订阅号随机分配最新的订阅文章。
50 |   * 利用scrapy-splash执行页面javascript后的内容爬取
51 | 
52 | ------------------------------------------
53 | 
54 | ## 贡献代码
55 | 
56 | 1. Fork
57 | 1. 创建您的特性分支 git checkout -b my-new-feature
58 | 1. 提交您的改动 git commit -am 'Added some feature'
59 | 1. 将您的修改记录提交到远程 git 仓库 git push origin my-new-feature
60 | 1. 然后到 github 网站的该 git 远程仓库的 my-new-feature 分支下发起 Pull Request
61 | 
62 | ## 许可证
63 | Copyright (c) 2014-2016 [Xiong Neng](https://www.xncoding.com/)
64 | 
65 | 基于 MIT 协议发布: <http://www.opensource.org/licenses/MIT>
66 | 
67 | 


--------------------------------------------------------------------------------
/coolscrapy/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/coolscrapy/__init__.py


--------------------------------------------------------------------------------
/coolscrapy/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | import scrapy
 9 | 
10 | 
11 | class Article(scrapy.Item):
12 |     title = scrapy.Field()
13 |     url = scrapy.Field()
14 |     body = scrapy.Field()
15 |     publish_time = scrapy.Field()
16 |     source_site = scrapy.Field()
17 | 
18 | 
19 | class NewsItem(scrapy.Item):
20 |     """医药网新闻Item"""
21 |     crawlkey = scrapy.Field()      # 关键字
22 |     title = scrapy.Field()         # 标题
23 |     link = scrapy.Field()          # 链接
24 |     desc = scrapy.Field()          # 简述
25 |     pubdate = scrapy.Field()       # 发布时间
26 |     category = scrapy.Field()      # 分类
27 |     location = scrapy.Field()      # 来源
28 |     content = scrapy.Field()       # 内容
29 |     htmlcontent = scrapy.Field()   # html内容
30 | 
31 | 
32 | class HuxiuItem(scrapy.Item):
33 |     """虎嗅网新闻Item"""
34 |     title = scrapy.Field()      # 标题
35 |     link = scrapy.Field()       # 链接
36 |     desc = scrapy.Field()       # 简述
37 |     published = scrapy.Field()  # 发布时间
38 | 
39 | 
40 | class BlogItem(scrapy.Item):
41 |     """博客Item"""
42 |     title = scrapy.Field()      # 标题
43 |     link = scrapy.Field()       # 链接
44 |     id = scrapy.Field()         # ID号
45 |     published = scrapy.Field()  # 发布时间
46 |     updated = scrapy.Field()    # 更新时间
47 | 
48 | 
49 | class JokeItem(scrapy.Item):
50 |     """糗事百科笑话Item"""
51 |     content = scrapy.Field()
52 |     image_urls = scrapy.Field()
53 |     images = scrapy.Field()
54 | 
55 | 
56 | class TobaccoItem(scrapy.Item):
57 |     """烟草条形码Item"""
58 |     pics = scrapy.Field()             # 图片
59 |     product = scrapy.Field()          # 产品
60 |     product_type = scrapy.Field()     # 产品类型
61 |     package_spec = scrapy.Field()     # 包装规格
62 |     reference_price = scrapy.Field()  # 参考零售价格
63 |     manufacturer = scrapy.Field()     # 生产厂家
64 | 


--------------------------------------------------------------------------------
/coolscrapy/middlewares.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 中间件集合
 5 | Desc : 
 6 | """
 7 | import redis
 8 | import random
 9 | import logging
10 | from scrapy import signals
11 | from scrapy.http import Request
12 | from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
13 | 
14 | logger = logging.getLogger(__name__)
15 | 
16 | 
17 | class RotateUserAgentMiddleware(UserAgentMiddleware):
18 |     """避免被ban策略之一：使用useragent池。
19 |     使用注意：需在settings.py中进行相应的设置。
20 |     更好的方式是使用：
21 |     pip install scrapy-fake-useragent
22 |     DOWNLOADER_MIDDLEWARES = {
23 |         'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
24 |         'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
25 |     }
26 |     """
27 |     def __init__(self, user_agent=''):
28 |         super(RotateUserAgentMiddleware, self).__init__()
29 |         self.user_agent = user_agent
30 | 
31 |     def process_request(self, request, spider):
32 |         ua = random.choice(self.user_agent_list)
33 |         if ua:
34 |             # 记录当前使用的useragent
35 |             logger.debug('Current UserAgent: ' + ua)
36 |             request.headers.setdefault('User-Agent', ua)
37 | 
38 |     # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
39 |     # for more visit http://www.useragentstring.com/pages/useragentstring.php
40 |     user_agent_list = [
41 |         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36",
42 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36",
43 |         "Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
44 |         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.3319.102 Safari/537.36",
45 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/4E423F",
46 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1623.0 Safari/537.36",
47 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36",
48 |         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36",
49 |         "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
50 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
51 |         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
52 |         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
53 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
54 |         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
55 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
56 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
57 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
58 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
59 |         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
60 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
61 |         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
62 |         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
63 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
64 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
65 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
66 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
67 |         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
68 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
69 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
70 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
71 |     ]


--------------------------------------------------------------------------------
/coolscrapy/models.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | """
  4 | Topic: 定义数据库模型实体
  5 | Desc : 
  6 | """
  7 | import datetime
  8 | 
  9 | from sqlalchemy.engine.url import URL
 10 | from sqlalchemy.ext.declarative import declarative_base
 11 | from sqlalchemy import create_engine, Column, Integer, String, Text, DateTime
 12 | from coolscrapy.settings import DATABASE
 13 | 
 14 | 
 15 | def db_connect():
 16 |     """
 17 |     Performs database connection using database settings from settings.py.
 18 |     Returns sqlalchemy engine instance
 19 |     """
 20 |     return create_engine(URL(**DATABASE))
 21 | 
 22 | 
 23 | def create_news_table(engine):
 24 |     """"""
 25 |     Base.metadata.create_all(engine)
 26 | 
 27 | 
 28 | def _get_date():
 29 |     return datetime.datetime.now()
 30 | 
 31 | Base = declarative_base()
 32 | 
 33 | 
 34 | class ArticleRule(Base):
 35 |     """自定义文章爬取规则"""
 36 |     __tablename__ = 'article_rule'
 37 | 
 38 |     id = Column(Integer, primary_key=True)
 39 |     # 规则名称
 40 |     name = Column(String(30))
 41 |     # 运行的域名列表，逗号隔开
 42 |     allow_domains = Column(String(100))
 43 |     # 开始URL列表，逗号隔开
 44 |     start_urls = Column(String(100))
 45 |     # 下一页的xpath
 46 |     next_page = Column(String(100))
 47 |     # 文章链接正则表达式(子串)
 48 |     allow_url = Column(String(200))
 49 |     # 文章链接提取区域xpath
 50 |     extract_from = Column(String(200))
 51 |     # 文章标题xpath
 52 |     title_xpath = Column(String(100))
 53 |     # 文章内容xpath
 54 |     body_xpath = Column(Text)
 55 |     # 发布时间xpath
 56 |     publish_time_xpath = Column(String(30))
 57 |     # 文章来源
 58 |     source_site = Column(String(30))
 59 |     # 规则是否生效
 60 |     enable = Column(Integer)
 61 | 
 62 | 
 63 | class Article(Base):
 64 |     """文章类"""
 65 |     __tablename__ = 'articles'
 66 | 
 67 |     id = Column(Integer, primary_key=True)
 68 |     url = Column(String(100))
 69 |     title = Column(String(100))
 70 |     body = Column(Text)
 71 |     publish_time = Column(String(30))
 72 |     source_site = Column(String(30))
 73 | 
 74 | 
 75 | class News(Base):
 76 |     """定义新闻实体"""
 77 |     __tablename__ = "wqy_push_essay"
 78 |     # 主键
 79 |     id = Column(Integer, primary_key=True)
 80 |     # 爬虫key
 81 |     crawlkey = Column('crawlkey', String(30), nullable=True)
 82 |     # 新闻分类
 83 |     category = Column('category', String(40), nullable=True)
 84 |     # 新闻链接地址
 85 |     link = Column('link', String(120), nullable=True)
 86 |     # 新闻来源
 87 |     location = Column('location', String(60), nullable=True)
 88 |     # 发布时间
 89 |     pubdate = Column('pubdate', DateTime, default=_get_date)
 90 |     # 新闻标题
 91 |     title = Column('title', String(120), nullable=True)
 92 |     # 正文
 93 |     content = Column('content', Text, nullable=True)
 94 |     # 带html标签的正文
 95 |     htmlcontent = Column('htmlcontent', Text, nullable=True)
 96 | 
 97 | 
 98 | class Tobacco(Base):
 99 |     """烟草"""
100 |     __tablename__ = 't_tobacco'
101 | 
102 |     id = Column(Integer, primary_key=True)
103 |     product_name = Column(String(32))
104 |     brand = Column(String(32))
105 |     product_type = Column(String(16))
106 |     package_spec = Column(String(32))
107 |     reference_price = Column(String(32))
108 |     manufacturer = Column(String(32))
109 |     pics = Column(String(255))
110 |     created_time = Column(DateTime, default=_get_date)
111 |     updated_time = Column(DateTime, default=_get_date)
112 | 
113 | 
114 | class Barcode(Base):
115 |     """烟草条形码"""
116 |     __tablename__ = 't_barcode'
117 | 
118 |     id = Column(Integer, primary_key=True)
119 |     tobacco_id = Column(Integer)
120 |     barcode = Column(String(32))
121 |     btype = Column(String(32))
122 |     created_time = Column(DateTime, default=_get_date)
123 |     updated_time = Column(DateTime, default=_get_date)
124 | 


--------------------------------------------------------------------------------
/coolscrapy/pipelines.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Define your item pipelines here
  4 | # centos安装MySQL-python，root用户下
  5 | # yum install mysql-devel
  6 | # pip install MySQL-python
  7 | #
  8 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  9 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 10 | 
 11 | import datetime
 12 | import redis
 13 | import json
 14 | import logging
 15 | from contextlib import contextmanager
 16 | 
 17 | from scrapy import signals, Request
 18 | from scrapy.exporters import JsonItemExporter
 19 | from scrapy.pipelines.images import ImagesPipeline
 20 | from scrapy.exceptions import DropItem
 21 | from sqlalchemy.orm import sessionmaker
 22 | from coolscrapy.models import News, db_connect, create_news_table, Article, Tobacco, Barcode
 23 | 
 24 | Redis = redis.StrictRedis(host='localhost', port=6379, db=0)
 25 | _log = logging.getLogger(__name__)
 26 | 
 27 | 
 28 | class DuplicatesPipeline(object):
 29 |     """Item去重复"""
 30 | 
 31 |     def process_item(self, item, spider):
 32 |         if Redis.exists('url:%s' % item['url']):
 33 |             raise DropItem("Duplicate item found: %s" % item)
 34 |         else:
 35 |             Redis.set('url:%s' % item['url'], 1)
 36 |             return item
 37 | 
 38 | 
 39 | class FilterWordsPipeline(object):
 40 |     """A pipeline for filtering out items which contain certain words in their
 41 |     description"""
 42 | 
 43 |     # put all words in lowercase
 44 |     words_to_filter = ['pilgrim']
 45 | 
 46 |     def process_item(self, item, spider):
 47 |         for word in self.words_to_filter:
 48 |             if False:
 49 |                 raise DropItem("Contains forbidden word: %s" % word)
 50 |         else:
 51 |             return item
 52 | 
 53 | 
 54 | class JsonWriterPipeline(object):
 55 |     """
 56 |     The purpose of JsonWriterPipeline is just to introduce how to write item pipelines.
 57 |     If you really want to store all scraped items into a JSON file you should use the Feed exports.
 58 |     """
 59 | 
 60 |     def __init__(self):
 61 |         pass
 62 |         self.file = open('items.json', 'wb')
 63 | 
 64 |     def open_spider(self, spider):
 65 |         """This method is called when the spider is opened."""
 66 |         _log.info('open_spider....')
 67 | 
 68 |     def process_item(self, item, spider):
 69 |         _log.info('process_item....')
 70 |         line = json.dumps(dict(item)) + "\n"
 71 |         self.file.write(line)
 72 |         return item
 73 | 
 74 |     def close_spider(self, spider):
 75 |         """This method is called when the spider is closed."""
 76 |         _log.info('close_spider....')
 77 |         self.file.close()
 78 | 
 79 | 
 80 | class JsonExportPipeline(object):
 81 |     def __init__(self):
 82 |         _log.info('JsonExportPipeline.init....')
 83 |         self.files = {}
 84 | 
 85 |     @classmethod
 86 |     def from_crawler(cls, crawler):
 87 |         _log.info('JsonExportPipeline.from_crawler....')
 88 |         pipeline = cls()
 89 |         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
 90 |         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
 91 |         return pipeline
 92 | 
 93 |     def spider_opened(self, spider):
 94 |         _log.info('JsonExportPipeline.spider_opened....')
 95 |         file = open('%s.json' % spider.name, 'w+b')
 96 |         self.files[spider] = file
 97 |         self.exporter = JsonItemExporter(file)
 98 |         self.exporter.start_exporting()
 99 | 
100 |     def spider_closed(self, spider):
101 |         _log.info('JsonExportPipeline.spider_closed....')
102 |         self.exporter.finish_exporting()
103 |         file = self.files.pop(spider)
104 |         file.close()
105 | 
106 |     def process_item(self, item, spider):
107 |         _log.info('JsonExportPipeline.process_item....')
108 |         self.exporter.export_item(item)
109 |         return item
110 | 
111 | 
112 | @contextmanager
113 | def session_scope(Session):
114 |     """Provide a transactional scope around a series of operations."""
115 |     session = Session()
116 |     session.expire_on_commit = False
117 |     try:
118 |         yield session
119 |         session.commit()
120 |     except:
121 |         session.rollback()
122 |         raise
123 |     finally:
124 |         session.close()
125 | 
126 | 
127 | class ArticleDataBasePipeline(object):
128 |     """保存文章到数据库"""
129 | 
130 |     def __init__(self):
131 |         engine = db_connect()
132 |         create_news_table(engine)
133 |         self.Session = sessionmaker(bind=engine)
134 | 
135 |     def open_spider(self, spider):
136 |         """This method is called when the spider is opened."""
137 |         pass
138 | 
139 |     def process_item(self, item, spider):
140 |         a = Article(url=item["url"],
141 |                     title=item["title"].encode("utf-8"),
142 |                     publish_time=item["publish_time"].encode("utf-8"),
143 |                     body=item["body"].encode("utf-8"),
144 |                     source_site=item["source_site"].encode("utf-8"))
145 |         with session_scope(self.Session) as session:
146 |             session.add(a)
147 | 
148 |     def close_spider(self, spider):
149 |         pass
150 | 
151 | 
152 | class NewsDatabasePipeline(object):
153 |     """保存新闻到数据库"""
154 | 
155 |     def __init__(self):
156 |         """
157 |         Initializes database connection and sessionmaker.
158 |         Creates deals table.
159 |         """
160 |         engine = db_connect()
161 |         create_news_table(engine)
162 |         # 初始化对象属性Session为可调用对象
163 |         self.Session = sessionmaker(bind=engine)
164 |         self.recent_links = None
165 |         self.nowtime = datetime.datetime.now()
166 | 
167 |     def open_spider(self, spider):
168 |         """This method is called when the spider is opened."""
169 |         _log.info('open_spider[%s]....' % spider.name)
170 |         session = self.Session()
171 |         recent_news = session.query(News).filter(
172 |             News.crawlkey == spider.name,
173 |             self.nowtime - News.pubdate <= datetime.timedelta(days=30)).all()
174 |         self.recent_links = [t.link for t in recent_news]
175 |         _log.info(self.recent_links)
176 | 
177 |     def process_item(self, item, spider):
178 |         """Save deals in the database.
179 |         This method is called for every item pipeline component.
180 |         """
181 |         # 每次获取到Item调用这个callable，获得一个新的session
182 |         _log.info('mysql->%s' % item['link'])
183 |         if item['link'] not in self.recent_links:
184 |             with session_scope(self.Session) as session:
185 |                 news = News(**item)
186 |                 session.add(news)
187 |                 self.recent_links.append(item['link'])
188 |         return item
189 | 
190 |     def close_spider(self, spider):
191 |         pass
192 | 
193 | 
194 | class MyImagesPipeline(ImagesPipeline):
195 |     """先安装：pip install Pillow"""
196 | 
197 |     def item_completed(self, results, item, info):
198 |         image_paths = [x['path'] for ok, x in results if ok]
199 |         if not image_paths:
200 |             raise DropItem("Item contains no images")
201 |         return item
202 | 
203 | 
204 | class TobaccoImagePipeline(ImagesPipeline):
205 |     """先安装：pip install Pillow"""
206 | 
207 |     def get_media_requests(self, item, info):
208 |         yield Request(item['pics'])
209 | 
210 |     def item_completed(self, results, item, info):
211 |         image_paths = [x['path'] for ok, x in results if ok]
212 |         if not image_paths:
213 |             raise DropItem("Item contains no images")
214 |         # 设置tobacco的pics字段
215 |         item['pics'] = image_paths[0]
216 |         return item
217 | 
218 | 
219 | class TobaccoDatabasePipeline(object):
220 |     """将烟草记录保存到数据库"""
221 | 
222 |     def __init__(self):
223 |         engine = db_connect()
224 |         self.Session = sessionmaker(bind=engine)
225 | 
226 |     def open_spider(self, spider):
227 |         """This method is called when the spider is opened."""
228 |         pass
229 | 
230 |     def process_item(self, item, spider):
231 |         logging.info("将烟草记录保存到数据库 start....")
232 |         product_vals = item['product'].split('/')
233 |         # 先插入一条烟的记录
234 |         tobacco = Tobacco(product_name=product_vals[0],
235 |                           brand=product_vals[1],
236 |                           product_type=item['product_type'],
237 |                           package_spec=item['package_spec'],
238 |                           reference_price=item['reference_price'],
239 |                           manufacturer=item['manufacturer'],
240 |                           pics=item['pics'])
241 |         with session_scope(self.Session) as session:
242 |             session.add(tobacco)
243 |         logging.info("tobacco.iiiiiiiiiiiiiiiiiiiiiiiiidddddd=, {}".format(tobacco.id))
244 |         # 然后再插入二维码记录
245 |         if product_vals[2]:
246 |             code_vals = product_vals[2].split('：')
247 |             barcode = Barcode(tobacco_id=tobacco.id,
248 |                               btype=code_vals[0],
249 |                               barcode=code_vals[1])
250 |             with session_scope(self.Session) as session:
251 |                 session.add(barcode)
252 |         if product_vals[3]:
253 |             code_vals = product_vals[3].split('：')
254 |             barcode = Barcode(tobacco_id=tobacco.id,
255 |                               btype=code_vals[0],
256 |                               barcode=code_vals[1])
257 |             with session_scope(self.Session) as session:
258 |                 # if barcode not in session:
259 |                 #     session.merge(barcode)
260 |                 session.add(barcode)
261 |         logging.info("将烟草记录保存到数据库 end....")
262 | 
263 |     def close_spider(self, spider):
264 |         pass
265 | 


--------------------------------------------------------------------------------
/coolscrapy/run.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: sample
 5 | Desc : 
 6 | """
 7 | 
 8 | import logging
 9 | from twisted.internet import reactor
10 | from scrapy.crawler import CrawlerRunner
11 | from scrapy.utils.project import get_project_settings
12 | from scrapy.utils.log import configure_logging
13 | from coolscrapy.models import db_connect
14 | from coolscrapy.models import ArticleRule
15 | from sqlalchemy.orm import sessionmaker
16 | from coolscrapy.spiders.article_spider import ArticleSpider
17 | 
18 | if __name__ == '__main__':
19 |     settings = get_project_settings()
20 |     configure_logging(settings)
21 |     db = db_connect()
22 |     Session = sessionmaker(bind=db)
23 |     session = Session()
24 |     rules = session.query(ArticleRule).filter(ArticleRule.enable == 1).all()
25 |     session.close()
26 |     runner = CrawlerRunner(settings)
27 | 
28 |     for rule in rules:
29 |         # spider = ArticleSpider(rule)  # instantiate every spider using rule
30 |         # stop reactor when spider closes
31 |         # runner.signals.connect(spider_closing, signal=signals.spider_closed)
32 |         runner.crawl(ArticleSpider, rule=rule)
33 | 
34 |     d = runner.join()
35 |     d.addBoth(lambda _: reactor.stop())
36 | 
37 |     # blocks process so always keep as the last statement
38 |     reactor.run()
39 |     logging.info('all finished.')
40 | 


--------------------------------------------------------------------------------
/coolscrapy/settings.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Scrapy settings for coolscrapy project
  4 | #
  5 | # For simplicity, this file contains only settings considered important or
  6 | # commonly used. You can find more settings consulting the documentation:
  7 | #
  8 | #     http://doc.scrapy.org/en/latest/topics/settings.html
  9 | #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
 10 | #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 11 | import logging
 12 | 
 13 | BOT_NAME = 'coolscrapy'
 14 | 
 15 | SPIDER_MODULES = ['coolscrapy.spiders']
 16 | NEWSPIDER_MODULE = 'coolscrapy.spiders'
 17 | 
 18 | ITEM_PIPELINES = {
 19 |     # 'coolscrapy.pipelines.DuplicatesPipeline': 1,
 20 |     # 'coolscrapy.pipelines.FilterWordsPipeline': 2,
 21 |     # 'coolscrapy.pipelines.JsonWriterPipeline': 3,
 22 |     # 'coolscrapy.pipelines.JsonExportPipeline': 4,
 23 |     # 'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
 24 |     'coolscrapy.pipelines.TobaccoImagePipeline': 6,
 25 |     'coolscrapy.pipelines.TobaccoDatabasePipeline': 7,
 26 | }
 27 | DOWNLOADER_MIDDLEWARES = {
 28 |     # 这里是下载中间件
 29 |     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
 30 |     'coolscrapy.middlewares.RotateUserAgentMiddleware': 400,
 31 |     'scrapy_splash.SplashCookiesMiddleware': 723,
 32 |     'scrapy_splash.SplashMiddleware': 725,
 33 |     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
 34 | }
 35 | SPIDER_MIDDLEWARES = {
 36 |     # 这是爬虫中间件， 543是运行的优先级
 37 |     # 'coolscrapy.middlewares.UrlUniqueMiddleware': 543,
 38 | }
 39 | 
 40 | # 几个反正被Ban的策略设置
 41 | DOWNLOAD_TIMEOUT = 20
 42 | DOWNLOAD_DELAY = 5
 43 | # 禁用Cookie
 44 | COOKIES_ENABLES = True
 45 | #COOKIES_DEBUG = True
 46 | 
 47 | LOG_LEVEL = logging.INFO
 48 | LOG_STDOUT = True
 49 | LOG_FILE = "E:/logs/spider.log"
 50 | LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
 51 | 
 52 | 
 53 | # windows pip install mysqlclient
 54 | # linux pip install MySQL-python
 55 | DATABASE = {'drivername': 'mysql',
 56 |             'host': '123.207.66.156',
 57 |             'port': '3306',
 58 |             'username': 'root',
 59 |             'password': '******',
 60 |             'database': 'test',
 61 |             'query': {'charset': 'utf8'}}
 62 | 
 63 | # 图片下载设置
 64 | IMAGES_STORE = 'E:/logs/'
 65 | IMAGES_EXPIRES = 30  # 30天内抓取的都不会被重抓
 66 | # 图片链接前缀
 67 | URL_PREFIX = 'http://enzhico.net/pics/'
 68 | 
 69 | # js异步加载支持
 70 | SPLASH_URL = 'http://192.168.203.91:8050'
 71 | DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
 72 | HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
 73 | 
 74 | # 扩展-定义爬取数量
 75 | # CLOSESPIDER_ITEMCOUNT = 10
 76 | 
 77 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 78 | # USER_AGENT = 'coolscrapy (+http://www.yourdomain.com)'
 79 | 
 80 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 81 | # CONCURRENT_REQUESTS=32
 82 | 
 83 | # Configure a delay for requests for the same website (default: 0)
 84 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
 85 | # See also autothrottle settings and docs
 86 | # DOWNLOAD_DELAY=3
 87 | # The download delay setting will honor only one of:
 88 | # CONCURRENT_REQUESTS_PER_DOMAIN=16
 89 | # CONCURRENT_REQUESTS_PER_IP=16
 90 | 
 91 | # Disable cookies (enabled by default)
 92 | # COOKIES_ENABLED=False
 93 | 
 94 | # Disable Telnet Console (enabled by default)
 95 | # TELNETCONSOLE_ENABLED=False
 96 | 
 97 | # Override the default request headers:
 98 | # DEFAULT_REQUEST_HEADERS = {
 99 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
100 | #   'Accept-Language': 'en',
101 | # }
102 | 
103 | # Enable or disable spider middlewares
104 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
105 | # SPIDER_MIDDLEWARES = {
106 | #    'coolscrapy.middlewares.MyCustomSpiderMiddleware': 543,
107 | # }
108 | 
109 | # Enable or disable downloader middlewares
110 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
111 | # DOWNLOADER_MIDDLEWARES = {
112 | #    'coolscrapy.middlewares.MyCustomDownloaderMiddleware': 543,
113 | # }
114 | 
115 | # Enable or disable extensions
116 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
117 | # EXTENSIONS = {
118 | #    'scrapy.telnet.TelnetConsole': None,
119 | # }
120 | 
121 | # Configure item pipelines
122 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
123 | # ITEM_PIPELINES = {
124 | #    'coolscrapy.pipelines.SomePipeline': 300,
125 | # }
126 | 
127 | # Enable and configure the AutoThrottle extension (disabled by default)
128 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
129 | # NOTE: AutoThrottle will honour the standard settings for concurrency and delay
130 | # AUTOTHROTTLE_ENABLED=True
131 | # The initial download delay
132 | # AUTOTHROTTLE_START_DELAY=5
133 | # The maximum download delay to be set in case of high latencies
134 | # AUTOTHROTTLE_MAX_DELAY=60
135 | # Enable showing throttling stats for every response received:
136 | # AUTOTHROTTLE_DEBUG=False
137 | 
138 | # Enable and configure HTTP caching (disabled by default)
139 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
140 | # HTTPCACHE_ENABLED=True
141 | # HTTPCACHE_EXPIRATION_SECS=0
142 | # HTTPCACHE_DIR='httpcache'
143 | # HTTPCACHE_IGNORE_HTTP_CODES=[]
144 | # HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
145 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/article_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: sample
 5 | Desc : 
 6 | """
 7 | 
 8 | from coolscrapy.utils import parse_text
 9 | from scrapy.spiders import CrawlSpider, Rule
10 | from scrapy.linkextractors import LinkExtractor
11 | from coolscrapy.items import Article
12 | 
13 | 
14 | class ArticleSpider(CrawlSpider):
15 |     name = "article"
16 | 
17 |     def __init__(self, rule):
18 |         self.rule = rule
19 |         self.name = rule.name
20 |         self.allowed_domains = rule.allow_domains.split(",")
21 |         self.start_urls = rule.start_urls.split(",")
22 |         rule_list = []
23 |         # 添加`下一页`的规则
24 |         if rule.next_page:
25 |             rule_list.append(Rule(LinkExtractor(restrict_xpaths=rule.next_page)))
26 |         # 添加抽取文章链接的规则
27 |         rule_list.append(Rule(LinkExtractor(
28 |             allow=[rule.allow_url],
29 |             restrict_xpaths=[rule.extract_from]),
30 |             callback='parse_item'))
31 |         self.rules = tuple(rule_list)
32 |         super(ArticleSpider, self).__init__()
33 | 
34 |     def parse_item(self, response):
35 |         self.log('Hi, this is an article page! %s' % response.url)
36 | 
37 |         article = Article()
38 |         article["url"] = response.url
39 | 
40 |         title = response.xpath(self.rule.title_xpath).extract()
41 |         article["title"] = parse_text(title, self.rule.name, 'title')
42 | 
43 |         body = response.xpath(self.rule.body_xpath).extract()
44 |         article["body"] = parse_text(body, self.rule.name, 'body')
45 | 
46 |         publish_time = response.xpath(self.rule.publish_time_xpath).extract()
47 |         article["publish_time"] = parse_text(publish_time, self.rule.name, 'publish_time')
48 | 
49 |         article["source_site"] = self.rule.source_site
50 | 
51 |         return article
52 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/drug_spider.py:
--------------------------------------------------------------------------------
  1 | # #!/usr/bin/env python
  2 | # # -*- encoding: utf-8 -*-
  3 | # """
  4 | # Topic: 网络爬虫
  5 | # Desc :
  6 | # """
  7 | from ..items import *
  8 | from scrapy.spiders import Spider
  9 | from scrapy.spiders import XMLFeedSpider, CrawlSpider, Rule
 10 | from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
 11 | from scrapy.linkextractors import LinkExtractor
 12 | from scrapy.selector import Selector, HtmlXPathSelector
 13 | from scrapy.loader import ItemLoader
 14 | from scrapy import Request
 15 | from scrapy.exceptions import DropItem
 16 | from coolscrapy.utils import *
 17 | from datetime import datetime
 18 | import coolscrapy.settings as setting
 19 | import re
 20 | import uuid
 21 | import urllib
 22 | import contextlib
 23 | import os
 24 | import logging
 25 | 
 26 | 
 27 | class CnyywXMLFeedSpider(CrawlSpider):
 28 |     """RSS/XML源爬虫，从医药网cn-yyw.cn上面订阅行业资讯"""
 29 |     name = 'cnyywfeed'
 30 |     allowed_domains = ['cn-yyw.cn']
 31 |     start_urls = [
 32 |         'http://www.cn-yyw.cn/feed/rss.php?mid=21',
 33 |     ]
 34 | 
 35 |     def parse(self, response):
 36 |         item_xpaths = response.xpath('//channel/item')
 37 |         for i_xpath in item_xpaths:
 38 |             xitem = NewsItem()
 39 |             xitem['crawlkey'] = self.name
 40 |             xitem['title'] = ltos(i_xpath.xpath('title/text()').extract())
 41 |             self.log('title=%s' % xitem['title'].encode('utf-8'), logging.INFO)
 42 |             xitem['link'] = ltos(i_xpath.xpath('link/text()').extract())
 43 |             self.log('link=%s' % xitem['link'], logging.INFO)
 44 |             pubdate_temp = ltos(i_xpath.xpath('pubDate/text()').extract())
 45 |             self.log('pubdate=%s' % pubdate_temp, logging.INFO)
 46 |             xitem['pubdate'] = datetime.strptime(pubdate_temp, '%Y-%m-%d %H:%M:%S')
 47 |             self.log('((((^_^))))'.center(50, '-'), logging.INFO)
 48 |             yield Request(url=xitem['link'], meta={'item': xitem}, callback=self.parse_item_page)
 49 | 
 50 |     def parse_item_page(self, response):
 51 |         page_item = response.meta['item']
 52 |         try:
 53 |             self.log('-------------------> link_page url=%s' % page_item['link'], logging.INFO)
 54 |             page_item['category'] = ltos(response.xpath(
 55 |                 '//div[@class="pos"]/a[last()]/text()').extract())
 56 |             page_item['location'] = ltos(response.xpath(
 57 |                 '//div[@class="info"]/a/text()').extract())
 58 |             content_temp = "".join([tt.strip() for tt in response.xpath(
 59 |                 '//div[@id="article"]').extract()])
 60 |             re_con_strong = re.compile(r'</strong>(\s*)<strong>')
 61 |             content_temp = re_con_strong.sub(r'\1', content_temp)
 62 |             re_start_strong = re.compile(r'<strong>', re.I)
 63 |             content_temp = re_start_strong.sub('<p>', content_temp)
 64 |             re_end_strong = re.compile(r'</strong>', re.I)
 65 |             content_temp = re_end_strong.sub('</p>', content_temp)
 66 |             page_item['content'] = filter_tags(content_temp)
 67 |             page_item['htmlcontent'] = page_item['content']
 68 |             return page_item
 69 |         except:
 70 |             self.log('ERROR-----%s' % response.url, logging.ERROR)
 71 |             return None
 72 | 
 73 | 
 74 | class Drug39Spider(Spider):
 75 |     name = "drug39"
 76 |     allowed_domains = ["drug.39.net"]
 77 |     start_urls = [
 78 |         "http://drug.39.net/yjxw/yydt/index.html"
 79 |     ]
 80 | 
 81 |     def parse(self, response):
 82 |         self.log('-------------------> link_list url=%s' % response.url, logging.INFO)
 83 |         links = response.xpath('//div[starts-with(@class, "listbox")]/ul/li')
 84 |         for link in links:
 85 |             url = link.xpath('span[1]/a/@href').extract()[0]
 86 |             date_str = link.xpath('span[2]/text()').extract()[0]
 87 |             date_str = date_str.split(' ')[1] + ':00'
 88 |             self.log('+++++++++++' + date_str, logging.INFO)
 89 |             yield Request(url=url, meta={'ds': date_str}, callback=self.parse_item_page)
 90 | 
 91 |     def parse_item_page(self, response):
 92 |         dstr = response.meta['ds']
 93 |         try:
 94 |             self.log('-------------------> link_page url=%s' % response.url, logging.INFO)
 95 |             item = NewsItem()
 96 |             item['crawlkey'] = self.name
 97 |             item['category'] = ltos(response.xpath(
 98 |                 '//span[@class="art_location"]/a[last()]/text()').extract())
 99 |             item['link'] = response.url
100 |             item['location'] = ltos(response.xpath(
101 |                 '//div[@class="date"]/em[2]/a/text()'
102 |                 '|//div[@class="date"]/em[2]/text()').extract())
103 |             pubdate_temp = ltos(response.xpath('//div[@class="date"]/em[1]/text()').extract())
104 |             item['pubdate'] = datetime.strptime(pubdate_temp + ' ' + dstr, '%Y-%m-%d %H:%M:%S')
105 |             item['title'] = ltos(response.xpath('//h1/text()').extract())
106 |             content_temp = "".join([tt.strip() for tt in response.xpath(
107 |                 '//div[@id="contentText"]/p').extract()])
108 |             item['content'] = filter_tags(content_temp)
109 |             htmlcontent = ltos(response.xpath('//div[@id="contentText"]').extract())
110 |             item['htmlcontent'] = clean_html(htmlcontent)
111 |             # 特殊构造，不作为分组
112 |             # (?=...)之后的字符串需要匹配表达式才能成功匹配
113 |             # (?<=...)之前的字符串需要匹配表达式才能成功匹配
114 |             pat_img = re.compile(r'(<img (?:.|\n)*?src=")((.|\n)*?)(?=")')
115 |             uuids = []
116 |             for i, m in enumerate(pat_img.finditer(htmlcontent)):
117 |                 full_path = m.group(2)
118 |                 suffix_name = '.' + os.path.basename(full_path).split('.')[-1]
119 |                 uuid_name = '{0:02d}{1:s}'.format(i + 1, uuid.uuid4().hex) + suffix_name
120 |                 uuids.append(uuid_name)
121 |                 self.log('UUID_PIC--------%s' % setting.URL_PREFIX + uuid_name, logging.INFO)
122 |                 with contextlib.closing(urllib2.urlopen(full_path)) as f:
123 |                     with open(os.path.join(IMAGES_STORE, uuid_name), 'wb') as bfile:
124 |                         bfile.write(f.read())
125 |             for indx, val in enumerate(uuids):
126 |                 htmlcontent = pat_img.sub(Nth(indx + 1, setting.URL_PREFIX + val), htmlcontent)
127 |             item['htmlcontent'] = htmlcontent
128 |             self.log('+++++++++title=%s+++++++++' % item['title'].encode('utf-8'), logging.INFO)
129 |             return item
130 |         except:
131 |             self.log('ERROR-----%s' % response.url, logging.ERROR)
132 |             return None
133 | 
134 | 
135 | class PharmnetCrawlSpider(CrawlSpider):
136 |     """医药网pharmnet.com.cn"""
137 |     name = 'pharmnet'
138 |     allowed_domains = ['pharmnet.com.cn']
139 |     start_urls = [
140 |         'http://news.pharmnet.com.cn/news/hyyw/news/index0.html',
141 |         # 'http://news.pharmnet.com.cn/news/hyyw/news/index1.html',
142 |     ]
143 | 
144 |     rules = (
145 |         # LxmlLinkExtractor提取链接列表
146 |         Rule(LxmlLinkExtractor(allow=(r'/news/\d{4}/\d{2}/\d{2}/\d+\.html'
147 |                                       , r'/news/hyyw/news/index\d+\.html')
148 |                                , restrict_xpaths=('//div[@class="list"]',
149 |                                                   '//div[@class="page"]'))
150 |              , callback='parse_links', follow=False),
151 |     )
152 | 
153 |     def parse_links(self, response):
154 |         # 如果是首页文章链接，直接处理
155 |         if '/hyyw/' not in response.url:
156 |             yield self.parse_page(response)
157 |         else:
158 |             self.log('-------------------> link_list url=%s' % response.url, logging.INFO)
159 |             links = response.xpath('//div[@class="list"]/ul/li/p/a')
160 |             for link in links:
161 |                 url = link.xpath('@href').extract()[0]
162 |                 yield Request(url=url, callback=self.parse_page)
163 | 
164 |     def parse_page(self, response):
165 |         try:
166 |             self.log('-------------------> link_page url=%s' % response.url, logging.INFO)
167 |             item = NewsItem()
168 |             item['crawlkey'] = self.name
169 |             item['category'] = ltos(response.xpath(
170 |                 '//div[@class="current"]/a[last()]/text()').extract())
171 |             item['link'] = response.url
172 |             head_line = ltos(response.xpath('//div[@class="ct01"]/text()[1]').extract())
173 |             item['location'] = head_line.strip().split()[1]
174 |             item['pubdate'] = datetime.strptime(head_line.strip().split()[0], '%Y-%m-%d')
175 |             item['title'] = ltos(response.xpath('//h1/text()').extract())
176 |             content_temp = "".join([tt.strip() for tt in response.xpath(
177 |                 '//div[@class="ct02"]/font/div/div|//div[@class="ct02"]/font/div').extract()])
178 |             item['content'] = filter_tags(content_temp)
179 |             hc = ltos(response.xpath('//div[@class="ct02"]').extract())
180 |             htmlcontent = clean_html(hc)
181 |             # 特殊构造，不作为分组
182 |             # (?=...)之后的字符串需要匹配表达式才能成功匹配
183 |             # (?<=...)之前的字符串需要匹配表达式才能成功匹配
184 |             pat_img = re.compile(r'(<img (?:.|\n)*?src=")((.|\n)*?)(?=")')
185 |             uuids = []
186 |             for i, m in enumerate(pat_img.finditer(htmlcontent)):
187 |                 full_path = m.group(2)
188 |                 suffix_name = '.' + os.path.basename(full_path).split('.')[-1]
189 |                 uuid_name = '{0:02d}{1:s}'.format(i + 1, uuid.uuid4().hex) + suffix_name
190 |                 uuids.append(uuid_name)
191 |                 self.log('UUID_PIC--------%s' % setting.URL_PREFIX + uuid_name, logging.INFO)
192 |                 with contextlib.closing(urllib.urlopen(full_path)) as f:
193 |                     with open(os.path.join(IMAGES_STORE, uuid_name), 'wb') as bfile:
194 |                         bfile.write(f.read())
195 |             for indx, val in enumerate(uuids):
196 |                 htmlcontent = pat_img.sub(Nth(indx + 1, setting.URL_PREFIX + val), htmlcontent)
197 |             item['htmlcontent'] = htmlcontent
198 |             self.log('+++++++++title=%s+++++++++' % item['title'].encode('utf-8'), logging.INFO)
199 |             return item
200 |         except:
201 |             self.log('ERROR-----%s' % response.url, logging.ERROR)
202 |             raise DropItem('DropItem-----%s' % response.url)
203 | 
204 | 
205 | class HaoyaoCrawlSpider(Spider):
206 |     """医药咨询网http://www.haoyao.net/"""
207 |     name = 'haoyao'
208 |     allowed_domains = ['haoyao.net']
209 |     start_urls = [
210 |         'http://www.haoyao.net/news/cate----0----index.htm',
211 |     ]
212 | 
213 |     def parse(self, response):
214 |         self.log('-------------------> link_list url=%s' % response.url, logging.INFO)
215 |         links = response.xpath('//div[@class="list"]')
216 |         for link in links:
217 |             url = link.xpath('div[1]/a/@href').extract()[0]
218 |             url = 'http://www.haoyao.net/news/' + url.split('/')[-1]
219 |             self.log('+++++++++++url=' + url, logging.INFO)
220 |             date_str = (link.xpath('div[2]/text()').extract()[0]).strip() + ' 00:00:00'
221 |             self.log('+++++++++++date_str=' + date_str, logging.INFO)
222 |             yield Request(url=url, meta={'ds': date_str}, callback=self.parse_item_page)
223 | 
224 |     def parse_item_page(self, response):
225 |         dstr = response.meta['ds']
226 |         try:
227 |             self.log('-------------------> link_page url=%s' % response.url, logging.INFO)
228 |             item = NewsItem()
229 |             item['crawlkey'] = self.name
230 |             item['category'] = '医药新闻'
231 |             item['link'] = response.url
232 |             item['location'] = ltos(response.xpath('//font[@color="#666666"]/a/text()').extract())
233 |             item['pubdate'] = datetime.strptime(dstr, '%Y-%m-%d %H:%M:%S')
234 |             item['title'] = ltos(response.xpath('//span[@id="lblTitle"]/text()').extract())
235 |             content_temp = "".join([tt.strip() for tt in response.xpath(
236 |                 '//span[@id="spContent"]/p').extract()])
237 |             item['content'] = filter_tags(content_temp)
238 |             htmlcontent = ltos(response.xpath('//div[@id="divContent"]').extract())
239 |             item['htmlcontent'] = clean_html(htmlcontent)
240 |             # 特殊构造，不作为分组
241 |             # (?=...)之后的字符串需要匹配表达式才能成功匹配
242 |             # (?<=...)之前的字符串需要匹配表达式才能成功匹配
243 |             pat_img = re.compile(r'(<img (?:.|\n)*?src=")((.|\n)*?)(?=")')
244 |             uuids = []
245 |             for i, m in enumerate(pat_img.finditer(htmlcontent)):
246 |                 full_path = 'http://www.haoyao.net' + m.group(2)
247 |                 suffix_name = '.' + os.path.basename(full_path).split('.')[-1]
248 |                 uuid_name = '{0:02d}{1:s}'.format(i + 1, uuid.uuid4().hex) + suffix_name
249 |                 uuids.append(uuid_name)
250 |                 self.log('UUID_PIC--------%s' % setting.URL_PREFIX + uuid_name, logging.INFO)
251 |                 with contextlib.closing(urllib.urlopen(full_path)) as f:
252 |                     with open(os.path.join(IMAGES_STORE, uuid_name), 'wb') as bfile:
253 |                         bfile.write(f.read())
254 |             for indx, val in enumerate(uuids):
255 |                 htmlcontent = pat_img.sub(Nth(indx + 1, setting.URL_PREFIX + val), htmlcontent)
256 |             item['htmlcontent'] = htmlcontent
257 |             self.log('+++++++++title=%s+++++++++' % item['title'].encode('utf-8'), logging.INFO)
258 |             return item
259 |         except:
260 |             self.log('ERROR-----%s' % response.url, logging.ERROR)
261 |             return None
262 | 
263 | 
264 | class Nth(object):
265 |     """
266 |     如果 sub 函数的第二个参数是个函数，则每次匹配到的时候都会执行这个函数。
267 |     函数接受匹配到的那个 match object 作为参数，返回用来替换的字符串。
268 |     利用这个特性就可以只在第 N 次匹配的时候返回要替换成的字符串，其他时候原样返回不做替换即可。
269 |     """
270 | 
271 |     def __init__(self, nth, replacement):
272 |         self.nth = nth
273 |         self.replacement = replacement
274 |         self.calls = 0
275 | 
276 |     def __call__(self, matchobj):
277 |         self.calls += 1
278 |         if self.calls == self.nth:
279 |             return matchobj.group(1) + self.replacement
280 |         return matchobj.group(0)
281 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/huxiu_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 爬取虎嗅网首页
 5 | Desc : 
 6 | """
 7 | import logging
 8 | import scrapy
 9 | from coolscrapy.items import HuxiuItem
10 | 
11 | 
12 | class HuxiuSpider(scrapy.Spider):
13 |     name = "huxiu"
14 |     allowed_domains = ["huxiu.com"]
15 |     start_urls = [
16 |         "http://www.huxiu.com/index.php"
17 |     ]
18 | 
19 |     def parse(self, response):
20 |         for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
21 |             item = HuxiuItem()
22 |             item['title'] = sel.xpath('h3/a/text()')[0].extract()
23 |             item['link'] = sel.xpath('h3/a/@href')[0].extract()
24 |             url = response.urljoin(item['link'])
25 |             item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
26 |             yield scrapy.Request(url, callback=self.parse_article)
27 | 
28 |     def parse_article(self, response):
29 |         detail = response.xpath('//div[@class="article-wrap"]')
30 |         item = HuxiuItem()
31 |         item['title'] = detail.xpath('h1/text()')[0].extract()
32 |         item['link'] = response.url
33 |         item['published'] = detail.xpath(
34 |             'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
35 |         logging.info(item['title'],item['link'],item['published'])
36 |         yield item
37 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/joke_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 抓取最新笑话
 5 | Desc : 
 6 | """
 7 | 
 8 | from urllib import request
 9 | import os.path
10 | import contextlib
11 | import logging
12 | 
13 | from scrapy import Spider
14 | 
15 | from coolscrapy.utils import send_mail
16 | from coolscrapy.items import *
17 | from coolscrapy.settings import IMAGES_STORE
18 | 
19 | 
20 | class JokerSpider(Spider):
21 |     """xpath的一些常见复杂查找示例"""
22 |     # '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[1][self::text()]')
23 |     # '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[3][self::text()]')
24 |     # '//*[contains(@class, "nav_pro_text")]/a/strong/text()')
25 |     # '//li[contains(@class, "nav_pro_item")]/div/a/img/@src[1]')
26 |     # '//*[contains(@class, "nav_pro_text")]/a/@href')
27 |     name = 'joker'
28 |     allowed_domains = ['xiaohua.com']
29 |     start_urls = [
30 |         'http://www.xiaohua.com/',
31 |     ]
32 | 
33 |     def parse(self, response):
34 |         # 抓取最新笑话
35 |         items = []
36 |         jokelist = []
37 |         count = 1
38 |         for jk in response.xpath('//div[starts-with(@class, "joke-box")]/ul/li[@class="t2"]'):
39 |             if count > 6:
40 |                 break
41 |             count += 1
42 |             item = JokeItem()
43 |             title = jk.xpath('*[1]/text()').extract_first().encode('utf-8')
44 |             pic_content = jk.xpath('a[2]/img')
45 |             txt_content = jk.xpath('a[2]/p')
46 |             img_src = None
47 |             if pic_content:
48 |                 item['image_urls'] = pic_content.xpath('@src').extract()
49 |                 full_imgurl = item['image_urls'][0]
50 |                 img_src = full_imgurl
51 |                 filename = os.path.basename(item['image_urls'][0])
52 |                 self.log('-------------' + full_imgurl, logging.INFO)
53 |                 with contextlib.closing(request.urlopen(full_imgurl)) as f:
54 |                     with open(os.path.join(IMAGES_STORE, filename), 'wb') as bfile:
55 |                         bfile.write(f.read())
56 |                 item['content'] = '<h3>%s</h3>' % title
57 |             else:
58 |                 content = '<br/>'.join(txt_content.xpath('text()').extract().encode('utf-8'))
59 |                 strong_txt = txt_content.xpath('strong/text()').extract_first()
60 |                 if strong_txt:
61 |                     content = '<strong>%s</strong><br/>' % strong_txt.encode('utf-8') + content
62 |                 item['content'] = '<h3>%s</h3>%s' % (title, content)
63 |             items.append(item)
64 |             jokelist.append((item['content'], img_src))
65 |         send_mail(jokelist)
66 |         return items
67 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/js_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 对于js异步加载网页的支持
 5 | Desc : 爬取京东网首页，下面内容基本都是异步加载的，我选取“猜你喜欢”这个异步加载内容来测试
 6 | """
 7 | import logging
 8 | import re
 9 | import json
10 | import base64
11 | import scrapy
12 | from scrapy_splash import SplashRequest
13 | 
14 | 
15 | class JsSpider(scrapy.Spider):
16 |     name = "jd"
17 |     allowed_domains = ["jd.com"]
18 |     start_urls = [
19 |         "http://www.jd.com/"
20 |     ]
21 | 
22 |     def start_requests(self):
23 |         splash_args = {
24 |             'wait': 0.5,
25 |             # 'http_method': 'GET',
26 |             # 'html': 1,
27 |             # 'png': 1,
28 |             # 'width': 600,
29 |             # 'render_all': 1,
30 |         }
31 |         for url in self.start_urls:
32 |             yield SplashRequest(url, self.parse_result, endpoint='render.html',
33 |                                 args=splash_args)
34 | 
35 |     def parse_result(self, response):
36 |         logging.info(u'----------使用splash爬取京东网首页异步加载内容-----------')
37 |         guessyou = response.xpath('//div[@id="guessyou"]/div[1]/h2/text()').extract_first()
38 |         logging.info(u"find：%s" % guessyou)
39 |         logging.info(u'---------------success----------------')
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     body = u'发布于： 2016年04月08日'
44 |     pat4 = re.compile(r'\d{4}年\d{2}月\d{2}日')
45 |     if (re.search(pat4, body)):
46 |         print(re.search(pat4, body).group())
47 | 
48 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/link_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 爬取链接的蜘蛛
 5 | Desc : 
 6 | """
 7 | import logging
 8 | from coolscrapy.items import HuxiuItem
 9 | import scrapy
10 | from scrapy.spiders import CrawlSpider, Rule
11 | from scrapy.linkextractors import LinkExtractor
12 | 
13 | 
14 | class LinkSpider(CrawlSpider):
15 |     name = "link"
16 |     allowed_domains = ["huxiu.com"]
17 |     start_urls = [
18 |         "http://www.huxiu.com/index.php"
19 |     ]
20 | 
21 |     rules = (
22 |         # 提取匹配正则式'/group?f=index_group'链接 (但是不能匹配'deny.html')
23 |         # 并且会递归爬取(如果没有定义callback，默认follow=True).
24 |         Rule(LinkExtractor(allow=('/group?f=index_group', ), deny=('deny\.html', ))),
25 |         # 提取匹配'/article/\d+/\d+.html'的链接，并使用parse_item来解析它们下载后的内容，不递归
26 |         Rule(LinkExtractor(allow=('/article/\d+/\d+\.html', )), callback='parse_item'),
27 |     )
28 | 
29 |     def parse_item(self, response):
30 |         self.logger.info('Hi, this is an item page! %s', response.url)
31 |         detail = response.xpath('//div[@class="article-wrap"]')
32 |         item = HuxiuItem()
33 |         item['title'] = detail.xpath('h1/text()')[0].extract()
34 |         item['link'] = response.url
35 |         item['published'] = detail.xpath(
36 |             'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
37 |         logging.info(item['title'],item['link'],item['published'])
38 |         yield item
39 | 
40 | 
41 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/login1_spider.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | """
  4 | Topic: 登录爬虫
  5 | Desc : 模拟登录https://github.com后将自己的issue全部爬出来
  6 | tips：使用chrome调试post表单的时候勾选Preserve log和Disable cache
  7 | """
  8 | import logging
  9 | import re
 10 | import sys
 11 | import scrapy
 12 | from scrapy.spiders import CrawlSpider, Rule
 13 | from scrapy.linkextractors import LinkExtractor
 14 | from scrapy.http import Request, FormRequest, HtmlResponse
 15 | 
 16 | logging.basicConfig(level=logging.INFO,
 17 |                     format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
 18 |                     datefmt='%Y-%m-%d %H:%M:%S',
 19 |                     handlers=[logging.StreamHandler(sys.stdout)])
 20 | 
 21 | 
 22 | class GithubSpider(CrawlSpider):
 23 |     name = "github"
 24 |     allowed_domains = ["github.com"]
 25 |     start_urls = [
 26 |         'https://github.com/issues',
 27 |     ]
 28 |     rules = (
 29 |         # 消息列表
 30 |         Rule(LinkExtractor(allow=('/issues/\d+',),
 31 |                            restrict_xpaths='//ul[starts-with(@class, "table-list")]/li/div[2]/a[2]'),
 32 |              callback='parse_page'),
 33 |         # 下一页, If callback is None follow defaults to True, otherwise it defaults to False
 34 |         Rule(LinkExtractor(restrict_xpaths='//a[@class="next_page"]')),
 35 |     )
 36 |     post_headers = {
 37 |         "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 38 |         "Accept-Encoding": "gzip, deflate",
 39 |         "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
 40 |         "Cache-Control": "no-cache",
 41 |         "Connection": "keep-alive",
 42 |         "Content-Type": "application/x-www-form-urlencoded",
 43 |         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
 44 |         "Referer": "https://github.com/",
 45 |     }
 46 | 
 47 |     # 重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
 48 |     def start_requests(self):
 49 |         return [Request("https://github.com/login",
 50 |                         meta={'cookiejar': 1}, callback=self.post_login)]
 51 | 
 52 |     # FormRequeset
 53 |     def post_login(self, response):
 54 |         # 先去拿隐藏的表单参数authenticity_token
 55 |         authenticity_token = response.xpath(
 56 |             '//input[@name="authenticity_token"]/@value').extract_first()
 57 |         logging.info('authenticity_token=' + authenticity_token)
 58 |         # FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
 59 |         # 登陆成功后, 会调用after_login回调函数，如果url跟Request页面的一样就省略掉
 60 |         return [FormRequest.from_response(response,
 61 |                                           url='https://github.com/session',
 62 |                                           meta={'cookiejar': response.meta['cookiejar']},
 63 |                                           headers=self.post_headers,  # 注意此处的headers
 64 |                                           formdata={
 65 |                                               'utf8': '✓',
 66 |                                               'login': 'yidao620c',
 67 |                                               'password': '******',
 68 |                                               'authenticity_token': authenticity_token
 69 |                                           },
 70 |                                           callback=self.after_login,
 71 |                                           dont_filter=True
 72 |                                           )]
 73 | 
 74 |     def after_login(self, response):
 75 |         # 登录之后，开始进入我要爬取的私信页面
 76 |         for url in self.start_urls:
 77 |             logging.info('letter url=' + url)
 78 |             # 因为我们上面定义了Rule，所以只需要简单的生成初始爬取Request即可
 79 |             # yield self.make_requests_from_url(url)
 80 |             yield Request(url, meta={'cookiejar': response.meta['cookiejar']})
 81 |             # 如果是普通的Spider，而不是CrawlerSpider，没有定义Rule规则，
 82 |             # 那么就需要像下面这样定义每个Request的callback
 83 |             # yield Request(url, dont_filter=True,
 84 |             #               # meta={'dont_redirect': True,
 85 |             #               #       'handle_httpstatus_list': [302]},
 86 |             #               callback=self.parse_page, )
 87 | 
 88 |     def parse_page(self, response):
 89 |         """这个是使用LinkExtractor自动处理链接以及`下一页`"""
 90 |         logging.info(u'--------------消息分割线-----------------')
 91 |         logging.info(response.url)
 92 |         issue_title = response.xpath(
 93 |             '//span[@class="js-issue-title"]/text()').extract_first()
 94 |         logging.info(u'issue_title：' + issue_title.encode('utf-8'))
 95 | 
 96 |         # def parse_page(self, response):
 97 |         #     """这个是不使用LinkExtractor我自己手动处理链接以及下一页"""
 98 |         #     logging.info(response.url)
 99 |         #     for each_msg in response.xpath('//ul[@class="Msgs"]/li'):
100 |         #         logging.info('--------------消息分割线-----------------')
101 |         #         logging.info(''.join(each_msg.xpath('.//div[@class="msg"]//*/text()').extract()))
102 |         #     next_page = response.xpath('//li[@class="page next"]/a')
103 |         #     if next_page:
104 |         #         logging.info(u'继续处理下一页')
105 |         #         yield Request(response.url + next_page.xpath('@href').extract())
106 |         #     else:
107 |         #         logging.info(u"已经处理完成，没有下一页了")
108 | 
109 |     def _requests_to_follow(self, response):
110 |         """重写加入cookiejar的更新"""
111 |         if not isinstance(response, HtmlResponse):
112 |             return
113 |         seen = set()
114 |         for n, rule in enumerate(self._rules):
115 |             links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
116 |             if links and rule.process_links:
117 |                 links = rule.process_links(links)
118 |             for link in links:
119 |                 seen.add(link)
120 |                 r = Request(url=link.url, callback=self._response_downloaded)
121 |                 # 下面这句是我重写的
122 |                 r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])
123 |                 yield rule.process_request(r)
124 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/login2_spider.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | """
  4 | Topic: 登录爬虫
  5 | Desc : 模拟登录http://www.iteye.com后将自己的私信全部爬出来
  6 | tips：使用chrome调试post表单的时候勾选Preserve log和Disable cache
  7 | """
  8 | import logging
  9 | import re
 10 | import sys
 11 | import scrapy
 12 | from scrapy.spiders import CrawlSpider, Rule
 13 | from scrapy.linkextractors import LinkExtractor
 14 | from scrapy.http import Request, FormRequest, HtmlResponse
 15 | 
 16 | logging.basicConfig(level=logging.INFO,
 17 |                     format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
 18 |                     datefmt='%Y-%m-%d %H:%M:%S',
 19 |                     handlers=[logging.StreamHandler(sys.stdout)])
 20 | 
 21 | 
 22 | class IteyeSpider(CrawlSpider):
 23 |     name = "iteye"
 24 |     allowed_domains = ["iteye.com"]
 25 |     start_urls = [
 26 |         'http://my.iteye.com/messages',
 27 |         'http://my.iteye.com/messages/store',
 28 |     ]
 29 |     rules = (
 30 |         # 消息列表
 31 |         Rule(LinkExtractor(allow=('/messages/\d+',),
 32 |                            restrict_xpaths='//table[@class="admin"]/tbody/tr/td[2]'),
 33 |              callback='parse_page'),
 34 |         # 下一页, If callback is None follow defaults to True, otherwise it defaults to False
 35 |         Rule(LinkExtractor(restrict_xpaths='//a[@class="next_page"]')),
 36 |     )
 37 |     request_headers = {
 38 |         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
 39 |         "Referer": "http://www.iteye.com/login",
 40 |     }
 41 | 
 42 |     post_headers = {
 43 |         "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 44 |         "Accept-Encoding": "gzip, deflate",
 45 |         "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
 46 |         "Cache-Control": "no-cache",
 47 |         "Connection": "keep-alive",
 48 |         "Content-Type": "application/x-www-form-urlencoded",
 49 |         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
 50 |         "Referer": "http://www.iteye.com/",
 51 |     }
 52 | 
 53 |     # 重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
 54 |     def start_requests(self):
 55 |         return [Request("http://www.iteye.com/login",
 56 |                         meta={'cookiejar': 1}, callback=self.post_login)]
 57 | 
 58 |     # FormRequeset
 59 |     def post_login(self, response):
 60 |         # 先去拿隐藏的表单参数authenticity_token
 61 |         authenticity_token = response.xpath(
 62 |             '//input[@name="authenticity_token"]/@value').extract_first()
 63 |         logging.info('authenticity_token=' + authenticity_token)
 64 |         # FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
 65 |         # 登陆成功后, 会调用after_login回调函数，如果url跟Request页面的一样就省略掉
 66 |         return [FormRequest.from_response(response,
 67 |                                           url='http://www.iteye.com/login',
 68 |                                           meta={'cookiejar': response.meta['cookiejar']},
 69 |                                           headers=self.post_headers,  # 注意此处的headers
 70 |                                           formdata={
 71 |                                               'name': 'yidao620c',
 72 |                                               'password': '******',
 73 |                                               'authenticity_token': authenticity_token
 74 |                                           },
 75 |                                           callback=self.after_login,
 76 |                                           dont_filter=True
 77 |                                           )]
 78 | 
 79 |     def after_login(self, response):
 80 |         # logging.info(response.body.encode('utf-8'))
 81 |         # 登录之后，开始进入我要爬取的私信页面
 82 |         # 对于登录成功后的页面我不感兴趣，所以这里response没啥用
 83 |         for url in self.start_urls:
 84 |             logging.info('letter url=' + url)
 85 |             # 因为我们上面定义了Rule，所以只需要简单的生成初始爬取Request即可
 86 |             yield Request(url, meta={'cookiejar': response.meta['cookiejar']})
 87 |             # 如果是普通的Spider，而不是CrawlerSpider，没有定义Rule规则，
 88 |             # 那么就需要像下面这样定义每个Request的callback
 89 |             # yield Request(url, dont_filter=True,
 90 |             #               callback=self.parse_page, )
 91 | 
 92 |     def parse_page(self, response):
 93 |         """这个是使用LinkExtractor自动处理链接以及`下一页`"""
 94 |         logging.info(u'--------------消息分割线-----------------')
 95 |         logging.info(response.url)
 96 |         logging.info(response.xpath('//a[@href="/messages/new"]/text()').extract_first())
 97 |         # msg_time = response.xpath(
 98 |         #     '//div[@id="main"]/table[1]/tbody/tr[1]/td[1]/text()').extract_first()
 99 |         # logging.info(msg_time)
100 |         # msg_title = response.xpath(
101 |         #     '//div[@id="main"]/table[1]/tbody/tr[2]/th[2]/span/text()').extract_first()
102 |         # logging.info(msg_title)
103 | 
104 |         # def parse_page(self, response):
105 |         #     """这个是不使用LinkExtractor我自己手动处理链接以及下一页"""
106 |         #     logging.info(response.url)
107 |         #     for each_msg in response.xpath('//ul[@class="Msgs"]/li'):
108 |         #         logging.info('--------------消息分割线-----------------')
109 |         #         logging.info(''.join(each_msg.xpath('.//div[@class="msg"]//*/text()').extract()))
110 |         #     next_page = response.xpath('//li[@class="page next"]/a')
111 |         #     if next_page:
112 |         #         logging.info(u'继续处理下一页')
113 |         #         yield Request(response.url + next_page.xpath('@href').extract())
114 |         #     else:
115 |         #         logging.info(u"已经处理完成，没有下一页了")
116 | 
117 |     def _requests_to_follow(self, response):
118 |         """重写加入cookiejar的更新"""
119 |         if not isinstance(response, HtmlResponse):
120 |             return
121 |         seen = set()
122 |         for n, rule in enumerate(self._rules):
123 |             links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
124 |             if links and rule.process_links:
125 |                 links = rule.process_links(links)
126 |             for link in links:
127 |                 seen.add(link)
128 |                 r = Request(url=link.url, callback=self._response_downloaded)
129 |                 # 下面这句是我重写的
130 |                 r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])
131 |                 yield rule.process_request(r)
132 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/test_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 爬虫小测试类
 5 | Desc : 
 6 | """
 7 | import logging
 8 | import scrapy
 9 | import re
10 | 
11 | 
12 | class TestSpider(scrapy.Spider):
13 |     name = "test"
14 |     allowed_domains = ["jd.com"]
15 |     start_urls = [
16 |         "http://www.jd.com/"
17 |     ]
18 | 
19 |     def parse(self, response):
20 |         logging.info(u'---------我这个是简单的直接获取京东网首页测试---------')
21 |         guessyou = response.xpath('//div[@id="guessyou"]/div[1]/h2/text()').extract_first()
22 |         logging.info(u"find：%s" % guessyou)
23 |         logging.info(u'---------------success----------------')
24 | 
25 | if __name__ == '__main__':
26 |     body = u'发布于： 2016年04月08日'
27 |     pat4 = re.compile(r'\d{4}年\d{2}月\d{2}日')
28 |     if (re.search(pat4, body)):
29 |         print(re.search(pat4, body).group())
30 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/tobacco_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 烟草条形码爬虫
 5 | -- 烟草表
 6 | DROP TABLE IF EXISTS `t_tobacco`;
 7 | CREATE TABLE `t_tobacco` (
 8 |   `id`                        BIGINT PRIMARY KEY AUTO_INCREMENT COMMENT '主键ID',
 9 |   `product_name`              VARCHAR(32)  COMMENT '产品名称',
10 |   `brand`                     VARCHAR(32)  COMMENT '品牌',
11 |   `product_type`              VARCHAR(32)  COMMENT '产品类型',
12 |   `package_spec`              VARCHAR(64)  COMMENT '包装规格',
13 |   `reference_price`           VARCHAR(32)  COMMENT '参考价格',
14 |   `manufacturer`              VARCHAR(32)  COMMENT '生产厂家',
15 |   `pics`                      VARCHAR(255) COMMENT '图片URL',
16 |   `created_time`              DATETIME DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
17 |   `updated_time`              DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间'
18 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='烟草表';
19 | 
20 | -- 烟草条形码表
21 | DROP TABLE IF EXISTS `t_barcode`;
22 | CREATE TABLE `t_barcode` (
23 |   `id`                        BIGINT PRIMARY KEY AUTO_INCREMENT COMMENT '主键ID',
24 |   `tobacco_id`                BIGINT COMMENT '香烟产品ID',
25 |   `barcode`                   VARCHAR(32) COMMENT '条形码',
26 |   `btype`                     VARCHAR(32) COMMENT '类型 小盒条形码/条包条形码',
27 |   `created_time`              DATETIME DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
28 |   `updated_time`              DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间'
29 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='烟草条形码表';
30 | """
31 | from scrapy import Request
32 | 
33 | from coolscrapy.utils import parse_text, tx
34 | from scrapy.spiders import CrawlSpider, Rule, Spider
35 | from scrapy.linkextractors import LinkExtractor
36 | from coolscrapy.items import Article, TobaccoItem
37 | 
38 | 
39 | class TobaccoSpider(Spider):
40 |     """
41 |     爬取本页面内容，然后再提取下一页链接生成新的Request标准做法
42 |     另外还使用了图片下载管道
43 |     """
44 |     name = "tobacco"
45 |     allowed_domains = ["etmoc.com"]
46 |     base_url = "http://www.etmoc.com/market/Brandlist.asp"
47 |     start_urls = [
48 |         "http://www.etmoc.com/market/Brandlist.asp?page=86&worded=&temp="
49 |     ]
50 |     pics_pre = 'http://www.etmoc.com/'
51 | 
52 |     def parse(self, response):
53 |         # 处理本页内容
54 |         for item in self.parse_page(response):
55 |             yield item
56 |         # 找下一页链接递归爬
57 |         next_url = tx(response.xpath('//a[text()="【下一页】"]/@href'))
58 |         if next_url:
59 |             self.logger.info('+++++++++++next_url++++++++++=' + self.base_url + next_url)
60 |             yield Request(url=self.base_url + next_url, meta={'ds': "ds"}, callback=self.parse)
61 | 
62 |     def parse_page(self, response):
63 |         self.logger.info('Hi, this is a page = %s', response.url)
64 |         items = []
65 |         for ind, each_row in enumerate(response.xpath('//div[@id="mainlist"]/table/tbody/tr')):
66 |             if ind == 0:
67 |                 continue
68 |             item = TobaccoItem()
69 |             item['pics'] = self.pics_pre + tx(each_row.xpath('td[1]/a/img/@src'))[3:]
70 |             product_name = tx(each_row.xpath('td[2]/p[1]/text()'))
71 |             brand = tx(each_row.xpath('td[2]/p[2]/a/text()'))
72 |             barcode1 = tx(each_row.xpath('td[2]/p[3]/text()'))
73 |             barcode2 = tx(each_row.xpath('td[2]/p[4]/text()'))
74 |             item['product'] = "{}/{}/{}/{}".format(product_name, brand, barcode1, barcode2)
75 |             item['product_type'] = tx(each_row.xpath('td[3]/text()'))
76 |             item['package_spec'] = tx(each_row.xpath('td[4]/text()'))
77 |             item['reference_price'] = tx(each_row.xpath('td[5]/span/text()'))
78 |             # 生产厂家有可能包含<a>链接，我取里面的文本，使用//text()
79 |             item['manufacturer'] = tx(each_row.xpath('td[6]//text()'))
80 |             self.logger.info("pics={},product={},product_type={},package_spec={},"
81 |                              "reference_price={},manufacturer={}".format(
82 |                 item['pics'], item['product'], item['product_type'], item['package_spec']
83 |                 , item['reference_price'], item['manufacturer']))
84 |             items.append(item)
85 |         return items
86 | 


--------------------------------------------------------------------------------
/coolscrapy/spiders/xml_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- encoding: utf-8 -*-
 3 | """
 4 | Topic: 爬取XML订阅的蜘蛛
 5 | Desc : 
 6 | """
 7 | from coolscrapy.items import BlogItem
 8 | import scrapy
 9 | from scrapy.spiders import XMLFeedSpider
10 | 
11 | 
12 | class XMLSpider(XMLFeedSpider):
13 |     name = "xml"
14 |     namespaces = [('atom', 'http://www.w3.org/2005/Atom')]
15 |     allowed_domains = ["github.io"]
16 |     start_urls = [
17 |         "http://yidao620c.github.io/atom.xml"
18 |     ]
19 |     iterator = 'xml'  # 缺省的iternodes，貌似对于有namespace的xml不行
20 |     itertag = 'atom:entry'
21 | 
22 |     def parse_node(self, response, node):
23 |         # self.logger.info('Hi, this is a <%s> node!', self.itertag)
24 |         item = BlogItem()
25 |         item['title'] = node.xpath('atom:title/text()')[0].extract()
26 |         item['link'] = node.xpath('atom:link/@href')[0].extract()
27 |         item['id'] = node.xpath('atom:id/text()')[0].extract()
28 |         item['published'] = node.xpath('atom:published/text()')[0].extract()
29 |         item['updated'] = node.xpath('atom:updated/text()')[0].extract()
30 |         self.logger.info('|'.join([item['title'],item['link'],item['id'],item['published']]))
31 |         return item
32 | 
33 | 


--------------------------------------------------------------------------------
/coolscrapy/utils.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- encoding: utf-8 -*-
  3 | """
  4 | Topic: 一些工具类
  5 | Desc : 
  6 | """
  7 | import re
  8 | import sys
  9 | import smtplib
 10 | from contextlib import contextmanager
 11 | from email.mime.multipart import MIMEMultipart
 12 | from email.mime.text import MIMEText
 13 | from email.mime.image import MIMEImage
 14 | import os.path
 15 | from coolscrapy.settings import IMAGES_STORE
 16 | from coolscrapy.models import ArticleRule
 17 | from coolscrapy.models import db_connect, create_news_table
 18 | from sqlalchemy.orm import sessionmaker
 19 | 
 20 | 
 21 | def filter_tags(htmlstr):
 22 |     """更深层次的过滤，类似instapaper或者readitlater这种服务，很有意思的研究课题
 23 |     过滤HTML中的标签
 24 |     将HTML中标签等信息去掉
 25 |     @param htmlstr HTML字符串.
 26 |     """
 27 |     # 先过滤CDATA
 28 |     re_pp = re.compile('</p>', re.I)  # 段落结尾
 29 |     re_cdata = re.compile('//<!\[CDATA\[[^>]*//\]\]>', re.I)  # 匹配CDATA
 30 |     re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.I)  # Script
 31 |     re_style = re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', re.I)  # style
 32 |     re_br = re.compile('<br\s*?/?>')  # 处理换行
 33 |     re_h = re.compile('</?\w+[^>]*>')  # HTML标签
 34 |     re_comment = re.compile('<!--[^>]*-->')  # HTML注释
 35 | 
 36 |     s = re_pp.sub('\n', htmlstr)  # 段落结尾变换行符
 37 |     s = re_cdata.sub('', s)  # 去掉CDATA
 38 |     s = re_script.sub('', s)  # 去掉SCRIPT
 39 |     s = re_style.sub('', s)  # 去掉style
 40 |     s = re_br.sub('\n', s)  # 将br转换为换行
 41 |     s = re_h.sub('', s)  # 去掉HTML 标签
 42 |     s = re_comment.sub('', s)  # 去掉HTML注释
 43 |     # 去掉多余的空行
 44 |     blank_line = re.compile('\n+')
 45 |     s = blank_line.sub('\n', s)
 46 |     s = replace_charentity(s)  # 替换实体
 47 |     return "".join([t.strip() + '\n' for t in s.split('\n') if t.strip() != ''])
 48 | 
 49 | 
 50 | def replace_charentity(htmlstr):
 51 |     """
 52 |     ##替换常用HTML字符实体.
 53 |     #使用正常的字符替换HTML中特殊的字符实体.
 54 |     #你可以添加新的实体字符到CHAR_ENTITIES中,处理更多HTML字符实体.
 55 |     #@param htmlstr HTML字符串.
 56 |     """
 57 |     char_entities = {'nbsp': ' ', '160': ' ',
 58 |                      'lt': '<', '60': '<',
 59 |                      'gt': '>', '62': '>',
 60 |                      'amp': '&', '38': '&',
 61 |                      'quot': '"', '34': '"',}
 62 | 
 63 |     re_charentity = re.compile(r'&#?(?P<name>\w+);')
 64 |     sz = re_charentity.search(htmlstr)
 65 |     while sz:
 66 |         entity = sz.group()  # entity全称，如&gt;
 67 |         key = sz.group('name')  # 去除&;后entity,如&gt;为gt
 68 |         try:
 69 |             htmlstr = re_charentity.sub(char_entities[key], htmlstr, 1)
 70 |             sz = re_charentity.search(htmlstr)
 71 |         except KeyError:
 72 |             # 以空串代替
 73 |             htmlstr = re_charentity.sub('', htmlstr, 1)
 74 |             sz = re_charentity.search(htmlstr)
 75 |     return htmlstr
 76 | 
 77 | 
 78 | pat1 = re.compile(r'<div class="hzh_botleft">(?:.|\n)*?</div>')
 79 | pat2 = re.compile(r'<script (?:.|\n)*?</script>')
 80 | pat3 = re.compile(r'<a href="javascript:"(?:.|\n)*?</a>')
 81 | 
 82 | 
 83 | def clean_html(p_str):
 84 |     """html标签清理"""
 85 |     p_str = pat1.sub('', p_str)
 86 |     p_str = pat2.sub('', p_str)
 87 |     p_str = pat3.sub('', p_str)
 88 |     return '\n'.join(s for s in p_str.split('\n') if len(s.strip()) != 0)
 89 | 
 90 | 
 91 | def repalce(s, re_exp, repl_string):
 92 |     return re_exp.sub(repl_string, s)
 93 | 
 94 | 
 95 | def ltos(lst):
 96 |     """列表取第一个值"""
 97 |     if lst is not None and isinstance(lst, list):
 98 |         if len(lst) > 0:
 99 |             return lst[0].strip()
100 |     return ''
101 | 
102 | 
103 | def send_mail(jokes):
104 |     """发送电子邮件"""
105 |     sender = 'xiongneng@winhong.com'
106 |     receiver = ['xiadan@winhong.com', 'xiongneng@winhong.com']
107 |     subject = '每日笑话'
108 |     smtpserver = 'smtp.263.net'
109 |     username = 'xiongneng@winhong.com'
110 |     password = '******'
111 |     msg_root = MIMEMultipart('related')
112 |     msg_root['Subject'] = subject
113 | 
114 |     msg_text_str = """
115 |         <h1>笑话网祝你笑口常开。</h1>
116 |         <div class="listbox">
117 |             <ul>
118 |         """
119 |     for idx, (content, img_url) in enumerate(jokes, 1):
120 |         msg_text_str = "\n".join([msg_text_str, '<li>'])
121 |         msg_text_str = "\n".join([msg_text_str, '<p>%s</p>' % content])
122 |         if img_url:
123 |             msg_text_str = "\n".join([msg_text_str, '<p><img src="cid:image%s"/></p>' % idx])
124 |         msg_text_str = "\n".join([msg_text_str, '</li>'])
125 |     msg_text_str = "\n".join([msg_text_str, '</ul>'])
126 |     msg_text_str = "\n".join([msg_text_str, '</div>'])
127 | 
128 |     msg_text = MIMEText(msg_text_str, 'html', 'utf-8')
129 |     msg_root.attach(msg_text)
130 | 
131 |     for idx, (_, img_url) in enumerate(jokes, start=1):
132 |         if img_url:
133 |             with open(os.path.join(IMAGES_STORE, os.path.basename(img_url)), 'rb') as fp:
134 |                 msg_image = MIMEImage(fp.read())
135 |                 msg_image.add_header('Content-ID', '<image%s>' % idx)
136 |                 msg_root.attach(msg_image)
137 |     smtp = smtplib.SMTP()
138 |     smtp.connect(smtpserver)
139 |     smtp.login(username, password)
140 |     smtp.sendmail(sender, receiver, msg_root.as_string())
141 |     smtp.quit()
142 | 
143 | 
144 | @contextmanager
145 | def session_scope(Session):
146 |     """Provide a transactional scope around a series of operations."""
147 |     session = Session()
148 |     try:
149 |         yield session
150 |         session.commit()
151 |     except:
152 |         session.rollback()
153 |         raise
154 |     finally:
155 |         session.close()
156 | 
157 | 
158 | def init_rule():
159 |     engine = db_connect()
160 |     create_news_table(engine)
161 |     Session = sessionmaker(bind=engine)
162 |     with session_scope(Session) as session:
163 |         artile_rule1 = ArticleRule(
164 |             name='huxiu',
165 |             allow_domains='huxiu.com',
166 |             start_urls='http://www.huxiu.com/',
167 |             next_page='',
168 |             allow_url='/article/\d+/\d+\.html',
169 |             extract_from='//div[@class="mod-info-flow"]',
170 |             title_xpath='//div[@class="article-wrap"]/h1/text()',
171 |             body_xpath='//div[@id="article_content"]/p//text()',
172 |             publish_time_xpath='//span[@class="article-time"]/text()',
173 |             source_site='虎嗅网',
174 |             enable=1
175 |         )
176 |         artile_rule2 = ArticleRule(
177 |             name='osc',
178 |             allow_domains='oschina.net',
179 |             start_urls='http://www.oschina.net/',
180 |             next_page='',
181 |             allow_url='/news/\d+/',
182 |             extract_from='//div[@id="IndustryNews"]',
183 |             title_xpath='//h1[@class="OSCTitle"]/text()',
184 |             publish_time_xpath='//div[@class="PubDate"]/text()',
185 |             body_xpath='//div[starts-with(@class, "Body")]/p[position()>1]//text()',
186 |             source_site='开源中国',
187 |             enable=1
188 |         )
189 |         session.add(artile_rule1)
190 |         session.add(artile_rule2)
191 | 
192 | 
193 | def parse_text(extract_texts, rule_name, attr_name):
194 |     """xpath的提取方式
195 |     @param extract_texts: 被处理的文本数组
196 |     @param rule_name: 规则名称
197 |     @param attr_name: 属性名
198 |     """
199 |     custom_func = "%s_%s" % (rule_name, attr_name)
200 |     if custom_func in globals():
201 |         return globals()[custom_func](extract_texts)
202 |     return '\n'.join(extract_texts).strip() if extract_texts else ""
203 | 
204 | 
205 | pat4 = re.compile(r'\d{4}年\d{2}月\d{2}日')
206 | 
207 | 
208 | def osc_publish_time(extract_texts):
209 |     """发布时间的提取方式
210 |     @param extract_texts: 被处理的文本数组
211 |     """
212 |     if extract_texts:
213 |         single_text = ''.join(extract_texts).strip()
214 |         res = re.search(pat4, single_text)
215 |         return res.group() if res else ""
216 |     return ""
217 | 
218 | 
219 | def tx(xpath_obj):
220 |     return ''.join(xpath_obj.extract()).strip()
221 | 
222 | if __name__ == '__main__':
223 |     print('11/222/333/'.split('/'))


--------------------------------------------------------------------------------
/doc/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License
 2 | 
 3 | Copyright 2017 Xiong Neng
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software
 6 | and associated documentation files (the "Software"),
 7 | to deal in the Software without restriction, including without limitation the rights to use,
 8 | copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
 9 | and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
10 | 
11 | The above copyright notice and this permission notice shall be included in all copies or
12 | substantial portions of the Software.
13 | 
14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
15 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
18 | DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
19 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/doc/README.md:
--------------------------------------------------------------------------------
 1 | # README
 2 | 
 3 | ## Scrapy 教程
 4 | 
 5 | Scrapy是python开发的著名爬虫框架，目前使用非常广泛。本教程基于最新的1.0版本，
 6 | 通过实际的例子带领你一步步掌握Scrapy核心，以后会持续更新改进。
 7 | 
 8 | * readthedocs:  <http://scrapy-cookbook.readthedocs.io/zh_CN/latest/>
 9 | * PDF下载:       <http://pan.baidu.com/s/1c2BZfGG>
10 | 
11 | ## License
12 | 
13 | MIT
14 | 
15 | ## Powered by
16 | 
17 | This is a book powered by [GitBook](https://github.com/GitbookIO/gitbook).
18 | 
19 | 


--------------------------------------------------------------------------------
/doc/SUMMARY.md:
--------------------------------------------------------------------------------
 1 | # Summary
 2 | 
 3 | * [Getting Started](README.md)
 4 | * [About this guide](source/other/about.md)
 5 | 
 6 | ## PartⅠ - 基础篇
 7 | * [基础入门](source/part1/README.md)
 8 |   - [入门篇](source/part1/scrapy-01.md)
 9 |   - [完整示例](source/part1/scrapy-02.md)
10 | * [核心组件](source/part2/README.md)
11 |   - [Spider详解](source/part2/scrapy-03.md)
12 |   - [Selector详解](source/part2/scrapy-04.md)
13 |   - [Item详解](source/part2/scrapy-05.md)
14 |   - [Item Pipeline](source/part2/scrapy-06.md)
15 |   - [内置服务](source/part2/scrapy-07.md)
16 |   - [文件与图片](source/part2/scrapy-08.md)
17 | * [生产环境](source/part3/README.md)
18 |   - [部署](source/part3/scrapy-09.md)
19 | 
20 | ## PartⅡ - 进阶篇
21 | * [动态配置爬虫](source/part4/scrapy-10.md)
22 | * [模拟登录](source/part4/scrapy-11.md)
23 | * [抓取动态网站](source/part4/scrapy-12.md)
24 | 
25 | ## Part Ⅲ - 高手篇
26 | * [结束语](source/part5/README.md)
27 | 


--------------------------------------------------------------------------------
/doc/assets/favicon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/assets/favicon.png


--------------------------------------------------------------------------------
/doc/assets/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/assets/logo.png


--------------------------------------------------------------------------------
/doc/book.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "author": "熊能 <yidao620@gmail.com>",
 3 |     "description": "scrapy简明指南",
 4 |     "generator": "site",
 5 |     "language": "zh-hans",
 6 |     "links": {
 7 |         "sidebar": {
 8 |             "我的博客": "https://www.xncoding.com"
 9 |         }
10 |     },
11 |     "pdf": {
12 |         "fontSize": 12,
13 |         "footerTemplate": null,
14 |         "headerTemplate": null,
15 |         "margin": {
16 |             "bottom": 36,
17 |             "left": 62,
18 |             "right": 62,
19 |             "top": 36
20 |         },
21 |         "pageNumbers": false,
22 |         "paperSize": "a4"
23 |     },
24 |     "plugins": [ "theme-gestalt", "-theme-default", "styles-sass-fix", "multipart", "toggle-chapters", "disqus" ],
25 |     "pluginsConfig": {
26 |         "theme-gestalt": {
27 |             "logo": "/core-scrapy/assets/logo.png",
28 |             "favicon": "/core-scrapy/assets/favicon.png",
29 |             "excludeDefaultStyles": true
30 |         },
31 |         "disqus": {
32 |             "shortName": "yidao620cgithubio"
33 |         }
34 |     },
35 |     "styles": {
36 |         "website": "./styles/website.scss"
37 |     },
38 |     "title": "Scrapy Cookbook2",
39 |     "variables": {}
40 | }
41 | 


--------------------------------------------------------------------------------
/doc/fonts/fontawesome-webfont.woff:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/fonts/fontawesome-webfont.woff


--------------------------------------------------------------------------------
/doc/fonts/fontawesome-webfont.woff2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/fonts/fontawesome-webfont.woff2


--------------------------------------------------------------------------------
/doc/source/images/scrapy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/source/images/scrapy.png


--------------------------------------------------------------------------------
/doc/source/images/scrapy01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/source/images/scrapy01.png


--------------------------------------------------------------------------------
/doc/source/images/scrapy02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/source/images/scrapy02.png


--------------------------------------------------------------------------------
/doc/source/images/scrapy03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/source/images/scrapy03.png


--------------------------------------------------------------------------------
/doc/source/images/weixin1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yidao620c/core-scrapy/3c671c6ed3ba0fdc222d43048b083f18626ed117/doc/source/images/weixin1.png


--------------------------------------------------------------------------------
/doc/source/other/about.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## 如何阅读
 3 | 
 4 | 每一章节的主题独立，可以顺序阅读，也可以跳跃只读自己感兴趣的
 5 | 
 6 | 
 7 | ## 联系我
 8 | 
 9 | * Email：   yidao620@gmail.com
10 | * 博客：     https://www.xncoding.com/
11 | * GitHub：  https://github.com/yidao620c
12 | 
13 | --------------------------------------------
14 | 
15 | <center>微信扫一扫，请博主喝杯咖啡</center>
16 | 
17 | ![](/source/images/weixin1.png)
18 | 
19 | 


--------------------------------------------------------------------------------
/doc/source/part1/README.md:
--------------------------------------------------------------------------------
1 | ## 基础入门
2 | 
3 | 简单介绍一下scrapy入门知识
4 | 
5 | 
6 | 


--------------------------------------------------------------------------------
/doc/source/part1/scrapy-01.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程01- 入门篇
  2 | 
  3 | Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，
  4 | 信息处理或存储历史数据等一系列的程序中。其最初是为了页面抓取(更确切来说,网络抓取)所设计的，
  5 | 也可以应用在获取API所返回的数据(比如Web Services)或者通用的网络爬虫。
  6 | 
  7 | Scrapy也能帮你实现高阶的爬虫框架，比如爬取时的网站认证、内容的分析处理、重复抓取、分布式爬取等等很复杂的事。
  8 | 
  9 | ## 安装scrapy
 10 | 
 11 | 我的测试环境是centos6.5
 12 | 
 13 | 升级python到最新版的2.7，下面的所有步骤都切换到root用户
 14 | 
 15 | 由于scrapy目前只能运行在python2上，所以先更新centos上面的python到最新的
 16 | [Python 2.7.11](https://www.python.org/downloads/release/python-2711/)，
 17 | 具体方法请google下很多这样的教程。
 18 | 
 19 | 先安装一些依赖软件
 20 | ```
 21 | yum install python-devel
 22 | yum install libffi-devel
 23 | yum install openssl-devel
 24 | ```
 25 | 
 26 | 然后安装pyopenssl库
 27 | ```
 28 | pip install pyopenssl
 29 | ```
 30 | 
 31 | 安装xlml
 32 | ```
 33 | yum install python-lxml
 34 | yum install libxml2-devel
 35 | yum install libxslt-devel
 36 | ```
 37 | 
 38 | 安装service-identity
 39 | ```
 40 | pip install service-identity
 41 | ```
 42 | 
 43 | 安装twisted
 44 | ```
 45 | pip install scrapy
 46 | ```
 47 | 
 48 | 安装scrapy
 49 | ```
 50 | pip install scrapy -U
 51 | ```
 52 | 
 53 | 测试scrapy
 54 | ```
 55 | scrapy bench
 56 | ```
 57 | 
 58 | 最终成功，太不容易了！
 59 | 
 60 | ## 简单示例
 61 | 
 62 | 创建一个python源文件，名为stackoverflow.py，内容如下：
 63 | 
 64 | ``` python
 65 | import scrapy
 66 | 
 67 | 
 68 | class StackOverflowSpider(scrapy.Spider):
 69 |     name = 'stackoverflow'
 70 |     start_urls = ['http://stackoverflow.com/questions?sort=votes']
 71 | 
 72 |     def parse(self, response):
 73 |         for href in response.css('.question-summary h3 a::attr(href)'):
 74 |             full_url = response.urljoin(href.extract())
 75 |             yield scrapy.Request(full_url, callback=self.parse_question)
 76 | 
 77 |     def parse_question(self, response):
 78 |         yield {
 79 |             'title': response.css('h1 a::text').extract()[0],
 80 |             'votes': response.css('.question .vote-count-post::text').extract()[0],
 81 |             'body': response.css('.question .post-text').extract()[0],
 82 |             'tags': response.css('.question .post-tag::text').extract(),
 83 |             'link': response.url,
 84 |         }
 85 | ```
 86 | 运行：
 87 | ```
 88 | scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
 89 | ```
 90 | 
 91 | 结果类似下面：
 92 | ```
 93 | [{
 94 |     "body": "... LONG HTML HERE ...",
 95 |     "link": "http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array",
 96 |     "tags": ["java", "c++", "performance", "optimization"],
 97 |     "title": "Why is processing a sorted array faster than an unsorted array?",
 98 |     "votes": "9924"
 99 | },
100 | {
101 |     "body": "... LONG HTML HERE ...",
102 |     "link": "http://stackoverflow.com/questions/1260748/how-do-i-remove-a-git-submodule",
103 |     "tags": ["git", "git-submodules"],
104 |     "title": "How do I remove a Git submodule?",
105 |     "votes": "1764"
106 | },
107 | ...]
108 | ```
109 | 
110 | 当你运行`scrapy runspider somefile.py`这条语句的时候，Scrapy会去寻找源文件中定义的一个spider并且交给爬虫引擎来执行它。
111 | `start_urls`属性定义了开始的URL，爬虫会通过它来构建初始的请求，返回response后再调用默认的回调方法`parse`并传入这个response。
112 | 我们在`parse`回调方法中通过使用css选择器提取每个提问页面链接的href属性值，然后`yield`另外一个请求，
113 | 并注册`parse_question`回调方法，在这个请求完成后被执行。
114 | 
115 | 处理流程图：
116 | 
117 | ![scrapy架构图](/source/images/scrapy.png)
118 | 
119 | Scrapy的一个好处是所有请求都是被调度并异步处理，就算某个请求出错也不影响其他请求继续被处理。
120 | 
121 | 我们的示例中将解析结果生成json格式，你还可以导出为其他格式（比如XML、CSV），或者是将其存储到FTP、Amazon S3上。
122 | 你还可以通过[pipeline](http://doc.scrapy.org/en/1.0/topics/item-pipeline.html#topics-item-pipeline)
123 | 将它们存储到数据库中去，这些数据保存的方式各种各样。
124 | 
125 | ## Scrapy特性一览
126 | 你已经可以通过Scrapy从一个网站上面爬取数据并将其解析保存下来了，但是这只是Scrapy的皮毛。
127 | Scrapy提供了更多的特性来让你爬取更加容易和高效。比如：
128 | 
129 | 1. 内置支持扩展的CSS选择器和XPath表达式来从HTML/XML源码中选择并提取数据，还能使用正则表达式
130 | 2. 提供交互式shell控制台试验CSS和XPath表达式，这个在调试你的蜘蛛程序时很有用
131 | 1. 内置支持生成多种格式的订阅导出（JSON、CSV、XML）并将它们存储在多个位置（FTP、S3、本地文件系统）
132 | 1. 健壮的编码支持和自动识别，用于处理外文、非标准和错误编码问题
133 | 1. 可扩展，允许你使用[signals](http://doc.scrapy.org/en/1.0/topics/signals.html#topics-signals)
134 | 和友好的API(middlewares, extensions, 和pipelines)来编写自定义插件功能。
135 | 1. 大量的内置扩展和中间件供使用：
136 |     - cookies and session handling
137 |     - HTTP features like compression, authentication, caching
138 |     - user-agent spoofing
139 |     - robots.txt
140 |     - crawl depth restriction
141 |     - and more
142 | 1. 还有其他好多好东东，比如可重复利用蜘蛛来爬取[Sitemaps](http://www.sitemaps.org/)和XML/CSV订阅，
143 | 一个跟爬取元素关联的媒体管道来
144 | [自动下载图片](http://doc.scrapy.org/en/1.0/topics/media-pipeline.html#topics-media-pipeline)，
145 | 一个缓存DNS解析器等等。
146 | 
147 | 


--------------------------------------------------------------------------------
/doc/source/part1/scrapy-02.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程02- 完整示例
  2 | 
  3 | 这篇文章我们通过一个比较完整的例子来教你使用Scrapy，我选择爬取[虎嗅网首页](http://www.huxiu.com/)的新闻列表。
  4 | 
  5 | 这里我们将完成如下几个步骤：
  6 | 
  7 | * 创建一个新的Scrapy工程
  8 | * 定义你所需要要抽取的Item对象
  9 | * 编写一个spider来爬取某个网站并提取出所有的Item对象
 10 | * 编写一个Item Pipline来存储提取出来的Item对象
 11 | 
 12 | Scrapy使用Python语言编写，如果你对这门语言还不熟，请先去学习下基本知识。
 13 | 
 14 | ## 创建Scrapy工程
 15 | 
 16 | 在任何你喜欢的目录执行如下命令
 17 | ``` bash
 18 | scrapy startproject coolscrapy
 19 | ```
 20 | 将会创建coolscrapy文件夹，其目录结构如下：
 21 | ```
 22 | coolscrapy/
 23 |     scrapy.cfg            # 部署配置文件
 24 | 
 25 |     coolscrapy/           # Python模块，你所有的代码都放这里面
 26 |         __init__.py
 27 | 
 28 |         items.py          # Item定义文件
 29 | 
 30 |         pipelines.py      # pipelines定义文件
 31 | 
 32 |         settings.py       # 配置文件
 33 | 
 34 |         spiders/          # 所有爬虫spider都放这个文件夹下面
 35 |             __init__.py
 36 |             ...
 37 | ```
 38 | 
 39 | ## 定义我们的Item
 40 | 我们通过创建一个scrapy.Item类，并定义它的类型为scrapy.Field的属性，
 41 | 我们准备将虎嗅网新闻列表的名称、链接地址和摘要爬取下来。
 42 | 
 43 | ``` python
 44 | import scrapy
 45 | 
 46 | class HuxiuItem(scrapy.Item):
 47 |     title = scrapy.Field()    # 标题
 48 |     link = scrapy.Field()     # 链接
 49 |     desc = scrapy.Field()     # 简述
 50 |     posttime = scrapy.Field() # 发布时间
 51 | ```
 52 | 
 53 | 也许你觉得定义这个Item有点麻烦，但是定义完之后你可以得到许多好处，这样你就可以使用Scrapy中其他有用的组件和帮助类。
 54 | 
 55 | ## 第一个Spider
 56 | 蜘蛛就是你定义的一些类，Scrapy使用它们来从一个domain（或domain组）爬取信息。
 57 | 在蜘蛛类中定义了一个初始化的URL下载列表，以及怎样跟踪链接，如何解析页面内容来提取Item。
 58 | 
 59 | 定义一个Spider，只需继承`scrapy.Spider`类并定于一些属性：
 60 | 
 61 | * name: Spider名称，必须是唯一的
 62 | * start_urls: 初始化下载链接URL
 63 | * parse(): 用来解析下载后的Response对象，该对象也是这个方法的唯一参数。
 64 | 它负责解析返回页面数据并提取出相应的Item（返回Item对象），还有其他合法的链接URL（返回Request对象）。
 65 | 
 66 | 我们在coolscrapy/spiders文件夹下面新建`huxiu_spider.py`，内容如下：
 67 | ``` python
 68 | #!/usr/bin/env python
 69 | # -*- encoding: utf-8 -*-
 70 | """
 71 | Topic: sample
 72 | Desc :
 73 | """
 74 | from coolscrapy.items import HuxiuItem
 75 | import scrapy
 76 | 
 77 | class HuxiuSpider(scrapy.Spider):
 78 |     name = "huxiu"
 79 |     allowed_domains = ["huxiu.com"]
 80 |     start_urls = [
 81 |         "http://www.huxiu.com/index.php"
 82 |     ]
 83 | 
 84 |     def parse(self, response):
 85 |         for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
 86 |             item = HuxiuItem()
 87 |             item['title'] = sel.xpath('h3/a/text()')[0].extract()
 88 |             item['link'] = sel.xpath('h3/a/@href')[0].extract()
 89 |             url = response.urljoin(item['link'])
 90 |             item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
 91 |             print(item['title'],item['link'],item['desc'])
 92 | ```
 93 | 
 94 | ## 运行爬虫
 95 | 在根目录执行下面的命令，其中huxiu是你定义的spider名字：
 96 | ```
 97 | scrapy crawl huxiu
 98 | ```
 99 | 如果一切正常，应该可以打印出每一个新闻
100 | 
101 | ## 处理链接
102 | 如果想继续跟踪每个新闻链接进去，看看它的详细内容的话，那么可以在parse()方法中返回一个Request对象，
103 | 然后注册一个回调函数来解析新闻详情。
104 | 
105 | ``` python
106 | from coolscrapy.items import HuxiuItem
107 | import scrapy
108 | 
109 | class HuxiuSpider(scrapy.Spider):
110 |     name = "huxiu"
111 |     allowed_domains = ["huxiu.com"]
112 |     start_urls = [
113 |         "http://www.huxiu.com/index.php"
114 |     ]
115 | 
116 |     def parse(self, response):
117 |         for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
118 |             item = HuxiuItem()
119 |             item['title'] = sel.xpath('h3/a/text()')[0].extract()
120 |             item['link'] = sel.xpath('h3/a/@href')[0].extract()
121 |             url = response.urljoin(item['link'])
122 |             item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
123 |             # print(item['title'],item['link'],item['desc'])
124 |             yield scrapy.Request(url, callback=self.parse_article)
125 | 
126 |     def parse_article(self, response):
127 |         detail = response.xpath('//div[@class="article-wrap"]')
128 |         item = HuxiuItem()
129 |         item['title'] = detail.xpath('h1/text()')[0].extract()
130 |         item['link'] = response.url
131 |         item['posttime'] = detail.xpath(
132 |             'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
133 |         print(item['title'],item['link'],item['posttime'])
134 |         yield item
135 | ```
136 | 现在parse只提取感兴趣的链接，然后将链接内容解析交给另外的方法去处理了。
137 | 你可以基于这个构建更加复杂的爬虫程序了。
138 | 
139 | ## 导出抓取数据
140 | 最简单的保存抓取数据的方式是使用json格式的文件保存在本地，像下面这样运行：
141 | ``` bash
142 | scrapy crawl huxiu -o items.json
143 | ```
144 | 在演示的小系统里面这种方式足够了。不过如果你要构建复杂的爬虫系统，
145 | 最好自己编写[Item Pipeline](http://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)。
146 | 
147 | ## 保存数据到数据库
148 | 上面我们介绍了可以将抓取的Item导出为json格式的文件，不过最常见的做法还是编写Pipeline将其存储到数据库中。我们在`coolscrapy/pipelines.py`定义
149 | ``` python
150 | # -*- coding: utf-8 -*-
151 | import datetime
152 | import redis
153 | import json
154 | import logging
155 | from contextlib import contextmanager
156 | 
157 | from scrapy import signals
158 | from scrapy.exporters import JsonItemExporter
159 | from scrapy.pipelines.images import ImagesPipeline
160 | from scrapy.exceptions import DropItem
161 | from sqlalchemy.orm import sessionmaker
162 | from coolscrapy.models import News, db_connect, create_news_table, Article
163 | 
164 | 
165 | class ArticleDataBasePipeline(object):
166 |     """保存文章到数据库"""
167 | 
168 |     def __init__(self):
169 |         engine = db_connect()
170 |         create_news_table(engine)
171 |         self.Session = sessionmaker(bind=engine)
172 | 
173 |     def open_spider(self, spider):
174 |         """This method is called when the spider is opened."""
175 |         pass
176 | 
177 |     def process_item(self, item, spider):
178 |         a = Article(url=item["url"],
179 |                     title=item["title"].encode("utf-8"),
180 |                     publish_time=item["publish_time"].encode("utf-8"),
181 |                     body=item["body"].encode("utf-8"),
182 |                     source_site=item["source_site"].encode("utf-8"))
183 |         with session_scope(self.Session) as session:
184 |             session.add(a)
185 | 
186 |     def close_spider(self, spider):
187 |         pass
188 | ```
189 | 上面我使用了python中的SQLAlchemy来保存数据库，这个是一个非常优秀的ORM库，我写了篇关于它的[入门教程](http://yidao620c.github.io/2016/03/07/sqlalchemy.html)，可以参考下。
190 | 
191 | 然后在`setting.py`中配置这个Pipeline，还有数据库链接等信息：
192 | ``` python
193 | ITEM_PIPELINES = {
194 |     'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
195 | }
196 | 
197 | # linux pip install MySQL-python
198 | DATABASE = {'drivername': 'mysql',
199 |             'host': '192.168.203.95',
200 |             'port': '3306',
201 |             'username': 'root',
202 |             'password': 'mysql',
203 |             'database': 'spider',
204 |             'query': {'charset': 'utf8'}}
205 | ```
206 | 再次运行爬虫
207 | ``` bash
208 | scrapy crawl huxiu
209 | ```
210 | 那么所有新闻的文章都存储到数据库中去了。
211 | 
212 | ## 下一步
213 | 本章只是带你领略了scrapy最基本的功能，还有很多高级特性没有讲到。接下来会通过多个例子向你展示scrapy的其他特性，然后再深入讲述每个特性。
214 | 
215 | 


--------------------------------------------------------------------------------
/doc/source/part2/README.md:
--------------------------------------------------------------------------------
1 | ## 核心组件
2 | 
3 | 这一部分开始介绍scrapy里面的核心组件，包括了Spider、Selector、Item、Pipeline、内置服务，以及文件和图片的处理。
4 | 


--------------------------------------------------------------------------------
/doc/source/part2/scrapy-03.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程03- Spider详解
  2 | 
  3 | Spider是爬虫框架的核心，爬取流程如下：
  4 | 
  5 | 1. 先初始化请求URL列表，并指定下载后处理response的回调函数。初次请求URL通过`start_urls`指定，调用`start_requests()`产生`Request`对象，然后注册`parse`方法作为回调
  6 | 1. 在parse回调中解析response并返回字典,`Item`对象,`Request`对象或它们的迭代对象。`Request`对象还会包含回调函数，之后Scrapy下载完后会被这里注册的回调函数处理。
  7 | 1. 在回调函数里面，你通过使用选择器（同样可以使用BeautifulSoup,lxml或其他工具）解析页面内容，并生成解析后的结果Item。
  8 | 1. 最后返回的这些Item通常会被持久化到数据库中(使用[Item Pipeline](http://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline))或者使用[Feed exports](http://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)将其保存到文件中。
  9 | 
 10 | 尽管这个流程适合于所有的蜘蛛，但是Scrapy里面为不同的使用目的实现了一些常见的Spider。下面我们把它们列出来。
 11 | 
 12 | ## CrawlSpider
 13 | 链接爬取蜘蛛，专门为那些爬取有特定规律的链接内容而准备的。
 14 | 如果你觉得它还不足以适合你的需求，可以先继承它然后覆盖相应的方法，或者自定义Spider也行。
 15 | 
 16 | 它除了从`scrapy.Spider`类继承的属性外，还有一个新的属性`rules`,它是一个`Rule`对象列表，每个`Rule`对象定义了某个规则，如果多个`Rule`匹配一个连接，那么使用第一个，根据定义的顺序。
 17 | 
 18 | 一个详细的例子：
 19 | ``` python
 20 | from coolscrapy.items import HuxiuItem
 21 | import scrapy
 22 | from scrapy.spiders import CrawlSpider, Rule
 23 | from scrapy.linkextractors import LinkExtractor
 24 | 
 25 | 
 26 | class LinkSpider(CrawlSpider):
 27 |     name = "link"
 28 |     allowed_domains = ["huxiu.com"]
 29 |     start_urls = [
 30 |         "http://www.huxiu.com/index.php"
 31 |     ]
 32 | 
 33 |     rules = (
 34 |         # 提取匹配正则式'/group?f=index_group'链接 (但是不能匹配'deny.php')
 35 |         # 并且会递归爬取(如果没有定义callback，默认follow=True).
 36 |         Rule(LinkExtractor(allow=('/group?f=index_group', ), deny=('deny\.php', ))),
 37 |         # 提取匹配'/article/\d+/\d+.html'的链接，并使用parse_item来解析它们下载后的内容，不递归
 38 |         Rule(LinkExtractor(allow=('/article/\d+/\d+\.html', )), callback='parse_item'),
 39 |     )
 40 | 
 41 |     def parse_item(self, response):
 42 |         self.logger.info('Hi, this is an item page! %s', response.url)
 43 |         detail = response.xpath('//div[@class="article-wrap"]')
 44 |         item = HuxiuItem()
 45 |         item['title'] = detail.xpath('h1/text()')[0].extract()
 46 |         item['link'] = response.url
 47 |         item['posttime'] = detail.xpath(
 48 |             'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
 49 |         print(item['title'],item['link'],item['posttime'])
 50 |         yield item
 51 | 
 52 | ```
 53 | 
 54 | ## XMLFeedSpider
 55 | XML订阅蜘蛛，用来爬取XML形式的订阅内容，通过某个指定的节点来遍历。
 56 | 可使用`iternodes`, `xml`, 和`html`三种形式的迭代器，不过当内容比较多的时候推荐使用`iternodes`，
 57 | 默认也是它，可以节省内存提升性能，不需要将整个DOM加载到内存中再解析。而使用`html`可以处理XML有格式错误的内容。
 58 | 处理XML的时候最好先[Removing namespaces](http://doc.scrapy.org/en/1.0/topics/selectors.html#removing-namespaces)
 59 | 
 60 | 接下来我通过爬取我的博客订阅XML来展示它的使用方法。
 61 | ``` python
 62 | from coolscrapy.items import BlogItem
 63 | import scrapy
 64 | from scrapy.spiders import XMLFeedSpider
 65 | 
 66 | 
 67 | class XMLSpider(XMLFeedSpider):
 68 |     name = "xml"
 69 |     namespaces = [('atom', 'http://www.w3.org/2005/Atom')]
 70 |     allowed_domains = ["github.io"]
 71 |     start_urls = [
 72 |         "http://www.pycoding.com/atom.xml"
 73 |     ]
 74 |     iterator = 'xml'  # 缺省的iternodes，貌似对于有namespace的xml不行
 75 |     itertag = 'atom:entry'
 76 | 
 77 |     def parse_node(self, response, node):
 78 |         # self.logger.info('Hi, this is a <%s> node!', self.itertag)
 79 |         item = BlogItem()
 80 |         item['title'] = node.xpath('atom:title/text()')[0].extract()
 81 |         item['link'] = node.xpath('atom:link/@href')[0].extract()
 82 |         item['id'] = node.xpath('atom:id/text()')[0].extract()
 83 |         item['published'] = node.xpath('atom:published/text()')[0].extract()
 84 |         item['updated'] = node.xpath('atom:updated/text()')[0].extract()
 85 |         self.logger.info('|'.join([item['title'],item['link'],item['id'],item['published']]))
 86 |         return item
 87 | ```
 88 | 
 89 | ## CSVFeedSpider
 90 | 这个跟上面的XMLFeedSpider很类似，区别在于它会一行一行的迭代，而不是一个节点一个节点的迭代。
 91 | 每次迭代行的时候会调用`parse_row()`方法。
 92 | ``` python
 93 | from coolscrapy.items import BlogItem
 94 | from scrapy.spiders import CSVFeedSpider
 95 | 
 96 | 
 97 | class CSVSpider(CSVFeedSpider):
 98 |     name = "csv"
 99 |     allowed_domains = ['example.com']
100 |     start_urls = ['http://www.example.com/feed.csv']
101 |     delimiter = ';'
102 |     quotechar = "'"
103 |     headers = ['id', 'name', 'description']
104 | 
105 |     def parse_row(self, response, row):
106 |         self.logger.info('Hi, this is a row!: %r', row)
107 |         item = BlogItem()
108 |         item['id'] = row['id']
109 |         item['name'] = row['name']
110 |         return item
111 | ```
112 | 
113 | ## SitemapSpider
114 | 站点地图蜘蛛，允许你使用[Sitemaps](http://www.sitemaps.org/)发现URL后爬取整个站点。
115 | 还支持嵌套的站点地图以及从`robots.txt`中发现站点URL
116 | 
117 | 


--------------------------------------------------------------------------------
/doc/source/part2/scrapy-04.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程04- Selector详解
  2 | 
  3 | 在你爬取网页的时候，最普遍的事情就是在页面源码中提取需要的数据，我们有几个库可以帮你完成这个任务：
  4 | 
  5 | 1. [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)是python中一个非常流行的抓取库,
  6 | 它还能合理的处理错误格式的标签，但是有一个唯一缺点就是：它运行很慢。
  7 | 2. [lxml](http://lxml.de/)是一个基于[ElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html)的XML解析库(同时还能解析HTML),
  8 | 不过lxml并不是Python标准库
  9 | 
 10 | 而Scrapy实现了自己的数据提取机制，它们被称为选择器，通过[XPath](http://www.w3.org/TR/xpath)或[CSS](http://www.w3.org/TR/selectors)表达式在HTML文档中来选择特定的部分
 11 | 
 12 | [XPath](http://www.w3.org/TR/xpath)是一用来在XML中选择节点的语言，同时可以用在HTML上面。
 13 | [CSS](http://www.w3.org/TR/selectors)是一种HTML文档上面的样式语言。
 14 | 
 15 | Scrapy选择器构建在lxml基础之上，所以可以保证速度和准确性。
 16 | 
 17 | 本章我们来详细讲解下选择器的工作原理，还有它们极其简单和相似的API，比lxml的API少多了，因为lxml可以用于很多其他领域。
 18 | 
 19 | 完整的API请查看[Selector参考](http://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors-ref)
 20 | 
 21 | ## 关于选择器
 22 | Scrapy帮我们下载完页面后，我们怎样在满是html标签的内容中找到我们所需要的元素呢，这里就需要使用到选择器了，它们是用来定位元素并且提取元素的值。先来举几个例子看看：
 23 | 
 24 | * /html/head/title: 选择`<title>`节点, 它位于html文档的`<head>`节点内
 25 | * /html/head/title/text(): 选择上面的`<title>`节点的内容.
 26 | * //td: 选择页面中所有的<td>元素
 27 | * //div[@class="mine"]: 选择所有拥有属性`class="mine"`的div元素
 28 | 
 29 | Scrapy使用css和xpath选择器来定位元素，它有四个基本方法：
 30 | 
 31 | * xpath(): 返回选择器列表，每个选择器代表使用xpath语法选择的节点
 32 | * css(): 返回选择器列表，每个选择器代表使用css语法选择的节点
 33 | * extract(): 返回被选择元素的unicode字符串
 34 | * re(): 返回通过正则表达式提取的unicode字符串列表
 35 | 
 36 | ## 使用选择器
 37 | 下面我们通过Scrapy shell演示下选择器的使用，假设我们有如下的一个网页<http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>，内容如下：
 38 | ``` html
 39 | <html>
 40 |  <head>
 41 |   <base href='http://example.com/' />
 42 |   <title>Example website</title>
 43 |  </head>
 44 |  <body>
 45 |   <div id='images'>
 46 |    <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
 47 |    <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
 48 |    <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
 49 |    <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
 50 |    <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
 51 |   </div>
 52 |  </body>
 53 | </html>
 54 | ```
 55 | 首先我们打开shell
 56 | ``` bash
 57 | scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
 58 | ```
 59 | 运行
 60 | ```
 61 | >>> response.xpath('//title/text()')
 62 | [<Selector (text) xpath=//title/text()>]
 63 | >>> response.css('title::text')
 64 | [<Selector (text) xpath=//title/text()>]
 65 | ```
 66 | 结果可以看出,`xpath()`和`css()`方法返回的是`SelectorList`实例，是一个选择器列表，你可以选择嵌套的数据：
 67 | ```
 68 | >>> response.css('img').xpath('@src').extract()
 69 | [u'image1_thumb.jpg',
 70 |  u'image2_thumb.jpg',
 71 |  u'image3_thumb.jpg',
 72 |  u'image4_thumb.jpg',
 73 |  u'image5_thumb.jpg']
 74 | ```
 75 | 必须使用`.extract()`才能提取最终的数据，如果你只想获得第一个匹配的，可以使用`.extract_first()`
 76 | ```
 77 | >>> response.xpath('//div[@id="images"]/a/text()').extract_first()
 78 | u'Name: My image 1 '
 79 | ```
 80 | 如果没有找到，会返回`None`，还可选择默认值
 81 | ```
 82 | >>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
 83 | 'not-found'
 84 | ```
 85 | 而CSS选择器还可以使用CSS3标准：
 86 | ```
 87 | >>> response.css('title::text').extract()
 88 | [u'Example website']
 89 | ```
 90 | 下面是几个比较全面的示例：
 91 | ```
 92 | >>> response.xpath('//base/@href').extract()
 93 | [u'http://example.com/']
 94 | 
 95 | >>> response.css('base::attr(href)').extract()
 96 | [u'http://example.com/']
 97 | 
 98 | >>> response.xpath('//a[contains(@href, "image")]/@href').extract()
 99 | [u'image1.html',
100 |  u'image2.html',
101 |  u'image3.html',
102 |  u'image4.html',
103 |  u'image5.html']
104 | 
105 | >>> response.css('a[href*=image]::attr(href)').extract()
106 | [u'image1.html',
107 |  u'image2.html',
108 |  u'image3.html',
109 |  u'image4.html',
110 |  u'image5.html']
111 | 
112 | >>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
113 | [u'image1_thumb.jpg',
114 |  u'image2_thumb.jpg',
115 |  u'image3_thumb.jpg',
116 |  u'image4_thumb.jpg',
117 |  u'image5_thumb.jpg']
118 | 
119 | >>> response.css('a[href*=image] img::attr(src)').extract()
120 | [u'image1_thumb.jpg',
121 |  u'image2_thumb.jpg',
122 |  u'image3_thumb.jpg',
123 |  u'image4_thumb.jpg',
124 |  u'image5_thumb.jpg']
125 | 
126 | ```
127 | 
128 | ## 嵌套选择器
129 | `xpath()`和`css()`返回的是选择器列表，所以你可以继续使用它们的方法。举例来讲：
130 | ```
131 | >>> links = response.xpath('//a[contains(@href, "image")]')
132 | >>> links.extract()
133 | [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
134 |  u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
135 |  u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
136 |  u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
137 |  u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
138 | 
139 | >>> for index, link in enumerate(links):
140 | ...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
141 | ...     print 'Link number %d points to url %s and image %s' % args
142 | 
143 | Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
144 | Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
145 | Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
146 | Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
147 | Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
148 | ```
149 | ## 使用正则表达式
150 | `Selector`有一个`re()`方法通过正则表达式提取数据，它返回的是unicode字符串列表，你不能再去嵌套使用
151 | ```
152 | >>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
153 | [u'My image 1',
154 |  u'My image 2',
155 |  u'My image 3',
156 |  u'My image 4',
157 |  u'My image 5']
158 | 
159 | >>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
160 | u'My image 1'
161 | ```
162 | 
163 | ## XPath相对路径
164 | 当你嵌套使用XPath时候，不要使用`/`开头的，因为这个会相对文档根节点开始算起，需要使用相对路径
165 | ```
166 | >>> divs = response.xpath('//div')
167 | >>> for p in divs.xpath('.//p'):  # extracts all <p> inside
168 | ...     print p.extract()
169 | 
170 | # 或者下面这个直接使用p也可以
171 | >>> for p in divs.xpath('p'):
172 | ...     print p.extract()
173 | ```
174 | 
175 | ## XPath建议
176 | 
177 | ### 使用text作为条件时
178 | 避免使用`.//text()`,直接使用`.`
179 | ```
180 | >>> sel.xpath("//a[contains(., 'Next Page')]").extract()
181 | [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
182 | ```
183 | 
184 | ### //node[1]和(//node)[1]区别
185 | 
186 | * //node[1]: 选择所有位于第一个子节点位置的node节点
187 | * (//node)[1]: 选择所有的node节点，然后返回结果中的第一个node节点
188 | 
189 | ### 通过class查找时优先考虑CSS
190 | ```
191 | >> from scrapy import Selector
192 | >>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
193 | >>> sel.css('.shout').xpath('./time/@datetime').extract()
194 | [u'2014-07-23 19:00']
195 | ```
196 | 
197 | 


--------------------------------------------------------------------------------
/doc/source/part2/scrapy-05.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程05- Item详解
  2 | 
  3 | Item是保存结构数据的地方，Scrapy可以将解析结果以字典形式返回，但是Python中字典缺少结构，在大型爬虫系统中很不方便。
  4 | 
  5 | Item提供了类字典的API，并且可以很方便的声明字段，很多Scrapy组件可以利用Item的其他信息。
  6 | 
  7 | ## 定义Item
  8 | 定义Item非常简单，只需要继承`scrapy.Item`类，并将所有字段都定义为`scrapy.Field`类型即可
  9 | ``` python
 10 | import scrapy
 11 | 
 12 | class Product(scrapy.Item):
 13 |     name = scrapy.Field()
 14 |     price = scrapy.Field()
 15 |     stock = scrapy.Field()
 16 |     last_updated = scrapy.Field(serializer=str)
 17 | ```
 18 | ## Item Fields
 19 | `Field`对象可用来对每个字段指定元数据。例如上面`last_updated`的序列化函数指定为`str`，可任意指定元数据，不过每种元数据对于不同的组件意义不一样。
 20 | 
 21 | ## Item使用示例
 22 | 你会看到Item的使用跟Python中的字典API非常类似
 23 | ### 创建Item
 24 | ``` python
 25 | >>> product = Product(name='Desktop PC', price=1000)
 26 | >>> print product
 27 | Product(name='Desktop PC', price=1000)
 28 | ```
 29 | ### 获取值
 30 | ``` python
 31 | >>> product['name']
 32 | Desktop PC
 33 | >>> product.get('name')
 34 | Desktop PC
 35 | 
 36 | >>> product['price']
 37 | 1000
 38 | 
 39 | >>> product['last_updated']
 40 | Traceback (most recent call last):
 41 |     ...
 42 | KeyError: 'last_updated'
 43 | 
 44 | >>> product.get('last_updated', 'not set')
 45 | not set
 46 | 
 47 | >>> product['lala'] # getting unknown field
 48 | Traceback (most recent call last):
 49 |     ...
 50 | KeyError: 'lala'
 51 | 
 52 | >>> product.get('lala', 'unknown field')
 53 | 'unknown field'
 54 | 
 55 | >>> 'name' in product  # is name field populated?
 56 | True
 57 | 
 58 | >>> 'last_updated' in product  # is last_updated populated?
 59 | False
 60 | 
 61 | >>> 'last_updated' in product.fields  # is last_updated a declared field?
 62 | True
 63 | 
 64 | >>> 'lala' in product.fields  # is lala a declared field?
 65 | False
 66 | ```
 67 | 
 68 | ### 设置值
 69 | ``` python
 70 | >>> product['last_updated'] = 'today'
 71 | >>> product['last_updated']
 72 | today
 73 | 
 74 | >>> product['lala'] = 'test' # setting unknown field
 75 | Traceback (most recent call last):
 76 |     ...
 77 | KeyError: 'Product does not support field: lala'
 78 | ```
 79 | 
 80 | ### 访问所有的值
 81 | ``` python
 82 | >>> product.keys()
 83 | ['price', 'name']
 84 | 
 85 | >>> product.items()
 86 | [('price', 1000), ('name', 'Desktop PC')]
 87 | ```
 88 | 
 89 | ## Item Loader
 90 | Item Loader为我们提供了生成Item的相当便利的方法。Item为抓取的数据提供了容器，而Item Loader可以让我们非常方便的将输入填充到容器中。
 91 | 
 92 | 下面我们通过一个例子来展示一般使用方法：
 93 | ``` python
 94 | from scrapy.loader import ItemLoader
 95 | from myproject.items import Product
 96 | 
 97 | def parse(self, response):
 98 |     l = ItemLoader(item=Product(), response=response)
 99 |     l.add_xpath('name', '//div[@class="product_name"]')
100 |     l.add_xpath('name', '//div[@class="product_title"]')
101 |     l.add_xpath('price', '//p[@id="price"]')
102 |     l.add_css('stock', 'p#stock]')
103 |     l.add_value('last_updated', 'today') # you can also use literal values
104 |     return l.load_item()
105 | ```
106 | 注意上面的`name`字段是从两个xpath路径添累加后得到。
107 | 
108 | ## 输入/输出处理器
109 | 每个Item Loader对每个`Field`都有一个输入处理器和一个输出处理器。输入处理器在数据被接受到时执行，当数据收集完后调用`ItemLoader.load_item() `时再执行输出处理器，返回最终结果。
110 | ``` python
111 | l = ItemLoader(Product(), some_selector)
112 | l.add_xpath('name', xpath1) # (1)
113 | l.add_xpath('name', xpath2) # (2)
114 | l.add_css('name', css) # (3)
115 | l.add_value('name', 'test') # (4)
116 | return l.load_item() # (5)
117 | ```
118 | 执行流程是这样的：
119 | 
120 | 1. `xpath1`中的数据被提取出来，然后传输到`name`字段的输入处理器中，在输入处理器处理完后生成结果放在Item Loader里面(这时候没有赋值给item)
121 | 2. `xpath2`数据被提取出来，然后传输给(1)中同样的输入处理器，因为它们都是`name`字段的处理器，然后处理结果被附加到(1)的结果后面
122 | 3. 跟2一样
123 | 4. 跟3一样，不过这次是直接的字面字符串值，先转换成一个单元素的可迭代对象再传给输入处理器
124 | 5. 上面4步的数据被传输给`name`的输出处理器，将最终的结果赋值给`name`字段
125 | 
126 | ## 自定义Item Loader
127 | 使用类定义语法，下面是一个例子
128 | ``` python
129 | from scrapy.loader import ItemLoader
130 | from scrapy.loader.processors import TakeFirst, MapCompose, Join
131 | 
132 | class ProductLoader(ItemLoader):
133 | 
134 |     default_output_processor = TakeFirst()
135 | 
136 |     name_in = MapCompose(unicode.title)
137 |     name_out = Join()
138 | 
139 |     price_in = MapCompose(unicode.strip)
140 | 
141 |     # ...
142 | ```
143 | 通过`_in`和`_out`后缀来定义输入和输出处理器，并且还可以定义默认的`ItemLoader.default_input_processor`和`ItemLoader.default_input_processor`.
144 | 
145 | ## 在Field定义中声明输入/输出处理器
146 | 还有个地方可以非常方便的添加输入/输出处理器，那就是直接在Field定义中
147 | ``` python
148 | import scrapy
149 | from scrapy.loader.processors import Join, MapCompose, TakeFirst
150 | from w3lib.html import remove_tags
151 | 
152 | def filter_price(value):
153 |     if value.isdigit():
154 |         return value
155 | 
156 | class Product(scrapy.Item):
157 |     name = scrapy.Field(
158 |         input_processor=MapCompose(remove_tags),
159 |         output_processor=Join(),
160 |     )
161 |     price = scrapy.Field(
162 |         input_processor=MapCompose(remove_tags, filter_price),
163 |         output_processor=TakeFirst(),
164 |     )
165 | ```
166 | 优先级：
167 | 
168 | 1. 在Item Loader中定义的`field_in`和`field_out`
169 | 1. Filed元数据(`input_processor`和`output_processor`关键字)
170 | 1. Item Loader中的默认的
171 | 
172 | Tips：一般来讲，将输入处理器定义在Item Loader的定义中`field_in`，然后将输出处理器定义在Field元数据中
173 | 
174 | ## Item Loader上下文
175 | Item Loader上下文被所有输入/输出处理器共享，比如你有一个解析长度的函数
176 | ``` python
177 | def parse_length(text, loader_context):
178 |     unit = loader_context.get('unit', 'm')
179 |     # ... length parsing code goes here ...
180 |     return parsed_length
181 | ```
182 | 
183 | 初始化和修改上下文的值
184 | ``` python
185 | loader = ItemLoader(product)
186 | loader.context['unit'] = 'cm'
187 | 
188 | loader = ItemLoader(product, unit='cm')
189 | 
190 | class ProductLoader(ItemLoader):
191 |     length_out = MapCompose(parse_length, unit='cm')
192 | ```
193 | 
194 | ## 内置的处理器
195 | 
196 | 1. `Identity` 啥也不做
197 | 1. `TakeFirst` 返回第一个非空值，通常用作输出处理器
198 | 1. `Join` 将结果连起来，默认使用空格' '
199 | 1. `Compose` 将函数链接起来形成管道流，产生最后的输出
200 | 1. `MapCompose` 跟上面的`Compose`类似，区别在于内部结果在函数中的传递方式.
201 | 它的输入值是可迭代的，首先将第一个函数依次作用于所有值，产生新的可迭代输入，作为第二个函数的输入，最后生成的结果连起来返回最终值，一般用在输入处理器中。
202 | 1. `SelectJmes` 使用json路径来查询值并返回结果
203 | 
204 | 


--------------------------------------------------------------------------------
/doc/source/part2/scrapy-06.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程06- Item Pipeline
  2 | 
  3 | 当一个item被蜘蛛爬取到之后会被发送给Item Pipeline，然后多个组件按照顺序处理这个item。
  4 | 每个Item Pipeline组件其实就是一个实现了一个简单方法的Python类。他们接受一个item并在上面执行逻辑，还能决定这个item到底是否还要继续往下传输，如果不要了就直接丢弃。
  5 | 
  6 | 使用Item Pipeline的常用场景：
  7 | 
  8 | * 清理HTML数据
  9 | * 验证被抓取的数据(检查item是否包含某些字段)
 10 | * 重复性检查(然后丢弃)
 11 | * 将抓取的数据存储到数据库中
 12 | 
 13 | ## 编写自己的Pipeline
 14 | 定义一个Python类，然后实现方法`process_item(self, item, spider)`即可，返回一个字典或Item，或者抛出`DropItem`异常丢弃这个Item。
 15 | 
 16 | 或者还可以实现下面几个方法：
 17 | 
 18 | * `open_spider(self, spider)` 蜘蛛打开的时执行
 19 | * `close_spider(self, spider)` 蜘蛛关闭时执行
 20 | * `from_crawler(cls, crawler)` 可访问核心组件比如配置和信号，并注册钩子函数到Scrapy中
 21 | 
 22 | ## Item Pipeline示例
 23 | 
 24 | ### 价格验证
 25 | 我们通过一个价格验证例子来看看怎样使用
 26 | ``` python
 27 | from scrapy.exceptions import DropItem
 28 | 
 29 | class PricePipeline(object):
 30 | 
 31 |     vat_factor = 1.15
 32 | 
 33 |     def process_item(self, item, spider):
 34 |         if item['price']:
 35 |             if item['price_excludes_vat']:
 36 |                 item['price'] = item['price'] * self.vat_factor
 37 |             return item
 38 |         else:
 39 |             raise DropItem("Missing price in %s" % item)
 40 | ```
 41 | 
 42 | ### 将item写入json文件
 43 | 下面的这个Pipeline将所有的item写入到一个单独的json文件，一行一个item
 44 | ``` python
 45 | import json
 46 | 
 47 | class JsonWriterPipeline(object):
 48 | 
 49 |     def __init__(self):
 50 |         self.file = open('items.jl', 'wb')
 51 | 
 52 |     def process_item(self, item, spider):
 53 |         line = json.dumps(dict(item)) + "\n"
 54 |         self.file.write(line)
 55 |         return item
 56 | ```
 57 | 
 58 | ### 将item存储到MongoDB中
 59 | 这个例子使用[pymongo](http://api.mongodb.org/python/current/)来演示怎样讲item保存到MongoDB中。
 60 | MongoDB的地址和数据库名在配置中指定，这个例子主要是向你展示怎样使用`from_crawler()`方法，以及如何清理资源。
 61 | ``` python
 62 | import pymongo
 63 | 
 64 | class MongoPipeline(object):
 65 | 
 66 |     collection_name = 'scrapy_items'
 67 | 
 68 |     def __init__(self, mongo_uri, mongo_db):
 69 |         self.mongo_uri = mongo_uri
 70 |         self.mongo_db = mongo_db
 71 | 
 72 |     @classmethod
 73 |     def from_crawler(cls, crawler):
 74 |         return cls(
 75 |             mongo_uri=crawler.settings.get('MONGO_URI'),
 76 |             mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
 77 |         )
 78 | 
 79 |     def open_spider(self, spider):
 80 |         self.client = pymongo.MongoClient(self.mongo_uri)
 81 |         self.db = self.client[self.mongo_db]
 82 | 
 83 |     def close_spider(self, spider):
 84 |         self.client.close()
 85 | 
 86 |     def process_item(self, item, spider):
 87 |         self.db[self.collection_name].insert(dict(item))
 88 |         return item
 89 | ```
 90 | 
 91 | ### 重复过滤器
 92 | 假设我们的item里面的id字典是唯一的，但是我们的蜘蛛返回了多个相同id的item
 93 | ``` python
 94 | from scrapy.exceptions import DropItem
 95 | 
 96 | class DuplicatesPipeline(object):
 97 | 
 98 |     def __init__(self):
 99 |         self.ids_seen = set()
100 | 
101 |     def process_item(self, item, spider):
102 |         if item['id'] in self.ids_seen:
103 |             raise DropItem("Duplicate item found: %s" % item)
104 |         else:
105 |             self.ids_seen.add(item['id'])
106 |             return item
107 | ```
108 | 
109 | ## 激活一个Item Pipeline组件
110 | 你必须在配置文件中将你需要激活的Pipline组件添加到`ITEM_PIPELINES`中
111 | ``` python
112 | ITEM_PIPELINES = {
113 |     'myproject.pipelines.PricePipeline': 300,
114 |     'myproject.pipelines.JsonWriterPipeline': 800,
115 | }
116 | ```
117 | 后面的数字表示它的执行顺序，从低到高执行，范围0-1000
118 | 
119 | ## Feed exports
120 | 这里顺便提下Feed exports，一般有的爬虫直接将爬取结果序列化到文件中，并保存到某个存储介质中。只需要在settings里面设置几个即可：
121 | ```
122 | * FEED_FORMAT= json # json|jsonlines|csv|xml|pickle|marshal
123 | * FEED_URI= file:///tmp/export.csv|ftp://user:pass@ftp.example.com/path/to/export.csv|s3://aws_key:aws_secret@mybucket/path/to/export.csv|stdout:
124 | * FEED_EXPORT_FIELDS = ["foo", "bar", "baz"] # 这个在导出csv的时候有用
125 | ```
126 | 
127 | ## 请求和响应
128 | Scrapy使用`Request`和`Response`对象来爬取网站。`Request`对象被蜘蛛生成，然后被传递给下载器，之后下载器处理这个`Request`后返回`Response`对象，然后返回给生成`Request`的这个蜘蛛。
129 | 
130 | ### 给回调函数传递额外的参数
131 | `Request`对象生成的时候会通过关键字参数`callback`指定回调函数，`Response`对象被当做第一个参数传入，有时候我们想传递额外的参数，比如我们构建某个Item的时候，需要两步，第一步是链接属性，第二步是详情属性，可以指定`Request.meta`
132 | ``` python
133 | def parse_page1(self, response):
134 |     item = MyItem()
135 |     item['main_url'] = response.url
136 |     request = scrapy.Request("http://www.example.com/some_page.html",
137 |                              callback=self.parse_page2)
138 |     request.meta['item'] = item
139 |     return request
140 | 
141 | def parse_page2(self, response):
142 |     item = response.meta['item']
143 |     item['other_url'] = response.url
144 |     return item
145 | 
146 | ```
147 | 
148 | ### Request子类
149 | Scrapy为各种不同的场景内置了很多Request子类，你还可以继承它自定义自己的请求类。
150 | 
151 | `FormRequest`这个专门为form表单设计，模拟表单提交的示例
152 | ``` python
153 | return [FormRequest(url="http://www.example.com/post/action",
154 |                     formdata={'name': 'John Doe', 'age': '27'},
155 |                     callback=self.after_post)]
156 | ```
157 | 
158 | 我们再来一个例子模拟用户登录，使用了`FormRequest.from_response()`
159 | ``` python
160 | import scrapy
161 | 
162 | class LoginSpider(scrapy.Spider):
163 |     name = 'example.com'
164 |     start_urls = ['http://www.example.com/users/login.php']
165 | 
166 |     def parse(self, response):
167 |         return scrapy.FormRequest.from_response(
168 |             response,
169 |             formdata={'username': 'john', 'password': 'secret'},
170 |             callback=self.after_login
171 |         )
172 | 
173 |     def after_login(self, response):
174 |         # check login succeed before going on
175 |         if "authentication failed" in response.body:
176 |             self.logger.error("Login failed")
177 |             return
178 | 
179 |         # continue scraping with authenticated session...
180 | ```
181 | 
182 | ### Response子类
183 | 一个`scrapy.http.Response`对象代表了一个HTTP相应，通常是被下载器下载后得到，并交给Spider做进一步的处理。Response也有很多默认的子类，用于表示各种不同的响应类型。
184 | 
185 | * TextResponse 在基本`Response`类基础之上增加了编码功能，专门用于二进制数据比如图片、声音或其他媒体文件
186 | * HtmlResponse 此类是`TextResponse`的子类，通过查询HTML的`meta http-equiv `属性实现了编码自动发现
187 | * XmlResponse  此类是`TextResponse`的子类，通过查询XML声明实现编码自动发现
188 | 
189 | 


--------------------------------------------------------------------------------
/doc/source/part2/scrapy-07.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程07- 内置服务
  2 | 
  3 | Scrapy使用Python内置的的日志系统来记录事件日志。
  4 | 日志配置
  5 | ``` python
  6 | LOG_ENABLED = true
  7 | LOG_ENCODING = "utf-8"
  8 | LOG_LEVEL = logging.INFO
  9 | LOG_FILE = "log/spider.log"
 10 | LOG_STDOUT = True
 11 | LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
 12 | LOG_DATEFORMAT = "%Y-%m-%d %H:%M:%S"
 13 | ```
 14 | 使用也很简单
 15 | ``` python
 16 | import logging
 17 | logger = logging.getLogger(__name__)
 18 | logger.warning("This is a warning")
 19 | ```
 20 | 
 21 | 如果在Spider里面使用，那就更简单了，因为logger就是它的一个实例变量
 22 | ``` python
 23 | import scrapy
 24 | 
 25 | class MySpider(scrapy.Spider):
 26 | 
 27 |     name = 'myspider'
 28 |     start_urls = ['http://scrapinghub.com']
 29 | 
 30 |     def parse(self, response):
 31 |         self.logger.info('Parse function called on %s', response.url)
 32 | ```
 33 | 
 34 | ## 发送email
 35 | Scrapy发送email基于[Twisted non-blocking IO](http://twistedmatrix.com/documents/current/core/howto/defer-intro.html)实现，只需几个简单配置即可。
 36 | 
 37 | 初始化
 38 | ``` python
 39 | mailer = MailSender.from_settings(settings)
 40 | ```
 41 | 发送不包含附件
 42 | ``` python
 43 | mailer.send(to=["someone@example.com"], subject="Some subject", body="Some body", cc=["another@example.com"])
 44 | ```
 45 | 配置
 46 | ``` python
 47 | MAIL_FROM = 'scrapy@localhost'
 48 | MAIL_HOST = 'localhost'
 49 | MAIL_PORT = 25
 50 | MAIL_USER = ""
 51 | MAIL_PASS = ""
 52 | MAIL_TLS = False
 53 | MAIL_SSL = False
 54 | ```
 55 | 
 56 | ## 同一个进程运行多个Spider
 57 | ``` python
 58 | import scrapy
 59 | from scrapy.crawler import CrawlerProcess
 60 | 
 61 | class MySpider1(scrapy.Spider):
 62 |     # Your first spider definition
 63 |     ...
 64 | 
 65 | class MySpider2(scrapy.Spider):
 66 |     # Your second spider definition
 67 |     ...
 68 | 
 69 | process = CrawlerProcess()
 70 | process.crawl(MySpider1)
 71 | process.crawl(MySpider2)
 72 | process.start() # the script will block here until all crawling jobs are finished
 73 | 
 74 | ```
 75 | 
 76 | ## 分布式爬虫
 77 | Scrapy并没有提供内置的分布式抓取功能，不过有很多方法可以帮你实现。
 78 | 
 79 | 如果你有很多个spider，最简单的方式就是启动多个`Scrapyd`实例，然后将spider分布到各个机器上面。
 80 | 
 81 | 如果你想多个机器运行同一个spider，可以将url分片后交给每个机器上面的spider。比如你把URL分成3份
 82 | ```
 83 | http://somedomain.com/urls-to-crawl/spider1/part1.list
 84 | http://somedomain.com/urls-to-crawl/spider1/part2.list
 85 | http://somedomain.com/urls-to-crawl/spider1/part3.list
 86 | ```
 87 | 然后运行3个`Scrapyd`实例，分别启动它们，并传递part参数
 88 | ```
 89 | curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
 90 | curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
 91 | curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
 92 | ```
 93 | 
 94 | ## 防止被封的策略
 95 | 一些网站实现了一些策略来禁止爬虫来爬取它们的网页。有的比较简单，有的相当复杂，如果你需要详细了解可以咨询[商业支持](http://scrapy.org/support/)
 96 | 
 97 | 下面是对于这些网站的一些有用的建议：
 98 | 
 99 | * 使用user agent池。也就是每次发送的时候随机从池中选择不一样的浏览器头信息，防止暴露爬虫身份
100 | * 禁止Cookie，某些网站会通过Cookie识别用户身份，禁用后使得服务器无法识别爬虫轨迹
101 | * 设置download_delay下载延迟，数字设置为5秒，越大越安全
102 | * 如果有可能的话尽量使用[Google cache](http://www.googleguide.com/cached_pages.html)获取网页，而不是直接访问
103 | * 使用一个轮转IP池，例如免费的[Tor project](https://www.torproject.org/)或者是付费的[ProxyMesh](http://proxymesh.com/)
104 | * 使用大型分布式下载器，这样就能完全避免被封了，只需要关注怎样解析页面就行。一个例子就是[Crawlera](http://scrapinghub.com/crawlera)
105 | 
106 | 如果这些还是无法避免被禁，可以考虑[商业支持](http://scrapy.org/support/)
107 | 
108 | 


--------------------------------------------------------------------------------
/doc/source/part2/scrapy-08.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程08- 文件与图片
  2 | 
  3 | Scrapy为我们提供了可重用的[item pipelines](http://doc.scrapy.org/en/1.0/topics/item-pipeline.html)为某个特定的Item去下载文件。
  4 | 通常来说你会选择使用Files Pipeline或Images Pipeline。
  5 | 
  6 | 这两个管道都实现了：
  7 | 
  8 | * 避免重复下载
  9 | * 可以指定下载后保存的地方(文件系统目录中,Amazon S3中)
 10 | 
 11 | Images Pipeline为处理图片提供了额外的功能：
 12 | 
 13 | * 将所有下载的图片格式转换成普通的JPG并使用RGB颜色模式
 14 | * 生成缩略图
 15 | * 检查图片的宽度和高度确保它们满足最小的尺寸限制
 16 | 
 17 | 管道同时会在内部保存一个被调度下载的URL列表，然后将包含相同媒体的相应关联到这个队列上来，从而防止了多个item共享这个媒体时重复下载。
 18 | 
 19 | ## 使用Files Pipeline
 20 | 一般我们会按照下面的步骤来使用文件管道：
 21 | 
 22 | 1. 在某个Spider中，你爬取一个item后，将相应的文件URL放入`file_urls`字段中
 23 | 1. item被返回之后就会转交给item pipeline
 24 | 1. 当这个item到达`FilesPipeline`时，在`file_urls`字段中的URL列表会通过标准的Scrapy调度器和下载器来调度下载，并且优先级很高，在抓取其他页面前就被处理。而这个`item`会一直在这个pipeline中被锁定，直到所有的文件下载完成。
 25 | 1. 当文件被下载完之后，结果会被赋值给另一个`files`字段。这个字段包含一个关于下载文件新的字典列表，比如下载路径，源地址，文件校验码。`files`里面的顺序和`file_url`顺序是一致的。要是某个写文件下载出错就不会出现在这个`files`中了。
 26 | 
 27 | ## 使用Images Pipeline
 28 | `ImagesPipeline`跟`FilesPipeline`的使用差不多，不过使用的字段名不一样，`image_urls`保存图片URL地址，`images`保存下载后的图片信息。
 29 | 
 30 | 使用`ImagesPipeline`的好处是你可以通过配置来提供额外的功能，比如生成文件缩略图，通过图片大小过滤需要下载的图片等。
 31 | 
 32 | `ImagesPipeline`使用[Pillow](https://github.com/python-pillow/Pillow)来生成缩略图以及转换成标准的JPEG/RGB格式。因此你需要安装这个包，我们建议你使用Pillow而不是PIL。
 33 | 
 34 | ## 使用例子
 35 | 要使用媒体管道，请先在配置文件中打开它
 36 | ``` python
 37 | # 同时使用图片和文件管道
 38 | ITEM_PIPELINES = {
 39 |                   'scrapy.pipelines.images.ImagesPipeline': 1,
 40 |                   'scrapy.pipelines.files.FilesPipeline': 2,
 41 |                  }
 42 | FILES_STORE = '/path/to/valid/dir'  # 文件存储路径
 43 | IMAGES_STORE = '/path/to/valid/dir' # 图片存储路径
 44 | # 90 days of delay for files expiration
 45 | FILES_EXPIRES = 90
 46 | # 30 days of delay for images expiration
 47 | IMAGES_EXPIRES = 30
 48 | # 图片缩略图
 49 | IMAGES_THUMBS = {
 50 |     'small': (50, 50),
 51 |     'big': (270, 270),
 52 | }
 53 | # 图片过滤器，最小高度和宽度
 54 | IMAGES_MIN_HEIGHT = 110
 55 | IMAGES_MIN_WIDTH = 110
 56 | ```
 57 | 一个使用了缩略图的下载例子会生成如下图片：
 58 | ```
 59 | <IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
 60 | <IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
 61 | <IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
 62 | ```
 63 | 
 64 | 然后，某个Item返回时，有`file_urls`或`image_urls`，并且存在相应的`files`或`images`字段
 65 | 
 66 | ``` python
 67 | import scrapy
 68 | 
 69 | class MyItem(scrapy.Item):
 70 | 
 71 |     # ... other item fields ...
 72 |     image_urls = scrapy.Field()
 73 |     images = scrapy.Field()
 74 | ```
 75 | 
 76 | ## 自定义媒体管道
 77 | 如果你还需要更加复杂的功能，想自定义下载媒体逻辑，请参考[扩展媒体管道](http://doc.scrapy.org/en/1.0/topics/media-pipeline.html#topics-media-pipeline-override)
 78 | 
 79 | 不管是扩展`FilesPipeline`还是`ImagesPipeline`,都只需重写下面两个方法
 80 | 
 81 | * `get_media_requests(self, item, info)`,返回一个`Request`对象
 82 | * `item_completed(self, results, item, info)`,当上门的Request下载完成后回调这个方法，然后填充`files`或`images`字段
 83 | 
 84 | 下面是一个扩展`ImagesPipeline`的例子，我只取path信息，并将它赋给`image_paths`字段，而不是默认的`images`
 85 | ``` python
 86 | import scrapy
 87 | from scrapy.pipelines.images import ImagesPipeline
 88 | from scrapy.exceptions import DropItem
 89 | 
 90 | class MyImagesPipeline(ImagesPipeline):
 91 | 
 92 |     def get_media_requests(self, item, info):
 93 |         for image_url in item['image_urls']:
 94 |             yield scrapy.Request(image_url)
 95 | 
 96 |     def item_completed(self, results, item, info):
 97 |         image_paths = [x['path'] for ok, x in results if ok]
 98 |         if not image_paths:
 99 |             raise DropItem("Item contains no images")
100 |         item['image_paths'] = image_paths
101 |         return item
102 | ```
103 | 
104 | 


--------------------------------------------------------------------------------
/doc/source/part3/README.md:
--------------------------------------------------------------------------------
1 | ## 部署
2 | 
3 | 介绍scrapy的部署方式
4 | 


--------------------------------------------------------------------------------
/doc/source/part3/scrapy-09.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程09- 部署
  2 | 
  3 | 本篇主要介绍两种部署爬虫的方案。如果仅仅在开发调试的时候在本地部署跑起来是很容易的，不过要是生产环境，爬虫任务量大，并且持续时间长，那么还是建议使用专业的部署方法。主要是两种方案：
  4 | 
  5 | * [Scrapyd](http://doc.scrapy.org/en/1.0/topics/deploy.html#deploy-scrapyd) 开源方案
  6 | * [Scrapy Cloud](http://doc.scrapy.org/en/1.0/topics/deploy.html#deploy-scrapy-cloud) 云方案
  7 | 
  8 | 
  9 | ## 部署到Scrapyd
 10 | [Scrapyd](http://doc.scrapy.org/en/1.0/topics/deploy.html#deploy-scrapyd)是一个开源软件，用来运行蜘蛛爬虫。它提供了HTTP API的服务器，还能运行和监控Scrapy的蜘蛛
 11 | 
 12 | 要部署爬虫到Scrapyd，需要使用到[scrapyd-client](https://github.com/scrapy/scrapyd-client)部署工具集，下面我演示下部署的步骤
 13 | 
 14 | Scrapyd通常以守护进程daemon形式运行，监听spider的请求，然后为每个spider创建一个进程执行`scrapy crawl myspider`,同时Scrapyd还能以多进程方式启动，通过配置`max_proc`和`max_proc_per_cpu`选项
 15 | 
 16 | ### 安装
 17 | 使用pip安装
 18 | ``` bash
 19 | pip install scrapyd
 20 | ```
 21 | 在ubuntu系统上面
 22 | ``` bash
 23 | apt-get install scrapyd
 24 | ```
 25 | 
 26 | ### 配置
 27 | 配置文件地址，优先级从低到高
 28 | 
 29 | * /etc/scrapyd/scrapyd.conf (Unix)
 30 | * /etc/scrapyd/conf.d/* (in alphabetical order, Unix)
 31 | * scrapyd.conf
 32 | * ~/.scrapyd.conf (users home directory)
 33 | 
 34 | 具体参数参考[scrapyd配置](http://scrapyd.readthedocs.org/en/latest/config.html)
 35 | 
 36 | 简单的例子
 37 | ```
 38 | [scrapyd]
 39 | eggs_dir    = eggs
 40 | logs_dir    = logs
 41 | items_dir   =
 42 | jobs_to_keep = 5
 43 | dbs_dir     = dbs
 44 | max_proc    = 0
 45 | max_proc_per_cpu = 4
 46 | finished_to_keep = 100
 47 | poll_interval = 5
 48 | bind_address = 0.0.0.0
 49 | http_port   = 6800
 50 | debug       = off
 51 | runner      = scrapyd.runner
 52 | application = scrapyd.app.application
 53 | launcher    = scrapyd.launcher.Launcher
 54 | webroot     = scrapyd.website.Root
 55 | 
 56 | [services]
 57 | schedule.json     = scrapyd.webservice.Schedule
 58 | cancel.json       = scrapyd.webservice.Cancel
 59 | addversion.json   = scrapyd.webservice.AddVersion
 60 | listprojects.json = scrapyd.webservice.ListProjects
 61 | listversions.json = scrapyd.webservice.ListVersions
 62 | listspiders.json  = scrapyd.webservice.ListSpiders
 63 | delproject.json   = scrapyd.webservice.DeleteProject
 64 | delversion.json   = scrapyd.webservice.DeleteVersion
 65 | listjobs.json     = scrapyd.webservice.ListJobs
 66 | daemonstatus.json = scrapyd.webservice.DaemonStatus
 67 | ```
 68 | 
 69 | ### 部署
 70 | 使用[scrapyd-client](https://github.com/scrapy/scrapyd-client)最方便，
 71 | Scrapyd-client是[scrapyd](https://github.com/scrapy/scrapyd)的一个客户端，它提供了`scrapyd-deploy`工具将工程部署到Scrapyd服务器上面
 72 | 
 73 | 通常将你的工程部署到Scrapyd需要两个步骤：
 74 | 
 75 | 1. 将工程打包成python蛋，你需要安装[setuptools](http://pypi.python.org/pypi/setuptools)
 76 | 1. 通过[addversion.json](https://scrapyd.readthedocs.org/en/latest/api.html#addversion-json)终端将蟒蛇蛋上传至Scrapd服务器
 77 | 
 78 | 你可以在你的工程配置文件`scrapy.cfg`定义Scrapyd目标
 79 | ```
 80 | [deploy:example]
 81 | url = http://scrapyd.example.com/api/scrapyd
 82 | username = scrapy
 83 | password = secret
 84 | ```
 85 | 列出所有可用目标使用命令
 86 | ```
 87 | scrapyd-deploy -l
 88 | ```
 89 | 列出某个目标上面所有可运行的工程，执行命令
 90 | ```
 91 | scrapyd-deploy -L example
 92 | ```
 93 | 先`cd`到工程根目录，然后使用如下命令来部署：
 94 | ```
 95 | scrapyd-deploy <target> -p <project>
 96 | ```
 97 | 你还可以定义默认的target和project，省的你每次都去敲代码
 98 | ```
 99 | [deploy]
100 | url = http://scrapyd.example.com/api/scrapyd
101 | username = scrapy
102 | password = secret
103 | project = yourproject
104 | ```
105 | 这样你就直接取执行
106 | ```
107 | scrapyd-deploy
108 | ```
109 | 如果你有多个target，那么可以使用下面命令将project部署到多个target服务器上面
110 | ```
111 | scrapyd-deploy -a -p <project>
112 | ```
113 | 
114 | ## 部署到Scrapy Cloud
115 | [Scrapy Cloud](http://scrapinghub.com/scrapy-cloud/)是一个托管的云服务器，由Scrapy背后的公司[Scrapinghub](http://scrapinghub.com/)维护
116 | 
117 | 它免除了安装和监控服务器的需要，并提供了非常美观的UI来管理各个Spider，还能查看被抓取的Item，日志和状态等。
118 | 
119 | 你可以使用[shub](http://doc.scrapinghub.com/shub.html)命令行工具来讲spider部署到Scrapy Cloud。更多请参考[官方文档](http://doc.scrapinghub.com/scrapy-cloud.html)
120 | 
121 | Scrapy Cloud和Scrapyd是兼容的，你可以根据需要在两者之前切换，配置文件也是`scrapy.cfg`，跟`scrapyd-deploy`读取的是一样的。
122 | 


--------------------------------------------------------------------------------
/doc/source/part4/README.md:
--------------------------------------------------------------------------------
1 | ## 进阶篇
2 | 
3 | 这部分介绍scrapy的高级主题，包括自定义配置，使用脚本运行，模拟登陆、以及如果抓取动态网站。
4 | 


--------------------------------------------------------------------------------
/doc/source/part4/scrapy-10.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程10- 动态配置爬虫
  2 | 
  3 | 有很多时候我们需要从多个网站爬取所需要的数据，比如我们想爬取多个网站的新闻，将其存储到数据库同一个表中。我们是不是要对每个网站都得去定义一个Spider类呢？
  4 | 其实不需要，我们可以通过维护一个规则配置表或者一个规则配置文件来动态增加或修改爬取规则，然后程序代码不需要更改就能实现多个网站爬取。
  5 | 
  6 | 要这样做，我们就不能再使用前面的`scrapy crawl test`这种命令了，我们需要使用编程的方式运行Scrapy spider，参考[官方文档](http://doc.scrapy.org/en/1.0/topics/practices.html#run-scrapy-from-a-script)
  7 | 
  8 | ### 脚本运行Scrapy
  9 | 可以利用scrapy提供的[核心API](http://doc.scrapy.org/en/1.0/topics/api.html#topics-api)通过编程方式启动scrapy，代替传统的`scrapy crawl`启动方式。
 10 | 
 11 | Scrapy构建于Twisted异步网络框架基础之上，因此你需要在Twisted reactor里面运行。
 12 | 
 13 | 首先你可以使用`scrapy.crawler.CrawlerProcess`这个类来运行你的spider，这个类会为你启动一个Twisted reactor，并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。
 14 | ``` python
 15 | import scrapy
 16 | from scrapy.crawler import CrawlerProcess
 17 | from scrapy.utils.project import get_project_settings
 18 | 
 19 | process = CrawlerProcess(get_project_settings())
 20 | 
 21 | process.crawl(MySpider)
 22 | process.start() # the script will block here until the crawling is finished
 23 | 
 24 | ```
 25 | 然后你就可以直接执行这个脚本
 26 | ``` bash
 27 | python run.py
 28 | ```
 29 | 
 30 | 另外一个功能更强大的类是`scrapy.crawler.CrawlerRunner`，推荐你使用这个
 31 | ``` python
 32 | from twisted.internet import reactor
 33 | import scrapy
 34 | from scrapy.crawler import CrawlerRunner
 35 | from scrapy.utils.log import configure_logging
 36 | 
 37 | class MySpider(scrapy.Spider):
 38 |     # Your spider definition
 39 |     ...
 40 | 
 41 | configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
 42 | runner = CrawlerRunner()
 43 | 
 44 | d = runner.crawl(MySpider)
 45 | d.addBoth(lambda _: reactor.stop())
 46 | reactor.run() # the script will block here until the crawling is finished
 47 | ```
 48 | 
 49 | ### 同一进程运行多个spider
 50 | 默认情况当你每次执行`scrapy crawl`命令时会创建一个新的进程。但我们可以使用核心API在同一个进程中同时运行多个spider
 51 | ``` python
 52 | import scrapy
 53 | from twisted.internet import reactor
 54 | from scrapy.crawler import CrawlerRunner
 55 | from scrapy.utils.log import configure_logging
 56 | 
 57 | class MySpider1(scrapy.Spider):
 58 |     # Your first spider definition
 59 |     ...
 60 | 
 61 | class MySpider2(scrapy.Spider):
 62 |     # Your second spider definition
 63 |     ...
 64 | 
 65 | configure_logging()
 66 | runner = CrawlerRunner()
 67 | runner.crawl(MySpider1)
 68 | runner.crawl(MySpider2)
 69 | d = runner.join()
 70 | d.addBoth(lambda _: reactor.stop())
 71 | 
 72 | reactor.run() # the script will block here until all crawling jobs are finished
 73 | ```
 74 | 
 75 | ### 定义规则表
 76 | 好了言归正传，有了前面的脚本启动基础，就可以开始我们的动态配置爬虫了。
 77 | 我们的需求是这样的，从两个不同的网站爬取我们所需要的新闻文章，然后存储到article表中。
 78 | 
 79 | 首先我们需要定义规则表和文章表，通过动态的创建蜘蛛类，我们以后就只需要维护规则表即可了。这里我使用SQLAlchemy框架来映射数据库。
 80 | ``` python
 81 | #!/usr/bin/env python
 82 | # -*- encoding: utf-8 -*-
 83 | """
 84 | Topic: 定义数据库模型实体
 85 | Desc :
 86 | """
 87 | import datetime
 88 | 
 89 | from sqlalchemy.engine.url import URL
 90 | from sqlalchemy.ext.declarative import declarative_base
 91 | from sqlalchemy import create_engine, Column, Integer, String, Text, DateTime
 92 | from coolscrapy.settings import DATABASE
 93 | 
 94 | Base = declarative_base()
 95 | 
 96 | class ArticleRule(Base):
 97 |     """自定义文章爬取规则"""
 98 |     __tablename__ = 'article_rule'
 99 | 
100 |     id = Column(Integer, primary_key=True)
101 |     # 规则名称
102 |     name = Column(String(30))
103 |     # 运行的域名列表，逗号隔开
104 |     allow_domains = Column(String(100))
105 |     # 开始URL列表，逗号隔开
106 |     start_urls = Column(String(100))
107 |     # 下一页的xpath
108 |     next_page = Column(String(100))
109 |     # 文章链接正则表达式(子串)
110 |     allow_url = Column(String(200))
111 |     # 文章链接提取区域xpath
112 |     extract_from = Column(String(200))
113 |     # 文章标题xpath
114 |     title_xpath = Column(String(100))
115 |     # 文章内容xpath
116 |     body_xpath = Column(Text)
117 |     # 发布时间xpath
118 |     publish_time_xpath = Column(String(30))
119 |     # 文章来源
120 |     source_site = Column(String(30))
121 |     # 规则是否生效
122 |     enable = Column(Integer)
123 | 
124 | 
125 | class Article(Base):
126 |     """文章类"""
127 |     __tablename__ = 'articles'
128 | 
129 |     id = Column(Integer, primary_key=True)
130 |     url = Column(String(100))
131 |     title = Column(String(100))
132 |     body = Column(Text)
133 |     publish_time = Column(String(30))
134 |     source_site = Column(String(30))
135 | ```
136 | 
137 | ### 定义文章Item
138 | 这个很简单了，没什么需要说明的
139 | ``` python
140 | import scrapy
141 | 
142 | 
143 | class Article(scrapy.Item):
144 |     title = scrapy.Field()
145 |     url = scrapy.Field()
146 |     body = scrapy.Field()
147 |     publish_time = scrapy.Field()
148 |     source_site = scrapy.Field()
149 | ```
150 | 
151 | ### 定义ArticleSpider
152 | 接下来我们将定义爬取文章的蜘蛛，这个spider会使用一个Rule实例来初始化，然后根据Rule实例中的xpath规则来获取相应的数据。
153 | ``` python
154 | from coolscrapy.utils import parse_text
155 | from scrapy.spiders import CrawlSpider, Rule
156 | from scrapy.linkextractors import LinkExtractor
157 | from coolscrapy.items import Article
158 | 
159 | 
160 | class ArticleSpider(CrawlSpider):
161 |     name = "article"
162 | 
163 |     def __init__(self, rule):
164 |         self.rule = rule
165 |         self.name = rule.name
166 |         self.allowed_domains = rule.allow_domains.split(",")
167 |         self.start_urls = rule.start_urls.split(",")
168 |         rule_list = []
169 |         # 添加`下一页`的规则
170 |         if rule.next_page:
171 |             rule_list.append(Rule(LinkExtractor(restrict_xpaths=rule.next_page)))
172 |         # 添加抽取文章链接的规则
173 |         rule_list.append(Rule(LinkExtractor(
174 |             allow=[rule.allow_url],
175 |             restrict_xpaths=[rule.extract_from]),
176 |             callback='parse_item'))
177 |         self.rules = tuple(rule_list)
178 |         super(ArticleSpider, self).__init__()
179 | 
180 |     def parse_item(self, response):
181 |         self.log('Hi, this is an article page! %s' % response.url)
182 | 
183 |         article = Article()
184 |         article["url"] = response.url
185 | 
186 |         title = response.xpath(self.rule.title_xpath).extract()
187 |         article["title"] = parse_text(title, self.rule.name, 'title')
188 | 
189 |         body = response.xpath(self.rule.body_xpath).extract()
190 |         article["body"] = parse_text(body, self.rule.name, 'body')
191 | 
192 |         publish_time = response.xpath(self.rule.publish_time_xpath).extract()
193 |         article["publish_time"] = parse_text(publish_time, self.rule.name, 'publish_time')
194 | 
195 |         article["source_site"] = self.rule.source_site
196 | 
197 |         return article
198 | ```
199 | 要注意的是start_urls，rules等都初始化成了对象的属性，都由传入的rule对象初始化，parse_item方法中的抽取规则也都有rule对象提供。
200 | 
201 | ### 编写pipeline存储到数据库中
202 | 我们还是使用SQLAlchemy来将文章Item数据存储到数据库中
203 | ``` python
204 | @contextmanager
205 | def session_scope(Session):
206 |     """Provide a transactional scope around a series of operations."""
207 |     session = Session()
208 |     try:
209 |         yield session
210 |         session.commit()
211 |     except:
212 |         session.rollback()
213 |         raise
214 |     finally:
215 |         session.close()
216 | 
217 | 
218 | class ArticleDataBasePipeline(object):
219 |     """保存文章到数据库"""
220 | 
221 |     def __init__(self):
222 |         engine = db_connect()
223 |         create_news_table(engine)
224 |         self.Session = sessionmaker(bind=engine)
225 | 
226 |     def open_spider(self, spider):
227 |         """This method is called when the spider is opened."""
228 |         pass
229 | 
230 |     def process_item(self, item, spider):
231 |         a = Article(url=item["url"],
232 |                     title=item["title"].encode("utf-8"),
233 |                     publish_time=item["publish_time"].encode("utf-8"),
234 |                     body=item["body"].encode("utf-8"),
235 |                     source_site=item["source_site"].encode("utf-8"))
236 |         with session_scope(self.Session) as session:
237 |             session.add(a)
238 | 
239 |     def close_spider(self, spider):
240 |         pass
241 | ```
242 | 
243 | ### 修改run.py启动脚本
244 | 我们将上面的run.py稍作修改即可定制我们的文章爬虫启动脚本
245 | ``` python
246 | import logging
247 | from spiders.article_spider import ArticleSpider
248 | from twisted.internet import reactor
249 | from scrapy.crawler import CrawlerRunner
250 | from scrapy.utils.project import get_project_settings
251 | from scrapy.utils.log import configure_logging
252 | from coolscrapy.models import db_connect
253 | from coolscrapy.models import ArticleRule
254 | from sqlalchemy.orm import sessionmaker
255 | 
256 | if __name__ == '__main__':
257 |     settings = get_project_settings()
258 |     configure_logging(settings)
259 |     db = db_connect()
260 |     Session = sessionmaker(bind=db)
261 |     session = Session()
262 |     rules = session.query(ArticleRule).filter(ArticleRule.enable == 1).all()
263 |     session.close()
264 |     runner = CrawlerRunner(settings)
265 | 
266 |     for rule in rules:
267 |         # stop reactor when spider closes
268 |         # runner.signals.connect(spider_closing, signal=signals.spider_closed)
269 |         runner.crawl(ArticleSpider, rule=rule)
270 | 
271 |     d = runner.join()
272 |     d.addBoth(lambda _: reactor.stop())
273 | 
274 |     # blocks process so always keep as the last statement
275 |     reactor.run()
276 |     logging.info('all finished.')
277 | ```
278 | 
279 | OK，一切搞定。现在我们可以往ArticleRule表中加入成百上千个网站的规则，而不用添加一行代码，就可以对这成百上千个网站进行爬取。
280 | 当然你完全可以做一个Web前端来完成维护ArticleRule表的任务。当然ArticleRule规则也可以放在除了数据库的任何地方，比如配置文件。
281 | 
282 | 你可以在[GitHub](https://github.com/yidao620c/core-scrapy)上看到本文的完整项目源码。
283 | 
284 | 


--------------------------------------------------------------------------------
/doc/source/part4/scrapy-11.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程11- 模拟登录
  2 | 
  3 | 有时候爬取网站的时候需要登录，在Scrapy中可以通过模拟登录保存cookie后再去爬取相应的页面。这里我通过登录github然后爬取自己的issue列表来演示下整个原理。
  4 | 
  5 | 要想实现登录就需要表单提交，先通过浏览器访问github的登录页面<https://github.com/login>，
  6 | 然后使用浏览器调试工具来得到登录时需要提交什么东西:
  7 | 
  8 | ![](/source/images/scrapy01.png)
  9 | 
 10 | 我这里使用chrome浏览器的调试工具，F12打开后选择Network，并将Preserve log勾上。
 11 | 我故意输入错误的用户名和密码，得到它提交的form表单参数还有POST提交的URL:
 12 | 
 13 | ![](/source/images/scrapy02.png)
 14 | 
 15 | 去查看html源码会发现表单里面有个隐藏的`authenticity_token`值，这个是需要先获取然后跟用户名和密码一起提交的:
 16 | 
 17 | ![](/source/images/scrapy03.png)
 18 | 
 19 | ### 重写start_requests方法
 20 | 要使用cookie，第一步得打开它呀，默认scrapy使用`CookiesMiddleware`中间件，并且打开了。如果你之前禁止过，请设置如下
 21 | ``` python
 22 | COOKIES_ENABLES = True
 23 | ```
 24 | 我们先要打开登录页面，获取`authenticity_token`值，这里我重写了start_requests方法
 25 | ``` python
 26 | # 重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
 27 | def start_requests(self):
 28 |     return [Request("https://github.com/login",
 29 |                     meta={'cookiejar': 1}, callback=self.post_login)]
 30 | 
 31 | # FormRequeset
 32 | def post_login(self, response):
 33 |     # 先去拿隐藏的表单参数authenticity_token
 34 |     authenticity_token = response.xpath(
 35 |         '//input[@name="authenticity_token"]/@value').extract_first()
 36 |     logging.info('authenticity_token=' + authenticity_token)
 37 |     pass
 38 | ```
 39 | 
 40 | `start_requests`方法指定了回调函数，用来获取隐藏表单值`authenticity_token`，同时我们还给Request指定了`cookiejar`的元数据，用来往回调函数传递cookie标识。
 41 | 
 42 | ### 使用FormRequest
 43 | Scrapy为我们准备了`FormRequest`类专门用来进行Form表单提交的
 44 | ``` python
 45 | # 为了模拟浏览器，我们定义httpheader
 46 | post_headers = {
 47 |     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 48 |     "Accept-Encoding": "gzip, deflate",
 49 |     "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
 50 |     "Cache-Control": "no-cache",
 51 |     "Connection": "keep-alive",
 52 |     "Content-Type": "application/x-www-form-urlencoded",
 53 |     "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
 54 |     "Referer": "https://github.com/",
 55 | }
 56 | # 使用FormRequeset模拟表单提交
 57 | def post_login(self, response):
 58 |     # 先去拿隐藏的表单参数authenticity_token
 59 |     authenticity_token = response.xpath(
 60 |         '//input[@name="authenticity_token"]/@value').extract_first()
 61 |     logging.info('authenticity_token=' + authenticity_token)
 62 |     # FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
 63 |     # 登陆成功后, 会调用after_login回调函数，如果url跟Request页面的一样就省略掉
 64 |     return [FormRequest.from_response(response,
 65 |                                       url='https://github.com/session',
 66 |                                       meta={'cookiejar': response.meta['cookiejar']},
 67 |                                       headers=self.post_headers,  # 注意此处的headers
 68 |                                       formdata={
 69 |                                           'utf8': '✓',
 70 |                                           'login': 'yidao620c',
 71 |                                           'password': '******',
 72 |                                           'authenticity_token': authenticity_token
 73 |                                       },
 74 |                                       callback=self.after_login,
 75 |                                       dont_filter=True
 76 |                                       )]
 77 | 
 78 | def after_login(self, response):
 79 |     pass
 80 | ```
 81 | `FormRequest.from_response()`方法让你指定提交的url，请求头还有form表单值，注意我们还通过`meta`传递了cookie标识。它同样有个回调函数，登录成功后调用。下面我们来实现它
 82 | ``` python
 83 | def after_login(self, response):
 84 |     # 登录之后，开始进入我要爬取的私信页面
 85 |     for url in self.start_urls:
 86 |         # 因为我们上面定义了Rule，所以只需要简单的生成初始爬取Request即可
 87 |         yield Request(url, meta={'cookiejar': response.meta['cookiejar']})
 88 | ```
 89 | 这里我通过`start_urls`定义了开始页面，然后生成Request，具体爬取的规则和`下一页`规则在前面的Rule里定义了。注意这里我继续传递`cookiejar`，访问初始页面时带上cookie信息。
 90 | 
 91 | ### 重写_requests_to_follow
 92 | 有个问题刚开始困扰我很久就是这里我定义的spider继承自CrawlSpider，它内部自动去下载匹配的链接，而每次去访问链接的时候并没有自动带上cookie，后来我重写了它的`_requests_to_follow()`方法解决了这个问题
 93 | ``` python
 94 | def _requests_to_follow(self, response):
 95 |     """重写加入cookiejar的更新"""
 96 |     if not isinstance(response, HtmlResponse):
 97 |         return
 98 |     seen = set()
 99 |     for n, rule in enumerate(self._rules):
100 |         links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
101 |         if links and rule.process_links:
102 |             links = rule.process_links(links)
103 |         for link in links:
104 |             seen.add(link)
105 |             r = Request(url=link.url, callback=self._response_downloaded)
106 |             # 下面这句是我重写的
107 |             r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])
108 |             yield rule.process_request(r)
109 | ```
110 | 
111 | ### 页面处理方法
112 | 在规则Rule里面我定义了每个链接的回调函数`parse_page`，就是最终我们处理每个issue页面提取信息的逻辑
113 | ``` python
114 | def parse_page(self, response):
115 |     """这个是使用LinkExtractor自动处理链接以及`下一页`"""
116 |     logging.info(u'--------------消息分割线-----------------')
117 |     logging.info(response.url)
118 |     issue_title = response.xpath(
119 |         '//span[@class="js-issue-title"]/text()').extract_first()
120 |     logging.info(u'issue_title：' + issue_title.encode('utf-8'))
121 | ```
122 | 
123 | ### 完整源码
124 | ``` python
125 | #!/usr/bin/env python
126 | # -*- encoding: utf-8 -*-
127 | """
128 | Topic: 登录爬虫
129 | Desc : 模拟登录https://github.com后将自己的issue全部爬出来
130 | tips：使用chrome调试post表单的时候勾选Preserve log和Disable cache
131 | """
132 | import logging
133 | import re
134 | import sys
135 | import scrapy
136 | from scrapy.spiders import CrawlSpider, Rule
137 | from scrapy.linkextractors import LinkExtractor
138 | from scrapy.http import Request, FormRequest, HtmlResponse
139 | 
140 | logging.basicConfig(level=logging.INFO,
141 |                     format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
142 |                     datefmt='%Y-%m-%d %H:%M:%S',
143 |                     handlers=[logging.StreamHandler(sys.stdout)])
144 | 
145 | 
146 | class GithubSpider(CrawlSpider):
147 |     name = "github"
148 |     allowed_domains = ["github.com"]
149 |     start_urls = [
150 |         'https://github.com/issues',
151 |     ]
152 |     rules = (
153 |         # 消息列表
154 |         Rule(LinkExtractor(allow=('/issues/\d+',),
155 |                            restrict_xpaths='//ul[starts-with(@class, "table-list")]/li/div[2]/a[2]'),
156 |              callback='parse_page'),
157 |         # 下一页, If callback is None follow defaults to True, otherwise it defaults to False
158 |         Rule(LinkExtractor(restrict_xpaths='//a[@class="next_page"]')),
159 |     )
160 |     post_headers = {
161 |         "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
162 |         "Accept-Encoding": "gzip, deflate",
163 |         "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
164 |         "Cache-Control": "no-cache",
165 |         "Connection": "keep-alive",
166 |         "Content-Type": "application/x-www-form-urlencoded",
167 |         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
168 |         "Referer": "https://github.com/",
169 |     }
170 | 
171 |     # 重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
172 |     def start_requests(self):
173 |         return [Request("https://github.com/login",
174 |                         meta={'cookiejar': 1}, callback=self.post_login)]
175 | 
176 |     # FormRequeset
177 |     def post_login(self, response):
178 |         # 先去拿隐藏的表单参数authenticity_token
179 |         authenticity_token = response.xpath(
180 |             '//input[@name="authenticity_token"]/@value').extract_first()
181 |         logging.info('authenticity_token=' + authenticity_token)
182 |         # FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
183 |         # 登陆成功后, 会调用after_login回调函数，如果url跟Request页面的一样就省略掉
184 |         return [FormRequest.from_response(response,
185 |                                           url='https://github.com/session',
186 |                                           meta={'cookiejar': response.meta['cookiejar']},
187 |                                           headers=self.post_headers,  # 注意此处的headers
188 |                                           formdata={
189 |                                               'utf8': '✓',
190 |                                               'login': 'yidao620c',
191 |                                               'password': '******',
192 |                                               'authenticity_token': authenticity_token
193 |                                           },
194 |                                           callback=self.after_login,
195 |                                           dont_filter=True
196 |                                           )]
197 | 
198 |     def after_login(self, response):
199 |         for url in self.start_urls:
200 |             # 因为我们上面定义了Rule，所以只需要简单的生成初始爬取Request即可
201 |             yield Request(url, meta={'cookiejar': response.meta['cookiejar']})
202 | 
203 |     def parse_page(self, response):
204 |         """这个是使用LinkExtractor自动处理链接以及`下一页`"""
205 |         logging.info(u'--------------消息分割线-----------------')
206 |         logging.info(response.url)
207 |         issue_title = response.xpath(
208 |             '//span[@class="js-issue-title"]/text()').extract_first()
209 |         logging.info(u'issue_title：' + issue_title.encode('utf-8'))
210 | 
211 |     def _requests_to_follow(self, response):
212 |         """重写加入cookiejar的更新"""
213 |         if not isinstance(response, HtmlResponse):
214 |             return
215 |         seen = set()
216 |         for n, rule in enumerate(self._rules):
217 |             links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
218 |             if links and rule.process_links:
219 |                 links = rule.process_links(links)
220 |             for link in links:
221 |                 seen.add(link)
222 |                 r = Request(url=link.url, callback=self._response_downloaded)
223 |                 # 下面这句是我重写的
224 |                 r.meta.update(rule=n, link_text=link.text, cookiejar=response.meta['cookiejar'])
225 |                 yield rule.process_request(r)
226 | 
227 | ```
228 | 
229 | 你可以在[GitHub](https://github.com/yidao620c/core-scrapy)上看到本文的完整项目源码，还有另外一个自动登陆iteye网站的例子。
230 | 
231 | 
232 | 


--------------------------------------------------------------------------------
/doc/source/part4/scrapy-12.md:
--------------------------------------------------------------------------------
  1 | # Scrapy教程12- 抓取动态网站
  2 | 
  3 | 前面我们介绍的都是去抓取静态的网站页面，也就是说我们打开某个链接，它的内容全部呈现出来。
  4 | 但是如今的互联网大部分的web页面都是动态的，经常逛的网站例如京东、淘宝等，商品列表都是js，并有Ajax渲染，
  5 | 下载某个链接得到的页面里面含有异步加载的内容，这样再使用之前的方式我们根本获取不到异步加载的这些网页内容。
  6 | 
  7 | 使用Javascript渲染和处理网页是种非常常见的做法，如何处理一个大量使用Javascript的页面是Scrapy爬虫开发中一个常见的问题，
  8 | 这篇文章将说明如何在Scrapy爬虫中使用[scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash)来处理页面中得Javascript。
  9 | 
 10 | ### scrapy-splash简介
 11 | scrapy-splash利用[Splash](https://github.com/scrapy/scrapy)将javascript和Scrapy集成起来，使得Scrapy可以抓取动态网页。
 12 | 
 13 | Splash是一个javascript渲染服务，是实现了HTTP API的轻量级浏览器，底层基于Twisted和QT框架，Python语言编写。所以首先你得安装Splash实例
 14 | 
 15 | ### 安装docker
 16 | 官网建议使用docker容器安装方式Splash。那么首先你得先安装docker
 17 | 
 18 | 参考[官方安装文档](https://docs.docker.com/engine/installation/linux/ubuntulinux/)，这里我选择Ubuntu 12.04 LTS版本安装
 19 | 
 20 | 升级内核版本，docker需要3.13内核
 21 | ``` bash
 22 | $ sudo apt-get update
 23 | $ sudo apt-get install linux-image-generic-lts-trusty
 24 | $ sudo reboot
 25 | ```
 26 | 
 27 | 安装`CA`认证
 28 | ``` bash
 29 | $ sudo apt-get install apt-transport-https ca-certificates
 30 | ```
 31 | 
 32 | 增加新的`GPG`key
 33 | ``` bash
 34 | $ sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
 35 | ```
 36 | 
 37 | 打开`/etc/apt/sources.list.d/docker.list`，如果没有就创建一个，然后删除任何已存在的内容，再增加下面一句
 38 | ```
 39 | deb https://apt.dockerproject.org/repo ubuntu-precise main
 40 | ```
 41 | 
 42 | 更新APT
 43 | ``` bash
 44 | $ sudo apt-get update
 45 | $ sudo apt-get purge lxc-docker
 46 | $ apt-cache policy docker-engine
 47 | ```
 48 | 
 49 | 安装
 50 | ``` bash
 51 | $ sudo apt-get install docker-engine
 52 | ```
 53 | 
 54 | 启动docker服务
 55 | ``` bash
 56 | $ sudo service docker start
 57 | ```
 58 | 
 59 | 验证是否启动成功
 60 | ``` bash
 61 | $ sudo docker run hello-world
 62 | ```
 63 | 上面这条命令会下载一个测试镜像并在容器中运行它，它会打印一个消息，然后退出。
 64 | 
 65 | 
 66 | ### 安装Splash
 67 | 拉取镜像下来
 68 | ``` bash
 69 | $ sudo docker pull scrapinghub/splash
 70 | ```
 71 | 
 72 | 启动容器
 73 | ``` bash
 74 | $ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
 75 | ```
 76 | 
 77 | 现在可以通过0.0.0.0:8050(http),8051(https),5023 (telnet)来访问Splash了。
 78 | 
 79 | ### 安装scrapy-splash
 80 | 使用pip安装
 81 | ``` bash
 82 | $ pip install scrapy-splash
 83 | ```
 84 | 
 85 | ### 配置scrapy-splash
 86 | 在你的scrapy工程的配置文件`settings.py`中添加
 87 | ``` python
 88 | SPLASH_URL = 'http://192.168.203.92:8050'
 89 | ```
 90 | 
 91 | 添加Splash中间件，还是在`settings.py`中通过`DOWNLOADER_MIDDLEWARES`指定，并且修改`HttpCompressionMiddleware`的优先级
 92 | ``` python
 93 | DOWNLOADER_MIDDLEWARES = {
 94 |     'scrapy_splash.SplashCookiesMiddleware': 723,
 95 |     'scrapy_splash.SplashMiddleware': 725,
 96 |     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
 97 | }
 98 | ```
 99 | 默认情况下，HttpProxyMiddleware的优先级是750，要把它放在Splash中间件后面
100 | 
101 | 设置Splash自己的去重过滤器
102 | ``` python
103 | DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
104 | ```
105 | 
106 | 如果你使用Splash的Http缓存，那么还要指定一个自定义的缓存后台存储介质，scrapy-splash提供了一个`scrapy.contrib.httpcache.FilesystemCacheStorage`的子类
107 | ``` python
108 | HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
109 | ```
110 | 如果你要使用其他的缓存存储，那么需要继承这个类并且将所有的`scrapy.util.request.request_fingerprint`调用替换成`scrapy_splash.splash_request_fingerprint`
111 | 
112 | ### 使用scrapy-splash
113 | 
114 | #### SplashRequest
115 | 最简单的渲染请求的方式是使用`scrapy_splash.SplashRequest`，通常你应该选择使用这个
116 | ``` python
117 | yield SplashRequest(url, self.parse_result,
118 |     args={
119 |         # optional; parameters passed to Splash HTTP API
120 |         'wait': 0.5,
121 | 
122 |         # 'url' is prefilled from request url
123 |         # 'http_method' is set to 'POST' for POST requests
124 |         # 'body' is set to request body for POST requests
125 |     },
126 |     endpoint='render.json', # optional; default is render.html
127 |     splash_url='<url>',     # optional; overrides SPLASH_URL
128 |     slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN,  # optional
129 | )
130 | ```
131 | 
132 | 另外，你还可以在普通的scrapy请求中传递`splash`请求meta关键字达到同样的效果
133 | ``` python
134 | yield scrapy.Request(url, self.parse_result, meta={
135 |     'splash': {
136 |         'args': {
137 |             # set rendering arguments here
138 |             'html': 1,
139 |             'png': 1,
140 | 
141 |             # 'url' is prefilled from request url
142 |             # 'http_method' is set to 'POST' for POST requests
143 |             # 'body' is set to request body for POST requests
144 |         },
145 | 
146 |         # optional parameters
147 |         'endpoint': 'render.json',  # optional; default is render.json
148 |         'splash_url': '<url>',      # optional; overrides SPLASH_URL
149 |         'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
150 |         'splash_headers': {},       # optional; a dict with headers sent to Splash
151 |         'dont_process_response': True, # optional, default is False
152 |         'dont_send_headers': True,  # optional, default is False
153 |         'magic_response': False,    # optional, default is True
154 |     }
155 | })
156 | ```
157 | 
158 | Splash API说明，使用`SplashRequest`是一个非常便利的工具来填充`request.meta['splash']`里的数据
159 | 
160 | * meta['splash']['args'] 包含了发往Splash的参数。
161 | * meta['splash']['endpoint'] 指定了Splash所使用的endpoint，默认是[render.html](http://splash.readthedocs.org/en/latest/api.html#render-html)
162 | * meta['splash']['splash_url'] 覆盖了`settings.py`文件中配置的Splash URL
163 | * meta['splash']['splash_headers'] 运行你增加或修改发往Splash服务器的HTTP头部信息，注意这个不是修改发往远程web站点的HTTP头部
164 | * meta['splash']['dont_send_headers'] 如果你不想传递headers给Splash，将它设置成True
165 | * meta['splash']['slot_policy'] 让你自定义Splash请求的同步设置
166 | * meta['splash']['dont_process_response'] 当你设置成True后，`SplashMiddleware`不会修改默认的`scrapy.Response`请求。默认是会返回`SplashResponse`子类响应比如`SplashTextResponse`
167 | * meta['splash']['magic_response'] 默认为True，Splash会自动设置Response的一些属性，比如`response.headers`,`response.body`等
168 | 
169 | 如果你想通过Splash来提交Form请求，可以使用`scrapy_splash.SplashFormRequest`，它跟`SplashRequest`使用是一样的。
170 | 
171 | #### Responses
172 | 对于不同的Splash请求，scrapy-splash返回不同的Response子类
173 | 
174 | * SplashResponse 二进制响应，比如对/render.png的响应
175 | * SplashTextResponse 文本响应，比如对/render.html的响应
176 | * SplashJsonResponse JSON响应，比如对/render.json或使用Lua脚本的/execute的响应
177 | 
178 | 如果你只想使用标准的Response对象，就设置`meta['splash']['dont_process_response']=True`
179 | 
180 | 所有这些Response会把`response.url`设置成原始请求URL(也就是你要渲染的页面URL)，而不是Splash endpoint的URL地址。实际地址通过`response.real_url`得到
181 | 
182 | #### Session的处理
183 | Splash本身是无状态的，那么为了支持scrapy-splash的session必须编写Lua脚本，使用`/execute`
184 | ``` lua
185 | function main(splash)
186 |     splash:init_cookies(splash.args.cookies)
187 | 
188 |     -- ... your script
189 | 
190 |     return {
191 |         cookies = splash:get_cookies(),
192 |         -- ... other results, e.g. html
193 |     }
194 | end
195 | ```
196 | 而标准的scrapy session参数可以使用`SplashRequest`将cookie添加到当前Splash cookiejar中
197 | 
198 | ### 使用实例
199 | 接下来我通过一个实际的例子来演示怎样使用，我选择爬取[京东网](http://www.jd.com/)首页的异步加载内容。
200 | 
201 | 京东网打开首页的时候只会将导航菜单加载出来，其他具体首页内容都是异步加载的，下面有个"猜你喜欢"这个内容也是异步加载的，
202 | 我现在就通过爬取这个"猜你喜欢"这四个字来说明下普通的Scrapy爬取和通过使用了Splash加载异步内容的区别。
203 | 
204 | 首先我们写个简单的测试Spider，不使用splash：
205 | ``` python
206 | class TestSpider(scrapy.Spider):
207 |     name = "test"
208 |     allowed_domains = ["jd.com"]
209 |     start_urls = [
210 |         "http://www.jd.com/"
211 |     ]
212 | 
213 |     def parse(self, response):
214 |         logging.info(u'---------我这个是简单的直接获取京东网首页测试---------')
215 |         guessyou = response.xpath('//div[@id="guessyou"]/div[1]/h2/text()').extract_first()
216 |         logging.info(u"find：%s" % guessyou)
217 |         logging.info(u'---------------success----------------')
218 | ```
219 | 然后运行结果：
220 | ```
221 | 2016-04-18 14:42:44 test_spider.py[line:20] INFO ---------我这个是简单的直接获取京东网首页测试---------
222 | 2016-04-18 14:42:44 test_spider.py[line:22] INFO find：None
223 | 2016-04-18 14:42:44 test_spider.py[line:23] INFO ---------------success----------------
224 | ```
225 | 我找不到那个"猜你喜欢"这四个字
226 | 
227 | 接下来我使用splash来爬取
228 | ``` python
229 | import scrapy
230 | from scrapy_splash import SplashRequest
231 | 
232 | 
233 | class JsSpider(scrapy.Spider):
234 |     name = "jd"
235 |     allowed_domains = ["jd.com"]
236 |     start_urls = [
237 |         "http://www.jd.com/"
238 |     ]
239 | 
240 |     def start_requests(self):
241 |         splash_args = {
242 |             'wait': 0.5,
243 |         }
244 |         for url in self.start_urls:
245 |             yield SplashRequest(url, self.parse_result, endpoint='render.html',
246 |                                 args=splash_args)
247 | 
248 |     def parse_result(self, response):
249 |         logging.info(u'----------使用splash爬取京东网首页异步加载内容-----------')
250 |         guessyou = response.xpath('//div[@id="guessyou"]/div[1]/h2/text()').extract_first()
251 |         logging.info(u"find：%s" % guessyou)
252 |         logging.info(u'---------------success----------------')
253 | ```
254 | 运行结果：
255 | ```
256 | 2016-04-18 14:42:51 js_spider.py[line:36] INFO ----------使用splash爬取京东网首页异步加载内容-----------
257 | 2016-04-18 14:42:51 js_spider.py[line:38] INFO find：猜你喜欢
258 | 2016-04-18 14:42:51 js_spider.py[line:39] INFO ---------------success----------------
259 | ```
260 | 可以看出结果里面已经找到了这个"猜你喜欢"，说明异步加载内容爬取成功！
261 | 
262 | 


--------------------------------------------------------------------------------
/doc/source/part5/README.md:
--------------------------------------------------------------------------------
1 | ## 高手篇
2 | 
3 | 尽情期待...
4 | 
5 | ----------
6 | 


--------------------------------------------------------------------------------
/doc/styles/website.scss:
--------------------------------------------------------------------------------
 1 | @import "../node_modules/gitbook-plugin-theme-gestalt/src/scss/variables";
 2 | 
 3 | // Put your variable overrides here
 4 | //$book-background: black;
 5 | //$page-background: black;
 6 | //$sidebar-background: black;
 7 | //$color-border: $color-gray-dark;
 8 | //
 9 | //$color-primary: blue;
10 | //$color-text: red;
11 | //
12 | //$heading-map : ( 
13 | //				color: $color-primary, 
14 | //				border-bottom: 1px solid $color-primary,
15 | //				font-family: $font-family-sans
16 | //				);
17 | 
18 | @import "../node_modules/gitbook-plugin-theme-gestalt/src/scss/all";
19 | 


--------------------------------------------------------------------------------
/publish_gitbook.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | cd doc
 4 | 
 5 | # gitbook install
 6 | # install the plugins and build the static site
 7 | gitbook build
 8 | 
 9 | cd ..
10 | 
11 | # checkout to the gh-pages branch
12 | git checkout gh-pages
13 | 
14 | # pull the latest updates
15 | git pull origin gh-pages
16 | 
17 | 
18 | if [[ "$?" != "0" ]]; then
19 |     exit 1
20 | fi
21 | # copy the static site files into the current directory.
22 | \cp -Rf doc/_book/* .
23 | 
24 | # remove 'node_modules' and '_book' directory
25 | # git clean -fx gitbook/node_modules
26 | # git clean -fx gitbook/_book
27 | rm -rf doc/_book/
28 | 
29 | # remove website css files, except last one
30 | ccount=`ls website-* | wc -w`
31 | if [[ "$ccount" > 1 ]];then
32 |     allcss=($(ls website-*))
33 |     c=0
34 |     for css in "${allcss[@]}"; do
35 |        let "c=c+1"
36 |        if [[ $c -ge $ccount ]]; then
37 |            break;
38 |        fi
39 |        rm -f $css
40 |     done
41 | fi
42 | 
43 | # add all files
44 | git add --all
45 | 
46 | # commit
47 | git commit -a -m "Update docs"
48 | 
49 | # push to the origin
50 | git push origin gh-pages
51 | 
52 | # checkout to the master branch
53 | git checkout master
54 | 
55 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = coolscrapy.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = coolscrapy
12 | 


--------------------------------------------------------------------------------