├── .idea
├── encodings.xml
├── inspectionProfiles
│ └── profiles_settings.xml
├── misc.xml
├── modules.xml
├── preferred-vcs.xml
└── weixin.iml
├── README.md
├── scrapy.cfg
└── weixin
├── __init__.py
├── items.py
├── main.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── WeiXin.py
└── __init__.py
/.idea/encodings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/profiles_settings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/preferred-vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Git
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/weixin.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 相关代码已经修改调试成功----2017-4-4 #
2 | ## 一、说明 ##
3 | **目标网址:**[http://weixin.sogou.com/weixin?type=2&query=python&ie=utf8](http://weixin.sogou.com/weixin?type=2&query=python&ie=utf8)
4 |
5 | **实现**:关于python文章的抓取,抓取标题、标题链接、描述。如下图所示。
6 |
7 | **数据**:数据我就没有保存,此实战主要是为了学习IP和用户代理池的设定,推荐一个开源项目关于搜狗微信公众号:[基于搜狗微信的公众号文章爬虫](https://github.com/pujinxiao/wechat_sogou_crawl)
8 |
9 | 图1
10 |
11 | 
12 |
13 | ## 二、运行 ##
14 | 1. key是要搜索关键字,可以改变。
15 | 2. 设置好代理IP和User-Agent,运行main.py即可。
16 |
17 | ## 三、问题----欢迎留言提出问题 ##
18 | 声明:此项目主要是学习IP代理池的设定
19 | > 1.我所用的ip是免费网站上找的,现在免费可用的ip真的是少,尤其实在搜狗微信端,很难找了。我用本地ip可以爬取,但是用代理ip,会出现xpath取到空值的情况,难道ip质量不好还是什么的,猜测是没有加载缓慢(待解决)如下所示
20 |
21 | 2017-04-04 16:57:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
22 | 当前使用的IP是124.65.238.166:80
23 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
24 | 2017-04-04 16:57:40 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
25 | 当前使用的IP是124.65.238.166:80
26 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
27 | 当前使用的IP是124.65.238.166:80
28 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3
29 | 当前使用的IP是124.65.238.166:80
30 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3
31 | 当前使用的IP是124.65.238.166:80
32 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
33 | 当前使用的IP是124.65.238.166:80
34 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
35 | 当前使用的IP是124.65.238.166:80
36 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3
37 | 当前使用的IP是124.65.238.166:80
38 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
39 | 当前使用的IP是124.65.238.166:80
40 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36
41 | 当前使用的IP是124.65.238.166:80
42 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3
43 | 当前使用的IP是124.65.238.166:80
44 | 当前使用的user-agent是Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3
45 | 2017-04-04 16:57:41 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://sogou.com/)
46 | 2017-04-04 16:57:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://weixin.sogou.com/weixin?query=python&type=2&page=10>
47 | {'dec': [], 'link': [], 'title': []}
48 | 2017-04-04 16:57:42 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://sogou.com/)
49 | 2017-04-04 16:57:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://weixin.sogou.com/weixin?query=python&type=2&page=9>
50 | {'dec': [], 'link': [], 'title': []}
51 | 2017-04-04 16:57:44 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://sogou.com/)
52 | **欢迎有兴趣的小伙伴帮我优化,找出问题,之后我将合并你的代码,作为贡献者,共同成长。**
53 | ## 四、笔记 ##
54 | 一.反爬虫机制处理思路:
55 |
56 | 1. 浏览器伪装、用户代理池;
57 | 1. IP限制--------IP代理池;
58 | 1. ajax、js异步-------抓包;
59 | 1. 验证码-------打码平台。
60 |
61 | 二.散点知识:
62 |
63 | 1.def process_request(): #处理请求
64 | request.meta["proxy"]=.... #添加代理ip
65 | 2.scrapy中如果请求2次就会放弃,说明该代理ip不行。
66 |
67 | ----------
68 | 如果本项目对你有用请给我一颗star,万分感谢。
--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
1 | # Automatically created by: scrapy startproject
2 | #
3 | # For more information about the [deploy] section see:
4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
5 |
6 | [settings]
7 | default = weixin.settings
8 |
9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = weixin
12 |
--------------------------------------------------------------------------------
/weixin/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pujinxiao/weixin/d3ec0c9204ee8688c9b1efbf9d9238b0296df34e/weixin/__init__.py
--------------------------------------------------------------------------------
/weixin/items.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your scraped items
4 | #
5 | # See documentation in:
6 | # http://doc.scrapy.org/en/latest/topics/items.html
7 |
8 | import scrapy
9 |
10 |
11 | class WeixinItem(scrapy.Item):
12 | # define the fields for your item here like:
13 | # name = scrapy.Field()
14 | title=scrapy.Field() #标题
15 | link=scrapy.Field() #链接
16 | dec=scrapy.Field() #描述
17 |
--------------------------------------------------------------------------------
/weixin/main.py:
--------------------------------------------------------------------------------
1 | from scrapy import cmdline
2 | cmdline.execute('scrapy crawl WeiXin'.split())
--------------------------------------------------------------------------------
/weixin/middlewares.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import random
3 | from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware #代理ip,这是固定的导入
4 | from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware #代理UA,固定导入
5 | class IPPOOLS(HttpProxyMiddleware):
6 | def __init__(self,ip=''):
7 | '''初始化'''
8 | self.ip=ip
9 | def process_request(self, request, spider):
10 | '''使用代理ip,随机选用'''
11 | ip=random.choice(self.ip_pools) #随机选择一个ip
12 | print '当前使用的IP是'+ip['ip']
13 | try:
14 | request.meta["proxy"]="http://"+ip['ip']
15 | except Exception,e:
16 | print e
17 | pass
18 | ip_pools=[
19 | {'ip': '124.65.238.166:80'},
20 | # {'ip':''},
21 | ]
22 | class UAPOOLS(UserAgentMiddleware):
23 | def __init__(self,user_agent=''):
24 | self.user_agent=user_agent
25 | def process_request(self, request, spider):
26 | '''使用代理UA,随机选用'''
27 | ua=random.choice(self.user_agent_pools)
28 | print '当前使用的user-agent是'+ua
29 | try:
30 | request.headers.setdefault('User-Agent',ua)
31 | except Exception,e:
32 | print e
33 | pass
34 | user_agent_pools=[
35 | 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3',
36 | 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3',
37 | 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36',
38 | ]
--------------------------------------------------------------------------------
/weixin/pipelines.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define your item pipelines here
4 | #
5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
7 |
8 | import re
9 | class WeixinPipeline(object):
10 | def process_item(self, item, spider):
11 | for i in range(len(item['title'])):
12 | html=item['title'][i]
13 | reg=re.compile(r'<[^>]+>') #去html标签
14 | title=reg.sub('',html)
15 | print title
16 | print item['link'][i]
17 | html_1=item['dec'][i]
18 | reg_1=re.compile(r'<[^>]+>')
19 | dec=reg_1.sub('',html_1)
20 | print dec
21 | return item
22 |
--------------------------------------------------------------------------------
/weixin/settings.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Scrapy settings for weixin project
4 | #
5 | # For simplicity, this file contains only settings considered important or
6 | # commonly used. You can find more settings consulting the documentation:
7 | #
8 | # http://doc.scrapy.org/en/latest/topics/settings.html
9 | # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 | # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 |
12 | BOT_NAME = 'weixin'
13 |
14 | SPIDER_MODULES = ['weixin.spiders']
15 | NEWSPIDER_MODULE = 'weixin.spiders'
16 |
17 |
18 | # Obey robots.txt rules
19 | ROBOTSTXT_OBEY = False
20 |
21 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
22 | #CONCURRENT_REQUESTS = 32
23 |
24 | # Configure a delay for requests for the same website (default: 0)
25 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
26 | # See also autothrottle settings and docs
27 | DOWNLOAD_DELAY = 1
28 | # The download delay setting will honor only one of:
29 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
30 | #CONCURRENT_REQUESTS_PER_IP = 16
31 |
32 | # Disable cookies (enabled by default)
33 | COOKIES_ENABLED = False
34 |
35 | # Disable Telnet Console (enabled by default)
36 | #TELNETCONSOLE_ENABLED = False
37 |
38 | # Override the default request headers:
39 | #DEFAULT_REQUEST_HEADERS = {
40 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
41 | # 'Accept-Language': 'en',
42 | #}
43 |
44 | # Enable or disable spider middlewares
45 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
46 | #SPIDER_MIDDLEWARES = {
47 | # 'weixin.middlewares.WeixinSpiderMiddleware': 543,
48 | #}
49 |
50 | # Enable or disable downloader middlewares
51 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
52 | DOWNLOADER_MIDDLEWARES = {
53 | 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':123,
54 | 'weixin.middlewares.IPPOOLS':124,
55 | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : 125,
56 | 'weixin.middlewares.UAPOOLS':126
57 | }
58 |
59 | # Enable or disable extensions
60 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
61 | #EXTENSIONS = {
62 | # 'scrapy.extensions.telnet.TelnetConsole': None,
63 | #}
64 |
65 | # Configure item pipelines
66 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
67 | ITEM_PIPELINES = {
68 | 'weixin.pipelines.WeixinPipeline': 300,
69 | }
70 |
71 | # Enable and configure the AutoThrottle extension (disabled by default)
72 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
73 | #AUTOTHROTTLE_ENABLED = True
74 | # The initial download delay
75 | #AUTOTHROTTLE_START_DELAY = 5
76 | # The maximum download delay to be set in case of high latencies
77 | #AUTOTHROTTLE_MAX_DELAY = 60
78 | # The average number of requests Scrapy should be sending in parallel to
79 | # each remote server
80 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 | # Enable showing throttling stats for every response received:
82 | #AUTOTHROTTLE_DEBUG = False
83 |
84 | # Enable and configure HTTP caching (disabled by default)
85 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 | #HTTPCACHE_ENABLED = True
87 | #HTTPCACHE_EXPIRATION_SECS = 0
88 | #HTTPCACHE_DIR = 'httpcache'
89 | #HTTPCACHE_IGNORE_HTTP_CODES = []
90 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
91 |
--------------------------------------------------------------------------------
/weixin/spiders/WeiXin.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import scrapy
3 | from scrapy.http import Request
4 | from weixin.items import WeixinItem
5 | class WeixinSpider(scrapy.Spider):
6 | name = "WeiXin"
7 | allowed_domains = ["sogou.com"]
8 | start_urls = ['http://sogou.com/']
9 |
10 | def parse(self, response):
11 | '''建立循环页'''
12 | key='python'
13 | for i in range(1,11):
14 | url='http://weixin.sogou.com/weixin?query='+key+'&type=2&page='+str(i)
15 | yield Request(url=url,callback=self.get_content)
16 | def get_content(self,response):
17 | '''获取相关信息'''
18 | item=WeixinItem()
19 | item['title']=response.xpath('//div[@class="txt-box"]/h3/a').extract() #一页的全部标题,10条包含html标签
20 | item['link']=response.xpath('//div[@class="txt-box"]/h3/a/@href').extract() #一页的全部标题链接 10条
21 | item['dec']=response.xpath('//p[@class="txt-info"]').extract() #一页的全部描述,10条包含html标签
22 | yield item
--------------------------------------------------------------------------------
/weixin/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 |
--------------------------------------------------------------------------------