├── NLP └── readme.md ├── README.md ├── Spider ├── ch_Bilibili │ ├── images │ │ ├── big.png │ │ ├── small.png │ │ ├── spider_image.gif │ │ └── three.png │ └── readme.md ├── ch_Code │ ├── README.md │ └── sliding_code.py ├── ch_Distributedcrawler │ ├── Connection.py │ ├── Duperfilter.py │ ├── Pipelines.py │ ├── Queue.py │ ├── Scheduler.py │ ├── Spider.py │ ├── picklecompat.py │ └── readme.md ├── ch_Haiwang │ ├── Spider │ │ ├── .idea │ │ │ ├── Spider.iml │ │ │ ├── misc.xml │ │ │ ├── modules.xml │ │ │ ├── vcs.xml │ │ │ └── workspace.xml │ │ ├── Spider │ │ │ ├── __init__.py │ │ │ ├── __pycache__ │ │ │ │ ├── __init__.cpython-35.pyc │ │ │ │ ├── items.cpython-35.pyc │ │ │ │ ├── pipelines.cpython-35.pyc │ │ │ │ └── settings.cpython-35.pyc │ │ │ ├── items.py │ │ │ ├── middlewares.py │ │ │ ├── pipelines.py │ │ │ ├── settings.py │ │ │ ├── spiders │ │ │ │ ├── Haiwang.py │ │ │ │ ├── __init__.py │ │ │ │ └── __pycache__ │ │ │ │ │ ├── Haiwang.cpython-35.pyc │ │ │ │ │ └── __init__.cpython-35.pyc │ │ │ └── tools │ │ │ │ ├── __init__.py │ │ │ │ └── xici_ip.py │ │ ├── __init__.py │ │ ├── run.py │ │ └── scrapy.cfg │ ├── analysis.py │ ├── city_coordinates.json │ ├── images │ │ ├── 1.png │ │ ├── 2.png │ │ ├── chart_line.png │ │ ├── ciyun.png │ │ ├── csv.png │ │ ├── geo.png │ │ ├── list.png │ │ └── mongo.png │ ├── pip.txt │ ├── readme.md │ └── stop_words.txt ├── ch_summary │ ├── summary01.md │ ├── summary02.md │ ├── summary03.md │ └── summary04.md └── readme.md ├── ch01 ├── readme.md ├── 特征探索性分析.ipynb └── 特征预处理.ipynb ├── ch02 ├── Ass_rule.py ├── K-means.py ├── RandomForest.ipynb ├── SVM.ipynb ├── datas │ ├── data1.txt │ ├── data2.txt │ ├── samtrain.csv │ ├── samval.csv │ ├── test.csv │ └── train.csv ├── images │ ├── 1.jpg │ ├── 2.jpg │ ├── 3.jpg │ ├── 4.jpg │ ├── 5.jpg │ ├── Kmeans.png │ └── sklearn.png ├── logistic_regression.ipynb └── readme.md ├── ch03 └── readme.md └── kaggle ├── data ├── HR.bak.csv ├── HR.csv ├── rfr.csv ├── test.csv └── train.csv ├── readme.md ├── 共享单车案例.ipynb ├── 员工离职率数据分析.ipynb └── 员工离职率特征工程与建模.ipynb /NLP/readme.md: -------------------------------------------------------------------------------- 1 | ### nlp 2 | 3 | * jieba是分词中使用最广泛的了,demo看[海王电影分析](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Haiwang/analysis.py) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-analysis 2 | python数据分析与挖掘建模 3 | 4 | - :soccer: [数据分析与特征处理](./ch01) 5 | - :basketball: [机器学习与数据建模](./ch02) 6 | - :apple: [模型评估](./ch03) 7 | - :hamburger: [自然语言处理](./NLP) 8 | - :cherries: [爬虫那点事](./Spider) 9 | - :fries: [kaggle竞赛项目](./kaggle) 10 | - :banana: [知识图谱]() 11 | 12 | ## 常用工具 13 | - numpy 14 | - pandas 15 | - matplotlib 16 | - seaborn 基于matplotlib,对图像的丰富 17 | - SciPy 科学计算中包的集合 18 | - scipy.integrade 数值积分例程和微分方程求解器 19 | - scipy.linalg 线性代数例程和矩阵分解 20 | - scipy.optimize 函数优化器和根查找算法 21 | - scipy.signal 信号处理工具 22 | - scipy.sparse 稀疏矩阵和稀疏线性系统求解器 23 | - scipy.special SPECFUN(实现了许多常用数学函数) 24 | - scipy.stats 标准连续和离散概率分布 25 | - scipy.weave 利用内敛c++代码加速数组计算的工具 26 | 27 | - scikit-learn 简称sk-learn, 机器学习工具,用于数据分析和数据挖掘,建立在Numpy, SciPy和matplotlib上。 28 | - Jupyter Notebook的本质是一个Web应用程序,便于创建和共享文学化程序文档,支持实时代码,数学方程,可视化和markdown,kaggle竞赛里资料都是Jupyter格式。用途包括:数据清理和转换,数值模拟,统计建模,机器学习等。 29 | - TensorFlow 是一个采用数据流图,用于数值计算的开源软件库。最初被Google用于机器学习和深度神经网络方面的研究,但也可广泛用于其他计算领域。 30 | - Anaconda 包括Conda,Python以及180多安装好的工具包机器依赖,比如:numpy, pandas等。Conda是一个开源的包,环境管理器,可以用于在同一个机器上安装不同版本的软件包及其依赖,并能够在不同的环境之间切换。 31 | 32 | **jieba分词Demo** 33 | ![ciyun](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Haiwang/images/ciyun.png) 34 | -------------------------------------------------------------------------------- /Spider/ch_Bilibili/images/big.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/big.png -------------------------------------------------------------------------------- /Spider/ch_Bilibili/images/small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/small.png -------------------------------------------------------------------------------- /Spider/ch_Bilibili/images/spider_image.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/spider_image.gif -------------------------------------------------------------------------------- /Spider/ch_Bilibili/images/three.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/three.png -------------------------------------------------------------------------------- /Spider/ch_Bilibili/readme.md: -------------------------------------------------------------------------------- 1 | # B 站爬取视频信息 2 | 一个魔性网站,n多次被鬼畜视频带跑了,尤其最近本山大叔的改革春风,彻底是把我吹狂乱了,时刻提醒自己我是来爬虫的... 3 | 4 | ### 准备工作 5 | 6 | `scrapy` `python3.5` 7 | 8 | 寻找api规律,获取json数据 9 | 10 | 基本开始的链接为:`url = 'https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&pic_size=160x100&order=click©_right=-1&cate_id=1&page=1&pagesize=20&time_from=20181201&time_to=20190109'` 11 | 12 | `cate_id`可以从网页中正则匹配到,闲麻烦就直接从1遍历到200了,基本就这范围。 13 | 14 | 如果想爬更多数据,`page` `time`都可以自行设置。 15 | 16 | #### spider代码 17 | ```py 18 | def parse(self, response): 19 | if response: 20 | text = json.loads(response.text) 21 | res = text.get('result') 22 | numpages = text.get('numPages') 23 | numResults = text.get('numResults') 24 | msg = text.get('msg') 25 | if msg == 'success': 26 | for i in res: 27 | author = i.get('author') 28 | id = i.get('id') 29 | pubdate = i.get('pubdate') 30 | favorite = i.get('favorites') 31 | rank_score = i.get('rank_score') 32 | video_review = i.get('video_review') 33 | tag = i.get('tag') 34 | title = i.get('title') 35 | arcurl = i.get('arcurl') 36 | 37 | item = BilibiliItem() 38 | item['numResults'] = numResults 39 | item['author'] = author 40 | item['id'] = id 41 | item['pubdate'] = pubdate 42 | item['favorite'] = favorite 43 | item['rank_score'] = rank_score 44 | item['video_review'] = video_review 45 | item['tag'] = tag 46 | item['title'] = title 47 | item['arcurl'] = arcurl 48 | yield item 49 | 50 | ``` 51 | ![bili-1](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Bilibili/images/big.png) 52 | 53 | ![bili-2](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Bilibili/images/small.png) 54 | 55 | ![bili-3](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Bilibili/images/three.png) 56 | 57 | 至于爬取后要怎么处理就看自己爱好了,最好保存为 csv 文件,方便pandas处理 58 | 59 | 60 | Bilibili视频的链接为 `https://www.bilibili.com/video/av + v_aid` 61 | 62 | 对数据感兴趣的话可以邮箱联系我,共同进步。 63 | -------------------------------------------------------------------------------- /Spider/ch_Code/README.md: -------------------------------------------------------------------------------- 1 | # 滑动验证码破解示例 2 | 通过Selenium模拟用户滑动解锁 3 | 由『国家企业信用信息公示系统』进行破解体验 4 | 5 | 具体步骤: 6 | 1. 使用Selenium打开页面 7 | 2. 匹配到输入框,输入要查询的信息,并点击查询按钮 8 | 3. 读取验证码图片,并做缺口识别 9 | 4. 根据缺口位置,计算滑动距离 10 | 5. 根据滑动距离,拖拽滑块到需要匹配的位置 11 | 12 | 13 | ![spider-1](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Bilibili/images/spider_image.gif) -------------------------------------------------------------------------------- /Spider/ch_Code/sliding_code.py: -------------------------------------------------------------------------------- 1 | # -*-coding:utf-8 -*- 2 | import random 3 | import time 4 | 5 | from selenium.webdriver import ActionChains 6 | from selenium.webdriver.support import expected_conditions as EC 7 | from selenium.webdriver.support.ui import WebDriverWait 8 | from selenium.webdriver.common.by import By 9 | from urllib.request import urlretrieve 10 | from selenium import webdriver 11 | from bs4 import BeautifulSoup 12 | import PIL.Image as image 13 | import re 14 | 15 | 16 | class Crack(): 17 | def __init__(self,keyword): 18 | self.url = 'http://bj.gsxt.gov.cn/sydq/loginSydqAction!sydq.dhtml' 19 | self.browser = webdriver.Chrome('D:\\chromedriver.exe') 20 | self.wait = WebDriverWait(self.browser, 100) 21 | self.keyword = keyword 22 | self.BORDER = 6 23 | 24 | def open(self): 25 | """ 26 | 打开浏览器,并输入查询内容 27 | """ 28 | self.browser.get(self.url) 29 | keyword = self.wait.until(EC.presence_of_element_located((By.ID, 'keyword_qycx'))) 30 | button = self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'btn'))) 31 | keyword.send_keys(self.keyword) 32 | button.click() 33 | 34 | def get_images(self, bg_filename='bg.jpg', fullbg_filename='fullbg.jpg'): 35 | """ 36 | 获取验证码图片 37 | :return: 图片的location信息 38 | """ 39 | bg = [] 40 | fullgb = [] 41 | while bg == [] and fullgb == []: 42 | bf = BeautifulSoup(self.browser.page_source, 'lxml') 43 | bg = bf.find_all('div', class_ = 'gt_cut_bg_slice') 44 | fullgb = bf.find_all('div', class_ = 'gt_cut_fullbg_slice') 45 | bg_url = re.findall('url\(\"(.*)\"\);', bg[0].get('style'))[0].replace('webp', 'jpg') 46 | fullgb_url = re.findall('url\(\"(.*)\"\);', fullgb[0].get('style'))[0].replace('webp', 'jpg') 47 | bg_location_list = [] 48 | fullbg_location_list = [] 49 | for each_bg in bg: 50 | location = dict() 51 | location['x'] = int(re.findall('background-position: (.*)px (.*)px;',each_bg.get('style'))[0][0]) 52 | location['y'] = int(re.findall('background-position: (.*)px (.*)px;',each_bg.get('style'))[0][1]) 53 | bg_location_list.append(location) 54 | for each_fullgb in fullgb: 55 | location = dict() 56 | location['x'] = int(re.findall('background-position: (.*)px (.*)px;',each_fullgb.get('style'))[0][0]) 57 | location['y'] = int(re.findall('background-position: (.*)px (.*)px;',each_fullgb.get('style'))[0][1]) 58 | fullbg_location_list.append(location) 59 | 60 | urlretrieve(url=bg_url, filename=bg_filename) 61 | print('缺口图片下载完成') 62 | urlretrieve(url=fullgb_url, filename=fullbg_filename) 63 | print('背景图片下载完成') 64 | return bg_location_list, fullbg_location_list 65 | 66 | def get_merge_image(self, filename, location_list): 67 | """ 68 | 根据位置对图片进行合并还原 69 | :filename:图片 70 | :location_list:图片位置 71 | """ 72 | im = image.open(filename) 73 | new_im = image.new('RGB', (260, 116)) 74 | im_list_upper = [] 75 | im_list_down = [] 76 | 77 | for location in location_list: 78 | if location['y'] == -58: 79 | im_list_upper.append(im.crop((abs(location['x']), 58, abs(location['x']) + 10, 166))) 80 | if location['y'] == 0: 81 | im_list_down.append(im.crop((abs(location['x']), 0, abs(location['x']) + 10, 58))) 82 | 83 | new_im = image.new('RGB', (260,116)) 84 | 85 | x_offset = 0 86 | for im in im_list_upper: 87 | new_im.paste(im, (x_offset,0)) 88 | x_offset += im.size[0] 89 | 90 | x_offset = 0 91 | for im in im_list_down: 92 | new_im.paste(im, (x_offset,58)) 93 | x_offset += im.size[0] 94 | 95 | new_im.save(filename) 96 | 97 | return new_im 98 | 99 | def is_pixel_equal(self, img1, img2, x, y): 100 | """ 101 | 判断两个像素是否相同 102 | :param img1: 图片1 103 | :param img2: 图片2 104 | :param x: 位置x 105 | :param y: 位置y 106 | :return: 像素是否相同 107 | """ 108 | # 取两个图片的像素点 109 | pix1 = img1.load()[x, y] 110 | pix2 = img2.load()[x, y] 111 | threshold = 60 112 | if (abs(pix1[0] - pix2[0] < threshold) and abs(pix1[1] - pix2[1] < threshold) 113 | and abs(pix1[2] - pix2[2] < threshold)): 114 | return True 115 | else: 116 | return False 117 | 118 | def get_gap(self, img1, img2): 119 | """ 120 | 获取缺口偏移量 121 | :param img1: 不带缺口图片 122 | :param img2: 带缺口图片 123 | :return: 124 | """ 125 | left = 43 126 | for i in range(left, img1.size[0]): 127 | for j in range(img1.size[1]): 128 | if not self.is_pixel_equal(img1, img2, i, j): 129 | left = i 130 | return left 131 | return left 132 | 133 | def get_track(self, distance): 134 | """ 135 | 根据偏移量获取移动轨迹 136 | :param distance: 偏移量 137 | :return: 移动轨迹 138 | """ 139 | # 移动轨迹 140 | track = [] 141 | # 当前位移 142 | current = 0 143 | # 减速阈值 144 | mid = distance * 4 / 5 145 | # 计算间隔 146 | t = 0.2 147 | # 初速度 148 | v = 0 149 | 150 | while current < distance: 151 | if current < mid: 152 | # 加速度为正2 153 | a = 2 154 | else: 155 | # 加速度为负3 156 | a = -3 157 | # 初速度v0 158 | v0 = v 159 | # 当前速度v = v0 + at 160 | v = v0 + a * t 161 | # 移动距离x = v0t + 1/2 * a * t^2 162 | move = v0 * t + 1 / 2 * a * t * t 163 | # 当前位移 164 | current += move 165 | # 加入轨迹 166 | track.append(round(move)) 167 | return track 168 | 169 | def get_slider(self): 170 | """ 171 | 获取滑块 172 | :return: 滑块对象 173 | """ 174 | while True: 175 | try: 176 | slider = self.browser.find_element_by_xpath("//div[@class='gt_slider_knob gt_show']") 177 | break 178 | except: 179 | time.sleep(0.5) 180 | return slider 181 | 182 | def move_to_gap(self, slider, track): 183 | """ 184 | 拖动滑块到缺口处 185 | :param slider: 滑块 186 | :param track: 轨迹 187 | :return: 188 | """ 189 | ActionChains(self.browser).click_and_hold(slider).perform() 190 | while track: 191 | x = random.choice(track) 192 | ActionChains(self.browser).move_by_offset(xoffset=x, yoffset=0).perform() 193 | track.remove(x) 194 | time.sleep(0.5) 195 | ActionChains(self.browser).release().perform() 196 | 197 | def crack(self): 198 | # 打开浏览器 199 | self.open() 200 | 201 | # 保存的图片名字 202 | bg_filename = 'bg.jpg' 203 | fullbg_filename = 'fullbg.jpg' 204 | 205 | # 获取图片 206 | bg_location_list, fullbg_location_list = self.get_images(bg_filename, fullbg_filename) 207 | 208 | # 根据位置对图片进行合并还原 209 | bg_img = self.get_merge_image(bg_filename, bg_location_list) 210 | fullbg_img = self.get_merge_image(fullbg_filename, fullbg_location_list) 211 | 212 | # 获取缺口位置 213 | gap = self.get_gap(fullbg_img, bg_img) 214 | print('缺口位置', gap) 215 | 216 | track = self.get_track(gap-self.BORDER) 217 | print('滑动滑块') 218 | print(track) 219 | 220 | # 点按呼出缺口 221 | slider = self.get_slider() 222 | # 拖动滑块到缺口处 223 | self.move_to_gap(slider, track) 224 | 225 | 226 | if __name__ == '__main__': 227 | print('开始验证') 228 | crack = Crack(u'中国移动') 229 | crack.crack() 230 | print('验证成功') -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/Connection.py: -------------------------------------------------------------------------------- 1 | 负责根据setting中配置实例化redis连接。被dupefilter和scheduler调用,总之涉及到redis存取的都要使用到这个模块。 2 | 这个文件主要是实现连接redis数据库的功能,这些连接接口在其他文件中经常被用到 3 | import six 4 | from scrapy.utils.misc import load_object 5 | from . import defaults 6 | 7 | 8 | # Shortcut maps 'setting name' -> 'parmater name'. 9 | # 要想连接到redis数据库,和其他数据库差不多,需要一个ip地址、端口号、用户名密码(可选)和一个整形的数据库编号 10 | # Shortcut maps 'setting name' -> 'parmater name'. 11 | SETTINGS_PARAMS_MAP = { 12 | 'REDIS_URL': 'url', 13 | 'REDIS_HOST': 'host', 14 | 'REDIS_PORT': 'port', 15 | 'REDIS_ENCODING': 'encoding', 16 | } 17 | 18 | 19 | def get_redis_from_settings(settings): 20 | """Returns a redis client instance from given Scrapy settings object. 21 | 22 | This function uses ``get_client`` to instantiate the client and uses 23 | ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You 24 | can override them using the ``REDIS_PARAMS`` setting. 25 | 26 | Parameters 27 | ---------- 28 | settings : Settings 29 | A scrapy settings object. See the supported settings below. 30 | 31 | Returns 32 | ------- 33 | server 34 | Redis client instance. 35 | 36 | Other Parameters 37 | ---------------- 38 | REDIS_URL : str, optional 39 | Server connection URL. 40 | REDIS_HOST : str, optional 41 | Server host. 42 | REDIS_PORT : str, optional 43 | Server port. 44 | REDIS_ENCODING : str, optional 45 | Data encoding. 46 | REDIS_PARAMS : dict, optional 47 | Additional client parameters. 48 | 49 | """ 50 | params = defaults.REDIS_PARAMS.copy() 51 | params.update(settings.getdict('REDIS_PARAMS')) 52 | # XXX: Deprecate REDIS_* settings. 53 | for source, dest in SETTINGS_PARAMS_MAP.items(): 54 | val = settings.get(source) 55 | if val: 56 | params[dest] = val 57 | 58 | # Allow ``redis_cls`` to be a path to a class. 59 | if isinstance(params.get('redis_cls'), six.string_types): 60 | params['redis_cls'] = load_object(params['redis_cls']) 61 | #在这里调用get_redis函数 62 | return get_redis(**params) 63 | 64 | 65 | # Backwards compatible alias. 66 | from_settings = get_redis_from_settings 67 | 68 | 69 | # 返回的是redis库的Redis对象,可以直接用来进行数据操作的对象 70 | def get_redis(**kwargs): 71 | """Returns a redis client instance. 72 | 73 | Parameters 74 | ---------- 75 | redis_cls : class, optional 76 | Defaults to ``redis.StrictRedis``. 77 | url : str, optional 78 | If given, ``redis_cls.from_url`` is used to instantiate the class. 79 | **kwargs 80 | Extra parameters to be passed to the ``redis_cls`` class. 81 | 82 | Returns 83 | ------- 84 | server 85 | Redis client instance. 86 | 87 | """ 88 | redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS) 89 | url = kwargs.pop('url', None) 90 | if url: 91 | return redis_cls.from_url(url, **kwargs) 92 | else: 93 | return redis_cls(**kwargs) -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/Duperfilter.py: -------------------------------------------------------------------------------- 1 | 负责执行requst的去重,实现的很有技巧性,使用redis的set数据结构。但是注意scheduler并不使用其中用于在这个模块中实现的dupefilter键做request的调度, 2 | 而是使用queue.py模块中实现的queue。当request不重复时,将其存入到queue中,调度时将其弹出。 3 | 4 | import logging 5 | import time 6 | from scrapy.dupefilters import BaseDupeFilter 7 | from scrapy.utils.request import request_fingerprint 8 | from . import defaults 9 | from .connection import get_redis_from_settings 10 | 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | 15 | # TODO: Rename class to RedisDupeFilter. 16 | class RFPDupeFilter(BaseDupeFilter): 17 | """Redis-based request duplicates filter. 18 | 19 | This class can also be used with default Scrapy's scheduler. 20 | 21 | """ 22 | 23 | logger = logger 24 | 25 | def __init__(self, server, key, debug=False): 26 | """Initialize the duplicates filter. 27 | 28 | Parameters 29 | ---------- 30 | server : redis.StrictRedis 31 | The redis server instance. 32 | key : str 33 | Redis key Where to store fingerprints. 34 | debug : bool, optional 35 | Whether to log filtered requests. 36 | 37 | """ 38 | self.server = server 39 | self.key = key 40 | self.debug = debug 41 | self.logdupes = True 42 | 43 | @classmethod 44 | def from_settings(cls, settings): 45 | """Returns an instance from given settings. 46 | 47 | This uses by default the key ``dupefilter:``. When using the 48 | ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as 49 | it needs to pass the spider name in the key. 50 | 51 | Parameters 52 | ---------- 53 | settings : scrapy.settings.Settings 54 | 55 | Returns 56 | ------- 57 | RFPDupeFilter 58 | A RFPDupeFilter instance. 59 | 60 | 61 | """ 62 | server = get_redis_from_settings(settings) 63 | # XXX: This creates one-time key. needed to support to use this 64 | # class as standalone dupefilter with scrapy's default scheduler 65 | # if scrapy passes spider on open() method this wouldn't be needed 66 | # TODO: Use SCRAPY_JOB env as default and fallback to timestamp. 67 | key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())} 68 | debug = settings.getbool('DUPEFILTER_DEBUG') 69 | return cls(server, key=key, debug=debug) 70 | 71 | @classmethod 72 | def from_crawler(cls, crawler): 73 | """Returns instance from crawler. 74 | 75 | Parameters 76 | ---------- 77 | crawler : scrapy.crawler.Crawler 78 | 79 | Returns 80 | ------- 81 | RFPDupeFilter 82 | Instance of RFPDupeFilter. 83 | 84 | """ 85 | return cls.from_settings(crawler.settings) 86 | 87 | def request_seen(self, request): 88 | """Returns True if request was already seen. 89 | 90 | Parameters 91 | ---------- 92 | request : scrapy.http.Request 93 | 94 | Returns 95 | ------- 96 | bool 97 | 98 | """ 99 | fp = self.request_fingerprint(request) 100 | # This returns the number of values added, zero if already exists. 101 | added = self.server.sadd(self.key, fp) 102 | return added == 0 103 | 104 | def request_fingerprint(self, request): 105 | """Returns a fingerprint for a given request. 106 | 107 | Parameters 108 | ---------- 109 | request : scrapy.http.Request 110 | 111 | Returns 112 | ------- 113 | str 114 | 115 | """ 116 | return request_fingerprint(request) 117 | 118 | def close(self, reason=''): 119 | """Delete data on close. Called by Scrapy's scheduler. 120 | 121 | Parameters 122 | ---------- 123 | reason : str, optional 124 | 125 | """ 126 | self.clear() 127 | 128 | def clear(self): 129 | """Clears fingerprints data.""" 130 | self.server.delete(self.key) 131 | 132 | def log(self, request, spider): 133 | """Logs given request. 134 | 135 | Parameters 136 | ---------- 137 | request : scrapy.http.Request 138 | spider : scrapy.spiders.Spider 139 | 140 | """ 141 | if self.debug: 142 | msg = "Filtered duplicate request: %(request)s" 143 | self.logger.debug(msg, {'request': request}, extra={'spider': spider}) 144 | elif self.logdupes: 145 | msg = ("Filtered duplicate request %(request)s" 146 | " - no more duplicates will be shown" 147 | " (see DUPEFILTER_DEBUG to show all duplicates)") 148 | self.logger.debug(msg, {'request': request}, extra={'spider': spider}) 149 | self.logdupes = False 150 | 151 | 这个文件看起来比较复杂,重写了scrapy本身已经实现的request判重功能。因为本身scrapy单机跑的话,只需要读取内存中的request队列或者持久化的request队列 152 | (scrapy默认的持久化似乎是json格式的文件,不是数据库)就能判断这次要发出的request url是否已经请求过或者正在调度(本地读就行了)。而分布式跑的话,就 153 | 需要各个主机上的scheduler都连接同一个数据库的同一个request池来判断这次的请求是否是重复的了。 154 | 155 | 在这个文件中,通过继承BaseDupeFilter重写他的方法,实现了基于redis的判重。根据源代码来看,scrapy-redis使用了scrapy本身的一个fingerprint接 156 | request_fingerprint,这个接口很有趣,根据scrapy文档所说,他通过hash来判断两个url是否相同(相同的url会生成相同的hash结果),但是当两个url的地址相同, 157 | get型参数相同但是顺序不同时,也会生成相同的hash结果(这个真的比较神奇。。。)所以scrapy-redis依旧使用url的fingerprint来判断request请求是否已经出现过。 158 | 这个类通过连接redis,使用一个key来向redis的一个set中插入fingerprint(这个key对于同一种spider是相同的,redis是一个key-value的数据库,如果key是相 159 | 同的,访问到的值就是相同的,这里使用spider名字+DupeFilter的key就是为了在不同主机上的不同爬虫实例,只要属于同一种spider,就会访问到同一个set,而这个set 160 | 就是他们的url判重池),如果返回值为0,说明该set中该fingerprint已经存在(因为集合是没有重复值的),则返回False,如果返回值为1,说明添加了一个fingerprint 161 | 到set中,则说明这个request没有重复,于是返回True,还顺便把新fingerprint加入到数据库中了。 DupeFilter判重会在scheduler类中用到,每一个request在进入 162 | 调度之前都要进行判重,如果重复就不需要参加调度,直接舍弃就好了,不然就是白白浪费资源。 -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/Pipelines.py: -------------------------------------------------------------------------------- 1 | 这是是用来实现分布式处理的作用。它将Item存储在redis中以实现分布式处理。由于在这里需要读取配置,所以就用到了from_crawler()函数。 2 | from scrapy.utils.misc import load_object 3 | from scrapy.utils.serialize import ScrapyJSONEncoder 4 | from twisted.internet.threads import deferToThread 5 | 6 | from . import connection, defaults 7 | 8 | 9 | default_serialize = ScrapyJSONEncoder().encode 10 | 11 | 12 | class RedisPipeline(object): 13 | """Pushes serialized item into a redis list/queue 14 | 15 | Settings 16 | -------- 17 | REDIS_ITEMS_KEY : str 18 | Redis key where to store items. 19 | REDIS_ITEMS_SERIALIZER : str 20 | Object path to serializer function. 21 | 22 | """ 23 | 24 | def __init__(self, server, 25 | key=defaults.PIPELINE_KEY, 26 | serialize_func=default_serialize): 27 | """Initialize pipeline. 28 | 29 | Parameters 30 | ---------- 31 | server : StrictRedis 32 | Redis client instance. 33 | key : str 34 | Redis key where to store items. 35 | serialize_func : callable 36 | Items serializer function. 37 | 38 | """ 39 | self.server = server 40 | self.key = key 41 | self.serialize = serialize_func 42 | 43 | @classmethod 44 | def from_settings(cls, settings): 45 | params = { 46 | 'server': connection.from_settings(settings), 47 | } 48 | if settings.get('REDIS_ITEMS_KEY'): 49 | params['key'] = settings['REDIS_ITEMS_KEY'] 50 | if settings.get('REDIS_ITEMS_SERIALIZER'): 51 | params['serialize_func'] = load_object( 52 | settings['REDIS_ITEMS_SERIALIZER'] 53 | ) 54 | 55 | return cls(**params) 56 | 57 | @classmethod 58 | def from_crawler(cls, crawler): 59 | return cls.from_settings(crawler.settings) 60 | 61 | def process_item(self, item, spider): 62 | return deferToThread(self._process_item, item, spider) 63 | 64 | def _process_item(self, item, spider): 65 | key = self.item_key(item, spider) 66 | data = self.serialize(item) 67 | self.server.rpush(key, data) 68 | return item 69 | 70 | def item_key(self, item, spider): 71 | """Returns redis key based on given spider. 72 | 73 | Override this function to use a different key depending on the item 74 | and/or spider. 75 | 76 | """ 77 | return self.key % {'spider': spider.name} 78 | pipelines文件实现了一个item pipieline类,和scrapy的item pipeline是同一个对象,通过从settings中拿到我们配置的REDIS_ITEMS_KEY作为key, 79 | 把item串行化之后存入redis数据库对应的value中(这个value可以看出出是个list,我们的每个item是这个list中的一个结点), 80 | 这个pipeline把提取出的item存起来,主要是为了方便我们延后处理数据。 -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/Queue.py: -------------------------------------------------------------------------------- 1 | 该文件实现了几个容器类,可以看这些容器和redis交互频繁,同时使用了我们上边picklecompat中定义的序列化器。这个文件实现的几个容器大体相同,只不过一个是队列,一个是栈,一个是优先级队列,这三个容器到时候会被scheduler对象实例化,来实现request的调度。比如我们使用SpiderQueue最为调度队列的类型,到时候request的调度方法就是先进先出,而实用SpiderStack就是先进后出了。 2 | 从SpiderQueue的实现看出来,他的push函数就和其他容器的一样,只不过push进去的request请求先被scrapy的接口request_to_dict变成了一个dict对象(因为request对象实在是比较复杂,有方法有属性不好串行化),之后使用picklecompat中的serializer串行化为字符串,然后使用一个特定的key存入redis中(该key在同一种spider中是相同的)。而调用pop时,其实就是从redis用那个特定的key去读其值(一个list),从list中读取最早进去的那个,于是就先进先出了。 这些容器类都会作为scheduler调度request的容器,scheduler在每个主机上都会实例化一个,并且和spider一一对应,所以分布式运行时会有一个spider的多个实例和一个scheduler的多个实例存在于不同的主机上,但是,因为scheduler都是用相同的容器,而这些容器都连接同一个redis服务器,又都使用spider名加queue来作为key读写数据,所以不同主机上的不同爬虫实例共用用一个request调度池,实现了分布式爬虫之间的统一调度。 3 | 4 | from scrapy.utils.reqser import request_to_dict, request_from_dict 5 | from . import picklecompat 6 | 7 | 8 | class Base(object): 9 | """Per-spider base queue class""" 10 | 11 | def __init__(self, server, spider, key, serializer=None): 12 | """Initialize per-spider redis queue. 13 | 14 | Parameters 15 | ---------- 16 | server : StrictRedis 17 | Redis client instance. 18 | spider : Spider 19 | Scrapy spider instance. 20 | key: str 21 | Redis key where to put and get messages. 22 | serializer : object 23 | Serializer object with ``loads`` and ``dumps`` methods. 24 | 25 | """ 26 | if serializer is None: 27 | # Backward compatibility. 28 | # TODO: deprecate pickle. 29 | serializer = picklecompat 30 | if not hasattr(serializer, 'loads'): 31 | raise TypeError("serializer does not implement 'loads' function: %r" 32 | % serializer) 33 | if not hasattr(serializer, 'dumps'): 34 | raise TypeError("serializer '%s' does not implement 'dumps' function: %r" 35 | % serializer) 36 | 37 | self.server = server 38 | self.spider = spider 39 | self.key = key % {'spider': spider.name} 40 | self.serializer = serializer 41 | 42 | def _encode_request(self, request): 43 | """Encode a request object""" 44 | obj = request_to_dict(request, self.spider) 45 | return self.serializer.dumps(obj) 46 | 47 | def _decode_request(self, encoded_request): 48 | """Decode an request previously encoded""" 49 | obj = self.serializer.loads(encoded_request) 50 | return request_from_dict(obj, self.spider) 51 | 52 | def __len__(self): 53 | """Return the length of the queue""" 54 | raise NotImplementedError 55 | 56 | def push(self, request): 57 | """Push a request""" 58 | raise NotImplementedError 59 | 60 | def pop(self, timeout=0): 61 | """Pop a request""" 62 | raise NotImplementedError 63 | 64 | def clear(self): 65 | """Clear queue/stack""" 66 | self.server.delete(self.key) 67 | 68 | 69 | class FifoQueue(Base): 70 | """Per-spider FIFO queue""" 71 | 72 | def __len__(self): 73 | """Return the length of the queue""" 74 | return self.server.llen(self.key) 75 | 76 | def push(self, request): 77 | """Push a request""" 78 | self.server.lpush(self.key, self._encode_request(request)) 79 | 80 | def pop(self, timeout=0): 81 | """Pop a request""" 82 | if timeout > 0: 83 | data = self.server.brpop(self.key, timeout) 84 | if isinstance(data, tuple): 85 | data = data[1] 86 | else: 87 | data = self.server.rpop(self.key) 88 | if data: 89 | return self._decode_request(data) 90 | 91 | 92 | class PriorityQueue(Base): 93 | """Per-spider priority queue abstraction using redis' sorted set""" 94 | 95 | def __len__(self): 96 | """Return the length of the queue""" 97 | return self.server.zcard(self.key) 98 | 99 | def push(self, request): 100 | """Push a request""" 101 | data = self._encode_request(request) 102 | score = -request.priority 103 | # We don't use zadd method as the order of arguments change depending on 104 | # whether the class is Redis or StrictRedis, and the option of using 105 | # kwargs only accepts strings, not bytes. 106 | self.server.execute_command('ZADD', self.key, score, data) 107 | 108 | def pop(self, timeout=0): 109 | """ 110 | Pop a request 111 | timeout not support in this queue class 112 | """ 113 | # use atomic range/remove using multi/exec 114 | pipe = self.server.pipeline() 115 | pipe.multi() 116 | pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0) 117 | results, count = pipe.execute() 118 | if results: 119 | return self._decode_request(results[0]) 120 | 121 | 122 | class LifoQueue(Base): 123 | """Per-spider LIFO queue.""" 124 | 125 | def __len__(self): 126 | """Return the length of the stack""" 127 | return self.server.llen(self.key) 128 | 129 | def push(self, request): 130 | """Push a request""" 131 | self.server.lpush(self.key, self._encode_request(request)) 132 | 133 | def pop(self, timeout=0): 134 | """Pop a request""" 135 | if timeout > 0: 136 | data = self.server.blpop(self.key, timeout) 137 | if isinstance(data, tuple): 138 | data = data[1] 139 | else: 140 | data = self.server.lpop(self.key) 141 | 142 | if data: 143 | return self._decode_request(data) 144 | 145 | 146 | # TODO: Deprecate the use of these names. 147 | SpiderQueue = FifoQueue 148 | SpiderStack = LifoQueue 149 | SpiderPriorityQueue = PriorityQueue -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/Scheduler.py: -------------------------------------------------------------------------------- 1 | 此扩展是对scrapy中自带的scheduler的替代(在settings的SCHEDULER变量中指出),正是利用此扩展实现crawler的分布式调度。 2 | 其利用的数据结构来自于queue中实现的数据结构。scrapy-redis所实现的两种分布式:爬虫分布式以及item处理分布式就是由模块 3 | scheduler和模块pipelines实现。上述其它模块作为为二者辅助的功能模块. 4 | 5 | import importlib 6 | import six 7 | from scrapy.utils.misc import load_object 8 | from . import connection, defaults 9 | 10 | 11 | # TODO: add SCRAPY_JOB support. 12 | class Scheduler(object): 13 | """Redis-based scheduler 14 | 15 | Settings 16 | -------- 17 | SCHEDULER_PERSIST : bool (default: False) 18 | Whether to persist or clear redis queue. 19 | SCHEDULER_FLUSH_ON_START : bool (default: False) 20 | Whether to flush redis queue on start. 21 | SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0) 22 | How many seconds to wait before closing if no message is received. 23 | SCHEDULER_QUEUE_KEY : str 24 | Scheduler redis key. 25 | SCHEDULER_QUEUE_CLASS : str 26 | Scheduler queue class. 27 | SCHEDULER_DUPEFILTER_KEY : str 28 | Scheduler dupefilter redis key. 29 | SCHEDULER_DUPEFILTER_CLASS : str 30 | Scheduler dupefilter class. 31 | SCHEDULER_SERIALIZER : str 32 | Scheduler serializer. 33 | 34 | """ 35 | 36 | def __init__(self, server, 37 | persist=False, 38 | flush_on_start=False, 39 | queue_key=defaults.SCHEDULER_QUEUE_KEY, 40 | queue_cls=defaults.SCHEDULER_QUEUE_CLASS, 41 | dupefilter_key=defaults.SCHEDULER_DUPEFILTER_KEY, 42 | dupefilter_cls=defaults.SCHEDULER_DUPEFILTER_CLASS, 43 | idle_before_close=0, 44 | serializer=None): 45 | """Initialize scheduler. 46 | 47 | Parameters 48 | ---------- 49 | server : Redis 50 | The redis server instance. 51 | persist : bool 52 | Whether to flush requests when closing. Default is False. 53 | flush_on_start : bool 54 | Whether to flush requests on start. Default is False. 55 | queue_key : str 56 | Requests queue key. 57 | queue_cls : str 58 | Importable path to the queue class. 59 | dupefilter_key : str 60 | Duplicates filter key. 61 | dupefilter_cls : str 62 | Importable path to the dupefilter class. 63 | idle_before_close : int 64 | Timeout before giving up. 65 | 66 | """ 67 | if idle_before_close < 0: 68 | raise TypeError("idle_before_close cannot be negative") 69 | 70 | self.server = server 71 | self.persist = persist 72 | self.flush_on_start = flush_on_start 73 | self.queue_key = queue_key 74 | self.queue_cls = queue_cls 75 | self.dupefilter_cls = dupefilter_cls 76 | self.dupefilter_key = dupefilter_key 77 | self.idle_before_close = idle_before_close 78 | self.serializer = serializer 79 | self.stats = None 80 | 81 | def __len__(self): 82 | return len(self.queue) 83 | 84 | @classmethod 85 | def from_settings(cls, settings): 86 | kwargs = { 87 | 'persist': settings.getbool('SCHEDULER_PERSIST'), 88 | 'flush_on_start': settings.getbool('SCHEDULER_FLUSH_ON_START'), 89 | 'idle_before_close': settings.getint('SCHEDULER_IDLE_BEFORE_CLOSE'), 90 | } 91 | 92 | # If these values are missing, it means we want to use the defaults. 93 | optional = { 94 | # TODO: Use custom prefixes for this settings to note that are 95 | # specific to scrapy-redis. 96 | 'queue_key': 'SCHEDULER_QUEUE_KEY', 97 | 'queue_cls': 'SCHEDULER_QUEUE_CLASS', 98 | 'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY', 99 | # We use the default setting name to keep compatibility. 100 | 'dupefilter_cls': 'DUPEFILTER_CLASS', 101 | 'serializer': 'SCHEDULER_SERIALIZER', 102 | } 103 | for name, setting_name in optional.items(): 104 | val = settings.get(setting_name) 105 | if val: 106 | kwargs[name] = val 107 | 108 | # Support serializer as a path to a module. 109 | if isinstance(kwargs.get('serializer'), six.string_types): 110 | kwargs['serializer'] = importlib.import_module(kwargs['serializer']) 111 | 112 | server = connection.from_settings(settings) 113 | # Ensure the connection is working. 114 | server.ping() 115 | 116 | return cls(server=server, **kwargs) 117 | 118 | @classmethod 119 | def from_crawler(cls, crawler): 120 | instance = cls.from_settings(crawler.settings) 121 | # FIXME: for now, stats are only supported from this constructor 122 | instance.stats = crawler.stats 123 | return instance 124 | 125 | def open(self, spider): 126 | self.spider = spider 127 | 128 | try: 129 | self.queue = load_object(self.queue_cls)( 130 | server=self.server, 131 | spider=spider, 132 | key=self.queue_key % {'spider': spider.name}, 133 | serializer=self.serializer, 134 | ) 135 | except TypeError as e: 136 | raise ValueError("Failed to instantiate queue class '%s': %s", 137 | self.queue_cls, e) 138 | 139 | try: 140 | self.df = load_object(self.dupefilter_cls)( 141 | server=self.server, 142 | key=self.dupefilter_key % {'spider': spider.name}, 143 | debug=spider.settings.getbool('DUPEFILTER_DEBUG'), 144 | ) 145 | except TypeError as e: 146 | raise ValueError("Failed to instantiate dupefilter class '%s': %s", 147 | self.dupefilter_cls, e) 148 | 149 | if self.flush_on_start: 150 | self.flush() 151 | # notice if there are requests already in the queue to resume the crawl 152 | if len(self.queue): 153 | spider.log("Resuming crawl (%d requests scheduled)" % len(self.queue)) 154 | 155 | def close(self, reason): 156 | if not self.persist: 157 | self.flush() 158 | 159 | def flush(self): 160 | self.df.clear() 161 | self.queue.clear() 162 | 163 | def enqueue_request(self, request): 164 | if not request.dont_filter and self.df.request_seen(request): 165 | self.df.log(request, self.spider) 166 | return False 167 | if self.stats: 168 | self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider) 169 | self.queue.push(request) 170 | return True 171 | 172 | def next_request(self): 173 | block_pop_timeout = self.idle_before_close 174 | request = self.queue.pop(block_pop_timeout) 175 | if request and self.stats: 176 | self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider) 177 | return request 178 | 179 | def has_pending_requests(self): 180 | return len(self) > 0 181 | 这个文件重写了scheduler类,用来代替scrapy.core.scheduler的原有调度器。其实对原有调度器的逻辑没有很大的改变, 182 | 主要是使用了redis作为数据存储的媒介,以达到各个爬虫之间的统一调度。 scheduler负责调度各个spider的request请求, 183 | scheduler初始化时,通过settings文件读取queue和dupefilters的类型(一般就用上边默认的),配置queue和dupefilters 184 | 使用的key(一般就是spider name加上queue或者dupefilters,这样对于同一种spider的不同实例,就会使用相同的数据块了)。 185 | 每当一个request要被调度时,enqueue_request被调用,scheduler使用dupefilters来判断这个url是否重复,如果不重复, 186 | 就添加到queue的容器中(先进先出,先进后出和优先级都可以,可以在settings中配置)。当调度完成时,next_request被调用, 187 | scheduler就通过queue容器的接口,取出一个request,把他发送给相应的spider,让spider进行爬取工作。 -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/Spider.py: -------------------------------------------------------------------------------- 1 | 设计的这个spider从redis中读取要爬的url, 然后执行爬取,若爬取过程中返回更多的url,那么继续进行直至所有的request完成。之后继续从redis中读取url,循环这个过程。 2 | 分析:在这个spider中通过connect signals.spider_idle信号实现对crawler状态的监视。当idle时,返回新的make_requests_from_url(url)给引擎,进而交给调度器调度。 3 | 4 | from scrapy import signals 5 | from scrapy.exceptions import DontCloseSpider 6 | from scrapy.spiders import Spider, CrawlSpider 7 | from . import connection, defaults 8 | from .utils import bytes_to_str 9 | 10 | 11 | class RedisMixin(object): 12 | """Mixin class to implement reading urls from a redis queue.""" 13 | redis_key = None 14 | redis_batch_size = None 15 | redis_encoding = None 16 | 17 | # Redis client placeholder. 18 | server = None 19 | 20 | def start_requests(self): 21 | """Returns a batch of start requests from redis.""" 22 | return self.next_requests() 23 | 24 | def setup_redis(self, crawler=None): 25 | """Setup redis connection and idle signal. 26 | 27 | This should be called after the spider has set its crawler object. 28 | """ 29 | if self.server is not None: 30 | return 31 | 32 | if crawler is None: 33 | # We allow optional crawler argument to keep backwards 34 | # compatibility. 35 | # XXX: Raise a deprecation warning. 36 | crawler = getattr(self, 'crawler', None) 37 | 38 | if crawler is None: 39 | raise ValueError("crawler is required") 40 | 41 | settings = crawler.settings 42 | 43 | if self.redis_key is None: 44 | self.redis_key = settings.get( 45 | 'REDIS_START_URLS_KEY', defaults.START_URLS_KEY, 46 | ) 47 | 48 | self.redis_key = self.redis_key % {'name': self.name} 49 | 50 | if not self.redis_key.strip(): 51 | raise ValueError("redis_key must not be empty") 52 | 53 | if self.redis_batch_size is None: 54 | # TODO: Deprecate this setting (REDIS_START_URLS_BATCH_SIZE). 55 | self.redis_batch_size = settings.getint( 56 | 'REDIS_START_URLS_BATCH_SIZE', 57 | settings.getint('CONCURRENT_REQUESTS'), 58 | ) 59 | 60 | try: 61 | self.redis_batch_size = int(self.redis_batch_size) 62 | except (TypeError, ValueError): 63 | raise ValueError("redis_batch_size must be an integer") 64 | 65 | if self.redis_encoding is None: 66 | self.redis_encoding = settings.get('REDIS_ENCODING', defaults.REDIS_ENCODING) 67 | 68 | self.logger.info("Reading start URLs from redis key '%(redis_key)s' " 69 | "(batch size: %(redis_batch_size)s, encoding: %(redis_encoding)s", 70 | self.__dict__) 71 | 72 | self.server = connection.from_settings(crawler.settings) 73 | # The idle signal is called when the spider has no requests left, 74 | # that's when we will schedule new requests from redis queue 75 | crawler.signals.connect(self.spider_idle, signal=signals.spider_idle) 76 | 77 | def next_requests(self): 78 | """Returns a request to be scheduled or none.""" 79 | use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', defaults.START_URLS_AS_SET) 80 | fetch_one = self.server.spop if use_set else self.server.lpop 81 | # XXX: Do we need to use a timeout here? 82 | found = 0 83 | # TODO: Use redis pipeline execution. 84 | while found < self.redis_batch_size: 85 | data = fetch_one(self.redis_key) 86 | if not data: 87 | # Queue empty. 88 | break 89 | req = self.make_request_from_data(data) 90 | if req: 91 | yield req 92 | found += 1 93 | else: 94 | self.logger.debug("Request not made from data: %r", data) 95 | 96 | if found: 97 | self.logger.debug("Read %s requests from '%s'", found, self.redis_key) 98 | 99 | def make_request_from_data(self, data): 100 | """Returns a Request instance from data coming from Redis. 101 | 102 | By default, ``data`` is an encoded URL. You can override this method to 103 | provide your own message decoding. 104 | 105 | Parameters 106 | ---------- 107 | data : bytes 108 | Message from redis. 109 | 110 | """ 111 | url = bytes_to_str(data, self.redis_encoding) 112 | return self.make_requests_from_url(url) 113 | 114 | def schedule_next_requests(self): 115 | """Schedules a request if available""" 116 | # TODO: While there is capacity, schedule a batch of redis requests. 117 | for req in self.next_requests(): 118 | self.crawler.engine.crawl(req, spider=self) 119 | 120 | def spider_idle(self): 121 | """Schedules a request if available, otherwise waits.""" 122 | # XXX: Handle a sentinel to close the spider. 123 | self.schedule_next_requests() 124 | raise DontCloseSpider 125 | 126 | 127 | class RedisSpider(RedisMixin, Spider): 128 | """Spider that reads urls from redis queue when idle. 129 | 130 | Attributes 131 | ---------- 132 | redis_key : str (default: REDIS_START_URLS_KEY) 133 | Redis key where to fetch start URLs from.. 134 | redis_batch_size : int (default: CONCURRENT_REQUESTS) 135 | Number of messages to fetch from redis on each attempt. 136 | redis_encoding : str (default: REDIS_ENCODING) 137 | Encoding to use when decoding messages from redis queue. 138 | 139 | Settings 140 | -------- 141 | REDIS_START_URLS_KEY : str (default: ":start_urls") 142 | Default Redis key where to fetch start URLs from.. 143 | REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS) 144 | Default number of messages to fetch from redis on each attempt. 145 | REDIS_START_URLS_AS_SET : bool (default: False) 146 | Use SET operations to retrieve messages from the redis queue. If False, 147 | the messages are retrieve using the LPOP command. 148 | REDIS_ENCODING : str (default: "utf-8") 149 | Default encoding to use when decoding messages from redis queue. 150 | 151 | """ 152 | 153 | @classmethod 154 | def from_crawler(self, crawler, *args, **kwargs): 155 | obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs) 156 | obj.setup_redis(crawler) 157 | return obj 158 | 159 | 160 | class RedisCrawlSpider(RedisMixin, CrawlSpider): 161 | """Spider that reads urls from redis queue when idle. 162 | 163 | Attributes 164 | ---------- 165 | redis_key : str (default: REDIS_START_URLS_KEY) 166 | Redis key where to fetch start URLs from.. 167 | redis_batch_size : int (default: CONCURRENT_REQUESTS) 168 | Number of messages to fetch from redis on each attempt. 169 | redis_encoding : str (default: REDIS_ENCODING) 170 | Encoding to use when decoding messages from redis queue. 171 | 172 | Settings 173 | -------- 174 | REDIS_START_URLS_KEY : str (default: ":start_urls") 175 | Default Redis key where to fetch start URLs from.. 176 | REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS) 177 | Default number of messages to fetch from redis on each attempt. 178 | REDIS_START_URLS_AS_SET : bool (default: True) 179 | Use SET operations to retrieve messages from the redis queue. 180 | REDIS_ENCODING : str (default: "utf-8") 181 | Default encoding to use when decoding messages from redis queue. 182 | 183 | """ 184 | 185 | @classmethod 186 | def from_crawler(self, crawler, *args, **kwargs): 187 | obj = super(RedisCrawlSpider, self).from_crawler(crawler, *args, **kwargs) 188 | obj.setup_redis(crawler) 189 | return obj 190 | spider的改动也不是很大,主要是通过connect接口,给spider绑定了spider_idle信号,spider初始化时,通过setup_redis函数初始化好和redis的连接, 191 | 后通过next_requests函数从redis中取出strat url,使用的key是settings中REDIS_START_URLS_AS_SET定义的(注意了这里的初始化url池和我们上边的 192 | queue的url池不是一个东西,queue的池是用于调度的,初始化url池是存放入口url的,他们都存在redis中,但是使用不同的key来区分,就当成是不同的表吧), 193 | spider使用少量的start url,可以发展出很多新的url,这些url会进入scheduler进行判重和调度。直到spider跑到调度池内没有url的时候,会触发spider_idle信号, 194 | 从而触发spider的next_requests函数,再次从redis的start url池中读取一些url。 195 | -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/picklecompat.py: -------------------------------------------------------------------------------- 1 | 在python中,一般可以使用pickle类来进行python对象的序列化,而cPickle提供了一个更快速简单的接口。 2 | """A pickle wrapper module with protocol=-1 by default.""" 3 | 4 | try: 5 | import cPickle as pickle # PY2 6 | except ImportError: 7 | import pickle 8 | 9 | 10 | def loads(s): 11 | return pickle.loads(s) 12 | 13 | 14 | def dumps(obj): 15 | return pickle.dumps(obj, protocol=-1) 16 | 这里实现了loads和dumps两个函数,其实就是实现了一个序列化器。 17 | 因为redis数据库不能存储复杂对象(key部分只能是字符串,value部分只能是字符串,字符串列表,字符串集合和hash),所以我们存啥都要先串行化成文本才行。 18 | 这里使用的就是python的pickle模块,一个兼容py2和py3的串行化工具。这个serializer主要用于一会的scheduler存reuqest对象。 -------------------------------------------------------------------------------- /Spider/ch_Distributedcrawler/readme.md: -------------------------------------------------------------------------------- 1 | ## 分布式爬虫 2 | 3 | #### 分布式爬虫的优点 4 | 5 | 1. 充分利用多机器的带宽加速爬取 6 | 2. 充分利用多机的ip加速爬取速度 7 | 8 | #### 分布式爬虫需要解决的问题 9 | 10 | 1. request队列集中管理 11 | 2. 去重集中管理 12 | 13 | #### scrapy不支持分布式,结合redis的特性自然而然产生了scrapy-redis,下面是scrapy-redis核心代码的梳理。 14 | - [Spider](./Spider.py) 15 | - [Scheduler](./Scheduler.py) 16 | - [Queue](./Queue.py) 17 | - [Pipelines](./Pipelines.py) 18 | - [picklecompat](./picklecompat.py) 19 | - [Duperfilter](./Duperfilter.py) 20 | - [Connection](./Connection.py) 21 | 22 | 对于分布式爬虫案例就不写了,随便搜搜就好多,比如说新浪,页面结构算是比较复杂的了,而且反爬挺烦人。时间间隔控制一下,另外需要获取cookie,当然只需静态获取还比较容易,对于动态方式具体操作可以看我的博客。 23 | > [**动态获取cookie**](https://blog.csdn.net/fenglei0415/article/details/81865379) 24 | 25 | ```md 26 | slave端,拷贝相同的代码,输入同样的运行命令 scrapy runspider demo.py,进入等待 27 | master端,安装redis,在redis_client输入命令 lpush spidername: start_url 就ok了。 28 | ``` 29 | --- 30 | 最后总结一下scrapy-redis的总体思路: 31 | 1. 这个工程通过重写scheduler和spider类,实现了调度、spider启动和redis的交互。 32 | 2. 通过实现新的dupefilter和queue类,达到了判重和调度容器和redis的交互,因为每个主机上的爬虫进程都访问同一个redis数据库,所以调度和判重都统一进行统一管理,达到了分布式爬虫的目的。 33 | 3. 当spider被初始化时,同时会初始化一个对应的scheduler对象,这个调度器对象通过读取settings,配置好自己的调度容器queue和判重工具dupefilter。 34 | 4. 每当一个spider产出一个request的时候,scrapy内核会把这个reuqest递交给这个spider对应的scheduler对象进行调度,scheduler对象通过访问redis对request进行判重,如果不重复就把他添加进redis中的调度池。当调度条件满足时,scheduler对象就从redis的调度池中取出一个request发送给spider,让他爬取。 35 | 5. 当spider爬取的所有暂时可用url之后,scheduler发现这个spider对应的redis的调度池空了,于是触发信号spider_idle,spider收到这个信号之后,直接连接redis读取strart url池,拿去新的一批url入口,然后再次重复上边的工作。 36 | 37 | #### redis常用命令: 38 | ```md 39 | # 字符串 40 | set key value 41 | get key 42 | getrange key start end 43 | strlen key 44 | incr/decr key 45 | append key value 46 | ``` 47 | ```md 48 | # 哈希 49 | hset key name value 50 | hget key 51 | hexists key fields 52 | hdel key filds 53 | hkeys key 54 | hvals key 55 | ``` 56 | ```md 57 | # 列表 58 | lpush/rpush mylist value 59 | lrange mylist 0 10 60 | blpop/brpop key1 timeout 61 | lpop/rpop key 62 | llen key 63 | lindex key index 64 | ``` 65 | ```md 66 | # 集合 67 | sadd myset value 68 | scard key 69 | sdiff key 70 | sinter key 71 | spop key 72 | smember key member 73 | ``` 74 | ```md 75 | # 有序集合 76 | zadd myset 0 value 77 | srangebyscore myset 0 100 78 | zcount key min max 79 | ``` 80 | -------------------------------------------------------------------------------- /Spider/ch_Haiwang/Spider/.idea/Spider.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | -------------------------------------------------------------------------------- /Spider/ch_Haiwang/Spider/.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 7 | -------------------------------------------------------------------------------- /Spider/ch_Haiwang/Spider/.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /Spider/ch_Haiwang/Spider/.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /Spider/ch_Haiwang/Spider/.idea/workspace.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 153 | 154 | 155 | 156 | assistAwardInfo 157 | nickName 158 | cityName 159 | score 160 | startTime 161 | approve 162 | reply 163 | avatarurl 164 | 165 | 166 | 167 | 169 | 170 | 180 | 181 | 182 | 183 | 184 | true 185 | DEFINITION_ORDER 186 | 187 | 188 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 |