├── NLP
└── readme.md
├── README.md
├── Spider
├── ch_Bilibili
│ ├── images
│ │ ├── big.png
│ │ ├── small.png
│ │ ├── spider_image.gif
│ │ └── three.png
│ └── readme.md
├── ch_Code
│ ├── README.md
│ └── sliding_code.py
├── ch_Distributedcrawler
│ ├── Connection.py
│ ├── Duperfilter.py
│ ├── Pipelines.py
│ ├── Queue.py
│ ├── Scheduler.py
│ ├── Spider.py
│ ├── picklecompat.py
│ └── readme.md
├── ch_Haiwang
│ ├── Spider
│ │ ├── .idea
│ │ │ ├── Spider.iml
│ │ │ ├── misc.xml
│ │ │ ├── modules.xml
│ │ │ ├── vcs.xml
│ │ │ └── workspace.xml
│ │ ├── Spider
│ │ │ ├── __init__.py
│ │ │ ├── __pycache__
│ │ │ │ ├── __init__.cpython-35.pyc
│ │ │ │ ├── items.cpython-35.pyc
│ │ │ │ ├── pipelines.cpython-35.pyc
│ │ │ │ └── settings.cpython-35.pyc
│ │ │ ├── items.py
│ │ │ ├── middlewares.py
│ │ │ ├── pipelines.py
│ │ │ ├── settings.py
│ │ │ ├── spiders
│ │ │ │ ├── Haiwang.py
│ │ │ │ ├── __init__.py
│ │ │ │ └── __pycache__
│ │ │ │ │ ├── Haiwang.cpython-35.pyc
│ │ │ │ │ └── __init__.cpython-35.pyc
│ │ │ └── tools
│ │ │ │ ├── __init__.py
│ │ │ │ └── xici_ip.py
│ │ ├── __init__.py
│ │ ├── run.py
│ │ └── scrapy.cfg
│ ├── analysis.py
│ ├── city_coordinates.json
│ ├── images
│ │ ├── 1.png
│ │ ├── 2.png
│ │ ├── chart_line.png
│ │ ├── ciyun.png
│ │ ├── csv.png
│ │ ├── geo.png
│ │ ├── list.png
│ │ └── mongo.png
│ ├── pip.txt
│ ├── readme.md
│ └── stop_words.txt
├── ch_summary
│ ├── summary01.md
│ ├── summary02.md
│ ├── summary03.md
│ └── summary04.md
└── readme.md
├── ch01
├── readme.md
├── 特征探索性分析.ipynb
└── 特征预处理.ipynb
├── ch02
├── Ass_rule.py
├── K-means.py
├── RandomForest.ipynb
├── SVM.ipynb
├── datas
│ ├── data1.txt
│ ├── data2.txt
│ ├── samtrain.csv
│ ├── samval.csv
│ ├── test.csv
│ └── train.csv
├── images
│ ├── 1.jpg
│ ├── 2.jpg
│ ├── 3.jpg
│ ├── 4.jpg
│ ├── 5.jpg
│ ├── Kmeans.png
│ └── sklearn.png
├── logistic_regression.ipynb
└── readme.md
├── ch03
└── readme.md
└── kaggle
├── data
├── HR.bak.csv
├── HR.csv
├── rfr.csv
├── test.csv
└── train.csv
├── readme.md
├── 共享单车案例.ipynb
├── 员工离职率数据分析.ipynb
└── 员工离职率特征工程与建模.ipynb
/NLP/readme.md:
--------------------------------------------------------------------------------
1 | ### nlp
2 |
3 | * jieba是分词中使用最广泛的了,demo看[海王电影分析](https://github.com/fenglei110/Data-analysis/blob/master/Spider/ch_Haiwang/analysis.py)
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data-analysis
2 | python数据分析与挖掘建模
3 |
4 | - :soccer: [数据分析与特征处理](./ch01)
5 | - :basketball: [机器学习与数据建模](./ch02)
6 | - :apple: [模型评估](./ch03)
7 | - :hamburger: [自然语言处理](./NLP)
8 | - :cherries: [爬虫那点事](./Spider)
9 | - :fries: [kaggle竞赛项目](./kaggle)
10 | - :banana: [知识图谱]()
11 |
12 | ## 常用工具
13 | - numpy
14 | - pandas
15 | - matplotlib
16 | - seaborn 基于matplotlib,对图像的丰富
17 | - SciPy 科学计算中包的集合
18 | - scipy.integrade 数值积分例程和微分方程求解器
19 | - scipy.linalg 线性代数例程和矩阵分解
20 | - scipy.optimize 函数优化器和根查找算法
21 | - scipy.signal 信号处理工具
22 | - scipy.sparse 稀疏矩阵和稀疏线性系统求解器
23 | - scipy.special SPECFUN(实现了许多常用数学函数)
24 | - scipy.stats 标准连续和离散概率分布
25 | - scipy.weave 利用内敛c++代码加速数组计算的工具
26 |
27 | - scikit-learn 简称sk-learn, 机器学习工具,用于数据分析和数据挖掘,建立在Numpy, SciPy和matplotlib上。
28 | - Jupyter Notebook的本质是一个Web应用程序,便于创建和共享文学化程序文档,支持实时代码,数学方程,可视化和markdown,kaggle竞赛里资料都是Jupyter格式。用途包括:数据清理和转换,数值模拟,统计建模,机器学习等。
29 | - TensorFlow 是一个采用数据流图,用于数值计算的开源软件库。最初被Google用于机器学习和深度神经网络方面的研究,但也可广泛用于其他计算领域。
30 | - Anaconda 包括Conda,Python以及180多安装好的工具包机器依赖,比如:numpy, pandas等。Conda是一个开源的包,环境管理器,可以用于在同一个机器上安装不同版本的软件包及其依赖,并能够在不同的环境之间切换。
31 |
32 | **jieba分词Demo**
33 | 
34 |
--------------------------------------------------------------------------------
/Spider/ch_Bilibili/images/big.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/big.png
--------------------------------------------------------------------------------
/Spider/ch_Bilibili/images/small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/small.png
--------------------------------------------------------------------------------
/Spider/ch_Bilibili/images/spider_image.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/spider_image.gif
--------------------------------------------------------------------------------
/Spider/ch_Bilibili/images/three.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Bilibili/images/three.png
--------------------------------------------------------------------------------
/Spider/ch_Bilibili/readme.md:
--------------------------------------------------------------------------------
1 | # B 站爬取视频信息
2 | 一个魔性网站,n多次被鬼畜视频带跑了,尤其最近本山大叔的改革春风,彻底是把我吹狂乱了,时刻提醒自己我是来爬虫的...
3 |
4 | ### 准备工作
5 |
6 | `scrapy` `python3.5`
7 |
8 | 寻找api规律,获取json数据
9 |
10 | 基本开始的链接为:`url = 'https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&pic_size=160x100&order=click©_right=-1&cate_id=1&page=1&pagesize=20&time_from=20181201&time_to=20190109'`
11 |
12 | `cate_id`可以从网页中正则匹配到,闲麻烦就直接从1遍历到200了,基本就这范围。
13 |
14 | 如果想爬更多数据,`page` `time`都可以自行设置。
15 |
16 | #### spider代码
17 | ```py
18 | def parse(self, response):
19 | if response:
20 | text = json.loads(response.text)
21 | res = text.get('result')
22 | numpages = text.get('numPages')
23 | numResults = text.get('numResults')
24 | msg = text.get('msg')
25 | if msg == 'success':
26 | for i in res:
27 | author = i.get('author')
28 | id = i.get('id')
29 | pubdate = i.get('pubdate')
30 | favorite = i.get('favorites')
31 | rank_score = i.get('rank_score')
32 | video_review = i.get('video_review')
33 | tag = i.get('tag')
34 | title = i.get('title')
35 | arcurl = i.get('arcurl')
36 |
37 | item = BilibiliItem()
38 | item['numResults'] = numResults
39 | item['author'] = author
40 | item['id'] = id
41 | item['pubdate'] = pubdate
42 | item['favorite'] = favorite
43 | item['rank_score'] = rank_score
44 | item['video_review'] = video_review
45 | item['tag'] = tag
46 | item['title'] = title
47 | item['arcurl'] = arcurl
48 | yield item
49 |
50 | ```
51 | 
52 |
53 | 
54 |
55 | 
56 |
57 | 至于爬取后要怎么处理就看自己爱好了,最好保存为 csv 文件,方便pandas处理
58 |
59 |
60 | Bilibili视频的链接为 `https://www.bilibili.com/video/av + v_aid`
61 |
62 | 对数据感兴趣的话可以邮箱联系我,共同进步。
63 |
--------------------------------------------------------------------------------
/Spider/ch_Code/README.md:
--------------------------------------------------------------------------------
1 | # 滑动验证码破解示例
2 | 通过Selenium模拟用户滑动解锁
3 | 由『国家企业信用信息公示系统』进行破解体验
4 |
5 | 具体步骤:
6 | 1. 使用Selenium打开页面
7 | 2. 匹配到输入框,输入要查询的信息,并点击查询按钮
8 | 3. 读取验证码图片,并做缺口识别
9 | 4. 根据缺口位置,计算滑动距离
10 | 5. 根据滑动距离,拖拽滑块到需要匹配的位置
11 |
12 |
13 | 
--------------------------------------------------------------------------------
/Spider/ch_Code/sliding_code.py:
--------------------------------------------------------------------------------
1 | # -*-coding:utf-8 -*-
2 | import random
3 | import time
4 |
5 | from selenium.webdriver import ActionChains
6 | from selenium.webdriver.support import expected_conditions as EC
7 | from selenium.webdriver.support.ui import WebDriverWait
8 | from selenium.webdriver.common.by import By
9 | from urllib.request import urlretrieve
10 | from selenium import webdriver
11 | from bs4 import BeautifulSoup
12 | import PIL.Image as image
13 | import re
14 |
15 |
16 | class Crack():
17 | def __init__(self,keyword):
18 | self.url = 'http://bj.gsxt.gov.cn/sydq/loginSydqAction!sydq.dhtml'
19 | self.browser = webdriver.Chrome('D:\\chromedriver.exe')
20 | self.wait = WebDriverWait(self.browser, 100)
21 | self.keyword = keyword
22 | self.BORDER = 6
23 |
24 | def open(self):
25 | """
26 | 打开浏览器,并输入查询内容
27 | """
28 | self.browser.get(self.url)
29 | keyword = self.wait.until(EC.presence_of_element_located((By.ID, 'keyword_qycx')))
30 | button = self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'btn')))
31 | keyword.send_keys(self.keyword)
32 | button.click()
33 |
34 | def get_images(self, bg_filename='bg.jpg', fullbg_filename='fullbg.jpg'):
35 | """
36 | 获取验证码图片
37 | :return: 图片的location信息
38 | """
39 | bg = []
40 | fullgb = []
41 | while bg == [] and fullgb == []:
42 | bf = BeautifulSoup(self.browser.page_source, 'lxml')
43 | bg = bf.find_all('div', class_ = 'gt_cut_bg_slice')
44 | fullgb = bf.find_all('div', class_ = 'gt_cut_fullbg_slice')
45 | bg_url = re.findall('url\(\"(.*)\"\);', bg[0].get('style'))[0].replace('webp', 'jpg')
46 | fullgb_url = re.findall('url\(\"(.*)\"\);', fullgb[0].get('style'))[0].replace('webp', 'jpg')
47 | bg_location_list = []
48 | fullbg_location_list = []
49 | for each_bg in bg:
50 | location = dict()
51 | location['x'] = int(re.findall('background-position: (.*)px (.*)px;',each_bg.get('style'))[0][0])
52 | location['y'] = int(re.findall('background-position: (.*)px (.*)px;',each_bg.get('style'))[0][1])
53 | bg_location_list.append(location)
54 | for each_fullgb in fullgb:
55 | location = dict()
56 | location['x'] = int(re.findall('background-position: (.*)px (.*)px;',each_fullgb.get('style'))[0][0])
57 | location['y'] = int(re.findall('background-position: (.*)px (.*)px;',each_fullgb.get('style'))[0][1])
58 | fullbg_location_list.append(location)
59 |
60 | urlretrieve(url=bg_url, filename=bg_filename)
61 | print('缺口图片下载完成')
62 | urlretrieve(url=fullgb_url, filename=fullbg_filename)
63 | print('背景图片下载完成')
64 | return bg_location_list, fullbg_location_list
65 |
66 | def get_merge_image(self, filename, location_list):
67 | """
68 | 根据位置对图片进行合并还原
69 | :filename:图片
70 | :location_list:图片位置
71 | """
72 | im = image.open(filename)
73 | new_im = image.new('RGB', (260, 116))
74 | im_list_upper = []
75 | im_list_down = []
76 |
77 | for location in location_list:
78 | if location['y'] == -58:
79 | im_list_upper.append(im.crop((abs(location['x']), 58, abs(location['x']) + 10, 166)))
80 | if location['y'] == 0:
81 | im_list_down.append(im.crop((abs(location['x']), 0, abs(location['x']) + 10, 58)))
82 |
83 | new_im = image.new('RGB', (260,116))
84 |
85 | x_offset = 0
86 | for im in im_list_upper:
87 | new_im.paste(im, (x_offset,0))
88 | x_offset += im.size[0]
89 |
90 | x_offset = 0
91 | for im in im_list_down:
92 | new_im.paste(im, (x_offset,58))
93 | x_offset += im.size[0]
94 |
95 | new_im.save(filename)
96 |
97 | return new_im
98 |
99 | def is_pixel_equal(self, img1, img2, x, y):
100 | """
101 | 判断两个像素是否相同
102 | :param img1: 图片1
103 | :param img2: 图片2
104 | :param x: 位置x
105 | :param y: 位置y
106 | :return: 像素是否相同
107 | """
108 | # 取两个图片的像素点
109 | pix1 = img1.load()[x, y]
110 | pix2 = img2.load()[x, y]
111 | threshold = 60
112 | if (abs(pix1[0] - pix2[0] < threshold) and abs(pix1[1] - pix2[1] < threshold)
113 | and abs(pix1[2] - pix2[2] < threshold)):
114 | return True
115 | else:
116 | return False
117 |
118 | def get_gap(self, img1, img2):
119 | """
120 | 获取缺口偏移量
121 | :param img1: 不带缺口图片
122 | :param img2: 带缺口图片
123 | :return:
124 | """
125 | left = 43
126 | for i in range(left, img1.size[0]):
127 | for j in range(img1.size[1]):
128 | if not self.is_pixel_equal(img1, img2, i, j):
129 | left = i
130 | return left
131 | return left
132 |
133 | def get_track(self, distance):
134 | """
135 | 根据偏移量获取移动轨迹
136 | :param distance: 偏移量
137 | :return: 移动轨迹
138 | """
139 | # 移动轨迹
140 | track = []
141 | # 当前位移
142 | current = 0
143 | # 减速阈值
144 | mid = distance * 4 / 5
145 | # 计算间隔
146 | t = 0.2
147 | # 初速度
148 | v = 0
149 |
150 | while current < distance:
151 | if current < mid:
152 | # 加速度为正2
153 | a = 2
154 | else:
155 | # 加速度为负3
156 | a = -3
157 | # 初速度v0
158 | v0 = v
159 | # 当前速度v = v0 + at
160 | v = v0 + a * t
161 | # 移动距离x = v0t + 1/2 * a * t^2
162 | move = v0 * t + 1 / 2 * a * t * t
163 | # 当前位移
164 | current += move
165 | # 加入轨迹
166 | track.append(round(move))
167 | return track
168 |
169 | def get_slider(self):
170 | """
171 | 获取滑块
172 | :return: 滑块对象
173 | """
174 | while True:
175 | try:
176 | slider = self.browser.find_element_by_xpath("//div[@class='gt_slider_knob gt_show']")
177 | break
178 | except:
179 | time.sleep(0.5)
180 | return slider
181 |
182 | def move_to_gap(self, slider, track):
183 | """
184 | 拖动滑块到缺口处
185 | :param slider: 滑块
186 | :param track: 轨迹
187 | :return:
188 | """
189 | ActionChains(self.browser).click_and_hold(slider).perform()
190 | while track:
191 | x = random.choice(track)
192 | ActionChains(self.browser).move_by_offset(xoffset=x, yoffset=0).perform()
193 | track.remove(x)
194 | time.sleep(0.5)
195 | ActionChains(self.browser).release().perform()
196 |
197 | def crack(self):
198 | # 打开浏览器
199 | self.open()
200 |
201 | # 保存的图片名字
202 | bg_filename = 'bg.jpg'
203 | fullbg_filename = 'fullbg.jpg'
204 |
205 | # 获取图片
206 | bg_location_list, fullbg_location_list = self.get_images(bg_filename, fullbg_filename)
207 |
208 | # 根据位置对图片进行合并还原
209 | bg_img = self.get_merge_image(bg_filename, bg_location_list)
210 | fullbg_img = self.get_merge_image(fullbg_filename, fullbg_location_list)
211 |
212 | # 获取缺口位置
213 | gap = self.get_gap(fullbg_img, bg_img)
214 | print('缺口位置', gap)
215 |
216 | track = self.get_track(gap-self.BORDER)
217 | print('滑动滑块')
218 | print(track)
219 |
220 | # 点按呼出缺口
221 | slider = self.get_slider()
222 | # 拖动滑块到缺口处
223 | self.move_to_gap(slider, track)
224 |
225 |
226 | if __name__ == '__main__':
227 | print('开始验证')
228 | crack = Crack(u'中国移动')
229 | crack.crack()
230 | print('验证成功')
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/Connection.py:
--------------------------------------------------------------------------------
1 | 负责根据setting中配置实例化redis连接。被dupefilter和scheduler调用,总之涉及到redis存取的都要使用到这个模块。
2 | 这个文件主要是实现连接redis数据库的功能,这些连接接口在其他文件中经常被用到
3 | import six
4 | from scrapy.utils.misc import load_object
5 | from . import defaults
6 |
7 |
8 | # Shortcut maps 'setting name' -> 'parmater name'.
9 | # 要想连接到redis数据库,和其他数据库差不多,需要一个ip地址、端口号、用户名密码(可选)和一个整形的数据库编号
10 | # Shortcut maps 'setting name' -> 'parmater name'.
11 | SETTINGS_PARAMS_MAP = {
12 | 'REDIS_URL': 'url',
13 | 'REDIS_HOST': 'host',
14 | 'REDIS_PORT': 'port',
15 | 'REDIS_ENCODING': 'encoding',
16 | }
17 |
18 |
19 | def get_redis_from_settings(settings):
20 | """Returns a redis client instance from given Scrapy settings object.
21 |
22 | This function uses ``get_client`` to instantiate the client and uses
23 | ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You
24 | can override them using the ``REDIS_PARAMS`` setting.
25 |
26 | Parameters
27 | ----------
28 | settings : Settings
29 | A scrapy settings object. See the supported settings below.
30 |
31 | Returns
32 | -------
33 | server
34 | Redis client instance.
35 |
36 | Other Parameters
37 | ----------------
38 | REDIS_URL : str, optional
39 | Server connection URL.
40 | REDIS_HOST : str, optional
41 | Server host.
42 | REDIS_PORT : str, optional
43 | Server port.
44 | REDIS_ENCODING : str, optional
45 | Data encoding.
46 | REDIS_PARAMS : dict, optional
47 | Additional client parameters.
48 |
49 | """
50 | params = defaults.REDIS_PARAMS.copy()
51 | params.update(settings.getdict('REDIS_PARAMS'))
52 | # XXX: Deprecate REDIS_* settings.
53 | for source, dest in SETTINGS_PARAMS_MAP.items():
54 | val = settings.get(source)
55 | if val:
56 | params[dest] = val
57 |
58 | # Allow ``redis_cls`` to be a path to a class.
59 | if isinstance(params.get('redis_cls'), six.string_types):
60 | params['redis_cls'] = load_object(params['redis_cls'])
61 | #在这里调用get_redis函数
62 | return get_redis(**params)
63 |
64 |
65 | # Backwards compatible alias.
66 | from_settings = get_redis_from_settings
67 |
68 |
69 | # 返回的是redis库的Redis对象,可以直接用来进行数据操作的对象
70 | def get_redis(**kwargs):
71 | """Returns a redis client instance.
72 |
73 | Parameters
74 | ----------
75 | redis_cls : class, optional
76 | Defaults to ``redis.StrictRedis``.
77 | url : str, optional
78 | If given, ``redis_cls.from_url`` is used to instantiate the class.
79 | **kwargs
80 | Extra parameters to be passed to the ``redis_cls`` class.
81 |
82 | Returns
83 | -------
84 | server
85 | Redis client instance.
86 |
87 | """
88 | redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS)
89 | url = kwargs.pop('url', None)
90 | if url:
91 | return redis_cls.from_url(url, **kwargs)
92 | else:
93 | return redis_cls(**kwargs)
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/Duperfilter.py:
--------------------------------------------------------------------------------
1 | 负责执行requst的去重,实现的很有技巧性,使用redis的set数据结构。但是注意scheduler并不使用其中用于在这个模块中实现的dupefilter键做request的调度,
2 | 而是使用queue.py模块中实现的queue。当request不重复时,将其存入到queue中,调度时将其弹出。
3 |
4 | import logging
5 | import time
6 | from scrapy.dupefilters import BaseDupeFilter
7 | from scrapy.utils.request import request_fingerprint
8 | from . import defaults
9 | from .connection import get_redis_from_settings
10 |
11 |
12 | logger = logging.getLogger(__name__)
13 |
14 |
15 | # TODO: Rename class to RedisDupeFilter.
16 | class RFPDupeFilter(BaseDupeFilter):
17 | """Redis-based request duplicates filter.
18 |
19 | This class can also be used with default Scrapy's scheduler.
20 |
21 | """
22 |
23 | logger = logger
24 |
25 | def __init__(self, server, key, debug=False):
26 | """Initialize the duplicates filter.
27 |
28 | Parameters
29 | ----------
30 | server : redis.StrictRedis
31 | The redis server instance.
32 | key : str
33 | Redis key Where to store fingerprints.
34 | debug : bool, optional
35 | Whether to log filtered requests.
36 |
37 | """
38 | self.server = server
39 | self.key = key
40 | self.debug = debug
41 | self.logdupes = True
42 |
43 | @classmethod
44 | def from_settings(cls, settings):
45 | """Returns an instance from given settings.
46 |
47 | This uses by default the key ``dupefilter:``. When using the
48 | ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
49 | it needs to pass the spider name in the key.
50 |
51 | Parameters
52 | ----------
53 | settings : scrapy.settings.Settings
54 |
55 | Returns
56 | -------
57 | RFPDupeFilter
58 | A RFPDupeFilter instance.
59 |
60 |
61 | """
62 | server = get_redis_from_settings(settings)
63 | # XXX: This creates one-time key. needed to support to use this
64 | # class as standalone dupefilter with scrapy's default scheduler
65 | # if scrapy passes spider on open() method this wouldn't be needed
66 | # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
67 | key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
68 | debug = settings.getbool('DUPEFILTER_DEBUG')
69 | return cls(server, key=key, debug=debug)
70 |
71 | @classmethod
72 | def from_crawler(cls, crawler):
73 | """Returns instance from crawler.
74 |
75 | Parameters
76 | ----------
77 | crawler : scrapy.crawler.Crawler
78 |
79 | Returns
80 | -------
81 | RFPDupeFilter
82 | Instance of RFPDupeFilter.
83 |
84 | """
85 | return cls.from_settings(crawler.settings)
86 |
87 | def request_seen(self, request):
88 | """Returns True if request was already seen.
89 |
90 | Parameters
91 | ----------
92 | request : scrapy.http.Request
93 |
94 | Returns
95 | -------
96 | bool
97 |
98 | """
99 | fp = self.request_fingerprint(request)
100 | # This returns the number of values added, zero if already exists.
101 | added = self.server.sadd(self.key, fp)
102 | return added == 0
103 |
104 | def request_fingerprint(self, request):
105 | """Returns a fingerprint for a given request.
106 |
107 | Parameters
108 | ----------
109 | request : scrapy.http.Request
110 |
111 | Returns
112 | -------
113 | str
114 |
115 | """
116 | return request_fingerprint(request)
117 |
118 | def close(self, reason=''):
119 | """Delete data on close. Called by Scrapy's scheduler.
120 |
121 | Parameters
122 | ----------
123 | reason : str, optional
124 |
125 | """
126 | self.clear()
127 |
128 | def clear(self):
129 | """Clears fingerprints data."""
130 | self.server.delete(self.key)
131 |
132 | def log(self, request, spider):
133 | """Logs given request.
134 |
135 | Parameters
136 | ----------
137 | request : scrapy.http.Request
138 | spider : scrapy.spiders.Spider
139 |
140 | """
141 | if self.debug:
142 | msg = "Filtered duplicate request: %(request)s"
143 | self.logger.debug(msg, {'request': request}, extra={'spider': spider})
144 | elif self.logdupes:
145 | msg = ("Filtered duplicate request %(request)s"
146 | " - no more duplicates will be shown"
147 | " (see DUPEFILTER_DEBUG to show all duplicates)")
148 | self.logger.debug(msg, {'request': request}, extra={'spider': spider})
149 | self.logdupes = False
150 |
151 | 这个文件看起来比较复杂,重写了scrapy本身已经实现的request判重功能。因为本身scrapy单机跑的话,只需要读取内存中的request队列或者持久化的request队列
152 | (scrapy默认的持久化似乎是json格式的文件,不是数据库)就能判断这次要发出的request url是否已经请求过或者正在调度(本地读就行了)。而分布式跑的话,就
153 | 需要各个主机上的scheduler都连接同一个数据库的同一个request池来判断这次的请求是否是重复的了。
154 |
155 | 在这个文件中,通过继承BaseDupeFilter重写他的方法,实现了基于redis的判重。根据源代码来看,scrapy-redis使用了scrapy本身的一个fingerprint接
156 | request_fingerprint,这个接口很有趣,根据scrapy文档所说,他通过hash来判断两个url是否相同(相同的url会生成相同的hash结果),但是当两个url的地址相同,
157 | get型参数相同但是顺序不同时,也会生成相同的hash结果(这个真的比较神奇。。。)所以scrapy-redis依旧使用url的fingerprint来判断request请求是否已经出现过。
158 | 这个类通过连接redis,使用一个key来向redis的一个set中插入fingerprint(这个key对于同一种spider是相同的,redis是一个key-value的数据库,如果key是相
159 | 同的,访问到的值就是相同的,这里使用spider名字+DupeFilter的key就是为了在不同主机上的不同爬虫实例,只要属于同一种spider,就会访问到同一个set,而这个set
160 | 就是他们的url判重池),如果返回值为0,说明该set中该fingerprint已经存在(因为集合是没有重复值的),则返回False,如果返回值为1,说明添加了一个fingerprint
161 | 到set中,则说明这个request没有重复,于是返回True,还顺便把新fingerprint加入到数据库中了。 DupeFilter判重会在scheduler类中用到,每一个request在进入
162 | 调度之前都要进行判重,如果重复就不需要参加调度,直接舍弃就好了,不然就是白白浪费资源。
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/Pipelines.py:
--------------------------------------------------------------------------------
1 | 这是是用来实现分布式处理的作用。它将Item存储在redis中以实现分布式处理。由于在这里需要读取配置,所以就用到了from_crawler()函数。
2 | from scrapy.utils.misc import load_object
3 | from scrapy.utils.serialize import ScrapyJSONEncoder
4 | from twisted.internet.threads import deferToThread
5 |
6 | from . import connection, defaults
7 |
8 |
9 | default_serialize = ScrapyJSONEncoder().encode
10 |
11 |
12 | class RedisPipeline(object):
13 | """Pushes serialized item into a redis list/queue
14 |
15 | Settings
16 | --------
17 | REDIS_ITEMS_KEY : str
18 | Redis key where to store items.
19 | REDIS_ITEMS_SERIALIZER : str
20 | Object path to serializer function.
21 |
22 | """
23 |
24 | def __init__(self, server,
25 | key=defaults.PIPELINE_KEY,
26 | serialize_func=default_serialize):
27 | """Initialize pipeline.
28 |
29 | Parameters
30 | ----------
31 | server : StrictRedis
32 | Redis client instance.
33 | key : str
34 | Redis key where to store items.
35 | serialize_func : callable
36 | Items serializer function.
37 |
38 | """
39 | self.server = server
40 | self.key = key
41 | self.serialize = serialize_func
42 |
43 | @classmethod
44 | def from_settings(cls, settings):
45 | params = {
46 | 'server': connection.from_settings(settings),
47 | }
48 | if settings.get('REDIS_ITEMS_KEY'):
49 | params['key'] = settings['REDIS_ITEMS_KEY']
50 | if settings.get('REDIS_ITEMS_SERIALIZER'):
51 | params['serialize_func'] = load_object(
52 | settings['REDIS_ITEMS_SERIALIZER']
53 | )
54 |
55 | return cls(**params)
56 |
57 | @classmethod
58 | def from_crawler(cls, crawler):
59 | return cls.from_settings(crawler.settings)
60 |
61 | def process_item(self, item, spider):
62 | return deferToThread(self._process_item, item, spider)
63 |
64 | def _process_item(self, item, spider):
65 | key = self.item_key(item, spider)
66 | data = self.serialize(item)
67 | self.server.rpush(key, data)
68 | return item
69 |
70 | def item_key(self, item, spider):
71 | """Returns redis key based on given spider.
72 |
73 | Override this function to use a different key depending on the item
74 | and/or spider.
75 |
76 | """
77 | return self.key % {'spider': spider.name}
78 | pipelines文件实现了一个item pipieline类,和scrapy的item pipeline是同一个对象,通过从settings中拿到我们配置的REDIS_ITEMS_KEY作为key,
79 | 把item串行化之后存入redis数据库对应的value中(这个value可以看出出是个list,我们的每个item是这个list中的一个结点),
80 | 这个pipeline把提取出的item存起来,主要是为了方便我们延后处理数据。
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/Queue.py:
--------------------------------------------------------------------------------
1 | 该文件实现了几个容器类,可以看这些容器和redis交互频繁,同时使用了我们上边picklecompat中定义的序列化器。这个文件实现的几个容器大体相同,只不过一个是队列,一个是栈,一个是优先级队列,这三个容器到时候会被scheduler对象实例化,来实现request的调度。比如我们使用SpiderQueue最为调度队列的类型,到时候request的调度方法就是先进先出,而实用SpiderStack就是先进后出了。
2 | 从SpiderQueue的实现看出来,他的push函数就和其他容器的一样,只不过push进去的request请求先被scrapy的接口request_to_dict变成了一个dict对象(因为request对象实在是比较复杂,有方法有属性不好串行化),之后使用picklecompat中的serializer串行化为字符串,然后使用一个特定的key存入redis中(该key在同一种spider中是相同的)。而调用pop时,其实就是从redis用那个特定的key去读其值(一个list),从list中读取最早进去的那个,于是就先进先出了。 这些容器类都会作为scheduler调度request的容器,scheduler在每个主机上都会实例化一个,并且和spider一一对应,所以分布式运行时会有一个spider的多个实例和一个scheduler的多个实例存在于不同的主机上,但是,因为scheduler都是用相同的容器,而这些容器都连接同一个redis服务器,又都使用spider名加queue来作为key读写数据,所以不同主机上的不同爬虫实例共用用一个request调度池,实现了分布式爬虫之间的统一调度。
3 |
4 | from scrapy.utils.reqser import request_to_dict, request_from_dict
5 | from . import picklecompat
6 |
7 |
8 | class Base(object):
9 | """Per-spider base queue class"""
10 |
11 | def __init__(self, server, spider, key, serializer=None):
12 | """Initialize per-spider redis queue.
13 |
14 | Parameters
15 | ----------
16 | server : StrictRedis
17 | Redis client instance.
18 | spider : Spider
19 | Scrapy spider instance.
20 | key: str
21 | Redis key where to put and get messages.
22 | serializer : object
23 | Serializer object with ``loads`` and ``dumps`` methods.
24 |
25 | """
26 | if serializer is None:
27 | # Backward compatibility.
28 | # TODO: deprecate pickle.
29 | serializer = picklecompat
30 | if not hasattr(serializer, 'loads'):
31 | raise TypeError("serializer does not implement 'loads' function: %r"
32 | % serializer)
33 | if not hasattr(serializer, 'dumps'):
34 | raise TypeError("serializer '%s' does not implement 'dumps' function: %r"
35 | % serializer)
36 |
37 | self.server = server
38 | self.spider = spider
39 | self.key = key % {'spider': spider.name}
40 | self.serializer = serializer
41 |
42 | def _encode_request(self, request):
43 | """Encode a request object"""
44 | obj = request_to_dict(request, self.spider)
45 | return self.serializer.dumps(obj)
46 |
47 | def _decode_request(self, encoded_request):
48 | """Decode an request previously encoded"""
49 | obj = self.serializer.loads(encoded_request)
50 | return request_from_dict(obj, self.spider)
51 |
52 | def __len__(self):
53 | """Return the length of the queue"""
54 | raise NotImplementedError
55 |
56 | def push(self, request):
57 | """Push a request"""
58 | raise NotImplementedError
59 |
60 | def pop(self, timeout=0):
61 | """Pop a request"""
62 | raise NotImplementedError
63 |
64 | def clear(self):
65 | """Clear queue/stack"""
66 | self.server.delete(self.key)
67 |
68 |
69 | class FifoQueue(Base):
70 | """Per-spider FIFO queue"""
71 |
72 | def __len__(self):
73 | """Return the length of the queue"""
74 | return self.server.llen(self.key)
75 |
76 | def push(self, request):
77 | """Push a request"""
78 | self.server.lpush(self.key, self._encode_request(request))
79 |
80 | def pop(self, timeout=0):
81 | """Pop a request"""
82 | if timeout > 0:
83 | data = self.server.brpop(self.key, timeout)
84 | if isinstance(data, tuple):
85 | data = data[1]
86 | else:
87 | data = self.server.rpop(self.key)
88 | if data:
89 | return self._decode_request(data)
90 |
91 |
92 | class PriorityQueue(Base):
93 | """Per-spider priority queue abstraction using redis' sorted set"""
94 |
95 | def __len__(self):
96 | """Return the length of the queue"""
97 | return self.server.zcard(self.key)
98 |
99 | def push(self, request):
100 | """Push a request"""
101 | data = self._encode_request(request)
102 | score = -request.priority
103 | # We don't use zadd method as the order of arguments change depending on
104 | # whether the class is Redis or StrictRedis, and the option of using
105 | # kwargs only accepts strings, not bytes.
106 | self.server.execute_command('ZADD', self.key, score, data)
107 |
108 | def pop(self, timeout=0):
109 | """
110 | Pop a request
111 | timeout not support in this queue class
112 | """
113 | # use atomic range/remove using multi/exec
114 | pipe = self.server.pipeline()
115 | pipe.multi()
116 | pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0)
117 | results, count = pipe.execute()
118 | if results:
119 | return self._decode_request(results[0])
120 |
121 |
122 | class LifoQueue(Base):
123 | """Per-spider LIFO queue."""
124 |
125 | def __len__(self):
126 | """Return the length of the stack"""
127 | return self.server.llen(self.key)
128 |
129 | def push(self, request):
130 | """Push a request"""
131 | self.server.lpush(self.key, self._encode_request(request))
132 |
133 | def pop(self, timeout=0):
134 | """Pop a request"""
135 | if timeout > 0:
136 | data = self.server.blpop(self.key, timeout)
137 | if isinstance(data, tuple):
138 | data = data[1]
139 | else:
140 | data = self.server.lpop(self.key)
141 |
142 | if data:
143 | return self._decode_request(data)
144 |
145 |
146 | # TODO: Deprecate the use of these names.
147 | SpiderQueue = FifoQueue
148 | SpiderStack = LifoQueue
149 | SpiderPriorityQueue = PriorityQueue
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/Scheduler.py:
--------------------------------------------------------------------------------
1 | 此扩展是对scrapy中自带的scheduler的替代(在settings的SCHEDULER变量中指出),正是利用此扩展实现crawler的分布式调度。
2 | 其利用的数据结构来自于queue中实现的数据结构。scrapy-redis所实现的两种分布式:爬虫分布式以及item处理分布式就是由模块
3 | scheduler和模块pipelines实现。上述其它模块作为为二者辅助的功能模块.
4 |
5 | import importlib
6 | import six
7 | from scrapy.utils.misc import load_object
8 | from . import connection, defaults
9 |
10 |
11 | # TODO: add SCRAPY_JOB support.
12 | class Scheduler(object):
13 | """Redis-based scheduler
14 |
15 | Settings
16 | --------
17 | SCHEDULER_PERSIST : bool (default: False)
18 | Whether to persist or clear redis queue.
19 | SCHEDULER_FLUSH_ON_START : bool (default: False)
20 | Whether to flush redis queue on start.
21 | SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0)
22 | How many seconds to wait before closing if no message is received.
23 | SCHEDULER_QUEUE_KEY : str
24 | Scheduler redis key.
25 | SCHEDULER_QUEUE_CLASS : str
26 | Scheduler queue class.
27 | SCHEDULER_DUPEFILTER_KEY : str
28 | Scheduler dupefilter redis key.
29 | SCHEDULER_DUPEFILTER_CLASS : str
30 | Scheduler dupefilter class.
31 | SCHEDULER_SERIALIZER : str
32 | Scheduler serializer.
33 |
34 | """
35 |
36 | def __init__(self, server,
37 | persist=False,
38 | flush_on_start=False,
39 | queue_key=defaults.SCHEDULER_QUEUE_KEY,
40 | queue_cls=defaults.SCHEDULER_QUEUE_CLASS,
41 | dupefilter_key=defaults.SCHEDULER_DUPEFILTER_KEY,
42 | dupefilter_cls=defaults.SCHEDULER_DUPEFILTER_CLASS,
43 | idle_before_close=0,
44 | serializer=None):
45 | """Initialize scheduler.
46 |
47 | Parameters
48 | ----------
49 | server : Redis
50 | The redis server instance.
51 | persist : bool
52 | Whether to flush requests when closing. Default is False.
53 | flush_on_start : bool
54 | Whether to flush requests on start. Default is False.
55 | queue_key : str
56 | Requests queue key.
57 | queue_cls : str
58 | Importable path to the queue class.
59 | dupefilter_key : str
60 | Duplicates filter key.
61 | dupefilter_cls : str
62 | Importable path to the dupefilter class.
63 | idle_before_close : int
64 | Timeout before giving up.
65 |
66 | """
67 | if idle_before_close < 0:
68 | raise TypeError("idle_before_close cannot be negative")
69 |
70 | self.server = server
71 | self.persist = persist
72 | self.flush_on_start = flush_on_start
73 | self.queue_key = queue_key
74 | self.queue_cls = queue_cls
75 | self.dupefilter_cls = dupefilter_cls
76 | self.dupefilter_key = dupefilter_key
77 | self.idle_before_close = idle_before_close
78 | self.serializer = serializer
79 | self.stats = None
80 |
81 | def __len__(self):
82 | return len(self.queue)
83 |
84 | @classmethod
85 | def from_settings(cls, settings):
86 | kwargs = {
87 | 'persist': settings.getbool('SCHEDULER_PERSIST'),
88 | 'flush_on_start': settings.getbool('SCHEDULER_FLUSH_ON_START'),
89 | 'idle_before_close': settings.getint('SCHEDULER_IDLE_BEFORE_CLOSE'),
90 | }
91 |
92 | # If these values are missing, it means we want to use the defaults.
93 | optional = {
94 | # TODO: Use custom prefixes for this settings to note that are
95 | # specific to scrapy-redis.
96 | 'queue_key': 'SCHEDULER_QUEUE_KEY',
97 | 'queue_cls': 'SCHEDULER_QUEUE_CLASS',
98 | 'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY',
99 | # We use the default setting name to keep compatibility.
100 | 'dupefilter_cls': 'DUPEFILTER_CLASS',
101 | 'serializer': 'SCHEDULER_SERIALIZER',
102 | }
103 | for name, setting_name in optional.items():
104 | val = settings.get(setting_name)
105 | if val:
106 | kwargs[name] = val
107 |
108 | # Support serializer as a path to a module.
109 | if isinstance(kwargs.get('serializer'), six.string_types):
110 | kwargs['serializer'] = importlib.import_module(kwargs['serializer'])
111 |
112 | server = connection.from_settings(settings)
113 | # Ensure the connection is working.
114 | server.ping()
115 |
116 | return cls(server=server, **kwargs)
117 |
118 | @classmethod
119 | def from_crawler(cls, crawler):
120 | instance = cls.from_settings(crawler.settings)
121 | # FIXME: for now, stats are only supported from this constructor
122 | instance.stats = crawler.stats
123 | return instance
124 |
125 | def open(self, spider):
126 | self.spider = spider
127 |
128 | try:
129 | self.queue = load_object(self.queue_cls)(
130 | server=self.server,
131 | spider=spider,
132 | key=self.queue_key % {'spider': spider.name},
133 | serializer=self.serializer,
134 | )
135 | except TypeError as e:
136 | raise ValueError("Failed to instantiate queue class '%s': %s",
137 | self.queue_cls, e)
138 |
139 | try:
140 | self.df = load_object(self.dupefilter_cls)(
141 | server=self.server,
142 | key=self.dupefilter_key % {'spider': spider.name},
143 | debug=spider.settings.getbool('DUPEFILTER_DEBUG'),
144 | )
145 | except TypeError as e:
146 | raise ValueError("Failed to instantiate dupefilter class '%s': %s",
147 | self.dupefilter_cls, e)
148 |
149 | if self.flush_on_start:
150 | self.flush()
151 | # notice if there are requests already in the queue to resume the crawl
152 | if len(self.queue):
153 | spider.log("Resuming crawl (%d requests scheduled)" % len(self.queue))
154 |
155 | def close(self, reason):
156 | if not self.persist:
157 | self.flush()
158 |
159 | def flush(self):
160 | self.df.clear()
161 | self.queue.clear()
162 |
163 | def enqueue_request(self, request):
164 | if not request.dont_filter and self.df.request_seen(request):
165 | self.df.log(request, self.spider)
166 | return False
167 | if self.stats:
168 | self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
169 | self.queue.push(request)
170 | return True
171 |
172 | def next_request(self):
173 | block_pop_timeout = self.idle_before_close
174 | request = self.queue.pop(block_pop_timeout)
175 | if request and self.stats:
176 | self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
177 | return request
178 |
179 | def has_pending_requests(self):
180 | return len(self) > 0
181 | 这个文件重写了scheduler类,用来代替scrapy.core.scheduler的原有调度器。其实对原有调度器的逻辑没有很大的改变,
182 | 主要是使用了redis作为数据存储的媒介,以达到各个爬虫之间的统一调度。 scheduler负责调度各个spider的request请求,
183 | scheduler初始化时,通过settings文件读取queue和dupefilters的类型(一般就用上边默认的),配置queue和dupefilters
184 | 使用的key(一般就是spider name加上queue或者dupefilters,这样对于同一种spider的不同实例,就会使用相同的数据块了)。
185 | 每当一个request要被调度时,enqueue_request被调用,scheduler使用dupefilters来判断这个url是否重复,如果不重复,
186 | 就添加到queue的容器中(先进先出,先进后出和优先级都可以,可以在settings中配置)。当调度完成时,next_request被调用,
187 | scheduler就通过queue容器的接口,取出一个request,把他发送给相应的spider,让spider进行爬取工作。
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/Spider.py:
--------------------------------------------------------------------------------
1 | 设计的这个spider从redis中读取要爬的url, 然后执行爬取,若爬取过程中返回更多的url,那么继续进行直至所有的request完成。之后继续从redis中读取url,循环这个过程。
2 | 分析:在这个spider中通过connect signals.spider_idle信号实现对crawler状态的监视。当idle时,返回新的make_requests_from_url(url)给引擎,进而交给调度器调度。
3 |
4 | from scrapy import signals
5 | from scrapy.exceptions import DontCloseSpider
6 | from scrapy.spiders import Spider, CrawlSpider
7 | from . import connection, defaults
8 | from .utils import bytes_to_str
9 |
10 |
11 | class RedisMixin(object):
12 | """Mixin class to implement reading urls from a redis queue."""
13 | redis_key = None
14 | redis_batch_size = None
15 | redis_encoding = None
16 |
17 | # Redis client placeholder.
18 | server = None
19 |
20 | def start_requests(self):
21 | """Returns a batch of start requests from redis."""
22 | return self.next_requests()
23 |
24 | def setup_redis(self, crawler=None):
25 | """Setup redis connection and idle signal.
26 |
27 | This should be called after the spider has set its crawler object.
28 | """
29 | if self.server is not None:
30 | return
31 |
32 | if crawler is None:
33 | # We allow optional crawler argument to keep backwards
34 | # compatibility.
35 | # XXX: Raise a deprecation warning.
36 | crawler = getattr(self, 'crawler', None)
37 |
38 | if crawler is None:
39 | raise ValueError("crawler is required")
40 |
41 | settings = crawler.settings
42 |
43 | if self.redis_key is None:
44 | self.redis_key = settings.get(
45 | 'REDIS_START_URLS_KEY', defaults.START_URLS_KEY,
46 | )
47 |
48 | self.redis_key = self.redis_key % {'name': self.name}
49 |
50 | if not self.redis_key.strip():
51 | raise ValueError("redis_key must not be empty")
52 |
53 | if self.redis_batch_size is None:
54 | # TODO: Deprecate this setting (REDIS_START_URLS_BATCH_SIZE).
55 | self.redis_batch_size = settings.getint(
56 | 'REDIS_START_URLS_BATCH_SIZE',
57 | settings.getint('CONCURRENT_REQUESTS'),
58 | )
59 |
60 | try:
61 | self.redis_batch_size = int(self.redis_batch_size)
62 | except (TypeError, ValueError):
63 | raise ValueError("redis_batch_size must be an integer")
64 |
65 | if self.redis_encoding is None:
66 | self.redis_encoding = settings.get('REDIS_ENCODING', defaults.REDIS_ENCODING)
67 |
68 | self.logger.info("Reading start URLs from redis key '%(redis_key)s' "
69 | "(batch size: %(redis_batch_size)s, encoding: %(redis_encoding)s",
70 | self.__dict__)
71 |
72 | self.server = connection.from_settings(crawler.settings)
73 | # The idle signal is called when the spider has no requests left,
74 | # that's when we will schedule new requests from redis queue
75 | crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)
76 |
77 | def next_requests(self):
78 | """Returns a request to be scheduled or none."""
79 | use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', defaults.START_URLS_AS_SET)
80 | fetch_one = self.server.spop if use_set else self.server.lpop
81 | # XXX: Do we need to use a timeout here?
82 | found = 0
83 | # TODO: Use redis pipeline execution.
84 | while found < self.redis_batch_size:
85 | data = fetch_one(self.redis_key)
86 | if not data:
87 | # Queue empty.
88 | break
89 | req = self.make_request_from_data(data)
90 | if req:
91 | yield req
92 | found += 1
93 | else:
94 | self.logger.debug("Request not made from data: %r", data)
95 |
96 | if found:
97 | self.logger.debug("Read %s requests from '%s'", found, self.redis_key)
98 |
99 | def make_request_from_data(self, data):
100 | """Returns a Request instance from data coming from Redis.
101 |
102 | By default, ``data`` is an encoded URL. You can override this method to
103 | provide your own message decoding.
104 |
105 | Parameters
106 | ----------
107 | data : bytes
108 | Message from redis.
109 |
110 | """
111 | url = bytes_to_str(data, self.redis_encoding)
112 | return self.make_requests_from_url(url)
113 |
114 | def schedule_next_requests(self):
115 | """Schedules a request if available"""
116 | # TODO: While there is capacity, schedule a batch of redis requests.
117 | for req in self.next_requests():
118 | self.crawler.engine.crawl(req, spider=self)
119 |
120 | def spider_idle(self):
121 | """Schedules a request if available, otherwise waits."""
122 | # XXX: Handle a sentinel to close the spider.
123 | self.schedule_next_requests()
124 | raise DontCloseSpider
125 |
126 |
127 | class RedisSpider(RedisMixin, Spider):
128 | """Spider that reads urls from redis queue when idle.
129 |
130 | Attributes
131 | ----------
132 | redis_key : str (default: REDIS_START_URLS_KEY)
133 | Redis key where to fetch start URLs from..
134 | redis_batch_size : int (default: CONCURRENT_REQUESTS)
135 | Number of messages to fetch from redis on each attempt.
136 | redis_encoding : str (default: REDIS_ENCODING)
137 | Encoding to use when decoding messages from redis queue.
138 |
139 | Settings
140 | --------
141 | REDIS_START_URLS_KEY : str (default: ":start_urls")
142 | Default Redis key where to fetch start URLs from..
143 | REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS)
144 | Default number of messages to fetch from redis on each attempt.
145 | REDIS_START_URLS_AS_SET : bool (default: False)
146 | Use SET operations to retrieve messages from the redis queue. If False,
147 | the messages are retrieve using the LPOP command.
148 | REDIS_ENCODING : str (default: "utf-8")
149 | Default encoding to use when decoding messages from redis queue.
150 |
151 | """
152 |
153 | @classmethod
154 | def from_crawler(self, crawler, *args, **kwargs):
155 | obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)
156 | obj.setup_redis(crawler)
157 | return obj
158 |
159 |
160 | class RedisCrawlSpider(RedisMixin, CrawlSpider):
161 | """Spider that reads urls from redis queue when idle.
162 |
163 | Attributes
164 | ----------
165 | redis_key : str (default: REDIS_START_URLS_KEY)
166 | Redis key where to fetch start URLs from..
167 | redis_batch_size : int (default: CONCURRENT_REQUESTS)
168 | Number of messages to fetch from redis on each attempt.
169 | redis_encoding : str (default: REDIS_ENCODING)
170 | Encoding to use when decoding messages from redis queue.
171 |
172 | Settings
173 | --------
174 | REDIS_START_URLS_KEY : str (default: ":start_urls")
175 | Default Redis key where to fetch start URLs from..
176 | REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS)
177 | Default number of messages to fetch from redis on each attempt.
178 | REDIS_START_URLS_AS_SET : bool (default: True)
179 | Use SET operations to retrieve messages from the redis queue.
180 | REDIS_ENCODING : str (default: "utf-8")
181 | Default encoding to use when decoding messages from redis queue.
182 |
183 | """
184 |
185 | @classmethod
186 | def from_crawler(self, crawler, *args, **kwargs):
187 | obj = super(RedisCrawlSpider, self).from_crawler(crawler, *args, **kwargs)
188 | obj.setup_redis(crawler)
189 | return obj
190 | spider的改动也不是很大,主要是通过connect接口,给spider绑定了spider_idle信号,spider初始化时,通过setup_redis函数初始化好和redis的连接,
191 | 后通过next_requests函数从redis中取出strat url,使用的key是settings中REDIS_START_URLS_AS_SET定义的(注意了这里的初始化url池和我们上边的
192 | queue的url池不是一个东西,queue的池是用于调度的,初始化url池是存放入口url的,他们都存在redis中,但是使用不同的key来区分,就当成是不同的表吧),
193 | spider使用少量的start url,可以发展出很多新的url,这些url会进入scheduler进行判重和调度。直到spider跑到调度池内没有url的时候,会触发spider_idle信号,
194 | 从而触发spider的next_requests函数,再次从redis的start url池中读取一些url。
195 |
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/picklecompat.py:
--------------------------------------------------------------------------------
1 | 在python中,一般可以使用pickle类来进行python对象的序列化,而cPickle提供了一个更快速简单的接口。
2 | """A pickle wrapper module with protocol=-1 by default."""
3 |
4 | try:
5 | import cPickle as pickle # PY2
6 | except ImportError:
7 | import pickle
8 |
9 |
10 | def loads(s):
11 | return pickle.loads(s)
12 |
13 |
14 | def dumps(obj):
15 | return pickle.dumps(obj, protocol=-1)
16 | 这里实现了loads和dumps两个函数,其实就是实现了一个序列化器。
17 | 因为redis数据库不能存储复杂对象(key部分只能是字符串,value部分只能是字符串,字符串列表,字符串集合和hash),所以我们存啥都要先串行化成文本才行。
18 | 这里使用的就是python的pickle模块,一个兼容py2和py3的串行化工具。这个serializer主要用于一会的scheduler存reuqest对象。
--------------------------------------------------------------------------------
/Spider/ch_Distributedcrawler/readme.md:
--------------------------------------------------------------------------------
1 | ## 分布式爬虫
2 |
3 | #### 分布式爬虫的优点
4 |
5 | 1. 充分利用多机器的带宽加速爬取
6 | 2. 充分利用多机的ip加速爬取速度
7 |
8 | #### 分布式爬虫需要解决的问题
9 |
10 | 1. request队列集中管理
11 | 2. 去重集中管理
12 |
13 | #### scrapy不支持分布式,结合redis的特性自然而然产生了scrapy-redis,下面是scrapy-redis核心代码的梳理。
14 | - [Spider](./Spider.py)
15 | - [Scheduler](./Scheduler.py)
16 | - [Queue](./Queue.py)
17 | - [Pipelines](./Pipelines.py)
18 | - [picklecompat](./picklecompat.py)
19 | - [Duperfilter](./Duperfilter.py)
20 | - [Connection](./Connection.py)
21 |
22 | 对于分布式爬虫案例就不写了,随便搜搜就好多,比如说新浪,页面结构算是比较复杂的了,而且反爬挺烦人。时间间隔控制一下,另外需要获取cookie,当然只需静态获取还比较容易,对于动态方式具体操作可以看我的博客。
23 | > [**动态获取cookie**](https://blog.csdn.net/fenglei0415/article/details/81865379)
24 |
25 | ```md
26 | slave端,拷贝相同的代码,输入同样的运行命令 scrapy runspider demo.py,进入等待
27 | master端,安装redis,在redis_client输入命令 lpush spidername: start_url 就ok了。
28 | ```
29 | ---
30 | 最后总结一下scrapy-redis的总体思路:
31 | 1. 这个工程通过重写scheduler和spider类,实现了调度、spider启动和redis的交互。
32 | 2. 通过实现新的dupefilter和queue类,达到了判重和调度容器和redis的交互,因为每个主机上的爬虫进程都访问同一个redis数据库,所以调度和判重都统一进行统一管理,达到了分布式爬虫的目的。
33 | 3. 当spider被初始化时,同时会初始化一个对应的scheduler对象,这个调度器对象通过读取settings,配置好自己的调度容器queue和判重工具dupefilter。
34 | 4. 每当一个spider产出一个request的时候,scrapy内核会把这个reuqest递交给这个spider对应的scheduler对象进行调度,scheduler对象通过访问redis对request进行判重,如果不重复就把他添加进redis中的调度池。当调度条件满足时,scheduler对象就从redis的调度池中取出一个request发送给spider,让他爬取。
35 | 5. 当spider爬取的所有暂时可用url之后,scheduler发现这个spider对应的redis的调度池空了,于是触发信号spider_idle,spider收到这个信号之后,直接连接redis读取strart url池,拿去新的一批url入口,然后再次重复上边的工作。
36 |
37 | #### redis常用命令:
38 | ```md
39 | # 字符串
40 | set key value
41 | get key
42 | getrange key start end
43 | strlen key
44 | incr/decr key
45 | append key value
46 | ```
47 | ```md
48 | # 哈希
49 | hset key name value
50 | hget key
51 | hexists key fields
52 | hdel key filds
53 | hkeys key
54 | hvals key
55 | ```
56 | ```md
57 | # 列表
58 | lpush/rpush mylist value
59 | lrange mylist 0 10
60 | blpop/brpop key1 timeout
61 | lpop/rpop key
62 | llen key
63 | lindex key index
64 | ```
65 | ```md
66 | # 集合
67 | sadd myset value
68 | scard key
69 | sdiff key
70 | sinter key
71 | spop key
72 | smember key member
73 | ```
74 | ```md
75 | # 有序集合
76 | zadd myset 0 value
77 | srangebyscore myset 0 100
78 | zcount key min max
79 | ```
80 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/.idea/Spider.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/.idea/workspace.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
153 |
154 |
155 |
156 | assistAwardInfo
157 | nickName
158 | cityName
159 | score
160 | startTime
161 | approve
162 | reply
163 | avatarurl
164 |
165 |
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 | true
185 | DEFINITION_ORDER
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
213 |
214 |
215 |
216 |
217 |
218 |
219 |
220 |
221 |
222 |
223 |
224 |
225 |
226 |
227 |
228 |
229 |
230 |
231 |
232 |
233 |
234 |
235 |
236 |
237 |
238 |
239 |
240 |
241 |
242 |
243 |
244 |
245 |
246 |
247 |
248 |
249 |
250 |
251 |
252 |
253 |
254 |
255 |
256 |
257 |
258 |
259 |
260 |
261 |
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 |
270 |
271 |
272 |
273 |
274 |
275 |
276 |
277 |
278 |
279 |
280 |
281 |
282 |
283 |
284 |
285 |
286 |
287 |
288 |
289 |
290 |
291 |
292 |
293 |
294 | 1547088015710
295 |
296 |
297 | 1547088015710
298 |
299 |
300 |
301 |
302 |
303 |
304 |
305 |
306 |
307 |
308 |
309 |
310 |
311 |
312 |
313 |
314 |
315 |
316 |
317 |
318 |
319 |
320 |
321 |
322 |
323 |
324 |
325 |
326 |
327 |
328 |
329 |
330 |
331 |
332 |
333 |
334 |
335 |
336 |
337 |
338 |
339 |
340 |
341 |
342 |
343 |
344 |
345 |
346 |
347 |
348 |
349 |
350 |
351 |
352 |
353 |
354 |
355 |
356 |
357 |
358 |
359 |
360 |
361 |
362 |
363 |
364 |
365 |
366 |
367 |
368 |
369 |
370 |
371 |
372 |
373 |
374 |
375 |
376 |
377 |
378 |
379 |
380 |
381 |
382 |
383 |
384 |
385 |
386 |
387 |
388 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/__init__.py
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/__pycache__/__init__.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/__pycache__/__init__.cpython-35.pyc
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/__pycache__/items.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/__pycache__/items.cpython-35.pyc
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/__pycache__/pipelines.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/__pycache__/pipelines.cpython-35.pyc
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/__pycache__/settings.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/__pycache__/settings.cpython-35.pyc
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/items.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your scraped items
4 | #
5 | # See documentation in:
6 | # https://doc.scrapy.org/en/latest/topics/items.html
7 |
8 | import scrapy
9 |
10 |
11 | class SpiderItem(scrapy.Item):
12 | # define the fields for your item here like:
13 | # name = scrapy.Field()
14 | pass
15 |
16 |
17 | class HaiwangItem(scrapy.Item):
18 | # define the fields for your item here like:
19 | # name = scrapy.Field()
20 | nickName = scrapy.Field() # 昵称
21 | cityName = scrapy.Field() # 城市
22 | content = scrapy.Field() # 评论
23 | score = scrapy.Field() # 评分
24 | startTime = scrapy.Field() # 评论时间
25 | approve = scrapy.Field() #
26 | reply = scrapy.Field() # reply num
27 | avatarurl = scrapy.Field() # image url
28 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/middlewares.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your spider middleware
4 | #
5 | # See documentation in:
6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
7 |
8 | from scrapy import signals
9 | from fake_useragent import UserAgent
10 | from .tools.xici_ip import GetIp
11 |
12 |
13 | class RandomUserAgentMiddleware(object):
14 | # 随机切换user-agent
15 | def __init__(self, crawler):
16 | super(RandomUserAgentMiddleware, self).__init__()
17 | self.ua = UserAgent()
18 | self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")
19 |
20 | @classmethod
21 | def from_crawler(cls, crawler):
22 | return cls(crawler)
23 |
24 | def process_request(self, request, spider):
25 | def get_ua():
26 | return getattr(self.ua, self.ua_type)
27 | request.headers.setdefault('User-Agent', get_ua())
28 |
29 |
30 | class RandomProxyMiddleware(object):
31 | # 动态ip设置
32 | def process_request(self, request, spider):
33 | get_ip = GetIp()
34 | request.meta['proxy'] = get_ip.get_random_ip()
35 |
36 |
37 | class SpiderSpiderMiddleware(object):
38 | # Not all methods need to be defined. If a method is not defined,
39 | # scrapy acts as if the spider middleware does not modify the
40 | # passed objects.
41 |
42 | @classmethod
43 | def from_crawler(cls, crawler):
44 | # This method is used by Scrapy to create your spiders.
45 | s = cls()
46 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
47 | return s
48 |
49 | def process_spider_input(self, response, spider):
50 | # Called for each response that goes through the spider
51 | # middleware and into the spider.
52 |
53 | # Should return None or raise an exception.
54 | return None
55 |
56 | def process_spider_output(self, response, result, spider):
57 | # Called with the results returned from the Spider, after
58 | # it has processed the response.
59 |
60 | # Must return an iterable of Request, dict or Item objects.
61 | for i in result:
62 | yield i
63 |
64 | def process_spider_exception(self, response, exception, spider):
65 | # Called when a spider or process_spider_input() method
66 | # (from other spider middleware) raises an exception.
67 |
68 | # Should return either None or an iterable of Response, dict
69 | # or Item objects.
70 | pass
71 |
72 | def process_start_requests(self, start_requests, spider):
73 | # Called with the start requests of the spider, and works
74 | # similarly to the process_spider_output() method, except
75 | # that it doesn’t have a response associated.
76 |
77 | # Must return only requests (not items).
78 | for r in start_requests:
79 | yield r
80 |
81 | def spider_opened(self, spider):
82 | spider.logger.info('Spider opened: %s' % spider.name)
83 |
84 |
85 | class SpiderDownloaderMiddleware(object):
86 | # Not all methods need to be defined. If a method is not defined,
87 | # scrapy acts as if the downloader middleware does not modify the
88 | # passed objects.
89 |
90 | @classmethod
91 | def from_crawler(cls, crawler):
92 | # This method is used by Scrapy to create your spiders.
93 | s = cls()
94 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
95 | return s
96 |
97 | def process_request(self, request, spider):
98 | # Called for each request that goes through the downloader
99 | # middleware.
100 |
101 | # Must either:
102 | # - return None: continue processing this request
103 | # - or return a Response object
104 | # - or return a Request object
105 | # - or raise IgnoreRequest: process_exception() methods of
106 | # installed downloader middleware will be called
107 | return None
108 |
109 | def process_response(self, request, response, spider):
110 | # Called with the response returned from the downloader.
111 |
112 | # Must either;
113 | # - return a Response object
114 | # - return a Request object
115 | # - or raise IgnoreRequest
116 | return response
117 |
118 | def process_exception(self, request, exception, spider):
119 | # Called when a download handler or a process_request()
120 | # (from other downloader middleware) raises an exception.
121 |
122 | # Must either:
123 | # - return None: continue processing this exception
124 | # - return a Response object: stops process_exception() chain
125 | # - return a Request object: stops process_exception() chain
126 | pass
127 |
128 | def spider_opened(self, spider):
129 | spider.logger.info('Spider opened: %s' % spider.name)
130 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/pipelines.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define your item pipelines here
4 | #
5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
7 | import csv
8 | import os
9 | import pymongo
10 | from Spider.settings import MONGO_DBNAME, MONGO_HOST, MONGO_PORT, MONGO_SHEETNAME
11 |
12 |
13 | class SpiderPipeline(object):
14 | def process_item(self, item, spider):
15 | return item
16 |
17 |
18 | class HaiwangPipeline(object):
19 | def __init__(self):
20 | store_file = os.path.dirname(__file__) + '/spiders/haiwang.csv'
21 | print(store_file)
22 | self.file = open(store_file, "a+", newline="", encoding="utf-8")
23 | self.writer = csv.writer(self.file)
24 |
25 | def process_item(self, item, spider):
26 | try:
27 | self.writer.writerow((
28 | item["nickName"],
29 | item["cityName"],
30 | item["content"],
31 | item["approve"],
32 | item["reply"],
33 | item["startTime"],
34 | item["avatarurl"],
35 | item["score"]
36 | ))
37 |
38 | except Exception as e:
39 | print(e.args)
40 |
41 | def close_spider(self, spider):
42 | self.file.close()
43 |
44 |
45 | class MongoPipline(object):
46 | def __init__(self):
47 | host = MONGO_HOST
48 | port = MONGO_PORT
49 | dbname = MONGO_DBNAME
50 | sheetname = MONGO_SHEETNAME
51 |
52 | client = pymongo.MongoClient(host=host, port=port)
53 | # 得到数据库对象
54 | mydb = client[dbname]
55 | # 得到表对象
56 | self.table = mydb[sheetname]
57 |
58 | def process_item(self, item, spider):
59 | dict_item = dict(item)
60 | self.table.insert(dict_item)
61 | return item
62 |
63 | def close_spider(self, spider):
64 | print("close spider....")
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/settings.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Scrapy settings for Spider project
4 | #
5 | # For simplicity, this file contains only settings considered important or
6 | # commonly used. You can find more settings consulting the documentation:
7 | #
8 | # https://doc.scrapy.org/en/latest/topics/settings.html
9 | # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11 |
12 | BOT_NAME = 'Spider'
13 |
14 | SPIDER_MODULES = ['Spider.spiders']
15 | NEWSPIDER_MODULE = 'Spider.spiders'
16 |
17 |
18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 | # USER_AGENT = 'Spider (+http://www.yourdomain.com)'
20 |
21 | # Obey robots.txt rules
22 | ROBOTSTXT_OBEY = False
23 |
24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 | # CONCURRENT_REQUESTS = 32
26 |
27 | # Configure a delay for requests for the same website (default: 0)
28 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 | # See also autothrottle settings and docs
30 | DOWNLOAD_DELAY = 1
31 | # The download delay setting will honor only one of:
32 | # CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 | # CONCURRENT_REQUESTS_PER_IP = 16
34 |
35 | # Disable cookies (enabled by default)
36 | COOKIES_ENABLED = False
37 |
38 | # Disable Telnet Console (enabled by default)
39 | # TELNETCONSOLE_ENABLED = False
40 |
41 | # Override the default request headers:
42 | DEFAULT_REQUEST_HEADERS = {
43 | "Referer": "http://m.maoyan.com/movie/249342/comments?_v_=yes",
44 | "User-Agent": "Mozilla/5.0 Chrome/63.0.3239.26 Mobile Safari/537.36",
45 | "X-Requested-With": "superagent"
46 | }
47 |
48 | # fake_useragent插件维护了大量的user-agent, 可以自行选择ie/Firefox/Chrome...
49 | RANDOM_UA_TYPE = "random"
50 |
51 | # Enable or disable spider middlewares
52 | # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
53 | # SPIDER_MIDDLEWARES = {
54 | # 'Spider.middlewares.SpiderSpiderMiddleware': 543,
55 | # }
56 |
57 | # Enable or disable downloader middlewares
58 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
59 | DOWNLOADER_MIDDLEWARES = {
60 | # 'Spider.middlewares.SpiderDownloaderMiddleware': 543,
61 | # 加入自定义中间建
62 | 'Spider.middlewares.RandomUserAgentMiddleware': 543,
63 | 'Spider.middlewares.RandomProxyMiddleware': 544,
64 | # 关闭内置UserAgentMiddleware
65 | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
66 | }
67 |
68 | # Enable or disable extensions
69 | # See https://doc.scrapy.org/en/latest/topics/extensions.html
70 | # EXTENSIONS = {
71 | # 'scrapy.extensions.telnet.TelnetConsole': None,
72 | # }
73 |
74 | # Configure item pipelines
75 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
76 | ITEM_PIPELINES = {
77 | # 'Spider.pipelines.SpiderPipeline': 300,
78 | 'Spider.pipelines.MongoPipline': 299,
79 | 'Spider.pipelines.HaiwangPipeline': 301,
80 | }
81 | MONGO_HOST = '127.0.0.1' # ip
82 | MONGO_PORT = 27017 # port
83 | MONGO_DBNAME = 'movie' # db name
84 | MONGO_SHEETNAME = 'Haiwang' # table name
85 |
86 | # Enable and configure the AutoThrottle extension (disabled by default)
87 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
88 | # AUTOTHROTTLE_ENABLED = True
89 | # The initial download delay
90 | # AUTOTHROTTLE_START_DELAY = 5
91 | # The maximum download delay to be set in case of high latencies
92 | # AUTOTHROTTLE_MAX_DELAY = 60
93 | # The average number of requests Scrapy should be sending in parallel to
94 | # each remote server
95 | # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
96 | # Enable showing throttling stats for every response received:
97 | # AUTOTHROTTLE_DEBUG = False
98 |
99 | # Enable and configure HTTP caching (disabled by default)
100 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
101 | # HTTPCACHE_ENABLED = True
102 | # HTTPCACHE_EXPIRATION_SECS = 0
103 | # HTTPCACHE_DIR = 'httpcache'
104 | # HTTPCACHE_IGNORE_HTTP_CODES = []
105 | # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
106 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/spiders/Haiwang.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import scrapy
3 | import json
4 |
5 | from Spider.items import HaiwangItem
6 |
7 |
8 | class HaiwangSpider(scrapy.Spider):
9 | name = 'Haiwang'
10 | allowed_domains = ['m.maoyan.com']
11 | start_urls = ['http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime=0']
12 |
13 | def parse(self, response):
14 | print(response.url)
15 | data = json.loads(response.text)
16 | print(data)
17 | item = HaiwangItem()
18 | for info in data["cmts"]:
19 | item["nickName"] = info["nickName"]
20 | item["cityName"] = info["cityName"] if "cityName" in info else ""
21 | item["content"] = info["content"]
22 | item["score"] = info["score"]
23 | item["startTime"] = info["startTime"]
24 | item["approve"] = info["approve"]
25 | item["reply"] = info["reply"]
26 | item["avatarurl"] = info["avatarurl"]
27 | print(item)
28 | yield item
29 |
30 | yield scrapy.Request("http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime={}".
31 | format(item["startTime"]), callback=self.parse)
32 |
33 |
34 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/spiders/__pycache__/Haiwang.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/spiders/__pycache__/Haiwang.cpython-35.pyc
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/spiders/__pycache__/__init__.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/spiders/__pycache__/__init__.cpython-35.pyc
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/tools/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/Spider/tools/__init__.py
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/Spider/tools/xici_ip.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from scrapy.selector import Selector
3 | import pymysql
4 |
5 | conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="ip_spider", charset="utf-8")
6 | cursor = conn.cursor()
7 |
8 |
9 | def crawl_ips():
10 | # 爬取西刺ip代理
11 | headers = {"User-Agent": "Mozilla/5.0 Chrome/63.0.3239.26 Mobile Safari/537.36"}
12 | for i in range(1568):
13 | re = requests.get("http://www.xicidaili.com/nn/{}".format(i), headers=headers)
14 |
15 | selector = Selector(text=re.text)
16 | all_trs = selector.css("#ip_list tr")
17 | ip_list = []
18 | speed = None
19 | for tr in all_trs[1:]:
20 | speed_str = tr.css(".bar::attr(title)").extract()[0]
21 | if speed_str:
22 | speed = float(speed_str.split("秒")[0])
23 | all_texts = tr.css("td::text").extract()
24 |
25 | ip = all_texts[0]
26 | port = all_texts[1]
27 | proxy_type = all_texts[5]
28 | ip_list.append((ip, port, proxy_type, speed))
29 |
30 | for ip_info in ip_list:
31 | cursor.execute(
32 | "insert proxy_ip(ip, port, speed, proxy_type) VALUES('{0}', '{1}', {2}, 'HTTP')".format(
33 | ip_info[0], ip_info[1], ip_info[3]
34 | )
35 | )
36 | conn.commit()
37 |
38 |
39 | class GetIp(object):
40 | def judge_ip(self, ip, port):
41 | http_url = "http://baidu.com"
42 | proxy_url = "http://{0}:{1}".format(ip, port)
43 | try:
44 | proxy_dict = {
45 | "http:": proxy_url
46 | }
47 | response = requests.get(http_url, proxies=proxy_dict)
48 |
49 | except Exception as e:
50 | print("invalid ip and port")
51 | self.delete_ip(ip)
52 | return False
53 | else:
54 | code = response.status_code
55 | if 300 > code >= 200:
56 | print("effective ip")
57 | return True
58 | else:
59 | print("invalid ip and port")
60 | self.delete_ip(ip)
61 |
62 | def delete_ip(self, ip):
63 | # mysql中删除无效的ip
64 | sql = """delete from proxy_ip where ip='{0}'""".format(ip)
65 | cursor.execute(sql)
66 | conn.commit()
67 | return True
68 |
69 | def get_random_ip(self):
70 | # 随机取出一个可用ip
71 | sql = """SELECT ip, port FROM proxy_ip ORDER BY RAND() LIMIT 1"""
72 | result = cursor.execute(sql)
73 | for ip_info in cursor.fetchall():
74 | ip = ip_info[0]
75 | port = ip_info[1]
76 | judge_re = self.judge_ip(ip, port)
77 | if judge_re:
78 | return "http://{0}:{1}".format(ip, port)
79 | else:
80 | return self.get_random_ip()
81 |
82 |
83 | if __name__ == '__main__':
84 | IP = GetIp()
85 | IP.get_random_ip()
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/Spider/__init__.py
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/run.py:
--------------------------------------------------------------------------------
1 | from scrapy import cmdline
2 |
3 | cmdline.execute("scrapy crawl Haiwang".split())
4 |
5 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/Spider/scrapy.cfg:
--------------------------------------------------------------------------------
1 | # Automatically created by: scrapy startproject
2 | #
3 | # For more information about the [deploy] section see:
4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
5 |
6 | [settings]
7 | default = Spider.settings
8 |
9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = Spider
12 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/analysis.py:
--------------------------------------------------------------------------------
1 | import json
2 | from collections import Counter
3 | import pandas as pd
4 | import numpy as np
5 | import seaborn as sns
6 | import jieba.analyse
7 | import matplotlib.pyplot as plt
8 | from matplotlib import font_manager as fm
9 | from pyecharts import WordCloud, Style, Geo
10 |
11 |
12 | def get_data():
13 | df = pd.read_csv("haiwang.csv", sep=",", header=None, names=["nickName","cityName","content","approve","reply","startTime","avatarurl","score"],encoding="utf-8")
14 | print(df.columns)
15 | return df
16 |
17 |
18 | # 清洗数据
19 | def clean_data():
20 | df = get_data()
21 | has_copy = any(df.duplicated())
22 | print(has_copy)
23 | data_duplicated = df.duplicated().value_counts()
24 | print(data_duplicated) # 查看有多少数据是重复的
25 | """
26 | >>>False 59900
27 | True 4450
28 | dtype: int64
29 | """
30 | data = df.drop_duplicates(keep="first") # 删掉重复值, first保留最先的
31 | data = data.reset_index(drop=True) # 删除部分行后重置索引
32 | data["startTime"] = pd.to_datetime(data["startTime"]) # dtype: object --> dtype: datetime64[ns]
33 | data["content_length"] = data["content"].apply(len) # 增加一列
34 | data = data[~data['nickName'].isin(["."])]
35 | return data
36 |
37 |
38 | # 查看数据基本情况
39 | def analysis1():
40 | data = clean_data()
41 | print(data.describe())
42 | print(data.isnull().any()) # 判断空值 cityName中存在空值
43 | print(len(data[data.nickName == "."])) # 38个nickName是 '.'
44 | # 删除.
45 | # data = data[~(data['nickName'] == ".")]
46 | # 和上面等价
47 | data = data[~data['nickName'].isin(["."])]
48 | print(data.head())
49 | print(data['nickName'].describe())
50 | print(data['cityName'].describe())
51 | return data
52 |
53 |
54 | def analysis2():
55 | """
56 | 分析打分score情况
57 | 饼状图基本参数都凑齐了
58 | """
59 | data = clean_data()
60 | grouped = data.groupby(by="score")["nickName"].size().tail(8)
61 | grouped = grouped.sort_values(ascending=False)
62 | index = grouped.index
63 | values = grouped.values
64 |
65 | # 将横、纵坐标轴标准化处理,保证饼图是一个正圆,否则为椭圆
66 | plt.axes(aspect='equal')
67 | plt.subplots(figsize=(10, 7)) # 设置绘图区域大小
68 | # 控制x轴和y轴的范围
69 | plt.xlim(0, 4)
70 | plt.ylim(0, 4)
71 | # 绘制饼图
72 | patches, texts, autotexts = plt.pie(x=index, # 绘图数据
73 | labels=values, # 添加label
74 | explode=[0.1, 0, 0, 0, 0, 0, 0, 0],
75 | autopct='%.1f%%', # 设置百分比的格式,这里保留一位小数
76 | pctdistance=1.2, # 设置百分比标签与圆心的距离
77 | labeldistance=0.8, # 设置value与圆心的距离
78 | startangle=90, # 设置饼图的初始角度
79 | radius=1.5, # 设置饼图的半径
80 | shadow=True, # 添加阴影
81 | counterclock=True, # 是否逆时针,这里设置为顺时针方向
82 | wedgeprops={'linewidth': 1.5, 'edgecolor': 'green'}, # 设置饼图内外边界的属性值
83 | textprops={'fontsize': 12, 'color': 'k'}, # 设置文本标签的属性值
84 | center=(1.8, 1.8), # 设置饼图的原点
85 | frame=0) # 是否显示饼图的图框,这里设置显示
86 | # 重新设置字体大小
87 | proptease = fm.FontProperties()
88 | proptease.set_size('small')
89 | plt.setp(autotexts, fontproperties=proptease)
90 | plt.setp(texts, fontproperties=proptease)
91 |
92 | # 删除x轴和y轴的刻度
93 | plt.xticks(())
94 | plt.yticks(())
95 | plt.legend()
96 | # 显示图形
97 | plt.savefig('pie2.png')
98 | plt.show()
99 |
100 |
101 | def analysis3():
102 | """分析评论时间"""
103 | data = clean_data()
104 | data["hour"] = data["startTime"].dt.hour # 提取小时
105 | data["startTime"] = data["startTime"].dt.date # 提取日期
106 | need_date = data[["startTime", "hour"]]
107 |
108 | def get_hour_size(data):
109 | hour_data = data.groupby(by="hour")["hour"].size().reset_index(name="count")
110 | return hour_data
111 |
112 | data = need_date.groupby(by="startTime").apply(get_hour_size)
113 | # print(data)
114 | data_reshape = data.pivot_table(index="startTime", columns="hour", values="count")[1:-2]
115 | data = data_reshape.describe()
116 | print(data)
117 | data_mean = data.loc["mean"] # 均值
118 | data_std = data.loc["std"] # 方差
119 | data_min = data.loc["min"] # min
120 | data_max = data.loc["max"] # max
121 |
122 | # 坐标轴负号的处理
123 | plt.rcParams['axes.unicode_minus'] = False
124 |
125 | plt.title("24h count")
126 | plt.plot(data_mean.index, data_mean, color="green", label="mean")
127 | plt.plot(data_std.index, data_std, color="red", label="std")
128 | plt.plot(data_min.index, data_min, color="blue", label="min")
129 | plt.plot(data_max.index, data_max, color="yellow", label="max")
130 | plt.legend()
131 | plt.xlabel("one day time")
132 | plt.ylabel("pub sum")
133 | plt.savefig('chart_line.png')
134 | plt.show()
135 |
136 |
137 | def analysis4():
138 | data = clean_data()
139 | contents = list(data["content"].values)
140 | try:
141 | jieba.analyse.set_stop_words('stop_words.txt')
142 | tags = jieba.analyse.extract_tags(str(contents), topK=100, withWeight=True)
143 | name = []
144 | value = []
145 | for v, n in tags:
146 | # [('好看', 0.5783566110162118), ('特效', 0.2966753295335903), ('不错', 0.22288265823188907),...]
147 | name.append(v)
148 | value.append(int(n * 10000))
149 | wordcloud = WordCloud(width=1300, height=620)
150 | wordcloud.add("", name, value, word_size_range=[20, 100])
151 | wordcloud.render()
152 | except Exception as e:
153 | print(e)
154 |
155 |
156 | def handle(cities):
157 | """处理地名数据,解决坐标文件中找不到地名的问题"""
158 | with open(
159 | 'city_coordinates.json',
160 | mode='r', encoding='utf-8') as f:
161 | data = json.loads(f.read()) # 将str转换为json
162 |
163 | # 循环判断处理
164 | data_new = data.copy() # 拷贝所有地名数据
165 | for city in set(cities): # 使用set去重
166 | # 处理地名为空的数据
167 | if city == '':
168 | while city in cities:
169 | cities.remove(city)
170 | count = 0
171 | for k in data.keys():
172 | count += 1
173 | if k == city:
174 | break
175 | if k.startswith(city):
176 | # print(k, city)
177 | data_new[city] = data[k]
178 | break
179 | if k.startswith(city[0:-1]) and len(city) >= 3:
180 | data_new[city] = data[k]
181 | break
182 | # 处理不存在的地名
183 | if count == len(data):
184 | while city in cities:
185 | cities.remove(city)
186 |
187 | # 写入覆盖坐标文件
188 | with open(
189 | 'city_coordinates.json',
190 | mode='w', encoding='utf-8') as f:
191 | f.write(json.dumps(data_new, ensure_ascii=False)) # 将json转换为str
192 |
193 |
194 | def analysis5():
195 | data = clean_data()
196 | cities = list(data[~data["cityName"].isnull()]["cityName"].values)
197 | handle(cities)
198 |
199 | style = Style(
200 | title_color='#fff',
201 | title_pos='center right',
202 | width=1200,
203 | height=600,
204 | background_color='#404a59'
205 | )
206 |
207 | new_cities = Counter(cities).most_common(100)
208 | geo = Geo("《海王》粉丝分布", "数据来源:Github-fenglei110", **style.init_style)
209 | attr, value = geo.cast(new_cities)
210 | geo.add('', attr, value, visual_range=[0, 3000], visual_text_color='#fff', symbol_size=15,is_visualmap=True, is_piecewise=True, visual_split_number=10)
211 | geo.render('粉丝位置分布-GEO.html')
212 |
213 |
214 | if __name__ == '__main__':
215 | analysis3()
216 |
217 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/1.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/2.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/chart_line.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/chart_line.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/ciyun.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/ciyun.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/csv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/csv.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/geo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/geo.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/list.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/list.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/images/mongo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/Spider/ch_Haiwang/images/mongo.png
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/pip.txt:
--------------------------------------------------------------------------------
1 | Package Version
2 | --------------------------- -------
3 | asn1crypto 0.24.0
4 | attrs 18.2.0
5 | Automat 0.7.0
6 | cffi 1.11.5
7 | constantly 15.1.0
8 | cryptography 2.4.2
9 | cssselect 1.0.3
10 | cycler 0.10.0
11 | dukpy 0.2.2
12 | fake-useragent 0.1.11
13 | future 0.17.1
14 | hyperlink 18.0.0
15 | idna 2.8
16 | incremental 17.5.0
17 | javascripthon 0.10
18 | jieba 0.39
19 | Jinja2 2.10
20 | jupyter-echarts-pypkg 0.1.2
21 | kiwisolver 1.0.1
22 | lml 0.0.2
23 | lxml 4.2.5
24 | macropy3 1.1.0b2
25 | MarkupSafe 1.1.0
26 | matlab 0.1
27 | matplotlib 3.0.2
28 | numpy 1.15.4
29 | pandas 0.23.4
30 | parsel 1.5.1
31 | Pillow 5.4.1
32 | pip 18.1
33 | pkg-resources 0.0.0
34 | pyasn1 0.4.5
35 | pyasn1-modules 0.2.3
36 | pycparser 2.19
37 | PyDispatcher 2.0.5
38 | pyecharts 0.5.11
39 | pyecharts-javascripthon 0.0.6
40 | pyecharts-jupyter-installer 0.0.3
41 | PyHamcrest 1.9.0
42 | pymongo 3.7.2
43 | pyOpenSSL 18.0.0
44 | pyparsing 2.3.0
45 | python-dateutil 2.7.5
46 | pytz 2018.7
47 | queuelib 1.5.0
48 | scipy 1.2.0
49 | Scrapy 1.5.1
50 | seaborn 0.9.0
51 | service-identity 18.1.0
52 | setuptools 40.6.3
53 | six 1.12.0
54 | Twisted 18.9.0
55 | w3lib 1.19.0
56 | wheel 0.32.3
57 | zope.interface 4.6.0
58 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/readme.md:
--------------------------------------------------------------------------------
1 | # 猫眼爬取《海王》的评论信息
2 |
3 | ### 准备工作
4 |
5 | 准备虚拟环境
6 |
7 | 下载Fiddler抓包工具,前提是手机和电脑在同一局域网内,具体教程自行百度吧。
8 |
9 | 从猫眼寻找api规律,获取json数据
10 |
11 | 基本开始的链接为:`url = 'http://m.maoyan.com/mmdb/comments/movie/249342.json?_v_=yes&offset=0&startTime=0'`
12 |
13 |
14 | #### 步骤
15 | ```md
16 | source venv3.5/bin/activate
17 | scrapy startproject Spider
18 | cd Spider
19 | scrapy genspider Haiwang m.maoyan.com
20 | ```
21 | #### 代码 中间建设置
22 | ##### `User-Agent`动态设置. 这里下载了`fake_useragent`插件, 维护了大量user-agent, 还是不错的.
23 | ##### 维护`IP`代理池, 维护ip具体操作查看tools目录. `github`上也有人写了比较完善的`scrapy-proxy`, 类似. 个人建议尽量还是使用收费版本, 便宜而且稳定. scrapy官网推出了收费`scrapy-crawlera`, 更省心.
24 | ```py
25 | class RandomUserAgentMiddleware(object):
26 | # 随机切换user-agent
27 | def __init__(self, crawler):
28 | super(RandomUserAgentMiddleware, self).__init__()
29 | self.ua = UserAgent()
30 | self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")
31 |
32 | @classmethod
33 | def from_crawler(cls, crawler):
34 | return cls(crawler)
35 |
36 | def process_request(self, request, spider):
37 | def get_ua():
38 | return getattr(self.ua, self.ua_type)
39 | request.headers.setdefault('User-Agent', get_ua())
40 |
41 | class RandomProxyMiddleware(object):
42 | # 动态ip设置
43 | def process_request(self, request, spider):
44 | get_ip = GetIp()
45 | request.meta['proxy'] = get_ip.get_random_ip()
46 | ```
47 |
48 | 至于爬取后要怎么处理就看自己爱好了,我是先保存为 csv 文件,然后再汇总插入到mongodb,毕竟文件存储有限。
49 |
50 |
51 | #### 评分占比, 看得出来大部分给了5分好评
52 |
53 | 
54 |
55 | #### 评分人数占比
56 |
57 | 
58 |
59 | #### 平均发布时间占比, 绝大数还是12点左右发布得, 凌晨5点人数最少, 符合常情
60 |
61 | 
62 |
63 | #### jieba制作词云图, 做图具体方法查看 [pyecharts](http://pyecharts.org/#/zh-cn/)
64 |
65 | 
66 |
67 | #### 观众分布
68 |
69 | 
70 |
71 | `为了这几个图煞费苦心, 还是不完美` :+1::+1::+1:
72 |
73 |
74 | 对数据感兴趣的话可以邮箱联系我,共同进步。
75 |
--------------------------------------------------------------------------------
/Spider/ch_Haiwang/stop_words.txt:
--------------------------------------------------------------------------------
1 | the
2 | of
3 | is
4 | and
5 | to
6 | in
7 | that
8 | we
9 | for
10 | an
11 | are
12 | by
13 | be
14 | as
15 | on
16 | with
17 | can
18 | if
19 | from
20 | which
21 | you
22 | it
23 | this
24 | then
25 | at
26 | have
27 | all
28 | not
29 | one
30 | has
31 | or
32 | that
33 | 的
34 | 了
35 | 和
36 | 是
37 | 就
38 | 都
39 | 而
40 | 及
41 | 與
42 | 著
43 | 或
44 | 一個
45 | 沒有
46 | 我們
47 | 你們
48 | 妳們
49 | 他們
50 | 她們
51 | 是否
--------------------------------------------------------------------------------
/Spider/ch_summary/summary01.md:
--------------------------------------------------------------------------------
1 | ## scrapy 如何中断续爬
2 |
3 | 1. scrapy简单易用,效率极高,自带多线程机制。但也正是因为多线程机制导致在用scrapy写爬虫的时候处理断点续爬很恼火。当你用for循环遍历一个网站的所有页面的时候,例如:
4 |
5 | 如果这个网站有10000页,启动scrapy后程序第一个发送的请求可能就是第10000页,然后第二个请求可能就是第1页。当程序跑到一半断掉的时候天知道有哪些页是爬过的。不过针对这个问题也不是没有解决办法。
6 | ```python
7 | totalpages = 10000
8 | for i in range(totalpages):
9 | url = '****page' + str(i)
10 | yield Request(url=url, callback=self.next)
11 | ```
12 | 2. 总之我们要想保证爬取数据的完整就要牺牲程序的效率。
13 |
14 | - 有的人把所有爬取过的url列表保存到一个文件当中,然后再次启动的时候每次爬取要和文件当中的url列表对比,如果相同则不再爬取。
15 | - 有的人在scrapy再次启动爬取的时候和数据库里面的数据做对比,如果相同则不存取。
16 | - 还有一种办法呢就是利用Request中的优先级(priority)
17 |
18 | 前两种方法就不详细说了,通俗易懂。主要说一下这个优先级。修改上面的代码
19 | ```python
20 | totalpages = 10000
21 | def first(self, response):
22 | for i in range(totalpages):
23 | url = '****page' + str(i)
24 | priority = totalpages-i
25 | yield Request(url=url,
26 | periority=priority,
27 | callback=self.next)
28 | ```
29 |
30 | 3. 当Request中加入了priority属性的时候,每次scrapy从请求队列中取请求的时候就会判断优先级,先取出优先级高的去访问。由于scrapy默认启动16个线程。这时优先级为100 的就会在优先级为90之前被取出请求队列,这样呢我们就能大体上保证爬取网页的顺序性。
31 |
32 | 保证了顺序性之后呢,我们就要记录已经爬取的页数。由于发送请求,下载页面,存取数据这几个动作是顺序执行的,也就是说程序发送了这个请求不代表此时已经爬取到这一页了,只有当收到response的时候我们才能确定我们已经获取到了数据,这时我们才能记录爬取位置。完整的代码应该是这样的:
33 | ```python
34 | with open('filename', 'r+') as f: # 程序开始我们要读取上次程序断掉时候的爬取位置
35 | page = int(f.read()) - 20 # 毕竟多线程,防止小范围乱序,保证数据的完整性
36 | totalpages = 10000
37 |
38 | def first(self, response):
39 | for i in range(self.page, totalpages):
40 | url = '****page' + str(i)
41 | priority = totalpages - i # 优先级从高到低
42 | yield Request(url=url,
43 | meta={'page': 1}, # 把page传到下一页
44 | priority=priority, # 设置优先级
45 | callback=self.next()) # 回调下一个函数
46 |
47 | def next(self, response):
48 | with open('filename', 'wb') as f: # 请求成功保存当前page
49 | f.write(response.meta['page'])
50 |
51 | ```
--------------------------------------------------------------------------------
/Spider/ch_summary/summary02.md:
--------------------------------------------------------------------------------
1 | ## url去重
2 |
3 | ### 去重方案
4 | - 关系数据库去重
5 |
6 | 例如将url保存到MySQL,每遇到一个url就启动一次查询,数据量大时,效率低
7 |
8 | - 缓存数据库去重
9 |
10 | Redis,使用其中的Set数据类型,可将内存中的数据持久化,应用广泛
11 |
12 | - 内存去重
13 |
14 | 将url直接保存到HashSet中,也就是python的set,这种方式也很耗内存。进一步将url经过MD5或SHA-1等哈希算法生成摘要,再存到HashSet中。MD5处理后摘要长度128位,32个字符,SHA-1摘要长度160位,这样占内存要小很多。
15 | 或者采用Bit-Map方法,建立一个BitSet,每一个url经过一个哈希函数映射到第一位,内存消耗最少,但是会发生冲突,长生误判。
16 | 最后就是BloomFilter,对Bit-Map的扩展。
17 |
18 | **综上,比较好的方法为:url经过MD5或SHA-1加密,然后结合缓存数据库,基本满足大多数中型爬虫需要。数据量上亿或者几十亿时,用BloomFilter了。**
19 |
20 | #### scrapy-redis当中的去重
21 | ```python
22 | class RFPDupeFilter(BaseDuperFilter):
23 | def __init__(self, path=None, debug=False):
24 | self.fingerprints = set()
25 | ...
26 |
27 | def request_fingerprint(self, include_headers=None):
28 | if include_headers:
29 | include_headers = tuple(to_bytes(h.lower() for h in sorted(include_headers)))
30 | cache = _fingerprint_cache.setdefault(request, {})
31 |
32 | if include_headers not in cache:
33 | fp = hashlib.sha1()
34 | fp.update(to_bytes(request.method))
35 | fp.update(to_bytes(canonicalize_url(request.url)))
36 | fp.update(request,body or b'')
37 |
38 | if include_headers:
39 | for hdr in include_headers:
40 | if hdr in request.headers:
41 | fp.update(hdr)
42 | for v in request.headers.getlist(hdr):
43 | fp.update(v)
44 |
45 | cache[include_headers] = fp.hexdigest()
46 | return cache[include_headers]
47 |
48 | ```
49 | 可以发现,PFPDuperFilter类中初始化默认采用redis的set()方法
50 | request_fingerprint方法中,去重指纹是sha1(method + url + body + header),所以实际能够去掉重复的比例并不大。
51 | 我们可以自定义使用url做指纹去重:
52 | ```python
53 | class SeenURLFilter(RFPDuperFilter):
54 | """A dupe filter that contains the URl"""
55 | def __init__(self, path=None):
56 | self.urls_seen = set()
57 | RFPDuperFilter.__init__(self, path)
58 |
59 | def request_seen(self, request):
60 | if request.url in self.urls_seen:
61 | return True
62 | else:
63 | self.urls_seen.add(request.url)
64 | ```
65 | #### 去重原理
66 | 1. set方法的去重
67 | ```python
68 | class Foo:
69 | def __init__(self, name, count):
70 | self.count = count
71 | self.name = name
72 |
73 | def __hash__(self):
74 | return hash(self.count)
75 |
76 | def __eq__(self, other):
77 | print(self.__dict__, other.__dict__)
78 | return self.__dict__ == other.__dict__
79 |
80 | ```
81 | python中的set去重会调用__hash__这个魔法方法,如果传入的变量不可哈希,会直接抛出异常,如果返回哈希值相同,又会调用__eq__这个魔法方法。
82 | 所以,set去重是通过__hash__和__eq__结合实现的。
83 |
84 | 2. set 去重效率
85 |
86 | python的去重优点是去重速度快,但要将去重的对象同时加载到内存,然后进行比较判断,返回去重后的集合对象,虽然底层也是哈希后来判断,但还是比较耗内存,而且set方法并不是判断元素是否存在的方式。
87 |
88 | 我们可以利用redis的缓存数据库当中的set类型,来判断元素是否存在集合中的方式,它的底层实现原理与python的set类似。Redis的set类型有sadd()方法与sismember()方法,如果redis当中不存在这条记录sadd则添加进去,simember返回False。如果存在,sadd不填加,sismember返回True。 接下来计算一下:
89 | 1GB=1024MB=1024×1024KB=1024×1024×1024Bit,不考虑哈希冲突情况下,1024×1024×1024Bit/32=33554432条,1G内存可以去重3300万条记录;考虑到哈希表存储效率通常小于50%,1G内存可以去重1600多万条记录。因此达到亿级别的数据,就得采用布隆过滤器了。
90 |
91 | 3. 布隆过滤器
92 | 具体实现百度吧,说说要点:
93 | 1. 一个很长的二进制向量(位数组)
94 | 2. 一系列随机函数(哈希)
95 | 3. 空间效率和查询效率高
96 | 4. 有一定误判率(哈希表是精确匹配)
97 | 5. 广泛用于拼写检查,网络去重和数据库系统中
98 |
--------------------------------------------------------------------------------
/Spider/ch_summary/summary03.md:
--------------------------------------------------------------------------------
1 | ### 爬虫高效率
2 | 爬虫的本质就是一个socket客户端与服务端的通信过程,如果我们有多个url待爬取,只用一个线程且采用串行的方式执行,
3 | 那只能等待爬取一个结束后才能继续下一个,效率会非常低。需要强调的是:对于单线程下串行N个任务,并不完全等同于低效。
4 |
5 | 如果这N个任务都是纯计算的任务,那么该线程对cpu的利用率仍然会很高,之所以单线程下串行多个爬虫任务低效, 是因为爬虫任务是明显的IO密集型程序。
6 |
7 | ---
8 | #### 一、同步、异步、回调机制
9 |
10 | 1、同步调用:即提交一个任务后就在原地等待任务结束,等到拿到任务的结果后再继续下一行代码,效率低下
11 | ```python
12 | import requests
13 |
14 | def parse_page(res):
15 | print('解析 %s' %(len(res)))
16 |
17 | def get_page(url):
18 | print('下载 %s' % url)
19 | response = requests.get(url)
20 | if response.status_code == 200:
21 | return response.text
22 |
23 | urls=['https://www.baidu.com/','http://www.sina.com.cn/','https://www.python.org']
24 | for url in urls:
25 | res = get_page(url) # 调用一个任务,就在原地等待任务结束拿到结果后才继续往后执行
26 | parse_page(res)
27 | ```
28 | 2、一个简单的解决方案:多线程或多进程
29 |
30 | 在服务器端使用多线程(或多进程)。
31 | 多线程(或多进程)的目的是让每个连接都拥有独立的线程(或进程),这样任何一个连接的阻塞都不会影响其他的连接。
32 | ```python
33 | #IO密集型程序应该用多线程
34 | import requests
35 | from threading import Thread, current_thread
36 |
37 | def parse_page(res):
38 | print('%s 解析 %s' %(current_thread().getName(),len(res)))
39 |
40 |
41 | def get_page(url, callback=parse_page):
42 | print('%s 下载 %s' %(current_thread().getName(),url))
43 | response = requests.get(url)
44 | if response.status_code == 200:
45 | callback(response.text)
46 |
47 |
48 | if __name__ == '__main__':
49 | urls=['https://www.baidu.com/','http://www.sina.com.cn/','https://www.python.org']
50 | for url in urls:
51 | t=Thread(target=get_page,args=(url,))
52 | t.start()
53 | ```
54 | 该方案的问题是:
55 | 开启多进程或都线程的方式,我们是无法无限制地开启多进程或多线程的:在遇到要同时响应成百上千路的连接请求,则无论多线程还是多进程都会严重占据系统资源,降低系统对外界响应效率,
56 | 而且线程与进程本身也更容易进入假死状态。
57 |
58 | 3、改进方案:
59 |
60 | 线程池或进程池+异步调用:提交一个任务后并不会等待任务结束,而是继续下一行代码。很多程序员可能会考虑使用`'线程池'`或`'连接池'`。
61 | `'线程池'`旨在减少创建和销毁线程的频率,其维持一定合理数量的线程,并让空闲的线程重新承担新的执行任务。`'连接池'`维持连接的缓存池,尽量重用已有的连接、减少创建和关闭连接的频率。这两种技术都可以很好的降低系统开销,都被广泛应用很多大型系统,如Nginx、tomcat和各种数据库等。
62 | ```python
63 | # IO密集型程序应该用多线程,我们使用线程池
64 |
65 | import requests
66 | from threading import current_thread
67 | from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
68 |
69 |
70 | def parse_page(res):
71 | res=res.result()
72 | print('%s 解析 %s' %(current_thread().getName(),len(res)))
73 |
74 | def get_page(url):
75 | print('%s 下载 %s' %(current_thread().getName(),url))
76 | response=requests.get(url)
77 | if response.status_code == 200:
78 | return response.text
79 |
80 |
81 | if __name__ == '__main__':
82 | urls=['https://www.baidu.com/', 'http://www.sina.com.cn/', 'https://www.python.org']
83 | pool = ThreadPoolExecutor(50)
84 | # pool = ProcessPoolExecutor(50)
85 | for url in urls:
86 | pool.submit(get_page,url).add_done_callback(parse_page)
87 | pool.shutdown(wait=True)
88 | ```
89 | 改进后方案其实也存在着问题:
90 |
91 | `'线程池'`和`'连接池'`技术也只是在一定程度上缓解了频繁调用IO接口带来的资源占用。而且,所谓`'池'`始终有其上限,当请求大大超过上限时,`'池'`构成的系统对外界的响应并不比没有池的时候效果好多少。所以使用`'池'`必须考虑其面临的响应规模,并根据响应规模调整'池'的大小。
92 |
93 | 对应上例中的所面临的可能同时出现的上千甚至上万次的客户端请求,`'线程池'`或`'连接池'`或许可以缓解部分压力,但是不能解决所有问题。
94 |
95 | 总之,多线程模型可以方便高效的解决小规模的服务请求,但面对大规模的服务请求,多线程模型也会遇到瓶颈,可以用非阻塞接口来尝试解决这个问题。
96 |
97 | #### 二、高性能
98 |
99 | 上述无论哪种解决方案其实没有解决一个性能相关的问题:IO阻塞。无论是多进程还是多线程,在遇到IO阻塞时都会被操作系统强行剥夺走CPU的执行权限,程序的执行效率因此就降低了下来。
100 |
101 | 解决这一问题的关键在于,我们自己从应用程序级别检测IO阻塞,然后切换到我们自己程序的其他任务执行,这样把我们程序的IO降到最低,我们的程序处于就绪态就会增多,以此来迷惑操作系统,操作系统便以为我们的程序是IO比较少的程序,从而会尽可能多的分配CPU给我们,这样也就达到了提升程序执行效率的目的。
102 |
103 | 1、在python3.3之后新增了asyncio模块,可以帮我们检测IO(只能是网络IO),实现应用程序级别的切换。
104 | ```python
105 | import asyncio
106 |
107 |
108 | @asyncio.coroutine
109 | def task(task_id,senconds):
110 | print('%s is start' %task_id)
111 | yield from asyncio.sleep(senconds) #只能检测网络IO,检测到IO后切换到其他任务执行
112 | print('%s is end' %task_id)
113 |
114 | tasks = [task(task_id='任务1', senconds=3), task('任务2',2), task(task_id='任务3', senconds=1)]
115 | loop = asyncio.get_event_loop()
116 | loop.run_until_complete(asyncio.wait(tasks))
117 | loop.close()
118 | ```
119 | 2、但asyncio模块只能发tcp级别的请求,不能发http协议,因此,在我们需要发送http请求的时候,需要我们自定义http报头。
120 | ```python
121 | import asyncio
122 | import uuid
123 |
124 | user_agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
125 |
126 |
127 | def parse_page(host,res):
128 | print('%s 解析结果 %s' %(host,len(res)))
129 | with open('%s.html' %(uuid.uuid1()),'wb') as f:
130 | f.write(res)
131 |
132 | @asyncio.coroutine
133 | def get_page(host, port=80, url='/', callback=parse_page, ssl=False):
134 | print('下载 http://%s:%s%s' %(host,port,url))
135 | #步骤一(IO阻塞):发起tcp链接,是阻塞操作,因此需要yield from
136 | if ssl:
137 | port = 443
138 | recv, send = yield from asyncio.open_connection(host=host, port=port, ssl=ssl)
139 |
140 | # 步骤二:封装http协议的报头,因为asyncio模块只能封装并发送tcp包,因此这一步需要我们自己封装http协议的包
141 | request_headers = '''GET %s HTTP/1.0rnHost: %srnUser-agent: %srnrn''' %(url,host,user_agent)
142 | # requset_headers = '''POST %s HTTP/1.0rnHost: %srnrnname=egon&password=123''' % (url, host,)
143 | request_headers=request_headers.encode('utf-8')
144 |
145 | # 步骤三(IO阻塞):发送http请求包
146 | send.write(request_headers)
147 | yield from send.drain()
148 |
149 | # 步骤四(IO阻塞):接收响应头
150 | while True:
151 | line = yield from recv.readline()
152 | if line == b'rn':
153 | break
154 |
155 | print('%s Response headers:%s' %(host,line))
156 |
157 | # 步骤五(IO阻塞):接收响应体
158 | text = yield from recv.read()
159 |
160 | # 步骤六:执行回调函数
161 | callback(host, text)
162 |
163 | # 步骤七:关闭套接字
164 | send.close() # 没有recv.close()方法,因为是四次挥手断链接,双向链接的两端,一端发完数据后执行send.close()另外一端就被动地断开
165 |
166 |
167 | if __name__ == '__main__':
168 | tasks = [
169 | get_page('www.baidu.com', url='/s?wd=美女', ssl=True),
170 | get_page('www.cnblogs.com', url='/', ssl=True),
171 | ]
172 |
173 |
174 | loop = asyncio.get_event_loop()
175 | loop.run_until_complete(asyncio.wait(tasks))
176 | loop.close()
177 | ```
178 | 3、自定义http报头多少有点麻烦,于是有了aiohttp模块,专门帮我们封装http报头,然后我们还需要用asyncio检测IO实现切换。
179 |
180 | ```python
181 | import aiohttp
182 | import asyncio
183 |
184 |
185 | @asyncio.coroutine
186 | def get_page(url):
187 | print('GET:%s' %url)
188 | response = yield from aiohttp.request('GET',url)
189 | data = yield from response.read()
190 | print(url, data)
191 | response.close()
192 | return 1
193 |
194 | tasks=[
195 | get_page('https://www.python.org/doc'),
196 | get_page('https://www.baidu.com'),
197 | get_page('https://www.openstack.org')
198 | ]
199 |
200 | loop = asyncio.get_event_loop()
201 | results = loop.run_until_complete(asyncio.gather(*tasks))
202 | loop.close()
203 |
204 | print('=====>',results) # [1, 1, 1]
205 | ```
206 | 4、此外,还可以将requests.get函数传给asyncio,就能够被检测了。
207 | ```python
208 | import requests
209 | import asyncio
210 |
211 |
212 | @asyncio.coroutine
213 | def get_page(func, *args):
214 | print('GET:%s' % args[0])
215 | loop = asyncio.get_event_loop()
216 | furture = loop.run_in_executor(None, func, *args)
217 | response = yield from furture
218 | print(response.url, len(response.text))
219 | return 1
220 |
221 |
222 | tasks=[
223 | get_page(requests.get, 'https://www.python.org/doc'),
224 | get_page(requests.get, 'https://www.baidu.com'),
225 | get_page(requests.get, 'https://www.openstack.org')
226 | ]
227 |
228 | loop = asyncio.get_event_loop()
229 | results = loop.run_until_complete(asyncio.gather(*tasks))
230 | loop.close()
231 |
232 | print('=====>',results) # [1, 1, 1]
233 | ```
234 | 5、还有之前在协程时介绍的gevent模块
235 | ```python
236 | from gevent import monkey; monkey.patch_all()
237 | import gevent
238 | import requests
239 |
240 |
241 | def get_page(url):
242 | print('GET:%s' % url)
243 | response = requests.get(url)
244 | print(url, len(response.text))
245 | return 1
246 |
247 |
248 | # g1=gevent.spawn(get_page,'https://www.python.org/doc')
249 | # g2=gevent.spawn(get_page,'https://www.cnblogs.com/linhaifeng')
250 | # g3=gevent.spawn(get_page,'https://www.openstack.org')
251 | # gevent.joinall([g1,g2,g3,])
252 | # print(g1.value,g2.value,g3.value) # 拿到返回值
253 |
254 |
255 | #协程池
256 | from gevent.pool import Pool
257 |
258 | pool = Pool(2)
259 | g1 = pool.spawn(get_page, 'https://www.python.org/doc')
260 | g2 = pool.spawn(get_page, 'https://www.cnblogs.com/linhaifeng')
261 | g3 = pool.spawn(get_page, 'https://www.openstack.org')
262 | gevent.joinall([g1, g2, g3,])
263 | print(g1.value, g2.value, g3.value) # 拿到返回值
264 | ```
265 | 6、封装了gevent+requests模块的grequests模块
266 | ```python
267 | #pip3 install grequests
268 | import grequests
269 |
270 |
271 | request_list=[
272 | grequests.get('https://wwww.xxxx.org/doc1'),
273 | grequests.get('https://www.cnblogs.com/linhaifeng'),
274 | grequests.get('https://www.openstack.org')
275 | ]
276 |
277 | ##### 执行并获取响应列表 #####
278 |
279 | # response_list = grequests.map(request_list)
280 | # print(response_list)
281 |
282 | ##### 执行并获取响应列表(处理异常) #####
283 |
284 | def exception_handler(request, exception):
285 | # print(request,exception)
286 | print('%s Request failed' % request.url)
287 |
288 | response_list = grequests.map(request_list, exception_handler=exception_handler)
289 | print(response_list)
290 | ```
291 | 7、twisted:是一个网络框架,其中一个功能是发送异步请求,检测IO并自动切换,但是本人用的并不多。
292 | ```
293 | 问题一:error: Microsoft Visual C++ 14.0 is required. Get it with 'Microsoft Visual C++ Build Tools': http://landinghub.visualstudio.com/visual-cpp-build-tools
294 |
295 | https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
296 |
297 | pip3 install C:UsersAdministratorDownloadsTwisted-17.9.0-cp36-cp36m-win_amd64.whl
298 | pip3 install twisted
299 |
300 |
301 | 问题二:ModuleNotFoundError: No module named 'win32api'
302 | https://sourceforge.net/projects/pywin32/files/pywin32/
303 |
304 |
305 | 问题三:openssl
306 | pip3 install pyopenssl
307 |
308 | ```
309 |
310 | ```python
311 | # twisted基本用法
312 |
313 | from twisted.web.client import getPage,defer
314 | from twisted.internet import reactor
315 |
316 | def all_done(arg):
317 | # print(arg)
318 | reactor.stop()
319 |
320 | def callback(res):
321 | print(res)
322 | return 1
323 |
324 | defer_list = []
325 | urls = [
326 | 'http://www.baidu.com',
327 | 'http://www.bing.com',
328 | 'https://www.python.org',
329 | ]
330 |
331 | for url in urls:
332 | obj = getPage(url.encode('utf=-8'),)
333 | obj.addCallback(callback)
334 | defer_list.append(obj)
335 |
336 | defer.DeferredList(defer_list).addBoth(all_done)
337 | reactor.run()
338 |
339 |
340 | # twisted的getPage的详细用法
341 | from twisted.internet import reactor
342 | from twisted.web.client import getPage
343 | import urllib.parse
344 |
345 | def one_done(arg):
346 | print(arg)
347 | reactor.stop()
348 |
349 | post_data = urllib.parse.urlencode({'check_data': 'adf'})
350 | post_data = bytes(post_data, encoding='utf8')
351 | headers = {b'Content-Type': b'application/x-www-form-urlencoded'}
352 | response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),
353 | method=bytes('POST', encoding='utf8'),
354 | postdata=post_data,
355 | cookies={},
356 | headers=headers)
357 | response.addBoth(one_done)
358 | reactor.run()
359 | ```
360 | 8、tornado最常用的一个异步网络框架
361 | ```python
362 | from tornado.httpclient import AsyncHTTPClient
363 | from tornado.httpclient import HTTPRequest
364 | from tornado import ioloop
365 |
366 |
367 | def handle_response(response):
368 | '''
369 | 处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop()
370 | :param response:
371 | :return:
372 | '''
373 | if response.error:
374 | print('Error:', response.error)
375 | else:
376 | print(response.body)
377 |
378 |
379 | def func():
380 | url_list = [
381 | 'http://www.baidu.com',
382 | 'http://www.bing.com',
383 | ]
384 |
385 | for url in url_list:
386 | print(url)
387 | http_client = AsyncHTTPClient()
388 | http_client.fetch(HTTPRequest(url), handle_response)
389 |
390 | ioloop.IOLoop.current().add_callback(func)
391 | ioloop.IOLoop.current().start()
392 | ```
393 | 发现上例在所有任务都完毕后也不能正常结束,为了解决该问题,让我们来加上计数器。
394 | ```python
395 | from tornado.httpclient import AsyncHTTPClient
396 | from tornado.httpclient import HTTPRequest
397 | from tornado import ioloop
398 |
399 |
400 | count=0
401 |
402 | def handle_response(response):
403 | """
404 | 处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop()
405 | :param response:
406 | :return:
407 | """
408 | if response.error:
409 | print('Error:', response.error)
410 | else:
411 | print(len(response.body))
412 |
413 | global count
414 | count-=1 #完成一次回调,计数减1
415 |
416 | if count == 0:
417 | ioloop.IOLoop.current().stop()
418 |
419 | def func():
420 | url_list = [
421 | 'http://www.baidu.com',
422 | 'http://www.bing.com',
423 | ]
424 |
425 | global count
426 | for url in url_list:
427 | print(url)
428 | http_client = AsyncHTTPClient()
429 | http_client.fetch(HTTPRequest(url), handle_response)
430 | count+=1 # 计数加1
431 |
432 | ioloop.IOLoop.current().add_callback(func)
433 | ioloop.IOLoop.current().start()
434 | ```
435 |
436 | 9. celery同样是python的一个非常好用的异步框架,功能强大,内部封装了多线程。 本人之前使用selenium和phantomjs比较频繁,但是相对很慢,配合celery大大提高了效率,有兴趣可以看celery官网。
--------------------------------------------------------------------------------
/Spider/ch_summary/summary04.md:
--------------------------------------------------------------------------------
1 | 布隆过滤器是通过写文件的方式,多进程使用需要添加同步和互斥,较为繁琐,不推荐多线程/多进程时候使用,另外写文件耗费时间长,可以累积一次性写入,过着利用上下文管理器推出时写入。
2 | ```python
3 | import os
4 | import json
5 |
6 | class Spider(object):
7 | def __init__(self):
8 | # 布隆过滤器初始化
9 | self.bulongname = 'test.bl'
10 | if not os.path.isfile(self.bulongname):
11 | self.bl = BloomFilter(capacity=100000, error_rate=0.00001)
12 | else:
13 | with open(self.bulongname, 'rb') as f:
14 | self.bl = BloomFilter.fromfile(f)
15 |
16 | def __enter__(self):
17 | """
18 | 上下文管理器入口
19 | """
20 | return self
21 |
22 | def __exit__(self, *args):
23 | """
24 | 上下文管理器退出入口
25 | """
26 | if self.conn is not None:
27 | self.conn.close()
28 | with open(self.bulongname, 'wb') as f:
29 | self.fingerprints.tofile(f)
30 |
31 | def get_info(self):
32 | """
33 | 抓取主函数
34 | """
35 | x = json.dumps(" ")
36 | if x not in self.bl:
37 | self.bl.add(x)
38 |
39 | if __name__ == '__main__':
40 | with Spider() as S:
41 | S.get_info()
42 |
43 | ```
44 |
45 | redis 集合去重能够支持多线程和多进程
46 | 步骤:
47 | 1. 建立连接池
48 | 2. 重复检查
49 |
50 | ```python
51 | import os
52 | import sys
53 | import hashlib
54 | import redis
55 |
56 | def redis_init():
57 | pool = redis.ConnectionPool(host=host, port=6379, db=0)
58 | r = redis.Redis(connection_pool=pool)
59 | return pool, r
60 |
61 | def redis_close(pool):
62 | """
63 | 释放连接池
64 | """
65 | pool.disconnct()
66 |
67 | def sha1(x):
68 | sha1obj = hashlib.sha1()
69 | sha1obj.update(x)
70 | hash_value = sha1obj.hexdigest()
71 | return hash_value
72 |
73 | def check_repeate(r, check_str, set_name):
74 | """
75 | 向redis集合中添加元素,重复返回0, 不重复则添加,返回1
76 | """
77 | hash_value = sha1(check_str)
78 | result = r.sadd(set_name, hash_value)
79 | return result
80 |
81 | def main():
82 | pool, r= redis_init()
83 | temp_str = 'aaaaaa'
84 | result = check_repeate(r, temp_str, 'test:test')
85 | if result == 0:
86 | # TODO
87 | print('重复')
88 | else:
89 | # TODO
90 | print('不重复')
91 | redis_close(pool)
92 |
93 |
94 | if __name__ == '__main__':
95 | main()
96 |
97 | ```
--------------------------------------------------------------------------------
/Spider/readme.md:
--------------------------------------------------------------------------------
1 | ### 爬虫
2 |
3 | - [scrapy-redis 分布式源码分析](./ch_Distributedcrawler)
4 | - [爬猫眼,维护IP池,处理中间键](./ch_Haiwang)
5 | - [如何把评论做成词云](./ch_Haiwang)
6 | - [爬B站视频](./ch_Bilibili)
7 | - [scrapy 如何中断续爬](./ch_summary/summary01.md)
8 | - [url常用去重](./ch_summary/summary02.md)
9 | - [代码模拟url去重](./ch_summary/summary04.md)
10 | - [多线程,协程,与异步IO](./ch_summary/summary03.md)
11 | - [破解滑动验证码(网站变化太快,随机应变)](./ch_Code)
12 |
--------------------------------------------------------------------------------
/ch01/readme.md:
--------------------------------------------------------------------------------
1 | ## 数据分析是一个大工程,覆盖面广。主要包括探索分析,特征处理
2 | ## 探索分析(单因子分析,多因子分析,复合分析)
3 | ### 单因子分析
4 | #### 理论铺垫
5 | - 集中趋势
6 | 均值 中位数 分位数 众数
7 | - 离中趋势
8 | 标准差 方差
9 | - 数据分布
10 | - 偏态
11 | 正偏 负偏
12 | - 峰态
13 | 正态分布峰态系数一般为3
14 | - 正态
15 | 标准正态分布均值为0,方差为1
16 | - 三大分布
17 | 卡方分布 t分布 f分布
18 | - 抽样理论
19 | 抽样误差 抽样精度
20 |
21 | #### 具体方法
22 | - 异常值分析
23 | - 离散异常值
24 | - 连续异常值
25 | - 常识异常值
26 | - 对比分析
27 | - 绝对数与相对数
28 | - 时间,空间,理论维度比较
29 | - 结构分析
30 | - 各组成部分的分布与规律
31 | - 分布分析
32 | - 数据分布频率的显示分析
33 |
34 | ### 多因子分析
35 | - 假设检验与方差检验
36 | - 相关系数
37 | - 皮尔逊
38 | - 斯皮尔曼
39 | - 回归
40 | - 线性回归
41 | - PCA与奇异值分解
42 |
43 | ### 复合分析
44 | - 交叉分析
45 | - 分组与钻取
46 | - 相关分析
47 | - 因子分析
48 | - 聚类分析
49 | - 回归分析
50 |
51 | ### 小结
52 | | Ⅰ | Ⅱ |
53 | |:----:|:----:|
54 | |数据类型|可用方法|
55 | |连续--连续|相关系数,假设检验|
56 | |连续--离散(二值)|相关系数,连续二值化,最大熵增益切分|
57 | |连续--离散(二值)|相关系数(定序)|
58 | |离散(二值)--离散(二值)|相关系数,熵相关,F分值|
59 | |离散--离散(非二值)|熵相关,Gini,相关系数(定序)|
60 |
61 | ## 特征处理
62 | - 特征选择
63 | - 特征变换
64 | - 对指化
65 | - 离散化
66 | - 归一化,标准化
67 | - 数值化
68 | - 正规化
69 | - 特征降维
70 | - 特征衍生
--------------------------------------------------------------------------------
/ch02/Ass_rule.py:
--------------------------------------------------------------------------------
1 | """
2 | 关联规则/序列规则
3 | """
4 | from itertools import combinations
5 |
6 |
7 | def comb(lst):
8 | """
9 | 得到一个列表里所有元素的组合
10 | ["a", "b", "c"]
11 | >>> [("a",), ("b",), ("c",), ("a","b"), ("a","c"), ("b","c"), ("a","b","c")]
12 | """
13 | ret = []
14 | for i in range(1, len(lst) + 1):
15 | ret += list(combinations(lst, i))
16 | return ret
17 |
18 |
19 | class AprLayer(object):
20 | """存放项目数量相同的项集"""
21 | def __init__(self):
22 | self.d = dict()
23 |
24 |
25 | class AprNode(object):
26 | """
27 | 初始化每一个项集
28 | node = ("a", ) 或 ("b",) 或 ("a", "b")
29 | """
30 | def __init__(self, node):
31 | self.s = set(node)
32 | self.size = len(self.s)
33 | self.lnk_nodes = dict()
34 | self.num = 0
35 |
36 | def __hash__(self):
37 | return hash("__".join(sorted([str(itm) for itm in list(self.s)])))
38 |
39 | def __eq__(self, other):
40 | if "__".join(sorted([str(itm) for itm in list(self.s)])) == "__".join(
41 | sorted([str(itm) for itm in list(other.s)])):
42 | return True
43 | return False
44 |
45 | def isSubnode(self, node):
46 | return self.s.issubset(node.s)
47 |
48 | def incNum(self, num=1):
49 | self.num += num
50 |
51 | def addLnk(self, node):
52 | self.lnk_nodes[node] = node.s
53 |
54 |
55 | class AprBlk(object):
56 | """使AprLayer和AprNode两个数据结构结合"""
57 | def __init__(self, data):
58 | cnt = 0 # 计数器
59 | self.apr_layers = dict()
60 | self.data_num = len(data)
61 | for datum in data:
62 | cnt += 1
63 | datum = comb(datum)
64 | # datum=[("a",), ("b",), ("a","b")]
65 | nodes = [AprNode(da) for da in datum]
66 | for node in nodes:
67 | # 根据项集的数目
68 | if node.size not in self.apr_layers:
69 | # {1: {('a',): AprNode(('a',)), ('b',): AprNode(('b',))}, 2:{}, 3:{}}
70 | self.apr_layers[node.size] = AprLayer()
71 | if node not in self.apr_layers[node.size].d:
72 | self.apr_layers[node.size].d[node] = node
73 | # 调用数量加1
74 | self.apr_layers[node.size].d[node].incNum()
75 | for node in nodes:
76 | if node.size == 1:
77 | continue
78 | for sn in node.s:
79 | # 高阶项集减去一阶项集
80 | # sn -> '毛巾', set([sn]) -> {'毛巾'}
81 | sub_n = AprNode(node.s - set([sn]))
82 | # 在低阶项集上建立和高阶项集的联系
83 | self.apr_layers[node.size - 1].d[sub_n].addLnk(node)
84 |
85 | def getFreqItems(self, thd=1, hd=1):
86 | # thd=1 为阈值
87 | freq_items = []
88 | for layer in self.apr_layers:
89 | for node in self.apr_layers[layer].d:
90 | if self.apr_layers[layer].d[node].num < thd:
91 | continue
92 | freq_items.append((self.apr_layers[layer].d[node].s, self.apr_layers[layer].d[node].num))
93 | # 根据num从高到低排序
94 | freq_items.sort(key=lambda x: x[1], reverse=True)
95 | return freq_items[:hd]
96 |
97 | def getConf(self, low=True, h_thd=10, l_thd=1, hd=1):
98 | # h_thd 高阈值
99 | confidence = []
100 | for layer in self.apr_layers:
101 | for node in self.apr_layers[layer].d:
102 | if self.apr_layers[layer].d[node].num < h_thd:
103 | continue
104 | for lnk_node in node.lnk_nodes:
105 | if lnk_node.num < l_thd:
106 | continue
107 | # 置信度=低阶频繁项集所连接的高阶频繁项集的数量/低阶频繁项集的数量
108 | conf = float(lnk_node.num) / float(node.num)
109 | confidence.append([node.s, node.num, lnk_node.s, lnk_node.num, conf])
110 | # 根据置信度排序
111 | confidence.sort(key=lambda x: x[4])
112 | if low:
113 | # 返回低置信度
114 | return confidence[:hd]
115 | else:
116 | # 返回高置信度
117 | return confidence[-hd::-1]
118 |
119 |
120 | class AssctAnaClass():
121 | """关联规则"""
122 | def __init__(self):
123 | self.apr_blk = None
124 |
125 | def fit(self, data):
126 | # 拟合数据
127 | self.apr_blk = AprBlk(data)
128 | return self
129 |
130 | def get_freq(self, thd=1, hd=1):
131 | # 取出频繁项集
132 | return self.apr_blk.getFreqItems(thd=thd, hd=hd)
133 |
134 | def get_conf_high(self, thd, h_thd=10):
135 | # 取出高置信度项集组合
136 | return self.apr_blk.getConf(low=False, h_thd=h_thd, l_thd=thd)
137 |
138 | def get_conf_low(self, thd, hd, l_thd=1):
139 | # 取出低置信度项集组合
140 | return self.apr_blk.getConf(h_thd=thd, l_thd=l_thd, hd=hd)
141 |
142 |
143 | def main():
144 | data = [
145 | ["牛奶", "啤酒", "尿布"],
146 | ["牛奶", "啤酒", "咖啡", "尿布"],
147 | ["香肠", "牛奶", "饼干"],
148 | ["尿布", "果汁", "啤酒"],
149 | ["钉子", "啤酒"],
150 | ["尿布", "毛巾", "香肠"],
151 | ["啤酒", "毛巾", "尿布", "饼干"]
152 | ]
153 | print("Freq", AssctAnaClass().fit(data).get_freq(thd=3, hd=10))
154 | print("Conf", AssctAnaClass().fit(data).get_conf_high(thd=3, h_thd=4))
155 |
156 |
157 | if __name__ == "__main__":
158 | main()
159 |
--------------------------------------------------------------------------------
/ch02/K-means.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import matplotlib.pyplot as plt
4 | from sklearn.datasets import make_circles, make_blobs, make_moons
5 | from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
6 |
7 | n_samples = 1000
8 | circles = make_circles(n_samples=n_samples, factor=0.5, noise=0.05)
9 | moons = make_moons(n_samples=n_samples, noise=0.05)
10 | blobs = make_blobs(n_samples=n_samples, random_state=8, center_box=(-1, 1), cluster_std=0.1)
11 | random_data = np.random.rand(n_samples, 2), None # None指标注
12 | colors = "bgrcmyk"
13 | data = [circles, moons, blobs, random_data] # 四个数据集
14 | models = [("None", None), # 不添加任何模型
15 | ("KMeans", KMeans(n_clusters=3)),] # KMeans 分成3类
16 | # ("DBscan", DBSCAN(min_samples=3, eps=0.2)), # 分成3类, E邻域
17 | # ("Agglomerative", AgglomerativeClustering(n_clusters=3, linkage="ward"))] # 指定ward方法
18 | f = plt.figure()
19 | for inx, clt in enumerate(models):
20 | clt_name, clt_entry = clt
21 | for i, dataset in enumerate(data):
22 | X, Y = dataset
23 | if not clt_entry:
24 | clt_res = [0 for i in range(len(X))]
25 | else:
26 | clt_entry.fit(X)
27 | clt_res = clt_entry.labels_.astype(np.int)
28 | f.add_subplot(len(models), len(data), inx*len(data)+i+1)
29 | plt.title(clt_name)
30 | [plt.scatter(X[p, 0], X[p, 1], color=colors[clt_res[p]]) for p in range(len(X))]
31 | plt.savefig("Kmeans.png")
32 | plt.show()
33 |
--------------------------------------------------------------------------------
/ch02/RandomForest.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "### 训练\n",
8 | "\n",
9 | "我们对训练集采用随机森林模型,并评估模型效果"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "name": "stdout",
19 | "output_type": "stream",
20 | "text": [
21 | "Populating the interactive namespace from numpy and matplotlib\n"
22 | ]
23 | }
24 | ],
25 | "source": [
26 | "%pylab inline\n",
27 | "# 导入训练集、验证集和测试集\n",
28 | "\n",
29 | "import pandas as pd\n",
30 | "\n",
31 | "samtrain = pd.read_csv('samtrain.csv')\n",
32 | "samval = pd.read_csv('samval.csv')\n",
33 | "samtest = pd.read_csv('samtest.csv')\n",
34 | "\n",
35 | "# 使用 sklearn的随机森林模型,其模块叫做 sklearn.ensemble.RandomForestClassifier\n",
36 | "\n",
37 | "# 在这里我们需要将标签列 ('activity') 转换为整数表示,\n",
38 | "# 因为Python的RandomForest package需要这样的格式。 \n",
39 | "\n",
40 | "# 其对应关系如下:\n",
41 | "# laying = 1, sitting = 2, standing = 3, walk = 4, walkup = 5, walkdown = 6\n",
42 | "\n",
43 | "import randomforests as rf\n",
44 | "samtrain = rf.remap_col(samtrain,'activity')\n",
45 | "samval = rf.remap_col(samval,'activity')\n",
46 | "samtest = rf.remap_col(samtest,'activity')"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 2,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "import sklearn.ensemble as sk\n",
56 | "rfc = sk.RandomForestClassifier(n_estimators=500, oob_score=True)\n",
57 | "train_data = samtrain[samtrain.columns[1:-2]]\n",
58 | "train_truth = samtrain['activity']\n",
59 | "model = rfc.fit(train_data, train_truth)"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 3,
65 | "metadata": {},
66 | "outputs": [
67 | {
68 | "data": {
69 | "text/plain": [
70 | "0.98174904942965779"
71 | ]
72 | },
73 | "execution_count": 3,
74 | "metadata": {},
75 | "output_type": "execute_result"
76 | }
77 | ],
78 | "source": [
79 | "# 使用 OOB (out of band) 来对模型的精确度进行评估.\n",
80 | "rfc.oob_score_"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 4,
86 | "metadata": {},
87 | "outputs": [
88 | {
89 | "data": {
90 | "text/plain": [
91 | "[(0.048788075395111638, 'tAccMean'),\n",
92 | " (0.044887862923922571, 'tAccStd'),\n",
93 | " (0.044231502495174914, 'tJerkMean'),\n",
94 | " (0.04892499919665521, 'tGyroJerkMagSD'),\n",
95 | " (0.058161561399143025, 'fAccMean'),\n",
96 | " (0.0448666616780896, 'fJerkSD'),\n",
97 | " (0.14045995765086935, 'angleGyroJerkGravity'),\n",
98 | " (0.16538335816293095, 'angleXGravity'),\n",
99 | " (0.047154808012715918, 'angleYGravity')]"
100 | ]
101 | },
102 | "execution_count": 4,
103 | "metadata": {},
104 | "output_type": "execute_result"
105 | }
106 | ],
107 | "source": [
108 | "# 用 \"feature importance\" 得分来看最重要的10个特征\n",
109 | "fi = enumerate(rfc.feature_importances_)\n",
110 | "cols = samtrain.columns\n",
111 | "[(value,cols[i]) for (i,value) in fi if value > 0.04]\n",
112 | "## 这个值0.4是我们通过经验选取的,它恰好能够提供10个最好的特征。\n",
113 | "## 改变这个值的大小可以得到不同数量的特征。\n",
114 | "## 下面这句命令是防止你修改参数弄乱了后回不来的命令备份。\n",
115 | "## [(value,cols[i]) for (i,value) in fi if value > 0.04]"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "我们对验证集和测试集使用predict()方法,并得到相应的误差。"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 5,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "# 因为pandas的 data frame 在第0列增加了一个假的未知列,所以我们从第1列开始。\n",
132 | "# not using subject column, activity ie target is in last columns hence -2 i.e dropping last 2 cols\n",
133 | "\n",
134 | "val_data = samval[samval.columns[1:-2]]\n",
135 | "val_truth = samval['activity']\n",
136 | "val_pred = rfc.predict(val_data)\n",
137 | "\n",
138 | "test_data = samtest[samtest.columns[1:-2]]\n",
139 | "test_truth = samtest['activity']\n",
140 | "test_pred = rfc.predict(test_data)"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "#### 输出误差 "
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 7,
153 | "metadata": {},
154 | "outputs": [
155 | {
156 | "name": "stdout",
157 | "output_type": "stream",
158 | "text": [
159 | "mean accuracy score for validation set = 0.834911\n",
160 | "mean accuracy score for test set = 0.900337\n"
161 | ]
162 | }
163 | ],
164 | "source": [
165 | "print(\"mean accuracy score for validation set = %f\" %(rfc.score(val_data, val_truth)))\n",
166 | "print(\"mean accuracy score for test set = %f\" %(rfc.score(test_data, test_truth)))"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 7,
172 | "metadata": {},
173 | "outputs": [
174 | {
175 | "data": {
176 | "text/plain": [
177 | "array([[293, 0, 0, 0, 0, 0],\n",
178 | " [ 0, 224, 40, 0, 0, 0],\n",
179 | " [ 0, 29, 254, 0, 0, 0],\n",
180 | " [ 0, 0, 0, 197, 26, 6],\n",
181 | " [ 0, 0, 16, 1, 173, 26],\n",
182 | " [ 0, 0, 0, 3, 14, 183]])"
183 | ]
184 | },
185 | "execution_count": 7,
186 | "metadata": {},
187 | "output_type": "execute_result"
188 | }
189 | ],
190 | "source": [
191 | "# 使用混淆矩阵来观察哪些活动被错误分类了。\n",
192 | "# 详细说明请看 [5]\n",
193 | "import sklearn.metrics as skm\n",
194 | "test_cm = skm.confusion_matrix(test_truth,test_pred)\n",
195 | "test_cm"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 9,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "# 混淆矩阵可视化"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 9,
210 | "metadata": {},
211 | "outputs": [
212 | {
213 | "data": {
214 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPgAAAD0CAYAAAC2E+twAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGmVJREFUeJzt3Xu4JFV57/HvbwYYISggKMoAMyCSAOcokIgHiTDEIzJK\nMDEmAj4nCImeE+KBR2KCoHmGMd4wEaIIJ4mQeURFRHwQSVAGAhsDclMYGS4DCAwMIzNyvzhhmMt7\n/lhrQ++mL7W7+lK79u/zPPXs7urqWqt399tr1arqdykiMLN6mjHqCpjZ4DjAzWrMAW5WYw5wsxpz\ngJvVmAPcrMYqH+CSXiHpUklPSfpOif0cJelH/azbqEj6XUl39fjc3SXdKulpSR/td92GSdIcSRsl\nVf5zPCp9+8fkALpZ0rOSVkr6d0kH9GHX7wdeA2wTER/odScRcX5EHNqH+gxU/sDu2mmbiLg2Ivbo\nsYi/Aa6KiK0i4qs97uNFkhZIOq/sfvK+ur72FgpdyCHpIEkreqjWlNaXAJd0InA68BngtcDOwFnA\n7/dh93OAe2L6XJHT8XVKmlly/3OAO3p5Yh/K7maQ77H6vf+tpVDxZXk/yy4sIkotwKuAZ4H3ddhm\nM+AfgZXAw8AZwKb5sYOAFcCJwOq8zdH5sVOBtcALwDPAMcAC4BsN+54DbARm5PsfAu7L298HHJnX\nHw38Z8Pz3gbcBDwJ3Ajs3/DY1cCngWvzfn4EvLrNaxuv/1831P+9wHzgbuAx4OSG7d8C/CSXuxI4\nE9gkP3ZNfi3P5XL/uGH/fwM8Anx9fF1+zq7A48De+f4OwK+AA1vU9T+A9cB/5f3vlt+/8/JzHgA+\n2bD90fl/cHp+HZ9u2t+78vuzNn8Gbm34TJwD/DLX/e8A5cfeAIwBT+Uyv93utbeo/wzgH4BHgV8A\nxwEbmt77O/PzfwF8JK/fAliTX/uz+fHXdXovCn724zMFFyDKxlpP8dmHAH8XKQBndNjm0/kfuW1e\nrgMWNgTIOlLgzsyB8Wtgq/z4AuC8hn01358z/ibnN/JpYLf82PbAHg0f1h/n29sATwBH5ecdke9v\n0xDg9+YP46x8/3MdAnwd8Mlc/z/PH9xv5vrsmT9cc/L2+wL7kVqUnUmt6fEN+9sI7NJi/58DNs31\nOQh4qGGbPwNuBzYHLgdO6/BeXA0c23D/PODiXNc5pC+lYxr+Z+tyIM0AZrXY34T3I6+7GDgbeAWw\nHXAD8OH82PnkLzzSF//b2r32FmX9H1IA7wBsDVzFxACfD8zNt9+eP0d7N/wfH2raX8f3okiAn1Zw\nYUQB3o8u+rbAYxGxscM2R5EC+vGIeBxYCPyvhsdfAP4uIjZExA9J3+K/2WN9NgD/XdIrImJ1RLQa\njHoPqdt/fkRsjIgLgGVMPKRYFBH3RcRa4EJg7w5lvkD6AtgAXED6UP9jRKyJiDtJH8o3A0TELRFx\nUyQPAf9C+vA1UovXtCAi1uX6TBAR55JarBtJX2qf6lDXlwpJg1MfAD6R6/og8CUmvjcrI+Ls/H96\nWdkt9vlaUqB9LCKej4jHSL23I/Im64A5kmZHxAsR8ZPmXXTY/R+T/q+/jIingM83PhgRP4yI5fn2\nfwKLSYHeUsH3oqNNCi6j0o8AfxzYrstI5g7AQw33H8zrXtxH0xfEGmDLyVYkItaQPrB/ATySR99b\nfVHskOvQ6EFgdsP9VZOoz+ORv9JJ3V9IrTgN67YEkPTGXK9HJD0FfJb0hdDJoxGxrss25wB7AWcW\n2HbcdqTPX/N70/h/mOzA1BxST+MRSU9IehL4J9JAKaRDmRnATZKWSjpmEvveoak+E95DSfMlXS/p\n8VzufDr8b3t8LybYvOAyKv0I8OtJx2B/0GGblaQ3ftwc0vFZL35N6k6Oe33jgxFxRUQcQjrGupv0\nrdzsl8DcpnU753oO2v8D7gLeEBFbk7r2nVot6D7w9hukVvJc4FRJWxesy2PkFrVh3Rwm/h+6DUw1\nP74CeB7YNiJeHRHbRMTWEfEmgIj4VUR8JCJmk7rcZ09i5PwRYKemugIgaTPgIuCLwGsiYhvgh7z0\nv231Onp5LybYtOAyKqUDPCKeIR2HnSXpvZI2l7RJ/jb9Qt7sAuBTkraTtB3wt8A3eixyCXCgpJ0k\nbQV8YvwBSa+VdLikLUgf3OdIx3XNLgPeKOkISTMlfQDYA7i0xzpNxiuBZyJijaTfIvU2Gq0iDZxN\nxleAmyLiI6TX9s9FnpR7TRcCn5W0paQ5wMeY3HuzGpgrSXmfq0hd4zMkvVLJrpIOBJD0fknjPYSn\nSO/P+HvU7bVfCBwvabakbYCTGh7bLC+PRcRGSfOBQ5rqua2kVzWs6/ZedDUduuhExOmkUfBPkbqm\nD5EGZr6fN/kM8FPgNuDn+fZnO+2yQ1lXAt/J+7qZiUE5I9djJal1OpAWb1pEPAEcBnw8b/dx4D0R\n8WS38gtqfn7j/Y8DH5T0DCkQL2ja9lTgvNy9fX+3giQdTvogH5dXnQjsI+nIgnU7nnQIcj/wY+Cb\nEbGoW7kNvktq9R6X9NO87mhSsN1JGrz8LqlHBWnk+sb8+r9PGtRanh87lc6v/WukQcTxz9D3XnxR\nEc/l1/JdSU+QjvkvaXj8buDbwP15/6+j+3vRVdVbcL106DiCwqVDSV3LGcC5EXHaEMo8lxTcq8e7\njcMgaUfSiPX2pBbraxHxlQGXOYsUtJuRGpKLImLhIMtsKn8GKRAfjojDh1TmctKZlI3AuojYb4Bl\nRdFvhCOAiJhU978fRnaJX37zv0o6zbYXcGTuJg3aolzmsK0HToyIvYD9gb8c9OvNo94HR8Q+pLMA\n8yUN7APfwgmkVnyYNgLzImKfQQb3uKq34KO8hnc/4N6IeDCP+l5AukBkoCLiWtKFDUMVEasiYkm+\n/RxpcGd252f1pdw1+eYsUis+lC5b7rG8mzS6P0xiiJ9rB3h7s5l4yuNhhvCBrwJJc0kt6o1DKGuG\npFtJA1hXRMTNgy4zO4N0SmzYx4ABXJF/F/HhQRc2HU6T2SRI2pJ0OueE3JIPVL5AZR9gR+CtkvYc\ndJmS3kMa41hCalGHeex5QETsS+o9/KWk3x1kYWVG0SXtKOkqSXfkawL+b15/gaRb8vKApFsannOy\npHsl3SXpkDa7nlC/UVlJOvc8bkeGcx56ZCRtQgrub0TEJd2276eIeEbS1cChDP64+ADgcEnvJjVg\nr5R0XkT86YDLJSIeyX8flXQx6VDw2kGVV7L7PT4usyR/8f9M0hURMX7VH5L+gXQ6EUl7AH9COqW7\nI3ClpDc2XGT1MqNswW8GdlP6Te9mpIHGHwyp7GG3KuP+FbgzIr48jMLydQdb5dubA+8kXZI7UBFx\nSkTsHBG7kt7Xq4YR3JK2yIEyfvHPIaRr9AemTAtecFzmT0jX70Mao7ogItbnU4v3kr7A2hpZgOfr\ntj9KuijiDlLFe0piMBmSzif98GV3SQ9N8lLJMuUeAHwQ+D2lhAu35NOEg/R64GpJS0jH+5dHxGUD\nLnOUtgeuzWMONwCXRsTiQRbYr0G2VuMykt4OrIqI+/Oq5nGrlXQZtxplF52I+BG9/6ik1zKPGmZ5\nDeVeR/q12TDLXEr6xdTIRMQ1pJ+CDqOsB+j8o6C+60cAdRiXOZJ0cU7PRhrgZlNdu9b5prx0025c\nRim5xvuY+AW9konX4ncdtxrplWxmU5mkeKDgtrvQ+ko2pXRXj0XEiU3rDwVOioiDG9btCXwLeCup\na34F0HGQzS24WQllRtEbxmWW5nGDAE7Jh64foKl7HhF3SrqQdBZkHXBcp+AGt+BmPZMUq7pvBqRf\n2oziWnS34GYlbFo0gtYPtBpt9S3AJbkrYLUwmZZ2k+kS4JCyPkzWGDCvRJkLeyq1HyX3ajqVO4oy\ny5Y7uV/TbjrUE5+T5y66WQmFW/ARqXj1zKpt01mjrkFnIw/wudOu5OlU7ijKHHK5I4+gzkZevbnT\nruTpVO4oyhxyuSOPoM4qXj2ziqt4BFW8emYV51F0sxqreARVvHpmFVfxUfRCCR8kHSppmaR7JJ3U\n/Rlm00TFpzbpWnRD/vJ3kOb0ulnSJREx8NQ/ZpVX8T5wkRZ8JPnLzaaEmQWXESkS4NM2f7lZVyW6\n6C3SJh/f9PhfSdoo6dUN66ZM2mSzqa9cBLVKm7w4IpblmWHeScMc6L2kTS5SvcL5y8cabs9llFep\nmRW1PC89KhHgearlVfn2c5LG0yYv46WZYRpTib+YNhlYLmk8bXLbGXKKVO/F/OWkCdiPIGV7fJl5\nBXZmVi1zmdgUTTIBbJ9OkzWmTc5TQq+IiKXShJ+mzwaub7hfPm1yRGyQNJ6/fHya34HnLzebEvpw\nkNuYNhnYAJxC6p6XVqh6o8hfbjYltBkhH/sVjD3a/enNaZMl/TdSl+LnSs33jsAtedrnSU/35UE2\nszLaRNC8HdIybmH72eAmTGcVEbeTcjQCIOkBYN+IeFLSD4BvSTqd1DXfjS7p1x3gZmWUiKAuaZPH\nBXkevV7SJjvAzcoocRFLkems8gSOjfc/D3y+aBkOcLMyKh5BFa+eWcW9YtQV6MwBblaGEz6Y1VjF\nI6ji1TOruIpHUMWrZ1Zx7qKb1VjFI6iv1et9nrDexWmTm0uqX7RgBHMtPn/q8Mu0zqZTgJtNOxVP\nuugANyuj4hFU8eqZVVzFI6ji1TOrOI+im9VYxSOo4tUzq7iKR1ChmU3MrI0SedHbpU2W9H5Jt0va\nIGnfpuc4bbLZ0JT7NVnLtMnAUuAPgX9u3HhQaZPNrJ0BpE2OiP8AUFNKVXpIm+wuulkZfZq6qDFt\ncofNmmcZ6po2uWuASzpX0mpJt3Wvptk004fZRRvTJkfEc/2uXjeLgDOB8/pZsFkttImgsdvS0k1z\n2uQum68Edmq4Xz5tckRcm2c1MbNmbbrf8/ZJy7iF57fdw4S0yS00HoePp00+A6dNNhuCEqPo7dIm\n572eCWwH/JukJREx32mTzYZtcGmTv9/mOaNMmzzWcHsunl/Uqm85o5pddBiKVk9MPBZoY16JqpiN\nwlxKzS5a8QAvcprsfOAnwO6SHpJ0zOCrZTZF9OE02SAVGUU/ahgVMZuS/HNRsxqreARVvHpmFeec\nbGY1VvEIqnj1zCqu4hFU8eqZVVzFI6ji1TOrtvAoull9bah4BFW8embV5gA3q7G1szYruOULA61H\nOw5wsxI2zKz2QfiUD3Cd9PRIyo3fKvDbmz7TsuHP3mqdbSh5raqkc4HDgNUR8aa87s3AP5F+Fz7+\nu++f5sdOBo4lZWQ9ISIWd9q/ky6albCemYWWDhYB72pa90VgQUTsAywA/h5A0p68lDZ5PnB2i8yr\nEzjAzUrYwCaFlnYi4lrgyabVG4Gt8u2teSnv2uHktMkRsRwYT5vc1pTvopuNUtkuehsfAy6X9CVS\nHoa35fWzgesbtiufNtnM2tvAzELLJP0F6fh6Z1Kw/2uv9XMLblbCWlqfJrtp7HluGnu+190eHREn\nAETERZLOyev7nzbZzNprd3z92/O25Lfnbfni/bMWdjzb05wSbaWkgyLiGknvIB1rg9Mmmw1XH06T\nnU9KZritpIdIo+YfBr4iaSbwPPARAKdNNhuysgHeISXa77TZfpRpk82mly7nuEfOAW5WQqdz3FVQ\nJG3yjpKuknSHpKWSjh9GxcymggGdJuubIl8/64ETI2JJnub0Z5IWR8SyAdfNrPJeaHOarCqK5EVf\nBazKt5+TdBdpiN4BbtNerY7BJc0F9gZuHERlzKaaqh+DF65d7p5fRLqE7rnWW4013J6LJx+06ltO\nmckHR3l8XUShAJe0CSm4vxERl7Tfcl5fKmU2PHMpM/lgLQKcdLH7nRHx5UFWxmyqmfLH4JIOAD4I\nLJV0KxDAKRHxo0FXzqzqXqj43EVFRtGvo/JzKJqNRl266GbWwpTvoptZe7U5TWZmL1f1LrpTNpmV\nUPZadEnnSlot6baGdQskPSzplrwc2vDYyZLulXSXpEO61c8tuFkJfWjBFwFnAuc1rT89Ik5vXCFp\nD15Km7wjcKWkN3ZK+uAANythbcnTZBFxraQ5LR5qle/8veS0ycBySeNpk9teOu4uulkJA/y56Ecl\nLZF0jqTxHOmzgRUN2zhtstkgDSjAzwZ2jYi9Sb/k/FKv9XMX3ayEdufBHxhbwQNjK1o+1k1EPNpw\n92vApfm20yabDVO78+A7z9uFneft8uL9qxde33K7bELaZEmvy3kYAN4H3J5vT8e0yad332QARjHT\nZ9y/cOhlAmjX3lqi8i7tvsmIDSht8sGS9ibNUbYc+N/gtMlmQzegtMmLOmzvtMlmw9Ju6qKqcICb\nleBr0c1qrOrXojvAzUpwgJvVmH8PblZjPgY3qzF30c1qbMpPXSRpFvBjYLO8/UURMZpLqswqZsof\ng0fEWkkHR8QaSTOB6yT9MCI6XgNrNh3U4hg8Itbkm7Pyczpe/2o2XdTiGFzSDOBnwBuAsyLi5oHW\nymyKqEWAR8RGYB9JrwK+L2nPiLhzsFUzq74pfwzeKCKekXQ1cCjpJ2tNxhpuz8Wzi1r13ZOX3kz5\nY3BJ2wHrIuJpSZsD7wS+0Hrref2sm9kQ7J6XcZdN6tllT5NJOhc4DFgdEW/K674I/D6wFrgPOCYi\nnsmPnQwcC6wnTeW9uNP+i+Rkez1wtaQlpOyNl0fE5P4LZjW1npmFlg4WAe9qWrcY2CvnZLsXOBlA\n0p68lDZ5PnC2pFbZV19U5DTZUmDfbtuZTUdlu+it0iZHxJUNd28A/ijfPpxJpk2u9gGEWcUNYRT9\nWODb+fZsoDG5W9e0yQ5wsxLaBfizY7fw7NitpfYt6ZOk8a9vd924DQe4WQntAnyLeW9hi3lvefH+\nIwvbpllrSdKHgHcDv9ew2mmTzYap7NRFWXPa5EOBvwYOjIi1DdtNx7TJZqMzoLTJp5B+3HVFHiS/\nISKOc9pksyFz2mSzGqvVpapmNtGUv1TVzNqrxa/JzKw1B3htHTD0ErXr8Cc8BLg+DhtJufvr30ZQ\n6nGT2nrtC1M8J5uZtbdhfbVDqNq1M6u4DevdRTerLQe4WY2tX+cAN6utjRuqHULVrp1Z1bmLblZj\nz1c7hKpdO7OqWz/qCnRWJOmimbWzvuDShqQTJC3Ny/F53TaSFku6W9LlkrbqtXoOcLMySgS4pL2A\nPwN+B9gbOEzSG4BPAFdGxG8CV5GzqvaicIBLmiHpFkk/6LUws9pZV3BpbQ/gxohYGxEbSLP4vo+U\nPfXreZuvA3/Qa/Um04KfQMvZTMymsQ0Fl9ZuB96eu+RbkHKw7QRsHxGrASJiFfDaXqtXKMAl7ZgL\nP6fXgsxqqUQXPSKWAacBV5CmVLmV1l8HPc/mW3QU/QxSErieD/bNaun5Nut/Pga3jXV9ekQsIqdo\nkvRZYAWwWtL2EbFa0uuAX/VavSJzk72HNG/SEknzaMj+aDbttRsh32teWsZ9c2HLzSS9JiIelbQz\n8IfA/wB2AT5Eat2PBi7ptXpFWvADgMMlvRvYHHilpPMi4k9fvulYw+25eHZRq77rmThZyCSVPw/+\nPUmv5qUsqc9IOg24UNKxwIOk+ch6UmRuslNIaVyRdBDwV62DGzy7qE09++dl3BmTe3rJAI+IA1us\newL4n+X2nPhKNrMy2p8Cq4RJBXhEXANcM6C6mE097U+BVYJbcLMyKn4tugPcrIx2p8kqwgFuVoZb\ncLMac4Cb1ZgD3KzG6nSazMya+DSZWY15FN2sxnwMblZjPgavq+tGUObmIygT9tfnR1Lu8th36GXO\nneyPoX0MblZjFe+iO6uqWRnl0yZvJem7ku6SdIektzptsllVlMuqCvBl4LKI2AN4M7CMUaRNNrMW\n1hZcWpD0KuDtOS8bEbE+Ip4G3ssI0iabWbNyXfRdgMckLcpzDvxLTp883LTJZtZGuS76JsC+wFkR\nsS/wa1L3vDlN8sDTJptZK+1Okz06Bo+NdXv2w8CKiPhpvv89UoAPL22ymXXQrvu9zby0jFv28rTJ\nOYBXSNo9Iu4B3gHckZcPMaS0yWbWTvnz4McD35K0KXA/cAwwk2GlTTazDkpeqhoRPwfe0uKh4aVN\nlrQceBrYCKyLiP36UbjZlNfmFFhVFG3BNwLzIuLJQVbGbMqp+KWqRQNc+JSa2ctV/NdkRYM2gCsk\n3Szpw4OskNmUUm5+8IEr2oIfEBGPSHoNKdDviohrB1kxsymhDl30iHgk/31U0sXAfkCLAB9ruD0X\nzy5qVXf92AvcMFain13xAFdE56vg8rWxMyLiOUm/ASwGFkbE4qbtAhYMrqbGqBI+wJtGUuryOHro\nZc7Vo0REobQPkoLdCl5F+gsV3m8/FWnBtwcuTgHMJsC3moPbbNqa6qfJIuIBYO8h1MVs6ql4F91X\nspmVUfHTZA5wszKcdNGsxtxFN6sxB7hZjVX8GNzXl5uVUSInm6RZkm6UdKukpZIW5PV1Spu83OUO\nxX0jKPO2EZSZrk6bCiJiLXBwROxDOhU9X9J+1Ctt8nKXOxT3j6DM0QR4qUtPhywi1uSbs0iHzIHT\nJpvVg6QZkm4FVgFXRMTN9DFtsgfZzEop11uIiI3APnkShIsl7UUf0yZ3/bFJ4R2la9XNprxJ/diE\nNW0e/XFexn2u634l/S1ph39OyqA0njb56jy10aT1LcDNppsU4E8X3HqrlwW4pO1IOQ6flrQ5cDnw\nBeAg4ImIOE3SScA2EfGJXuroLrpZKf9V5smvB74uaQZpPOw7EXGZpBvoU9pkt+BmPUot+IqCW+9U\n2d+Dm1lb1b5W1QFuVkq1z7k7wM1KcQtuVmNuwc1qrNQo+sA5wM1KcRfdrMbcRTerMbfgZjXmFtys\nxtyCm9WYW3CzGvNpMrMacwtuVmPVPgZ3TjazUtYVXFqTdKikZZLuyckd+sotuFkpvbfgOdHDV4F3\nAL8EbpZ0SUQs61PlHOBm5ZQ6Bt8PuDciHgSQdAEpZbID3KwaSh2Dz2ZiSpiHSUHfNw5ws1J8msys\nrh6EU+cU3HZ1i3UrgZ0b7u+Y1/WNky6ajYikmcDdpEG2R4CbgCMj4q5+leEW3GxEImKDpI8Ci0mn\nrM/tZ3CDW3CzWvOFLmY15gA3qzEHuFmNOcDNaswBblZjDnCzGnOAm9WYA9ysxv4/AA2FyxG6yD0A\nAAAASUVORK5CYII=\n",
215 | "text/plain": [
216 | ""
217 | ]
218 | },
219 | "metadata": {},
220 | "output_type": "display_data"
221 | }
222 | ],
223 | "source": [
224 | "import pylab as pl\n",
225 | "pl.matshow(test_cm)\n",
226 | "pl.title('Confusion matrix for test data')\n",
227 | "pl.colorbar()\n",
228 | "pl.show()"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 10,
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "# 计算一下其他的对预测效果的评估指标\n",
238 | "# 详细内容请看 [6],[7],[8],[9]"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 12,
244 | "metadata": {},
245 | "outputs": [
246 | {
247 | "name": "stdout",
248 | "output_type": "stream",
249 | "text": [
250 | "Accuracy = 0.900337\n"
251 | ]
252 | }
253 | ],
254 | "source": [
255 | "# Accuracy\n",
256 | "print(\"Accuracy = %f\" %(skm.accuracy_score(test_truth,test_pred)))"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 13,
262 | "metadata": {},
263 | "outputs": [
264 | {
265 | "name": "stdout",
266 | "output_type": "stream",
267 | "text": [
268 | "Precision = 0.902996\n"
269 | ]
270 | },
271 | {
272 | "name": "stderr",
273 | "output_type": "stream",
274 | "text": [
275 | "C:\\Anaconda2\\lib\\site-packages\\sklearn\\metrics\\classification.py:1203: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring=\"f1_weighted\" instead of scoring=\"f1\".\n",
276 | " sample_weight=sample_weight)\n"
277 | ]
278 | }
279 | ],
280 | "source": [
281 | "# Precision\n",
282 | "print(\"Precision = %f\" %(skm.precision_score(test_truth,test_pred)))"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 14,
288 | "metadata": {},
289 | "outputs": [
290 | {
291 | "name": "stdout",
292 | "output_type": "stream",
293 | "text": [
294 | "Recall = 0.900337\n"
295 | ]
296 | },
297 | {
298 | "name": "stderr",
299 | "output_type": "stream",
300 | "text": [
301 | "C:\\Anaconda2\\lib\\site-packages\\sklearn\\metrics\\classification.py:1304: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring=\"f1_weighted\" instead of scoring=\"f1\".\n",
302 | " sample_weight=sample_weight)\n"
303 | ]
304 | }
305 | ],
306 | "source": [
307 | "# Recall\n",
308 | "print(\"Recall = %f\" %(skm.recall_score(test_truth,test_pred)))"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": 15,
314 | "metadata": {},
315 | "outputs": [
316 | {
317 | "name": "stdout",
318 | "output_type": "stream",
319 | "text": [
320 | "F1 score = 0.900621\n"
321 | ]
322 | },
323 | {
324 | "name": "stderr",
325 | "output_type": "stream",
326 | "text": [
327 | "C:\\Anaconda2\\lib\\site-packages\\sklearn\\metrics\\classification.py:756: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring=\"f1_weighted\" instead of scoring=\"f1\".\n",
328 | " sample_weight=sample_weight)\n"
329 | ]
330 | }
331 | ],
332 | "source": [
333 | "# F1 Score\n",
334 | "print(\"F1 score = %f\" %(skm.f1_score(test_truth,test_pred)))"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "### 下面是一些参考资料\n",
342 | "\n",
343 | "[1] Original dataset as R data https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda \n",
344 | "[2] Human Activity Recognition Using Smartphones http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones \n",
345 | "[3] Android Developer Reference http://developer.android.com/reference/android/hardware/Sensor.html \n",
346 | "[4] Random Forests http://en.wikipedia.org/wiki/Random_forest \n",
347 | "[5] Confusion matrix http://en.wikipedia.org/wiki/Confusion_matrix\n",
348 | "[6] Mean Accuracy http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1054102&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1054102\n",
349 | "\n",
350 | "[7] Precision http://en.wikipedia.org/wiki/Precision_and_recall\n",
351 | "[8] Recall http://en.wikipedia.org/wiki/Precision_and_recall\n",
352 | "[9] F Measure http://en.wikipedia.org/wiki/Precision_and_recall"
353 | ]
354 | }
355 | ],
356 | "metadata": {
357 | "kernelspec": {
358 | "display_name": "Python 3",
359 | "language": "python",
360 | "name": "python3"
361 | },
362 | "language_info": {
363 | "codemirror_mode": {
364 | "name": "ipython",
365 | "version": 3
366 | },
367 | "file_extension": ".py",
368 | "mimetype": "text/x-python",
369 | "name": "python",
370 | "nbconvert_exporter": "python",
371 | "pygments_lexer": "ipython3",
372 | "version": "3.7.1"
373 | }
374 | },
375 | "nbformat": 4,
376 | "nbformat_minor": 1
377 | }
378 |
--------------------------------------------------------------------------------
/ch02/datas/data1.txt:
--------------------------------------------------------------------------------
1 | 34.62365962451697,78.0246928153624,0
2 | 30.28671076822607,43.89499752400101,0
3 | 35.84740876993872,72.90219802708364,0
4 | 60.18259938620976,86.30855209546826,1
5 | 79.0327360507101,75.3443764369103,1
6 | 45.08327747668339,56.3163717815305,0
7 | 61.10666453684766,96.51142588489624,1
8 | 75.02474556738889,46.55401354116538,1
9 | 76.09878670226257,87.42056971926803,1
10 | 84.43281996120035,43.53339331072109,1
11 | 95.86155507093572,38.22527805795094,0
12 | 75.01365838958247,30.60326323428011,0
13 | 82.30705337399482,76.48196330235604,1
14 | 69.36458875970939,97.71869196188608,1
15 | 39.53833914367223,76.03681085115882,0
16 | 53.9710521485623,89.20735013750205,1
17 | 69.07014406283025,52.74046973016765,1
18 | 67.94685547711617,46.67857410673128,0
19 | 70.66150955499435,92.92713789364831,1
20 | 76.97878372747498,47.57596364975532,1
21 | 67.37202754570876,42.83843832029179,0
22 | 89.67677575072079,65.79936592745237,1
23 | 50.534788289883,48.85581152764205,0
24 | 34.21206097786789,44.20952859866288,0
25 | 77.9240914545704,68.9723599933059,1
26 | 62.27101367004632,69.95445795447587,1
27 | 80.1901807509566,44.82162893218353,1
28 | 93.114388797442,38.80067033713209,0
29 | 61.83020602312595,50.25610789244621,0
30 | 38.78580379679423,64.99568095539578,0
31 | 61.379289447425,72.80788731317097,1
32 | 85.40451939411645,57.05198397627122,1
33 | 52.10797973193984,63.12762376881715,0
34 | 52.04540476831827,69.43286012045222,1
35 | 40.23689373545111,71.16774802184875,0
36 | 54.63510555424817,52.21388588061123,0
37 | 33.91550010906887,98.86943574220611,0
38 | 64.17698887494485,80.90806058670817,1
39 | 74.78925295941542,41.57341522824434,0
40 | 34.1836400264419,75.2377203360134,0
41 | 83.90239366249155,56.30804621605327,1
42 | 51.54772026906181,46.85629026349976,0
43 | 94.44336776917852,65.56892160559052,1
44 | 82.36875375713919,40.61825515970618,0
45 | 51.04775177128865,45.82270145776001,0
46 | 62.22267576120188,52.06099194836679,0
47 | 77.19303492601364,70.45820000180959,1
48 | 97.77159928000232,86.7278223300282,1
49 | 62.07306379667647,96.76882412413983,1
50 | 91.56497449807442,88.69629254546599,1
51 | 79.94481794066932,74.16311935043758,1
52 | 99.2725269292572,60.99903099844988,1
53 | 90.54671411399852,43.39060180650027,1
54 | 34.52451385320009,60.39634245837173,0
55 | 50.2864961189907,49.80453881323059,0
56 | 49.58667721632031,59.80895099453265,0
57 | 97.64563396007767,68.86157272420604,1
58 | 32.57720016809309,95.59854761387875,0
59 | 74.24869136721598,69.82457122657193,1
60 | 71.79646205863379,78.45356224515052,1
61 | 75.3956114656803,85.75993667331619,1
62 | 35.28611281526193,47.02051394723416,0
63 | 56.25381749711624,39.26147251058019,0
64 | 30.05882244669796,49.59297386723685,0
65 | 44.66826172480893,66.45008614558913,0
66 | 66.56089447242954,41.09209807936973,0
67 | 40.45755098375164,97.53518548909936,1
68 | 49.07256321908844,51.88321182073966,0
69 | 80.27957401466998,92.11606081344084,1
70 | 66.74671856944039,60.99139402740988,1
71 | 32.72283304060323,43.30717306430063,0
72 | 64.0393204150601,78.03168802018232,1
73 | 72.34649422579923,96.22759296761404,1
74 | 60.45788573918959,73.09499809758037,1
75 | 58.84095621726802,75.85844831279042,1
76 | 99.82785779692128,72.36925193383885,1
77 | 47.26426910848174,88.47586499559782,1
78 | 50.45815980285988,75.80985952982456,1
79 | 60.45555629271532,42.50840943572217,0
80 | 82.22666157785568,42.71987853716458,0
81 | 88.9138964166533,69.80378889835472,1
82 | 94.83450672430196,45.69430680250754,1
83 | 67.31925746917527,66.58935317747915,1
84 | 57.23870631569862,59.51428198012956,1
85 | 80.36675600171273,90.96014789746954,1
86 | 68.46852178591112,85.59430710452014,1
87 | 42.0754545384731,78.84478600148043,0
88 | 75.47770200533905,90.42453899753964,1
89 | 78.63542434898018,96.64742716885644,1
90 | 52.34800398794107,60.76950525602592,0
91 | 94.09433112516793,77.15910509073893,1
92 | 90.44855097096364,87.50879176484702,1
93 | 55.48216114069585,35.57070347228866,0
94 | 74.49269241843041,84.84513684930135,1
95 | 89.84580670720979,45.35828361091658,1
96 | 83.48916274498238,48.38028579728175,1
97 | 42.2617008099817,87.10385094025457,1
98 | 99.31500880510394,68.77540947206617,1
99 | 55.34001756003703,64.9319380069486,1
100 | 74.77589300092767,89.52981289513276,1
101 |
--------------------------------------------------------------------------------
/ch02/datas/data2.txt:
--------------------------------------------------------------------------------
1 | 0.051267,0.69956,1
2 | -0.092742,0.68494,1
3 | -0.21371,0.69225,1
4 | -0.375,0.50219,1
5 | -0.51325,0.46564,1
6 | -0.52477,0.2098,1
7 | -0.39804,0.034357,1
8 | -0.30588,-0.19225,1
9 | 0.016705,-0.40424,1
10 | 0.13191,-0.51389,1
11 | 0.38537,-0.56506,1
12 | 0.52938,-0.5212,1
13 | 0.63882,-0.24342,1
14 | 0.73675,-0.18494,1
15 | 0.54666,0.48757,1
16 | 0.322,0.5826,1
17 | 0.16647,0.53874,1
18 | -0.046659,0.81652,1
19 | -0.17339,0.69956,1
20 | -0.47869,0.63377,1
21 | -0.60541,0.59722,1
22 | -0.62846,0.33406,1
23 | -0.59389,0.005117,1
24 | -0.42108,-0.27266,1
25 | -0.11578,-0.39693,1
26 | 0.20104,-0.60161,1
27 | 0.46601,-0.53582,1
28 | 0.67339,-0.53582,1
29 | -0.13882,0.54605,1
30 | -0.29435,0.77997,1
31 | -0.26555,0.96272,1
32 | -0.16187,0.8019,1
33 | -0.17339,0.64839,1
34 | -0.28283,0.47295,1
35 | -0.36348,0.31213,1
36 | -0.30012,0.027047,1
37 | -0.23675,-0.21418,1
38 | -0.06394,-0.18494,1
39 | 0.062788,-0.16301,1
40 | 0.22984,-0.41155,1
41 | 0.2932,-0.2288,1
42 | 0.48329,-0.18494,1
43 | 0.64459,-0.14108,1
44 | 0.46025,0.012427,1
45 | 0.6273,0.15863,1
46 | 0.57546,0.26827,1
47 | 0.72523,0.44371,1
48 | 0.22408,0.52412,1
49 | 0.44297,0.67032,1
50 | 0.322,0.69225,1
51 | 0.13767,0.57529,1
52 | -0.0063364,0.39985,1
53 | -0.092742,0.55336,1
54 | -0.20795,0.35599,1
55 | -0.20795,0.17325,1
56 | -0.43836,0.21711,1
57 | -0.21947,-0.016813,1
58 | -0.13882,-0.27266,1
59 | 0.18376,0.93348,0
60 | 0.22408,0.77997,0
61 | 0.29896,0.61915,0
62 | 0.50634,0.75804,0
63 | 0.61578,0.7288,0
64 | 0.60426,0.59722,0
65 | 0.76555,0.50219,0
66 | 0.92684,0.3633,0
67 | 0.82316,0.27558,0
68 | 0.96141,0.085526,0
69 | 0.93836,0.012427,0
70 | 0.86348,-0.082602,0
71 | 0.89804,-0.20687,0
72 | 0.85196,-0.36769,0
73 | 0.82892,-0.5212,0
74 | 0.79435,-0.55775,0
75 | 0.59274,-0.7405,0
76 | 0.51786,-0.5943,0
77 | 0.46601,-0.41886,0
78 | 0.35081,-0.57968,0
79 | 0.28744,-0.76974,0
80 | 0.085829,-0.75512,0
81 | 0.14919,-0.57968,0
82 | -0.13306,-0.4481,0
83 | -0.40956,-0.41155,0
84 | -0.39228,-0.25804,0
85 | -0.74366,-0.25804,0
86 | -0.69758,0.041667,0
87 | -0.75518,0.2902,0
88 | -0.69758,0.68494,0
89 | -0.4038,0.70687,0
90 | -0.38076,0.91886,0
91 | -0.50749,0.90424,0
92 | -0.54781,0.70687,0
93 | 0.10311,0.77997,0
94 | 0.057028,0.91886,0
95 | -0.10426,0.99196,0
96 | -0.081221,1.1089,0
97 | 0.28744,1.087,0
98 | 0.39689,0.82383,0
99 | 0.63882,0.88962,0
100 | 0.82316,0.66301,0
101 | 0.67339,0.64108,0
102 | 1.0709,0.10015,0
103 | -0.046659,-0.57968,0
104 | -0.23675,-0.63816,0
105 | -0.15035,-0.36769,0
106 | -0.49021,-0.3019,0
107 | -0.46717,-0.13377,0
108 | -0.28859,-0.060673,0
109 | -0.61118,-0.067982,0
110 | -0.66302,-0.21418,0
111 | -0.59965,-0.41886,0
112 | -0.72638,-0.082602,0
113 | -0.83007,0.31213,0
114 | -0.72062,0.53874,0
115 | -0.59389,0.49488,0
116 | -0.48445,0.99927,0
117 | -0.0063364,0.99927,0
118 | 0.63265,-0.030612,0
119 |
--------------------------------------------------------------------------------
/ch02/datas/test.csv:
--------------------------------------------------------------------------------
1 | PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2 | 892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
3 | 893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
4 | 894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
5 | 895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
6 | 896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
7 | 897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
8 | 898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q
9 | 899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,248738,29,,S
10 | 900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18,0,0,2657,7.2292,,C
11 | 901,3,"Davies, Mr. John Samuel",male,21,2,0,A/4 48871,24.15,,S
12 | 902,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S
13 | 903,1,"Jones, Mr. Charles Cresson",male,46,0,0,694,26,,S
14 | 904,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23,1,0,21228,82.2667,B45,S
15 | 905,2,"Howard, Mr. Benjamin",male,63,1,0,24065,26,,S
16 | 906,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)",female,47,1,0,W.E.P. 5734,61.175,E31,S
17 | 907,2,"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",female,24,1,0,SC/PARIS 2167,27.7208,,C
18 | 908,2,"Keane, Mr. Daniel",male,35,0,0,233734,12.35,,Q
19 | 909,3,"Assaf, Mr. Gerios",male,21,0,0,2692,7.225,,C
20 | 910,3,"Ilmakangas, Miss. Ida Livija",female,27,1,0,STON/O2. 3101270,7.925,,S
21 | 911,3,"Assaf Khalil, Mrs. Mariana (Miriam"")""",female,45,0,0,2696,7.225,,C
22 | 912,1,"Rothschild, Mr. Martin",male,55,1,0,PC 17603,59.4,,C
23 | 913,3,"Olsen, Master. Artur Karl",male,9,0,1,C 17368,3.1708,,S
24 | 914,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
25 | 915,1,"Williams, Mr. Richard Norris II",male,21,0,1,PC 17597,61.3792,,C
26 | 916,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48,1,3,PC 17608,262.375,B57 B59 B63 B66,C
27 | 917,3,"Robins, Mr. Alexander A",male,50,1,0,A/5. 3337,14.5,,S
28 | 918,1,"Ostby, Miss. Helene Ragnhild",female,22,0,1,113509,61.9792,B36,C
29 | 919,3,"Daher, Mr. Shedid",male,22.5,0,0,2698,7.225,,C
30 | 920,1,"Brady, Mr. John Bertram",male,41,0,0,113054,30.5,A21,S
31 | 921,3,"Samaan, Mr. Elias",male,,2,0,2662,21.6792,,C
32 | 922,2,"Louch, Mr. Charles Alexander",male,50,1,0,SC/AH 3085,26,,S
33 | 923,2,"Jefferys, Mr. Clifford Thomas",male,24,2,0,C.A. 31029,31.5,,S
34 | 924,3,"Dean, Mrs. Bertram (Eva Georgetta Light)",female,33,1,2,C.A. 2315,20.575,,S
35 | 925,3,"Johnston, Mrs. Andrew G (Elizabeth Lily"" Watson)""",female,,1,2,W./C. 6607,23.45,,S
36 | 926,1,"Mock, Mr. Philipp Edmund",male,30,1,0,13236,57.75,C78,C
37 | 927,3,"Katavelas, Mr. Vassilios (Catavelas Vassilios"")""",male,18.5,0,0,2682,7.2292,,C
38 | 928,3,"Roth, Miss. Sarah A",female,,0,0,342712,8.05,,S
39 | 929,3,"Cacic, Miss. Manda",female,21,0,0,315087,8.6625,,S
40 | 930,3,"Sap, Mr. Julius",male,25,0,0,345768,9.5,,S
41 | 931,3,"Hee, Mr. Ling",male,,0,0,1601,56.4958,,S
42 | 932,3,"Karun, Mr. Franz",male,39,0,1,349256,13.4167,,C
43 | 933,1,"Franklin, Mr. Thomas Parham",male,,0,0,113778,26.55,D34,S
44 | 934,3,"Goldsmith, Mr. Nathan",male,41,0,0,SOTON/O.Q. 3101263,7.85,,S
45 | 935,2,"Corbett, Mrs. Walter H (Irene Colvin)",female,30,0,0,237249,13,,S
46 | 936,1,"Kimball, Mrs. Edwin Nelson Jr (Gertrude Parsons)",female,45,1,0,11753,52.5542,D19,S
47 | 937,3,"Peltomaki, Mr. Nikolai Johannes",male,25,0,0,STON/O 2. 3101291,7.925,,S
48 | 938,1,"Chevre, Mr. Paul Romaine",male,45,0,0,PC 17594,29.7,A9,C
49 | 939,3,"Shaughnessy, Mr. Patrick",male,,0,0,370374,7.75,,Q
50 | 940,1,"Bucknell, Mrs. William Robert (Emma Eliza Ward)",female,60,0,0,11813,76.2917,D15,C
51 | 941,3,"Coutts, Mrs. William (Winnie Minnie"" Treanor)""",female,36,0,2,C.A. 37671,15.9,,S
52 | 942,1,"Smith, Mr. Lucien Philip",male,24,1,0,13695,60,C31,S
53 | 943,2,"Pulbaum, Mr. Franz",male,27,0,0,SC/PARIS 2168,15.0333,,C
54 | 944,2,"Hocking, Miss. Ellen Nellie""""",female,20,2,1,29105,23,,S
55 | 945,1,"Fortune, Miss. Ethel Flora",female,28,3,2,19950,263,C23 C25 C27,S
56 | 946,2,"Mangiavacchi, Mr. Serafino Emilio",male,,0,0,SC/A.3 2861,15.5792,,C
57 | 947,3,"Rice, Master. Albert",male,10,4,1,382652,29.125,,Q
58 | 948,3,"Cor, Mr. Bartol",male,35,0,0,349230,7.8958,,S
59 | 949,3,"Abelseth, Mr. Olaus Jorgensen",male,25,0,0,348122,7.65,F G63,S
60 | 950,3,"Davison, Mr. Thomas Henry",male,,1,0,386525,16.1,,S
61 | 951,1,"Chaudanson, Miss. Victorine",female,36,0,0,PC 17608,262.375,B61,C
62 | 952,3,"Dika, Mr. Mirko",male,17,0,0,349232,7.8958,,S
63 | 953,2,"McCrae, Mr. Arthur Gordon",male,32,0,0,237216,13.5,,S
64 | 954,3,"Bjorklund, Mr. Ernst Herbert",male,18,0,0,347090,7.75,,S
65 | 955,3,"Bradley, Miss. Bridget Delia",female,22,0,0,334914,7.725,,Q
66 | 956,1,"Ryerson, Master. John Borie",male,13,2,2,PC 17608,262.375,B57 B59 B63 B66,C
67 | 957,2,"Corey, Mrs. Percy C (Mary Phyllis Elizabeth Miller)",female,,0,0,F.C.C. 13534,21,,S
68 | 958,3,"Burns, Miss. Mary Delia",female,18,0,0,330963,7.8792,,Q
69 | 959,1,"Moore, Mr. Clarence Bloomfield",male,47,0,0,113796,42.4,,S
70 | 960,1,"Tucker, Mr. Gilbert Milligan Jr",male,31,0,0,2543,28.5375,C53,C
71 | 961,1,"Fortune, Mrs. Mark (Mary McDougald)",female,60,1,4,19950,263,C23 C25 C27,S
72 | 962,3,"Mulvihill, Miss. Bertha E",female,24,0,0,382653,7.75,,Q
73 | 963,3,"Minkoff, Mr. Lazar",male,21,0,0,349211,7.8958,,S
74 | 964,3,"Nieminen, Miss. Manta Josefina",female,29,0,0,3101297,7.925,,S
75 | 965,1,"Ovies y Rodriguez, Mr. Servando",male,28.5,0,0,PC 17562,27.7208,D43,C
76 | 966,1,"Geiger, Miss. Amalie",female,35,0,0,113503,211.5,C130,C
77 | 967,1,"Keeping, Mr. Edwin",male,32.5,0,0,113503,211.5,C132,C
78 | 968,3,"Miles, Mr. Frank",male,,0,0,359306,8.05,,S
79 | 969,1,"Cornell, Mrs. Robert Clifford (Malvina Helen Lamson)",female,55,2,0,11770,25.7,C101,S
80 | 970,2,"Aldworth, Mr. Charles Augustus",male,30,0,0,248744,13,,S
81 | 971,3,"Doyle, Miss. Elizabeth",female,24,0,0,368702,7.75,,Q
82 | 972,3,"Boulos, Master. Akar",male,6,1,1,2678,15.2458,,C
83 | 973,1,"Straus, Mr. Isidor",male,67,1,0,PC 17483,221.7792,C55 C57,S
84 | 974,1,"Case, Mr. Howard Brown",male,49,0,0,19924,26,,S
85 | 975,3,"Demetri, Mr. Marinko",male,,0,0,349238,7.8958,,S
86 | 976,2,"Lamb, Mr. John Joseph",male,,0,0,240261,10.7083,,Q
87 | 977,3,"Khalil, Mr. Betros",male,,1,0,2660,14.4542,,C
88 | 978,3,"Barry, Miss. Julia",female,27,0,0,330844,7.8792,,Q
89 | 979,3,"Badman, Miss. Emily Louisa",female,18,0,0,A/4 31416,8.05,,S
90 | 980,3,"O'Donoghue, Ms. Bridget",female,,0,0,364856,7.75,,Q
91 | 981,2,"Wells, Master. Ralph Lester",male,2,1,1,29103,23,,S
92 | 982,3,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judith Andersson)",female,22,1,0,347072,13.9,,S
93 | 983,3,"Pedersen, Mr. Olaf",male,,0,0,345498,7.775,,S
94 | 984,1,"Davidson, Mrs. Thornton (Orian Hays)",female,27,1,2,F.C. 12750,52,B71,S
95 | 985,3,"Guest, Mr. Robert",male,,0,0,376563,8.05,,S
96 | 986,1,"Birnbaum, Mr. Jakob",male,25,0,0,13905,26,,C
97 | 987,3,"Tenglin, Mr. Gunnar Isidor",male,25,0,0,350033,7.7958,,S
98 | 988,1,"Cavendish, Mrs. Tyrell William (Julia Florence Siegel)",female,76,1,0,19877,78.85,C46,S
99 | 989,3,"Makinen, Mr. Kalle Edvard",male,29,0,0,STON/O 2. 3101268,7.925,,S
100 | 990,3,"Braf, Miss. Elin Ester Maria",female,20,0,0,347471,7.8542,,S
101 | 991,3,"Nancarrow, Mr. William Henry",male,33,0,0,A./5. 3338,8.05,,S
102 | 992,1,"Stengel, Mrs. Charles Emil Henry (Annie May Morris)",female,43,1,0,11778,55.4417,C116,C
103 | 993,2,"Weisz, Mr. Leopold",male,27,1,0,228414,26,,S
104 | 994,3,"Foley, Mr. William",male,,0,0,365235,7.75,,Q
105 | 995,3,"Johansson Palmquist, Mr. Oskar Leander",male,26,0,0,347070,7.775,,S
106 | 996,3,"Thomas, Mrs. Alexander (Thamine Thelma"")""",female,16,1,1,2625,8.5167,,C
107 | 997,3,"Holthen, Mr. Johan Martin",male,28,0,0,C 4001,22.525,,S
108 | 998,3,"Buckley, Mr. Daniel",male,21,0,0,330920,7.8208,,Q
109 | 999,3,"Ryan, Mr. Edward",male,,0,0,383162,7.75,,Q
110 | 1000,3,"Willer, Mr. Aaron (Abi Weller"")""",male,,0,0,3410,8.7125,,S
111 | 1001,2,"Swane, Mr. George",male,18.5,0,0,248734,13,F,S
112 | 1002,2,"Stanton, Mr. Samuel Ward",male,41,0,0,237734,15.0458,,C
113 | 1003,3,"Shine, Miss. Ellen Natalia",female,,0,0,330968,7.7792,,Q
114 | 1004,1,"Evans, Miss. Edith Corse",female,36,0,0,PC 17531,31.6792,A29,C
115 | 1005,3,"Buckley, Miss. Katherine",female,18.5,0,0,329944,7.2833,,Q
116 | 1006,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63,1,0,PC 17483,221.7792,C55 C57,S
117 | 1007,3,"Chronopoulos, Mr. Demetrios",male,18,1,0,2680,14.4542,,C
118 | 1008,3,"Thomas, Mr. John",male,,0,0,2681,6.4375,,C
119 | 1009,3,"Sandstrom, Miss. Beatrice Irene",female,1,1,1,PP 9549,16.7,G6,S
120 | 1010,1,"Beattie, Mr. Thomson",male,36,0,0,13050,75.2417,C6,C
121 | 1011,2,"Chapman, Mrs. John Henry (Sara Elizabeth Lawry)",female,29,1,0,SC/AH 29037,26,,S
122 | 1012,2,"Watt, Miss. Bertha J",female,12,0,0,C.A. 33595,15.75,,S
123 | 1013,3,"Kiernan, Mr. John",male,,1,0,367227,7.75,,Q
124 | 1014,1,"Schabert, Mrs. Paul (Emma Mock)",female,35,1,0,13236,57.75,C28,C
125 | 1015,3,"Carver, Mr. Alfred John",male,28,0,0,392095,7.25,,S
126 | 1016,3,"Kennedy, Mr. John",male,,0,0,368783,7.75,,Q
127 | 1017,3,"Cribb, Miss. Laura Alice",female,17,0,1,371362,16.1,,S
128 | 1018,3,"Brobeck, Mr. Karl Rudolf",male,22,0,0,350045,7.7958,,S
129 | 1019,3,"McCoy, Miss. Alicia",female,,2,0,367226,23.25,,Q
130 | 1020,2,"Bowenur, Mr. Solomon",male,42,0,0,211535,13,,S
131 | 1021,3,"Petersen, Mr. Marius",male,24,0,0,342441,8.05,,S
132 | 1022,3,"Spinner, Mr. Henry John",male,32,0,0,STON/OQ. 369943,8.05,,S
133 | 1023,1,"Gracie, Col. Archibald IV",male,53,0,0,113780,28.5,C51,C
134 | 1024,3,"Lefebre, Mrs. Frank (Frances)",female,,0,4,4133,25.4667,,S
135 | 1025,3,"Thomas, Mr. Charles P",male,,1,0,2621,6.4375,,C
136 | 1026,3,"Dintcheff, Mr. Valtcho",male,43,0,0,349226,7.8958,,S
137 | 1027,3,"Carlsson, Mr. Carl Robert",male,24,0,0,350409,7.8542,,S
138 | 1028,3,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C
139 | 1029,2,"Schmidt, Mr. August",male,26,0,0,248659,13,,S
140 | 1030,3,"Drapkin, Miss. Jennie",female,23,0,0,SOTON/OQ 392083,8.05,,S
141 | 1031,3,"Goodwin, Mr. Charles Frederick",male,40,1,6,CA 2144,46.9,,S
142 | 1032,3,"Goodwin, Miss. Jessie Allis",female,10,5,2,CA 2144,46.9,,S
143 | 1033,1,"Daniels, Miss. Sarah",female,33,0,0,113781,151.55,,S
144 | 1034,1,"Ryerson, Mr. Arthur Larned",male,61,1,3,PC 17608,262.375,B57 B59 B63 B66,C
145 | 1035,2,"Beauchamp, Mr. Henry James",male,28,0,0,244358,26,,S
146 | 1036,1,"Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey"")""",male,42,0,0,17475,26.55,,S
147 | 1037,3,"Vander Planke, Mr. Julius",male,31,3,0,345763,18,,S
148 | 1038,1,"Hilliard, Mr. Herbert Henry",male,,0,0,17463,51.8625,E46,S
149 | 1039,3,"Davies, Mr. Evan",male,22,0,0,SC/A4 23568,8.05,,S
150 | 1040,1,"Crafton, Mr. John Bertram",male,,0,0,113791,26.55,,S
151 | 1041,2,"Lahtinen, Rev. William",male,30,1,1,250651,26,,S
152 | 1042,1,"Earnshaw, Mrs. Boulton (Olive Potter)",female,23,0,1,11767,83.1583,C54,C
153 | 1043,3,"Matinoff, Mr. Nicola",male,,0,0,349255,7.8958,,C
154 | 1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S
155 | 1045,3,"Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)",female,36,0,2,350405,12.1833,,S
156 | 1046,3,"Asplund, Master. Filip Oscar",male,13,4,2,347077,31.3875,,S
157 | 1047,3,"Duquemin, Mr. Joseph",male,24,0,0,S.O./P.P. 752,7.55,,S
158 | 1048,1,"Bird, Miss. Ellen",female,29,0,0,PC 17483,221.7792,C97,S
159 | 1049,3,"Lundin, Miss. Olga Elida",female,23,0,0,347469,7.8542,,S
160 | 1050,1,"Borebank, Mr. John James",male,42,0,0,110489,26.55,D22,S
161 | 1051,3,"Peacock, Mrs. Benjamin (Edith Nile)",female,26,0,2,SOTON/O.Q. 3101315,13.775,,S
162 | 1052,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q
163 | 1053,3,"Touma, Master. Georges Youssef",male,7,1,1,2650,15.2458,,C
164 | 1054,2,"Wright, Miss. Marion",female,26,0,0,220844,13.5,,S
165 | 1055,3,"Pearce, Mr. Ernest",male,,0,0,343271,7,,S
166 | 1056,2,"Peruschitz, Rev. Joseph Maria",male,41,0,0,237393,13,,S
167 | 1057,3,"Kink-Heilmann, Mrs. Anton (Luise Heilmann)",female,26,1,1,315153,22.025,,S
168 | 1058,1,"Brandeis, Mr. Emil",male,48,0,0,PC 17591,50.4958,B10,C
169 | 1059,3,"Ford, Mr. Edward Watson",male,18,2,2,W./C. 6608,34.375,,S
170 | 1060,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genevieve Fosdick)",female,,0,0,17770,27.7208,,C
171 | 1061,3,"Hellstrom, Miss. Hilda Maria",female,22,0,0,7548,8.9625,,S
172 | 1062,3,"Lithman, Mr. Simon",male,,0,0,S.O./P.P. 251,7.55,,S
173 | 1063,3,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,,C
174 | 1064,3,"Dyker, Mr. Adolf Fredrik",male,23,1,0,347072,13.9,,S
175 | 1065,3,"Torfa, Mr. Assad",male,,0,0,2673,7.2292,,C
176 | 1066,3,"Asplund, Mr. Carl Oscar Vilhelm Gustafsson",male,40,1,5,347077,31.3875,,S
177 | 1067,2,"Brown, Miss. Edith Eileen",female,15,0,2,29750,39,,S
178 | 1068,2,"Sincock, Miss. Maude",female,20,0,0,C.A. 33112,36.75,,S
179 | 1069,1,"Stengel, Mr. Charles Emil Henry",male,54,1,0,11778,55.4417,C116,C
180 | 1070,2,"Becker, Mrs. Allen Oliver (Nellie E Baumgardner)",female,36,0,3,230136,39,F4,S
181 | 1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ingersoll)",female,64,0,2,PC 17756,83.1583,E45,C
182 | 1072,2,"McCrie, Mr. James Matthew",male,30,0,0,233478,13,,S
183 | 1073,1,"Compton, Mr. Alexander Taylor Jr",male,37,1,1,PC 17756,83.1583,E52,C
184 | 1074,1,"Marvin, Mrs. Daniel Warner (Mary Graham Carmichael Farquarson)",female,18,1,0,113773,53.1,D30,S
185 | 1075,3,"Lane, Mr. Patrick",male,,0,0,7935,7.75,,Q
186 | 1076,1,"Douglas, Mrs. Frederick Charles (Mary Helene Baxter)",female,27,1,1,PC 17558,247.5208,B58 B60,C
187 | 1077,2,"Maybery, Mr. Frank Hubert",male,40,0,0,239059,16,,S
188 | 1078,2,"Phillips, Miss. Alice Frances Louisa",female,21,0,1,S.O./P.P. 2,21,,S
189 | 1079,3,"Davies, Mr. Joseph",male,17,2,0,A/4 48873,8.05,,S
190 | 1080,3,"Sage, Miss. Ada",female,,8,2,CA. 2343,69.55,,S
191 | 1081,2,"Veal, Mr. James",male,40,0,0,28221,13,,S
192 | 1082,2,"Angle, Mr. William A",male,34,1,0,226875,26,,S
193 | 1083,1,"Salomon, Mr. Abraham L",male,,0,0,111163,26,,S
194 | 1084,3,"van Billiard, Master. Walter John",male,11.5,1,1,A/5. 851,14.5,,S
195 | 1085,2,"Lingane, Mr. John",male,61,0,0,235509,12.35,,Q
196 | 1086,2,"Drew, Master. Marshall Brines",male,8,0,2,28220,32.5,,S
197 | 1087,3,"Karlsson, Mr. Julius Konrad Eugen",male,33,0,0,347465,7.8542,,S
198 | 1088,1,"Spedden, Master. Robert Douglas",male,6,0,2,16966,134.5,E34,C
199 | 1089,3,"Nilsson, Miss. Berta Olivia",female,18,0,0,347066,7.775,,S
200 | 1090,2,"Baimbrigge, Mr. Charles Robert",male,23,0,0,C.A. 31030,10.5,,S
201 | 1091,3,"Rasmussen, Mrs. (Lena Jacobsen Solvang)",female,,0,0,65305,8.1125,,S
202 | 1092,3,"Murphy, Miss. Nora",female,,0,0,36568,15.5,,Q
203 | 1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S
204 | 1094,1,"Astor, Col. John Jacob",male,47,1,0,PC 17757,227.525,C62 C64,C
205 | 1095,2,"Quick, Miss. Winifred Vera",female,8,1,1,26360,26,,S
206 | 1096,2,"Andrew, Mr. Frank Thomas",male,25,0,0,C.A. 34050,10.5,,S
207 | 1097,1,"Omont, Mr. Alfred Fernand",male,,0,0,F.C. 12998,25.7417,,C
208 | 1098,3,"McGowan, Miss. Katherine",female,35,0,0,9232,7.75,,Q
209 | 1099,2,"Collett, Mr. Sidney C Stuart",male,24,0,0,28034,10.5,,S
210 | 1100,1,"Rosenbaum, Miss. Edith Louise",female,33,0,0,PC 17613,27.7208,A11,C
211 | 1101,3,"Delalic, Mr. Redjo",male,25,0,0,349250,7.8958,,S
212 | 1102,3,"Andersen, Mr. Albert Karvin",male,32,0,0,C 4001,22.525,,S
213 | 1103,3,"Finoli, Mr. Luigi",male,,0,0,SOTON/O.Q. 3101308,7.05,,S
214 | 1104,2,"Deacon, Mr. Percy William",male,17,0,0,S.O.C. 14879,73.5,,S
215 | 1105,2,"Howard, Mrs. Benjamin (Ellen Truelove Arman)",female,60,1,0,24065,26,,S
216 | 1106,3,"Andersson, Miss. Ida Augusta Margareta",female,38,4,2,347091,7.775,,S
217 | 1107,1,"Head, Mr. Christopher",male,42,0,0,113038,42.5,B11,S
218 | 1108,3,"Mahon, Miss. Bridget Delia",female,,0,0,330924,7.8792,,Q
219 | 1109,1,"Wick, Mr. George Dennick",male,57,1,1,36928,164.8667,,S
220 | 1110,1,"Widener, Mrs. George Dunton (Eleanor Elkins)",female,50,1,1,113503,211.5,C80,C
221 | 1111,3,"Thomson, Mr. Alexander Morrison",male,,0,0,32302,8.05,,S
222 | 1112,2,"Duran y More, Miss. Florentina",female,30,1,0,SC/PARIS 2148,13.8583,,C
223 | 1113,3,"Reynolds, Mr. Harold J",male,21,0,0,342684,8.05,,S
224 | 1114,2,"Cook, Mrs. (Selena Rogers)",female,22,0,0,W./C. 14266,10.5,F33,S
225 | 1115,3,"Karlsson, Mr. Einar Gervasius",male,21,0,0,350053,7.7958,,S
226 | 1116,1,"Candee, Mrs. Edward (Helen Churchill Hungerford)",female,53,0,0,PC 17606,27.4458,,C
227 | 1117,3,"Moubarek, Mrs. George (Omine Amenia"" Alexander)""",female,,0,2,2661,15.2458,,C
228 | 1118,3,"Asplund, Mr. Johan Charles",male,23,0,0,350054,7.7958,,S
229 | 1119,3,"McNeill, Miss. Bridget",female,,0,0,370368,7.75,,Q
230 | 1120,3,"Everett, Mr. Thomas James",male,40.5,0,0,C.A. 6212,15.1,,S
231 | 1121,2,"Hocking, Mr. Samuel James Metcalfe",male,36,0,0,242963,13,,S
232 | 1122,2,"Sweet, Mr. George Frederick",male,14,0,0,220845,65,,S
233 | 1123,1,"Willard, Miss. Constance",female,21,0,0,113795,26.55,,S
234 | 1124,3,"Wiklund, Mr. Karl Johan",male,21,1,0,3101266,6.4958,,S
235 | 1125,3,"Linehan, Mr. Michael",male,,0,0,330971,7.8792,,Q
236 | 1126,1,"Cumings, Mr. John Bradley",male,39,1,0,PC 17599,71.2833,C85,C
237 | 1127,3,"Vendel, Mr. Olof Edvin",male,20,0,0,350416,7.8542,,S
238 | 1128,1,"Warren, Mr. Frank Manley",male,64,1,0,110813,75.25,D37,C
239 | 1129,3,"Baccos, Mr. Raffull",male,20,0,0,2679,7.225,,C
240 | 1130,2,"Hiltunen, Miss. Marta",female,18,1,1,250650,13,,S
241 | 1131,1,"Douglas, Mrs. Walter Donald (Mahala Dutton)",female,48,1,0,PC 17761,106.425,C86,C
242 | 1132,1,"Lindstrom, Mrs. Carl Johan (Sigrid Posse)",female,55,0,0,112377,27.7208,,C
243 | 1133,2,"Christy, Mrs. (Alice Frances)",female,45,0,2,237789,30,,S
244 | 1134,1,"Spedden, Mr. Frederic Oakley",male,45,1,1,16966,134.5,E34,C
245 | 1135,3,"Hyman, Mr. Abraham",male,,0,0,3470,7.8875,,S
246 | 1136,3,"Johnston, Master. William Arthur Willie""""",male,,1,2,W./C. 6607,23.45,,S
247 | 1137,1,"Kenyon, Mr. Frederick R",male,41,1,0,17464,51.8625,D21,S
248 | 1138,2,"Karnes, Mrs. J Frank (Claire Bennett)",female,22,0,0,F.C.C. 13534,21,,S
249 | 1139,2,"Drew, Mr. James Vivian",male,42,1,1,28220,32.5,,S
250 | 1140,2,"Hold, Mrs. Stephen (Annie Margaret Hill)",female,29,1,0,26707,26,,S
251 | 1141,3,"Khalil, Mrs. Betros (Zahie Maria"" Elias)""",female,,1,0,2660,14.4542,,C
252 | 1142,2,"West, Miss. Barbara J",female,0.92,1,2,C.A. 34651,27.75,,S
253 | 1143,3,"Abrahamsson, Mr. Abraham August Johannes",male,20,0,0,SOTON/O2 3101284,7.925,,S
254 | 1144,1,"Clark, Mr. Walter Miller",male,27,1,0,13508,136.7792,C89,C
255 | 1145,3,"Salander, Mr. Karl Johan",male,24,0,0,7266,9.325,,S
256 | 1146,3,"Wenzel, Mr. Linhart",male,32.5,0,0,345775,9.5,,S
257 | 1147,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.55,,S
258 | 1148,3,"Mahon, Mr. John",male,,0,0,AQ/4 3130,7.75,,Q
259 | 1149,3,"Niklasson, Mr. Samuel",male,28,0,0,363611,8.05,,S
260 | 1150,2,"Bentham, Miss. Lilian W",female,19,0,0,28404,13,,S
261 | 1151,3,"Midtsjo, Mr. Karl Albert",male,21,0,0,345501,7.775,,S
262 | 1152,3,"de Messemaeker, Mr. Guillaume Joseph",male,36.5,1,0,345572,17.4,,S
263 | 1153,3,"Nilsson, Mr. August Ferdinand",male,21,0,0,350410,7.8542,,S
264 | 1154,2,"Wells, Mrs. Arthur Henry (Addie"" Dart Trevaskis)""",female,29,0,2,29103,23,,S
265 | 1155,3,"Klasen, Miss. Gertrud Emilia",female,1,1,1,350405,12.1833,,S
266 | 1156,2,"Portaluppi, Mr. Emilio Ilario Giuseppe",male,30,0,0,C.A. 34644,12.7375,,C
267 | 1157,3,"Lyntakoff, Mr. Stanko",male,,0,0,349235,7.8958,,S
268 | 1158,1,"Chisholm, Mr. Roderick Robert Crispin",male,,0,0,112051,0,,S
269 | 1159,3,"Warren, Mr. Charles William",male,,0,0,C.A. 49867,7.55,,S
270 | 1160,3,"Howard, Miss. May Elizabeth",female,,0,0,A. 2. 39186,8.05,,S
271 | 1161,3,"Pokrnic, Mr. Mate",male,17,0,0,315095,8.6625,,S
272 | 1162,1,"McCaffry, Mr. Thomas Francis",male,46,0,0,13050,75.2417,C6,C
273 | 1163,3,"Fox, Mr. Patrick",male,,0,0,368573,7.75,,Q
274 | 1164,1,"Clark, Mrs. Walter Miller (Virginia McDowell)",female,26,1,0,13508,136.7792,C89,C
275 | 1165,3,"Lennon, Miss. Mary",female,,1,0,370371,15.5,,Q
276 | 1166,3,"Saade, Mr. Jean Nassr",male,,0,0,2676,7.225,,C
277 | 1167,2,"Bryhl, Miss. Dagmar Jenny Ingeborg ",female,20,1,0,236853,26,,S
278 | 1168,2,"Parker, Mr. Clifford Richard",male,28,0,0,SC 14888,10.5,,S
279 | 1169,2,"Faunthorpe, Mr. Harry",male,40,1,0,2926,26,,S
280 | 1170,2,"Ware, Mr. John James",male,30,1,0,CA 31352,21,,S
281 | 1171,2,"Oxenham, Mr. Percy Thomas",male,22,0,0,W./C. 14260,10.5,,S
282 | 1172,3,"Oreskovic, Miss. Jelka",female,23,0,0,315085,8.6625,,S
283 | 1173,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S
284 | 1174,3,"Fleming, Miss. Honora",female,,0,0,364859,7.75,,Q
285 | 1175,3,"Touma, Miss. Maria Youssef",female,9,1,1,2650,15.2458,,C
286 | 1176,3,"Rosblom, Miss. Salli Helena",female,2,1,1,370129,20.2125,,S
287 | 1177,3,"Dennis, Mr. William",male,36,0,0,A/5 21175,7.25,,S
288 | 1178,3,"Franklin, Mr. Charles (Charles Fardon)",male,,0,0,SOTON/O.Q. 3101314,7.25,,S
289 | 1179,1,"Snyder, Mr. John Pillsbury",male,24,1,0,21228,82.2667,B45,S
290 | 1180,3,"Mardirosian, Mr. Sarkis",male,,0,0,2655,7.2292,F E46,C
291 | 1181,3,"Ford, Mr. Arthur",male,,0,0,A/5 1478,8.05,,S
292 | 1182,1,"Rheims, Mr. George Alexander Lucien",male,,0,0,PC 17607,39.6,,S
293 | 1183,3,"Daly, Miss. Margaret Marcella Maggie""""",female,30,0,0,382650,6.95,,Q
294 | 1184,3,"Nasr, Mr. Mustafa",male,,0,0,2652,7.2292,,C
295 | 1185,1,"Dodge, Dr. Washington",male,53,1,1,33638,81.8583,A34,S
296 | 1186,3,"Wittevrongel, Mr. Camille",male,36,0,0,345771,9.5,,S
297 | 1187,3,"Angheloff, Mr. Minko",male,26,0,0,349202,7.8958,,S
298 | 1188,2,"Laroche, Miss. Louise",female,1,1,2,SC/Paris 2123,41.5792,,C
299 | 1189,3,"Samaan, Mr. Hanna",male,,2,0,2662,21.6792,,C
300 | 1190,1,"Loring, Mr. Joseph Holland",male,30,0,0,113801,45.5,,S
301 | 1191,3,"Johansson, Mr. Nils",male,29,0,0,347467,7.8542,,S
302 | 1192,3,"Olsson, Mr. Oscar Wilhelm",male,32,0,0,347079,7.775,,S
303 | 1193,2,"Malachard, Mr. Noel",male,,0,0,237735,15.0458,D,C
304 | 1194,2,"Phillips, Mr. Escott Robert",male,43,0,1,S.O./P.P. 2,21,,S
305 | 1195,3,"Pokrnic, Mr. Tome",male,24,0,0,315092,8.6625,,S
306 | 1196,3,"McCarthy, Miss. Catherine Katie""""",female,,0,0,383123,7.75,,Q
307 | 1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabeth Halstead)",female,64,1,1,112901,26.55,B26,S
308 | 1198,1,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S
309 | 1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S
310 | 1200,1,"Hays, Mr. Charles Melville",male,55,1,1,12749,93.5,B69,S
311 | 1201,3,"Hansen, Mrs. Claus Peter (Jennie L Howard)",female,45,1,0,350026,14.1083,,S
312 | 1202,3,"Cacic, Mr. Jego Grga",male,18,0,0,315091,8.6625,,S
313 | 1203,3,"Vartanian, Mr. David",male,22,0,0,2658,7.225,,C
314 | 1204,3,"Sadowitz, Mr. Harry",male,,0,0,LP 1588,7.575,,S
315 | 1205,3,"Carr, Miss. Jeannie",female,37,0,0,368364,7.75,,Q
316 | 1206,1,"White, Mrs. John Stuart (Ella Holmes)",female,55,0,0,PC 17760,135.6333,C32,C
317 | 1207,3,"Hagardon, Miss. Kate",female,17,0,0,AQ/3. 30631,7.7333,,Q
318 | 1208,1,"Spencer, Mr. William Augustus",male,57,1,0,PC 17569,146.5208,B78,C
319 | 1209,2,"Rogers, Mr. Reginald Harry",male,19,0,0,28004,10.5,,S
320 | 1210,3,"Jonsson, Mr. Nils Hilding",male,27,0,0,350408,7.8542,,S
321 | 1211,2,"Jefferys, Mr. Ernest Wilfred",male,22,2,0,C.A. 31029,31.5,,S
322 | 1212,3,"Andersson, Mr. Johan Samuel",male,26,0,0,347075,7.775,,S
323 | 1213,3,"Krekorian, Mr. Neshan",male,25,0,0,2654,7.2292,F E57,C
324 | 1214,2,"Nesson, Mr. Israel",male,26,0,0,244368,13,F2,S
325 | 1215,1,"Rowe, Mr. Alfred G",male,33,0,0,113790,26.55,,S
326 | 1216,1,"Kreuchen, Miss. Emilie",female,39,0,0,24160,211.3375,,S
327 | 1217,3,"Assam, Mr. Ali",male,23,0,0,SOTON/O.Q. 3101309,7.05,,S
328 | 1218,2,"Becker, Miss. Ruth Elizabeth",female,12,2,1,230136,39,F4,S
329 | 1219,1,"Rosenshine, Mr. George (Mr George Thorne"")""",male,46,0,0,PC 17585,79.2,,C
330 | 1220,2,"Clarke, Mr. Charles Valentine",male,29,1,0,2003,26,,S
331 | 1221,2,"Enander, Mr. Ingvar",male,21,0,0,236854,13,,S
332 | 1222,2,"Davies, Mrs. John Morgan (Elizabeth Agnes Mary White) ",female,48,0,2,C.A. 33112,36.75,,S
333 | 1223,1,"Dulles, Mr. William Crothers",male,39,0,0,PC 17580,29.7,A18,C
334 | 1224,3,"Thomas, Mr. Tannous",male,,0,0,2684,7.225,,C
335 | 1225,3,"Nakid, Mrs. Said (Waika Mary"" Mowad)""",female,19,1,1,2653,15.7417,,C
336 | 1226,3,"Cor, Mr. Ivan",male,27,0,0,349229,7.8958,,S
337 | 1227,1,"Maguire, Mr. John Edward",male,30,0,0,110469,26,C106,S
338 | 1228,2,"de Brito, Mr. Jose Joaquim",male,32,0,0,244360,13,,S
339 | 1229,3,"Elias, Mr. Joseph",male,39,0,2,2675,7.2292,,C
340 | 1230,2,"Denbury, Mr. Herbert",male,25,0,0,C.A. 31029,31.5,,S
341 | 1231,3,"Betros, Master. Seman",male,,0,0,2622,7.2292,,C
342 | 1232,2,"Fillbrook, Mr. Joseph Charles",male,18,0,0,C.A. 15185,10.5,,S
343 | 1233,3,"Lundstrom, Mr. Thure Edvin",male,32,0,0,350403,7.5792,,S
344 | 1234,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.55,,S
345 | 1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlotte Wardle Drake)",female,58,0,1,PC 17755,512.3292,B51 B53 B55,C
346 | 1236,3,"van Billiard, Master. James William",male,,1,1,A/5. 851,14.5,,S
347 | 1237,3,"Abelseth, Miss. Karen Marie",female,16,0,0,348125,7.65,,S
348 | 1238,2,"Botsford, Mr. William Hull",male,26,0,0,237670,13,,S
349 | 1239,3,"Whabee, Mrs. George Joseph (Shawneene Abi-Saab)",female,38,0,0,2688,7.2292,,C
350 | 1240,2,"Giles, Mr. Ralph",male,24,0,0,248726,13.5,,S
351 | 1241,2,"Walcroft, Miss. Nellie",female,31,0,0,F.C.C. 13528,21,,S
352 | 1242,1,"Greenfield, Mrs. Leo David (Blanche Strouse)",female,45,0,1,PC 17759,63.3583,D10 D12,C
353 | 1243,2,"Stokes, Mr. Philip Joseph",male,25,0,0,F.C.C. 13540,10.5,,S
354 | 1244,2,"Dibden, Mr. William",male,18,0,0,S.O.C. 14879,73.5,,S
355 | 1245,2,"Herman, Mr. Samuel",male,49,1,2,220845,65,,S
356 | 1246,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.575,,S
357 | 1247,1,"Julian, Mr. Henry Forbes",male,50,0,0,113044,26,E60,S
358 | 1248,1,"Brown, Mrs. John Murray (Caroline Lane Lamson)",female,59,2,0,11769,51.4792,C101,S
359 | 1249,3,"Lockyer, Mr. Edward",male,,0,0,1222,7.8792,,S
360 | 1250,3,"O'Keefe, Mr. Patrick",male,,0,0,368402,7.75,,Q
361 | 1251,3,"Lindell, Mrs. Edvard Bengtsson (Elin Gerda Persson)",female,30,1,0,349910,15.55,,S
362 | 1252,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.55,,S
363 | 1253,2,"Mallet, Mrs. Albert (Antoinette Magnin)",female,24,1,1,S.C./PARIS 2079,37.0042,,C
364 | 1254,2,"Ware, Mrs. John James (Florence Louise Long)",female,31,0,0,CA 31352,21,,S
365 | 1255,3,"Strilic, Mr. Ivan",male,27,0,0,315083,8.6625,,S
366 | 1256,1,"Harder, Mrs. George Achilles (Dorothy Annan)",female,25,1,0,11765,55.4417,E50,C
367 | 1257,3,"Sage, Mrs. John (Annie Bullen)",female,,1,9,CA. 2343,69.55,,S
368 | 1258,3,"Caram, Mr. Joseph",male,,1,0,2689,14.4583,,C
369 | 1259,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22,0,0,3101295,39.6875,,S
370 | 1260,1,"Gibson, Mrs. Leonard (Pauline C Boeson)",female,45,0,1,112378,59.4,,C
371 | 1261,2,"Pallas y Castello, Mr. Emilio",male,29,0,0,SC/PARIS 2147,13.8583,,C
372 | 1262,2,"Giles, Mr. Edgar",male,21,1,0,28133,11.5,,S
373 | 1263,1,"Wilson, Miss. Helen Alice",female,31,0,0,16966,134.5,E39 E41,C
374 | 1264,1,"Ismay, Mr. Joseph Bruce",male,49,0,0,112058,0,B52 B54 B56,S
375 | 1265,2,"Harbeck, Mr. William H",male,44,0,0,248746,13,,S
376 | 1266,1,"Dodge, Mrs. Washington (Ruth Vidaver)",female,54,1,1,33638,81.8583,A34,S
377 | 1267,1,"Bowen, Miss. Grace Scott",female,45,0,0,PC 17608,262.375,,C
378 | 1268,3,"Kink, Miss. Maria",female,22,2,0,315152,8.6625,,S
379 | 1269,2,"Cotterill, Mr. Henry Harry""""",male,21,0,0,29107,11.5,,S
380 | 1270,1,"Hipkins, Mr. William Edward",male,55,0,0,680,50,C39,S
381 | 1271,3,"Asplund, Master. Carl Edgar",male,5,4,2,347077,31.3875,,S
382 | 1272,3,"O'Connor, Mr. Patrick",male,,0,0,366713,7.75,,Q
383 | 1273,3,"Foley, Mr. Joseph",male,26,0,0,330910,7.8792,,Q
384 | 1274,3,"Risien, Mrs. Samuel (Emma)",female,,0,0,364498,14.5,,S
385 | 1275,3,"McNamee, Mrs. Neal (Eileen O'Leary)",female,19,1,0,376566,16.1,,S
386 | 1276,2,"Wheeler, Mr. Edwin Frederick""""",male,,0,0,SC/PARIS 2159,12.875,,S
387 | 1277,2,"Herman, Miss. Kate",female,24,1,2,220845,65,,S
388 | 1278,3,"Aronsson, Mr. Ernst Axel Algot",male,24,0,0,349911,7.775,,S
389 | 1279,2,"Ashby, Mr. John",male,57,0,0,244346,13,,S
390 | 1280,3,"Canavan, Mr. Patrick",male,21,0,0,364858,7.75,,Q
391 | 1281,3,"Palsson, Master. Paul Folke",male,6,3,1,349909,21.075,,S
392 | 1282,1,"Payne, Mr. Vivian Ponsonby",male,23,0,0,12749,93.5,B24,S
393 | 1283,1,"Lines, Mrs. Ernest H (Elizabeth Lindsey James)",female,51,0,1,PC 17592,39.4,D28,S
394 | 1284,3,"Abbott, Master. Eugene Joseph",male,13,0,2,C.A. 2673,20.25,,S
395 | 1285,2,"Gilbert, Mr. William",male,47,0,0,C.A. 30769,10.5,,S
396 | 1286,3,"Kink-Heilmann, Mr. Anton",male,29,3,1,315153,22.025,,S
397 | 1287,1,"Smith, Mrs. Lucien Philip (Mary Eloise Hughes)",female,18,1,0,13695,60,C31,S
398 | 1288,3,"Colbert, Mr. Patrick",male,24,0,0,371109,7.25,,Q
399 | 1289,1,"Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)",female,48,1,1,13567,79.2,B41,C
400 | 1290,3,"Larsson-Rondberg, Mr. Edvard A",male,22,0,0,347065,7.775,,S
401 | 1291,3,"Conlon, Mr. Thomas Henry",male,31,0,0,21332,7.7333,,Q
402 | 1292,1,"Bonnell, Miss. Caroline",female,30,0,0,36928,164.8667,C7,S
403 | 1293,2,"Gale, Mr. Harry",male,38,1,0,28664,21,,S
404 | 1294,1,"Gibson, Miss. Dorothy Winifred",female,22,0,1,112378,59.4,,C
405 | 1295,1,"Carrau, Mr. Jose Pedro",male,17,0,0,113059,47.1,,S
406 | 1296,1,"Frauenthal, Mr. Isaac Gerald",male,43,1,0,17765,27.7208,D40,C
407 | 1297,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20,0,0,SC/PARIS 2166,13.8625,D38,C
408 | 1298,2,"Ware, Mr. William Jeffery",male,23,1,0,28666,10.5,,S
409 | 1299,1,"Widener, Mr. George Dunton",male,50,1,1,113503,211.5,C80,C
410 | 1300,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q
411 | 1301,3,"Peacock, Miss. Treasteall",female,3,1,1,SOTON/O.Q. 3101315,13.775,,S
412 | 1302,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.75,,Q
413 | 1303,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37,1,0,19928,90,C78,Q
414 | 1304,3,"Henriksson, Miss. Jenny Lovisa",female,28,0,0,347086,7.775,,S
415 | 1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
416 | 1306,1,"Oliva y Ocana, Dona. Fermina",female,39,0,0,PC 17758,108.9,C105,C
417 | 1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
418 | 1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
419 | 1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C
420 |
--------------------------------------------------------------------------------
/ch02/images/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/1.jpg
--------------------------------------------------------------------------------
/ch02/images/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/2.jpg
--------------------------------------------------------------------------------
/ch02/images/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/3.jpg
--------------------------------------------------------------------------------
/ch02/images/4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/4.jpg
--------------------------------------------------------------------------------
/ch02/images/5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/5.jpg
--------------------------------------------------------------------------------
/ch02/images/Kmeans.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/Kmeans.png
--------------------------------------------------------------------------------
/ch02/images/sklearn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simbafl/Data-analysis/1345a34149b784f6a2006ed7b996f6f780cc4991/ch02/images/sklearn.png
--------------------------------------------------------------------------------
/ch02/readme.md:
--------------------------------------------------------------------------------
1 | ## 监督式学习
2 | #### 分类
3 | [随机森林](RandomForest.ipynb) `KNN` `朴素贝叶斯` [决策树](RandomForest.ipynb) [SVM](SVM.ipynb) `GDBT` `Adaboost` ...
4 | #### 回归
5 | `线性回归` [逻辑回归](logistic_regression.ipynb)
6 |
7 | ## 非监督式学习
8 | #### 聚类
9 | [Kmeans](K-means.py) `DBSCAN` `层次聚类` [关联规则](Ass_rule.py) `序列规则` ...
10 |
11 | ## 半监督学习
12 | `标签传播`
13 |
14 | ### 机器学习一般流程: 问题建模, 特征工程, 模型选择, 模型融合, 模型应用
15 | #### 1. 问题建模
16 | 解决机器学习问题都是从建模开始。首先需要收集问题的资料,深入理解问题,然后将问题抽象成机器可预测的问题。
17 | 在这个过程中要明确业务指标和模型预测目标,根据预测目标选择适当的评估指标用于模型评估。接着从原始数据中选择
18 | 最相关的样本子集用于模型训练,并对样本子集划分训练集和测试集,应用交叉验证的方法对模型进行选择和评估。
19 | 
20 | #### 2. 特征工程
21 | 完成问题建模,对数据进行筛选和清洗之后的步骤,就是对数据抽取特征,即特征工程。此时就要对模型的算法有深入的理解。
22 | 
23 | #### 3. 常用模型(更多模型选择往下看)
24 | 特征工程的目的是为了将特征输入给模型,让模型从数据中学习规律。但模型有很多,不同的模型差别很大,使用场景不同,
25 | 能够处理的特征也会不同。
26 | 
27 | #### 4. 模型融合
28 | 充分利用不同模型的差异,采用模型融合的方法,以进一步优化目标。
29 | 
30 | #### 5. 用户画像
31 | 
32 | #### 机器学习实现了太多模型,具体如何选择,sklearn官网提供了大概思路
33 | 
34 |
--------------------------------------------------------------------------------
/ch03/readme.md:
--------------------------------------------------------------------------------
1 |
2 | ### 1. 分类模型评估
3 | #### 二分类
4 | - 正确率: (TP+TN)/(TP+TN+FN+FP)
5 | - 召回率: TP/(TP+FN)
6 | - F值: 2×召回率×(TP+FN)/(正确率+召回率)
7 | - 精准率: TP/(TP+FP)
8 | - 错误接受率: FP/(FP+TN)
9 | - 错误拒绝率: FN/(TP+FN)
10 | - TP+TN+FP+FN = 样本总数
11 | #### 多元混淆矩阵
12 | - 正确率: (TP+TN)/(TP+TN+FN+FP)
13 | - 召回率,F值: 与二分类计算方法有所不同
14 | ##### 分类模型评估方法还有ROC和AUC,增益图和KS图
15 | ### 2. 回归模型评估
16 | - MAE: 残差绝对值的平均值
17 | - MSE: 残差平方的平均值
18 | - RMSE: MSE开平方
19 | - 决定系数 R2 = SSR/SST
20 | ### 3. 聚类模型评估
21 | - RMS
22 | - 轮廓系数
23 | ### 4. 关联模型评估
24 | - 支持度
25 | - 置信度
26 | - 提升度
27 |
28 |
--------------------------------------------------------------------------------
/kaggle/readme.md:
--------------------------------------------------------------------------------
1 | ## kaggle 竞赛项目
2 |
3 | #### 城市自行车共享系统
4 | 数据包括了旅行的持续时间、出发地点、到达地点和过去的时间。因此,自行车共享系统可以用作传感器网络,以研究城市中的交通运行情况。在本次比赛中,参赛者被要求将历史使用模式与天气数据结合起来,以预测华盛顿地区的自行车租赁需求。本次比赛提供两年间每小时的租赁数据。训练集由每月的前19天组成,而测试集为每月的20日至月末。参赛者必须预测在测试集中每个小时内租赁的自行车总数。
5 |
6 | #### 员工离职率分析
7 | 人力资源部门对员工信息的部分摘录, 数据是基于某公司真实情况所模仿,基于此分析员工离职率情况。
--------------------------------------------------------------------------------
/kaggle/员工离职率特征工程与建模.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## HR analysis\n",
8 | "### 标注是left,离散值,所以属于分类问题\n",
9 | "基于对HR.csv的特征探索分析基础,我们这里直接建模"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np\n",
20 | "import seaborn as sns\n",
21 | "import scipy.stats as ss\n",
22 | "import matplotlib.pyplot as plt\n",
23 | "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
24 | "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
25 | "from sklearn.decomposition import PCA\n",
26 | "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
27 | "# 验证数据的准确率,召回率,F值\n",
28 | "from sklearn.metrics import accuracy_score, recall_score, f1_score\n",
29 | "%matplotlib inline"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "### 特征处理"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 2,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "def hr_preprocessing(sl=False, le=False, npr=False, amh=False, tsc=False, wa=False, \n",
46 | " pl=False, dep=False, sal=False,lower_d=False, ld_n=1):\n",
47 | " def map_salary(s):\n",
48 | " d = dict([('low', 0), ('medium', 1), ('high', 2)])\n",
49 | " return d.get(s, 0)\n",
50 | " \n",
51 | " df = pd.read_csv(\"HR.csv\")\n",
52 | " \n",
53 | " # 清洗数据\n",
54 | " df = df.dropna(subset=['satisfaction_level', 'last_evaluation'])\n",
55 | " df = df[df['satisfaction_level']<=1][df['salary']!='nme']\n",
56 | " # 标注\n",
57 | " label = df['left']\n",
58 | " df=df.drop('left', axis=1)\n",
59 | " # 特征选择\n",
60 | " # 特征处理\n",
61 | " scaler_lst = [sl, le, npr, amh, tsc, wa, pl]\n",
62 | " column_lst = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', \n",
63 | " 'time_spend_company', 'Work_accident', 'promotion_last_5years']\n",
64 | " for i in range(len(scaler_lst)):\n",
65 | " if not scaler_lst[i]:\n",
66 | " df[column_lst[i]] = MinMaxScaler().fit_transform(df[column_lst[i]].values.reshape(-1, 1)).reshape(1, -1)[0]\n",
67 | " else:\n",
68 | " df[column_lst[i]] = StandardScaler().fit_transform(df[column_lst[i]].values.reshape(-1, 1)).reshape(1, -1)[0]\n",
69 | " # 对离散值处理\n",
70 | " scaler_lst2 = [sal, dep]\n",
71 | " column_lst2 = ['salary', 'department']\n",
72 | " for i in range(len(scaler_lst2)):\n",
73 | " if not scaler_lst2[i]:\n",
74 | " if column_lst2[i] == 'salary':\n",
75 | " df[column_lst2[i]] = [map_salary(s) for s in df['salary'].values]\n",
76 | " else:\n",
77 | " df[column_lst2[i]] = LabelEncoder().fit_transform(df[column_lst2[i]].values.reshape(-1, 1))\n",
78 | " # 统一归一化处理\n",
79 | " df[column_lst2[i]] = MinMaxScaler().fit_transform(df[column_lst2[i]].values.reshape(-1, 1))\n",
80 | " else:\n",
81 | " df = pd.get_dummies(df, columns=[column_lst2[i]])\n",
82 | " if lower_d:\n",
83 | " # 如果为True,使用PCA降维\n",
84 | " return PCA(n_components=ld_n).fit_transform(df.values), label\n",
85 | " return df, label"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "### 建模"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 3,
98 | "metadata": {},
99 | "outputs": [
100 | {
101 | "name": "stdout",
102 | "output_type": "stream",
103 | "text": [
104 | "8999 3000 3000\n"
105 | ]
106 | },
107 | {
108 | "name": "stderr",
109 | "output_type": "stream",
110 | "text": [
111 | "/home/f/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
112 | " y = column_or_1d(y, warn=True)\n"
113 | ]
114 | }
115 | ],
116 | "source": [
117 | "def hr_modeling(features, label):\n",
118 | " from sklearn.model_selection import train_test_split\n",
119 | " f_v = features.values\n",
120 | " l_v = label.values\n",
121 | " # 训练集,验证集,Y为标注\n",
122 | " X_tt, X_validation, Y_tt, Y_validation = train_test_split(f_v, l_v, test_size=0.2)\n",
123 | " # 从训练集再切割0.25为测试集\n",
124 | " X_train, X_test, Y_train, Y_test = train_test_split(X_tt, Y_tt, test_size=0.25)\n",
125 | " print(len(X_train), len(X_validation), len(X_test))\n",
126 | "features, label = hr_preprocessing()\n",
127 | "hr_modeling(features, label)"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "训练集,验证集,测试集长度比为 3:1:1,符合预期"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 4,
140 | "metadata": {},
141 | "outputs": [
142 | {
143 | "name": "stdout",
144 | "output_type": "stream",
145 | "text": [
146 | "ACC: 0.9516666666666667\n",
147 | "REC: 0.9127423822714681\n",
148 | "F-Score: 0.9008885850991114\n"
149 | ]
150 | },
151 | {
152 | "name": "stderr",
153 | "output_type": "stream",
154 | "text": [
155 | "/home/f/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
156 | " y = column_or_1d(y, warn=True)\n"
157 | ]
158 | }
159 | ],
160 | "source": [
161 | "def hr_modeling(features, label):\n",
162 | " from sklearn.model_selection import train_test_split\n",
163 | " f_v = features.values\n",
164 | " l_v = label.values\n",
165 | " # 训练集,验证集,Y为标注\n",
166 | " X_tt, X_validation, Y_tt, Y_validation = train_test_split(f_v, l_v, test_size=0.2)\n",
167 | " # 从训练集再切割0.25为测试集\n",
168 | " X_train, X_test, Y_train, Y_test = train_test_split(X_tt, Y_tt, test_size=0.25)\n",
169 | " from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier\n",
170 | " # n_neighbors=3 结果最好\n",
171 | " knn_clf = KNeighborsClassifier(n_neighbors=3)\n",
172 | " knn_clf.fit(X_train, Y_train)\n",
173 | " Y_predict = knn_clf.predict(X_validation)\n",
174 | " # 比较实际验证集和预测值\n",
175 | " print(\"ACC:\", accuracy_score(Y_validation, Y_predict))\n",
176 | " print(\"REC:\", recall_score(Y_validation, Y_predict))\n",
177 | " print(\"F-Score:\", f1_score(Y_validation, Y_predict))\n",
178 | "features, label = hr_preprocessing()\n",
179 | "hr_modeling(features, label) "
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {},
185 | "source": [
186 | "### 接下来多尝试几个分类器"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 5,
192 | "metadata": {
193 | "scrolled": true
194 | },
195 | "outputs": [
196 | {
197 | "name": "stderr",
198 | "output_type": "stream",
199 | "text": [
200 | "/home/f/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
201 | " y = column_or_1d(y, warn=True)\n"
202 | ]
203 | },
204 | {
205 | "name": "stdout",
206 | "output_type": "stream",
207 | "text": [
208 | "0\n",
209 | "KNN ACC: 0.9738859873319258\n",
210 | "KNN REC: 0.9598893499308437\n",
211 | "KNN F: 0.9465787679017958\n",
212 | "1\n",
213 | "KNN ACC: 0.954\n",
214 | "KNN REC: 0.9337175792507204\n",
215 | "KNN F: 0.9037656903765692\n",
216 | "2\n",
217 | "KNN ACC: 0.9563333333333334\n",
218 | "KNN REC: 0.9519774011299436\n",
219 | "KNN F: 0.9114266396213658\n"
220 | ]
221 | }
222 | ],
223 | "source": [
224 | "def hr_modeling(features, label):\n",
225 | " from sklearn.model_selection import train_test_split\n",
226 | " f_v = features.values\n",
227 | " l_v = label.values\n",
228 | " # 训练集,验证集,Y为标注\n",
229 | " X_tt, X_validation, Y_tt, Y_validation = train_test_split(f_v, l_v, test_size=0.2)\n",
230 | " # 从训练集再切割0.25为测试集\n",
231 | " X_train, X_test, Y_train, Y_test = train_test_split(X_tt, Y_tt, test_size=0.25)\n",
232 | " from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier\n",
233 | " models = []\n",
234 | " models.append(('KNN', KNeighborsClassifier(n_neighbors=3)))\n",
235 | " for clf_name, clf in models:\n",
236 | " clf.fit(X_train, Y_train)\n",
237 | " xy_lst = [(X_train, Y_train), (X_validation, Y_validation), (X_test, Y_test)]\n",
238 | " for i in range(len(xy_lst)):\n",
239 | " X_part = xy_lst[i][0]\n",
240 | " Y_part = xy_lst[i][1]\n",
241 | " Y_predict = clf.predict(X_part)\n",
242 | " print(i)\n",
243 | " print(clf_name, ' ACC:',accuracy_score(Y_part, Y_predict))\n",
244 | " print(clf_name, ' REC:', recall_score(Y_part, Y_predict))\n",
245 | " print(clf_name, ' F:',f1_score(Y_part, Y_predict))\n",
246 | "\n",
247 | "features, label = hr_preprocessing()\n",
248 | "hr_modeling(features, label) "
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "0为训练集,1为验证集,2为测试集分别对应的准确率,召回率,F值。接下来我们把所有尝试的分类器写到一个函数里"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 6,
261 | "metadata": {},
262 | "outputs": [
263 | {
264 | "name": "stderr",
265 | "output_type": "stream",
266 | "text": [
267 | "/home/f/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
268 | " y = column_or_1d(y, warn=True)\n"
269 | ]
270 | },
271 | {
272 | "name": "stdout",
273 | "output_type": "stream",
274 | "text": [
275 | "0\n",
276 | "KNN ACC: 0.9747749749972219\n",
277 | "KNN REC: 0.9561567164179104\n",
278 | "KNN F: 0.9475387104229258\n",
279 | "1\n",
280 | "KNN ACC: 0.953\n",
281 | "KNN REC: 0.9208333333333333\n",
282 | "KNN F: 0.9038854805725972\n",
283 | "2\n",
284 | "KNN ACC: 0.9513333333333334\n",
285 | "KNN REC: 0.9151343705799151\n",
286 | "KNN F: 0.898611111111111\n",
287 | "0\n",
288 | "GaussianNB ACC: 0.7968663184798311\n",
289 | "GaussianNB REC: 0.7252798507462687\n",
290 | "GaussianNB F: 0.6298096395301741\n",
291 | "1\n",
292 | "GaussianNB ACC: 0.8116666666666666\n",
293 | "GaussianNB REC: 0.7486111111111111\n",
294 | "GaussianNB F: 0.6561168594035302\n",
295 | "2\n",
296 | "GaussianNB ACC: 0.8013333333333333\n",
297 | "GaussianNB REC: 0.743988684582744\n",
298 | "GaussianNB F: 0.6383495145631067\n",
299 | "0\n",
300 | "BernoulliNB ACC: 0.8386487387487499\n",
301 | "BernoulliNB REC: 0.4608208955223881\n",
302 | "BernoulliNB F: 0.5764294049008168\n",
303 | "1\n",
304 | "BernoulliNB ACC: 0.846\n",
305 | "BernoulliNB REC: 0.47638888888888886\n",
306 | "BernoulliNB F: 0.5975609756097561\n",
307 | "2\n",
308 | "BernoulliNB ACC: 0.8463333333333334\n",
309 | "BernoulliNB REC: 0.4837340876944837\n",
310 | "BernoulliNB F: 0.5973799126637555\n",
311 | "0\n",
312 | "DecisionTree ACC: 1.0\n",
313 | "DecisionTree REC: 1.0\n",
314 | "DecisionTree F: 1.0\n",
315 | "1\n",
316 | "DecisionTree ACC: 0.975\n",
317 | "DecisionTree REC: 0.9611111111111111\n",
318 | "DecisionTree F: 0.9485949280328992\n",
319 | "2\n",
320 | "DecisionTree ACC: 0.9736666666666667\n",
321 | "DecisionTree REC: 0.958981612446959\n",
322 | "DecisionTree F: 0.9449477351916377\n"
323 | ]
324 | },
325 | {
326 | "name": "stderr",
327 | "output_type": "stream",
328 | "text": [
329 | "/home/f/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
330 | " \"avoid this warning.\", FutureWarning)\n"
331 | ]
332 | },
333 | {
334 | "name": "stdout",
335 | "output_type": "stream",
336 | "text": [
337 | "0\n",
338 | "SVM ACC: 0.9523280364484943\n",
339 | "SVM REC: 0.902518656716418\n",
340 | "SVM F: 0.9002093510118633\n",
341 | "1\n",
342 | "SVM ACC: 0.9513333333333334\n",
343 | "SVM REC: 0.9\n",
344 | "SVM F: 0.8987517337031901\n",
345 | "2\n",
346 | "SVM ACC: 0.9586666666666667\n",
347 | "SVM REC: 0.9137199434229137\n",
348 | "SVM F: 0.9124293785310734\n",
349 | "0\n",
350 | "RandomForest ACC: 0.9978886542949217\n",
351 | "RandomForest REC: 0.9916044776119403\n",
352 | "RandomForest F: 0.9955513931163662\n",
353 | "1\n",
354 | "RandomForest ACC: 0.988\n",
355 | "RandomForest REC: 0.9583333333333334\n",
356 | "RandomForest F: 0.9745762711864409\n",
357 | "2\n",
358 | "RandomForest ACC: 0.9853333333333333\n",
359 | "RandomForest REC: 0.9462517680339463\n",
360 | "RandomForest F: 0.9681620839363243\n"
361 | ]
362 | },
363 | {
364 | "name": "stderr",
365 | "output_type": "stream",
366 | "text": [
367 | "/home/f/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
368 | " \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
369 | ]
370 | },
371 | {
372 | "name": "stdout",
373 | "output_type": "stream",
374 | "text": [
375 | "0\n",
376 | "AdaBoost ACC: 0.962884764973886\n",
377 | "AdaBoost REC: 0.9183768656716418\n",
378 | "AdaBoost F: 0.9218164794007491\n",
379 | "1\n",
380 | "AdaBoost ACC: 0.9633333333333334\n",
381 | "AdaBoost REC: 0.9083333333333333\n",
382 | "AdaBoost F: 0.922425952045134\n",
383 | "2\n",
384 | "AdaBoost ACC: 0.9596666666666667\n",
385 | "AdaBoost REC: 0.9151343705799151\n",
386 | "AdaBoost F: 0.914487632508834\n",
387 | "0\n",
388 | "LogisticRegression ACC: 0.7918657628625403\n",
389 | "LogisticRegression REC: 0.3516791044776119\n",
390 | "LogisticRegression F: 0.44602188701567574\n",
391 | "1\n",
392 | "LogisticRegression ACC: 0.7886666666666666\n",
393 | "LogisticRegression REC: 0.3527777777777778\n",
394 | "LogisticRegression F: 0.44483362521891423\n",
395 | "2\n",
396 | "LogisticRegression ACC: 0.7826666666666666\n",
397 | "LogisticRegression REC: 0.3437057991513437\n",
398 | "LogisticRegression F: 0.42706502636203864\n"
399 | ]
400 | }
401 | ],
402 | "source": [
403 | "def hr_modeling(features, label):\n",
404 | " from sklearn.model_selection import train_test_split\n",
405 | " f_v = features.values\n",
406 | " l_v = label.values\n",
407 | " # 训练集,验证集,Y为标注\n",
408 | " X_tt, X_validation, Y_tt, Y_validation = train_test_split(f_v, l_v, test_size=0.2)\n",
409 | " # 从训练集再切割0.25为测试集\n",
410 | " X_train, X_test, Y_train, Y_test = train_test_split(X_tt, Y_tt, test_size=0.25)\n",
411 | " from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier # KNN\n",
412 | " from sklearn.naive_bayes import GaussianNB, BernoulliNB # 高斯,泊努力 朴素贝叶斯(对离散值比较适合,这里也试试)\n",
413 | " from sklearn.tree import DecisionTreeClassifier # 决策树、\n",
414 | " from sklearn.svm import SVC # 支持向量机\n",
415 | " from sklearn.ensemble import RandomForestClassifier # 随机森林\n",
416 | " from sklearn.ensemble import AdaBoostClassifier # 集成\n",
417 | " from sklearn.linear_model import LogisticRegression # 逻辑斯特回归(这里数据线性不可分,效果也不会理想,适合线性可分情况)\n",
418 | " models = []\n",
419 | " models.append(('KNN', KNeighborsClassifier(n_neighbors=3)))\n",
420 | " models.append(('GaussianNB', GaussianNB()))\n",
421 | " models.append(('BernoulliNB', BernoulliNB()))\n",
422 | " models.append(('DecisionTree', DecisionTreeClassifier()))\n",
423 | " models.append(('SVM', SVC(C=100))) # 惩罚度,数值越大,运算越谨慎,时间越长\n",
424 | " models.append(('RandomForest', RandomForestClassifier()))\n",
425 | " models.append(('AdaBoost', AdaBoostClassifier()))\n",
426 | " models.append(('LogisticRegression', LogisticRegression(C=1000, solver='sag')))\n",
427 | " for clf_name, clf in models:\n",
428 | " clf.fit(X_train, Y_train)\n",
429 | " xy_lst = [(X_train, Y_train), (X_validation, Y_validation), (X_test, Y_test)]\n",
430 | " for i in range(len(xy_lst)):\n",
431 | " X_part = xy_lst[i][0]\n",
432 | " Y_part = xy_lst[i][1]\n",
433 | " Y_predict = clf.predict(X_part)\n",
434 | " print(i)\n",
435 | " print(clf_name, ' ACC:',accuracy_score(Y_part, Y_predict))\n",
436 | " print(clf_name, ' REC:', recall_score(Y_part, Y_predict))\n",
437 | " print(clf_name, ' F:',f1_score(Y_part, Y_predict))\n",
438 | "\n",
439 | "features, label = hr_preprocessing()\n",
440 | "hr_modeling(features, label) "
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "从结果看出,这样的数据情况下,随机森林和决策树是表现比较好的,然后是KNN。"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "剩下的就是对选定模型调惨优化了"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": null,
460 | "metadata": {},
461 | "outputs": [],
462 | "source": []
463 | }
464 | ],
465 | "metadata": {
466 | "kernelspec": {
467 | "display_name": "Python 3",
468 | "language": "python",
469 | "name": "python3"
470 | },
471 | "language_info": {
472 | "codemirror_mode": {
473 | "name": "ipython",
474 | "version": 3
475 | },
476 | "file_extension": ".py",
477 | "mimetype": "text/x-python",
478 | "name": "python",
479 | "nbconvert_exporter": "python",
480 | "pygments_lexer": "ipython3",
481 | "version": "3.7.1"
482 | }
483 | },
484 | "nbformat": 4,
485 | "nbformat_minor": 2
486 | }
487 |
--------------------------------------------------------------------------------