├── k3s入门笔记.md ├── 设计模式.md ├── go笔记.md ├── README.MD ├── 怎么使用机器学习预测可转债价格.pdf ├── 决策树 剪枝.assets ├── Selection_045.png └── Selection_046.png ├── .ipynb_checkpoints └── Untitled-checkpoint.ipynb ├── 投资于概率.md ├── github_mirror.md ├── redis_note.mkd ├── mysql事务.md ├── ubuntu终端Git中文乱码.md ├── flask_note.mkd ├── 一日一技.mkd ├── ubuntu下golang下载libxml2 报错信息 安装libxml2.md ├── Ubuntu中恢复rm命令误删文件(超级详细+亲测有效).md ├── go-vim.mkd ├── pandas分析工具.MD ├── 机器学习1-关于回归问题的准确性评价.md ├── 树莓派使用USB摄像头和motion实现监控.md ├── 决策树 剪枝.md ├── python pandas stack和unstack函数.md ├── Go.md ├── mysql 如何设置事务隔离级别 ├── pyecharts 使用经验.md ├── 通过pymysql程序debug学习数据库事务、隔离级别.md ├── 设置运行rsync 进程的用户.md ├── IDE安装教程.md ├── 目录.md ├── mongodb意外断电,非正常关闭, 造成不可启动, 日志出现WT_ERROR non-specific WiredTiger error, terminating.md ├── 用Golang写爬虫(一).md ├── scikit-learn随机森林调参小结.md ├── 为什么有了ip还要mac.md └── 随机森林超参数调参.md /k3s入门笔记.md: -------------------------------------------------------------------------------- 1 | # k3s 2 | 3 | -------------------------------------------------------------------------------- /设计模式.md: -------------------------------------------------------------------------------- 1 | # 不宜一开始就是用设计模式 2 | 如果项目简单,不需要扩展; 3 | -------------------------------------------------------------------------------- /go笔记.md: -------------------------------------------------------------------------------- 1 | # go笔记 2 | 3 | ## 字符串 4 | 5 | * 6 | 7 | -------------------------------------------------------------------------------- /README.MD: -------------------------------------------------------------------------------- 1 | # 保存的学习资料 2 | ###### 以pdf为主,来源为公众号保存的文件。因为保存到有道云笔记会失去代码格式,只好用插件保存为pdf 3 | 4 | -------------------------------------------------------------------------------- /怎么使用机器学习预测可转债价格.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rockyzsu/Collections/master/怎么使用机器学习预测可转债价格.pdf -------------------------------------------------------------------------------- /决策树 剪枝.assets/Selection_045.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rockyzsu/Collections/master/决策树 剪枝.assets/Selection_045.png -------------------------------------------------------------------------------- /决策树 剪枝.assets/Selection_046.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rockyzsu/Collections/master/决策树 剪枝.assets/Selection_046.png -------------------------------------------------------------------------------- /.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /投资于概率.md: -------------------------------------------------------------------------------- 1 | # 有人觉得用余额宝的利息里 买彩票,觉得这个一个很好的杠铃策略 2 | 可是,对于这种有明显概率事件的策略,收益期望是负的,比如你把全部组合买了,与最终付出的金额对比,是一个负收益。 3 | 4 | 而人们对于自己运气,总是抱有乐观的态度,总觉得自己有好运 5 | 6 | 而对于厄运,大部分人都觉得不会降临到自己的头上, 7 | 8 | -------------------------------------------------------------------------------- /github_mirror.md: -------------------------------------------------------------------------------- 1 | Github国内镜像网站,解决Github访问的神器 2 | 3 | 请叫我Pro大叔 4 | 1 5 | 2020.10.23 13:21:33 6 | 字数 34 7 | 阅读 131,936 8 | https://github.com.cnpmjs.org/ 9 | https://hub.fastgit.org/ 10 | https://github.wuyanzheshui.workers.dev/ 11 | -------------------------------------------------------------------------------- /redis_note.mkd: -------------------------------------------------------------------------------- 1 | # redis 2 | ``` 3 | config get dir # get current redis folder 4 | ``` 5 | 6 | ``` 7 | redis.config 8 | 9 | save 900 1 : save file after 900s when file change 1 times 10 | ``` 11 | 12 | rdb default 13 | appendonly -------------------------------------------------------------------------------- /mysql事务.md: -------------------------------------------------------------------------------- 1 | 使用事务: 2 | 3 | 以begin开头 4 | 5 | 6 | 7 | 8 | 9 | ### 隔离级别 10 | 11 | ``` 12 | SELECT @@tx_isolation; 13 | ``` 14 | 15 | 得到 16 | 17 | > REPEATABLE-READ 18 | 19 | 默认是可重复读 20 | 21 | 22 | 23 | ``` 24 | SELECT @@global.tx_isolation; //全局 25 | ``` 26 | 27 | -------------------------------------------------------------------------------- /ubuntu终端Git中文乱码.md: -------------------------------------------------------------------------------- 1 | # ubuntu终端Git中文乱码 2 | 3 | csdn_yuan88 2019-04-04 21:42:56 2677 收藏 2 4 | 分类专栏: 技术_主机系统软件 文章标签: ubuntu git github 中文乱码 5 | 版权 6 | ubuntu终端Git中文乱码:200\273\347\273\223 7 | 使用git add添加要提交的文件的时候,显示形如2200\273\347\273\223乱码。 8 | 解决方案:git config --global core.quotepath false 9 | 10 | -------------------------------------------------------------------------------- /flask_note.mkd: -------------------------------------------------------------------------------- 1 | # flask note 2 | 3 | url route should start with /, 4 | @app.route('/api) 5 | 6 | 7 | 8 | ``` 9 | 解决方案 10 | 代码如下 11 | 12 | import requests 13 | 14 | s = requests.session() 15 | 16 | s.auth = ('用户名', '密码') 17 | 18 | res = s.get("URL") 19 | #获取请求的json数据 20 | data = res.json() 21 | 还有更简单的 22 | 23 | from requests.auth import HTTPBasicAuth 24 | requests.get('https://api.github.com/user', auth=HTTPBasicAuth('user', 'pass')) 25 | ``` -------------------------------------------------------------------------------- /一日一技.mkd: -------------------------------------------------------------------------------- 1 | remove all the code that can't work in my repo 2 | 3 | 4 | 5 | ---- 6 | ``` 7 | >>> x=[{1,2,3},{4,5,6},{7,8,9}] 8 | >>> set.intersection(*x) 9 | set() 10 | >>> 11 | 12 | ``` 13 | ---- 14 | 15 | ``` 16 | 除了做多重条件判断外,还可以用来自己和自己取或操作,实现重试。 17 | 18 | 例如api_1()可能成功也可能失败,所以需要尝试运行3次,那么代码可以这样写: 19 | 20 | weather = api_1() or api_1() or api_1() 21 | ``` 22 | 23 | ---- 24 | ## partial function is very useful 25 | ---- 26 | ``` 27 | select host_id, count(host_id) from host_info group by host_id, platform having count(host_id) > 1 28 | ``` 29 | -------------------------------------------------------------------------------- /ubuntu下golang下载libxml2 报错信息 安装libxml2.md: -------------------------------------------------------------------------------- 1 | ubuntu安装 libxml2 2 | 3 | ubuntu下golang下载libxml2 报错信息: 4 | 5 | ``` 6 | $ go get -u github.com/lestrrat-go/libxml2 7 | # pkg-config --cflags -- libxml-2.0 8 | Package libxml-2.0 was not found in the pkg-config search path. 9 | Perhaps you should add the directory containing `libxml-2.0.pc' 10 | to the PKG_CONFIG_PATH environment variable 11 | No package 'libxml-2.0' found 12 | pkg-config: exit status 1 13 | 14 | 15 | ``` 16 | 17 | 因为系统少了个libxml2 开发包: 18 | 19 | 使用以下命令即可修复: 20 | 21 | ``` 22 | sudo apt install libxml2-dev 23 | ``` 24 | 25 | -------------------------------------------------------------------------------- /Ubuntu中恢复rm命令误删文件(超级详细+亲测有效).md: -------------------------------------------------------------------------------- 1 | # Ubuntu中恢复rm命令误删文件(超级详细+亲测有效) 2 | 3 | 置顶 2019年05月27日 11:13:12 [rain_Man2018](https://me.csdn.net/weixin_44038165) 阅读数 40 4 | 5 | 6 | 7 | 在实验室做项目时使用的是ubuntu16.04 8 | 9 | 某次开发时打字太快从而误删除别的文件,而且还是很重要的文件,ubuntu没有像windows一样的回收站,因此删完就没了,只能通过其他办法恢复。 10 | 11 | ### 第一步:进入误删除文件的目录内,查看被删文件的挂载分区 12 | 13 | 如 cd /home/conference 进入到conference目录,原来的误删除的文件处于此目录内 14 | 15 | 使用df -h命令查看此目录的挂载分区,如/dev/sda1是误删文件所在的分区 16 | 17 | ### 第二步: 安装恢复工具extundelete 18 | 19 | 使用命令: sudo apt-get install extundelete 20 | 21 | ### 第三步:恢复文件 22 | 23 | 使用命令: sudo extundelete /dev/sda1 --restore-all 24 | 25 | 通过以上三步骤后,会在当前目录中生成一个名为RECOVERED_FILES目录,并且将恢复的文件放到这个目录中,然后自己去一个一个找需要用到的文件。 26 | 27 | -------------------------------------------------------------------------------- /go-vim.mkd: -------------------------------------------------------------------------------- 1 | 解决go包管理代理网址无法访问:proxy.golang.org 2 | 默认使用的是proxy.golang.org,在国内无法访问,如下图所示: 3 | 4 | bogon:demo-path user$ make build_darwin 5 | rm -rf target/demo-0.6.0 6 | mkdir -p target/demo-0.6.0/bin 7 | env CGO_ENABLED=1 GO111MODULE=on go run build/spec.go target/demo-0.6.0/bin/demo-spec-0.6.0.yaml 8 | go: github.com/StackExchange/wmi@v0.0.0-20190523213315-cbe66965904d: Get "https://proxy.golang.org/github.com/%21stack%21exchange/wmi/@v/v0.0.0-20190523213315-cbe66965904d.mod": dial tcp 34.64.4.17:443: i/o timeout 9 | make: *** [build_yaml] Error 1 10 | bogon:demo-path user$ make build_darwin 11 | 解决方法: 12 | 换一个国内能访问的代理地址:https://goproxy.cn 13 | 14 | 执行命令: 15 | 16 | go env -w GOPROXY=https://goproxy.cn 17 | 重新执行命令,完美通过! -------------------------------------------------------------------------------- /pandas分析工具.MD: -------------------------------------------------------------------------------- 1 | # 一个神奇的库pandas-profiling 2 | 3 | 最近发现一个神奇的库pandas-profiling,**一行代码**生成**超详细数据分析报告**,实乃我等数据分析从业者的福音哈哈~ 4 | 5 | 6 | 7 | 一键生成超详细数据分析报告 8 | 9 | 一般来说,面对一个数据集,我们需要做一些探索性分析 (Exploratory data analysis),这个过程繁琐而冗杂。以泰坦尼克号数据集为例,传统方法是先用Dataframe.describe(): 10 | 11 | ```text 12 | import pandas as pd 13 | 14 | data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv') 15 | data.describe() 16 | ``` 17 | 18 | ![img](https://pic1.zhimg.com/50/v2-8768711ba6b36ff23353dfe0aa58df27_720w.jpg?source=1940ef5c)![img](https://pic1.zhimg.com/80/v2-8768711ba6b36ff23353dfe0aa58df27_720w.jpg?source=1940ef5c) 19 | 20 | 通过describe()方法,我们对数据集可以有一个大体的认知。然后,通过分析各变量之间的关系(直方图,散点图,柱状图,关联分析等等),我们可以进一步探索这个数据集。EDA这个步骤通常需要耗费大量的时间和精力。如今,有了pandas-profiling库,我们一行代码就可以生成一份超详细的数据分析报告~ 21 | 22 | 23 | 24 | ```text 25 | import pandas_profiling 26 | 27 | data.profile_report(title='Titanic Dataset') 28 | ``` 29 | 30 | 具体报告效果如下: 31 | 32 | ![img](https://pic2.zhimg.com/80/v2-faf97e710da736d8f1d3785efbf7d678_720w.jpg?source=1940ef5c) -------------------------------------------------------------------------------- /机器学习1-关于回归问题的准确性评价.md: -------------------------------------------------------------------------------- 1 | # [机器学习1-关于回归问题的准确性评价](https://www.cnblogs.com/rayshaw/p/8628174.html) 2 | 3 | 网址https://book.douban.com/reading/46607817/ 4 | 5 | 6 | 7 | 建立回归器后,需要建立评价回归器拟合效果的指标模型。 8 | 9 | 平均误差(mean absolute error):这是给定数据集的所有数据点的绝对误差平均值 10 | 11 | 均方误差(mean squared error):给定数据集的所有数据点的误差的平方的平均值,最流行 12 | 13 | 中位数绝对误差(mean absolute error):给定数据集的所有数据点的误差的中位数,可以消除异常值的干扰 14 | 15 | 解释方差分(explained variance score):用于衡量我们的模型对数据集波动的解释能力,如果得分为1.0,表明我们的模型是完美的。 16 | 17 | R方得分(R2 score):读作R方,指确定性相关系数,用于衡量模型对未知样本预测的效果,最好的得分为1.0,值也可以是负数。 18 | 19 | 对应代码: 20 | 21 | ``` 22 | import sklearn.metrics as sm 23 | 24 | print('mean absolute error=',round(sm.mean_absolute_error(y_test,y_test_pre),2)) 25 | print('mean squared error=',round(sm.mean_squared_error(y_test,y_test_pre),2)) 26 | print('median absolute error=',round(sm.median_absolute _error(y_test,y_test_pre),2)) 27 | print('explained variance score=',round(sm.explained_variance _score(y_test,y_test_pre),2)) 28 | print('R2 score=',round(sm.r2_score(y_test,y_test_pre),2)) 29 | ``` 30 | 31 | 32 | 33 | 通常情况下,尽量保证均方误差最低,而且解释方差分最高。 34 | 35 | -------------------------------------------------------------------------------- /树莓派使用USB摄像头和motion实现监控.md: -------------------------------------------------------------------------------- 1 | ## 树莓派使用USB摄像头和motion实现监控 2 | 3 | ***\**\*顶\*\**\*** 4 | 5 | 本文同步至个人博客:cyang.tech 6 | 7 | # 一、工具 8 | 9 | - 1、树莓派3B 10 | - 2、USB摄像头 11 | 12 | # 二、操作步骤 13 | 14 | - 1、安装motion 15 | 16 | ``` 17 | 1sudo apt-get install motion 18 | 2 19 | ``` 20 | 21 | - 2、配置motion 22 | 23 | (1) 24 | 25 | ``` 26 | 1sudo nano /etc/default/motion 27 | 2 28 | ``` 29 | 30 | 将里面的no修改成yes,让motion可以一直在后台运行:start_motion_daemon=yes 31 | 32 | (2) 33 | 34 | ``` 35 | 1sudo nano /etc/motion/motion.conf 36 | 2 37 | ``` 38 | 39 | 修改配置文件,这个文件比较长,请确保一下参数的配置。在nano编辑器下,可以使用^w快速查找到如下配置内容。也可以使用^v向下翻页。 40 | 41 | - 3、启动motion 42 | 43 | ``` 44 | 1sudo motion 45 | 2 46 | ``` 47 | 48 | - 4、查看视频数据 49 | 50 | 在局域网内的设备,不管是手机还是电脑,均可打开浏览器访问树莓派IP:8081 51 | 52 | - 5、退出motion 53 | 54 | ``` 55 | 1killall -TERM motion 56 | 2 57 | ``` 58 | 59 | 或者 60 | 61 | ``` 62 | 1service motion stop 63 | 2 64 | ``` 65 | 66 | # 三、可能出现的问题 67 | 68 | - 1、配置错误 69 | 70 | 出现Unknown config option "sdl_threadnr" 71 | 解决方法: 72 | 在配置文件中,直接将这一行内容进行注释。不是下图光标所在处,是光标下面sdl_threadnr 0这一行,注释成# sdl_threadnr 0即可。 73 | 74 | - 2、8081页面无法显示 75 | 76 | 在8081端口,无法显示数据,但是在8080端口可以看到motion的信息。 77 | **解决方法:** 78 | 这可能是摄像头没有被识别,可以将摄像头拔下重新插入。 -------------------------------------------------------------------------------- /决策树 剪枝.md: -------------------------------------------------------------------------------- 1 | 信息获取度: 2 | 3 | 没有该特征是的 信息熵 4 | 5 | 有该特征时的信息熵 6 | 7 | 差值就是 信息增益 8 | 9 | 信息增益最大的作为根节点 10 | 11 | 12 | 13 | ID3: information gain 14 | 15 | 其他方法只是重新定义了其度量方法 16 | 17 | C4.5 gain ratio 18 | 19 | cart : gini index 20 | 21 | 22 | 23 | 连续型的属性需要离散化 24 | 25 | 26 | 27 | 深度太大, 分的太细,容易过拟合 28 | 29 | 30 | 31 | 优点: 32 | 33 | 小规模数据集有效 34 | 35 | 缺点: 36 | 37 | 连续型数据变量不好 38 | 39 | 类别较多是,错误增加较快 40 | 41 | 42 | 43 | # 决策树 剪枝 44 | 45 | * 限制深度, 叶子节点数,叶子节点的样本数 46 | 47 | 48 | 49 | # 集成算法 50 | 51 | * 多种机器学习方法合成在一起,取平均 52 | 53 | 54 | 55 | 56 | 57 | # 随机森林: 58 | 59 | * 多个决策树, 得到的结果取众数. 并行训练一堆树 60 | 61 | * 随机, 数据随机采样, 特征也随机采样 60% - 80% 62 | * 树要不一样,保证其泛化能力 63 | 64 | ![Selection_045](/home/xda/hub/Collections/决策树 剪枝.assets/Selection_045.png) 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | ![Selection_046](/home/xda/hub/Collections/决策树 剪枝.assets/Selection_046.png) 73 | 74 | 75 | 76 | 平时用的数的数量100,200个差不多就够了 77 | 78 | 79 | 80 | 81 | 82 | # Boosting: 83 | 84 | 从弱学习器开始加强 85 | 86 | 串行的算法. 87 | 88 | 先算出一棵树, 根据残差,继续加强 89 | 90 | AdaBoots, XgBoost 91 | 92 | 93 | 94 | # Stacking: 95 | 96 | 堆叠模型, 几个模型,加在一起, 取一个平均 97 | 98 | 99 | 100 | 101 | 102 | 210300102609 103 | 104 | -------------------------------------------------------------------------------- /python pandas stack和unstack函数.md: -------------------------------------------------------------------------------- 1 | # [python pandas stack和unstack函数](https://www.cnblogs.com/bambipai/p/7658311.html) 2 | 3 |   在用pandas进行数据重排时,经常用到stack和unstack两个函数。stack的意思是堆叠,堆积,unstack即“不要堆叠”,我对两个函数是这样理解和区分的。 4 | 5 |   常见的数据的层次化结构有两种,一种是表格,一种是“花括号”,即下面这样的l两种形式: 6 | 7 | | | store1 | store2 | store3 | 8 | | ------- | ------ | ------ | ------ | 9 | | street1 | 1 | 2 | 3 | 10 | | street2 | 4 | 5 | 6 | 11 | 12 | ​ ![img](http://xximg.30daydo.com/typora/1153897-20171013213916309-1610254152.png) 13 | 14 |  表格在行列方向上均有索引(类似于DataFrame),花括号结构只有“列方向”上的索引(类似于层次化的Series),结构更加偏向于堆叠(Series-stack,方便记忆)。**stack函数会将数据从”表格结构“变成”花括号结构“**,即将其行索引变成列索引,反之,unstack函数将数据从”花括号结构“变成”表格结构“,即要将其中一层的列索引变成行索引。例: 15 | 16 | 17 | 18 | ``` 19 | import numpy as np 20 | import pandas as pd 21 | from pandas import Series,DataFrame 22 | data=DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['street1','street2']),columns=pd.Index(['one','two','three'])) 23 | print(data) 24 | print('-----------------------------------------\n') 25 | data2=data.stack() 26 | data3=data2.unstack() 27 | print(data2) 28 | print('-----------------------------------------\n') 29 | print(data3) 30 | ``` 31 | 32 | 33 | 34 | ·打印结果如下:使用stack函数,将data的行索引['one','two','three’]转变成列索引(第二层),便得到了一个层次化的Series(data2),使用unstack函数,将data2的第二层列索引转变成行索引(默认的,可以改变),便又得到了DataFrame(data3)。 35 | 36 | ![img](http://xximg.30daydo.com/typora/1153897-20171012213543402-1226924570.png) 37 | 38 | -------------------------------------------------------------------------------- /Go.md: -------------------------------------------------------------------------------- 1 | 1. %v 只输出所有的值 2 | 3 | 2. %+v 先输出字段类型,再输出该字段的值 4 | 5 | 3. %#v 先输出结构体名字值,再输出结构体(字段类型+字段的值) 6 | 7 | 輸出結果分析 8 | 作为函数参数是值拷贝,在函数中slice的修改是通过slice中保存的地址对底层数组进行修改。但是删除操作,需要传递地址。 9 | 作为函数参数,当在函数中使用append增加切片元素的时候,就相当于创建一个新的变量。 10 | 11 | ## new 12 | 创建的是一个指针! 13 | 14 | GO是一个前类型语言,不同类型之间不能进行赋值 15 | 16 | 不同类型的指针不能互相赋值,unsafe指针除外 17 | 18 | 写代码的书,最好不要在写完代码后,额外在代码其他地方写注释,最后在代码下面就写注释 19 | ---- 20 | struct 里面小写的字段,包外面是无法正常访问的。 21 | 22 | 指针的运算需要使用uintptr进行转换后才能相加,减 23 | uintptr移动的单位是字节,8bit 24 | 25 | --- 26 | *& 成对出现就是抵消了。等于无 27 | 28 | go继承C的结构体,而没有类的概念 29 | 面象现象特性: 封装 30 | 31 | 同一个包内的 小写 私有变量可以访问,包外就不可以了 32 | 33 | 34 | GO 没有继承 35 | 用的是组合 has-a 36 | 37 | 继承是 is-a 38 | 39 | 空接口 断言 40 | i.(T) i为空接口类型的变量,T为所需要的断言的类型 41 | 42 | 断言推荐 使用 43 | ``` 44 | if value,ok:=i.(T);ok{ 45 | 46 | } 47 | ``` 48 | 这样,即使断言出错,也不会panic 49 | 50 | 51 | # VS 类型判断 52 | i.(type) type 是关键字 53 | ``` 54 | package main 55 | 56 | import ( 57 | "fmt" 58 | ) 59 | 60 | type data interface{} 61 | type Car struct { 62 | Color string 63 | Brand string 64 | } 65 | 66 | func main() { 67 | slice := make([]data, 3) 68 | slice[0] = 1 // an int 69 | slice[1] = "Hello" // a string 70 | slice[2] = Car{"Red", "BMW"} //a struct 71 | 72 | for i, v := range slice { 73 | //v.(type)不能在switch外的任何逻辑里面使用 74 | switch value := v.(type) { 75 | case int: 76 | { 77 | fmt.Printf("slice[%d] type is int[%d]\n", i, value) 78 | } 79 | case string: 80 | { 81 | fmt.Printf("slice[%d] type is string [%s]\n", i, value) 82 | } 83 | case Car: 84 | { 85 | fmt.Printf("slice[%d] type is Car [%s]\n", i, value) 86 | } 87 | default: 88 | { 89 | 90 | } 91 | } 92 | } 93 | } 94 | 95 | ``` 96 | 函数签名:不包含函数名,类型,返回值 97 | 98 | 方法签名: QA: 似乎签名里面 参数没有名称,返回也没有名称 ? 99 | ``` 100 | type IDrive interface{ 101 | Drive() // 方法签名 102 | } 103 | ``` 104 | 105 | main函数本质是一个独立的主协程,如果主协程退出,则其他的协程即使还没有执行完,也将终止执行 106 | 107 | 反射的性能相对较低 108 | 109 | 切片的传参方式, 110 | func myfunc(args ...interface{}){ 111 | 112 | Exec(args...) // 把他传进args 113 | 114 | } 115 | 116 | 117 | 118 | 性能测试工具 119 | ipmort _ "net/khttp/pprof" 120 | 121 | web: 127.0.0.1:8080/debug/pprof 122 | 123 | 生成的exe临时文集那 124 | /tmp/go-build893595111/b001/exe/ 125 | 126 | 压缩文件记得close,不然无法解压 127 | 128 | -------------------------------------------------------------------------------- /mysql 如何设置事务隔离级别: -------------------------------------------------------------------------------- 1 | 1. 查看当前事物级别:SELECT @@tx_isolation; 2 | 3 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/64a62a0f64781423d3b4038daac2bbd6e0d0b2a0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 4 | 5 | 2. 6 | 7 | 设置mysql的隔离级别:set session transaction isolation level 设置事务隔离级别 8 | 9 | 3. 10 | 11 | 设置read uncommitted级别:set session transaction isolation level read uncommitted; 12 | 13 | 查看设置结果:SELECT @@tx_isolation; 14 | 15 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/40d2d0e8b004541ba8028795869a310e1699a6a0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 16 | 17 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/f367139a310e17999727cf0bc9406afec214a3a0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 18 | 19 | 4. 20 | 21 | 设置read committed级别:set session transaction isolation level read committed 22 | 23 | 查看设置结果:SELECT @@tx_isolation; 24 | 25 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/c255efc595ee41c1312ad9e08d88912ca4ca9ba0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 26 | 27 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/07c98f2ca5cadce81c45754ffcf7980e5e2095a0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 28 | 29 | 5. 30 | 31 | 设置repeatable read级别:set session transaction isolation level repeatable read; 32 | 33 | 查看设置结果:SELECT @@tx_isolation; 34 | 35 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/994f412043715fdba52dcc89468920c5270f8ca0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 36 | 37 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/274e9635dd8a59dede0afdb2b370d5413b8c84a0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 38 | 39 | 6. 6 40 | 41 | 设置serializable级别:set session transaction isolation level serializable 42 | 43 | 查看设置结果:SELECT @@tx_isolation; 44 | 45 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/89402670d5413a8c0605d0bc1ffc508c9ace81a0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 46 | 47 | ![mysql 如何设置事务隔离级别](https://exp-picture.cdn.bcebos.com/95bd4e8c9bcec7f82c7a286e034ce54a2e27fba0.jpg?x-bce-process=image%2Fresize%2Cm_lfit%2Cw_500%2Climit_1%2Fformat%2Cf_jpg%2Fquality%2Cq_80) 48 | 49 | END -------------------------------------------------------------------------------- /pyecharts 使用经验.md: -------------------------------------------------------------------------------- 1 | # pyecharts 使用经验 2 | 3 | 4 | 5 | ## 全局配置 6 | 7 | set_global_opts() [注意,不是在**初始化**里面] 8 | 9 | 里面设置标题: 10 | 11 | title_opts=opts.TitleOpts(title="主标题", subtitle="副标题") 12 | 13 | TitleOpts -> title_opts 14 | 15 | 16 | 17 | ```python 18 | scatter = ( 19 | Scatter(init_opts=opts.InitOpts(width="1024px", height="768px",bg_color=JsCode(background_color_js))) 20 | .add_xaxis(xaxis_data=X) 21 | .add_yaxis( 22 | series_name="上市价格", 23 | y_axis=Y, 24 | symbol_size=10, 25 | label_opts=opts.LabelOpts(is_show=False), 26 | ) 27 | .set_series_opts() 28 | .set_global_opts( 29 | title_opts=opts.TitleOpts(title='转债上市价格预测'), 30 | xaxis_opts=opts.AxisOpts( 31 | type_="value", 32 | splitline_opts=opts.SplitLineOpts(is_show=True), 33 | min_=60, 34 | ), 35 | yaxis_opts=opts.AxisOpts( 36 | name='我是y轴,右边的', # 坐标轴的显示名字 37 | type_="value", 38 | axistick_opts=opts.AxisTickOpts(is_show=True), 39 | splitline_opts=opts.SplitLineOpts(is_show=True), 40 | min_=60, 41 | 42 | ), 43 | tooltip_opts=opts.TooltipOpts(is_show=False), 44 | ) 45 | ) 46 | scatter.render('only.html') 47 | ``` 48 | 49 | 50 | 51 | 52 | 53 | ## 多图绘制 54 | 55 | ```python 56 | scatter = ( 57 | Scatter(init_opts=opts.InitOpts(width="800px", height="600px")) 58 | .add_xaxis(xaxis_data=X) 59 | .add_yaxis( 60 | series_name="转债价格", 61 | y_axis=Y, 62 | symbol_size=3, 63 | label_opts=opts.LabelOpts(is_show=False), 64 | ) 65 | .set_series_opts() 66 | .set_global_opts( 67 | xaxis_opts=opts.AxisOpts( 68 | type_="value", splitline_opts=opts.SplitLineOpts(is_show=True), 69 | min_=60, 70 | ), 71 | yaxis_opts=opts.AxisOpts( 72 | type_="value", 73 | axistick_opts=opts.AxisTickOpts(is_show=True), 74 | splitline_opts=opts.SplitLineOpts(is_show=True), 75 | min_=60, 76 | 77 | ) 78 | ) 79 | ) 80 | 81 | line = ( 82 | Line() 83 | .add_xaxis(xaxis_data=X0) 84 | .add_yaxis( 85 | series_name="可转债价格预测", 86 | # yaxis_index=1, 87 | symbol_size=5, 88 | y_axis=Y0, 89 | label_opts=opts.LabelOpts(is_show=False), 90 | linestyle_opts=opts.LineStyleOpts(width=1,color='yellow'), 91 | 92 | ).set_colors("yellow") 93 | ) 94 | scatter.overlap(line) 95 | scatter.render_notebook() 96 | ``` 97 | 98 | * 如果需要2条Y轴,需要 设置 yaxis_index=1. -------------------------------------------------------------------------------- /通过pymysql程序debug学习数据库事务、隔离级别.md: -------------------------------------------------------------------------------- 1 | # 通过pymysql程序debug学习数据库事务、隔离级别 2 | 3 | ### 问题 4 | 5 | 今天在使用pymysql连数据库的时候,出现了一个bug,查询数据库某个数据,但是在我在数据库中执行sql语句改变数据后,pymsql的查询依然没有发生改变。 6 | 代码如下: 7 | 8 | ```python 9 | # 5.6.10 10 | 11 | conn = pymysql.connect(host=HOST, port=PORT, user=USER, passwd=PSWD, db=DB) 12 | 13 | def fetch(): 14 | cursor = conn.cursor() 15 | sql = "SELECT * FROM hello" 16 | try: 17 | res = cursor.execute(sql) 18 | except: 19 | pass 20 | 21 | finally: 22 | 23 | cursor.close() 24 | for data in cursor.fetchall(): 25 | print(*data) 26 | 27 | while True: 28 | fetch() 29 | time.sleep(2) 30 | ``` 31 | 32 | ### 解决问题 33 | 34 | 首先,我们还是找出问题原因,并解决它,查阅相关文档后可知,因为我们的查询语句执行后,没有`commit()`,这会导致查询事务没有提交,mysql数据库会返回上次查询到的结果。 35 | 所以,不管是增删查改,最好都以事务的形式提交! 36 | 37 | ```python 38 | try: 39 | res = cursor.execute(sql) 40 | conn.commit() 41 | 42 | except: 43 | pass 44 | finally: 45 | cursor.close() 46 | ``` 47 | 48 | # 分析 49 | 50 | 接下来我们来仔细分析,为什么查询也需要提交事务 51 | 52 | - 数据库的事务 53 | - 脏读、不可重复读、幻读 54 | - 数据库事务隔离级别 55 | - 数据库的锁 56 | 57 | ## 1.数据库的事务特性 58 | 59 | 先简单了解数据库事务的特性 60 | 61 | 1. 原子性:原子性是指事务是一个不可分割的工作单位,事务中的操作要么都发生要么都不发生。 62 | 2. 一致性:如果事务执行之前数据库是一个完整性的状态,那么事务结束后,无论事务是否执行成功,数据库仍然是一个完整性状态. 63 | 数据库的完整性状态:当一个数据库中的所有的数据都符合数据库中所定义的所有的约束,此时可以称数据库是一个完整性状态. 64 | 3. 隔离性:事务的隔离性是指多个用户并发访问数据库时,一个用户的事务不能被其它用户的事务所干扰,多个并发事务之间事务要隔离 65 | 4. 持久性:持久性是指一个事务一旦被提交,它对数据库中数据的改变就是永久性的,接下来即使数据库发生故障也不应该对其有任何影响 66 | 67 | ## 2.脏读、不可重复读、幻读 68 | 69 | ### 脏读(读取未提交的数据) 70 | 71 | | 转账 | 取钱 | 72 | | :---------------------------------- | :----------------------------- | 73 | | 事务开始 | | 74 | | | 事务开始 | 75 | | | 查看余额为2000 | 76 | | | 取钱1000 | 77 | | 查余额为1000 | | 78 | | | 未知错误,事务回滚,余额为2000 | 79 | | 转账2000,余额为3000(脏读1000+2000) | | 80 | | 事务提交 | | 81 | 82 | 所以莫名其妙就少了1000块钱 83 | 84 | ### 不可重复读(两次读取结果不一致) 85 | 86 | 拿购物和取钱说事,有天小A去取钱,看余额有2000块(事务开始),很开心,此时她老婆看到喜欢的东西,手速极快的下单,付款2000(其他事务提交完成),这时小A到ATM取1000块钱,ATM提示余额不足!小A感到很疑惑,刚才明明还有2000的啊? 87 | 88 | | 取钱 | 购物 | 89 | | :-------------- | :------- | 90 | | 事务开始 | | 91 | | 查看余额为2000 | | 92 | | | 事务开始 | 93 | | | 消费2000 | 94 | | | 事务提交 | 95 | | 再次查询余额为0 | | 96 | | 事务结束 | | 97 | 98 | ### 幻读(多次读取,总量不一样) 99 | 100 | 这天,小A查自己这个月的账单(事务开始),发现5笔购物总计消费1000块,这时,他老婆又眼疾手快的下单付款买了一件衣服2000块(其他事务结束),这时,小A再看他的账单,总计消费变成了3000块,就像产生幻觉一样(事务结束) 101 | 102 | | 查账 | 购物 | 103 | | :----------------- | :--------------- | 104 | | 事务开始 | | 105 | | 查看账单为1000 | | 106 | | | 事务开始 | 107 | | | 增加一笔账单2000 | 108 | | | 事务提交 | 109 | | 再次查看账单为3000 | | 110 | | 事务结束 | | 111 | 112 | ### 不可重复读和幻读的区别 113 | 114 | 可能到这,大概了解了3种读取数据会出现的异常情况了,但可能对不可重复读和幻读有疑问,似乎差不多啊。 115 | 可以这么理解,不可重复读是针对于数据库表的某条记录而言,也就是针对update一些。解决办法例如:我们可以在读取事务进行的时候对该条记录加锁,以避免重复读不一致的问题。 116 | 幻读是针对多条记录而言,针对insert,delete一些,在同一事务两次查询结果数目不一致。解决办法例如:我们可以在读取事务进行的时候对整个表加锁,以避免。 117 | 118 | ## 3.数据库的隔离级别 119 | 120 | 数据库的隔离级别,由低到高依次为Read uncommitted 、Read committed 、Repeatable read 、Serializable ,这四个级别可以逐个解决脏读 、不可重复读 、幻读 这几类问题。 121 | √可以避免,×不能避免 122 | 123 | | | 脏读 | 不可重复读 | 幻读 | 124 | | :--------------- | :--- | :--------- | :--- | 125 | | Read uncommitted | × | × | × | 126 | | Read committed | √ | × | × | 127 | | Repeatable read | √ | √ | × | 128 | | Serializable | √ | √ | √ | 129 | 130 | 而mysql默认为 Repeatable read,Sql Server , Oracle默认为 Read committed 131 | 到这里,基本可以完结今天的bug原因了,mysql可以避免重复读的问题的,它并不是通过前面提到的加锁来控制的,而是,同一事务的查询结果都是事务开始的时候保存的快照,所以如果不commit,查询结果不会改变! 132 | 133 | ## 4.数据库的锁 134 | 135 | 还想继续深究数据库是如何加锁来保证事务的四大特性的。有时间一定去了解,到时候来更新。。。哈哈 136 | [InnoDB锁机制](https://www.toutiao.com/i6669505644367708680/?tt_from=mobile_qq&utm_campaign=client_share×tamp=1552909227&app=news_article&utm_source=mobile_qq&utm_medium=toutiao_android&group_id=6669505644367708680) -------------------------------------------------------------------------------- /设置运行rsync 进程的用户.md: -------------------------------------------------------------------------------- 1 | 安装及配置 2 | 安装运行: 3 | 4 | yum -y install rsync 5 | #启动rsync服务 6 | systemctl start rsyncd.service 7 | systemctl enable rsyncd.service 8 | 9 | #检查是否已经成功启动 10 | netstat -lnp|grep 873 11 | 1 12 | 2 13 | 3 14 | 4 15 | 5 16 | 6 17 | 7 18 | 19 | 好了,好了。安装成功。 20 | 21 | 配置: 22 | 首先,配置文件在: 23 | /etc/rsyncd.conf 24 | 25 | vim /etc/rsyncd.conf 26 | 1 27 | 看到: 28 | 29 | 30 | 31 | 好了,先修改成: 32 | 33 | uid = root 34 | # //设置运行rsync 进程的用户 35 | gid = root 36 | use chroot = no 37 | max connections = 4 38 | # pid file = /var/run/rsyncd.pid 39 | #//CentOS7中yum安装不需指定pid file 否则报错 40 | lock file=/var/run/rsyncd.lock 41 | log file = /var/log/rsyncd.log 42 | # //此文件定义完成后系统会自动创建 43 | exclude = lost+found/ 44 | transfer logging = yes 45 | timeout = 900 46 | ignore nonreadable = yes 47 | # //同步时跳过没有权限的目录 48 | dont compress = *.gz *.tgz *.zip *.z *.Z *.rpm *.deb *.bz2 49 | # //传输时不压缩的文件 50 | 1 51 | 2 52 | 3 53 | 4 54 | 5 55 | 6 56 | 7 57 | 8 58 | 9 59 | 10 60 | 11 61 | 12 62 | 13 63 | 14 64 | 15 65 | 16 66 | 17 67 | 68 | 69 | 重启: 70 | 71 | systemctl restart rsyncd.service 72 | 1 73 | 两个服务器间传输文件 74 | 好了,上面配置完rsync了,那么接下来,假设有两台服务器,开发服务器dev及线上测试环境test,现在需要从dev将可运行代码更新到test上面去。 75 | 76 | 注意:每一次传输文件客户端将告诉服务端需要调用哪个传输规则进行传输,而传输规则如下: 77 | 78 | [simba] //此名字即客户端使用rsync来同步的路径 79 | path=/usr/local/simba //实际需要同步的路径 80 | comment=simba //和中括号里名字一样就行 81 | ignore errors 82 | read only=no //表示可以pull 83 | write only=no //表示可以push 84 | list=no 85 | auth users=rsyncuser //客户端获取文件的身份此用户并不是本机中确实存在的用户 86 | secrets file=/etc/rsyncd.passwd //用来认证客户端的秘钥文件 格式 USERNAME:PASSWD 此文件权 87 | //限一定需要改为600,且属主必须与运行rsync的用户一致。 88 | hosts allow=* //允许所有主机访问 89 | 1 90 | 2 91 | 3 92 | 4 93 | 5 94 | 6 95 | 7 96 | 8 97 | 9 98 | 10 99 | 11 100 | 好了,我们自己做一个规则,假定test上面接收的目录是: /data/www/helloRsync 101 | 102 | 103 | #创建目录 104 | mkdir /data/www/helloRsync 105 | 1 106 | 2 107 | 3 108 | 4 109 | 而dev上面也是。。。 110 | 请自行创建。 111 | 那么在test上面的/etc/rsyncd.conf添加规则: 112 | 113 | #规则名称,作为测试用规则,直接用这个算了。 114 | [helloRsync] 115 | #同步的路径 116 | path=/data/www/helloRsync 117 | #规则描述 118 | comment=测试规则 119 | ignore errors 120 | #是否可以pull 121 | read only=no 122 | #是否可以push 123 | write only=no 124 | list=no 125 | #下面配置同步时候的身份,注意该身份是在rsync里面定义的,并非是本机实际用户。等下说说如何在rsync里面定义身份。 126 | #客户端获取文件的身份此用户并不是本机中确实存在的用户 127 | auth users=rsyncuser 128 | #//用来认证客户端的秘钥文件 格式 USERNAME:PASSWD 此文件权 129 | #//限一定需要改为600,且属主必须与运行rsync的用户一致。 130 | secrets file=/etc/rsyncd.passwd 131 | #允许所有主机访问 132 | hosts allow=* 133 | 1 134 | 2 135 | 3 136 | 4 137 | 5 138 | 6 139 | 7 140 | 8 141 | 9 142 | 10 143 | 11 144 | 12 145 | 13 146 | 14 147 | 15 148 | 16 149 | 17 150 | 18 151 | 19 152 | 20 153 | 154 | 155 | 给rsync定义身份,如下: 156 | 157 | echo 'rsyncuser:123456'>/etc/rsyncd.passwd //文件用户名和路径为上面定义,别写错,密码自己定 158 | chmod 600 /etc/rsyncd.passwd //修改权限 159 | 1 160 | 2 161 | 重启服务。 162 | 163 | systemctl restart rsyncd.service 164 | 1 165 | 客户端的配置 166 | 167 | 1。创建密码。 168 | 169 | echo '123456' >>/etc/rsyncd-test.passwd //注意这里只需要服务器rsyncd.passwd 中的密码 170 | chmod 600 /etc/rsyncd-test.passwd 171 | 1 172 | 2 173 | 为了测试顺利,我们添加一些文件进行同步,譬如: 174 | 175 | echo 'test,hello'>> /data/www/helloRsync/readme.txt 176 | 1 177 | 2 178 | 好了,同步: 179 | 180 | rsync -auv --password-file=/etc/rsyncd-test.passwd rsyncuser@120.x.x.x::helloRsync /data/www/helloRsync/ 181 | 1 182 | 183 | 184 | 看看服务器上面对应目录 185 | 186 | 187 | 188 | 检查原因 189 | 190 | 191 | 192 | test上面看日志: 193 | 194 | vim /var/log/rsyncd.log 195 | 1 196 | 2 197 | 198 | 199 | 好了, 200 | 201 | 202 | 203 | auth user 是错的,要用 auth users,也是醉了,都没发现吗? 204 | 205 | 改过以后重启rsync服务,在尝试 206 | 207 | 得到: 208 | 209 | 210 | 211 | name or service not kunown 212 | 213 | 据查这不是问题,怀疑,是rsync命令问题,应该是将test的文件下载下来了,于是将目录调换一下顺序,得到: 214 | 215 | rsync -auv --password-file=/etc/rsyncd-test.passwd /data/www/helloRsync/ rsyncuser@120.x.x.x::helloRsync 216 | 1 217 | 报错: 218 | 219 | 220 | 好了,是没有放开权限: 221 | 222 | 223 | 224 | 225 | 放开以后重启终于成功了。泪流满面。坑爹丫。 226 | 227 | 一份真实环境用的更新脚本demo 228 | 229 | src="/usr/local/webroot/java-web-bld" 230 | rsync -rauvvt --progress \ 231 | --password-file=/etc/rsyncd-test.passwd \ 232 | --exclude="console/data" \ 233 | --exclude=".svn" \ 234 | --exclude="WEB-INF/logs" \ 235 | --exclude="res/upload" \ 236 | --exclude="WEB-INF/upload" \ 237 | --exclude="WEB-INF/temp" \ 238 | --exclude="env.properties" \ 239 | /usr/local/webroot/java-web-bld/ rsyncuser@xxx.xx.xx.xxx::backend 240 | ———————————————— 241 | 版权声明:本文为CSDN博主「码农下的天桥」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 242 | 原文链接:https://blog.csdn.net/cdnight/article/details/78861543 -------------------------------------------------------------------------------- /IDE安装教程.md: -------------------------------------------------------------------------------- 1 | # IDE安装教程 2 | 3 | ## 安装教程 4 | 5 | ### 1.官网下载专业版 6 | 7 | 打开官网: 8 | 9 | ```markdown 10 | https://www.jetbrains.com/pycharm/download/#section=windows 11 | ``` 12 | 13 | 选择你的平台,如果你是Windows,选择Windows,Mac选择Mac. 14 | 15 | 如图所示,点击专业版下载: 16 | 17 | ![image-20211205143554475](https://billy.taoxiaoxin.club//img/202112051435582.png) 18 | 19 | ### 2.安装 20 | 21 | 找到下载好的安装包,右键以管理员身份运行. 22 | 23 | ![image-20211205144235843](https://billy.taoxiaoxin.club//img/202112051442908.png) 24 | 25 | 选择是 26 | 27 | ![image-20211205144606399](https://billy.taoxiaoxin.club//img/202112051446473.png) 28 | 29 | 点击NEXT 30 | 31 | ![image-20211205144724158](https://billy.taoxiaoxin.club//img/202112051447226.png) 32 | 33 | 选择安装路径,点击NEXT. 34 | 35 | ![image-20211205144823445](https://billy.taoxiaoxin.club//img/202112051448505.png) 36 | 37 | 勾选桌面快捷方式,点击NEXT. 38 | 39 | ![image-20211205144947424](https://billy.taoxiaoxin.club//img/202112051449499.png) 40 | 41 | 直接点击安装即可 42 | 43 | ![图片](https://billy.taoxiaoxin.club//img/202112051451065.webp) 44 | 45 | 等待安装完成之后,勾选启动Pycharm. 46 | 47 | ![image-20211205145704003](https://billy.taoxiaoxin.club//img/202112051457070.png) 48 | 49 | ## 激活IDE 50 | 51 | **重要提示:**如果之前激活IDEA时,修改过Host文件,请删除Host中的相关配置 52 | 53 | ### 1.下载插件 54 | 55 | 下载激活插件: 56 | 57 | ```markdown 58 | https://txx.lanzouw.com/iH3Ugx9t07i 59 | ``` 60 | 61 | 下载的文件包含: 62 | 63 | - reset_script (如果你已经过了30天试用期,可以使用这个重置) 64 | - 激活码 65 | - 补丁Jar文件 66 | 67 | ### 2.打开Pycharm 68 | 69 | 然后在桌面双击pycharm运行。第一次运行的话就无需导入,直接点击ok。 70 | 71 | ![图片](https://billy.taoxiaoxin.club//img/202112051506573.webp) 72 | 73 | ### 3.注册账号登录 74 | 75 | 以前的pycharm30天是不需要登录的,但是现在出现了注册登录 76 | 77 | 首先没有账号的小伙伴还是要先**注册/登录**一下,先免费试用30天 78 | 79 | 注册登录账号几分钟就搞定了 80 | 81 | ![image-20211205151311024](https://billy.taoxiaoxin.club//img/202112051513115.png) 82 | 83 | 当然,它是支持第三方登录注册的,如果有GitHub账号的用户直接可以使用GitHub账号登录注册. 84 | 85 | ![image-20211205151712886](https://billy.taoxiaoxin.club//img/202112051517989.png) 86 | 87 | 这里我使用的我GitHub账号登录 88 | 89 | ![image-20211205151907855](https://billy.taoxiaoxin.club//img/202112051519002.png) 90 | 91 | #### 温馨提示: 92 | 93 | 如果你之前登录过并且试用期30天已经过了,**请解压 reset_script.zip 文件** 94 | 95 | > **解压之后会看到下面两个文件,根据自己的操作系统执行对应文件即可** 96 | 97 | windows点击运行文件:`reset_jetbrains_eval_windows.vbs` 98 | 99 | Mac / Linux 执行:`reset_jetbrains_eval_mac_linux.sh` 100 | 101 | ### 4.免费试用 102 | 103 | 登录成功之后,点击 **Start Trial** **,开始免费试用** 104 | 105 | ![image-20211205152122912](https://billy.taoxiaoxin.club//img/202112051521002.png) 106 | 107 | ### 5.创建项目 108 | 109 | 首先随便创建一个项目,点击`New Project` 110 | 111 | ![image-20211205152755650](https://billy.taoxiaoxin.club//img/202112051527743.png) 112 | 113 | 选择你的项目创建路径和本地Python解释器. 114 | 115 | ![image-20211205152941187](https://billy.taoxiaoxin.club//img/202112051529268.png) 116 | 117 | ![image-20211205153132772](https://billy.taoxiaoxin.club//img/202112051531851.png) 118 | 119 | 最后点击`create` 120 | 121 | ![image-20211205153228019](https://billy.taoxiaoxin.club//img/202112051532122.png) 122 | 123 | ### 6.开始激活 124 | 125 | 复制激活插件**FineAgent.jar**到你安装pycharm的路径下. 126 | 127 | ![image-20211205154217567](https://billy.taoxiaoxin.club//img/202112051542628.png) 128 | 129 | 新建一个idea的文件夹 130 | 131 | ![image-20211205154404752](https://billy.taoxiaoxin.club//img/202112051544812.png) 132 | 133 | 粘贴激活插件文件`FineAgent.jar`到`idea`文件夹下. 134 | 135 | ![image-20211205154545412](https://billy.taoxiaoxin.club//img/202112051545475.png) 136 | 137 | #### 温馨提示 138 | 139 | 如果你忘记了你的pycharm安装路径在哪里,可以点击pycharm快捷方式--->右键---->点击打开文件所在位置. 140 | 141 | ![image-20211205154951219](https://billy.taoxiaoxin.club//img/202112051549307.png) 142 | 143 | 返回到目录`PyCharm 2021.3`下 144 | 145 | ![image-20211205155129199](https://billy.taoxiaoxin.club//img/202112051551273.png) 146 | 147 | 进入到IDEA项目开发界面(`默认情况下,需要创建一个项目或者打开一个项目,才能进入到这个页面`) 148 | 149 | 点击如图所示的菜单:`Help - > Edit Custom VM Options...`。修改`idea64.exe.vmoptions`文件。 150 | 151 | ![image-20211205153707749](https://billy.taoxiaoxin.club//img/202112051537034.png) 152 | 153 | 点击按钮,会打开`idea64.exe.vmoptions`文件。 154 | 155 | ![image-20211205153842636](https://billy.taoxiaoxin.club//img/202112051538789.png) 156 | 157 | 然后到刚刚的`idea`文件夹并复制文件夹路径. 158 | 159 | ![image-20211205160022532](https://billy.taoxiaoxin.club//img/202112051600631.png) 160 | 161 | 在`idea64.exe.vmoptions`文件中添加,这样一段话 162 | 163 | `-javaagent:文件路径\FindAgent.jar` 164 | 165 | 示例: 166 | 167 | ```markdown 168 | # FindAgent.jar 的文件路径 169 | -javaagent:F:\Program Files\JetBrains\PyCharm 2021.3\idea\FineAgent.jar 170 | ``` 171 | 172 | ![image-20211205160341723](https://billy.taoxiaoxin.club//img/202112051603792.png) 173 | 174 | 然后保存,重启Pycharm. 175 | 176 | 重启之后,又会重新进入到激活页面,这个时候,我们选择`Activate IntelliJ IDEA` 177 | 178 | ![image-20211205162746965](https://billy.taoxiaoxin.club//img/202112051627050.png) 179 | 180 | 找到激活文件里面的激活码 181 | 182 | ![image-20211205162906996](https://billy.taoxiaoxin.club//img/202112051629139.png) 183 | 184 | ![image-20211205162921802](https://billy.taoxiaoxin.club//img/202112051629895.png) 185 | 186 | 打开文件,全选复制 187 | 188 | ![image-20211205163003785](https://billy.taoxiaoxin.club//img/202112051630887.png) 189 | 190 | 回到Pycharm激活页面,粘贴激活码,点击激活按钮. 191 | 192 | ![image-20211205163217727](https://billy.taoxiaoxin.club//img/202112051632829.png) 193 | 194 | 恭喜你!成功激活到 2099 年!!! 195 | 196 | ![image-20211205163541824](https://billy.taoxiaoxin.club//img/202112051635929.png) -------------------------------------------------------------------------------- /目录.md: -------------------------------------------------------------------------------- 1 | 目录 2 | 3 | MJPG简介: 4 | 5 | 1.硬件与驱动 6 | 7 | 1.1用到的工具材料: 8 | 9 | 1.2检查是否存在USB摄像头设备 10 | 11 | 2 .安装MJPG-Streamer 12 | 13 | 3.启动 MJPG-Streamer 14 | 15 | 3.1 输入以下命令 16 | 17 | 3.2参数说明: 18 | 19 | 4.实时视频接收 20 | 21 | MJPG简介: 22 | 23 |   MJPG是MJPEG的缩写,但是MJPEG还可以表示文件格式扩展名. 24 |   MJPEG 25 |   全名为 “Motion Joint Photographic Experts Group”,是一种视频编码格式, 26 |   Motion JPEG技术常用与闭合电路的电视摄像机的模拟视频信号“翻译”成视频流,并存储在硬盘上。典型的应用如数字视频记录器等。MJPEG不像MPEG,不使用帧间编码,因此用一个非线性编辑器就很容易编辑。MJPEG的压缩算法与MPEG一脉相承,功能很强大,能发送高质图片,生成完全动画视频等。但相应地,MJPEG对带宽的要求也很高,相当于T-1,MJPEG信息是存储在数字媒体中的庞然大物,需要大量的存储空间以满足如今多数用户的需求。因此从另一个角度说,在某些条件下,MJPEG也许是效率最低的编码/解码器之一。 27 |   MJPEG 是 24-bit 的 “true-color” 影像标准,MJPEG 的工作是将 RGB 格式的影像转换成 YCrCB 格式,目的是为了减少档案大小,一般约可减少 1/3 ~ 1/2 左右。 28 |   MJPEG与MJPG的区别: 29 |   1、mjpeg是视频,就是由系列jpg图片组成的视频。 30 |   2、MJPG是MJPEG的缩写,但是MJPEG还可以表示文件格式扩展名. 31 | 32 | 1.硬件与驱动 33 | 1.1用到的工具材料: 34 | 树莓派3B+ 35 | PC电脑 36 | USB摄像头 37 | 1.2检查是否存在USB摄像头设备 38 | 输入以下指令: 39 | 40 | pi@raspberrypi:~ $ lsusb 41 | 可以看到usb摄像头的一些信息。见红框; 42 | 43 | 44 | 45 | 或则输入: 46 | 47 | pi@raspberrypi:~ $ ls /dev 48 | 可以看到 红框中有 video0 设备 也可以说明 usb摄像头 正常运行 49 | 50 | 51 | 52 | 2 .安装MJPG-Streamer 53 | 依次通过以下命令 安装 (直接复制粘贴即可) 54 | 55 | pi@raspberrypi:~ $ sudo apt-get install cmake libjpeg8-dev 56 | pi@raspberrypi:~ $ wget https://github.com/Five-great/mjpg-streamer/archive/master.zip 57 | pi@raspberrypi:~ $ unzip mjpg-streamer-master.zip 58 | pi@raspberrypi:~ $ cd mjp*g-* 59 | pi@raspberrypi:~/mjpg-streamer-master $ cd mjpg-* 60 | pi@raspberrypi:~/mjpg-streamer-master/mjpg-streamer-experimental $ make 61 | pi@raspberrypi:~/mjpg-streamer-master/mjpg-streamer-experimental $ sudo make install 62 | pi@raspberrypi:~/mjpg-streamer-master/mjpg-streamer-experimental $ cd home 63 | pi@raspberrypi:~ $ 64 | 上面的包源已经发生变化 可前往https://fivecc.coding.net/public/mjpg-streamer/mjpg-streamer/git 查看, 65 | 66 | 直链下载:https://fivecc.coding.net/p/mjpg-streamer/d/mjpg-streamer/git/archive/master/?download=true 并解压 67 | 68 | 或者 通过 git 方式 无需解压 69 | 70 | git clone https://e.coding.net/fivecc/mjpg-streamer/mjpg-streamer.git 71 | 执行如下操作 72 | 73 | pi@raspberrypi:~ $ sudo apt-get install cmake libjpeg8-dev 74 | pi@raspberrypi:~ $ git clone https://e.coding.net/fivecc/mjpg-streamer/mjpg-streamer.git 75 | pi@raspberrypi:~ $ cd mjpg-* 76 | pi@raspberrypi:~/mjpg-streamer-master $ cd mjpg-* 77 | pi@raspberrypi:~/mjpg-streamer-master/mjpg-streamer-experimental $ make 78 | pi@raspberrypi:~/mjpg-streamer-master/mjpg-streamer-experimental $ sudo make install 79 | pi@raspberrypi:~/mjpg-streamer-master/mjpg-streamer-experimental $ cd 80 | pi@raspberrypi:~ $ 81 | 一路没报错 就安好了 82 | 83 | 下面我们先来分析一下mjpg_streamer这个功能的结构: 84 | 85 | Mjpg_streamer.c /* 主程序主要运行如下几个部分 */ 86 | input_init(); /* 输入相关的初始化 */ 87 | 88 | output_init(); /* 输出相关的初始化 */ 89 | input_run(); /* 运行输入函数,采集输入数据*/ 90 | output_run(); /* 输出初函数,把数据收集起来通过网络socket发送出去 */ 91 | 92 | 3.启动 MJPG-Streamer 93 | 3.1 输入以下命令 94 | pi@raspberrypi: ~ $ /usr/local/bin/mjpg_streamer -i "/usr/local/lib/mjpg-streamer/input_uvc.so -n -f 30 -r 1280x720" -o "/usr/local/lib/mjpg-streamer/output_http.so -p 8080 -w /usr/local/share/mjpg-streamer/www" 95 | 见到如下便是成功了 96 | 97 | 98 | 99 | 3.2参数说明: 100 | -i "/usr/local/lib/mjpg-streamer/input_uvc.so -n -f 30 -r 1280x720" 101 | 102 | -i 输入 103 | 104 | input_uvc.so:UVC输入组件 105 | 106 | -f 30 :表示30帧 107 | 108 | -r 1280*720 :分辨率 109 | 110 | -y :YUV格式输入(有卡顿),不加表示MJPG输入(需要摄像头支持) 111 | 112 | -o "/usr/local/lib/mjpg-streamer/output_http.so -p 8080 -w /usr/local/share/mjpg-streamer/www" 113 | 114 | -o 输出 115 | 116 | output_http.so :网页输出组件 117 | 118 | -w www : 网页输出 119 | 120 | -p 8080 :端口 8080 121 | 122 | -d 1000 : 时间1S 123 | 124 | 4.实时视频接收 125 | 1)可以直接打开网址 http://<树莓派ip>:/javascript.html 126 | 127 | 2) 创建一个.html 文件 就可以访问 128 | 129 | index.html 130 | 131 | 132 | 133 | 134 | 实时视频 135 | 151 | 152 | 153 | 154 |
155 |
156 |
157 |
158 | 159 | 194 | 195 | 196 | 视频效果: 197 | 198 | 199 | ———————————————— 200 | 版权声明:本文为CSDN博主「Five-菜鸟级」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 201 | 原文链接:https://blog.csdn.net/qq_41923622/article/details/88366185 -------------------------------------------------------------------------------- /mongodb意外断电,非正常关闭, 造成不可启动, 日志出现WT_ERROR non-specific WiredTiger error, terminating.md: -------------------------------------------------------------------------------- 1 | # mongodb意外断电,非正常关闭, 造成不可启动, 日志出现WT_ERROR: non-specific WiredTiger error, terminating 2 | 3 | 4 | 5 | # 亲测 6 | 7 | 8 | 9 | 10 | 11 | 今天启动MongoDB, /**/mongod -f /**/mongodb.conf 12 | 13 | 今天服务器突然都断电了, 我重新启动的时候, 已经启动不成功, 去log文件夹下面, 看了一下日志: 14 | 15 | ``` 16 | 2020-12-28T12:51:29.423+0800 I CONTROL [initandlisten] target_arch: x86_64 17 | 2020-12-28T12:51:29.423+0800 I CONTROL [initandlisten] options: { config: "/home/**/mongodb/conf/mongodb.conf", net: { port: 27017 }, processManagement: { fork: true, pidFilePath: "/home/**/mongodb/data/mongodb.pid" }, repair: true, replication: { oplogSizeMB: 512 }, storage: { dbPath: "/home/**/mongodb/data/" }, systemLog: { destination: "file", logAppend: true, path: "/home/**/mongodb/log/mongodb.log" } } 18 | 2020-12-28T12:51:29.449+0800 E NETWORK [initandlisten] listen(): bind() failed No space left on device for socket: /tmp/mongodb-27017.sock 19 | 2020-12-28T12:51:29.449+0800 E NETWORK [initandlisten] Failed to set up sockets during startup. 20 | 2020-12-28T12:51:29.449+0800 E STORAGE [initandlisten] Failed to set up listener: InternalError: Failed to set up sockets 21 | 2020-12-28T12:51:29.449+0800 I NETWORK [initandlisten] shutdown: going to close listening sockets... 22 | 2020-12-28T12:51:29.449+0800 I NETWORK [initandlisten] shutdown: going to flush diaglog... 23 | 2020-12-28T12:51:29.449+0800 I CONTROL [initandlisten] now exiting 24 | 2020-12-28T12:51:29.449+0800 I CONTROL [initandlisten] shutting down with code:48 25 | 2020-12-28T13:03:26.708+0800 I CONTROL [main] ***** SERVER RESTARTED ***** 26 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] MongoDB starting : pid=110233 port=27017 dbpath=/home/**/mongodb/data/ 64-bit host=localhost.localdomain 27 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] db version v3.4.3 28 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1 29 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.0-fips 29 Mar 2010 30 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] allocator: tcmalloc 31 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] modules: none 32 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] build environment: 33 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] distmod: amazon 34 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] distarch: x86_64 35 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] target_arch: x86_64 36 | 2020-12-28T13:03:26.713+0800 I CONTROL [initandlisten] options: { config: "/home/**/mongodb/conf/mongodb.conf", net: { port: 27017 }, processManagement: { fork: true, pidFilePath: "/home/**/mongodb/data/mongodb.pid" }, repair: true, replication: { oplogSizeMB: 512 }, storage: { dbPath: "/home/**/mongodb/data/" }, systemLog: { destination: "file", logAppend: true, path: "/home/**/mongodb/log/mongodb.log" } } 37 | 2020-12-28T13:03:26.737+0800 E NETWORK [initandlisten] listen(): bind() failed No space left on device for socket: /tmp/mongodb-27017.sock 38 | 2020-12-28T13:03:26.737+0800 E NETWORK [initandlisten] Failed to set up sockets during startup. 39 | 2020-12-28T13:03:26.737+0800 E STORAGE [initandlisten] Failed to set up listener: InternalError: Failed to set up sockets 40 | 2020-12-28T13:03:26.737+0800 I NETWORK [initandlisten] shutdown: going to close listening sockets... 41 | 2020-12-28T13:03:26.737+0800 I NETWORK [initandlisten] shutdown: going to flush diaglog... 42 | 2020-12-28T13:03:26.737+0800 I CONTROL [initandlisten] now exiting 43 | 2020-12-28T13:03:26.737+0800 I CONTROL [initandlisten] shutting down with code:48 44 | ``` 45 | 46 | 我看到了listen(): bind() failed No space left on device for socket: /tmp/mongodb-27017.sock, 这句话, 说明磁盘空间不够, df -h, 看了一下, 确实是空间不够, 清理完之后, 接着重新启动, 又失败: 47 | 48 | ``` 49 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] MongoDB starting : pid=8356 port=27017 dbpath=/home/**/mongodb/data/ 64-bit host=localhost.localdomain 50 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] db version v3.4.3 51 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1 52 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.0-fips 29 Mar 2010 53 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] allocator: tcmalloc 54 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] modules: none 55 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] build environment: 56 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] distmod: amazon 57 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] distarch: x86_64 58 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] target_arch: x86_64 59 | 2020-12-28T19:04:49.385+0800 I CONTROL [initandlisten] options: { config: "/home/**/mongodb/conf/mongodb.conf", net: { port: 27017 }, processManagement: { fork: true, pidFilePath: "/home/**/mongodb/data/mongodb.pid" }, repair: true, replication: { oplogSizeMB: 512 }, storage: { dbPath: "/home/**/mongodb/data/" }, systemLog: { destination: "file", logAppend: true, path: "/home/**/mongodb/log/mongodb.log" } } 60 | 2020-12-28T19:04:49.414+0800 I - [initandlisten] Detected data files in /home/**/mongodb/data/ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'. 61 | 2020-12-28T19:04:49.414+0800 I STORAGE [initandlisten] Detected WT journal files. Running recovery from last checkpoint. 62 | 2020-12-28T19:04:49.414+0800 I STORAGE [initandlisten] journal to nojournal transition config: create,cache_size=15557M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0), 63 | 2020-12-28T19:04:49.436+0800 E STORAGE [initandlisten] WiredTiger error (-31802) [1609153489:436086][8356:0x7fd6498a2e00], txn-recover: unsupported WiredTiger file version: this build only supports major/minor versions up to 1/0, and the file is version 2/0: WT_ERROR: non-specific WiredTiger error 64 | 2020-12-28T19:04:49.436+0800 E STORAGE [initandlisten] WiredTiger error (0) [1609153489:436137][8356:0x7fd6498a2e00], txn-recover: WiredTiger is unable to read the recovery log. 65 | 2020-12-28T19:04:49.436+0800 E STORAGE [initandlisten] WiredTiger error (0) [1609153489:436146][8356:0x7fd6498a2e00], txn-recover: This may be due to the log files being encrypted, being from an older version or due to corruption on disk 66 | 2020-12-28T19:04:49.436+0800 E STORAGE [initandlisten] WiredTiger error (0) [1609153489:436152][8356:0x7fd6498a2e00], txn-recover: You should confirm that you have opened the database with the correct options including all encryption and compression options 67 | 2020-12-28T19:04:49.436+0800 E STORAGE [initandlisten] WiredTiger error (-31802) [1609153489:436163][8356:0x7fd6498a2e00], txn-recover: Recovery failed: WT_ERROR: non-specific WiredTiger error 68 | 2020-12-28T19:04:49.436+0800 I - [initandlisten] Assertion: 28718:-31802: WT_ERROR: non-specific WiredTiger error src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 251 69 | 2020-12-28T19:04:49.438+0800 I STORAGE [initandlisten] exception in initAndListen: 28718 -31802: WT_ERROR: non-specific WiredTiger error, terminating 70 | 2020-12-28T19:04:49.438+0800 I NETWORK [initandlisten] shutdown: going to close listening sockets... 71 | 2020-12-28T19:04:49.438+0800 I NETWORK [initandlisten] removing socket file: /tmp/mongodb-27017.sock 72 | 2020-12-28T19:04:49.438+0800 I NETWORK [initandlisten] shutdown: going to flush diaglog... 73 | 2020-12-28T19:04:49.438+0800 I CONTROL [initandlisten] now exiting 74 | 2020-12-28T19:04:49.438+0800 I CONTROL [initandlisten] shutting down with code:100 75 | 2020-12-28T19:06:07.035+0800 I CONTROL [main] ***** SERVER RESTARTED ***** 76 | ``` 77 | 78 | 上面由于不正常关机, 造成数据库被锁起来了, 需要去db目录下面: 79 | 80 | ``` 81 | cd /data/db //进入数据文件夹 82 | rm -r journal 83 | rm -r mongod.lock 84 | rm -r WiredTiger.lock 85 | ``` 86 | 87 | 然后接着重启服务就可以了 -------------------------------------------------------------------------------- /用Golang写爬虫(一).md: -------------------------------------------------------------------------------- 1 | ## [用Golang写*爬虫*(一)](https://zhuanlan.zhihu.com/p/55039990) 2 | 3 | [![小歪丶](https://pica.zhimg.com/50/v2-aeb6e58abd907333ff19ed4ef1cf13b1_s.jpg?source=4e949a73)](https://www.zhihu.com/people/cuishite) 4 | 5 | [小歪丶](https://www.zhihu.com/people/cuishite) 6 | 7 | 乐观、积极、向上 8 | 9 | 前言近期有些项目需要用到Golang,大概花了一周来看语法,然后就开始看爬虫相关的。这里记录下如何使用Golang来写爬虫的几个步骤,最终完成的效果如下图 10 | ![img](https://pic1.zhimg.com/v2-bcb084009335290f2b53ad25605c2ffc_b.jpg) 11 | 环境安装比较简单`sudo apt-get install golang # (Linux) brew install go # (Mac)`安装之后注意`GOPATH`和`GOROOT`等环境变量设置,IDE用的是jetbrains家的GoLand。建议先去看看Golang的官方文档,学习基本语法知识。地址:[官方教程中文版](https://link.zhihu.com/?target=https%3A//tour.go-zh.org/welcome/1)创建文档新建文件`crawler.go`,并写入如下代码: 12 | 13 | ```go 14 | package main import "fmt" func main() { fmt.Println("Hello, world") } 15 | ``` 16 | 17 | 18 | 19 | 运行方法:`go run crawler.go`,肉眼可见,编译速度比JAVA要快得多。下载网页这里先从Golang原生http库开始,直接使用`net/http`包内的函数请求`import "net/http" ... resp, err := http.Get("http://wwww.baidu.com") ` 20 | 所以代码可以这样写 21 | 22 | ```go 23 | package main 24 | 25 | import ( 26 | "fmt" 27 | "io/ioutil" 28 | "net/http" 29 | ) 30 | 31 | func main() { 32 | fmt.Println("Hello, world") 33 | resp, err := http.Get("http://www.baidu.com/") 34 | if err != nil { 35 | fmt.Println("http get error", err) 36 | return 37 | } 38 | body, err := ioutil.ReadAll(resp.Body) 39 | if err != nil { 40 | fmt.Println("read error", err) 41 | return 42 | } 43 | fmt.Println(string(body)) 44 | } 45 | 46 | ``` 47 | 48 | 49 | 50 | Golang的错误处理就是这样的,习惯就好。这里更好的做法是把下载方法封装为函数。 51 | 52 | ```go 53 | package main import ( "fmt" "io/ioutil" "net/http" ) 54 | func main() { 55 | fmt.Println("Hello, world") 56 | url := "http://www.baidu.com/" 57 | download(url) 58 | } 59 | 60 | func download(url string) { 61 | client := &http.Client{} 62 | eq, _ := http.NewRequest("GET", url, nil) 63 | // 自定义Header 64 | eq.Header.Set("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)") 65 | resp, err := client.Do(req) 66 | if err != nil { 67 | fmt.Println("http get error", err) 68 | return } 69 | //函数结束后关闭相关链接 70 | defer resp.Body.Close() 71 | body, err := ioutil.ReadAll(resp.Body) 72 | f err != nil { 73 | fmt.Println("read error", err) 74 | return } 75 | fmt.Println(string(body)) } 76 | ``` 77 | 78 | 79 | 80 | 解析网页go常见的解析器xpath、jquery、正则都有,直接搜索即可,我这里偷懒,直接用别人写好的轮子`collectlinks`,可以提取网页中所有的链接,下载方法`go get -u github.com/jackdanger/collectlinks` 81 | 82 | ```go 83 | package main import ( "fmt" "github.com/jackdanger/collectlinks" "net/http" ) 84 | func main() { 85 | fmt.Println("Hello, world") 86 | url := "http://www.baidu.com/" 87 | download(url) 88 | } 89 | func download(url string) { 90 | client := &http.Client{} 91 | req, _ := http.NewRequest("GET", url, nil) 92 | // 自定义Header 93 | req.Header.Set("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)") 94 | resp, err := client.Do(req) 95 | if err != nil { 96 | fmt.Println("http get error", err) 97 | return 98 | } 99 | //函数结束后关闭相关链接 100 | defer resp.Body.Close() 101 | links := collectlinks.All(resp.Body) 102 | for _, link := range links { 103 | fmt.Println("parse url", link) 104 | } } 105 | ``` 106 | 107 | 108 | 109 | `并发Golang使用关键字`go`即可开启一个新的go程,也叫`goroutine`,使用 go 语句开启一个新的 goroutine 之后,go 语句之后的函数调用将在新的 goroutine 中执行,而不会阻塞当前的程序执行。所以使用Golang可以很容易写成异步IO。`` 110 | 111 | ```go 112 | package main import ( "fmt" "github.com/jackdanger/collectlinks" "net/http" ) 113 | func main() { 114 | fmt.Println("Hello, world") 115 | url := "http://www.baidu.com/" 116 | queue := make(chan string) 117 | 118 | go func() { 119 | queue <- url }() 120 | for uri := range queue { 121 | download(uri, queue) 122 | } } 123 | 124 | func download(url string, queue chan string) { 125 | client := &http.Client{} 126 | req, _ := http.NewRequest("GET", url, nil) 127 | // 自定义Header 128 | req.Header.Set("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)") 129 | resp, err := client.Do(req) 130 | if err != nil { 131 | fmt.Println("http get error", err) 132 | return 133 | } 134 | //函数结束后关闭相关链接 135 | defer resp.Body.Close() 136 | links := collectlinks.All(resp.Body) 137 | for _, link := range links { 138 | fmt.Println("parse url", link) 139 | go func() { 140 | queue <- link 141 | }() 142 | } } 143 | ``` 144 | 145 | 146 | 147 | ``现在的流程是main有一个for循环读取来自名为queue的通道,download下载网页和链接解析,将发现的链接放入main使用的同一队列中,并再开启一个新的goroutine去抓取形成无限循环。这里对于新手来说真的不好理解,涉及到Golang的两个比较重要的东西:goroutine和channels,这个我也不大懂,这里也不多讲了,以后有机会细说。官方:A *goroutine* is a lightweight thread managed by the Go runtime。翻译过来就是:Goroutine是由Go运行时管理的轻量级线程。channels是连接并发goroutine的管道,可以理解为goroutine通信的管道。 可以将值从一个goroutine发送到通道,并将这些值接收到另一个goroutine中。对这部分有兴趣的可以去看文档。好了,到这里爬虫基本上已经完成了,但是还有两个问题:去重、链接是否有效。链接转为绝对路径` 148 | 149 | ```go 150 | package main import ( "fmt" "github.com/jackdanger/collectlinks" "net/http" "net/url" ) func main() { fmt.Println("Hello, world") url := "http://www.baidu.com/" queue := make(chan string) go func() { queue <- url }() for uri := range queue { download(uri, queue) } } 151 | func download(url string, queue chan string) { 152 | client := &http.Client{} 153 | req, _ := http.NewRequest("GET", url, nil) 154 | // 自定义Header 155 | req.Header.Set("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)") resp, err := client.Do(req) if err != nil { 156 | fmt.Println("http get error", err) 157 | return } 158 | //函数结束后关闭相关链接 159 | defer resp.Body.Close() 160 | links := collectlinks.All(resp.Body) 161 | for _, link := range links { absolute := urlJoin(link, url) 162 | if url != " " { fmt.Println("parse url", absolute) 163 | 164 | go func() { 165 | queue <- absolute }() } } } 166 | func urlJoin(href, base string) string { uri, err := url.Parse(href) if err != nil { return " " } baseUrl, err := url.Parse(base) if err != nil { return " " } return baseUrl.ResolveReference(uri).String() } `这里新写了一个`urlJoin`函数,功能和Python中的`urllib.parse.urljoin`一样。去重我们维护一个map用来记录,那些是已经访问过的。`package main import ( "fmt" "github.com/jackdanger/collectlinks" "net/http" "net/url" ) 167 | var visited = make(map[string]bool) func main() { fmt.Println("Hello, world") url := "http://www.baidu.com/" queue := make(chan string) go func() { queue <- url }() for uri := range queue { download(uri, queue) } } 168 | func download(url string, queue chan string) { visited[url] = true client := &http.Client{} 169 | req, _ := http.NewRequest("GET", url, nil) 170 | // 自定义Header 171 | req.Header.Set("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)") resp, err := client.Do(req) if err != nil { fmt.Println("http get error", err) return } 172 | //函数结束后关闭相关链接 173 | defer resp.Body.Close() 174 | links := collectlinks.All(resp.Body) 175 | for _, link := range links { absolute := urlJoin(link, url) 176 | 177 | if url != " " { if !visited[absolute] { fmt.Println("parse url", absolute) 178 | go func() { queue <- absolute }() } } } } 179 | func urlJoin(href, base string) string { uri, err := url.Parse(href) 180 | if err != nil { return " " } 181 | baseUrl, err := url.Parse(base) if err != nil { return " " } 182 | return baseUrl.ResolveReference(uri).String() } 183 | ``` 184 | 185 | 186 | 187 | 好了大功告成,运行程序,会像一张网铺开一直不停的抓下去。写到这里,我突然觉得我忘了什么,哦,忘记加timeout了,必须要为每次请求加上超时,前两天才写了的。完整代码就补贴上来了,[在github中](https://link.zhihu.com/?target=https%3A//github.com/zhangslob/awesome_crawl/tree/master/Golang_basic_spider)。运行一段时间后的资源消耗![img](https://pic2.zhimg.com/v2-0ac07bcc60064b854bc00a10f0617cd5_b.jpg) 188 | 189 | ![img](https://pic3.zhimg.com/v2-5cf4f851eddd46677d670b12297e0ba6_b.jpg) 190 | CPU使用率并不高,内存因为会保存一张不断增大的map,所以会一直上涨。如果是用Python,该怎么写呢?资源消耗和Golang比会如何呢?有兴趣的小伙伴可以去试试。后记都说Golang的并发好,体验了下确实如此。Golang起步晚,但是发展的块。采集还是多学点技能防身吧。我从上周开始学习Golang语法,跟着官方文档学习,基本上都可以看懂在做什么,除了那几块难理解的,需要自己多写多用才行。有兴趣的小伙伴一起入坑啊。 -------------------------------------------------------------------------------- /scikit-learn随机森林调参小结.md: -------------------------------------------------------------------------------- 1 | # [scikit-learn随机森林调参小结](https://www.cnblogs.com/pinard/p/6160412.html) 2 | 3 |     在[Bagging与随机森林算法原理小结](http://www.cnblogs.com/pinard/p/6156009.html)中,我们对随机森林(Random Forest, 以下简称RF)的原理做了总结。本文就从实践的角度对RF做一个总结。重点讲述scikit-learn中RF的调参注意事项,以及和GBDT调参的异同点。 4 | 5 | # 1. scikit-learn随机森林类库概述 6 | 7 |     在scikit-learn中,RF的分类类是RandomForestClassifier,回归类是RandomForestRegressor。当然RF的变种Extra Trees也有, 分类类ExtraTreesClassifier,回归类ExtraTreesRegressor。由于RF和Extra Trees的区别较小,调参方法基本相同,本文只关注于RF的调参。 8 | 9 |     和GBDT的调参类似,RF需要调参的参数也包括两部分,第一部分是Bagging框架的参数,第二部分是CART决策树的参数。下面我们就对这些参数做一个介绍。 10 | 11 | 12 | 13 | # 2. RF框架参数 14 | 15 |     首先我们关注于RF的Bagging框架的参数。这里可以和GBDT对比来学习。在[scikit-learn 梯度提升树(GBDT)调参小结](http://www.cnblogs.com/pinard/p/6143927.html)中我们对GBDT的框架参数做了介绍。GBDT的框架参数比较多,重要的有最大迭代器个数,步长和子采样比例,调参起来比较费力。但是RF则比较简单,这是因为bagging框架里的各个弱学习器之间是没有依赖关系的,这减小的调参的难度。换句话说,达到同样的调参效果,RF调参时间要比GBDT少一些。 16 | 17 |     下面我来看看RF重要的Bagging框架的参数,由于RandomForestClassifier和RandomForestRegressor参数绝大部分相同,这里会将它们一起讲,不同点会指出。 18 | 19 |     1) **n_estimators**: 也就是最大的弱学习器的个数。一般来说n_estimators太小,容易欠拟合,n_estimators太大,计算量会太大,并且n_estimators到一定的数量后,再增大n_estimators获得的模型提升会很小,所以一般选择一个适中的数值。默认是100。 20 | 21 |     2) **oob_score** :即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True,因为袋外分数反应了一个模型拟合后的泛化能力。 22 | 23 |     3) **criterion**:即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse,另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。 24 | 25 |     从上面可以看出, RF重要的框架参数比较少,主要需要关注的是 n_estimators,即RF最大的决策树个数。 26 | 27 | # 3. RF决策树参数 28 | 29 |     下面我们再来看RF的决策树参数,它要调参的参数基本和GBDT相同,如下: 30 | 31 |     1) RF划分时考虑的最大特征数**max_features**: 可以使用很多种类型的值,默认是"auto",意味着划分时最多考虑*𝑁*‾‾√N个特征;如果是"log2"意味着划分时最多考虑*𝑙**𝑜**𝑔*2*𝑁*log2N个特征;如果是"sqrt"或者"auto"意味着划分时最多考虑*𝑁*‾‾√N个特征。如果是整数,代表考虑的特征绝对数。如果是浮点数,代表考虑特征百分比,即考虑(百分比xN)取整后的特征数。其中N为样本总特征数。一般我们用默认的"auto"就可以了,如果特征数非常多,我们可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数,以控制决策树的生成时间。 32 | 33 |     2) 决策树最大深度**max_depth**: 默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。一般来说,数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多的情况下,推荐限制这个最大深度,具体的取值取决于数据的分布。常用的可以取值10-100之间。 34 | 35 |     3) 内部节点再划分所需最小样本数**min_samples_split**: 这个值限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。 默认是2.如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。 36 | 37 |     4) 叶子节点最少样本数**min_samples_leaf**: 这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。 默认是1,可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。 38 | 39 |     5)叶子节点最小的样本权重和**min_weight_fraction_leaf**:这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝。 默认是0,就是不考虑权重问题。一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。 40 | 41 |     6) 最大叶子节点数**max_leaf_nodes**: 通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。如果加了限制,算法会建立在最大叶子节点数内最优的决策树。如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。 42 | 43 |     7) 节点划分最小不纯度**min_impurity_split:** 这个值限制了决策树的增长,如果某节点的不纯度(基于基尼系数,均方差)小于这个阈值,则该节点不再生成子节点。即为叶子节点 。一般不推荐改动默认值1e-7。 44 | 45 |     上面决策树参数中最重要的包括最大特征数max_features, 最大深度max_depth, 内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf。 46 | 47 | # 4.RF调参实例 48 | 49 |     这里仍然使用GBDT调参时同样的数据集来做RF调参的实例,数据的[下载地址在这](http://files.cnblogs.com/files/pinard/train_modified.zip)。本例我们采用袋外分数来评估我们模型的好坏。 50 | 51 |     完整代码参见我的github:https://github.com/ljpzzz/machinelearning/blob/master/ensemble-learning/random_forest_classifier.ipynb 52 | 53 |     首先,我们载入需要的类库: 54 | 55 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 56 | 57 | ``` 58 | import pandas as pd 59 | import numpy as np 60 | from sklearn.ensemble import RandomForestClassifier 61 | from sklearn.grid_search import GridSearchCV 62 | from sklearn import cross_validation, metrics 63 | 64 | import matplotlib.pylab as plt 65 | %matplotlib inline 66 | ``` 67 | 68 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 69 | 70 |     接着,我们把解压的数据用下面的代码载入,顺便看看数据的类别分布。 71 | 72 | ``` 73 | train = pd.read_csv('train_modified.csv') 74 | target='Disbursed' # Disbursed的值就是二元分类的输出 75 | IDcol = 'ID' 76 | train['Disbursed'].value_counts() 77 | ``` 78 | 79 |     可以看到类别输出如下,也就是类别0的占大多数。 80 | 81 | 0 19680 82 | 1 320 83 | Name: Disbursed, dtype: int64 84 | 85 |     接着我们选择好样本特征和类别输出。 86 | 87 | ``` 88 | x_columns = [x for x in train.columns if x not in [target, IDcol]] 89 | X = train[x_columns] 90 | y = train['Disbursed'] 91 | ``` 92 | 93 |     不管任何参数,都用默认的,我们拟合下数据看看: 94 | 95 | ``` 96 | rf0 = RandomForestClassifier(oob_score=True, random_state=10) 97 | rf0.fit(X,y) 98 | print rf0.oob_score_ 99 | y_predprob = rf0.predict_proba(X)[:,1] 100 | print "AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob) 101 | ``` 102 | 103 |     输出如下,可见袋外分数已经很高,而且AUC分数也很高。相对于GBDT的默认参数输出,RF的默认参数拟合效果对本例要好一些。 104 | 105 | 0.98005 106 | AUC Score (Train): 0.999833 107 | 108 |    我们首先对n_estimators进行网格搜索: 109 | 110 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 111 | 112 | ``` 113 | param_test1 = {'n_estimators':range(10,71,10)} 114 | gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100, 115 | min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 116 | param_grid = param_test1, scoring='roc_auc',cv=5) 117 | gsearch1.fit(X,y) 118 | gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_ 119 | ``` 120 | 121 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 122 | 123 |     输出结果如下: 124 | 125 | ([mean: 0.80681, std: 0.02236, params: {'n_estimators': 10}, 126 | mean: 0.81600, std: 0.03275, params: {'n_estimators': 20}, 127 | mean: 0.81818, std: 0.03136, params: {'n_estimators': 30}, 128 | mean: 0.81838, std: 0.03118, params: {'n_estimators': 40}, 129 | mean: 0.82034, std: 0.03001, params: {'n_estimators': 50}, 130 | mean: 0.82113, std: 0.02966, params: {'n_estimators': 60}, 131 | mean: 0.81992, std: 0.02836, params: {'n_estimators': 70}], 132 | {'n_estimators': 60}, 133 | 0.8211334476626017) 134 | 135 |     这样我们得到了最佳的弱学习器迭代次数,接着我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。 136 | 137 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 138 | 139 | ``` 140 | param_test2 = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)} 141 | gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, 142 | min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10), 143 | param_grid = param_test2, scoring='roc_auc',iid=False, cv=5) 144 | gsearch2.fit(X,y) 145 | gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_ 146 | ``` 147 | 148 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 149 | 150 |     输出如下: 151 | 152 | ([mean: 0.79379, std: 0.02347, params: {'min_samples_split': 50, 'max_depth': 3}, 153 | mean: 0.79339, std: 0.02410, params: {'min_samples_split': 70, 'max_depth': 3}, 154 | mean: 0.79350, std: 0.02462, params: {'min_samples_split': 90, 'max_depth': 3}, 155 | mean: 0.79367, std: 0.02493, params: {'min_samples_split': 110, 'max_depth': 3}, 156 | mean: 0.79387, std: 0.02521, params: {'min_samples_split': 130, 'max_depth': 3}, 157 | mean: 0.79373, std: 0.02524, params: {'min_samples_split': 150, 'max_depth': 3}, 158 | mean: 0.79378, std: 0.02532, params: {'min_samples_split': 170, 'max_depth': 3}, 159 | mean: 0.79349, std: 0.02542, params: {'min_samples_split': 190, 'max_depth': 3}, 160 | mean: 0.80960, std: 0.02602, params: {'min_samples_split': 50, 'max_depth': 5}, 161 | mean: 0.80920, std: 0.02629, params: {'min_samples_split': 70, 'max_depth': 5}, 162 | mean: 0.80888, std: 0.02522, params: {'min_samples_split': 90, 'max_depth': 5}, 163 | mean: 0.80923, std: 0.02777, params: {'min_samples_split': 110, 'max_depth': 5}, 164 | mean: 0.80823, std: 0.02634, params: {'min_samples_split': 130, 'max_depth': 5}, 165 | mean: 0.80801, std: 0.02637, params: {'min_samples_split': 150, 'max_depth': 5}, 166 | mean: 0.80792, std: 0.02685, params: {'min_samples_split': 170, 'max_depth': 5}, 167 | mean: 0.80771, std: 0.02587, params: {'min_samples_split': 190, 'max_depth': 5}, 168 | mean: 0.81688, std: 0.02996, params: {'min_samples_split': 50, 'max_depth': 7}, 169 | mean: 0.81872, std: 0.02584, params: {'min_samples_split': 70, 'max_depth': 7}, 170 | mean: 0.81501, std: 0.02857, params: {'min_samples_split': 90, 'max_depth': 7}, 171 | mean: 0.81476, std: 0.02552, params: {'min_samples_split': 110, 'max_depth': 7}, 172 | mean: 0.81557, std: 0.02791, params: {'min_samples_split': 130, 'max_depth': 7}, 173 | mean: 0.81459, std: 0.02905, params: {'min_samples_split': 150, 'max_depth': 7}, 174 | mean: 0.81601, std: 0.02808, params: {'min_samples_split': 170, 'max_depth': 7}, 175 | mean: 0.81704, std: 0.02757, params: {'min_samples_split': 190, 'max_depth': 7}, 176 | mean: 0.82090, std: 0.02665, params: {'min_samples_split': 50, 'max_depth': 9}, 177 | mean: 0.81908, std: 0.02527, params: {'min_samples_split': 70, 'max_depth': 9}, 178 | mean: 0.82036, std: 0.02422, params: {'min_samples_split': 90, 'max_depth': 9}, 179 | mean: 0.81889, std: 0.02927, params: {'min_samples_split': 110, 'max_depth': 9}, 180 | mean: 0.81991, std: 0.02868, params: {'min_samples_split': 130, 'max_depth': 9}, 181 | mean: 0.81788, std: 0.02436, params: {'min_samples_split': 150, 'max_depth': 9}, 182 | mean: 0.81898, std: 0.02588, params: {'min_samples_split': 170, 'max_depth': 9}, 183 | mean: 0.81746, std: 0.02716, params: {'min_samples_split': 190, 'max_depth': 9}, 184 | mean: 0.82395, std: 0.02454, params: {'min_samples_split': 50, 'max_depth': 11}, 185 | mean: 0.82380, std: 0.02258, params: {'min_samples_split': 70, 'max_depth': 11}, 186 | mean: 0.81953, std: 0.02552, params: {'min_samples_split': 90, 'max_depth': 11}, 187 | mean: 0.82254, std: 0.02366, params: {'min_samples_split': 110, 'max_depth': 11}, 188 | mean: 0.81950, std: 0.02768, params: {'min_samples_split': 130, 'max_depth': 11}, 189 | mean: 0.81887, std: 0.02636, params: {'min_samples_split': 150, 'max_depth': 11}, 190 | mean: 0.81910, std: 0.02734, params: {'min_samples_split': 170, 'max_depth': 11}, 191 | mean: 0.81564, std: 0.02622, params: {'min_samples_split': 190, 'max_depth': 11}, 192 | mean: 0.82291, std: 0.02092, params: {'min_samples_split': 50, 'max_depth': 13}, 193 | mean: 0.82177, std: 0.02513, params: {'min_samples_split': 70, 'max_depth': 13}, 194 | mean: 0.82415, std: 0.02480, params: {'min_samples_split': 90, 'max_depth': 13}, 195 | mean: 0.82420, std: 0.02417, params: {'min_samples_split': 110, 'max_depth': 13}, 196 | mean: 0.82209, std: 0.02481, params: {'min_samples_split': 130, 'max_depth': 13}, 197 | mean: 0.81852, std: 0.02227, params: {'min_samples_split': 150, 'max_depth': 13}, 198 | mean: 0.81955, std: 0.02885, params: {'min_samples_split': 170, 'max_depth': 13}, 199 | mean: 0.82092, std: 0.02600, params: {'min_samples_split': 190, 'max_depth': 13}], 200 | {'max_depth': 13, 'min_samples_split': 110}, 201 | 0.8242016800050813) 202 | 203 |     我们看看我们现在模型的袋外分数: 204 | 205 | ``` 206 | rf1 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=110, 207 | min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10) 208 | rf1.fit(X,y) 209 | print rf1.oob_score_ 210 | ``` 211 | 212 |     输出结果为: 213 | 214 | 0.984 215 | 216 |     可见此时我们的袋外分数有一定的提高。也就是时候模型的泛化能力增强了。 217 | 218 |     对于内部节点再划分所需最小样本数min_samples_split,我们暂时不能一起定下来,因为这个还和决策树其他的参数存在关联。下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。 219 | 220 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 221 | 222 | ``` 223 | param_test3 = {'min_samples_split':range(80,150,20), 'min_samples_leaf':range(10,60,10)} 224 | gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, 225 | max_features='sqrt' ,oob_score=True, random_state=10), 226 | param_grid = param_test3, scoring='roc_auc',iid=False, cv=5) 227 | gsearch3.fit(X,y) 228 | gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_ 229 | ``` 230 | 231 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 232 | 233 |     输出如下: 234 | 235 | ([mean: 0.82093, std: 0.02287, params: {'min_samples_split': 80, 'min_samples_leaf': 10}, 236 | mean: 0.81913, std: 0.02141, params: {'min_samples_split': 100, 'min_samples_leaf': 10}, 237 | mean: 0.82048, std: 0.02328, params: {'min_samples_split': 120, 'min_samples_leaf': 10}, 238 | mean: 0.81798, std: 0.02099, params: {'min_samples_split': 140, 'min_samples_leaf': 10}, 239 | mean: 0.82094, std: 0.02535, params: {'min_samples_split': 80, 'min_samples_leaf': 20}, 240 | mean: 0.82097, std: 0.02327, params: {'min_samples_split': 100, 'min_samples_leaf': 20}, 241 | mean: 0.82487, std: 0.02110, params: {'min_samples_split': 120, 'min_samples_leaf': 20}, 242 | mean: 0.82169, std: 0.02406, params: {'min_samples_split': 140, 'min_samples_leaf': 20}, 243 | mean: 0.82352, std: 0.02271, params: {'min_samples_split': 80, 'min_samples_leaf': 30}, 244 | mean: 0.82164, std: 0.02381, params: {'min_samples_split': 100, 'min_samples_leaf': 30}, 245 | mean: 0.82070, std: 0.02528, params: {'min_samples_split': 120, 'min_samples_leaf': 30}, 246 | mean: 0.82141, std: 0.02508, params: {'min_samples_split': 140, 'min_samples_leaf': 30}, 247 | mean: 0.82278, std: 0.02294, params: {'min_samples_split': 80, 'min_samples_leaf': 40}, 248 | mean: 0.82141, std: 0.02547, params: {'min_samples_split': 100, 'min_samples_leaf': 40}, 249 | mean: 0.82043, std: 0.02724, params: {'min_samples_split': 120, 'min_samples_leaf': 40}, 250 | mean: 0.82162, std: 0.02348, params: {'min_samples_split': 140, 'min_samples_leaf': 40}, 251 | mean: 0.82225, std: 0.02431, params: {'min_samples_split': 80, 'min_samples_leaf': 50}, 252 | mean: 0.82225, std: 0.02431, params: {'min_samples_split': 100, 'min_samples_leaf': 50}, 253 | mean: 0.81890, std: 0.02458, params: {'min_samples_split': 120, 'min_samples_leaf': 50}, 254 | mean: 0.81917, std: 0.02528, params: {'min_samples_split': 140, 'min_samples_leaf': 50}], 255 | {'min_samples_leaf': 20, 'min_samples_split': 120}, 256 | 0.8248650279471544) 257 | 258 |     最后我们再对最大特征数max_features做调参: 259 | 260 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 261 | 262 | ``` 263 | param_test4 = {'max_features':range(3,11,2)} 264 | gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120, 265 | min_samples_leaf=20 ,oob_score=True, random_state=10), 266 | param_grid = param_test4, scoring='roc_auc',iid=False, cv=5) 267 | gsearch4.fit(X,y) 268 | gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_ 269 | ``` 270 | 271 | [![复制代码](https://common.cnblogs.com/images/copycode.gif)](javascript:void(0);) 272 | 273 |     输出如下: 274 | 275 | ([mean: 0.81981, std: 0.02586, params: {'max_features': 3}, 276 | mean: 0.81639, std: 0.02533, params: {'max_features': 5}, 277 | mean: 0.82487, std: 0.02110, params: {'max_features': 7}, 278 | mean: 0.81704, std: 0.02209, params: {'max_features': 9}], 279 | {'max_features': 7}, 280 | 0.8248650279471544) 281 | 282 |     用我们搜索到的最佳参数,我们再看看最终的模型拟合: 283 | 284 | ``` 285 | rf2 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120, 286 | min_samples_leaf=20,max_features=7 ,oob_score=True, random_state=10) 287 | rf2.fit(X,y) 288 | print rf2.oob_score_ 289 | ``` 290 | 291 |     此时的输出为: 292 | 293 | 0.984 294 | 295 |     可见此时模型的袋外分数基本没有提高,主要原因是0.984已经是一个很高的袋外分数了,如果想进一步需要提高模型的泛化能力,我们需要更多的数据。 296 | 297 | 以上就是RF调参的一个总结,希望可以帮到朋友们。 298 | 299 | 300 | 301 | (欢迎转载,转载请注明出处。欢迎沟通交流: liujianping-ok@163.com) -------------------------------------------------------------------------------- /为什么有了ip还要mac.md: -------------------------------------------------------------------------------- 1 | 作者:闪客sun 2 | 链接:https://www.zhihu.com/question/21546408/answer/2303205686 3 | 来源:知乎 4 | 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。 5 | 6 | 7 | 8 | **你是一台电脑,你的名字叫 A** 9 | 10 | 很久很久之前,你不与任何其他电脑相连接,孤苦伶仃。 11 | 12 | ![img](http://xximg.30daydo.com/typora/v2-3afd7b073f743911a89fbe8b9a7a5d8b_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-3afd7b073f743911a89fbe8b9a7a5d8b_720w.jpg) 13 | 14 | 直到有一天,你希望与另一台电脑 B 建立通信,于是你们各开了一个网口,用一根**网线**连接了起来。 15 | 16 | ![img](http://xximg.30daydo.com/typora/v2-7821155291abb010af49fac532126e66_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-7821155291abb010af49fac532126e66_720w.jpg) 17 | 18 | 用一根网线连接起来怎么就能"通信"了呢?我可以给你讲 IO、讲中断、讲缓冲区,但这不是研究网络时该关心的问题。 19 | 20 | 如果你纠结,要么去研究一下操作系统是如何处理网络 IO 的,要么去研究一下包是如何被网卡转换成电信号发送出去的,要么就仅仅把它当做电脑里有个小人在**开枪**吧~ 21 | 22 | ![img](http://xximg.30daydo.com/typora/v2-5e57f86574e7a13d56d3014a60fc5d7a_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-5e57f86574e7a13d56d3014a60fc5d7a_720w.jpg) 23 | 24 | 反正,你们就是连起来了,并且可以通信。 25 | 26 | ## 第一层 27 | 28 | 有一天,一个新伙伴 C 加入了,但聪明的你们很快发现,可以每个人开**两个网口**,用一共**三根网线**,彼此相连。 29 | 30 | ![img](http://xximg.30daydo.com/typora/v2-33b51fa0279d2917af3a5c43877b3a89_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-33b51fa0279d2917af3a5c43877b3a89_720w.jpg) 31 | 32 | 随着越来越多的人加入,你发现身上开的网口实在太多了,而且网线密密麻麻,混乱不堪。(而实际上一台电脑根本开不了这么多网口,所以这种连线只在理论上可行,所以连不上的我就用红色虚线表示了,就是这么严谨哈哈~) 33 | 34 | ![img](http://xximg.30daydo.com/typora/v2-838a512a8ab3deced546c582834e5779_720w.jpg)![img](https://pic3.zhimg.com/80/v2-838a512a8ab3deced546c582834e5779_720w.jpg?source=1940ef5c) 35 | 36 | 于是你们发明了一个中间设备,你们将网线都插到这个设备上,由这个设备做转发,就可以彼此之间通信了,本质上和原来一样,只不过网口的数量和网线的数量减少了,不再那么混乱。 37 | 38 | ![img](http://xximg.30daydo.com/typora/v2-365de7097f7923ce302737a72adf9895_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-365de7097f7923ce302737a72adf9895_720w.jpg) 39 | 40 | 你给它取名叫**[集线器](https://www.zhihu.com/search?q=集线器&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686})**,它仅仅是无脑将电信号**转发到所有出口(广播)**,不做任何处理,你觉得它是没有智商的,因此把人家定性在了**物理层**。 41 | 42 | ![img](http://xximg.30daydo.com/typora/v2-04ec9b151d018fa9c43459360a8b9395_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-04ec9b151d018fa9c43459360a8b9395_720w.jpg) 43 | 44 | 由于转发到了所有出口,那 BCDE 四台机器怎么知道数据包是不是发给自己的呢? 45 | 46 | 首先,你要给所有的连接到交换机的设备,都起个名字。原来你们叫 ABCD,但现在需要一个更专业的,**全局唯一**的名字作为标识,你把这个更高端的名字称为 **MAC 地址**。 47 | 48 | 你的 MAC 地址是 aa-aa-aa-aa-aa-aa,你的伙伴 b 的 MAC 地址是 bb-bb-bb-bb-bb-bb,以此类推,不重复就好。 49 | 50 | 这样,A 在发送数据包给 B 时,只要在头部拼接一个这样结构的数据,就可以了。 51 | 52 | ![img](http://xximg.30daydo.com/typora/v2-637dc286dcf5e0a2e6109faeeee4a579_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-637dc286dcf5e0a2e6109faeeee4a579_720w.jpg) 53 | 54 | B 在收到数据包后,根据头部的目标 MAC 地址信息,判断这个数据包的确是发给自己的,于是便**收下**。 55 | 56 | 其他的 CDE 收到数据包后,根据头部的目标 MAC 地址信息,判断这个数据包并不是发给自己的,于是便**丢弃**。 57 | 58 | ![img](http://xximg.30daydo.com/typora/v2-85fac7764973e898f77c8398457999c8_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-85fac7764973e898f77c8398457999c8_720w.jpg) 59 | 60 | 虽然集线器使整个布局干净不少,但原来我只要发给电脑 B 的消息,现在却要发给连接到集线器中的所有电脑,这样既不安全,又不节省网络资源。 61 | 62 | ## 第二层 63 | 64 | 如果把这个集线器弄得更智能一些,**只发给目标 MAC 地址指向的那台电脑**,就好了。 65 | 66 | ![img](http://xximg.30daydo.com/typora/v2-50c150a977d7be5c8ee549bd951ed37c_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-50c150a977d7be5c8ee549bd951ed37c_720w.jpg) 67 | 68 | 虽然只比集线器多了这一点点区别,但看起来似乎有智能了,你把这东西叫做**交换机**。也正因为这一点点智能,你把它放在了另一个层级,**数据链路层**。 69 | 70 | ![img](http://xximg.30daydo.com/typora/v2-c373ac4ecd477da6dba19e39ebdda13d_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-c373ac4ecd477da6dba19e39ebdda13d_720w.jpg) 71 | 72 | 如上图所示,你是这样设计的。 73 | 74 | 交换机内部维护一张 **MAC 地址表**,记录着每一个 MAC 地址的设备,连接在其哪一个端口上。 75 | 76 | | MAC 地址 | 端口 | 77 | | ----------------- | ---- | 78 | | bb-bb-bb-bb-bb-bb | 1 | 79 | | cc-cc-cc-cc-cc-cc | 3 | 80 | | aa-aa-aa-aa-aa-aa | 4 | 81 | | dd-dd-dd-dd-dd-dd | 5 | 82 | 83 | 假如你仍然要发给 B 一个数据包,构造了如下的数据结构从网口出去。 84 | 85 | ![img](http://xximg.30daydo.com/typora/v2-637dc286dcf5e0a2e6109faeeee4a579_720w.jpg)![img](https://pic3.zhimg.com/80/v2-637dc286dcf5e0a2e6109faeeee4a579_720w.jpg?source=1940ef5c) 86 | 87 | 到达交换机时,交换机内部通过自己维护的 MAC 地址表,发现**目标机器 B 的 MAC 地址 bb-bb-bb-bb-bb-bb 映射到了端口 1 上**,于是把数据从 1 号端口发给了 B,完事~ 88 | 89 | 你给这个通过这样传输方式而组成的小范围的网络,叫做**[以太网](https://www.zhihu.com/search?q=以太网&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686})**。 90 | 91 | 当然最开始的时候,MAC 地址表是空的,是怎么逐步建立起来的呢? 92 | 93 | 假如在 MAC 地址表为空是,你给 B 发送了如下数据 94 | 95 | ![img](http://xximg.30daydo.com/typora/v2-637dc286dcf5e0a2e6109faeeee4a579_720w.jpg)![img](https://pic3.zhimg.com/80/v2-637dc286dcf5e0a2e6109faeeee4a579_720w.jpg?source=1940ef5c) 96 | 97 | 由于这个包从端口 4 进入的交换机,所以此时交换机就可以在 MAC地址表记录第一条数据: 98 | 99 | **MAC:aa-aa-aa-aa-aa-aa-aa** 100 | **端口:4** 101 | 102 | 交换机看目标 MAC 地址(bb-bb-bb-bb-bb-bb)在地址表中并没有映射关系,于是将此包发给了**所有端口**,也即发给了所有机器。 103 | 104 | 之后,只有机器 B 收到了确实是发给自己的包,于是做出了**响应**,响应数据从端口 1 进入交换机,于是交换机此时在地址表中更新了第二条数据: 105 | 106 | **MAC:bb-bb-bb-bb-bb-bb** 107 | **端口:1** 108 | 109 | 过程如下 110 | 111 | ![img](http://xximg.30daydo.com/typora/v2-61ffeb32ec6178cb85878bcc355ed69b_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-61ffeb32ec6178cb85878bcc355ed69b_720w.jpg) 112 | 113 | 经过该网络中的机器不断地通信,交换机最终将 MAC 地址表建立完毕~ 114 | 115 | 随着机器数量越多,交换机的端口也不够了,但聪明的你发现,只要将多个交换机连接起来,这个问题就轻而易举搞定~ 116 | 117 | ![img](http://xximg.30daydo.com/typora/v2-2c7c41918f50fcdeb7234666db95f5b7_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-2c7c41918f50fcdeb7234666db95f5b7_720w.jpg) 118 | 119 | 你完全不需要设计额外的东西,只需要按照之前的设计和规矩来,按照上述的接线方式即可完成所有电脑的互联,所以交换机设计的这种规则,真的很巧妙。你想想看为什么(比如 A 要发数据给 F)。 120 | 121 | 但是你要注意,上面那根红色的线,最终在 MAC 地址表中可不是一条记录呀,而是要把 EFGH 这四台机器与该端口(端口6)的映射全部记录在表中。 122 | 123 | 最终,**两个交换机将分别记录 A ~ H 所有机器的映射记录**。 124 | 125 | 这在只有 8 台电脑的时候还好,甚至在只有几百台电脑的时候,都还好,所以这种交换机的设计方式,已经足足支撑一阵子了。 126 | 127 | 但很遗憾,人是贪婪的动物,很快,电脑的数量就发展到几千、几万、几十万。 128 | 129 | ## 第三层 130 | 131 | 交换机已经无法记录如此庞大的映射关系了。 132 | 133 | 此时你动了歪脑筋,你发现了问题的根本在于,连出去的那根红色的网线,后面不知道有多少个设备不断地连接进来,从而使得地址表越来越大。 134 | 135 | 那我可不可以让那根红色的网线,接入一个**新的设备**,这个设备就跟电脑一样有自己独立的 MAC 地址,而且同时还能帮我把数据包做一次**转发**呢? 136 | 137 | 这个设备就是**路由器,**它的功能就是,作为一台独立的拥有 MAC 地址的设备,并且可以帮我把数据包做一次转发**,**你把它定在了**网络层。** 138 | 139 | ![img](http://xximg.30daydo.com/typora/v2-775a7cb7a6ce0ebf1a7d00f91f43ce07_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-775a7cb7a6ce0ebf1a7d00f91f43ce07_720w.jpg) 140 | 141 | 注意,路由器的每一个端口,都有独立的 MAC 地址 142 | 143 | 好了,现在交换机的 MAC 地址表中,只需要多出一条 MAC 地址 ABAB 与其端口的映射关系,就可以成功把数据包转交给路由器了,这条搞定。 144 | 145 | 那如何做到,把发送给 C 和 D,甚至是把发送给 DEFGH.... 的数据包,统统先发送给路由器呢? 146 | 147 | 不难想到这样一个点子,假如电脑 C 和 D 的 MAC 地址拥有共同的前缀,比如分别是 148 | 149 | **C 的 MAC 地址:FFFF-FFFF-CCCC** 150 | **D 的 MAC 地址:FFFF-FFFF-DDDD** 151 | 152 | 那我们就可以说,将目标 MAC 地址为 **FFFF-FFFF-?开头的**,统统先发送给路由器。 153 | 154 | 这样是否可行呢?答案是否定的。 155 | 156 | 我们先从现实中 MAC 地址的结构入手,MAC地址也叫物理地址、硬件地址,长度为 48 位,一般这样来表示 157 | 158 | **00-16-EA-AE-3C-40** 159 | 160 | 它是由网络设备制造商生产时烧录在网卡的EPROM(一种闪存芯片,通常可以通过程序擦写)。其中**前 24 位(00-16-EA)代表网络硬件制造商的编号,后 24 位(AE-3C-40)是该厂家自己分配的,一般表示系列号。**只要不更改自己的 MAC 地址,MAC 地址在世界是唯一的。形象地说,MAC地址就如同身份证上的身份证号码,具有唯一性。 161 | 162 | 那如果你希望向上面那样表示将目标 MAC 地址为 **FFFF-FFFF-?开头的**,统一从路由器出去发给某一群设备(后面会提到这其实是子网的概念),那你就需要要求某一子网下统统买一个厂商制造的设备,要么你就需要要求厂商在生产网络设备烧录 MAC 地址时,提前按照你规划好的子网结构来定 MAC 地址,并且日后这个网络的结构都不能轻易改变。 163 | 164 | 这显然是不现实的。 165 | 166 | 于是你发明了一个新的地址,给每一台机器一个 32 位的编号,如: 167 | 168 | **11000000101010000000000000000001** 169 | 170 | 你觉得有些不清晰,于是把它分成四个部分,中间用点相连。 171 | 172 | **11000000.10101000.00000000.00000001** 173 | 174 | 你还觉得不清晰,于是把它转换成 10 进制。 175 | 176 | **192.168.0.1** 177 | 178 | 最后你给了这个地址一个响亮的名字,**IP 地址**。现在每一台电脑,同时有自己的 MAC 地址,又有自己的 IP 地址,只不过 IP 地址是**软件层面**上的,可以随时修改,MAC 地址一般是无法修改的。 179 | 180 | 这样一个可以随时修改的 IP 地址,就可以根据你规划的网络拓扑结构,来调整了。 181 | 182 | ![img](http://xximg.30daydo.com/typora/v2-102187e33d28be523c8d8696142bf2f1_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-102187e33d28be523c8d8696142bf2f1_720w.jpg) 183 | 184 | 如上图所示,假如我想要发送数据包给 ABCD 其中一台设备,不论哪一台,我都可以这样描述,**"将 IP 地址为 192.168.0 开头的全部发送给到路由器,之后再怎么转发,交给它!"**,巧妙吧。 185 | 186 | 那交给路由器之后,路由器又是怎么把数据包准确转发给指定设备的呢? 187 | 188 | 别急我们慢慢来。 189 | 190 | 我们先给上面的组网方式中的每一台设备,加上自己的 IP 地址 191 | 192 | ![img](http://xximg.30daydo.com/typora/v2-1f76057980bf71669ec0c8dce306f1be_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-1f76057980bf71669ec0c8dce306f1be_720w.jpg) 193 | 194 | 现在两个设备之间传输,除了加上数据链路层的头部之外,还要再增加一个网络层的头部。 195 | 196 | 假如 A 给 B 发送数据,由于它们直接连着交换机,所以 A 直接发出如下数据包即可,其实网络层没有体现出作用。 197 | 198 | ![img](http://xximg.30daydo.com/typora/v2-8a5d9caff5dfea84263062d5951cf766_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-8a5d9caff5dfea84263062d5951cf766_720w.jpg) 199 | 200 | 但假如 A 给 C 发送数据,A 就需要先转交给路由器,然后再由路由器转交给 C。由于最底层的传输仍然需要依赖以太网,所以数据包是分成两段的。 201 | 202 | A ~ 路由器这段的包如下: 203 | 204 | ![img](http://xximg.30daydo.com/typora/v2-a17bc9cf72ea7880933a367348c67c42_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-a17bc9cf72ea7880933a367348c67c42_720w.jpg) 205 | 206 | 路由器到 C 这段的包如下: 207 | 208 | ![img](http://xximg.30daydo.com/typora/v2-7c8a35a13b2bc9d534b36a52ced9830f_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-7c8a35a13b2bc9d534b36a52ced9830f_720w.jpg) 209 | 210 | 好了,上面说的两种情况(A->B,A->C),相信细心的读者应该会有不少疑问,下面我们一个个来展开。 211 | 212 | **A 给 C 发数据包,怎么知道是否要通过路由器转发呢?** 213 | 214 | **答案:子网** 215 | 216 | 如果源 IP 与目的 IP 处于一个子网,直接将包通过交换机发出去。 217 | 218 | 如果源 IP 与目的 IP 不处于一个子网,就交给路由器去处理。 219 | 220 | 好,那现在只需要解决,什么叫处于一个子网就好了。 221 | 222 | - [192.168.0.1](https://www.zhihu.com/search?q=192.168.0.1&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686}) 和 192.168.0.2 处于同一个子网 223 | - 192.168.0.1 和 [192.168.1.1](https://www.zhihu.com/search?q=192.168.1.1&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686}) 处于不同子网 224 | 225 | 这两个是我们人为规定的,即我们想表示,对于 192.168.0.1 来说: 226 | 227 | **[http://192.168.0.xxx](https://link.zhihu.com/?target=http%3A//192.168.0.xxx)** **开头的,就算是在一个子网,否则就是在不同的子网。** 228 | 229 | 那对于计算机来说,怎么表达这个意思呢?于是人们发明了**子网掩码**的概念 230 | 231 | 假如某台机器的子网掩码定为 255.255.255.0 232 | 233 | 这表示,将源 IP 与目的 IP 分别同这个子网掩码进行**与运算,相等则是在一个子网,不相等就是在不同子网**,就这么简单。 234 | 235 | 比如 236 | 237 | - **A电脑**:192.168.0.1 & 255.255.255.0 = 192.168.0.0 238 | - **B电脑**:192.168.0.2 & 255.255.255.0 = 192.168.0.0 239 | - **C电脑**:192.168.1.1 & 255.255.255.0 = 192.168.1.0 240 | - **D电脑**:192.168.1.2 & 255.255.255.0 = 192.168.1.0 241 | 242 | 那么 A 与 B 在同一个子网,C 与 D 在同一个子网,但是 A 与 C 就不在同一个子网,与 D 也不在同一个子网,以此类推。 243 | 244 | ![img](http://xximg.30daydo.com/typora/v2-157b025512b6187e9a60dd069599c915_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-157b025512b6187e9a60dd069599c915_720w.jpg) 245 | 246 | 所以如果 A 给 C 发消息,A 和 C 的 IP 地址分别 & A 机器配置的子网掩码,发现不相等,则 A 认为 C 和自己不在同一个子网,于是把包发给路由器,就不管了,**之后怎么转发,A 不关心**。 247 | 248 | **A 如何知道,哪个设备是路由器?** 249 | 250 | **答案:在 A 上要设置默认网关** 251 | 252 | 上一步 A 通过是否与 C 在同一个子网内,判断出自己应该把包发给路由器,那路由器的 IP 是多少呢? 253 | 254 | 其实说发给路由器不准确,应该说 A 会把包发给**默认网关**。 255 | 256 | 对 A 来说,A 只能**直接**把包发给同处于一个子网下的某个 IP 上,所以发给路由器还是发给某个电脑,对 A 来说也不关心,只要这个设备有个 IP 地址就行。 257 | 258 | 所以**[默认网关](https://www.zhihu.com/search?q=默认网关&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686}),就是 A 在自己电脑里配置的一个 IP 地址**,以便在发给不同子网的机器时,发给这个 IP 地址。 259 | 260 | ![img](http://xximg.30daydo.com/typora/v2-0ea2b5e1a7ab9e3c22045db6374beeb9_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-0ea2b5e1a7ab9e3c22045db6374beeb9_720w.jpg) 261 | 262 | 仅此而已! 263 | 264 | **路由器如何知道C在哪里?** 265 | 266 | **答案:[路由表](https://www.zhihu.com/search?q=路由表&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686})** 267 | 268 | 现在 A 要给 C 发数据包,已经可以成功发到路由器这里了,最后一个问题就是,**路由器怎么知道,收到的这个数据包,该从自己的哪个端口出去**,才能直接(或间接)地最终到达目的地 C 呢。 269 | 270 | 路由器收到的数据包有目的 IP 也就是 C 的 IP 地址,需要转化成从自己的哪个端口出去,很容易想到,应该有个表,就像 MAC 地址表一样。 271 | 272 | 这个表就叫**路由表**。 273 | 274 | 至于这个路由表是怎么出来的,有很多路由算法,本文不展开,因为我也不会哈哈~ 275 | 276 | 不同于 MAC 地址表的是,路由表并不是一对一这种明确关系,我们下面看一个路由表的结构。 277 | 278 | | 目的地址 | 子网掩码 | 下一跳 | 端口 | 279 | | ------------- | --------------- | ------ | ---- | 280 | | 192.168.0.0 | 255.255.255.0 | | 0 | 281 | | 192.168.0.254 | 255.255.255.255 | | 0 | 282 | | 192.168.1.0 | 255.255.255.0 | | 1 | 283 | | 192.168.1.254 | 255.255.255.255 | | 1 | 284 | 285 | 我们学习一种新的表示方法,由于子网掩码其实就表示前多少位表示子网的网段,所以如 192.168.0.0(255.255.255.0) 也可以简写为 192.168.0.0/24 286 | 287 | | 目的地址 | 下一跳 | 端口 | 288 | | ---------------- | ------ | ---- | 289 | | 192.168.0.0/24 | | 0 | 290 | | 192.168.0.254/32 | | 0 | 291 | | 192.168.1.0/24 | | 1 | 292 | | 192.168.1.254/32 | | 1 | 293 | 294 | 这就很好理解了,路由表就表示,**[http://192.168.0.xxx](https://link.zhihu.com/?target=http%3A//192.168.0.xxx)** **这个子网下的,都转发到 0 号端口,[http://192.168.1.xxx](https://link.zhihu.com/?target=http%3A//192.168.1.xxx)** **这个子网下的,都转发到 1 号端口**。下一跳列还没有值,我们先不管 295 | 296 | 配合着结构图来看(这里把子网掩码和默认网关都补齐了) 297 | 298 | ![img](http://xximg.30daydo.com/typora/v2-746317999373975988a8cae191a1f95d_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-746317999373975988a8cae191a1f95d_720w.jpg) 299 | 300 | **刚才说的都是 IP 层,但发送数据包的数据链路层需要知道 MAC 地址,可是我只知道 IP 地址该怎么办呢?** 301 | 302 | **答案:arp** 303 | 304 | 假如你(A)此时**不知道**你同伴 B 的 MAC 地址(现实中就是不知道的,刚刚我们只是假设已知),你只知道它的 IP 地址,你该怎么把数据包准确传给 B 呢? 305 | 306 | 答案很简单,在[网络层](https://www.zhihu.com/search?q=网络层&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686}),**我需要把 IP 地址对应的 MAC 地址找到**,也就是通过某种方式,找到 **192.168.0.2** 对应的 MAC 地址 **BBBB**。 307 | 308 | 这种方式就是 **arp 协议**,同时电脑 A 和 B 里面也会有一张 **arp 缓存表**,表中记录着 **IP 与 MAC 地址**的对应关系。 309 | 310 | | IP 地址 | MAC 地址 | 311 | | ----------- | -------- | 312 | | 192.168.0.2 | BBBB | 313 | 314 | 一开始的时候这个表是**空的**,电脑 A 为了知道电脑 B(192.168.0.2)的 MAC 地址,将会**广播**一条 arp 请求,B 收到请求后,带上自己的 MAC 地址给 A 一个**响应**。此时 A 便更新了自己的 arp 表。 315 | 316 | 这样通过大家不断广播 arp 请求,最终所有电脑里面都将 arp 缓存表更新完整。 317 | 318 | **总结一下** 319 | 320 | 好了,总结一下,到目前为止就几条规则 321 | 322 | **从各个节点的视角来看** 323 | 324 | **电脑视角:** 325 | 326 | - 首先我要知道我的 IP 以及对方的 IP 327 | - 通过子网掩码判断我们是否在同一个子网 328 | - 在同一个子网就通过 arp 获取对方 mac 地址直接扔出去 329 | - 不在同一个子网就通过 arp 获取默认网关的 mac 地址直接扔出去 330 | 331 | **交换机视角:** 332 | 333 | - 我收到的数据包必须有目标 MAC 地址 334 | - 通过 MAC 地址表查映射关系 335 | - 查到了就按照映射关系从我的指定端口发出去 336 | - 查不到就所有端口都发出去 337 | 338 | **路由器视角:** 339 | 340 | - 我收到的数据包必须有目标 IP 地址 341 | - 通过路由表查映射关系 342 | - 查到了就按照映射关系从我的指定端口发出去(不在任何一个子网范围,走其路由器的默认网关也是查到了) 343 | - 查不到则返回一个路由不可达的数据包 344 | 345 | 如果你嗅觉足够敏锐,你应该可以感受到下面这句话: 346 | 347 | > 网络层(IP协议)本身没有传输包的功能,包的实际传输是委托给数据链路层(以太网中的交换机)来实现的。 348 | 349 | **涉及到的三张表分别是** 350 | 351 | - 交换机中有 **MAC 地址**表用于映射 MAC 地址和它的端口 352 | - 路由器中有**路由表**用于映射 IP 地址(段)和它的端口 353 | - 电脑和路由器中都有 **arp 缓存表**用于缓存 IP 和 MAC 地址的映射关系 354 | 355 | **这三张表是怎么来的** 356 | 357 | - MAC 地址表是通过以太网内各节点之间不断通过交换机通信,不断完善起来的。 358 | - 路由表是各种路由算法 + 人工配置逐步完善起来的。 359 | - arp 缓存表是不断通过 arp 协议的请求逐步完善起来的。 360 | 361 | 知道了以上这些,目前网络上两个节点是如何发送数据包的这个过程,就完全可以解释通了 362 | 363 | ![img](http://xximg.30daydo.com/typora/v2-a1c1f6a087a7872c22dd7458b18f6b51_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-a1c1f6a087a7872c22dd7458b18f6b51_720w.jpg) 364 | 365 | 那接下来我们就放上本章 **最后一个** [网络拓扑图](https://www.zhihu.com/search?q=网络拓扑图&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686})吧,请做好 **战斗** 准备! 366 | 367 | 368 | 369 | ![img](http://xximg.30daydo.com/typora/v2-0c45d67e3a66459fccaa56aa46569f69_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-0c45d67e3a66459fccaa56aa46569f69_720w.jpg) 370 | 371 | 这时路由器 1 连接了路由器 2,所以其路由表有了下一条地址这一个概念,所以它的路由表就变成了这个样子。如果匹配到了有下一跳地址的一项,则需要再次匹配,找到其端口,并找到下一跳 IP 的 MAC 地址。 372 | 373 | 也就是说找来找去,最终必须能映射到一个端口号,然后从这个端口号把数据包发出去。 374 | 375 | 376 | 377 | | 目的地址 | | 下一跳 | 378 | | ---------------- | ------------- | ------ | 379 | | 192.168.0.0/24 | | 0 | 380 | | 192.168.0.254/32 | | 0 | 381 | | 192.168.1.0/24 | | 1 | 382 | | 192.168.1.254/32 | | 1 | 383 | | 192.168.2.0/24 | 192.168.100.5 | | 384 | | 192.168.100.0/24 | | 2 | 385 | | 192.168.100.4/32 | | 2 | 386 | 387 | ### **这时如果 A 给 F 发送一个数据包,能不能通呢?如果通的话整个过程是怎样的呢?** 388 | 389 | 390 | 391 | ![img](http://xximg.30daydo.com/typora/v2-a9d11ba2d66d90fed1ce2a7e8902c909_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-a9d11ba2d66d90fed1ce2a7e8902c909_720w.jpg) 392 | 393 | 394 | 395 | 思考一分钟... 396 | 397 | **详细过程动画描述:** 398 | 399 | 400 | 401 | ![img](http://xximg.30daydo.com/typora/v2-d43f6b095306e1a20ead68c63c1b70a0_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-d43f6b095306e1a20ead68c63c1b70a0_720w.jpg) 402 | 403 | 404 | 405 | **详细过程文字描述:** 406 | 407 | **1.** 首先 A(192.168.0.1)通过子网掩码(255.255.255.0)计算出自己与 F(192.168.2.2)并不在同一个子网内,于是决定发送给默认网关(192.168.0.254) 408 | 409 | **2.** A 通过 ARP 找到 默认网关 [192.168.0.254](https://www.zhihu.com/search?q=192.168.0.254&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType"%3A"answer"%2C"sourceId"%3A2303205686}) 的 MAC 地址。 410 | 411 | **3.** A 将源 MAC 地址(AAAA)与网关 MAC 地址(ABAB)封装在数据链路层头部,又将源 IP 地址(192.168.0.1)和目的 IP 地址(192.168.2.2)(注意这里千万不要以为填写的是默认网关的 IP 地址,从始至终这个数据包的两个 IP 地址都是不变的,只有 MAC 地址在不断变化)封装在网络层头部,然后发包 412 | 413 | 414 | 415 | ![img](http://xximg.30daydo.com/typora/v2-ad5dd1a501789f19a405929b760f4df8_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-ad5dd1a501789f19a405929b760f4df8_720w.jpg) 416 | 417 | 418 | 419 | **4.** 交换机 1 收到数据包后,发现目标 MAC 地址是 ABAB,转发给路由器1 420 | 421 | ***5.*** *数据包来到了路由器 1,发现其目标 IP 地址是 192.168.2.2,查看其路由表,发现了下一跳的地址是 192.168.100.5* 422 | 423 | **6.** 所以此时路由器 1 需要做两件事,第一件是再次匹配路由表,发现匹配到了端口为 2,于是将其封装到数据链路层,最后把包从 2 号口发出去。 424 | 425 | **7.** 此时路由器 2 收到了数据包,看到其目的地址是 192.168.2.2,查询其路由表,匹配到端口号为 1,准备从 1 号口把数据包送出去。 426 | 427 | **8.** 但此时路由器 2 需要知道 192.168.2.2 的 MAC 地址了,于是查看其 arp 缓存,找到其 MAC 地址为 FFFF,将其封装在数据链路层头部,并从 1 号端口把包发出去。 428 | 429 | **9.** 交换机 3 收到了数据包,发现目的 MAC 地址为 FFFF,查询其 MAC 地址表,发现应该从其 6 号端口出去,于是从 6 号端口把数据包发出去。 430 | 431 | **10.** **F 最终收到了数据包!**并且发现目的 MAC 地址就是自己,于是收下了这个包 432 | 433 | **更详细且精准的过程:** 434 | 435 | 读到这相信大家已经很累了,理解上述过程基本上网络层以下的部分主流程就基本疏通了,如果你想要本过程更为专业的过程描述,可以在公~号 **低并发编程** 后台回复 **网络**,获得我模拟这个过程的 Cisco Packet Tracer 源文件。 436 | 437 | ![img](http://xximg.30daydo.com/typora/v2-d38fca315ee1ae83a7c27a45d43f9a8c_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-d38fca315ee1ae83a7c27a45d43f9a8c_720w.jpg) 438 | 439 | 440 | 441 | 每一步包的传输都会有各层的原始数据,以及专业的过程描述 442 | 443 | 444 | 445 | ![img](http://xximg.30daydo.com/typora/v2-d2224785f6747e348f47ab27daaa9f8e_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-d2224785f6747e348f47ab27daaa9f8e_720w.jpg) 446 | 447 | 448 | 449 | 同时在此基础之上你也可以设计自己的网络拓扑结构,进行各种实验,来加深网络传输过程的理解。 450 | 451 | ## 后记 452 | 453 | 至此,经过物理层、数据链路层、网络层这前三层的协议,以及根据这些协议设计的各种网络设备(网线、集线器、交换机、路由器),理论上只要拥有对方的 IP 地址,就已经将地球上任意位置的两个节点连通了。 454 | 455 | 456 | 457 | ![img](http://xximg.30daydo.com/typora/v2-be4fa17d84090b8a8e388f40a8f9f2e8_720w.jpg)![img](http://xximg.30daydo.com/typora/v2-be4fa17d84090b8a8e388f40a8f9f2e8_720w.jpg) 458 | 459 | 460 | 461 | 本文经过了很多次的修改,删减了不少影响主流程的内容,就是为了让读者能抓住网络传输前三层的真正核心思想。同时网络相关的知识也是多且杂,我也还有很多搞不清楚的地方,非常欢迎大家与我交流,共同进步。 -------------------------------------------------------------------------------- /随机森林超参数调参.md: -------------------------------------------------------------------------------- 1 | # Hyperparameter Tuning the Random Forest in Python 2 | 3 | [![Will Koehrsen](https://miro.medium.com/fit/c/56/56/1*SckxdIFfjlR-cWXkL5ya-g.jpeg)](https://williamkoehrsen.medium.com/?source=post_page-----28d2aa77dd74--------------------------------) 4 | 5 | [Will Koehrsen](https://williamkoehrsen.medium.com/?source=post_page-----28d2aa77dd74--------------------------------) 6 | 7 | [Jan 10, 2018·12 min read](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74?source=post_page-----28d2aa77dd74--------------------------------) 8 | 9 | 10 | 11 | 12 | 13 | **Improving the Random Forest Part Two** 14 | 15 | So we’ve built a random forest model to solve our machine learning problem (perhaps by following this [end-to-end guide](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)) but we’re not too impressed by the results. What are our options? As we saw in the [first part of this series](https://towardsdatascience.com/improving-random-forest-in-python-part-1-893916666cd), our first step should be to gather more data and perform feature engineering. Gathering more data and feature engineering usually has the greatest payoff in terms of time invested versus improved performance, but when we have exhausted all data sources, it’s time to move on to model hyperparameter tuning. This post will focus on optimizing the random forest model in Python using Scikit-Learn tools. Although this article builds on part one, it fully stands on its own, and we will cover many widely-applicable machine learning concepts. 16 | 17 | ![img](https://miro.medium.com/max/2000/1*G1_rf6QQBIs_vO_d98WfAQ.png) 18 | 19 | One Tree in a Random Forest 20 | 21 | I have included Python code in this article where it is most instructive. Full code and data to follow along can be found on the project [Github page](https://github.com/WillKoehrsen/Machine-Learning-Projects/tree/master/random_forest_explained). 22 | 23 | # A Brief Explanation of Hyperparameter Tuning 24 | 25 | The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance, just as we might turn the knobs of [an AM radio to get a clear signal](https://electronics.howstuffworks.com/radio8.htm) (or your parents might have!). While model *parameters* are learned during training — such as the slope and intercept in a linear regression — *hyperparameters* must be set by the data scientist before training. In the case of a random forest, hyperparameters include the number of decision trees in the forest and the number of features considered by each tree when splitting a node. (The parameters of a random forest are the variables and thresholds used to split each node learned during training). Scikit-Learn implements a set of [sensible default hyperparameters ](https://arxiv.org/abs/1309.0238)for all models, but these are not guaranteed to be optimal for a problem. The best hyperparameters are usually impossible to determine ahead of time, and tuning a model is where machine learning turns from a science into trial-and-error based engineering. 26 | 27 | ![img](https://miro.medium.com/max/60/1*0215Gzmw56XvORtB7-Torw.png?q=20) 28 | 29 | ![img](https://miro.medium.com/max/425/1*0215Gzmw56XvORtB7-Torw.png) 30 | 31 | Hyperparameters and Parameters 32 | 33 | Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations evaluate the performance of each model. However, evaluating each model only on the training set can lead to one of the most fundamental problems in machine learning: [overfitting](https://elitedatascience.com/overfitting-in-machine-learning). 34 | 35 | If we optimize the model for the training data, then our model will score very well on the training set, but will not be able to generalize to new data, such as in a test set. When a model performs highly on the training set but poorly on the test set, this is known as overfitting, or essentially creating a model that knows the training set very well but cannot be applied to new problems. It’s like a student who has memorized the simple problems in the textbook but has no idea how to apply concepts in the messy real world. 36 | 37 | An overfit model may look impressive on the training set, but will be useless in a real application. Therefore, the standard procedure for hyperparameter optimization accounts for overfitting through [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html). 38 | 39 | # Cross Validation 40 | 41 | The technique of cross validation (CV) is best explained by example using the most common method, [K-Fold CV.](http://statweb.stanford.edu/~tibs/sta306bfiles/cvwrong.pdf) When we approach a machine learning problem, we make sure to split our data into a training and a testing set. In K-Fold CV, we further split our training set into K number of subsets, called folds. We then iteratively fit the model K times, each time training the data on K-1 of the folds and evaluating on the Kth fold (called the validation data). As an example, consider fitting a model with K = 5. The first iteration we train on the first four folds and evaluate on the fifth. The second time we train on the first, second, third, and fifth fold and evaluate on the fourth. We repeat this procedure 3 more times, each time evaluating on a different fold. At the very end of training, we average the performance on each of the folds to come up with final validation metrics for the model. 42 | 43 | ![img](https://miro.medium.com/max/60/0*KH3dnbGNcmyV_ODL.png?q=20) 44 | 45 | ![img](https://miro.medium.com/max/700/0*KH3dnbGNcmyV_ODL.png) 46 | 47 | 5 Fold Cross Validation ([Source](https://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn)) 48 | 49 | For hyperparameter tuning, we perform many iterations of the entire K-Fold CV process, each time using different model settings. We then compare all of the models, select the best one, train it on the full training set, and then evaluate on the testing set. This sounds like an awfully tedious process! Each time we want to assess a different set of hyperparameters, we have to split our training data into K fold and train and evaluate K times. If we have 10 sets of hyperparameters and are using 5-Fold CV, that represents 50 training loops. Fortunately, as with most problems in machine learning, someone has solved our problem and model tuning with K-Fold CV can be automatically implemented in Scikit-Learn. 50 | 51 | # Random Search Cross Validation in Scikit-Learn 52 | 53 | Usually, we only have a vague idea of the best hyperparameters and thus the best approach to narrow our search is to evaluate a wide range of values for each hyperparameter. Using Scikit-Learn’s RandomizedSearchCV method, we can define a grid of hyperparameter ranges, and randomly sample from the grid, performing K-Fold CV with each combination of values. 54 | 55 | As a brief recap before we get into model tuning, we are dealing with a supervised regression machine learning problem. We are trying to predict the temperature tomorrow in our city (Seattle, WA) using past historical weather data. We have 4.5 years of training data, 1.5 years of test data, and are using 6 different features (variables) to make our predictions. (To see the full code for data preparation, see the [notebook](https://github.com/WillKoehrsen/Machine-Learning-Projects/blob/master/random_forest_explained/Improving Random Forest Part 2.ipynb)). 56 | 57 | Let’s examine the features quickly. 58 | 59 | ![img](https://miro.medium.com/max/60/1*Gr3BUzeZjEeS8q3G6b1pkg.png?q=20) 60 | 61 | ![img](https://miro.medium.com/max/469/1*Gr3BUzeZjEeS8q3G6b1pkg.png) 62 | 63 | Features for Temperature Prediction 64 | 65 | - temp_1 = max temperature (in F) one day prior 66 | - average = historical average max temperature 67 | - ws_1 = average wind speed one day prior 68 | - temp_2 = max temperature two days prior 69 | - friend = prediction from our “trusty” friend 70 | - year = calendar year 71 | 72 | In previous posts, we checked the data to check for anomalies and we know our data is clean. Therefore, we can skip the data cleaning and jump straight into hyperparameter tuning. 73 | 74 | To look at the available hyperparameters, we can create a random forest and examine the default values. 75 | 76 | ``` 77 | from sklearn.ensemble import RandomForestRegressor 78 | rf = RandomForestRegressor(random_state = 42) 79 | from pprint import pprint 80 | # Look at parameters used by our current forest 81 | print('Parameters currently in use:\n') 82 | pprint(rf.get_params()) 83 | 84 | Parameters currently in use: 85 | 86 | {'bootstrap': True, 87 | 'criterion': 'mse', 88 | 'max_depth': None, 89 | 'max_features': 'auto', 90 | 'max_leaf_nodes': None, 91 | 'min_impurity_decrease': 0.0, 92 | 'min_impurity_split': None, 93 | 'min_samples_leaf': 1, 94 | 'min_samples_split': 2, 95 | 'min_weight_fraction_leaf': 0.0, 96 | 'n_estimators': 10, 97 | 'n_jobs': 1, 98 | 'oob_score': False, 99 | 'random_state': 42, 100 | 'verbose': 0, 101 | 'warm_start': False} 102 | ``` 103 | 104 | Wow, that is quite an overwhelming list! How do we know where to start? A good place is the [documentation on the random forest in Scikit-Learn](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). This tells us the most important settings are the number of trees in the forest (n_estimators) and the number of features considered for splitting at each leaf node (max_features). We could go read the [research papers on the random forest ](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)and try to theorize the best hyperparameters, but a more efficient use of our time is just to try out a wide range of values and see what works! We will try adjusting the following set of hyperparameters: 105 | 106 | - n_estimators = number of trees in the foreset 107 | - max_features = max number of features considered for splitting a node 108 | - max_depth = max number of levels in each decision tree 109 | - min_samples_split = min number of data points placed in a node before the node is split 110 | - min_samples_leaf = min number of data points allowed in a leaf node 111 | - bootstrap = method for sampling data points (with or without replacement) 112 | 113 | 114 | 115 | ## Random Hyperparameter Grid 116 | 117 | To use RandomizedSearchCV, we first need to create a parameter grid to sample from during fitting: 118 | 119 | ``` 120 | from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest 121 | n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)] 122 | # Number of features to consider at every split 123 | max_features = ['auto', 'sqrt'] 124 | # Maximum number of levels in tree 125 | max_depth = [int(x) for x in np.linspace(10, 110, num = 11)] 126 | max_depth.append(None) 127 | # Minimum number of samples required to split a node 128 | min_samples_split = [2, 5, 10] 129 | # Minimum number of samples required at each leaf node 130 | min_samples_leaf = [1, 2, 4] 131 | # Method of selecting samples for training each tree 132 | bootstrap = [True, False]# Create the random grid 133 | random_grid = {'n_estimators': n_estimators, 134 | 'max_features': max_features, 135 | 'max_depth': max_depth, 136 | 'min_samples_split': min_samples_split, 137 | 'min_samples_leaf': min_samples_leaf, 138 | 'bootstrap': bootstrap}pprint(random_grid){'bootstrap': [True, False], 139 | 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None], 140 | 'max_features': ['auto', 'sqrt'], 141 | 'min_samples_leaf': [1, 2, 4], 142 | 'min_samples_split': [2, 5, 10], 143 | 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]} 144 | ``` 145 | 146 | On each iteration, the algorithm will choose a difference combination of the features. Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings! However, the benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values. 147 | 148 | ## Random Search Training 149 | 150 | Now, we instantiate the random search and fit it like any Scikit-Learn model: 151 | 152 | ``` 153 | # Use the random grid to search for best hyperparameters 154 | # First create the base model to tune 155 | rf = RandomForestRegressor() 156 | # Random search of parameters, using 3 fold cross validation, 157 | # search across 100 different combinations, and use all available cores 158 | rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)# Fit the random search model 159 | rf_random.fit(train_features, train_labels) 160 | ``` 161 | 162 | The most important arguments in RandomizedSearchCV are n_iter, which controls the number of different combinations to try, and cv which is the number of folds to use for cross validation (we use 100 and 3 respectively). More iterations will cover a wider search space and more cv folds reduces the chances of overfitting, but raising each will increase the run time. Machine learning is a field of trade-offs, and performance vs time is one of the most fundamental. 163 | 164 | We can view the best parameters from fitting the random search: 165 | 166 | ``` 167 | rf_random.best_params_{'bootstrap': True, 168 | 'max_depth': 70, 169 | 'max_features': 'auto', 170 | 'min_samples_leaf': 4, 171 | 'min_samples_split': 10, 172 | 'n_estimators': 400} 173 | ``` 174 | 175 | From these results, we should be able to narrow the range of values for each hyperparameter. 176 | 177 | ## Evaluate Random Search 178 | 179 | To determine if random search yielded a better model, we compare the base model with the best random search model. 180 | 181 | ``` 182 | def evaluate(model, test_features, test_labels): 183 | predictions = model.predict(test_features) 184 | errors = abs(predictions - test_labels) 185 | mape = 100 * np.mean(errors / test_labels) 186 | accuracy = 100 - mape 187 | print('Model Performance') 188 | print('Average Error: {:0.4f} degrees.'.format(np.mean(errors))) 189 | print('Accuracy = {:0.2f}%.'.format(accuracy)) 190 | 191 | return accuracy 192 | 193 | base_model = RandomForestRegressor(n_estimators = 10, random_state = 42) 194 | base_model.fit(train_features, train_labels) 195 | base_accuracy = evaluate(base_model, test_features, test_labels) 196 | 197 | Model Performance 198 | Average Error: 3.9199 degrees. 199 | Accuracy = 93.36%. 200 | 201 | best_random = rf_random.best_estimator_ 202 | random_accuracy = evaluate(best_random, test_features, test_labels) 203 | 204 | Model Performance 205 | 206 | Average Error: 3.7152 degrees. 207 | Accuracy = 93.73%.print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))Improvement of 0.40%. 208 | ``` 209 | 210 | We achieved an unspectacular improvement in accuracy of 0.4%. Depending on the application though, this could be a significant benefit. We can further improve our results by using grid search to focus on the most promising hyperparameters ranges found in the random search. 211 | 212 | # Grid Search with Cross Validation 213 | 214 | Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search: 215 | 216 | ``` 217 | from sklearn.model_selection import GridSearchCV 218 | # Create the parameter grid based on the results of random search 219 | 220 | param_grid = { 221 | 'bootstrap': [True], 222 | 'max_depth': [80, 90, 100, 110], 223 | 'max_features': [2, 3], 224 | 'min_samples_leaf': [3, 4, 5], 225 | 'min_samples_split': [8, 10, 12], 226 | 'n_estimators': [100, 200, 300, 1000] 227 | } 228 | # Create a based model 229 | rf = RandomForestRegressor()# Instantiate the grid search model 230 | grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 231 | cv = 3, n_jobs = -1, verbose = 2) 232 | ``` 233 | 234 | This will try out 1 * 4 * 2 * 3 * 3 * 4 = 288 combinations of settings. We can fit the model, display the best hyperparameters, and evaluate performance: 235 | 236 | ``` 237 | # Fit the grid search to the data 238 | grid_search.fit(train_features, train_labels) 239 | 240 | grid_search.best_params_{'bootstrap': True, 241 | 'max_depth': 80, 242 | 'max_features': 3, 243 | 'min_samples_leaf': 5, 244 | 'min_samples_split': 12, 245 | 'n_estimators': 100} 246 | 247 | best_grid = grid_search.best_estimator_ 248 | grid_accuracy = evaluate(best_grid, test_features, test_labels) 249 | 250 | Model Performance 251 | Average Error: 3.6561 degrees. 252 | Accuracy = 93.83%. 253 | print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy)) 254 | Improvement of 0.50%. 255 | ``` 256 | 257 | It seems we have about maxed out performance, but we can give it one more try with a grid further refined from our previous results. The code is the same as before just with a different grid so I only present the results: 258 | 259 | ``` 260 | Model Performance 261 | Average Error: 3.6602 degrees. 262 | Accuracy = 93.82%. 263 | Improvement of 0.49%. 264 | ``` 265 | 266 | A small decrease in performance indicates we have reached diminishing returns for hyperparameter tuning. We could continue, but the returns would be minimal at best. 267 | 268 | # Comparisons 269 | 270 | We can make some quick comparisons between the different approaches used to improve performance showing the returns on each. The following table shows the final results from all the improvements we made (including those from the first part): 271 | 272 | ![img](https://miro.medium.com/max/60/1*vBCSIIIxyTLKzcJMiV5lKg.png?q=20) 273 | 274 | ![img](https://miro.medium.com/max/680/1*vBCSIIIxyTLKzcJMiV5lKg.png) 275 | 276 | Comparison of All Models 277 | 278 | Model is the (very unimaginative) names for the models, accuracy is the percentage accuracy, error is the average absolute error in degrees, n_features is the number of features in the dataset, n_trees is the number of decision trees in the forest, and time is the training and predicting time in seconds. 279 | 280 | The models are as follows: 281 | 282 | - average: original baseline computed by predicting historical average max temperature for each day in test set 283 | - one_year: model trained using a single year of data 284 | - four_years_all: model trained using 4.5 years of data and expanded features (see Part One for details) 285 | - four_years_red: model trained using 4.5 years of data and subset of most important features 286 | - best_random: best model from random search with cross validation 287 | - first_grid: best model from first grid search with cross validation (selected as the final model) 288 | - second_grid: best model from second grid search 289 | 290 | **Overall, gathering more data and feature selection reduced the error by 17.69% while hyperparameter further reduced the error by 6.73%.** 291 | 292 | ![img](https://miro.medium.com/max/60/1*6gpuSFyshQ-shuvnOjYeGg.png?q=20) 293 | 294 | ![img](https://miro.medium.com/max/686/1*6gpuSFyshQ-shuvnOjYeGg.png) 295 | 296 | Model Comparison (see Notebook for code) 297 | 298 | In terms of programmer-hours, gathering data took about 6 hours while hyperparameter tuning took about 3 hours. As with any pursuit in life, there is a point at which pursuing further optimization is not worth the effort and knowing when to stop can be just as important as being able to keep going (sorry for getting all philosophical). Moreover, in any data problem, there is what is called the [Bayes error rate](https://en.wikipedia.org/wiki/Bayes_error_rate), which is the absolute minimum possible error in a problem. Bayes error, also called reproducible error, is a combination of latent variables, the factors affecting a problem which we cannot measure, and inherent noise in any physical process. Creating a perfect model is therefore not possible. Nonetheless, in this example, we were able to significantly improve our model with hyperparameter tuning and we covered numerous machine learning topics which are broadly applicable. 299 | 300 | 301 | 302 | # **Training Visualizations** 303 | 304 | To further analyze the process of hyperparameter optimization, we can change one setting at a time and see the effect on the model performance (essentially conducting a controlled experiment). For example, we can create a grid with a range of number of trees, perform grid search CV, and then plot the results. Plotting the training and testing error and the training time will allow us to inspect how changing one hyperparameter impacts the model. 305 | 306 | First we can look at the effect of changing the number of trees in the forest. (see notebook for training and plotting code) 307 | 308 | ![img](https://miro.medium.com/max/60/1*mDQtEHIojnUuqiNJc0Na7w.png?q=20) 309 | 310 | ![img](https://miro.medium.com/max/633/1*mDQtEHIojnUuqiNJc0Na7w.png) 311 | 312 | Number of Trees Training Curves 313 | 314 | As the number of trees increases, our error decreases up to a point. There is not much benefit in accuracy to increasing the number of trees beyond 20 (our final model had 100) and the training time rises consistently. 315 | 316 | We can also examine curves for the number of features to split a node: 317 | 318 | ![img](https://miro.medium.com/max/60/1*ZDd5Bs7ed5bZEL93xDwflA.png?q=20) 319 | 320 | ![img](https://miro.medium.com/max/645/1*ZDd5Bs7ed5bZEL93xDwflA.png) 321 | 322 | Number of Features Training Curves 323 | 324 | As we increase the number of features retained, the model accuracy increases as expected. The training time also increases although not significantly. 325 | 326 | Together with the quantitative stats, these visuals can give us a good idea of the trade-offs we make with different combinations of hyperparameters. Although there is usually no way to know ahead of time what settings will work the best, this example has demonstrated the simple tools in Python that allow us to optimize our machine learning model. 327 | 328 | As always, I welcome feedback and constructive criticism. I can be reached at wjk68@case.edu --------------------------------------------------------------------------------