├── .gitignore ├── Decision Tree ├── DecisionTree.csv ├── DecisionTree.ipynb ├── README.md └── RandomForestRegression.ipynb ├── LICENSE ├── Linear Regression ├── README.md └── demo │ ├── README.md │ ├── housing_price.py │ ├── kc_test.csv │ └── kc_train.csv ├── Logistic Regression ├── README.md └── demo │ ├── CreditScoring.ipynb │ ├── KaggleCredit2.csv.zip │ └── README.md ├── Model Ensemble ├── README.md └── kaggle_credict.ipynb ├── README.md ├── Regularization ├── LassoCV.ipynb ├── README.md └── RidgeCV.ipynb ├── SVM ├── README.md └── cnews_demo │ ├── README.md │ └── svm_classification.ipynb └── Word2Vec ├── README.md └── word2vec.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /Decision Tree/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.决策树](#1决策树) 3 | - [1.1从LR到决策树](#11从lr到决策树) 4 | - [1.2“树”的成长过程](#12树的成长过程) 5 | - [1.3“树”怎么长](#13树怎么长) 6 | - [1.3.1ID3算法](#131id3算法) 7 | - [1.3.2C4.5](#132c45) 8 | - [1.3.3CART算法](#133cart算法) 9 | - [1.3.4三种不同的决策树](#134三种不同的决策树) 10 | - [1.4随机森林](#14随机森林random-forest) 11 | - [决策树demo](https://github.com/mantchs/machine_learning_model/blob/master/Decision%20Tree/DecisionTree.ipynb) 12 | - [随机森林demo](https://github.com/mantchs/machine_learning_model/blob/master/Decision%20Tree/RandomForestRegression.ipynb) 13 | 14 | ## 1.决策树 15 | 16 | ### 1.1从LR到决策树 17 | 18 | 相信大家都做过用LR来进行分类,总结一下LR模型的优缺点: 19 | 20 | #### 优点 21 | 22 | - 适合需要得到一个**分类概率的**场景。 23 | - 实现效率较高。 24 | - 很好处理线性特征。 25 | 26 | #### 缺点 27 | 28 | - 当特征空间很大时,逻辑回归的性能不是很好。 29 | - 不能很好地处理大量多类特征。 30 | - 对于非线性特征,需要进行转换。 31 | 32 | 以上就是LR模型的优缺点,没错,决策树的出现就是为了解决LR模型不足的地方,这也是我们为什么要学习决策树的原因了,没有任何一个模型是万能的。 33 | 34 | #### 决策树的优点 35 | 36 | - 模拟人的直观决策规则。 37 | - 可以处理非线性特征。 38 | - 考虑了特征之间的相互作用。 39 | 40 | 其实用一下图片能更好的理解LR模型和决策树模型算法的根本区别,我们可以思考一下一个决策问题:是否去相亲,一个女孩的母亲要给这个女海介绍对象。 41 | 42 | ![](https://www.wailian.work/images/2018/12/11/image8efdc.png) 43 | 44 | 大家都看得很明白了吧!LR模型是一股脑儿的把所有特征塞入学习,而决策树更像是编程语言中的if-else一样,去做条件判断,这就是根本性的区别。 45 | 46 | ### 1.2“树”的成长过程 47 | 48 | 决策树基于“树”结构进行决策的,这时我们就要面临两个问题 : 49 | 50 | - “树”怎么长。 51 | - 这颗“树”长到什么时候停。 52 | 53 | 弄懂了这两个问题,那么这个模型就已经建立起来了,决策树的总体流程是“分而治之”的思想,一是自根至叶的递归过程,一是在每个中间节点寻找一个“划分”属性,相当于就是一个特征属性了。接下来我们来逐个解决以上两个问题。 54 | 55 | #### 这颗“树”长到什么时候停 56 | 57 | - 当前结点包含的样本全属于同一类别,无需划分;例如:样本当中都是决定去相亲的,属于同一类别,就是不管特征如何改变都不会影响结果,这种就不需要划分了。 58 | - 当前属性集为空,或是所有样本在所有属性上取值相同,无法划分;例如:所有的样本特征都是一样的,就造成无法划分了,训练集太单一。 59 | - 当前结点包含的样本集合为空,不能划分。 60 | 61 | ### 1.3“树”怎么长 62 | 63 | 在生活当中,我们都会碰到很多需要做出决策的地方,例如:吃饭地点、数码产品购买、旅游地区等,你会发现在这些选择当中都是依赖于大部分人做出的选择,也就是跟随大众的选择。其实在决策树当中也是一样的,当大部分的样本都是同一类的时候,那么就已经做出了决策。 64 | 65 | 我们可以把大众的选择抽象化,这就引入了一个概念就是纯度,想想也是如此,大众选择就意味着纯度越高。好,在深入一点,就涉及到一句话:**信息熵越低,纯度越高**。我相信大家或多或少都听说过“熵”这个概念,信息熵通俗来说就是用来度量包含的“信息量”,如果样本的属性都是一样的,就会让人觉得这包含的信息很单一,没有差异化,相反样本的属性都不一样,那么包含的信息量就很多了。 66 | 67 | 一到这里就头疼了,因为马上要引入信息熵的公式,其实也很简单: 68 | 69 | ![](https://www.wailian.work/images/2018/12/11/image278e3.png) 70 | 71 | Pk表示的是:当前样本集合D中第k类样本所占的比例为Pk。 72 | 73 | #### 信息增益 74 | 75 | 废话不多说直接上公式: 76 | 77 | ![](http://www.wailian.work/images/2018/12/11/imagedb167.png) 78 | 79 | 看不懂的先不管,简单一句话就是:划分前的信息熵--划分后的信息熵。表示的是向纯度方向迈出的“步长”。 80 | 81 | #### 1.3.1ID3算法 82 | 83 | 解释:在根节点处计算信息熵,然后根据属性依次划分并计算其节点的信息熵,用根节点信息熵--属性节点的信息熵=信息增益,根据信息增益进行降序排列,排在前面的就是第一个划分属性,其后依次类推,这就得到了决策树的形状,也就是怎么“长”了。 84 | 85 | 如果不理解的,可以查看我一下分享的示例,结合我说的,包你看懂: 86 | 87 | 1.https://www.wailian.work/images/2018/12/11/image39e7b.png 88 | 89 | 2.https://www.wailian.work/images/2018/12/11/image61cdc.png 90 | 91 | 3.https://www.wailian.work/images/2018/12/11/image9e194.png 92 | 93 | 4.https://www.wailian.work/images/2018/12/11/image09288.png 94 | 95 | 不过,信息增益有一个问题:对可取值数目较多的属性有所偏好,例如:考虑将“编号”作为一个属性。这就引出了另一个 算法C4.5。 96 | 97 | #### 1.3.2C4.5 98 | 99 | 为了解决信息增益的问题,引入一个信息增益率: 100 | 101 | ![](http://www.wailian.work/images/2018/12/11/image3ef88.png) 102 | 103 | 属性a的可能取值数目越多(即V越大),则IV(a)的值通常就越大。**信息增益比本质: 是在信息增益的基础之上乘上一个惩罚参数。特征个数较多时,惩罚参数较小;特征个数较少时,惩罚参数较大。**不过有一个缺点: 104 | 105 | - 缺点:信息增益比偏向取值较少的特征。 106 | 107 | 使用信息增益比:基于以上缺点,并不是直接选择信息增益率最大的特征,而是现在候选特征中找出信息增益高于平均水平的特征,然后在这些特征中再选择信息增益率最高的特征。 108 | 109 | #### 1.3.3CART算法 110 | 111 | 数学家真实聪明,想到了另外一个表示纯度的方法,叫做基尼指数(讨厌的公式): 112 | 113 | ![](https://www.wailian.work/images/2018/12/11/image02d2f.png) 114 | 115 | 表示在样本集合中一个随机选中的样本被分错的概率。举例来说,现在一个袋子里有3种颜色的球若干个,伸手进去掏出2个球,颜色不一样的概率,这下明白了吧。**Gini(D)越小,数据集D的纯度越高。** 116 | 117 | ##### 举个例子 118 | 119 | 假设现在有特征 “学历”,此特征有三个特征取值: “本科”,“硕士”, “博士”, 120 | 121 | 当使用“学历”这个特征对样本集合D进行划分时,划分值分别有三个,因而有三种划分的可能集合,划分后的子集如下: 122 | 123 | 1.划分点: “本科”,划分后的子集合 : {本科},{硕士,博士} 124 | 125 | 2.划分点: “硕士”,划分后的子集合 : {硕士},{本科,博士} 126 | 127 | 3.划分点: “硕士”,划分后的子集合 : {博士},{本科,硕士}} 128 | 129 | 对于上述的每一种划分,都可以计算出基于 **划分特征= 某个特征值** 将样本集合D划分为两个子集的纯度: 130 | 131 | ![](https://www.wailian.work/images/2018/12/11/imagedd375.png) 132 | 133 | 因而**对于一个具有多个取值(超过2个)的特征,需要计算以每一个取值作为划分点,对样本D划分之后子集的纯度Gini(D,Ai),(其中Ai 表示特征A的可能取值)** 134 | 135 | 然后从所有的可能划分的Gini(D,Ai)中找出Gini指数最小的划分,这个划分的划分点,便是使用特征A对样本集合D进行划分的最佳划分点。到此就可以长成一棵“大树”了。 136 | 137 | #### 1.3.4三种不同的决策树 138 | 139 | - **ID3**:取值多的属性,更容易使数据更纯,其信息增益更大。 140 | 141 | 训练得到的是一棵庞大且深度浅的树:不合理。 142 | 143 | - **C4.5**:采用信息增益率替代信息增益。 144 | 145 | - **CART**:以基尼系数替代熵,最小化不纯度,而不是最大化信息增益。 146 | 147 | ### 1.4随机森林(Random Forest) 148 | 149 | #### Bagging思想 150 | 151 | Bagging是bootstrap aggregating。思想就是从总体样本当中随机取一部分样本进行训练,通过多次这样的结果,进行投票获取平均值作为结果输出,这就极大可能的避免了不好的样本数据,从而提高准确度。因为有些是不好的样本,相当于噪声,模型学入噪声后会使准确度不高。 152 | 153 | **举个例子**: 154 | 155 | 假设有1000个样本,如果按照以前的思维,是直接把这1000个样本拿来训练,但现在不一样,先抽取800个样本来进行训练,假如噪声点是这800个样本以外的样本点,就很有效的避开了。重复以上操作,提高模型输出的平均值。 156 | 157 | #### 随机森林 158 | 159 | RandomForest(随机森林)是一种基于树模型的Bagging的优化版本,一棵树的生成肯定还是不如多棵树,因此就有了随机森林,解决决策树泛化能力弱的特点。(可以理解成三个臭皮匠顶过诸葛亮) 160 | 161 | 而同一批数据,用同样的算法只能产生一棵树,这时Bagging策略可以帮助我们产生不同的数据集。**Bagging**策略来源于bootstrap aggregation:从样本集(假设样本集N个数据点)中重采样选出Nb个样本(有放回的采样,样本数据点个数仍然不变为N),在所有样本上,对这n个样本建立分类器(ID3\C4.5\CART\SVM\LOGISTIC),重复以上两步m次,获得m个分类器,最后根据这m个分类器的投票结果,决定数据属于哪一类。 162 | 163 | 总的来说就是随机选择样本数,随机选取特征,随机选择分类器,建立多颗这样的决策树,然后通过这几课决策树来投票,决定数据属于哪一类(**投票机制有一票否决制、少数服从多数、加权多数**) 164 | 165 | #### 优点: 166 | 167 | - 在当前的很多数据集上,相对其他算法有着很大的优势,表现良好。 168 | - 它能够处理很高维度(feature很多)的数据,并且不用做特征选择(因为特征子集是随机选择的)。 169 | - 在训练完后,它能够给出哪些feature比较重要。 170 | - 训练速度快,容易做成并行化方法(训练时树与树之间是相互独立的)。 171 | - 在训练过程中,能够检测到feature间的互相影响。 172 | - 对于不平衡的数据集来说,它可以平衡误差。 173 | - 如果有很大一部分的特征遗失,仍可以维持准确度。 174 | 175 | #### 缺点: 176 | 177 | - 随机森林已经被证明在某些**噪音较大**的分类或回归问题上会过拟合。 178 | - 对于有不同取值的属性的数据,取值划分较多的属性会对随机森林产生更大的影响,所以随机森林在这种数据上产出的属性权值是不可信的。 179 | 180 | ### 1.5Python代码 181 | 182 | [决策树模型demo](https://github.com/mantchs/machine_learning_model/blob/master/Decision%20Tree/DecisionTree.ipynb) 183 | 184 | [随机森林模型demo](https://github.com/mantchs/machine_learning_model/blob/master/Decision%20Tree/RandomForestRegression.ipynb) 185 | 186 | . 187 | 188 | . 189 | 190 | . 191 | 192 | . 193 | 194 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 195 | 196 | 欢迎添加微信交流!请备注“机器学习”。 197 | 198 | -------------------------------------------------------------------------------- /Decision Tree/RandomForestRegression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 构建随机森林回归模型" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "#### 0.import工具库" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd\n", 26 | "from sklearn import preprocessing\n", 27 | "from sklearn.ensemble import RandomForestRegressor\n", 28 | "from sklearn.datasets import load_boston" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "#### 1.加载数据" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": { 42 | "collapsed": true 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "boston_house = load_boston()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 3, 52 | "metadata": { 53 | "collapsed": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "boston_feature_name = boston_house.feature_names\n", 58 | "boston_features = boston_house.data\n", 59 | "boston_target = boston_house.target" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 4, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "data": { 69 | "text/plain": [ 70 | "array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n", 71 | " 'TAX', 'PTRATIO', 'B', 'LSTAT'],\n", 72 | " dtype='|S7')" 73 | ] 74 | }, 75 | "execution_count": 4, 76 | "metadata": {}, 77 | "output_type": "execute_result" 78 | } 79 | ], 80 | "source": [ 81 | "boston_feature_name" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 5, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "Boston House Prices dataset\n", 94 | "===========================\n", 95 | "\n", 96 | "Notes\n", 97 | "------\n", 98 | "Data Set Characteristics: \n", 99 | "\n", 100 | " :Number of Instances: 506 \n", 101 | "\n", 102 | " :Number of Attributes: 13 numeric/categorical predictive\n", 103 | " \n", 104 | " :Median Value (attribute 14) is usually the target\n", 105 | "\n", 106 | " :Attribute Information (in order):\n", 107 | " - CRIM per capita crime rate by town\n", 108 | " - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n", 109 | " - INDUS proportion of non-retail business acres per town\n", 110 | " - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n", 111 | " - NOX nitric oxides concentration (parts per 10 million)\n", 112 | " - RM average number of rooms per dwelling\n", 113 | " - AGE proportion of owner-occupied units built prior to 1940\n", 114 | " - DIS weighted distances to five Boston employment centres\n", 115 | " - RAD index of accessibility to radial highways\n", 116 | " - TAX full-value property-tax rate per $10,000\n", 117 | " - PTRATIO pupil-teacher ratio by town\n", 118 | " - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n", 119 | " - LSTAT % lower status of the population\n", 120 | " - MEDV Median value of owner-occupied homes in $1000's\n", 121 | "\n", 122 | " :Missing Attribute Values: None\n", 123 | "\n", 124 | " :Creator: Harrison, D. and Rubinfeld, D.L.\n", 125 | "\n", 126 | "This is a copy of UCI ML housing dataset.\n", 127 | "http://archive.ics.uci.edu/ml/datasets/Housing\n", 128 | "\n", 129 | "\n", 130 | "This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n", 131 | "\n", 132 | "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n", 133 | "prices and the demand for clean air', J. Environ. Economics & Management,\n", 134 | "vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n", 135 | "...', Wiley, 1980. N.B. Various transformations are used in the table on\n", 136 | "pages 244-261 of the latter.\n", 137 | "\n", 138 | "The Boston house-price data has been used in many machine learning papers that address regression\n", 139 | "problems. \n", 140 | " \n", 141 | "**References**\n", 142 | "\n", 143 | " - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n", 144 | " - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n", 145 | " - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)\n", 146 | "\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "print(boston_house.DESCR)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 6, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "array([[ 6.32000000e-03, 1.80000000e+01, 2.31000000e+00,\n", 163 | " 0.00000000e+00, 5.38000000e-01, 6.57500000e+00,\n", 164 | " 6.52000000e+01, 4.09000000e+00, 1.00000000e+00,\n", 165 | " 2.96000000e+02, 1.53000000e+01, 3.96900000e+02,\n", 166 | " 4.98000000e+00],\n", 167 | " [ 2.73100000e-02, 0.00000000e+00, 7.07000000e+00,\n", 168 | " 0.00000000e+00, 4.69000000e-01, 6.42100000e+00,\n", 169 | " 7.89000000e+01, 4.96710000e+00, 2.00000000e+00,\n", 170 | " 2.42000000e+02, 1.78000000e+01, 3.96900000e+02,\n", 171 | " 9.14000000e+00],\n", 172 | " [ 2.72900000e-02, 0.00000000e+00, 7.07000000e+00,\n", 173 | " 0.00000000e+00, 4.69000000e-01, 7.18500000e+00,\n", 174 | " 6.11000000e+01, 4.96710000e+00, 2.00000000e+00,\n", 175 | " 2.42000000e+02, 1.78000000e+01, 3.92830000e+02,\n", 176 | " 4.03000000e+00],\n", 177 | " [ 3.23700000e-02, 0.00000000e+00, 2.18000000e+00,\n", 178 | " 0.00000000e+00, 4.58000000e-01, 6.99800000e+00,\n", 179 | " 4.58000000e+01, 6.06220000e+00, 3.00000000e+00,\n", 180 | " 2.22000000e+02, 1.87000000e+01, 3.94630000e+02,\n", 181 | " 2.94000000e+00],\n", 182 | " [ 6.90500000e-02, 0.00000000e+00, 2.18000000e+00,\n", 183 | " 0.00000000e+00, 4.58000000e-01, 7.14700000e+00,\n", 184 | " 5.42000000e+01, 6.06220000e+00, 3.00000000e+00,\n", 185 | " 2.22000000e+02, 1.87000000e+01, 3.96900000e+02,\n", 186 | " 5.33000000e+00]])" 187 | ] 188 | }, 189 | "execution_count": 6, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "boston_features[:5,:]" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 7, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "array([ 24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5,\n", 207 | " 18.9, 15. , 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5,\n", 208 | " 20.2, 18.2, 13.6, 19.6, 15.2, 14.5, 15.6, 13.9, 16.6,\n", 209 | " 14.8, 18.4, 21. , 12.7, 14.5, 13.2, 13.1, 13.5, 18.9,\n", 210 | " 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7, 21.2,\n", 211 | " 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4,\n", 212 | " 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2,\n", 213 | " 25. , 33. , 23.5, 19.4, 22. , 17.4, 20.9, 24.2, 21.7,\n", 214 | " 22.8, 23.4, 24.1, 21.4, 20. , 20.8, 21.2, 20.3, 28. ,\n", 215 | " 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2, 23.6, 28.7,\n", 216 | " 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n", 217 | " 33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4,\n", 218 | " 19.8, 19.4, 21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2,\n", 219 | " 19.2, 20.4, 19.3, 22. , 20.3, 20.5, 17.3, 18.8, 21.4,\n", 220 | " 15.7, 16.2, 18. , 14.3, 19.2, 19.6, 23. , 18.4, 15.6,\n", 221 | " 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4, 15.6,\n", 222 | " 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3,\n", 223 | " 19.4, 17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. ,\n", 224 | " 50. , 50. , 22.7, 25. , 50. , 23.8, 23.8, 22.3, 17.4,\n", 225 | " 19.1, 23.1, 23.6, 22.6, 29.4, 23.2, 24.6, 29.9, 37.2,\n", 226 | " 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. , 32. , 29.8,\n", 227 | " 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n", 228 | " 34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4,\n", 229 | " 22.5, 24.4, 20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. ,\n", 230 | " 23.3, 28.7, 21.5, 23. , 26.7, 21.7, 27.5, 30.1, 44.8,\n", 231 | " 50. , 37.6, 31.6, 46.7, 31.5, 24.3, 31.7, 41.7, 48.3,\n", 232 | " 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1, 22.2,\n", 233 | " 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8,\n", 234 | " 29.6, 42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8,\n", 235 | " 43.1, 48.8, 31. , 36.5, 22.8, 30.7, 50. , 43.5, 20.7,\n", 236 | " 21.1, 25.2, 24.4, 35.2, 32.4, 32. , 33.2, 33.1, 29.1,\n", 237 | " 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. , 20.1, 23.2,\n", 238 | " 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n", 239 | " 20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4,\n", 240 | " 33.4, 28.2, 22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8,\n", 241 | " 16.2, 17.8, 19.8, 23.1, 21. , 23.8, 23.1, 20.4, 18.5,\n", 242 | " 25. , 24.6, 23. , 22.2, 19.3, 22.6, 19.8, 17.1, 19.4,\n", 243 | " 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7, 32.7,\n", 244 | " 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9,\n", 245 | " 24.1, 18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6,\n", 246 | " 25. , 19.9, 20.8, 16.8, 21.9, 27.5, 21.9, 23.1, 50. ,\n", 247 | " 50. , 50. , 50. , 50. , 13.8, 13.8, 15. , 13.9, 13.3,\n", 248 | " 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8, 7.2, 10.5,\n", 249 | " 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,\n", 250 | " 12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5,\n", 251 | " 5. , 11.9, 27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,\n", 252 | " 7. , 7.2, 7.5, 10.4, 8.8, 8.4, 16.7, 14.2, 20.8,\n", 253 | " 13.4, 11.7, 8.3, 10.2, 10.9, 11. , 9.5, 14.5, 14.1,\n", 254 | " 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8, 10.5,\n", 255 | " 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. ,\n", 256 | " 13.4, 15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9,\n", 257 | " 20. , 16.4, 17.7, 19.5, 20.2, 21.4, 19.9, 19. , 19.1,\n", 258 | " 19.1, 20.1, 19.9, 19.6, 23.2, 29.8, 13.8, 13.3, 16.7,\n", 259 | " 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8, 20.6, 21.2,\n", 260 | " 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,\n", 261 | " 23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9,\n", 262 | " 22. , 11.9])" 263 | ] 264 | }, 265 | "execution_count": 7, 266 | "metadata": {}, 267 | "output_type": "execute_result" 268 | } 269 | ], 270 | "source": [ 271 | "boston_target" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "### 构建模型" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 8, 284 | "metadata": { 285 | "scrolled": true 286 | }, 287 | "outputs": [ 288 | { 289 | "name": "stdout", 290 | "output_type": "stream", 291 | "text": [ 292 | "Help on class RandomForestRegressor in module sklearn.ensemble.forest:\n", 293 | "\n", 294 | "class RandomForestRegressor(ForestRegressor)\n", 295 | " | A random forest regressor.\n", 296 | " | \n", 297 | " | A random forest is a meta estimator that fits a number of classifying\n", 298 | " | decision trees on various sub-samples of the dataset and use averaging\n", 299 | " | to improve the predictive accuracy and control over-fitting.\n", 300 | " | The sub-sample size is always the same as the original\n", 301 | " | input sample size but the samples are drawn with replacement if\n", 302 | " | `bootstrap=True` (default).\n", 303 | " | \n", 304 | " | Read more in the :ref:`User Guide `.\n", 305 | " | \n", 306 | " | Parameters\n", 307 | " | ----------\n", 308 | " | n_estimators : integer, optional (default=10)\n", 309 | " | The number of trees in the forest.\n", 310 | " | \n", 311 | " | criterion : string, optional (default=\"mse\")\n", 312 | " | The function to measure the quality of a split. Supported criteria\n", 313 | " | are \"mse\" for the mean squared error, which is equal to variance\n", 314 | " | reduction as feature selection criterion, and \"mae\" for the mean\n", 315 | " | absolute error.\n", 316 | " | \n", 317 | " | .. versionadded:: 0.18\n", 318 | " | Mean Absolute Error (MAE) criterion.\n", 319 | " | \n", 320 | " | max_features : int, float, string or None, optional (default=\"auto\")\n", 321 | " | The number of features to consider when looking for the best split:\n", 322 | " | \n", 323 | " | - If int, then consider `max_features` features at each split.\n", 324 | " | - If float, then `max_features` is a percentage and\n", 325 | " | `int(max_features * n_features)` features are considered at each\n", 326 | " | split.\n", 327 | " | - If \"auto\", then `max_features=n_features`.\n", 328 | " | - If \"sqrt\", then `max_features=sqrt(n_features)`.\n", 329 | " | - If \"log2\", then `max_features=log2(n_features)`.\n", 330 | " | - If None, then `max_features=n_features`.\n", 331 | " | \n", 332 | " | Note: the search for a split does not stop until at least one\n", 333 | " | valid partition of the node samples is found, even if it requires to\n", 334 | " | effectively inspect more than ``max_features`` features.\n", 335 | " | \n", 336 | " | max_depth : integer or None, optional (default=None)\n", 337 | " | The maximum depth of the tree. If None, then nodes are expanded until\n", 338 | " | all leaves are pure or until all leaves contain less than\n", 339 | " | min_samples_split samples.\n", 340 | " | \n", 341 | " | min_samples_split : int, float, optional (default=2)\n", 342 | " | The minimum number of samples required to split an internal node:\n", 343 | " | \n", 344 | " | - If int, then consider `min_samples_split` as the minimum number.\n", 345 | " | - If float, then `min_samples_split` is a percentage and\n", 346 | " | `ceil(min_samples_split * n_samples)` are the minimum\n", 347 | " | number of samples for each split.\n", 348 | " | \n", 349 | " | .. versionchanged:: 0.18\n", 350 | " | Added float values for percentages.\n", 351 | " | \n", 352 | " | min_samples_leaf : int, float, optional (default=1)\n", 353 | " | The minimum number of samples required to be at a leaf node:\n", 354 | " | \n", 355 | " | - If int, then consider `min_samples_leaf` as the minimum number.\n", 356 | " | - If float, then `min_samples_leaf` is a percentage and\n", 357 | " | `ceil(min_samples_leaf * n_samples)` are the minimum\n", 358 | " | number of samples for each node.\n", 359 | " | \n", 360 | " | .. versionchanged:: 0.18\n", 361 | " | Added float values for percentages.\n", 362 | " | \n", 363 | " | min_weight_fraction_leaf : float, optional (default=0.)\n", 364 | " | The minimum weighted fraction of the sum total of weights (of all\n", 365 | " | the input samples) required to be at a leaf node. Samples have\n", 366 | " | equal weight when sample_weight is not provided.\n", 367 | " | \n", 368 | " | max_leaf_nodes : int or None, optional (default=None)\n", 369 | " | Grow trees with ``max_leaf_nodes`` in best-first fashion.\n", 370 | " | Best nodes are defined as relative reduction in impurity.\n", 371 | " | If None then unlimited number of leaf nodes.\n", 372 | " | \n", 373 | " | min_impurity_split : float,\n", 374 | " | Threshold for early stopping in tree growth. A node will split\n", 375 | " | if its impurity is above the threshold, otherwise it is a leaf.\n", 376 | " | \n", 377 | " | .. deprecated:: 0.19\n", 378 | " | ``min_impurity_split`` has been deprecated in favor of\n", 379 | " | ``min_impurity_decrease`` in 0.19 and will be removed in 0.21.\n", 380 | " | Use ``min_impurity_decrease`` instead.\n", 381 | " | \n", 382 | " | min_impurity_decrease : float, optional (default=0.)\n", 383 | " | A node will be split if this split induces a decrease of the impurity\n", 384 | " | greater than or equal to this value.\n", 385 | " | \n", 386 | " | The weighted impurity decrease equation is the following::\n", 387 | " | \n", 388 | " | N_t / N * (impurity - N_t_R / N_t * right_impurity\n", 389 | " | - N_t_L / N_t * left_impurity)\n", 390 | " | \n", 391 | " | where ``N`` is the total number of samples, ``N_t`` is the number of\n", 392 | " | samples at the current node, ``N_t_L`` is the number of samples in the\n", 393 | " | left child, and ``N_t_R`` is the number of samples in the right child.\n", 394 | " | \n", 395 | " | ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,\n", 396 | " | if ``sample_weight`` is passed.\n", 397 | " | \n", 398 | " | .. versionadded:: 0.19\n", 399 | " | \n", 400 | " | bootstrap : boolean, optional (default=True)\n", 401 | " | Whether bootstrap samples are used when building trees.\n", 402 | " | \n", 403 | " | oob_score : bool, optional (default=False)\n", 404 | " | whether to use out-of-bag samples to estimate\n", 405 | " | the R^2 on unseen data.\n", 406 | " | \n", 407 | " | n_jobs : integer, optional (default=1)\n", 408 | " | The number of jobs to run in parallel for both `fit` and `predict`.\n", 409 | " | If -1, then the number of jobs is set to the number of cores.\n", 410 | " | \n", 411 | " | random_state : int, RandomState instance or None, optional (default=None)\n", 412 | " | If int, random_state is the seed used by the random number generator;\n", 413 | " | If RandomState instance, random_state is the random number generator;\n", 414 | " | If None, the random number generator is the RandomState instance used\n", 415 | " | by `np.random`.\n", 416 | " | \n", 417 | " | verbose : int, optional (default=0)\n", 418 | " | Controls the verbosity of the tree building process.\n", 419 | " | \n", 420 | " | warm_start : bool, optional (default=False)\n", 421 | " | When set to ``True``, reuse the solution of the previous call to fit\n", 422 | " | and add more estimators to the ensemble, otherwise, just fit a whole\n", 423 | " | new forest.\n", 424 | " | \n", 425 | " | Attributes\n", 426 | " | ----------\n", 427 | " | estimators_ : list of DecisionTreeRegressor\n", 428 | " | The collection of fitted sub-estimators.\n", 429 | " | \n", 430 | " | feature_importances_ : array of shape = [n_features]\n", 431 | " | The feature importances (the higher, the more important the feature).\n", 432 | " | \n", 433 | " | n_features_ : int\n", 434 | " | The number of features when ``fit`` is performed.\n", 435 | " | \n", 436 | " | n_outputs_ : int\n", 437 | " | The number of outputs when ``fit`` is performed.\n", 438 | " | \n", 439 | " | oob_score_ : float\n", 440 | " | Score of the training dataset obtained using an out-of-bag estimate.\n", 441 | " | \n", 442 | " | oob_prediction_ : array of shape = [n_samples]\n", 443 | " | Prediction computed with out-of-bag estimate on the training set.\n", 444 | " | \n", 445 | " | Examples\n", 446 | " | --------\n", 447 | " | >>> from sklearn.ensemble import RandomForestRegressor\n", 448 | " | >>> from sklearn.datasets import make_regression\n", 449 | " | >>>\n", 450 | " | >>> X, y = make_regression(n_features=4, n_informative=2,\n", 451 | " | ... random_state=0, shuffle=False)\n", 452 | " | >>> regr = RandomForestRegressor(max_depth=2, random_state=0)\n", 453 | " | >>> regr.fit(X, y)\n", 454 | " | RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,\n", 455 | " | max_features='auto', max_leaf_nodes=None,\n", 456 | " | min_impurity_decrease=0.0, min_impurity_split=None,\n", 457 | " | min_samples_leaf=1, min_samples_split=2,\n", 458 | " | min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", 459 | " | oob_score=False, random_state=0, verbose=0, warm_start=False)\n", 460 | " | >>> print(regr.feature_importances_)\n", 461 | " | [ 0.17339552 0.81594114 0. 0.01066333]\n", 462 | " | >>> print(regr.predict([[0, 0, 0, 0]]))\n", 463 | " | [-2.50699856]\n", 464 | " | \n", 465 | " | Notes\n", 466 | " | -----\n", 467 | " | The default values for the parameters controlling the size of the trees\n", 468 | " | (e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and\n", 469 | " | unpruned trees which can potentially be very large on some data sets. To\n", 470 | " | reduce memory consumption, the complexity and size of the trees should be\n", 471 | " | controlled by setting those parameter values.\n", 472 | " | \n", 473 | " | The features are always randomly permuted at each split. Therefore,\n", 474 | " | the best found split may vary, even with the same training data,\n", 475 | " | ``max_features=n_features`` and ``bootstrap=False``, if the improvement\n", 476 | " | of the criterion is identical for several splits enumerated during the\n", 477 | " | search of the best split. To obtain a deterministic behaviour during\n", 478 | " | fitting, ``random_state`` has to be fixed.\n", 479 | " | \n", 480 | " | References\n", 481 | " | ----------\n", 482 | " | \n", 483 | " | .. [1] L. Breiman, \"Random Forests\", Machine Learning, 45(1), 5-32, 2001.\n", 484 | " | \n", 485 | " | See also\n", 486 | " | --------\n", 487 | " | DecisionTreeRegressor, ExtraTreesRegressor\n", 488 | " | \n", 489 | " | Method resolution order:\n", 490 | " | RandomForestRegressor\n", 491 | " | ForestRegressor\n", 492 | " | abc.NewBase\n", 493 | " | BaseForest\n", 494 | " | abc.NewBase\n", 495 | " | sklearn.ensemble.base.BaseEnsemble\n", 496 | " | abc.NewBase\n", 497 | " | sklearn.base.BaseEstimator\n", 498 | " | sklearn.base.MetaEstimatorMixin\n", 499 | " | sklearn.base.RegressorMixin\n", 500 | " | __builtin__.object\n", 501 | " | \n", 502 | " | Methods defined here:\n", 503 | " | \n", 504 | " | __init__(self, n_estimators=10, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)\n", 505 | " | \n", 506 | " | ----------------------------------------------------------------------\n", 507 | " | Data and other attributes defined here:\n", 508 | " | \n", 509 | " | __abstractmethods__ = frozenset([])\n", 510 | " | \n", 511 | " | ----------------------------------------------------------------------\n", 512 | " | Methods inherited from ForestRegressor:\n", 513 | " | \n", 514 | " | predict(self, X)\n", 515 | " | Predict regression target for X.\n", 516 | " | \n", 517 | " | The predicted regression target of an input sample is computed as the\n", 518 | " | mean predicted regression targets of the trees in the forest.\n", 519 | " | \n", 520 | " | Parameters\n", 521 | " | ----------\n", 522 | " | X : array-like or sparse matrix of shape = [n_samples, n_features]\n", 523 | " | The input samples. Internally, its dtype will be converted to\n", 524 | " | ``dtype=np.float32``. If a sparse matrix is provided, it will be\n", 525 | " | converted into a sparse ``csr_matrix``.\n", 526 | " | \n", 527 | " | Returns\n", 528 | " | -------\n", 529 | " | y : array of shape = [n_samples] or [n_samples, n_outputs]\n", 530 | " | The predicted values.\n", 531 | " | \n", 532 | " | ----------------------------------------------------------------------\n", 533 | " | Methods inherited from BaseForest:\n", 534 | " | \n", 535 | " | apply(self, X)\n", 536 | " | Apply trees in the forest to X, return leaf indices.\n", 537 | " | \n", 538 | " | Parameters\n", 539 | " | ----------\n", 540 | " | X : array-like or sparse matrix, shape = [n_samples, n_features]\n", 541 | " | The input samples. Internally, its dtype will be converted to\n", 542 | " | ``dtype=np.float32``. If a sparse matrix is provided, it will be\n", 543 | " | converted into a sparse ``csr_matrix``.\n", 544 | " | \n", 545 | " | Returns\n", 546 | " | -------\n", 547 | " | X_leaves : array_like, shape = [n_samples, n_estimators]\n", 548 | " | For each datapoint x in X and for each tree in the forest,\n", 549 | " | return the index of the leaf x ends up in.\n", 550 | " | \n", 551 | " | decision_path(self, X)\n", 552 | " | Return the decision path in the forest\n", 553 | " | \n", 554 | " | .. versionadded:: 0.18\n", 555 | " | \n", 556 | " | Parameters\n", 557 | " | ----------\n", 558 | " | X : array-like or sparse matrix, shape = [n_samples, n_features]\n", 559 | " | The input samples. Internally, its dtype will be converted to\n", 560 | " | ``dtype=np.float32``. If a sparse matrix is provided, it will be\n", 561 | " | converted into a sparse ``csr_matrix``.\n", 562 | " | \n", 563 | " | Returns\n", 564 | " | -------\n", 565 | " | indicator : sparse csr array, shape = [n_samples, n_nodes]\n", 566 | " | Return a node indicator matrix where non zero elements\n", 567 | " | indicates that the samples goes through the nodes.\n", 568 | " | \n", 569 | " | n_nodes_ptr : array of size (n_estimators + 1, )\n", 570 | " | The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]]\n", 571 | " | gives the indicator value for the i-th estimator.\n", 572 | " | \n", 573 | " | fit(self, X, y, sample_weight=None)\n", 574 | " | Build a forest of trees from the training set (X, y).\n", 575 | " | \n", 576 | " | Parameters\n", 577 | " | ----------\n", 578 | " | X : array-like or sparse matrix of shape = [n_samples, n_features]\n", 579 | " | The training input samples. Internally, its dtype will be converted to\n", 580 | " | ``dtype=np.float32``. If a sparse matrix is provided, it will be\n", 581 | " | converted into a sparse ``csc_matrix``.\n", 582 | " | \n", 583 | " | y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n", 584 | " | The target values (class labels in classification, real numbers in\n", 585 | " | regression).\n", 586 | " | \n", 587 | " | sample_weight : array-like, shape = [n_samples] or None\n", 588 | " | Sample weights. If None, then samples are equally weighted. Splits\n", 589 | " | that would create child nodes with net zero or negative weight are\n", 590 | " | ignored while searching for a split in each node. In the case of\n", 591 | " | classification, splits are also ignored if they would result in any\n", 592 | " | single class carrying a negative weight in either child node.\n", 593 | " | \n", 594 | " | Returns\n", 595 | " | -------\n", 596 | " | self : object\n", 597 | " | Returns self.\n", 598 | " | \n", 599 | " | ----------------------------------------------------------------------\n", 600 | " | Data descriptors inherited from BaseForest:\n", 601 | " | \n", 602 | " | feature_importances_\n", 603 | " | Return the feature importances (the higher, the more important the\n", 604 | " | feature).\n", 605 | " | \n", 606 | " | Returns\n", 607 | " | -------\n", 608 | " | feature_importances_ : array, shape = [n_features]\n", 609 | " | \n", 610 | " | ----------------------------------------------------------------------\n", 611 | " | Methods inherited from sklearn.ensemble.base.BaseEnsemble:\n", 612 | " | \n", 613 | " | __getitem__(self, index)\n", 614 | " | Returns the index'th estimator in the ensemble.\n", 615 | " | \n", 616 | " | __iter__(self)\n", 617 | " | Returns iterator over estimators in the ensemble.\n", 618 | " | \n", 619 | " | __len__(self)\n", 620 | " | Returns the number of estimators in the ensemble.\n", 621 | " | \n", 622 | " | ----------------------------------------------------------------------\n", 623 | " | Methods inherited from sklearn.base.BaseEstimator:\n", 624 | " | \n", 625 | " | __getstate__(self)\n", 626 | " | \n", 627 | " | __repr__(self)\n", 628 | " | \n", 629 | " | __setstate__(self, state)\n", 630 | " | \n", 631 | " | get_params(self, deep=True)\n", 632 | " | Get parameters for this estimator.\n", 633 | " | \n", 634 | " | Parameters\n", 635 | " | ----------\n", 636 | " | deep : boolean, optional\n", 637 | " | If True, will return the parameters for this estimator and\n", 638 | " | contained subobjects that are estimators.\n", 639 | " | \n", 640 | " | Returns\n", 641 | " | -------\n", 642 | " | params : mapping of string to any\n", 643 | " | Parameter names mapped to their values.\n", 644 | " | \n", 645 | " | set_params(self, **params)\n", 646 | " | Set the parameters of this estimator.\n", 647 | " | \n", 648 | " | The method works on simple estimators as well as on nested objects\n", 649 | " | (such as pipelines). The latter have parameters of the form\n", 650 | " | ``__`` so that it's possible to update each\n", 651 | " | component of a nested object.\n", 652 | " | \n", 653 | " | Returns\n", 654 | " | -------\n", 655 | " | self\n", 656 | " | \n", 657 | " | ----------------------------------------------------------------------\n", 658 | " | Data descriptors inherited from sklearn.base.BaseEstimator:\n", 659 | " | \n", 660 | " | __dict__\n", 661 | " | dictionary for instance variables (if defined)\n", 662 | " | \n", 663 | " | __weakref__\n", 664 | " | list of weak references to the object (if defined)\n", 665 | " | \n", 666 | " | ----------------------------------------------------------------------\n", 667 | " | Methods inherited from sklearn.base.RegressorMixin:\n", 668 | " | \n", 669 | " | score(self, X, y, sample_weight=None)\n", 670 | " | Returns the coefficient of determination R^2 of the prediction.\n", 671 | " | \n", 672 | " | The coefficient R^2 is defined as (1 - u/v), where u is the residual\n", 673 | " | sum of squares ((y_true - y_pred) ** 2).sum() and v is the total\n", 674 | " | sum of squares ((y_true - y_true.mean()) ** 2).sum().\n", 675 | " | The best possible score is 1.0 and it can be negative (because the\n", 676 | " | model can be arbitrarily worse). A constant model that always\n", 677 | " | predicts the expected value of y, disregarding the input features,\n", 678 | " | would get a R^2 score of 0.0.\n", 679 | " | \n", 680 | " | Parameters\n", 681 | " | ----------\n", 682 | " | X : array-like, shape = (n_samples, n_features)\n", 683 | " | Test samples.\n", 684 | " | \n", 685 | " | y : array-like, shape = (n_samples) or (n_samples, n_outputs)\n", 686 | " | True values for X.\n", 687 | " | \n", 688 | " | sample_weight : array-like, shape = [n_samples], optional\n", 689 | " | Sample weights.\n", 690 | " | \n", 691 | " | Returns\n", 692 | " | -------\n", 693 | " | score : float\n", 694 | " | R^2 of self.predict(X) wrt. y.\n", 695 | "\n" 696 | ] 697 | } 698 | ], 699 | "source": [ 700 | "help(RandomForestRegressor)" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": 9, 706 | "metadata": { 707 | "collapsed": true 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "rgs = RandomForestRegressor(n_estimators=15) ##随机森林模型\n", 712 | "rgs = rgs.fit(boston_features, boston_target)" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 10, 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "data": { 722 | "text/plain": [ 723 | "RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n", 724 | " max_features='auto', max_leaf_nodes=None,\n", 725 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 726 | " min_samples_leaf=1, min_samples_split=2,\n", 727 | " min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,\n", 728 | " oob_score=False, random_state=None, verbose=0, warm_start=False)" 729 | ] 730 | }, 731 | "execution_count": 10, 732 | "metadata": {}, 733 | "output_type": "execute_result" 734 | } 735 | ], 736 | "source": [ 737 | "rgs" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": 11, 743 | "metadata": {}, 744 | "outputs": [ 745 | { 746 | "data": { 747 | "text/plain": [ 748 | "array([ 26.16666667, 22.24 , 33.76666667, 33.67333333,\n", 749 | " 35.46 , 26.74 , 21.84 , 26.59333333,\n", 750 | " 20.31333333, 19.88 , 18.53333333, 19.57333333,\n", 751 | " 21.64 , 19.64666667, 19.37333333, 19.92 ,\n", 752 | " 22.3 , 17.8 , 19.64 , 18.73333333,\n", 753 | " 13.80666667, 18.42666667, 15.5 , 14.46666667,\n", 754 | " 15.54666667, 14.18 , 16.34666667, 14.58666667,\n", 755 | " 18.25333333, 21.46 , 13.38 , 15.86 ,\n", 756 | " 14.09333333, 13.8 , 13.71333333, 19.51333333,\n", 757 | " 19.96666667, 21.49333333, 23.56666667, 30.07333333,\n", 758 | " 34.7 , 28.31333333, 25.04 , 24.64666667,\n", 759 | " 21.45333333, 19.42 , 19.74 , 18.26666667,\n", 760 | " 19.57333333, 20.02666667, 20.74666667, 20.84666667,\n", 761 | " 25.42666667, 22.25333333, 19.12666667, 34.66666667,\n", 762 | " 24.3 , 32.18666667, 23.42666667, 19.78 ,\n", 763 | " 18.78666667, 17.82666667, 23.03333333, 25.81333333,\n", 764 | " 32.30666667, 23.85333333, 20.3 , 21.85333333,\n", 765 | " 18.59333333, 21.10666667, 24.42 , 21.25333333,\n", 766 | " 23.2 , 23.56666667, 24.22 , 22.24 ,\n", 767 | " 20.08 , 21.34666667, 21.24 , 20.49333333,\n", 768 | " 27.87333333, 24.4 , 24.16666667, 22.98 ,\n", 769 | " 23.48666667, 27.13333333, 21.23333333, 22.28666667,\n", 770 | " 26.63333333, 29.01333333, 22.4 , 22.06666667,\n", 771 | " 22.75333333, 24.77333333, 21.38666667, 26.91333333,\n", 772 | " 21.57333333, 40.41333333, 44.06 , 32.64666667,\n", 773 | " 26.9 , 25.90666667, 19.05333333, 19.85333333,\n", 774 | " 20.00666667, 19.73333333, 19.09333333, 20.34666667,\n", 775 | " 20.21333333, 19.20666667, 21.24 , 24.06 ,\n", 776 | " 18.76666667, 18.66 , 19.18666667, 18.37333333,\n", 777 | " 21.08 , 20.04 , 19.32 , 19.11333333,\n", 778 | " 21.79333333, 21.07333333, 19.68 , 16.92 ,\n", 779 | " 19.00666667, 20.57333333, 15.87333333, 16.04666667,\n", 780 | " 17.64666667, 14.80666667, 19.72666667, 19.89333333,\n", 781 | " 21.52 , 18.16666667, 15.46 , 18.89333333,\n", 782 | " 16.71333333, 18.36 , 13.54666667, 16.38666667,\n", 783 | " 13.9 , 14.04 , 13.41333333, 14.58666667,\n", 784 | " 12.49333333, 13.95333333, 15.80666667, 14.44 ,\n", 785 | " 16.67333333, 15.32 , 21.67333333, 19.6 ,\n", 786 | " 16.92666667, 17.62666667, 16.9 , 15.82666667,\n", 787 | " 14.26666667, 34.84666667, 23.87333333, 24.14 ,\n", 788 | " 26.14666667, 48.45333333, 49.56666667, 49.44666667,\n", 789 | " 21.82666667, 24.29333333, 49.80666667, 22.55333333,\n", 790 | " 22.67333333, 21.97333333, 18.66666667, 20.72 ,\n", 791 | " 21.75333333, 23.42666667, 22.56666667, 29.07333333,\n", 792 | " 23.02666667, 24.20666667, 28.74666667, 35.60666667,\n", 793 | " 42.8 , 33.5 , 36.97333333, 32.09333333,\n", 794 | " 26.45333333, 28.88 , 47.63333333, 30.38666667,\n", 795 | " 29.18666667, 34.84666667, 34.12666667, 29.24666667,\n", 796 | " 35.76666667, 31.82666667, 29.34 , 48.67333333,\n", 797 | " 33.6 , 30.52666667, 34.26666667, 34.84 ,\n", 798 | " 33.73333333, 23.38 , 42.36666667, 48.6 ,\n", 799 | " 48.47333333, 22.15333333, 24.27333333, 21.84 ,\n", 800 | " 23.44 , 20.54666667, 21.4 , 20.68 ,\n", 801 | " 22.51333333, 26.79333333, 22.83333333, 24.82666667,\n", 802 | " 22.53333333, 27.07333333, 20.67333333, 22.8 ,\n", 803 | " 27. , 21.24 , 26.67333333, 29.24 ,\n", 804 | " 45.48 , 48.93333333, 41.64666667, 31.81333333,\n", 805 | " 46.19333333, 32.8 , 23.28 , 31.92666667,\n", 806 | " 43.86666667, 45.86 , 28.46 , 23.44 ,\n", 807 | " 25.91333333, 31.70666667, 23.57333333, 24.63333333,\n", 808 | " 24.99333333, 20.82666667, 22.10666667, 23.97333333,\n", 809 | " 18.09333333, 18.52 , 23.64666667, 20.46 ,\n", 810 | " 23.2 , 26.64 , 24.36 , 27.22 ,\n", 811 | " 30.37333333, 40.96 , 22.07333333, 21.23333333,\n", 812 | " 44.58666667, 49.24666667, 36.1 , 30.28 ,\n", 813 | " 33.12666667, 43.98 , 47.82 , 30.35333333,\n", 814 | " 36.10666667, 21.79333333, 29.96 , 48.74666667,\n", 815 | " 43.60666667, 20.66666667, 21.44 , 25.01333333,\n", 816 | " 24.76 , 38.78 , 33.61333333, 33.6 ,\n", 817 | " 33.22666667, 32.66666667, 26.78 , 34.47333333,\n", 818 | " 46.01333333, 34.73333333, 45.41333333, 48.90666667,\n", 819 | " 31.54 , 22.25333333, 20.25333333, 23.65333333,\n", 820 | " 22.66 , 24.64666667, 31.19333333, 34.18 ,\n", 821 | " 28.34666667, 23.28666667, 22.12 , 27.36 ,\n", 822 | " 26.49333333, 20.88666667, 24.1 , 29.98666667,\n", 823 | " 26.59333333, 23.92 , 25.13333333, 32.73333333,\n", 824 | " 35.57333333, 28.28666667, 34.6 , 29.93333333,\n", 825 | " 25.48666667, 20.17333333, 17.52 , 22.84 ,\n", 826 | " 19.96 , 21.66666667, 23.65333333, 16.72666667,\n", 827 | " 17.96666667, 19.64 , 23.01333333, 21.31333333,\n", 828 | " 23.89333333, 23.42 , 21.14666667, 18.29333333,\n", 829 | " 24.74666667, 24.94666667, 23.97333333, 22.04 ,\n", 830 | " 20.17333333, 22.75333333, 19.79333333, 17.60666667,\n", 831 | " 20.96 , 22.32 , 22.04666667, 20.34666667,\n", 832 | " 19.46666667, 19.18 , 20.54666667, 19.31333333,\n", 833 | " 18.98 , 32.66666667, 18.45333333, 25.26666667,\n", 834 | " 30.28 , 17.92666667, 18.08 , 23.3 ,\n", 835 | " 24.58666667, 27.50666667, 23.44 , 25.18 ,\n", 836 | " 20.55333333, 29.7 , 19.15333333, 20.94 ,\n", 837 | " 16.68 , 21.38666667, 21.86666667, 22.14 ,\n", 838 | " 23.73333333, 20.13333333, 20. , 17.24666667,\n", 839 | " 26.76666667, 22.62666667, 19.58 , 21.64 ,\n", 840 | " 46.78666667, 47.02 , 41.21333333, 45.14666667,\n", 841 | " 42.74666667, 13.46 , 13.32 , 22.26666667,\n", 842 | " 13.04666667, 12.56 , 12.65333333, 10.74 ,\n", 843 | " 15.79333333, 10.94666667, 11.46 , 11.27333333,\n", 844 | " 9.1 , 8.34666667, 8.96 , 7.56 ,\n", 845 | " 9.76 , 11.43333333, 14.76 , 20.18 ,\n", 846 | " 9.70666667, 14.08 , 11.27333333, 13.44666667,\n", 847 | " 13.02666667, 10.52 , 5.72666667, 7.14666667,\n", 848 | " 7.20666667, 8.32666667, 11.5 , 9.32 ,\n", 849 | " 8.04666667, 6.98666667, 12.48666667, 31.44666667,\n", 850 | " 16.7 , 25.96666667, 21.21333333, 17.04 ,\n", 851 | " 17.02666667, 16.3 , 7.26666667, 7.3 ,\n", 852 | " 8.25333333, 10.02 , 8.87333333, 11.07333333,\n", 853 | " 15.46666667, 14.47333333, 18.96666667, 12.74 ,\n", 854 | " 12.3 , 8.9 , 11.50666667, 12.76 ,\n", 855 | " 12.12 , 10.05333333, 14.50666667, 17.65333333,\n", 856 | " 17.75333333, 14.52 , 11.85333333, 11.67333333,\n", 857 | " 10.94 , 8.68666667, 8.02 , 12.10666667,\n", 858 | " 10.29333333, 16.21333333, 17.60666667, 15.25333333,\n", 859 | " 10.71333333, 12.16666667, 14.80666667, 14.68666667,\n", 860 | " 14.30666667, 13.55333333, 14.02 , 15.37333333,\n", 861 | " 16.08666667, 19.74666667, 14.77333333, 13.98 ,\n", 862 | " 13.53333333, 13.98 , 15.16 , 19.47333333,\n", 863 | " 16.33333333, 18.45333333, 20.74 , 20.99333333,\n", 864 | " 21.57333333, 19.66666667, 17.67333333, 15.99333333,\n", 865 | " 18.84666667, 20.22 , 18.75333333, 19.84 ,\n", 866 | " 22.87333333, 27.68666667, 13.87333333, 14.12 ,\n", 867 | " 16.30666667, 11.87333333, 14.16 , 20.96 ,\n", 868 | " 22.48666667, 24.33333333, 26.02666667, 20.92 ,\n", 869 | " 20.38 , 22.08 , 18.89333333, 20.89333333,\n", 870 | " 14.98666667, 9.30666667, 9.91333333, 13.98666667,\n", 871 | " 20.46 , 21.08666667, 23.27333333, 19.56 ,\n", 872 | " 19.76666667, 18.92666667, 21.03333333, 18.05333333,\n", 873 | " 17.84666667, 22.7 , 20.64666667, 25.46666667,\n", 874 | " 23.96666667, 14.72666667])" 875 | ] 876 | }, 877 | "execution_count": 11, 878 | "metadata": {}, 879 | "output_type": "execute_result" 880 | } 881 | ], 882 | "source": [ 883 | "rgs.predict(boston_features)" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": 12, 889 | "metadata": { 890 | "collapsed": true 891 | }, 892 | "outputs": [], 893 | "source": [ 894 | "from sklearn import tree" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": 13, 900 | "metadata": {}, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,\n", 906 | " max_leaf_nodes=None, min_impurity_decrease=0.0,\n", 907 | " min_impurity_split=None, min_samples_leaf=1,\n", 908 | " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", 909 | " presort=False, random_state=None, splitter='best')" 910 | ] 911 | }, 912 | "execution_count": 13, 913 | "metadata": {}, 914 | "output_type": "execute_result" 915 | } 916 | ], 917 | "source": [ 918 | "rgs2 = tree.DecisionTreeRegressor() ##决策树模型,比较两个模型的预测结果!\n", 919 | "rgs2.fit(boston_features, boston_target)" 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": 14, 925 | "metadata": {}, 926 | "outputs": [ 927 | { 928 | "data": { 929 | "text/plain": [ 930 | "array([ 24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5,\n", 931 | " 18.9, 15. , 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5,\n", 932 | " 20.2, 18.2, 13.6, 19.6, 15.2, 14.5, 15.6, 13.9, 16.6,\n", 933 | " 14.8, 18.4, 21. , 12.7, 14.5, 13.2, 13.1, 13.5, 18.9,\n", 934 | " 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7, 21.2,\n", 935 | " 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4,\n", 936 | " 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2,\n", 937 | " 25. , 33. , 23.5, 19.4, 22. , 17.4, 20.9, 24.2, 21.7,\n", 938 | " 22.8, 23.4, 24.1, 21.4, 20. , 20.8, 21.2, 20.3, 28. ,\n", 939 | " 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2, 23.6, 28.7,\n", 940 | " 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n", 941 | " 33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4,\n", 942 | " 19.8, 19.4, 21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2,\n", 943 | " 19.2, 20.4, 19.3, 22. , 20.3, 20.5, 17.3, 18.8, 21.4,\n", 944 | " 15.7, 16.2, 18. , 14.3, 19.2, 19.6, 23. , 18.4, 15.6,\n", 945 | " 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4, 15.6,\n", 946 | " 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3,\n", 947 | " 19.4, 17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. ,\n", 948 | " 50. , 50. , 22.7, 25. , 50. , 23.8, 23.8, 22.3, 17.4,\n", 949 | " 19.1, 23.1, 23.6, 22.6, 29.4, 23.2, 24.6, 29.9, 37.2,\n", 950 | " 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. , 32. , 29.8,\n", 951 | " 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n", 952 | " 34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4,\n", 953 | " 22.5, 24.4, 20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. ,\n", 954 | " 23.3, 28.7, 21.5, 23. , 26.7, 21.7, 27.5, 30.1, 44.8,\n", 955 | " 50. , 37.6, 31.6, 46.7, 31.5, 24.3, 31.7, 41.7, 48.3,\n", 956 | " 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1, 22.2,\n", 957 | " 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8,\n", 958 | " 29.6, 42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8,\n", 959 | " 43.1, 48.8, 31. , 36.5, 22.8, 30.7, 50. , 43.5, 20.7,\n", 960 | " 21.1, 25.2, 24.4, 35.2, 32.4, 32. , 33.2, 33.1, 29.1,\n", 961 | " 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. , 20.1, 23.2,\n", 962 | " 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n", 963 | " 20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4,\n", 964 | " 33.4, 28.2, 22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8,\n", 965 | " 16.2, 17.8, 19.8, 23.1, 21. , 23.8, 23.1, 20.4, 18.5,\n", 966 | " 25. , 24.6, 23. , 22.2, 19.3, 22.6, 19.8, 17.1, 19.4,\n", 967 | " 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7, 32.7,\n", 968 | " 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9,\n", 969 | " 24.1, 18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6,\n", 970 | " 25. , 19.9, 20.8, 16.8, 21.9, 27.5, 21.9, 23.1, 50. ,\n", 971 | " 50. , 50. , 50. , 50. , 13.8, 13.8, 15. , 13.9, 13.3,\n", 972 | " 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8, 7.2, 10.5,\n", 973 | " 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,\n", 974 | " 12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5,\n", 975 | " 5. , 11.9, 27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,\n", 976 | " 7. , 7.2, 7.5, 10.4, 8.8, 8.4, 16.7, 14.2, 20.8,\n", 977 | " 13.4, 11.7, 8.3, 10.2, 10.9, 11. , 9.5, 14.5, 14.1,\n", 978 | " 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8, 10.5,\n", 979 | " 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. ,\n", 980 | " 13.4, 15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9,\n", 981 | " 20. , 16.4, 17.7, 19.5, 20.2, 21.4, 19.9, 19. , 19.1,\n", 982 | " 19.1, 20.1, 19.9, 19.6, 23.2, 29.8, 13.8, 13.3, 16.7,\n", 983 | " 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8, 20.6, 21.2,\n", 984 | " 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,\n", 985 | " 23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9,\n", 986 | " 22. , 11.9])" 987 | ] 988 | }, 989 | "execution_count": 14, 990 | "metadata": {}, 991 | "output_type": "execute_result" 992 | } 993 | ], 994 | "source": [ 995 | "rgs2.predict(boston_features)" 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "metadata": { 1002 | "collapsed": true 1003 | }, 1004 | "outputs": [], 1005 | "source": [] 1006 | } 1007 | ], 1008 | "metadata": { 1009 | "kernelspec": { 1010 | "display_name": "Python 2", 1011 | "language": "python", 1012 | "name": "python2" 1013 | }, 1014 | "language_info": { 1015 | "codemirror_mode": { 1016 | "name": "ipython", 1017 | "version": 2 1018 | }, 1019 | "file_extension": ".py", 1020 | "mimetype": "text/x-python", 1021 | "name": "python", 1022 | "nbconvert_exporter": "python", 1023 | "pygments_lexer": "ipython2", 1024 | "version": "2.7.12" 1025 | } 1026 | }, 1027 | "nbformat": 4, 1028 | "nbformat_minor": 1 1029 | } 1030 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /Linear Regression/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.线性回归](#1线性回归linear-regression案例) 3 | - [1.1什么是线性回归](#11什么是线性回归) 4 | - [1.2线性回归要解决什么问题](#12线性回归要解决什么问题) 5 | - [1.3线性回归的一般模型](#13线性回归的一般模型) 6 | - [1.4如何使用模型](#14如何使用模型) 7 | - [1.5模型计算](#15模型计算) 8 | - [1.6过拟合与欠拟合](#16过拟合与欠拟合underfitting-and-overfitting) 9 | - [1.7Python实现代码](https://github.com/mantchs/machine_learning_model/tree/master/Linear%20Regression/demo) 10 | 11 | ## 1.线性回归(Linear Regression)                                                [案例](https://github.com/mantchs/machine_learning_model/tree/master/Linear%20Regression/demo) 12 | 13 | ### 1.1什么是线性回归 14 | 15 | 我们首先用弄清楚什么是线性,什么是非线性。 16 | 17 | - 线性:两个变量之间的关系**是**一次函数关系的——图象**是直线**,叫做线性。 18 | 19 | **注意:题目的线性是指广义的线性,也就是数据与数据之间的关系。** 20 | 21 | - 非线性:两个变量之间的关系**不是**一次函数关系的——图象**不是直线**,叫做非线性。 22 | 23 | 相信通过以上两个概念大家已经很清楚了,其次我们经常说的回归回归到底是什么意思呢。 24 | 25 | - 回归:人们在测量事物的时候因为客观条件所限,求得的都是测量值,而不是事物真实的值,为了能够得到真实值,无限次的进行测量,最后通过这些测量数据计算**回归到真实值**,这就是回归的由来。 26 | 27 | 通俗的说就是用一个函数去逼近这个真实值,那又有人问了,线性回归不是用来做预测吗?是的,通过大量的数据我们是可以预测到**真实值**的。如果还是不明白,大家可以加一下我的**微信:wei15693176** 进行讨论。 28 | 29 | ### 1.2线性回归要解决什么问题 30 | 31 | 对大量的观测数据进行处理,从而得到比较符合事物内部规律的数学表达式。也就是说寻找到数据与数据之间的规律所在,从而就可以模拟出结果,也就是对结果进行预测。解决的就是通过已知的数据得到未知的结果。例如:对房价的预测、判断信用评价、电影票房预估等。 32 | 33 | ### 1.3线性回归的一般模型 34 | 35 | ![image.png](http://www.wailian.work/images/2018/12/10/1240.png) 36 | 37 | 大家看上面图片,图片上有很多个小点点,通过这些小点点我们很难预测当x值=某个值时,y的值是多少,我们无法得知,所以,数学家是很聪明的,是否能够找到一条直线来描述这些点的趋势或者分布呢?答案是肯定的。相信大家在学校的时候都学过这样的直线,只是当时不知道这个方程在现实中是可以用来预测很多事物的。 38 | 39 | 那么问题来了,什么是模型呢?先来看看下面这幅图。 40 | 41 | ![image.png](http://www.wailian.work/images/2018/12/10/12406aaf3.png) 42 | 43 | 假设数据就是**x**,结果是**y**,那中间的模型其实就是一个**方程**,这是一种片面的解释,但有助于我们去理解模型到底是个什么东西。以前在学校的时候总是不理解数学建模比赛到底在做些什么,现在理解了,是从题目给的数据中找到数据与数据之间的关系,建立数学方程模型,得到结果解决现实问题。其实是和机器学习中的模型是一样的意思。那么线性回归的一般模型是什么呢? 44 | 45 | ![image.png](http://www.wailian.work/images/2018/12/10/124062c0c.png) 46 | 47 | 模型神秘的面纱已经被我们揭开了,就是以上这个公式,不要被公式吓到,只要知道模型长什么样就行了。假设i=0,表示的是一元一次方程,是穿过坐标系中原点的一条直线,以此类推。 48 | 49 | ### 1.4如何使用模型 50 | 51 | 我们知道x是已知条件,通过公式求出y。已知条件其实就是我们的数据,以预测房价的案例来说明: 52 | 53 | ![image.png](http://www.wailian.work/images/2018/12/10/124020e1e.png) 54 | 55 | 上图给出的是某个地区房价的一些相关信息,有日期、房间数、建筑面积、房屋评分等特征,表里头的数据就是我们要的x1、x2、x3…….... 自然的表中的price列就是房屋的价格,也就是y。现在需要求的就是theta的值了,后续步骤都需要依赖计算机来训练求解。 56 | 57 | ### 1.5模型计算 58 | 59 | 当然,这些计算虽然复杂,但python库中有现成的函数直接调用就可以求解。我们为了理解内部的计算原理,就需要一步一步的来剖析计算过程。 60 | 61 | 为了容易理解模型,假设该模型是一元一次函数,我们把一组数据x和y带入模型中,会得到如下图所示线段。 62 | 63 | ![image.png](http://www.wailian.work/images/2018/12/10/124058856.png) 64 | 65 | 是不是觉得这条直线拟合得不够好?显然最好的效果应该是这条直线穿过所有的点才是,需要对模型进行优化,这里我们要引入一个概念。 66 | 67 | - **损失函数**:是用来估量你模型的预测值 f(x)与真实值 YY 的不一致程度,损失函数越小,模型的效果就越好。 68 | 69 | ![image-20181209191632406](http://www.wailian.work/images/2018/12/10/image.png) 70 | 71 | 不要看公式很复杂,其实就是一句话,(预测值-真实值)的平法和的平均值,换句话说就是点到直线距离和最小。用一幅图来表示: 72 | 73 | ![image.png](http://www.wailian.work/images/2018/12/10/1240f2d47.png) 74 | 75 | **解释**:一开始损失函数是比较大的,但随着直线的不断变化(模型不断训练),损失函数会越来越小,从而达到极小值点,也就是我们要得到的最终模型。 76 | 77 | 这种方法我们统称为**梯度下降法**。随着模型的不断训练,损失函数的梯度越来越平,直至极小值点,点到直线的距离和最小,所以这条直线就会经过所有的点,这就是我们要求的模型(函数)。 78 | 79 | 以此类推,高维的线性回归模型也是一样的,利用梯度下降法优化模型,寻找极值点,这就是模型训练的过程。 80 | 81 | ### 1.6过拟合与欠拟合(underfitting and overfitting) 82 | 83 | 在机器学习模型训练当中,模型的泛化能力越强,就越能说明这个模型表现很好。什么是模型的泛化能力? 84 | 85 | - **模型的泛化能力**:机器学习模型学习到的概念在它处于学习的过程中时模型没有遇见过的样本时候的表现。 86 | 87 | 模型的泛化能力直接导致了模型会过拟合与欠拟合的情况。让我们来看看一下情况: 88 | 89 | ![](http://www.wailian.work/images/2018/12/10/imageb8d54.png) 90 | 91 | 我们的目标是要实现点到直线的平方和最小,那通过以上图示显然可以看出中间那幅图的拟合程度很好,最左边的情况属于欠拟合,最右边的情况属于过拟合。 92 | 93 | - **欠拟合**:训练集的预测值,与训练集的真实值有不少的误差,称之为欠拟合。 94 | - **过拟合**:训练集的预测值,完全贴合训练集的真实值,称之为过拟合。 95 | 96 | 欠拟合已经很明白了,就是误差比较大,而过拟合呢是训练集上表现得很好,换一批数据进行预测结果就很不理想了,泛化泛化说的就是一个通用性。 97 | 98 | #### 解决方法 99 | 100 | 使用正则化项,也就是给梯度下降公式加上一个参数,即: 101 | 102 | ![](http://www.wailian.work/images/2018/12/10/image7d03d.png) 103 | 104 | 加入这个正则化项好处: 105 | 106 | - 控制参数幅度,不让模型“无法无天”。 107 | - 限制参数搜索空间 108 | - 解决欠拟合与过拟合的问题。 109 | 110 | 看到这里是不是觉得很麻烦,我之前说过现在是解释线性回归模型的原理与优化,但是到了真正使用上这些方法是一句话的事,因为这些计算库别人已经准备好了,感谢开源吧! 111 | 112 | ### [1.7Python实现代码](https://github.com/mantchs/machine_learning_model/tree/master/Linear%20Regression/demo) 113 | 114 |
115 |
116 |
117 |
118 | 119 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 120 | 121 | 欢迎添加微信交流!请备注“机器学习”。 122 | 123 | 124 | -------------------------------------------------------------------------------- /Linear Regression/demo/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.题目](#1题目) 3 | - [2.步骤](#2步骤) 4 | - [3.模型选择](#3模型选择) 5 | - [4.环境配置](#4环境配置) 6 | - [5.csv数据处理](#5csv数据处理) 7 | - [6.数据处理](#6数据处理) 8 | - [7.模型训练](#7模型训练) 9 | - [8.完整代码](https://github.com/mantchs/machine_learning_model/blob/master/Linear%20Regression/demo/housing_price.py) 10 | 11 |         这篇介绍的是我在做房价预测模型时的python代码,房价预测在机器学习入门中已经是个经典的题目了,但我发现目前网上还没有能够很好地做一个demo出来,使得入门者不能很快的找到“入口”在哪,所以在此介绍我是如何做的预测房价模型的题目,仅供参考。 12 | ## 1.题目: 13 |         从给定的房屋基本信息以及房屋销售信息等,建立一个回归模型预测房屋的销售价格。 14 | 数据下载请点击:[下载](https://pan.baidu.com/share/init?surl=kVdwI3d),密码:mfqy。 15 | - **数据说明**: 16 | 数据主要包括2014年5月至2015年5月美国King County的房屋销售价格以及房屋的基本信息。 17 | 数据分为训练数据和测试数据,分别保存在kc_train.csv和kc_test.csv两个文件中。 18 | 其中训练数据主要包括10000条记录,14个字段,主要字段说明如下: 19 | 第一列“销售日期”:2014年5月到2015年5月房屋出售时的日期 20 | 第二列“销售价格”:房屋交易价格,单位为美元,是目标预测值 21 | 第三列“卧室数”:房屋中的卧室数目 22 | 第四列“浴室数”:房屋中的浴室数目 23 | 第五列“房屋面积”:房屋里的生活面积 24 | 第六列“停车面积”:停车坪的面积 25 | 第七列“楼层数”:房屋的楼层数 26 | 第八列“房屋评分”:King County房屋评分系统对房屋的总体评分 27 | 第九列“建筑面积”:除了地下室之外的房屋建筑面积 28 | 第十列“地下室面积”:地下室的面积 29 | 第十一列“建筑年份”:房屋建成的年份 30 | 第十二列“修复年份”:房屋上次修复的年份 31 | 第十三列"纬度":房屋所在纬度 32 | 第十四列“经度”:房屋所在经度 33 | 34 |         测试数据主要包括3000条记录,13个字段,跟训练数据的不同是测试数据并不包括房屋销售价格,学员需要通过由训练数据所建立的模型以及所给的测试数据,得出测试数据相应的房屋销售价格预测值。 35 | 36 | ## 2.步骤 37 | ![](http://www.wailian.work/images/2018/12/10/12400f554.png) 38 | 39 | - 1.选择合适的模型,对模型的好坏进行评估和选择。 40 | - 2.对缺失的值进行补齐操作,可以使用均值的方式补齐数据,使得准确度更高。 41 | - 3.数据的取值一般跟属性有关系,但世界万物的属性是很多的,有些值小,但不代表不重要,所有为了提高预测的准确度,统一数据维度进行计算,方法有特征缩放和归一法等。 42 | - 4.数据处理好之后就可以进行调用模型库进行训练了。 43 | - 5.使用测试数据进行目标函数预测输出,观察结果是否符合预期。或者通过画出对比函数进行结果线条对比。 44 | 45 | ## 3.模型选择 46 | 这里我们选择多元线性回归模型。公式如下:选择多元线性回归模型。 47 | ![](http://www.wailian.work/images/2018/12/10/12409d868.png) 48 | 49 | y表示我们要求的销售价格,x表示特征值。需要调用sklearn库来进行训练。 50 | 51 | 52 | ## 4.环境配置 53 | - python3.5 54 | - numpy库 55 | - pandas库 56 | - matplotlib库进行画图 57 | - seaborn库 58 | - sklearn库 59 | 60 | ## 5.csv数据处理 61 | 下载的是两个数据文件,一个是真实数据,一个是测试数据,打开*kc_train.csv*,能够看到第二列是销售价格,而我们要预测的就是销售价格,所以在训练过程中是不需要销售价格的,把第二列删除掉,新建一个csv文件存放销售价格这一列,作为后面的结果对比。 62 | 63 | ## 6.数据处理 64 | 首先先读取数据,查看数据是否存在缺失值,然后进行特征缩放统一数据维度。代码如下:(注:最后会给出完整代码) 65 | ```python 66 | #读取数据 67 | housing = pd.read_csv('kc_train.csv') 68 | target=pd.read_csv('kc_train2.csv') #销售价格 69 | t=pd.read_csv('kc_test.csv') #测试数据 70 | 71 | #数据预处理 72 | housing.info() #查看是否有缺失值 73 | 74 | #特征缩放 75 | from sklearn.preprocessing import MinMaxScaler 76 | minmax_scaler=MinMaxScaler() 77 | minmax_scaler.fit(housing) #进行内部拟合,内部参数会发生变化 78 | scaler_housing=minmax_scaler.transform(housing) 79 | scaler_housing=pd.DataFrame(scaler_housing,columns=housing.columns) 80 | ``` 81 | 82 | ## 7.模型训练 83 | 使用sklearn库的线性回归函数进行调用训练。梯度下降法获得误差最小值。最后使用均方误差法来评价模型的好坏程度,并画图进行比较。 84 | ```python 85 | #选择基于梯度下降的线性回归模型 86 | from sklearn.linear_model import LinearRegression 87 | LR_reg=LinearRegression() 88 | #进行拟合 89 | LR_reg.fit(scaler_housing,target) 90 | 91 | 92 | #使用均方误差用于评价模型好坏 93 | from sklearn.metrics import mean_squared_error 94 | preds=LR_reg.predict(scaler_housing) #输入数据进行预测得到结果 95 | mse=mean_squared_error(preds,target) #使用均方误差来评价模型好坏,可以输出mse进行查看评价值 96 | 97 | #绘图进行比较 98 | plot.figure(figsize=(10,7)) #画布大小 99 | num=100 100 | x=np.arange(1,num+1) #取100个点进行比较 101 | plot.plot(x,target[:num],label='target') #目标取值 102 | plot.plot(x,preds[:num],label='preds') #预测取值 103 | plot.legend(loc='upper right') #线条显示位置 104 | plot.show() 105 | ``` 106 | 最后输出的图是这样的: 107 | ![](http://www.wailian.work/images/2018/12/10/124094e96.png) 108 | 从这张结果对比图中就可以看出模型是否得到精确的目标函数,是否能够精确预测房价。 109 | - 如果想要预测test文件里的数据,那就把test文件里的数据进行读取,并且进行特征缩放,调用: 110 | **LR_reg.predict(test)** 111 | 就可以得到预测结果,并进行输出操作。 112 | - 到这里可以看到机器学习也不是不能够学会,只要深入研究和总结,就能够找到学习的方法,重要的是总结,最后就是调用一些机器学习的方法库就行了,当然这只是入门级的,我觉得入门级的写到这已经足够了,很多人都能够看得懂,代码量不多。但要理解线性回归的概念性东西还是要多看资料。 113 | 114 | ## [8.完整代码](https://github.com/mantchs/machine_learning_model/blob/master/Linear%20Regression/demo/housing_price.py) 115 | 116 |
117 |
118 |
119 |
120 | 121 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 122 | 123 | 欢迎添加微信交流!请备注“机器学习”。 124 | -------------------------------------------------------------------------------- /Linear Regression/demo/housing_price.py: -------------------------------------------------------------------------------- 1 | # 兼容 pythone2,3 2 | from __future__ import print_function 3 | 4 | # 导入相关python库 5 | import os 6 | import numpy as np 7 | import pandas as pd 8 | 9 | #设定随机数种子 10 | np.random.seed(36) 11 | 12 | #使用matplotlib库画图 13 | import matplotlib 14 | import seaborn 15 | import matplotlib.pyplot as plot 16 | 17 | from sklearn import datasets 18 | 19 | 20 | #读取数据 21 | housing = pd.read_csv('kc_train.csv') 22 | target=pd.read_csv('kc_train2.csv') #销售价格 23 | t=pd.read_csv('kc_test.csv') #测试数据 24 | 25 | #数据预处理 26 | housing.info() #查看是否有缺失值 27 | 28 | #特征缩放 29 | from sklearn.preprocessing import MinMaxScaler 30 | minmax_scaler=MinMaxScaler() 31 | minmax_scaler.fit(housing) #进行内部拟合,内部参数会发生变化 32 | scaler_housing=minmax_scaler.transform(housing) 33 | scaler_housing=pd.DataFrame(scaler_housing,columns=housing.columns) 34 | 35 | mm=MinMaxScaler() 36 | mm.fit(t) 37 | scaler_t=mm.transform(t) 38 | scaler_t=pd.DataFrame(scaler_t,columns=t.columns) 39 | 40 | 41 | 42 | #选择基于梯度下降的线性回归模型 43 | from sklearn.linear_model import LinearRegression 44 | LR_reg=LinearRegression() 45 | #进行拟合 46 | LR_reg.fit(scaler_housing,target) 47 | 48 | 49 | #使用均方误差用于评价模型好坏 50 | from sklearn.metrics import mean_squared_error 51 | preds=LR_reg.predict(scaler_housing) #输入数据进行预测得到结果 52 | mse=mean_squared_error(preds,target) #使用均方误差来评价模型好坏,可以输出mse进行查看评价值 53 | 54 | #绘图进行比较 55 | plot.figure(figsize=(10,7)) #画布大小 56 | num=100 57 | x=np.arange(1,num+1) #取100个点进行比较 58 | plot.plot(x,target[:num],label='target') #目标取值 59 | plot.plot(x,preds[:num],label='preds') #预测取值 60 | plot.legend(loc='upper right') #线条显示位置 61 | plot.show() 62 | 63 | 64 | #输出测试数据 65 | result=LR_reg.predict(scaler_t) 66 | df_result=pd.DataFrame(result) 67 | df_result.to_csv("result.csv") 68 | -------------------------------------------------------------------------------- /Logistic Regression/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.逻辑回归](#1逻辑回归logistic-regression案例) 3 | - [1.1逻辑回归与线性回归的关系](#11逻辑回归与线性回归的关系) 4 | - [1.2损失函数](#12损失函数) 5 | - [1.3多分类问题](#13多分类问题one-vs-rest) 6 | - [1.4逻辑回归的一些经验](#14逻辑回归lr的一些经验) 7 | - [1.5LR的应用](#15lr的应用) 8 | - [1.6Python代码实现](https://github.com/mantchs/machine_learning_model/blob/master/Logistic%20Regression/demo/CreditScoring.ipynb) 9 | 10 | ## 1.逻辑回归(Logistic Regression)                                                [案例](https://github.com/mantchs/machine_learning_model/blob/master/Logistic%20Regression/demo/CreditScoring.ipynb) 11 | 12 | 13 | 14 | ### 1.1逻辑回归与线性回归的关系 15 | 16 | 逻辑回归是用来做分类算法的,大家都熟悉线性回归,一般形式是Y=aX+b,y的取值范围是[-∞, +∞],有这么多取值,怎么进行分类呢?不用担心,伟大的数学家已经为我们找到了一个方法。 17 | 18 | 首先我们先来看一个函数,这个函数叫做Sigmoid函数: 19 | 20 | ![](http://www.wailian.work/images/2018/12/10/image061f6.png) 21 | 22 | 函数中t无论取什么值,其结果都在[0,-1]的区间内,回想一下,一个分类问题就有两种答案,一种是“是”,一种是“否”,那0对应着“否”,1对应着“是”,那又有人问了,你这不是[0,1]的区间吗,怎么会只有0和1呢?这个问题问得好,我们假设分类的阈值是0.5,那么超过0.5的归为1分类,低于0.5的归为0分类,阈值是可以自己设定的。 23 | 24 | 好了,接下来我们把aX+b带入t中就得到了我们的逻辑回归的一般模型方程: 25 | 26 | ![](http://www.wailian.work/images/2018/12/10/image4e5fa.png) 27 | 28 | 结果P也可以理解为概率,换句话说概率大于0.5的属于1分类,概率小于0.5的属于0分类,这就达到了分类的目的。 29 | 30 | ### 1.2损失函数 31 | 32 | 逻辑回归的损失函数跟其它的不同,先一睹尊容: 33 | 34 | ![](http://www.wailian.work/images/2018/12/10/imagedbfb5.png) 35 | 36 | 解释一下,当真实值为1分类时,用第一个方程来表示损失函数;当真实值为0分类时,用第二个方程来表示损失函数,为什么要加上log函数呢?可以试想一下,当真实样本为1是,但h=0概率,那么log0=∞,这就对模型最大的惩罚力度;当h=1时,那么log1=0,相当于没有惩罚,也就是没有损失,达到最优结果。所以数学家就想出了用log函数来表示损失函数,把上述两式合并起来就是如下函数,并加上正则化项: 37 | 38 | ![](https://www.wailian.work/images/2018/12/10/image8771f.png) 39 | 40 | 最后按照梯度下降法一样,求解极小值点,得到想要的模型效果。 41 | 42 | ### 1.3多分类问题(one vs rest) 43 | 44 | 其实我们可以从二分类问题过度到多分类问题,思路步骤如下: 45 | 46 | 1.将类型class1看作正样本,其他类型全部看作负样本,然后我们就可以得到样本标记类型为该类型的概率p1。 47 | 48 | 2.然后再将另外类型class2看作正样本,其他类型全部看作负样本,同理得到p2。 49 | 50 | 3.以此循环,我们可以得到该待预测样本的标记类型分别为类型class i时的概率pi,最后我们取pi中最大的那个概率对应的样本标记类型作为我们的待预测样本类型。 51 | 52 | ![](https://www.wailian.work/images/2018/12/10/image31617.png) 53 | 54 | 总之还是以二分类来依次划分,并求出概率结果。 55 | 56 | ### 1.4逻辑回归(LR)的一些经验 57 | 58 | - 模型本身并没有好坏之分。 59 | - LR能以概率的形式输出结果,而非只是0,1判定。 60 | - LR的可解释性强,可控度高(你要给老板讲的嘛…)。 61 | - 训练快,feature engineering之后效果赞。 62 | - 因为结果是概率,可以做ranking model。 63 | 64 | ### 1.5LR的应用 65 | 66 | - CTR预估/推荐系统的learning to rank/各种分类场景。 67 | - 某搜索引擎厂的广告CTR预估基线版是LR。 68 | - 某电商搜索排序/广告CTR预估基线版是LR。 69 | - 某电商的购物搭配推荐用了大量LR。 70 | - 某现在一天广告赚1000w+的新闻app排序基线是LR。 71 | 72 | ### [1.6Python代码实现](https://github.com/mantchs/machine_learning_model/blob/master/Logistic%20Regression/demo/CreditScoring.ipynb) 73 | 74 |
75 |
76 |
77 |
78 | 79 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 80 | 81 | 欢迎添加微信交流!请备注“机器学习”。 82 | -------------------------------------------------------------------------------- /Logistic Regression/demo/CreditScoring.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## machine learning for credit scoring\n", 8 | "\n", 9 | "\n", 10 | "Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. \n", 11 | "\n", 12 | "Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)\n", 13 | "\n", 14 | "Attribute Information:\n", 15 | "\n", 16 | "|Variable Name\t|\tDescription\t|\tType|\n", 17 | "|----|----|----|\n", 18 | "|SeriousDlqin2yrs\t|\tPerson experienced 90 days past due delinquency or worse \t|\tY/N|\n", 19 | "|RevolvingUtilizationOfUnsecuredLines\t|\tTotal balance on credit divided by the sum of credit limits\t|\tpercentage|\n", 20 | "|age\t|\tAge of borrower in years\t|\tinteger|\n", 21 | "|NumberOfTime30-59DaysPastDueNotWorse\t|\tNumber of times borrower has been 30-59 days past due |\tinteger|\n", 22 | "|DebtRatio\t|\tMonthly debt payments\t|\tpercentage|\n", 23 | "|MonthlyIncome\t|\tMonthly income\t|\treal|\n", 24 | "|NumberOfOpenCreditLinesAndLoans\t|\tNumber of Open loans |\tinteger|\n", 25 | "|NumberOfTimes90DaysLate\t|\tNumber of times borrower has been 90 days or more past due.\t|\tinteger|\n", 26 | "|NumberRealEstateLoansOrLines\t|\tNumber of mortgage and real estate loans\t|\tinteger|\n", 27 | "|NumberOfTime60-89DaysPastDueNotWorse\t|\tNumber of times borrower has been 60-89 days past due |integer|\n", 28 | "|NumberOfDependents\t|\tNumber of dependents in family\t|\tinteger|\n" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "Read the data into Pandas " 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 1, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/html": [ 46 | "
\n", 47 | "\n", 60 | "\n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | "
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
010.76612745.02.00.8029829120.013.00.06.00.02.0
100.95715140.00.00.1218762600.04.00.00.00.01.0
200.65818038.01.00.0851133042.02.01.00.00.00.0
300.23381030.00.00.0360503300.05.00.00.00.00.0
400.90723949.01.00.02492663588.07.00.01.00.00.0
\n", 150 | "
" 151 | ], 152 | "text/plain": [ 153 | " SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age \\\n", 154 | "0 1 0.766127 45.0 \n", 155 | "1 0 0.957151 40.0 \n", 156 | "2 0 0.658180 38.0 \n", 157 | "3 0 0.233810 30.0 \n", 158 | "4 0 0.907239 49.0 \n", 159 | "\n", 160 | " NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome \\\n", 161 | "0 2.0 0.802982 9120.0 \n", 162 | "1 0.0 0.121876 2600.0 \n", 163 | "2 1.0 0.085113 3042.0 \n", 164 | "3 0.0 0.036050 3300.0 \n", 165 | "4 1.0 0.024926 63588.0 \n", 166 | "\n", 167 | " NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate \\\n", 168 | "0 13.0 0.0 \n", 169 | "1 4.0 0.0 \n", 170 | "2 2.0 1.0 \n", 171 | "3 5.0 0.0 \n", 172 | "4 7.0 0.0 \n", 173 | "\n", 174 | " NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse \\\n", 175 | "0 6.0 0.0 \n", 176 | "1 0.0 0.0 \n", 177 | "2 0.0 0.0 \n", 178 | "3 0.0 0.0 \n", 179 | "4 1.0 0.0 \n", 180 | "\n", 181 | " NumberOfDependents \n", 182 | "0 2.0 \n", 183 | "1 1.0 \n", 184 | "2 0.0 \n", 185 | "3 0.0 \n", 186 | "4 0.0 " 187 | ] 188 | }, 189 | "execution_count": 1, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "import pandas as pd\n", 196 | "pd.set_option('display.max_columns', 500)\n", 197 | "import zipfile\n", 198 | "with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z: ##读取zip里的文件\n", 199 | " f = z.open('KaggleCredit2.csv')\n", 200 | " data = pd.read_csv(f, index_col=0)\n", 201 | "data.head()" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 2, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "(112915, 11)" 213 | ] 214 | }, 215 | "execution_count": 2, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "data.shape" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "Drop na" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 3, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "SeriousDlqin2yrs 0\n", 240 | "RevolvingUtilizationOfUnsecuredLines 0\n", 241 | "age 4267\n", 242 | "NumberOfTime30-59DaysPastDueNotWorse 0\n", 243 | "DebtRatio 0\n", 244 | "MonthlyIncome 0\n", 245 | "NumberOfOpenCreditLinesAndLoans 0\n", 246 | "NumberOfTimes90DaysLate 0\n", 247 | "NumberRealEstateLoansOrLines 0\n", 248 | "NumberOfTime60-89DaysPastDueNotWorse 0\n", 249 | "NumberOfDependents 4267\n", 250 | "dtype: int64" 251 | ] 252 | }, 253 | "execution_count": 3, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "data.isnull().sum(axis=0)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 4, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/plain": [ 270 | "(108648, 11)" 271 | ] 272 | }, 273 | "execution_count": 4, 274 | "metadata": {}, 275 | "output_type": "execute_result" 276 | } 277 | ], 278 | "source": [ 279 | "data.dropna(inplace=True) ##去掉为空的数据\n", 280 | "data.shape" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "Create X and y" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 5, 293 | "metadata": { 294 | "collapsed": true 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "y = data['SeriousDlqin2yrs']\n", 299 | "X = data.drop('SeriousDlqin2yrs', axis=1)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 6, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "0.06742876076872101" 311 | ] 312 | }, 313 | "execution_count": 6, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "y.mean() ##求取均值" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "# 练习1\n", 327 | "\n", 328 | "把数据切分成训练集和测试集" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 7, 334 | "metadata": {}, 335 | "outputs": [ 336 | { 337 | "name": "stdout", 338 | "output_type": "stream", 339 | "text": [ 340 | "(21730, 10)\n" 341 | ] 342 | } 343 | ], 344 | "source": [ 345 | "from sklearn import model_selection\n", 346 | "x_tran,x_test,y_tran,y_test=model_selection.train_test_split(X,y,test_size=0.2)\n", 347 | "print(x_test.shape)" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "# 练习2\n", 355 | "使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类,尝试查sklearn API了解模型参数含义,调整不同的参数。" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 8, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stdout", 365 | "output_type": "stream", 366 | "text": [ 367 | "0.9323730412572769\n" 368 | ] 369 | }, 370 | { 371 | "name": "stderr", 372 | "output_type": "stream", 373 | "text": [ 374 | "/usr/local/lib/python3.5/dist-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n", 375 | " \"the coef_ did not converge\", ConvergenceWarning)\n" 376 | ] 377 | } 378 | ], 379 | "source": [ 380 | "from sklearn.linear_model import LogisticRegression\n", 381 | "## https://blog.csdn.net/sun_shengyun/article/details/53811483\n", 382 | "lr=LogisticRegression(multi_class='ovr',solver='sag',class_weight='balanced')\n", 383 | "lr.fit(x_tran,y_tran)\n", 384 | "score=lr.score(x_tran,y_tran)\n", 385 | "print(score) ##最好的分数是1" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "# 练习3\n", 393 | "在测试集上进行预测,计算准确度" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 9, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "name": "stdout", 403 | "output_type": "stream", 404 | "text": [ 405 | "训练集准确率: 0.9323730412572769\n", 406 | "测试集准确率: 0.9332719742291763\n" 407 | ] 408 | } 409 | ], 410 | "source": [ 411 | "from sklearn.metrics import accuracy_score\n", 412 | "## https://blog.csdn.net/qq_16095417/article/details/79590455\n", 413 | "train_score=accuracy_score(y_tran,lr.predict(x_tran))\n", 414 | "test_score=lr.score(x_test,y_test)\n", 415 | "print('训练集准确率:',train_score)\n", 416 | "print('测试集准确率:',test_score)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "# 练习4\n", 424 | "查看sklearn的官方说明,了解分类问题的评估标准,并对此例进行评估。" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 10, 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "name": "stdout", 434 | "output_type": "stream", 435 | "text": [ 436 | "训练集召回率: 0.4999938302834368\n", 437 | "测试集召回率: 0.4999753463833144\n" 438 | ] 439 | } 440 | ], 441 | "source": [ 442 | "##召回率\n", 443 | "from sklearn.metrics import recall_score\n", 444 | "train_recall=recall_score(y_tran,lr.predict(x_tran),average='macro')\n", 445 | "test_recall=recall_score(y_test,lr.predict(x_test),average='macro')\n", 446 | "print('训练集召回率:',train_recall)\n", 447 | "print('测试集召回率:',test_recall)" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "# 练习5\n", 455 | "\n", 456 | "银行通常会有更严格的要求,因为fraud带来的后果通常比较严重,一般我们会调整模型的标准。
\n", 457 | "比如在logistic regression当中,一般我们的概率判定边界为0.5,但是我们可以把阈值设定低一些,来提高模型的“敏感度”,试试看把阈值设定为0.3,再看看这时的评估指标(主要是准确率和召回率)。\n", 458 | "\n", 459 | "tips:sklearn的很多分类模型,predict_prob可以拿到预估的概率,可以根据它和设定的阈值大小去判断最终结果(分类类别)" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": 11, 465 | "metadata": {}, 466 | "outputs": [ 467 | { 468 | "name": "stdout", 469 | "output_type": "stream", 470 | "text": [ 471 | "0.9333179935572941\n" 472 | ] 473 | } 474 | ], 475 | "source": [ 476 | "import numpy as np\n", 477 | "y_pro=lr.predict_proba(x_test) ##获取预测概率值\n", 478 | "y_prd2 = [list(p>=0.3).index(1) for i,p in enumerate(y_pro)] ##设定0.3阈值,把大于0.3的看成1分类。\n", 479 | "train_score=accuracy_score(y_test,y_prd2)\n", 480 | "print(train_score)" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "metadata": { 487 | "collapsed": true 488 | }, 489 | "outputs": [], 490 | "source": [] 491 | } 492 | ], 493 | "metadata": { 494 | "kernelspec": { 495 | "display_name": "Python 3", 496 | "language": "python", 497 | "name": "python3" 498 | }, 499 | "language_info": { 500 | "codemirror_mode": { 501 | "name": "ipython", 502 | "version": 3 503 | }, 504 | "file_extension": ".py", 505 | "mimetype": "text/x-python", 506 | "name": "python", 507 | "nbconvert_exporter": "python", 508 | "pygments_lexer": "ipython3", 509 | "version": "3.5.2" 510 | } 511 | }, 512 | "nbformat": 4, 513 | "nbformat_minor": 1 514 | } 515 | -------------------------------------------------------------------------------- /Logistic Regression/demo/KaggleCredit2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mantchs/machine_learning_model/4fc08c9f5f6ce644a11f0945b3ac06506f13ea10/Logistic Regression/demo/KaggleCredit2.csv.zip -------------------------------------------------------------------------------- /Logistic Regression/demo/README.md: -------------------------------------------------------------------------------- 1 | ## Kaggle上简单的金融信用分类,代码都有注释,此处不再赘述! 2 | -------------------------------------------------------------------------------- /Model Ensemble/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.信用卡欺诈预测案例](#1信用卡欺诈预测案例) 3 | - [2.模型集成](#2模型集成model-ensemble) 4 | - [2.1Bagging](#21bagging) 5 | - [2.2Stacking](#22stacking) 6 | - [2.3Adaboost](#23adaboost) 7 | - [2.4图解模型集成](#24图解模型集成) 8 | - [3.案例总流程](#3案例总流程) 9 | - [4.初始化工作](#4初始化工作) 10 | - [5.数据下采样](#5数据下采样) 11 | - [6.模型训练](#6模型训练) 12 | - [6.1KNN](#61knn) 13 | - [6.2SVM-RBF](#62--svm-rbf) 14 | - [6.3SVM-POLY](#63-svm-poly) 15 | - [6.4Logistic Regression](#64-logistic-regression) 16 | - [6.5Random Forest](#65-random-forest) 17 | - [6.6决策边界](#66决策边界) 18 | - [6.7模型建模](#67-模型建模) 19 | - [7.结果](#7结果) 20 | - [7.1预测](#71预测) 21 | - [7.2模型评估](#72模型评估) 22 | - [8.完整代码]() 23 | - [数据集下载](https://v2.fangcloud.com/share/a63342d8bd816c43f281dab455) 24 | - [代码](https://github.com/mantchs/machine_learning_model/blob/master/Model%20Ensemble/kaggle_credict.ipynb) 25 | 26 | ## 1.信用卡欺诈预测案例 27 | 28 | 这是一道kaggle上的题目。 29 | 30 | 我们都知道信用卡,能够透支一大笔钱来供自己消费,正因为这一点,不法分子就利用信用卡进一特性来实施欺诈行为。银行为了能够检测出这一欺诈行为,通过机器学习模型进行智能识别,提前冻结该账户,避免造成银行的损失。那么我们应该通过什么方式来提高这种识别精度呢!这就是今天要说的主题,多模型融合预测。使用到的模型算法有:**KNN、SVM、Logistic Regression(LR)、Random Forest**。 31 | 32 | 我会讲到**如何使用多模型进行融合计算(模型集成)、模型评估、超参数调节、K折交叉验证**等,力求能够讲得清楚,希望大家通过这篇博文能够了解到一个完整的机器学习算法到底是怎样的,如有讲得不到位亦或是错误的地方,望告知! 33 | 34 | 以下我们正式开始介绍。 35 | 36 | **数据集下载:**https://v2.fangcloud.com/share/a63342d8bd816c43f281dab455 37 | 38 | [GitHub完整代码](https://github.com/mantchs/machine_learning_model/blob/master/Model%20Ensemble/kaggle_credict.ipynb) 39 | 40 | ## 2.模型集成(model ensemble) 41 | 42 | 我们先从概念着手,这是我们的地基,要建起高楼大厦,首先地基要稳。 43 | 44 | - **多模型:**分类问题是以多个模型计算出的结果进行投票决定最终答案,线性问题以多个模型计算出来的结果求取均值作为预测数值。 45 | 46 | 那么多模型融合存在着多种实现方法:**Bagging思想、Stacking、Adaboost。** 47 | 48 | ### 2.1Bagging 49 | 50 | Bagging是bootstrap aggregating。Bagging思想就是从总体样本当中随机取一部分样本进行训练,通过多次这样的结果,进行投票亦或求取平均值作为结果输出,这就极大可能的避免了不好的样本数据,从而提高准确度。因为有些是不好的样本,相当于噪声,模型学入噪声后会使准确度不高。一句话概括就是:**群众的力量是伟大的,集体智慧是惊人的。** 51 | 52 | 而反观多模型,其实也是一样的,利用多个模型的结果进行投票亦或求取均值作为最终的输出,用的就是Bagging的思想。 53 | 54 | ### 2.2Stacking 55 | 56 | stacking是一种分层模型集成框架。以两层为例,第一层由多个基学习器组成,其输入为原始训练集,第二层的模型则是以第一层基学习器的输出作为训练集进行再训练,从而得到完整的stacking模型。如果是多层次的话,以此类推。一句话概括:**站在巨人的肩膀上,能看得更远。** 57 | 58 | ![TIM截图20181231134358.png](https://i.loli.net/2018/12/31/5c29aca8a6f0e.png) 59 | 60 | ### 2.3Adaboost 61 | 62 | 所谓的AdaBoost的核心思想其实是,既然找一个强分类器不容易,那么我们干脆就不找了吧!我们可以去找多个弱分类器,这是比较容易实现的一件事情,然后再集成这些弱分类器就有可能达到强分类器的效果了,其中这里的弱分类器真的是很弱,你只需要构建一个比瞎猜的效果好一点点的分类器就可以了。一句话概括:**坚守一万小时定律,努力学习。** 63 | 64 | ### 2.4图解模型集成 65 | 66 | ![无标题.png](https://i.loli.net/2018/12/31/5c29b3ef23dc1.png) 67 | 68 | ## 3.案例总流程 69 | 70 | ![未命名文件 (1).jpg](https://i.loli.net/2018/12/31/5c29bf94e3484.jpg) 71 | 72 | 1. 首先拉取数据到python中。 73 | 2. 将数据划分成训练集和测试集,训练集由于分类极度不平衡,所以采取下采样工作,使分类比例达到一致。 74 | 3. 将训练集送入模型中训练,同时以K折交叉验证方法来进行超参数调节,哪一组超参数表现好,就选择哪一组超参数。 75 | 4. 寻找到超参数后,用同样的方法寻找决策边界,至此模型训练完成。 76 | 5. 使用模型集成预测测试集,并使用ROC曲线分析法,得到模型的评估指标。 77 | 78 | ## 4.初始化工作 79 | 80 | 啥都不说,先上代码,这里需要说明的就是sklearn.model_selection这个类库,因为老版本和新版本的区别还是很大的,如果巡行报错,尝试着升级sklearn库。 81 | 82 | ```python 83 | # 数据读取与计算 84 | import pandas as pd 85 | import matplotlib.pyplot as plt 86 | import numpy as np 87 | 88 | # 数据预处理与模型选择 89 | from sklearn.preprocessing import StandardScaler 90 | from sklearn.model_selection import train_test_split 91 | from sklearn.linear_model import LogisticRegression 92 | from sklearn.model_selection import KFold 93 | from sklearn.model_selection import cross_val_score 94 | from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report 95 | import itertools 96 | 97 | # 随机森林与SVM 98 | from sklearn.ensemble import RandomForestClassifier 99 | from sklearn.svm import SVC 100 | from sklearn.neighbors import KNeighborsClassifier 101 | from scipy import stats 102 | 103 | import warnings 104 | warnings.filterwarnings("ignore") 105 | 106 | # 一些基本参数设定 107 | mode = 2 #投票个数阈值 108 | ratio = 1 #负样本倍率 109 | iteration1 = 1 #总流程循环次数 110 | show_best_c = True #是否显示最优超参数 111 | show_bdry = True #是否显示决策边界 112 | 113 | ##读取数据,删除无用的时间特征。 114 | data=pd.read_csv('creditcard.csv') 115 | data.drop('Time',axis=1,inplace=True) 116 | ``` 117 | 118 | 119 | 120 | ## 5.数据下采样 121 | 122 | 先回答什么是数据下采样: 123 | 124 | **数据下采样:**数据集中正样本和负样本的比例严重失调,这会给模型的学习带来很大的困扰,例如,正样本有100个,而负样本只有1个,模型只是看到了正样本,而学习不到负样本,这回造成模型对负样本的预测能力几乎为0。所以为了避免这种数据倾斜,处理数据使得正样本和负样本的数量基本均等,这样的模型泛化能力才会高。 125 | 126 | 反观**数据上采样**也是一样的,只不过是基准样本不一样而已。 127 | 128 | 这里的数据处理采用下标的方式,较容易运算。 129 | 130 | ```python 131 | #欺诈类的样本下标 132 | fraud_indices=np.array(data[data.Class==1].index) 133 | #进行随机排列 134 | np.random.shuffle(fraud_indices) 135 | 136 | #获取正常样本下标 137 | normal_indices=np.array(data[data.Class==0].index) 138 | np.random.shuffle(normal_indices) 139 | 140 | 141 | #划分训练集和测试集 142 | train_normal_indices, train_fraud_indices, test_normal_indices 143 | ,test_fraud_indices = split_train_test(normal_indices,fraud_indices) 144 | 145 | ##合并测试集 146 | test_indices=np.concatenate([test_normal_indices,test_fraud_indices]) 147 | 148 | #通过下标选取测试集数据,[表示选取行,表示选取列] 149 | test_data=data.iloc[test_indices,:] 150 | x_test=test_data.ix[:,test_data.columns != 'Class'] 151 | y_test=test_data.ix[:,test_data.columns == 'Class'] 152 | 153 | #数据下采样,调用下采样函数 getTrainingSample 154 | x_train_undersample,y_train_undersample,train_normal_pos=getTrainingSample( 155 | train_fraud_indices,train_normal_indices,data,0,ratio) 156 | ``` 157 | 158 | getTrainingSample函数如下,由于代码显示效果不行,所以以图代替,文章开头已有源代码链接,注解已写得很清楚,不需重复赘述: 159 | 160 | ![UTOOLS1546243337995.png](https://i.loli.net/2018/12/31/5c29cd0a797ff.png) 161 | 162 | 163 | 164 | ## 6.模型训练 165 | 166 | ### 6.1KNN 167 | 168 | ```python 169 | #用不同的模型进行训练 170 | models_dict = {'knn' : knn_module, 'svm_rbf': svm_rbf_module, 'svm_poly': svm_poly_module, 171 | 'lr': lr_module, 'rf': rf_module} 172 | 173 | #knn中取不同的k值(超参数) 174 | c_param_range_knn=[3,5,7,9] 175 | #自定义cross_validation_recall,使用循环找出最适合的超参数。 176 | best_c_knn=cross_validation_recall(x,y, c_param_range_knn,models_dict, 'knn') 177 | ``` 178 | 179 | cross_validation_recall函数如下: 180 | 181 | ![UTOOLS1546245831285.png](https://i.loli.net/2018/12/31/5c29d6c7816f9.png) 182 | 183 | 这里有几个概念需要解释一下,以防大家看不懂。 184 | 185 | - **K折交叉验证:**K折交叉验证(k-fold cross-validation)首先将所有数据分割成K个子样本,不重复的选取其中一个子样本作为测试集,其他K-1个样本用来训练。共重复K次,平均K次的结果或者使用其它指标,最终得到一个单一估测。 186 | 187 | 这个方法的优势在于,保证每个子样本都参与训练且都被测试,降低泛化误差。其中,10折交叉验证是最常用的。 188 | 189 | - **ROC曲线**:评估模型好坏的方式,已有人解释非常清楚,此处不再赘述,欲了解请点击: 190 | 191 | https://www.cnblogs.com/gatherstars/p/6084696.html 192 | 193 | 接下来就是真正的模型训练函数: 194 | 195 | ```python 196 | def knn_module(x,y,indices, c_param, bdry=None): 197 | #超参数赋值 198 | knn=KNeighborsClassifier(n_neighbors=c_param) 199 | #ravel把数组变平 200 | knn.fit(x.iloc[indices[0],:], y.iloc[indices[0],:].values.ravel()) 201 | y_pred_undersample = knn.predict(x.iloc[indices[1],:].values) 202 | 203 | return y_pred_undersample 204 | ``` 205 | 206 | 模型评估,计算召回率和auc值: 207 | 208 | ```python 209 | #计算召回率和auc 210 | #y_t是真实值,y_p是预测值 211 | def compute_recall_and_auc(y_t, y_p): 212 | #混淆矩阵 https://www.cnblogs.com/zhixingheyi/p/8097782.html 213 | # https://blog.csdn.net/xierhacker/article/details/70903617 214 | cnf_matrix=confusion_matrix(y_t,y_p) 215 | #设置numpy的打印精度 216 | np.set_printoptions(precision=2) 217 | recall_score = cnf_matrix[0,0]/(cnf_matrix[1,0]+cnf_matrix[0,0]) 218 | 219 | #Roc曲线 220 | # https://www.cnblogs.com/gatherstars/p/6084696.html 221 | fpr, tpr,thresholds = roc_curve(y_t,y_p) 222 | roc_auc= auc(fpr,tpr) 223 | return recall_score , roc_auc 224 | ``` 225 | 226 | 227 | 228 | ### 6.2 SVM-RBF 229 | 230 | 径向基函数(RBF)做SVM的核函数。 231 | 232 | 欲想了解核函数:https://blog.csdn.net/v_JULY_v/article/details/7624837#commentBox 233 | 234 | ```python 235 | # SVM-RBF中不同的参数 236 | c_param_range_svm_rbf=[0.01,0.1,1,10,100] 237 | best_c_svm_rbf = cross_validation_recall(x,y,c_param_range_svm_rbf, models_dict, 'svm_rbf') 238 | 239 | def svm_rbf_module(x, y, indices, c_param, bdry= 0.5): 240 | svm_rbf = SVC(C=c_param, probability=True) 241 | svm_rbf.fit(x.iloc[indices[0],:], y.iloc[indices[0],:].values.ravel()) 242 | y_pred_undersample = svm_rbf.predict_proba(x.iloc[indices[1],:].values)[:,1] >= bdry#True/Flase 243 | return y_pred_undersample 244 | ``` 245 | 246 | ### 6.3 SVM-POLY 247 | 248 | 多项式(POLY)做SVM的核函数。 249 | 250 | ![UTOOLS1546247241520.png](https://i.loli.net/2018/12/31/5c29dc498aea3.png) 251 | 252 | 训练函数为: 253 | 254 | ```python 255 | def svm_poly_module(x,y, indices, c_param, bdry=0.5): 256 | svm_poly=SVC(C=c_param[0], kernel='poly', degree= c_param[1], probability=True) 257 | svm_poly.fit(x.iloc[indices[0],:], y.iloc[indices[0],:].values.ravel()) 258 | y_pred_undersample = svm_poly.predict_proba(x.iloc[indices[1],:].values)[:,1] >= bdry 259 | return y_pred_undersample 260 | ``` 261 | 262 | ### 6.4 Logistic Regression 263 | 264 | 逻辑回归模型 265 | 266 | ```python 267 | # 逻辑回归当中的正则化强度 268 | c_param_range_lr=[0.01,0.1,1,10,100] 269 | best_c_lr = cross_validation_recall(x,y, c_param_range_lr, models_dict, 'lr') 270 | 271 | def lr_module(x,y, indices, c_param, bdry=0.5): 272 | # penalty惩罚系数 273 | lr = LogisticRegression(C=c_param,penalty='11') 274 | lr.fit(X.iloc[indices[0],:], y.iloc[indices[0],:].values.ravel()) 275 | y_pred_undersample= lr.predict_proba(X.iloc[indices[1],:].values)[:,1]>=bdry 276 | return y_pred_undersample 277 | ``` 278 | 279 | ### 6.5 Random Forest 280 | 281 | 随机森林模型,欲知超参数含义请点击: 282 | 283 | https://www.cnblogs.com/harvey888/p/6512312.html 284 | 285 | ```python 286 | # 随机森林里调参 287 | c_param_range_rf = [2,5,10,15,20] 288 | best_c_rf= cross_validation_recall(X, y, c_param_range_rf, models_dict, 'rf') 289 | ``` 290 | 291 | ![UTOOLS1546247662478.png](https://i.loli.net/2018/12/31/5c29ddee81187.png) 292 | 293 | 294 | 295 | ### 6.6决策边界 296 | 297 | 在具有两个类的统计分类问题中,决策边界或决策表面是超曲面,其将基础向量空间划分为两个集合,一个集合。 分类器将决策边界一侧的所有点分类为属于一个类,而将另一侧的所有点分类为属于另一个类。 298 | 299 | 所以这一步我们要做的就是根据AUC值找出模型最好的决策边界值,也就是概率值。大于这一概率值为正样本,反之为负样本。 300 | 301 | ```python 302 | # 交叉验证确定合适的决策边界阈值 303 | fold = KFold(4,shuffle=True) 304 | 305 | # 定义各个模型的计算公式 306 | def lr_bdry_module(recall_acc, roc_auc): 307 | return 0.9*recall_acc+0.1*roc_auc 308 | def svm_rbf_bdry_module(recall_acc, roc_auc): 309 | return recall_acc*roc_auc 310 | def svm_poly_bdry_module(recall_acc, roc_auc): 311 | return recall_acc*roc_auc 312 | def rf_bdry_module(recall_acc, roc_auc): 313 | return 0.5*recall_acc+0.5*roc_auc 314 | bdry_dict = {'lr': lr_bdry_module,'svm_rbf': svm_rbf_bdry_module, 315 | 'svm_poly': svm_poly_bdry_module, 'rf': rf_bdry_module} 316 | 317 | # decision_boundary是一个计算决策边界的函数 318 | best_bdry_svm_rbf= decision_boundary(x, y, fold, best_c_svm_rbf, bdry_dict, models_dict, 'svm_rbf') 319 | best_bdry_svm_poly = decision_boundary(x, y, fold, best_c_svm_poly, bdry_dict, models_dict, 'svm_poly') 320 | best_bdry_lr = decision_boundary(x, y, fold, best_c_lr, bdry_dict, models_dict, 'lr') 321 | best_bdry_rf = decision_boundary(x, y, fold, best_c_rf, bdry_dict, models_dict, 'rf') 322 | best_bdry = [0.5, best_bdry_svm_rbf, best_bdry_svm_poly, best_bdry_lr, best_bdry_rf] 323 | ``` 324 | 325 | decision_boundary函数为,与前面寻找超参数大致相同: 326 | 327 | ![UTOOLS1546248292832.png](https://i.loli.net/2018/12/31/5c29e06617b39.png) 328 | 329 | 330 | 331 | ### 6.7 模型建模 332 | 333 | 寻找到最优的超参数和决策边界后,就可以正式开始训练各个模型了。 334 | 335 | ```python 336 | # 最优参数建模 337 | knn = KNeighborsClassifier(n_neighbors = int(best_c_knn)) 338 | knn.fit(x.values, y.values.ravel()) 339 | 340 | svm_rbf = SVC(C=best_c_svm_rbf, probability = True) 341 | svm_rbf.fit(x.values, y.values.ravel()) 342 | 343 | svm_poly = SVC(C=best_c_svm_poly[0], kernel = 'poly', degree = best_c_svm_poly[1], probability = True) 344 | svm_poly.fit(x.values, y.values.ravel()) 345 | 346 | lr = LogisticRegression(C = best_c_lr, penalty ='l1', warm_start = False) 347 | lr.fit(x.values, y.values.ravel()) 348 | 349 | rf = RandomForestClassifier(n_jobs=-1, n_estimators = 100, criterion = 'entropy', 350 | max_features = 'auto', max_depth = None, 351 | min_samples_split = int(best_c_rf), random_state=0) 352 | rf.fit(x.values, y.values.ravel()) 353 | 354 | models = [knn,svm_rbf,svm_poly, lr, rf] 355 | ``` 356 | 357 | 358 | 359 | ## 7.结果 360 | 361 | ### 7.1预测 362 | 363 | 使用之前划分的测试集运用以上训练出来的模型进行预测,预测使用的是模型集成的投票机制。 364 | 365 | 我们先来看看预测的代码: 366 | 367 | ![UTOOLS1546248653719.png](https://i.loli.net/2018/12/31/5c29e1cdbd6cc.png) 368 | 369 | 模型集成投票代码: 370 | 371 | ![UTOOLS1546248915382.png](https://i.loli.net/2018/12/31/5c29e2d3aa3f2.png) 372 | 373 | ### 7.2模型评估 374 | 375 | 使用AUC进行模型评估,预测部分代码已经记录有相关指标数据,只要计算平均得分就可以。 376 | 377 | ```python 378 | #计算平均得分 379 | mean_recall_score = np.mean(recall_score_list) 380 | std_recall_score = np.std(recall_score_list) 381 | 382 | mean_auc= np.mean(auc_list) 383 | std_auc = np.std(auc_list) 384 | ``` 385 | 386 | 387 | 388 | ## 8.完整代码 389 | 390 | **数据集下载:**https://v2.fangcloud.com/share/a63342d8bd816c43f281dab455 391 | 392 | [GitHub完整代码](https://github.com/mantchs/machine_learning_model/blob/master/Model%20Ensemble/kaggle_credict.ipynb) 393 | 394 | . 395 | 396 | . 397 | 398 | . 399 | 400 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 401 | 402 | 欢迎添加微信交流!请备注“机器学习”。 403 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | [寻觅互联网,少有机器学习通俗易懂之算法讲解、案例等,项目立于这一问题之上,整理一份基本算法讲解+案例于文档,供大家学习之。通俗易懂之文章亦不可以面概全,但凡有不正确或争议之处,望告知,自当不吝赐教!](#) 3 | 4 | ## 目录 5 | 6 | ## [1.线性回归](https://github.com/mantchs/machine_learning_model/tree/master/Linear%20Regression) 7 | ###     [1.1房价预测案例](https://github.com/mantchs/machine_learning_model/tree/master/Linear%20Regression/demo) 8 | ## [2.逻辑回归](https://github.com/mantchs/machine_learning_model/tree/master/Logistic%20Regression) 9 | ###     [2.1金融信用分类案例](https://github.com/mantchs/machine_learning_model/blob/master/Logistic%20Regression/demo/CreditScoring.ipynb) 10 | ## [3.决策树](https://github.com/mantchs/machine_learning_model/tree/master/Decision%20Tree) 11 | ###     [3.1决策树案例](https://github.com/mantchs/machine_learning_model/blob/master/Decision%20Tree/DecisionTree.ipynb) 12 | ###     [3.2随机森林案例](https://github.com/mantchs/machine_learning_model/blob/master/Decision%20Tree/RandomForestRegression.ipynb) 13 | ## [4.SVM](https://github.com/mantchs/machine_learning_model/tree/master/SVM) 14 | ###     [4.1新闻分类案例](https://github.com/mantchs/machine_learning_model/tree/master/SVM/cnews_demo) 15 | ## [5.模型集成(多模型融合)](https://github.com/mantchs/machine_learning_model/tree/master/Model%20Ensemble) 16 | ###     [5.1信用卡欺诈预测案例](https://github.com/mantchs/machine_learning_model/tree/master/Model%20Ensemble) 17 | ## [6.L1、L2正则化,ElasticNetCV](https://github.com/mantchs/machine_learning_model/tree/master/Regularization) 18 | ## [7.NLP从词袋到Word2Vec](https://github.com/mantchs/machine_learning_model/tree/master/Word2Vec) 19 | ###     [7.1Word2Vec训练维基百科文章](https://github.com/mantchs/machine_learning_model/blob/master/Word2Vec/word2vec.ipynb) 20 | 21 | 22 |
23 |
24 |
25 |
26 | 27 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 28 | 29 | 欢迎添加微信交流!请备注“机器学习”。 30 | -------------------------------------------------------------------------------- /Regularization/LassoCV.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "\n", 13 | "author : duanxxnj@163.com\n", 14 | "time : 2016-06-06_15-41\n", 15 | "\n", 16 | "Lasso 回归应用于稀疏信号\n", 17 | "\n", 18 | "\n", 19 | "测试集上的R2可决系数 : -0.185315\n" 20 | ] 21 | }, 22 | { 23 | "data": { 24 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xl8VNX5+PHPmTV7CEnYCYuCrCFAWEQBFQVXUOpOLdSFUtfWn2ttq+3XtrZa963U4r7WXUGLKApuICDIvggIgQCBEELW2c7vjzsz2TdmMneSed6vV153MnfmzjN3Zp45c85zz1Vaa4QQQrQfFrMDEEIIEV6S2IUQop2RxC6EEO2MJHYhhGhnJLELIUQ7I4ldCCHaGUnsQgjRzjQ7sSul5imlDiil1lW7rqNS6hOl1Fb/Mq11whRCCNFcLWmxPwecWeu6O4BPtdb9gE/9/wshhDCRasmRp0qp3sCHWush/v83A6dorfOVUl2Bz7XWJzS1nYyMDN27d+9jClgIIWLVypUrD2qtM5u6nS3Ex+mstc4H8Cf3Ts25U+/evVmxYkWIDy2EELFFKfVTc24XscFTpdRspdQKpdSKgoKCSD2sEELEnFAT+35/Fwz+5YGGbqi1nqu1ztVa52ZmNvlLQgghxDEKNbG/D8z0X54JvBfi9oQQQoSo2X3sSqlXgVOADKVUHnA3cB/whlLqKmAXcFFrBClEe+V2u8nLy6OiosLsUEQUiYuLo0ePHtjt9mO6f7MTu9b6sgZWTTqmRxZCkJeXR3JyMr1790YpZXY4IgporTl06BB5eXn06dPnmLYhR54KYaKKigrS09MlqYsgpRTp6ekh/YqTxC6EySSpi9pCfU9IYheitfl88P1L4HWbHYmIEZLYhWhta16F966Drx8zO5J6JSUlmR1CyG699VYGDx7MrbfeSkFBAWPGjGH48OEsXbqUs88+m6Kiogbv+/TTT/PCCy8c0+Pu3LmTV1555VjDbjWhHnkqhGhKeaGxLDtkbhzt2L/+9S8KCgpwOp289tprDBgwgOeffx6A8ePHN3rfOXPmHPPjBhL75ZdffszbaA3SYheitTkSjWVCR3PjaIEPPvgg2Oo9/fTT2b9/PwBffPEFOTk55OTkMHz4cI4ePUp+fj4TJkwgJyeHIUOGsHTpUgBeffVVhg4dypAhQ7j99tvrfZzvvvuOcePGMWzYMEaPHs3Ro0epqKjgl7/8JUOHDmX48OEsXrwYAK/Xy6233sqoUaPIzs7mX//6FwBTp06ltLSUMWPG8Pe//53bbruNBQsWkJOTQ3l5Ob179+bgwYMAvPDCC2RnZzNs2DCuuOIKAO655x4eeOABAH788UfOPPNMRo4cyfjx49m0aRMAs2bN4sYbb2TcuHH07duXN998E4A77riDpUuXkpOTw0MPPcT69esZPXo0OTk5ZGdns3Xr1tZ4eZokLXYhWtug8yGlB/TIbfRmf/pgPRv2Fof3obulcPd5g1t8v5NPPplvv/0WpRTPPPMM//jHP/jnP//JAw88wBNPPMFJJ51ESUkJcXFxzJ07lylTpnDXXXfh9XopKytj79693H777axcuZK0tDQmT57Mu+++y/nnnx98DJfLxSWXXMLrr7/OqFGjKC4uJj4+nkceeQSAtWvXsmnTJiZPnsyWLVt44YUXSE1N5bvvvqOyspKTTjqJyZMn8/7775OUlMTq1asB6Ny5MytWrODxxx+v8ZzWr1/PX/7yF7766isyMjIoLCys87xnz57N008/Tb9+/Vi2bBnXXnstn332GQD5+fl8+eWXbNq0ialTp3LhhRdy33338cADD/Dhhx8CcMMNN3DTTTcxY8YMXC4XXq+3xfs+HCSxC9HaEjpC/8lmR9EieXl5XHLJJeTn5+NyuYL11CeddBI333wzM2bMYPr06fTo0YNRo0Zx5ZVX4na7Of/888nJyeGzzz7jlFNOITB9yIwZM1iyZEmNxL5582a6du3KqFGjAEhJSQHgyy+/5IYbbgBgwIAB9OrViy1btrBw4UJ++OGHYGv5yJEjbN26tdm13p999hkXXnghGRkZAHTsWPMXVElJCV9//TUXXVR1nGVlZWXw8vnnn4/FYmHQoEHBXzC1nXjiifzlL38hLy+P6dOn069fv2bFFm6S2IVobbuWwbzJcMG/YNilDd7sWFrWreWGG27g5ptvZurUqXz++efcc889gNH1cM4557BgwQLGjh3LokWLmDBhAkuWLGH+/PlcccUV3HrrrcEk3Ritdb1lfQ1NJa615rHHHmPKlCnH9JwaerwAn89Hhw4dgi3/2pxOZ5MxXn755YwZM4b58+czZcoUnnnmGU477bRjijcU0scuRGvbvcxY7ltrbhwtcOTIEbp37w4QHIQEow966NCh3H777eTm5rJp0yZ++uknOnXqxDXXXMNVV13FqlWrGDNmDF988QUHDx7E6/Xy6quvMnHixBqPMWDAAPbu3ct3330HwNGjR/F4PEyYMIGXX34ZgC1btrBr1y5OOOEEpkyZwlNPPYXb7Q6uKy0tbfZzmjRpEm+88QaHDhmD2LW7YlJSUujTpw///e9/ASN5r1mzptFtJicnc/To0eD/27dvp2/fvtx4441MnTqVH374odnxhZO02IVobVF+AFJZWRk9evQI/n/zzTdzzz33cNFFF9G9e3fGjh3Ljh07AHj44YdZvHgxVquVQYMGcdZZZ/Haa69x//33Y7fbSUpK4oUXXqBr16787W9/49RTT0Vrzdlnn820adNqPK7D4eD111/nhhtuoLy8nPj4eBYtWsS1117LnDlzGDp0KDabjeeeew6n08nVV1/Nzp07GTFiBFprMjMzeffdd5v9PAcPHsxdd93FxIkTsVqtDB8+nOeee67GbV5++WV+/etfc++99+J2u7n00ksZNmxYg9vMzs7GZrMxbNgwZs2aRUVFBS+99BJ2u50uXbrwxz/+sdnxhVOLzqAULrm5uVpOtCFixtePwcLfw9jr4My/1li1ceNGBg4caFJgIprV995QSq3UWjc+Co90xQghRLsjiV2I1mZ1GMu0XubGIWKG9LEL0doGT4fOQ6D7CLMjETFCErsQrS0p0/gTIkLC0hWjlPqtUmq9UmqdUupVpVRcOLYrRLuQvwbuSYWvHjU7EhEjQk7sSqnuwI1ArtZ6CGAFGj4KQ4hYs2OJsSyp/2hFIcItXIOnNiBeKWUDEoC9YdquECJKNDX9LcAf//hHFi1adEzb//zzzzn33HOP6b7NsWnTpuDkZT/++COPPvooAwcOZMaMGbz//vvcd999jd5/3Lhxx/zYzz33HHv3Ri4thtzHrrXeo5R6AONk1uXAQq31wtq3U0rNBmYDZGVlhfqwQogI0VqjtWbBggVN3vbPf/5zBCI6Nu+++y7Tpk3jT3/6EwBPPvkkH330UXCumalTpzZ6/6+//vqYH/u5555jyJAhdOvW7Zi30RLh6IpJA6YBfYBuQKJS6ue1b6e1nqu1ztVa5wYmBhIiJphwEGBLPPjggwwZMoQhQ4bw8MMPA8Y84wMHDuTaa69lxIgR7N69u8b0t//3f//HgAEDOOOMM7jsssuC097OmjUrOElX7969ufvuuxkxYgRDhw4NToG7fPlyxo0bx/Dhwxk3bhybN29uND6v18stt9zC0KFDyc7O5rHHjBOWfPrppwwfPpyhQ4dy5ZVXBifsWrlyJRMnTmTkyJFMmTKF/Px8FixYwMMPP8wzzzzDqaeeypw5c9i+fTtTp07loYce4rnnnuP6668HYP/+/VxwwQUMGzaMYcOGBRN69ROS3H///cHpg+++++4a++yaa65h8ODBTJ48mfLyct58801WrFjBjBkzglMJ33HHHQwaNIjs7GxuueWW0F/EWsJRFXM6sENrXQCglHobGAe8FIZtC9H2+TzGssvQpm/77Dl1rxt8Poy+Blxl8PJFddfnXA7DZ0DpIXjjFzXX/XJ+ow+3cuVKnn32WZYtW4bWmjFjxjBx4kTS0tLYvHkzzz77LE8++WSN+6xYsYK33nqL77//Ho/Hw4gRIxg5cmS928/IyGDVqlU8+eSTPPDAAzzzzDMMGDCAJUuWYLPZWLRoEb/73e946623Goxx7ty57Nixg++//x6bzUZhYSEVFRXMmjWLTz/9lP79+/OLX/yCp556iuuuu44bbriB9957j8zMTF5//XXuuusu5s2bx5w5c0hKSgom0o8//pjFixeTkZFRY2qBG2+8kYkTJ/LOO+/g9XopKSmpEc/ChQvZunUry5cvR2vN1KlTWbJkCVlZWWzdupVXX32Vf//731x88cW89dZb/PznP+fxxx/ngQceIDc3l8LCQt555x02bdqEUqrJ7q1jEY7EvgsYq5RKwOiKmQTIfAFCBAy/Ao6fBJ2bkdgj7Msvv+SCCy4gMdE4Gcj06dNZunQpU6dOpVevXowdO7be+0ybNo34+HgAzjvvvAa3P336dABGjhzJ22+/DRgTjM2cOZOtW7eilApO6tWQRYsWMWfOHGw2I1117NiRNWvW0KdPH/r37w/AzJkzeeKJJzj99NNZt24dZ5xxBmC09rt27dqSXcJnn30WPFWe1WolNTW1xvqFCxeycOFChg8fDhjT/W7dupWsrCz69OlDTk5O8Dnv3LmzzvZTUlKIi4vj6quv5pxzzmmVcYVw9LEvU0q9CawCPMD3wNxQtytEu9GSOvbGWtiOhMbXJ6Y32UKvrbG5ogLJviX3qS0w1a3VasXjMX65/OEPf+DUU0/lnXfeYefOnZxyyilNxlh7ut3GpvYdPHgw33zzTbNjbCmtNXfeeSe/+tWvaly/c+fOGlP7Wq1WysvL69zfZrOxfPlyPv30U1577TUef/zx4Mk8wiUsVTFa67u11gO01kO01ldorSubvpcQMSJvhVHHvvD3ZkdSx4QJE3j33XcpKyujtLSUd955p8lzhJ588sl88MEHVFRUUFJSwvz5LfsyqT4lcO3ZFeszefJknn766eAXQ2FhIQMGDGDnzp1s27YNgBdffJGJEydywgknUFBQEEzsbreb9evXtyi+SZMm8dRTTwFGi7+4uOZZraZMmcK8efOCXTR79uzhwIEDjW6z+vS+JSUlHDlyhLPPPpuHH364wfnfQyFzxQjR2jb5E58n+to7I0aMYNasWYwePZoxY8Zw9dVXB7sYGjJq1CimTp3KsGHDmD59Orm5uXW6Kxpz2223ceedd3LSSSc169RxV199NVlZWcFzlb7yyivExcXx7LPPctFFFzF06FAsFgtz5szB4XDw5ptvcvvttzNs2DBycnJaXM3yyCOPsHjxYoYOHcrIkSPrfDFMnjyZyy+/nBNPPJGhQ4dy4YUX1piTvT6zZs1izpw55OTkcPToUc4991yys7OZOHEiDz30UIviaw6ZtleI1vbJH+GrRyD3Kjj3wRqr2uq0vSUlJSQlJVFWVsaECROYO3cuI0bIXDjhFMq0vTJXjBCtzedvlQaqY9qB2bNns2HDBioqKpg5c6Yk9SgjiV2I1hZI7NqcM9a3hldeecXsEEQjpI9diNYWaKn3Orne1WZ0h4roFup7QlrsQrS2U+6EE6+Djn3qrIqLi+PQoUOkp6fXKekTsUlrzaFDh4iLO/ZJciWxC9HaEtMhPs3okrFYa6zq0aMHeXl5FBQUmBSciEZxcXE1TjDeUpLYhWhtWxbCKxfBoGlw8Qs1Vtnt9uAkVEKEi/SxC9Ha1huH0gcHUYVoZZLYhWhtgcFT7TM3DhEzJLEL0draYR27iG6S2IVobYH6demKEREiiV2I1hZI6IMaP0OPEOEiVTFCtLbznwTvo0bZoxARIC12IVpbXCrY46Gy8RkAhQiXsCR2pVQHpdSbSqlNSqmNSqkTw7FdIdqF71+Gv3aFF6ebHYmIEeHqinkE+FhrfaFSygEkhGm7QrR9P7xuLNvRJGAiuoWc2JVSKcAEYBaA1toFuELdrhDthpQ7iggLR1dMX6AAeFYp9b1S6hmlVP0nSxQiFgXLHeUAJREZ4UjsNmAE8JTWejhQCtxR+0ZKqdlKqRVKqRUy4ZGIKcEjT6UrRkRGOBJ7HpCntV7m//9NjERfg9Z6rtY6V2udm5nZzDO2C9EeBLpiRl1tbhwiZoTcx6613qeU2q2UOkFrvRmYBGwIPTQh2okrPzbmibHHmx2JiBHhqoq5AXjZXxGzHfhlmLYrRNtnc0L5YSgvgpSuZkcjYkBYErvWejXQ5JmzhYhJS/8Jn/4ZUnvCb9eZHY2IAXLkqQid1rBsLpQVmh1JdFrzmrGUScBEhEhiF6Hb9wN8dCu8M8fsSKJToCpG6thFhEhiF6GzOo1l75PMjSNaBVrqUu4oIkQSuwif1GM/+W67JkeeigiTaXtF6DwVxvLgVnPjiFZKQce+cOJ1ZkciYoQkdhE6Z7KxzP/B3DiilVTCiAiTrhgRuo59AQWdB5sdSfQqPQgHNpkdhYgR0mIXoVMKrA7wyqSe9Zp/C2xbBEW74G4pCRWtTxK7CN2Pi8FbCbu+MTuS6LTmVXCVGJe1Nr4IhWhF0hUjQuepNJZyAE79qu8XLVP3itYniV2ELtAFc0qd2ZoF1Kxfly8/EQGS2EXoAom9Q5a5cUQrnwcs9qrLQrQy6WMXoQskqx1LIPMEc2OJNlob5aDHTYLjTgWr3eyIRAyQxC5Cl9bHWG54D0ZfY24s0UYpuGOX2VGIGCNdMSJ0WWOg89CqA5VEXSUFkLcCvG6zIxExQFrsInQ+H1htUsdeH3cFvDMbXKVGLfstWyGpk9lRiXYubC12pZRVKfW9UurDcG1TtBHfPAZ7vzfOECRq8rqMLqrDPxn/y+CpiIBwdsXcBGwM4/ZEWyHdCw0LlDpaHcZSyh1FBIQlsSulegDnAM+EY3uijQkk9kteMjeOaBRI5LZAYpcWu2h94WqxPwzcBshhdbHI6wKLLbZP1Pzj4von+Qok9sDJSOTIUxEBISd2pdS5wAGt9combjdbKbVCKbWioKAg1IcV0cTrMlqiK541OxLzvHg+LLil/nUpPaDfGXDhszJwKiIiHC32k4CpSqmdwGvAaUqpOr/JtdZztda5WuvczMzMMDysiBq9/KfE++Zxc+MwU8YJkNCx7vXJneHm9TDhFhgyXUpCRUSEnNi11ndqrXtorXsDlwKfaa1/HnJkou0YcDYMuyy2yx0PbYOi3Q2vLz1odNdUFEcuJhGz5AAlETpXmTGAGsvVMdoLe1fVvf7IHnhxOix90OiuKdwe+dhEzAlrYtdaf661Pjec2xRtwIe/hXVvxm6L3dfIgKirBH78FMoO+W8r5Y6i9UmLXYQukNC9MVrK5/XPR59Yz8Bo7XJHLYldtD5J7CJ0PjekZsFNq82OxBzucmM5/ua66wJ164FyR6ljFxEgiV2EzuuGhLT6q0JiQeAMUvWRI0+FCSSxi9B5XZC/Bhb9qfH+5vYqLgWUBT6u5wxSVodRCtnrRJjxJnQeHPn4RMyRxC5Cl30JpPWGLx80umVijSMRRs6ChIy66zoPhuuXw8DzjIOUYvVXjYgoSewidMMuhdyrjMuxWPJYUWzMbll2sOHblBfBxg/h6L7IxSViliR2Ebqj+435xiE2Sx4PbDQSe332rob/TIGN78PrM2BPPbXuQoSZJHYRupd+Bl/cZ1yOxRa7t9rgae0xhvLDsPtb44QbIOWOIiIksYvQVW+lx2KLPVAV02di3dkb60zbK4ldtD5J7CJ0XhcMvgB+XwCpPcyOJvI8/tb4lL8YpwisTteatlfq2EUESGIXofN5wJ5gtEqVMjuayAu02CtL6rbI6xx5GoPloCLiJLGL0HldcHArLLgVjuSZHU3kZY2F/mfBs2fWff7OZOg2wqhl/+VH0PdUc2IUMUUSuwjdKXdA95GwfC6U7Dc7mshL7QFDLzQu1z4Ktc94mL0YugyBXuMgSc5FIFqfJHYRutwrjYNvIDarYgq3w09fGZcD/e21ucpgzWtwcFvk4hIxSxJ7G7d0awFfbWvkwJhIKNgMlUeNy7GY2Ne/AyvmGZdrt9g3fwxPnQQFG+GdX8GOz5u92Y/X7eO0Bz7H7ZV+edEytqZvIqLZQ59swaIUJx1fz+HskaA1PDEG+kww/o/lckeo22IvL4T966r+b8FcOhvyi9l+sJQj5W4ykpwhBiliSThOZt1TKbVYKbVRKbVeKXVTOAITzVNS6aG4wsRWss8LaHAkgcUem3XagWQ+6Y/QIavmuhCm7S2t9NRYCtFc4Wixe4D/p7VepZRKBlYqpT7RWm8Iw7ZFE0oqPHi1Ni+AQAu952i47BXz4jCTpxLiUmH8/6u7zldr2t4WHHkaSOglkthFC4XjZNb5WutV/stHgY1A91C3K5qnpNJDcbmJH/xAYrfazYvBbJ4KUFY4vLPuyaoDLfRjOPK0JNhij8FfQSIkYR08VUr1BoYDy8K5XVHTT4dK2V9cgdaakkoP5W4vLo9JA2yBwdKKYnj3Wvjpa3PiMNPY64yjTh8ZBhs/qLkupZsx/hCXCr9aCjkzmr3ZkmCLPQYHpEVIwjZ4qpRKAt4CfqO1Lq5n/WxgNkBWVlbt1aIFbnj1e3p2TOCBC4fh8/fCHK1wk27GAJsjEc57FFK7GxOBZY016rVjSWZ/42QbUHNCMIAB5xh/APFpLdpsVVeMtNhFy4Slxa6UsmMk9Ze11m/Xdxut9Vytda7WOjczUw7SCMWB4koKiis5Wq0ld6TcpFadIwFGzoQu2cb/sVgVs/NL2PWtcbmx0+R99wzkrWj2ZgMJXQZPRUuFoypGAf8BNmqtHww9JNGU4go3xRXuGn2vxRUmffjd5ZC3Elwlxv+xWMe+9J+w5H7jcu1yx+/+Aw9nG/tpwa2w+aNmb1aqYsSxCkeL/STgCuA0pdRq/9/ZYdiuqIfb66PM5aW43E1JtWRebFaLvXAHPHMa/PSN8X8sJnZPJThTqi5XV34Yin4yBlcttgarYio9Xsb+9VPm/5AfvC7Qx37UrC9t0WaF3Meutf4SiMEp/cwRSOBHyt01yuBM64oJdL04EiGug5G8Yo2nwug/P+t+Y86c6gJVMBarkdwbqGM/VOJiX3EFG/KPcE52V6B6VYwkdtEyMfgpbNsCXS6lLm+NZG7aQUqBROVIhDt+MicGs3kqwRYHY2bXXRdooSuL8aXXwJGnh8tc/qXxOrq9vmClU6lLErtoGZkrpo2p3uWSf6S82vUmffiljt1osducxpw5RbtqrvN5jJa6UmCxNNhiL/In9CJ/gq/eSpeqGNFS0mJvY6q30vcWVUvsZrXYA4ndYoe3roG+p8Dw5tdqtwsXv2C02J+fajz/85+oWpdxQlW546+WVPXF1xJosQcSfPV+9RIzp4wQbZK02NuY6gl8b5FRgZHgsJrXx95pEFw4DzJPgC3/g31rzYnDTJ0HQ/pxRqu9dlXMsEvgkheNy2m9IaFjvZsIdMEEltW7X+TI02O3p6ic+z7ahNdn4rQbJpDE3sZUT+B7isqxKOicEmdeVUxSJxjyM0jMMLpjYrGOfdWLRsmnLa7h+djBqGPfsrDeVUWlgRZ7za4Yh9Uic8WE4LNNB3j6ix/ZeajU7FAiShJ7G1O9L31vUTmJDhsp8Xbz6tiP7oftn4Or1JjoKhYT+/ybYeP7/hZ7rXLHRffAoyOMy0sfgg3v1ruJqha7sf8C/eqdU50yeBqCwAD0geJGDhxrhySxtzHVu2IKSipJirORGm83r8W+Ywm8MA2K88Fqi706dp/P+DKzxdXfFVNRDBVFxmVLw+WOgZZ6hdtHhdsbbLF3To6TcscQVHqML8gDRxv5JdUOyeBpG3Ok3E2Cw0qZy4vWkOi0kRJnI6+wzJyAglUxNujQC+I7mBOHWQJzw9icMOEWo6yxOu01qmLAn9jr7y8PtNQDlwMHn3VOiWPtniNhDztWBFrs+4slsYsoVlzupktKHLsKy/D4NEnOQFeMyVUxVgfM+tCcGMwUaKHb4uD40+uu93mqDtpq5MjTQFcMwOHSqoPPOqfEUenx4fH6sFnlB3ZLVSV26YoRUay4wkNKvJ2UeKNuPMlpIyXOTnG5B23GCTeCZwhyRP6xo4GnWov94La6k3z5fEZLHRo98rSozEWnZGfwcrArJsW4Tipjjk2wj/1obCV2abHXw+XxUeHxkhIXfQfdHCl3kxpvJyXORmGpiySn0cfu8vqocPuId1gjG1D1A5Q+vtNYnvHnyMZgpoQMuH6lUcb48R2wexnctKZqfc/RVdP1/nJBg1MuHC5zM6BLMgeOVnK4zE2Jy4PDZqFDgvEeLHF5SE2IvvdjtKuUrhgR8OAnW/jwh70sve1UjMkro8fRcjdZHRNI9bfYE502UuKNl7G4wh35xD7gHEg/HuyJsHd1Veu0jTpUUsnRCg+9MxKbdwerDTKOZ29RORnYcdSuisn9ZdXlBmrYvT5NcYWbvpmJLNtRGOxjT3LaSHQar22JTAR2TKqqYmIrsUtXTD2+3X6IvMPl7D0SfW+GI+VuUuJs1bpirMFfFqZUxqT1hv5TjATXDurY735/PZf9+9vmd2sd3Y/+6lFueupdlu8ubXw+9hXz4PuX61x9pNyN1tA73fgyCXTFJDqtVYldKmOOictb1RVjSlelSSSx1+Lx+tiYb5wAam1edFUjaG207FLi7cFknlQtyZsygHpgY9Uc41ZHmy531FqzfEch+UcqyDtc3vQdAIp2oT75A/FHd7Cr2Fc3sf93FvxronF59auw9r91NhGoiOmcEkeCw2p0xVR6SXLaSfYn9vZa8vifL3eweNOBVtt+oMVe5vLG1JejJPZathWUBPvl1u+NrsRe4fbh9mqjj71aV0z3DnEAbNlfEvmg1rwGb8w0LrfxxL6nqDw4yLZq1+Hm3cljfAFUageHKhW6dh27p7Lm1L31DJ4Gatg7JNhJS3Bw2N9iT6rWYm+Pif1ohZu/LdjIQ4u2tNpjVFY7F3BIlTGHd8L+DeAyqay4hSSx1xJopSfH2aKufjgwnUBKnD3Yr57stHFcZhLdO8Tz6cb9kQ/K666qiEnrDWm9Ih9DmHy/q6jey43yt9ArsfOedxwrxz4G1X/y+7xV4w4WG+i60/YeLjVe17QEBx0S7BRs5W64AAAgAElEQVSVuSl1eUh02khqx10xX249iMen+SHvSKsdQBQ4QAlCPEhp+b/hqRPh773aROMlXOc8PVMptVkptU0pdUc4tmmWdXuOkOiwcsbAzqzbcySq+uUCXS2p8fYag6dKKU4f2Ikvtx2kwh3hsjivy+hfBzjzr3Bp3T7ktmLVrsPE2S3k9krj+2a32I1k0bdrOrusWXzsHm5M0Rvg81Qrd6x/2t5AV0xagiPYYi+pMBJ7e+5jX7z5ADaLsa++2FzQKo/h8vjI8J/kPaRpBbZ9aiy9Lsj7LgyRta5wnPPUCjwBnAUMAi5TSg0KdbtmWbe3mMHdUsnukcrBEldUHdgQGBxNibcF+9gDH/xJAztT4fbx9Y8HIxuU19VuathX7Soiu0cHcnt3ZP3e4mZ9SboqjK6YQT0zOa1LJZatH4O7WsuwGUeeBqbq7ZBoD7bYSyo9JDttJDqN+0ZbV4zWmvdW7+GFb3biO4aZE30+zeLNBUwZ3IXOKU4Wb26dfnaX10fPjvFACCWPxXuhYCOMv8V4LQNJPoqFo9xxNLBNa70dQCn1GjAN2BCGbdfw1baDeFa9hNNds4ukJK4bOztNAuCEPW9j99acya04PotdmcYA1sDdr2PVVZUb+1OHUZCaTWrpTnoeWsqo/AKGZ3VgUFEKT9GVxz7byvC4ffQo/LpOPNu6nEOFoyPpRzfR9XDdb/HN3c7HbUsm88g6Oh/5vs76jd0vwmuNo3PR92QWr6uzfn2Py9AWG10Ll5NesplDJZVcZS2i79YNpJV7gGxjcG3bIsYd2MC1ji3sXfA5365MwmtxsrHHxQBkFXxBSnnNE0C4rYls7j4dgN4HFpFUUXWuzX6dkknP7MK6zHNIcFjpu+9j8vN28lNhKTaLhZyeqVhSuvJm5Vhy8g/T1aN4fel2hv70PFkHl/BT5inBbRUl9CYvYzwAg3e/gqp15GVhUj/2dhwL2seQ3XVb+weTB7EvbSRWbyUD97yBT9nZ2vUc3LbkGrer+RwVOzpNojSua3B9d72fM20r8Po0a/KO4PZXS+zpdibnjR+FLtjMmH2vMKZPOj0q4nGr3Sx9/hsScy9lXM5gCratZPuy+WhqJrFv40/llYonuK//YLodeZGzd96He3VHVGUx8ytz6GQbhcXiYe3S7Viy/gYo+r7xKHHuql8ECQUlnGeNJ9l5NmkJDk4uXkCCLmPk4TScy5fwK/smXOs3817HS5iW092YJdJdQV5ROXmHy/z7uQ95GSfjtFu5zDcfK17W5B0JfjkVJvVnb8cxzd7PtQU+J3ZPKSfsfZv8ogrW+cegXluWRHy/iRxKGYjTXUS//A/q3H9Px7EcTupHfGUBx+3/mHKXl6llB5lm78ZZHct4eks/nlmaRmJFPn0OLKpz/52Zp1ES352Usl1kHfyizvrtnSaT3q0PUzKLKN/4P7ZWppE9+Re4PD4m2dcx1rGehJWf821eEgA+ZWdDz0sB6HlwKallO2tsz21NYHP3n1FS6WHQstuYDDy6fwgXpwwlfsUbLHKNrvF8atudPp4jib3rPJ+C5MHknHw2PTsm1LlPOIUjsXcHdlf7Pw8YU/tGSqnZwGyArKysY3qgj9ft44p1c+lv2VPj+i+82dzr7gbAV84n6a4O1Vg/3zuae92dAFjjfJRUVTUAssuXyQTXI8yz/4Ox1tWMtfifQR4MTriXl5ftosKyhH86nq4Tzz3rOrFJZ/EL6/+4wP58nfW3rMsiT3fiWuuHTLO/Xmf9r9f25zAp3GJ7j/Ns79VZ/4u1w3Bh527b25xj+x8A59iB5dDNYiM1/lV6ZSTC529jW/0yt1mAYuOvSCcyY81QAJ60v8JY6/Ia287TGVyxeiAAL9hfYKy12jzqW4HMgVxf1pUBXVJ42jWXrruXEUyTm6GoYw637U2gj5pIKrmsnr+R2db9/M6+iq5Fq4Kbesd7Eve6MwDY6HyEeFWzHPIlzyTu9aRiwcf2uAfr7IOnPedynyeBFEr5wb/+/fWH+NA7lkftj/OCdzLf+gay3vlbrKoq6a7fuIF7PVfQlUOU4WSkZQtnOR7ABlQ/K+ll6+OxpfVg8KFl/M76EuwCdsEf7EAeXLCzE88cfzyLPpnPZfv/WSe+31V2ozwuixF9OrFzRxrsBPv8mwBY4PoN//ONNm64bmPwPgsczzDIUnUawbHA4PhslPozA7omczLv0Mt6IPg+vNMKCw+MZPZrJzCmTzpdFv8Nyg7SA+jh38bb3pO5150OwOUJ96B8leRUi/NFz+nc60lpZD+fV2c/V/dP94U85rXTlUN8419/QeB4qSL489dlzPPC8SqPRc6697/NfQ1veD3kqG28619/qh3YANnAx54buXd+J8ZZ1vGKo5741tv43JfDZMt3zHU8VDe+dUl8pwv58sz99Pj8j2QDpSdOw+XxMdqzhJMtC4KfDYBiHc/lPwwD4DH7K4y1fltje/t0GlesNjoe3k4pZp8jiyc2OCnU2dxjf4FlX37CG14Pw9Q23qvn+b603s2HvhPrPJ+nPOeRfMKEVk/sKtQ+ZKXURcAUrfXV/v+vAEZrrW9o6D65ubl6xYoVDa1uUIXbi7u8uObgFBh9lw7/ASWVR+ve0WIFe0Kd9fbVz2Pf8BZlMz7A+fm9aGcy3nE3kmA3vu9cljgqfcrobqivPtmeYGy7ofWORCM2T2X99d3B9RX1D8g4koz+2mrrHTaF0+r/aR/nPxuPuxy8bjS65mnUnP5Wrbus/smnAutdpcFBvd+/u5Yt+0t4fc44sv/6NScfn8FLvxjCRU99SVqig8OlLjxeTXpKPOsOePjfbyZgqd6hV3v/W2xgj69/XfX1WoOrnqoeq92Yh0VrVFkBSY8NovKUP+IecD5JT4+g/KxH8PUYTcK8iVSc+RCe/mejyg6iEzLAmUzi44PJy5jAaZvO4+1rRrDncDm3vrmGV64ZS6/0RCY+9C1nDetJ92Qb//psHUtvO40O8XYqvV627i9h6tzVzDmlH88v3cplIzpx0+n9asZnT8DhsOO0Wflp82oOvXwV5MxgWcIpPPjFbr7+3ZnE2Wv1eLpK6ryH4x12bPHG61FafBjQJDqM96HH5+Oj9Qe54a3NLLhxPIPSFWjN+H8s5pQTMrntzAFgsaHt8Zzx4BeM6+Ggc0o8r6/YxSe/mYjTbmnRfq5/vcOYNkH7wFWK3aqIsxnvw0qvF5e2G+t9XuP9VpvNaWzD5zHer9R8L5drB57AlAvuekpN7fHGc/C6653zvkw7mPLo1/RItnLywTe4w/4aB6/bwvRnNzCmRzz3XTCQMletz0AzPh8WpUhURndjpVZG+aSrxHguVkcjzzfO2Ke1n4/VTnx84jHP+6OUWqm1zm3qduFosecBPav93wPYG4bt1hFntxJnT2viRvUf3Vfv+om/hYm/JRlgat3WmMP/B3agsSMRm7O+MSGut8eDPR4FJMfVsz4utfH7x1XNyNivZzfe27iFFXuNL6JytxcciRR6nHSKT+Fn2Rnc8fZaOFDCb0/vX/cw98b2f1OvTXxTr103sNhwekpwYnyY4ncvhTGz4K59xGuf8WFKrbYdrxuHMw4PNkqI56iGEhLo2DGd1A4JjOyTybLth+iSGkfPLl3okGa0ep3AkL5p5GTt4qkvfkRrCz8bN5Dk1PpPbQfQ7fhsTvf+H1fF92V3YRld01LI9M//UvN5NP4eTkypud4GZHQ0Ek9RuQucGXh9mrxyG2lp6SRXe76nDejE+6v30jHJw7DjepKRkVH3AZraz02ur/mcnP4/gx2o701YfX183U02sb7m/eu2dpOBX088jr99tInhVuPxKysr6ereRXb5YayOkSTHN/A5aurz4U+TTsBps9Z6/Y7t+ba2cFTFfAf0U0r1UUo5gEuB98Ow3chxldb9FRCjBnUzEtfbq4zurnJ/K6fC7SPObmVqTjeSnTasFsWlo3s2uJ1WoZRxztCKI8Y85wDr3jReO4u16oTa3z4N6942LnvdWG2O4HMp85+0IsHfGh7bN53tB0v5bmchY/um13nIS0dloTUM69mBgV0bTuoAdquFvhlJbDtwlG37S+jXKSkMT9oQmDPmSLWTXmsNHRNrDlxPGtCZUpeX3YXlTBrYOWyP3xbMHNeb7B6p9OlkJOrKygrGeZZzxe672/wR0S0VcmLXWnuA64H/ARuBN7TW60PdbsR8cjf8tRvMPcXsSKJCIHkt3GDUxFd4AondS5zdQoLDxm1nDeCmSf3onNJYS6WVXL0ITvs9VBZXXTf/Zlj816r/V/wHNvjHLLwubA4j+ZW5vJT5BxMT/HPqBJK526sZ27duS/XcYV0Z1DWFX0/s26zwju+cxMb8o2w/WMLxncOf2Iv8lVGH/KfSS0+q2Xo+6fgMnDbjYz1pQKewPX5bEGe38v71J9Nv4mWcU/lXSqwdUIHyUktsTaAWlknAtNYLgAXh2FbE2fzJKbGen6wxqEtKHGkJ9uD84BX+Fnu520u83UiGV4w18SCk9OOMpao22diKeTXnQg/05QL43Fjtxmtc7vJS7vJiUQST36BuKSQ7bZS4PIzuUzexJzhsLLhpfLPD69cpifk/5PsvJzdx6+brEG98OQVKIw+V+BN7rRZ7vMPKGYM6k3+kgm4dIt8FEA2syZms170p81nRPjdYaXBWzfYqtp5tfbpmG8uOzWuRtXdKKQZ2TeHrH43KogqPD621kdgjPXNkfTa8bwzCDp8BN2+CBwcY12f0r7pNYDIyreHM+6DjUFhymFKXh9JKLwkOW3DWTqtFccqATuQXldMhIfR6/P6dk6tdDl+LPc5uwWGzBI8+LvS32Dsm1Y35wYtz8MVw12JqeR4/t36C6+jxWLQXHxYsltg6yF4Se/8zjQ9/zgyzI4kag/yJPSPJSWmlB5fXh9bGT13T/fA6FO4wEntyl6rr04+vumzxJ3alYMyvcLi9wMeUubyUuz11vqDuvzA7bEMs1fvVj8sMX2JXStEh3s6RciOhHyo1qrDSE+sOzjpssZXEautwZCP32p/ls6Jp2PHiU7aYmzsl1p5vXRYrjP11VemgICfLqJIZ07cjFR4vFS6jFDIqEntg8PTbp+HlC2HyX4zra7TY/ZOReT2wfz1O9xEsKjB46g32rwfE2a1h+zXSKz0Rm0XRvUN88KjgcAkclQpVXTFpcvKNOuz+MZXS8gqe90xm/sj/mBxR5EliF3WcPaQrn/x2AoO6pqB11eRjdeqxzRCXYgycHtgA+T9Axz6QmgUZ1erLZ/wXrngHKorgqXGotW+S4LAZg6euqrGC1uCwWTihSzJDuoe/oZAaX5XYC0tddEiwy3lQ6+Gw+08nWF5BPukUp2ebHFHkSVeMqMNiUfTrnMzSrca8M4X+SapaMyE2mzPF6GOvKDKS/IBzjL/qHP4657JCY2m1E++wUu72UF5Piz3cnpmZi6MVEm5qvIM9RcbBLodKK+sMnAqD3Wkk9rKKcnLVdo7fvxuYY25QESZf96JBge6Jw9GU2ONSAW1MzORsoFW8+hVY+mC187E6SHRYKa30BqfDbU1dU+PrlCGGQ4cEO0f8r8WhEle9/esCnA5jv1RUVPAz61JyNj5gckSRJ4ldNCjQ9RI4EURU9LGPnAW37TAqXho6YnDrQiO5B6ZpsNqJ93fFlLdyV0xr6hBvD9axF5a66hycJAy2nrlMcj3EWvpjV96qA9diiHTFiAYFEmDgRBBRkdid/kqT9OMgMbP+21gdRms92GK3k+Dviqlv8LSt6JBgp8zlpdLjpbDUxag+ktjr5UjggKM7HSstWPGiY6yGHSSxi0Y4/Yk80GKPijr2wz8ZByRNvL3qYKXaApMvpXSFaU9CtxEkOPZRUmkk9nhH23zbpyZUHaRUWOYiQ1rs9Ssp4BrLB6w6Og473pg76hSkK0Y0IthiL4uiqpiyg/DVw3Bwa8O3CbTY49OMeve0XsTbrf4jTz1tt8Xun8Rq58HSeueJEX6lB7jR9yKdK37EFqMt9ij4pIpoFUjs0VUV4+9Xf/USWPVi/bex2I3pZSuOwK5lUFFMgsNKqctDmbttd8UAbD9onEimYysM0LYL/kTucrm4x/0Ldpw+1+SAIk8Su2hQXK2umKjoY69+IFl984YDnPV3uG077F0N8ybDvrUkOG0cLnWjddXMjm1NYL6YHw8Yz1u6YhoQaKF73ewjHdWxj7nxmEASu2hQdA6eVkvsDVXFBE4mHayKcZBgtwZPCN1WW+yBE5iv3WOckq41SirbBX8VjE15OdvyLR13ts35CUMhiV00qHa5Y1R0xdirTRXcUB375o/g3euqVcXYaiTzqBgEPgaBk5os21FIr/SEsM733q74W+w2vMy0LSRtfd3TVrZ3kthFg+IcVYOnFgV2qzI5Ir8ZbxnLhub32b8OVr9Udcoyq6NGJUxbbbEnO21Y/C/BzBN7Y7FEyesRbZI6c2eft3jbOx4bXlQM1rGHlNiVUvcrpTYppX5QSr2jlOrQ9L1EWxE4p2VgLvbAVLemS0iD406DpC71rw+Ut7mMQUasjhrJvK0mdotFkRpvJ9Fh5cLcHk3fIVZZrLjiM6nEgQ0vFknsLfYJMERrnQ1sAe4MPSQRLexWhdXfKoyK/vWAn76B4yZBZv/611v9g4rdR8JFz0NylxrdL/H2tjl4CjChfybXnno8KXGxl6yazevm3EPzGK02Gi12W+ztq5De4VrrhdX+/Ra4MLRwRDRRShHvH3SMqsS+daFx5vdx19e/PtBCS+oMXYYAkOioqqBJdEbRc2mhRy4dbnYI0U/7OHX/c3xnuVha7GFwJfBRGLcnokBgADWqBhx3fAF5y8HnrX+9PcGomCncDts+BY+rXXTFiGYKDp76+Ln7LtR5D5scUOQ1mdiVUouUUuvq+ZtW7TZ3AR7g5Ua2M1sptUIptaKgoCA80YtWF2ipR8VRp7VZGkjQw2fAHbtgz0p4aTq4y2p2xbTROnbRTBYrGoVNeSi2pqFi8HzGTb7DtdanN7ZeKTUTOBeYpHXDJxjTWs8F5gLk5ubG7gkZ25hAYo+KUseAGW/Brq+bvl21aXsTHK7g1QnR9FxEq/ApG3a8XGVbAJsVnHCm2SFFVEhNF6XUmcDtwEStdVl4QhLRJD7YYo+iZNjvdOOvIXtWwlePgtN/YmmrnQRHVbdNVHUriVbhs9iw4uNK3octlphL7KH+vn4cSAY+UUqtVko9HYaYRBSJysTelJIDsOFdYwlgsQW7XywKnDF+sudY8L+zlvIPzyXY8FZNMRBDQq2KOb7pW4m2zBkYPG1Lid1arY7d6gClgt0vCQ5b9NTji1Zjj0/Gg82f2GOvKib2vspEi8RH8+BpQwJ17NkXw4T/B0CCM5DY29AXlDhmAzY+zlSLxoYHrLGX5mLvGYsWicrB06YEWmgdehpHqAIOqwWrRUlijxFddrzFeMvxWPHFZFdMG2qGCTMEW+xtKSE6EiClO+xbC1uMY+iUvztGSh1jg7LYsCkvV2a+BhNuMzuciJPELhoV6IIJzBvTJnQdBjdvgIIt8OFvg1fHO6zSYo8VVjt2vHgdycYXfYyRxC4aFWipt8kSQa+rxhnqEySxxwxltRNHJVcc/TfsWGp2OBEniV00KtgV05ZKBI/ug5cuNOaUsVadZei4zCT6ZiSaGJiIFCOxuzir+E3Y+73Z4URcG/q0CjMEB0/bUkvX64Jtn0BFUY0W+zMzc7ln6mATAxOR4r7qc65132T8E4OTgMlIkmhUmzxAqVorvfqHWurXY4fTbsWO/2jjGKyKib1nLFokOHjaFhP7qGtg5CxTQxHmsHzzGL9xfGX8Iy12IWpqk3XsgQ9yWu/gfOwixmxbxNm2deBDjjwVora4ttoVk3EC/PQ1dB4UPEhJxBCrnfQuveDqbWZHYgoZPBWNGt6zA6cN6MQJXZLNDqX5bE64fjkU74FvnzI7GmEGix18HrBYjL8YE3vPWLRIp5Q45s0aRWp8G/w563XXHEgVscNihaKf4IObIP8Hs6OJOEnson16fiocWB+TA2cCcKZAZQmsfA6O5JkdTcRJYhftU+CgFGmxx6YLnoIr/2dcjsEvd0nson0K1C7HYEWE8PN5jGUM1rGHJbErpW5RSmmlVOydNVZEJ6sDep0Mp/7O7EiEGVY+D+/+2rgsib3llFI9gTOAXaGHI0SYWB3QIQtSu5sdiTDDvh/g8A6wJ8Rkd1w4WuwPAbcBOgzbEiI8ug2DLR/Drm/NjkSYwWIDZyrclQ9ZY8yOJuJCSuxKqanAHq31mjDFI0R4XPISVB6FLf8zOxJhBosNfG6zozBNk51PSqlFQJd6Vt0F/A6Y3JwHUkrNBmYDZGVltSBEIY6B1sYHOwYrIgTG6+4ug7euhtP/FHNdck222LXWp2uth9T+A7YDfYA1SqmdQA9glVKqvi8BtNZztda5WuvczMzMcD4HIep680pjKYk9NsV1MJZr/wuuUnNjMcExDxdrrdcCnQL/+5N7rtb6YBjiEiI0h/xzhMTgwJkATv4NJHWGd+eAVapihGhfJLHHrkAfewweyxC2rzKtde9wbUuIkCWkQ1IXyJlhdiTCDBveh/dvMC7HYHectNhF+2SLg6RMiEsxOxJhhsD8MFZnTLbYJbGL9imtF+xbC/s3mB2JMEOglf7b9ZCYbm4sJpDELtqnUVcbywOS2GNSYBqBGK1ll8Qu2ievy1jGYP+qoOp1f/kiY17+GCOJXbRPC/9gLKUqJjbFpxnL/etAtaHTOoaJJHbRPpUWGMsYHDgTwIBzYPwtoOTUeEK0P9IVE7t87pj9YpfELtqn3uONZa9x5sYhzLHrW/jqEfBWmh2JKSSxi/bJHm/0rUqLPTZVFJsdgakksYt2SoP2QnG+2YEIMwTmh/nlx+bGYRJJ7KJ96jzEWFYUmRuHMEewjt1jbhwmkcQu2qfAB1rKHWNTYND0vzPNjcMkkthF+/Tdf4xl5VFz4xDmiEs1lmWHzI3DJJLYRfvUfaSxTIi9eUIE0HkQDL4AMvqbHYkpYm8GehEbzvgTjJwFHXqaHYkwi9dd1dceY2LzWYv2z2qHzNhsrQng8E+w6UOzozBNyF0xSqkblFKblVLrlVL/CEdQQggRkhithgkIqcWulDoVmAZka60rlVKdmrqPEEK0usCBaVMfNzcOk4TaYv81cJ/WuhJAa30g9JCEECJEgXLHGG25h5rY+wPjlVLLlFJfKKVGNXRDpdRspdQKpdSKgoKCEB9WCCEaERg0/fA35sZhkia7YpRSi4Au9ay6y3//NGAsMAp4QynVV2uta99Yaz0XmAuQm5tbZ70QQoSNPc7sCEzVZGLXWp/e0Dql1K+Bt/2JfLlSygdkANIkF0KYx5kMnYfGbLlrqF0x7wKnASil+gMO4GCoQQkhRMh8HqljP0bzgHlKqXWAC5hZXzeMEEJElM8HBRuNvxgUUmLXWruAn4cpFiGECI8YPB1edbH5O0UI0f5Z7DDuerOjMEVsf60JIdovq92YLyYGSWIXQrRP7jL4JjaPPJWuGCFE+3TOg9Atx+woTCGJXQjRPo26yuwITCNdMUII0c5IYhdCiHZGErsQQrQzktiFEKKdkcQuhBDtjCR2IYRoZySxCyFEOyOJXQgh2hllxiy7SqkC4KdjvHsG0Tnne7TGBdEbm8TVMtEaF0RvbO0trl5a68ymbmRKYg+FUmqF1jrX7Dhqi9a4IHpjk7haJlrjguiNLVbjkq4YIYRoZySxCyFEO9MWE/tcswNoQLTGBdEbm8TVMtEaF0RvbDEZV5vrYxdCCNG4tthiF0II0Yg2ldiVUmcqpTYrpbYppe4wMY6eSqnFSqmNSqn1Sqmb/Nffo5Tao5Ra7f8724TYdiql1voff4X/uo5KqU+UUlv9y7QIx3RCtX2yWilVrJT6jVn7Syk1Tyl1QCm1rtp19e4jZXjU/577QSk1IsJx3a+U2uR/7HeUUh381/dWSpVX23dPRziuBl87pdSd/v21WSk1JcJxvV4tpp1KqdX+6yO5vxrKD5F7j2mt28QfYAV+BPoCDmANMMikWLoCI/yXk4EtwCDgHuAWk/fTTiCj1nX/AO7wX74D+LvJr+M+oJdZ+wuYAIwA1jW1j4CzgY8ABYwFlkU4rsmAzX/579Xi6l39dibsr3pfO//nYA3gBPr4P7PWSMVVa/0/gT+asL8ayg8Re4+1pRb7aGCb1nq71toFvAZMMyMQrXW+1nqV//JRYCPQ3YxYmmka8Lz/8vPA+SbGMgn4UWt9rAeohUxrvQQorHV1Q/toGvCCNnwLdFBKdY1UXFrrhVprj//fb4EerfHYLY2rEdOA17TWlVrrHcA2jM9uRONSSingYuDV1njsxjSSHyL2HmtLib07sLva/3lEQTJVSvUGhgPL/Fdd7/85NS/SXR5+GliolFqplJrtv66z1jofjDcd0MmEuAIupeaHzez9FdDQPoqm992VGC27gD5Kqe+VUl8opcabEE99r1207K/xwH6t9dZq10V8f9XKDxF7j7WlxK7quc7Ukh6lVBLwFvAbrXUx8BRwHJAD5GP8FIy0k7TWI4CzgOuUUhNMiKFeSikHMBX4r/+qaNhfTYmK951S6i7AA7zsvyofyNJaDwduBl5RSqVEMKSGXruo2F/AZdRsQER8f9WTHxq8aT3XhbTP2lJizwN6Vvu/B7DXpFhQStkxXrSXtdZvA2it92utvVprH/BvWuknaGO01nv9ywPAO/4Y9gd+2vmXByIdl99ZwCqt9X5/jKbvr2oa2kemv++UUjOBc4EZ2t8p6+/qOOS/vBKjL7t/pGJq5LWLhv1lA6YDrweui/T+qi8/EMH3WFtK7N8B/ZRSffwtv0uB980IxN9/9x9go9b6wWrXV+8XuwBYV/u+rRxXolIqOXAZY+BtHcZ+mum/2UzgvUjGVU2NVpTZ+6uWhvbR+8Av/JULY4EjgZ/TkVbpcUIAAAEUSURBVKCUOhO4HZiqtS6rdn2mUsrqv9wX6Adsj2BcDb127wOXKqWcSqk+/riWRyouv9OBTVrrvMAVkdxfDeUHIvkei8Qocbj+MEaPt2B8295lYhwnY/xU+gFY7f87G3gRWOu//n2ga4Tj6otRkbAGWB/YR0A68Cmw1b/saMI+SwAOAanVrjNlf2F8ueQDbozW0lUN7SOMn8lP+N9za4HcCMe1DaP/NfA+e9p/25/5X+M1wCrgvAjH1eBrB9zl31+bgbMiGZf/+ueAObVuG8n91VB+iNh7TI48FUKIdqYtdcUIIYRoBknsQgjRzkhiF0KIdkYSuxBCtDOS2IUQop2RxC6EEO2MJHYhhGhnJLELIUQ78/8BJOmqATbzV3YAAAAASUVORK5CYII=\n", 25 | "text/plain": [ 26 | "
" 27 | ] 28 | }, 29 | "metadata": { 30 | "needs_background": "light" 31 | }, 32 | "output_type": "display_data" 33 | } 34 | ], 35 | "source": [ 36 | "#!/usr/bin/python\n", 37 | "# -*- coding: utf-8 -*-\n", 38 | "\n", 39 | "print(__doc__)\n", 40 | "\n", 41 | "import numpy as np\n", 42 | "import matplotlib.pyplot as plt\n", 43 | "import time\n", 44 | "\n", 45 | "from sklearn.linear_model import Lasso\n", 46 | "from sklearn.metrics import r2_score\n", 47 | "\n", 48 | "# 用于产生稀疏数据\n", 49 | "np.random.seed(int(time.time()))\n", 50 | "# 生成系数数据,样本为50个,参数为200维\n", 51 | "n_samples, n_features = 50, 200\n", 52 | "# 基于高斯函数生成数据\n", 53 | "X = np.random.randn(n_samples, n_features)\n", 54 | "# 每个变量对应的系数\n", 55 | "coef = 3 * np.random.randn(n_features)\n", 56 | "# 变量的下标\n", 57 | "inds = np.arange(n_features)\n", 58 | "# 变量下标随机排列\n", 59 | "np.random.shuffle(inds)\n", 60 | "# 仅仅保留10个变量的系数,其他系数全部设置为0\n", 61 | "# 生成稀疏参数\n", 62 | "coef[inds[10:]] = 0\n", 63 | "# 得到目标值,y\n", 64 | "y = np.dot(X, coef)\n", 65 | "# 为y添加噪声\n", 66 | "y += 0.01 * np.random.normal((n_samples,))\n", 67 | "\n", 68 | "# 将数据分为训练集和测试集\n", 69 | "n_samples = X.shape[0]\n", 70 | "X_train, y_train = X[:n_samples // 2], y[:n_samples // 2]\n", 71 | "X_test, y_test = X[n_samples // 2:], y[n_samples // 2:]\n", 72 | "\n", 73 | "# Lasso 回归的参数\n", 74 | "alpha = 0.1\n", 75 | "lasso = Lasso(max_iter=10000, alpha=alpha)\n", 76 | "\n", 77 | "# 基于训练数据,得到的模型的测试结果\n", 78 | "# 这里使用的是坐标轴下降算法(coordinate descent)\n", 79 | "y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)\n", 80 | "\n", 81 | "# 这里是R2可决系数(coefficient of determination)\n", 82 | "# 回归平方和(RSS)在总变差(TSS)中所占的比重称为可决系数\n", 83 | "# 可决系数可以作为综合度量回归模型对样本观测值拟合优度的度量指标。\n", 84 | "# 可决系数越大,说明在总变差中由模型作出了解释的部分占的比重越大,模型拟合优度越好。\n", 85 | "# 反之可决系数小,说明模型对样本观测值的拟合程度越差。\n", 86 | "# R2可决系数最好的效果是1。\n", 87 | "r2_score_lasso = r2_score(y_test, y_pred_lasso)\n", 88 | "\n", 89 | "print(\"测试集上的R2可决系数 : %f\" % r2_score_lasso)\n", 90 | "\n", 91 | "plt.plot(lasso.coef_, label='Lasso coefficients')\n", 92 | "plt.plot(coef, '--', label='original coefficients')\n", 93 | "plt.legend(loc='best')\n", 94 | "\n", 95 | "plt.show()" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [] 104 | } 105 | ], 106 | "metadata": { 107 | "kernelspec": { 108 | "display_name": "Python 3", 109 | "language": "python", 110 | "name": "python3" 111 | }, 112 | "language_info": { 113 | "codemirror_mode": { 114 | "name": "ipython", 115 | "version": 3 116 | }, 117 | "file_extension": ".py", 118 | "mimetype": "text/x-python", 119 | "name": "python", 120 | "nbconvert_exporter": "python", 121 | "pygments_lexer": "ipython3", 122 | "version": "3.5.4" 123 | } 124 | }, 125 | "nbformat": 4, 126 | "nbformat_minor": 2 127 | } 128 | -------------------------------------------------------------------------------- /Regularization/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.L2正则化](#1l2正则化岭回归) 3 | - [1.1问题](#11问题) 4 | - [1.2公式](#12公式) 5 | - [1.3对应图形](#13对应图形) 6 | - [1.4使用场景](#14使用场景) 7 | - [1.5代码实现](https://github.com/mantchs/machine_learning_model/blob/master/Regularization) 8 | - [2.L1正则化lasso回归](#2l1正则化lasso回归) 9 | - [2.1公式](#21公式) 10 | - [2.2对应图形](#22对应图形) 11 | - [2.3使用场景](#23使用场景) 12 | - [2.4代码实现](https://github.com/mantchs/machine_learning_model/blob/master/Regularization) 13 | - [3.ElasticNet回归](#3elasticnet回归) 14 | - [3.1公式](#31公式) 15 | - [3.2使用场景](#32使用场景) 16 | - [3.3代码实现](#33代码实现) 17 | 18 | ## 1.L2正则化(岭回归) 19 | 20 | ### 1.1问题 21 | 22 | ![](https://images0.cnblogs.com/blog/663864/201411/081949249876584.png) 23 | 24 | 想要理解什么是正则化,首先我们先来了解上图的方程式。当训练的特征和数据很少时,往往会造成欠拟合的情况,对应的是左边的坐标;而我们想要达到的目的往往是中间的坐标,适当的特征和数据用来训练;但往往现实生活中影响结果的因素是很多的,也就是说会有很多个特征值,所以训练模型的时候往往会造成过拟合的情况,如右边的坐标所示。 25 | 26 | ### 1.2公式 27 | 28 | 以图中的公式为例,往往我们得到的模型是: 29 | 30 | ![UTOOLS1546959038274.png](https://i.loli.net/2019/01/08/5c34b8bda06d3.png) 31 | 32 | 为了能够得到中间坐标的图形,肯定是**希望θ3和θ4越小越好**,因为这两项越小就越接近于0,就可以得到中间的图形了。 33 | 34 | 对应的损失函数也加上这个惩罚项(为了惩罚θ):假设*λ*=1000 35 | 36 | ![UTOOLS1546959169901.png](https://i.loli.net/2019/01/08/5c34b941f2bf5.png) 37 | 38 | 为了求得最小值,**使θ值趋近于0**,这就达到了我们的目的,得到中间坐标的方程。 39 | 40 | 把以上公式通用化得: 41 | 42 | ![UTOOLS1546959221738.png](https://i.loli.net/2019/01/08/5c34b975cf88d.png) 43 | 44 | 相当于在原始损失函数中加上了一个惩罚项(λ项) 45 | 46 | 这就是**防止过拟合**的一个方法,通常叫做**L2正则化,也叫作岭回归。** 47 | 48 | ### 1.3对应图形 49 | 50 | 我们可以简化L2正则化的方程: 51 | 52 | ![UTOOLS1546959273104.png](https://i.loli.net/2019/01/08/5c34b9a91663f.png) 53 | 54 | J0表示原始的损失函数,咱们假设正则化项为: 55 | 56 | ![UTOOLS1546959466689.png](https://i.loli.net/2019/01/08/5c34ba6b35b6a.png) 57 | 58 | 我们不妨回忆一下圆形的方程: 59 | 60 | ![UTOOLS1546959496318.png](https://i.loli.net/2019/01/08/5c34ba88553fb.png) 61 | 62 | 其中(a,b)为圆心坐标,r为半径。那么经过坐标原点的单位元可以写成: 63 | 64 | ![UTOOLS1546959601400.png](https://i.loli.net/2019/01/08/5c34baf161c01.png) 65 | 66 | 正和L2正则化项一样,同时,机器学习的任务就是要通过一些方法(比如梯度下降)求出损失函数的最小值。 67 | 68 | 此时我们的任务变成在L约束下求出J0取最小值的解。 69 | 70 | 求解J0的过程可以画出等值线。同时L2正则化的函数L也可以在w1w2的二维平面上画出来。如下图: 71 | 72 | ![UTOOLS1546953455440.png](https://i.loli.net/2019/01/08/5c34a2efa0b01.png) 73 | 74 | L表示为图中的黑色圆形,随着梯度下降法的不断逼近,与圆第一次产生交点,而这个交点很难出现在坐标轴上。 75 | 76 | 这就说明了L2正则化不容易得到稀疏矩阵,同时为了求出损失函数的最小值,使得w1和w2无限接近于0,达到防止过拟合的问题。 77 | 78 | ### 1.4使用场景 79 | 80 | 只要数据线性相关,用LinearRegression拟合的不是很好,**需要正则化**,可以考虑使用岭回归(L2), 如何输入特征的维度很高,而且是稀疏线性关系的话, 岭回归就不太合适,考虑使用Lasso回归。 81 | 82 | ### 1.5代码实现 83 | 84 | [GitHub代码--L2正则化](https://github.com/mantchs/machine_learning_model/blob/master/Regularization/RidgeCV.ipynb) 85 | 86 | ## 2.L1正则化(lasso回归) 87 | 88 | ### 2.1公式 89 | 90 | L1正则化与L2正则化的区别在于惩罚项的不同: 91 | 92 | ![UTOOLS1546959770276.png](https://i.loli.net/2019/01/08/5c34bb9ac2621.png) 93 | 94 | L1正则化表现的是θ的绝对值,变化为上面提到的w1和w2可以表示为: 95 | 96 | ![UTOOLS1546959815505.png](https://i.loli.net/2019/01/08/5c34bbc779e82.png) 97 | 98 | ### 2.2对应图形 99 | 100 | 求解J0的过程可以画出等值线。同时L1正则化的函数也可以在w1w2的二维平面上画出来。如下图: 101 | 102 | ![UTOOLS1546955675245.png](https://i.loli.net/2019/01/08/5c34ab9a546f7.png) 103 | 104 | 惩罚项表示为图中的黑色棱形,随着梯度下降法的不断逼近,与棱形第一次产生交点,而这个交点很容易出现在坐标轴上。**这就说明了L1正则化容易得到稀疏矩阵。** 105 | 106 | ### 2.3使用场景 107 | 108 | **L1正则化(Lasso回归)可以使得一些特征的系数变小,甚至还使一些绝对值较小的系数直接变为0**,从而增强模型的泛化能力 。对于高纬的特征数据,尤其是线性关系是稀疏的,就采用L1正则化(Lasso回归),或者是要在一堆特征里面找出主要的特征,那么L1正则化(Lasso回归)更是首选了。 109 | 110 | ### 2.4代码实现 111 | 112 | [GitHub代码--L1正则化](https://github.com/mantchs/machine_learning_model/blob/master/Regularization/LassoCV.ipynb) 113 | 114 | ## 3.ElasticNet回归 115 | 116 | ### 3.1公式 117 | 118 | **ElasticNet综合了L1正则化项和L2正则化项**,以下是它的公式: 119 | 120 | ![UTOOLS1546959876945.png](https://i.loli.net/2019/01/08/5c34bc051086c.png) 121 | 122 | ### 3.2使用场景 123 | 124 | ElasticNet在我们发现用Lasso回归太过(太多特征被稀疏为0),而岭回归也正则化的不够(回归系数衰减太慢)的时候,可以考虑使用ElasticNet回归来综合,得到比较好的结果。 125 | 126 | ### 3.3代码实现 127 | 128 | ```python 129 | from sklearn import linear_model 130 | #得到拟合模型,其中x_train,y_train为训练集 131 | ENSTest = linear_model.ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10], l1_ratio=[.01, .1, .5, .9, .99], max_iter=5000).fit(x_train, y_train) 132 | #利用模型预测,x_test为测试集特征变量 133 | y_prediction = ENSTest.predict(x_test) 134 | ``` 135 | 136 | . 137 | 138 | . 139 | 140 | . 141 | 142 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 143 | 144 | 欢迎添加微信交流!请备注“机器学习”。 145 | 146 | -------------------------------------------------------------------------------- /Regularization/RidgeCV.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "系数矩阵:\n", 13 | " [1.57719167]\n", 14 | "线性回归模型:\n", 15 | " RidgeCV(alphas=array([ 0.1, 1. , 10. ]), cv=None, fit_intercept=True,\n", 16 | " gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)\n" 17 | ] 18 | }, 19 | { 20 | "data": { 21 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEKCAYAAAD9xUlFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X2U1dV97/H3B0QxV1AayRULOsmtRCim2I5eC7mNUWgMEszqtalJNXXqlbV0xRh11V7biDI02phErLk2CTEBY2IeV01wFBOikkRQzFCVyKAk0cGHEnlQRCMgOt/7x3mY8zTMYeb8zuPntRZrze/32+fM9ydyvue792/vrYjAzMwMYEStAzAzs/rhpGBmZllOCmZmluWkYGZmWU4KZmaW5aRgZmZZTgpmZpblpGBmZllOCmZmlnVQrQM4UEceeWS0tbXVOgwzs4aybt267RExfrB2DZcU2tra6O7urnUYZmYNRdLmctq5+8jMzLKcFMzMLMtJwczMspwUzMwsy0nBzMyynBTMzCzLScHMrAYKd72sl10wnRTMzKps8cpNdHb1ZBNBRNDZ1cPilZtqHJmTgplZVUUEu/bsY+nq3mxi6OzqYenqXnbt2VfziqHhZjSbmdWbiEDSgMe5JLFg7lQAlq7uZenqXgA6ZraxYO7UAV9XLa4UzMyGYShdQbmJIaMeEgI4KZiZDdlQu4Iy7XLlJpZacveRmdkQDaUrKDdxZNpljqH2FUPilYKkkZIeldQ1wPWPSOqRtEHSHUnHY2ZWSQfaFSSJsaNH5SWOBXOn0jGzjbGjR/W/butWOOkk+PGPk76FPNWoFC4FNgJjCy9IOg64CpgZES9LekcV4jEzq5iBuoL2lxgumz05bzA6kxiy7a+4Am68MfXzgw/CBz6QWPyFEq0UJE0EzgRuHaDJhcAtEfEyQERsTTIeM7NKKuwKeub6OXTMbMsbYxhIYcKQBE8+CVJ/Qrj+eli0KMlbKJJ0pXATcCUwZoDrkwEkrQZGAtdGxL0Jx2RmVhEDdQUB+V1Bg4mAEQXf0XfuhMMPr3DEg0ssKUiaC2yNiHWSTt3P7z8OOBWYCPxC0rSI2FnwXvOB+QDHHHNMUiGbmR2wQbuCBvOZz8CnP91//I1vwHnnJRBpeZKsFGYC8yTNAUYDYyV9MyLOzWnzPPBwROwDnpH0FKkk8cvcN4qIJcASgPb29to/s2VmlqNkV9Bgdu6EcePyz/3+9/C2t+138lvSEhtTiIirImJiRLQB5wD3FyQEgB8C7weQdCSp7qSnk4rJzKwuSHkJ4b7zLiX6+rIJITX57am8l1RrDkPVJ69J6pQ0L334Y2CHpB7gAeAfImJHtWMyM6uKX/0qlRByLPzRr7jg6NlFk99W9rxIX18fUN0F86oyeS0iVgGr0j8vyDkfwOXpP2ZmzauwO+gLX4DLL2dBBEh5k9+mThhDz5ZXWXT3xrzJbR0z2xLvWvKMZjOzJN1xB/zt3+afy+kKygxMZxICQNcl72XR3RtrsmCe1z4yM0uKlJ8QVq3KSwhQevLbors3cvWZU/LOVWv5CycFM7NKu/ji4u6iCHjf+wpODTz5be4XH8xrW60F89x9ZGZWKXv3wujR+edeeAGOPrpk81KT364+cwprn95Bz5ZXa7JgnpOCmVklTJoEzz/ff3zccbBp8KeFCie/jRgxgllT/jv/811vH94s6SFSPazffSDa29uju7u71mGYmaU89RQcf3z+ub174eCDh/W2B7KbWzkkrYuI9sHaeUzBzGyopPyEcPHFqbGDYSaE1FsPYZZ0Bbj7yMzsQH3nO/DRj+afa7Bel4G4UjAzOxBSfkK48camSQjgSsHMrDx///ewdGn+uSZKBhlOCmZm+9PXByNH5p976CE45ZTaxJMwJwUzs4EcfDDs25d/rgmrg1weUzAzK7R1a2rsIDch7NjR9AkBXCmYmeUrfPRz3Dh46aXaxFIDrhTMzAB+8YvihPDWWy2VEMBJwcwslQz+4i/6jy+6KNVVNKL1PiITv2NJIyU9KqlrP23OlhSSBp2CbWZWMdddV3o103//99rEUweqMaZwKbARGFvqoqQxwCeBtVWIxcwspTAZ3HknfPjDtYmljiRaKUiaCJwJ3LqfZouAG4A9ScZiZgbASSeVrg6cEIDku49uAq4E+kpdlHQiMCkiBuxaMjOriNdfTyWD3FWWf/vblnjM9EAk1n0kaS6wNSLWSTq1xPURwGLg/DLeaz4wH+CYY46pbKBm1vxKrTDqZFBSkpXCTGCepF7gO8Bpkr6Zc30MMA1YlW5zCrC81GBzRCyJiPaIaB8/fnyCIZtZU3nyyeKEsGePE8J+JJYUIuKqiJgYEW3AOcD9EXFuzvVXIuLIiGhLt3kYmBcR3kHHzIZPgilT+o9POy2VDA45pHYxNYCqP4QrqVPSvGr/XjNrEd/6VumB5Pvuq008DaYqy1xExCpgVfrnBQO0ObUasZhZEytMBjfdBJdeWptYGpTXPjKzxvfxj8Ptt+ef87jBkDgpmFnjeustOKjgY2ztWjj55NrE0wScFMysMY0YUVwNuDoYttZb7cnMKi4KPowLjyvqxRdTYwe5v+Oll5wQKsRJwcyGZfHKTXR29WQTQUTQ2dXD4pWbKv/LJDjqqP7j8eNTyWDcuMr/rhblpGBmQxYR7Nqzj6Wre7OJobOrh6Wre9m1Z19eohiWVauKnyzq60vtkGYV5TEFMxsySSyYOxWApat7Wbq6F4CpE8Zw9ZlTkJRNFGNHj+Ky2ZOH8kvyjy+5BG6+eZiR20BcKZjZsOQmhoyeLa+y6O6NA1YOZensLD0JzQkhUa4UzGxYMh/8uaZOGJNXOXTMbCtKHPtVmAx+9COY54UQqsGVgpkNWW4l0DGzjWeun0PHzDZ6trya1y6TEAYdgJ4+vXR14IRQNU4KZjZkkhg7elS2EpDE1WdOYeqEMXntOu/qYeFdGwbuRvr971PJ4PHH+88984wfM60Bdx+Z2bBcNnsyEZEdVF5090Z6trxKx4w2gmDZms0sXdMLwPkzjs0mjyzvdVBXXCmY2bBlPuTzKocPTeWaD/1xXrtrPvTH/Qlh40bvdVCHnBTMrKIumz05bwwhV3aSmwRTcwaeZ8/2Xgd1wknBzBJRagB651e+jkYUfOxEwE9+UpsgrYjHFMys4koNQF8zb1p+o5tvTk1Es7qSeFKQNBLoBl6IiLkF1y4H/g/wJrAN+PuI2Jx0TGaWvOwA9MKFsHBh/kWPG9StalQKlwIbgbElrj0KtEfE65IuAm4A/qYKMZlZ0t56CxXudfDII3DSSbWJx8qS6JiCpInAmcCtpa5HxAMR8Xr68GFgYpLxmFmVSPmb32SWunZCqHtJDzTfBFwJ9JXR9gJgRakLkuZL6pbUvW3btkrGZ2aV9LvfFT9m+uqrqRVNrSEklhQkzQW2RsS6MtqeC7QDnyt1PSKWRER7RLSPHz++wpGaWUVIMGFC//FRR6Wqg8MOq11MdsCSrBRmAvMk9QLfAU6T9M3CRpJmAf8MzIuIvQnGY2ZJuP/+0nsdbNlSm3hsWBJLChFxVURMjIg24Bzg/og4N7eNpBOBr5BKCN4tw6zRSHD66f3Hn/pUqjootXSFNYSqT16T1Ckps+Th54DDgO9LekzS8mrHY2ZDcO21pVczXby4JuFY5VRl8lpErAJWpX9ekHN+VjV+v5lVSAQUzkhevhw+9KHaxGMV5xnNZi0ms6LpQMcDOuEEeOKJwjercHRWa177yKyFLF65qX9ROvo3ydnvxjevvZbqKspNCL29TghNyknBrEVEBLv27GPp6t5sYhh0/2QJxowpfCM49tjqBG1V5+4jsxaQ6SLKLGldav/kvC6kDRtgWsECdnv3wsEHVyliqxVXCmZNLrfLKLNdZq6SO6HlJoQPfCBVHTghtAQnBbMmVthl1NfXx9wvPpjXJjvGsGxZ6cdM7723egFbzTkpmDWxTJdRx8w2lq7u5V3/tIKeLa8ydcIYnr7ug9nzGjECOjr6X3jLLR5IblEqObhUx9rb26O7u7vWYZg1lIjgnVfdkz1++roPMmLECOKcc9B3v1vYuMrRWTVIWhcR7YO1c6Vg1uQyTxnl+pflvwIpPyF0dzshmJ8+MmtmuY+dZp8yKpyRnGpY/eCsLrlSMGtieXsl/9m44oSwc6cTguVxpWDW5C6bPbnoqaKYOBE991yNIrJ65krBrJl9//sl9zpwQrCBuFIwa1aFyeCss+CHP6xNLNYwXCmYNZvzzis9Cc0JwcqQeFKQNFLSo5K6Slw7RNJ3Jf1G0lpJbUnHY9a0MjuefTNn19vbbvNAsh2QanQfXQpsBMaWuHYB8HJE/JGkc4DPAn9ThZjMmkup/RAiUusdVT8aa2CJVgqSJgJnArcO0OQs4Lb0zz8ATldZu32YGQAvvVT8ZNHjj2cTwqB7JZgVSLr76CbgSqBvgOt/CDwHEBFvAq8Ab084JrPmIMHb8/+5tP1jF52bR5a3V4JZCYklBUlzga0RsW5/zUqcK/q/V9J8Sd2Surdt21axGM0a0n33FXcX7d5N9PVlF7h751X35M9idgFuZUqyUpgJzJPUC3wHOE3SNwvaPA9MApB0EHA48FLhG0XEkohoj4j28ePHJxiyWZ2TYNasvFMLlz9BHHJIeXslmA0isaQQEVdFxMSIaAPOAe6PiHMLmi0H/i7989npNq5zzQp98pPFYwd9fSxc/kR5eyWYlanqk9ckdQLdEbEc+Bpwu6TfkKoQzql2PGZ1r/Cb/gUXwK23Iii5vebUCWPouuS9LLp7Y/acKwYrV1WSQkSsAlalf16Qc34P8NfViMGs4QzwmGl+k9QmOpkPf4CuS97LiBEjsglj7OhRTghWNs9oNktQYddNWV05b7xRnBDuuafkJLRSeyUsuntjdj/mBXOnphbEMyuT1z4yS8jilZvYtWdftusm8wE+dvSogT+oy6gO+k8X75WQOQZ3GdnQuFIwS0BEsGvPvuwg8KDzBp58sjghvPjifpeoyNsrIZ0AMvsxu8vIhsp7NJslJDcRZJScN3AA1cFAvyf3/QqPzcB7NJvVXOabe66rz5zS/4H9pS+V3OvgQBewK0wATgg2HE4KZgkpNQg894sP0tfXl0oGF1/cf+HEE/tXOTWrIScFswQUDgI/fd0HmTphDFf+2+WMGDkyr+3C5U8Q6/a3GoxZ9fjpI7MEFA0CA/d86n15ba49fT669JN+SsjqipOCWUIumz05Neg7orggb/vH1J5TzzghWJ0ZtPtI0ickjatGMGZN5aWXihLC//v3u7IJAbw2kdWfcsYUjgJ+Kel7ks7wJjhmZSix18HC5U/w+c2iY2Ybz1w/J7vMtROD1ZNBk0JEfBo4jtTidecDv5Z0naT/kXBsZo3nZz8rfoJozx6I8EQzawhljSlEREj6HfA74E1gHPADSSsj4sokAzRrGIUf7IcdBq++mj3MjjGk22USgxOC1ZNyxhQ+KWkdcAOwGjghIi4C/gz43wnHZ1b/rryyOCFE5CWEDE80s3pXTqVwJPBXEbE592RE9KW33DSre4ktBVH4HldcAZ///PDf16xGBk0KufsflLi2sbLhmFXekFYrHczhh8OuXfnnPFhsTSCxGc2SRkt6RNLjkjZIWliizTGSHpD0qKT1kuYkFY+1pgNerXQwmb0OchPC/fc7IVjTSHLy2l7gtIh4TdIo4EFJKyLi4Zw2nwa+FxFfkjQVuAdoSzAmazG5i9LlbllZcrXSwd+s+JyTgTWZxCqFSHktfTgq/afwX1AAY9M/Hw78V1LxWOsqtVrpASWETZuKE8L27U4I1pQSXRBP0khJjwFbgZURsbagybXAuZKeJ1UlXJJkPNaaSq1WWvaEMQne/e7CNyyamGbWLBJNChHxVkRMByYCJ0uaVtDko8CyiJgIzAFul1QUk6T5kroldW/bti3JkK3JFK5WWvZM4q9/vSJ7HZg1mqosiBcROyWtAs4Ansi5dEH6HBHxkKTRpB6B3Vrw+iXAEkjtvFaNmK05DLRlJTDwTOLCc7Nnw09+UoVozWovsaQgaTywL50QDgVmAZ8taPYscDqwTNIUYDTgUsAqquyZxLNmwX335Z9zZWAtJsnuownAA5LWA78kNabQJalT0rx0myuACyU9DnwbOD+8MpglYL8ziTM7nuUmhK99zQnBWlJilUJErAdOLHF+Qc7PPcDMpGIwG5QfMzXL4+04rTXt2FGcEDZtckKwlued16z1uDowG5ArBWsd999fnBD27nVCMMvhSsFaQ2EyGDcOXnqpNrGY1TFXCtbcrrii9F4HTghmJTkpWPOS4MYb+4+vvNJdRWaDcPeRNZ+3vQ12784/52RgVhZXCtY89u5NVQe5CeFnP3NCMDsArhSsOfgxU7OKcKVgje2pp4oTwo4dTghmQ+RKwRqXqwOzinOlYI3nq1/1XgdmCXGlYI2lMBl88INwzz21icWsCTkpWGN4//th1ar8c64MzCrO3UdW3zJ7HeQmhGXLnBDMEuJKweqXB5LNqi6xSkHSaEmPSHpc0gZJCwdo9xFJPek2dyQVjzWQ7duLE8Kvf+2EYFYFSVYKe4HTIuI1SaOAByWtiIiHMw0kHQdcBcyMiJclvSPBeKwRuDowq6nEKoVIeS19OCr9p/Bf94XALRHxcvo1W5OKx+rcT39anBDeeMMJwazKEh1TkDQSWAf8EakP/7UFTSan260GRgLXRsS9ScZkdagwGbzjHfDii7WJxazFJfr0UUS8FRHTgYnAyZKmFTQ5CDgOOBX4KHCrpCMK30fSfEndkrq3bduWZMhWTZdeWnqvAycEs5qpyiOpEbETWAWcUXDpeeBHEbEvIp4BniKVJApfvyQi2iOiffz48YnHa1Ugwc039x//0z+5q8isDiTWfSRpPLAvInZKOhSYBXy2oNkPSVUIyyQdSao76emkYrI64IFks7qWZKUwAXhA0nrgl8DKiOiS1ClpXrrNj4EdknqAB4B/iIgdCcZktbJ7d3FCuPdeJwSzOqNosH+U7e3t0d3dXesw7EC4OjCrOUnrIqJ9sHZe5sKS8+ijxQlh+/YhJ4TCLzCN9oXGrBF4mQtLxjCqg4hAOa+PCG766a/ZtWcfC+ZORRIRQWdXD2NHj+Ky2ZMrFbVZy3OlYJX1mc8Ma6+DxSs30dnVk60CIoLOu3r42aatLF3dm73W2dXD0tW97NqzzxWDWQW5UrDKKUwGJ5wA69eX/fKIYNeefSxd3QvAgrlTUx/+a3rpmNHG9ElHsHR1b/Z6x8y2bOVgZpXhpGDDd/TRsGVL/rkhfHuXxIK5UwFKfvgDLFuzOdveCcGs8tx91OKGNXib2esgNyF88YvDerIoNzFkZI47u3ryzud2M5lZZbhSaFGFg7eZc4vu3pgdvC0c8M2T0GOmmfGCXJ139RAEy9ZszlYNmTEFcMVgVklOCi1o8cpN7Nq9L/tBGxE89uxOnt+5m+2vvUHHzDb6+vryEkTWli2p7qJcGzfC8ccPO67cAeTCD//pk47g/BnHZhNAJpGNHT3KCcGsgpwUWkx2MDc9eHv+jGPz+umnHHUYV585hUV3b8x+OGcrhoQnoUli7OhReQPIuR/+n5p1XDYBZK45IZhVlmc0t6Dcb+T7k/1w/o//gLPPzr/4xhswatSQf3/hPIQDOTazA+cZzTagUoO5pSyYOxWNGFGcECKGnBBKzkPo6mHxyk158RXGa2bV4aTQgjITwnKdP+NYphx1WPb4S3del0oI+S8cVndR7jwET0Izq08eU2gx2Q/iNanB2+mTDgf6n/+fctRhrLjs1PzXXHghWrJk2L97sHkIrgjMas9JocUUDuZmPPbcK/zwE+8tar/4J09VdG2hTGLIHc9wQjCrH+4+akGXzZ6c/SCWhHbvLk4IK1cSfX0VX2yu5DwET0IzqxuuFFpU9pv5fh4zrfR39/3NQwBXDGb1ILFKQdJoSY9IelzSBkkL99P2bEkhadDHpaxCNmwoTgg7diS6+c1A8xA6ZrZ5EppZnUiyUtgLnBYRr0kaBTwoaUVEPJzbSNIY4JPA2gRjsVw13AmtcPkMT0Izqy+JVQqR8lr6cFT6T6lPnkXADcCepGKxtFtuGdZeB5XieQhm9SvRgWZJIyU9BmwFVkbE2oLrJwKTIqJrkPeZL6lbUve2bdsSjLiJSfCJT/Qff/jD/aucmpmlJZoUIuKtiJgOTAROljQtc03SCGAxcEUZ77MkItojon38+PHJBdyEYsaM4g/+CLjzztoEZGZ1rSqPpEbETmAVcEbO6THANGCVpF7gFGC5B5srJF0F6KGH+k9961ssXP5E3pISZma5knz6aLykI9I/HwrMAp7MXI+IVyLiyIhoi4g24GFgXkR4tbvhkqBgiYro66NzzJ94SQkz268knz6aANwmaSSp5PO9iOiS1Al0R8TyBH93a3rxRTjqqLxT/7bkXhb/9k246h7AS0qY2f556exmMcBjphHBO9MJAeCZ6+c4IZi1IC+d3SpWrChOCPv2ZROCl5QwswPhZS4aWWEymDQJnn0W8JISZjY0TgqN6KKL4Mtfzj9X8O1/sK0tnRDMrBSPKTSawg/za6+Fa64ZsLm3tjQzKH9MwZVCoxjiekVeUsLMDoQHmuvd7t3FCeGhh6q+XpGZtQZXCvWshquZmllrcqVQjzZtKk4Ir7zihGBmiXOlUG9cHZhZDblSqBff/nbp1UydEMysipwU6oEEH/tY//EVVzgZmFlNuPuolj72sVSFkMvJwMxqyEmhFiKKlrZmxQo444zS7c3MqsRJodo8kGxmdcxjCkNQuDRIWUuFbN9enBBeeMEJwczqSpI7r42W9IikxyVtkLSwRJvLJfVIWi/pPknHJhVPpSxeuSlv+enMaqT73eJSgsK9pSPg6KMTjNTM7MAlWSnsBU6LiD8BpgNnSDqloM2jQHtEvAf4AXBDgvEMW0Twyu43WLq6N5sYFt61YeAtLn/+8+Lq4M03XR2YWd1KbEwhUp+Qr6UPR6X/REGbB3IOHwbOTSqeSrjpp79GiPNnHMvS1b3ZvQmmTzq8eH+CwmQwZw7cfXf1gjUzG4JExxQkjZT0GLAVWBkRa/fT/AJgRZLxDEdEsGvPPpau6UXkf+CfOGlc/8HChaUnoTkhmFkDSPTpo4h4C5gu6QjgTknTIuKJwnaSzgXagfeVeh9J84H5AMccc0yCEQ+8/4AkxhxyEFMmjGHpmt681/znsy9nAs1/syVL4MILE43XzKySqvL0UUTsBFYBRQ/iS5oF/DMwLyL2DvD6JRHRHhHt4wsHbCsXY94gcuZPZhA5Uyls3PJq0WtvvOajqHDeQYQTgpk1nMQqBUnjgX0RsVPSocAs4LMFbU4EvgKcERFbk4plMItXbuKV3W8glKoCAoLgsede4bHndtIxsw1I7WvctX4L2197A4BD3nyDp77wV/lvtn49nHDCAcfgHdLMrB4k2X00AbhN0khSFcn3IqJLUifQHRHLgc8BhwHfT38APhsR8xKMqUimAli2ZjPnzziWjhlted1DHTPasnsbL7p7YzYh9H52bqk3G1IMi1duYteefdnB6kyFMnb0KC6bPXlI72lmNhRJPn20HjixxPkFOT/PSur3lyt3Q/vM00S5Fnyo/6misaMP4n8duofbrz07r80Xf/SfXDKv6FbLkh3ATv/uBXOn0tnVw9LVvXTMbHPFYGZV5WUu6E8MpZJCZ1dPNmlc9pfH513b9Qfv4D0Xfp0OHTzkD+/CpJSJoWNmf4ViZlYtXuYCspPQcmW7klb3suyzt5ccSB6z/Xd0zGxj7OhRw/o2n5sYMjLHg86WNjOroJavFDL998vWbGb6pMM5cdI4gsiOMRSOHcQ3voHOOw/o/zAfTkLIzIIuTEoL79qQHfh2N5KZVUvLJwVJjB09qqi75s+X384HzipYdSOCwo/l4XxQ5z71tGzNZs7/82NZ2/sSG7e8yrI1m4H+gW4nBDOrhpZPCgCXzZ7c/008vdfBB3IbPP44vOc9Ff2duU89TZ90OB0z2giiaB5E7kC3mVnSWm5MYaBlryXBv/5r8eY3ERVPCJnft2DuVDpmtvHYc6+wdE0vy9ZsZuqEMXntcldkNTNLWkslhYGWvf63FRvgXe+Cq67qb/zyy4mvZlpqgLlny6t0zGzjmevn0DGzLW9FVjOzpLVMUsidD5D5kO3s6mH7V2/j0jnT4JlnUg07O1PJ4IgjqhJTZ1dP3rmpE8Zw9ZlT8iqJ4T7dZGZWLjXaN9D29vbo7u4e0mszH8JLV/dy2N7XeeKmj/RfPPNMuOuu0ttlJiA3lswgd+FxZnazE4KZDZekdRHRPli7lhpoznz7/q+ld/CVO6/rv7BxIxx//MAvTCiW3KeecruScisDJwQzq6aWSgqZb+fv7X0MgFvbz+KFBZ9hwbvfXfSoaTXkPfVEZeY9mJkNR0uNKWS6Z55eeAPR18cLCz5T84HcwgTghGBmtdQylUK53TVmZq2spQaawfsWmFlrKneguWW6jzLcXWNmNrCWSwpmZjawxJKCpNGSHpH0uKQNkhaWaHOIpO9K+o2ktZLakorHzMwGl2SlsBc4LSL+BJgOnCHplII2FwAvR8QfAYsp2MPZzMyqK7GkECmvpQ9Hpf8UjmqfBdyW/vkHwOlyJ7+ZWc0kOqYgaaSkx4CtwMqIWFvQ5A+B5wAi4k3gFeDtScZkZmYDSzQpRMRbETEdmAicLGlaQZNSVUHRM7KS5kvqltS9bdu2JEI1MzOqNHktInZKWgWcATyRc+l5YBLwvKSDgMOBl0q8fgmwBEDSNkmbDzCEI4HtQwi9kfmeW4PvuTVU4p6PLadRYklB0nhgXzohHArMonggeTnwd8BDwNnA/THIbLqIGD+EWLrLmbTRTHzPrcH33Bqqec9JVgoTgNskjSTVTfW9iOiS1Al0R8Ry4GvA7ZJ+Q6pCOCfBeMzMbBCJJYWIWA+cWOL8gpyf9wB/nVQMZmZ2YFplRvOSWgdQA77n1uB7bg1Vu+eGWxDPzMyS0yqVgpmZlaGpkoKkMyQ9lV5L6f+WuN50ay2Vcc+XS+qRtF7SfZLKeiytng12zzntzpYUkhr+SZVy7lnSR9L1MIueAAAD0klEQVR/1xsk3VHtGCutjP+3j5H0gKRH0/9/z6lFnJUi6euStkp6YoDrknRz+r/Hekl/mkggEdEUf4CRwG+BdwEHA48DUwvaXAx8Of3zOcB3ax13Fe75/cDb0j9f1Ar3nG43Bvg58DDQXuu4q/D3fBzwKDAuffyOWsddhXteAlyU/nkq0FvruId5z38B/CnwxADX5wArSE36PQVYm0QczVQpnAz8JiKejog3gO+QWlspV7OttTToPUfEAxHxevrwYVKzyxtZOX/PAIuAG4A91QwuIeXc84XALRHxMkBEbK1yjJVWzj0HMDb98+HAf1UxvoqLiJ9TYvJujrOAb0TKw8ARkiZUOo5mSgrZdZTSnk+fK9kmmmOtpXLuOdcFpL5pNLJB71nSicCkiOiqZmAJKufveTIwWdJqSQ9LOqNq0SWjnHu+FjhX0vPAPcAl1QmtZg703/uQNNMezeWso1TWWksNpOz7kXQu0A68L9GIkrffe5Y0gtQy7OdXK6AqKOfv+SBSXUinkqoGfyFpWkTsTDi2pJRzzx8FlkXEFyT9OamJsNMioi/58GqiKp9fzVQpZNZRyphIcTmZbbO/tZYaSDn3jKRZwD8D8yJib5ViS8pg9zwGmAasktRLqu91eYMPNpf7//aPImJfRDwDPEUqSTSqcu75AuB7ABHxEDCa1BpBzaqsf+/D1UxJ4ZfAcZLeKelgUgPJywvaZNZagjLXWqpzg95zuivlK6QSQqP3M8Mg9xwRr0TEkRHRFhFtpMZR5kVEd23CrYhy/t/+IamHCpB0JKnupKerGmVllXPPzwKnA0iaQiopNPMyysuBj6efQjoFeCUitlT6lzRN91FEvCnpE8CPST258PWI2NDMay2Vec+fAw4Dvp8eU382IubVLOhhKvOem0qZ9/xj4C8l9QBvAf8QETtqF/XwlHnPVwBflXQZqW6U8xv5S56kb5Pq/jsyPU5yDanNyYiIL5MaN5kD/AZ4HehIJI4G/m9oZmYV1kzdR2ZmNkxOCmZmluWkYGZmWU4KZmaW5aRgZmZZTgpmZpblpGBmZllOCmbDJOmk9Pr2oyX9t/R+BtNqHZfZUHjymlkFSPoXUsssHAo8HxHX1zgksyFxUjCrgPT6PL8ktX/DjIh4q8YhmQ2Ju4/MKuMPSK0xNYZUxWDWkFwpmFWApOWkdgd7JzAhIj5R45DMhqRpVkk1qxVJHwfejIg7JI0E1kg6LSLur3VsZgfKlYKZmWV5TMHMzLKcFMzMLMtJwczMspwUzMwsy0nBzMyynBTMzCzLScHMzLKcFMzMLOv/A3kJcL698zadAAAAAElFTkSuQmCC\n", 22 | "text/plain": [ 23 | "
" 24 | ] 25 | }, 26 | "metadata": { 27 | "needs_background": "light" 28 | }, 29 | "output_type": "display_data" 30 | } 31 | ], 32 | "source": [ 33 | "import numpy as np # 快速操作结构数组的工具\n", 34 | "import matplotlib.pyplot as plt # 可视化绘制\n", 35 | "from sklearn.linear_model import Ridge,RidgeCV # Ridge岭回归,RidgeCV带有广义交叉验证的岭回归\n", 36 | "\n", 37 | "\n", 38 | "# 样本数据集,第一列为x,第二列为y,在x和y之间建立回归模型\n", 39 | "data=[\n", 40 | " [0.067732,3.176513],[0.427810,3.816464],[0.995731,4.550095],[0.738336,4.256571],[0.981083,4.560815],\n", 41 | " [0.526171,3.929515],[0.378887,3.526170],[0.033859,3.156393],[0.132791,3.110301],[0.138306,3.149813],\n", 42 | " [0.247809,3.476346],[0.648270,4.119688],[0.731209,4.282233],[0.236833,3.486582],[0.969788,4.655492],\n", 43 | " [0.607492,3.965162],[0.358622,3.514900],[0.147846,3.125947],[0.637820,4.094115],[0.230372,3.476039],\n", 44 | " [0.070237,3.210610],[0.067154,3.190612],[0.925577,4.631504],[0.717733,4.295890],[0.015371,3.085028],\n", 45 | " [0.335070,3.448080],[0.040486,3.167440],[0.212575,3.364266],[0.617218,3.993482],[0.541196,3.891471]\n", 46 | "]\n", 47 | "\n", 48 | "\n", 49 | "#生成X和y矩阵\n", 50 | "dataMat = np.array(data)\n", 51 | "X = dataMat[:,0:1] # 变量x\n", 52 | "y = dataMat[:,1] #变量y\n", 53 | "\n", 54 | "\n", 55 | "\n", 56 | "# ========岭回归========\n", 57 | "model = Ridge(alpha=0.5)\n", 58 | "model = RidgeCV(alphas=[0.1, 1.0, 10.0]) # 通过RidgeCV可以设置多个参数值,算法使用交叉验证获取最佳参数值\n", 59 | "model.fit(X, y) # 线性回归建模\n", 60 | "print('系数矩阵:\\n',model.coef_)\n", 61 | "print('线性回归模型:\\n',model)\n", 62 | "# print('交叉验证最佳alpha值',model.alpha_) # 只有在使用RidgeCV算法时才有效\n", 63 | "# 使用模型预测\n", 64 | "predicted = model.predict(X)\n", 65 | "\n", 66 | "# 绘制散点图 参数:x横轴 y纵轴\n", 67 | "plt.scatter(X, y, marker='x')\n", 68 | "plt.plot(X, predicted,c='r')\n", 69 | "\n", 70 | "# 绘制x轴和y轴坐标\n", 71 | "plt.xlabel(\"x\")\n", 72 | "plt.ylabel(\"y\")\n", 73 | "\n", 74 | "# 显示图形\n", 75 | "plt.show()" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [] 84 | } 85 | ], 86 | "metadata": { 87 | "kernelspec": { 88 | "display_name": "Python 3", 89 | "language": "python", 90 | "name": "python3" 91 | }, 92 | "language_info": { 93 | "codemirror_mode": { 94 | "name": "ipython", 95 | "version": 3 96 | }, 97 | "file_extension": ".py", 98 | "mimetype": "text/x-python", 99 | "name": "python", 100 | "nbconvert_exporter": "python", 101 | "pygments_lexer": "ipython3", 102 | "version": "3.5.4" 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 2 107 | } 108 | -------------------------------------------------------------------------------- /SVM/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.SVM讲解](#1svm讲解案例) 3 | - [1.1支持向量机(SVM)的由来](#11支持向量机svm的由来) 4 | - [1.2如何找到超平面](#12如何找到超平面) 5 | - [1.3最大间隔分类器](#13最大间隔分类器) 6 | - [1.4后续问题](#14后续问题) 7 | - [1.5新闻分类实例](https://github.com/mantchs/machine_learning_model/tree/master/SVM/cnews_demo) 8 | 9 | ## 1.SVM讲解                                                [案例](https://github.com/mantchs/machine_learning_model/tree/master/SVM/cnews_demo) 10 | 11 | SVM是一个很复杂的算法,不是一篇博文就能够讲完的,所以此篇的定位是初学者能够接受的程度,并且讲的都是SVM的一种思想,通过此篇能够使读着会使用SVM就行,具体SVM的推导过程有一篇博文是讲得非常细的,具体链接我放到最后面,供大家参考。 12 | 13 | ### 1.1支持向量机(SVM)的由来 14 | 15 | 首先我们先来看一个3维的平面方程:**Ax+By+Cz+D=0** 16 | 17 | 这就是我们中学所学的,从这个方程我们可以推导出二维空间的一条直线:**Ax+By+D=0** 18 | 19 | 那么,依次类推,更高维的空间叫做一个超平面: 20 | 21 | ![](https://www.wailian.work/images/2018/12/14/image.png) 22 | 23 | x代表的是一个向量,接下来我们看下二维空间的几何表示: 24 | 25 | ![imagea8724.png](https://www.wailian.work/images/2018/12/14/imagea8724.png) 26 | 27 | SVM的目标是找到一个超平面,这个超平面能够很好的解决二分类问题,所以先找到各个分类的样本点离这个超平面最近的点,使得这个点到超平面的距离最大化,最近的点就是虚线所画的。由以上超平面公式计算得出大于1的就属于打叉分类,如果小于0的属于圆圈分类。 28 | 29 | 这些点能够很好地确定一个超平面,而且在几何空间中表示的也是一个向量,**那么就把这些能够用来确定超平面的向量称为支持向量(直接支持超平面的生成),于是该算法就叫做支持向量机(SVM)了。** 30 | 31 | ### 1.2如何找到超平面 32 | 33 | #### 函数间隔 34 | 35 | 在超平面w*x+b=0确定的情况下,|w*x+b|能够表示点x到距离超平面的远近,而通过观察w*x+b的符号与类标记y的符号是否一致可判断分类是否正确,所以,可以用(y*(w*x+b))的正负性来判定或表示分类的正确性。于此,我们便引出了函数间隔(functional margin)的概念。定义函数间隔(用![image5b81c.png](https://www.wailian.work/images/2018/12/14/image5b81c.png)表示)为: 36 | 37 | ![image18f21.png](https://www.wailian.work/images/2018/12/14/image18f21.png) 38 | 39 | 但是这个函数间隔有个问题,就是我成倍的增加w和b的值,则函数值也会跟着成倍增加,但这个超平面没有改变。所以有函数间隔还不够,需要一个几何间隔。 40 | 41 | #### 几何间隔 42 | 43 | 我们把w做一个约束条件,假定对于一个点 x ,令其垂直投影到超平面上的对应点为 x0 ,w 是垂直于超平面的一个向量,为样本x到超平面的距离,如下图所示: 44 | 45 | ![imageff0e4.png](http://www.wailian.work/images/2018/12/14/imageff0e4.png) 46 | 47 | 根据平面几何知识,有 48 | 49 | ![image5b9c1.png](https://www.wailian.work/images/2018/12/14/image5b9c1.png) 50 | 51 | 其中||w||为w的二阶范数(范数是一个类似于模的表示长度的概念),![imaged559c.png](https://www.wailian.work/images/2018/12/14/imaged559c.png)是单位向量(一个向量除以它的模称之为单位向量)。又由于*x*0 是超平面上的点,满足 *f*(*x*0)=0,代入超平面的方程![image3f5df.png](https://www.wailian.work/images/2018/12/14/image3f5df.png),可得![imagef9888.png](https://www.wailian.work/images/2018/12/14/imagef9888.png),即![imagea4151.png](http://www.wailian.work/images/2018/12/14/imagea4151.png)。随即让此式![image5b9c1.png](https://www.wailian.work/images/2018/12/14/image5b9c1.png)的两边同时乘以![image993f1.png](https://www.wailian.work/images/2018/12/14/image993f1.png),再根据![imagea4151.png](http://www.wailian.work/images/2018/12/14/imagea4151.png)和![image651f2.png](https://www.wailian.work/images/2018/12/14/image651f2.png),即可算出*γ*: 52 | 53 | ![image5833a.png](https://www.wailian.work/images/2018/12/14/image5833a.png) 54 | 55 | 为了得到![image9d8b4.png](https://www.wailian.work/images/2018/12/14/image9d8b4.png)的绝对值,令![image9d8b4.png](https://www.wailian.work/images/2018/12/14/image9d8b4.png)乘上对应的类别 y,即可得出几何间隔(用![image7718c.png](https://www.wailian.work/images/2018/12/14/image7718c.png)表示)的定义: 56 | 57 | ![image7cee4.png](http://www.wailian.work/images/2018/12/14/image7cee4.png) 58 | 59 | ### 1.3最大间隔分类器 60 | 61 | 对一个数据点进行分类,当超平面离数据点的“间隔”越大,分类的确信度(confidence)也越大。所以,为了使得分类的确信度尽量高,需要让所选择的超平面能够最大化这个“间隔”值。这个间隔就是下图中的Gap的一半。 62 | 63 | ![imagef7dc1.png](https://www.wailian.work/images/2018/12/14/imagef7dc1.png) 64 | 65 | 回顾下几何间隔的定义![image7cee4.png](http://www.wailian.work/images/2018/12/14/image7cee4.png),可知:如果令函数间隔![image7718c.png](https://www.wailian.work/images/2018/12/14/image7718c.png)等于1(之所以令![image7718c.png](https://www.wailian.work/images/2018/12/14/image7718c.png)等于1,是为了方便推导和优化,且这样做对目标函数的优化没有影响,至于为什么,请见本文评论下第42楼回复),则有![image7718c.png](https://www.wailian.work/images/2018/12/14/image7718c.png)= 1 / ||w||,从而上述目标函数转化成了 66 | 67 | ![imageb73e7.png](https://www.wailian.work/images/2018/12/14/imageb73e7.png) 68 | 69 | ### 1.4后续问题 70 | 71 | 至此,SVM的第一层已经了解了,就是求最大的几何间隔,对于那些只关心怎么用SVM的朋友便已足够,不必再更进一层深究其更深的原理。 72 | 73 | SVM要深入的话有很多内容需要讲到,比如:线性不可分问题、核函数、SMO算法等。 74 | 75 | 在此推荐一篇博文,这篇博文把深入的SVM内容也讲了,包括推导过程等。如果想进一步了解SVM,推荐看一下: 76 | 77 | 支持向量机通俗导论:https://blog.csdn.net/v_JULY_v/article/details/7624837#commentBox 78 | 79 | ### [1.5新闻分类实例](https://github.com/mantchs/machine_learning_model/tree/master/SVM/cnews_demo) 80 | 81 | . 82 | 83 | . 84 | 85 | . 86 | 87 | . 88 | 89 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 90 | 91 | 欢迎添加微信交流!请备注“机器学习”。 92 | -------------------------------------------------------------------------------- /SVM/cnews_demo/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.新闻分类案例](#1新闻分类案例) 3 | - [1.1介绍](#11介绍) 4 | - [1.2数据集下载](#12数据集下载) 5 | - [1.3libsvm库安装](#13libsvm库安装) 6 | - [1.4实现步骤](#14实现步骤) 7 | - [1.5代码实现](https://github.com/mantchs/machine_learning_model/blob/master/SVM/cnews_demo/svm_classification.ipynb) 8 | 9 | ## 1.新闻分类案例 10 | 11 | ### 1.1介绍 12 | 13 | 这是一个预测新闻分类的案例,通过给定的数据集来预测测试集的新闻分类,该案例用到的是libsvm库,实现步骤我已经写到代码里面了,每一步都有注释,相信很多人都能够看得明白。 14 | 15 | ### 1.2数据集下载 16 | 17 | 因为数据集比较大,不适合放到github里,所以单独下载吧,放到与代码同级目录即可。 18 | 19 | 有三个文件,一个是训练数据,一个是测试数据,一个是分类。 20 | 21 | 训练数据:https://pan.baidu.com/s/1ZkxGIvvGml3vig-9_s1pRw 22 | 23 | ### 1.3libsvm库安装 24 | 25 | LIBSVM是台湾大学林智仁(Lin Chih-Jen)教授等开发设计的一个简单、易于使用和快速有效的SVM[模式识别](https://baike.baidu.com/item/%E6%A8%A1%E5%BC%8F%E8%AF%86%E5%88%AB/295301)与回归的软件包。其它的svm库也有,这里以libsvm为例。 26 | 27 | libsvm下载地址:[libsvm-3.23.zip](http://www.csie.ntu.edu.tw/~cjlin/cgi-bin/libsvm.cgi?+http://www.csie.ntu.edu.tw/~cjlin/libsvm+zip) 28 | 29 | #### MAC系统 30 | 31 | 1.下载libsvm后解压,进入目录有个文件:**libsvm.so.2**,把这个文件复制到python安装目录**site-packages/**下。 32 | 33 | 2.在**site-packages/**下新建libsvm文件夹,并进入**libsvm**目录新建init.py的空文件。 34 | 35 | 3.进入libsvm解压路径:**libsvm-3.23/python/**,把里面的三个文件:**svm.py、svmutil.py、commonutil.py**,复制到新建的:**site-packages/libsvm/**目录下。之后就可以使用libsvm了。 36 | 37 | #### Windows系统 38 | 39 | 安装教程:https://www.cnblogs.com/bbn0111/p/8318629.html 40 | ### 1.4实现步骤 41 | 42 | 1.先对数据集进行分词,本案例用的是**jieba**分词。 43 | 44 | 2.对分词的结果进行词频统计,分配词ID。 45 | 46 | 3.根据词ID生成词向量,这就是最终的训练数据。 47 | 48 | 4.调用libsvm训练器进行训练。 49 | 50 | ### [1.5代码实现](https://github.com/mantchs/machine_learning_model/blob/master/SVM/cnews_demo/svm_classification.ipynb) 51 | 52 | . 53 | 54 | . 55 | 56 | . 57 | 58 | . 59 | 60 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 61 | 62 | 欢迎添加微信交流!请备注“机器学习”。 63 | -------------------------------------------------------------------------------- /SVM/cnews_demo/svm_classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 11, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import sys\n", 10 | "import os\n", 11 | "import jieba\n", 12 | "from libsvm import svm\n", 13 | "from libsvm.svmutil import svm_read_problem,svm_train,svm_predict,svm_save_model,svm_load_model\n", 14 | "\n", 15 | "## 数据集下载:https://pan.baidu.com/s/1ZkxGIvvGml3vig-9_s1pRw\n", 16 | "news_file='cnews.train.txt' ##原始是数据\n", 17 | "test_file='cnews.test.txt' ##测试数据\n", 18 | "output_word_file='cnews_dict.txt' ##进过分词后的数\n", 19 | "output_word_test_file='cnews_dict_test.txt'\n", 20 | "feature_file='cnews_feature_file.txt' ##最后生成的词向量文件\n", 21 | "feature_test_file='cnews_feature_test_file.txt'\n", 22 | "model_filename='cnews_model' ##模型保存的文件" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 3, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stderr", 32 | "output_type": "stream", 33 | "text": [ 34 | "Building prefix dict from the default dictionary ...\n" 35 | ] 36 | }, 37 | { 38 | "name": "stdout", 39 | "output_type": "stream", 40 | "text": [ 41 | "马晓旭意外受伤让国奥警惕 无奈大雨格外青睐殷家军记者傅亚雨沈阳报道 来到沈阳,国奥队依然没有摆脱雨水的困扰。7月31日下午6点,国奥队的日常训练再度受到大雨的干扰,无奈之下队员们只慢跑了25分钟就草草收场。31日上午10点,国奥队在奥体中心外场训练的时候,天就是阴沉沉的,气象预报显示当天下午沈阳就有大雨,但幸好队伍上午的训练并没有受到任何干扰。下午6点,当球队抵达训练场时,大雨已经下了几个小时,而且丝毫没有停下来的意思。抱着试一试的态度,球队开始了当天下午的例行训练,25分钟过去了,天气没有任何转好的迹象,为了保护球员们,国奥队决定中止当天的训练,全队立即返回酒店。在雨中训练对足球队来说并不是什么稀罕事,但在奥运会即将开始之前,全队变得“娇贵”了。在沈阳最后一周的训练,国奥队首先要保证现有的球员不再出现意外的伤病情况以免影响正式比赛,因此这一阶段控制训练受伤、控制感冒等疾病的出现被队伍放在了相当重要的位置。而抵达沈阳之后,中后卫冯萧霆就一直没有训练,冯萧霆是7月27日在长春患上了感冒,因此也没有参加29日跟塞尔维亚的热身赛。队伍介绍说,冯萧霆并没有出现发烧症状,但为了安全起见,这两天还是让他静养休息,等感冒彻底好了之后再恢复训练。由于有了冯萧霆这个例子,因此国奥队对雨中训练就显得特别谨慎,主要是担心球员们受凉而引发感冒,造成非战斗减员。而女足队员马晓旭在热身赛中受伤导致无缘奥运的前科,也让在沈阳的国奥队现在格外警惕,“训练中不断嘱咐队员们要注意动作,我们可不能再出这样的事情了。”一位工作人员表示。从长春到沈阳,雨水一路伴随着国奥队,“也邪了,我们走到哪儿雨就下到哪儿,在长春几次训练都被大雨给搅和了,没想到来沈阳又碰到这种事情。”一位国奥球员也对雨水的“青睐”有些不解。\n" 42 | ] 43 | }, 44 | { 45 | "name": "stderr", 46 | "output_type": "stream", 47 | "text": [ 48 | "Dumping model to file cache /var/folders/jr/sc47kr2548z426dnw3xkrjqm0000gn/T/jieba.cache\n", 49 | "Loading model cost 1.041 seconds.\n", 50 | "Prefix dict has been built succesfully.\n" 51 | ] 52 | }, 53 | { 54 | "name": "stdout", 55 | "output_type": "stream", 56 | "text": [ 57 | "马晓旭/ 意外/ 受伤/ 让/ 国奥/ 警惕/ / 无奈/ 大雨/ 格外/ 青睐/ 殷家/ 军/ 记者/ 傅亚雨/ 沈阳/ 报道/ / 来到/ 沈阳/ ,/ 国奥队/ 依然/ 没有/ 摆脱/ 雨水/ 的/ 困扰/ 。/ 7/ 月/ 31/ 日/ 下午/ 6/ 点/ ,/ 国奥队/ 的/ 日常/ 训练/ 再度/ 受到/ 大雨/ 的/ 干扰/ ,/ 无奈/ 之下/ 队员/ 们/ 只/ 慢跑/ 了/ 25/ 分钟/ 就/ 草草收场/ 。/ 31/ 日/ 上午/ 10/ 点/ ,/ 国奥队/ 在/ 奥体中心/ 外场/ 训练/ 的/ 时候/ ,/ 天/ 就是/ 阴沉沉/ 的/ ,/ 气象预报/ 显示/ 当天/ 下午/ 沈阳/ 就/ 有/ 大雨/ ,/ 但/ 幸好/ 队伍/ 上午/ 的/ 训练/ 并/ 没有/ 受到/ 任何/ 干扰/ 。/ 下午/ 6/ 点/ ,/ 当/ 球队/ 抵达/ 训练场/ 时/ ,/ 大雨/ 已经/ 下/ 了/ 几个/ 小时/ ,/ 而且/ 丝毫/ 没有/ 停下来/ 的/ 意思/ 。/ 抱/ 着/ 试一试/ 的/ 态度/ ,/ 球队/ 开始/ 了/ 当天/ 下午/ 的/ 例行/ 训练/ ,/ 25/ 分钟/ 过去/ 了/ ,/ 天气/ 没有/ 任何/ 转好/ 的/ 迹象/ ,/ 为了/ 保护/ 球员/ 们/ ,/ 国奥队/ 决定/ 中止/ 当天/ 的/ 训练/ ,/ 全队/ 立即/ 返回/ 酒店/ 。/ 在/ 雨/ 中/ 训练/ 对/ 足球队/ 来说/ 并/ 不是/ 什么/ 稀罕/ 事/ ,/ 但/ 在/ 奥运会/ 即将/ 开始/ 之前/ ,/ 全队/ 变得/ “/ 娇贵/ ”/ 了/ 。/ 在/ 沈阳/ 最后/ 一周/ 的/ 训练/ ,/ 国奥队/ 首先/ 要/ 保证/ 现有/ 的/ 球员/ 不再/ 出现意外/ 的/ 伤病/ 情况/ 以免/ 影响/ 正式/ 比赛/ ,/ 因此/ 这一/ 阶段/ 控制/ 训练/ 受伤/ 、/ 控制/ 感冒/ 等/ 疾病/ 的/ 出现/ 被/ 队伍/ 放在/ 了/ 相当/ 重要/ 的/ 位置/ 。/ 而/ 抵达/ 沈阳/ 之后/ ,/ 中/ 后卫/ 冯萧霆/ 就/ 一直/ 没有/ 训练/ ,/ 冯萧霆/ 是/ 7/ 月/ 27/ 日/ 在/ 长春/ 患上/ 了/ 感冒/ ,/ 因此/ 也/ 没有/ 参加/ 29/ 日/ 跟/ 塞尔维亚/ 的/ 热身赛/ 。/ 队伍/ 介绍/ 说/ ,/ 冯萧霆/ 并/ 没有/ 出现/ 发烧/ 症状/ ,/ 但/ 为了/ 安全/ 起/ 见/ ,/ 这/ 两天/ 还是/ 让/ 他/ 静养/ 休息/ ,/ 等/ 感冒/ 彻底/ 好/ 了/ 之后/ 再/ 恢复/ 训练/ 。/ 由于/ 有/ 了/ 冯萧霆/ 这个/ 例子/ ,/ 因此/ 国奥队/ 对雨中/ 训练/ 就/ 显得/ 特别/ 谨慎/ ,/ 主要/ 是/ 担心/ 球员/ 们/ 受凉/ 而/ 引发/ 感冒/ ,/ 造成/ 非战斗/ 减员/ 。/ 而/ 女足/ 队员/ 马晓旭/ 在/ 热身赛/ 中/ 受伤/ 导致/ 无缘/ 奥运/ 的/ 前科/ ,/ 也/ 让/ 在/ 沈阳/ 的/ 国奥队/ 现在/ 格外/ 警惕/ ,/ “/ 训练/ 中/ 不断/ 嘱咐/ 队员/ 们/ 要/ 注意/ 动作/ ,/ 我们/ 可/ 不能/ 再出/ 这样/ 的/ 事情/ 了/ 。/ ”/ 一位/ 工作人员/ 表示/ 。/ 从/ 长春/ 到/ 沈阳/ ,/ 雨水/ 一路/ 伴随/ 着/ 国奥队/ ,/ “/ 也/ 邪/ 了/ ,/ 我们/ 走/ 到/ 哪儿/ 雨/ 就/ 下/ 到/ 哪儿/ ,/ 在/ 长春/ 几次/ 训练/ 都/ 被/ 大雨/ 给/ 搅和/ 了/ ,/ 没想到/ 来/ 沈阳/ 又/ 碰到/ 这种/ 事情/ 。/ ”/ 一位/ 国奥/ 球员/ 也/ 对/ 雨水/ 的/ “/ 青睐/ ”/ 有些/ 不解/ 。\n" 58 | ] 59 | } 60 | ], 61 | "source": [ 62 | "\n", 63 | "\n", 64 | "with open(news_file, 'r') as f: ##读取新闻文章\n", 65 | " lines = f.readlines()\n", 66 | "\n", 67 | "label, content = lines[0].strip('\\r\\n').split('\\t')\n", 68 | "print(content)\n", 69 | "\n", 70 | "words_iter = jieba.cut(content) ##使用jiejia进行分词操作\n", 71 | "print('/ '.join(words_iter))" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stderr", 81 | "output_type": "stream", 82 | "text": [ 83 | "Building prefix dict from the default dictionary ...\n", 84 | "Loading model from cache /var/folders/jr/sc47kr2548z426dnw3xkrjqm0000gn/T/jieba.cache\n", 85 | "Loading model cost 0.792 seconds.\n", 86 | "Prefix dict has been built succesfully.\n" 87 | ] 88 | }, 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "==========分词完成====================\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "\n", 99 | "\n", 100 | "def generate_word_file(input_char_file, output_word_file): ##定义分词函数,并写入文件\n", 101 | " with open(input_char_file, 'r') as f:\n", 102 | " lines = f.readlines()\n", 103 | " with open(output_word_file, 'w') as f:\n", 104 | " for line in lines:\n", 105 | " label, content = line.strip('\\r\\n').split('\\t')\n", 106 | " words_iter = jieba.cut(content)\n", 107 | " word_content = ''\n", 108 | " for word in words_iter:\n", 109 | " word = word.strip(' ')\n", 110 | " if word != '':\n", 111 | " word_content += word + ' '\n", 112 | " out_line = '%s\\t%s\\n' % (label, word_content.strip(' '))\n", 113 | " f.write(out_line)\n", 114 | "\n", 115 | "generate_word_file(news_file, output_word_file)\n", 116 | "generate_word_file(test_file, output_word_test_file)\n", 117 | "print('==========分词完成====================') ##需要的时间比较长" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 2, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "name": "stdout", 127 | "output_type": "stream", 128 | "text": [ 129 | "10\n" 130 | ] 131 | } 132 | ], 133 | "source": [ 134 | "\n", 135 | "\n", 136 | "class Category: ##分类topic\n", 137 | " def __init__(self, category_file):\n", 138 | " self._category_to_id = {}\n", 139 | " with open(category_file, 'r') as f:\n", 140 | " lines = f.readlines()\n", 141 | " for line in lines:\n", 142 | " category, idx = line.strip('\\r\\n').split('\\t')\n", 143 | " idx = int(idx)\n", 144 | " self._category_to_id[category] = idx\n", 145 | " \n", 146 | " def category_to_id(self, category):\n", 147 | " return self._category_to_id[category]\n", 148 | " \n", 149 | " def size(self):\n", 150 | " return len(self._category_to_id)\n", 151 | "\n", 152 | "category_file='cnews.category.txt'\n", 153 | "category_vocab = Category(category_file)\n", 154 | "print(category_vocab.size())" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 12, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "name": "stdout", 164 | "output_type": "stream", 165 | "text": [ 166 | "10353\n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "\n", 172 | "##对分词后的数据进行词频统计并过滤,分配词ID\n", 173 | "\n", 174 | "def generate_feature_dict(train_file, feature_threshold=10): \n", 175 | " feature_dict = {}\n", 176 | " with open(train_file, 'r') as f:\n", 177 | " lines = f.readlines()\n", 178 | " for line in lines:\n", 179 | " label, content = line.strip('\\r\\n').split('\\t')\n", 180 | " for word in content.split(' '):\n", 181 | " if not word in feature_dict:\n", 182 | " feature_dict.setdefault(word, 0)\n", 183 | " feature_dict[word] += 1\n", 184 | " filtered_feature_dict = {}\n", 185 | " for feature_name in feature_dict:\n", 186 | " if feature_dict[feature_name] < feature_threshold:\n", 187 | " continue\n", 188 | " if not feature_name in filtered_feature_dict:\n", 189 | " filtered_feature_dict[feature_name] = len(filtered_feature_dict) + 1\n", 190 | " return filtered_feature_dict\n", 191 | " \n", 192 | "feature_dict = generate_feature_dict(output_word_file, feature_threshold=200)\n", 193 | "print(len(feature_dict))" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 5, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "==========构造词向量完成完成====================\n" 206 | ] 207 | } 208 | ], 209 | "source": [ 210 | "\n", 211 | "\n", 212 | "def generate_feature_line(line, feature_dict, category_vocab): ##对每一篇文章根据词id构造词向量。\n", 213 | " label, content = line.strip('\\r\\n').split('\\t')\n", 214 | " label_id = category_vocab.category_to_id(label)\n", 215 | " feature_example = {}\n", 216 | " for word in content.split(' '):\n", 217 | " if not word in feature_dict:\n", 218 | " continue\n", 219 | " feature_id = feature_dict[word]\n", 220 | " feature_example.setdefault(feature_id, 0)\n", 221 | " feature_example[feature_id] += 1\n", 222 | " feature_line = '%d' % label_id\n", 223 | " sorted_feature_example = sorted(feature_example.items(), key=lambda d:d[0])\n", 224 | " for item in sorted_feature_example:\n", 225 | " feature_line += ' %d:%d' % item\n", 226 | " return feature_line\n", 227 | "\n", 228 | "##循环没一篇文章,得到词向量化后的文件\n", 229 | "\n", 230 | "def convert_raw_to_feature(raw_file, feature_file, feature_dict, category_vocab): \n", 231 | " with open(raw_file, 'r') as f:\n", 232 | " lines = f.readlines()\n", 233 | " with open(feature_file, 'w') as f:\n", 234 | " for line in lines:\n", 235 | " feature_line = generate_feature_line(line, feature_dict, category_vocab)\n", 236 | " f.write('%s\\n' % feature_line)\n", 237 | " \n", 238 | "##测试数据运用相同的词ID表\n", 239 | "convert_raw_to_feature(output_word_file, feature_file, feature_dict, category_vocab)\n", 240 | "convert_raw_to_feature(output_word_test_file, feature_test_file, feature_dict, category_vocab) \n", 241 | "print('==========构造词向量完成完成====================')" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 3, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "name": "stdout", 251 | "output_type": "stream", 252 | "text": [ 253 | "0.0 {10155: 1.0, 6149: 2.0, 10246: 1.0, 5640: 1.0, 8713: 1.0, 8714: 1.0, 4107: 2.0, 12: 1.0, 5720: 1.0, 19: 2.0, 6164: 2.0, 6165: 1.0, 495: 2.0, 5656: 1.0, 7707: 1.0, 3101: 2.0, 6661: 3.0, 3104: 3.0, 5665: 1.0, 7714: 1.0, 9764: 1.0, 6181: 2.0, 7207: 1.0, 9259: 1.0, 6190: 1.0, 2286: 3.0, 3128: 1.0, 5177: 3.0, 570: 1.0, 3647: 1.0, 6208: 38.0, 4162: 1.0, 4167: 1.0, 4680: 4.0, 3146: 1.0, 3154: 1.0, 8291: 1.0, 5206: 1.0, 7255: 1.0, 6232: 1.0, 4187: 1.0, 9821: 1.0, 5218: 1.0, 4707: 1.0, 5186: 1.0, 3379: 1.0, 107: 21.0, 5740: 1.0, 2670: 3.0, 7794: 2.0, 6774: 1.0, 2476: 13.0, 1387: 1.0, 5253: 1.0, 5254: 1.0, 6797: 8.0, 5262: 1.0, 5265: 1.0, 1170: 1.0, 9647: 3.0, 3736: 2.0, 6511: 2.0, 8350: 1.0, 2208: 1.0, 1185: 1.0, 1187: 1.0, 2214: 7.0, 1706: 12.0, 8876: 4.0, 850: 2.0, 2222: 1.0, 6323: 1.0, 8885: 1.0, 4791: 3.0, 192: 1.0, 4128: 2.0, 5067: 1.0, 8900: 4.0, 10274: 1.0, 718: 1.0, 3796: 4.0, 9944: 2.0, 7385: 1.0, 6362: 2.0, 5853: 1.0, 4837: 3.0, 5352: 1.0, 7914: 1.0, 5870: 1.0, 2288: 4.0, 2289: 1.0, 3833: 1.0, 5882: 1.0, 9469: 1.0, 6398: 2.0, 7939: 1.0, 3844: 14.0, 8453: 1.0, 2310: 1.0, 9991: 1.0, 4878: 5.0, 2839: 1.0, 4377: 1.0, 6432: 2.0, 2340: 1.0, 10118: 2.0, 3878: 2.0, 5418: 3.0, 3463: 1.0, 3887: 1.0, 8497: 1.0, 818: 1.0, 2355: 1.0, 10035: 1.0, 6465: 2.0, 322: 1.0, 9027: 8.0, 9548: 1.0, 6991: 3.0, 2386: 4.0, 7508: 1.0, 2905: 1.0, 5469: 1.0, 10081: 1.0, 1891: 1.0, 4454: 2.0, 2919: 1.0, 4458: 2.0, 3435: 1.0, 2194: 1.0, 2927: 1.0, 9586: 1.0, 7030: 1.0, 10108: 1.0, 1770: 1.0, 2942: 1.0, 5525: 1.0, 6530: 1.0, 902: 2.0, 6977: 2.0, 5002: 1.0, 8075: 1.0, 909: 2.0, 7567: 1.0, 6034: 1.0, 6549: 4.0, 1942: 1.0, 1435: 1.0, 1438: 1.0, 2464: 2.0, 7586: 1.0, 10149: 1.0, 1959: 1.0, 9101: 1.0, 937: 2.0, 6727: 2.0, 6572: 1.0, 4525: 1.0, 6575: 2.0, 5192: 1.0, 1262: 2.0, 5022: 1.0, 7608: 2.0, 8123: 1.0, 6588: 4.0, 3007: 1.0, 1991: 4.0, 8650: 1.0, 8139: 1.0, 6606: 2.0, 3536: 1.0, 8659: 2.0, 3547: 3.0, 9696: 1.0, 10210: 3.0, 4603: 1.0, 6822: 1.0, 998: 1.0, 1618: 2.0, 5103: 1.0, 8688: 1.0, 5106: 1.0, 4598: 1.0, 3065: 2.0, 2555: 2.0, 4605: 1.0}\n" 254 | ] 255 | } 256 | ], 257 | "source": [ 258 | "\n", 259 | "\n", 260 | "##生成svm训练数据\n", 261 | "train_label, train_value = svm_read_problem(feature_file)\n", 262 | "print(train_label[0],train_value[0])\n", 263 | "train_test_label, train_test_value = svm_read_problem(feature_test_file)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 9, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | "=======模型训练完成================\n" 276 | ] 277 | } 278 | ], 279 | "source": [ 280 | "\n", 281 | "if(os.path.exists(model_filename)): ##判断模型是否存在,存在直接读取\n", 282 | " model=svm_load_model(model_filename)\n", 283 | "else:\n", 284 | " model=svm_train(train_label,train_value,'-s 0 -c 5 -t 0 -g 0.5 -e 0.1') ##模型训练\n", 285 | " svm_save_model(model_filename,model) \n", 286 | "print(\"=======模型训练完成================\")" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 10, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "name": "stdout", 296 | "output_type": "stream", 297 | "text": [ 298 | "Accuracy = 94.15% (9415/10000) (classification)\n", 299 | "(94.15, 0.6576, 0.9223329891919024)\n" 300 | ] 301 | } 302 | ], 303 | "source": [ 304 | "\n", 305 | "##模型预测,并打印出精确度。\n", 306 | "p_labs, p_acc, p_vals =svm_predict(train_test_label, train_test_value, model) \n", 307 | "print(p_acc)" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": { 314 | "collapsed": true 315 | }, 316 | "outputs": [], 317 | "source": [] 318 | } 319 | ], 320 | "metadata": { 321 | "kernelspec": { 322 | "display_name": "Python 3", 323 | "language": "python", 324 | "name": "python3" 325 | }, 326 | "language_info": { 327 | "codemirror_mode": { 328 | "name": "ipython", 329 | "version": 3 330 | }, 331 | "file_extension": ".py", 332 | "mimetype": "text/x-python", 333 | "name": "python", 334 | "nbconvert_exporter": "python", 335 | "pygments_lexer": "ipython3", 336 | "version": "3.5.0" 337 | } 338 | }, 339 | "nbformat": 4, 340 | "nbformat_minor": 2 341 | } 342 | -------------------------------------------------------------------------------- /Word2Vec/README.md: -------------------------------------------------------------------------------- 1 | ## 目录 2 | - [1.离散表示](#1离散表示) 3 | - [1.1 One-hot表示](#11-one-hot表示) 4 | - [1.2 词袋模型](#12-词袋模型) 5 | - [1.3 TF-IDF](#13-tf-idf) 6 | - [1.4 n-gram模型](#14-n-gram模型) 7 | - [1.5 离散表示存在的问题](#15-离散表示存在的问题) 8 | - [2.分布式表示](#2-分布式表示) 9 | - [2.1 共现矩阵](#21-共现矩阵) 10 | - [3.神经网络表示](#3神经网络表示) 11 | - [3.1 NNLM](#31-nnlm) 12 | - [3.2 Word2Vec](#32-word2vec) 13 | - [3.3 sense2vec](#33-sense2vec) 14 | - [4.代码实现Word2Vec](https://github.com/mantchs/machine_learning_model/blob/master/Word2Vec/word2vec.ipynb) 15 | 16 | 17 | 18 | 在NLP(自然语言处理)领域,文本表示是第一步,也是很重要的一步,通俗来说就是把人类的语言符号转化为机器能够进行计算的数字,因为普通的文本语言机器是看不懂的,必须通过转化来表征对应文本。早期是**基于规则**的方法进行转化,而现代的方法是**基于统计机器学习**的方法。 19 | 20 | 数据决定了机器学习的上限,而算法只是尽可能逼近这个上限,在本文中数据指的就是文本表示,所以,弄懂文本表示的发展历程,对于NLP学习者来说是必不可少的。接下来开始我们的发展历程。文本表示分为**离散表示**和**分布式表示**: 21 | 22 | ## 1.离散表示 23 | 24 | ### 1.1 One-hot表示 25 | 26 | One-hot简称读热向量编码,也是特征工程中最常用的方法。其步骤如下: 27 | 28 | 1. 构造文本分词后的字典,每个分词是一个比特值,比特值为0或者1。 29 | 2. 每个分词的文本表示为该分词的比特位为1,其余位为0的矩阵表示。 30 | 31 | 例如:**John likes to watch movies. Mary likes too** 32 | 33 | **John also likes to watch football games.** 34 | 35 | 以上两句可以构造一个词典,**{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10} ** 36 | 37 | 每个词典索引对应着比特位。那么利用One-hot表示为: 38 | 39 | **John: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] ** 40 | 41 | **likes: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]** .......等等,以此类推。 42 | 43 | One-hot表示文本信息的**缺点**: 44 | 45 | - 随着语料库的增加,数据特征的维度会越来越大,产生一个维度很高,又很稀疏的矩阵。 46 | - 这种表示方法的分词顺序和在句子中的顺序是无关的,不能保留词与词之间的关系信息。 47 | 48 | 49 | 50 | ### 1.2 词袋模型 51 | 52 | 词袋模型(Bag-of-words model),像是句子或是文件这样的文字可以用一个袋子装着这些词的方式表现,这种表现方式不考虑文法以及词的顺序。 53 | 54 | **文档的向量表示可以直接将各词的词向量表示加和**。例如: 55 | 56 | **John likes to watch movies. Mary likes too** 57 | 58 | **John also likes to watch football games.** 59 | 60 | 以上两句可以构造一个词典,**{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10} ** 61 | 62 | 那么第一句的向量表示为:**[1,2,1,1,1,0,0,0,1,1]**,其中的2表示**likes**在该句中出现了2次,依次类推。 63 | 64 | 词袋模型同样有一下**缺点**: 65 | 66 | - 词向量化后,词与词之间是有大小关系的,不一定词出现的越多,权重越大。 67 | - 词与词之间是没有顺序关系的。 68 | 69 | 70 | 71 | ### 1.3 TF-IDF 72 | 73 | TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency)。 74 | 75 | **字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。一个词语在一篇文章中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文章。** 76 | 77 | $$TF_w=\frac{在某一类中词条w出现的次数}{该类中所有的词条数目}$$ 78 | 79 | $$IDF=log(\frac{语料库的文档总数}{包含词条w的文档总数+1})$$,分母之所以加1,是为了避免分母为0。 80 | 81 | 那么,$TF-IDF=TF*IDF$,从这个公式可以看出,当w在文档中出现的次数增大时,而TF-IDF的值是减小的,所以也就体现了以上所说的了。 82 | 83 | **缺点:**还是没有把词与词之间的关系顺序表达出来。 84 | 85 | 86 | 87 | ### 1.4 n-gram模型 88 | 89 | n-gram模型为了保持词的顺序,做了一个滑窗的操作,这里的n表示的就是滑窗的大小,例如2-gram模型,也就是把2个词当做一组来处理,然后向后移动一个词的长度,再次组成另一组词,把这些生成一个字典,按照词袋模型的方式进行编码得到结果。改模型考虑了词的顺序。 90 | 91 | 例如: 92 | 93 | **John likes to watch movies. Mary likes too** 94 | 95 | **John also likes to watch football games.** 96 | 97 | 以上两句可以构造一个词典,**{"John likes”: 1, "likes to”: 2, "to watch”: 3, "watch movies”: 4, "Mary likes”: 5, "likes too”: 6, "John also”: 7, "also likes”: 8, “watch football”: 9, "football games": 10}** 98 | 99 | 那么第一句的向量表示为:**[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]**,其中第一个1表示**John likes**在该句中出现了1次,依次类推。 100 | 101 | **缺点:**随着n的大小增加,词表会成指数型膨胀,会越来越大。 102 | 103 | 104 | 105 | ### 1.5 离散表示存在的问题 106 | 107 | 由于存在以下的问题,对于一般的NLP问题,是可以使用离散表示文本信息来解决问题的,但对于要求精度较高的场景就不适合了。 108 | 109 | - 无法衡量词向量之间的关系。 110 | - 词表的维度随着语料库的增长而膨胀。 111 | - n-gram词序列随语料库增长呈指数型膨胀,更加快。 112 | - 离散数据来表示文本会带来数据稀疏问题,导致丢失了信息,与我们生活中理解的信息是不一样的。 113 | 114 | 115 | 116 | ## 2. 分布式表示 117 | 118 | 科学家们为了提高模型的精度,又发明出了分布式的表示文本信息的方法,这就是这一节需要介绍的。 119 | 120 | **用一个词附近的其它词来表示该词,这是现代统计自然语言处理中最有创见的想法之一。**当初科学家发明这种方法是基于人的语言表达,认为一个词是由这个词的周边词汇一起来构成精确的语义信息。就好比,物以类聚人以群分,如果你想了解一个人,可以通过他周围的人进行了解,因为周围人都有一些共同点才能聚集起来。 121 | 122 | 123 | 124 | ### 2.1 共现矩阵 125 | 126 | 共现矩阵顾名思义就是共同出现的意思,词文档的共现矩阵主要用于发现主题(topic),用于主题模型,如LSA。 127 | 128 | 局域窗中的word-word共现矩阵可以挖掘语法和语义信息,**例如:** 129 | 130 | - I like deep learning. 131 | - I like NLP. 132 | - I enjoy flying 133 | 134 | 有以上三句话,设置滑窗为2,可以得到一个词典:**{"I like","like deep","deep learning","like NLP","I enjoy","enjoy flying","I like"}**。 135 | 136 | 我们可以得到一个共现矩阵(对称矩阵): 137 | 138 | ![image](https://wx2.sinaimg.cn/large/00630Defly1g2rwv1op5zj30q70c7wh2.jpg) 139 | 140 | 中间的每个格子表示的是行和列组成的词组在词典中共同出现的次数,也就体现了**共现**的特性。 141 | 142 | **存在的问题:** 143 | 144 | - 向量维数随着词典大小线性增长。 145 | - 存储整个词典的空间消耗非常大。 146 | - 一些模型如文本分类模型会面临稀疏性问题。 147 | - **模型会欠稳定,每新增一份语料进来,稳定性就会变化。** 148 | 149 | 150 | 151 | ## 3.神经网络表示 152 | 153 | ### 3.1 NNLM 154 | 155 | NNLM (Neural Network Language model),神经网络语言模型是03年提出来的,通过训练得到中间产物--词向量矩阵,这就是我们要得到的文本表示向量矩阵。 156 | 157 | NNLM说的是定义一个前向窗口大小,其实和上面提到的窗口是一个意思。把这个窗口中最后一个词当做y,把之前的词当做输入x,通俗来说就是预测这个窗口中最后一个词出现概率的模型。 158 | 159 | ![image](https://wx3.sinaimg.cn/large/00630Defly1g2vb5thw9rj30eq065dg4.jpg) 160 | 161 | 以下是NNLM的网络结构图: 162 | 163 | ![image](https://wx3.sinaimg.cn/large/00630Defly1g2t1f4bqilj30lv0e2adl.jpg) 164 | 165 | - input层是一个前向词的输入,是经过one-hot编码的词向量表示形式,具有V*1的矩阵。 166 | 167 | - C矩阵是投影矩阵,也就是稠密词向量表示,在神经网络中是**w参数矩阵**,该矩阵的大小为D*V,正好与input层进行全连接(相乘)得到D\*1的矩阵,采用线性映射将one-hot表 示投影到稠密D维表示。 168 | 169 | ![image](https://wx3.sinaimg.cn/large/00630Defly1g2t1s20jpnj30f107575i.jpg) 170 | 171 | - output层(softmax)自然是前向窗中需要预测的词。 172 | 173 | - 通过BP+SGD得到最优的C投影矩阵,这就是NNLM的中间产物,也是我们所求的文本表示矩阵,**通过NNLM将稀疏矩阵投影到稠密向量矩阵中。** 174 | 175 | ### 3.2 Word2Vec 176 | 177 | 谷歌2013年提出的Word2Vec是目前最常用的词嵌入模型之一。Word2Vec实际 是一种浅层的神经网络模型,它有两种网络结构,**分别是CBOW(Continues Bag of Words)连续词袋和Skip-gram。**Word2Vec和上面的NNLM很类似,但比NNLM简单。 178 | 179 | **CBOW** 180 | 181 | CBOW是通过中间词来预测窗口中上下文词出现的概率模型,把中间词当做y,把窗口中的其它词当做x输入,x输入是经过one-hot编码过的,然后通过一个隐层进行求和操作,最后通过激活函数softmax,可以计算出每个单词的生成概率,接下来的任务就是训练神经网络的权重,使得语料库中所有单词的整体生成概率最大化,而求得的权重矩阵就是文本表示词向量的结果。 182 | 183 | ![image](https://ws2.sinaimg.cn/large/00630Defly1g2u6va5fvyj30gf0h0aby.jpg) 184 | 185 | **Skip-gram**: 186 | 187 | Skip-gram是通过当前词来预测窗口中上下文词出现的概率模型,把当前词当做x,把窗口中其它词当做y,依然是通过一个隐层接一个Softmax激活函数来预测其它词的概率。如下图所示: 188 | 189 | ![image](https://ws4.sinaimg.cn/large/00630Defly1g2u7f4tsl2j30eu0h20tq.jpg) 190 | 191 | **优化方法**: 192 | 193 | - **层次Softmax:**至此还没有结束,因为如果单单只是接一个softmax激活函数,计算量还是很大的,有多少词就会有多少维的权重矩阵,所以这里就提出**层次Softmax(Hierarchical Softmax)**,使用Huffman Tree来编码输出层的词典,相当于平铺到各个叶子节点上,**瞬间把维度降低到了树的深度**,可以看如下图所示。这课Tree把出现频率高的词放到靠近根节点的叶子节点处,每一次只要做二分类计算,计算路径上所有非叶子节点词向量的贡献即可。 194 | 195 | ![image](https://ws4.sinaimg.cn/large/00630Defly1g2u762c7nwj30jb0fs0wh.jpg) 196 | 197 | - **负例采样(Negative Sampling):**这种优化方式做的事情是,在正确单词以外的负样本中进行采样,最终目的是为了减少负样本的数量,达到减少计算量效果。将词典中的每一个词对应一条线段,所有词组成了[0,1]间的剖分,如下图所示,然后每次随机生成一个[1, M-1]间的整数,看落在哪个词对应的剖分上就选择哪个词,最后会得到一个负样本集合。 198 | 199 | ![image](https://wx3.sinaimg.cn/large/00630Defly1g2u7vvrgjnj30lu07d75v.jpg) 200 | 201 | **Word2Vec存在的问题** 202 | 203 | - 对每个local context window单独训练,没有利用包 含在global co-currence矩阵中的统计信息。 204 | - 对多义词无法很好的表示和处理,因为使用了唯一 的词向量 205 | 206 | 207 | 208 | ### 3.3 sense2vec 209 | 210 | word2vec模型的问题在于词语的多义性。比如duck这个单词常见的含义有 水禽或者下蹲,但对于 word2vec 模型来说,它倾向于将所有概念做归一化 平滑处理,得到一个最终的表现形式。 211 | 212 | ## 4.代码实现 213 | 214 | [github Word2Vec训练维基百科文章]() 215 | 216 |
217 |
218 |
219 |
220 | 221 | ![image.png](https://upload-images.jianshu.io/upload_images/13876065-08b587647d14267c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 222 | 223 | 欢迎添加微信交流!请备注“机器学习”。 224 | 225 | -------------------------------------------------------------------------------- /Word2Vec/word2vec.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## word2vec训练中文模型\n", 8 | "------\n", 9 | "### 1.准备数据与预处理\n", 10 | "首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。中文维基百科的打包文件地址为 \n", 11 | "链接: https://pan.baidu.com/s/1H-wuIve0d_fvczvy3EOKMQ 提取码: uqua \n", 12 | "百度网盘加速下载地址:https://www.baiduwp.com/?m=index\n", 13 | "\n", 14 | "中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用处理这个XML压缩文件。\n", 15 | "\n", 16 | "**注意输入输出地址**" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 9, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stderr", 26 | "output_type": "stream", 27 | "text": [ 28 | "2019-05-08 21:42:31,184: INFO: running c:\\users\\mantch\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel_launcher.py -f C:\\Users\\mantch\\AppData\\Roaming\\jupyter\\runtime\\kernel-30939db9-3a59-4a92-844c-704c6189dbef.json\n", 29 | "2019-05-08 21:43:12,274: INFO: Saved 10000 articles\n", 30 | "2019-05-08 21:43:45,223: INFO: Saved 20000 articles\n", 31 | "2019-05-08 21:44:14,638: INFO: Saved 30000 articles\n", 32 | "2019-05-08 21:44:44,601: INFO: Saved 40000 articles\n", 33 | "2019-05-08 21:45:16,004: INFO: Saved 50000 articles\n", 34 | "2019-05-08 21:45:47,421: INFO: Saved 60000 articles\n", 35 | "2019-05-08 21:46:16,722: INFO: Saved 70000 articles\n", 36 | "2019-05-08 21:46:46,733: INFO: Saved 80000 articles\n", 37 | "2019-05-08 21:47:16,143: INFO: Saved 90000 articles\n", 38 | "2019-05-08 21:47:47,533: INFO: Saved 100000 articles\n", 39 | "2019-05-08 21:48:29,591: INFO: Saved 110000 articles\n", 40 | "2019-05-08 21:49:04,530: INFO: Saved 120000 articles\n", 41 | "2019-05-08 21:49:40,279: INFO: Saved 130000 articles\n", 42 | "2019-05-08 21:50:15,592: INFO: Saved 140000 articles\n", 43 | "2019-05-08 21:50:54,183: INFO: Saved 150000 articles\n", 44 | "2019-05-08 21:51:31,123: INFO: Saved 160000 articles\n", 45 | "2019-05-08 21:52:06,278: INFO: Saved 170000 articles\n", 46 | "2019-05-08 21:52:43,157: INFO: Saved 180000 articles\n", 47 | "2019-05-08 21:55:59,809: INFO: Saved 190000 articles\n", 48 | "2019-05-08 21:57:01,859: INFO: Saved 200000 articles\n", 49 | "2019-05-08 21:58:33,921: INFO: Saved 210000 articles\n", 50 | "2019-05-08 21:59:26,744: INFO: Saved 220000 articles\n", 51 | "2019-05-08 22:00:41,757: INFO: Saved 230000 articles\n", 52 | "2019-05-08 22:01:36,532: INFO: Saved 240000 articles\n", 53 | "2019-05-08 22:02:26,347: INFO: Saved 250000 articles\n", 54 | "2019-05-08 22:03:08,634: INFO: Saved 260000 articles\n", 55 | "2019-05-08 22:03:53,447: INFO: Saved 270000 articles\n", 56 | "2019-05-08 22:04:37,136: INFO: Saved 280000 articles\n", 57 | "2019-05-08 22:05:14,017: INFO: Saved 290000 articles\n", 58 | "2019-05-08 22:06:01,296: INFO: Saved 300000 articles\n", 59 | "2019-05-08 22:06:47,762: INFO: Saved 310000 articles\n", 60 | "2019-05-08 22:07:39,714: INFO: Saved 320000 articles\n", 61 | "2019-05-08 22:08:28,825: INFO: Saved 330000 articles\n", 62 | "2019-05-08 22:09:11,412: INFO: finished iterating over Wikipedia corpus of 338005 documents with 77273203 positions (total 3288566 articles, 91445479 positions before pruning articles shorter than 50 words)\n", 63 | "2019-05-08 22:09:11,555: INFO: Finished Saved 338005 articles\n" 64 | ] 65 | } 66 | ], 67 | "source": [ 68 | "import logging\n", 69 | "import os.path\n", 70 | "import sys\n", 71 | "from gensim.corpora import WikiCorpus\n", 72 | "if __name__ == '__main__':\n", 73 | " \n", 74 | " # 定义输入输出\n", 75 | " basename = \"F:/temp/DL/\"\n", 76 | " inp = basename+'zhwiki-latest-pages-articles.xml.bz2'\n", 77 | " outp = basename+'wiki.zh.text'\n", 78 | " \n", 79 | " program = os.path.basename(basename)\n", 80 | " logger = logging.getLogger(program)\n", 81 | " logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')\n", 82 | " logging.root.setLevel(level=logging.INFO)\n", 83 | " logger.info(\"running %s\" % ' '.join(sys.argv))\n", 84 | " # check and process input arguments\n", 85 | " if len(sys.argv) < 3:\n", 86 | " print(globals()['__doc__'] % locals())\n", 87 | " sys.exit(1)\n", 88 | " \n", 89 | " space = \" \"\n", 90 | " i = 0\n", 91 | " output = open(outp, 'w',encoding='utf-8')\n", 92 | " wiki = WikiCorpus(inp, lemmatize=False, dictionary={})\n", 93 | " for text in wiki.get_texts():\n", 94 | " output.write(space.join(text) + \"\\n\")\n", 95 | " i = i + 1\n", 96 | " if (i % 10000 == 0):\n", 97 | " logger.info(\"Saved \" + str(i) + \" articles\")\n", 98 | " output.close()\n", 99 | " logger.info(\"Finished Saved \" + str(i) + \" articles\")\n", 100 | "\n" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "\n", 108 | "### 2.训练数据\n", 109 | "Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg \n", 110 | "接着用word2vec工具训练: \n", 111 | "\n", 112 | "**注意输入输出地址**" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "import logging\n", 122 | "import os.path\n", 123 | "import sys\n", 124 | "import multiprocessing\n", 125 | "from gensim.corpora import WikiCorpus\n", 126 | "from gensim.models import Word2Vec\n", 127 | "from gensim.models.word2vec import LineSentence\n", 128 | "\n", 129 | "# 定义输入输出\n", 130 | "basename = \"F:/temp/DL/\"\n", 131 | "inp = basename+'wiki.zh.text'\n", 132 | "outp1 = basename+'wiki.zh.text.model'\n", 133 | "outp2 = basename+'wiki.zh.text.vector'\n", 134 | "\n", 135 | "program = os.path.basename(basename)\n", 136 | "logger = logging.getLogger(program)\n", 137 | "logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')\n", 138 | "logging.root.setLevel(level=logging.INFO)\n", 139 | "logger.info(\"running %s\" % ' '.join(sys.argv))\n", 140 | "# check and process input arguments\n", 141 | "if len(sys.argv) < 4:\n", 142 | " print(globals()['__doc__'] % locals())\n", 143 | "\n", 144 | "model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,\n", 145 | " workers=multiprocessing.cpu_count())\n", 146 | "# trim unneeded model memory = use(much) less RAM\n", 147 | "#model.init_sims(replace=True)\n", 148 | "model.save(outp1)\n", 149 | "model.save_word2vec_format(outp2, binary=False)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "raw", 154 | "metadata": {}, 155 | "source": [ 156 | "输出如下:\n", 157 | "2019-05-08 22:28:25,638: INFO: running c:\\users\\mantch\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel_launcher.py -f C:\\Users\\mantch\\AppData\\Roaming\\jupyter\\runtime\\kernel-b1f915fd-fdb2-43fc-bcf3-b361fb4a7c3d.json\n", 158 | "2019-05-08 22:28:25,640: INFO: collecting all words and their counts\n", 159 | "2019-05-08 22:28:25,642: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", 160 | "Automatically created module for IPython interactive environment\n", 161 | "2019-05-08 22:28:27,887: INFO: PROGRESS: at sentence #10000, processed 4278620 words, keeping 2586311 word types\n", 162 | "2019-05-08 22:28:29,666: INFO: PROGRESS: at sentence #20000, processed 7491125 words, keeping 4291863 word types\n", 163 | "2019-05-08 22:28:31,445: INFO: PROGRESS: at sentence #30000, processed 10424455 words, keeping 5704507 word types\n", 164 | "2019-05-08 22:28:32,854: INFO: PROGRESS: at sentence #40000, processed 13190001 words, keeping 6983862 word types\n", 165 | "2019-05-08 22:28:34,125: INFO: PROGRESS: at sentence #50000, processed 15813238 words, keeping 8145905 word types\n", 166 | "2019-05-08 22:28:35,353: INFO: PROGRESS: at sentence #60000, processed 18388731 words, keeping 9198885 word types\n", 167 | "2019-05-08 22:28:36,544: INFO: PROGRESS: at sentence #70000, processed 20773000 words, keeping 10203788 word types\n", 168 | "2019-05-08 22:28:37,652: INFO: PROGRESS: at sentence #80000, processed 23064544 words, keeping 11144885 word types\n", 169 | "2019-05-08 22:28:39,490: INFO: PROGRESS: at sentence #90000, processed 25324650 words, keeping 12034343 word types\n", 170 | "2019-05-08 22:28:40,688: INFO: PROGRESS: at sentence #100000, processed 27672540 words, keeping 12878856 word types\n", 171 | "2019-05-08 22:28:41,871: INFO: PROGRESS: at sentence #110000, processed 29985282 words, keeping 13688622 word types\n", 172 | "2019-05-08 22:28:42,944: INFO: PROGRESS: at sentence #120000, processed 32025045 words, keeping 14477767 word types\n", 173 | "2019-05-08 22:28:44,048: INFO: PROGRESS: at sentence #130000, processed 34267390 words, keeping 15309447 word types\n", 174 | "2019-05-08 22:28:45,197: INFO: PROGRESS: at sentence #140000, processed 36451394 words, keeping 16090548 word types\n", 175 | "2019-05-08 22:28:46,345: INFO: PROGRESS: at sentence #150000, processed 38671717 words, keeping 16877015 word types\n", 176 | "2019-05-08 22:28:47,483: INFO: PROGRESS: at sentence #160000, processed 40778409 words, keeping 17648492 word types\n", 177 | "2019-05-08 22:28:48,655: INFO: PROGRESS: at sentence #170000, processed 43154040 words, keeping 18308373 word types\n", 178 | "2019-05-08 22:28:49,759: INFO: PROGRESS: at sentence #180000, processed 45231681 words, keeping 19010906 word types\n", 179 | "2019-05-08 22:28:50,826: INFO: PROGRESS: at sentence #190000, processed 47190144 words, keeping 19659373 word types\n", 180 | "2019-05-08 22:28:51,886: INFO: PROGRESS: at sentence #200000, processed 49201934 words, keeping 20311518 word types\n", 181 | "2019-05-08 22:28:52,856: INFO: PROGRESS: at sentence #210000, processed 51116197 words, keeping 20917125 word types\n", 182 | "2019-05-08 22:28:53,859: INFO: PROGRESS: at sentence #220000, processed 53321151 words, keeping 21513016 word types\n", 183 | "2019-05-08 22:28:54,921: INFO: PROGRESS: at sentence #230000, processed 55408211 words, keeping 22207241 word types\n", 184 | "2019-05-08 22:28:59,645: INFO: PROGRESS: at sentence #240000, processed 57442276 words, keeping 22849499 word types\n", 185 | "2019-05-08 22:29:00,988: INFO: PROGRESS: at sentence #250000, processed 59563975 words, keeping 23544817 word types\n", 186 | "2019-05-08 22:29:02,292: INFO: PROGRESS: at sentence #260000, processed 61764248 words, keeping 24222911 word types\n", 187 | "2019-05-08 22:29:03,654: INFO: PROGRESS: at sentence #270000, processed 63938511 words, keeping 24906453 word types\n", 188 | "2019-05-08 22:29:04,900: INFO: PROGRESS: at sentence #280000, processed 66096661 words, keeping 25519781 word types\n", 189 | "2019-05-08 22:29:06,057: INFO: PROGRESS: at sentence #290000, processed 67947209 words, keeping 26062482 word types\n", 190 | "2019-05-08 22:29:07,229: INFO: PROGRESS: at sentence #300000, processed 69927780 words, keeping 26649878 word types\n", 191 | "2019-05-08 22:29:08,506: INFO: PROGRESS: at sentence #310000, processed 71800313 words, keeping 27230264 word types\n", 192 | "2019-05-08 22:29:09,836: INFO: PROGRESS: at sentence #320000, processed 73942427 words, keeping 27850568 word types\n", 193 | "2019-05-08 22:29:11,419: INFO: PROGRESS: at sentence #330000, processed 75859220 words, keeping 28467061 word types\n", 194 | "2019-05-08 22:29:12,379: INFO: collected 28914285 word types from a corpus of 77273203 raw words and 338042 sentences" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "\n", 202 | "### 3.测试结果" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 1, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "name": "stderr", 212 | "output_type": "stream", 213 | "text": [ 214 | "c:\\users\\mantch\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel_launcher.py:11: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 215 | " # This is added back by InteractiveShellApp.init_path()\n" 216 | ] 217 | }, 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "排球 0.8914323449134827\n", 223 | "籃球 0.8889479041099548\n", 224 | "棒球 0.854706883430481\n", 225 | "高爾夫 0.832783043384552\n", 226 | "高爾夫球 0.8316080570220947\n", 227 | "網球 0.8276922702789307\n", 228 | "橄欖球 0.823620080947876\n", 229 | "英式足球 0.8229209184646606\n", 230 | "板球 0.822044312953949\n", 231 | "欖球 0.8151556253433228\n" 232 | ] 233 | } 234 | ], 235 | "source": [ 236 | "\n", 237 | "# 测试结果\n", 238 | "import gensim\n", 239 | "\n", 240 | "# 定义输入输出\n", 241 | "basename = \"F:/temp/DL/\"\n", 242 | "model = basename+'wiki.zh.text.model'\n", 243 | "\n", 244 | "model = gensim.models.Word2Vec.load(model)\n", 245 | "\n", 246 | "result = model.most_similar(u\"足球\")\n", 247 | "for e in result:\n", 248 | " print(e[0], e[1])" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 2, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "name": "stdout", 258 | "output_type": "stream", 259 | "text": [ 260 | "女人 0.908246636390686\n", 261 | "男孩 0.872255802154541\n", 262 | "女孩 0.8567496538162231\n", 263 | "孩子 0.8363182544708252\n", 264 | "知道 0.8341636061668396\n", 265 | "某人 0.8211491107940674\n", 266 | "漂亮 0.8023637533187866\n", 267 | "伴侶 0.8001378774642944\n", 268 | "什麼 0.7944830656051636\n", 269 | "嫉妒 0.7929206490516663\n" 270 | ] 271 | }, 272 | { 273 | "name": "stderr", 274 | "output_type": "stream", 275 | "text": [ 276 | "c:\\users\\mantch\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).\n", 277 | " \"\"\"Entry point for launching an IPython kernel.\n" 278 | ] 279 | } 280 | ], 281 | "source": [ 282 | "result = model.most_similar(u\"男人\")\n", 283 | "for e in result:\n", 284 | " print(e[0], e[1])" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": {}, 291 | "outputs": [], 292 | "source": [] 293 | } 294 | ], 295 | "metadata": { 296 | "kernelspec": { 297 | "display_name": "Python 3", 298 | "language": "python", 299 | "name": "python3" 300 | }, 301 | "language_info": { 302 | "codemirror_mode": { 303 | "name": "ipython", 304 | "version": 3 305 | }, 306 | "file_extension": ".py", 307 | "mimetype": "text/x-python", 308 | "name": "python", 309 | "nbconvert_exporter": "python", 310 | "pygments_lexer": "ipython3", 311 | "version": "3.5.4" 312 | } 313 | }, 314 | "nbformat": 4, 315 | "nbformat_minor": 2 316 | } 317 | --------------------------------------------------------------------------------