├── .ipynb_checkpoints ├── KMedoids-checkpoint.ipynb ├── Kmeans-checkpoint.ipynb ├── evolutionary(experiment 1, find max value)-checkpoint.ipynb ├── evolutionary(experiment 2, Cracking Passwords)-checkpoint.ipynb ├── evolutionary(experiment 3, TSP)-checkpoint.ipynb └── smote-checkpoint.ipynb ├── KMedoids.html ├── KMedoids.ipynb ├── KMedoids.zip ├── Kmeans.html ├── Kmeans.ipynb ├── Kmeans.zip ├── PCA.html ├── PCA.ipynb ├── PCA.zip ├── README.md ├── data └── china.csv ├── evolutionary(experiment 1, find max value).ipynb ├── evolutionary(experiment 2, Cracking Passwords).ipynb ├── evolutionary(experiment 3, TSP).ipynb ├── evolutionary1.jpg ├── evolutionary2.jpg ├── evolutionary3.jpg ├── k_nearest_neighbors.html ├── k_nearest_neighbors.ipynb ├── k_nearest_neighbors.md ├── linear_discriminant_analysis(BiClass).html ├── linear_discriminant_analysis(BiClass).ipynb ├── linear_discriminant_analysis(BiClass).zip ├── linear_discriminant_analysis(MutiClass).html ├── linear_discriminant_analysis(MutiClass).ipynb ├── linear_discriminant_analysis(MutiClass).zip ├── logistic_regression.html ├── logistic_regression.ipynb ├── logistic_regression.zip ├── naive_bayes(continuity features).html ├── naive_bayes(continuity features).ipynb ├── naive_bayes(continuity features).md ├── naive_bayes(discrete features).html ├── naive_bayes(discrete features).ipynb ├── naive_bayes(discrete features).md ├── regression.html ├── regression.ipynb ├── regression.zip ├── smote.html ├── smote.ipynb └── smote.md /.ipynb_checkpoints/evolutionary(experiment 2, Cracking Passwords)-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "上一篇文章我们动手实验了用遗传算法求解函数在给定区间的最大值。本篇文章再来看一个实验:用遗传算法破解密码。\n", 8 | "\n", 9 | "在这个问题中,我们的个体就是一串字符串了,其目的就是找到一个与密码完全相同的字符串。基本步骤与前一篇文章基本类似,不过在本问题中,我们用字符的ASCII值来表示个体(字符串)的DNA。其它的就不多说了,还是看详细代码吧:" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [ 19 | { 20 | "name": "stdout", 21 | "output_type": "stream", 22 | "text": [ 23 | "第0 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t IJ6{Kvl\\o\"y\n", 24 | "第10 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I loveqyouJ\n", 25 | "第20 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I loveWyoue\n", 26 | "第30 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I love$you#\n", 27 | "第40 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovelyou$\n", 28 | "第50 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I love*youZ\n", 29 | "第60 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovePyouZ\n", 30 | "第70 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I love you'\n", 31 | "第80 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovemyou$\n", 32 | "第90 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovemyouH\n", 33 | "第100 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovebyou!\n", 34 | "第110 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovesyou!\n", 35 | "第120 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovebyou!\n", 36 | "第125 次进化后, 找到了密码: \t I love you!\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "import numpy as np\n", 42 | "\n", 43 | "\n", 44 | "class GeneticAlgorithm(object):\n", 45 | " \"\"\"遗传算法.\n", 46 | "\n", 47 | " Parameters:\n", 48 | " -----------\n", 49 | " cross_rate: float\n", 50 | " 交配的可能性大小.\n", 51 | " mutate_rate: float\n", 52 | " 基因突变的可能性大小. \n", 53 | " n_population: int\n", 54 | " 种群的大小.\n", 55 | " n_iterations: int\n", 56 | " 迭代次数.\n", 57 | " password: str\n", 58 | " 欲破解的密码.\n", 59 | " \"\"\"\n", 60 | " def __init__(self, cross_rate, mutation_rate, n_population, n_iterations, password):\n", 61 | " self.cross_rate = cross_rate\n", 62 | " self.mutate_rate = mutation_rate\n", 63 | " self.n_population = n_population\n", 64 | " self.n_iterations = n_iterations\n", 65 | " self.password = password # 要破解的密码\n", 66 | " self.password_size = len(self.password) # 要破解密码的长度\n", 67 | " self.password_ascii = np.fromstring(self.password, dtype=np.uint8) # 将password转换成ASCII\n", 68 | " self.ascii_bounder = [32, 126+1]\n", 69 | " \n", 70 | "\n", 71 | " # 初始化一个种群\n", 72 | " def init_population(self):\n", 73 | " population = np.random.randint(low=self.ascii_bounder[0], high=self.ascii_bounder[1], \n", 74 | " size=(self.n_population, self.password_size)).astype(np.int8)\n", 75 | " return population\n", 76 | "\n", 77 | " # 将个体的DNA转换成ASCII\n", 78 | " def translateDNA(self, DNA): # convert to readable string\n", 79 | " return DNA.tostring().decode('ascii')\n", 80 | "\n", 81 | " # 计算种群中每个个体的适应度,适应度越高,说明该个体的基因越好\n", 82 | " def fitness(self, population):\n", 83 | " match_num = (population == self.password_ascii).sum(axis=1)\n", 84 | " return match_num\n", 85 | "\n", 86 | " # 对种群按照其适应度进行采样,这样适应度高的个体就会以更高的概率被选择\n", 87 | " def select(self, population):\n", 88 | " fitness = self.fitness(population) + 1e-4 # add a small amount to avoid all zero fitness\n", 89 | " idx = np.random.choice(np.arange(self.n_population), size=self.n_population, replace=True, p=fitness/fitness.sum())\n", 90 | " return population[idx]\n", 91 | "\n", 92 | " # 进行交配\n", 93 | " def create_child(self, parent, pop):\n", 94 | " if np.random.rand() < self.cross_rate:\n", 95 | " index = np.random.randint(0, self.n_population, size=1) # select another individual from pop\n", 96 | " cross_points = np.random.randint(0, 2, self.password_size).astype(np.bool) # choose crossover points\n", 97 | " parent[cross_points] = pop[index, cross_points] # mating and produce one child\n", 98 | " #child = parent\n", 99 | " return parent\n", 100 | "\n", 101 | " # 基因突变\n", 102 | " def mutate_child(self, child):\n", 103 | " for point in range(self.password_size):\n", 104 | " if np.random.rand() < self.mutate_rate:\n", 105 | " child[point] = np.random.randint(*self.ascii_bounder) # choose a random ASCII index\n", 106 | " return child\n", 107 | "\n", 108 | " # 进化\n", 109 | " def evolution(self):\n", 110 | " population = self.init_population()\n", 111 | " for i in range(self.n_iterations):\n", 112 | " fitness = self.fitness(population)\n", 113 | " \n", 114 | " best_person = population[np.argmax(fitness)]\n", 115 | " best_person_ascii = self.translateDNA(best_person)\n", 116 | " \n", 117 | " if i % 10 == 0:\n", 118 | " print(u'第%-4d次进化后, 基因最好的个体(与欲破解的密码最接近)是: \\t %s'% (i, best_person_ascii))\n", 119 | " \n", 120 | " if best_person_ascii == self.password:\n", 121 | " print(u'第%-4d次进化后, 找到了密码: \\t %s'% (i, best_person_ascii))\n", 122 | " break\n", 123 | " \n", 124 | " population = self.select(population)\n", 125 | " population_copy = population.copy()\n", 126 | " \n", 127 | " for parent in population:\n", 128 | " child = self.create_child(parent, population_copy)\n", 129 | " child = self.mutate_child(child)\n", 130 | " parent[:] = child\n", 131 | " \n", 132 | " population = population\n", 133 | " \n", 134 | "def main():\n", 135 | " password = 'I love you!' # 要破解的密码\n", 136 | " \n", 137 | " ga = GeneticAlgorithm(cross_rate=0.8, mutation_rate=0.01, n_population=300, n_iterations=500, password=password)\n", 138 | " \n", 139 | " ga.evolution()\n", 140 | "\n", 141 | "if __name__ == '__main__':\n", 142 | " main()\n" 143 | ] 144 | } 145 | ], 146 | "metadata": { 147 | "kernelspec": { 148 | "display_name": "Python [conda root]", 149 | "language": "python", 150 | "name": "conda-root-py" 151 | }, 152 | "language_info": { 153 | "codemirror_mode": { 154 | "name": "ipython", 155 | "version": 2 156 | }, 157 | "file_extension": ".py", 158 | "mimetype": "text/x-python", 159 | "name": "python", 160 | "nbconvert_exporter": "python", 161 | "pygments_lexer": "ipython2", 162 | "version": "2.7.12" 163 | } 164 | }, 165 | "nbformat": 4, 166 | "nbformat_minor": 1 167 | } 168 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/smote-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "本文将主要详细介绍一下SMOTE(Synthetic Minority Oversampling Technique)算法从原理到代码实践,SMOTE主要是用来解决类不平衡问题的,在讲解SMOTE算法之前,我们先解释一下什么是类不平衡问题、为什么类不平衡带来的问题以及相应的解决方法。\n", 8 | "\n", 9 | "\n", 10 | "## 1. 什么是类不平衡问题\n", 11 | "\n", 12 | "类不平衡(class-imbalance)是指在训练分类器中所使用的训练集的类别分布不均。比如说一个二分类问题,1000个训练样本,比较理想的情况是正类、负类样本的数量相差不多;而如果正类样本有995个、负类样本仅5个,就意味着存在类不平衡。\n", 13 | "\n", 14 | "在后文中,把样本数量过少的类别称为“少数类”。\n", 15 | "\n", 16 | "但实际上,数据集上的类不平衡到底有没有达到需要特殊处理的程度,还要看不处理时训练出来的模型在验证集上的效果。有些时候是没必要处理的。\n", 17 | "\n", 18 | "\n", 19 | "## 2. 类不平衡引发的问题\n", 20 | "\n", 21 | "\n", 22 | "### 2.1 从模型的训练过程来看\n", 23 | "\n", 24 | "从训练模型的角度来说,如果某类的样本数量很少,那么这个类别所提供的“信息”就太少。\n", 25 | "\n", 26 | "使用经验风险(模型在训练集上的平均损失)最小化作为模型的学习准则。设损失函数为0-1 loss(这是一种典型的均等代价的损失函数),那么优化目标就等价于错误率最小化(也就是accuracy最大化)。考虑极端情况:1000个训练样本中,正类样本999个,负类样本1个。训练过程中在某次迭代结束后,模型把所有的样本都分为正类,虽然分错了这个负类,但是所带来的损失实在微不足道,accuracy已经是99.9%,于是满足停机条件或者达到最大迭代次数之后自然没必要再优化下去,ok,到此为止,训练结束!于是这个模型……\n", 27 | "\n", 28 | "模型没有学习到如何去判别出少数类,这时候模型的召回率会非常低。\n", 29 | "\n", 30 | "\n", 31 | "\n", 32 | "### 2.2 从模型的预测过程来看\n", 33 | "\n", 34 | "考虑二项Logistic回归模型。输入一个样本 x ,模型输出的是其属于正类的概率 ŷ 。当 ŷ >0.5 时,模型判定该样本属于正类,否则就是属于反类。\n", 35 | "\n", 36 | "为什么是0.5呢?可以认为模型是出于最大后验概率决策的角度考虑的,选择了0.5意味着当模型估计的样本属于正类的后验概率要大于样本属于负类的后验概率时就将样本判为正类。但实际上,这个后验概率的估计值是否准确呢?\n", 37 | "\n", 38 | "从几率(odds)的角度考虑:几率表达的是样本属于正类的可能性与属于负类的可能性的比值。模型对于样本的预测几率为 $\\frac{ŷ} {1−ŷ}$ 。\n", 39 | "\n", 40 | "模型在做出决策时,当然希望能够遵循真实样本总体的正负类样本分布:设 θ 等于正类样本数除以全部样本数,那么样本的真实几率为 $\\frac{θ}{1−θ}$ 。当观测几率大于真实几率时,也就是 $ŷ >θ$ 时,那么就判定这个样本属于正类。\n", 41 | "\n", 42 | "虽然我们无法获悉真实样本总体,但之于训练集,存在这样一个假设:训练集是真实样本总体的无偏采样。正是因为这个假设,所以认为训练集的观测几率$\\frac { \\hat{\\theta}}{1−\\hat{ \\theta }}$就代表了真实几率 $\\frac{θ}{1−θ}$ 。\n", 43 | "\n", 44 | "所以,在这个假设下,当一个样本的预测几率大于观测几率时,就应该将样本判断为正类。\n", 45 | "\n", 46 | "\n", 47 | "## 3. 解决类不平衡问题的方法\n", 48 | "\n", 49 | "目前主要有三种办法:\n", 50 | "\n", 51 | "### 3.1 调整 θ 值\n", 52 | "\n", 53 | "根据训练集的正负样本比例,调整 θ 值。   \n", 54 | "\n", 55 | "这样做的依据是上面所述的对训练集的假设。但在给定任务中,这个假设是否成立,还有待讨论。\n", 56 | "\n", 57 | "\n", 58 | "### 3.2 过采样\n", 59 | "\n", 60 | "对训练集里面样本数量较少的类别(少数类)进行过采样,合成新的样本来缓解类不平衡。下面将介绍一种经典的过采样算法:SMOTE。\n", 61 | "\n", 62 | "### 3.3 欠采样\n", 63 | "\n", 64 | "对训练集里面样本数量较多的类别(多数类)进行欠采样,抛弃一些样本来缓解类不平衡。\n", 65 | "\n", 66 | "\n", 67 | "\n", 68 | "## 4. SMOTE算法原理\n", 69 | "\n", 70 | "SMOTE,合成少数类过采样技术.它是基于随机过采样算法的一种改进方案,由于随机过采样采取简单复制样本的策略来增加少数类样本,这样容易产生模型过拟合的问题,即使得模型学习到的信息过于特别(Specific)而不够泛化(General),SMOTE算法的基本思想是对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中,算法流程如下。\n", 71 | "\n", 72 | "### 4.1 SMOTE算法流程\n", 73 | "\n", 74 | "对于正样本数据集X(minority class samples),遍历每一个样本:\n", 75 | "\n", 76 | "$\\,\\,\\,\\,\\,\\,$ (1) 对于少数类(X)中每一个样本x,计算它到少数类样本集(X)中所有样本的距离,得到其k近邻。\n", 77 | " \n", 78 | "$\\,\\,\\,\\,\\,\\,$ (2) 根据样本不平衡比例设置一个采样比例以确定采样倍率sampling_rate,对于每一个少数类样本x,从其k近邻中随机选择sampling_rate个近邻,假设选择的近邻为 ${x^{(1)}, x^{(2)}, ..., x^{(sampling\\_rate)}}$ 。\n", 79 | " \n", 80 | "$\\,\\,\\,\\,\\,\\,$ (3) 对于每一个随机选出的近邻 $x^{(i)} \\, (i=1,2, ..., {sampling\\_rate}) $,分别与原样本按照如下的公式构建新的样本\n", 81 | "\n", 82 | "$$ x_{new} = x + rand(0, 1) * (x^{(i)} - x) $$\n", 83 | "\n", 84 | "\n", 85 | "### 4.2 SMOTE算法代码实现\n", 86 | "\n", 87 | "下面我们就用代码来实现一下:" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 1, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "[[ 2.55355825 3.55355825 5.33033738]\n", 100 | " [ 3. 4.89432435 2.42270262]\n", 101 | " [ 2. 2. 1. ]\n", 102 | " [ 3. 5. 2. ]\n", 103 | " [ 5. 3. 4. ]\n", 104 | " [ 3. 3.85514586 5.85514586]\n", 105 | " [ 1. 2. 3. ]\n", 106 | " [ 3. 4. 6. ]\n", 107 | " [ 2. 2. 1. ]\n", 108 | " [ 3. 5. 2. ]\n", 109 | " [ 5. 3. 4. ]\n", 110 | " [ 3. 2. 4. ]]\n" 111 | ] 112 | } 113 | ], 114 | "source": [ 115 | "import random\n", 116 | "from sklearn.neighbors import NearestNeighbors\n", 117 | "import numpy as np\n", 118 | "\n", 119 | "class Smote:\n", 120 | " \"\"\"\n", 121 | " SMOTE过采样算法.\n", 122 | "\n", 123 | "\n", 124 | " Parameters:\n", 125 | " -----------\n", 126 | " k: int\n", 127 | " 选取的近邻数目.\n", 128 | " sampling_rate: int\n", 129 | " 采样倍数, attention sampling_rate < k.\n", 130 | " newindex: int\n", 131 | " 生成的新样本(合成样本)的索引号.\n", 132 | " \"\"\"\n", 133 | " def __init__(self, sampling_rate=5, k=5):\n", 134 | " self.sampling_rate = sampling_rate\n", 135 | " self.k = k\n", 136 | " self.newindex = 0\n", 137 | "\n", 138 | " def fit(self, X, y=None):\n", 139 | " if y is not None:\n", 140 | " negative_X = X[y==0]\n", 141 | " X = X[y==1]\n", 142 | " \n", 143 | " n_samples, n_features = X.shape\n", 144 | " # 初始化一个矩阵, 用来存储合成样本\n", 145 | " self.synthetic = np.zeros((n_samples * self.sampling_rate, n_features))\n", 146 | " \n", 147 | " # 找出正样本集(数据集X)中的每一个样本在数据集X中的k个近邻\n", 148 | " knn = NearestNeighbors(n_neighbors=self.k).fit(X)\n", 149 | " for i in range(len(X)):\n", 150 | " k_neighbors = knn.kneighbors(X[i].reshape(1,-1), \n", 151 | " return_distance=False)[0]\n", 152 | " # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成\n", 153 | " # sampling_rate个新的样本\n", 154 | " self.synthetic_samples(X, i, k_neighbors)\n", 155 | " \n", 156 | " if y is not None:\n", 157 | " return ( np.concatenate((self.synthetic, X, negative_X), axis=0), \n", 158 | " np.concatenate(([1]*(len(self.synthetic)+len(X)), y[y==0]), axis=0) )\n", 159 | " \n", 160 | " return np.concatenate((self.synthetic, X), axis=0)\n", 161 | "\n", 162 | "\n", 163 | " # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成sampling_rate个新的样本\n", 164 | " def synthetic_samples(self, X, i, k_neighbors):\n", 165 | " for j in range(self.sampling_rate):\n", 166 | " # 从k个近邻里面随机选择一个近邻\n", 167 | " neighbor = np.random.choice(k_neighbors)\n", 168 | " # 计算样本X[i]与刚刚选择的近邻的差\n", 169 | " diff = X[neighbor] - X[i]\n", 170 | " # 生成新的数据\n", 171 | " self.synthetic[self.newindex] = X[i] + random.random() * diff\n", 172 | " self.newindex += 1\n", 173 | " \n", 174 | "X=np.array([[1,2,3],[3,4,6],[2,2,1],[3,5,2],[5,3,4],[3,2,4]])\n", 175 | "y = np.array([1, 1, 1, 0, 0, 0])\n", 176 | "smote=Smote(sampling_rate=1, k=5)\n", 177 | "print(smote.fit(X))" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "### 4.3 SMOTE算法的缺陷\n", 185 | "\n", 186 | "该算法主要存在两方面的问题:一是在近邻选择时,存在一定的盲目性。从上面的算法流程可以看出,在算法执行过程中,需要确定k值,即选择多少个近邻样本,这需要用户自行解决。从k值的定义可以看出,k值的下限是sampling_rate(sampling_rate为从k个近邻中随机挑选出的近邻样本的个数,且有 sampling_rate < k ), sampling_rate的大小可以根据负类样本数量、正类样本数量和数据集最后需要达到的平衡率决定。但k值的上限没有办法确定,只能根据具体的数据集去反复测试。因此如何确定k值,才能使算法达到最优这是未知的。\n", 187 | "\n", 188 | "另外,该算法无法克服非平衡数据集的数据分布问题,容易产生分布边缘化问题。由于正类样本(少数类样本)的分布决定了其可选择的近邻,如果一个正类样本处在正类样本集的分布边缘,则由此正类样本和相邻样本产生的“人造”样本也会处在这个边缘,且会越来越边缘化,从而模糊了正类样本和负类样本的边界,而且使边界变得越来越模糊。这种边界模糊性,虽然使数据集的平衡性得到了改善,但加大了分类算法进行分类的难度。\n", 189 | "\n", 190 | "针对SMOTE算法的进一步改进\n", 191 | "\n", 192 | "针对SMOTE算法存在的边缘化和盲目性等问题,很多人纷纷提出了新的改进办法,在一定程度上改进了算法的性能,但还存在许多需要解决的问题。\n", 193 | "\n", 194 | "Han等人Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning 在SMOTE算法基础上进行了改进,提出了Borderhne.SMOTE算法,解决了生成样本重叠(Overlapping)的问题该算法在运行的过程中,查找一个适当的区域,该区域可以较好地反应数据集的性质,然后在该区域内进行插值,以使新增加的“人造”样本更有效。这个适当的区域一般由经验给定,因此算法在执行的过程中有一定的局限性。\n", 195 | "\n", 196 | "\n", 197 | "\n", 198 | "参考文献:\n", 199 | "\n", 200 | "http://www.cnblogs.com/Determined22/p/5772538.html\n", 201 | "\n", 202 | "smote算法的论文地址:https://www.jair.org/media/953/live-953-2037-jair.pdf\n", 203 | "\n", 204 | "http://blog.csdn.net/yaphat/article/details/60347968\n", 205 | "\n", 206 | "http://blog.csdn.net/Yaphat/article/details/52463304?locationNum=7#0-tsina-1-78137-397232819ff9a47a7b7e80a40613cfe1\n" 207 | ] 208 | } 209 | ], 210 | "metadata": { 211 | "kernelspec": { 212 | "display_name": "Python 3", 213 | "language": "python", 214 | "name": "python3" 215 | }, 216 | "language_info": { 217 | "codemirror_mode": { 218 | "name": "ipython", 219 | "version": 3 220 | }, 221 | "file_extension": ".py", 222 | "mimetype": "text/x-python", 223 | "name": "python", 224 | "nbconvert_exporter": "python", 225 | "pygments_lexer": "ipython3", 226 | "version": "3.6.1" 227 | } 228 | }, 229 | "nbformat": 4, 230 | "nbformat_minor": 2 231 | } 232 | -------------------------------------------------------------------------------- /KMedoids.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/KMedoids.zip -------------------------------------------------------------------------------- /Kmeans.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/Kmeans.zip -------------------------------------------------------------------------------- /PCA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "前面两篇文章详细讲解了线性判别分析LDA,说到LDA,就不能不提到主成份分析,简称为PCA,是一种非监督学习算法,经常被用来进行数据降维、有损数据压缩、特征抽取、数据可视化(Jolliffe, 2002)。它也被称为Karhunen-Loève变换。\n", 10 | "\n", 11 | "\n", 12 | "\n", 13 | "\n", 14 | "## 1. PCA原理\n", 15 | "\n", 16 | "\n", 17 | "PCA的思想是将$n$维特征映射到$k$维空间上$k" 39 | ] 40 | }, 41 | "metadata": {}, 42 | "output_type": "display_data" 43 | } 44 | ], 45 | "source": [ 46 | "from IPython.display import display, Image\n", 47 | "\n", 48 | "display(Image('pca1.jpg'))" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "如上图所示,红色点表示原样本点$x^{(i)}$,$u$是蓝色直线的斜率也是直线的方向向量,而且是单位向量,直线上的蓝色点表示原样本点$x^{(i)}$在$u$上的投影。容易知道投影点离原点的距离是$x^{(i)T}u$,由于这些原始样本点的每一维特征均值都为0,因此投影到u上的样本点的均值仍然是0。\n", 56 | "\n", 57 | "假设原始数据集为$X_{mxn}$,我们的目标是找到最佳的投影空间$W_{nxk}=(w_1, w_2, …, w_k)$,其中$w_i$是单位向量且$w_i$与$w_j (i\\neq j)$正交, 何为最佳的$W$?就是原始样本点投影到$W$上之后,使得投影后的样本点方差最大。\n", 58 | "\n", 59 | "由于投影后均值为0,因此投影后的总方差为:\n", 60 | "\n", 61 | "$$ \\frac{1}{m}\\sum_{i=1}^m (x^{(i)T}w)^2 = \\frac{1}{m}\\sum_{i=1}^m w^T x^{(i)} x^{(i)T}w =\\sum_{i=1}^m w^T (\\frac{1}{m} x^{(i)} x^{(i)T}) w_1 $$ \n", 62 | "\n", 63 | "$\\frac{1}{m} x^{(i)} x^{(i)T}$是不是似曾相识,没错,它就是原始数据集$X$的协方差矩阵(因为$x^{(i)}$的均值为0,因为无偏估计的原因,一般协方差矩阵除以$m-1$,这里用m)。\n", 64 | "\n", 65 | "记$\\lambda = \\frac{1}{m}\\sum_{i=1}^m (x^{(i)T}w)^2$, $\\sum = \\frac{1}{m} x^{(i)} x^{(i)T}$, 则有 \n", 66 | "\n", 67 | "$$\\lambda = w^T \\sum w $$.\n", 68 | "\n", 69 | "上式两边同时左乘$w$,注意到$w w^T =1$(单位向量),则有\n", 70 | "\n", 71 | "$$\\lambda w = \\sum w $$.\n", 72 | "\n", 73 | "所以$w$是矩阵$\\sum$的特征值所对应的特征向量。\n", 74 | "\n", 75 | "欲使投影后的总方差最大,即$\\lambda $最大,因此最佳的投影向量$w$是特征值$\\lambda $最大时对应的特征向量,因此,当我们将$w$设置为与具有最大的特征值$\\lambda$的特征向量相等时,方差会达到最大值。这个特征向量被称为第一主成分。\n", 76 | "\n", 77 | "我们可以用一种增量的方式定义额外的主成分,方法为:在所有与那些已经考虑过的方向正交的所有可能的方向中,将新的方向选择为最大化投影方差的方向。如果我们考虑 $k$ 维投影空间的一般情形,那么最大化投影数据方差的最优线性投影由数据协方差矩阵 $\\sum $ 的 $k$ 个特征向量 $w_1,..., w_k $ 定义,对应于 $k$ 个最大的特征值 $\\lambda_1,..., \\lambda_k $ 。可以通过归纳法很容易地证明出来。\n", 78 | "\n", 79 | "\n", 80 | "因此,我们只需要对协方差矩阵进行特征值分解,得到的前$k$大特征值对应的特征向量就是最佳的$k$维新特征,而且这$k$维新特征是正交的。得到前$k$个$u$以后,原始数据集$X$通过变换可以得到新的样本。\n", 81 | "\n", 82 | "\n", 83 | "### PCA算法流程\n", 84 | "\n", 85 | "算法输入:数据集$X_{mxn}$\n", 86 | "* 按列计算数据集$X$的均值$X_{mean}$,然后令$X_{new} = X - X_{mean}$;\n", 87 | "* 求解矩阵$X_{new}$的协方差矩阵,并将其记为$Cov$;\n", 88 | "* 计算协方差矩阵$COv$的特征值和相应的特征向量;\n", 89 | "* 将特征值按照从大到小的排序,选择其中最大的$k$个,然后将其对应的$k$个特征向量分别作为列向量组成特征向量矩阵$W_{nxk}$;\n", 90 | "* 计算$X_{new}W$,即将数据集$X_{new}$投影到选取的特征向量上,这样就得到了我们需要的已经降维的数据集$X_{new}W$。\n", 91 | "\n", 92 | "\n", 93 | "\n", 94 | "注意,计算一个$nxn$矩阵的完整的特征向量分解的时间复杂度为 $O(n^3)$ 。如果我们将数据集投影到前 $k$ 个主成分中,那么我们只需寻找前 $k$ 个特征值和特征向量。这可以使用更高效的方法得到,例如幂方法(power method) (Golub and Van Loan, 1996),它的时间复杂度为 $O(k n^2)$,或者我们也可以使用 EM 算法。" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### 1.2 最小平方误差理论" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 2, 107 | "metadata": { 108 | "scrolled": false 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVUAAAEcCAIAAAD4ObDdAAAAAXNSR0IArs4c6QAAAARnQU1BAACx\njwv8YQUAAAAJcEhZcwAADsIAAA7CARUoSoAAACfHSURBVHhe7V1peBRV1sZldMZnZhxn1E+/mdHn\nG2dcRp/HMSKOiiA7goBGQSEIiAKyCRhigCCI7ARkXwUkQAg7sodNQEzYIexLCDthSdjDFoL3e5uC\nptOp7q7uruq6VfXWkx8h3Dr3nPfUW/fWveeeU0zwIgJEwKkIFHOq4bSbCBABQf7zISACzkWA/Heu\n72k5ESD/+QwQAeciQP471/e0nAiQ/3wGiIBzESD/net7Wk4EyH8+A0TAuQiQ/871PS0nAuQ/nwEi\n4FwEyH/n+p6WEwHyn88AEXAuApLyf9GiRdOmTZs5c+bhw4fhnGvXri1ZsmTWrFk5OTnO9RUtJwJ6\nIyAp/y9evNi2bduBAwe67V2xYgVeAXqbT3lEwNEISMp/+ASEb9q06fnz5xX/DB48+MqVK472FY0n\nAnojIC//8/PzY2Njp0+fDpOPHz8+YcIEvW2nPCLgdATk5T88s3DhwhYtWuTl5U2ZMkVZCOBFBIiA\njghIzX9M+L/44gtMATwXAnQ0nqKIgMMRkJr/8A3W/Bs1apSWluZwP9F8ImAEArLz/8KFC9gIwFqA\nEcZTJhFwOAKy8z87Oxsf/w53Es0nAgYhIDv/k5OTT5w4YZDxFEsEHI6ApPzH5n/Hjh1HjBgxbtw4\nh3uI5hMB4xCQlP/bt2/v0aMH4n8LCgqMM56SiYDDEZCU/26vbNiw4caNGw53Es0nAgYhIDv/+/fv\nn5mZaZDxFEsEHI6A1PxH/E/jxo2TkpIc7iSaTwQMQkBq/q9fvx7BP/Hx8QYZT7FEwOEISM3/MWPG\ngP+4Tp486XA/0XwiYAQC8vIfy36tW7dW+L948WIjjKdMIuBwBOTl/549exTy40pMTHS4n2g+ETAC\nAXn5P3XqVDf/mzRpglPARthPmUTAyQjIy/+EhAQ3//HLmjVrnOwn2k4EjEBAUv4j4Y8n+fH7yJEj\njbCfMomAkxGQlP+pqale/EcikOvXrzvZVbSdCOiOgKT87927txf/8c8dO3bobj8FEgEnIyAj/5Hz\nA2F/Rfk/adIkJ7uKthMB3RGQkf/p6ekK+Vu2bNmpUyf3XKBdu3a620+BRMDJCMjI/+HDh4P87du3\nP3bsGHb+d+3ahYPAyhvhyJEjTvYWbScC+iIgHf+R6g85vzHm4ysApoL/u3fvxi+rV69GOZB58+bp\naz+lEQEnIyAd/7dt24acP+6lfjf/4aR9+/ahCpCTvUXbiYC+CEjHf68Kn578h+WnT59mOhB9nwBK\nczIC0vHfyxle/Heyq2g7EdAdAfJfd0gpkAhYBgHy3zKuoqJEQHcEyH/dIaVAImAZBMh/y7iKihIB\n3REg/3WHlAKJgGUQIP8t4yoqSgR0R4D81x1SCiQClkGA/LeMq6goEdAdAfJfd0gpkAhYBgHy3zKu\noqJEQHcEyH/dIaVAImAZBMh/y7iKihIB3REg/3WHlAKJgGUQIP8t4yoqSgR0R4D81x1SCiQClkGA\n/LeMq6goEdAdAfJfd0gpkAhYBgHy3zKuoqJEQHcEyH/dIaVAImAZBMh/y7iKihIB3REg/3WHlAKJ\ngGUQIP8t4yoqSgR0R4D81x1SCiQClkGA/LeMq6goEdAdAfJfd0gp0C4IbEkXI7uIIQli9lhxOc8u\nVhWyg/y3pVtpVHgI7N8lar8koord+Sn1oJjYPzyhMt5N/svoFepkJgJHskTZRwqR3/0i+KGXmYoZ\n0Df5bwCoFGlpBDrUUSc/3gIl7hOnT1raOC/lyX87eZO2hI3A2VwXyT1n/l6/22sKQP6H/cRQgJ0Q\nyEjzR368CzA7sNFF/tvImTQlfAS2rA7A/4S64XcijwTyXx5fUBMJELh0UZT8g79XwJShEmipmwrk\nv25QUpBNEOjZzCf/8WrIu6DRzOvXry9atKh3797db16DBw9evny5xnsj1oz8jxjU7MgiCJw7LaKf\nVXkFvHy3WDQ5WBsGDBgQdfNaunRpsPdGoD35HwGQ2YXVEDiTI+JriVfuvfMWiH5OrJwTghljx45V\n+L9u3boQbjf6FvLfaIQp37IInMoWC5LF7B/EplUh20D+hwyd68bExMTdu3eHJYI3EwHzECD/w8Ke\n/A8LPt5sNgLkf1geIP/Dgo83m40A+R+WB8j/sOAL/+YbNwSOwa6Y7foGxu+8gkTAD/9zc3ODFKZ/\nc67/6Y+pfSROGy6q/ePOGniVJ0TyAPtYFxFLfPE/KysL0QERUcFfJ+S/6S6QVYEBcephML1ayKqx\njHr54n/Xrl03b95susbkv+kukFIBzPb9nIFLS5VSaRmVGjp0qLL/v2DBAkW//Pz8lJQU/OXYsWOm\na0z+m+4CKRVo95E//reqJqXSZiqVnZ3dv3//JUuWuJVA/O/ixYvbtm3b+ObVtGnThISEjh07tmrV\nCv9s0qQJXgRmanyzb/LfdBdIqUCFx/zxH2HwvG4jgGl8XFxcmTJlwP/Tp09bCxjy31r+ipS2ZR/2\nx//XfhcpPeTtB6P3/Pnz69atW6NGjcmTJ1+6dEleXX1rRv5b0WvG69ywpD/+f/Qf4zWQt4ezZ8+O\nHj26UqVKmMOvXLny119/lVfXQJqR/4EQcub/zxvvj//YF3TkhU07rNuXLl26S5cumZmZNsCA/LeB\nEw0wIf+aaPC6+isgpri4ctmALuUViRF+1apVWMCrWLHiqFGjzpw5I6+uQWpG/gcJmHOaXzgnvqjq\n/QpoWkHgeLxjrsuXL0+dOjU6OrpOnTrz5s2TYcVeX+zJf33xtJ20zG1icHvRtZEY8JXYtcl25vk0\n6MSJE0jdUbZs2djY2I0bN9rVcPLfrp6lXSEisHXr1vj4eOzn9e3b9+jRoyFKscht5L9FHEU1DUYA\n4Tqpqan16tWrXr16cnJyXp49C/55oUj+G/xYUbz0CJw/f/6HH354++23GzVqhBSdN5x0zJH8l/7x\npIKGIXDgwAFk5sV+XqdOnZyZZor8N+zhomCJEUhPT2/RokWFChVGjBghwzl8s6Ai/81Cnv2agMCV\nK1dmzJjxwQcffPjhh7Nnz7527ZoJSsjUJfkvkzeoi2EInDp1ChU4sJ/XunVrOVNxG2a6P8Hkvymw\ns9PIIbBjx44OHTq89dZbKMVz+PDhyHVshZ7Ifyt4iToGj0BBQQGO3zdo0KBq1aoTJky4cEFr3a7g\nu7LwHeS/hZ1H1VURANWTkpKqVKny6aefLlu2DC8CAuULAfKfz4Z9EDh06FCvXr0w1UemnZ07d9rH\nMMMsIf8Ng5aCI4jA2rVrkVerXLlyyLeHpb4I9mztrsh/a/vP4dpfvXp11qxZtWrVqlmz5syZM/FP\nhwMSrPnkf7CIsb0UCOTk5AwbNqx8+fJffPHFmjVrpNDJgkqQ/xZ0WgRVRuqLnw/OK7gh0RLarl27\nkEUXQbs9e/Y8ePBgBMGwYVfkvw2dqqNJh89lRo0otv2k+bXrcSwHi/lY0sfC/rhx47ifp4uXyX9d\nYLSekBu/3kjZOmjUhm/xk5SRmHVmx7jNfUZv7L43d4unMVeuXy4+8p4zl81cUbt48eLEiRPfeeed\n+vXro2YWDupaD25ZNSb/ZfWM8XqdzDtaPul/MLzHpr6H3r5Z3nBj9s9Fu208p4zxuqj3cOTIkT59\n+mA/r3379tu2bTNLDRv3S/7b2LmBTZu583vw/+URd8/aNRrTAdUbVh6YE1iQ3i3Wr1/fpk0bhOsP\nGjTo5MmTeounvFsIkP+OfhSwsBc9+Tm8AiokPYYvAtOxQIJNHMv76KOPkHJz+vTpSL9pukr2VoD8\nt7d/A1v30/5Z4D9+0g+7ylFfv3Edc4GEpTGB79S1BSpn4Sg+DuQ3b948LS3N0kU1dAXGWGHkv7H4\nyi993dGfyo57GPyPmVEc2oJ4a48srT7pnxHTfO/evZ07d8Z+Xrdu3fbv3x+xftkRECD/Hf0YXLh6\nrtvKJpuyVylTgEX7pgCOnac2RoD/2M9bsWIFKuFWrlx57Nix586dc7QnTDKe/DcJeAm6PXM5p/m8\nSisPzsWYr6wCVJn4xKm8Y27+78nNwNth/bHl+iqLUpmTJk1Cmt2PP/544cKF3M/TF96gpJH/QcFl\nn8ZY7Vu2f+ayrBn4/sfmP35RflYfWazwH0uDw9d12n9ml442Hzt2rF+/fkitjwT7W7YUCjTQsReK\n0o4A+a8dK6e0BP+rJT8Vt+gDBP/pZfOmTZtQSAf7eSiqk52drZdYygkTAfI/TABteLsy/oP/XVc0\nCtM87OfNnz+/WrVqNWrUmDJlyr59+xDD6+dC7Z0we+TtQSFA/gcFlyMaK/y/lH8xevKzM3aOCs1m\nFMn9/vvvUTD3888/R+guDu1ADtb8kIHXfeGgPpYAPf/CtYDQ0A75LvI/ZOjseWN+Qf607cNLj30I\nk39E/pUYdd+CvclXr1/Rbm1mZmaXLl2wn/ftt9/id9yYmJiIcppFJfzyyy8op61dMlvqjgD5rzuk\n1haIZb/cSyfwc/7qGUwBlN/xUgholeuk8M8/Y7THmI+RH+O/+5bx48erHtdDrh6c3md+voDYGteA\n/DcOW6dIRpQuvu3ffffdOnXq4Gsf3/xelqPYRtHqWmiGUD+U3FM+DUK/0N2K2WLSQDFjlMg+FLoc\nR95J/jvS7ToZffz4caznY1Ufa/sbN270JRWndz2nA0oz7P/16NEDmTzw7ghdnRkjRYXHRFSxWz/F\n7xGx74kzOaELdNid5L/DHB68uZj8n758EucCPG8Fe7/66ivs5GM/H7v6/qXiDC/C+73aYGkAgUBL\nly4NffxPSrzDfPcrAL+8/29x/s7XR/AWO+gO8t9Bzg7WVMQIIQQIa4EIDUSmACQIwfo8Ivbq1av3\nftWqiOHLy8vTIlN1/o8bka4T/NciQaXNyaMCo70n7T1/HxAXoliH3Ub+O8zhwZg7ZG0H5VyA+6dU\n82f6Valy6s03f33qKaG5og5Ijhw+RXsOi/8jOvskP14EZf4irjEXcGBnk/+BMXK12LVJTB4iUgaJ\nDSu03WBeq8uXRGqKmNhf/DhGnA49cwb2/N4c80cv/lf47jc3OrQXmPBPmiQGDtRoJDJ5qMb8hcX/\nllX88R+vgP2s/xHYP+R/IIz2bhH1Xi30qEU/J9aGOmsN1FtY///rr2LUt6LUn+5oW+I+0fkTgTdC\n8Ff2hYNe5Ff+iSODLmFYdS9dGgE9WgQjn8eePXt0Hv8D8j9rhxbdHN6G/Pf7AGRuE6UeVBlnSvxG\npLuyZch19WquPiR+VjqoyTAC8pB7572a7xYfcq/XK6Di+MfvpAnq1k3MnasFgYyMjGvXrunM/6Ed\n/Y3/pR8SV4OIWdJihS3bkP9+3droLZ8PWZUnNY5+EXputq7xxwd8vGi4kGkP+fawn4cZOzLwFf3+\nn5DR746YnBzxzjsapArk7VUtvB3W/B9b/S/f7dPkvm20KMY25L/vZ2D/rgBfmCs1jX4Resg6N/Cn\n7QfP+1cD2XXbtWuHTLuI1UXWXaUxhvph674uNfZPmAWUGnTPmHktvIU0bCi2bw9oYExMDJL86Dz+\nQ9z3XdVNfvdphgAEdIrSgPz3DdT8CQH4P7yTRpQj0QwM97UZpvz9ikouTcTeIqM+xmcc0UGUjuoq\nfX7BNSQFuV7uLVGy5B1DkKirTRtRsaLIyAhoHV4uOPmjP/8hMXmAa6n/juF3ieaVRa7KWYOASjqz\nAfkfBv+HfS3RQxOQ/4VXARGQjyo6qKWDijo4kKvKz0LWrVolHnhAIIxnxQpRs6aIiRGrV2s0v3bt\n2jj5awj/IRTf+djv+KGX611wUGWVUaOSzmxG/vv2OxaQ/Y+oy3+U6KHp+LE/baOfdauKmnmonIfz\neYi93bkzmE0y7Pn/7W+ifXtx+wNBo/l4xagm9gzr+19j32zmFwHy3y88n77pk1Rv/10USFQVU2Sk\n+eM/jscIgTq5LVu2LFeuHCrnon5u0NTAnj/2/IK/1q1bp3rIj/wPHkud7yD//QK6J0O8+UcVXr1y\nr1g1X2dXhC+uW2PVV8DV+q/PnDqlZs2atWrVmjVrFlgXYlcovIfxX8MHv5d8Q/b/Q7SBtxVCgPwP\n9EAg8q/2S4V4Vf0pkbYw0G1m/D/if7ArXvIPbm1PFf/t0JoVypUti2P2a9eu1UGnXr2wmxesHGT7\ndO8peN7L8T9YJHVvT/5rg3RLupjQT+DAWVoqSmRou8ekVnkXxJxxO3vHJ9St+VapUr169Tp0SL9T\n8cjq8dBD4vjxoGzDkSHVSl7kf1AwGtGY/DcCVdNkYhkfi/kNGzbEwn5SUpJq1p1wlWvWTHTsGJQQ\nX+f/kDLo/PnzQYliY30RIP/1xdM0adi6nzBhArbxGzRosHjxYgOTaiGS59FHxZUgomv79+/vq4Yv\n4oIwQymaHcA0HB3WMflveYcjtLZ3794I3evQocN2DdF4OhhcrZoIJm8nynupjvOYrSxfvjwqKooV\nAXRwSkgiyP+QYJPjJuyrtW7dGuH6gwcP9jXAGqLpTz+Jf/9bu2Rf839IQBZA8l87krq3JP91h9Rw\ngThLN3v2bJDq/fffnzFjBo7rGd5l0Q5efFGkpmrsNyUlpWj+P0xb5s2bhwAk8l8jjEY0I/+NQNUo\nmbm5uSNGjKhQoQIy56anpxvVjRa548aJSpW0NEQb5Pn0CjfCa2vIkCF4KSB9IPmvEUYjmpH/RqCq\nv0zkz+jUqROCdrt3764aS6t/l/4l4jz/449rOfwHMV7zf5T9QQyikiYc6wLkf6R959Ef+W8i+IG7\nVlbIPvvss8qVK2MV7dy5m7l3JLm6dhWffaZFFxT58swUOnXqVFT+Vm7k978WAI1rQ/4bh21YkkGY\n5OTk6tWrI9luamqqjIXxcIIAsUCnTgW0E9GHniv8qA703nvvkf8BcYtAA/I/AiAH1wXS6fft2xf7\neQibRZr94G7Wt/Xhw+LLL12nfXGNHy+GDvUW36iR+PbbgH16xf+vXr0ac34lI4gy/qtmBwoolg3C\nR4D8Dx9D3SSghA4K6WA/D0V1UFpHN7khC0LSjj59bp35++QTUb26t6QdO1yrAGq5/Txb4i3mWRQM\nHzXIL/b2229jgjNy5EjwH/aqBgiHrDhv1IgA+a8RKAObgRtz585F8TzMivFtLBcTvv/+Fv/xImjX\nTgUF7AJgL8DvVbduXSQd8GyCyF+cR0JdABiLjYzA2UcMhN/Rosl/09yPhx6r+iiSjak+yuauWrUK\nrDBNG6+Oc3PFoEGie3fRosUt/i9aJFJSVNTD3xELEIj/qvn/sKihz6lEWVCznh7kv+E+w6I9eL5y\n5UoUuhw4cCCidHE+B3v4L7/8Mqa+r732GjLwGa5EUB1gYQ+p/rCwhwP/5crd4j+q/fhKGfL882LZ\nMj89bN26VfXVhvN/S5YsCUo1NtYXAfJfXzwFZrP4psWo3qxZs+jo6DfeeAMk93UhgE+K73wvDOLi\nXEm+lGv48MA5f/CN4DcROPL/odpnUaDJf50fvuDFkf/BY+b3DjzTWNbC8O6H9sp/tWjRQmP9TJ1V\nDCjuhRdctFcu9/e/n7sQgIwTgWoVfpSbGjVq5Cv/H8f/gN4wtAH5bwi8iMlHOm0/bwEcejXwiG6Y\nNj33nOjQIQj+o+nXXwvkBfBx4SNfNX6B43+Yjgr/dvI/fAxVJGBt76effkLKvaKzAHz2o3K2Ib3q\nJRSJ/Z98UiDVD67Bg8WrrwYWfOKE+POfb91SpLWf/H8c/wNja2QL8l9ndJGHA4l3EN+uOv/HcgAW\nAnXuUndxZ8+6vvn/+U9XeG+tWuLvfxdLNdQ7bdBAIDug2oUlz6NHjxb9H47/ursuWIHkf7CI+WyP\nIDYs+7366qtu5mP8x6Pv/mflcmX2bNuqW39GC8KO/cWLOKDj2gXQciFUEdmBb57q8boQ3eAr/x/H\nfy3QGteG/NcBW3zfYonLzXPM8Fu1aqXsbPf4sqXy95ioZ09F3Sve+rMYEGfb0rRly4rk5KKA+sr/\nwfFfh4cvPBHO4r9S9ArR9TiRFh5urrvx+M6cObNq1apu5mN636dPnzvR7D/PqxL1Av63TdQ/Lkfd\ndSeJOApy27I6NcqBFy9eFFikJ8KZX87/w3/kdJfgIP4jBX3FihXHjBmDtBNvvvnmcPcWV/Cg4mlG\nTkvPvX2V+pmX8/a98QjI/13UX28UrSMmVe3A4BFQvwPxi08/LVApsPCFA3+qJ5c5/usFfMhyHMR/\nnEIFG5VENDhRjw/1S5cuBQscStni7IoSuqdcmPnjiL5KBPuMkROjHp0e9bB6Wa4Kj6l+Kgerj3Tt\ncUYwOtpLK87/pXPTbYWcwn/M/EuUKAG64swJJv84bIPf09LSNDoG29c4hO+5nwdpnTt3RmCvTwnx\nH+big99PBdHdmzX2bqVmeKU+8ohAdXCPa9q0aWexp1Dk4vhvumedwn/EnynDNXLjY2VOubSkncfE\ndfTo0Tii4x7wy5cvP2rUqKIJLb19GRsdoHzw9nWmu98QBXBMsFUrT8nYFuH3vyFQhy3UKfxHemyF\nwNq337OyspBy75VXXnEzPyYmZv78+Z5H2f3hPzDeH/9L/EacCb4Cb9j+joQAbPUjFsijsA/n/5GA\nPaQ+nMJ/gFOjRg3lc10hMJJSqKadwUk1HMVFFR3P/by4uLjNm4Ocrh/OFJ5r/l4fAvG1QvKXRW6q\nU0f06+fW9ZdfflFdauH833R3Ooj/KIx3ays+JiYhIQG7gF7oI0Zl8uTJyLTpZn6pUqWQmib06jS9\nmqtPAUr9SezfabrvDVRg/XpXBHFBgdIFjkIiqRm//w0EPFTRDuI/IEK1CXyLIiZv1qxZnifS8XQm\nJibiKL6b+fqk4kH+rMRWovg9hd4CVZ4QO9aH6i/r3IcMAtOmKeri/K/qQinHf9Pd6Sz+F4UbKfda\ntmzpuZ+H0hrYF9AzFc/xw2JkF/F1PdG1kVg2wz0qmu57YxWYOVO8/rrIvyZWL942ekD+mmUCb8PC\nF/lvrAs0SHco/7EEMGfOHAzy7gEfgz/K1EhRWkOD2yzQBKcGHn5IFH8Ic5+YqGcORN0vqjwppg5z\na46MgDgljfh/Ym6iNx3Hf5TQQuQfPuzdzMcHP4KCWYhez6cQgYAd6oi/FRMPFQP/P456Zm/Ub299\nBA1JUDrCeup3333XpUsX2U9D64mLdLIcxP9du3a1b9/ecz8Pi/wIB5A3D4d0T4tmhRZPdbH9xWLi\nnmLihWJbox4otAJyM/ABk3+sC2I7JkI1yzXr7qiG9uc/InOx8o8U1O4BH68ArP/vQO56XgYhgANO\nyn7no66fj6KevTP+44+d6ivdDho0CEnQZCxtZBAs8om1M/8R8zt+/HiE67mZX6ZMmWHDhqnGosnn\nGstqhMm/e8vjedcUoOWL/3B9/7sjIN75P8W2n3/+OZxTWJYFSCLF7cn/Q4cOYTHPMxVHrVq1Zs+e\nfS1QpRqJPGNdVbDO7xn49Kdi+/92f/5N8ue/fM/ZVx84+86/cBwAF5YAkftc+V25QjiRZV2cZNDc\nbvxHbTnU0vAM3cNxvXXrbBppL8MTpKpD9LN3Rvtnim39wwN7on6Hv5x47cG0Cv9K+zwaO6yq1+7d\nu6W1yZaKRZD/eRdERprYtEqcCVwx1o01wnK0PBPYSZoxY0a1atXczMfhfET44cy/Ld1mplFI9Ym6\nQEj4i58hQ0RGhujUyVUmEH93XxP7ey74dX7piWNR9935y/IfzdSffXsgEBH+46BL98/FG7+/9QSU\nuE+0+0gc3qfFEQH5j4M9KKpTsmRJN/ORigNbSpKm1tdis/xtli8Xd90lihUT333niuopX154VSu9\nnCc++o+b8LOi/nIp6u5b/2xe2SkRUPL7UQjj+Y/R/r1nVMLgy+KU+K6AEPnhPw7wtG3b1jN0r3Hj\nxjjex2KSAVHVoUH9+i7+o/hvjx5iwwYVgWdzRdMKit8/jHp2N+b/L9/tCgq4clmH3ilCJwSM53/8\nhz6PwdZ+KaAVRfmP7aKFCxcipbx7wEcqDoSRqFaYDChfa4ODe8SccWL2WLFT7VnXKsVG7Q4dEvff\n73oF1Kzpz6q9W8XQjsPr1jg1IEEcybKR/TYxxWD+5xwXOOjuJwcOVgT8Xp78x/owUnGULVvWMxUH\n/hI4FUc4zjqwWzQpV8iEmOKuhQxeX37p4v/vfy9OnnSBgQyfWBHo0qUoMNjkU83/QwhNR8Bg/i+a\nHCAHzohvtPB/375933zzjZLAS7mQwGvBggVaU3GEDDOG/bJqCfz++1uxUfoyHiFbrfFGrAI++KDr\nFaBk+8FCAMj/6adF7/aV/0NjP2xmHAIG83/e+AD8H3K7zpwPE1u3bq3k6lOu4sWLf/XVV/jyNw6R\nQpJvf8GqWFHtHwKBLo69UN0Ai389e7r4jw+BAwdcSOAvavzHaWvV/L+OBU8eww3mP+bJfib/+C98\nUfu+MG/0TMWBdf6IVsv2n8AHyqctlMeREdVk0yZXeg+k9MzOdvEfPxUrisuX7/AfVUCQBfB2hDXS\nqDHmMqIO0tyZwfyHHh887/MVUOpBgY2iIhcW8JGiF4v5StbNKlWqTJ8+XbWAlGYzQ2o4f0KAl9fw\nTiHJtfhN+M6fPdv1gzz/SKA8Z86tH0zKlPEfywFxca7aYbcvzv+ldbnx/F8517XxozoLSB7ghQvC\nP1NSUqpXr47jOsrnfffu3bXE/xiCb8CPF1vW8AgHSvD/ww9F1aqicJw1gjJNeH2HY4hj7jWe/4AS\nq4Al/1DoFYAQoHF9PEFGjr1+/frhfI5Xps2A8T8GempPRoDxf8mt/FYG6mAt0eD/xx+7SgAlJXkq\n3qRJE9X6v9YyzpbaRoT/QA7Bv8j9klBXtK8txvYUp2/uGN28Nm3ahDAeMB8VtYpm2jST/1Cu/n99\nvgIqPq61MK4tHxxVo5T5//bt4uGHPYOCfOX/cw4w0loaKf4XAQBze+TSR0Z9pOVG1l1fB79M5v+2\nteL1wrkrlA8ZfNEgkx8vTwSwHFivnihVynUQIDZW/PWv4naRVRRN4yF/OR8WE/ivhPFUqlQJ00KE\n6/rPtGky/+G0Leni3acLzQIq/i/Jr/I0X70qTp92/WAjAMU/8MuZM0ozvOWZ5I/8FwjjQfrt0qVL\nI5hHY7iu+fyH37DPv3qxK4fviM5i6XRO+4N9lOvVq5eZmRnsXWwfAQQiMf4rFXWaNm2K8tuaKud5\n2C0F/yPgB1t3kZGRoWc+dVtjFWHjjOU/dn2mTJmCNNtYAZo7d24I4brkf4QfCCO6w2EtjdM9I3qn\nTD8IGMX/nJwcVM7CWZ3Y2FjU2AjZB+R/yNDJcyPy/CLVlzz6UBM3AkbxH2E8SL8T/q4v+W+DhxVf\nfyFM/WxguPwmGMV/vSwn//VC0kQ5jP81EXz/XZP/0rrGPoph0yf0Gsr2gUFGS8h/Gb1iM51wfIvp\nGOX0Kfkvp19spRXn/9K6k/yX1jX2UQxBH9gPso89NrKE/LeRM2U1ZfDgwcbmaJTVcPn1Iv/l95Hl\nNeT8X1oXkv/SusY+iiH0E7VY7WOPjSwh/23kTFlNQbV1lGmSVTtH60X+O9r9kTEe8f979uyJTF/s\nJSgEyP+g4GLjUBBYu3YtCrSGcifvMRgB8t9ggCleICfYpyzELOeDQP7L6RdbaYUKLjz/K6dHyX85\n/WIrrbZv315QUGArk+xiDPlvF09KbAfGf+b/k9M/5L+cfrGVVvXr10fqR1uZZBdjyH+7eFJiO5j/\nT1rnkP/SusY+ijH/n7S+JP+ldY19FGvZsuWhQ4fsY4+NLCH/beRMWU1BNedrhSuCyqqp4/Qi/x3n\n8sgbzPN/kcdcY4/kv0ag2Cx0BHr06HH8+PHQ7+edhiFA/hsGLQXfRgD1XS9evEg8JESA/JfQKXZT\nifN/aT1K/kvrGvsoNnbs2NMoB8xLPgTIf/l8YjuN+vfvz/x/cnqV/JfTL7bSivN/ad1J/kvrGvso\ntmDBAub/k9Od5L+cfrGVVvHx8SdOnLCVSXYxhvy3iycltoP5/6R1DvkvrWvso9jGjRuvXr1qH3ts\nZAn5byNnymrKJ598cvjwYVm1c7Re5L+j3R8Z42NiYpj/LzJQB9sL+R8sYmwfNALM/xc0ZJG6gfyP\nFNIO7gf5/7KyshwMgLymk//y+sY2mjVo0ID8l9Ob5L+cfrGVVps2bbpx44atTLKLMeS/XTwpsR3c\n/5fWOeS/tK6xj2Jt2rTh/p+c7iT/5fSLrbRaunQp43/k9Cj5L6dfbKUVz/9J607yX1rX2Eex3r17\n8/yPnO4k/+X0i620mjhxIs//yulR8l9Ov9hKK87/pXUn+S+ta+yj2Pjx45n/S053kv9y+sVWWvXt\n25f5P+X0KPkvp19spRXn/9K6k/yX1jX2UWzx4sWs/yGnO8l/Of1iK61iY2NZ/0tOj5L/cvrFVlox\n/l9ad5L/0rrGPorh/B/rf8vpTvJfTr/YSqv69evz/I+cHiX/5fSLrbRi/j9p3Un+S+sa+yiG/H/M\n/yGnO8l/Of1iK61q1669b98+W5lkF2PIf7t4UmI7GjZsuH//fokVdK5q5L9zfR8xyzds2MD5f8TQ\nDqoj8j8ouNg4FAS4/x8KahG5h/yPCMzO7iQuLu7IkSPOxkBS68l/SR1jJ7UWLVp05coVO1lkG1vI\nf9u4Ul5DeP5PWt+Q/9K6xj6KYf7fq1evgR4X/jl//nz7WGhZS8h/y7rOOop36NAhOzvbU98DBw78\n+OOP1rFARdPr169v3brV6nnNyX9LP4TWUB7x/zgCZDP+w5y0tLSmTZtiWrNixYqzZ89qccbJkyen\nTp06atSoadOmYU0kIyMDv0OOlnuNaEP+G4EqZRZCoGPHjl7nf2ww/isWgsaNbl/dunWbO3duwJNO\neFN8+eWXyImG2zdv3ozXgYmPC/lvIvhO6Rr1vw4ePGi/8R8WIa5pyJAh7leA8kt8fHxycjJOPeAb\nQdXHW7ZsQTOkRcLgb25kFPnvFBKaaCfifzHRtSX/YRSm8d98843XK0D5Z4sWLYYPH56enp6Xl+eF\n/7hx49Ag4GTBaK+ZwP+UlJRBmq/WrVv36NFDc3M2lBEB5P/CkNjO42rbtm1CQoKMuoakEybzTZo0\nUX0FuP/Ys2fPJUuWuPOgpKamAhbMHYxmuH/5JvA/KysL8x+NV+fOnbFRpLExm8mJwJo1a7BC5nWt\nX79eTm1D0Arf8H7I36xZM1RAwyKfOwlqZmbmnDlz9u7d27hx41WrVpn4CjCB/0FZm5iYuHv37qBu\nYWMiEEkEMJ5hF6Ao/xH1gMJn27Zty8/P99QHpdDcn/1TpkzBNwI2BSKpsGdf5L9ZyLNfOyCQm5uL\nxXxP8nft2hVju9d6p9vUnJyc7t27g//Kh8DChQtx79dff425gClwkP+mwM5O7YAAgn+6dOkCAmP8\nHzBgwPLly7WUOfv15uW23+ufEcaF/I8w4OzOJgiAt0lJSWPGjEF2A+uebiL/bfI40owII1BQUGDu\n1r0u9pL/usBIIUTAkgiQ/5Z0G5UmArogQP7rAiOFEAFLIkD+W9JtVJoI6IIA+a8LjBRCBCyJAPlv\nSbdRaSKgCwLkvy4wUggRsCQCsvMfByStnmLJks8FlXYGArLz3xleoJVEwBwEyH9zcGevREAGBMh/\nGbxAHYiAOQj8P01GQFuvKm/zAAAAAElFTkSuQmCC\n", 114 | "text/plain": [ 115 | "" 116 | ] 117 | }, 118 | "metadata": {}, 119 | "output_type": "display_data" 120 | } 121 | ], 122 | "source": [ 123 | "from IPython.display import display, Image\n", 124 | "\n", 125 | "display(Image('pca2.png'))" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "如上图所示,假设有这样的二维样本点(红色点),按照前文我们讲解的最大方差理论,我们的目标是是求一条直线,使得样本点投影到直线或者平面上的点的方差最大。本质是求直线或者平面,那么度量直线求的好不好,不仅仅只有方差最大化的方法。再回想我们最开始学习的线性回归等,目的也是求一个线性函数使得直线能够最佳拟合样本点,那么我们能不能认为最佳的直线就是回归后的直线呢?回归时我们的最小二乘法度量的是样本点到直线的坐标轴距离。比如这个问题中,特征是x,类标签是y。回归时最小二乘法度量的是距离d。如果使用回归方法来度量最佳直线,那么就是直接在原始样本上做回归了,跟特征选择就没什么关系了。\n", 133 | "\n", 134 | "因此,我们打算选用另外一种评价直线好坏的方法,使用点到直线的距离$d’$来度量。\n", 135 | "\n", 136 | "现在有$m$个样本点$x^{(1)}, ..., x^{(m)}$,每个样本点为$n$维。将样本点$x^{(i)}$在直线上的投影记为$x^{(1)’}$,那么我们就是要最小化\n", 137 | "\n", 138 | "$$ \\sum_{i=1}^{m}(x^{(i)’} - x^{(i)})^2$$\n", 139 | "\n", 140 | "这个公式称作最小平方误差(Least Squared Error)。\n", 141 | "\n", 142 | "初中我们就已经知道确定一条直线,只需要知道直线经过某一个点和其方向即可。\n", 143 | "\n", 144 | "首先,我们确定直线经过的点,假设要在空间中找一点$x_0$来代表这$m$个样本点,“代表”这个词不是量化的,因此要量化的话,我们就是要找一个$n$维的点$x_0$,使得\n", 145 | "\n", 146 | "$$ J_0(x_0) = \\sum_{i=1}^{m}(x_0 - x^{(i)})^2$$\n", 147 | "\n", 148 | "最小。其中$ J_0(x_0)$是平方错误评价函数(squared-error criterion function),假设$\\bar x$为$m$个样本点的均值,即\n", 149 | "\n", 150 | "$$\\bar x = \\frac{1}{m} \\sum_{i=1}^{m}x^{(i)}$$\n", 151 | "\n", 152 | "则\n", 153 | "\n", 154 | "\\begin{align*}\n", 155 | "J_0(x_0) \n", 156 | "& = \\sum_{i=1}^{m}(x_0 - x^{(i)})^2 \\\\\n", 157 | "& = \\sum_{i=1}^{m}((x_0 - \\bar x) - (x^{(i)} - \\bar x))^2 \\\\\n", 158 | "& = \\sum_{i=1}^{m}(x_0 - \\bar x)^2 - 2 \\sum_{i=1}^{m}(x_0 - \\bar x)^T (x^{(i)} - \\bar x) + \\sum_{i=1}^{m}(x^{(i)} - \\bar x)^2 \\\\\n", 159 | "& = \\sum_{i=1}^{m}(x_0 - \\bar x)^2 - 2 (x_0 - \\bar x)^T \\sum_{i=1}^{m} (x^{(i)} - \\bar x) + \\sum_{i=1}^{m}(x^{(i)} - \\bar x)^2 \\\\\n", 160 | "& = \\sum_{i=1}^{m}(x_0 - \\bar x)^2 + \\sum_{i=1}^{m}(x^{(i)} - \\bar x)^2 \n", 161 | "\\end{align*}\n", 162 | "\n", 163 | "显然,上式的第二项与$x_0$无关,因此,$J_0(x_0) $在$\\bar x$处有最小值。\n", 164 | "\n", 165 | "接下来,我们确定直线的方向向量。我们已经知道直线经过点$\\bar x$,假设直线的方向是单位向量$\\vec e$。那么直线上任意一点$x^{(i)’}$有:\n", 166 | "\n", 167 | "$$ x^{(i)’} = \\bar x + a_i \\vec e $$\n", 168 | "\n", 169 | "其中,$a_i$是$x^{(i)’}$到点$\\bar x$的距离。\n", 170 | "\n", 171 | "我们重新定义最小平方误差:\n", 172 | "\n", 173 | "\\begin{align*}\n", 174 | "J_1(a_1, a_2, ..., a_m, \\vec e) \n", 175 | "& = \\sum_{i=1}^{m}(x^{(i)’} - x^{(i)})^2 \\\\\n", 176 | "& = \\sum_{i=1}^{m}((\\bar x + a_i \\vec e) - x^{(i)})^2 \\\\\n", 177 | "& = \\sum_{i=1}^{m}(a_i \\vec e - (x^{(i)} - \\bar x ))^2 \\\\\n", 178 | "& = \\sum_{i=1}^{m} a^2_i \\vec e^2 -2 \\sum_{i=1}^{m} a_i \\vec e^T (x^{(i)}-\\bar x) + \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 \n", 179 | "\\end{align*}\n", 180 | "\n", 181 | "\n", 182 | "我们首先固定$\\vec e$,将其看做是常量,然后令$J_1$ 关于 $a_i$ 的导数等于0,则有:\n", 183 | "\n", 184 | "$$ a_i = \\vec e^T(x^{(i)}-\\bar x), $$\n", 185 | "\n", 186 | "这个结果意思是说,如果知道了$\\vec e$,那么将$(x_{(i)}-\\bar x)$与$\\vec e$做内积,就可以知道了$x^{(i)}$在$\\vec e$上的投影离$\\bar x$的长度距离,不过这个结果不用求都知道。\n", 187 | "\n", 188 | "然后是固定$ a_i $,对$\\vec e$求偏导数,我们先将 $a_i $代入$J_1$,得\n", 189 | "\n", 190 | "\\begin{align*}\n", 191 | "J_1( \\vec e) \n", 192 | "& = \\sum_{i=1}^{m} a^2_i \\vec e^2 -2 \\sum_{i=1}^{m} a_i \\vec e^T (x^{(i)}-\\bar x) + \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 \\\\\n", 193 | "& = \\sum_{i=1}^{m} a^2_i \\vec e^2 -2 \\sum_{i=1}^{m} a^2_i+ \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 \\\\\n", 194 | "& = - \\sum_{i=1}^{m} (\\vec e^T(x^{(i)}-\\bar x))^2 + \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 \\\\\n", 195 | "& = - \\sum_{i=1}^{m} \\vec e^T (x^{(i)}-\\bar x)(x^{(i)}-\\bar x)^T \\vec e + \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 \\\\\n", 196 | "& = - \\vec e^T S \\vec e + \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 \n", 197 | "\\end{align*}\n", 198 | "\n", 199 | "其中$S = \\sum_{i=1}^{m} (x^{(i)}-\\bar x)(x^{(i)}-\\bar x)^T $,与协方差矩阵类似,只是缺少个分母$n-1$,我们称之为散列矩阵(scatter matrix)。\n", 200 | "\n", 201 | "现在我们就可以用拉格朗日乘数法求解方向向量$\\vec e$了。令\n", 202 | "\n", 203 | "$$f(\\vec e) = - \\vec e^T S \\vec e + \\sum_{i=1}^{m} (x^{(i)} - \\bar x)^2 + \\lambda (\\vec e^T \\vec e - 1) $$\n", 204 | "\n", 205 | "令上式关于$\\vec e$的偏导数等于0,则可得\n", 206 | "\n", 207 | "$$S \\vec e = \\lambda \\vec e$$\n", 208 | "\n", 209 | "两边除以$n-1$就变成了对协方差矩阵求特征值向量了。\n", 210 | "\n", 211 | "从不同的思路出发,最后得到同一个结果,对协方差矩阵求特征向量,求得后特征向量上就成为了新的坐标,如下图:" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 3, 217 | "metadata": { 218 | "scrolled": true 219 | }, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAANgAAACbCAIAAACPjsJ4AAAAAXNSR0IArs4c6QAAAARnQU1BAACx\njwv8YQUAAAAJcEhZcwAADsIAAA7CARUoSoAAAAwUSURBVHhe7Z3dbx3FGcb9J/Sm/wE3VSVyERWp\n6gWRKkCUFi6o6A1NuIiMEK1U1w1K1DQ6Io1ahGvSpE2dhITEpXUAW6Fue2iQjAlYqSFJ7Xyo5tQJ\nxXUS4SbYx8E4hxLn9DlnfNZ7Pndmd8/ufDwrXwQxuzv7zO/MvPvsOzMdRR5UQAMFOjSoA6tABYoE\nkRBooQBB1KIZWAmCSAa0UIAgatEMrARBJANaKEAQtWgGVoIgkgEtFCCIWjQDK0EQyYAWChBELZqB\nlSCIZEALBQiiFs3AShBEMqCFAgRRi2ZgJewA8eORv9z3k4tX2JzmKmAuiDdOvXn/xvFz1y7u2Jn7\ncOrs/pH85+Y2A2tuLojF29d+9+if+rMne0/eWmFDmq6AwSAW7/zr2Cvr7jvx1k3TG4H1L5qdoX17\nfurZnbnrbEcLFDC3R7y9tHjp1NhvOC5bQCEewVwQPz51cse+S1e88HD88pgdTeLmU5gLYm179Y7s\nOji2181WtOCp7QFx+NzQ+v0d7BcNhdIeEEemTgBEskgQU1Zgei4nQHzoyNeuLsymXBveXlEBe3pE\nwAcKNxy664GX1oHF+aVPFKVg8TQV0B/EhYnD2zN/nv7vu4cPnJ5vLRVAFK8s6B3Pz06kqSvvraiA\n/iAWVy4f3/rCG+8MHDu9FPApr+v45uX/LQNHdoeKGKRf3AAQiyszbzzb9aMXzywFySX4Q484cPpo\nUFn+f70UMAHE4sri348cOL0gqRxwRIyIrlGyPIvpoID+IK58dmN2YnAocFz2q5nJboGtqIO+rIOk\nAvqD+MnEwOE9Q28rZXrhZWXTsUckJWAxHRTQH8SSSp2dnblcTkkvgMivLEqKpVvYABALhQJAzGaz\nSkqBQrxEK53CwikqYACIk5OTADGTyajKhFcWjNGqZ7F8KgoYAOLg4CBAxDE3N6ekEd5X4G8rncLC\naSlgAIjd3d0CxNHRUSWZaG4ryZVuYd1BnJmZERT2lA9VsWBuM0lRVbRUyusOIt5RRIyIVxaAmM/n\nlWSiua0kV4qFdQcR/An7Rmgk/lPpQJhIc1tJsVQK6w6iEMUDMYRGNLdDiJb8KfaDCE1hKNLcTp4t\npTs6ASLNbSUmUinsBIhQFl/8aG6nQpjkTV0BEe8rSMmRFIXFklfAFRBhbnMiS/J4yd/RFRChCM1t\neSySL+kQiDC3MZ2FmdvJQyZzR4dAhBw0t2WYSKWMWyDixRmRYipC86atFXALRJrb2v4enAMR5jan\ns2iIo3Mg0tzWkEJUyUUQsW6YS+a2GXt/uAiiDeb2fO0SU1iDyrfiz+KHZ566d/DJZ4aPThUWjdj7\nw0UQMRBgTRJTM7evzha7NhfXd9x69J7XX+5C1w4fQKzHhySjtfX4vpjo69/wq3+r5RGnOGo7CqLB\n5vbBvaBQ/F1/7KsCQUFhlVdfeH//0P3fffOkKXt/OAqi+OJnZOb2pkc8EPGPh/u/Dgrrevf58+9u\nO/Kf0b1HNr50LX8nxY5O+tbugijMbeO++F19cZcH4vKen4NCO5Y+cxdE/FYRYJmSuY0fDPpvOKA7\nXnli5g+9xcyW4vDQqfcGTal/YM/oNIhGmNsIZzHyovPGz8bi3F6nQRTmtraLHAM7ZGmIEND6NXBd\nB3HV3J7OFZeTXdgTd6zzAr3xS0yyQS+I4di4KDZwFG5YwHUQCx9NLzzwpdXwf+REOBHVzgLxZSOw\n9DdQtcQymMMPA520g2vquQ5i0WfLFR9KJENseMjvvxRhUBeLGHnx8mt9INjiF+s8iL1rbkiJj+bD\npVq316J0NYhX3juBENCRQJAgNldgfGytfwKUCRzoAtH1lofmG9+726lAkCC25Gs6t7xn5+8zX7l1\nM4nNqhAIXhje99a2e/B35p8jjryLBP7AnR+aKwrFaW5jfC9HfjWHPxDU1jMKJKZNBQjiqrAgI57M\nbe/tB6/GFUsISTFeIMgNK2nfBPyYYzC30RdWUmNK/xgfA98iUwuOoPWmdJTOkj3imnqwkaNmbsOm\n9oE4+fTdwhFkIBjIKEFck0hkbkccOlce/5bH4gcX3w5sABYQChDEKhIwgAZkbmPwxQcY9Hx1hwgE\nn9jdcb5n880XftbwfYXYNVOAIFYpIzK3mwZz/pEXLyWVwwsE8XWEgWC4HxtBrNWt1S67/u+BiAWX\nl0UiGf7wjZiBYDgE3R6a0bfhOwrS7usSHTDCNs3crgZx48H1su7j+YlSKituhw85PBop4GqP6J/5\nUQcH8EIP10Au39D8wc7vy6apwk30ezqNvG7C6SSI1SZLqWusPhqa28AOgKIXPNe3ZXH6ggI6/s/Z\nIBJJDzzqFHASRKhQSTsQtnM9GP6MQC8QDJmmWuNyN3rjJpmugoigDaMzcETY1yg3G/Bt/2sXyEO8\nGMPuGIhExe2qM2HJn6eAqyC2RAAWzL53euHjgEXZQJBMRVOAIFbpJ+YroRcsmTjvG7ssSTQmUjmb\nIK7KLuYrITT0AkHuspskka6D6E1cbxgItjK3k2wlB+7lLogyE9e5y25iPwEXQVSauN7U3E6sidy4\nkVsgBkxcxzcP8d3P5zlzl91kfghOgCg7cb3Jdz8Hp7snA5//LpaDqDBxHd2h/4swchQqB3fZTYBL\na0HEkKo8cd3fI1Zn5cBZpLPdVhwtBBEd2G9f/WGYiev47idWpan77gdzEUZ3W1vC8YvbA6IIBLcf\nevCjx+8CTHc2frthQn+49qa5HU43+bNsANEfCC48v3Ut1EP3Ft/BXXbj07LBlcwGscHE9eq1zmPU\njuZ2jGLWX0o/EJGUhRcFOHm+hbk6Oztrqt504rp/ra24F1WCuW3kRgRtJSimi+sHoj9ltZIp6IGI\nWM1LU208cV1wDPMFmX9xrzG3am7jnSbuK8fUmgZfRjMQa7LqKx4KQMTIKNJU0S2ltoLR1dnP7/3y\nagzKjP9YsdcMxJrZJOUkfgSCD/7yG2Ipy5TNvJrlZRNedjvWhtftYpqBCHkwqoovHJtK6/2LFYy+\ns2uDFhPXk19eVjde2lYf/UAs9YGzU6MD/onr9S8rbROk5YURHXqfAeN+E0rnibS5q14gijRVEQj6\n91TSBcTyj+SLlw9ieVktemhtMIpeEV1AFGmqzQJBjUAsS05zOzp5NVdIH0QxcR29YIsVjHQD0eBd\ndmMnKKYLpgmi/MR13UCE+MiBoLkdE4Sly6QAohcIyk9c1xBEsctujC3h+KUSBdELBNGdKDmCGoII\nbuR/SI5DJvP4CYHon7ge4n1TTxCN2GVXBgIdyiQEIl5EQq5gVBZJTxBRMZidSl27Dk2uZx0SAjHi\nw2sLIn5dUTciiCiNLacTxEgtKTYiCBFsRLqrjScTxKitiqgjYCOCqHdw4nyCGLWZaW5HVbB8PkGM\nQUb0iDS3I+pIECMKWDqd5nZ0EQlidA1LV5Dd5yKeu1l4FYIYT6PS3I6oI0GMKODa6THsshtbXcy7\nEEGMrc2wzgTN7dBqEsTQ0tWeGMsuu7HVxrQLEcQ4W4zmdmg1CWJo6RqcGLDLbpy3su1aBDHmFqW5\nHU5QghhOt6ZnCXObezerykoQVRULLk9zO1ijuhIEMYRoAac03GU3/tvYdUWC2Jb2pLmtKitBVFVM\nqjy++NHcllKqUoggKsklW5jmtqxSBFFVKdXyyFBk5ra8aOwR5bVSK0lzW0kvgqgkl1ph7rIrrxdB\nlNdKuSTWuqW5LakaQZQUKmQx7rIrKRxBlBQqZLHEzO25ubnJyclCoRCyommfRhDb3gKJ7bLb19eH\nJTH6+/vHx8fz+Xz5wZZm/7b7B7/o/fXu7OXPVtr+qBFuoCOIUNOm47GfPvzN3nVpPVH++tlXn9l6\n9MKnESBJ4lQdQax/bm3XvpFpImFuJ7BWE3rB7u5uQTz+ga6xNFgvTL723PZth/6xJFPX9MoQxCS0\nT8DcRnTY09OTyWSy2Wwul6s81WLutQOvT53944+fO35J606RICYBYgK77M7MzOB9JYmHac89CGJ7\ndK27Ks3t1kITxIRA5C67BDEh1AJvw8ztFhKZ0SMGtjELmK4AQTS9BS2pP0G0pCFNfwyCaHoLWlJ/\ngmhJQ5r+GATR9Ba0pP4E0ZKGNP0xCKLpLWhJ/QmiJQ1p+mMQRNNb0JL6/x979rRTaksCmgAAAABJ\nRU5ErkJggg==\n", 224 | "text/plain": [ 225 | "" 226 | ] 227 | }, 228 | "metadata": {}, 229 | "output_type": "display_data" 230 | } 231 | ], 232 | "source": [ 233 | "display(Image('pca3.png'))" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "这时候点都聚集在新的坐标轴周围,因为我们使用的最小平方误差的意义就在此。另外,PRML书上从线性子空间的角度进行了详细的阐述,有兴趣的读者可以看看。\n", 241 | "\n", 242 | "\n", 243 | "\n", 244 | "## 2. PCA算法优缺点:\n", 245 | "\n", 246 | "优点:\n", 247 | "\n", 248 | "* 它是无监督学习,完全无参数限制的。在PCA的计算过程中完全不需要人为的设定参数或是根据任何经验模型对计算进行干预,最后的结果只与数据相关,与用户是独立的。\n", 249 | "\n", 250 | "* 用PCA技术可以对数据进行降维,同时对新求出的“主元”向量的重要性进行排序,根据需要取前面最重要的部分,将后面的维数省去,可以达到降维从而简化模型或是对数据进行压缩的效果。同时最大程度的保持了原有数据的信息。\n", 251 | "\n", 252 | "* 各主成分之间正交,可消除原始数据成分间的相互影响。\n", 253 | "\n", 254 | "* 计算方法简单,易于在计算机上实现。\n", 255 | "\n", 256 | "\n", 257 | "缺点:\n", 258 | "\n", 259 | "* 如果用户对观测对象有一定的先验知识,掌握了数据的一些特征,却无法通过参数化等方法对处理过程进行干预,可能会得不到预期的效果,效率也不高。\n", 260 | "\n", 261 | "* 贡献率小的主成分往往可能含有对样本差异的重要信息。\n", 262 | "\n", 263 | "* 特征值矩阵的正交向量空间是否唯一有待讨论。\n", 264 | "\n", 265 | "* 在非高斯分布的情况下,PCA方法得出的主元可能并不是最优的,此时在寻找主元时不能将方差作为衡量重要性的标准。\n", 266 | "\n", 267 | "\n", 268 | "\n", 269 | "## 3. 代码实现" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 4, 275 | "metadata": { 276 | "scrolled": true 277 | }, 278 | "outputs": [ 279 | { 280 | "data": { 281 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEKCAYAAAAFJbKyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XucXWV18PHfOpPLMGQgkovk6oBgGpCikgDFxJIGAkZN\nxNaoQ62S2pS8iEgU3zbQipfmRUBpKVbeFBN5WwZKFUxUhFwaNIoSkgiEEKIIMVdMiGISQhIys94/\n9j6TM2f23mfvc/Y+e+8z6/v5zGdm9rk9CeSs8zzPetYSVcUYY4wJq5D2AIwxxuSLBQ5jjDGRWOAw\nxhgTiQUOY4wxkVjgMMYYE4kFDmOMMZFY4DDGGBOJBQ5jjDGRWOAwxhgTSb+0B5CEoUOHaltbW9rD\nMMaY3Fi3bt3LqjoszH0bMnC0tbWxdu3atIdhjDG5ISK/CXtfW6oyxhgTiQUOY4wxkVjgMMYYE0mq\nexwi8hngVmCYqr7scfulwL8ATcBdqnpTnYdojDGhvf7662zfvp1Dhw6lPRRfzc3NjB49mv79+1f9\nHKkFDhEZA0wDtvrc3gR8HbgY2A48ISJLVfXZ+o3SGGPC2759O62trbS1tSEiaQ+nF1Vl7969bN++\nnVNOOaXq50lzqeo24HOAXyepc4HnVfUFVT0C3AfMrNfgjDEmqkOHDjFkyJBMBg0AEWHIkCE1z4hS\nCRwiMhPYoapPBdxtFLCt5Pft7jVjjMmsrAaNojjGl9hSlYisAE72uOl6YD7OMlWcrzcHmAMwduzY\nOJ/aGGNMicRmHKp6kaq+tfwLeAE4BXhKRLYAo4H1IlIeZHYAY0p+H+1e83u9hao6QVUnDBsW6vCj\naXBLNm9i0uKFvPn2rzJp8UKWbN6U9pCMqYuHH36YcePGcdppp3HTTfHnFNV9c1xVNwDDi7+7wWOC\nR1bVE8DpInIKTsD4MNBer3GafFuyeRPzVy7jtaNHAdi5fz/zVy4DYOa48WkOzZhEdXZ2ctVVV7F8\n+XJGjx7NxIkTmTFjBmeccUZsr5GpcxwiMlJEHgJQ1aPAJ4FHgE3A/aq6Mc3xmfy45bHV3UGj6LWj\nR7nlsdUpjciY3lZ2rObytrlMa5rF5W1zWdlR+/+fa9as4bTTTuPUU09lwIABfPjDH2bJkiUxjPaY\n1GtVqWpbyc87geklvz8EPJTCsEzO7dq/P9J1Y+ptZcdqbptzJ4cPHgFg99aXuW3OnQBMbZ9c9fPu\n2LGDMWOOrfKPHj2axx9/vLbBlsnUjMOYuIxobY103Zh6WzS/oztoFB0+eIRF8ztSGlF4FjhMQ7ru\ngskc16/nhPq4fv247oLqP8klrevgUrp2X0jXS+Oc7weXpj0kk6A92/ZGuh7WqFGj2Lbt2EmG7du3\nM2pUvCcZLHCYhjRz3HgWTJ3GyNZWBBjZ2sqCqdMyuzHedXAp7LsBunYC6nzfd4MFjwY2bMyQSNfD\nmjhxIr/61a948cUXOXLkCPfddx8zZsyo6TnLpb7HYUxSZo4bn9lA0cuBrwHlp3kPOddb4v1Hb7Jh\n9oL2HnscAANbBjB7QW3Jo/369eOOO+7gkksuobOzk9mzZ3PmmWfWOtyerxHrsxljqtO1K9p1k3vF\nDfBF8zvYs20vw8YMYfaC9po2xoumT5/O9OnTK9+xShY4jMmCwgh3mcrjumlYU9snxxIo6s32OIzJ\ngkHzgOayi83udWOyxWYcxmRAoWUGXeDsaXTtcmYag+ZRsP0Nk0EWOIzJiELLDNsIN7lgS1XGGGMi\nscBhjDEmEgscxhjTYJIuq26BwxhjGkixrPoPf/hDnn32We69916effbZWF/DNseNMSYlSzZv4pbH\nVrNr/35GtLZy3QWTa652UFpWHeguqx5nPw4LHCaXkvgHZ0w9JdVszMqqG+PhH1atYN4jD7Fz/36U\nY//grDWsyZM8NxuzwGFyZcnmTXRseAotu56Xf3DGFCXVbMzKqhtT5pbHVvcKGkXW3c/kSVLNxupR\nVt0Ch8mVoOBg3f1MniTVbKy0rPr48eOZNWuWlVU3fduI1lZ2egQPgUx39zOmXHEDPIkkDyurnhMr\nO1YnUlff9HTdBZN7ZKKAEzTazzrbsqpM7uSq2VgJCxwxWNmxukcnr91bX+a2OXcCWPCIWZKf0owx\n4VjgiMGi+R092j8CHD54hEXzOyxwJCCvn9KMaRS2OR6DPdv2RrpujDF5ZoEjBsPGDIl03Rhj8swC\nRwxmL2hnYMuAHtcGtgxg9oL2lEZkjDHJscARg6ntk7l24ZUMHzsUEWH42KFcu/BK298wxqQi6bLq\ntjkek6ntky1QGGNSVyyrvnz5ckaPHs3EiROZMWNGrNVxU51xiMhnRERFZKjP7VtEZIOIPCkia+s9\nPmOqsWTzJiYtXsibb/8qkxYvtOKLxlfXwaV07b6QrpfGOd8PLq35OUvLqg8YMKC7rHqcUptxiMgY\nYBqwtcJdp6jqy3UYkjE1S6pUtmk8XQeXwr4bgEPuhZ2w7wa6gEJL9bWlGr2s+m3A58C3Zp0xuZPn\nUtmmzg58je6g0e2Qez3bUgkcIjIT2KGqT1W4qwIrRGSdiMyp8JxzRGStiKzds2dPbGM1JoqkSmWn\nIYllFFOia1e06yHVo6x6YktVIrICONnjpuuB+TjLVJVMUtUdIjIcWC4iz6nqj73uqKoLgYUAEyZM\nsFmMSYVfEcY8Ve7tOrgU9n8Z9JWSi/Eso5gShRHO36vX9RqUllUfNWoU9913Hx0dHTU9Z7nEZhyq\nepGqvrX8C3gBOAV4SkS2AKOB9SLSK8io6g73+27gQeDcpMZrsiWvG8xJlcqul+5199Kg0S0fyyi5\nMWge0Fx2sdm9Xr2GLKuuqhuA4cXf3eAxoXwDXESOBwqqut/9eRrwxXqO1aQjzxvMuS/C6LnuXqLG\nZRRzTKFlBl3g/J137XJmGoPmxTKj61Nl1UVkJHCXqk4H3gg8KCLgjLNDVR9Oc3ymPoI2mNN+A16y\neVPFoJDrIoyVAkONyyimp0LLDMjh0l/qgUNV20p+3glMd39+ATg7pWGZFGV1gznPM6HQ/NbdgTiW\nUUxjsJIjJnOS6sVciyWbN/HZZT9s/FRbz3V3gMFwwpdtYzwE1Wzn5sQxvtRnHMaU8+ryV88N5vLl\nqCltp/LApo10+vyDK50JhVnKyrIk1937gubmZvbu3cuQIUNwl9kzRVXZu3cvzc1eHw7Cs8BhMifN\nDWav5aiODU8FnlItzoSqXcrKWrDJ67p7FowePZrt27eT5bNkzc3NjB49uqbnqBg4RKS/qr5edm2o\nlQExSUprg9lrYz4oaJTOhKrZ1O8T+yZ9SP/+/TnllFPSHkbifPc4RGSKiGwHdonIMhFpK7l5WdID\nMyYNUTbgm0RYMHVa9xt8NZv6VqLE5FHQ5vjNwCWqOhTnRPZyETnfvS17i3fGxMBvA778f/jj+vXj\n1mnv7jErqGZTP6sZZMYECQocA1R1I4Cqfht4P3C3iLwfK0xoGpTfye/2s85mZGsrAoxsbe0x06j0\n2KBN/SxmkBlTSdAex+sicrKqvgSgqhtFZCrwfeDNdRmdMXVWy8Z8NY9NO4Os6+BSy6AykYlfTq+I\nXATsKa9gKyInAp9U1X+qw/iqMmHCBF271vo+mXxIK6uqVz8IAJrtvEYfJSLrVHVCqPtm/bBKNeod\nOFZ2rGbR/A72bNvLsDFDmL2g3drI9gFZS6ONqmv3hT7VWUdSGP5oz/u+8nk4dD/QCTRB8ywKg79Q\nh1GaeokSOOwcR41Wdqzmtjl3cvjgEQB2b32Z2+bcCWDBo4E1RBptyH4QTtC4t+RKJxy6l65XsODR\nR1nJkRotmt/RHTSKDh88wqL58da/N9kSNo02jfLwoV/Tr2Bh+fVD93vfz++6aXgVA4eIfDDMtb5q\nz7a9ka6bxhAmjbY4K9m5fz/KsVlJksEj0muG7gfR6fNqznXrFNj3hJlx/H3Ia33SsDFDIl03jSFM\nGm0ah/uivGahZQac8GUojATE+e65Md7k82pNxzbYu3YCeqxToAWPhhZ0cvzdIvKvwCgRub3k61vA\nUb/H5dXKjtVc3jaXaU2zuLxtLis7wv3jnr2gnYEtA3pcG9gygNkL2pMYpsmIMGc2/GYlO/fvT2zp\nKuqBwkLLDArDH6Vw8mbnu1c2VfMs7xdrnuXT+Klnp0CbkTSeoM3xncBaYAawruT6fuDaJAdVb7Vs\ncBdvt6yqviXMmQ2//uNAj2Wk0uerVRI9zwuDv0DXK3hmVXW9NM77Qe4Ge6+UX+td3hAqpuN6FTnM\nuqjpuJe3zWX31t41G4ePHco9W77R67ql35owyjOv/IxsbeUnV8zp9dhqUn29XvO4fv08T7rHoVJK\nb5SUX5OuuNNxzxWRG4E3ufcXQFX11OqHmC1RNrgt/daEVT4r8fuIVr6MVEuqb91L0g+a532IsLjB\nHjLl1+RLmMDxTZylqXX4p1fk2rAxQzxnHF4b3EHptxY4TLnS8vCTFi8MtYxUa8/1epakr9j4ya8V\nrfUuz7UwWVV/UNUfqupuVd1b/Ep8ZHUUZYPb0m9NtcIWQcxbxdzADfbQKb8mT8LMOFaJyC3AA8Dh\n4kVVXZ/YqOosygZ3lNmJMaXCLiP5bXAXRFiyeVNdT6bXWlbFWtE2pjCb46s8Lquq/lkyQ6pdkrWq\nyvc4wJmdXLvwSluqMrEI2lRPcqM7zDjq+fqmvmLdHFfVKbUPqXFY+q0Jq9pP68X7fHbZD+ks+2AX\nZa+jVrXutZjGFabn+BuBBcBIVX23iJwB/ImqfjPx0WXU1PbJFihMoFqLIM4cN555jzzkeVu99jry\nttdi6ifM5vi3gEeAke7vvwQ+ndSAjKmnpIoQxlFuJExZkySLKFZ6fTsR3neFCRxDVfV+cPa4VPUo\nDZqWa/qWJIsQxvFp3SsLS4Apbc4RqqSLKAZlgVmNqr4tTOB4VUSG4PYZF5HzgT8kOipj6iDJIoR+\nn9YVQs8MZo4bzwfGn4mUPf6BTRu790+SLKI4c9x4Fkyd5t1rPUSNKtO4wqTjzgOWAm8WkZ8Cw4C/\nqOVF3ZPofwPscS/NV9VeC7oicinwLzjlOe9S1Ztqed04WLmRxpHkGv51F0zmc8sf5vWurl63Rdnv\nWLXlhV4nzovBIanxh9rUtxPhfVqYrKr1IvKnwDicmfLmmGpX3aaqt/rdKCJNwNeBi4HtwBMislRV\nn43htasStdyIBZlsi7sgYOkb7uDmZjo9gkZR2OykoOCQREHD0Jv6diK8TwvbAfBc4GzgHcBHROSv\nkhtSj9d8XlVfUNUjwH3AzDq8rq8o3f6KQWb31pdR1e4gE7Zcu0le2JPcYZTvN/z+0CH8w4YjzMwg\naIM6zvEXhV7+shPhfVqYDoD/AdwKTAImul+hDolUcLWIPC0ii0TkDR63jwK2lfy+3b2WmijlRqyl\nbPYFruFH5PWGW0mY/Y6g4BDn+Iu8gtn7xvyKe991Z4/sqfBNoEwjCrPHMQE4QysdMS8jIiuAkz1u\nuh74BvAlnH87XwK+CsyO8vwerzcHmAMwduzYWp7KV5RyI1bTKh/iKghY7b5Cpf2OSmVK4i5oWL78\n9b4xv+KfJv6Yln5uUCzvp2GBok8KEziewQkAkXa9VPWiMPcTkX8Hvu9x0w5gTMnvo91rfq+3EFgI\nTsmR8CMNb/aCds9yI17FEK2mVd8S1LSpkkr7HfWsdnvdBZN77HF89o/XHAsa3dzsKQsafVaocxzA\nsyLyiIgsLX7V8qIiUrqDdhlOcCr3BHC6iJwiIgOAD+Nkd6Vmavtkrl14JcPHDkVEGD52qG+NKq+K\nuwic955z6jRaU61qDtV5LSlFkZXT2OXLXyNaDnjfMUPZU3YQsf7C/J9+YwKve7OIvA1nqWoL8LcA\nIjISJ+12uqoeFZFP4pxabwIWqerGBMYSSZhyI8VsqvI9DhSW3b2KM985zrKrMqraUiFhmzb5qSUT\nKm6lM5yu3d/LdPaUtaZNR8XquNBdr2qi++saVd2d6KhqlGR13CArO1bzb9csZt/e4E+Pfi1pTfr8\nmi15tXet5nkGDxzI4c7O3FSc7fXGDEBz5I3wroNLayqt7vd4a00bnyjVccNkVc0C1gAfBGYBj4tI\nTQcAG1Ex/bZS0ADbIM+yuA7VeS1d9RdBRHjt6FGaxDkPHkcmVJIKLTOg+TKcST/O9+bLor/pV1me\npOvgUrpeOhf2fdb78XYQMRVhlqquByYWZxkiMgxYAXw7yYHljefSlA8pCBcXPkihqUBXZxfDxw61\nw4EZEdehuvKlqxMHDuTV11/n94ecT+6dqj3SarOq6+BSOPQgx8rTdcKhB+k6eE744BFUniTgObxn\nO2WPt4OIqQizOV4oW5raG/JxfUqUWURXZ1eP73Y4MDviPFQ3c9x4fnLFHH79qc/QMmBAr/IjcdaV\nKqqlWq7nJnMcNalCzArCv3bZ4+0gYirCzDgeFpFHgHvd3z8EeDcK6MP80m/DKh4OtFlHusK2d42q\nHnWlBjc3c+DwYV539y0rbez32DeQE0EPAD3Pa/i+cUdZCqowK/Db4A4MGu7jrTVtOsLUqrpORD6A\nc3IcYKGqPpjssPLH64xHVLb3kQ1JnJuoR12p4jJYKb8zIr3erPUVj1c4hLO34dVFoXDsBHklg+Z5\nb7AXZwV+sxrf1+75eDuIWH9hl5weA34ErAJ+ltxw8qk0/bbQ5PyVDh87lNaTBkV6HlXl8ra5tmTV\ngOpVV8qL56ym0jJQt056LwW510NucFcsT+I7e/F5bRls5U1SFiar6hM4WVWX4ZRT/7mI1FQepJGU\nFjMEZ9+ieJr8wO9fjfx8tt/RmOpVV8qLZ02s0EtNzpv0sayqUuH3OgotMygMf5TCyZud76Vv+n4b\n2W6A6RlwbqXwxjUWNFJW8RyHiGwGLlDVve7vQ4DHVHVcHcZXlXqe47i8ba7n3sbwsUMBAvc9illV\nXuysR98Tqg9GCb+zIn5Kz4v4nn/opT+c8H9g33XgeaxRKJy8OfQYvMR1VsTUJtZzHDhZVKX/d+53\nrxmCixl6lh0pcfaFZyIinrfZfkffUk0bWM+zIoUCgwcO9Lx/jywuz2wkL68fS3v1EkPaq1XazZ8w\ngeN5nEN/N4rI54GfA78UkXki0udz3vyKFqoqi+Z3MO1jU3wf+4uVG3wfb8UQ+5Zq2sB6LX/dfPGl\nrP/bT+L9ccRZ3lqyeRPv+q+XuPZnF/DSayeg6r5Z+6lD2mvgUpbJnDDpuL92v4qWuN+zU1wnRUHZ\nVLu3vsyyu1dFfrxfxV2Tf37LUdWm6/plgPllcZ04cGB3JtbO/afzvW2ndy9hve8Nc33TZi3t1ZQK\nk477hXoMJK+K5y4Wze/w3M+olJ5b+nhrMdvYggooxp2uW14eHZw9jmLJk1LFmc37PhScNmtpr6ao\nYuAQkQk4ZUfeVHp/Vf3jBMeVK8WKudOaZhGl39Xbp57V4/GmsQUtR/m90UdJ1y2fzXxg/Jms2vJC\nj9nNvEe8z+7u2r+fQsuc3M4qai2iaKIJs1R1D3AdsAEqtlHu0/xOjw8fO5RRp4/gFys3dF97+9Sz\nuHn5P1Z8zuIZEZuN5F/QclStJ9a9ZjMPbNrYK+X3lsdWB85s8jirsNLq9RcmcOxRVeuMEkLQfkU1\nb/bFMyLF5yue8QAseORQpeWoWk6sB81mSp8zjplN5oQoomgzkniFyar6vIjcJSIfEZEPFL8SH1kO\nRekQGIZXxd1iTSuTP0mcHi8Ku7mexEHE1FUoolhLWXfjLcyM4wrgj4D+HFuqUuCBpAaVZ3HuVwSd\nETH5U+1yVJiDgVE21+vZw7wuKpVWr7Ksu/EXJnBMzPIp8TwKu2/ht2diZzzyK+qbdthWtg25BBVW\npSKK1uwpdmGWqh4TkTMSH0kfUVrbSlUDa1N5nTy3Mx59S9iDgXlcgjrWg+MtdL003vle7MURQcWT\n5wmeeu+rwsw4zgeeFJEXgcOAAGrpuNUJ2rcon3XYGQ8T5WBgnpagetencsunV5kRFZgNVmlGYiIL\nEzguTXwUfUjUfQs749G3xX0wMGohxcQElnWPd//BTr3Hr+JSlar+BhgMvM/9GuxeMyVWdqzm8ra5\nTGuaFdhTw2pTmSjizMSqppBiHDzbwlbaX4h5/8FqYcUrTD+Oa3AOAQ53v/5TRK5OemB5YvsWJilx\n7l1UU0ixVn6psHBi8ANt/yHTwixV/TVwnqq+CiAiX8HpAvivSQ4sTyrtW5RnUU372BQe/8G6SPsW\ndoK874pr7yKpvueB/FJhpRm0H909znuw/YesCxM4hJ6Nfzvda8YVtG/hdfp72d2rIh0MtBPkJg5J\n9D2vyG/JSV/Bu6vgYDjhBltKyrgw6biLOdaP40acfhzfTHRUKQm7T1EuaN8iyulvv9e3E+QmDkme\nXPflu+RU/nn02OXyoOG5R2JSFWZz/Gs4p8d/535doar/nPTA6i3KPkW5oH2LsFlUQa9vJ8hN0ZLN\nm5i0eCFvvv2rvfuIV5DKWQ+/BlCebWhxZyLHWLmQbPLtOS4iE4GhqvrDsuvTgd+q6ro6jK8q1fQc\nD+odHqb3t98ehN/zFp+70v2CepdbX/K+pfwUOfTsI55VXgUG2fdZ3/sXTv7lscf69UYvjKQw/NH4\nB9uHxdVz/CvAsx7XNwK3VDOwInfZa4eIPOl+Tfe53xYR2eDeJ1okiKjWT/VT2ydzz5ZvsKzzfu7Z\n8o3uvYegvuNhZxWWiWUgnayoOHinwg72uXfZ9RAFDG0Zq/6CAker13kN99rQGF77NlV9m/vl3V3G\nMcW9T6hIWK2kzleUVsz1Utyr8HsdKTh5CHFW3TX55Jf9tHP//qqWrlJ1wg145uYIPd/8A8qF2DJW\neoICxxsCbmuJeyBpS/JTfXE2IuKdjOY3qwDo6uzqzqDymtGYviMo+6meB/ri4NSXuoleMwx9peeb\nv98eyaB5wVVvTaKCAscKEfknKXm3E8cXgf+J4bWvFpGnRWSRiPgFKXXHsU5E5gQ9mYjMEZG1IrJ2\nz549kQcTdy8NL0GzmuLrF5p6/yexDCoD3llR5fKwdFVUaJkBBa/PoMfe/LsLGEppgHEDiVW9TU3Q\n5vjxwF3AucCT7uWzgbXAJ1T1QOATi6wATva46XqclN6XcQLDl4ARqjrb4zlGqeoOERkOLAeuVtUf\nV/pDVbM5Xg/l5zEAEHjflZfwqa9/AsC3b7mIsKzz/noN1WRUaa0pv+72Avz6U5/xfVyqNarKdL00\nDu8MK6Fw8mbnPr0KIgI0u4cIX+n9UNs4r0qUzXHfjy/uSfGPiMipwJnu5Y2q+kKYJ1bVi8LcT0T+\nHfi+z3PscL/vFpEHcYJYxcCRVVPbJ7Pxp5v53p2PHPu3orDs7lWc+c5xTG2fbD04TKDSU+STFi8M\ndaAvbE+P8sdUE2git2it1IQJ/JektBln9tF3q96m1RI3zDmOF1T1e+5XqKBRiYiU7nhdBjzjcZ/j\nRaS1+DMwzet+efP4D9b1+oBVuhRlGVQmrLAH+qJmY0Uphlh6ruTGhz/N0T9cH22zOmgPo8h36ekP\nwX04GlyayQFhTo4n4WY3zfZpYApwLYCIjBSRYobVG4GfiMhTwBrgB6r6cDrDjc7vFHiltN967LWY\nxhD2QF/UGlVhA015gPnr039MPzlc9mzBm9UVmzBBYGZVlKq3DZe6m2JyQJhaVbFT1Y/6XN8JTHd/\nfgFnTyV3gmpLhVmKsh4cJqwwBRD9alQVRFiyeVPVgaY8wIxo8dn2rLBZHdiECWJpxNRrn6TKhlGZ\nkmJygO+MQ0ROCvpKfGQ5FlRbypaiTL35ZWN1qnouQfml/ZZfLw8kuw4O8h5AjSXSQ81KSnjOLBox\ndTfFlrhBS1XrcDKo1nl8ZS9lKUP8lqN2b32Zmz92R4+gYktRJm7l9awAFkydRpPHOSKvJaiweyfl\ngeTWp8/l4NHyABXPZnXYJSnf/h9eG/CQ79TdMPtDCfENHKp6iqqe6n4v/zo18ZHlWFAGVFdnV4/f\nz3vPORY0TGz+YdUK5j3yUK+NbYAun9T78plD2L2T8gDzvW2n84X1UzjYNYx6bFZHmll4lnAn1w2j\nos7E4uR7jqPHnZwDeqdTEt7CnKdIS9rnODzPa/goNBV45PX/qsOoTKNbsnkT8x55yPNUxEh3duC1\n1zGytZWfXBF4vjbwNdM4H+J7tsO3j7nf7db/oyiWcxwlT/YJ4BpgNM5BwPNxOgD+WS2DbGTFGcTN\nH7uj1wyjXKXbjQnrlsdW+x4K3LV/P1+7ZLpndd1a+nHE1Z0wssCZhUefj8JIZwln/5fLDg2+EnqT\nPK0zE1kUJh33GmAi8BtVnQK8HfA4rmlKTW2fjHZVns0BkRtHGeMlqAXsiNbWdPpx1Mg3hdZ3b6IT\n6F92rf+xN3kJLnHi+9q/PRf2/Z0VVHSFScc9pKqHRAQRGaiqz4nIuMRH1gD8Um/LlTZuAmsHa6rj\nl3Yr0D2rSG2GUIWgFFrfE+cyGHpVQyr5ABcyhbXXa3uVNikGnD446wgz49guIoOB7wLLRWQJ0Kvc\nuult9oJ2+g0If1QmqJhhtW1tTd/hlQ0lQPtZZ+cmWPQQlELrl1GkAEfLrh89NqMIm8Lq+doe8pyV\nVYOK72qqepn7440isgo4EcjNCe40FWcO/3bNYvbtdT4Jtp40iKtun81XPvqvnsUM92zb26ub4Hnv\nOYdld6/yPFBosxNTVAwOUTer0yyAGLhvEDA7KLTMcGYevToLXuf7GCD8YcKwASHHWVm1CJtV9Q5g\nEk48/6mqrk96YLVIO6sqDL9Wsa0nDeLIoSO9Kuh67Xpa61hTqzTb0fpmRrkppdW0jQ3zmDCb3L7P\n00NzQ9XGiqt1bPHJ/hG4GxiC0/lvsYjcUNsQjWfjJoGjR472TuP1ie27t75sS1emJqm2o610mrua\nA24hHhPqMKHn8/THaTzV9woqlguzAH85cLaqHgIQkZtw0nK/nOTAGp1fifXXDoRYVy1hG+umFlEL\nIMaqwka133JU0Jt1NY9J8nmiyFO6b5jAsZOeJ2cGAjsSG1Ef4lVi3ZfPclVRcWPdAoeJsmfhl4kV\n1KY2NiHzBBW7AAASQUlEQVR6cVQsgFgmzjffqK9di7wVYQyTVfUHYKOIfEtEFuP0xHhFRG4XkduT\nHV5j86tpVW5gywDed+Ul3aXWa30+07ii9NKA8HWpEhFzraU0+1PULGdFGMPMOB50v4oeTWYofY/f\nOY/WkwZx3KDm7qyq2QvanZmE217Wb2PdugSaoD0Lr1lHtZlYcYh9OSjozTeDn9p7yFn/9DDpuHfX\nYyB90ewF7dxyxdfpfP1YiYSm/k1cdfvswCWn2Qvae9XCstLsBqrbs4hyKDDu1N1Yl4Ny9ubbQ5gW\nuhkS1I/jfvf7BhF5uvyrfkNsTCs7VvP1Ty3qETSAwKWoIusSaPyE7aVRjajLYHWXYn+KmqVYIr0a\nvuc4RGSEqu4SkTd53a6qmT09nvVzHJWq59r5DFOtJM9lTFq8MPbqunGqdC4k69LOqoqlOq6qFud3\nBWBXSTrucTj9wE2VvDoElrJNblOtJPcsUk3dDSGNFNpaeAYKn4ONWRNmc/y/gQtKfu90r01MZER9\nQKXAYJvcphZJFTJMNXU3pDB7Jml/su8eQ47Sb8uFScftp6rdH4/dnwcE3N9UEBQYbJPbZFWqqbsx\nyUzKbs7Sb8uFCRx7RKQ7BIrITKByrXDjy7PcCCAFYdrHptgmt8mkPPbz6CUrb9h5zgAj3FLVlcA9\nInIHzvnlbcBfJTqqBlasfHv44JFep8G1S1l29yrOfOc4Cx4mk/LUz8NTVt6wc5Z+W67ijENVf62q\n5wNnAONV9QJVfT75oTWeYjZV9+E9j4S2oJ4cxpgaZSVlN2fpt+XC9BwfCPw50Ab0K54zUNUvJjqy\nBlQpm6rIsqqMSUjYfhwJy1sGWLkwS1VLcOpVrQMOJzucxhY2IFhWlTHJyNIbdj2LKMYtTOAYraqX\nJj6SPiBUD3KB895zTn0GZEwflOc37KwIk1X1mIicFfcLi8jVIvKciGwUkZt97nOpiGwWkedF5O/i\nHkO9eWVTSVNZiRGFZXevssZMxpjMCjPjmAR8XERexFmqEkBV9Y+rfVERmQLMxGkQdVhEhnvcpwn4\nOnAxsB14QkSWquqz1b5u2oqZUqX9xF87cIj9vzvQ437WW8OkJc3+43mVhQOF9RYmcLw7gdedC9yk\nqocBVHW3x33OBZ5X1RcAROQ+nGCT28ABTvAoDQjTmmZ53s82yE29lde5KhYxBCx4+Mj7CfBqBVXH\nPcH9cb/PVy3eAkwWkcdF5Eci4lW+ZBTOmZGi7e61huK3EW4b5KbeUu0/nldZOVBYZ0Ezjg7gvTjZ\nVIqzRFWkwKlBTywiK4CTPW663n3dk4DzcWpe3S8ip6pfqd4QRGQOMAdg7Nix1T5N3VlvDZMVWS9i\nmElZOVBYZ0HVcd8rzqGNP1XVrVGfWFUv8rtNROYCD7iBYo2IdAFDgT0ld9sBjCn5fTQBvc5VdSGw\nEJyy6lHHmxavfY/ujn/G1FEeihhmTs5PgFcrMKvKfWP/QQKv+11gCoCIvAWnaGJ5nuoTwOkicoqI\nDAA+DOSgeXB0U9snc8+Wb7Cs837u2fIN36CxsmM1l7fNZVrTLC5vm2uZVyZWjVDEsO5yfgK8WmE2\nx9eLyERVfSLG110ELBKRZ4AjwMdUVUVkJHCXqk5X1aMi8kngEaAJWKSqG2McQ66UN3/avfVlbptz\nJ4DNTkws0uw/nldZOlBYT74dALvvIPIccDqwBXiVGNJxk5b1DoDVuLxtrufhQesWaEzy+kLKbSwd\nAEtcUuN4TAz80nMtbdeYZPXVlNsgQem4zSLyaeA64FJgh6r+pvhVtxEawNJ2jUlNH025DRK0OX43\nMAHYgHMI8Kt1GVEDW9mxmj8fNpuLCx/k4sIH+cDQK0JvcHuVK7G0XWPqoI+m3AYJWqo6Q1XPAhCR\nbwJr6jOkxrSyYzW3zv43jh45dsBq/+8OcMsVXwcqb3Bb2q4xKemjKbdBfDfHRWS9qr7D7/csy+Lm\nuN/mNtgGtzFZ1muPA4BmOOHLDbXHEdfm+Nkisq/4nMBx7u/FrKoT/B9qygVtYtsGtzHZ1VdTboME\nnRxvqudAGl1QL45qNriLvctt2cqY5FkPj57C9OMwMZi9oJ1+A3rH6ab+TZE3uEt7l6tq92FAO0lu\njKkHCxx1MrV9Mp9d9L84Ycixuj+tJw3iusVXRZ4pePUuL/bwMMaYpIU5AGhiUt6Lo1p2GNAYkyab\nceSQHQY0xqTJAkcO2WFAY0yabKkqh+wwoDEmTRWr4+ZRFg8AGmNMlkU5AGhLVcYYYyKxwGGMMSYS\nCxzGGGMiscBhjDEmEgscxhhjIrHAYYwxJhILHMYYYyKxwGGMMSYSCxwpWNmxmsvb5jKtaRaXt821\ncujGmFyxkiN1VuylUSyLXuylAZX7jhtjTBbYjKPOrJeGMSbvLHDUmfXSMMbknQWOOrNeGsaYvLPA\nUWfWS8MY46Xr4FK6dl9I10vjnO8Hl6Y9JF+pbY6LyNXAVUAn8ANV/ZzHfbYA+937HA1b8jfLrJeG\nMaZc18GlsO8G4JB7YSfsu4EuoNAyI82heUolcIjIFGAmcLaqHhaR4QF3n6KqL9dpaIla2bG6R8D4\n3/9xtQUMYwwc+BrdQaPbIee6BY5uc4GbVPUwgKruTmkcdWNpuMYYX127ol1PWVp7HG8BJovI4yLy\nIxGZ6HM/BVaIyDoRmRP0hCIyR0TWisjaPXv2xD7gWlkarjHGV2FEtOspS2zGISIrgJM9brrefd2T\ngPOBicD9InKq9u5jO0lVd7hLWctF5DlV/bHX66nqQmAhOK1j4/pzxMXScI0xvgbN67nHAUCzcz2D\nEgscqnqR320iMhd4wA0Ua0SkCxgK9JgqqOoO9/tuEXkQOBfwDBxZN2zMEHZv7b1VY2m4xphCywy6\nwNnT6NrlzDQGzcvkxjikt1T1XWAKgIi8BRgA9HhXFZHjRaS1+DMwDXimzuOMjaXhGmOCFFpmUBj+\nKIWTNzvfMxo0IL3N8UXAIhF5BjgCfExVVURGAnep6nTgjcCDIlIcZ4eqPpzSeGtmabjGmEYhvbcV\n8m/ChAm6du3atIdhjDG5ISLrwp6Vs5PjxhhjIrHAYYwxJhILHMYYYyKxwJFR1iXQGJNV1gEwg6w8\niTEmy2zGkUFWnsSYxpCnUulR2Iwjg6w8iTH5l7dS6VHYjCODrEugMQ0gqFR6zlngyCArT2JMA8hZ\nqfQoLHBk0NT2yVy78EqGjx2KiDB87FCuXXilbYwbkyc5K5Uehe1xZNTU9skWKIzJs5yVSo/CAocx\nxiQgb6XSo7DAYYwxCSm0zMhkz/Ba2R6HMcaYSCxwGGOMicQChzHGmEgscBhjjInEAocxxphILHAY\nY4yJxAKHMcaYSCxwGGOMiURUNe0xxE5E9gC/cX8dCryc4nCisLEmw8aaDBtrMtIa65tUdViYOzZk\n4CglImtVdULa4wjDxpoMG2sybKzJyMNYbanKGGNMJBY4jDHGRNIXAsfCtAcQgY01GTbWZNhYk5H5\nsTb8Hocxxph49YUZhzHGmBg1ZOAQkTEiskpEnhWRjSJyTdpj8iMizSKyRkSecsf6hbTHVImINInI\nL0Tk+2mPpRIR2SIiG0TkSRFZm/Z4/IjIYBH5tog8JyKbRORP0h6THxEZ5/59Fr/2icin0x6XFxG5\n1v139YyI3CsizWmPyY+IXOOOc2NW/z6LGnKpSkRGACNUdb2ItALrgPer6rMpD60XERHgeFU9ICL9\ngZ8A16jqz1Memi8RmQdMAE5Q1femPZ4gIrIFmKCqmc7hF5G7gdWqepeIDABaVPWVtMdViYg0ATuA\n81T1N5XuX08iMgrn39MZqvqaiNwPPKSq30p3ZL2JyFuB+4BzgSPAw8CVqvp8qgPz0ZAzDlXdparr\n3Z/3A5uAUemOyps6Dri/9ne/MhvNRWQ08B7grrTH0ihE5ETgXcA3AVT1SB6Chmsq8OusBY0S/YDj\nRKQf0ALsTHk8fsYDj6vqQVU9CvwI+EDKY/LVkIGjlIi0AW8HHk93JP7cpZ8ngd3AclXN7FiBfwY+\nB0475RxQYIWIrBOROWkPxscpwB5gsbsEeJeIHJ/2oEL6MHBv2oPwoqo7gFuBrcAu4A+quizdUfl6\nBpgsIkNEpAWYDoxJeUy+GjpwiMgg4DvAp1V1X9rj8aOqnar6NmA0cK47bc0cEXkvsFtV16U9lggm\nuX+37wauEpF3pT0gD/2AdwDfUNW3A68Cf5fukCpzl9RmAP+d9li8iMgbgJk4gXkkcLyI/GW6o/Km\nqpuArwDLcJapngQ6Ux1UgIYNHO5+wXeAe1T1gbTHE4a7PLEKuDTtsfh4JzDD3Te4D/gzEfnPdIcU\nzP3UiaruBh7EWUPOmu3A9pKZ5rdxAknWvRtYr6q/TXsgPi4CXlTVPar6OvAAcEHKY/Klqt9U1XNU\n9V3A74Ffpj0mPw0ZONwN528Cm1T1a2mPJ4iIDBORwe7PxwEXA8+lOypvqvr3qjpaVdtwlij+R1Uz\n+QkOQESOd5MjcJd+puEsCWSKqr4EbBORce6lqUDmEjk8fISMLlO5tgLni0iL+54wFWe/M5NEZLj7\nfSzO/kZHuiPy1y/tASTkncBHgQ3u3gHAfFV9KMUx+RkB3O1mpxSA+1U182muOfFG4EHnPYN+QIeq\nPpzukHxdDdzjLv+8AFyR8ngCuYH4YuBv0x6LH1V9XES+DawHjgK/INunsr8jIkOA14Grspwg0ZDp\nuMYYY5LTkEtVxhhjkmOBwxhjTCQWOIwxxkRigcMYY0wkFjiMMcZEYoHDZIqIdLoVV58Rkf92yy94\n3e+h4vmXiM8/0k3RrHZ8W0RkqMf1QSLyf0Xk1255k0dF5LxqXycLRORtIjLd57YhbgXqAyJyR73H\nZtJlgcNkzWuq+jZVfStOldArS28UR0FVp1eT566qO1X1L+IabIm7gN8Bp6vqOTjnMHoFmJx5G07N\nJC+HgH8APlu/4ZissMBhsmw1cJqItInIZhH5fzgnv8cUP/m7t20SkX93+xgsc0/gIyKnicgKcXqd\nrBeRN7v3f8a9/eMissSdHfxKRD5ffGER+a47c9hYqTiiiLwZOA+4QVW7AFT1RVX9gXv7PHcG9Uyx\nz4I7judE5Fsi8ksRuUdELhKRn7pjOde9340i8h8i8jP3+t+410VEbnGfc4OIfMi9fqH75yn29rjH\nPTWNiJwjIj9y/1yPiNN+APf+XxGnL8wvRWSyexDxi8CH3Bngh0r/zKr6qqr+BCeAmL5GVe3LvjLz\nBRxwv/cDlgBzgTacarznl9xvC84n+jacU8Fvc6/fD/yl+/PjwGXuz804ZbXbgGfcax/HqZo6BDgO\nJyhNcG87yf1evD6k9HXLxjwDeNDnz3MOsAE4HhgEbMSp1lwc91k4H+DWAYsAwSnM91338TcCT7nj\nGApswynY9+fAcqAJ54T8VpwqBBcCf8ApmFkAfgZMwinX/xgwzH3eDwGL3J8fBb7q/jwdWFHy93NH\nhf9eFe9jX4331aglR0x+HVdSJmY1Ts2xkcBv1L+51YuqWnzMOqDNrVE1SlUfBFDVQwDuh+9Sy1V1\nr3vbAzhvsmuBT4nIZe59xgCnA3ur+PNMwgkqr5a8xmRgqTvuDe71jcBKVVUR2YATWIqWqOprwGsi\nsgqnUOMk4F5V7QR+KyI/AiYC+4A1qrrdfd4n3ed6BXgrsNz9O2jCCZpFxUKg68pe25heLHCYrHlN\nnTLo3dw3ulcDHnO45OdOnE/nYZXX3FERuRCnsuqfqOpBEXkUZ8biZyNwtog0uW/kYZWOu6vk9y56\n/tvsNcYIz9vpPpcAG1XVryXt4bL7G+PL9jhMQ1Kn8+N2EXk/gIgM9MnQulhETnL3Rd4P/BQ4Efi9\nGzT+CDi/wmv9GmeW8oWS/YQ2EXkPzqzp/eJUaD0euMy9FsVMcXrTD8FZinrCfY4PidMEbBhOB8E1\nAc+xGRgmbi9zEekvImdWeN39QGvEsZo+wAKHaWQfxVlyehpnff9kj/uswenb8jTwHVVdi9NIp5+I\nbAJuAsL0f/8Ezl7D8+7m+7dwml6td39eg7Pncpeq/iLin+NpnD4tPwe+pKo7cXqLPI2z//E/wOfU\nKc/uSVWPAH8BfEVEnsJpFFSpN8Uq4AyvzXHo7uf+NeDjIrJdRM6I+OcyOWXVcU2fJSIfx9kM/2Ta\nY/EjIjfiJAzcmvZYjCmyGYcxxphIbMZhjDEmEptxGGOMicQChzHGmEgscBhjjInEAocxxphILHAY\nY4yJxAKHMcaYSP4/R2+xkGhBEnYAAAAASUVORK5CYII=\n", 282 | "text/plain": [ 283 | "" 284 | ] 285 | }, 286 | "metadata": {}, 287 | "output_type": "display_data" 288 | } 289 | ], 290 | "source": [ 291 | "from __future__ import print_function\n", 292 | "from sklearn import datasets\n", 293 | "import matplotlib.pyplot as plt\n", 294 | "import matplotlib.cm as cmx\n", 295 | "import matplotlib.colors as colors\n", 296 | "import numpy as np\n", 297 | "%matplotlib inline\n", 298 | "\n", 299 | "\n", 300 | "\n", 301 | "def shuffle_data(X, y, seed=None):\n", 302 | " if seed:\n", 303 | " np.random.seed(seed)\n", 304 | "\n", 305 | " idx = np.arange(X.shape[0])\n", 306 | " np.random.shuffle(idx)\n", 307 | "\n", 308 | " return X[idx], y[idx]\n", 309 | "\n", 310 | "\n", 311 | "\n", 312 | "# 正规化数据集 X\n", 313 | "def normalize(X, axis=-1, p=2):\n", 314 | " lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))\n", 315 | " lp_norm[lp_norm == 0] = 1\n", 316 | " return X / np.expand_dims(lp_norm, axis)\n", 317 | "\n", 318 | "\n", 319 | "# 标准化数据集 X\n", 320 | "def standardize(X):\n", 321 | " X_std = np.zeros(X.shape)\n", 322 | " mean = X.mean(axis=0)\n", 323 | " std = X.std(axis=0)\n", 324 | "\n", 325 | " # 做除法运算时请永远记住分母不能等于0的情形\n", 326 | " # X_std = (X - X.mean(axis=0)) / X.std(axis=0) \n", 327 | " for col in range(np.shape(X)[1]):\n", 328 | " if std[col]:\n", 329 | " X_std[:, col] = (X_std[:, col] - mean[col]) / std[col]\n", 330 | "\n", 331 | " return X_std\n", 332 | "\n", 333 | "\n", 334 | "# 划分数据集为训练集和测试集\n", 335 | "def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):\n", 336 | " if shuffle:\n", 337 | " X, y = shuffle_data(X, y, seed)\n", 338 | "\n", 339 | " n_train_samples = int(X.shape[0] * (1-test_size))\n", 340 | " x_train, x_test = X[:n_train_samples], X[n_train_samples:]\n", 341 | " y_train, y_test = y[:n_train_samples], y[n_train_samples:]\n", 342 | "\n", 343 | " return x_train, x_test, y_train, y_test\n", 344 | "\n", 345 | "\n", 346 | "\n", 347 | "# 计算矩阵X的协方差矩阵\n", 348 | "def calculate_covariance_matrix(X, Y=np.empty((0,0))):\n", 349 | " if not Y.any():\n", 350 | " Y = X\n", 351 | " n_samples = np.shape(X)[0]\n", 352 | " covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0))\n", 353 | "\n", 354 | " return np.array(covariance_matrix, dtype=float)\n", 355 | "\n", 356 | "\n", 357 | "# 计算数据集X每列的方差\n", 358 | "def calculate_variance(X):\n", 359 | " n_samples = np.shape(X)[0]\n", 360 | " variance = (1 / n_samples) * np.diag((X - X.mean(axis=0)).T.dot(X - X.mean(axis=0)))\n", 361 | " return variance\n", 362 | "\n", 363 | "\n", 364 | "# 计算数据集X每列的标准差\n", 365 | "def calculate_std_dev(X):\n", 366 | " std_dev = np.sqrt(calculate_variance(X))\n", 367 | " return std_dev\n", 368 | "\n", 369 | "\n", 370 | "# 计算相关系数矩阵\n", 371 | "def calculate_correlation_matrix(X, Y=np.empty([0])):\n", 372 | " # 先计算协方差矩阵\n", 373 | " covariance_matrix = calculate_covariance_matrix(X, Y)\n", 374 | " # 计算X, Y的标准差\n", 375 | " std_dev_X = np.expand_dims(calculate_std_dev(X), 1)\n", 376 | " std_dev_y = np.expand_dims(calculate_std_dev(Y), 1)\n", 377 | " correlation_matrix = np.divide(covariance_matrix, std_dev_X.dot(std_dev_y.T))\n", 378 | "\n", 379 | " return np.array(correlation_matrix, dtype=float)\n", 380 | "\n", 381 | "\n", 382 | "\n", 383 | "class PCA():\n", 384 | " \"\"\"\n", 385 | " 主成份分析算法PCA,非监督学习算法.\n", 386 | " \"\"\"\n", 387 | " def __init__(self):\n", 388 | " self.eigen_values = None\n", 389 | " self.eigen_vectors = None\n", 390 | " self.k = 2\n", 391 | "\n", 392 | " def transform(self, X):\n", 393 | " \"\"\" \n", 394 | " 将原始数据集X通过PCA进行降维\n", 395 | " \"\"\"\n", 396 | " covariance = calculate_covariance_matrix(X)\n", 397 | "\n", 398 | " # 求解特征值和特征向量\n", 399 | " self.eigen_values, self.eigen_vectors = np.linalg.eig(covariance)\n", 400 | "\n", 401 | " # 将特征值从大到小进行排序,注意特征向量是按列排的,即self.eigen_vectors第k列是self.eigen_values中第k个特征值对应的特征向量\n", 402 | " idx = self.eigen_values.argsort()[::-1]\n", 403 | " eigenvalues = self.eigen_values[idx][:self.k]\n", 404 | " eigenvectors = self.eigen_vectors[:, idx][:, :self.k]\n", 405 | "\n", 406 | " # 将原始数据集X映射到低维空间\n", 407 | " X_transformed = X.dot(eigenvectors)\n", 408 | "\n", 409 | " return X_transformed\n", 410 | "\n", 411 | "\n", 412 | "def main():\n", 413 | " # Load the dataset\n", 414 | " data = datasets.load_iris()\n", 415 | " X = data.data\n", 416 | " y = data.target\n", 417 | "\n", 418 | " # 将数据集X映射到低维空间\n", 419 | " X_trans = PCA().transform(X)\n", 420 | "\n", 421 | " x1 = X_trans[:, 0]\n", 422 | " x2 = X_trans[:, 1]\n", 423 | "\n", 424 | " cmap = plt.get_cmap('viridis')\n", 425 | " colors = [cmap(i) for i in np.linspace(0, 1, len(np.unique(y)))]\n", 426 | "\n", 427 | " class_distr = []\n", 428 | " # Plot the different class distributions\n", 429 | " for i, l in enumerate(np.unique(y)):\n", 430 | " _x1 = x1[y == l]\n", 431 | " _x2 = x2[y == l]\n", 432 | " _y = y[y == l]\n", 433 | " class_distr.append(plt.scatter(_x1, _x2, color=colors[i]))\n", 434 | "\n", 435 | " # Add a legend\n", 436 | " plt.legend(class_distr, y, loc=1)\n", 437 | "\n", 438 | " # Axis labels\n", 439 | " plt.xlabel('Principal Component 1')\n", 440 | " plt.ylabel('Principal Component 2')\n", 441 | " plt.show()\n", 442 | "\n", 443 | "\n", 444 | "if __name__ == \"__main__\":\n", 445 | " main()\n" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "参考文献:\n", 453 | "\n", 454 | "《模式识别和机器学习》\n", 455 | "\n", 456 | "http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html" 457 | ] 458 | } 459 | ], 460 | "metadata": { 461 | "kernelspec": { 462 | "display_name": "Python 3", 463 | "language": "python", 464 | "name": "python3" 465 | }, 466 | "language_info": { 467 | "codemirror_mode": { 468 | "name": "ipython", 469 | "version": 3 470 | }, 471 | "file_extension": ".py", 472 | "mimetype": "text/x-python", 473 | "name": "python", 474 | "nbconvert_exporter": "python", 475 | "pygments_lexer": "ipython3", 476 | "version": "3.6.1" 477 | } 478 | }, 479 | "nbformat": 4, 480 | "nbformat_minor": 2 481 | } 482 | -------------------------------------------------------------------------------- /PCA.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/PCA.zip -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Machine-learning-implement 2 | 3 | Teach you how to implement machine learning algorithms 4 | 5 | 最近萌生了实现一些机器学习算法的想法,主要基于以下几个原因: 6 | 7 | 1. 希望可以帮助一些刚刚入门的小伙伴。 8 | 2. 通过实现算法,加深自己对算法理论的理解。 9 | 3. 弄清楚数学语言如何转换为计算机语言。 10 | 4. 可以了解很多看书学不到的各种trick,所有算法几乎都有坑。比如hyper-parameter什么意义怎么设,怎么初始化,numerical stability的怎么保证,如何保证矩阵正定,计算机rounding error的影响,numerical underflow和overflow问题等等。 11 | 5. 对整个领域各个算法的关联有更深刻的了解,思维形成一个关系网。看到一个算法就会自然的去想跟其他算法的联系,怎么去扩展。如果一篇paper我不能把它纳入到这个关系网里,我就觉得自己没懂。要么推出联系,要么推出矛盾证明这篇paper垃圾。 12 | 6. 实现这样一个项目会花掉我大量的时间,尽管这样,我还是会在时间允许的情况下会尽快更新项目。 13 | 14 | -------------------------------------------------------------------------------- /data/china.csv: -------------------------------------------------------------------------------- 1 | 北京 ;116.46;39.92 天津 ;117.2;39.13 上海 ;121.48;31.22 重庆 ;106.54;29.59 拉萨 ;91.11;29.97 乌鲁木齐 ;87.68;43.77 银川 ;106.27;38.47 呼和浩特 ;111.65;40.82 南宁 ;108.33;22.84 哈尔滨 ;126.63;45.75 长春 ;125.35;43.88 沈阳 ;123.38;41.8 石家庄 ;114.48;38.03 太原 ;112.53;37.87 西宁 ;101.74;36.56 济南 ;117;36.65 郑州 ;113.6;34.76 南京;118.78;32.04 合肥;117.27;31.86 杭州;120.19;30.26 福州;119.3;26.08 南昌;115.89;28.68 长沙;113;28.21 武汉;114.31;30.52 广州;113.23;23.16 台北;121.5;25.05 海口;110.35;20.02 兰州;103.73;36.03 西安;108.95;34.27 成都;104.06;30.67 贵阳;106.71;26.57 昆明;102.73;25.04 香港;114.1;22.2 澳门;113.33;22.13 -------------------------------------------------------------------------------- /evolutionary(experiment 2, Cracking Passwords).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "上一篇文章我们动手实验了用遗传算法求解函数在给定区间的最大值。本篇文章再来看一个实验:用遗传算法破解密码。\n", 8 | "\n", 9 | "在这个问题中,我们的个体就是一串字符串了,其目的就是找到一个与密码完全相同的字符串。基本步骤与前一篇文章基本类似,不过在本问题中,我们用字符的ASCII值来表示个体(字符串)的DNA。其它的就不多说了,还是看详细代码吧:" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [ 19 | { 20 | "name": "stdout", 21 | "output_type": "stream", 22 | "text": [ 23 | "第0 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t IJ6{Kvl\\o\"y\n", 24 | "第10 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I loveqyouJ\n", 25 | "第20 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I loveWyoue\n", 26 | "第30 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I love$you#\n", 27 | "第40 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovelyou$\n", 28 | "第50 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I love*youZ\n", 29 | "第60 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovePyouZ\n", 30 | "第70 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I love you'\n", 31 | "第80 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovemyou$\n", 32 | "第90 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovemyouH\n", 33 | "第100 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovebyou!\n", 34 | "第110 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovesyou!\n", 35 | "第120 次进化后, 基因最好的个体(与欲破解的密码最接近)是: \t I lovebyou!\n", 36 | "第125 次进化后, 找到了密码: \t I love you!\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "import numpy as np\n", 42 | "\n", 43 | "\n", 44 | "class GeneticAlgorithm(object):\n", 45 | " \"\"\"遗传算法.\n", 46 | "\n", 47 | " Parameters:\n", 48 | " -----------\n", 49 | " cross_rate: float\n", 50 | " 交配的可能性大小.\n", 51 | " mutate_rate: float\n", 52 | " 基因突变的可能性大小. \n", 53 | " n_population: int\n", 54 | " 种群的大小.\n", 55 | " n_iterations: int\n", 56 | " 迭代次数.\n", 57 | " password: str\n", 58 | " 欲破解的密码.\n", 59 | " \"\"\"\n", 60 | " def __init__(self, cross_rate, mutation_rate, n_population, n_iterations, password):\n", 61 | " self.cross_rate = cross_rate\n", 62 | " self.mutate_rate = mutation_rate\n", 63 | " self.n_population = n_population\n", 64 | " self.n_iterations = n_iterations\n", 65 | " self.password = password # 要破解的密码\n", 66 | " self.password_size = len(self.password) # 要破解密码的长度\n", 67 | " self.password_ascii = np.fromstring(self.password, dtype=np.uint8) # 将password转换成ASCII\n", 68 | " self.ascii_bounder = [32, 126+1]\n", 69 | " \n", 70 | "\n", 71 | " # 初始化一个种群\n", 72 | " def init_population(self):\n", 73 | " population = np.random.randint(low=self.ascii_bounder[0], high=self.ascii_bounder[1], \n", 74 | " size=(self.n_population, self.password_size)).astype(np.int8)\n", 75 | " return population\n", 76 | "\n", 77 | " # 将个体的DNA转换成ASCII\n", 78 | " def translateDNA(self, DNA): # convert to readable string\n", 79 | " return DNA.tostring().decode('ascii')\n", 80 | "\n", 81 | " # 计算种群中每个个体的适应度,适应度越高,说明该个体的基因越好\n", 82 | " def fitness(self, population):\n", 83 | " match_num = (population == self.password_ascii).sum(axis=1)\n", 84 | " return match_num\n", 85 | "\n", 86 | " # 对种群按照其适应度进行采样,这样适应度高的个体就会以更高的概率被选择\n", 87 | " def select(self, population):\n", 88 | " fitness = self.fitness(population) + 1e-4 # add a small amount to avoid all zero fitness\n", 89 | " idx = np.random.choice(np.arange(self.n_population), size=self.n_population, replace=True, p=fitness/fitness.sum())\n", 90 | " return population[idx]\n", 91 | "\n", 92 | " # 进行交配\n", 93 | " def create_child(self, parent, pop):\n", 94 | " if np.random.rand() < self.cross_rate:\n", 95 | " index = np.random.randint(0, self.n_population, size=1) # select another individual from pop\n", 96 | " cross_points = np.random.randint(0, 2, self.password_size).astype(np.bool) # choose crossover points\n", 97 | " parent[cross_points] = pop[index, cross_points] # mating and produce one child\n", 98 | " #child = parent\n", 99 | " return parent\n", 100 | "\n", 101 | " # 基因突变\n", 102 | " def mutate_child(self, child):\n", 103 | " for point in range(self.password_size):\n", 104 | " if np.random.rand() < self.mutate_rate:\n", 105 | " child[point] = np.random.randint(*self.ascii_bounder) # choose a random ASCII index\n", 106 | " return child\n", 107 | "\n", 108 | " # 进化\n", 109 | " def evolution(self):\n", 110 | " population = self.init_population()\n", 111 | " for i in range(self.n_iterations):\n", 112 | " fitness = self.fitness(population)\n", 113 | " \n", 114 | " best_person = population[np.argmax(fitness)]\n", 115 | " best_person_ascii = self.translateDNA(best_person)\n", 116 | " \n", 117 | " if i % 10 == 0:\n", 118 | " print(u'第%-4d次进化后, 基因最好的个体(与欲破解的密码最接近)是: \\t %s'% (i, best_person_ascii))\n", 119 | " \n", 120 | " if best_person_ascii == self.password:\n", 121 | " print(u'第%-4d次进化后, 找到了密码: \\t %s'% (i, best_person_ascii))\n", 122 | " break\n", 123 | " \n", 124 | " population = self.select(population)\n", 125 | " population_copy = population.copy()\n", 126 | " \n", 127 | " for parent in population:\n", 128 | " child = self.create_child(parent, population_copy)\n", 129 | " child = self.mutate_child(child)\n", 130 | " parent[:] = child\n", 131 | " \n", 132 | " population = population\n", 133 | " \n", 134 | "def main():\n", 135 | " password = 'I love you!' # 要破解的密码\n", 136 | " \n", 137 | " ga = GeneticAlgorithm(cross_rate=0.8, mutation_rate=0.01, n_population=300, n_iterations=500, password=password)\n", 138 | " \n", 139 | " ga.evolution()\n", 140 | "\n", 141 | "if __name__ == '__main__':\n", 142 | " main()\n" 143 | ] 144 | } 145 | ], 146 | "metadata": { 147 | "kernelspec": { 148 | "display_name": "Python [conda root]", 149 | "language": "python", 150 | "name": "conda-root-py" 151 | }, 152 | "language_info": { 153 | "codemirror_mode": { 154 | "name": "ipython", 155 | "version": 2 156 | }, 157 | "file_extension": ".py", 158 | "mimetype": "text/x-python", 159 | "name": "python", 160 | "nbconvert_exporter": "python", 161 | "pygments_lexer": "ipython2", 162 | "version": "2.7.12" 163 | } 164 | }, 165 | "nbformat": 4, 166 | "nbformat_minor": 1 167 | } 168 | -------------------------------------------------------------------------------- /evolutionary1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/evolutionary1.jpg -------------------------------------------------------------------------------- /evolutionary2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/evolutionary2.jpg -------------------------------------------------------------------------------- /evolutionary3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/evolutionary3.jpg -------------------------------------------------------------------------------- /k_nearest_neighbors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# K近邻分类算法 (K-Nearest Neighbor) \n", 8 | "\n", 9 | "\n", 10 | "KNN分类算法非常简单,该算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别。该方法在确定分类决策上只依据最邻近K个样本的类别来决定待分样本所属的类别。KNN是一个懒惰算法,也就是说在平时不好好学习,考试(对测试样本分类)时才临阵发力(临时去找k个近邻),因此在预测的时候速度比较慢。\n", 11 | "\n", 12 | "KNN模型是非参数模型,既然有非参数模型,那就肯定还有参数模型,那何为参数模型与非参数模型呢?\n", 13 | "\n", 14 | "\n", 15 | "\n", 16 | "## 1. 参数模型与非参数模型\n", 17 | "\n", 18 | "### 1.1 参数模型\n", 19 | "\n", 20 | "参数模型是指选择某种形式的函数并通过机器学习用一系列固定个数的参数尽可能表征这些数据的某种模式。参数模型具有如下特征:\n", 21 | "1. 不管数据量有多大,在模型确定了,参数的个数就确定了,即参数个数不随着样本量的增大而增加,从关系上说它们相互独立;\n", 22 | "2. 一般参数模型会对数据有一定的假设,如分布的假设,空间的假设等,并且这些假设可以由参数来描述;\n", 23 | "3. 参数模型预测速度快。\n", 24 | "\n", 25 | "常用参数学习的模型有: \n", 26 | "\n", 27 | "* 回归模型(线性回归、岭回归、lasso回归、多项式回归)\n", 28 | "* 逻辑回归\n", 29 | "* 线性判别分析(Linear Discriminant Analysis)\n", 30 | "* 感知器\n", 31 | "* 朴素贝叶斯\n", 32 | "* 神经网络\n", 33 | "* 使用线性核的SVM\n", 34 | "* Mixture models\n", 35 | "* K-means\n", 36 | "* Hidden Markov models\n", 37 | "* Factor analysis / pPCA / PMF\n", 38 | "\n", 39 | "\n", 40 | "### 1.2 非参数模型\n", 41 | "\n", 42 | "非参数模型是指系统的数学模型中非显式地包含可估参数。注意不要被名字误导,非参不等于无参。非参数模型具有以下特征:\n", 43 | "\n", 44 | "1. 数据决定了函数形式,函数参数个数不固定;\n", 45 | "2. 随着训练数据量的增加,参数个数一般也会随之增长,模型越来越大;\n", 46 | "3. 对数据本身做较少的先验假设;\n", 47 | "4. 预测速度慢。\n", 48 | "\n", 49 | "一些常用的非参学习模型: \n", 50 | "\n", 51 | "* k-Nearest Neighbors\n", 52 | "* Decision Trees like CART and C4.5\n", 53 | "* 使用非线性核的SVM\n", 54 | "* Gradient Boosted Decision Trees\n", 55 | "* Gaussian processes for regression\n", 56 | "* Dirichlet process mixtures\n", 57 | "* infinite HMMs\n", 58 | "* infinite latent factor models\n", 59 | "\n", 60 | "\n", 61 | "## 2. KNN算法步骤:\n", 62 | "\n", 63 | "1. 准备数据,对数据进行预处理;\n", 64 | "\n", 65 | "2. 设定参数,如k;\n", 66 | "\n", 67 | "3. 遍历测试集,\n", 68 | "\n", 69 | " 对测试集中每个样本,计算该样本(测试集中)到训练集中每个样本的距离;\n", 70 | "\n", 71 | " 取出训练集中到该样本(测试集中)的距离最小的k个样本的类别标签;\n", 72 | "\n", 73 | " 对类别标签进行计数,类别标签次数最多的就是该样本(测试集中)的类别标签。\n", 74 | "\n", 75 | "\n", 76 | "4. 遍历完毕.\n", 77 | "\n", 78 | "\n", 79 | "\n", 80 | "\n", 81 | "\n", 82 | "## 3. KNN算法优点和缺点\n", 83 | "\n", 84 | "### 3.1 KNN算法优点\n", 85 | "\n", 86 | "1. 简单,易于理解,易于实现,无需估计参数,无需训练;\n", 87 | "\n", 88 | "2. 适合对稀有事件进行分类;\n", 89 | "\n", 90 | "3. 特别适合于多分类问题(multi-modal,对象具有多个类别标签), kNN比SVM的表现要好;\n", 91 | "\n", 92 | "4. 由于KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,KNN方法较其他方法更为适合;\n", 93 | "\n", 94 | "5. 该算法比较适用于样本容量比较大的类域的自动分类,而那些样本容量较小的类域采用这种算法比较容易产生误分。\n", 95 | "\n", 96 | "\n", 97 | "\n", 98 | "### 3.2 KNN算法缺点\n", 99 | "\n", 100 | "1. 该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的K个邻居中大容量类的样本占多数。 该算法只计算“最近的”邻居样本,某一类的样本数量很大,那么或者这类样本并不接近目标样本,或者这类样本很靠近目标样本。无论怎样,数量并不能影响运行结果。可以采用权值的方法(和该样本距离小的邻居权值大)来改进;\n", 101 | "\n", 102 | "2. 该方法的另一个不足之处是计算量较大,因为对每一个待分类的文本都要计算它到全体已知样本的距离,才能求得它的K个最近邻点。\n", 103 | "\n", 104 | "3. 属于硬分类,即直接给出这个样本的类别,并不是给出这个样本有多大的可能性属于该类别;\n", 105 | "\n", 106 | "4. 可解释性差,无法给出像决策树那样的规则;\n", 107 | "\n", 108 | "5. 计算量较大。目前常用的解决方法是事先对已知样本点进行剪辑,事先去除对分类作用不大的样本。\n" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 1, 114 | "metadata": { 115 | "scrolled": false 116 | }, 117 | "outputs": [ 118 | { 119 | "name": "stdout", 120 | "output_type": "stream", 121 | "text": [ 122 | "Accuracy: 0.893939393939\n" 123 | ] 124 | } 125 | ], 126 | "source": [ 127 | "from __future__ import print_function\n", 128 | "import math\n", 129 | "import numpy as np\n", 130 | "from sklearn import datasets\n", 131 | "import matplotlib.pyplot as plt\n", 132 | "from collections import Counter\n", 133 | "from sklearn.datasets import make_classification\n", 134 | "%matplotlib inline\n", 135 | "\n", 136 | "\n", 137 | "def shuffle_data(X, y, seed=None):\n", 138 | " if seed:\n", 139 | " np.random.seed(seed)\n", 140 | " \n", 141 | " idx = np.arange(X.shape[0])\n", 142 | " np.random.shuffle(idx)\n", 143 | " \n", 144 | " return X[idx], y[idx]\n", 145 | "\n", 146 | "\n", 147 | "\n", 148 | "# 正规化数据集 X\n", 149 | "def normalize(X, axis=-1, p=2):\n", 150 | " lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))\n", 151 | " lp_norm[lp_norm == 0] = 1\n", 152 | " return X / np.expand_dims(lp_norm, axis)\n", 153 | "\n", 154 | "\n", 155 | "# 标准化数据集 X\n", 156 | "def standardize(X):\n", 157 | " X_std = np.zeros(X.shape)\n", 158 | " mean = X.mean(axis=0)\n", 159 | " std = X.std(axis=0)\n", 160 | "\n", 161 | " # 做除法运算时请永远记住分母不能等于0的情形\n", 162 | " # X_std = (X - X.mean(axis=0)) / X.std(axis=0) \n", 163 | " for col in range(np.shape(X)[1]):\n", 164 | " if std[col]:\n", 165 | " X_std[:, col] = (X_std[:, col] - mean[col]) / std[col]\n", 166 | "\n", 167 | " return X_std\n", 168 | "\n", 169 | "\n", 170 | "# 划分数据集为训练集和测试集\n", 171 | "def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):\n", 172 | " if shuffle:\n", 173 | " X, y = shuffle_data(X, y, seed)\n", 174 | "\n", 175 | " n_train_samples = int(X.shape[0] * (1-test_size))\n", 176 | " x_train, x_test = X[:n_train_samples], X[n_train_samples:]\n", 177 | " y_train, y_test = y[:n_train_samples], y[n_train_samples:]\n", 178 | "\n", 179 | " return x_train, x_test, y_train, y_test\n", 180 | "\n", 181 | "def accuracy(y, y_pred):\n", 182 | " y = y.reshape(y.shape[0], -1)\n", 183 | " y_pred = y_pred.reshape(y_pred.shape[0], -1)\n", 184 | " return np.sum(y == y_pred)/len(y)\n", 185 | "\n", 186 | "\n", 187 | "class KNN():\n", 188 | " \"\"\" K近邻分类算法.\n", 189 | "\n", 190 | " Parameters:\n", 191 | " -----------\n", 192 | " k: int\n", 193 | " 最近邻个数.\n", 194 | " \"\"\"\n", 195 | " def __init__(self, k=5):\n", 196 | " self.k = k\n", 197 | "\n", 198 | " # 计算一个样本与训练集中所有样本的欧氏距离的平方\n", 199 | " def euclidean_distance(self, one_sample, X_train):\n", 200 | " one_sample = one_sample.reshape(1, -1)\n", 201 | " X_train = X_train.reshape(X_train.shape[0], -1)\n", 202 | " distances = np.power(np.tile(one_sample, (X_train.shape[0], 1)) - X_train, 2).sum(axis=1)\n", 203 | " return distances\n", 204 | " \n", 205 | " # 获取k个近邻的类别标签\n", 206 | " def get_k_neighbor_labels(self, distances, y_train, k):\n", 207 | " k_neighbor_labels = []\n", 208 | " for distance in np.sort(distances)[:k]:\n", 209 | "\n", 210 | " label = y_train[distances==distance]\n", 211 | " k_neighbor_labels.append(label)\n", 212 | "\n", 213 | " return np.array(k_neighbor_labels).reshape(-1, )\n", 214 | " \n", 215 | " # 进行标签统计,得票最多的标签就是该测试样本的预测标签\n", 216 | " def vote(self, one_sample, X_train, y_train, k):\n", 217 | " distances = self.euclidean_distance(one_sample, X_train)\n", 218 | " y_train = y_train.reshape(y_train.shape[0], 1)\n", 219 | " k_neighbor_labels = self.get_k_neighbor_labels(distances, y_train, k)\n", 220 | " \n", 221 | " find_label, find_count = 0, 0\n", 222 | " for label, count in Counter(k_neighbor_labels).items():\n", 223 | " if count > find_count:\n", 224 | " find_count = count\n", 225 | " find_label = label\n", 226 | " return find_label\n", 227 | " \n", 228 | " # 对测试集进行预测\n", 229 | " def predict(self, X_test, X_train, y_train):\n", 230 | " y_pred = []\n", 231 | " for sample in X_test:\n", 232 | " label = self.vote(sample, X_train, y_train, self.k)\n", 233 | " y_pred.append(label)\n", 234 | " return np.array(y_pred)\n", 235 | "\n", 236 | "\n", 237 | "def main():\n", 238 | " data = make_classification(n_samples=200, n_features=4, n_informative=2, \n", 239 | " n_redundant=2, n_repeated=0, n_classes=2)\n", 240 | " X, y = data[0], data[1]\n", 241 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True)\n", 242 | " clf = KNN(k=5)\n", 243 | " y_pred = clf.predict(X_test, X_train, y_train)\n", 244 | " \n", 245 | " accu = accuracy(y_test, y_pred)\n", 246 | " print (\"Accuracy:\", accu)\n", 247 | "\n", 248 | "\n", 249 | "if __name__ == \"__main__\":\n", 250 | " main()\n" 251 | ] 252 | } 253 | ], 254 | "metadata": { 255 | "kernelspec": { 256 | "display_name": "Python 3", 257 | "language": "python", 258 | "name": "python3" 259 | }, 260 | "language_info": { 261 | "codemirror_mode": { 262 | "name": "ipython", 263 | "version": 3 264 | }, 265 | "file_extension": ".py", 266 | "mimetype": "text/x-python", 267 | "name": "python", 268 | "nbconvert_exporter": "python", 269 | "pygments_lexer": "ipython3", 270 | "version": "3.6.1" 271 | } 272 | }, 273 | "nbformat": 4, 274 | "nbformat_minor": 2 275 | } 276 | -------------------------------------------------------------------------------- /k_nearest_neighbors.md: -------------------------------------------------------------------------------- 1 | 2 | # K近邻分类算法 (K-Nearest Neighbor) 3 | 4 | 5 | KNN分类算法非常简单,该算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别。该方法在确定分类决策上只依据最邻近K个样本的类别来决定待分样本所属的类别。KNN是一个懒惰算法,也就是说在平时不好好学习,考试(对测试样本分类)时才临阵发力(临时去找k个近邻),因此在预测的时候速度比较慢。 6 | 7 | KNN模型是非参数模型,既然有非参数模型,那就肯定还有参数模型,那何为参数模型与非参数模型呢? 8 | 9 | 10 | 11 | ## 1. 参数模型与非参数模型 12 | 13 | ### 1.1 参数模型 14 | 15 | 参数模型是指选择某种形式的函数并通过机器学习用一系列固定个数的参数尽可能表征这些数据的某种模式。参数模型具有如下特征: 16 | 1. 不管数据量有多大,在模型确定了,参数的个数就确定了,即参数个数不随着样本量的增大而增加,从关系上说它们相互独立; 17 | 2. 一般参数模型会对数据有一定的假设,如分布的假设,空间的假设等,并且这些假设可以由参数来描述; 18 | 3. 参数模型预测速度快。 19 | 20 | 常用参数学习的模型有: 21 | 22 | * 回归模型(线性回归、岭回归、lasso回归、多项式回归) 23 | * 逻辑回归 24 | * 线性判别分析(Linear Discriminant Analysis) 25 | * 感知器 26 | * 朴素贝叶斯 27 | * 神经网络 28 | * 使用线性核的SVM 29 | * Mixture models 30 | * K-means 31 | * Hidden Markov models 32 | * Factor analysis / pPCA / PMF 33 | 34 | 35 | ### 1.2 非参数模型 36 | 37 | 非参数模型是指系统的数学模型中非显式地包含可估参数。注意不要被名字误导,非参不等于无参。非参数模型具有以下特征: 38 | 39 | 1. 数据决定了函数形式,函数参数个数不固定; 40 | 2. 随着训练数据量的增加,参数个数一般也会随之增长,模型越来越大; 41 | 3. 对数据本身做较少的先验假设; 42 | 4. 预测速度慢。 43 | 44 | 一些常用的非参学习模型: 45 | 46 | * k-Nearest Neighbors 47 | * Decision Trees like CART and C4.5 48 | * 使用非线性核的SVM 49 | * Gradient Boosted Decision Trees 50 | * Gaussian processes for regression 51 | * Dirichlet process mixtures 52 | * infinite HMMs 53 | * infinite latent factor models 54 | 55 | 56 | ## 2. KNN算法步骤: 57 | 58 | 1. 准备数据,对数据进行预处理; 59 | 60 | 2. 设定参数,如k; 61 | 62 | 3. 遍历测试集, 63 | 64 | 对测试集中每个样本,计算该样本(测试集中)到训练集中每个样本的距离; 65 | 66 | 取出训练集中到该样本(测试集中)的距离最小的k个样本的类别标签; 67 | 68 | 对类别标签进行计数,类别标签次数最多的就是该样本(测试集中)的类别标签。 69 | 70 | 71 | 4. 遍历完毕. 72 | 73 | 74 | 75 | 76 | 77 | ## 3. KNN算法优点和缺点 78 | 79 | ### 3.1 KNN算法优点 80 | 81 | 1. 简单,易于理解,易于实现,无需估计参数,无需训练; 82 | 83 | 2. 适合对稀有事件进行分类; 84 | 85 | 3. 特别适合于多分类问题(multi-modal,对象具有多个类别标签), kNN比SVM的表现要好; 86 | 87 | 4. 由于KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,KNN方法较其他方法更为适合; 88 | 89 | 5. 该算法比较适用于样本容量比较大的类域的自动分类,而那些样本容量较小的类域采用这种算法比较容易产生误分。 90 | 91 | 92 | 93 | ### 3.2 KNN算法缺点 94 | 95 | 1. 该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的K个邻居中大容量类的样本占多数。 该算法只计算“最近的”邻居样本,某一类的样本数量很大,那么或者这类样本并不接近目标样本,或者这类样本很靠近目标样本。无论怎样,数量并不能影响运行结果。可以采用权值的方法(和该样本距离小的邻居权值大)来改进; 96 | 97 | 2. 该方法的另一个不足之处是计算量较大,因为对每一个待分类的文本都要计算它到全体已知样本的距离,才能求得它的K个最近邻点。 98 | 99 | 3. 属于硬分类,即直接给出这个样本的类别,并不是给出这个样本有多大的可能性属于该类别; 100 | 101 | 4. 可解释性差,无法给出像决策树那样的规则; 102 | 103 | 5. 计算量较大。目前常用的解决方法是事先对已知样本点进行剪辑,事先去除对分类作用不大的样本。 104 | 105 | 106 | 107 | ```python 108 | from __future__ import print_function 109 | import math 110 | import numpy as np 111 | from sklearn import datasets 112 | import matplotlib.pyplot as plt 113 | from collections import Counter 114 | from sklearn.datasets import make_classification 115 | %matplotlib inline 116 | 117 | 118 | def shuffle_data(X, y, seed=None): 119 | if seed: 120 | np.random.seed(seed) 121 | 122 | idx = np.arange(X.shape[0]) 123 | np.random.shuffle(idx) 124 | 125 | return X[idx], y[idx] 126 | 127 | 128 | 129 | # 正规化数据集 X 130 | def normalize(X, axis=-1, p=2): 131 | lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis)) 132 | lp_norm[lp_norm == 0] = 1 133 | return X / np.expand_dims(lp_norm, axis) 134 | 135 | 136 | # 标准化数据集 X 137 | def standardize(X): 138 | X_std = np.zeros(X.shape) 139 | mean = X.mean(axis=0) 140 | std = X.std(axis=0) 141 | 142 | # 做除法运算时请永远记住分母不能等于0的情形 143 | # X_std = (X - X.mean(axis=0)) / X.std(axis=0) 144 | for col in range(np.shape(X)[1]): 145 | if std[col]: 146 | X_std[:, col] = (X_std[:, col] - mean[col]) / std[col] 147 | 148 | return X_std 149 | 150 | 151 | # 划分数据集为训练集和测试集 152 | def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None): 153 | if shuffle: 154 | X, y = shuffle_data(X, y, seed) 155 | 156 | n_train_samples = int(X.shape[0] * (1-test_size)) 157 | x_train, x_test = X[:n_train_samples], X[n_train_samples:] 158 | y_train, y_test = y[:n_train_samples], y[n_train_samples:] 159 | 160 | return x_train, x_test, y_train, y_test 161 | 162 | def accuracy(y, y_pred): 163 | y = y.reshape(y.shape[0], -1) 164 | y_pred = y_pred.reshape(y_pred.shape[0], -1) 165 | return np.sum(y == y_pred)/len(y) 166 | 167 | 168 | class KNN(): 169 | """ K近邻分类算法. 170 | 171 | Parameters: 172 | ----------- 173 | k: int 174 | 最近邻个数. 175 | """ 176 | def __init__(self, k=5): 177 | self.k = k 178 | 179 | # 计算一个样本与训练集中所有样本的欧氏距离的平方 180 | def euclidean_distance(self, one_sample, X_train): 181 | one_sample = one_sample.reshape(1, -1) 182 | X_train = X_train.reshape(X_train.shape[0], -1) 183 | distances = np.power(np.tile(one_sample, (X_train.shape[0], 1)) - X_train, 2).sum(axis=1) 184 | return distances 185 | 186 | # 获取k个近邻的类别标签 187 | def get_k_neighbor_labels(self, distances, y_train, k): 188 | k_neighbor_labels = [] 189 | for distance in np.sort(distances)[:k]: 190 | 191 | label = y_train[distances==distance] 192 | k_neighbor_labels.append(label) 193 | 194 | return np.array(k_neighbor_labels).reshape(-1, ) 195 | 196 | # 进行标签统计,得票最多的标签就是该测试样本的预测标签 197 | def vote(self, one_sample, X_train, y_train, k): 198 | distances = self.euclidean_distance(one_sample, X_train) 199 | y_train = y_train.reshape(y_train.shape[0], 1) 200 | k_neighbor_labels = self.get_k_neighbor_labels(distances, y_train, k) 201 | 202 | find_label, find_count = 0, 0 203 | for label, count in Counter(k_neighbor_labels).items(): 204 | if count > find_count: 205 | find_count = count 206 | find_label = label 207 | return find_label 208 | 209 | # 对测试集进行预测 210 | def predict(self, X_test, X_train, y_train): 211 | y_pred = [] 212 | for sample in X_test: 213 | label = self.vote(sample, X_train, y_train, self.k) 214 | y_pred.append(label) 215 | return np.array(y_pred) 216 | 217 | 218 | def main(): 219 | data = make_classification(n_samples=200, n_features=4, n_informative=2, 220 | n_redundant=2, n_repeated=0, n_classes=2) 221 | X, y = data[0], data[1] 222 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True) 223 | clf = KNN(k=5) 224 | y_pred = clf.predict(X_test, X_train, y_train) 225 | 226 | accu = accuracy(y_test, y_pred) 227 | print ("Accuracy:", accu) 228 | 229 | 230 | if __name__ == "__main__": 231 | main() 232 | 233 | ``` 234 | 235 | Accuracy: 0.893939393939 236 | 237 | -------------------------------------------------------------------------------- /linear_discriminant_analysis(BiClass).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 线性判别分析" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "线性判别分析(Linear Discriminant Analysis或者Fisher’s Linear Discriminant)简称LDA,是一种监督学习算法。\n", 15 | "\n", 16 | "LDA的原理是,将数据通过线性变换(投影)的方法,映射到维度更低的空间中,使得投影后的点满足同类型标签的样本在映射后的空间比较近,不同类型标签的样本在映射后的空间比较远。" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## 一、线性判别分析(二类情形)\n", 24 | "\n", 25 | "\n", 26 | "在讲解算法理论之前,先补充一下协方差矩阵的定义。\n", 27 | "\n", 28 | "\n", 29 | "### 1. 协方差矩阵定义\n", 30 | "\n", 31 | "矩阵$X_{mxn}$协方差的计算公式:\n", 32 | "\n", 33 | "设$x, y$分别是两个列向量,则$x, y$的协方差为\n", 34 | "$$cov(x, y) = E(x-\\bar{x})(y-\\bar{y})$$\n", 35 | "\n", 36 | "若将$x, y$合并成一个矩阵$X_{mxn}$,则求矩阵$X_{mxn}$,则矩阵$X_{mxn}$的协方差矩阵为\n", 37 | "\n", 38 | "$$A = \\sum_{i=1...m,j=1...n}a_{ij} = cov(X_i, X_j) = E(X_i-\\bar{X_i})(X_j-\\bar{X_j})$$\n", 39 | "\n", 40 | "\n", 41 | "### 2. 模型原理" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 32, 47 | "metadata": { 48 | "scrolled": false 49 | }, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "image/jpeg": "R0lGODlhSAE+AfcAAAkBAg8HHA4UGxIKCBEKFhwTCRoVFR44GRQhISEaFyUbJiQhGjEoFCYqJiEw\nMTUlMjU4NTUvQipGJUMkK0c5Om49O0U2RlhJOkdESU9JUElQUFdDSF1OXFVXTFtbW1xbZ1thV2Rb\nWmleanNdYX5hXGNnZ2tteGVxamhwcHlpaHZ6a3ZwcAp05wF39Bd26RR99HZqmHl4hWh7zwKD7wGA\n8BmH6RWG8iaG7iiL8iWT7iyR8TaJ6TaZ6zmT8T2j6nWAglmW3USM5UWW6kSX9lWf70um6kCi8Uuw\n51el6lil8Faz6Wud1H2h3Weh6GKn82u16mey8nKn6HW47HG48XrH55NUVIRwcKRsbcolJuAVCPYB\nAvIKEvMVC/QYF/AmF/MiJPI2KvQ3NvI5RPJPP9VIQMtvb8Zyd+dPUfNIR/JIVvJXSvJWU/JHavJo\nWu96afNnafNpcvF5a/J5d4B5hrJ4gM9vh8Z7geB7iPN6ifF8kIaMf5iBeMWOePaJfYiEiYWQgpSI\nhpSOkpabjZaWmJiboZailJ+ipp20rbmFhbmQnb6Vq7WbtaOknKenq6y4tbuotrG1qbO1uIqr54W6\n6pG76JO/96iuxqe5w4nI8o7R6JjH6ZXH9pnV7LzMzqfN6qTH8qrV6avV9LLN7LfY6LnX8seHicaK\nlcqWhsmWmN6FhtiLmNSQitSbmMaDosqYo8aasNWbpsqmndeoncumqMipscy2tNinp9aqtNuxqde1\nuO+Fj/SCh/OOlvGVhfWZmOWbofWZqeummPaim+qopeqrtuCwr+O4tvWlpfaqsfK0qvS4tcm9wM27\n1d27xNG+2Oe9w/K6xcPButnBuujGufXDvMfFx8zD1cjW2t3GydLG1NrRytjY2MPP6sXW7sPf9d/P\n4Nrb5N3g3Mvj8tnm59fq9ebJxuXL0e/XyOfW1vbIyPfP0/LXzPbW1ufb5PXb5eji3O73wvXi3Orp\n6OXs9Orz9vfo5/bt8fjy7f7+/v/M////AP//M///Zv//mf//zAAAACH5BAEAAP8ALAAAAABIAT4B\nAAj/APEJHEiwoMGDCBMqXMiwocOHECNKnEixosWLGDNq3Mixo8ePIEOKHEmypMmTKFOqXMmypcuX\nMGPKnEmzps2bOHPq3Mmzp8+fQIMKHUq0qNGjSJMqXcq0qdOnUKNKnUq1qtWrWLNq3cq1q9evYMOK\nHUu2rNmzaNOqXcv2q7x2MfzIa0v3bCQPDwAMUEGvrt+wjRoAEEABwAJIfxNzbaQAQAFC1QQAKDFX\nsWWrBwZH2IbPhOFol0NLJaQXkKFx+MARAHCir+jXS+U1ptCuUDiBKwAk4Ay791FAAwjowTeIt6MF\nABr5Xj5UHggADVAXF7gNA4AQzLP/jCS4g0BC4AbO/wFgQbt5nfRKAAhwTeB0fNsKAzh0vr7Ndn/k\n1XPPe1sHAxOgZt+AN72HTzjWCEjggjMZyOCDNDkI4YQv2UbhhS9JiOGGJmnI4YcheQjiiByJSOKJ\nF1mI4ooamcjiiw65COOMCQ1yG404xnhjjjzWyFuPQBYkY5AzDknki4LseCSQRi6JYpNOkghllCCq\nSGWOU165oZVaFvljlzBmCeaEXI65ophmMohmmgSuyaZ9Nr55ZnhynuhmndoF8iWeHBKiIJ8c3gmo\nb3EOymGZhk4oKEv1KANNOum4k2hRi67UixaYaiHHpEQh2hMwmWoRB6dDVarSpZmOSmpQpqbkS6iq\nrv/6U6souXMMMscck46sQNHKK1u+/qpWksIuGGyxZxWKbH3HLkuWn84yu2e0yzVLLVjWXuuVINNm\nlI4cccDxhjDaZnWsMF1kuka5WHmKkTBcZIoGu1cdC8wW6g7FDqTpsENvjN1ilAwaYRS8aVD1pJHv\nvwwpy5E97ETMzn4Io7EwwwrpeVY9Y1yMMULZFlWPGB5/bFDIRcVTTjnnUGzyQQ6//FfMMtc1CJ01\nzxxwzmqhzHNT0P5s885Cm0Vz0Wj5jDRS7i5NltJOF9V01GFBXVE97rDjjstUl2Q1Rb6kEUYacMTT\ndYdEy/RGpmGYfTZJX08kB9v+vj2SxjutjWkYddv/HRKxOx2jBhtvxMG13x7FjXhOii9+E+COf9V4\n5BGmPWI3mmQ+ypKT+zVEC6An4VqQR69oBOg0OMG55SCeDrrqR3ZelxE11DCDFKMDCTnlXMnOe4as\n/w4Vt8L3HnzxTfmOvEp4L3/V7g3Vow6/hzvPU5b1bLFFF11QY32vxwtUT6jef+8T9AyNn6ky5vuU\n5Tqh7oo4Os0kJWYyyiiTjNt+18NKOUhRHr3aYYt2HKV0QVkHHnYhh10cI03lKMUBw5eTZIRKDWyK\nxS2MIsCLWDBTb3iTKTZIFPQN5YOYSsOb4IGK6rmPgjhRBve41wY5sSMVvejbrGCIE3fEY2su7NIx\n/9K1hiDmhHjta4jCMKXDnnSQYcr4ghYwyCoeOo8d09AGKoLSvCQ6pAzoAIoJvagQVADwhWSESDlY\noQxfGLFBVixe9YYBhi684Y0xGWMaEQKHTK1jJ0+U2dwwBQ1AxtF86UBDF85wvUNeBQk7iGQN5nEi\ndiQjF7YApJLSwoMW0AB0fyKRGcLIOEdaxQeoa0EoR1QPV+AiDrwokCmr0knQqfJF7PACpvJgEz2S\nxRNUCCYVclfJTIHBJoH8mT3w0IUvJAOZs0xiPYRxBzyuBIl7rIgtjAHNbFqEHrCQHxy9eZEqWBMl\nXSSnRM7BinScsyS+XJIUIsmDKFByK9QIQxfk8P/OkSTzNUmwpRDIwZU1YKoL6ojJP0WThBmAbqBc\nCQamsiBOl8TzSEgQKEG5Qgw5qIIYCo3miThBBSQ8QRNhQcU0YLLQqKnjFAZ0SUujpg1T8G8lCFSn\nRDApU5Hq1Bi5aEmzjoGHBT7TfOhgxU1RctGJ1EMNJXNeOVqIU58O5Kny8iIsuKmSY9ljiVqYlzRR\nkQ2VYJMj93gDGL4AhhCqhB7kiOs9l4UOO/SzRVYdyL6U0a+VTKIHgG0CtdBB1ZPM1CJSsCUPrjWL\n+t3VIulcSkZBh4Rr1UMVFqPiSJo6lMQ+VFuvwpQv4JZXmWjiCajlhLaiqIUvIIO0OhXJMdCgC3j/\nbjK2HanHKeyxjsdG5LCO2wYWthCGQoIEuItjR6ba4FuA4VYk7lBXcxuS0+diZBdc6MJoj1ta6w5E\nGb+AxXQb1l3vDgQXK/0IZ81bETNUpiPI5V0zUHEM424kvpSzBxla+8CNnJW9HkkHvrRQxPuWF8D4\nWGIfSnRgBLvhDpnEx3gJElkEe+QZEc7IeqUCig6DYnPIkocrSIkR/IZkErZsQQ2chQ4JZsTEIJns\nJ1fsrFxsESMbhsoTWuBQFVNrGje2CIw/ooQU0zhapXDkkD3CiSII4QhHAHG02MGKJv62wRYuSDky\nPJHqZpkj9chFUCmyZOTZoxTvvfKXUbKNUzQ3/8drvkgpQCqRMjtPpXXGcpwJcuaYPuS/ex5JOWaR\nZvKuJBRSmIQUPgHgbUIkbvSYbAuGAGB5FLYhFR4JPZJAgxnMwAgIrkcprIwQOGOEHlD4JA0ojeBb\n4OIe1NWzQaTQAx30AApZRsWYF6I4cohDHHFd2oQtImISJ8TUgc5IOy4NMlkn2yC1IKGPns0SW+wa\nZs6mNkEIu1SCePkppKhEJTaxCXEsDR2rmLZWdmxLlFJp2BvRYLO3wu5PuntpfJDGQZB9FCqk+N5I\nc4cp7HGybM/kCS+wgQ1e4AmqTfVwdr7IOCa+SqTRgqsDAbS2U1IPW2DD2wbfuEEiCPK3gcMZjv+Q\nFFekEeRMC+0RjQGABHCmlTkLhN/sckQDDDAAAGyg4lWJByrCGPEZbSMSKgAAADQQCT9jJRstLDqO\nflAA3XCgPVopRy6kjqNGfMAAAIBAoWOyjbKb/exoT7va1Y6ObdBhBJ1QOzPMPveyZyMbdsd72eu+\nDb7rfRt3p7vg7W72wO998IAv/N+tgfi/rx3tfzf8Nhh/eMLn3eyU73vhFY/5xnO+7Jn3++cn7/nL\ng770gI884tGODnCUfRzWEEHPge4SQqDACii4fe5XsILc4x4FvPe98IHfe90TP/cnSEEFSKD74OM+\n+M3v/e+df/zoG5/60H++9He//ewff/rdL/7/9Ief+/Jff/vAT3/zhY/98F9f/dq3fvzHX/zei//7\n+Kd++u/ffvnv/v38d38lkALcdwGzxxV7YAqREwkJAAC0NxWBgAgftziQAHYhJxOFoA2lMIF+w4AA\ncIEhhQ6uUA/EFDXbIAJK94BSYSHmAAM3gAQlWDQrAHZKB4IsxRnkwAKg8wRRUw0MoHQD4ACE0BXT\nQQ82ADpSsDSNEAIIoBslYAhjhxXv4QlFsAQqiDHgkAFKhwGMUDU/Mg/NAAv4AFc/Qw9+gAFc92J7\n4grMoAQ4sAk/Uw9RmFs7ZBBBADo3YG4ihxIaUmQtsAMbtYdocxCQFAqCyIfdMg+PYGyH6E8B/4MO\nswBvjRhrCbFNowAK2SSJJ+FLkgA6kzCJIbIz81BLOKCHoAhfRAMFDgUEc3WKBqYQ84AJUZAIrpg4\nljMPxlA/tehft3UQ6KAK3baLFBE0DKENtnANSNBwwlgRUGIJLtACL2CIy9hlx8MJR9gCUzCN1PgQ\nkdYCEKWNEbFh9MANiwCOeSYR0yCG5vgQ31YQ8sAKCbWO1NWLDGEOrDCH8lhyE5ELxjAOn+AN+Thv\nTkULQvACPUAKAVlwFaEJqAM7CTkQU7MQ45ADnoQJD8kR4FAETHCRHsGPHAlmqOB0H3kRLGcNMTiS\nEdEOMuACR4CSF7EJM/BJmXBVLhkRoXCNlP9QkwtJBJNwC7CmkxNBSYpwCcAGlBMhBC6AAzkgCkYJ\nEX4IOjvQlA+BYp7UAkUglQ8RBTzQA0BAc1i5EPRAD+8AC/j4lQeRCwpolg8RDNKmlgxhCtfmlgmR\nVGUplwIhDVxmlwdhD7mQXnqJEPWACJo4ks/gCn+pELXAB4eZEKVgDot5EO/ACoz4mPhgDiNImQWR\nC8WAmQTRDq8wmB+5DnbAmQRxC8vwk6RpC3lJmfdgCpP5mP6jcqRpY6QpEPbgcbWJD+UgC3Vpl+UQ\nC7mJD7oWnKhgX6R5BcGJYQRXm2KWm/UgCzZ4is1gmLkpC8EQnHVwRqQZD63wDrn5DW6Wm7kIgAvB\n2ZykEhAAOw==\n", 54 | "text/plain": [ 55 | "" 56 | ] 57 | }, 58 | "metadata": {}, 59 | "output_type": "display_data" 60 | } 61 | ], 62 | "source": [ 63 | "from IPython.display import Image, display\n", 64 | "\n", 65 | "display(Image('lda.jpg'))" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "上图中红色的正方形形状的点为0类的原样本点、蓝色的正方形形状的点为1类的原样本点,过原点的那条直线就是投影的直线,从图上可以清楚的看到,红色的点和蓝色的点被原点明显的分开了。下面具体看看二分类LDA二类问题的情形:\n", 73 | "\n", 74 | "现在我们觉得原始特征数太多,想将$n$维特征降到只有一维(LDA映射到的低维空间维度小于等于$n_{labels}-1$),而又要保证类别能够“清晰”地反映在低维数据上,也就是这一维就能决定每个样例的类别。\n", 75 | "\n", 76 | "假设用来区分二分类的直线(投影函数)为:\n", 77 | "\n", 78 | "$$y=w^Tx$$\n", 79 | "\n", 80 | "注意这里得到的$y$值不是0/1值,而是$x$投影到直线上的点到原点的距离。\n", 81 | "\n", 82 | "已知数据集\n", 83 | "\n", 84 | "$$D=\\{x^{(1)}, x^{(1)}, …, x^{(m)}\\},$$\n", 85 | "\n", 86 | "将$D$按照类别标签划分为两类$D_1, D_2$, 其中$D_1\\bigcup D_2=D, D_1\\bigcap D_2=\\emptyset$, \n", 87 | "定义两个子集的中心:\n", 88 | "\n", 89 | "$$\\mu_1 = \\frac{1}{n_1} \\sum_{x^{(i)} \\in D_1} {x^{(i)}},$$\n", 90 | "$$\\mu_2 = \\frac{1}{n_2} \\sum_{x^{(i)} \\in D_2} {x^{(i)}},$$\n", 91 | "\n", 92 | "则两个子集投影后的中心为\n", 93 | "$$\\tilde{\\mu_1} = \\frac{1}{n_1} \\sum_{y^{(i)} \\in \\tilde{D_1}} w^T {x^{(i)}},$$\n", 94 | "$$\\tilde{\\mu_2} = \\frac{1}{n_2} \\sum_{x^{(i)} \\in D_2} w^T {x^{(i)}},$$\n", 95 | "\n", 96 | "则两个子集投影后的方差分别为\n", 97 | "$$ \\tilde{ \\sigma}^2_1 = \\frac{1}{n_1} \\sum_{y^{(i)} \\in \\tilde{D_1}} ({y^{(i)}-\\tilde{\\mu_1}})^2 = \n", 98 | "\\frac{1}{n_1} \\sum_{x^{(i)} \\in {D_1}} ({w^T x^{(i)}-w^T \\mu_1})^2 = \n", 99 | "\\frac{1}{n_1} \\sum_{x^{(i)} \\in {D_1}} w^T ({x^{(i)}- \\mu_1}) ({x^{(i)}- \\mu_1})^T w ,$$\n", 100 | "\n", 101 | "同理可得\n", 102 | "$$ \\tilde{ \\sigma}^2_2 = \\frac{1}{n_2} \\sum_{y^{(i)} \\in \\tilde{D_2}} ({y^{(i)}-\\tilde{\\mu_2}})^2 =\n", 103 | "\\frac{1}{n_2} \\sum_{x^{(i)} \\in {D_2}} w^T ({x^{(i)}- \\mu_2}) ({x^{(i)}- \\mu_2})^T w , $$\n", 104 | "\n", 105 | "令\n", 106 | "$$ S_1 = \\frac{1}{n_1} \\sum_{x^{(i)} \\in {D_1}} ({x^{(i)}- \\mu_1}) ({x^{(i)}- \\mu_1})^T, $$\n", 107 | "\n", 108 | "$$ S_2 = \\frac{1}{n_2} \\sum_{x^{(i)} \\in {D_2}} ({x^{(i)}- \\mu_2}) ({x^{(i)}- \\mu_2})^T, $$\n", 109 | "\n", 110 | "则有\n", 111 | "$$\\tilde{ \\sigma}^2_1 = w^T S_1 w, \\, \\tilde{\\sigma}^2_2 = w^T S_2 w. $$\n", 112 | "\n", 113 | "令\n", 114 | "$$ \\tilde{S_1} = \\frac{1}{n_1} \\sum_{y^{(i)} \\in \\tilde{D_1}} ({y^{(i)}- \\tilde{\\mu_1}}) ({y^{(i)}- \\tilde{\\mu_1}})^T, $$\n", 115 | "\n", 116 | "$$ \\tilde{S_2} = \\frac{1}{n_2} \\sum_{y^{(i)} \\in \\tilde{D_2}} ({y^{(i)}- \\tilde{\\mu_2}}) ({y^{(i)}- \\tilde{\\mu_2}})^T, $$\n", 117 | "\n", 118 | "则有\n", 119 | "$$ \\tilde{S_1} = w^T S_1 w, \\,\\,\\, \\tilde{S_2} = w^T S_2 w, $$\n", 120 | "\n", 121 | "现在我们就可以定义损失函数:\n", 122 | "\n", 123 | "$$ J(w) = \\frac{|\\tilde{\\mu}_1 - \\tilde{\\mu}_2|^2 }{\\tilde{S}^2_1 + \\tilde{S}^2_2 } $$\n", 124 | "\n", 125 | "我们分类的目标是,使得类别内的点距离越近(集中),类别间的点越远越好。分母表示数据被映射到低维空间之后每一个类别内的方差之和,方差越大表示在低维空间(映射后的空间)一个类别内的点越分散,欲使类别内的点距离越近(集中),分母应该越小越好。分子为在映射后的空间两个类别各自的中心点的距离的平方,欲使类别间的点越远,分子越大越好。故我们最大化$J(w)$,求出的w就是最优的了。\n", 126 | "\n", 127 | "因为\n", 128 | "\n", 129 | "$$ |\\tilde{\\mu}_1 - \\tilde{\\mu}_2|^2 = w^T (\\mu_1 - \\mu_2) (\\mu_1 - \\mu_2)^T w = w^T S_B w,$$\n", 130 | "\n", 131 | "其中,\n", 132 | "\n", 133 | "$$ S_B = (\\mu_1 - \\mu_2) (\\mu_1 - \\mu_2)^T .$$\n", 134 | "\n", 135 | "设$S_w = S_1 + S_2,$ 则\n", 136 | "\n", 137 | "$$ J(w) = \\frac{w^T S_B w}{w^T S_w w} $$\n", 138 | "\n", 139 | "这样就可以用最喜欢的拉格朗日乘数法了,但是有一个问题,如果分子、分母是都可以取任意值,就会导致有无穷解,我们将分母限制为长度为1(这是用拉格朗日乘子法一个很重要的技巧,在下面将说的PCA里面也会用到,如果忘记了,请复习一下高数),并作为拉格朗日乘子法的限制条件,带入得到:\n", 140 | "\n", 141 | "$$ loss(w) = w^T S_B w - (\\lambda w^T S_w w -1) $$\n", 142 | "\n", 143 | "令$$\\frac{dloss}{dw}=2 S_B w - 2 \\lambda S_w w = 0, $$ \n", 144 | "\n", 145 | "则有$$ S_B w = \\lambda S_w w. $$ \n", 146 | "\n", 147 | "很显然,$S_B w$和$\\mu_1 - \\mu_2$是平行的,又因为对$w$扩大缩小任何倍(平移$w$)不影响结果,因此,只要找到的$w$满足条件$S_B w$与$\\mu_1 - \\mu_2$平行即可。如果$S_w$是非奇异的,则有\n", 148 | "\n", 149 | "$$w = S_w^{-1}{(\\mu_1 - \\mu_2)}. $$\n", 150 | "\n", 151 | "下面看看具体的数学推导,\n", 152 | "$$ S_B w = (\\mu_1 - \\mu_2) (\\mu_1 - \\mu_2)^T w = (\\mu_1 - \\mu_2) \\lambda_w .$$\n", 153 | "\n", 154 | "将上式代入特征值公式中可得$$ S_w^{-1} S_B w = S_w^{-1} (\\mu_1 - \\mu_2) \\lambda_w = \\lambda w , $$\n", 155 | "\n", 156 | "因为$w$的平移不影响结果,故可以扔掉$\\lambda_w , \\lambda $,因此可得\n", 157 | "\n", 158 | "$$w = S_w^{-1}{(\\mu_1 - \\mu_2)}. $$\n", 159 | "\n", 160 | "\n", 161 | "得到$w$之后,就可以对测试数据进行分类了。\n", 162 | "\n", 163 | "一个常见的LDA分类基本思想是假设各个类别的样本数据符合高斯分布,这样利用LDA进行投影后,可以利用极大似然估计计算各个类别投影数据的均值和方差,进而得到该类别高斯分布的概率密度函数。当一个新的样本到来后,我们可以将它投影,然后将投影后的样本特征分别带入各个类别的高斯分布概率密度函数,计算它属于这个类别的概率,最大的概率对应的类别即为预测类别。\n", 164 | "\n", 165 | "\n", 166 | "但是这里还有另外一种分类的思想,就以LDA二值分类为例,我们可以将测试数据投影到低维空间(直线,因为二分类问题是投影到一维空间),得到$y$,然后看看$y$是否在超过某个阈值$y_0$,超过是某一类,否则是另一类。但是又该怎么去寻找这个$y_0$呢?\n", 167 | "\n", 168 | "因为\n", 169 | "\n", 170 | "$$ y = w^T x, $$\n", 171 | "\n", 172 | "根据中心极限定理,独立同分布的随机变量和服从高斯分布,然后利用极大似然估计求\n", 173 | "\n", 174 | "$$ p(y|label_i), $$\n", 175 | "\n", 176 | "然后用决策理论里的公式来寻找最佳的$y_0$,详情请参阅PRML。这是一种可行但比较繁琐的选取方法。\n", 177 | "\n", 178 | "其实,还有另外一种非常巧妙的方法可以确定$y_0=0$,投影之前的数据集的标签$y_{label}$是用0和1来表示,这里我们将其做一个简单的变换,\n", 179 | "\n", 180 | "$$ \\begin{cases}\n", 181 | "\\tilde y_{label}=\\frac{m}{n_1} & \\text{ if } x\\in D_1 \\\\ \n", 182 | "\\tilde y_{label}=-\\frac{m}{n_2} & \\text{ if } x\\in D_2\n", 183 | "\\end{cases} $$\n", 184 | "\n", 185 | "从变换后的$\\tilde y_{label}$的定义可以看出,对于样本$x^{(i)}$, 若$\\tilde y_{label}^{(i)}>0$,则$ x^{(i)} \\in D_1$,即$ y^{(i)}_{label}=0$,若$\\tilde y_{label}^{(i)}<0$,则$ x^{(i)} \\in D_2$,即$ y^{(i)}_{label}=1$.\n", 186 | "\n", 187 | "\n" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "## 算法流程\n", 195 | "\n", 196 | "\n", 197 | "输入:数据集$D=\\{x^{(1)}, x^{(1)}, …, x^{(m)}\\}$;\n", 198 | "\n", 199 | "输出:投影后的样本集$D′$;\n", 200 | "\n", 201 | "* 计算类内散度矩阵$S_w$;\n", 202 | "\n", 203 | "* 求解向量$w$,其中$w = S_w^{-1}(\\mu_1 - \\mu_2)$;\n", 204 | "\n", 205 | "* 将原始样本集投影到以$w$为基向量生成的低维空间中(1维),投影后的样本集就是我们需要的样本集$D′$(1维特征)。\n" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 1, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "name": "stdout", 215 | "output_type": "stream", 216 | "text": [ 217 | "Accuracy: 1.0\n" 218 | ] 219 | } 220 | ], 221 | "source": [ 222 | "from sklearn import datasets\n", 223 | "import matplotlib.pyplot as plt\n", 224 | "%matplotlib inline\n", 225 | "import numpy as np\n", 226 | "import pandas as pd\n", 227 | "\n", 228 | "\n", 229 | "\n", 230 | "def shuffle_data(X, y, seed=None):\n", 231 | " if seed:\n", 232 | " np.random.seed(seed)\n", 233 | "\n", 234 | " idx = np.arange(X.shape[0])\n", 235 | " np.random.shuffle(idx)\n", 236 | "\n", 237 | " return X[idx], y[idx]\n", 238 | "\n", 239 | "\n", 240 | "\n", 241 | "# 正规化数据集 X\n", 242 | "def normalize(X, axis=-1, p=2):\n", 243 | " lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))\n", 244 | " lp_norm[lp_norm == 0] = 1\n", 245 | " return X / np.expand_dims(lp_norm, axis)\n", 246 | "\n", 247 | "\n", 248 | "# 标准化数据集 X\n", 249 | "def standardize(X):\n", 250 | " X_std = np.zeros(X.shape)\n", 251 | " mean = X.mean(axis=0)\n", 252 | " std = X.std(axis=0)\n", 253 | " \n", 254 | " # 做除法运算时请永远记住分母不能等于0的情形\n", 255 | " # X_std = (X - X.mean(axis=0)) / X.std(axis=0) \n", 256 | " for col in range(np.shape(X)[1]):\n", 257 | " if std[col]:\n", 258 | " X_std[:, col] = (X_std[:, col] - mean[col]) / std[col]\n", 259 | " \n", 260 | " return X_std\n", 261 | "\n", 262 | "\n", 263 | "# 划分数据集为训练集和测试集\n", 264 | "def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):\n", 265 | " if shuffle:\n", 266 | " X, y = shuffle_data(X, y, seed)\n", 267 | " \n", 268 | " n_train_samples = int(X.shape[0] * (1-test_size))\n", 269 | " x_train, x_test = X[:n_train_samples], X[n_train_samples:]\n", 270 | " y_train, y_test = y[:n_train_samples], y[n_train_samples:]\n", 271 | "\n", 272 | " return x_train, x_test, y_train, y_test\n", 273 | "\n", 274 | "\n", 275 | "def accuracy(y, y_pred):\n", 276 | " y = y.reshape(y.shape[0], -1)\n", 277 | " y_pred = y_pred.reshape(y_pred.shape[0], -1)\n", 278 | " return np.sum(y == y_pred)/len(y)\n", 279 | "\n", 280 | "\n", 281 | "# 计算矩阵X的协方差矩阵\n", 282 | "def calculate_covariance_matrix(X, Y=np.empty((0,0))):\n", 283 | " if not Y.any():\n", 284 | " Y = X\n", 285 | " n_samples = np.shape(X)[0]\n", 286 | " covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(Y - Y.mean(axis=0))\n", 287 | "\n", 288 | " return np.array(covariance_matrix, dtype=float)\n", 289 | "\n", 290 | "\n", 291 | "\n", 292 | "class BiClassLDA():\n", 293 | " \"\"\"\n", 294 | " 线性判别分析分类算法(Linear Discriminant Analysis classifier). 既可以用来分类也可以用来降维.\n", 295 | " 此处实现二类情形(二类情形分类)\n", 296 | " \"\"\"\n", 297 | " def __init__(self):\n", 298 | " self.w = None\n", 299 | " \n", 300 | " \n", 301 | " def transform(self, X, y):\n", 302 | " self.fit(X, y)\n", 303 | " # Project data onto vector\n", 304 | " X_transform = X.dot(self.w)\n", 305 | " return X_transform\n", 306 | "\n", 307 | " \n", 308 | " def fit(self, X, y):\n", 309 | " # Separate data by class\n", 310 | " X = X.reshape(X.shape[0], -1)\n", 311 | " \n", 312 | " X1 = X[y == 0]\n", 313 | " X2 = X[y == 1]\n", 314 | " y = y.reshape(y.shape[0], -1)\n", 315 | " \n", 316 | " # 计算两个子集的协方差矩阵\n", 317 | " S1 = calculate_covariance_matrix(X1)\n", 318 | " S2 = calculate_covariance_matrix(X2)\n", 319 | " Sw = S1 + S2\n", 320 | " \n", 321 | " # 计算两个子集的均值\n", 322 | " mu1 = X1.mean(axis=0)\n", 323 | " mu2 = X2.mean(axis=0)\n", 324 | " mean_diff = np.atleast_1d(mu1 - mu2)\n", 325 | " mean_diff = mean_diff.reshape(X.shape[1], -1)\n", 326 | " \n", 327 | " # 计算w. 其中w = Sw^(-1)(mu1 - mu2), 这里我求解的是Sw的伪逆, 因为Sw可能是奇异的\n", 328 | " self.w = np.linalg.pinv(Sw).dot(mean_diff)\n", 329 | " \n", 330 | "\n", 331 | " def predict(self, X):\n", 332 | " y_pred = []\n", 333 | " for sample in X:\n", 334 | " sample = sample.reshape(1, sample.shape[0])\n", 335 | " h = sample.dot(self.w)\n", 336 | " y = 1 * (h[0][0] < 0)\n", 337 | " y_pred.append(y)\n", 338 | " return y_pred\n", 339 | "\n", 340 | "\n", 341 | "def main():\n", 342 | " # 加载数据\n", 343 | " data = datasets.load_iris()\n", 344 | " X = data.data\n", 345 | " y = data.target\n", 346 | "\n", 347 | " # 只取label=0和1的数据,因为是二分类问题\n", 348 | " X = X[y != 2]\n", 349 | " y = y[y != 2]\n", 350 | " \n", 351 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)\n", 352 | " \n", 353 | " # 训练模型\n", 354 | " lda = BiClassLDA()\n", 355 | " lda.fit(X_train, y_train)\n", 356 | " lda.transform(X_train, y_train)\n", 357 | " \n", 358 | " # 在测试集上预测\n", 359 | " y_pred = lda.predict(X_test)\n", 360 | " y_pred = np.array(y_pred)\n", 361 | " accu = accuracy(y_test, y_pred)\n", 362 | " print (\"Accuracy:\", accu)\n", 363 | " \n", 364 | "\n", 365 | "if __name__ == \"__main__\":\n", 366 | " main()\n" 367 | ] 368 | } 369 | ], 370 | "metadata": { 371 | "kernelspec": { 372 | "display_name": "Python 3", 373 | "language": "python", 374 | "name": "python3" 375 | }, 376 | "language_info": { 377 | "codemirror_mode": { 378 | "name": "ipython", 379 | "version": 3 380 | }, 381 | "file_extension": ".py", 382 | "mimetype": "text/x-python", 383 | "name": "python", 384 | "nbconvert_exporter": "python", 385 | "pygments_lexer": "ipython3", 386 | "version": "3.6.1" 387 | } 388 | }, 389 | "nbformat": 4, 390 | "nbformat_minor": 2 391 | } 392 | -------------------------------------------------------------------------------- /linear_discriminant_analysis(BiClass).zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/linear_discriminant_analysis(BiClass).zip -------------------------------------------------------------------------------- /linear_discriminant_analysis(MutiClass).zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/linear_discriminant_analysis(MutiClass).zip -------------------------------------------------------------------------------- /logistic_regression.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/logistic_regression.zip -------------------------------------------------------------------------------- /naive_bayes(continuity features).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "朴素贝叶斯算法是基于贝叶斯定理和特征之间条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y。朴素贝叶斯算法实现简单,学习和预测的效率都很高,是一种常用的方法。\n", 9 | "\n", 10 | "本文考虑当特征是连续情形时,朴素贝叶斯分类方法的原理以及如何从零开始实现贝叶斯分类算法。此时,我们通常有两种处理方式,第一种就是将连续特征离散化(区间化),这样就转换成了离散情形,完全按照特征离散情形即可完成分类,具体原理可以参见上一篇文章。第二种处理方式就是本文的重点,详情请看下文:\n", 11 | "\n", 12 | "\n", 13 | "\n", 14 | "\n", 15 | "\n", 16 | "## 1. 朴素贝叶斯算法原理\n", 17 | "\n", 18 | "\n", 19 | "与特征是离散情形时原理类似,只是在计算后验概率时有点不一样,具体计算方法如下:\n", 20 | "\n", 21 | "这时,可以假设每个类别中的样本集的每个特征均服从正态分布,通过其样本集计算出均值和方差,也就是得到正态分布的密度函数。有了密度函数,就可以把值代入,算出某一点的密度函数的值。为了阐述的更加清楚,下面我摘取了一个实例,以供大家更好的理解。\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "## 2. 朴素贝叶斯的应用\n", 26 | "\n", 27 | "\n", 28 | "下面是一组人类身体特征的统计资料。\n", 29 | "\n", 30 | "   性别  身高(英尺) 体重(磅)  脚掌(英寸)\n", 31 | "\n", 32 | "   男    6       180     12\n", 33 | "   男    5.92     190     11\n", 34 | "   男    5.58     170     12\n", 35 | "   男    5.92     165     10\n", 36 | "   女    5       100     6\n", 37 | "   女    5.5      150     8\n", 38 | "   女    5.42     130     7\n", 39 | "   女    5.75     150     9\n", 40 | "\n", 41 | "已知某人身高6英尺、体重130磅,脚掌8英寸,请问该人是男是女?\n", 42 | "\n", 43 | "根据朴素贝叶斯分类器,计算下面这个式子的值。\n", 44 | "\n", 45 | " P(身高|性别) x P(体重|性别) x P(脚掌|性别) x P(性别)\n", 46 | "\n", 47 | "这里的困难在于,由于身高、体重、脚掌都是连续变量,不能采用离散变量的方法计算概率。而且由于样本太少,所以也无法分成区间计算。怎么办?\n", 48 | "\n", 49 | "这时,可以假设男性和女性的身高、体重、脚掌都是正态分布,通过样本计算出均值和方差,也就是得到正态分布的密度函数。有了密度函数,就可以把值代入,算出某一点的密度函数的值。\n", 50 | "\n", 51 | "比如,男性的身高是均值5.855、方差0.035的正态分布。所以,男性的身高为6英尺的概率的相对值等于1.5789(大于1并没有关系,因为这里是密度函数的值,只用来反映各个值的相对可能性)。\n", 52 | "\n", 53 | "从上面的计算结果可以看出,分母都一样,因此,我们只需要比价分子的大小即可。显然,P(不转化|Mx上海)的分子大于P(转化|Mx上海)的分子,因此,这个上海男性用户的预测结果是不转化。这就是贝叶斯分类器的基本方法:在统计资料的基础上,依据某些特征,计算各个类别的概率,从而实现分类。\n", 54 | "\n", 55 | "$$ p(height|male) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} e^{-\\frac{(6-\\mu)^2}{2 \\sigma^2}} \\approx 1.5789 $$\n", 56 | "\n", 57 | "有了这些数据以后,就可以计算性别的分类了。\n", 58 | "\n", 59 | "   P(身高=6|男) x P(体重=130|男) x P(脚掌=8|男) x P(男)\n", 60 | "     = 6.1984 x e-9\n", 61 | "\n", 62 | "   P(身高=6|女) x P(体重=130|女) x P(脚掌=8|女) x P(女)\n", 63 | "     = 5.3778 x e-4\n", 64 | "\n", 65 | "可以看到,女性的概率比男性要高出将近10000倍,所以判断该人为女性。" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 1, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "Accuracy: 0.973333333333\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "from __future__ import division, print_function\n", 83 | "from sklearn import datasets\n", 84 | "import matplotlib.pyplot as plt\n", 85 | "import math\n", 86 | "import numpy as np\n", 87 | "import pandas as pd\n", 88 | "%matplotlib inline\n", 89 | "\n", 90 | "\n", 91 | "def shuffle_data(X, y, seed=None):\n", 92 | " if seed:\n", 93 | " np.random.seed(seed)\n", 94 | "\n", 95 | " idx = np.arange(X.shape[0])\n", 96 | " np.random.shuffle(idx)\n", 97 | "\n", 98 | " return X[idx], y[idx]\n", 99 | "\n", 100 | "\n", 101 | "# 正规化数据集 X\n", 102 | "def normalize(X, axis=-1, p=2):\n", 103 | " lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))\n", 104 | " lp_norm[lp_norm == 0] = 1\n", 105 | " return X / np.expand_dims(lp_norm, axis)\n", 106 | "\n", 107 | "\n", 108 | "# 标准化数据集 X\n", 109 | "def standardize(X):\n", 110 | " X_std = np.zeros(X.shape)\n", 111 | " mean = X.mean(axis=0)\n", 112 | " std = X.std(axis=0)\n", 113 | " \n", 114 | " # 做除法运算时请永远记住分母不能等于0的情形\n", 115 | " # X_std = (X - X.mean(axis=0)) / X.std(axis=0) \n", 116 | " for col in range(np.shape(X)[1]):\n", 117 | " if std[col]:\n", 118 | " X_std[:, col] = (X_std[:, col] - mean[col]) / std[col]\n", 119 | " return X_std\n", 120 | "\n", 121 | "\n", 122 | "# 划分数据集为训练集和测试集\n", 123 | "def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):\n", 124 | " if shuffle:\n", 125 | " X, y = shuffle_data(X, y, seed)\n", 126 | "\n", 127 | " n_train_samples = int(X.shape[0] * (1-test_size))\n", 128 | " x_train, x_test = X[:n_train_samples], X[n_train_samples:]\n", 129 | " y_train, y_test = y[:n_train_samples], y[n_train_samples:]\n", 130 | "\n", 131 | " return x_train, x_test, y_train, y_test\n", 132 | "\n", 133 | "\n", 134 | "def accuracy(y, y_pred):\n", 135 | " y = y.reshape(y.shape[0], -1)\n", 136 | " y_pred = y_pred.reshape(y_pred.shape[0], -1)\n", 137 | " return np.sum(y == y_pred)/len(y)\n", 138 | "\n", 139 | "\n", 140 | "\n", 141 | "\n", 142 | "class NaiveBayes():\n", 143 | " \"\"\"朴素贝叶斯分类模型. \"\"\"\n", 144 | " def __init__(self):\n", 145 | " self.classes = None\n", 146 | " self.X = None\n", 147 | " self.y = None\n", 148 | " # 存储高斯分布的参数(均值, 方差), 因为预测的时候需要, 模型训练的过程中其实就是计算出\n", 149 | " # 所有高斯分布(因为朴素贝叶斯模型假设每个类别的样本集每个特征都服从高斯分布, 固有多个\n", 150 | " # 高斯分布)的参数\n", 151 | " self.parameters = []\n", 152 | "\n", 153 | " def fit(self, X, y):\n", 154 | " self.X = X\n", 155 | " self.y = y\n", 156 | " self.classes = np.unique(y)\n", 157 | " # 计算每一个类别每个特征的均值和方差\n", 158 | " for i in range(len(self.classes)):\n", 159 | " c = self.classes[i]\n", 160 | " # 选出该类别的数据集\n", 161 | " x_where_c = X[np.where(y == c)]\n", 162 | " # 计算该类别数据集的均值和方差\n", 163 | " self.parameters.append([])\n", 164 | " for j in range(len(x_where_c[0, :])):\n", 165 | " col = x_where_c[:, j]\n", 166 | " parameters = {}\n", 167 | " parameters[\"mean\"] = col.mean()\n", 168 | " parameters[\"var\"] = col.var()\n", 169 | " self.parameters[i].append(parameters)\n", 170 | "\n", 171 | " # 计算高斯分布密度函数的值\n", 172 | " def calculate_gaussian_probability(self, mean, var, x):\n", 173 | " coeff = (1.0 / (math.sqrt((2.0 * math.pi) * var)))\n", 174 | " exponent = math.exp(-(math.pow(x - mean, 2) / (2 * var)))\n", 175 | " return coeff * exponent\n", 176 | "\n", 177 | " # 计算先验概率 \n", 178 | " def calculate_priori_probability(self, c):\n", 179 | " x_where_c = self.X[np.where(self.y == c)]\n", 180 | " n_samples_for_c = x_where_c.shape[0]\n", 181 | " n_samples = self.X.shape[0]\n", 182 | " return n_samples_for_c / n_samples\n", 183 | "\n", 184 | " # Classify using Bayes Rule, P(Y|X) = P(X|Y)*P(Y)/P(X)\n", 185 | " # P(X|Y) - Probability. Gaussian distribution (given by calculate_probability)\n", 186 | " # P(Y) - Prior (given by calculate_prior)\n", 187 | " # P(X) - Scales the posterior to the range 0 - 1 (ignored)\n", 188 | " # Classify the sample as the class that results in the largest P(Y|X)\n", 189 | " # (posterior)\n", 190 | " def classify(self, sample):\n", 191 | " posteriors = []\n", 192 | " \n", 193 | " # 遍历所有类别\n", 194 | " for i in range(len(self.classes)):\n", 195 | " c = self.classes[i]\n", 196 | " prior = self.calculate_priori_probability(c)\n", 197 | " posterior = np.log(prior)\n", 198 | " \n", 199 | " # probability = P(Y)*P(x1|Y)*P(x2|Y)*...*P(xN|Y)\n", 200 | " # 遍历所有特征 \n", 201 | " for j, params in enumerate(self.parameters[i]):\n", 202 | " # 取出第i个类别第j个特征的均值和方差\n", 203 | " mean = params[\"mean\"]\n", 204 | " var = params[\"var\"]\n", 205 | " # 取出预测样本的第j个特征\n", 206 | " sample_feature = sample[j]\n", 207 | " # 按照高斯分布的密度函数计算密度值\n", 208 | " prob = self.calculate_gaussian_probability(mean, var, sample_feature)\n", 209 | " # 朴素贝叶斯模型假设特征之间条件独立,即P(x1,x2,x3|Y) = P(x1|Y)*P(x2|Y)*P(x3|Y), \n", 210 | " # 并且用取对数的方法将累乘转成累加的形式\n", 211 | " posterior += np.log(prob)\n", 212 | " \n", 213 | " posteriors.append(posterior)\n", 214 | " \n", 215 | " # 对概率进行排序\n", 216 | " index_of_max = np.argmax(posteriors)\n", 217 | " max_value = posteriors[index_of_max]\n", 218 | "\n", 219 | " return self.classes[index_of_max]\n", 220 | "\n", 221 | " # 对数据集进行类别预测\n", 222 | " def predict(self, X):\n", 223 | " y_pred = []\n", 224 | " for sample in X:\n", 225 | " y = self.classify(sample)\n", 226 | " y_pred.append(y)\n", 227 | " return np.array(y_pred)\n", 228 | "\n", 229 | "\n", 230 | "def main():\n", 231 | " data = datasets.load_iris()\n", 232 | " X = normalize(data.data)\n", 233 | " y = data.target\n", 234 | "\n", 235 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)\n", 236 | "\n", 237 | " clf = NaiveBayes()\n", 238 | " clf.fit(X_train, y_train)\n", 239 | " y_pred = np.array(clf.predict(X_test))\n", 240 | "\n", 241 | " accu = accuracy(y_test, y_pred)\n", 242 | "\n", 243 | " print (\"Accuracy:\", accu)\n", 244 | "\n", 245 | " \n", 246 | "if __name__ == \"__main__\":\n", 247 | " main()\n" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "参考文献:\n", 255 | "http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html\n", 256 | "\n", 257 | "李航《统计学习方法》" 258 | ] 259 | } 260 | ], 261 | "metadata": { 262 | "kernelspec": { 263 | "display_name": "Python 3", 264 | "language": "python", 265 | "name": "python3" 266 | }, 267 | "language_info": { 268 | "codemirror_mode": { 269 | "name": "ipython", 270 | "version": 3 271 | }, 272 | "file_extension": ".py", 273 | "mimetype": "text/x-python", 274 | "name": "python", 275 | "nbconvert_exporter": "python", 276 | "pygments_lexer": "ipython3", 277 | "version": "3.6.1" 278 | } 279 | }, 280 | "nbformat": 4, 281 | "nbformat_minor": 2 282 | } 283 | -------------------------------------------------------------------------------- /naive_bayes(continuity features).md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 朴素贝叶斯算法是基于贝叶斯定理和特征之间条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y。朴素贝叶斯算法实现简单,学习和预测的效率都很高,是一种常用的方法。 4 | 5 | 本文考虑当特征是连续情形时,朴素贝叶斯分类方法的原理以及如何从零开始实现贝叶斯分类算法。此时,我们通常有两种处理方式,第一种就是将连续特征离散化(区间化),这样就转换成了离散情形,完全按照特征离散情形即可完成分类,具体原理可以参见上一篇文章。第二种处理方式就是本文的重点,详情请看下文: 6 | 7 | 8 | 9 | 10 | 11 | ## 1. 朴素贝叶斯算法原理 12 | 13 | 14 | 与特征是离散情形时原理类似,只是在计算后验概率时有点不一样,具体计算方法如下: 15 | 16 | 这时,可以假设每个类别中的样本集的每个特征均服从正态分布,通过其样本集计算出均值和方差,也就是得到正态分布的密度函数。有了密度函数,就可以把值代入,算出某一点的密度函数的值。为了阐述的更加清楚,下面我摘取了一个实例,以供大家更好的理解。 17 | 18 | 19 | 20 | ## 2. 朴素贝叶斯的应用 21 | 22 | 23 | 下面是一组人类身体特征的统计资料。 24 | 25 |   性别  身高(英尺) 体重(磅)  脚掌(英寸) 26 | 27 |   男    6       180     12 28 |   男    5.92     190     11 29 |   男    5.58     170     12 30 |   男    5.92     165     10 31 |   女    5       100     6 32 |   女    5.5      150     8 33 |   女    5.42     130     7 34 |   女    5.75     150     9 35 | 36 | 已知某人身高6英尺、体重130磅,脚掌8英寸,请问该人是男是女? 37 | 38 | 根据朴素贝叶斯分类器,计算下面这个式子的值。 39 | 40 | P(身高|性别) x P(体重|性别) x P(脚掌|性别) x P(性别) 41 | 42 | 这里的困难在于,由于身高、体重、脚掌都是连续变量,不能采用离散变量的方法计算概率。而且由于样本太少,所以也无法分成区间计算。怎么办? 43 | 44 | 这时,可以假设男性和女性的身高、体重、脚掌都是正态分布,通过样本计算出均值和方差,也就是得到正态分布的密度函数。有了密度函数,就可以把值代入,算出某一点的密度函数的值。 45 | 46 | 比如,男性的身高是均值5.855、方差0.035的正态分布。所以,男性的身高为6英尺的概率的相对值等于1.5789(大于1并没有关系,因为这里是密度函数的值,只用来反映各个值的相对可能性)。 47 | 48 | 从上面的计算结果可以看出,分母都一样,因此,我们只需要比价分子的大小即可。显然,P(不转化|Mx上海)的分子大于P(转化|Mx上海)的分子,因此,这个上海男性用户的预测结果是不转化。这就是贝叶斯分类器的基本方法:在统计资料的基础上,依据某些特征,计算各个类别的概率,从而实现分类。 49 | 50 | $$ p(height|male) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(6-\mu)^2}{2 \sigma^2}} \approx 1.5789 $$ 51 | 52 | 有了这些数据以后,就可以计算性别的分类了。 53 | 54 |   P(身高=6|男) x P(体重=130|男) x P(脚掌=8|男) x P(男) 55 |     = 6.1984 x e-9 56 | 57 |   P(身高=6|女) x P(体重=130|女) x P(脚掌=8|女) x P(女) 58 |     = 5.3778 x e-4 59 | 60 | 可以看到,女性的概率比男性要高出将近10000倍,所以判断该人为女性。 61 | 62 | 63 | ```python 64 | from __future__ import division, print_function 65 | from sklearn import datasets 66 | import matplotlib.pyplot as plt 67 | import math 68 | import numpy as np 69 | import pandas as pd 70 | %matplotlib inline 71 | 72 | 73 | def shuffle_data(X, y, seed=None): 74 | if seed: 75 | np.random.seed(seed) 76 | 77 | idx = np.arange(X.shape[0]) 78 | np.random.shuffle(idx) 79 | 80 | return X[idx], y[idx] 81 | 82 | 83 | # 正规化数据集 X 84 | def normalize(X, axis=-1, p=2): 85 | lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis)) 86 | lp_norm[lp_norm == 0] = 1 87 | return X / np.expand_dims(lp_norm, axis) 88 | 89 | 90 | # 标准化数据集 X 91 | def standardize(X): 92 | X_std = np.zeros(X.shape) 93 | mean = X.mean(axis=0) 94 | std = X.std(axis=0) 95 | 96 | # 做除法运算时请永远记住分母不能等于0的情形 97 | # X_std = (X - X.mean(axis=0)) / X.std(axis=0) 98 | for col in range(np.shape(X)[1]): 99 | if std[col]: 100 | X_std[:, col] = (X_std[:, col] - mean[col]) / std[col] 101 | return X_std 102 | 103 | 104 | # 划分数据集为训练集和测试集 105 | def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None): 106 | if shuffle: 107 | X, y = shuffle_data(X, y, seed) 108 | 109 | n_train_samples = int(X.shape[0] * (1-test_size)) 110 | x_train, x_test = X[:n_train_samples], X[n_train_samples:] 111 | y_train, y_test = y[:n_train_samples], y[n_train_samples:] 112 | 113 | return x_train, x_test, y_train, y_test 114 | 115 | 116 | def accuracy(y, y_pred): 117 | y = y.reshape(y.shape[0], -1) 118 | y_pred = y_pred.reshape(y_pred.shape[0], -1) 119 | return np.sum(y == y_pred)/len(y) 120 | 121 | 122 | 123 | 124 | class NaiveBayes(): 125 | """朴素贝叶斯分类模型. """ 126 | def __init__(self): 127 | self.classes = None 128 | self.X = None 129 | self.y = None 130 | # 存储高斯分布的参数(均值, 方差), 因为预测的时候需要, 模型训练的过程中其实就是计算出 131 | # 所有高斯分布(因为朴素贝叶斯模型假设每个类别的样本集每个特征都服从高斯分布, 固有多个 132 | # 高斯分布)的参数 133 | self.parameters = [] 134 | 135 | def fit(self, X, y): 136 | self.X = X 137 | self.y = y 138 | self.classes = np.unique(y) 139 | # 计算每一个类别每个特征的均值和方差 140 | for i in range(len(self.classes)): 141 | c = self.classes[i] 142 | # 选出该类别的数据集 143 | x_where_c = X[np.where(y == c)] 144 | # 计算该类别数据集的均值和方差 145 | self.parameters.append([]) 146 | for j in range(len(x_where_c[0, :])): 147 | col = x_where_c[:, j] 148 | parameters = {} 149 | parameters["mean"] = col.mean() 150 | parameters["var"] = col.var() 151 | self.parameters[i].append(parameters) 152 | 153 | # 计算高斯分布密度函数的值 154 | def calculate_gaussian_probability(self, mean, var, x): 155 | coeff = (1.0 / (math.sqrt((2.0 * math.pi) * var))) 156 | exponent = math.exp(-(math.pow(x - mean, 2) / (2 * var))) 157 | return coeff * exponent 158 | 159 | # 计算先验概率 160 | def calculate_priori_probability(self, c): 161 | x_where_c = self.X[np.where(self.y == c)] 162 | n_samples_for_c = x_where_c.shape[0] 163 | n_samples = self.X.shape[0] 164 | return n_samples_for_c / n_samples 165 | 166 | # Classify using Bayes Rule, P(Y|X) = P(X|Y)*P(Y)/P(X) 167 | # P(X|Y) - Probability. Gaussian distribution (given by calculate_probability) 168 | # P(Y) - Prior (given by calculate_prior) 169 | # P(X) - Scales the posterior to the range 0 - 1 (ignored) 170 | # Classify the sample as the class that results in the largest P(Y|X) 171 | # (posterior) 172 | def classify(self, sample): 173 | posteriors = [] 174 | 175 | # 遍历所有类别 176 | for i in range(len(self.classes)): 177 | c = self.classes[i] 178 | prior = self.calculate_priori_probability(c) 179 | posterior = np.log(prior) 180 | 181 | # probability = P(Y)*P(x1|Y)*P(x2|Y)*...*P(xN|Y) 182 | # 遍历所有特征 183 | for j, params in enumerate(self.parameters[i]): 184 | # 取出第i个类别第j个特征的均值和方差 185 | mean = params["mean"] 186 | var = params["var"] 187 | # 取出预测样本的第j个特征 188 | sample_feature = sample[j] 189 | # 按照高斯分布的密度函数计算密度值 190 | prob = self.calculate_gaussian_probability(mean, var, sample_feature) 191 | # 朴素贝叶斯模型假设特征之间条件独立,即P(x1,x2,x3|Y) = P(x1|Y)*P(x2|Y)*P(x3|Y), 192 | # 并且用取对数的方法将累乘转成累加的形式 193 | posterior += np.log(prob) 194 | 195 | posteriors.append(posterior) 196 | 197 | # 对概率进行排序 198 | index_of_max = np.argmax(posteriors) 199 | max_value = posteriors[index_of_max] 200 | 201 | return self.classes[index_of_max] 202 | 203 | # 对数据集进行类别预测 204 | def predict(self, X): 205 | y_pred = [] 206 | for sample in X: 207 | y = self.classify(sample) 208 | y_pred.append(y) 209 | return np.array(y_pred) 210 | 211 | 212 | def main(): 213 | data = datasets.load_iris() 214 | X = normalize(data.data) 215 | y = data.target 216 | 217 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) 218 | 219 | clf = NaiveBayes() 220 | clf.fit(X_train, y_train) 221 | y_pred = np.array(clf.predict(X_test)) 222 | 223 | accu = accuracy(y_test, y_pred) 224 | 225 | print ("Accuracy:", accu) 226 | 227 | 228 | if __name__ == "__main__": 229 | main() 230 | 231 | ``` 232 | 233 | Accuracy: 0.973333333333 234 | 235 | 236 | 参考文献: 237 | http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html 238 | 239 | 李航《统计学习方法》 240 | -------------------------------------------------------------------------------- /naive_bayes(discrete features).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "朴素贝叶斯算法是基于贝叶斯定理和特征之间条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y。朴素贝叶斯算法实现简单,学习和预测的效率都很高,是一种常用的方法。\n", 9 | "\n", 10 | "本文考虑当特征是离散情形时,朴素贝叶斯分类方法的原理以及如何从零开始实现贝叶斯分类算法。\n", 11 | "\n", 12 | "\n", 13 | "\n", 14 | "\n", 15 | "\n", 16 | "## 1. 朴素贝叶斯算法原理\n", 17 | "\n", 18 | "\n", 19 | "### 1.1 贝叶斯定理\n", 20 | "\n", 21 | "\n", 22 | "根据贝叶斯定理,对一个分类问题,给定样本特征x,样本属于类别y的概率是 \n", 23 | "\n", 24 | "$$ p(y|x) = \\frac{p(y)p(x|y)}{p(x)} \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,(1) $$\n", 25 | "\n", 26 | "在这里,x是一个特征向量,设x维度为n。因为朴素的假设,即特征之间条件独立,根据全概率公式展开,公式(1)可以写成\n", 27 | "\n", 28 | "\n", 29 | "\\begin{align*}\n", 30 | "p(y=c_k|x=(x_1, x_2, ..., x_n)) \n", 31 | "& = \\frac{p(y=c_k)p(x|y=c_k)}{p(x)} \\\\\n", 32 | "& = \\frac{p(y=c_k)p(x=(x_1, x_2, ..., x_n)|y=c_k)}{p(x)} \\\\\n", 33 | "& = \\frac{p(y=c_k)p(x=x_1|y=c_k)p(x=x_2|y=c_k)...p(x=x_n|y=c_k)}{p(x)} \\\\\n", 34 | "& = \\frac{p(y=c_k) \\prod_{i=1}^{n} p(x=x_i|y=c_k)}{p(x)} \\\\\n", 35 | "& = \\frac{p(y=c_k) \\prod_{i=1}^{n} p(x=x_i|y=c_k)} { \\sum_{k=1}^{n_{labels}} p(x|y=c_k)p(y=c_k)} \\\\\n", 36 | "& = \\frac{p(y=c_k) \\prod_{i=1}^{n} p(x=x_i|y=c_k)} { \\sum_{k=1}^{n_{labels}} p(y=c_k)\\prod_{i=1}^{n} p(x=x_i|y=c_k)} \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, (2)\n", 37 | "\\end{align*}\n", 38 | "\n", 39 | "\n", 40 | "\n", 41 | "上式中$p(y=c_k)$称为先验概率,$ { \\prod_{i=1}^{n} p(x=x_i|y=c_k)} $称为似然度,$p(y=c_k|x=(x_1, x_2, ..., x_n)) $称为后验概率,显然,给定一个数据集,先验概率和似然度我们都可以计算出来,因此,后验概率 $p(y=c_k|x=(x_1, x_2, ..., x_n)) ( k = 1, 2, ..., n_{labels})$ 就可以得出,然后找出$\\{ p(y=c_1|x=(x_1, x_2, ..., x_n)), p(y=c_2|x=(x_1, x_2, ..., x_n)) , ..., p(y=c_{n_{labels}}|x=(x_1, x_2, ..., x_n)) \\}$中的最大值,其所对应的$c_k$就是算法预测的类别。\n", 42 | "\n", 43 | "\n", 44 | "\n", 45 | "\n", 46 | "\n", 47 | "### 1.2 朴素贝叶斯算法的参数学习\n", 48 | "\n", 49 | "\n", 50 | "\n", 51 | "下面介绍如何从数据中,学习得到朴素贝叶斯分类模型的参数。\n", 52 | "\n", 53 | "\n", 54 | "设训练集$X_{mxn}=\\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}), ..., (x^{(m)},y^{(m)})\\} $中有$m$个样本,其中 $x^{(i)}=(x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n)^T$是$n$维向量,$y^{(i)}∈\\{c_1,c_2,...c_{n_{labels}}\\}$属于$c_{n_{labels}}$类中的一类。\n", 55 | "\n", 56 | "首先,计算公式(2)中的$p(y=c_k)$, \n", 57 | "$$p(y=c_k)=\\sum_i^m \\frac{I(y^{(i)}=c_k)}{m} \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, (3)$$\n", 58 | "其中$I(x)$为指示函数,若括号内成立,则计1,否则为0。\n", 59 | "\n", 60 | "接下来计算分子中的条件概率,设$n$维特征的第$j$维有$L$个取值,则第$j$维特征的某个取值$a_{jl}$,在给定某分类$c_k$下的条件概率为:\n", 61 | " $$ p(x_j=a_{jl}|y=c_k)=\\frac{\\sum_i^m=I(x_{ji}=a_{jl},y^{(i)}=c_k)}{ \\sum_i^m I(y^{(i)}=c_k)} \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,(4)$$\n", 62 | "\n", 63 | "经过上述两个步骤,我们就可以得到模型的基本概率,也就完成了学习的任务。\n", 64 | "\n", 65 | "\n", 66 | "\n", 67 | "\n", 68 | "### 1.3 朴素贝叶斯算法的分类\n", 69 | "\n", 70 | "\n", 71 | "通过学到的概率,给定未分类新实例$x$,就可以通过上述概率进行计算,得到该实例属于各类的后验概率$p(y=c_k|x)$,因为对所有的类来说,公式(2)中分母的值都相同,所以只计算分子部分即可,具体步骤如下:\n", 72 | "\n", 73 | "1.计算该实例属于$y=c_k$类的概率\n", 74 | "$$p(y=c_k|x)=p(y=c_k)\\prod_{i=1}^{n} p(x=x_i|y=c_k) \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,(5) $$\n", 75 | "\n", 76 | "确定该实例所属的类别标签$y$\n", 77 | "$$y=arg \\, \\underset{c_k}{ max } p(y=c_k|x) \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, (6) $$\n", 78 | "\n", 79 | "这样我们就得到了新实例的分类结果。\n", 80 | "\n", 81 | "\n", 82 | "\n", 83 | "## 2. 朴素贝叶斯算法的应用\n", 84 | "\n", 85 | "\n", 86 | "前面讲了一大堆的公式,为了让大家更好的理解朴素贝叶斯算法,下面从一个具体的例子开始讲起,你会看到贝叶斯分类器非常简单。\n", 87 | "\n", 88 | "现在有一个数据集,总共5个样本,如下表。目标是预测用户是否会转化。\n", 89 | "\n", 90 | "\n", 91 | " id   性别   城市   转化\n", 92 | "\n", 93 | " 1    M  北京   1\n", 94 | "\n", 95 | " 2    F  上海   0\n", 96 | "\n", 97 | " 3   M   广州  1\n", 98 | "\n", 99 | " 4    M   北京  1\n", 100 | "\n", 101 | " 5    F  上海   0\n", 102 | "\n", 103 | "\n", 104 | "假设我们现在需要预测一个用户(上海男性)是否会转化?\n", 105 | "\n", 106 | "根据贝叶斯定理:\n", 107 | "\n", 108 | "  P(A|B) = P(B|A) P(A) / P(B)\n", 109 | "\n", 110 | "可得\n", 111 | "\n", 112 | "    P(转化|Mx上海)\n", 113 | "     = P(Mx上海|转化) x P(转化) / P(Mx上海)\n", 114 | "\n", 115 | "假定\"性别\"和\"城市\"这两个特征是独立的,因此,上面的等式就变成了\n", 116 | "\n", 117 | "    P(转化|Mx上海)\n", 118 | "     = (P(M|转化) x P(上海|转化) x P(转化)) / (P(M) x P(上海))\n", 119 | "     = (2/2 x 0/2 x 2/5) / (P(M) x P(上海))\n", 120 | "     = 0\n", 121 | "\n", 122 | "因此,这个上海男性用户转化的概率是0。同理,可以计算这个这个上海男性用户不转化的概率,即\n", 123 | "\n", 124 | "    P(不转化|Mx上海)\n", 125 | "     = (P(M|不转化) x P(上海|不转化) x P(不转化)) / (P(M) x P(上海))\n", 126 | "     = (1/3 x 2/3 x 3/5) / (P(M) x P(上海))\n", 127 | "\n", 128 | "从上面的计算结果可以看出,分母都一样,因此,我们只需要比价分子的大小即可。显然,P(不转化|Mx上海)的分子大于P(转化|Mx上海)的分子,因此,这个上海男性用户的预测结果是不转化。这就是贝叶斯分类器的基本方法:在统计资料的基础上,依据某些特征,计算各个类别的概率,从而实现分类。\n", 129 | "\n", 130 | "\n", 131 | "\n", 132 | "## 3. 朴素贝叶斯算法的注意事项\n", 133 | "\n", 134 | "上一节(第2节)可以看出,P(转化|Mx上海)=0,也就是说,我们得到了‘上海男人不转化’这个结论,显然这是不合理的,之所以出现了这样不合理的结果,是因为我们的训练集中没有上海男人有过转化,也许是我们的数据集收集太少或者其它原因导致的。所以我们应该对我们的结果进行修正。通常的修正方法是用拉普拉斯平滑方法。\n", 135 | "\n", 136 | "### 3.1 拉普拉斯平滑\n", 137 | "\n", 138 | "\n", 139 | "拉普拉斯平滑思想非常简单,其就是给学习步骤中的两个概率计算公式(3)(4)中的分子和分母都分别加上一个常数,就可以避免这个问题。更新过后的公式如下:\n", 140 | "\n", 141 | "$$p(y=c_k)=\\sum_i^m \\frac{I(y^{(i)}=c_k) + \\lambda}{m + \\lambda n_{labels}} \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, (7)$$\n", 142 | "\n", 143 | "其中,$n_{labels}$是类别标签的个数,$\\lambda \\geq 0$是一个常数。\n", 144 | "\n", 145 | "$$ p(x_j=a_{jl}|y=c_k)=\\frac{\\sum_i^m=I(x_{ji}=a_{jl},y^{(i)}=c_k) + \\lambda}{ \\sum_i^m I(y^{(i)}=c_k) + \\lambda L_j} \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\, \\,\\,\\,\\,\\,\\,\\,\\,\\,\\,\\,(8)$$\n", 146 | "\n", 147 | "其中,$L_j$是$X[y==c_k]$数据集中第$j$维特征的不同特征值的个数,用$Python$脚本来描述,则$L_j=len(np.unique(X[y==c_k][:, j]))$。\n", 148 | "\n", 149 | "显然,对于任意的$j=1,2,..., L_j, \\, k=1,2,...,n_{labels}$,有\n", 150 | "$$ p(x_j=a_{jl}|y=c_k)>0, \\,\\,\\,\\,\\,\\, \\sum_{l=1}^{L_j} p(x_j=a_{jl}|y=c_k)=1 $$\n", 151 | "\n", 152 | "\n", 153 | "这说明(8)式仍然是概率,同理可知(7)式也是概率。平滑因子λ=0即为(3)(4)实现的最大似然估计,这时会出现前文提到的0概率问题;而λ=1则避免了0概率问题,这种方法被称为拉普拉斯平滑。\n", 154 | "\n", 155 | "下面,我们就用拉普拉斯平滑修正第2节中的实例:\n", 156 | "\n", 157 | "\n", 158 | "    P(转化|Mx上海)\n", 159 | "     = (P(M|转化) x P(上海|转化) x P(转化)) / (P(M) x P(上海))\n", 160 | "     = (3/3 x 1/4 x 3/7) / (P(M) x P(上海))\n", 161 | " \n", 162 | " \n", 163 | "    P(不转化|Mx上海)\n", 164 | "     = (P(M|不转化) x P(上海|不转化) x P(不转化)) / (P(M) x P(上海))\n", 165 | "     = (2/5 x 3/5 x 4/7) / (P(M) x P(上海))\n", 166 | " \n", 167 | "\n", 168 | "\n", 169 | "### 3.2 概率连乘的问题\n", 170 | "\n", 171 | "在具体的实现代码过程中,我们有一点需要注意,就是很多个大于0且小于1的数累乘最终会接近于0,由于计算机的位数限制,所以多个概率相乘之后的数值在计算机内存中保存的数值是0。针对这个问题,我们通常将概率连乘式子取对数,将累乘的运算转换成累加的运算,这样就避免了上述的问题。\n", 172 | "\n", 173 | "\n", 174 | "\n", 175 | "## 4. 朴素贝叶斯算法的优缺点\n", 176 | "\n", 177 | "\n", 178 | "朴素贝叶斯的主要优点有:\n", 179 | "\n", 180 | "* 朴素贝叶斯模型发源于古典数学理论,有稳定的分类效率。\n", 181 | "\n", 182 | "* 对小规模的数据表现很好,能个处理多分类任务,适合增量式训练,尤其是数据量超出内存时,我们可以一批批的去增量训练。\n", 183 | "\n", 184 | "* 对缺失数据不太敏感,算法也比较简单,常用于文本分类。\n", 185 | "\n", 186 | "\n", 187 | "朴素贝叶斯的主要缺点有:\n", 188 | "\n", 189 | "* 理论上,朴素贝叶斯模型与其他分类方法相比具有最小的误差率。但是实际上并非总是如此,这是因为朴素贝叶斯模型假设属性之间相互独立,这个假设在实际应用中往往是不成立的,在属性个数比较多或者属性之间相关性较大时,分类效果不好。而在属性相关性较小时,朴素贝叶斯性能最为良好。对于这一点,有半朴素贝叶斯之类的算法通过考虑部分关联性适度改进。\n", 190 | "\n", 191 | "* 需要知道先验概率,且先验概率很多时候取决于假设,假设的模型可以有很多种,因此在某些时候会由于假设的先验模型的原因导致预测效果不佳。\n", 192 | "\n", 193 | "* 由于我们是通过先验和数据来决定后验的概率从而决定分类,所以分类决策存在一定的错误率。\n", 194 | "\n", 195 | "* 对输入数据的表达形式很敏感。\n", 196 | "\n", 197 | "* 对于连续特征,假设了每个类别标签的数据集中的每个特征均服从高斯分布。" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 1, 203 | "metadata": { 204 | "scrolled": false 205 | }, 206 | "outputs": [ 207 | { 208 | "name": "stdout", 209 | "output_type": "stream", 210 | "text": [ 211 | "Accuracy: 1.0\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "from __future__ import division, print_function\n", 217 | "from sklearn import datasets\n", 218 | "import matplotlib.pyplot as plt\n", 219 | "import math\n", 220 | "import numpy as np\n", 221 | "import pandas as pd\n", 222 | "%matplotlib inline\n", 223 | "\n", 224 | "\n", 225 | "def shuffle_data(X, y, seed=None):\n", 226 | " if seed:\n", 227 | " np.random.seed(seed)\n", 228 | "\n", 229 | " idx = np.arange(X.shape[0])\n", 230 | " np.random.shuffle(idx)\n", 231 | "\n", 232 | " return X[idx], y[idx]\n", 233 | "\n", 234 | "\n", 235 | "# 正规化数据集 X\n", 236 | "def normalize(X, axis=-1, p=2):\n", 237 | " lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))\n", 238 | " lp_norm[lp_norm == 0] = 1\n", 239 | " return X / np.expand_dims(lp_norm, axis)\n", 240 | "\n", 241 | "\n", 242 | "# 标准化数据集 X\n", 243 | "def standardize(X):\n", 244 | " X_std = np.zeros(X.shape)\n", 245 | " mean = X.mean(axis=0)\n", 246 | " std = X.std(axis=0)\n", 247 | " \n", 248 | " # 做除法运算时请永远记住分母不能等于0的情形\n", 249 | " # X_std = (X - X.mean(axis=0)) / X.std(axis=0) \n", 250 | " for col in range(np.shape(X)[1]):\n", 251 | " if std[col]:\n", 252 | " X_std[:, col] = (X_std[:, col] - mean[col]) / std[col]\n", 253 | " return X_std\n", 254 | "\n", 255 | "\n", 256 | "# 划分数据集为训练集和测试集\n", 257 | "def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):\n", 258 | " if shuffle:\n", 259 | " X, y = shuffle_data(X, y, seed)\n", 260 | "\n", 261 | " n_train_samples = int(X.shape[0] * (1-test_size))\n", 262 | " x_train, x_test = X[:n_train_samples], X[n_train_samples:]\n", 263 | " y_train, y_test = y[:n_train_samples], y[n_train_samples:]\n", 264 | "\n", 265 | " return x_train, x_test, y_train, y_test\n", 266 | "\n", 267 | "\n", 268 | "def accuracy(y, y_pred):\n", 269 | " y = y.reshape(y.shape[0], -1)\n", 270 | " y_pred = y_pred.reshape(y_pred.shape[0], -1)\n", 271 | " return np.sum(y == y_pred)/len(y)\n", 272 | "\n", 273 | "\n", 274 | "\n", 275 | "\n", 276 | "class NaiveBayes():\n", 277 | " \"\"\"朴素贝叶斯分类模型. \"\"\"\n", 278 | " def __init__(self):\n", 279 | " self.classes = None\n", 280 | " self.X = None\n", 281 | " self.y = None\n", 282 | " # 存储每个类别标签数据集中每个特征中每个特征值的出现概率, 因为预测的时候需要, 模型训练的过程中其实就是计算出这些概率\n", 283 | " self.parameters = []\n", 284 | "\n", 285 | " def fit(self, X, y):\n", 286 | " self.X = X\n", 287 | " self.y = y\n", 288 | " self.classes = np.unique(y)\n", 289 | " # 遍历所有类别的数据集,计算每一个类别数据集每个特征中每个特征值的出现概率\n", 290 | " for i in range(len(self.classes)):\n", 291 | " c = self.classes[i]\n", 292 | " # 选出该类别的数据集\n", 293 | " x_where_c = X[np.where(y == c)]\n", 294 | " \n", 295 | " self.parameters.append([])\n", 296 | " # 遍历该类别数据的所有特征,计算该类别数据集每个特征中每个特征值的出现概率\n", 297 | " for j in range(x_where_c.shape[1]):\n", 298 | " feautre_values_where_c_j = np.unique(x_where_c[:, j])\n", 299 | " \n", 300 | " parameters = {}\n", 301 | " # 遍历整个训练数据集该特征的所有特征值(如果遍历该类别数据集x_where_c中该特征的所有特征值, \n", 302 | " # 则每列的特征值都不全,因此整个数据集X中存在但是不在x_where_c中的特征值将得不到其概率,\n", 303 | " # feautre_values_where_c_j), 计算该类别数据集该特征中每个特征值的出现概率\n", 304 | " for feature_value in X[:, j]: # feautre_values_where_c_j\n", 305 | " n_feature_value = x_where_c[x_where_c[:, j]==feature_value].shape[0]\n", 306 | " # 用Laplance平滑对概率进行修正, 并且用取对数的方法将累乘转成累加的形式\n", 307 | " parameters[feature_value] = np.log((n_feature_value + 1) / \n", 308 | " (x_where_c.shape[0] + len(feautre_values_where_c_j)))\n", 309 | " self.parameters[i].append(parameters)\n", 310 | " \n", 311 | " # 计算先验概率\n", 312 | " def calculate_priori_probability(self, c):\n", 313 | " x_where_c = self.X[np.where(self.y == c)]\n", 314 | " n_samples_for_c = x_where_c.shape[0]\n", 315 | " n_samples = self.X.shape[0]\n", 316 | " return (n_samples_for_c + 1) / (n_samples + len(self.classes))\n", 317 | "\n", 318 | " def classify(self, sample):\n", 319 | " posteriors = []\n", 320 | " \n", 321 | " # 遍历所有类别\n", 322 | " for i in range(len(self.classes)):\n", 323 | " c = self.classes[i]\n", 324 | " prior = self.calculate_priori_probability(c)\n", 325 | " posterior = np.log(prior)\n", 326 | " \n", 327 | " # probability = P(Y)*P(x1|Y)*P(x2|Y)*...*P(xN|Y)\n", 328 | " # 遍历所有特征\n", 329 | " for j, params in enumerate(self.parameters[i]):\n", 330 | " # 取出预测样本的第j个特征\n", 331 | " sample_feature = sample[j]\n", 332 | " # 取出参数中第i个类别第j个特征特征值为sample_feature的概率, 如果测试集中的样本\n", 333 | " # 有特征值没有出现, 则假设该特征值的概率为1/self.X.shape[0]\n", 334 | " proba = params.get(sample_feature, np.log(1/self.X.shape[0]))\n", 335 | " \n", 336 | " # 朴素贝叶斯模型假设特征之间条件独立,即P(x1,x2,x3|Y) = P(x1|Y)*P(x2|Y)*P(x3|Y)\n", 337 | " posterior += proba\n", 338 | " \n", 339 | " posteriors.append(posterior)\n", 340 | " \n", 341 | " # 对概率进行排序\n", 342 | " index_of_max = np.argmax(posteriors)\n", 343 | " max_value = posteriors[index_of_max]\n", 344 | "\n", 345 | " return self.classes[index_of_max]\n", 346 | "\n", 347 | " # 对数据集进行类别预测\n", 348 | " def predict(self, X):\n", 349 | " y_pred = []\n", 350 | " for sample in X:\n", 351 | " y = self.classify(sample)\n", 352 | " y_pred.append(y)\n", 353 | " return np.array(y_pred)\n", 354 | "\n", 355 | "\n", 356 | " \n", 357 | "def main():\n", 358 | " X = np.array([['M','北京'], ['F', '上海'], ['M' ,'广州'], ['M' ,'北京'], \n", 359 | " ['F' ,'上海'], ['M','北京'], ['F', '上海'], ['M' ,'广州'], \n", 360 | " ['M' ,'北京'], ['F' ,'上海']])\n", 361 | " y = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 0])\n", 362 | "\n", 363 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6)\n", 364 | "\n", 365 | " clf = NaiveBayes()\n", 366 | " clf.fit(X_train, y_train)\n", 367 | " y_pred = np.array(clf.predict(X_test))\n", 368 | "\n", 369 | " accu = accuracy(y_test, y_pred)\n", 370 | "\n", 371 | " print (\"Accuracy:\", accu)\n", 372 | "\n", 373 | "\n", 374 | "if __name__ == \"__main__\":\n", 375 | " main()\n" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "参考文献:\n", 383 | "http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html\n", 384 | "\n", 385 | "李航《统计学习方法》" 386 | ] 387 | } 388 | ], 389 | "metadata": { 390 | "anaconda-cloud": {}, 391 | "kernelspec": { 392 | "display_name": "Python 3", 393 | "language": "python", 394 | "name": "python3" 395 | }, 396 | "language_info": { 397 | "codemirror_mode": { 398 | "name": "ipython", 399 | "version": 3 400 | }, 401 | "file_extension": ".py", 402 | "mimetype": "text/x-python", 403 | "name": "python", 404 | "nbconvert_exporter": "python", 405 | "pygments_lexer": "ipython3", 406 | "version": "3.6.1" 407 | } 408 | }, 409 | "nbformat": 4, 410 | "nbformat_minor": 2 411 | } 412 | -------------------------------------------------------------------------------- /naive_bayes(discrete features).md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 朴素贝叶斯算法是基于贝叶斯定理和特征之间条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y。朴素贝叶斯算法实现简单,学习和预测的效率都很高,是一种常用的方法。 4 | 5 | 本文考虑当特征是离散情形时,朴素贝叶斯分类方法的原理以及如何从零开始实现贝叶斯分类算法。 6 | 7 | 8 | 9 | 10 | 11 | ## 1. 朴素贝叶斯算法原理 12 | 13 | 14 | ### 1.1 贝叶斯定理 15 | 16 | 17 | 根据贝叶斯定理,对一个分类问题,给定样本特征x,样本属于类别y的概率是 18 | 19 | $$ p(y|x) = \frac{p(y)p(x|y)}{p(x)} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,(1) $$ 20 | 21 | 在这里,x是一个特征向量,设x维度为n。因为朴素的假设,即特征之间条件独立,根据全概率公式展开,公式(1)可以写成 22 | 23 | 24 | \begin{align*} 25 | p(y=c_k|x=(x_1, x_2, ..., x_n)) 26 | & = \frac{p(y=c_k)p(x|y=c_k)}{p(x)} \\ 27 | & = \frac{p(y=c_k)p(x=(x_1, x_2, ..., x_n)|y=c_k)}{p(x)} \\ 28 | & = \frac{p(y=c_k)p(x=x_1|y=c_k)p(x=x_2|y=c_k)...p(x=x_n|y=c_k)}{p(x)} \\ 29 | & = \frac{p(y=c_k) \prod_{i=1}^{n} p(x=x_i|y=c_k)}{p(x)} \\ 30 | & = \frac{p(y=c_k) \prod_{i=1}^{n} p(x=x_i|y=c_k)} { \sum_{k=1}^{n_{labels}} p(x|y=c_k)p(y=c_k)} \\ 31 | & = \frac{p(y=c_k) \prod_{i=1}^{n} p(x=x_i|y=c_k)} { \sum_{k=1}^{n_{labels}} p(y=c_k)\prod_{i=1}^{n} p(x=x_i|y=c_k)} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, (2) 32 | \end{align*} 33 | 34 | 35 | 36 | 上式中$p(y=c_k)$称为先验概率,$ { \prod_{i=1}^{n} p(x=x_i|y=c_k)} $称为似然度,$p(y=c_k|x=(x_1, x_2, ..., x_n)) $称为后验概率,显然,给定一个数据集,先验概率和似然度我们都可以计算出来,因此,后验概率 $p(y=c_k|x=(x_1, x_2, ..., x_n)) ( k = 1, 2, ..., n_{labels})$ 就可以得出,然后找出$\{ p(y=c_1|x=(x_1, x_2, ..., x_n)), p(y=c_2|x=(x_1, x_2, ..., x_n)) , ..., p(y=c_{n_{labels}}|x=(x_1, x_2, ..., x_n)) \}$中的最大值,其所对应的$c_k$就是算法预测的类别。 37 | 38 | 39 | 40 | 41 | 42 | ### 1.2 朴素贝叶斯算法的参数学习 43 | 44 | 45 | 46 | 下面介绍如何从数据中,学习得到朴素贝叶斯分类模型的参数。 47 | 48 | 49 | 设训练集$X_{mxn}=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}), ..., (x^{(m)},y^{(m)})\} $中有$m$个样本,其中 $x^{(i)}=(x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n)^T$是$n$维向量,$y^{(i)}∈\{c_1,c_2,...c_{n_{labels}}\}$属于$c_{n_{labels}}$类中的一类。 50 | 51 | 首先,计算公式(2)中的$p(y=c_k)$, 52 | $$p(y=c_k)=\sum_i^m \frac{I(y^{(i)}=c_k)}{m} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, (3)$$ 53 | 其中$I(x)$为指示函数,若括号内成立,则计1,否则为0。 54 | 55 | 接下来计算分子中的条件概率,设$n$维特征的第$j$维有$L$个取值,则第$j$维特征的某个取值$a_{jl}$,在给定某分类$c_k$下的条件概率为: 56 | $$ p(x_j=a_{jl}|y=c_k)=\frac{\sum_i^m=I(x_{ji}=a_{jl},y^{(i)}=c_k)}{ \sum_i^m I(y^{(i)}=c_k)} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,(4)$$ 57 | 58 | 经过上述两个步骤,我们就可以得到模型的基本概率,也就完成了学习的任务。 59 | 60 | 61 | 62 | 63 | ### 1.3 朴素贝叶斯算法的分类 64 | 65 | 66 | 通过学到的概率,给定未分类新实例$x$,就可以通过上述概率进行计算,得到该实例属于各类的后验概率$p(y=c_k|x)$,因为对所有的类来说,公式(2)中分母的值都相同,所以只计算分子部分即可,具体步骤如下: 67 | 68 | 1.计算该实例属于$y=c_k$类的概率 69 | $$p(y=c_k|x)=p(y=c_k)\prod_{i=1}^{n} p(x=x_i|y=c_k) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(5) $$ 70 | 71 | 确定该实例所属的类别标签$y$ 72 | $$y=arg \, \underset{c_k}{ max } p(y=c_k|x) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (6) $$ 73 | 74 | 这样我们就得到了新实例的分类结果。 75 | 76 | 77 | 78 | ## 2. 朴素贝叶斯算法的应用 79 | 80 | 81 | 前面讲了一大堆的公式,为了让大家更好的理解朴素贝叶斯算法,下面从一个具体的例子开始讲起,你会看到贝叶斯分类器非常简单。 82 | 83 | 现在有一个数据集,总共5个样本,如下表。目标是预测用户是否会转化。 84 | 85 | 86 | id   性别   城市   转化 87 | 88 | 1    M  北京   1 89 | 90 | 2    F  上海   0 91 | 92 | 3   M   广州  1 93 | 94 | 4    M   北京  1 95 | 96 | 5    F  上海   0 97 | 98 | 99 | 假设我们现在需要预测一个用户(上海男性)是否会转化? 100 | 101 | 根据贝叶斯定理: 102 | 103 |  P(A|B) = P(B|A) P(A) / P(B) 104 | 105 | 可得 106 | 107 |    P(转化|Mx上海) 108 |     = P(Mx上海|转化) x P(转化) / P(Mx上海) 109 | 110 | 假定"性别"和"城市"这两个特征是独立的,因此,上面的等式就变成了 111 | 112 |    P(转化|Mx上海) 113 |     = (P(M|转化) x P(上海|转化) x P(转化)) / (P(M) x P(上海)) 114 |     = (2/2 x 0/2 x 2/5) / (P(M) x P(上海)) 115 |     = 0 116 | 117 | 因此,这个上海男性用户转化的概率是0。同理,可以计算这个这个上海男性用户不转化的概率,即 118 | 119 |    P(不转化|Mx上海) 120 |     = (P(M|不转化) x P(上海|不转化) x P(不转化)) / (P(M) x P(上海)) 121 |     = (1/3 x 2/3 x 3/5) / (P(M) x P(上海)) 122 | 123 | 从上面的计算结果可以看出,分母都一样,因此,我们只需要比价分子的大小即可。显然,P(不转化|Mx上海)的分子大于P(转化|Mx上海)的分子,因此,这个上海男性用户的预测结果是不转化。这就是贝叶斯分类器的基本方法:在统计资料的基础上,依据某些特征,计算各个类别的概率,从而实现分类。 124 | 125 | 126 | 127 | ## 3. 朴素贝叶斯算法的注意事项 128 | 129 | 上一节(第2节)可以看出,P(转化|Mx上海)=0,也就是说,我们得到了‘上海男人不转化’这个结论,显然这是不合理的,之所以出现了这样不合理的结果,是因为我们的训练集中没有上海男人有过转化,也许是我们的数据集收集太少或者其它原因导致的。所以我们应该对我们的结果进行修正。通常的修正方法是用拉普拉斯平滑方法。 130 | 131 | ### 3.1 拉普拉斯平滑 132 | 133 | 134 | 拉普拉斯平滑思想非常简单,其就是给学习步骤中的两个概率计算公式(3)(4)中的分子和分母都分别加上一个常数,就可以避免这个问题。更新过后的公式如下: 135 | 136 | $$p(y=c_k)=\sum_i^m \frac{I(y^{(i)}=c_k) + \lambda}{m + \lambda n_{labels}} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, (7)$$ 137 | 138 | 其中,$n_{labels}$是类别标签的个数,$\lambda \geq 0$是一个常数。 139 | 140 | $$ p(x_j=a_{jl}|y=c_k)=\frac{\sum_i^m=I(x_{ji}=a_{jl},y^{(i)}=c_k) + \lambda}{ \sum_i^m I(y^{(i)}=c_k) + \lambda L_j} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,(8)$$ 141 | 142 | 其中,$L_j$是$X[y==c_k]$数据集中第$j$维特征的不同特征值的个数,用$Python$脚本来描述,则$L_j=len(np.unique(X[y==c_k][:, j]))$。 143 | 144 | 显然,对于任意的$j=1,2,..., L_j, \, k=1,2,...,n_{labels}$,有 145 | $$ p(x_j=a_{jl}|y=c_k)>0, \,\,\,\,\,\, \sum_{l=1}^{L_j} p(x_j=a_{jl}|y=c_k)=1 $$ 146 | 147 | 148 | 这说明(8)式仍然是概率,同理可知(7)式也是概率。平滑因子λ=0即为(3)(4)实现的最大似然估计,这时会出现前文提到的0概率问题;而λ=1则避免了0概率问题,这种方法被称为拉普拉斯平滑。 149 | 150 | 下面,我们就用拉普拉斯平滑修正第2节中的实例: 151 | 152 | 153 |    P(转化|Mx上海) 154 |     = (P(M|转化) x P(上海|转化) x P(转化)) / (P(M) x P(上海)) 155 |     = (3/3 x 1/4 x 3/7) / (P(M) x P(上海)) 156 | 157 | 158 |    P(不转化|Mx上海) 159 |     = (P(M|不转化) x P(上海|不转化) x P(不转化)) / (P(M) x P(上海)) 160 |     = (2/5 x 3/5 x 4/7) / (P(M) x P(上海)) 161 | 162 | 163 | 164 | ### 3.2 概率连乘的问题 165 | 166 | 在具体的实现代码过程中,我们有一点需要注意,就是很多个大于0且小于1的数累乘最终会接近于0,由于计算机的位数限制,所以多个概率相乘之后的数值在计算机内存中保存的数值是0。针对这个问题,我们通常将概率连乘式子取对数,将累乘的运算转换成累加的运算,这样就避免了上述的问题。 167 | 168 | 169 | 170 | ## 4. 朴素贝叶斯算法的优缺点 171 | 172 | 173 | 朴素贝叶斯的主要优点有: 174 | 175 | * 朴素贝叶斯模型发源于古典数学理论,有稳定的分类效率。 176 | 177 | * 对小规模的数据表现很好,能个处理多分类任务,适合增量式训练,尤其是数据量超出内存时,我们可以一批批的去增量训练。 178 | 179 | * 对缺失数据不太敏感,算法也比较简单,常用于文本分类。 180 | 181 | 182 | 朴素贝叶斯的主要缺点有: 183 | 184 | * 理论上,朴素贝叶斯模型与其他分类方法相比具有最小的误差率。但是实际上并非总是如此,这是因为朴素贝叶斯模型假设属性之间相互独立,这个假设在实际应用中往往是不成立的,在属性个数比较多或者属性之间相关性较大时,分类效果不好。而在属性相关性较小时,朴素贝叶斯性能最为良好。对于这一点,有半朴素贝叶斯之类的算法通过考虑部分关联性适度改进。 185 | 186 | * 需要知道先验概率,且先验概率很多时候取决于假设,假设的模型可以有很多种,因此在某些时候会由于假设的先验模型的原因导致预测效果不佳。 187 | 188 | * 由于我们是通过先验和数据来决定后验的概率从而决定分类,所以分类决策存在一定的错误率。 189 | 190 | * 对输入数据的表达形式很敏感。 191 | 192 | * 对于连续特征,假设了每个类别标签的数据集中的每个特征均服从高斯分布。 193 | 194 | 195 | ```python 196 | from __future__ import division, print_function 197 | from sklearn import datasets 198 | import matplotlib.pyplot as plt 199 | import math 200 | import numpy as np 201 | import pandas as pd 202 | %matplotlib inline 203 | 204 | 205 | def shuffle_data(X, y, seed=None): 206 | if seed: 207 | np.random.seed(seed) 208 | 209 | idx = np.arange(X.shape[0]) 210 | np.random.shuffle(idx) 211 | 212 | return X[idx], y[idx] 213 | 214 | 215 | # 正规化数据集 X 216 | def normalize(X, axis=-1, p=2): 217 | lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis)) 218 | lp_norm[lp_norm == 0] = 1 219 | return X / np.expand_dims(lp_norm, axis) 220 | 221 | 222 | # 标准化数据集 X 223 | def standardize(X): 224 | X_std = np.zeros(X.shape) 225 | mean = X.mean(axis=0) 226 | std = X.std(axis=0) 227 | 228 | # 做除法运算时请永远记住分母不能等于0的情形 229 | # X_std = (X - X.mean(axis=0)) / X.std(axis=0) 230 | for col in range(np.shape(X)[1]): 231 | if std[col]: 232 | X_std[:, col] = (X_std[:, col] - mean[col]) / std[col] 233 | return X_std 234 | 235 | 236 | # 划分数据集为训练集和测试集 237 | def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None): 238 | if shuffle: 239 | X, y = shuffle_data(X, y, seed) 240 | 241 | n_train_samples = int(X.shape[0] * (1-test_size)) 242 | x_train, x_test = X[:n_train_samples], X[n_train_samples:] 243 | y_train, y_test = y[:n_train_samples], y[n_train_samples:] 244 | 245 | return x_train, x_test, y_train, y_test 246 | 247 | 248 | def accuracy(y, y_pred): 249 | y = y.reshape(y.shape[0], -1) 250 | y_pred = y_pred.reshape(y_pred.shape[0], -1) 251 | return np.sum(y == y_pred)/len(y) 252 | 253 | 254 | 255 | 256 | class NaiveBayes(): 257 | """朴素贝叶斯分类模型. """ 258 | def __init__(self): 259 | self.classes = None 260 | self.X = None 261 | self.y = None 262 | # 存储每个类别标签数据集中每个特征中每个特征值的出现概率, 因为预测的时候需要, 模型训练的过程中其实就是计算出这些概率 263 | self.parameters = [] 264 | 265 | def fit(self, X, y): 266 | self.X = X 267 | self.y = y 268 | self.classes = np.unique(y) 269 | # 遍历所有类别的数据集,计算每一个类别数据集每个特征中每个特征值的出现概率 270 | for i in range(len(self.classes)): 271 | c = self.classes[i] 272 | # 选出该类别的数据集 273 | x_where_c = X[np.where(y == c)] 274 | 275 | self.parameters.append([]) 276 | # 遍历该类别数据的所有特征,计算该类别数据集每个特征中每个特征值的出现概率 277 | for j in range(x_where_c.shape[1]): 278 | feautre_values_where_c_j = np.unique(x_where_c[:, j]) 279 | 280 | parameters = {} 281 | # 遍历整个训练数据集该特征的所有特征值(如果遍历该类别数据集x_where_c中该特征的所有特征值, 282 | # 则每列的特征值都不全,因此整个数据集X中存在但是不在x_where_c中的特征值将得不到其概率, 283 | # feautre_values_where_c_j), 计算该类别数据集该特征中每个特征值的出现概率 284 | for feature_value in X[:, j]: # feautre_values_where_c_j 285 | n_feature_value = x_where_c[x_where_c[:, j]==feature_value].shape[0] 286 | # 用Laplance平滑对概率进行修正, 并且用取对数的方法将累乘转成累加的形式 287 | parameters[feature_value] = np.log((n_feature_value + 1) / 288 | (x_where_c.shape[0] + len(feautre_values_where_c_j))) 289 | self.parameters[i].append(parameters) 290 | 291 | # 计算先验概率 292 | def calculate_priori_probability(self, c): 293 | x_where_c = self.X[np.where(self.y == c)] 294 | n_samples_for_c = x_where_c.shape[0] 295 | n_samples = self.X.shape[0] 296 | return (n_samples_for_c + 1) / (n_samples + len(self.classes)) 297 | 298 | def classify(self, sample): 299 | posteriors = [] 300 | 301 | # 遍历所有类别 302 | for i in range(len(self.classes)): 303 | c = self.classes[i] 304 | prior = self.calculate_priori_probability(c) 305 | posterior = np.log(prior) 306 | 307 | # probability = P(Y)*P(x1|Y)*P(x2|Y)*...*P(xN|Y) 308 | # 遍历所有特征 309 | for j, params in enumerate(self.parameters[i]): 310 | # 取出预测样本的第j个特征 311 | sample_feature = sample[j] 312 | # 取出参数中第i个类别第j个特征特征值为sample_feature的概率, 如果测试集中的样本 313 | # 有特征值没有出现, 则假设该特征值的概率为1/self.X.shape[0] 314 | proba = params.get(sample_feature, np.log(1/self.X.shape[0])) 315 | 316 | # 朴素贝叶斯模型假设特征之间条件独立,即P(x1,x2,x3|Y) = P(x1|Y)*P(x2|Y)*P(x3|Y) 317 | posterior += proba 318 | 319 | posteriors.append(posterior) 320 | 321 | # 对概率进行排序 322 | index_of_max = np.argmax(posteriors) 323 | max_value = posteriors[index_of_max] 324 | 325 | return self.classes[index_of_max] 326 | 327 | # 对数据集进行类别预测 328 | def predict(self, X): 329 | y_pred = [] 330 | for sample in X: 331 | y = self.classify(sample) 332 | y_pred.append(y) 333 | return np.array(y_pred) 334 | 335 | 336 | 337 | def main(): 338 | X = np.array([['M','北京'], ['F', '上海'], ['M' ,'广州'], ['M' ,'北京'], 339 | ['F' ,'上海'], ['M','北京'], ['F', '上海'], ['M' ,'广州'], 340 | ['M' ,'北京'], ['F' ,'上海']]) 341 | y = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 0]) 342 | 343 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6) 344 | 345 | clf = NaiveBayes() 346 | clf.fit(X_train, y_train) 347 | y_pred = np.array(clf.predict(X_test)) 348 | 349 | accu = accuracy(y_test, y_pred) 350 | 351 | print ("Accuracy:", accu) 352 | 353 | 354 | if __name__ == "__main__": 355 | main() 356 | 357 | ``` 358 | 359 | Accuracy: 1.0 360 | 361 | 362 | 参考文献: 363 | http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html 364 | 365 | 李航《统计学习方法》 366 | -------------------------------------------------------------------------------- /regression.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Allensmile/Machine-learning-implement/HEAD/regression.zip -------------------------------------------------------------------------------- /smote.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "本文将主要详细介绍一下SMOTE(Synthetic Minority Oversampling Technique)算法从原理到代码实践,SMOTE主要是用来解决类不平衡问题的,在讲解SMOTE算法之前,我们先解释一下什么是类不平衡问题、为什么类不平衡带来的问题以及相应的解决方法。\n", 8 | "\n", 9 | "\n", 10 | "## 1. 什么是类不平衡问题\n", 11 | "\n", 12 | "类不平衡(class-imbalance)是指在训练分类器中所使用的训练集的类别分布不均。比如说一个二分类问题,1000个训练样本,比较理想的情况是正类、负类样本的数量相差不多;而如果正类样本有995个、负类样本仅5个,就意味着存在类不平衡。\n", 13 | "\n", 14 | "在后文中,把样本数量过少的类别称为“少数类”。\n", 15 | "\n", 16 | "但实际上,数据集上的类不平衡到底有没有达到需要特殊处理的程度,还要看不处理时训练出来的模型在验证集上的效果。有些时候是没必要处理的。\n", 17 | "\n", 18 | "\n", 19 | "## 2. 类不平衡引发的问题\n", 20 | "\n", 21 | "\n", 22 | "### 2.1 从模型的训练过程来看\n", 23 | "\n", 24 | "从训练模型的角度来说,如果某类的样本数量很少,那么这个类别所提供的“信息”就太少。\n", 25 | "\n", 26 | "使用经验风险(模型在训练集上的平均损失)最小化作为模型的学习准则。设损失函数为0-1 loss(这是一种典型的均等代价的损失函数),那么优化目标就等价于错误率最小化(也就是accuracy最大化)。考虑极端情况:1000个训练样本中,正类样本999个,负类样本1个。训练过程中在某次迭代结束后,模型把所有的样本都分为正类,虽然分错了这个负类,但是所带来的损失实在微不足道,accuracy已经是99.9%,于是满足停机条件或者达到最大迭代次数之后自然没必要再优化下去,ok,到此为止,训练结束!于是这个模型……\n", 27 | "\n", 28 | "模型没有学习到如何去判别出少数类,这时候模型的召回率会非常低。\n", 29 | "\n", 30 | "\n", 31 | "\n", 32 | "### 2.2 从模型的预测过程来看\n", 33 | "\n", 34 | "考虑二项Logistic回归模型。输入一个样本 x ,模型输出的是其属于正类的概率 ŷ 。当 ŷ >0.5 时,模型判定该样本属于正类,否则就是属于反类。\n", 35 | "\n", 36 | "为什么是0.5呢?可以认为模型是出于最大后验概率决策的角度考虑的,选择了0.5意味着当模型估计的样本属于正类的后验概率要大于样本属于负类的后验概率时就将样本判为正类。但实际上,这个后验概率的估计值是否准确呢?\n", 37 | "\n", 38 | "从几率(odds)的角度考虑:几率表达的是样本属于正类的可能性与属于负类的可能性的比值。模型对于样本的预测几率为 $\\frac{ŷ} {1−ŷ}$ 。\n", 39 | "\n", 40 | "模型在做出决策时,当然希望能够遵循真实样本总体的正负类样本分布:设 θ 等于正类样本数除以全部样本数,那么样本的真实几率为 $\\frac{θ}{1−θ}$ 。当观测几率大于真实几率时,也就是 $ŷ >θ$ 时,那么就判定这个样本属于正类。\n", 41 | "\n", 42 | "虽然我们无法获悉真实样本总体,但之于训练集,存在这样一个假设:训练集是真实样本总体的无偏采样。正是因为这个假设,所以认为训练集的观测几率$\\frac { \\hat{\\theta}}{1−\\hat{ \\theta }}$就代表了真实几率 $\\frac{θ}{1−θ}$ 。\n", 43 | "\n", 44 | "所以,在这个假设下,当一个样本的预测几率大于观测几率时,就应该将样本判断为正类。\n", 45 | "\n", 46 | "\n", 47 | "## 3. 解决类不平衡问题的方法\n", 48 | "\n", 49 | "目前主要有三种办法:\n", 50 | "\n", 51 | "### 3.1 调整 θ 值\n", 52 | "\n", 53 | "根据训练集的正负样本比例,调整 θ 值。   \n", 54 | "\n", 55 | "这样做的依据是上面所述的对训练集的假设。但在给定任务中,这个假设是否成立,还有待讨论。\n", 56 | "\n", 57 | "\n", 58 | "### 3.2 过采样\n", 59 | "\n", 60 | "对训练集里面样本数量较少的类别(少数类)进行过采样,合成新的样本来缓解类不平衡。下面将介绍一种经典的过采样算法:SMOTE。\n", 61 | "\n", 62 | "### 3.3 欠采样\n", 63 | "\n", 64 | "对训练集里面样本数量较多的类别(多数类)进行欠采样,抛弃一些样本来缓解类不平衡。\n", 65 | "\n", 66 | "\n", 67 | "\n", 68 | "## 4. SMOTE算法原理\n", 69 | "\n", 70 | "SMOTE,合成少数类过采样技术.它是基于随机过采样算法的一种改进方案,由于随机过采样采取简单复制样本的策略来增加少数类样本,这样容易产生模型过拟合的问题,即使得模型学习到的信息过于特别(Specific)而不够泛化(General),SMOTE算法的基本思想是对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中,算法流程如下。\n", 71 | "\n", 72 | "### 4.1 SMOTE算法流程\n", 73 | "\n", 74 | "对于正样本数据集X(minority class samples),遍历每一个样本:\n", 75 | "\n", 76 | "$\\,\\,\\,\\,\\,\\,$ (1) 对于少数类(X)中每一个样本x,计算它到少数类样本集(X)中所有样本的距离,得到其k近邻。\n", 77 | " \n", 78 | "$\\,\\,\\,\\,\\,\\,$ (2) 根据样本不平衡比例设置一个采样比例以确定采样倍率sampling_rate,对于每一个少数类样本x,从其k近邻中随机选择sampling_rate个近邻,假设选择的近邻为 ${x^{(1)}, x^{(2)}, ..., x^{(sampling\\_rate)}}$ 。\n", 79 | " \n", 80 | "$\\,\\,\\,\\,\\,\\,$ (3) 对于每一个随机选出的近邻 $x^{(i)} \\, (i=1,2, ..., {sampling\\_rate}) $,分别与原样本按照如下的公式构建新的样本\n", 81 | "\n", 82 | "$$ x_{new} = x + rand(0, 1) * (x^{(i)} - x) $$\n", 83 | "\n", 84 | "\n", 85 | "### 4.2 SMOTE算法代码实现\n", 86 | "\n", 87 | "下面我们就用代码来实现一下:" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 1, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "[[ 2.55355825 3.55355825 5.33033738]\n", 100 | " [ 3. 4.89432435 2.42270262]\n", 101 | " [ 2. 2. 1. ]\n", 102 | " [ 3. 5. 2. ]\n", 103 | " [ 5. 3. 4. ]\n", 104 | " [ 3. 3.85514586 5.85514586]\n", 105 | " [ 1. 2. 3. ]\n", 106 | " [ 3. 4. 6. ]\n", 107 | " [ 2. 2. 1. ]\n", 108 | " [ 3. 5. 2. ]\n", 109 | " [ 5. 3. 4. ]\n", 110 | " [ 3. 2. 4. ]]\n" 111 | ] 112 | } 113 | ], 114 | "source": [ 115 | "import random\n", 116 | "from sklearn.neighbors import NearestNeighbors\n", 117 | "import numpy as np\n", 118 | "\n", 119 | "class Smote:\n", 120 | " \"\"\"\n", 121 | " SMOTE过采样算法.\n", 122 | "\n", 123 | "\n", 124 | " Parameters:\n", 125 | " -----------\n", 126 | " k: int\n", 127 | " 选取的近邻数目.\n", 128 | " sampling_rate: int\n", 129 | " 采样倍数, attention sampling_rate < k.\n", 130 | " newindex: int\n", 131 | " 生成的新样本(合成样本)的索引号.\n", 132 | " \"\"\"\n", 133 | " def __init__(self, sampling_rate=5, k=5):\n", 134 | " self.sampling_rate = sampling_rate\n", 135 | " self.k = k\n", 136 | " self.newindex = 0\n", 137 | "\n", 138 | " def fit(self, X, y=None):\n", 139 | " if y is not None:\n", 140 | " negative_X = X[y==0]\n", 141 | " X = X[y==1]\n", 142 | " \n", 143 | " n_samples, n_features = X.shape\n", 144 | " # 初始化一个矩阵, 用来存储合成样本\n", 145 | " self.synthetic = np.zeros((n_samples * self.sampling_rate, n_features))\n", 146 | " \n", 147 | " # 找出正样本集(数据集X)中的每一个样本在数据集X中的k个近邻\n", 148 | " knn = NearestNeighbors(n_neighbors=self.k).fit(X)\n", 149 | " for i in range(len(X)):\n", 150 | " k_neighbors = knn.kneighbors(X[i].reshape(1,-1), \n", 151 | " return_distance=False)[0]\n", 152 | " # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成\n", 153 | " # sampling_rate个新的样本\n", 154 | " self.synthetic_samples(X, i, k_neighbors)\n", 155 | " \n", 156 | " if y is not None:\n", 157 | " return ( np.concatenate((self.synthetic, X, negative_X), axis=0), \n", 158 | " np.concatenate(([1]*(len(self.synthetic)+len(X)), y[y==0]), axis=0) )\n", 159 | " \n", 160 | " return np.concatenate((self.synthetic, X), axis=0)\n", 161 | "\n", 162 | "\n", 163 | " # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成sampling_rate个新的样本\n", 164 | " def synthetic_samples(self, X, i, k_neighbors):\n", 165 | " for j in range(self.sampling_rate):\n", 166 | " # 从k个近邻里面随机选择一个近邻\n", 167 | " neighbor = np.random.choice(k_neighbors)\n", 168 | " # 计算样本X[i]与刚刚选择的近邻的差\n", 169 | " diff = X[neighbor] - X[i]\n", 170 | " # 生成新的数据\n", 171 | " self.synthetic[self.newindex] = X[i] + random.random() * diff\n", 172 | " self.newindex += 1\n", 173 | " \n", 174 | "X=np.array([[1,2,3],[3,4,6],[2,2,1],[3,5,2],[5,3,4],[3,2,4]])\n", 175 | "y = np.array([1, 1, 1, 0, 0, 0])\n", 176 | "smote=Smote(sampling_rate=1, k=5)\n", 177 | "print(smote.fit(X))" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "### 4.3 SMOTE算法的缺陷\n", 185 | "\n", 186 | "该算法主要存在两方面的问题:一是在近邻选择时,存在一定的盲目性。从上面的算法流程可以看出,在算法执行过程中,需要确定k值,即选择多少个近邻样本,这需要用户自行解决。从k值的定义可以看出,k值的下限是sampling_rate(sampling_rate为从k个近邻中随机挑选出的近邻样本的个数,且有 sampling_rate < k ), sampling_rate的大小可以根据负类样本数量、正类样本数量和数据集最后需要达到的平衡率决定。但k值的上限没有办法确定,只能根据具体的数据集去反复测试。因此如何确定k值,才能使算法达到最优这是未知的。\n", 187 | "\n", 188 | "另外,该算法无法克服非平衡数据集的数据分布问题,容易产生分布边缘化问题。由于正类样本(少数类样本)的分布决定了其可选择的近邻,如果一个正类样本处在正类样本集的分布边缘,则由此正类样本和相邻样本产生的“人造”样本也会处在这个边缘,且会越来越边缘化,从而模糊了正类样本和负类样本的边界,而且使边界变得越来越模糊。这种边界模糊性,虽然使数据集的平衡性得到了改善,但加大了分类算法进行分类的难度。\n", 189 | "\n", 190 | "针对SMOTE算法的进一步改进\n", 191 | "\n", 192 | "针对SMOTE算法存在的边缘化和盲目性等问题,很多人纷纷提出了新的改进办法,在一定程度上改进了算法的性能,但还存在许多需要解决的问题。\n", 193 | "\n", 194 | "Han等人Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning 在SMOTE算法基础上进行了改进,提出了Borderhne.SMOTE算法,解决了生成样本重叠(Overlapping)的问题该算法在运行的过程中,查找一个适当的区域,该区域可以较好地反应数据集的性质,然后在该区域内进行插值,以使新增加的“人造”样本更有效。这个适当的区域一般由经验给定,因此算法在执行的过程中有一定的局限性。\n", 195 | "\n", 196 | "\n", 197 | "\n", 198 | "参考文献:\n", 199 | "\n", 200 | "http://www.cnblogs.com/Determined22/p/5772538.html\n", 201 | "\n", 202 | "smote算法的论文地址:https://www.jair.org/media/953/live-953-2037-jair.pdf\n", 203 | "\n", 204 | "http://blog.csdn.net/yaphat/article/details/60347968\n", 205 | "\n", 206 | "http://blog.csdn.net/Yaphat/article/details/52463304?locationNum=7#0-tsina-1-78137-397232819ff9a47a7b7e80a40613cfe1\n" 207 | ] 208 | } 209 | ], 210 | "metadata": { 211 | "kernelspec": { 212 | "display_name": "Python 3", 213 | "language": "python", 214 | "name": "python3" 215 | }, 216 | "language_info": { 217 | "codemirror_mode": { 218 | "name": "ipython", 219 | "version": 3 220 | }, 221 | "file_extension": ".py", 222 | "mimetype": "text/x-python", 223 | "name": "python", 224 | "nbconvert_exporter": "python", 225 | "pygments_lexer": "ipython3", 226 | "version": "3.6.1" 227 | } 228 | }, 229 | "nbformat": 4, 230 | "nbformat_minor": 2 231 | } 232 | -------------------------------------------------------------------------------- /smote.md: -------------------------------------------------------------------------------- 1 | 2 | 本文将主要详细介绍一下SMOTE(Synthetic Minority Oversampling Technique)算法从原理到代码实践,SMOTE主要是用来解决类不平衡问题的,在讲解SMOTE算法之前,我们先解释一下什么是类不平衡问题、为什么类不平衡带来的问题以及相应的解决方法。 3 | 4 | 5 | ## 1. 什么是类不平衡问题 6 | 7 | 类不平衡(class-imbalance)是指在训练分类器中所使用的训练集的类别分布不均。比如说一个二分类问题,1000个训练样本,比较理想的情况是正类、负类样本的数量相差不多;而如果正类样本有995个、负类样本仅5个,就意味着存在类不平衡。 8 | 9 | 在后文中,把样本数量过少的类别称为“少数类”。 10 | 11 | 但实际上,数据集上的类不平衡到底有没有达到需要特殊处理的程度,还要看不处理时训练出来的模型在验证集上的效果。有些时候是没必要处理的。 12 | 13 | 14 | ## 2. 类不平衡引发的问题 15 | 16 | 17 | ### 2.1 从模型的训练过程来看 18 | 19 | 从训练模型的角度来说,如果某类的样本数量很少,那么这个类别所提供的“信息”就太少。 20 | 21 | 使用经验风险(模型在训练集上的平均损失)最小化作为模型的学习准则。设损失函数为0-1 loss(这是一种典型的均等代价的损失函数),那么优化目标就等价于错误率最小化(也就是accuracy最大化)。考虑极端情况:1000个训练样本中,正类样本999个,负类样本1个。训练过程中在某次迭代结束后,模型把所有的样本都分为正类,虽然分错了这个负类,但是所带来的损失实在微不足道,accuracy已经是99.9%,于是满足停机条件或者达到最大迭代次数之后自然没必要再优化下去,ok,到此为止,训练结束!于是这个模型…… 22 | 23 | 模型没有学习到如何去判别出少数类,这时候模型的召回率会非常低。 24 | 25 | 26 | 27 | ### 2.2 从模型的预测过程来看 28 | 29 | 考虑二项Logistic回归模型。输入一个样本 x ,模型输出的是其属于正类的概率 ŷ 。当 ŷ >0.5 时,模型判定该样本属于正类,否则就是属于反类。 30 | 31 | 为什么是0.5呢?可以认为模型是出于最大后验概率决策的角度考虑的,选择了0.5意味着当模型估计的样本属于正类的后验概率要大于样本属于负类的后验概率时就将样本判为正类。但实际上,这个后验概率的估计值是否准确呢? 32 | 33 | 从几率(odds)的角度考虑:几率表达的是样本属于正类的可能性与属于负类的可能性的比值。模型对于样本的预测几率为 $\frac{ŷ} {1−ŷ}$ 。 34 | 35 | 模型在做出决策时,当然希望能够遵循真实样本总体的正负类样本分布:设 θ 等于正类样本数除以全部样本数,那么样本的真实几率为 $\frac{θ}{1−θ}$ 。当观测几率大于真实几率时,也就是 $ŷ >θ$ 时,那么就判定这个样本属于正类。 36 | 37 | 虽然我们无法获悉真实样本总体,但之于训练集,存在这样一个假设:训练集是真实样本总体的无偏采样。正是因为这个假设,所以认为训练集的观测几率$\frac { \hat{\theta}}{1−\hat{ \theta }}$就代表了真实几率 $\frac{θ}{1−θ}$ 。 38 | 39 | 所以,在这个假设下,当一个样本的预测几率大于观测几率时,就应该将样本判断为正类。 40 | 41 | 42 | ## 3. 解决类不平衡问题的方法 43 | 44 | 目前主要有三种办法: 45 | 46 | ### 3.1 调整 θ 值 47 | 48 | 根据训练集的正负样本比例,调整 θ 值。    49 | 50 | 这样做的依据是上面所述的对训练集的假设。但在给定任务中,这个假设是否成立,还有待讨论。 51 | 52 | 53 | ### 3.2 过采样 54 | 55 | 对训练集里面样本数量较少的类别(少数类)进行过采样,合成新的样本来缓解类不平衡。下面将介绍一种经典的过采样算法:SMOTE。 56 | 57 | ### 3.3 欠采样 58 | 59 | 对训练集里面样本数量较多的类别(多数类)进行欠采样,抛弃一些样本来缓解类不平衡。 60 | 61 | 62 | 63 | ## 4. SMOTE算法原理 64 | 65 | SMOTE,合成少数类过采样技术.它是基于随机过采样算法的一种改进方案,由于随机过采样采取简单复制样本的策略来增加少数类样本,这样容易产生模型过拟合的问题,即使得模型学习到的信息过于特别(Specific)而不够泛化(General),SMOTE算法的基本思想是对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中,算法流程如下。 66 | 67 | ### 4.1 SMOTE算法流程 68 | 69 | 对于正样本数据集X(minority class samples),遍历每一个样本: 70 | 71 | $\,\,\,\,\,\,$ (1) 对于少数类(X)中每一个样本x,计算它到少数类样本集(X)中所有样本的距离,得到其k近邻。 72 | 73 | $\,\,\,\,\,\,$ (2) 根据样本不平衡比例设置一个采样比例以确定采样倍率sampling_rate,对于每一个少数类样本x,从其k近邻中随机选择sampling_rate个近邻,假设选择的近邻为 ${x^{(1)}, x^{(2)}, ..., x^{(sampling\_rate)}}$ 。 74 | 75 | $\,\,\,\,\,\,$ (3) 对于每一个随机选出的近邻 $x^{(i)} \, (i=1,2, ..., {sampling\_rate}) $,分别与原样本按照如下的公式构建新的样本 76 | 77 | $$ x_{new} = x + rand(0, 1) * (x^{(i)} - x) $$ 78 | 79 | 80 | ### 4.2 SMOTE算法代码实现 81 | 82 | 下面我们就用代码来实现一下: 83 | 84 | 85 | ```python 86 | import random 87 | from sklearn.neighbors import NearestNeighbors 88 | import numpy as np 89 | 90 | class Smote: 91 | """ 92 | SMOTE过采样算法. 93 | 94 | 95 | Parameters: 96 | ----------- 97 | k: int 98 | 选取的近邻数目. 99 | sampling_rate: int 100 | 采样倍数, attention sampling_rate < k. 101 | newindex: int 102 | 生成的新样本(合成样本)的索引号. 103 | """ 104 | def __init__(self, sampling_rate=5, k=5): 105 | self.sampling_rate = sampling_rate 106 | self.k = k 107 | self.newindex = 0 108 | 109 | def fit(self, X, y=None): 110 | if y is not None: 111 | negative_X = X[y==0] 112 | X = X[y==1] 113 | 114 | n_samples, n_features = X.shape 115 | # 初始化一个矩阵, 用来存储合成样本 116 | self.synthetic = np.zeros((n_samples * self.sampling_rate, n_features)) 117 | 118 | # 找出正样本集(数据集X)中的每一个样本在数据集X中的k个近邻 119 | knn = NearestNeighbors(n_neighbors=self.k).fit(X) 120 | for i in range(len(X)): 121 | k_neighbors = knn.kneighbors(X[i].reshape(1,-1), 122 | return_distance=False)[0] 123 | # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成 124 | # sampling_rate个新的样本 125 | self.synthetic_samples(X, i, k_neighbors) 126 | 127 | if y is not None: 128 | return ( np.concatenate((self.synthetic, X, negative_X), axis=0), 129 | np.concatenate(([1]*(len(self.synthetic)+len(X)), y[y==0]), axis=0) ) 130 | 131 | return np.concatenate((self.synthetic, X), axis=0) 132 | 133 | 134 | # 对正样本集(minority class samples)中每个样本, 分别根据其k个近邻生成sampling_rate个新的样本 135 | def synthetic_samples(self, X, i, k_neighbors): 136 | for j in range(self.sampling_rate): 137 | # 从k个近邻里面随机选择一个近邻 138 | neighbor = np.random.choice(k_neighbors) 139 | # 计算样本X[i]与刚刚选择的近邻的差 140 | diff = X[neighbor] - X[i] 141 | # 生成新的数据 142 | self.synthetic[self.newindex] = X[i] + random.random() * diff 143 | self.newindex += 1 144 | 145 | X=np.array([[1,2,3],[3,4,6],[2,2,1],[3,5,2],[5,3,4],[3,2,4]]) 146 | y = np.array([1, 1, 1, 0, 0, 0]) 147 | smote=Smote(sampling_rate=1, k=5) 148 | print(smote.fit(X)) 149 | ``` 150 | 151 | [[ 2.55355825 3.55355825 5.33033738] 152 | [ 3. 4.89432435 2.42270262] 153 | [ 2. 2. 1. ] 154 | [ 3. 5. 2. ] 155 | [ 5. 3. 4. ] 156 | [ 3. 3.85514586 5.85514586] 157 | [ 1. 2. 3. ] 158 | [ 3. 4. 6. ] 159 | [ 2. 2. 1. ] 160 | [ 3. 5. 2. ] 161 | [ 5. 3. 4. ] 162 | [ 3. 2. 4. ]] 163 | 164 | 165 | ### 4.3 SMOTE算法的缺陷 166 | 167 | 该算法主要存在两方面的问题:一是在近邻选择时,存在一定的盲目性。从上面的算法流程可以看出,在算法执行过程中,需要确定k值,即选择多少个近邻样本,这需要用户自行解决。从k值的定义可以看出,k值的下限是sampling_rate(sampling_rate为从k个近邻中随机挑选出的近邻样本的个数,且有 sampling_rate < k ), sampling_rate的大小可以根据负类样本数量、正类样本数量和数据集最后需要达到的平衡率决定。但k值的上限没有办法确定,只能根据具体的数据集去反复测试。因此如何确定k值,才能使算法达到最优这是未知的。 168 | 169 | 另外,该算法无法克服非平衡数据集的数据分布问题,容易产生分布边缘化问题。由于正类样本(少数类样本)的分布决定了其可选择的近邻,如果一个正类样本处在正类样本集的分布边缘,则由此正类样本和相邻样本产生的“人造”样本也会处在这个边缘,且会越来越边缘化,从而模糊了正类样本和负类样本的边界,而且使边界变得越来越模糊。这种边界模糊性,虽然使数据集的平衡性得到了改善,但加大了分类算法进行分类的难度。 170 | 171 | 针对SMOTE算法的进一步改进 172 | 173 | 针对SMOTE算法存在的边缘化和盲目性等问题,很多人纷纷提出了新的改进办法,在一定程度上改进了算法的性能,但还存在许多需要解决的问题。 174 | 175 | Han等人Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning 在SMOTE算法基础上进行了改进,提出了Borderhne.SMOTE算法,解决了生成样本重叠(Overlapping)的问题该算法在运行的过程中,查找一个适当的区域,该区域可以较好地反应数据集的性质,然后在该区域内进行插值,以使新增加的“人造”样本更有效。这个适当的区域一般由经验给定,因此算法在执行的过程中有一定的局限性。 176 | 177 | 178 | 179 | 参考文献: 180 | 181 | http://www.cnblogs.com/Determined22/p/5772538.html 182 | 183 | smote算法的论文地址:https://www.jair.org/media/953/live-953-2037-jair.pdf 184 | 185 | http://blog.csdn.net/yaphat/article/details/60347968 186 | 187 | http://blog.csdn.net/Yaphat/article/details/52463304?locationNum=7#0-tsina-1-78137-397232819ff9a47a7b7e80a40613cfe1 188 | 189 | --------------------------------------------------------------------------------