├── AdaBoost └── AdaBoost_sklearn.py ├── README.md ├── data ├── test.csv ├── train.csv └── train_binary.csv ├── decision_tree ├── C45.py ├── ID3.py └── decision_tree_sklearn.py ├── imgs ├── Adaboost_sklearn_result_1.png ├── Adaboost_sklearn_result_2.png ├── C45_result.png ├── ID3_result.png ├── decision_tree_sklearn_result.png ├── knn_result.png ├── knn_sklearn_result.png ├── logistic_regression_result.png ├── logistic_regression_sklearn_result.png ├── maxEnt_result.png ├── naive_bayes_result.png ├── naive_bayes_sklearn_result.png ├── perceptron_result.png ├── perceptron_sklearn_result.png └── svm_sklearn_result.png ├── knn ├── knn.py └── knn_sklearn.py ├── logistic_regression ├── logistic_regression.py └── logistic_regression_sklearn.py ├── maxEnt └── maxEnt.py ├── naive_bayes ├── naive_bayes.py └── naive_bayes_sklearn.py ├── perceptron ├── perceptron.py └── perceptron_sklearn.py └── svm └── svm_sklearn.py /AdaBoost/AdaBoost_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import time 5 | 6 | from sklearn.cross_validation import train_test_split 7 | from sklearn.metrics import accuracy_score 8 | 9 | from sklearn.ensemble import AdaBoostClassifier 10 | 11 | if __name__ == '__main__': 12 | 13 | print("Start read data...") 14 | time_1 = time.time() 15 | 16 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) 17 | data = raw_data.values 18 | 19 | features = data[::, 1::] 20 | labels = data[::, 0] 21 | 22 | # 随机选取33%数据作为测试集，剩余为训练集 23 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 24 | 25 | time_2 = time.time() 26 | print('read data cost %f seconds' % (time_2 - time_1)) 27 | 28 | 29 | print('Start training...') 30 | # n_estimators表示要组合的弱分类器个数； 31 | # algorithm可选{‘SAMME’, ‘SAMME.R’}，默认为‘SAMME.R’，表示使用的是real boosting算法，‘SAMME’表示使用的是discrete boosting算法 32 | clf = AdaBoostClassifier(n_estimators=100,algorithm='SAMME.R') 33 | clf.fit(train_features,train_labels) 34 | time_3 = time.time() 35 | print('training cost %f seconds' % (time_3 - time_2)) 36 | 37 | 38 | print('Start predicting...') 39 | test_predict = clf.predict(test_features) 40 | time_4 = time.time() 41 | print('predicting cost %f seconds' % (time_4 - time_3)) 42 | 43 | 44 | score = accuracy_score(test_labels, test_predict) 45 | print("The accruacy score is %f" % score) 46 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # lihang_algorithms 2 | 用python和sklearn实现李航老师的《统计学习方法》中所提到的算法 3 |

实验数据：MNIST数据集,这里用kaggle中处理好的数据 4 |
官方下载地址：http://yann.lecun.com/exdb/mnist/ 5 |
kaggle中处理好的数据：https://www.kaggle.com/c/digit-recognizer/data 6 | 7 | * * * 8 | 9 | ## 第二章感知机 10 | 适用问题：二类分类 11 |
实验数据：由于是二分类器，所以将MINST数据集[train.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train.csv)的label列进行了一些微调，label等于0的继续等于0，label大于0改为1。这样就将十分类的数据改为二分类的数据。获取地址[train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv) 12 |

代码：[perceptron/perceptron.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/perceptron/perceptron.py) 13 |
运行结果： 14 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/perceptron_result.png) 15 |

代码(用sklearn实现)：[perceptron/perceptron_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/perceptron/perceptron_sklearn.py) 16 |
运行结果： 17 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/perceptron_sklearn_result.png) 18 | 19 | ## 第三章 k邻近法 20 | 适用问题：多类分类 21 |
三个基本要素：k值的选择、距离度量及分类决策规则 22 |

代码：[knn/knn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/knn/knn.py) 23 |
运行结果： 24 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/knn_result.png) 25 |

代码(用sklearn实现)：[knn/knn_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/knn/knn_sklearn.py) 26 |
运行结果： 27 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/knn_sklearn_result.png) 28 | 29 | ## 第四章朴素贝叶斯法 30 | 适用问题：多类分类 31 |
基于贝叶斯定理和特征条件独立假设 32 |
常用的三个模型有： 33 | - 高斯模型：处理特征是连续型变量的情况 34 | - 多项式模型：最常见，要求特征是离散数据 35 | - 伯努利模型：要求特征是离散的，且为布尔类型，即true和false，或者1和0 36 | 37 |
代码（基于多项式模型）：[naive_bayes/naive_bayes.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/naive_bayes/naive_bayes.py) 38 |
运行结果： 39 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/naive_bayes_result.png) 40 |

代码（基于多项式模型，用sklearn实现）：[naive_bayes/naive_bayes_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/naive_bayes/naive_bayes_sklearn.py) 41 |
运行结果： 42 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/naive_bayes_sklearn_result.png) 43 | 44 | ## 第五章决策树 45 | 适用问题：多类分类 46 |
三个步骤：特征选择、决策树的生成和决策树的剪枝 47 |
常见的决策树算法有： 48 | - **ID3**：特征划分基于**信息增益** 49 | - **C4.5**：特征划分基于**信息增益比** 50 | - **CART**：特征划分基于**基尼指数** 51 | 52 |
ID3算法代码：[decision_tree/ID3.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/decision_tree/ID3.py) 53 |
运行结果： 54 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/ID3_result.png) 55 |

C4.5算法代码：[decision_tree/C45.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/decision_tree/C45.py) 56 |
运行结果： 57 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/C45_result.png) 58 |

CART算法代码(用sklearn实现)：[decision_tree/decision_tree_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/decision_tree/decision_tree_sklearn.py) 59 |
运行结果： 60 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/decision_tree_sklearn_result.png) 61 | 62 | ## 第六章逻辑斯谛回归 63 | ### 二项逻辑斯谛回归 64 | 适用问题：二类分类 65 |
可类比于感知机算法 66 |
实验数据：[train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv) 67 |
代码：[logistic_regression/logistic_regression.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/logistic_regression/logistic_regression.py) 68 |
运行结果： 69 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/logistic_regression_result.png) 70 | ### (多项)逻辑斯谛回归 71 | 适用问题：多类分类 72 |
实验数据：[train.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train.csv) 73 |
代码(用sklearn实现)：[logistic_regression/logistic_regression_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/logistic_regression/logistic_regression_sklearn.py) 74 |
运行结果： 75 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/logistic_regression_sklearn_result.png) 76 | 77 | ## 第六章最大熵模型 78 | 适用问题：多类分类 79 |
下面用改进的迭代尺度法（IIS）学习最大熵模型，将特征函数定义为： 80 |
$f(x,y)=\left\{\begin{matrix} 1 & (x,y)\in train set \\ 0 & else \end{matrix}\right.$ 81 |
与其他分类器不同的是，最大熵模型中的f(x,y)中的x是单独的一个特征，不是一个n维特征向量，因此我们需要对每个维度特征加一个区分标签;如X=(x0,x1,x2,...)变为X=(0_x0,1_x1,2_x2,...) 82 |

代码：[maxEnt/maxEnt.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/maxEnt/maxEnt.py) 83 |
运行结果： 84 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/maxEnt_result.png) 85 | 86 | ## 第七章支持向量机 87 | 适用问题：二类分类 88 |
实验数据：二分类的数据 [train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv) 89 |
SVM有三种模型，由简至繁为 90 | - 当训练数据训练可分时，通过硬间隔最大化，可学习到**硬间隔支持向量机**，又叫**线性可分支持向量机** 91 | - 当训练数据训练近似可分时，通过软间隔最大化，可学习到**软间隔支持向量机**，又叫**线性支持向量机** 92 | - 当训练数据训练不可分时，通过软间隔最大化及**核技巧(kernel trick)**，可学习到**非线性支持向量机** 93 | 94 |
代码(用sklearn实现)：[svm/svm_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/svm/svm_sklearn.py) 95 |
*注：可用拆解法（如OvO，OvR）将svm扩展成适用于多分类问题（其他二分类问题亦可），sklearn中已经实现* 96 |

运行结果： 97 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/svm_sklearn_result.png) 98 | 99 | ## 第八章提升方法 100 | 提升方法就是组合一系列弱分类器构成一个强分类器，AdaBoost是其代表性算法 101 | ### AdaBoost算法 102 | 适用问题：二类分类，要处理多类分类需进行改进 103 |
代码(用sklearn实现)：[AdaBoost/AdaBoost_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/AdaBoost/AdaBoost_sklearn.py) 104 |

实验数据为[train.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train.csv)的运行结果： 105 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/Adaboost_sklearn_result_1.png) 106 |

实验数据为[train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv)的运行结果： 107 |
![](https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/master/imgs/Adaboost_sklearn_result_2.png) 108 | -------------------------------------------------------------------------------- /decision_tree/C45.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import cv2 4 | import time 5 | import numpy as np 6 | import pandas as pd 7 | 8 | 9 | from sklearn.cross_validation import train_test_split 10 | from sklearn.metrics import accuracy_score 11 | 12 | # 二值化 13 | def binaryzation(img): 14 | cv_img = img.astype(np.uint8) 15 | cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img) 16 | return cv_img 17 | 18 | def binaryzation_features(trainset): 19 | features = [] 20 | 21 | for img in trainset: 22 | img = np.reshape(img,(28,28)) 23 | cv_img = img.astype(np.uint8) 24 | 25 | img_b = binaryzation(cv_img) 26 | # hog_feature = np.transpose(hog_feature) 27 | features.append(img_b) 28 | 29 | features = np.array(features) 30 | features = np.reshape(features,(-1,feature_len)) 31 | 32 | return features 33 | 34 | 35 | class Tree(object): 36 | def __init__(self,node_type,Class = None, feature = None): 37 | self.node_type = node_type # 节点类型（internal或leaf） 38 | self.dict = {} # dict的键表示特征Ag的可能值ai，值表示根据ai得到的子树 39 | self.Class = Class # 叶节点表示的类，若是内部节点则为none 40 | self.feature = feature # 表示当前的树即将由第feature个特征划分（即第feature特征是使得当前树中信息增益最大的特征） 41 | 42 | def add_tree(self,key,tree): 43 | self.dict[key] = tree 44 | 45 | def predict(self,features): 46 | if self.node_type == 'leaf' or (features[self.feature] not in self.dict): 47 | return self.Class 48 | 49 | tree = self.dict.get(features[self.feature]) 50 | return tree.predict(features) 51 | 52 | # 计算数据集x的经验熵H(x) 53 | def calc_ent(x): 54 | x_value_list = set([x[i] for i in range(x.shape[0])]) 55 | ent = 0.0 56 | for x_value in x_value_list: 57 | p = float(x[x == x_value].shape[0]) / x.shape[0] 58 | logp = np.log2(p) 59 | ent -= p * logp 60 | 61 | return ent 62 | 63 | # 计算条件熵H(y/x) 64 | def calc_condition_ent(x, y): 65 | x_value_list = set([x[i] for i in range(x.shape[0])]) 66 | ent = 0.0 67 | for x_value in x_value_list: 68 | sub_y = y[x == x_value] 69 | temp_ent = calc_ent(sub_y) 70 | ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent 71 | 72 | return ent 73 | 74 | # 计算信息增益 75 | def calc_ent_grap(x,y): 76 | base_ent = calc_ent(y) 77 | condition_ent = calc_condition_ent(x, y) 78 | ent_grap = base_ent - condition_ent 79 | 80 | return ent_grap 81 | 82 | # C4.5算法 83 | def recurse_train(train_set,train_label,features): 84 | 85 | LEAF = 'leaf' 86 | INTERNAL = 'internal' 87 | 88 | # 步骤1——如果训练集train_set中的所有实例都属于同一类Ck 89 | label_set = set(train_label) 90 | if len(label_set) == 1: 91 | return Tree(LEAF,Class = label_set.pop()) 92 | 93 | # 步骤2——如果特征集features为空 94 | class_len = [(i,len(list(filter(lambda x:x==i,train_label)))) for i in range(class_num)] # 计算每一个类出现的个数 95 | (max_class,max_len) = max(class_len,key = lambda x:x[1]) 96 | 97 | if len(features) == 0: 98 | return Tree(LEAF,Class = max_class) 99 | 100 | # 步骤3——计算信息增益,并选择信息增益最大的特征 101 | max_feature = 0 102 | max_gda = 0 103 | D = train_label 104 | for feature in features: 105 | # print(type(train_set)) 106 | A = np.array(train_set[:,feature].flat) # 选择训练集中的第feature列（即第feature个特征） 107 | gda = calc_ent_grap(A,D) 108 | if calc_ent(A) != 0: ####### 计算信息增益比，这是与ID3算法唯一的不同 109 | gda /= calc_ent(A) 110 | if gda > max_gda: 111 | max_gda,max_feature = gda,feature 112 | 113 | # 步骤4——信息增益小于阈值 114 | if max_gda < epsilon: 115 | return Tree(LEAF,Class = max_class) 116 | 117 | # 步骤5——构建非空子集 118 | sub_features = list(filter(lambda x:x!=max_feature,features)) 119 | tree = Tree(INTERNAL,feature=max_feature) 120 | 121 | max_feature_col = np.array(train_set[:,max_feature].flat) 122 | feature_value_list = set([max_feature_col[i] for i in range(max_feature_col.shape[0])]) # 保存信息增益最大的特征可能的取值 (shape[0]表示计算行数) 123 | for feature_value in feature_value_list: 124 | 125 | index = [] 126 | for i in range(len(train_label)): 127 | if train_set[i][max_feature] == feature_value: 128 | index.append(i) 129 | 130 | sub_train_set = train_set[index] 131 | sub_train_label = train_label[index] 132 | 133 | sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features) 134 | tree.add_tree(feature_value,sub_tree) 135 | 136 | return tree 137 | 138 | def train(train_set,train_label,features): 139 | return recurse_train(train_set,train_label,features) 140 | 141 | def predict(test_set,tree): 142 | result = [] 143 | for features in test_set: 144 | tmp_predict = tree.predict(features) 145 | result.append(tmp_predict) 146 | return np.array(result) 147 | 148 | 149 | class_num = 10 # MINST数据集有10种labels，分别是“0,1,2,3,4,5,6,7,8,9” 150 | feature_len = 784 # MINST数据集每个image有28*28=784个特征（pixels） 151 | epsilon = 0.001 # 设定阈值 152 | 153 | if __name__ == '__main__': 154 | 155 | print("Start read data...") 156 | 157 | time_1 = time.time() 158 | 159 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 160 | data = raw_data.values 161 | 162 | imgs = data[::, 1::] 163 | features = binaryzation_features(imgs) # 图片二值化(很重要，不然预测准确率很低) 164 | labels = data[::, 0] 165 | 166 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 167 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 168 | time_2 = time.time() 169 | print('read data cost %f seconds' % (time_2 - time_1)) 170 | 171 | # 通过C4.5算法生成决策树 172 | print('Start training...') 173 | tree = train(train_features,train_labels,list(range(feature_len))) 174 | time_3 = time.time() 175 | print('training cost %f seconds' % (time_3 - time_2)) 176 | 177 | print('Start predicting...') 178 | test_predict = predict(test_features,tree) 179 | time_4 = time.time() 180 | print('predicting cost %f seconds' % (time_4 - time_3)) 181 | 182 | # print("预测的结果为：") 183 | # print(test_predict) 184 | for i in range(len(test_predict)): 185 | if test_predict[i] == None: 186 | test_predict[i] = epsilon 187 | score = accuracy_score(test_labels, test_predict) 188 | print("The accruacy score is %f" % score) 189 | 190 | -------------------------------------------------------------------------------- /decision_tree/ID3.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import cv2 4 | import time 5 | import numpy as np 6 | import pandas as pd 7 | 8 | 9 | from sklearn.cross_validation import train_test_split 10 | from sklearn.metrics import accuracy_score 11 | 12 | # 二值化 13 | def binaryzation(img): 14 | cv_img = img.astype(np.uint8) 15 | cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img) 16 | return cv_img 17 | 18 | def binaryzation_features(trainset): 19 | features = [] 20 | 21 | for img in trainset: 22 | img = np.reshape(img,(28,28)) 23 | cv_img = img.astype(np.uint8) 24 | 25 | img_b = binaryzation(cv_img) 26 | # hog_feature = np.transpose(hog_feature) 27 | features.append(img_b) 28 | 29 | features = np.array(features) 30 | features = np.reshape(features,(-1,feature_len)) 31 | 32 | return features 33 | 34 | 35 | class Tree(object): 36 | def __init__(self,node_type,Class = None, feature = None): 37 | self.node_type = node_type # 节点类型（internal或leaf） 38 | self.dict = {} # dict的键表示特征Ag的可能值ai，值表示根据ai得到的子树 39 | self.Class = Class # 叶节点表示的类，若是内部节点则为none 40 | self.feature = feature # 表示当前的树即将由第feature个特征划分（即第feature特征是使得当前树中信息增益最大的特征） 41 | 42 | def add_tree(self,key,tree): 43 | self.dict[key] = tree 44 | 45 | def predict(self,features): 46 | if self.node_type == 'leaf' or (features[self.feature] not in self.dict): 47 | return self.Class 48 | 49 | tree = self.dict.get(features[self.feature]) 50 | return tree.predict(features) 51 | 52 | # 计算数据集x的经验熵H(x) 53 | def calc_ent(x): 54 | x_value_list = set([x[i] for i in range(x.shape[0])]) 55 | ent = 0.0 56 | for x_value in x_value_list: 57 | p = float(x[x == x_value].shape[0]) / x.shape[0] 58 | logp = np.log2(p) 59 | ent -= p * logp 60 | 61 | return ent 62 | 63 | # 计算条件熵H(y/x) 64 | def calc_condition_ent(x, y): 65 | x_value_list = set([x[i] for i in range(x.shape[0])]) 66 | ent = 0.0 67 | for x_value in x_value_list: 68 | sub_y = y[x == x_value] 69 | temp_ent = calc_ent(sub_y) 70 | ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent 71 | 72 | return ent 73 | 74 | # 计算信息增益 75 | def calc_ent_grap(x,y): 76 | base_ent = calc_ent(y) 77 | condition_ent = calc_condition_ent(x, y) 78 | ent_grap = base_ent - condition_ent 79 | 80 | return ent_grap 81 | 82 | # ID3算法 83 | def recurse_train(train_set,train_label,features): 84 | 85 | LEAF = 'leaf' 86 | INTERNAL = 'internal' 87 | 88 | # 步骤1——如果训练集train_set中的所有实例都属于同一类Ck 89 | label_set = set(train_label) 90 | if len(label_set) == 1: 91 | return Tree(LEAF,Class = label_set.pop()) 92 | 93 | # 步骤2——如果特征集features为空 94 | class_len = [(i,len(list(filter(lambda x:x==i,train_label)))) for i in range(class_num)] # 计算每一个类出现的个数 95 | (max_class,max_len) = max(class_len,key = lambda x:x[1]) 96 | 97 | if len(features) == 0: 98 | return Tree(LEAF,Class = max_class) 99 | 100 | # 步骤3——计算信息增益,并选择信息增益最大的特征 101 | max_feature = 0 102 | max_gda = 0 103 | D = train_label 104 | for feature in features: 105 | # print(type(train_set)) 106 | A = np.array(train_set[:,feature].flat) # 选择训练集中的第feature列（即第feature个特征） 107 | gda=calc_ent_grap(A,D) 108 | if gda > max_gda: 109 | max_gda,max_feature = gda,feature 110 | 111 | # 步骤4——信息增益小于阈值 112 | if max_gda < epsilon: 113 | return Tree(LEAF,Class = max_class) 114 | 115 | # 步骤5——构建非空子集 116 | sub_features = list(filter(lambda x:x!=max_feature,features)) 117 | tree = Tree(INTERNAL,feature=max_feature) 118 | 119 | max_feature_col = np.array(train_set[:,max_feature].flat) 120 | feature_value_list = set([max_feature_col[i] for i in range(max_feature_col.shape[0])]) # 保存信息增益最大的特征可能的取值 (shape[0]表示计算行数) 121 | for feature_value in feature_value_list: 122 | 123 | index = [] 124 | for i in range(len(train_label)): 125 | if train_set[i][max_feature] == feature_value: 126 | index.append(i) 127 | 128 | sub_train_set = train_set[index] 129 | sub_train_label = train_label[index] 130 | 131 | sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features) 132 | tree.add_tree(feature_value,sub_tree) 133 | 134 | return tree 135 | 136 | def train(train_set,train_label,features): 137 | return recurse_train(train_set,train_label,features) 138 | 139 | def predict(test_set,tree): 140 | result = [] 141 | for features in test_set: 142 | tmp_predict = tree.predict(features) 143 | result.append(tmp_predict) 144 | return np.array(result) 145 | 146 | 147 | class_num = 10 # MINST数据集有10种labels，分别是“0,1,2,3,4,5,6,7,8,9” 148 | feature_len = 784 # MINST数据集每个image有28*28=784个特征（pixels） 149 | epsilon = 0.001 # 设定阈值 150 | 151 | if __name__ == '__main__': 152 | 153 | print("Start read data...") 154 | 155 | time_1 = time.time() 156 | 157 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 158 | data = raw_data.values 159 | 160 | imgs = data[::, 1::] 161 | features = binaryzation_features(imgs) # 图片二值化(很重要，不然预测准确率很低) 162 | labels = data[::, 0] 163 | 164 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 165 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 166 | time_2 = time.time() 167 | print('read data cost %f seconds' % (time_2 - time_1)) 168 | 169 | # 通过ID3算法生成决策树 170 | print('Start training...') 171 | tree = train(train_features,train_labels,list(range(feature_len))) 172 | time_3 = time.time() 173 | print('training cost %f seconds' % (time_3 - time_2)) 174 | 175 | print('Start predicting...') 176 | test_predict = predict(test_features,tree) 177 | time_4 = time.time() 178 | print('predicting cost %f seconds' % (time_4 - time_3)) 179 | 180 | # print("预测的结果为：") 181 | # print(test_predict) 182 | for i in range(len(test_predict)): 183 | if test_predict[i] == None: 184 | test_predict[i] = epsilon 185 | score = accuracy_score(test_labels, test_predict) 186 | print("The accruacy score is %f" % score) 187 | 188 | -------------------------------------------------------------------------------- /decision_tree/decision_tree_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import time 5 | 6 | from sklearn.cross_validation import train_test_split 7 | from sklearn.metrics import accuracy_score 8 | 9 | from sklearn.tree import DecisionTreeClassifier 10 | 11 | 12 | 13 | if __name__ == '__main__': 14 | 15 | print("Start read data...") 16 | time_1 = time.time() 17 | 18 | raw_data = pd.read_csv('../data/train.csv', header=0) 19 | data = raw_data.values 20 | 21 | features = data[::, 1::] 22 | labels = data[::, 0] 23 | 24 | # 随机选取33%数据作为测试集，剩余为训练集 25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 26 | 27 | time_2 = time.time() 28 | print('read data cost %f seconds' % (time_2 - time_1)) 29 | 30 | 31 | print('Start training...') 32 | # criterion可选‘gini’, ‘entropy’，默认为gini(对应CART算法)，entropy为信息增益（对应ID3算法） 33 | clf = DecisionTreeClassifier(criterion='gini') 34 | clf.fit(train_features,train_labels) 35 | time_3 = time.time() 36 | print('training cost %f seconds' % (time_3 - time_2)) 37 | 38 | 39 | print('Start predicting...') 40 | test_predict = clf.predict(test_features) 41 | time_4 = time.time() 42 | print('predicting cost %f seconds' % (time_4 - time_3)) 43 | 44 | 45 | score = accuracy_score(test_labels, test_predict) 46 | print("The accruacy score is %f" % score) 47 | -------------------------------------------------------------------------------- /imgs/Adaboost_sklearn_result_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/Adaboost_sklearn_result_1.png -------------------------------------------------------------------------------- /imgs/Adaboost_sklearn_result_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/Adaboost_sklearn_result_2.png -------------------------------------------------------------------------------- /imgs/C45_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/C45_result.png -------------------------------------------------------------------------------- /imgs/ID3_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/ID3_result.png -------------------------------------------------------------------------------- /imgs/decision_tree_sklearn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/decision_tree_sklearn_result.png -------------------------------------------------------------------------------- /imgs/knn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/knn_result.png -------------------------------------------------------------------------------- /imgs/knn_sklearn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/knn_sklearn_result.png -------------------------------------------------------------------------------- /imgs/logistic_regression_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/logistic_regression_result.png -------------------------------------------------------------------------------- /imgs/logistic_regression_sklearn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/logistic_regression_sklearn_result.png -------------------------------------------------------------------------------- /imgs/maxEnt_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/maxEnt_result.png -------------------------------------------------------------------------------- /imgs/naive_bayes_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/naive_bayes_result.png -------------------------------------------------------------------------------- /imgs/naive_bayes_sklearn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/naive_bayes_sklearn_result.png -------------------------------------------------------------------------------- /imgs/perceptron_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/perceptron_result.png -------------------------------------------------------------------------------- /imgs/perceptron_sklearn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/perceptron_sklearn_result.png -------------------------------------------------------------------------------- /imgs/svm_sklearn_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/svm_sklearn_result.png -------------------------------------------------------------------------------- /knn/knn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import numpy as np 5 | import time 6 | 7 | from sklearn.cross_validation import train_test_split 8 | from sklearn.metrics import accuracy_score 9 | 10 | def Predict(testset, trainset, train_labels): 11 | predict = [] # 保存测试集预测到的label，并返回 12 | count = 0 # 当前测试数据为第count个 13 | 14 | for test_vec in testset: 15 | # 输出当前运行的测试用例坐标，用于测试 16 | count += 1 17 | print("the number of %d is predicting..."%count) 18 | 19 | knn_list = [] # 当前k个最近邻居 20 | max_index = -1 # 当前k个最近邻居中距离最远点的坐标 21 | max_dist = 0 # 当前k个最近邻居中距离最远点的距离 22 | 23 | # 初始化knn_list,将前k个点的距离放入knn_list中 24 | for i in range(k): 25 | label = train_labels[i] 26 | train_vec = trainset[i] 27 | dist = np.linalg.norm(train_vec - test_vec) # 计算两个点的欧氏距离 28 | knn_list.append((dist, label)) 29 | 30 | # 剩下的点 31 | for i in range(k, len(train_labels)): 32 | label = train_labels[i] 33 | train_vec = trainset[i] 34 | dist = np.linalg.norm(train_vec - test_vec) # 计算两个点的欧氏距离 35 | 36 | # 寻找k个邻近点中距离最远的点 37 | if max_index < 0: 38 | for j in range(k): 39 | if max_dist < knn_list[j][0]: 40 | max_index = j 41 | max_dist = knn_list[max_index][0] 42 | 43 | # 如果当前k个最近邻中存在点距离比当前点距离远，则替换 44 | if dist < max_dist: 45 | knn_list[max_index] = (dist, label) 46 | max_index = -1 47 | max_dist = 0 48 | 49 | 50 | # 统计选票 51 | class_total = k 52 | class_count = [0 for i in range(class_total)] 53 | for dist, label in knn_list: 54 | class_count[label] += 1 55 | 56 | # 找出最大选票 57 | mmax = max(class_count) 58 | 59 | # 找出最大选票标签 60 | for i in range(class_total): 61 | if mmax == class_count[i]: 62 | predict.append(i) 63 | break 64 | 65 | return np.array(predict) 66 | 67 | k = 10 # 选取k值 68 | 69 | if __name__ == '__main__': 70 | 71 | print("Start read data") 72 | 73 | time_1 = time.time() 74 | 75 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 76 | data = raw_data.values 77 | 78 | features = data[::, 1::] 79 | labels = data[::, 0] 80 | 81 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 82 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 83 | 84 | time_2 = time.time() 85 | print('read data cost %f seconds' % (time_2 - time_1)) 86 | 87 | print('Start training') 88 | print('knn need not train') 89 | 90 | time_3 = time.time() 91 | print('training cost %f seconds' % (time_3 - time_2)) 92 | 93 | print('Start predicting') 94 | test_predict = Predict(test_features, train_features, train_labels) 95 | time_4 = time.time() 96 | print('predicting cost %f seconds' % (time_4 - time_3)) 97 | 98 | score = accuracy_score(test_labels, test_predict) 99 | print("The accruacy score is %f" % score) 100 | -------------------------------------------------------------------------------- /knn/knn_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import numpy as np 5 | import time 6 | 7 | from sklearn.neighbors import KNeighborsClassifier 8 | 9 | from sklearn.cross_validation import train_test_split 10 | 11 | 12 | if __name__ == '__main__': 13 | 14 | print("Start read data...") 15 | 16 | time_1 = time.time() 17 | 18 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 19 | data = raw_data.values 20 | 21 | features = data[::, 1::] 22 | labels = data[::, 0] 23 | 24 | # 随机选取33%数据作为测试集，剩余为训练集 25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 26 | 27 | time_2 = time.time() 28 | print('read data cost %f seconds' % (time_2 - time_1)) 29 | 30 | print('Start training...') 31 | neigh = KNeighborsClassifier(n_neighbors=10) 32 | neigh.fit(train_features, train_labels) 33 | time_3 = time.time() 34 | print('training cost %f seconds...' % (time_3 - time_2)) 35 | 36 | print('Start predicting...') 37 | test_predict = neigh.predict(test_features) 38 | time_4 = time.time() 39 | print('predicting cost %f seconds' % (time_4 - time_3)) 40 | 41 | score = neigh.score(test_features, test_labels) 42 | print("The accruacy score is %f" % score) 43 | -------------------------------------------------------------------------------- /logistic_regression/logistic_regression.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import time 4 | import math 5 | import random 6 | import pandas as pd 7 | 8 | from sklearn.cross_validation import train_test_split 9 | from sklearn.metrics import accuracy_score 10 | 11 | 12 | class LogisticRegression(object): 13 | 14 | def __init__(self): 15 | self.learning_step = 0.0001 # 学习率 16 | self.max_iteration = 5000 # 分类正确上界，当分类正确的次数超过上界时，认为已训练好，退出训练 17 | 18 | def train(self,features, labels): 19 | self.w = [0.0] * (len(features[0]) + 1) # 初始化模型参数 20 | 21 | correct_count = 0 # 分类正确的次数 22 | 23 | while correct_count < self.max_iteration: 24 | 25 | # 随机选取数据(xi,yi) 26 | index = random.randint(0, len(labels) - 1) 27 | x = list(features[index]) 28 | x.append(1.0) 29 | y = labels[index] 30 | 31 | if y == self.predict_(x): # 分类正确的次数加1,并跳过下面的步骤 32 | correct_count += 1 33 | continue 34 | 35 | wx = sum([self.w[i] * x[i] for i in range(len(self.w))]) 36 | while wx>700: # 控制运算结果越界 37 | wx/=2 38 | exp_wx = math.exp(wx) 39 | 40 | for i in range(len(self.w)): 41 | self.w[i] -= self.learning_step * \ 42 | (-y * x[i] + float(x[i] * exp_wx) / float(1 + exp_wx)) 43 | 44 | def predict_(self,x): 45 | wx = sum([self.w[j] * x[j] for j in range(len(self.w))]) 46 | while wx>700: # 控制运算结果越界 47 | wx/=2 48 | exp_wx = math.exp(wx) 49 | 50 | predict1 = exp_wx / (1 + exp_wx) 51 | predict0 = 1 / (1 + exp_wx) 52 | 53 | if predict1 > predict0: 54 | return 1 55 | else: 56 | return 0 57 | 58 | 59 | def predict(self,features): 60 | labels = [] 61 | 62 | for feature in features: 63 | x = list(feature) 64 | x.append(1) 65 | labels.append(self.predict_(x)) 66 | 67 | return labels 68 | 69 | if __name__ == "__main__": 70 | print("Start read data...") 71 | 72 | time_1 = time.time() 73 | 74 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据，并将第一行视为表头，返回DataFrame类型 75 | data = raw_data.values 76 | 77 | features = data[::, 1::] 78 | labels = data[::, 0] 79 | 80 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 81 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 82 | 83 | time_2 = time.time() 84 | print('read data cost %f seconds' % (time_2 - time_1)) 85 | 86 | print('Start training...') 87 | lr = LogisticRegression() 88 | lr.train(train_features, train_labels) 89 | time_3 = time.time() 90 | print('training cost %f seconds' % (time_3 - time_2)) 91 | 92 | print('Start predicting...') 93 | test_predict = lr.predict(test_features) 94 | time_4 = time.time() 95 | print('predicting cost %f seconds' % (time_4 - time_3)) 96 | 97 | score = accuracy_score(test_labels, test_predict) 98 | print("The accruacy score is %f" % score) 99 | 100 | -------------------------------------------------------------------------------- /logistic_regression/logistic_regression_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import time 5 | 6 | from sklearn.cross_validation import train_test_split 7 | from sklearn.metrics import accuracy_score 8 | 9 | from sklearn.linear_model import LogisticRegression 10 | 11 | 12 | 13 | if __name__ == '__main__': 14 | 15 | print("Start read data...") 16 | time_1 = time.time() 17 | 18 | raw_data = pd.read_csv('../data/train.csv', header=0) 19 | data = raw_data.values 20 | 21 | features = data[::, 1::] 22 | labels = data[::, 0] 23 | 24 | # 随机选取33%数据作为测试集，剩余为训练集 25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 26 | 27 | time_2 = time.time() 28 | print('read data cost %f seconds' % (time_2 - time_1)) 29 | 30 | 31 | print('Start training...') 32 | # multi_class可选‘ovr’, ‘multinomial’，默认为ovr用于二类分类，multinomial用于多类分类 33 | clf = LogisticRegression(max_iter=100,solver='saga',multi_class='multinomial') 34 | clf.fit(train_features,train_labels) 35 | time_3 = time.time() 36 | print('training cost %f seconds' % (time_3 - time_2)) 37 | 38 | 39 | print('Start predicting...') 40 | test_predict = clf.predict(test_features) 41 | time_4 = time.time() 42 | print('predicting cost %f seconds' % (time_4 - time_3)) 43 | 44 | 45 | score = accuracy_score(test_labels, test_predict) 46 | print("The accruacy score is %f" % score) 47 | -------------------------------------------------------------------------------- /maxEnt/maxEnt.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import time 5 | import math 6 | 7 | from collections import defaultdict 8 | 9 | from sklearn.cross_validation import train_test_split 10 | from sklearn.metrics import accuracy_score 11 | 12 | 13 | class MaxEnt(object): 14 | 15 | def init_params(self, X, Y): 16 | self.X_ = X 17 | self.Y_ = set() 18 | 19 | self.cal_Vxy(X, Y) 20 | 21 | self.N = len(X) # 训练集大小,如P59例子中为15 22 | self.n = len(self.Vxy) # 数据集中(x,y)对数,如P59例子中为6+3+3+5=17对 23 | self.M = 10000.0 # 设置P91中的M，可认为是学习速率 24 | 25 | self.build_dict() 26 | self.cal_Pxy() 27 | 28 | def cal_Vxy(self, X, Y): 29 | ''' 30 | 计算v(X=x,Y=y),P82 31 | ''' 32 | self.Vxy = defaultdict(int) 33 | 34 | for i in range(len(X)): 35 | x_, y = X[i], Y[i] 36 | self.Y_.add(y) 37 | 38 | for x in x_: 39 | self.Vxy[(x, y)] += 1 40 | 41 | def build_dict(self): 42 | self.id2xy = {} 43 | self.xy2id = {} 44 | 45 | for i, (x, y) in enumerate(self.Vxy): 46 | self.id2xy[i] = (x, y) 47 | self.xy2id[(x, y)] = i 48 | 49 | def cal_Pxy(self): 50 | ''' 51 | 计算P(X=x,Y=y),P82 52 | ''' 53 | self.Pxy = defaultdict(float) 54 | for id in range(self.n): 55 | (x, y) = self.id2xy[id] 56 | self.Pxy[id] = float(self.Vxy[(x, y)]) / float(self.N) 57 | 58 | 59 | def cal_Zx(self, X, y): 60 | ''' 61 | 计算Zw(x/yi)，根据P85公式6.23，Zw(x)未相加前的单项 62 | ''' 63 | result = 0.0 64 | for x in X: 65 | if (x,y) in self.xy2id: 66 | id = self.xy2id[(x, y)] 67 | result += self.w[id] 68 | return (math.exp(result), y) 69 | 70 | def cal_Pyx(self, X): 71 | ''' 72 | 计算P(y|x),根据P85公式6.22 73 | ''' 74 | Pyxs = [(self.cal_Zx(X, y)) for y in self.Y_] 75 | Zwx = sum([prob for prob, y in Pyxs]) 76 | return [(prob / Zwx, y) for prob, y in Pyxs] 77 | 78 | def cal_Epfi(self): 79 | ''' 80 | 计算Ep(fi),根据P83最上面的公式 81 | ''' 82 | self.Epfi = [0.0 for i in range(self.n)] 83 | 84 | for i, X in enumerate(self.X_): 85 | Pyxs = self.cal_Pyx(X) 86 | 87 | for x in X: 88 | for Pyx, y in Pyxs: 89 | if (x,y) in self.xy2id: 90 | id = self.xy2id[(x, y)] 91 | 92 | self.Epfi[id] += Pyx * (1.0 / self.N) 93 | 94 | 95 | def train(self, X, Y): 96 | ''' 97 | IIS学习算法 98 | ''' 99 | self.init_params(X, Y) 100 | 101 | # 第一步：初始化参数值wi为0 102 | self.w = [0.0 for i in range(self.n)] 103 | 104 | max_iteration = 500 # 设置最大迭代次数 105 | for times in range(max_iteration): 106 | print("the number of iterater : %d " % times) 107 | 108 | # 第二步：求δi 109 | detas = [] 110 | self.cal_Epfi() 111 | for i in range(self.n): 112 | deta = 1 / self.M * math.log(self.Pxy[i] / self.Epfi[i]) # 指定的特征函数为指示函数，因此E~p(fi)等于Pxy 113 | detas.append(deta) 114 | 115 | # if len(filter(lambda x: abs(x) >= 0.01, detas)) == 0: 116 | # break 117 | 118 | # 第三步：更新Wi 119 | self.w = [self.w[i] + detas[i] for i in range(self.n)] 120 | 121 | def predict(self, testset): 122 | results = [] 123 | for test in testset: 124 | result = self.cal_Pyx(test) 125 | results.append(max(result, key=lambda x: x[0])[1]) 126 | return results 127 | 128 | 129 | def rebuild_features(features): 130 | ''' 131 | 最大熵模型中的f(x,y)中的x是单独的一个特征,不是一个n维特征向量，因此我们需要对每个维度特征加一个区分标签 132 | 具体地：将原feature的（a0,a1,a2,a3,a4,...）变成 (0_a0,1_a1,2_a2,3_a3,4_a4,...)形式 133 | ''' 134 | new_features = [] 135 | for feature in features: 136 | new_feature = [] 137 | for i, f in enumerate(feature): 138 | new_feature.append(str(i) + '_' + str(f)) 139 | new_features.append(new_feature) 140 | return new_features 141 | 142 | 143 | if __name__ == '__main__': 144 | 145 | print("Start read data...") 146 | 147 | time_1 = time.time() 148 | 149 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 150 | data = raw_data.values 151 | 152 | features = data[:5000:, 1::] 153 | labels = data[:5000:, 0] 154 | 155 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 156 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 157 | 158 | train_features = rebuild_features(train_features) 159 | test_features = rebuild_features(test_features) 160 | 161 | time_2 = time.time() 162 | print('read data cost %f seconds' % (time_2 - time_1)) 163 | 164 | print('Start training...') 165 | met = MaxEnt() 166 | met.train(train_features, train_labels) 167 | 168 | time_3 = time.time() 169 | print('training cost %f seconds' % (time_3 - time_2)) 170 | 171 | print('Start predicting...') 172 | test_predict = met.predict(test_features) 173 | time_4 = time.time() 174 | print('predicting cost %f seconds' % (time_4 - time_3)) 175 | 176 | score = accuracy_score(test_labels, test_predict) 177 | print("The accruacy score is %f" % score) 178 | -------------------------------------------------------------------------------- /naive_bayes/naive_bayes.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import numpy as np 5 | import cv2 6 | import time 7 | 8 | from sklearn.cross_validation import train_test_split 9 | from sklearn.metrics import accuracy_score 10 | 11 | # 二值化处理 12 | def binaryzation(img): 13 | cv_img = img.astype(np.uint8) # 类型转化成Numpy中的uint8型 14 | cv2.threshold(cv_img, 50, 1, cv2.THRESH_BINARY_INV, cv_img) # 大于50的值赋值为0,不然赋值为1 15 | return cv_img 16 | 17 | # 训练，计算出先验概率和条件概率 18 | def Train(trainset, train_labels): 19 | prior_probability = np.zeros(class_num) # 先验概率 20 | conditional_probability = np.zeros((class_num, feature_len, 2)) # 条件概率 21 | 22 | # 计算 23 | for i in range(len(train_labels)): 24 | img = binaryzation(trainset[i]) # 图片二值化，让每一个特征都只有0，1两种取值 25 | label = train_labels[i] 26 | 27 | prior_probability[label] += 1 28 | 29 | for j in range(feature_len): 30 | conditional_probability[label][j][img[j]] += 1 31 | 32 | # 将条件概率归到[1,10001] 33 | for i in range(class_num): 34 | for j in range(feature_len): 35 | 36 | # 经过二值化后图像只有0，1两种取值 37 | pix_0 = conditional_probability[i][j][0] 38 | pix_1 = conditional_probability[i][j][1] 39 | 40 | # 计算0，1像素点对应的条件概率 41 | probalility_0 = (float(pix_0)/float(pix_0+pix_1))*10000 + 1 42 | probalility_1 = (float(pix_1)/float(pix_0+pix_1))*10000 + 1 43 | 44 | conditional_probability[i][j][0] = probalility_0 45 | conditional_probability[i][j][1] = probalility_1 46 | 47 | return prior_probability, conditional_probability 48 | 49 | # 计算概率 50 | def calculate_probability(img, label): 51 | probability = int(prior_probability[label]) 52 | 53 | for j in range(feature_len): 54 | probability *= int(conditional_probability[label][j][img[j]]) 55 | 56 | return probability 57 | 58 | # 预测 59 | def Predict(testset, prior_probability, conditional_probability): 60 | predict = [] 61 | 62 | # 对每个输入的x，将后验概率最大的类作为x的类输出 63 | for img in testset: 64 | 65 | img = binaryzation(img) # 图像二值化 66 | 67 | max_label = 0 68 | max_probability = calculate_probability(img, 0) 69 | 70 | for j in range(1, class_num): 71 | probability = calculate_probability(img, j) 72 | 73 | if max_probability < probability: 74 | max_label = j 75 | max_probability = probability 76 | 77 | predict.append(max_label) 78 | 79 | return np.array(predict) 80 | 81 | 82 | class_num = 10 # MINST数据集有10种labels，分别是“0,1,2,3,4,5,6,7,8,9” 83 | feature_len = 784 # MINST数据集每个image有28*28=784个特征（pixels） 84 | 85 | if __name__ == '__main__': 86 | 87 | print("Start read data") 88 | time_1 = time.time() 89 | 90 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 91 | data = raw_data.values 92 | 93 | features = data[::, 1::] 94 | labels = data[::, 0] 95 | 96 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 97 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 98 | 99 | time_2 = time.time() 100 | print('read data cost %f seconds' % (time_2 - time_1)) 101 | 102 | 103 | print('Start training') 104 | prior_probability, conditional_probability = Train(train_features, train_labels) 105 | time_3 = time.time() 106 | print('training cost %f seconds' % (time_3 - time_2)) 107 | 108 | 109 | print('Start predicting') 110 | test_predict = Predict(test_features, prior_probability, conditional_probability) 111 | time_4 = time.time() 112 | print('predicting cost %f seconds' % (time_4 - time_3)) 113 | 114 | 115 | score = accuracy_score(test_labels, test_predict) 116 | print("The accruacy score is %f" % score) 117 | 118 | -------------------------------------------------------------------------------- /naive_bayes/naive_bayes_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import numpy as np 5 | import time 6 | 7 | from sklearn.naive_bayes import MultinomialNB 8 | 9 | from sklearn.cross_validation import train_test_split 10 | from sklearn.metrics import accuracy_score 11 | 12 | 13 | if __name__ == '__main__': 14 | 15 | print("Start read data...") 16 | time_1 = time.time() 17 | 18 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据 19 | data = raw_data.values 20 | 21 | features = data[::, 1::] 22 | labels = data[::, 0] 23 | 24 | # 随机选取33%数据作为测试集，剩余为训练集 25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 26 | 27 | time_2 = time.time() 28 | print('read data cost %f seconds' % (time_2 - time_1)) 29 | 30 | 31 | print('Start training...') 32 | clf = MultinomialNB(alpha=1.0) # 加入laplace平滑 33 | clf.fit(train_features, train_labels) 34 | time_3 = time.time() 35 | print('training cost %f seconds' % (time_3 - time_2)) 36 | 37 | 38 | print('Start predicting...') 39 | test_predict = clf.predict(test_features) 40 | time_4 = time.time() 41 | print('predicting cost %f seconds' % (time_4 - time_3)) 42 | 43 | 44 | score = accuracy_score(test_labels, test_predict) 45 | print("The accruacy score is %f" % score) 46 | 47 | -------------------------------------------------------------------------------- /perceptron/perceptron.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import random 5 | import time 6 | 7 | from sklearn.cross_validation import train_test_split 8 | from sklearn.metrics import accuracy_score 9 | 10 | 11 | class Perceptron(object): 12 | 13 | def __init__(self): 14 | self.learning_step = 0.001 # 学习率 15 | self.max_iteration = 5000 # 分类正确上界，当分类正确的次数超过上界时，认为已训练好，退出训练 16 | 17 | def train(self, features, labels): 18 | 19 | # 初始化w,b为0,b在最后一位 20 | self.w = [0.0] * (len(features[0]) + 1) 21 | 22 | correct_count = 0 # 分类正确的次数 23 | 24 | while correct_count < self.max_iteration: 25 | 26 | # 随机选取数据(xi,yi) 27 | index = random.randint(0, len(labels) - 1) 28 | x = list(features[index]) 29 | x.append(1.0) # 加上1是为了与b相乘 30 | y = 2 * labels[index] - 1 # label为1转化为正实例点+1，label为0转化为负实例点-1 31 | 32 | # 计算w*xi+b 33 | wx = sum([self.w[j] * x[j] for j in range(len(self.w))]) 34 | 35 | # 如果yi(w*xi+b) > 0 则分类正确的次数加1 36 | if wx * y > 0: 37 | correct_count += 1 38 | continue 39 | 40 | # 如果yi(w*xi+b) <= 0 则更新w(最后一位实际上b)的值 41 | for i in range(len(self.w)): 42 | self.w[i] += self.learning_step * (y * x[i]) 43 | 44 | def predict_(self, x): 45 | wx = sum([self.w[j] * x[j] for j in range(len(self.w))]) 46 | return int(wx > 0) # w*xi+b>0则返回返回1,否则返回0 47 | 48 | def predict(self, features): 49 | labels = [] 50 | for feature in features: 51 | x = list(feature) 52 | x.append(1) 53 | labels.append(self.predict_(x)) 54 | return labels 55 | 56 | 57 | if __name__ == '__main__': 58 | 59 | print("Start read data") 60 | 61 | time_1 = time.time() 62 | 63 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据，并将第一行视为表头，返回DataFrame类型 64 | data = raw_data.values 65 | 66 | features = data[::, 1::] 67 | labels = data[::, 0] 68 | 69 | # 避免过拟合，采用交叉验证，随机选取33%数据作为测试集，剩余为训练集 70 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 71 | 72 | time_2 = time.time() 73 | print('read data cost %f seconds' % (time_2 - time_1)) 74 | 75 | print('Start training') 76 | p = Perceptron() 77 | p.train(train_features, train_labels) 78 | 79 | time_3 = time.time() 80 | print('training cost %f seconds' % (time_3 - time_2)) 81 | 82 | print('Start predicting') 83 | test_predict = p.predict(test_features) 84 | time_4 = time.time() 85 | print('predicting cost %f seconds' % (time_4 - time_3)) 86 | 87 | score = accuracy_score(test_labels, test_predict) 88 | print("The accruacy score is %f" % score) 89 | -------------------------------------------------------------------------------- /perceptron/perceptron_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import pandas as pd 4 | import time 5 | 6 | from sklearn.cross_validation import train_test_split 7 | from sklearn.metrics import accuracy_score 8 | 9 | from sklearn.linear_model import Perceptron 10 | 11 | 12 | 13 | if __name__ == '__main__': 14 | 15 | print("Start read data...") 16 | time_1 = time.time() 17 | 18 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据，并将第一行视为表头，返回DataFrame类型 19 | data = raw_data.values 20 | 21 | features = data[::, 1::] 22 | labels = data[::, 0] 23 | 24 | # 随机选取33%数据作为测试集，剩余为训练集 25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 26 | 27 | time_2 = time.time() 28 | print('read data cost %f seconds' % (time_2 - time_1)) 29 | 30 | print('Start training...') 31 | clf = Perceptron(alpha=0.0001,max_iter=2000) # 设置步长及最大迭代次数 32 | clf.fit(train_features,train_labels) 33 | time_3 = time.time() 34 | print('training cost %f seconds' % (time_3 - time_2)) 35 | 36 | print('Start predicting...') 37 | test_predict = clf.predict(test_features) 38 | time_4 = time.time() 39 | print('predicting cost %f seconds' % (time_4 - time_3)) 40 | 41 | score = accuracy_score(test_labels, test_predict) 42 | print("The accruacy score is %f" % score) 43 | -------------------------------------------------------------------------------- /svm/svm_sklearn.py: -------------------------------------------------------------------------------- 1 | # encoding=utf-8 2 | 3 | import time 4 | 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.cross_validation import train_test_split 8 | from sklearn.metrics import accuracy_score 9 | from sklearn import datasets 10 | from sklearn import svm 11 | 12 | if __name__ == '__main__': 13 | 14 | print('prepare datasets...') 15 | # Iris数据集 16 | # iris=datasets.load_iris() 17 | # features=iris.data 18 | # labels=iris.target 19 | 20 | # MINST数据集 21 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据，并将第一行视为表头，返回DataFrame类型 22 | data = raw_data.values 23 | features = data[::, 1::] 24 | labels = data[::, 0] # 选取33%数据作为测试集，剩余为训练集 25 | 26 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0) 27 | 28 | time_2=time.time() 29 | print('Start training...') 30 | clf = svm.SVC() # svm class 31 | clf.fit(train_features, train_labels) # training the svc model 32 | time_3 = time.time() 33 | print('training cost %f seconds' % (time_3 - time_2)) 34 | 35 | print('Start predicting...') 36 | test_predict=clf.predict(test_features) 37 | time_4 = time.time() 38 | print('predicting cost %f seconds' % (time_4 - time_3)) 39 | 40 | score = accuracy_score(test_labels, test_predict) 41 | print("The accruacy score is %f" % score) 42 | --------------------------------------------------------------------------------