├── AdaBoost
└── AdaBoost_sklearn.py
├── README.md
├── data
├── test.csv
├── train.csv
└── train_binary.csv
├── decision_tree
├── C45.py
├── ID3.py
└── decision_tree_sklearn.py
├── imgs
├── Adaboost_sklearn_result_1.png
├── Adaboost_sklearn_result_2.png
├── C45_result.png
├── ID3_result.png
├── decision_tree_sklearn_result.png
├── knn_result.png
├── knn_sklearn_result.png
├── logistic_regression_result.png
├── logistic_regression_sklearn_result.png
├── maxEnt_result.png
├── naive_bayes_result.png
├── naive_bayes_sklearn_result.png
├── perceptron_result.png
├── perceptron_sklearn_result.png
└── svm_sklearn_result.png
├── knn
├── knn.py
└── knn_sklearn.py
├── logistic_regression
├── logistic_regression.py
└── logistic_regression_sklearn.py
├── maxEnt
└── maxEnt.py
├── naive_bayes
├── naive_bayes.py
└── naive_bayes_sklearn.py
├── perceptron
├── perceptron.py
└── perceptron_sklearn.py
└── svm
└── svm_sklearn.py
/AdaBoost/AdaBoost_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import time
5 |
6 | from sklearn.cross_validation import train_test_split
7 | from sklearn.metrics import accuracy_score
8 |
9 | from sklearn.ensemble import AdaBoostClassifier
10 |
11 | if __name__ == '__main__':
12 |
13 | print("Start read data...")
14 | time_1 = time.time()
15 |
16 | raw_data = pd.read_csv('../data/train_binary.csv', header=0)
17 | data = raw_data.values
18 |
19 | features = data[::, 1::]
20 | labels = data[::, 0]
21 |
22 | # 随机选取33%数据作为测试集,剩余为训练集
23 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
24 |
25 | time_2 = time.time()
26 | print('read data cost %f seconds' % (time_2 - time_1))
27 |
28 |
29 | print('Start training...')
30 | # n_estimators表示要组合的弱分类器个数;
31 | # algorithm可选{‘SAMME’, ‘SAMME.R’},默认为‘SAMME.R’,表示使用的是real boosting算法,‘SAMME’表示使用的是discrete boosting算法
32 | clf = AdaBoostClassifier(n_estimators=100,algorithm='SAMME.R')
33 | clf.fit(train_features,train_labels)
34 | time_3 = time.time()
35 | print('training cost %f seconds' % (time_3 - time_2))
36 |
37 |
38 | print('Start predicting...')
39 | test_predict = clf.predict(test_features)
40 | time_4 = time.time()
41 | print('predicting cost %f seconds' % (time_4 - time_3))
42 |
43 |
44 | score = accuracy_score(test_labels, test_predict)
45 | print("The accruacy score is %f" % score)
46 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # lihang_algorithms
2 | 用python和sklearn实现李航老师的《统计学习方法》中所提到的算法
3 |
实验数据:MNIST数据集,这里用kaggle中处理好的数据
4 |
官方下载地址:http://yann.lecun.com/exdb/mnist/
5 |
kaggle中处理好的数据:https://www.kaggle.com/c/digit-recognizer/data
6 |
7 | * * *
8 |
9 | ## 第二章 感知机
10 | 适用问题:二类分类
11 |
实验数据:由于是二分类器,所以将MINST数据集[train.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train.csv)的label列进行了一些微调,label等于0的继续等于0,label大于0改为1。这样就将十分类的数据改为二分类的数据。获取地址[train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv)
12 |
代码:[perceptron/perceptron.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/perceptron/perceptron.py)
13 |
运行结果:
14 |

15 |
代码(用sklearn实现):[perceptron/perceptron_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/perceptron/perceptron_sklearn.py)
16 |
运行结果:
17 |

18 |
19 | ## 第三章 k邻近法
20 | 适用问题:多类分类
21 |
三个基本要素:k值的选择、距离度量及分类决策规则
22 |
代码:[knn/knn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/knn/knn.py)
23 |
运行结果:
24 |

25 |
代码(用sklearn实现):[knn/knn_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/knn/knn_sklearn.py)
26 |
运行结果:
27 |

28 |
29 | ## 第四章 朴素贝叶斯法
30 | 适用问题:多类分类
31 |
基于贝叶斯定理和特征条件独立假设
32 |
常用的三个模型有:
33 | - 高斯模型:处理特征是连续型变量的情况
34 | - 多项式模型:最常见,要求特征是离散数据
35 | - 伯努利模型:要求特征是离散的,且为布尔类型,即true和false,或者1和0
36 |
37 |
代码(基于多项式模型):[naive_bayes/naive_bayes.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/naive_bayes/naive_bayes.py)
38 |
运行结果:
39 |

40 |
代码(基于多项式模型,用sklearn实现):[naive_bayes/naive_bayes_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/naive_bayes/naive_bayes_sklearn.py)
41 |
运行结果:
42 |

43 |
44 | ## 第五章 决策树
45 | 适用问题:多类分类
46 |
三个步骤:特征选择、决策树的生成和决策树的剪枝
47 |
常见的决策树算法有:
48 | - **ID3**:特征划分基于**信息增益**
49 | - **C4.5**:特征划分基于**信息增益比**
50 | - **CART**:特征划分基于**基尼指数**
51 |
52 |
ID3算法代码:[decision_tree/ID3.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/decision_tree/ID3.py)
53 |
运行结果:
54 |

55 |
C4.5算法代码:[decision_tree/C45.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/decision_tree/C45.py)
56 |
运行结果:
57 |

58 |
CART算法代码(用sklearn实现):[decision_tree/decision_tree_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/decision_tree/decision_tree_sklearn.py)
59 |
运行结果:
60 |

61 |
62 | ## 第六章 逻辑斯谛回归
63 | ### 二项逻辑斯谛回归
64 | 适用问题:二类分类
65 |
可类比于感知机算法
66 |
实验数据:[train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv)
67 |
代码:[logistic_regression/logistic_regression.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/logistic_regression/logistic_regression.py)
68 |
运行结果:
69 |

70 | ### (多项)逻辑斯谛回归
71 | 适用问题:多类分类
72 |
实验数据:[train.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train.csv)
73 |
代码(用sklearn实现):[logistic_regression/logistic_regression_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/logistic_regression/logistic_regression_sklearn.py)
74 |
运行结果:
75 |

76 |
77 | ## 第六章 最大熵模型
78 | 适用问题:多类分类
79 |
下面用改进的迭代尺度法(IIS)学习最大熵模型,将特征函数定义为:
80 |
81 |
与其他分类器不同的是,最大熵模型中的f(x,y)中的x是单独的一个特征,不是一个n维特征向量,因此我们需要对每个维度特征加一个区分标签;如X=(x0,x1,x2,...)变为X=(0_x0,1_x1,2_x2,...)
82 |
代码:[maxEnt/maxEnt.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/maxEnt/maxEnt.py)
83 |
运行结果:
84 |

85 |
86 | ## 第七章 支持向量机
87 | 适用问题:二类分类
88 |
实验数据:二分类的数据 [train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv)
89 |
SVM有三种模型,由简至繁为
90 | - 当训练数据训练可分时,通过硬间隔最大化,可学习到**硬间隔支持向量机**,又叫**线性可分支持向量机**
91 | - 当训练数据训练近似可分时,通过软间隔最大化,可学习到**软间隔支持向量机**,又叫**线性支持向量机**
92 | - 当训练数据训练不可分时,通过软间隔最大化及**核技巧(kernel trick)**,可学习到**非线性支持向量机**
93 |
94 |
代码(用sklearn实现):[svm/svm_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/svm/svm_sklearn.py)
95 |
*注:可用拆解法(如OvO,OvR)将svm扩展成适用于多分类问题(其他二分类问题亦可),sklearn中已经实现*
96 |
运行结果:
97 |

98 |
99 | ## 第八章 提升方法
100 | 提升方法就是组合一系列弱分类器构成一个强分类器,AdaBoost是其代表性算法
101 | ### AdaBoost算法
102 | 适用问题:二类分类,要处理多类分类需进行改进
103 |
代码(用sklearn实现):[AdaBoost/AdaBoost_sklearn.py](https://github.com/fuqiuai/lihang_algorithms/blob/master/AdaBoost/AdaBoost_sklearn.py)
104 |
实验数据为[train.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train.csv)的运行结果:
105 |

106 |
实验数据为[train_binary.csv](https://github.com/fuqiuai/lihang_algorithms/blob/master/data/train_binary.csv)的运行结果:
107 |

108 |
--------------------------------------------------------------------------------
/decision_tree/C45.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import cv2
4 | import time
5 | import numpy as np
6 | import pandas as pd
7 |
8 |
9 | from sklearn.cross_validation import train_test_split
10 | from sklearn.metrics import accuracy_score
11 |
12 | # 二值化
13 | def binaryzation(img):
14 | cv_img = img.astype(np.uint8)
15 | cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img)
16 | return cv_img
17 |
18 | def binaryzation_features(trainset):
19 | features = []
20 |
21 | for img in trainset:
22 | img = np.reshape(img,(28,28))
23 | cv_img = img.astype(np.uint8)
24 |
25 | img_b = binaryzation(cv_img)
26 | # hog_feature = np.transpose(hog_feature)
27 | features.append(img_b)
28 |
29 | features = np.array(features)
30 | features = np.reshape(features,(-1,feature_len))
31 |
32 | return features
33 |
34 |
35 | class Tree(object):
36 | def __init__(self,node_type,Class = None, feature = None):
37 | self.node_type = node_type # 节点类型(internal或leaf)
38 | self.dict = {} # dict的键表示特征Ag的可能值ai,值表示根据ai得到的子树
39 | self.Class = Class # 叶节点表示的类,若是内部节点则为none
40 | self.feature = feature # 表示当前的树即将由第feature个特征划分(即第feature特征是使得当前树中信息增益最大的特征)
41 |
42 | def add_tree(self,key,tree):
43 | self.dict[key] = tree
44 |
45 | def predict(self,features):
46 | if self.node_type == 'leaf' or (features[self.feature] not in self.dict):
47 | return self.Class
48 |
49 | tree = self.dict.get(features[self.feature])
50 | return tree.predict(features)
51 |
52 | # 计算数据集x的经验熵H(x)
53 | def calc_ent(x):
54 | x_value_list = set([x[i] for i in range(x.shape[0])])
55 | ent = 0.0
56 | for x_value in x_value_list:
57 | p = float(x[x == x_value].shape[0]) / x.shape[0]
58 | logp = np.log2(p)
59 | ent -= p * logp
60 |
61 | return ent
62 |
63 | # 计算条件熵H(y/x)
64 | def calc_condition_ent(x, y):
65 | x_value_list = set([x[i] for i in range(x.shape[0])])
66 | ent = 0.0
67 | for x_value in x_value_list:
68 | sub_y = y[x == x_value]
69 | temp_ent = calc_ent(sub_y)
70 | ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent
71 |
72 | return ent
73 |
74 | # 计算信息增益
75 | def calc_ent_grap(x,y):
76 | base_ent = calc_ent(y)
77 | condition_ent = calc_condition_ent(x, y)
78 | ent_grap = base_ent - condition_ent
79 |
80 | return ent_grap
81 |
82 | # C4.5算法
83 | def recurse_train(train_set,train_label,features):
84 |
85 | LEAF = 'leaf'
86 | INTERNAL = 'internal'
87 |
88 | # 步骤1——如果训练集train_set中的所有实例都属于同一类Ck
89 | label_set = set(train_label)
90 | if len(label_set) == 1:
91 | return Tree(LEAF,Class = label_set.pop())
92 |
93 | # 步骤2——如果特征集features为空
94 | class_len = [(i,len(list(filter(lambda x:x==i,train_label)))) for i in range(class_num)] # 计算每一个类出现的个数
95 | (max_class,max_len) = max(class_len,key = lambda x:x[1])
96 |
97 | if len(features) == 0:
98 | return Tree(LEAF,Class = max_class)
99 |
100 | # 步骤3——计算信息增益,并选择信息增益最大的特征
101 | max_feature = 0
102 | max_gda = 0
103 | D = train_label
104 | for feature in features:
105 | # print(type(train_set))
106 | A = np.array(train_set[:,feature].flat) # 选择训练集中的第feature列(即第feature个特征)
107 | gda = calc_ent_grap(A,D)
108 | if calc_ent(A) != 0: ####### 计算信息增益比,这是与ID3算法唯一的不同
109 | gda /= calc_ent(A)
110 | if gda > max_gda:
111 | max_gda,max_feature = gda,feature
112 |
113 | # 步骤4——信息增益小于阈值
114 | if max_gda < epsilon:
115 | return Tree(LEAF,Class = max_class)
116 |
117 | # 步骤5——构建非空子集
118 | sub_features = list(filter(lambda x:x!=max_feature,features))
119 | tree = Tree(INTERNAL,feature=max_feature)
120 |
121 | max_feature_col = np.array(train_set[:,max_feature].flat)
122 | feature_value_list = set([max_feature_col[i] for i in range(max_feature_col.shape[0])]) # 保存信息增益最大的特征可能的取值 (shape[0]表示计算行数)
123 | for feature_value in feature_value_list:
124 |
125 | index = []
126 | for i in range(len(train_label)):
127 | if train_set[i][max_feature] == feature_value:
128 | index.append(i)
129 |
130 | sub_train_set = train_set[index]
131 | sub_train_label = train_label[index]
132 |
133 | sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features)
134 | tree.add_tree(feature_value,sub_tree)
135 |
136 | return tree
137 |
138 | def train(train_set,train_label,features):
139 | return recurse_train(train_set,train_label,features)
140 |
141 | def predict(test_set,tree):
142 | result = []
143 | for features in test_set:
144 | tmp_predict = tree.predict(features)
145 | result.append(tmp_predict)
146 | return np.array(result)
147 |
148 |
149 | class_num = 10 # MINST数据集有10种labels,分别是“0,1,2,3,4,5,6,7,8,9”
150 | feature_len = 784 # MINST数据集每个image有28*28=784个特征(pixels)
151 | epsilon = 0.001 # 设定阈值
152 |
153 | if __name__ == '__main__':
154 |
155 | print("Start read data...")
156 |
157 | time_1 = time.time()
158 |
159 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
160 | data = raw_data.values
161 |
162 | imgs = data[::, 1::]
163 | features = binaryzation_features(imgs) # 图片二值化(很重要,不然预测准确率很低)
164 | labels = data[::, 0]
165 |
166 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
167 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
168 | time_2 = time.time()
169 | print('read data cost %f seconds' % (time_2 - time_1))
170 |
171 | # 通过C4.5算法生成决策树
172 | print('Start training...')
173 | tree = train(train_features,train_labels,list(range(feature_len)))
174 | time_3 = time.time()
175 | print('training cost %f seconds' % (time_3 - time_2))
176 |
177 | print('Start predicting...')
178 | test_predict = predict(test_features,tree)
179 | time_4 = time.time()
180 | print('predicting cost %f seconds' % (time_4 - time_3))
181 |
182 | # print("预测的结果为:")
183 | # print(test_predict)
184 | for i in range(len(test_predict)):
185 | if test_predict[i] == None:
186 | test_predict[i] = epsilon
187 | score = accuracy_score(test_labels, test_predict)
188 | print("The accruacy score is %f" % score)
189 |
190 |
--------------------------------------------------------------------------------
/decision_tree/ID3.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import cv2
4 | import time
5 | import numpy as np
6 | import pandas as pd
7 |
8 |
9 | from sklearn.cross_validation import train_test_split
10 | from sklearn.metrics import accuracy_score
11 |
12 | # 二值化
13 | def binaryzation(img):
14 | cv_img = img.astype(np.uint8)
15 | cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img)
16 | return cv_img
17 |
18 | def binaryzation_features(trainset):
19 | features = []
20 |
21 | for img in trainset:
22 | img = np.reshape(img,(28,28))
23 | cv_img = img.astype(np.uint8)
24 |
25 | img_b = binaryzation(cv_img)
26 | # hog_feature = np.transpose(hog_feature)
27 | features.append(img_b)
28 |
29 | features = np.array(features)
30 | features = np.reshape(features,(-1,feature_len))
31 |
32 | return features
33 |
34 |
35 | class Tree(object):
36 | def __init__(self,node_type,Class = None, feature = None):
37 | self.node_type = node_type # 节点类型(internal或leaf)
38 | self.dict = {} # dict的键表示特征Ag的可能值ai,值表示根据ai得到的子树
39 | self.Class = Class # 叶节点表示的类,若是内部节点则为none
40 | self.feature = feature # 表示当前的树即将由第feature个特征划分(即第feature特征是使得当前树中信息增益最大的特征)
41 |
42 | def add_tree(self,key,tree):
43 | self.dict[key] = tree
44 |
45 | def predict(self,features):
46 | if self.node_type == 'leaf' or (features[self.feature] not in self.dict):
47 | return self.Class
48 |
49 | tree = self.dict.get(features[self.feature])
50 | return tree.predict(features)
51 |
52 | # 计算数据集x的经验熵H(x)
53 | def calc_ent(x):
54 | x_value_list = set([x[i] for i in range(x.shape[0])])
55 | ent = 0.0
56 | for x_value in x_value_list:
57 | p = float(x[x == x_value].shape[0]) / x.shape[0]
58 | logp = np.log2(p)
59 | ent -= p * logp
60 |
61 | return ent
62 |
63 | # 计算条件熵H(y/x)
64 | def calc_condition_ent(x, y):
65 | x_value_list = set([x[i] for i in range(x.shape[0])])
66 | ent = 0.0
67 | for x_value in x_value_list:
68 | sub_y = y[x == x_value]
69 | temp_ent = calc_ent(sub_y)
70 | ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent
71 |
72 | return ent
73 |
74 | # 计算信息增益
75 | def calc_ent_grap(x,y):
76 | base_ent = calc_ent(y)
77 | condition_ent = calc_condition_ent(x, y)
78 | ent_grap = base_ent - condition_ent
79 |
80 | return ent_grap
81 |
82 | # ID3算法
83 | def recurse_train(train_set,train_label,features):
84 |
85 | LEAF = 'leaf'
86 | INTERNAL = 'internal'
87 |
88 | # 步骤1——如果训练集train_set中的所有实例都属于同一类Ck
89 | label_set = set(train_label)
90 | if len(label_set) == 1:
91 | return Tree(LEAF,Class = label_set.pop())
92 |
93 | # 步骤2——如果特征集features为空
94 | class_len = [(i,len(list(filter(lambda x:x==i,train_label)))) for i in range(class_num)] # 计算每一个类出现的个数
95 | (max_class,max_len) = max(class_len,key = lambda x:x[1])
96 |
97 | if len(features) == 0:
98 | return Tree(LEAF,Class = max_class)
99 |
100 | # 步骤3——计算信息增益,并选择信息增益最大的特征
101 | max_feature = 0
102 | max_gda = 0
103 | D = train_label
104 | for feature in features:
105 | # print(type(train_set))
106 | A = np.array(train_set[:,feature].flat) # 选择训练集中的第feature列(即第feature个特征)
107 | gda=calc_ent_grap(A,D)
108 | if gda > max_gda:
109 | max_gda,max_feature = gda,feature
110 |
111 | # 步骤4——信息增益小于阈值
112 | if max_gda < epsilon:
113 | return Tree(LEAF,Class = max_class)
114 |
115 | # 步骤5——构建非空子集
116 | sub_features = list(filter(lambda x:x!=max_feature,features))
117 | tree = Tree(INTERNAL,feature=max_feature)
118 |
119 | max_feature_col = np.array(train_set[:,max_feature].flat)
120 | feature_value_list = set([max_feature_col[i] for i in range(max_feature_col.shape[0])]) # 保存信息增益最大的特征可能的取值 (shape[0]表示计算行数)
121 | for feature_value in feature_value_list:
122 |
123 | index = []
124 | for i in range(len(train_label)):
125 | if train_set[i][max_feature] == feature_value:
126 | index.append(i)
127 |
128 | sub_train_set = train_set[index]
129 | sub_train_label = train_label[index]
130 |
131 | sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features)
132 | tree.add_tree(feature_value,sub_tree)
133 |
134 | return tree
135 |
136 | def train(train_set,train_label,features):
137 | return recurse_train(train_set,train_label,features)
138 |
139 | def predict(test_set,tree):
140 | result = []
141 | for features in test_set:
142 | tmp_predict = tree.predict(features)
143 | result.append(tmp_predict)
144 | return np.array(result)
145 |
146 |
147 | class_num = 10 # MINST数据集有10种labels,分别是“0,1,2,3,4,5,6,7,8,9”
148 | feature_len = 784 # MINST数据集每个image有28*28=784个特征(pixels)
149 | epsilon = 0.001 # 设定阈值
150 |
151 | if __name__ == '__main__':
152 |
153 | print("Start read data...")
154 |
155 | time_1 = time.time()
156 |
157 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
158 | data = raw_data.values
159 |
160 | imgs = data[::, 1::]
161 | features = binaryzation_features(imgs) # 图片二值化(很重要,不然预测准确率很低)
162 | labels = data[::, 0]
163 |
164 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
165 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
166 | time_2 = time.time()
167 | print('read data cost %f seconds' % (time_2 - time_1))
168 |
169 | # 通过ID3算法生成决策树
170 | print('Start training...')
171 | tree = train(train_features,train_labels,list(range(feature_len)))
172 | time_3 = time.time()
173 | print('training cost %f seconds' % (time_3 - time_2))
174 |
175 | print('Start predicting...')
176 | test_predict = predict(test_features,tree)
177 | time_4 = time.time()
178 | print('predicting cost %f seconds' % (time_4 - time_3))
179 |
180 | # print("预测的结果为:")
181 | # print(test_predict)
182 | for i in range(len(test_predict)):
183 | if test_predict[i] == None:
184 | test_predict[i] = epsilon
185 | score = accuracy_score(test_labels, test_predict)
186 | print("The accruacy score is %f" % score)
187 |
188 |
--------------------------------------------------------------------------------
/decision_tree/decision_tree_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import time
5 |
6 | from sklearn.cross_validation import train_test_split
7 | from sklearn.metrics import accuracy_score
8 |
9 | from sklearn.tree import DecisionTreeClassifier
10 |
11 |
12 |
13 | if __name__ == '__main__':
14 |
15 | print("Start read data...")
16 | time_1 = time.time()
17 |
18 | raw_data = pd.read_csv('../data/train.csv', header=0)
19 | data = raw_data.values
20 |
21 | features = data[::, 1::]
22 | labels = data[::, 0]
23 |
24 | # 随机选取33%数据作为测试集,剩余为训练集
25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
26 |
27 | time_2 = time.time()
28 | print('read data cost %f seconds' % (time_2 - time_1))
29 |
30 |
31 | print('Start training...')
32 | # criterion可选‘gini’, ‘entropy’,默认为gini(对应CART算法),entropy为信息增益(对应ID3算法)
33 | clf = DecisionTreeClassifier(criterion='gini')
34 | clf.fit(train_features,train_labels)
35 | time_3 = time.time()
36 | print('training cost %f seconds' % (time_3 - time_2))
37 |
38 |
39 | print('Start predicting...')
40 | test_predict = clf.predict(test_features)
41 | time_4 = time.time()
42 | print('predicting cost %f seconds' % (time_4 - time_3))
43 |
44 |
45 | score = accuracy_score(test_labels, test_predict)
46 | print("The accruacy score is %f" % score)
47 |
--------------------------------------------------------------------------------
/imgs/Adaboost_sklearn_result_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/Adaboost_sklearn_result_1.png
--------------------------------------------------------------------------------
/imgs/Adaboost_sklearn_result_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/Adaboost_sklearn_result_2.png
--------------------------------------------------------------------------------
/imgs/C45_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/C45_result.png
--------------------------------------------------------------------------------
/imgs/ID3_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/ID3_result.png
--------------------------------------------------------------------------------
/imgs/decision_tree_sklearn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/decision_tree_sklearn_result.png
--------------------------------------------------------------------------------
/imgs/knn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/knn_result.png
--------------------------------------------------------------------------------
/imgs/knn_sklearn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/knn_sklearn_result.png
--------------------------------------------------------------------------------
/imgs/logistic_regression_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/logistic_regression_result.png
--------------------------------------------------------------------------------
/imgs/logistic_regression_sklearn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/logistic_regression_sklearn_result.png
--------------------------------------------------------------------------------
/imgs/maxEnt_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/maxEnt_result.png
--------------------------------------------------------------------------------
/imgs/naive_bayes_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/naive_bayes_result.png
--------------------------------------------------------------------------------
/imgs/naive_bayes_sklearn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/naive_bayes_sklearn_result.png
--------------------------------------------------------------------------------
/imgs/perceptron_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/perceptron_result.png
--------------------------------------------------------------------------------
/imgs/perceptron_sklearn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/perceptron_sklearn_result.png
--------------------------------------------------------------------------------
/imgs/svm_sklearn_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/lihang_algorithms/b9a04f48a80c2fdefcff259d65d3925f83a51be0/imgs/svm_sklearn_result.png
--------------------------------------------------------------------------------
/knn/knn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import numpy as np
5 | import time
6 |
7 | from sklearn.cross_validation import train_test_split
8 | from sklearn.metrics import accuracy_score
9 |
10 | def Predict(testset, trainset, train_labels):
11 | predict = [] # 保存测试集预测到的label,并返回
12 | count = 0 # 当前测试数据为第count个
13 |
14 | for test_vec in testset:
15 | # 输出当前运行的测试用例坐标,用于测试
16 | count += 1
17 | print("the number of %d is predicting..."%count)
18 |
19 | knn_list = [] # 当前k个最近邻居
20 | max_index = -1 # 当前k个最近邻居中距离最远点的坐标
21 | max_dist = 0 # 当前k个最近邻居中距离最远点的距离
22 |
23 | # 初始化knn_list,将前k个点的距离放入knn_list中
24 | for i in range(k):
25 | label = train_labels[i]
26 | train_vec = trainset[i]
27 | dist = np.linalg.norm(train_vec - test_vec) # 计算两个点的欧氏距离
28 | knn_list.append((dist, label))
29 |
30 | # 剩下的点
31 | for i in range(k, len(train_labels)):
32 | label = train_labels[i]
33 | train_vec = trainset[i]
34 | dist = np.linalg.norm(train_vec - test_vec) # 计算两个点的欧氏距离
35 |
36 | # 寻找k个邻近点中距离最远的点
37 | if max_index < 0:
38 | for j in range(k):
39 | if max_dist < knn_list[j][0]:
40 | max_index = j
41 | max_dist = knn_list[max_index][0]
42 |
43 | # 如果当前k个最近邻中存在点距离比当前点距离远,则替换
44 | if dist < max_dist:
45 | knn_list[max_index] = (dist, label)
46 | max_index = -1
47 | max_dist = 0
48 |
49 |
50 | # 统计选票
51 | class_total = k
52 | class_count = [0 for i in range(class_total)]
53 | for dist, label in knn_list:
54 | class_count[label] += 1
55 |
56 | # 找出最大选票
57 | mmax = max(class_count)
58 |
59 | # 找出最大选票标签
60 | for i in range(class_total):
61 | if mmax == class_count[i]:
62 | predict.append(i)
63 | break
64 |
65 | return np.array(predict)
66 |
67 | k = 10 # 选取k值
68 |
69 | if __name__ == '__main__':
70 |
71 | print("Start read data")
72 |
73 | time_1 = time.time()
74 |
75 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
76 | data = raw_data.values
77 |
78 | features = data[::, 1::]
79 | labels = data[::, 0]
80 |
81 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
82 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
83 |
84 | time_2 = time.time()
85 | print('read data cost %f seconds' % (time_2 - time_1))
86 |
87 | print('Start training')
88 | print('knn need not train')
89 |
90 | time_3 = time.time()
91 | print('training cost %f seconds' % (time_3 - time_2))
92 |
93 | print('Start predicting')
94 | test_predict = Predict(test_features, train_features, train_labels)
95 | time_4 = time.time()
96 | print('predicting cost %f seconds' % (time_4 - time_3))
97 |
98 | score = accuracy_score(test_labels, test_predict)
99 | print("The accruacy score is %f" % score)
100 |
--------------------------------------------------------------------------------
/knn/knn_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import numpy as np
5 | import time
6 |
7 | from sklearn.neighbors import KNeighborsClassifier
8 |
9 | from sklearn.cross_validation import train_test_split
10 |
11 |
12 | if __name__ == '__main__':
13 |
14 | print("Start read data...")
15 |
16 | time_1 = time.time()
17 |
18 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
19 | data = raw_data.values
20 |
21 | features = data[::, 1::]
22 | labels = data[::, 0]
23 |
24 | # 随机选取33%数据作为测试集,剩余为训练集
25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
26 |
27 | time_2 = time.time()
28 | print('read data cost %f seconds' % (time_2 - time_1))
29 |
30 | print('Start training...')
31 | neigh = KNeighborsClassifier(n_neighbors=10)
32 | neigh.fit(train_features, train_labels)
33 | time_3 = time.time()
34 | print('training cost %f seconds...' % (time_3 - time_2))
35 |
36 | print('Start predicting...')
37 | test_predict = neigh.predict(test_features)
38 | time_4 = time.time()
39 | print('predicting cost %f seconds' % (time_4 - time_3))
40 |
41 | score = neigh.score(test_features, test_labels)
42 | print("The accruacy score is %f" % score)
43 |
--------------------------------------------------------------------------------
/logistic_regression/logistic_regression.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import time
4 | import math
5 | import random
6 | import pandas as pd
7 |
8 | from sklearn.cross_validation import train_test_split
9 | from sklearn.metrics import accuracy_score
10 |
11 |
12 | class LogisticRegression(object):
13 |
14 | def __init__(self):
15 | self.learning_step = 0.0001 # 学习率
16 | self.max_iteration = 5000 # 分类正确上界,当分类正确的次数超过上界时,认为已训练好,退出训练
17 |
18 | def train(self,features, labels):
19 | self.w = [0.0] * (len(features[0]) + 1) # 初始化模型参数
20 |
21 | correct_count = 0 # 分类正确的次数
22 |
23 | while correct_count < self.max_iteration:
24 |
25 | # 随机选取数据(xi,yi)
26 | index = random.randint(0, len(labels) - 1)
27 | x = list(features[index])
28 | x.append(1.0)
29 | y = labels[index]
30 |
31 | if y == self.predict_(x): # 分类正确的次数加1,并跳过下面的步骤
32 | correct_count += 1
33 | continue
34 |
35 | wx = sum([self.w[i] * x[i] for i in range(len(self.w))])
36 | while wx>700: # 控制运算结果越界
37 | wx/=2
38 | exp_wx = math.exp(wx)
39 |
40 | for i in range(len(self.w)):
41 | self.w[i] -= self.learning_step * \
42 | (-y * x[i] + float(x[i] * exp_wx) / float(1 + exp_wx))
43 |
44 | def predict_(self,x):
45 | wx = sum([self.w[j] * x[j] for j in range(len(self.w))])
46 | while wx>700: # 控制运算结果越界
47 | wx/=2
48 | exp_wx = math.exp(wx)
49 |
50 | predict1 = exp_wx / (1 + exp_wx)
51 | predict0 = 1 / (1 + exp_wx)
52 |
53 | if predict1 > predict0:
54 | return 1
55 | else:
56 | return 0
57 |
58 |
59 | def predict(self,features):
60 | labels = []
61 |
62 | for feature in features:
63 | x = list(feature)
64 | x.append(1)
65 | labels.append(self.predict_(x))
66 |
67 | return labels
68 |
69 | if __name__ == "__main__":
70 | print("Start read data...")
71 |
72 | time_1 = time.time()
73 |
74 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据,并将第一行视为表头,返回DataFrame类型
75 | data = raw_data.values
76 |
77 | features = data[::, 1::]
78 | labels = data[::, 0]
79 |
80 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
81 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
82 |
83 | time_2 = time.time()
84 | print('read data cost %f seconds' % (time_2 - time_1))
85 |
86 | print('Start training...')
87 | lr = LogisticRegression()
88 | lr.train(train_features, train_labels)
89 | time_3 = time.time()
90 | print('training cost %f seconds' % (time_3 - time_2))
91 |
92 | print('Start predicting...')
93 | test_predict = lr.predict(test_features)
94 | time_4 = time.time()
95 | print('predicting cost %f seconds' % (time_4 - time_3))
96 |
97 | score = accuracy_score(test_labels, test_predict)
98 | print("The accruacy score is %f" % score)
99 |
100 |
--------------------------------------------------------------------------------
/logistic_regression/logistic_regression_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import time
5 |
6 | from sklearn.cross_validation import train_test_split
7 | from sklearn.metrics import accuracy_score
8 |
9 | from sklearn.linear_model import LogisticRegression
10 |
11 |
12 |
13 | if __name__ == '__main__':
14 |
15 | print("Start read data...")
16 | time_1 = time.time()
17 |
18 | raw_data = pd.read_csv('../data/train.csv', header=0)
19 | data = raw_data.values
20 |
21 | features = data[::, 1::]
22 | labels = data[::, 0]
23 |
24 | # 随机选取33%数据作为测试集,剩余为训练集
25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
26 |
27 | time_2 = time.time()
28 | print('read data cost %f seconds' % (time_2 - time_1))
29 |
30 |
31 | print('Start training...')
32 | # multi_class可选‘ovr’, ‘multinomial’,默认为ovr用于二类分类,multinomial用于多类分类
33 | clf = LogisticRegression(max_iter=100,solver='saga',multi_class='multinomial')
34 | clf.fit(train_features,train_labels)
35 | time_3 = time.time()
36 | print('training cost %f seconds' % (time_3 - time_2))
37 |
38 |
39 | print('Start predicting...')
40 | test_predict = clf.predict(test_features)
41 | time_4 = time.time()
42 | print('predicting cost %f seconds' % (time_4 - time_3))
43 |
44 |
45 | score = accuracy_score(test_labels, test_predict)
46 | print("The accruacy score is %f" % score)
47 |
--------------------------------------------------------------------------------
/maxEnt/maxEnt.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import time
5 | import math
6 |
7 | from collections import defaultdict
8 |
9 | from sklearn.cross_validation import train_test_split
10 | from sklearn.metrics import accuracy_score
11 |
12 |
13 | class MaxEnt(object):
14 |
15 | def init_params(self, X, Y):
16 | self.X_ = X
17 | self.Y_ = set()
18 |
19 | self.cal_Vxy(X, Y)
20 |
21 | self.N = len(X) # 训练集大小,如P59例子中为15
22 | self.n = len(self.Vxy) # 数据集中(x,y)对数,如P59例子中为6+3+3+5=17对
23 | self.M = 10000.0 # 设置P91中的M,可认为是学习速率
24 |
25 | self.build_dict()
26 | self.cal_Pxy()
27 |
28 | def cal_Vxy(self, X, Y):
29 | '''
30 | 计算v(X=x,Y=y),P82
31 | '''
32 | self.Vxy = defaultdict(int)
33 |
34 | for i in range(len(X)):
35 | x_, y = X[i], Y[i]
36 | self.Y_.add(y)
37 |
38 | for x in x_:
39 | self.Vxy[(x, y)] += 1
40 |
41 | def build_dict(self):
42 | self.id2xy = {}
43 | self.xy2id = {}
44 |
45 | for i, (x, y) in enumerate(self.Vxy):
46 | self.id2xy[i] = (x, y)
47 | self.xy2id[(x, y)] = i
48 |
49 | def cal_Pxy(self):
50 | '''
51 | 计算P(X=x,Y=y),P82
52 | '''
53 | self.Pxy = defaultdict(float)
54 | for id in range(self.n):
55 | (x, y) = self.id2xy[id]
56 | self.Pxy[id] = float(self.Vxy[(x, y)]) / float(self.N)
57 |
58 |
59 | def cal_Zx(self, X, y):
60 | '''
61 | 计算Zw(x/yi),根据P85公式6.23,Zw(x)未相加前的单项
62 | '''
63 | result = 0.0
64 | for x in X:
65 | if (x,y) in self.xy2id:
66 | id = self.xy2id[(x, y)]
67 | result += self.w[id]
68 | return (math.exp(result), y)
69 |
70 | def cal_Pyx(self, X):
71 | '''
72 | 计算P(y|x),根据P85公式6.22
73 | '''
74 | Pyxs = [(self.cal_Zx(X, y)) for y in self.Y_]
75 | Zwx = sum([prob for prob, y in Pyxs])
76 | return [(prob / Zwx, y) for prob, y in Pyxs]
77 |
78 | def cal_Epfi(self):
79 | '''
80 | 计算Ep(fi),根据P83最上面的公式
81 | '''
82 | self.Epfi = [0.0 for i in range(self.n)]
83 |
84 | for i, X in enumerate(self.X_):
85 | Pyxs = self.cal_Pyx(X)
86 |
87 | for x in X:
88 | for Pyx, y in Pyxs:
89 | if (x,y) in self.xy2id:
90 | id = self.xy2id[(x, y)]
91 |
92 | self.Epfi[id] += Pyx * (1.0 / self.N)
93 |
94 |
95 | def train(self, X, Y):
96 | '''
97 | IIS学习算法
98 | '''
99 | self.init_params(X, Y)
100 |
101 | # 第一步: 初始化参数值wi为0
102 | self.w = [0.0 for i in range(self.n)]
103 |
104 | max_iteration = 500 # 设置最大迭代次数
105 | for times in range(max_iteration):
106 | print("the number of iterater : %d " % times)
107 |
108 | # 第二步:求δi
109 | detas = []
110 | self.cal_Epfi()
111 | for i in range(self.n):
112 | deta = 1 / self.M * math.log(self.Pxy[i] / self.Epfi[i]) # 指定的特征函数为指示函数,因此E~p(fi)等于Pxy
113 | detas.append(deta)
114 |
115 | # if len(filter(lambda x: abs(x) >= 0.01, detas)) == 0:
116 | # break
117 |
118 | # 第三步:更新Wi
119 | self.w = [self.w[i] + detas[i] for i in range(self.n)]
120 |
121 | def predict(self, testset):
122 | results = []
123 | for test in testset:
124 | result = self.cal_Pyx(test)
125 | results.append(max(result, key=lambda x: x[0])[1])
126 | return results
127 |
128 |
129 | def rebuild_features(features):
130 | '''
131 | 最大熵模型中的f(x,y)中的x是单独的一个特征,不是一个n维特征向量,因此我们需要对每个维度特征加一个区分标签
132 | 具体地:将原feature的(a0,a1,a2,a3,a4,...) 变成 (0_a0,1_a1,2_a2,3_a3,4_a4,...)形式
133 | '''
134 | new_features = []
135 | for feature in features:
136 | new_feature = []
137 | for i, f in enumerate(feature):
138 | new_feature.append(str(i) + '_' + str(f))
139 | new_features.append(new_feature)
140 | return new_features
141 |
142 |
143 | if __name__ == '__main__':
144 |
145 | print("Start read data...")
146 |
147 | time_1 = time.time()
148 |
149 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
150 | data = raw_data.values
151 |
152 | features = data[:5000:, 1::]
153 | labels = data[:5000:, 0]
154 |
155 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
156 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
157 |
158 | train_features = rebuild_features(train_features)
159 | test_features = rebuild_features(test_features)
160 |
161 | time_2 = time.time()
162 | print('read data cost %f seconds' % (time_2 - time_1))
163 |
164 | print('Start training...')
165 | met = MaxEnt()
166 | met.train(train_features, train_labels)
167 |
168 | time_3 = time.time()
169 | print('training cost %f seconds' % (time_3 - time_2))
170 |
171 | print('Start predicting...')
172 | test_predict = met.predict(test_features)
173 | time_4 = time.time()
174 | print('predicting cost %f seconds' % (time_4 - time_3))
175 |
176 | score = accuracy_score(test_labels, test_predict)
177 | print("The accruacy score is %f" % score)
178 |
--------------------------------------------------------------------------------
/naive_bayes/naive_bayes.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import numpy as np
5 | import cv2
6 | import time
7 |
8 | from sklearn.cross_validation import train_test_split
9 | from sklearn.metrics import accuracy_score
10 |
11 | # 二值化处理
12 | def binaryzation(img):
13 | cv_img = img.astype(np.uint8) # 类型转化成Numpy中的uint8型
14 | cv2.threshold(cv_img, 50, 1, cv2.THRESH_BINARY_INV, cv_img) # 大于50的值赋值为0,不然赋值为1
15 | return cv_img
16 |
17 | # 训练,计算出先验概率和条件概率
18 | def Train(trainset, train_labels):
19 | prior_probability = np.zeros(class_num) # 先验概率
20 | conditional_probability = np.zeros((class_num, feature_len, 2)) # 条件概率
21 |
22 | # 计算
23 | for i in range(len(train_labels)):
24 | img = binaryzation(trainset[i]) # 图片二值化,让每一个特征都只有0,1两种取值
25 | label = train_labels[i]
26 |
27 | prior_probability[label] += 1
28 |
29 | for j in range(feature_len):
30 | conditional_probability[label][j][img[j]] += 1
31 |
32 | # 将条件概率归到[1,10001]
33 | for i in range(class_num):
34 | for j in range(feature_len):
35 |
36 | # 经过二值化后图像只有0,1两种取值
37 | pix_0 = conditional_probability[i][j][0]
38 | pix_1 = conditional_probability[i][j][1]
39 |
40 | # 计算0,1像素点对应的条件概率
41 | probalility_0 = (float(pix_0)/float(pix_0+pix_1))*10000 + 1
42 | probalility_1 = (float(pix_1)/float(pix_0+pix_1))*10000 + 1
43 |
44 | conditional_probability[i][j][0] = probalility_0
45 | conditional_probability[i][j][1] = probalility_1
46 |
47 | return prior_probability, conditional_probability
48 |
49 | # 计算概率
50 | def calculate_probability(img, label):
51 | probability = int(prior_probability[label])
52 |
53 | for j in range(feature_len):
54 | probability *= int(conditional_probability[label][j][img[j]])
55 |
56 | return probability
57 |
58 | # 预测
59 | def Predict(testset, prior_probability, conditional_probability):
60 | predict = []
61 |
62 | # 对每个输入的x,将后验概率最大的类作为x的类输出
63 | for img in testset:
64 |
65 | img = binaryzation(img) # 图像二值化
66 |
67 | max_label = 0
68 | max_probability = calculate_probability(img, 0)
69 |
70 | for j in range(1, class_num):
71 | probability = calculate_probability(img, j)
72 |
73 | if max_probability < probability:
74 | max_label = j
75 | max_probability = probability
76 |
77 | predict.append(max_label)
78 |
79 | return np.array(predict)
80 |
81 |
82 | class_num = 10 # MINST数据集有10种labels,分别是“0,1,2,3,4,5,6,7,8,9”
83 | feature_len = 784 # MINST数据集每个image有28*28=784个特征(pixels)
84 |
85 | if __name__ == '__main__':
86 |
87 | print("Start read data")
88 | time_1 = time.time()
89 |
90 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
91 | data = raw_data.values
92 |
93 | features = data[::, 1::]
94 | labels = data[::, 0]
95 |
96 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
97 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
98 |
99 | time_2 = time.time()
100 | print('read data cost %f seconds' % (time_2 - time_1))
101 |
102 |
103 | print('Start training')
104 | prior_probability, conditional_probability = Train(train_features, train_labels)
105 | time_3 = time.time()
106 | print('training cost %f seconds' % (time_3 - time_2))
107 |
108 |
109 | print('Start predicting')
110 | test_predict = Predict(test_features, prior_probability, conditional_probability)
111 | time_4 = time.time()
112 | print('predicting cost %f seconds' % (time_4 - time_3))
113 |
114 |
115 | score = accuracy_score(test_labels, test_predict)
116 | print("The accruacy score is %f" % score)
117 |
118 |
--------------------------------------------------------------------------------
/naive_bayes/naive_bayes_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import numpy as np
5 | import time
6 |
7 | from sklearn.naive_bayes import MultinomialNB
8 |
9 | from sklearn.cross_validation import train_test_split
10 | from sklearn.metrics import accuracy_score
11 |
12 |
13 | if __name__ == '__main__':
14 |
15 | print("Start read data...")
16 | time_1 = time.time()
17 |
18 | raw_data = pd.read_csv('../data/train.csv', header=0) # 读取csv数据
19 | data = raw_data.values
20 |
21 | features = data[::, 1::]
22 | labels = data[::, 0]
23 |
24 | # 随机选取33%数据作为测试集,剩余为训练集
25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
26 |
27 | time_2 = time.time()
28 | print('read data cost %f seconds' % (time_2 - time_1))
29 |
30 |
31 | print('Start training...')
32 | clf = MultinomialNB(alpha=1.0) # 加入laplace平滑
33 | clf.fit(train_features, train_labels)
34 | time_3 = time.time()
35 | print('training cost %f seconds' % (time_3 - time_2))
36 |
37 |
38 | print('Start predicting...')
39 | test_predict = clf.predict(test_features)
40 | time_4 = time.time()
41 | print('predicting cost %f seconds' % (time_4 - time_3))
42 |
43 |
44 | score = accuracy_score(test_labels, test_predict)
45 | print("The accruacy score is %f" % score)
46 |
47 |
--------------------------------------------------------------------------------
/perceptron/perceptron.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import random
5 | import time
6 |
7 | from sklearn.cross_validation import train_test_split
8 | from sklearn.metrics import accuracy_score
9 |
10 |
11 | class Perceptron(object):
12 |
13 | def __init__(self):
14 | self.learning_step = 0.001 # 学习率
15 | self.max_iteration = 5000 # 分类正确上界,当分类正确的次数超过上界时,认为已训练好,退出训练
16 |
17 | def train(self, features, labels):
18 |
19 | # 初始化w,b为0,b在最后一位
20 | self.w = [0.0] * (len(features[0]) + 1)
21 |
22 | correct_count = 0 # 分类正确的次数
23 |
24 | while correct_count < self.max_iteration:
25 |
26 | # 随机选取数据(xi,yi)
27 | index = random.randint(0, len(labels) - 1)
28 | x = list(features[index])
29 | x.append(1.0) # 加上1是为了与b相乘
30 | y = 2 * labels[index] - 1 # label为1转化为正实例点+1,label为0转化为负实例点-1
31 |
32 | # 计算w*xi+b
33 | wx = sum([self.w[j] * x[j] for j in range(len(self.w))])
34 |
35 | # 如果yi(w*xi+b) > 0 则分类正确的次数加1
36 | if wx * y > 0:
37 | correct_count += 1
38 | continue
39 |
40 | # 如果yi(w*xi+b) <= 0 则更新w(最后一位实际上b)的值
41 | for i in range(len(self.w)):
42 | self.w[i] += self.learning_step * (y * x[i])
43 |
44 | def predict_(self, x):
45 | wx = sum([self.w[j] * x[j] for j in range(len(self.w))])
46 | return int(wx > 0) # w*xi+b>0则返回返回1,否则返回0
47 |
48 | def predict(self, features):
49 | labels = []
50 | for feature in features:
51 | x = list(feature)
52 | x.append(1)
53 | labels.append(self.predict_(x))
54 | return labels
55 |
56 |
57 | if __name__ == '__main__':
58 |
59 | print("Start read data")
60 |
61 | time_1 = time.time()
62 |
63 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据,并将第一行视为表头,返回DataFrame类型
64 | data = raw_data.values
65 |
66 | features = data[::, 1::]
67 | labels = data[::, 0]
68 |
69 | # 避免过拟合,采用交叉验证,随机选取33%数据作为测试集,剩余为训练集
70 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
71 |
72 | time_2 = time.time()
73 | print('read data cost %f seconds' % (time_2 - time_1))
74 |
75 | print('Start training')
76 | p = Perceptron()
77 | p.train(train_features, train_labels)
78 |
79 | time_3 = time.time()
80 | print('training cost %f seconds' % (time_3 - time_2))
81 |
82 | print('Start predicting')
83 | test_predict = p.predict(test_features)
84 | time_4 = time.time()
85 | print('predicting cost %f seconds' % (time_4 - time_3))
86 |
87 | score = accuracy_score(test_labels, test_predict)
88 | print("The accruacy score is %f" % score)
89 |
--------------------------------------------------------------------------------
/perceptron/perceptron_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import pandas as pd
4 | import time
5 |
6 | from sklearn.cross_validation import train_test_split
7 | from sklearn.metrics import accuracy_score
8 |
9 | from sklearn.linear_model import Perceptron
10 |
11 |
12 |
13 | if __name__ == '__main__':
14 |
15 | print("Start read data...")
16 | time_1 = time.time()
17 |
18 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据,并将第一行视为表头,返回DataFrame类型
19 | data = raw_data.values
20 |
21 | features = data[::, 1::]
22 | labels = data[::, 0]
23 |
24 | # 随机选取33%数据作为测试集,剩余为训练集
25 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
26 |
27 | time_2 = time.time()
28 | print('read data cost %f seconds' % (time_2 - time_1))
29 |
30 | print('Start training...')
31 | clf = Perceptron(alpha=0.0001,max_iter=2000) # 设置步长及最大迭代次数
32 | clf.fit(train_features,train_labels)
33 | time_3 = time.time()
34 | print('training cost %f seconds' % (time_3 - time_2))
35 |
36 | print('Start predicting...')
37 | test_predict = clf.predict(test_features)
38 | time_4 = time.time()
39 | print('predicting cost %f seconds' % (time_4 - time_3))
40 |
41 | score = accuracy_score(test_labels, test_predict)
42 | print("The accruacy score is %f" % score)
43 |
--------------------------------------------------------------------------------
/svm/svm_sklearn.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 |
3 | import time
4 |
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.cross_validation import train_test_split
8 | from sklearn.metrics import accuracy_score
9 | from sklearn import datasets
10 | from sklearn import svm
11 |
12 | if __name__ == '__main__':
13 |
14 | print('prepare datasets...')
15 | # Iris数据集
16 | # iris=datasets.load_iris()
17 | # features=iris.data
18 | # labels=iris.target
19 |
20 | # MINST数据集
21 | raw_data = pd.read_csv('../data/train_binary.csv', header=0) # 读取csv数据,并将第一行视为表头,返回DataFrame类型
22 | data = raw_data.values
23 | features = data[::, 1::]
24 | labels = data[::, 0] # 选取33%数据作为测试集,剩余为训练集
25 |
26 | train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
27 |
28 | time_2=time.time()
29 | print('Start training...')
30 | clf = svm.SVC() # svm class
31 | clf.fit(train_features, train_labels) # training the svc model
32 | time_3 = time.time()
33 | print('training cost %f seconds' % (time_3 - time_2))
34 |
35 | print('Start predicting...')
36 | test_predict=clf.predict(test_features)
37 | time_4 = time.time()
38 | print('predicting cost %f seconds' % (time_4 - time_3))
39 |
40 | score = accuracy_score(test_labels, test_predict)
41 | print("The accruacy score is %f" % score)
42 |
--------------------------------------------------------------------------------