├── SentimentAnalysis.ipynb ├── test.combined.txt ├── train.negative.txt └── train.positive.txt /SentimentAnalysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 项目: 基于评论的情感分析" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "本项目的目标是基于用户提供的评论,通过算法自动去判断其评论是正面的还是负面的情感。比如给定一个用户的评论:\n", 15 | "- 评论1: “我特别喜欢这个电器,我已经用了3个月,一点问题都没有!”\n", 16 | "- 评论2: “我从这家淘宝店卖的东西不到一周就开始坏掉了,强烈建议不要买,真实浪费钱”\n", 17 | "\n", 18 | "对于这两个评论,第一个明显是正面的,第二个是负面的。 我们希望搭建一个AI算法能够自动帮我们识别出评论是正面还是负面。\n", 19 | "\n", 20 | "情感分析的应用场景非常丰富,也是NLP技术在不同场景中落地的典范。比如对于一个证券领域,作为股民,其实比较关注舆论的变化,这个时候如果能有一个AI算法自动给网络上的舆论做正负面判断,然后把所有相关的结论再整合,这样我们可以根据这些大众的舆论,辅助做买卖的决策。 另外,在电商领域评论无处不在,而且评论已经成为影响用户购买决策的非常重要的因素,所以如果AI系统能够自动分析其情感,则后续可以做很多有意思的应用。 " 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "情感分析是文本处理领域经典的问题。整个系统一般会包括几个模块:\n", 28 | "- 数据的抓取: 通过爬虫的技术去网络抓取相关文本数据\n", 29 | "- 数据的清洗/预处理:在本文中一般需要去掉无用的信息,比如各种标签(HTML标签),标点符号,停用词等等\n", 30 | "- 把文本信息转换成向量: 这也成为特征工程,文本本身是不能作为模型的输入,只有数字(比如向量)才能成为模型的输入。所以进入模型之前,任何的信号都需要转换成模型可识别的数字信号(数字,向量,矩阵,张量...)\n", 31 | "- 选择合适的模型以及合适的评估方法。 对于情感分析来说,这是二分类问题(或者三分类:正面,负面,中性),所以需要采用分类算法比如逻辑回归,朴素贝叶斯,神经网络,SVM等等。另外,我们需要选择合适的评估方法,比如对于一个应用,我们是关注准确率呢,还是关注召回率呢? 这跟应用场景以及数据本身有关。比如训练数据里,有100个正样本,1个负样本,那这时候就不太适合用准确率来衡量系统的好坏,为什么? 因为在这种样本很不均匀的情况下,我直接把所有的数据分类成正样本,那在没有任何学习的情况下准确率也可以达到100/101,接近100%,这显然是不太合理的。所以在这种情况我们更趋向于用其他的评估方式比如AUC。评估方式是影响系统的关键因素,因为任何的学习过程都是在不断地优化我们制定的评估方式。" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "在本次项目中,我们已经给定了训练数据和测试数据,它们分别是 train.positive.txt, train.negative.txt, test_combined.txt. 请注意训练数据和测试数据的格式不一样,详情请见文件内容。 整个项目你需要完成以下步骤:\n", 39 | "- 数据的读取以及清洗: 从给定的.txt中读取内容,并做一些数据清洗,这里需要做几个工作: (1) 文本的读取,需要把字符串内容读进来。 (2)去掉无用的字符比如标点符号,多余的空格,换行符等 (3) 分词 \n", 40 | "- 把文本转换成TF-IDF向量: 这部分直接可以利用sklearn提供的TfidfVectorizer类来做。\n", 41 | "- 利用逻辑回归模型来做分类,并通过交叉验证选择最合适的超参数\n", 42 | "- 利用神经网络,支持向量机做分类,并通过交叉验证选择神经网络的合适的参数" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "### 第一部分: 数据预处理\n", 50 | "本部分你将要完成数据的预处理过程,包括数据的读取,数据清洗,分词,以及把文本转换成tf-idf向量。请注意: 在接下来的任务中,正面的情感我们标记为1, 负面\n", 51 | "的情感我们标记成0" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 4, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "import re\n", 61 | "import jieba\n", 62 | "import numpy as np\n", 63 | "\n", 64 | "def process_line(line): \n", 65 | " new_line = re.sub('([a-zA-Z0-9])','',line)\n", 66 | " new_line = ''.join(e for e in new_line if e.isalnum())\n", 67 | " new_line = ','.join(jieba.cut(new_line))\n", 68 | " return new_line\n", 69 | " \n", 70 | "def process_train(file_path):\n", 71 | " comments = [] # 用来存储评论\n", 72 | " labels = [] # 用来存储标签(正/负),如果是train_positive.txt,则所有标签为1, 否则0. \n", 73 | " with open(file_path) as file:\n", 74 | " # TODO 提取每一个评论,然后利用process_line函数来做处理,并添加到comments。\n", 75 | " text = file.read().replace(' ','').replace('\\n','')\n", 76 | " reg = ''\n", 77 | " result = re.findall(reg,text)\n", 78 | " for r in result:\n", 79 | " r = process_line(r)\n", 80 | " comments.append(r)\n", 81 | " if file_path == 'train.positive.txt':\n", 82 | " labels.append('1')\n", 83 | " else:\n", 84 | " labels.append('0')\n", 85 | " return comments, labels\n", 86 | " \n", 87 | " \n", 88 | "def process_test(file_path):\n", 89 | " comments = [] # 用来存储评论\n", 90 | " labels = [] # 用来存储标签(正/负).\n", 91 | " with open(file_path) as file:\n", 92 | " # TODO 提取每一个评论,然后利用process_line函数来做处理,并添加到\n", 93 | " # comments。\n", 94 | " text = file.read().replace(' ','').replace('\\n','')\n", 95 | " reg = ''\n", 96 | " result = re.findall(reg,text)\n", 97 | " for r in result:\n", 98 | " \n", 99 | " label = re.findall('label=\"(\\d)\"',r)[0]\n", 100 | " labels.append(label)\n", 101 | " r = process_line(r)\n", 102 | " comments.append(r)\n", 103 | " return comments, labels\n", 104 | " \n", 105 | "\n", 106 | "def read_file():\n", 107 | " \"\"\"\n", 108 | " 读取所提供的.txt文件,并把内容处理之后写到list里面。 这里需要分别处理四个文件,“train_positive.txt\", \"train_negative.txt\",\n", 109 | " \"test_combined.txt\" 并把每一个文件里的内容存储成列表。 \n", 110 | " \"\"\"\n", 111 | " # 处理训练数据,这两个文件的格式相同,请指定训练文件的路径\n", 112 | " train_pos_comments, train_pos_labels = process_train(\"train.positive.txt\")\n", 113 | " train_neg_comments, train_neg_labels = process_train(\"train.negative.txt\")\n", 114 | " \n", 115 | " # TODO: train_pos_comments和train_neg_comments合并成train_comments, train_pos_labels和train_neg_labels合并成train_labels\n", 116 | " train_comments = train_pos_comments + train_pos_labels\n", 117 | " train_labels = train_pos_labels + train_neg_labels\n", 118 | " # 处理测试数据, 请指定测试文件的路径\n", 119 | " test_comments, test_labels = process_test(\"test.combined.txt\")\n", 120 | " \n", 121 | " return train_comments, train_labels, test_comments, test_labels" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 5, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stderr", 131 | "output_type": "stream", 132 | "text": [ 133 | "Building prefix dict from the default dictionary ...\n", 134 | "Loading model from cache /var/folders/18/6srg07412v12cx1h90mpgggm0000gq/T/jieba.cache\n", 135 | "Loading model cost 1.311 seconds.\n", 136 | "Prefix dict has been built succesfully.\n" 137 | ] 138 | }, 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "10000 10000 2500 2500\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "# 读取数据,并对文本进行处理\n", 149 | "train_comments, train_labels, test_comments, test_labels = read_file()\n", 150 | "\n", 151 | "# 查看训练数据与测试数据大小\n", 152 | "print (len(train_comments), len(train_labels), len(test_comments), len(test_labels))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "(10000, 17659) (10000,) (2500, 17659) (2500,)\n" 165 | ] 166 | } 167 | ], 168 | "source": [ 169 | "# 把每一个文本内容转换成tf-idf向量\n", 170 | "from sklearn.feature_extraction.text import TfidfVectorizer # 导入sklearn库\n", 171 | "# TODO: 利用TfidfVectorizer把train_comments转换成tf-idf,把结果存储在X_train, 这里X_train是稀疏矩阵(Sparse Matrix) \n", 172 | "# 并把train_labels转换成向量 y_train. 类似的,去创建X_test, y_test。 把文本转换成tf-idf过程请参考TfidfVectorizer的说明\n", 173 | "tfid_vec = TfidfVectorizer()\n", 174 | "X_train = tfid_vec.fit_transform(train_comments)\n", 175 | "y_train = np.array(train_labels)\n", 176 | "X_test = tfid_vec.transform(test_comments)\n", 177 | "y_test = np.array(test_labels)\n", 178 | "# 查看每个矩阵,向量的大小, 保证X_train和y_train, X_test和y_test的长度是一样的。\n", 179 | "print (np.shape(X_train), np.shape(y_train), np.shape(X_test), np.shape(y_test))" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "### 第二部分: 利用逻辑回归模型搭建情感分析引擎\n", 187 | "在本部分你将会利用罗回归模型(logistic regressiion)来搭建情感分析引擎。你需要完成整个pipeline的搭建过程,从而学会一个机器学习模型的搭建流程。 " 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 57, 193 | "metadata": {}, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "训练数据上的准确率为:0.9585\n", 200 | "测试数据上的准确率为: 0.5248\n", 201 | "['这个,很,好']\n", 202 | " (0, 15943)\t1.0\n", 203 | "['1']\n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "from sklearn.linear_model import LogisticRegression\n", 209 | "# TODO: 初始化模型model,并利用模型的fit函数来做训练,暂时用默认的设置。\n", 210 | "# 具体使用方法请参考:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", 211 | "lr = LogisticRegression().fit(X_train,y_train)\n", 212 | "# 打印在训练数据上的准确率\n", 213 | "print (\"训练数据上的准确率为:\" + str(lr.score(X_train, y_train)))\n", 214 | "\n", 215 | "# 打印在测试数据上的准确率\n", 216 | "print (\"测试数据上的准确率为: \" + str(lr.score(X_test, y_test)))\n", 217 | "\n", 218 | "# TODO: 打印混淆矩阵(confusion matrix)。\n", 219 | "# 参考:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html\n", 220 | "# 混淆矩阵是一种常用的分析方法。比如模型的准确率不理想的情况下,可以做进一步地分析,并找出原因。 在情感分析问题上,\n", 221 | "# 混淆矩阵可以用来分析有多少个原本正样本被分类成负样本,有多少原本是负样本的被分类成正样本,可以做这种精细化的结果分析,从而找到一些原因。 \n", 222 | "\n", 223 | "# TODO: 利用自己提出的例子来做测试。随意指定一个评论,接着利用process_line来做预处理,再利用之前构建好的TfidfVectorizer来把文本转换\n", 224 | "# 成tf-idf向量, 然后再利用构建好的model做预测(model.predict函数)\n", 225 | "test_comment1 = \"这个很好\"\n", 226 | "test_comment2 = \"垃圾\"\n", 227 | "test_comment3 = \"评论区说不烂都是骗人的,超赞\"\n", 228 | "\n", 229 | "a = []\n", 230 | "a.append(process_line(test_comment1))\n", 231 | "print(a)\n", 232 | "print(tfid_vec.transform(a))\n", 233 | "print(lr.predict(tfid_vec.transform(a)))\n", 234 | "\n", 235 | "# TODO: 输出结果,并自己分析一下是不是跟自己想象的结果一致。 也可以输出两个分类的概率\n" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "### 第三部分: 利用决策树,神经网络,SVM来训练模型。 \n", 243 | "本部分类似于第二部分的内容,只不过替换成其他的模型(包括决策树,神经网络,SVM模型)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 58, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | "训练数据上的准确率为:0.9929\n", 256 | "测试数据上的准确率为: 0.5212\n", 257 | "训练数据上的准确率为:0.6961\n", 258 | "测试数据上的准确率为: 0.572\n", 259 | "训练数据上的准确率为:0.741\n", 260 | "测试数据上的准确率为: 0.5776\n" 261 | ] 262 | } 263 | ], 264 | "source": [ 265 | "# 利用决策树来做情感分析预测\n", 266 | "from sklearn import tree\n", 267 | "# 具体使用方法请参考:http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html\n", 268 | "\n", 269 | "# TODO: 初始化决策树模型,并利用模型的fit函数来做训练并打印在训练和测试数据上的准确率,利用决策树默认的参数设置\n", 270 | "dtc1 = tree.DecisionTreeClassifier().fit(X_train,y_train)\n", 271 | "# 打印在训练数据上的准确率\n", 272 | "print (\"训练数据上的准确率为:\" + str(dtc1.score(X_train, y_train)))\n", 273 | "\n", 274 | "# 打印在测试数据上的准确率\n", 275 | "print (\"测试数据上的准确率为: \" + str(dtc1.score(X_test, y_test)))\n", 276 | "\n", 277 | "\n", 278 | "# TODO: 初始化决策树模型,并利用模型的fit函数来做训练并打印在训练和测试数据上的准确率,设置max_depth参数为3\n", 279 | "dtc2 = tree.DecisionTreeClassifier(max_depth=3).fit(X_train,y_train)\n", 280 | "# 打印在训练数据上的准确率\n", 281 | "print (\"训练数据上的准确率为:\" + str(dtc2.score(X_train, y_train)))\n", 282 | "\n", 283 | "# 打印在测试数据上的准确率\n", 284 | "print (\"测试数据上的准确率为: \" + str(dtc2.score(X_test, y_test)))\n", 285 | "\n", 286 | "\n", 287 | "# TODO: 初始化决策树模型,并利用模型的fit函数来做训练并打印在训练和测试数据上的准确率,设置max_depth参数为5\n", 288 | "dtc3 = tree.DecisionTreeClassifier(max_depth=5).fit(X_train,y_train)\n", 289 | "# 打印在训练数据上的准确率\n", 290 | "print (\"训练数据上的准确率为:\" + str(dtc3.score(X_train, y_train)))\n", 291 | "\n", 292 | "# 打印在测试数据上的准确率\n", 293 | "print (\"测试数据上的准确率为: \" + str(dtc3.score(X_test, y_test)))" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 59, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "训练数据上的准确率为:0.5155\n", 306 | "测试数据上的准确率为: 0.514\n" 307 | ] 308 | } 309 | ], 310 | "source": [ 311 | "# 利用支持向量机(SVM)来做情感分析预测\n", 312 | "from sklearn import svm\n", 313 | "# http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html\n", 314 | "\n", 315 | "# TODO: 初始化SVM模型,并利用模型的fit函数来做训练并打印在训练和测试数据上的准确率,SVM模型的kernel设置成“rbf”核函数\n", 316 | "svc = svm.SVC(kernel='rbf').fit(X_train,y_train)\n", 317 | "# 打印在训练数据上的准确率\n", 318 | "print (\"训练数据上的准确率为:\" + str(svc.score(X_train, y_train)))\n", 319 | "\n", 320 | "# 打印在测试数据上的准确率\n", 321 | "print (\"测试数据上的准确率为: \" + str(svc.score(X_test, y_test)))" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 60, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "训练数据上的准确率为:0.9929\n", 334 | "测试数据上的准确率为: 0.516\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "# 利用线性支持向量机(LinearSVM)来做情感分析的预测\n", 340 | "from sklearn.svm import LinearSVC\n", 341 | "# 具体的使用方式请见: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html\n", 342 | "\n", 343 | "# TODO: 初始化LinearSVC模型,并利用模型的fit函数来做训练并打印在训练和测试数据上的准确率,使用模型的默认参数。\n", 344 | "clf = LinearSVC()\n", 345 | "clf.fit(X_train,y_train)\n", 346 | "# 打印在训练数据上的准确率\n", 347 | "print (\"训练数据上的准确率为:\" + str(clf.score(X_train, y_train)))\n", 348 | "\n", 349 | "# 打印在测试数据上的准确率\n", 350 | "print (\"测试数据上的准确率为: \" + str(clf.score(X_test, y_test)))" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 70, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "name": "stdout", 360 | "output_type": "stream", 361 | "text": [ 362 | "训练数据上的准确率为:0.9929\n", 363 | "测试数据上的准确率为: 0.4988\n" 364 | ] 365 | } 366 | ], 367 | "source": [ 368 | "# 利用神经网络模型来做情感分析的预测\n", 369 | "from sklearn.neural_network import MLPClassifier\n", 370 | "# 具体使用方法请见:http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html\n", 371 | "\n", 372 | "# TODO: 初始化MLPClassifier模型,并利用模型的fit函数来做训练并打印在训练和测试数据上的准确率,设置为hidden_layer_sizes为100,\n", 373 | "# 并使用\"lbfgs\" solver\n", 374 | "mlp = MLPClassifier(solver='lbfgs',hidden_layer_sizes=100)\n", 375 | "mlp.fit(X_train,y_train)\n", 376 | "# 打印在训练数据上的准确率\n", 377 | "print (\"训练数据上的准确率为:\" + str(mlp.score(X_train, y_train)))\n", 378 | "\n", 379 | "# 打印在测试数据上的准确率\n", 380 | "print (\"测试数据上的准确率为: \" + str(mlp.score(X_test, y_test)))" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "### 第四部分: 通过交叉验证找出最好的超参数\n", 388 | "调用一个sklearn模型本身很简单,只需要2行代码即可以完成所需要的操作。但这里的关键点在于怎么去寻找最优的超参数(hyperparameter)。 比如对于逻辑回归\n", 389 | "来说,我们可以设定一些参数的值如“penalty”, C等等,这些我们可以理解成是超参数。通常情况下,超参数对于整个模型的效果有着举足轻重的作用,这就意味着\n", 390 | "我们需要一种方式起来找到一个比较合适的参数。其中一个最常用的方法是grid search, 也就在一个去区间里面做搜索,然后找到最优的那个参数值。 \n", 391 | "\n", 392 | "举个例子,对于逻辑回归模型,它拥有一个超参数叫做C,在文档里面解释叫做“Inverse of regularization strength“, 就是正则的权重,而且这种权重的取值\n", 393 | "范围可以认为通常是(0.01, 1000)区间。这时候,通过grid search的方式我们依次可以尝试 0.01, 0.1, 1, 10, 100, 1000 这些值,然后找出使得\n", 394 | "模型的准确率最高的参数。当然,如果计算条件资源允许的话,可以尝试更多的值,比如0.01,0.05,0.1, 0.5, 1, 5, 10 ..。 当我们尝试越多值的时候,找到\n", 395 | "最优参数的概率就会越大。\n", 396 | "\n", 397 | "另外,参数的搜索过程离不开交叉验证,交叉验证相关的细节请参考线上的视频课程。\n", 398 | "\n", 399 | "在第四部分里,你将要编写程序来寻找最优的参数,分别针对逻辑回归和神经网络模型。" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 7, 405 | "metadata": {}, 406 | "outputs": [ 407 | { 408 | "name": "stdout", 409 | "output_type": "stream", 410 | "text": [ 411 | "最好的参数C值为: 1000.000000\n", 412 | "训练数据上的准确率为:0.9929\n", 413 | "测试数据上的准确率为: 0.5088\n" 414 | ] 415 | } 416 | ], 417 | "source": [ 418 | "from sklearn.linear_model import LogisticRegression\n", 419 | "from sklearn.model_selection import KFold\n", 420 | "params_c = np.logspace(-3,3,7) # 对于参数 “C”,尝试几个不同的值\n", 421 | "best_c = params_c[0] # 存储最好的C值\n", 422 | "best_acc = 0\n", 423 | "kf = KFold(n_splits=5,shuffle=False)\n", 424 | "for c in params_c:\n", 425 | " # TODO: 编写交叉验证的过程,对于每一个c值,计算出在验证集中的平均准确率。 在这里,我们做5-fold交叉验证。也就是,每一次把20%\n", 426 | " # 的数据作为验证集来对待,然后准确率为五次的平均值。我们把这个准确率命名为 acc_avg\n", 427 | " avg = 0\n", 428 | " for train_index, test_index in kf.split(X_train):\n", 429 | " lr = LogisticRegression(C=c).fit(X_train[train_index],y_train[train_index])\n", 430 | " avg += lr.score(X_train[test_index],y_train[test_index])\n", 431 | " acc_avg = avg/5\n", 432 | " if acc_avg > best_acc:\n", 433 | " best_acc = acc_avg\n", 434 | " best_c = c\n", 435 | "\n", 436 | "print (\"最好的参数C值为: %f\" % (best_c))\n", 437 | "# TODO 我们需要在整个训练数据上重新训练模型,但这次利用最好的参数best_c值\n", 438 | "# 提示: model = LogisticRegression(C=best_c).fit(X_train, y_train)\n", 439 | "lr = LogisticRegression(C=best_c).fit(X_train, y_train)\n", 440 | "\n", 441 | "# 打印在训练数据上的准确率\n", 442 | "print (\"训练数据上的准确率为:\" + str(lr.score(X_train, y_train)))\n", 443 | "\n", 444 | "# 打印在测试数据上的准确率\n", 445 | "print (\"测试数据上的准确率为: \" + str(lr.score(X_test, y_test)))" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "from sklearn.neural_network import MLPClassifier\n", 455 | "import numpy as np\n", 456 | "\n", 457 | "param_hidden_layer_sizes = np.linspace(10, 200, 20) # 针对参数 “hidden_layer_sizes”, 尝试几个不同的值\n", 458 | "param_alphas = np.logspace(-4,1,6) # 对于参数 \"alpha\", 尝试几个不同的值\n", 459 | "\n", 460 | "best_hidden_layer_size = param_hidden_layer_sizes[0]\n", 461 | "best_alpha = param_alphas[0]\n", 462 | "\n", 463 | "for size in param_hidden_layer_sizes:\n", 464 | " for val in param_alphas:\n", 465 | " # TODO 编写交叉验证的过程,需要做5-fold交叉验证。\n", 466 | " avg = 0\n", 467 | " for train_index, test_index in kf.split(X_train, y_train):\n", 468 | " mlp = MLPClassifier(alpha=int(val),hidden_layer_sizes=int(size))\n", 469 | " mlp.fit(X_train[train_index],y_train[train_index])\n", 470 | " avg += mlp.score(X_train[test_index],y_train[test_index])\n", 471 | " acc_avg = avg/5\n", 472 | " if acc_avg > best_acc:\n", 473 | " best_acc = acc_avg\n", 474 | " best_hidden_layer_size = size\n", 475 | " best_alpha = val\n", 476 | "\n", 477 | "print (\"最好的参数hidden_layer_size值为: %f\" % (best_hidden_layer_size))\n", 478 | "print (\"最好的参数alpha值为: %f\" % (best_alpha))\n", 479 | "\n", 480 | "# TODO 我们需要在整个训练数据上重新训练模型,但这次使用最好的参数hidden_layer_size和best_alpha\n", 481 | "mlp = MLPClassifier(alpha=best_alpha,hidden_layer_sizes=best_hidden_layer_size).fit(X_train,y_train)\n", 482 | "# 打印在训练数据上的准确率\n", 483 | "print (\"训练数据上的准确率为:\" + str(mlp.score(X_train, y_train)))\n", 484 | "\n", 485 | "# 打印在测试数据上的准确率\n", 486 | "print (\"测试数据上的准确率为: \" + str(mlp.score(X_test, y_test))) \n" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [] 495 | } 496 | ], 497 | "metadata": { 498 | "kernelspec": { 499 | "display_name": "Python 3", 500 | "language": "python", 501 | "name": "python3" 502 | }, 503 | "language_info": { 504 | "codemirror_mode": { 505 | "name": "ipython", 506 | "version": 3 507 | }, 508 | "file_extension": ".py", 509 | "mimetype": "text/x-python", 510 | "name": "python", 511 | "nbconvert_exporter": "python", 512 | "pygments_lexer": "ipython3", 513 | "version": "3.7.0" 514 | } 515 | }, 516 | "nbformat": 4, 517 | "nbformat_minor": 2 518 | } 519 | --------------------------------------------------------------------------------