├── 2019新版debug之后--中文自然语言处理--情感分析.ipynb ├── README.md ├── flowchart.jpg ├── neg ├── neg.0.txt ├── neg.1.txt ├── neg.10.txt ├── neg.1000.txt ├── neg.1001.txt ├── neg.1002.txt ├── neg.1003.txt ├── neg.1004.txt ├── neg.1005.txt ├── neg.1006.txt ├── neg.1007.txt ├── neg.1009.txt ├── neg.101.txt ├── neg.1010.txt ├── neg.1011.txt ├── neg.1012.txt ├── neg.1013.txt ├── neg.1014.txt ├── neg.1015.txt ├── neg.1017.txt ├── neg.1018.txt ├── neg.1019.txt ├── neg.102.txt ├── neg.1020.txt ├── neg.1022.txt ├── neg.1025.txt ├── neg.1026.txt ├── neg.1027.txt ├── neg.1028.txt ├── neg.1029.txt ├── neg.103.txt ├── neg.1030.txt ├── neg.1032.txt ├── neg.1033.txt ├── neg.1034.txt ├── neg.1035.txt ├── neg.1036.txt ├── neg.1038.txt ├── neg.1039.txt ├── neg.104.txt ├── neg.1040.txt ├── neg.1041.txt ├── neg.1042.txt ├── neg.1047.txt ├── neg.1048.txt ├── neg.1049.txt ├── neg.105.txt ├── neg.1050.txt ├── neg.1052.txt ├── neg.1053.txt ├── neg.1054.txt ├── neg.1055.txt ├── neg.1056.txt ├── neg.1057.txt ├── neg.1058.txt ├── neg.1059.txt ├── neg.106.txt ├── neg.1060.txt ├── neg.1061.txt ├── neg.1062.txt ├── neg.1063.txt ├── neg.1066.txt ├── neg.1067.txt ├── neg.1069.txt ├── neg.107.txt ├── neg.1070.txt ├── neg.1071.txt ├── neg.1072.txt ├── neg.1073.txt ├── neg.1074.txt ├── neg.1075.txt ├── neg.1076.txt ├── neg.1077.txt ├── neg.1078.txt ├── neg.1079.txt ├── neg.108.txt ├── neg.1080.txt ├── neg.1081.txt ├── neg.1082.txt ├── neg.1083.txt ├── neg.1084.txt ├── neg.1085.txt ├── neg.1086.txt ├── neg.1087.txt ├── neg.1088.txt ├── neg.1089.txt ├── neg.109.txt ├── neg.1090.txt ├── neg.1091.txt ├── neg.1092.txt ├── neg.1093.txt ├── neg.1094.txt ├── neg.1095.txt ├── neg.1096.txt ├── neg.1097.txt ├── neg.1098.txt ├── neg.1099.txt ├── neg.11.txt └── neg.110.txt ├── negative_samples.txt ├── pos ├── pos.10.txt ├── pos.100.txt ├── pos.1000.txt ├── pos.1001.txt ├── pos.1002.txt ├── pos.1003.txt ├── pos.1004.txt ├── pos.1005.txt ├── pos.1006.txt ├── pos.1007.txt ├── pos.1008.txt ├── pos.1009.txt ├── pos.101.txt ├── pos.1010.txt ├── pos.1012.txt ├── pos.1013.txt ├── pos.1014.txt ├── pos.1015.txt ├── pos.1016.txt ├── pos.1017.txt ├── pos.1018.txt ├── pos.1019.txt ├── pos.102.txt ├── pos.1020.txt ├── pos.1021.txt ├── pos.1022.txt ├── pos.1023.txt ├── pos.1024.txt ├── pos.1025.txt ├── pos.1026.txt ├── pos.1027.txt ├── pos.1028.txt ├── pos.1029.txt ├── pos.103.txt ├── pos.1030.txt ├── pos.1031.txt ├── pos.1032.txt ├── pos.1033.txt ├── pos.1034.txt ├── pos.1035.txt ├── pos.1036.txt ├── pos.1037.txt ├── pos.1038.txt ├── pos.1039.txt ├── pos.104.txt ├── pos.1040.txt ├── pos.1041.txt ├── pos.1042.txt ├── pos.1043.txt ├── pos.1044.txt ├── pos.1045.txt ├── pos.1046.txt ├── pos.1047.txt ├── pos.1048.txt ├── pos.1049.txt ├── pos.105.txt ├── pos.1050.txt ├── pos.1051.txt ├── pos.1052.txt ├── pos.1053.txt ├── pos.1054.txt ├── pos.1055.txt ├── pos.1056.txt ├── pos.1057.txt ├── pos.1058.txt ├── pos.1059.txt ├── pos.106.txt ├── pos.1060.txt ├── pos.1061.txt ├── pos.1062.txt ├── pos.1063.txt ├── pos.1064.txt ├── pos.1065.txt ├── pos.107.txt ├── pos.1073.txt ├── pos.1074.txt ├── pos.1075.txt ├── pos.1076.txt ├── pos.1077.txt ├── pos.1078.txt ├── pos.1079.txt ├── pos.108.txt ├── pos.1080.txt ├── pos.1081.txt ├── pos.1082.txt ├── pos.1083.txt ├── pos.1084.txt ├── pos.1085.txt ├── pos.1086.txt ├── pos.1087.txt ├── pos.1088.txt ├── pos.1089.txt ├── pos.109.txt ├── pos.1090.txt ├── pos.1091.txt ├── pos.1093.txt ├── pos.1094.txt ├── pos.1095.txt ├── pos.1096.txt └── pos.1097.txt ├── positive_samples.txt ├── 中文自然语言处理--情感分析.ipynb └── 语料.zip /2019新版debug之后--中文自然语言处理--情感分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 用Tensorflow进行中文自然语言处理--情感分析" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "$$f('真好喝')=1$$\n", 15 | "$$f('太难喝了')=0$$" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "**简介** \n", 23 | "大家好,我是Espresso,这是我制作的第一个教程,是一个简单的中文自然语言处理的分类实践。 \n", 24 | "制作此教程的目的是什么呢?虽然现在自然语言处理的学习资料很多,英文的资料更多,但是网上资源很乱,尤其是中文的系统的学习资料稀少,而且知识点非常分散,缺少比较系统的实践学习资料,就算有一些代码但因为缺少注释导致要花费很长时间才能理解,我个人在学习过程中,在网络搜索花费了一整天时间,才把处理中文的步骤和需要的软件梳理出来。 \n", 25 | "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习,在这个教程中我注重的是实践部分,理论部分我推荐学习deeplearning.ai的课程,在下面的代码部分,涉及到哪方面知识的,我推荐一些学习资料并附上链接,如有侵权请e-mail:a66777@188.com。 \n", 26 | "另外我对自然语言处理并没有任何深入研究,欢迎各位大牛吐槽,希望能指出不足和改善方法。" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "**需要的库** \n", 34 | "numpy \n", 35 | "jieba \n", 36 | "gensim \n", 37 | "tensorflow \n", 38 | "matplotlib " 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 1, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "# 首先加载必用的库\n", 48 | "%matplotlib inline\n", 49 | "import numpy as np\n", 50 | "import matplotlib.pyplot as plt\n", 51 | "import re\n", 52 | "import jieba # 结巴分词\n", 53 | "# gensim用来加载预训练word vector\n", 54 | "from gensim.models import KeyedVectors\n", 55 | "import warnings\n", 56 | "warnings.filterwarnings(\"ignore\")\n", 57 | "# 用来解压\n", 58 | "import bz2" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "**预训练词向量** \n", 66 | "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为: \n", 67 | "https://github.com/Embedding/Chinese-Word-Vectors \n", 68 | "如果你不知道word2vec是什么,我推荐以下一篇文章: \n", 69 | "https://zhuanlan.zhihu.com/p/26306795 \n", 70 | "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量,可以从上面github链接下载,我们先加载预训练模型并进行一些简单测试:" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 2, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# 请将下载的词向量压缩包放置在根目录 embeddings 文件夹里\n", 80 | "# 解压词向量, 有可能需要等待1-2分钟\n", 81 | "with open(\"embeddings/sgns.zhihu.bigram\", 'wb') as new_file, open(\"embeddings/sgns.zhihu.bigram.bz2\", 'rb') as file:\n", 82 | " decompressor = bz2.BZ2Decompressor()\n", 83 | " for data in iter(lambda : file.read(100 * 1024), b''):\n", 84 | " new_file.write(decompressor.decompress(data))" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 3, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "# 使用gensim加载预训练中文分词embedding, 有可能需要等待1-2分钟\n", 94 | "cn_model = KeyedVectors.load_word2vec_format('embeddings/sgns.zhihu.bigram', \n", 95 | " binary=False, unicode_errors=\"ignore\")" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "**词向量模型** \n", 103 | "在这个词向量模型里,每一个词是一个索引,对应的是一个长度为300的向量,我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本,需要先进行分次并把词汇转换为词向量,步骤请参考下图,步骤的讲解会跟着代码一步一步来,如果你不知道RNN,GRU,LSTM是什么,我推荐deeplearning.ai的课程,网易公开课有免费中文字幕版,但我还是推荐有习题和练习代码部分的的coursera原版: \n", 104 | "" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "词向量的长度为300\n" 117 | ] 118 | }, 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "array([-2.603470e-01, 3.677500e-01, -2.379650e-01, 5.301700e-02,\n", 123 | " -3.628220e-01, -3.212010e-01, -1.903330e-01, 1.587220e-01,\n", 124 | " -7.156200e-02, -4.625400e-02, -1.137860e-01, 3.515600e-01,\n", 125 | " -6.408200e-02, -2.184840e-01, 3.286950e-01, -7.110330e-01,\n", 126 | " 1.620320e-01, 1.627490e-01, 5.528180e-01, 1.016860e-01,\n", 127 | " 1.060080e-01, 7.820700e-01, -7.537310e-01, -2.108400e-02,\n", 128 | " -4.758250e-01, -1.130420e-01, -2.053000e-01, 6.624390e-01,\n", 129 | " 2.435850e-01, 9.171890e-01, -2.090610e-01, -5.290000e-02,\n", 130 | " -7.969340e-01, 2.394940e-01, -9.028100e-02, 1.537360e-01,\n", 131 | " -4.003980e-01, -2.456100e-02, -1.717860e-01, 2.037790e-01,\n", 132 | " -4.344710e-01, -3.850430e-01, -9.366000e-02, 3.775310e-01,\n", 133 | " 2.659690e-01, 8.879800e-02, 2.493440e-01, 4.914900e-02,\n", 134 | " 5.996000e-03, 3.586430e-01, -1.044960e-01, -5.838460e-01,\n", 135 | " 3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n", 136 | " -2.075230e-01, 2.845980e-01, 1.414760e-01, 1.678570e-01,\n", 137 | " 1.957560e-01, 7.782140e-01, -2.359000e-01, -6.833100e-02,\n", 138 | " 2.560170e-01, -6.906900e-02, -1.219620e-01, 2.683020e-01,\n", 139 | " 1.678810e-01, 2.068910e-01, 1.987520e-01, 6.720900e-02,\n", 140 | " -3.975290e-01, -7.123140e-01, 5.613200e-02, 2.586000e-03,\n", 141 | " 5.616910e-01, 1.157000e-03, -4.341190e-01, 1.977480e-01,\n", 142 | " 2.519540e-01, 8.835000e-03, -3.554600e-01, -1.573500e-02,\n", 143 | " -2.526010e-01, 9.355900e-02, -3.962500e-02, -1.628350e-01,\n", 144 | " 2.980950e-01, 1.647900e-01, -5.454270e-01, 3.888790e-01,\n", 145 | " 1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n", 146 | " 2.020520e-01, -4.424750e-01, 3.911580e-01, 2.115100e-01,\n", 147 | " 6.516760e-01, 5.668030e-01, 5.065500e-02, -1.259650e-01,\n", 148 | " -3.720640e-01, 2.330470e-01, 6.659900e-02, 8.300600e-02,\n", 149 | " 2.540460e-01, -5.279760e-01, -3.843280e-01, 3.366460e-01,\n", 150 | " 2.336500e-01, 3.564750e-01, -4.884160e-01, -1.183910e-01,\n", 151 | " 1.365910e-01, 2.293420e-01, -6.151930e-01, 5.212050e-01,\n", 152 | " 3.412000e-01, 5.757940e-01, 2.354480e-01, -3.641530e-01,\n", 153 | " 7.373400e-02, 1.007380e-01, -3.211410e-01, -3.040480e-01,\n", 154 | " -3.738440e-01, -2.515150e-01, 2.633890e-01, 3.995490e-01,\n", 155 | " 4.461880e-01, 1.641110e-01, 1.449590e-01, -4.191540e-01,\n", 156 | " 2.297840e-01, 6.710600e-02, 3.316430e-01, -6.026500e-02,\n", 157 | " -5.130610e-01, 1.472570e-01, 2.414060e-01, 2.011000e-03,\n", 158 | " -3.823410e-01, -1.356010e-01, 3.112300e-01, 9.177830e-01,\n", 159 | " -4.511630e-01, 1.272190e-01, -9.431600e-02, -8.216000e-03,\n", 160 | " -3.835440e-01, 2.589400e-02, 6.374980e-01, 4.931630e-01,\n", 161 | " -1.865070e-01, 4.076900e-01, -1.841000e-03, 2.213160e-01,\n", 162 | " 2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n", 163 | " 1.296890e-01, -1.304100e-01, -4.742270e-01, 2.275500e-02,\n", 164 | " 4.255050e-01, 1.570280e-01, 2.975300e-02, 1.931830e-01,\n", 165 | " 1.304340e-01, -3.179800e-02, 1.516650e-01, -2.154310e-01,\n", 166 | " -4.681410e-01, 1.007326e+00, -6.698940e-01, -1.555240e-01,\n", 167 | " 1.797170e-01, 2.848660e-01, 6.216130e-01, 1.549510e-01,\n", 168 | " 6.225000e-02, -2.227800e-02, 2.561270e-01, -1.006380e-01,\n", 169 | " 2.807900e-02, 4.597710e-01, -4.077750e-01, -1.777390e-01,\n", 170 | " 1.920500e-02, -4.829300e-02, 4.714700e-02, -3.715200e-01,\n", 171 | " -2.995930e-01, -3.719710e-01, 4.622800e-02, -1.436460e-01,\n", 172 | " 2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n", 173 | " 5.970110e-01, 3.578450e-01, -6.826000e-02, 4.735200e-02,\n", 174 | " -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n", 175 | " -4.757790e-01, 1.079320e-01, 9.858300e-02, 8.540300e-01,\n", 176 | " 3.518370e-01, -1.306360e-01, -1.541590e-01, 1.166775e+00,\n", 177 | " 2.048860e-01, 5.952340e-01, 1.158830e-01, 6.774400e-02,\n", 178 | " 6.793920e-01, -3.610700e-01, 1.697870e-01, 4.118530e-01,\n", 179 | " 4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n", 180 | " -7.043300e-02, 1.576110e-01, -4.780500e-02, -7.344390e-01,\n", 181 | " -2.834330e-01, 4.582690e-01, 3.957010e-01, -8.484300e-02,\n", 182 | " -3.472550e-01, 1.291660e-01, 3.838960e-01, -3.287600e-02,\n", 183 | " -2.802220e-01, 5.257030e-01, -3.609300e-02, -4.842220e-01,\n", 184 | " 3.690700e-02, 3.429560e-01, 2.902490e-01, -1.624650e-01,\n", 185 | " -7.513700e-02, 2.669300e-01, 5.778230e-01, -3.074020e-01,\n", 186 | " -2.183790e-01, -2.834050e-01, 1.350870e-01, 1.490070e-01,\n", 187 | " 1.438400e-02, -2.509040e-01, -3.376100e-01, 1.291880e-01,\n", 188 | " -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n", 189 | " -1.211970e-01, 2.532660e-01, 2.757050e-01, -3.382040e-01,\n", 190 | " 1.178070e-01, 3.860190e-01, 5.277960e-01, 4.581920e-01,\n", 191 | " 1.502310e-01, 1.226320e-01, 2.768540e-01, -4.502080e-01,\n", 192 | " -1.992670e-01, 1.689100e-02, 1.188860e-01, 3.502440e-01,\n", 193 | " -4.064770e-01, 2.610280e-01, -1.934990e-01, -1.625660e-01,\n", 194 | " 2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n", 195 | " -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n", 196 | " 2.154600e-02, 2.318040e-01, 5.395310e-01, -4.223720e-01],\n", 197 | " dtype=float32)" 198 | ] 199 | }, 200 | "execution_count": 4, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "# 由此可见每一个词都对应一个长度为300的向量\n", 207 | "embedding_dim = cn_model['山东大学'].shape[0]\n", 208 | "print('词向量的长度为{}'.format(embedding_dim))\n", 209 | "cn_model['山东大学']" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "Cosine Similarity for Vector Space Models by Christian S. Perone\n", 217 | "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 5, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "0.66128117" 229 | ] 230 | }, 231 | "execution_count": 5, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "# 计算相似度\n", 238 | "cn_model.similarity('橘子', '橙子')" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 6, 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "data": { 248 | "text/plain": [ 249 | "0.66128117" 250 | ] 251 | }, 252 | "execution_count": 6, 253 | "metadata": {}, 254 | "output_type": "execute_result" 255 | } 256 | ], 257 | "source": [ 258 | "# dot('橘子'/|'橘子'|, '橙子'/|'橙子'| )\n", 259 | "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n", 260 | "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 7, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "text/plain": [ 271 | "[('高中', 0.7247823476791382),\n", 272 | " ('本科', 0.6768535375595093),\n", 273 | " ('研究生', 0.6244412660598755),\n", 274 | " ('中学', 0.6088204979896545),\n", 275 | " ('大学本科', 0.595908522605896),\n", 276 | " ('初中', 0.5883588790893555),\n", 277 | " ('读研', 0.5778335332870483),\n", 278 | " ('职高', 0.5767995119094849),\n", 279 | " ('大学毕业', 0.5767451524734497),\n", 280 | " ('师范大学', 0.5708829760551453)]" 281 | ] 282 | }, 283 | "execution_count": 7, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "# 找出最相近的词,余弦相似度\n", 290 | "cn_model.most_similar(positive=['大学'], topn=10)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 8, 296 | "metadata": {}, 297 | "outputs": [ 298 | { 299 | "name": "stdout", 300 | "output_type": "stream", 301 | "text": [ 302 | "在 老师 会计师 程序员 律师 医生 老人 中:\n", 303 | "不是同一类别的词为: 老人\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "# 找出不同的词\n", 309 | "test_words = '老师 会计师 程序员 律师 医生 老人'\n", 310 | "test_words_result = cn_model.doesnt_match(test_words.split())\n", 311 | "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 9, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/plain": [ 322 | "[('出轨', 0.6100173592567444)]" 323 | ] 324 | }, 325 | "execution_count": 9, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "cn_model.most_similar(positive=['女人','劈腿'], negative=['男人'], topn=1)" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "**训练语料** \n", 339 | "本教程使用了谭松波老师的酒店评论语料,即使是这个语料也很难找到下载链接,在某博客还得花积分下载,而我不知道怎么赚取积分,后来好不容易找到一个链接但竟然是失效的,再后来尝试把链接粘贴到迅雷上终于下载了下来,希望大家以后多多分享资源。 \n", 340 | "训练样本分别被放置在两个文件夹里:\n", 341 | "分别的pos和neg,每个文件夹里有2000个txt文件,每个文件内有一段评语,共有4000个训练样本,这样大小的样本数据在NLP中属于非常迷你的:" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 10, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "# 获得样本的索引,样本存放于两个文件夹中,\n", 351 | "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n", 352 | "# 每个文件夹中有2000个txt文件,每个文件中是一例评价\n", 353 | "import os\n", 354 | "pos_txts = os.listdir('pos')\n", 355 | "neg_txts = os.listdir('neg')" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 11, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stdout", 365 | "output_type": "stream", 366 | "text": [ 367 | "样本总共: 4000\n" 368 | ] 369 | } 370 | ], 371 | "source": [ 372 | "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 12, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "# 现在我们将所有的评价内容放置到一个list里\n", 382 | "# 这里和视频课程不一样, 因为很多同学反应按照视频课程里的读取方法会乱码,\n", 383 | "# 经过检查发现是原始文本里的编码是gbk造成的,\n", 384 | "# 这里我进行了简单的预处理, 以避免乱码\n", 385 | "train_texts_orig = []\n", 386 | "# 文本所对应的labels, 也就是标记\n", 387 | "train_target = []\n", 388 | "with open(\"positive_samples.txt\", \"r\", encoding=\"utf-8\") as f:\n", 389 | " lines = f.readlines()\n", 390 | " for line in lines:\n", 391 | " dic = eval(line)\n", 392 | " train_texts_orig.append(dic[\"text\"])\n", 393 | " train_target.append(dic[\"label\"])\n", 394 | "\n", 395 | "with open(\"negative_samples.txt\", \"r\", encoding=\"utf-8\") as f:\n", 396 | " lines = f.readlines()\n", 397 | " for line in lines:\n", 398 | " dic = eval(line)\n", 399 | " train_texts_orig.append(dic[\"text\"])\n", 400 | " train_target.append(dic[\"label\"])" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 13, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "data": { 410 | "text/plain": [ 411 | "4000" 412 | ] 413 | }, 414 | "execution_count": 13, 415 | "metadata": {}, 416 | "output_type": "execute_result" 417 | } 418 | ], 419 | "source": [ 420 | "len(train_texts_orig)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 14, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "# 我们使用tensorflow的keras接口来建模\n", 430 | "from tensorflow.python.keras.models import Sequential\n", 431 | "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n", 432 | "from tensorflow.python.keras.preprocessing.text import Tokenizer\n", 433 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n", 434 | "from tensorflow.python.keras.optimizers import RMSprop\n", 435 | "from tensorflow.python.keras.optimizers import Adam\n", 436 | "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "**分词和tokenize** \n", 444 | "首先我们去掉每个样本的标点符号,然后用jieba分词,jieba分词返回一个生成器,没法直接进行tokenize,所以我们将分词结果转换成一个list,并将它索引化,这样每一例评价的文本变成一段索引数字,对应着预训练词向量模型中的词。" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 15, 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "name": "stderr", 454 | "output_type": "stream", 455 | "text": [ 456 | "Building prefix dict from the default dictionary ...\n", 457 | "WARNING: Logging before flag parsing goes to stderr.\n", 458 | "I0610 16:36:46.897187 9584 __init__.py:111] Building prefix dict from the default dictionary ...\n", 459 | "Loading model from cache C:\\Users\\OSCARZ~1\\AppData\\Local\\Temp\\jieba.cache\n", 460 | "I0610 16:36:46.899155 9584 __init__.py:131] Loading model from cache C:\\Users\\OSCARZ~1\\AppData\\Local\\Temp\\jieba.cache\n", 461 | "Loading model cost 0.535 seconds.\n", 462 | "I0610 16:36:47.432762 9584 __init__.py:163] Loading model cost 0.535 seconds.\n", 463 | "Prefix dict has been built succesfully.\n", 464 | "I0610 16:36:47.433753 9584 __init__.py:164] Prefix dict has been built succesfully.\n" 465 | ] 466 | } 467 | ], 468 | "source": [ 469 | "# 进行分词和tokenize\n", 470 | "# train_tokens是一个长长的list,其中含有4000个小list,对应每一条评价\n", 471 | "train_tokens = []\n", 472 | "for text in train_texts_orig:\n", 473 | " # 去掉标点\n", 474 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n", 475 | " # 结巴分词\n", 476 | " cut = jieba.cut(text)\n", 477 | " # 结巴分词的输出结果为一个生成器\n", 478 | " # 把生成器转换为list\n", 479 | " cut_list = [ i for i in cut ]\n", 480 | " for i, word in enumerate(cut_list):\n", 481 | " try:\n", 482 | " # 将词转换为索引index\n", 483 | " cut_list[i] = cn_model.vocab[word].index\n", 484 | " except KeyError:\n", 485 | " # 如果词不在字典中,则输出0\n", 486 | " cut_list[i] = 0\n", 487 | " train_tokens.append(cut_list)" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "**索引长度标准化** \n", 495 | "因为每段评语的长度是不一样的,我们如果单纯取最长的一个评语,并把其他评填充成同样的长度,这样十分浪费计算资源,所以我们取一个折衷的长度。" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 16, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "# 获得所有tokens的长度\n", 505 | "num_tokens = [ len(tokens) for tokens in train_tokens ]\n", 506 | "num_tokens = np.array(num_tokens)" 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": 17, 512 | "metadata": {}, 513 | "outputs": [ 514 | { 515 | "data": { 516 | "text/plain": [ 517 | "71.4495" 518 | ] 519 | }, 520 | "execution_count": 17, 521 | "metadata": {}, 522 | "output_type": "execute_result" 523 | } 524 | ], 525 | "source": [ 526 | "# 平均tokens的长度\n", 527 | "np.mean(num_tokens)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": 18, 533 | "metadata": {}, 534 | "outputs": [ 535 | { 536 | "data": { 537 | "text/plain": [ 538 | "1540" 539 | ] 540 | }, 541 | "execution_count": 18, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "# 最长的评价tokens的长度\n", 548 | "np.max(num_tokens)" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 19, 554 | "metadata": {}, 555 | "outputs": [ 556 | { 557 | "data": { 558 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAHXBJREFUeJzt3XmYXVWd7vHva0DGCGIKhAQo0CgiDagRaUFF4dogKNxWBgcMg532qqCAV4PYgl69QmOjOBuZIiKCiIKiNohw0QcFwxgEsXkYQgiSIFMYGgi+94+9ipwUVbVPDafOqTrv53nOU2evPaxfdlXO76y19l5btomIiBjK89odQEREdL4ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRbRFEnflvRvY3SszSQ9KmlKWb5c0gfG4tjleL+UNHusjjeMej8v6X5Jfx2DY+0iafFYxDWKGCzppW2ot+3/9niuJItA0p2SnpC0XNJDkq6U9EFJz/592P6g7f/T5LF2G2ob24tsr2v7mTGI/ThJ3+93/D1szx/tsYcZx6bAUcDWtl88wPp8AA6iXUkphifJIvq83fZUYHPgeOCTwKljXYmk1cb6mB1ic+Bvtpe2O5CIVkiyiFXYftj2hcD+wGxJ2wBIOkPS58v7aZJ+XlohD0j6raTnSToT2Az4Welm+oSk3vLN8VBJi4DfNJQ1Jo6XSLpa0sOSLpC0QanrOd/I+1ovknYHPgXsX+q7oax/tlurxPVpSXdJWirpe5LWK+v64pgtaVHpQjpmsHMjab2y/7JyvE+X4+8GXAJsUuI4o99+6wC/bFj/qKRNJK0h6SuSlpTXVyStMUjdh0u6WdKMsryXpOsbWoLb9js/H5d0Yzmf50hac6jf3RB/En3HXEPSl8p5uq90S67V+DuSdFQ5x/dKOrhh3xdJ+pmkRyT9sXTX/a6su6JsdkM5L/s37Dfg8aI9kixiQLavBhYDbxhg9VFlXQ+wEdUHtm0fCCyiaqWsa/vfG/Z5E/AK4J8GqfL9wCHAJsAK4KtNxPgr4P8C55T6thtgs4PK683AlsC6wNf7bbMz8HJgV+Azkl4xSJVfA9Yrx3lTiflg278G9gCWlDgO6hfnY/3Wr2t7CXAMsCOwPbAdsAPw6f6VqhorOgh4k+3Fkl4NnAb8K/Ai4DvAhf0SzX7A7sAWwLZlfxjkdzfIv7fRCcDLSqwvBaYDn2lY/+JybqYDhwLfkPTCsu4bwGNlm9nl1Xdu3ljeblfOyzlNHC/aIMkihrIE2GCA8qeBjYHNbT9t+7eun2TsONuP2X5ikPVn2r6pfLD+G7CfygD4KL0XOMn27bYfBY4GDujXqvms7Sds3wDcQPXBvYoSy/7A0baX274T+A/gwFHG9jnbS20vAz7b73iSdBJVgn1z2QbgX4Dv2L7K9jNlfOZJqsTT56u2l9h+APgZ1Yc8jOB3J0mlziNsP2B7OVWSPqBhs6fLv+Vp278AHgVeXs7bO4FjbT9u+2agmfGkAY/XxH7RIkkWMZTpwAMDlJ8I3AZcLOl2SXObONbdw1h/F7A6MK2pKIe2STle47FXo/pW3afx6qXHqVof/U0Dnj/AsaaPcWybNCyvD8wBvmj74YbyzYGjSlfSQ5IeAjbtt+9g/6aR/O56gLWBaxrq+1Up7/M32ysGqLOH6nw3/n7r/haGOl60SZJFDEjSa6k+CH/Xf135Zn2U7S2BtwNHStq1b/Ugh6xreWza8H4zqm+W91N1X6zdENcUVv2QqjvuEqoP18ZjrwDuq9mvv/tLTP2PdU+T+w8U50CxLWlYfhDYCzhd0k4N5XcDX7C9fsNrbdtn1wYx9O9uMPcDTwCvbKhvPdvNfHgvozrfMxrKNh1k2+hgSRaxCkkvkLQX8EPg+7YXDrDNXpJeWronHgGeKS+oPoS3HEHV75O0taS1gc8B55VLa/8CrClpT0mrU/XpN/bN3wf0DjFIezZwhKQtJK3LyjGOFYNsP6ASy7nAFyRNlbQ5cCTw/aH3XCXOF/UNrjfE9mlJPZKmUY0B9L8M+HKq7qqfSHpdKf4u8EFJr1NlnXJ+ptYFUfO7G5Dtv5c6vyxpw3Kc6ZIGG39q3PcZ4HzgOElrS9qKaqyn0Uj/ZmIcJVlEn59JWk71rfUY4CRgsCtQZgK/pupH/j3wzfKhBvBFqg/AhyR9fBj1nwmcQdV9siZwOFRXZwEfAk6h+hb/GNUAbZ8flZ9/k3TtAMc9rRz7CuAO4L+Bw4YRV6PDSv23U7W4flCOX8v2n6mSw+3l3GwCfB5YANwILASuLWX9972E6ndxoaTX2F5ANYbwdarWx22sHMCuM9TvbiifLPX8QdIj5RjNjiF8hGqw+q9Uv4uzqcZY+hwHzC/nZb8mjxnjTHn4UUSMJ0knAC+2Pe532cfIpWURES0laStJ25Yusx2oLoX9SbvjiuGZrHfTRkTnmErV9bQJsJTqkuML2hpRDFu6oSIiola6oSIiotaE7oaaNm2ae3t72x1GRMSEcs0119xvu6d+y5UmdLLo7e1lwYIF7Q4jImJCkXRX/VarSjdURETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIUemdexG9cy9qdxgR0WJJFhERUSvJIiIiak3oiQRj8urr2rrz+D2Hvc9w94uIemlZRERErSSLiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZxISReagi2ifJIiIiarUsWUg6TdJSSTc1lJ0o6c+SbpT0E0nrN6w7WtJtkm6V9E+tiisiIoavlS2LM4Dd+5VdAmxje1vgL8DRAJK2Bg4AXln2+aakKS2MLSIihqFlycL2FcAD/coutr2iLP4BmFHe7w380PaTtu8AbgN2aFVsERExPO0cszgE+GV5Px24u2Hd4lL2HJLmSFogacGyZctaHGJERECbkoWkY4AVwFl9RQNs5oH2tT3P9izbs3p6eloVYkRENBj351lImg3sBexquy8hLAY2bdhsBrBkvGOLyWMkz8OIiMGNa8tC0u7AJ4F32H68YdWFwAGS1pC0BTATuHo8Y4uIiMG1rGUh6WxgF2CapMXAsVRXP60BXCIJ4A+2P2j7T5LOBW6m6p76sO1nWhVbREQMT8uShe13D1B86hDbfwH4Qqviic6RLqKIiSd3cEdERK0ki4iIqJVkERERtZIsIiKi1rjfZxExlExBHtGZkiyiqzQmo1yNFdG8dENFREStJIuIiKiVbqiY8DLOEdF6aVlEREStJIuIiKiVbqjoCMPpSsrcUhHjLy2LiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZRERErSSLiIiolTu4oys0c4d47gyPGFxaFhERUatlLQtJpwF7AUttb1PKNgDOAXqBO4H9bD8oScDJwNuAx4GDbF/bqtiiPfp/u8/U4hETRytbFmcAu/crmwtcansmcGlZBtgDmFlec4BvtTCuiFX0zr0oiSuiRsuShe0rgAf6Fe8NzC/v5wP7NJR/z5U/AOtL2rhVsUVExPCM95jFRrbvBSg/Nyzl04G7G7ZbXMqeQ9IcSQskLVi2bFlLg42IiEqnDHBrgDIPtKHtebZn2Z7V09PT4rAiIgLGP1nc19e9VH4uLeWLgU0btpsBLBnn2CIiYhDjnSwuBGaX97OBCxrK36/KjsDDfd1VERHRfq28dPZsYBdgmqTFwLHA8cC5kg4FFgH7ls1/QXXZ7G1Ul84e3Kq4IiJi+FqWLGy/e5BVuw6wrYEPtyqWiIgYnUz3ERNW7o2IGD+1YxaSdpK0Tnn/PkknSdq89aFFRESnaGaA+1vA45K2Az4B3AV8r6VRxaSQO6MjJo9muqFW2LakvYGTbZ8qaXbtXjHhdcIsrKNNNklWEWOjmWSxXNLRwPuAN0qaAqze2rAiIqKTNNMNtT/wJHCo7b9STcNxYkujioiIjlLbsigJ4qSG5UVkzCIioqs0czXUP0v6L0kPS3pE0nJJj4xHcBER0RmaGbP4d+Dttm9pdTAREdGZmhmzuC+JIiKiuzXTslgg6Rzgp1QD3QDYPr9lUUVEREdpJlm8gGpyv7c2lBlIsoiI6BLNXA2VGWAjIrpcM1dDvUzSpZJuKsvbSvp060OLiIhO0cwA93eBo4GnAWzfCBzQyqAiIqKzNJMs1rZ9db+yFa0IJiIiOlMzyeJ+SS+hGtRG0ruAPPI0ulJm0o1u1czVUB8G5gFbSboHuINqUsGISakTZtuN6DTNJIt7bO9WHoD0PNvLJW3Q6sAiIqJzNNMNdb6k1Ww/VhLFi4FLWh1YdI50vUREM8nip8B5kqZI6gUupro6KiIiukQzN+V9V9LzqZJGL/Cvtq9sdWAREdE5Bk0Wko5sXAQ2Ba4HdpS0o+2TBt6znqQjgA9QXWG1EDgY2Bj4IbABcC1woO2nRlpHRESMnaG6oaY2vNYFfgLc1lA2IpKmA4cDs2xvA0yhusnvBODLtmcCDwKHjrSOiIgYW4O2LGx/tnFZ0tSq2I+OUb1rSXoaWJvqvo23AO8p6+cDxwHfGoO6IiJilGrHLCRtA5xJ1T2EpPuB99v+00gqtH2PpC8Bi4AnqAbMrwEest13Z/hiqmd9DxTPHGAOwGabbTaSEKJGrnyKiP6auRpqHnCk7c1tbw4cRTVf1IhIeiGwN7AFsAmwDrDHAJt6oP1tz7M9y/asnp6ekYYRERHD0MxNeevYvqxvwfbl5Qa9kdoNuMP2MgBJ5wOvB9Yv93OsAGYAS0ZRR0SttKAimtdMsrhd0r9RdUVBNdXHHaOocxHVFVVrU3VD7QosAC4D3kV1RdRs4IJR1BEt0PjhmqkwIrpLM91QhwA9VE/GOx+YBhw00gptXwWcR3V57MISwzzgk8CRkm4DXgScOtI6IiJibDXTstjN9uGNBZL2BX400kptHwsc26/4dmCHkR4zIiJap5mWxUBTe2S6j4iILjLUHdx7AG8Dpkv6asOqF5CHH0WXyWB4dLuhuqGWUA08v4PqPog+y4EjWhlURCdIgohYaag7uG8AbpD0A9tPj2NMERHRYWrHLJIoIiKimQHuiIjocoMmC0lnlp8fHb9wIiKiEw3VsniNpM2BQyS9UNIGja/xCjAiItpvqKuhvg38CtiS6mooNaxzKY+IiC4waMvC9ldtvwI4zfaWtrdoeCVRRER0kWaewf2/JG0HvKEUXWH7xtaGFZ0u9yBEdJfaq6EkHQ6cBWxYXmdJOqzVgUVEROdoZiLBDwCvs/0YgKQTgN8DX2tlYBER0Tmauc9CwDMNy8+w6mB3RERMcs20LE4HrpL0k7K8D3nWREREV2lmgPskSZcDO1O1KA62fV2rA4vx0TdQnSffRcRQmmlZYPtaqifbRUREF8rcUBERUSvJIiIiag2ZLCRNkfTr8QomIiI605DJwvYzwOOS1huneCIiogM1M8D938BCSZcAj/UV2j68ZVFFTBCN057kirKYzJpJFheVV0REdKlm7rOYL2ktYDPbt45FpZLWB04BtqGa7vwQ4FbgHKAXuBPYz/aDY1FfRESMTjMTCb4duJ7q2RZI2l7ShaOs92TgV7a3ArYDbgHmApfanglcWpajRXrnXpSZYyOiac1cOnscsAPwEIDt64EtRlqhpBcAb6RMGWL7KdsPAXsD88tm86mmFYmIiA7QTLJYYfvhfmUeRZ1bAsuA0yVdJ+kUSesAG9m+F6D83HCgnSXNkbRA0oJly5aNIoyIiGhWM8niJknvAaZIminpa8CVo6hzNeDVwLdsv4rqCqumu5xsz7M9y/asnp6eUYQRERHNaiZZHAa8EngSOBt4BPjYKOpcDCy2fVVZPo8qedwnaWOA8nPpKOqIaKmM+US3aeZqqMeBY8pDj2x7+WgqtP1XSXdLenm5umpX4Obymg0cX35eMJp6onN0y4dqZvCNyaw2WUh6LXAaMLUsPwwcYvuaUdR7GNXjWZ8P3A4cTNXKOVfSocAiYN9RHD8iIsZQMzflnQp8yPZvASTtTPVApG1HWmm5omrWAKt2HekxIyKidZpJFsv7EgWA7d9JGlVXVExe3dLlFNFtBk0Wkl5d3l4t6TtUg9sG9gcub31oERHRKYZqWfxHv+VjG96P5j6LmITSooiY3AZNFrbfPJ6BRERE52rmaqj1gfdTTfD37PaZojwions0M8D9C+APwELg760NJyIiOlEzyWJN20e2PJKIiOhYzUz3caakf5G0saQN+l4tjywiIjpGMy2Lp4ATgWNYeRWUqWaPjYiILtBMsjgSeKnt+1sdTEREdKZmuqH+BDze6kAiJovMSBuTUTMti2eA6yVdRjVNOZBLZyMiukkzyeKn5RUREV2qmedZzK/bJiIiJrdm7uC+gwHmgrKdq6EiIrpEM91Qjc+dWJPqoUS5zyIioos00w31t35FX5H0O+AzrQkpYnLof0VUHrcaE1kz3VCvblh8HlVLY2rLIoqIiI7TTDdU43MtVgB3Avu1JJqIiOhIzXRD5bkWERFdrpluqDWAd/Lc51l8rnVhRUREJ2mmG+oC4GHgGhru4I6I4Wkc8M5gd0w0zSSLGbZ3H+uKJU0BFgD32N5L0hbAD6kuy70WOND2U2Ndb0REDF8zEwleKekfWlD3R4FbGpZPAL5seybwIHBoC+qMiIgRaCZZ7AxcI+lWSTdKWijpxtFUKmkGsCdwSlkW8BbgvLLJfGCf0dQRERFjp5luqD1aUO9XgE+w8n6NFwEP2V5RlhcD01tQb0REjEAzl87eNZYVStoLWGr7Gkm79BUPVPUg+88B5gBsttlmYxlaREQMopluqLG2E/AOSXdSDWi/haqlsb6kvuQ1A1gy0M6259meZXtWT0/PeMQbEdH1xj1Z2D7a9gzbvcABwG9svxe4DHhX2Ww21SW7ERHRAdrRshjMJ4EjJd1GNYZxapvjiYiIopkB7paxfTlweXl/O7BDO+OJiIiBdVLLIiIiOlRbWxYx/vo/YyEiohlpWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkEdFGvXMvyhQsMSEkWURERK1MJBjRBmlNxESTlkVEh0tXVXSCJIuIiKiVbqiIDtLYgrjz+D3bGEnEqtKyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKg17ldDSdoU+B7wYuDvwDzbJ0vaADgH6AXuBPaz/eB4xxfRDrmPIjpdO1oWK4CjbL8C2BH4sKStgbnApbZnApeW5YiI6ADjnixs32v72vJ+OXALMB3YG5hfNpsP7DPesUVExMDaOmYhqRd4FXAVsJHte6FKKMCGg+wzR9ICSQuWLVs2XqFGtF2m/Yh2aluykLQu8GPgY7YfaXY/2/Nsz7I9q6enp3UBRkTEs9qSLCStTpUozrJ9fim+T9LGZf3GwNJ2xBYREc/VjquhBJwK3GL7pIZVFwKzgePLzwvGO7aITpIup+gk7ZhIcCfgQGChpOtL2aeoksS5kg4FFgH7tiG2iIgYwLgnC9u/AzTI6l3HM5aIiSgz00Y75A7uiIiolWQRMYHlctoYL3n40SSW7oqIGCtpWURERK0ki4hJIN1R0WpJFhERUSvJIiIiaiVZREwi6Y6KVkmyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKiV6T4iJrFM+RJjJS2LiIiolWQRERG10g0VMQmN9Ma8vv3SZRX9pWURERG1kiwmkUz1EEPp//eRv5cYjnRDTRDpHoixkgQRI5GWRURE1EqyiIhBpasq+iRZRERErY4bs5C0O3AyMAU4xfbxbQ4pYlIbqOXQvyxjZtFRyULSFOAbwP8AFgN/lHSh7ZvHO5Z2TJMwnDozjUNMBMNJMklIna3TuqF2AG6zfbvtp4AfAnu3OaaIiK4n2+2O4VmS3gXsbvsDZflA4HW2P9KwzRxgTlncBrhp3APtTNOA+9sdRIfIuVgp52KlnIuVXm576nB26KhuKEADlK2SzWzPA+YBSFpge9Z4BNbpci5WyrlYKedipZyLlSQtGO4+ndYNtRjYtGF5BrCkTbFERETRacnij8BMSVtIej5wAHBhm2OKiOh6HdUNZXuFpI8A/0l16exptv80xC7zxieyCSHnYqWci5VyLlbKuVhp2Oeiowa4IyKiM3VaN1RERHSgJIuIiKg1YZOFpN0l3SrpNklz2x1Pu0jaVNJlkm6R9CdJH213TO0kaYqk6yT9vN2xtJuk9SWdJ+nP5e/jH9sdU7tIOqL8/7hJ0tmS1mx3TONF0mmSlkq6qaFsA0mXSPqv8vOFdceZkMmiYVqQPYCtgXdL2rq9UbXNCuAo268AdgQ+3MXnAuCjwC3tDqJDnAz8yvZWwHZ06XmRNB04HJhlexuqi2cOaG9U4+oMYPd+ZXOBS23PBC4ty0OakMmCTAvyLNv32r62vF9O9YEwvb1RtYekGcCewCntjqXdJL0AeCNwKoDtp2w/1N6o2mo1YC1JqwFr00X3b9m+AnigX/HewPzyfj6wT91xJmqymA7c3bC8mC79gGwkqRd4FXBVeyNpm68AnwD+3u5AOsCWwDLg9NItd4qkddodVDvYvgf4ErAIuBd42PbF7Y2q7TayfS9UXziBDet2mKjJonZakG4jaV3gx8DHbD/S7njGm6S9gKW2r2l3LB1iNeDVwLdsvwp4jCa6Giaj0h+/N7AFsAmwjqT3tTeqiWeiJotMC9JA0upUieIs2+e3O5422Ql4h6Q7qbol3yLp++0Nqa0WA4tt97Uyz6NKHt1oN+AO28tsPw2cD7y+zTG1232SNgYoP5fW7TBRk0WmBSkkiapf+hbbJ7U7nnaxfbTtGbZ7qf4efmO7a7892v4rcLekl5eiXYFxfy5Mh1gE7Chp7fL/ZVe6dLC/wYXA7PJ+NnBB3Q4dNd1Hs0YwLchkthNwILBQ0vWl7FO2f9HGmKIzHAacVb5Q3Q4c3OZ42sL2VZLOA66lunrwOrpo6g9JZwO7ANMkLQaOBY4HzpV0KFUy3bf2OJnuIyIi6kzUbqiIiBhHSRYREVErySIiImolWURERK0ki4iIqJVkEROWpEdbcMztJb2tYfk4SR8fxfH2LTO+XtavvFfSe5rY/yBJXx9p/RFjJckiYlXbA2+r3ap5hwIfsv3mfuW9QG2yiOgUSRYxKUj635L+KOlGSZ8tZb3lW/13y7MMLpa0Vln32rLt7yWdWJ5z8Hzgc8D+kq6XtH85/NaSLpd0u6TDB6n/3ZIWluOcUMo+A+wMfFvSif12OR54Q6nnCElrSjq9HOM6Sf2TC5L2LPFOk9Qj6cfl3/xHSTuVbY4rzy9YJV5J60i6SNINJcb9+x8/Yki288prQr6AR8vPt1LdkSuqL0A/p5qeu5fqjt3ty3bnAu8r728CXl/eHw/cVN4fBHy9oY7jgCuBNYBpwN+A1fvFsQnVXbA9VLMi/AbYp6y7nOo5Cv1j3wX4ecPyUcDp5f1W5Xhr9sUD/E/gt8ALyzY/AHYu7zejmu5l0HiBdwLfbahvvXb//vKaWK8JOd1HRD9vLa/ryvK6wEyqD9w7bPdNg3IN0CtpfWCq7StL+Q+AvYY4/kW2nwSelLQU2Ihqor4+rwUut70MQNJZVMnqp8P4N+wMfA3A9p8l3QW8rKx7MzALeKtXzii8G1WLp2//F0iaOkS8C4EvlVbPz23/dhixRSRZxKQg4Iu2v7NKYfV8jycbip4B1mLgKe6H0v8Y/f/fDPd4AxnqGLdTPZ/iZcCCUvY84B9tP7HKQark8Zx4bf9F0muoxmO+KOli258bg7ijS2TMIiaD/wQOKc/0QNJ0SYM+zMX2g8BySTuWosZHbC4Hpj53ryFdBbypjCVMAd4N/L+affrXcwXw3hL/y6i6lm4t6+4C/hn4nqRXlrKLgY/07Sxp+6Eqk7QJ8Ljt71M9CKhbpyuPEUqyiAnP1VPPfgD8XtJCqmc31H3gHwrMk/R7qm/1D5fyy6i6d65vdhDY1ZPGji773gBca7tuyucbgRVlwPkI4JvAlBL/OcBBpSupr45bqZLJjyS9hPJM6TJIfzPwwZr6/gG4usxMfAzw+Wb+bRF9MutsdCVJ69p+tLyfC2xs+6NtDiuiY2XMIrrVnpKOpvo/cBfVVUcRMYi0LCIiolbGLCIiolaSRURE1EqyiIiIWkkWERFRK8kiIiJq/X+K7ei/t/dDegAAAABJRU5ErkJggg==\n", 559 | "text/plain": [ 560 | "
" 561 | ] 562 | }, 563 | "metadata": { 564 | "needs_background": "light" 565 | }, 566 | "output_type": "display_data" 567 | } 568 | ], 569 | "source": [ 570 | "plt.hist(np.log(num_tokens), bins = 100)\n", 571 | "plt.xlim((0,10))\n", 572 | "plt.ylabel('number of tokens')\n", 573 | "plt.xlabel('length of tokens')\n", 574 | "plt.title('Distribution of tokens length')\n", 575 | "plt.show()" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 20, 581 | "metadata": {}, 582 | "outputs": [ 583 | { 584 | "data": { 585 | "text/plain": [ 586 | "236" 587 | ] 588 | }, 589 | "execution_count": 20, 590 | "metadata": {}, 591 | "output_type": "execute_result" 592 | } 593 | ], 594 | "source": [ 595 | "# 取tokens平均值并加上两个tokens的标准差,\n", 596 | "# 假设tokens长度的分布为正态分布,则max_tokens这个值可以涵盖95%左右的样本\n", 597 | "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n", 598 | "max_tokens = int(max_tokens)\n", 599 | "max_tokens" 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": 21, 605 | "metadata": {}, 606 | "outputs": [ 607 | { 608 | "data": { 609 | "text/plain": [ 610 | "0.9565" 611 | ] 612 | }, 613 | "execution_count": 21, 614 | "metadata": {}, 615 | "output_type": "execute_result" 616 | } 617 | ], 618 | "source": [ 619 | "# 取tokens的长度为236时,大约95%的样本被涵盖\n", 620 | "# 我们对长度不足的进行padding,超长的进行修剪\n", 621 | "np.sum( num_tokens < max_tokens ) / len(num_tokens)" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "**反向tokenize** \n", 629 | "我们定义一个function,用来把索引转换成可阅读的文本,这对于debug很重要。" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": 22, 635 | "metadata": {}, 636 | "outputs": [], 637 | "source": [ 638 | "# 用来将tokens转换为文本\n", 639 | "def reverse_tokens(tokens):\n", 640 | " text = ''\n", 641 | " for i in tokens:\n", 642 | " if i != 0:\n", 643 | " text = text + cn_model.index2word[i]\n", 644 | " else:\n", 645 | " text = text + ' '\n", 646 | " return text" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 23, 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [ 655 | "reverse = reverse_tokens(train_tokens[0])" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "以下可见,训练样本的极性并不是那么精准,比如说下面的样本,对早餐并不满意,但被定义为正面评价,这会迷惑我们的模型,不过我们暂时不对训练样本进行任何修改。" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 24, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "data": { 672 | "text/plain": [ 673 | "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'" 674 | ] 675 | }, 676 | "execution_count": 24, 677 | "metadata": {}, 678 | "output_type": "execute_result" 679 | } 680 | ], 681 | "source": [ 682 | "# 经过tokenize再恢复成文本\n", 683 | "# 可见标点符号都没有了\n", 684 | "reverse" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 25, 690 | "metadata": {}, 691 | "outputs": [ 692 | { 693 | "data": { 694 | "text/plain": [ 695 | "'早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'" 696 | ] 697 | }, 698 | "execution_count": 25, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "# 原始文本\n", 705 | "train_texts_orig[0]" 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "**准备Embedding Matrix** \n", 713 | "现在我们来为模型准备embedding matrix(词向量矩阵),根据keras的要求,我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵,num words代表我们使用的词汇的数量,emdedding dimension在我们现在使用的预训练词向量模型中是300,每一个词汇都用一个长度为300的向量表示。 \n", 714 | "注意我们只选择使用前50k个使用频率最高的词,在这个预训练词向量模型中,一共有260万词汇量,如果全部使用在分类问题上会很浪费计算资源,因为我们的训练样本很小,一共只有4k,如果我们有100k,200k甚至更多的训练样本时,在分类问题上可以考虑减少使用的词汇量。" 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": 26, 720 | "metadata": {}, 721 | "outputs": [ 722 | { 723 | "data": { 724 | "text/plain": [ 725 | "300" 726 | ] 727 | }, 728 | "execution_count": 26, 729 | "metadata": {}, 730 | "output_type": "execute_result" 731 | } 732 | ], 733 | "source": [ 734 | "embedding_dim" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 27, 740 | "metadata": {}, 741 | "outputs": [], 742 | "source": [ 743 | "# 只使用前20000个词\n", 744 | "num_words = 50000\n", 745 | "# 初始化embedding_matrix,之后在keras上进行应用\n", 746 | "embedding_matrix = np.zeros((num_words, embedding_dim))\n", 747 | "# embedding_matrix为一个 [num_words,embedding_dim] 的矩阵\n", 748 | "# 维度为 50000 * 300\n", 749 | "for i in range(num_words):\n", 750 | " embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n", 751 | "embedding_matrix = embedding_matrix.astype('float32')" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": 28, 757 | "metadata": {}, 758 | "outputs": [ 759 | { 760 | "data": { 761 | "text/plain": [ 762 | "300" 763 | ] 764 | }, 765 | "execution_count": 28, 766 | "metadata": {}, 767 | "output_type": "execute_result" 768 | } 769 | ], 770 | "source": [ 771 | "# 检查index是否对应,\n", 772 | "# 输出300意义为长度为300的embedding向量一一对应\n", 773 | "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )" 774 | ] 775 | }, 776 | { 777 | "cell_type": "code", 778 | "execution_count": 29, 779 | "metadata": {}, 780 | "outputs": [ 781 | { 782 | "data": { 783 | "text/plain": [ 784 | "(50000, 300)" 785 | ] 786 | }, 787 | "execution_count": 29, 788 | "metadata": {}, 789 | "output_type": "execute_result" 790 | } 791 | ], 792 | "source": [ 793 | "# embedding_matrix的维度,\n", 794 | "# 这个维度为keras的要求,后续会在模型中用到\n", 795 | "embedding_matrix.shape" 796 | ] 797 | }, 798 | { 799 | "cell_type": "markdown", 800 | "metadata": {}, 801 | "source": [ 802 | "**padding(填充)和truncating(修剪)** \n", 803 | "我们把文本转换为tokens(索引)之后,每一串索引的长度并不相等,所以为了方便模型的训练我们需要把索引的长度标准化,上面我们选择了236这个可以涵盖95%训练样本的长度,接下来我们进行padding和truncating,我们一般采用'pre'的方法,这会在文本索引的前面填充0,因为根据一些研究资料中的实践,如果在文本索引后面填充0的话,会对模型造成一些不良影响。" 804 | ] 805 | }, 806 | { 807 | "cell_type": "code", 808 | "execution_count": 30, 809 | "metadata": {}, 810 | "outputs": [], 811 | "source": [ 812 | "# 进行padding和truncating, 输入的train_tokens是一个list\n", 813 | "# 返回的train_pad是一个numpy array\n", 814 | "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n", 815 | " padding='pre', truncating='pre')" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": 31, 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "# 超出五万个词向量的词用0代替\n", 825 | "train_pad[ train_pad>=num_words ] = 0" 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": 32, 831 | "metadata": {}, 832 | "outputs": [ 833 | { 834 | "data": { 835 | "text/plain": [ 836 | "array([ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 837 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 838 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 839 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 840 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 841 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 842 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 843 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 844 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 845 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 846 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 847 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 848 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 849 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 850 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 851 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 852 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 853 | " 290, 3053, 57, 169, 73, 1, 25, 11216, 49,\n", 854 | " 163, 15985, 0, 0, 30, 8, 0, 1, 228,\n", 855 | " 223, 40, 35, 653, 0, 5, 1642, 29, 11216,\n", 856 | " 2751, 500, 98, 30, 3159, 2225, 2146, 371, 6285,\n", 857 | " 169, 27396, 1, 1191, 5432, 1080, 20055, 57, 562,\n", 858 | " 1, 22671, 40, 35, 169, 2567, 0, 42665, 7761,\n", 859 | " 110, 0, 0, 41281, 0, 110, 0, 35891, 110,\n", 860 | " 0, 28781, 57, 169, 1419, 1, 11670, 0, 19470,\n", 861 | " 1, 0, 0, 169, 35071, 40, 562, 35, 12398,\n", 862 | " 657, 4857])" 863 | ] 864 | }, 865 | "execution_count": 32, 866 | "metadata": {}, 867 | "output_type": "execute_result" 868 | } 869 | ], 870 | "source": [ 871 | "# 可见padding之后前面的tokens全变成0,文本在最后面\n", 872 | "train_pad[33]" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 33, 878 | "metadata": {}, 879 | "outputs": [], 880 | "source": [ 881 | "# 准备target向量,前2000样本为1,后2000为0\n", 882 | "train_target = np.array(train_target)" 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": 34, 888 | "metadata": {}, 889 | "outputs": [], 890 | "source": [ 891 | "# 进行训练和测试样本的分割\n", 892 | "from sklearn.model_selection import train_test_split" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 50, 898 | "metadata": {}, 899 | "outputs": [], 900 | "source": [ 901 | "# 90%的样本用来训练,剩余10%用来测试\n", 902 | "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n", 903 | " train_target,\n", 904 | " test_size=0.1,\n", 905 | " random_state=12)" 906 | ] 907 | }, 908 | { 909 | "cell_type": "code", 910 | "execution_count": 36, 911 | "metadata": {}, 912 | "outputs": [ 913 | { 914 | "name": "stdout", 915 | "output_type": "stream", 916 | "text": [ 917 | " 房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n", 918 | "class: 1\n" 919 | ] 920 | } 921 | ], 922 | "source": [ 923 | "# 查看训练样本,确认无误\n", 924 | "print(reverse_tokens(X_train[35]))\n", 925 | "print('class: ',y_train[35])" 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": {}, 931 | "source": [ 932 | "现在我们用keras搭建LSTM模型,模型的第一层是Embedding层,只有当我们把tokens索引转换为词向量矩阵之后,才可以用神经网络对文本进行处理。\n", 933 | "keras提供了Embedding接口,避免了繁琐的稀疏矩阵操作。 \n", 934 | "在Embedding层我们输入的矩阵为:$$(batchsize, maxtokens)$$\n", 935 | "输出矩阵为: $$(batchsize, maxtokens, embeddingdim)$$" 936 | ] 937 | }, 938 | { 939 | "cell_type": "code", 940 | "execution_count": 37, 941 | "metadata": {}, 942 | "outputs": [], 943 | "source": [ 944 | "# 用LSTM对样本进行分类\n", 945 | "model = Sequential()" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": 38, 951 | "metadata": {}, 952 | "outputs": [], 953 | "source": [ 954 | "# 模型第一层为embedding\n", 955 | "model.add(Embedding(num_words,\n", 956 | " embedding_dim,\n", 957 | " weights=[embedding_matrix],\n", 958 | " input_length=max_tokens,\n", 959 | " trainable=False))" 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": 39, 965 | "metadata": {}, 966 | "outputs": [], 967 | "source": [ 968 | "# 在2019年6月10日修改了一些大坑的bug, 可能是数据的顺序变了, \n", 969 | "# 结果模型训练的效果没有去年最早的时候效果好了, \n", 970 | "# 有兴趣的同学可以调整一下模型参数, 看看会不会有更好的结果\n", 971 | "model.add(Bidirectional(LSTM(units=64, return_sequences=True)))\n", 972 | "model.add(LSTM(units=16, return_sequences=False))" 973 | ] 974 | }, 975 | { 976 | "cell_type": "markdown", 977 | "metadata": {}, 978 | "source": [ 979 | "**构建模型** \n", 980 | "我在这个教程中尝试了几种神经网络结构,因为训练样本比较少,所以我们可以尽情尝试,训练过程等待时间并不长: \n", 981 | "**GRU:**如果使用GRU的话,测试样本可以达到87%的准确率,但我测试自己的文本内容时发现,GRU最后一层激活函数的输出都在0.5左右,说明模型的判断不是很明确,信心比较低,而且经过测试发现模型对于否定句的判断有时会失误,我们期望对于负面样本输出接近0,正面样本接近1而不是都徘徊于0.5之间。 \n", 982 | "**BiLSTM:**测试了LSTM和BiLSTM,发现BiLSTM的表现最好,LSTM的表现略好于GRU,这可能是因为BiLSTM对于比较长的句子结构有更好的记忆,有兴趣的朋友可以深入研究一下。 \n", 983 | "Embedding之后第,一层我们用BiLSTM返回sequences,然后第二层16个单元的LSTM不返回sequences,只返回最终结果,最后是一个全链接层,用sigmoid激活函数输出结果。" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 40, 989 | "metadata": {}, 990 | "outputs": [], 991 | "source": [ 992 | "# GRU的代码\n", 993 | "# model.add(GRU(units=32, return_sequences=True))\n", 994 | "# model.add(GRU(units=16, return_sequences=True))\n", 995 | "# model.add(GRU(units=4, return_sequences=False))" 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": 41, 1001 | "metadata": {}, 1002 | "outputs": [], 1003 | "source": [ 1004 | "model.add(Dense(1, activation='sigmoid'))\n", 1005 | "# 我们使用adam以0.001的learning rate进行优化\n", 1006 | "optimizer = Adam(lr=1e-3)" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": 42, 1012 | "metadata": {}, 1013 | "outputs": [], 1014 | "source": [ 1015 | "model.compile(loss='binary_crossentropy',\n", 1016 | " optimizer=optimizer,\n", 1017 | " metrics=['accuracy'])" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "code", 1022 | "execution_count": 43, 1023 | "metadata": {}, 1024 | "outputs": [ 1025 | { 1026 | "name": "stdout", 1027 | "output_type": "stream", 1028 | "text": [ 1029 | "Model: \"sequential\"\n", 1030 | "_________________________________________________________________\n", 1031 | "Layer (type) Output Shape Param # \n", 1032 | "=================================================================\n", 1033 | "embedding (Embedding) (None, 236, 300) 15000000 \n", 1034 | "_________________________________________________________________\n", 1035 | "bidirectional (Bidirectional (None, 236, 128) 186880 \n", 1036 | "_________________________________________________________________\n", 1037 | "lstm_1 (LSTM) (None, 16) 9280 \n", 1038 | "_________________________________________________________________\n", 1039 | "dense (Dense) (None, 1) 17 \n", 1040 | "=================================================================\n", 1041 | "Total params: 15,196,177\n", 1042 | "Trainable params: 196,177\n", 1043 | "Non-trainable params: 15,000,000\n", 1044 | "_________________________________________________________________\n" 1045 | ] 1046 | } 1047 | ], 1048 | "source": [ 1049 | "# 我们来看一下模型的结构,一共90k左右可训练的变量\n", 1050 | "model.summary()" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 44, 1056 | "metadata": {}, 1057 | "outputs": [], 1058 | "source": [ 1059 | "# 建立一个权重的存储点\n", 1060 | "path_checkpoint = 'sentiment_checkpoint.keras'\n", 1061 | "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n", 1062 | " verbose=1, save_weights_only=True,\n", 1063 | " save_best_only=True)" 1064 | ] 1065 | }, 1066 | { 1067 | "cell_type": "code", 1068 | "execution_count": 45, 1069 | "metadata": {}, 1070 | "outputs": [ 1071 | { 1072 | "name": "stdout", 1073 | "output_type": "stream", 1074 | "text": [ 1075 | "Shapes (300, 256) and (300, 128) are incompatible\n" 1076 | ] 1077 | } 1078 | ], 1079 | "source": [ 1080 | "# 尝试加载已训练模型\n", 1081 | "try:\n", 1082 | " model.load_weights(path_checkpoint)\n", 1083 | "except Exception as e:\n", 1084 | " print(e)" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": 46, 1090 | "metadata": {}, 1091 | "outputs": [], 1092 | "source": [ 1093 | "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n", 1094 | "earlystopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1)" 1095 | ] 1096 | }, 1097 | { 1098 | "cell_type": "code", 1099 | "execution_count": 47, 1100 | "metadata": {}, 1101 | "outputs": [], 1102 | "source": [ 1103 | "# 自动降低learning rate\n", 1104 | "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n", 1105 | " factor=0.1, min_lr=1e-8, patience=0,\n", 1106 | " verbose=1)" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": 48, 1112 | "metadata": {}, 1113 | "outputs": [], 1114 | "source": [ 1115 | "# 定义callback函数\n", 1116 | "callbacks = [\n", 1117 | " earlystopping, \n", 1118 | " checkpoint,\n", 1119 | " lr_reduction\n", 1120 | "]" 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": 49, 1126 | "metadata": { 1127 | "scrolled": false 1128 | }, 1129 | "outputs": [ 1130 | { 1131 | "name": "stdout", 1132 | "output_type": "stream", 1133 | "text": [ 1134 | "Train on 3240 samples, validate on 360 samples\n", 1135 | "Epoch 1/20\n", 1136 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.6847 - accuracy: 0.5694\n", 1137 | "Epoch 00001: val_loss improved from inf to 0.65662, saving model to sentiment_checkpoint.keras\n", 1138 | "3240/3240 [==============================] - 34s 10ms/sample - loss: 0.6846 - accuracy: 0.5698 - val_loss: 0.6566 - val_accuracy: 0.6639\n", 1139 | "Epoch 2/20\n", 1140 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.6266 - accuracy: 0.6562\n", 1141 | "Epoch 00002: val_loss improved from 0.65662 to 0.56397, saving model to sentiment_checkpoint.keras\n", 1142 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.6266 - accuracy: 0.6556 - val_loss: 0.5640 - val_accuracy: 0.7139\n", 1143 | "Epoch 3/20\n", 1144 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.5093 - accuracy: 0.7591\n", 1145 | "Epoch 00003: val_loss improved from 0.56397 to 0.51803, saving model to sentiment_checkpoint.keras\n", 1146 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.5100 - accuracy: 0.7583 - val_loss: 0.5180 - val_accuracy: 0.7556\n", 1147 | "Epoch 4/20\n", 1148 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4914 - accuracy: 0.7734\n", 1149 | "Epoch 00004: val_loss improved from 0.51803 to 0.43727, saving model to sentiment_checkpoint.keras\n", 1150 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4904 - accuracy: 0.7744 - val_loss: 0.4373 - val_accuracy: 0.8250\n", 1151 | "Epoch 5/20\n", 1152 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4549 - accuracy: 0.8006\n", 1153 | "Epoch 00005: val_loss did not improve from 0.43727\n", 1154 | "\n", 1155 | "Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.\n", 1156 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4564 - accuracy: 0.7994 - val_loss: 0.4508 - val_accuracy: 0.8000\n", 1157 | "Epoch 6/20\n", 1158 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4261 - accuracy: 0.8206\n", 1159 | "Epoch 00006: val_loss did not improve from 0.43727\n", 1160 | "\n", 1161 | "Epoch 00006: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.\n", 1162 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4260 - accuracy: 0.8210 - val_loss: 0.4374 - val_accuracy: 0.8139\n", 1163 | "Epoch 7/20\n", 1164 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4158 - accuracy: 0.8256\n", 1165 | "Epoch 00007: val_loss improved from 0.43727 to 0.43676, saving model to sentiment_checkpoint.keras\n", 1166 | "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4160 - accuracy: 0.8256 - val_loss: 0.4368 - val_accuracy: 0.8139\n", 1167 | "Epoch 8/20\n", 1168 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4163 - accuracy: 0.8222\n", 1169 | "Epoch 00008: val_loss improved from 0.43676 to 0.43648, saving model to sentiment_checkpoint.keras\n", 1170 | "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4142 - accuracy: 0.8241 - val_loss: 0.4365 - val_accuracy: 0.8139\n", 1171 | "Epoch 9/20\n", 1172 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4125 - accuracy: 0.8241\n", 1173 | "Epoch 00009: val_loss improved from 0.43648 to 0.43615, saving model to sentiment_checkpoint.keras\n", 1174 | "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4131 - accuracy: 0.8228 - val_loss: 0.4361 - val_accuracy: 0.8139\n", 1175 | "Epoch 10/20\n", 1176 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4126 - accuracy: 0.8241\n", 1177 | "Epoch 00010: val_loss improved from 0.43615 to 0.43576, saving model to sentiment_checkpoint.keras\n", 1178 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4120 - accuracy: 0.8247 - val_loss: 0.4358 - val_accuracy: 0.8167\n", 1179 | "Epoch 11/20\n", 1180 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4110 - accuracy: 0.8253\n", 1181 | "Epoch 00011: val_loss improved from 0.43576 to 0.43573, saving model to sentiment_checkpoint.keras\n", 1182 | "\n", 1183 | "Epoch 00011: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.\n", 1184 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4109 - accuracy: 0.8253 - val_loss: 0.4357 - val_accuracy: 0.8167\n", 1185 | "Epoch 12/20\n", 1186 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4112 - accuracy: 0.8241\n", 1187 | "Epoch 00012: val_loss improved from 0.43573 to 0.43573, saving model to sentiment_checkpoint.keras\n", 1188 | "\n", 1189 | "Epoch 00012: ReduceLROnPlateau reducing learning rate to 1.0000001111620805e-07.\n", 1190 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1191 | "Epoch 13/20\n", 1192 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4109 - accuracy: 0.8238\n", 1193 | "Epoch 00013: val_loss improved from 0.43573 to 0.43572, saving model to sentiment_checkpoint.keras\n", 1194 | "\n", 1195 | "Epoch 00013: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-08.\n", 1196 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1197 | "Epoch 14/20\n", 1198 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4098 - accuracy: 0.8250\n", 1199 | "Epoch 00014: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n", 1200 | "\n", 1201 | "Epoch 00014: ReduceLROnPlateau reducing learning rate to 1e-08.\n", 1202 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1203 | "Epoch 15/20\n", 1204 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4090 - accuracy: 0.8253\n", 1205 | "Epoch 00015: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n", 1206 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1207 | "Epoch 16/20\n", 1208 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4103 - accuracy: 0.8247\n", 1209 | "Epoch 00016: val_loss did not improve from 0.43572\n", 1210 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1211 | "Epoch 17/20\n", 1212 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4108 - accuracy: 0.8244\n", 1213 | "Epoch 00017: val_loss did not improve from 0.43572\n", 1214 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1215 | "Epoch 18/20\n", 1216 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4107 - accuracy: 0.8247\n", 1217 | "Epoch 00018: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n", 1218 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1219 | "Epoch 19/20\n", 1220 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4080 - accuracy: 0.8263\n", 1221 | "Epoch 00019: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n", 1222 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n", 1223 | "Epoch 20/20\n", 1224 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4115 - accuracy: 0.8234\n", 1225 | "Epoch 00020: val_loss did not improve from 0.43572\n", 1226 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n" 1227 | ] 1228 | }, 1229 | { 1230 | "data": { 1231 | "text/plain": [ 1232 | "" 1233 | ] 1234 | }, 1235 | "execution_count": 49, 1236 | "metadata": {}, 1237 | "output_type": "execute_result" 1238 | } 1239 | ], 1240 | "source": [ 1241 | "# 开始训练\n", 1242 | "model.fit(X_train, y_train,\n", 1243 | " validation_split=0.1, \n", 1244 | " epochs=20,\n", 1245 | " batch_size=128,\n", 1246 | " callbacks=callbacks)" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "markdown", 1251 | "metadata": {}, 1252 | "source": [ 1253 | "**结论** \n", 1254 | "我们首先对测试样本进行预测,得到了还算满意的准确度。 \n", 1255 | "之后我们定义一个预测函数,来预测输入的文本的极性,可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "code", 1260 | "execution_count": 51, 1261 | "metadata": {}, 1262 | "outputs": [ 1263 | { 1264 | "name": "stdout", 1265 | "output_type": "stream", 1266 | "text": [ 1267 | "400/400 [==============================] - 1s 3ms/sample - loss: 0.4799 - accuracy: 0.76750s - loss: 0.4699 - accuracy\n", 1268 | "Accuracy:76.75%\n" 1269 | ] 1270 | } 1271 | ], 1272 | "source": [ 1273 | "result = model.evaluate(X_test, y_test)\n", 1274 | "print('Accuracy:{0:.2%}'.format(result[1]))" 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": 52, 1280 | "metadata": {}, 1281 | "outputs": [], 1282 | "source": [ 1283 | "def predict_sentiment(text):\n", 1284 | " print(text)\n", 1285 | " # 去标点\n", 1286 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n", 1287 | " # 分词\n", 1288 | " cut = jieba.cut(text)\n", 1289 | " cut_list = [ i for i in cut ]\n", 1290 | " # tokenize\n", 1291 | " for i, word in enumerate(cut_list):\n", 1292 | " try:\n", 1293 | " cut_list[i] = cn_model.vocab[word].index\n", 1294 | " if cut_list[i] >= 50000:\n", 1295 | " cut_list[i] = 0\n", 1296 | " except KeyError:\n", 1297 | " cut_list[i] = 0\n", 1298 | " # padding\n", 1299 | " tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n", 1300 | " padding='pre', truncating='pre')\n", 1301 | " # 预测\n", 1302 | " result = model.predict(x=tokens_pad)\n", 1303 | " coef = result[0][0]\n", 1304 | " if coef >= 0.5:\n", 1305 | " print('是一例正面评价','output=%.2f'%coef)\n", 1306 | " else:\n", 1307 | " print('是一例负面评价','output=%.2f'%coef)" 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "code", 1312 | "execution_count": 53, 1313 | "metadata": {}, 1314 | "outputs": [ 1315 | { 1316 | "name": "stdout", 1317 | "output_type": "stream", 1318 | "text": [ 1319 | "酒店设施不是新的,服务态度很不好\n", 1320 | "是一例正面评价 output=0.63\n", 1321 | "酒店卫生条件非常不好\n", 1322 | "是一例负面评价 output=0.47\n", 1323 | "床铺非常舒适\n", 1324 | "是一例正面评价 output=0.79\n", 1325 | "房间很凉,不给开暖气\n", 1326 | "是一例正面评价 output=0.53\n", 1327 | "房间很凉爽,空调冷气很足\n", 1328 | "是一例正面评价 output=0.79\n", 1329 | "酒店环境不好,住宿体验很不好\n", 1330 | "是一例负面评价 output=0.48\n", 1331 | "房间隔音不到位\n", 1332 | "是一例负面评价 output=0.31\n", 1333 | "晚上回来发现没有打扫卫生\n", 1334 | "是一例正面评价 output=0.50\n", 1335 | "因为过节所以要我临时加钱,比团购的价格贵\n", 1336 | "是一例负面评价 output=0.47\n" 1337 | ] 1338 | } 1339 | ], 1340 | "source": [ 1341 | "test_list = [\n", 1342 | " '酒店设施不是新的,服务态度很不好',\n", 1343 | " '酒店卫生条件非常不好',\n", 1344 | " '床铺非常舒适',\n", 1345 | " '房间很凉,不给开暖气',\n", 1346 | " '房间很凉爽,空调冷气很足',\n", 1347 | " '酒店环境不好,住宿体验很不好',\n", 1348 | " '房间隔音不到位' ,\n", 1349 | " '晚上回来发现没有打扫卫生',\n", 1350 | " '因为过节所以要我临时加钱,比团购的价格贵'\n", 1351 | "]\n", 1352 | "for text in test_list:\n", 1353 | " predict_sentiment(text)" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "markdown", 1358 | "metadata": {}, 1359 | "source": [ 1360 | "**错误分类的文本**\n", 1361 | "经过查看,发现错误分类的文本的含义大多比较含糊,就算人类也不容易判断极性,如index为101的这个句子,好像没有一点满意的成分,但这例子评价在训练样本中被标记成为了正面评价,而我们的模型做出的负面评价的预测似乎是合理的。" 1362 | ] 1363 | }, 1364 | { 1365 | "cell_type": "code", 1366 | "execution_count": 54, 1367 | "metadata": {}, 1368 | "outputs": [], 1369 | "source": [ 1370 | "y_pred = model.predict(X_test)\n", 1371 | "y_pred = y_pred.T[0]\n", 1372 | "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n", 1373 | "y_pred = np.array(y_pred)" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "code", 1378 | "execution_count": 55, 1379 | "metadata": {}, 1380 | "outputs": [], 1381 | "source": [ 1382 | "y_actual = np.array(y_test)" 1383 | ] 1384 | }, 1385 | { 1386 | "cell_type": "code", 1387 | "execution_count": 56, 1388 | "metadata": {}, 1389 | "outputs": [], 1390 | "source": [ 1391 | "# 找出错误分类的索引\n", 1392 | "misclassified = np.where( y_pred != y_actual )[0]" 1393 | ] 1394 | }, 1395 | { 1396 | "cell_type": "code", 1397 | "execution_count": 57, 1398 | "metadata": { 1399 | "scrolled": true 1400 | }, 1401 | "outputs": [ 1402 | { 1403 | "name": "stdout", 1404 | "output_type": "stream", 1405 | "text": [ 1406 | "400\n" 1407 | ] 1408 | } 1409 | ], 1410 | "source": [ 1411 | "# 输出所有错误分类的索引\n", 1412 | "len(misclassified)\n", 1413 | "print(len(X_test))" 1414 | ] 1415 | }, 1416 | { 1417 | "cell_type": "code", 1418 | "execution_count": 58, 1419 | "metadata": {}, 1420 | "outputs": [ 1421 | { 1422 | "name": "stdout", 1423 | "output_type": "stream", 1424 | "text": [ 1425 | " 由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 :1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n", 1426 | "预测的分类 0\n", 1427 | "实际的分类 1\n" 1428 | ] 1429 | } 1430 | ], 1431 | "source": [ 1432 | "# 我们来找出错误分类的样本看看\n", 1433 | "idx=101\n", 1434 | "print(reverse_tokens(X_test[idx]))\n", 1435 | "print('预测的分类', y_pred[idx])\n", 1436 | "print('实际的分类', y_actual[idx])" 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "code", 1441 | "execution_count": 59, 1442 | "metadata": {}, 1443 | "outputs": [ 1444 | { 1445 | "name": "stdout", 1446 | "output_type": "stream", 1447 | "text": [ 1448 | " 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n", 1449 | "预测的分类 1\n", 1450 | "实际的分类 1\n" 1451 | ] 1452 | } 1453 | ], 1454 | "source": [ 1455 | "idx=1\n", 1456 | "print(reverse_tokens(X_test[idx]))\n", 1457 | "print('预测的分类', y_pred[idx])\n", 1458 | "print('实际的分类', y_actual[idx])" 1459 | ] 1460 | }, 1461 | { 1462 | "cell_type": "code", 1463 | "execution_count": null, 1464 | "metadata": {}, 1465 | "outputs": [], 1466 | "source": [] 1467 | }, 1468 | { 1469 | "cell_type": "code", 1470 | "execution_count": null, 1471 | "metadata": {}, 1472 | "outputs": [], 1473 | "source": [] 1474 | }, 1475 | { 1476 | "cell_type": "code", 1477 | "execution_count": null, 1478 | "metadata": {}, 1479 | "outputs": [], 1480 | "source": [] 1481 | } 1482 | ], 1483 | "metadata": { 1484 | "kernelspec": { 1485 | "display_name": "Python 3", 1486 | "language": "python", 1487 | "name": "python3" 1488 | }, 1489 | "language_info": { 1490 | "codemirror_mode": { 1491 | "name": "ipython", 1492 | "version": 3 1493 | }, 1494 | "file_extension": ".py", 1495 | "mimetype": "text/x-python", 1496 | "name": "python", 1497 | "nbconvert_exporter": "python", 1498 | "pygments_lexer": "ipython3", 1499 | "version": "3.6.8" 1500 | } 1501 | }, 1502 | "nbformat": 4, 1503 | "nbformat_minor": 2 1504 | } 1505 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # chinese_sentiment 2 | **用Tensorflow进行中文自然语言处理分类实践** 3 | 1. 词向量下载地址: 4 | 链接: https://pan.baidu.com/s/1GerioMpwj1zmju9NkkrsFg 5 | 提取码: x6v3 6 | 请下载之后在项目根目录建立"embeddings"文件夹, 将下载的文件放入(不用解压), 即可运行代码. 7 | 2. 很多同学遇到乱码等bug, 很抱歉没能及时回复, 现已重新处理了语料和代码, 已经没有了乱码的问题. 8 | 3. 修改了bug后, 可能是数据的顺序变了, 结果模型训练的效果相比去年差了一些, 有兴趣的同学可以调整一下模型参数, 看看会不会有更好的结果. 9 | 4. 代码写的比较早, 有些地方可能有坑, 现在先不重写了, 因为LSTM实在是属于比较老的模型, 近期会发布transformer语言模型的教程, 请大家关注. 10 | 5. 注意, debug之后的代码在"2019新版debug之后--中文自然语言处理--情感分析.ipynb"里, 对应的语料文件是"negative_samples.txt", "positive_samples.txt"这两个. 11 | 6. 如果有问题请在视频评论区留言, 这样各位学习的同学可以互相帮助解决问题, 或者在项目里提issue, 尽量不要给我写邮件, 因为可能回复不及时. 12 | 13 | 教学视频地址: 14 | youtube: 15 | https://www.youtube.com/watch?v=-mcrmLmNOXA&t=991s 16 | bilibili: 17 | https://www.bilibili.com/video/av30543613?from=search&seid=74343163897647645 18 | 老版本中pos和neg中的语料不全,请解压“语料.zip”覆盖 19 | -------------------------------------------------------------------------------- /flowchart.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/flowchart.jpg -------------------------------------------------------------------------------- /neg/neg.0.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.0.txt -------------------------------------------------------------------------------- /neg/neg.1.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1.txt -------------------------------------------------------------------------------- /neg/neg.10.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.10.txt -------------------------------------------------------------------------------- /neg/neg.1000.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1000.txt -------------------------------------------------------------------------------- /neg/neg.1001.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1001.txt -------------------------------------------------------------------------------- /neg/neg.1002.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1002.txt -------------------------------------------------------------------------------- /neg/neg.1003.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1003.txt -------------------------------------------------------------------------------- /neg/neg.1004.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1004.txt -------------------------------------------------------------------------------- /neg/neg.1005.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1005.txt -------------------------------------------------------------------------------- /neg/neg.1006.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1006.txt -------------------------------------------------------------------------------- /neg/neg.1007.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1007.txt -------------------------------------------------------------------------------- /neg/neg.1009.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1009.txt -------------------------------------------------------------------------------- /neg/neg.101.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.101.txt -------------------------------------------------------------------------------- /neg/neg.1010.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1010.txt -------------------------------------------------------------------------------- /neg/neg.1011.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1011.txt -------------------------------------------------------------------------------- /neg/neg.1012.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1012.txt -------------------------------------------------------------------------------- /neg/neg.1013.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1013.txt -------------------------------------------------------------------------------- /neg/neg.1014.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1014.txt -------------------------------------------------------------------------------- /neg/neg.1015.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1015.txt -------------------------------------------------------------------------------- /neg/neg.1017.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1017.txt -------------------------------------------------------------------------------- /neg/neg.1018.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1018.txt -------------------------------------------------------------------------------- /neg/neg.1019.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1019.txt -------------------------------------------------------------------------------- /neg/neg.102.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.102.txt -------------------------------------------------------------------------------- /neg/neg.1020.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1020.txt -------------------------------------------------------------------------------- /neg/neg.1022.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1022.txt -------------------------------------------------------------------------------- /neg/neg.1025.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1025.txt -------------------------------------------------------------------------------- /neg/neg.1026.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1026.txt -------------------------------------------------------------------------------- /neg/neg.1027.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1027.txt -------------------------------------------------------------------------------- /neg/neg.1028.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1028.txt -------------------------------------------------------------------------------- /neg/neg.1029.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1029.txt -------------------------------------------------------------------------------- /neg/neg.103.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.103.txt -------------------------------------------------------------------------------- /neg/neg.1030.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1030.txt -------------------------------------------------------------------------------- /neg/neg.1032.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1032.txt -------------------------------------------------------------------------------- /neg/neg.1033.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1033.txt -------------------------------------------------------------------------------- /neg/neg.1034.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1034.txt -------------------------------------------------------------------------------- /neg/neg.1035.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1035.txt -------------------------------------------------------------------------------- /neg/neg.1036.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1036.txt -------------------------------------------------------------------------------- /neg/neg.1038.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1038.txt -------------------------------------------------------------------------------- /neg/neg.1039.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1039.txt -------------------------------------------------------------------------------- /neg/neg.104.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.104.txt -------------------------------------------------------------------------------- /neg/neg.1040.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1040.txt -------------------------------------------------------------------------------- /neg/neg.1041.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1041.txt -------------------------------------------------------------------------------- /neg/neg.1042.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1042.txt -------------------------------------------------------------------------------- /neg/neg.1047.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1047.txt -------------------------------------------------------------------------------- /neg/neg.1048.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1048.txt -------------------------------------------------------------------------------- /neg/neg.1049.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1049.txt -------------------------------------------------------------------------------- /neg/neg.105.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.105.txt -------------------------------------------------------------------------------- /neg/neg.1050.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1050.txt -------------------------------------------------------------------------------- /neg/neg.1052.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1052.txt -------------------------------------------------------------------------------- /neg/neg.1053.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1053.txt -------------------------------------------------------------------------------- /neg/neg.1054.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1054.txt -------------------------------------------------------------------------------- /neg/neg.1055.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1055.txt -------------------------------------------------------------------------------- /neg/neg.1056.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1056.txt -------------------------------------------------------------------------------- /neg/neg.1057.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1057.txt -------------------------------------------------------------------------------- /neg/neg.1058.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1058.txt -------------------------------------------------------------------------------- /neg/neg.1059.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1059.txt -------------------------------------------------------------------------------- /neg/neg.106.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.106.txt -------------------------------------------------------------------------------- /neg/neg.1060.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1060.txt -------------------------------------------------------------------------------- /neg/neg.1061.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1061.txt -------------------------------------------------------------------------------- /neg/neg.1062.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1062.txt -------------------------------------------------------------------------------- /neg/neg.1063.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1063.txt -------------------------------------------------------------------------------- /neg/neg.1066.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1066.txt -------------------------------------------------------------------------------- /neg/neg.1067.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1067.txt -------------------------------------------------------------------------------- /neg/neg.1069.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1069.txt -------------------------------------------------------------------------------- /neg/neg.107.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.107.txt -------------------------------------------------------------------------------- /neg/neg.1070.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1070.txt -------------------------------------------------------------------------------- /neg/neg.1071.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1071.txt -------------------------------------------------------------------------------- /neg/neg.1072.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1072.txt -------------------------------------------------------------------------------- /neg/neg.1073.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1073.txt -------------------------------------------------------------------------------- /neg/neg.1074.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1074.txt -------------------------------------------------------------------------------- /neg/neg.1075.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1075.txt -------------------------------------------------------------------------------- /neg/neg.1076.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1076.txt -------------------------------------------------------------------------------- /neg/neg.1077.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1077.txt -------------------------------------------------------------------------------- /neg/neg.1078.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1078.txt -------------------------------------------------------------------------------- /neg/neg.1079.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1079.txt -------------------------------------------------------------------------------- /neg/neg.108.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.108.txt -------------------------------------------------------------------------------- /neg/neg.1080.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1080.txt -------------------------------------------------------------------------------- /neg/neg.1081.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1081.txt -------------------------------------------------------------------------------- /neg/neg.1082.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1082.txt -------------------------------------------------------------------------------- /neg/neg.1083.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1083.txt -------------------------------------------------------------------------------- /neg/neg.1084.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1084.txt -------------------------------------------------------------------------------- /neg/neg.1085.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1085.txt -------------------------------------------------------------------------------- /neg/neg.1086.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1086.txt -------------------------------------------------------------------------------- /neg/neg.1087.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1087.txt -------------------------------------------------------------------------------- /neg/neg.1088.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1088.txt -------------------------------------------------------------------------------- /neg/neg.1089.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1089.txt -------------------------------------------------------------------------------- /neg/neg.109.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.109.txt -------------------------------------------------------------------------------- /neg/neg.1090.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1090.txt -------------------------------------------------------------------------------- /neg/neg.1091.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1091.txt -------------------------------------------------------------------------------- /neg/neg.1092.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1092.txt -------------------------------------------------------------------------------- /neg/neg.1093.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1093.txt -------------------------------------------------------------------------------- /neg/neg.1094.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1094.txt -------------------------------------------------------------------------------- /neg/neg.1095.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1095.txt -------------------------------------------------------------------------------- /neg/neg.1096.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1096.txt -------------------------------------------------------------------------------- /neg/neg.1097.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1097.txt -------------------------------------------------------------------------------- /neg/neg.1098.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1098.txt -------------------------------------------------------------------------------- /neg/neg.1099.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1099.txt -------------------------------------------------------------------------------- /neg/neg.11.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.11.txt -------------------------------------------------------------------------------- /neg/neg.110.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.110.txt -------------------------------------------------------------------------------- /pos/pos.10.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.10.txt -------------------------------------------------------------------------------- /pos/pos.100.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.100.txt -------------------------------------------------------------------------------- /pos/pos.1000.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1000.txt -------------------------------------------------------------------------------- /pos/pos.1001.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1001.txt -------------------------------------------------------------------------------- /pos/pos.1002.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1002.txt -------------------------------------------------------------------------------- /pos/pos.1003.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1003.txt -------------------------------------------------------------------------------- /pos/pos.1004.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1004.txt -------------------------------------------------------------------------------- /pos/pos.1005.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1005.txt -------------------------------------------------------------------------------- /pos/pos.1006.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1006.txt -------------------------------------------------------------------------------- /pos/pos.1007.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1007.txt -------------------------------------------------------------------------------- /pos/pos.1008.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1008.txt -------------------------------------------------------------------------------- /pos/pos.1009.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1009.txt -------------------------------------------------------------------------------- /pos/pos.101.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.101.txt -------------------------------------------------------------------------------- /pos/pos.1010.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1010.txt -------------------------------------------------------------------------------- /pos/pos.1012.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1012.txt -------------------------------------------------------------------------------- /pos/pos.1013.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1013.txt -------------------------------------------------------------------------------- /pos/pos.1014.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1014.txt -------------------------------------------------------------------------------- /pos/pos.1015.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1015.txt -------------------------------------------------------------------------------- /pos/pos.1016.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1016.txt -------------------------------------------------------------------------------- /pos/pos.1017.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1017.txt -------------------------------------------------------------------------------- /pos/pos.1018.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1018.txt -------------------------------------------------------------------------------- /pos/pos.1019.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1019.txt -------------------------------------------------------------------------------- /pos/pos.102.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.102.txt -------------------------------------------------------------------------------- /pos/pos.1020.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1020.txt -------------------------------------------------------------------------------- /pos/pos.1021.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1021.txt -------------------------------------------------------------------------------- /pos/pos.1022.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1022.txt -------------------------------------------------------------------------------- /pos/pos.1023.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1023.txt -------------------------------------------------------------------------------- /pos/pos.1024.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1024.txt -------------------------------------------------------------------------------- /pos/pos.1025.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1025.txt -------------------------------------------------------------------------------- /pos/pos.1026.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1026.txt -------------------------------------------------------------------------------- /pos/pos.1027.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1027.txt -------------------------------------------------------------------------------- /pos/pos.1028.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1028.txt -------------------------------------------------------------------------------- /pos/pos.1029.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1029.txt -------------------------------------------------------------------------------- /pos/pos.103.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.103.txt -------------------------------------------------------------------------------- /pos/pos.1030.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1030.txt -------------------------------------------------------------------------------- /pos/pos.1031.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1031.txt -------------------------------------------------------------------------------- /pos/pos.1032.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1032.txt -------------------------------------------------------------------------------- /pos/pos.1033.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1033.txt -------------------------------------------------------------------------------- /pos/pos.1034.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1034.txt -------------------------------------------------------------------------------- /pos/pos.1035.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1035.txt -------------------------------------------------------------------------------- /pos/pos.1036.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1036.txt -------------------------------------------------------------------------------- /pos/pos.1037.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1037.txt -------------------------------------------------------------------------------- /pos/pos.1038.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1038.txt -------------------------------------------------------------------------------- /pos/pos.1039.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1039.txt -------------------------------------------------------------------------------- /pos/pos.104.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.104.txt -------------------------------------------------------------------------------- /pos/pos.1040.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1040.txt -------------------------------------------------------------------------------- /pos/pos.1041.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1041.txt -------------------------------------------------------------------------------- /pos/pos.1042.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1042.txt -------------------------------------------------------------------------------- /pos/pos.1043.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1043.txt -------------------------------------------------------------------------------- /pos/pos.1044.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1044.txt -------------------------------------------------------------------------------- /pos/pos.1045.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1045.txt -------------------------------------------------------------------------------- /pos/pos.1046.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1046.txt -------------------------------------------------------------------------------- /pos/pos.1047.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1047.txt -------------------------------------------------------------------------------- /pos/pos.1048.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1048.txt -------------------------------------------------------------------------------- /pos/pos.1049.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1049.txt -------------------------------------------------------------------------------- /pos/pos.105.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.105.txt -------------------------------------------------------------------------------- /pos/pos.1050.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1050.txt -------------------------------------------------------------------------------- /pos/pos.1051.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1051.txt -------------------------------------------------------------------------------- /pos/pos.1052.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1052.txt -------------------------------------------------------------------------------- /pos/pos.1053.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1053.txt -------------------------------------------------------------------------------- /pos/pos.1054.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1054.txt -------------------------------------------------------------------------------- /pos/pos.1055.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1055.txt -------------------------------------------------------------------------------- /pos/pos.1056.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1056.txt -------------------------------------------------------------------------------- /pos/pos.1057.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1057.txt -------------------------------------------------------------------------------- /pos/pos.1058.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1058.txt -------------------------------------------------------------------------------- /pos/pos.1059.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1059.txt -------------------------------------------------------------------------------- /pos/pos.106.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.106.txt -------------------------------------------------------------------------------- /pos/pos.1060.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1060.txt -------------------------------------------------------------------------------- /pos/pos.1061.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1061.txt -------------------------------------------------------------------------------- /pos/pos.1062.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1062.txt -------------------------------------------------------------------------------- /pos/pos.1063.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1063.txt -------------------------------------------------------------------------------- /pos/pos.1064.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1064.txt -------------------------------------------------------------------------------- /pos/pos.1065.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1065.txt -------------------------------------------------------------------------------- /pos/pos.107.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.107.txt -------------------------------------------------------------------------------- /pos/pos.1073.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1073.txt -------------------------------------------------------------------------------- /pos/pos.1074.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1074.txt -------------------------------------------------------------------------------- /pos/pos.1075.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1075.txt -------------------------------------------------------------------------------- /pos/pos.1076.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1076.txt -------------------------------------------------------------------------------- /pos/pos.1077.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1077.txt -------------------------------------------------------------------------------- /pos/pos.1078.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1078.txt -------------------------------------------------------------------------------- /pos/pos.1079.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1079.txt -------------------------------------------------------------------------------- /pos/pos.108.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.108.txt -------------------------------------------------------------------------------- /pos/pos.1080.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1080.txt -------------------------------------------------------------------------------- /pos/pos.1081.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1081.txt -------------------------------------------------------------------------------- /pos/pos.1082.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1082.txt -------------------------------------------------------------------------------- /pos/pos.1083.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1083.txt -------------------------------------------------------------------------------- /pos/pos.1084.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1084.txt -------------------------------------------------------------------------------- /pos/pos.1085.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1085.txt -------------------------------------------------------------------------------- /pos/pos.1086.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1086.txt -------------------------------------------------------------------------------- /pos/pos.1087.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1087.txt -------------------------------------------------------------------------------- /pos/pos.1088.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1088.txt -------------------------------------------------------------------------------- /pos/pos.1089.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1089.txt -------------------------------------------------------------------------------- /pos/pos.109.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.109.txt -------------------------------------------------------------------------------- /pos/pos.1090.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1090.txt -------------------------------------------------------------------------------- /pos/pos.1091.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1091.txt -------------------------------------------------------------------------------- /pos/pos.1093.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1093.txt -------------------------------------------------------------------------------- /pos/pos.1094.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1094.txt -------------------------------------------------------------------------------- /pos/pos.1095.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1095.txt -------------------------------------------------------------------------------- /pos/pos.1096.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1096.txt -------------------------------------------------------------------------------- /pos/pos.1097.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1097.txt -------------------------------------------------------------------------------- /中文自然语言处理--情感分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 用Tensorflow进行中文自然语言处理--情感分析" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "$$f('真好喝')=1$$\n", 15 | "$$f('太难喝了')=0$$" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "**简介** \n", 23 | "大家好,我是Espresso,这是我制作的第一个教程,是一个简单的中文自然语言处理的分类实践。 \n", 24 | "制作此教程的目的是什么呢?虽然现在自然语言处理的学习资料很多,英文的资料更多,但是网上资源很乱,尤其是中文的系统的学习资料稀少,而且知识点非常分散,缺少比较系统的实践学习资料,就算有一些代码但因为缺少注释导致要花费很长时间才能理解,我个人在学习过程中,在网络搜索花费了一整天时间,才把处理中文的步骤和需要的软件梳理出来。 \n", 25 | "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习,在这个教程中我注重的是实践部分,理论部分我推荐学习deeplearning.ai的课程,在下面的代码部分,涉及到哪方面知识的,我推荐一些学习资料并附上链接,如有侵权请e-mail:a66777@188.com。 \n", 26 | "另外我对自然语言处理并没有任何深入研究,欢迎各位大牛吐槽,希望能指出不足和改善方法。" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "**需要的库** \n", 34 | "numpy \n", 35 | "jieba \n", 36 | "gensim \n", 37 | "tensorflow \n", 38 | "matplotlib " 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "name": "stderr", 48 | "output_type": "stream", 49 | "text": [ 50 | "c:\\users\\jinan\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\gensim\\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n", 51 | " warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n" 52 | ] 53 | } 54 | ], 55 | "source": [ 56 | "# 首先加载必用的库\n", 57 | "%matplotlib inline\n", 58 | "import numpy as np\n", 59 | "import matplotlib.pyplot as plt\n", 60 | "import re\n", 61 | "import jieba # 结巴分词\n", 62 | "# gensim用来加载预训练word vector\n", 63 | "from gensim.models import KeyedVectors\n", 64 | "import warnings\n", 65 | "warnings.filterwarnings(\"ignore\")" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "**预训练词向量** \n", 73 | "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为: \n", 74 | "https://github.com/Embedding/Chinese-Word-Vectors \n", 75 | "如果你不知道word2vec是什么,我推荐以下一篇文章: \n", 76 | "https://zhuanlan.zhihu.com/p/26306795 \n", 77 | "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量,可以从上面github链接下载,我们先加载预训练模型并进行一些简单测试:" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 3, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "# 使用gensim加载预训练中文分词embedding\n", 87 | "cn_model = KeyedVectors.load_word2vec_format('chinese_word_vectors/sgns.zhihu.bigram', \n", 88 | " binary=False)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "**词向量模型** \n", 96 | "在这个词向量模型里,每一个词是一个索引,对应的是一个长度为300的向量,我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本,需要先进行分次并把词汇转换为词向量,步骤请参考下图,步骤的讲解会跟着代码一步一步来,如果你不知道RNN,GRU,LSTM是什么,我推荐deeplearning.ai的课程,网易公开课有免费中文字幕版,但我还是推荐有习题和练习代码部分的的coursera原版: \n", 97 | "" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 29, 103 | "metadata": {}, 104 | "outputs": [ 105 | { 106 | "name": "stdout", 107 | "output_type": "stream", 108 | "text": [ 109 | "词向量的长度为300\n" 110 | ] 111 | }, 112 | { 113 | "data": { 114 | "text/plain": [ 115 | "array([-2.603470e-01, 3.677500e-01, -2.379650e-01, 5.301700e-02,\n", 116 | " -3.628220e-01, -3.212010e-01, -1.903330e-01, 1.587220e-01,\n", 117 | " -7.156200e-02, -4.625400e-02, -1.137860e-01, 3.515600e-01,\n", 118 | " -6.408200e-02, -2.184840e-01, 3.286950e-01, -7.110330e-01,\n", 119 | " 1.620320e-01, 1.627490e-01, 5.528180e-01, 1.016860e-01,\n", 120 | " 1.060080e-01, 7.820700e-01, -7.537310e-01, -2.108400e-02,\n", 121 | " -4.758250e-01, -1.130420e-01, -2.053000e-01, 6.624390e-01,\n", 122 | " 2.435850e-01, 9.171890e-01, -2.090610e-01, -5.290000e-02,\n", 123 | " -7.969340e-01, 2.394940e-01, -9.028100e-02, 1.537360e-01,\n", 124 | " -4.003980e-01, -2.456100e-02, -1.717860e-01, 2.037790e-01,\n", 125 | " -4.344710e-01, -3.850430e-01, -9.366000e-02, 3.775310e-01,\n", 126 | " 2.659690e-01, 8.879800e-02, 2.493440e-01, 4.914900e-02,\n", 127 | " 5.996000e-03, 3.586430e-01, -1.044960e-01, -5.838460e-01,\n", 128 | " 3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n", 129 | " -2.075230e-01, 2.845980e-01, 1.414760e-01, 1.678570e-01,\n", 130 | " 1.957560e-01, 7.782140e-01, -2.359000e-01, -6.833100e-02,\n", 131 | " 2.560170e-01, -6.906900e-02, -1.219620e-01, 2.683020e-01,\n", 132 | " 1.678810e-01, 2.068910e-01, 1.987520e-01, 6.720900e-02,\n", 133 | " -3.975290e-01, -7.123140e-01, 5.613200e-02, 2.586000e-03,\n", 134 | " 5.616910e-01, 1.157000e-03, -4.341190e-01, 1.977480e-01,\n", 135 | " 2.519540e-01, 8.835000e-03, -3.554600e-01, -1.573500e-02,\n", 136 | " -2.526010e-01, 9.355900e-02, -3.962500e-02, -1.628350e-01,\n", 137 | " 2.980950e-01, 1.647900e-01, -5.454270e-01, 3.888790e-01,\n", 138 | " 1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n", 139 | " 2.020520e-01, -4.424750e-01, 3.911580e-01, 2.115100e-01,\n", 140 | " 6.516760e-01, 5.668030e-01, 5.065500e-02, -1.259650e-01,\n", 141 | " -3.720640e-01, 2.330470e-01, 6.659900e-02, 8.300600e-02,\n", 142 | " 2.540460e-01, -5.279760e-01, -3.843280e-01, 3.366460e-01,\n", 143 | " 2.336500e-01, 3.564750e-01, -4.884160e-01, -1.183910e-01,\n", 144 | " 1.365910e-01, 2.293420e-01, -6.151930e-01, 5.212050e-01,\n", 145 | " 3.412000e-01, 5.757940e-01, 2.354480e-01, -3.641530e-01,\n", 146 | " 7.373400e-02, 1.007380e-01, -3.211410e-01, -3.040480e-01,\n", 147 | " -3.738440e-01, -2.515150e-01, 2.633890e-01, 3.995490e-01,\n", 148 | " 4.461880e-01, 1.641110e-01, 1.449590e-01, -4.191540e-01,\n", 149 | " 2.297840e-01, 6.710600e-02, 3.316430e-01, -6.026500e-02,\n", 150 | " -5.130610e-01, 1.472570e-01, 2.414060e-01, 2.011000e-03,\n", 151 | " -3.823410e-01, -1.356010e-01, 3.112300e-01, 9.177830e-01,\n", 152 | " -4.511630e-01, 1.272190e-01, -9.431600e-02, -8.216000e-03,\n", 153 | " -3.835440e-01, 2.589400e-02, 6.374980e-01, 4.931630e-01,\n", 154 | " -1.865070e-01, 4.076900e-01, -1.841000e-03, 2.213160e-01,\n", 155 | " 2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n", 156 | " 1.296890e-01, -1.304100e-01, -4.742270e-01, 2.275500e-02,\n", 157 | " 4.255050e-01, 1.570280e-01, 2.975300e-02, 1.931830e-01,\n", 158 | " 1.304340e-01, -3.179800e-02, 1.516650e-01, -2.154310e-01,\n", 159 | " -4.681410e-01, 1.007326e+00, -6.698940e-01, -1.555240e-01,\n", 160 | " 1.797170e-01, 2.848660e-01, 6.216130e-01, 1.549510e-01,\n", 161 | " 6.225000e-02, -2.227800e-02, 2.561270e-01, -1.006380e-01,\n", 162 | " 2.807900e-02, 4.597710e-01, -4.077750e-01, -1.777390e-01,\n", 163 | " 1.920500e-02, -4.829300e-02, 4.714700e-02, -3.715200e-01,\n", 164 | " -2.995930e-01, -3.719710e-01, 4.622800e-02, -1.436460e-01,\n", 165 | " 2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n", 166 | " 5.970110e-01, 3.578450e-01, -6.826000e-02, 4.735200e-02,\n", 167 | " -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n", 168 | " -4.757790e-01, 1.079320e-01, 9.858300e-02, 8.540300e-01,\n", 169 | " 3.518370e-01, -1.306360e-01, -1.541590e-01, 1.166775e+00,\n", 170 | " 2.048860e-01, 5.952340e-01, 1.158830e-01, 6.774400e-02,\n", 171 | " 6.793920e-01, -3.610700e-01, 1.697870e-01, 4.118530e-01,\n", 172 | " 4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n", 173 | " -7.043300e-02, 1.576110e-01, -4.780500e-02, -7.344390e-01,\n", 174 | " -2.834330e-01, 4.582690e-01, 3.957010e-01, -8.484300e-02,\n", 175 | " -3.472550e-01, 1.291660e-01, 3.838960e-01, -3.287600e-02,\n", 176 | " -2.802220e-01, 5.257030e-01, -3.609300e-02, -4.842220e-01,\n", 177 | " 3.690700e-02, 3.429560e-01, 2.902490e-01, -1.624650e-01,\n", 178 | " -7.513700e-02, 2.669300e-01, 5.778230e-01, -3.074020e-01,\n", 179 | " -2.183790e-01, -2.834050e-01, 1.350870e-01, 1.490070e-01,\n", 180 | " 1.438400e-02, -2.509040e-01, -3.376100e-01, 1.291880e-01,\n", 181 | " -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n", 182 | " -1.211970e-01, 2.532660e-01, 2.757050e-01, -3.382040e-01,\n", 183 | " 1.178070e-01, 3.860190e-01, 5.277960e-01, 4.581920e-01,\n", 184 | " 1.502310e-01, 1.226320e-01, 2.768540e-01, -4.502080e-01,\n", 185 | " -1.992670e-01, 1.689100e-02, 1.188860e-01, 3.502440e-01,\n", 186 | " -4.064770e-01, 2.610280e-01, -1.934990e-01, -1.625660e-01,\n", 187 | " 2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n", 188 | " -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n", 189 | " 2.154600e-02, 2.318040e-01, 5.395310e-01, -4.223720e-01],\n", 190 | " dtype=float32)" 191 | ] 192 | }, 193 | "execution_count": 29, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "# 由此可见每一个词都对应一个长度为300的向量\n", 200 | "embedding_dim = cn_model['山东大学'].shape[0]\n", 201 | "print('词向量的长度为{}'.format(embedding_dim))\n", 202 | "cn_model['山东大学']" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "Cosine Similarity for Vector Space Models by Christian S. Perone\n", 210 | "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 33, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/plain": [ 221 | "0.66128117" 222 | ] 223 | }, 224 | "execution_count": 33, 225 | "metadata": {}, 226 | "output_type": "execute_result" 227 | } 228 | ], 229 | "source": [ 230 | "# 计算相似度\n", 231 | "cn_model.similarity('橘子', '橙子')" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 34, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/plain": [ 242 | "0.66128117" 243 | ] 244 | }, 245 | "execution_count": 34, 246 | "metadata": {}, 247 | "output_type": "execute_result" 248 | } 249 | ], 250 | "source": [ 251 | "# dot('橘子'/|'橘子'|, '橙子'/|'橙子'| )\n", 252 | "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n", 253 | "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 35, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "data": { 263 | "text/plain": [ 264 | "[('高中', 0.7247823476791382),\n", 265 | " ('本科', 0.6768535375595093),\n", 266 | " ('研究生', 0.6244412660598755),\n", 267 | " ('中学', 0.6088204979896545),\n", 268 | " ('大学本科', 0.595908522605896),\n", 269 | " ('初中', 0.5883588790893555),\n", 270 | " ('读研', 0.5778335332870483),\n", 271 | " ('职高', 0.5767995119094849),\n", 272 | " ('大学毕业', 0.5767451524734497),\n", 273 | " ('师范大学', 0.5708829760551453)]" 274 | ] 275 | }, 276 | "execution_count": 35, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "# 找出最相近的词,余弦相似度\n", 283 | "cn_model.most_similar(positive=['大学'], topn=10)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 36, 289 | "metadata": {}, 290 | "outputs": [ 291 | { 292 | "name": "stdout", 293 | "output_type": "stream", 294 | "text": [ 295 | "在 老师 会计师 程序员 律师 医生 老人 中:\n", 296 | "不是同一类别的词为: 老人\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "# 找出不同的词\n", 302 | "test_words = '老师 会计师 程序员 律师 医生 老人'\n", 303 | "test_words_result = cn_model.doesnt_match(test_words.split())\n", 304 | "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "cn_model.most_similar(positive=['女人','出轨'], negative=['男人'], topn=1)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "**训练语料** \n", 321 | "本教程使用了谭松波老师的酒店评论语料,即使是这个语料也很难找到下载链接,在某博客还得花积分下载,而我不知道怎么赚取积分,后来好不容易找到一个链接但竟然是失效的,再后来尝试把链接粘贴到迅雷上终于下载了下来,希望大家以后多多分享资源。 \n", 322 | "训练样本分别被放置在两个文件夹里:\n", 323 | "分别的pos和neg,每个文件夹里有2000个txt文件,每个文件内有一段评语,共有4000个训练样本,这样大小的样本数据在NLP中属于非常迷你的:" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 38, 329 | "metadata": {}, 330 | "outputs": [], 331 | "source": [ 332 | "# 获得样本的索引,样本存放于两个文件夹中,\n", 333 | "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n", 334 | "# 每个文件夹中有2000个txt文件,每个文件中是一例评价\n", 335 | "import os\n", 336 | "pos_txts = os.listdir('pos')\n", 337 | "neg_txts = os.listdir('neg')" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 39, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "name": "stdout", 347 | "output_type": "stream", 348 | "text": [ 349 | "样本总共: 4000\n" 350 | ] 351 | } 352 | ], 353 | "source": [ 354 | "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 40, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "# 现在我们将所有的评价内容放置到一个list里\n", 364 | "\n", 365 | "train_texts_orig = [] # 存储所有评价,每例评价为一条string\n", 366 | "\n", 367 | "# 添加完所有样本之后,train_texts_orig为一个含有4000条文本的list\n", 368 | "# 其中前2000条文本为正面评价,后2000条为负面评价\n", 369 | "\n", 370 | "for i in range(len(pos_txts)):\n", 371 | " with open('pos/'+pos_txts[i], 'r', errors='ignore') as f:\n", 372 | " text = f.read().strip()\n", 373 | " train_texts_orig.append(text)\n", 374 | " f.close()\n", 375 | "for i in range(len(neg_txts)):\n", 376 | " with open('neg/'+neg_txts[i], 'r', errors='ignore') as f:\n", 377 | " text = f.read().strip()\n", 378 | " train_texts_orig.append(text)\n", 379 | " f.close()" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 41, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/plain": [ 390 | "4000" 391 | ] 392 | }, 393 | "execution_count": 41, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "len(train_texts_orig)" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 42, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "# 我们使用tensorflow的keras接口来建模\n", 409 | "from tensorflow.python.keras.models import Sequential\n", 410 | "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n", 411 | "from tensorflow.python.keras.preprocessing.text import Tokenizer\n", 412 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n", 413 | "from tensorflow.python.keras.optimizers import RMSprop\n", 414 | "from tensorflow.python.keras.optimizers import Adam\n", 415 | "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "**分词和tokenize** \n", 423 | "首先我们去掉每个样本的标点符号,然后用jieba分词,jieba分词返回一个生成器,没法直接进行tokenize,所以我们将分词结果转换成一个list,并将它索引化,这样每一例评价的文本变成一段索引数字,对应着预训练词向量模型中的词。" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 43, 429 | "metadata": {}, 430 | "outputs": [ 431 | { 432 | "name": "stderr", 433 | "output_type": "stream", 434 | "text": [ 435 | "Building prefix dict from the default dictionary ...\n", 436 | "Loading model from cache C:\\Users\\jinan\\AppData\\Local\\Temp\\jieba.cache\n", 437 | "Loading model cost 0.672 seconds.\n", 438 | "Prefix dict has been built succesfully.\n" 439 | ] 440 | } 441 | ], 442 | "source": [ 443 | "# 进行分词和tokenize\n", 444 | "# train_tokens是一个长长的list,其中含有4000个小list,对应每一条评价\n", 445 | "train_tokens = []\n", 446 | "for text in train_texts_orig:\n", 447 | " # 去掉标点\n", 448 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n", 449 | " # 结巴分词\n", 450 | " cut = jieba.cut(text)\n", 451 | " # 结巴分词的输出结果为一个生成器\n", 452 | " # 把生成器转换为list\n", 453 | " cut_list = [ i for i in cut ]\n", 454 | " for i, word in enumerate(cut_list):\n", 455 | " try:\n", 456 | " # 将词转换为索引index\n", 457 | " cut_list[i] = cn_model.vocab[word].index\n", 458 | " except KeyError:\n", 459 | " # 如果词不在字典中,则输出0\n", 460 | " cut_list[i] = 0\n", 461 | " train_tokens.append(cut_list)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "**索引长度标准化** \n", 469 | "因为每段评语的长度是不一样的,我们如果单纯取最长的一个评语,并把其他评填充成同样的长度,这样十分浪费计算资源,所以我们取一个折衷的长度。" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 44, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "# 获得所有tokens的长度\n", 479 | "num_tokens = [ len(tokens) for tokens in train_tokens ]\n", 480 | "num_tokens = np.array(num_tokens)" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": 45, 486 | "metadata": {}, 487 | "outputs": [ 488 | { 489 | "data": { 490 | "text/plain": [ 491 | "71.4495" 492 | ] 493 | }, 494 | "execution_count": 45, 495 | "metadata": {}, 496 | "output_type": "execute_result" 497 | } 498 | ], 499 | "source": [ 500 | "# 平均tokens的长度\n", 501 | "np.mean(num_tokens)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 46, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/plain": [ 512 | "1540" 513 | ] 514 | }, 515 | "execution_count": 46, 516 | "metadata": {}, 517 | "output_type": "execute_result" 518 | } 519 | ], 520 | "source": [ 521 | "# 最长的评价tokens的长度\n", 522 | "np.max(num_tokens)" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 89, 528 | "metadata": {}, 529 | "outputs": [ 530 | { 531 | "data": { 532 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAHXBJREFUeJzt3XmYXVWd7vHva0DGCGIKhAQo0CgiDagRaUFF4dogKNxWBgcMg532qqCAV4PYgl69QmOjOBuZIiKCiIKiNohw0QcFwxgEsXkYQgiSIFMYGgi+94+9ipwUVbVPDafOqTrv53nOU2evPaxfdlXO76y19l5btomIiBjK89odQEREdL4ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRbRFEnflvRvY3SszSQ9KmlKWb5c0gfG4tjleL+UNHusjjeMej8v6X5Jfx2DY+0iafFYxDWKGCzppW2ot+3/9niuJItA0p2SnpC0XNJDkq6U9EFJz/592P6g7f/T5LF2G2ob24tsr2v7mTGI/ThJ3+93/D1szx/tsYcZx6bAUcDWtl88wPp8AA6iXUkphifJIvq83fZUYHPgeOCTwKljXYmk1cb6mB1ic+Bvtpe2O5CIVkiyiFXYftj2hcD+wGxJ2wBIOkPS58v7aZJ+XlohD0j6raTnSToT2Az4Welm+oSk3vLN8VBJi4DfNJQ1Jo6XSLpa0sOSLpC0QanrOd/I+1ovknYHPgXsX+q7oax/tlurxPVpSXdJWirpe5LWK+v64pgtaVHpQjpmsHMjab2y/7JyvE+X4+8GXAJsUuI4o99+6wC/bFj/qKRNJK0h6SuSlpTXVyStMUjdh0u6WdKMsryXpOsbWoLb9js/H5d0Yzmf50hac6jf3RB/En3HXEPSl8p5uq90S67V+DuSdFQ5x/dKOrhh3xdJ+pmkRyT9sXTX/a6su6JsdkM5L/s37Dfg8aI9kixiQLavBhYDbxhg9VFlXQ+wEdUHtm0fCCyiaqWsa/vfG/Z5E/AK4J8GqfL9wCHAJsAK4KtNxPgr4P8C55T6thtgs4PK683AlsC6wNf7bbMz8HJgV+Azkl4xSJVfA9Yrx3lTiflg278G9gCWlDgO6hfnY/3Wr2t7CXAMsCOwPbAdsAPw6f6VqhorOgh4k+3Fkl4NnAb8K/Ai4DvAhf0SzX7A7sAWwLZlfxjkdzfIv7fRCcDLSqwvBaYDn2lY/+JybqYDhwLfkPTCsu4bwGNlm9nl1Xdu3ljeblfOyzlNHC/aIMkihrIE2GCA8qeBjYHNbT9t+7eun2TsONuP2X5ikPVn2r6pfLD+G7CfygD4KL0XOMn27bYfBY4GDujXqvms7Sds3wDcQPXBvYoSy/7A0baX274T+A/gwFHG9jnbS20vAz7b73iSdBJVgn1z2QbgX4Dv2L7K9jNlfOZJqsTT56u2l9h+APgZ1Yc8jOB3J0mlziNsP2B7OVWSPqBhs6fLv+Vp278AHgVeXs7bO4FjbT9u+2agmfGkAY/XxH7RIkkWMZTpwAMDlJ8I3AZcLOl2SXObONbdw1h/F7A6MK2pKIe2STle47FXo/pW3afx6qXHqVof/U0Dnj/AsaaPcWybNCyvD8wBvmj74YbyzYGjSlfSQ5IeAjbtt+9g/6aR/O56gLWBaxrq+1Up7/M32ysGqLOH6nw3/n7r/haGOl60SZJFDEjSa6k+CH/Xf135Zn2U7S2BtwNHStq1b/Ugh6xreWza8H4zqm+W91N1X6zdENcUVv2QqjvuEqoP18ZjrwDuq9mvv/tLTP2PdU+T+w8U50CxLWlYfhDYCzhd0k4N5XcDX7C9fsNrbdtn1wYx9O9uMPcDTwCvbKhvPdvNfHgvozrfMxrKNh1k2+hgSRaxCkkvkLQX8EPg+7YXDrDNXpJeWronHgGeKS+oPoS3HEHV75O0taS1gc8B55VLa/8CrClpT0mrU/XpN/bN3wf0DjFIezZwhKQtJK3LyjGOFYNsP6ASy7nAFyRNlbQ5cCTw/aH3XCXOF/UNrjfE9mlJPZKmUY0B9L8M+HKq7qqfSHpdKf4u8EFJr1NlnXJ+ptYFUfO7G5Dtv5c6vyxpw3Kc6ZIGG39q3PcZ4HzgOElrS9qKaqyn0Uj/ZmIcJVlEn59JWk71rfUY4CRgsCtQZgK/pupH/j3wzfKhBvBFqg/AhyR9fBj1nwmcQdV9siZwOFRXZwEfAk6h+hb/GNUAbZ8flZ9/k3TtAMc9rRz7CuAO4L+Bw4YRV6PDSv23U7W4flCOX8v2n6mSw+3l3GwCfB5YANwILASuLWX9972E6ndxoaTX2F5ANYbwdarWx22sHMCuM9TvbiifLPX8QdIj5RjNjiF8hGqw+q9Uv4uzqcZY+hwHzC/nZb8mjxnjTHn4UUSMJ0knAC+2Pe532cfIpWURES0laStJ25Yusx2oLoX9SbvjiuGZrHfTRkTnmErV9bQJsJTqkuML2hpRDFu6oSIiola6oSIiotaE7oaaNm2ae3t72x1GRMSEcs0119xvu6d+y5UmdLLo7e1lwYIF7Q4jImJCkXRX/VarSjdURETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIUemdexG9cy9qdxgR0WJJFhERUSvJIiIiak3oiQRj8urr2rrz+D2Hvc9w94uIemlZRERErSSLiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZxISReagi2ifJIiIiarUsWUg6TdJSSTc1lJ0o6c+SbpT0E0nrN6w7WtJtkm6V9E+tiisiIoavlS2LM4Dd+5VdAmxje1vgL8DRAJK2Bg4AXln2+aakKS2MLSIihqFlycL2FcAD/coutr2iLP4BmFHe7w380PaTtu8AbgN2aFVsERExPO0cszgE+GV5Px24u2Hd4lL2HJLmSFogacGyZctaHGJERECbkoWkY4AVwFl9RQNs5oH2tT3P9izbs3p6eloVYkRENBj351lImg3sBexquy8hLAY2bdhsBrBkvGOLyWMkz8OIiMGNa8tC0u7AJ4F32H68YdWFwAGS1pC0BTATuHo8Y4uIiMG1rGUh6WxgF2CapMXAsVRXP60BXCIJ4A+2P2j7T5LOBW6m6p76sO1nWhVbREQMT8uShe13D1B86hDbfwH4Qqviic6RLqKIiSd3cEdERK0ki4iIqJVkERERtZIsIiKi1rjfZxExlExBHtGZkiyiqzQmo1yNFdG8dENFREStJIuIiKiVbqiY8DLOEdF6aVlEREStJIuIiKiVbqjoCMPpSsrcUhHjLy2LiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZRERErSSLiIiolTu4oys0c4d47gyPGFxaFhERUatlLQtJpwF7AUttb1PKNgDOAXqBO4H9bD8oScDJwNuAx4GDbF/bqtiiPfp/u8/U4hETRytbFmcAu/crmwtcansmcGlZBtgDmFlec4BvtTCuiFX0zr0oiSuiRsuShe0rgAf6Fe8NzC/v5wP7NJR/z5U/AOtL2rhVsUVExPCM95jFRrbvBSg/Nyzl04G7G7ZbXMqeQ9IcSQskLVi2bFlLg42IiEqnDHBrgDIPtKHtebZn2Z7V09PT4rAiIgLGP1nc19e9VH4uLeWLgU0btpsBLBnn2CIiYhDjnSwuBGaX97OBCxrK36/KjsDDfd1VERHRfq28dPZsYBdgmqTFwLHA8cC5kg4FFgH7ls1/QXXZ7G1Ul84e3Kq4IiJi+FqWLGy/e5BVuw6wrYEPtyqWiIgYnUz3ERNW7o2IGD+1YxaSdpK0Tnn/PkknSdq89aFFRESnaGaA+1vA45K2Az4B3AV8r6VRxaSQO6MjJo9muqFW2LakvYGTbZ8qaXbtXjHhdcIsrKNNNklWEWOjmWSxXNLRwPuAN0qaAqze2rAiIqKTNNMNtT/wJHCo7b9STcNxYkujioiIjlLbsigJ4qSG5UVkzCIioqs0czXUP0v6L0kPS3pE0nJJj4xHcBER0RmaGbP4d+Dttm9pdTAREdGZmhmzuC+JIiKiuzXTslgg6Rzgp1QD3QDYPr9lUUVEREdpJlm8gGpyv7c2lBlIsoiI6BLNXA2VGWAjIrpcM1dDvUzSpZJuKsvbSvp060OLiIhO0cwA93eBo4GnAWzfCBzQyqAiIqKzNJMs1rZ9db+yFa0IJiIiOlMzyeJ+SS+hGtRG0ruAPPI0ulJm0o1u1czVUB8G5gFbSboHuINqUsGISakTZtuN6DTNJIt7bO9WHoD0PNvLJW3Q6sAiIqJzNNMNdb6k1Ww/VhLFi4FLWh1YdI50vUREM8nip8B5kqZI6gUupro6KiIiukQzN+V9V9LzqZJGL/Cvtq9sdWAREdE5Bk0Wko5sXAQ2Ba4HdpS0o+2TBt6znqQjgA9QXWG1EDgY2Bj4IbABcC1woO2nRlpHRESMnaG6oaY2vNYFfgLc1lA2IpKmA4cDs2xvA0yhusnvBODLtmcCDwKHjrSOiIgYW4O2LGx/tnFZ0tSq2I+OUb1rSXoaWJvqvo23AO8p6+cDxwHfGoO6IiJilGrHLCRtA5xJ1T2EpPuB99v+00gqtH2PpC8Bi4AnqAbMrwEest13Z/hiqmd9DxTPHGAOwGabbTaSEKJGrnyKiP6auRpqHnCk7c1tbw4cRTVf1IhIeiGwN7AFsAmwDrDHAJt6oP1tz7M9y/asnp6ekYYRERHD0MxNeevYvqxvwfbl5Qa9kdoNuMP2MgBJ5wOvB9Yv93OsAGYAS0ZRR0SttKAimtdMsrhd0r9RdUVBNdXHHaOocxHVFVVrU3VD7QosAC4D3kV1RdRs4IJR1BEt0PjhmqkwIrpLM91QhwA9VE/GOx+YBhw00gptXwWcR3V57MISwzzgk8CRkm4DXgScOtI6IiJibDXTstjN9uGNBZL2BX400kptHwsc26/4dmCHkR4zIiJap5mWxUBTe2S6j4iILjLUHdx7AG8Dpkv6asOqF5CHH0WXyWB4dLuhuqGWUA08v4PqPog+y4EjWhlURCdIgohYaag7uG8AbpD0A9tPj2NMERHRYWrHLJIoIiKimQHuiIjocoMmC0lnlp8fHb9wIiKiEw3VsniNpM2BQyS9UNIGja/xCjAiItpvqKuhvg38CtiS6mooNaxzKY+IiC4waMvC9ldtvwI4zfaWtrdoeCVRRER0kWaewf2/JG0HvKEUXWH7xtaGFZ0u9yBEdJfaq6EkHQ6cBWxYXmdJOqzVgUVEROdoZiLBDwCvs/0YgKQTgN8DX2tlYBER0Tmauc9CwDMNy8+w6mB3RERMcs20LE4HrpL0k7K8D3nWREREV2lmgPskSZcDO1O1KA62fV2rA4vx0TdQnSffRcRQmmlZYPtaqifbRUREF8rcUBERUSvJIiIiag2ZLCRNkfTr8QomIiI605DJwvYzwOOS1huneCIiogM1M8D938BCSZcAj/UV2j68ZVFFTBCN057kirKYzJpJFheVV0REdKlm7rOYL2ktYDPbt45FpZLWB04BtqGa7vwQ4FbgHKAXuBPYz/aDY1FfRESMTjMTCb4duJ7q2RZI2l7ShaOs92TgV7a3ArYDbgHmApfanglcWpajRXrnXpSZYyOiac1cOnscsAPwEIDt64EtRlqhpBcAb6RMGWL7KdsPAXsD88tm86mmFYmIiA7QTLJYYfvhfmUeRZ1bAsuA0yVdJ+kUSesAG9m+F6D83HCgnSXNkbRA0oJly5aNIoyIiGhWM8niJknvAaZIminpa8CVo6hzNeDVwLdsv4rqCqumu5xsz7M9y/asnp6eUYQRERHNaiZZHAa8EngSOBt4BPjYKOpcDCy2fVVZPo8qedwnaWOA8nPpKOqIaKmM+US3aeZqqMeBY8pDj2x7+WgqtP1XSXdLenm5umpX4Obymg0cX35eMJp6onN0y4dqZvCNyaw2WUh6LXAaMLUsPwwcYvuaUdR7GNXjWZ8P3A4cTNXKOVfSocAiYN9RHD8iIsZQMzflnQp8yPZvASTtTPVApG1HWmm5omrWAKt2HekxIyKidZpJFsv7EgWA7d9JGlVXVExe3dLlFNFtBk0Wkl5d3l4t6TtUg9sG9gcub31oERHRKYZqWfxHv+VjG96P5j6LmITSooiY3AZNFrbfPJ6BRERE52rmaqj1gfdTTfD37PaZojwions0M8D9C+APwELg760NJyIiOlEzyWJN20e2PJKIiOhYzUz3caakf5G0saQN+l4tjywiIjpGMy2Lp4ATgWNYeRWUqWaPjYiILtBMsjgSeKnt+1sdTEREdKZmuqH+BDze6kAiJovMSBuTUTMti2eA6yVdRjVNOZBLZyMiukkzyeKn5RUREV2qmedZzK/bJiIiJrdm7uC+gwHmgrKdq6EiIrpEM91Qjc+dWJPqoUS5zyIioos00w31t35FX5H0O+AzrQkpYnLof0VUHrcaE1kz3VCvblh8HlVLY2rLIoqIiI7TTDdU43MtVgB3Avu1JJqIiOhIzXRD5bkWERFdrpluqDWAd/Lc51l8rnVhRUREJ2mmG+oC4GHgGhru4I6I4Wkc8M5gd0w0zSSLGbZ3H+uKJU0BFgD32N5L0hbAD6kuy70WOND2U2Ndb0REDF8zEwleKekfWlD3R4FbGpZPAL5seybwIHBoC+qMiIgRaCZZ7AxcI+lWSTdKWijpxtFUKmkGsCdwSlkW8BbgvLLJfGCf0dQRERFjp5luqD1aUO9XgE+w8n6NFwEP2V5RlhcD01tQb0REjEAzl87eNZYVStoLWGr7Gkm79BUPVPUg+88B5gBsttlmYxlaREQMopluqLG2E/AOSXdSDWi/haqlsb6kvuQ1A1gy0M6259meZXtWT0/PeMQbEdH1xj1Z2D7a9gzbvcABwG9svxe4DHhX2Ww21SW7ERHRAdrRshjMJ4EjJd1GNYZxapvjiYiIopkB7paxfTlweXl/O7BDO+OJiIiBdVLLIiIiOlRbWxYx/vo/YyEiohlpWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkEdFGvXMvyhQsMSEkWURERK1MJBjRBmlNxESTlkVEh0tXVXSCJIuIiKiVbqiIDtLYgrjz+D3bGEnEqtKyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKg17ldDSdoU+B7wYuDvwDzbJ0vaADgH6AXuBPaz/eB4xxfRDrmPIjpdO1oWK4CjbL8C2BH4sKStgbnApbZnApeW5YiI6ADjnixs32v72vJ+OXALMB3YG5hfNpsP7DPesUVExMDaOmYhqRd4FXAVsJHte6FKKMCGg+wzR9ICSQuWLVs2XqFGtF2m/Yh2aluykLQu8GPgY7YfaXY/2/Nsz7I9q6enp3UBRkTEs9qSLCStTpUozrJ9fim+T9LGZf3GwNJ2xBYREc/VjquhBJwK3GL7pIZVFwKzgePLzwvGO7aITpIup+gk7ZhIcCfgQGChpOtL2aeoksS5kg4FFgH7tiG2iIgYwLgnC9u/AzTI6l3HM5aIiSgz00Y75A7uiIiolWQRMYHlctoYL3n40SSW7oqIGCtpWURERK0ki4hJIN1R0WpJFhERUSvJIiIiaiVZREwi6Y6KVkmyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKiV6T4iJrFM+RJjJS2LiIiolWQRERG10g0VMQmN9Ma8vv3SZRX9pWURERG1kiwmkUz1EEPp//eRv5cYjnRDTRDpHoixkgQRI5GWRURE1EqyiIhBpasq+iRZRERErY4bs5C0O3AyMAU4xfbxbQ4pYlIbqOXQvyxjZtFRyULSFOAbwP8AFgN/lHSh7ZvHO5Z2TJMwnDozjUNMBMNJMklIna3TuqF2AG6zfbvtp4AfAnu3OaaIiK4n2+2O4VmS3gXsbvsDZflA4HW2P9KwzRxgTlncBrhp3APtTNOA+9sdRIfIuVgp52KlnIuVXm576nB26KhuKEADlK2SzWzPA+YBSFpge9Z4BNbpci5WyrlYKedipZyLlSQtGO4+ndYNtRjYtGF5BrCkTbFERETRacnij8BMSVtIej5wAHBhm2OKiOh6HdUNZXuFpI8A/0l16exptv80xC7zxieyCSHnYqWci5VyLlbKuVhp2Oeiowa4IyKiM3VaN1RERHSgJIuIiKg1YZOFpN0l3SrpNklz2x1Pu0jaVNJlkm6R9CdJH213TO0kaYqk6yT9vN2xtJuk9SWdJ+nP5e/jH9sdU7tIOqL8/7hJ0tmS1mx3TONF0mmSlkq6qaFsA0mXSPqv8vOFdceZkMmiYVqQPYCtgXdL2rq9UbXNCuAo268AdgQ+3MXnAuCjwC3tDqJDnAz8yvZWwHZ06XmRNB04HJhlexuqi2cOaG9U4+oMYPd+ZXOBS23PBC4ty0OakMmCTAvyLNv32r62vF9O9YEwvb1RtYekGcCewCntjqXdJL0AeCNwKoDtp2w/1N6o2mo1YC1JqwFr00X3b9m+AnigX/HewPzyfj6wT91xJmqymA7c3bC8mC79gGwkqRd4FXBVeyNpm68AnwD+3u5AOsCWwDLg9NItd4qkddodVDvYvgf4ErAIuBd42PbF7Y2q7TayfS9UXziBDet2mKjJonZakG4jaV3gx8DHbD/S7njGm6S9gKW2r2l3LB1iNeDVwLdsvwp4jCa6Giaj0h+/N7AFsAmwjqT3tTeqiWeiJotMC9JA0upUieIs2+e3O5422Ql4h6Q7qbol3yLp++0Nqa0WA4tt97Uyz6NKHt1oN+AO28tsPw2cD7y+zTG1232SNgYoP5fW7TBRk0WmBSkkiapf+hbbJ7U7nnaxfbTtGbZ7qf4efmO7a7892v4rcLekl5eiXYFxfy5Mh1gE7Chp7fL/ZVe6dLC/wYXA7PJ+NnBB3Q4dNd1Hs0YwLchkthNwILBQ0vWl7FO2f9HGmKIzHAacVb5Q3Q4c3OZ42sL2VZLOA66lunrwOrpo6g9JZwO7ANMkLQaOBY4HzpV0KFUy3bf2OJnuIyIi6kzUbqiIiBhHSRYREVErySIiImolWURERK0ki4iIqJVkEROWpEdbcMztJb2tYfk4SR8fxfH2LTO+XtavvFfSe5rY/yBJXx9p/RFjJckiYlXbA2+r3ap5hwIfsv3mfuW9QG2yiOgUSRYxKUj635L+KOlGSZ8tZb3lW/13y7MMLpa0Vln32rLt7yWdWJ5z8Hzgc8D+kq6XtH85/NaSLpd0u6TDB6n/3ZIWluOcUMo+A+wMfFvSif12OR54Q6nnCElrSjq9HOM6Sf2TC5L2LPFOk9Qj6cfl3/xHSTuVbY4rzy9YJV5J60i6SNINJcb9+x8/Yki288prQr6AR8vPt1LdkSuqL0A/p5qeu5fqjt3ty3bnAu8r728CXl/eHw/cVN4fBHy9oY7jgCuBNYBpwN+A1fvFsQnVXbA9VLMi/AbYp6y7nOo5Cv1j3wX4ecPyUcDp5f1W5Xhr9sUD/E/gt8ALyzY/AHYu7zejmu5l0HiBdwLfbahvvXb//vKaWK8JOd1HRD9vLa/ryvK6wEyqD9w7bPdNg3IN0CtpfWCq7StL+Q+AvYY4/kW2nwSelLQU2Ihqor4+rwUut70MQNJZVMnqp8P4N+wMfA3A9p8l3QW8rKx7MzALeKtXzii8G1WLp2//F0iaOkS8C4EvlVbPz23/dhixRSRZxKQg4Iu2v7NKYfV8jycbip4B1mLgKe6H0v8Y/f/fDPd4AxnqGLdTPZ/iZcCCUvY84B9tP7HKQark8Zx4bf9F0muoxmO+KOli258bg7ijS2TMIiaD/wQOKc/0QNJ0SYM+zMX2g8BySTuWosZHbC4Hpj53ryFdBbypjCVMAd4N/L+affrXcwXw3hL/y6i6lm4t6+4C/hn4nqRXlrKLgY/07Sxp+6Eqk7QJ8Ljt71M9CKhbpyuPEUqyiAnP1VPPfgD8XtJCqmc31H3gHwrMk/R7qm/1D5fyy6i6d65vdhDY1ZPGji773gBca7tuyucbgRVlwPkI4JvAlBL/OcBBpSupr45bqZLJjyS9hPJM6TJIfzPwwZr6/gG4usxMfAzw+Wb+bRF9MutsdCVJ69p+tLyfC2xs+6NtDiuiY2XMIrrVnpKOpvo/cBfVVUcRMYi0LCIiolbGLCIiolaSRURE1EqyiIiIWkkWERFRK8kiIiJq/X+K7ei/t/dDegAAAABJRU5ErkJggg==\n", 533 | "text/plain": [ 534 | "
" 535 | ] 536 | }, 537 | "metadata": {}, 538 | "output_type": "display_data" 539 | } 540 | ], 541 | "source": [ 542 | "plt.hist(np.log(num_tokens), bins = 100)\n", 543 | "plt.xlim((0,10))\n", 544 | "plt.ylabel('number of tokens')\n", 545 | "plt.xlabel('length of tokens')\n", 546 | "plt.title('Distribution of tokens length')\n", 547 | "plt.show()" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 48, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/plain": [ 558 | "236" 559 | ] 560 | }, 561 | "execution_count": 48, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "# 取tokens平均值并加上两个tokens的标准差,\n", 568 | "# 假设tokens长度的分布为正态分布,则max_tokens这个值可以涵盖95%左右的样本\n", 569 | "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n", 570 | "max_tokens = int(max_tokens)\n", 571 | "max_tokens" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": 49, 577 | "metadata": {}, 578 | "outputs": [ 579 | { 580 | "data": { 581 | "text/plain": [ 582 | "0.9565" 583 | ] 584 | }, 585 | "execution_count": 49, 586 | "metadata": {}, 587 | "output_type": "execute_result" 588 | } 589 | ], 590 | "source": [ 591 | "# 取tokens的长度为236时,大约95%的样本被涵盖\n", 592 | "# 我们对长度不足的进行padding,超长的进行修剪\n", 593 | "np.sum( num_tokens < max_tokens ) / len(num_tokens)" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "**反向tokenize** \n", 601 | "我们定义一个function,用来把索引转换成可阅读的文本,这对于debug很重要。" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": 50, 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [ 610 | "# 用来将tokens转换为文本\n", 611 | "def reverse_tokens(tokens):\n", 612 | " text = ''\n", 613 | " for i in tokens:\n", 614 | " if i != 0:\n", 615 | " text = text + cn_model.index2word[i]\n", 616 | " else:\n", 617 | " text = text + ' '\n", 618 | " return text" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": 51, 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "reverse = reverse_tokens(train_tokens[0])" 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": {}, 633 | "source": [ 634 | "以下可见,训练样本的极性并不是那么精准,比如说下面的样本,对早餐并不满意,但被定义为正面评价,这会迷惑我们的模型,不过我们暂时不对训练样本进行任何修改。" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": 52, 640 | "metadata": {}, 641 | "outputs": [ 642 | { 643 | "data": { 644 | "text/plain": [ 645 | "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'" 646 | ] 647 | }, 648 | "execution_count": 52, 649 | "metadata": {}, 650 | "output_type": "execute_result" 651 | } 652 | ], 653 | "source": [ 654 | "# 经过tokenize再恢复成文本\n", 655 | "# 可见标点符号都没有了\n", 656 | "reverse" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 53, 662 | "metadata": {}, 663 | "outputs": [ 664 | { 665 | "data": { 666 | "text/plain": [ 667 | "'早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'" 668 | ] 669 | }, 670 | "execution_count": 53, 671 | "metadata": {}, 672 | "output_type": "execute_result" 673 | } 674 | ], 675 | "source": [ 676 | "# 原始文本\n", 677 | "train_texts_orig[0]" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "**准备Embedding Matrix** \n", 685 | "现在我们来为模型准备embedding matrix(词向量矩阵),根据keras的要求,我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵,num words代表我们使用的词汇的数量,emdedding dimension在我们现在使用的预训练词向量模型中是300,每一个词汇都用一个长度为300的向量表示。 \n", 686 | "注意我们只选择使用前50k个使用频率最高的词,在这个预训练词向量模型中,一共有260万词汇量,如果全部使用在分类问题上会很浪费计算资源,因为我们的训练样本很小,一共只有4k,如果我们有100k,200k甚至更多的训练样本时,在分类问题上可以考虑减少使用的词汇量。" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 90, 692 | "metadata": {}, 693 | "outputs": [ 694 | { 695 | "data": { 696 | "text/plain": [ 697 | "300" 698 | ] 699 | }, 700 | "execution_count": 90, 701 | "metadata": {}, 702 | "output_type": "execute_result" 703 | } 704 | ], 705 | "source": [ 706 | "embedding_dim" 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": 55, 712 | "metadata": {}, 713 | "outputs": [], 714 | "source": [ 715 | "# 只使用前20000个词\n", 716 | "num_words = 50000\n", 717 | "# 初始化embedding_matrix,之后在keras上进行应用\n", 718 | "embedding_matrix = np.zeros((num_words, embedding_dim))\n", 719 | "# embedding_matrix为一个 [num_words,embedding_dim] 的矩阵\n", 720 | "# 维度为 50000 * 300\n", 721 | "for i in range(num_words):\n", 722 | " embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n", 723 | "embedding_matrix = embedding_matrix.astype('float32')" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": 56, 729 | "metadata": {}, 730 | "outputs": [ 731 | { 732 | "data": { 733 | "text/plain": [ 734 | "300" 735 | ] 736 | }, 737 | "execution_count": 56, 738 | "metadata": {}, 739 | "output_type": "execute_result" 740 | } 741 | ], 742 | "source": [ 743 | "# 检查index是否对应,\n", 744 | "# 输出300意义为长度为300的embedding向量一一对应\n", 745 | "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": 57, 751 | "metadata": {}, 752 | "outputs": [ 753 | { 754 | "data": { 755 | "text/plain": [ 756 | "(50000, 300)" 757 | ] 758 | }, 759 | "execution_count": 57, 760 | "metadata": {}, 761 | "output_type": "execute_result" 762 | } 763 | ], 764 | "source": [ 765 | "# embedding_matrix的维度,\n", 766 | "# 这个维度为keras的要求,后续会在模型中用到\n", 767 | "embedding_matrix.shape" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "**padding(填充)和truncating(修剪)** \n", 775 | "我们把文本转换为tokens(索引)之后,每一串索引的长度并不相等,所以为了方便模型的训练我们需要把索引的长度标准化,上面我们选择了236这个可以涵盖95%训练样本的长度,接下来我们进行padding和truncating,我们一般采用'pre'的方法,这会在文本索引的前面填充0,因为根据一些研究资料中的实践,如果在文本索引后面填充0的话,会对模型造成一些不良影响。" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": 58, 781 | "metadata": {}, 782 | "outputs": [], 783 | "source": [ 784 | "# 进行padding和truncating, 输入的train_tokens是一个list\n", 785 | "# 返回的train_pad是一个numpy array\n", 786 | "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n", 787 | " padding='pre', truncating='pre')" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 59, 793 | "metadata": {}, 794 | "outputs": [], 795 | "source": [ 796 | "# 超出五万个词向量的词用0代替\n", 797 | "train_pad[ train_pad>=num_words ] = 0" 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": 60, 803 | "metadata": {}, 804 | "outputs": [ 805 | { 806 | "data": { 807 | "text/plain": [ 808 | "array([ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 809 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 810 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 811 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 812 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 813 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 814 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 815 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 816 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 817 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 818 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 819 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 820 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 821 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 822 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 823 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 824 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 825 | " 290, 3053, 57, 169, 73, 1, 25, 11216, 49,\n", 826 | " 163, 15985, 0, 0, 30, 8, 0, 1, 228,\n", 827 | " 223, 40, 35, 653, 0, 5, 1642, 29, 11216,\n", 828 | " 2751, 500, 98, 30, 3159, 2225, 2146, 371, 6285,\n", 829 | " 169, 27396, 1, 1191, 5432, 1080, 20055, 57, 562,\n", 830 | " 1, 22671, 40, 35, 169, 2567, 0, 42665, 7761,\n", 831 | " 110, 0, 0, 41281, 0, 110, 0, 35891, 110,\n", 832 | " 0, 28781, 57, 169, 1419, 1, 11670, 0, 19470,\n", 833 | " 1, 0, 0, 169, 35071, 40, 562, 35, 12398,\n", 834 | " 657, 4857])" 835 | ] 836 | }, 837 | "execution_count": 60, 838 | "metadata": {}, 839 | "output_type": "execute_result" 840 | } 841 | ], 842 | "source": [ 843 | "# 可见padding之后前面的tokens全变成0,文本在最后面\n", 844 | "train_pad[33]" 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": 61, 850 | "metadata": {}, 851 | "outputs": [], 852 | "source": [ 853 | "# 准备target向量,前2000样本为1,后2000为0\n", 854 | "train_target = np.concatenate( (np.ones(2000),np.zeros(2000)) )" 855 | ] 856 | }, 857 | { 858 | "cell_type": "code", 859 | "execution_count": 62, 860 | "metadata": {}, 861 | "outputs": [], 862 | "source": [ 863 | "# 进行训练和测试样本的分割\n", 864 | "from sklearn.model_selection import train_test_split" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": 63, 870 | "metadata": {}, 871 | "outputs": [], 872 | "source": [ 873 | "# 90%的样本用来训练,剩余10%用来测试\n", 874 | "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n", 875 | " train_target,\n", 876 | " test_size=0.1,\n", 877 | " random_state=12)" 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": 64, 883 | "metadata": {}, 884 | "outputs": [ 885 | { 886 | "name": "stdout", 887 | "output_type": "stream", 888 | "text": [ 889 | " 房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n", 890 | "class: 1.0\n" 891 | ] 892 | } 893 | ], 894 | "source": [ 895 | "# 查看训练样本,确认无误\n", 896 | "print(reverse_tokens(X_train[35]))\n", 897 | "print('class: ',y_train[35])" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "现在我们用keras搭建LSTM模型,模型的第一层是Embedding层,只有当我们把tokens索引转换为词向量矩阵之后,才可以用神经网络对文本进行处理。\n", 905 | "keras提供了Embedding接口,避免了繁琐的稀疏矩阵操作。 \n", 906 | "在Embedding层我们输入的矩阵为:$$(batchsize, maxtokens)$$\n", 907 | "输出矩阵为: $$(batchsize, maxtokens, embeddingdim)$$" 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": 65, 913 | "metadata": {}, 914 | "outputs": [], 915 | "source": [ 916 | "# 用LSTM对样本进行分类\n", 917 | "model = Sequential()" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 66, 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [ 926 | "# 模型第一层为embedding\n", 927 | "model.add(Embedding(num_words,\n", 928 | " embedding_dim,\n", 929 | " weights=[embedding_matrix],\n", 930 | " input_length=max_tokens,\n", 931 | " trainable=False))" 932 | ] 933 | }, 934 | { 935 | "cell_type": "code", 936 | "execution_count": 67, 937 | "metadata": {}, 938 | "outputs": [], 939 | "source": [ 940 | "model.add(Bidirectional(LSTM(units=32, return_sequences=True)))\n", 941 | "model.add(LSTM(units=16, return_sequences=False))" 942 | ] 943 | }, 944 | { 945 | "cell_type": "markdown", 946 | "metadata": {}, 947 | "source": [ 948 | "**构建模型** \n", 949 | "我在这个教程中尝试了几种神经网络结构,因为训练样本比较少,所以我们可以尽情尝试,训练过程等待时间并不长: \n", 950 | "**GRU:**如果使用GRU的话,测试样本可以达到87%的准确率,但我测试自己的文本内容时发现,GRU最后一层激活函数的输出都在0.5左右,说明模型的判断不是很明确,信心比较低,而且经过测试发现模型对于否定句的判断有时会失误,我们期望对于负面样本输出接近0,正面样本接近1而不是都徘徊于0.5之间。 \n", 951 | "**BiLSTM:**测试了LSTM和BiLSTM,发现BiLSTM的表现最好,LSTM的表现略好于GRU,这可能是因为BiLSTM对于比较长的句子结构有更好的记忆,有兴趣的朋友可以深入研究一下。 \n", 952 | "Embedding之后第,一层我们用BiLSTM返回sequences,然后第二层16个单元的LSTM不返回sequences,只返回最终结果,最后是一个全链接层,用sigmoid激活函数输出结果。" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 68, 958 | "metadata": {}, 959 | "outputs": [], 960 | "source": [ 961 | "# GRU的代码\n", 962 | "# model.add(GRU(units=32, return_sequences=True))\n", 963 | "# model.add(GRU(units=16, return_sequences=True))\n", 964 | "# model.add(GRU(units=4, return_sequences=False))" 965 | ] 966 | }, 967 | { 968 | "cell_type": "code", 969 | "execution_count": 69, 970 | "metadata": {}, 971 | "outputs": [], 972 | "source": [ 973 | "model.add(Dense(1, activation='sigmoid'))\n", 974 | "# 我们使用adam以0.001的learning rate进行优化\n", 975 | "optimizer = Adam(lr=1e-3)" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": 70, 981 | "metadata": {}, 982 | "outputs": [], 983 | "source": [ 984 | "model.compile(loss='binary_crossentropy',\n", 985 | " optimizer=optimizer,\n", 986 | " metrics=['accuracy'])" 987 | ] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "execution_count": 71, 992 | "metadata": {}, 993 | "outputs": [ 994 | { 995 | "name": "stdout", 996 | "output_type": "stream", 997 | "text": [ 998 | "_________________________________________________________________\n", 999 | "Layer (type) Output Shape Param # \n", 1000 | "=================================================================\n", 1001 | "embedding_1 (Embedding) (None, 236, 300) 15000000 \n", 1002 | "_________________________________________________________________\n", 1003 | "bidirectional_1 (Bidirection (None, 236, 64) 85248 \n", 1004 | "_________________________________________________________________\n", 1005 | "lstm_2 (LSTM) (None, 16) 5184 \n", 1006 | "_________________________________________________________________\n", 1007 | "dense_1 (Dense) (None, 1) 17 \n", 1008 | "=================================================================\n", 1009 | "Total params: 15,090,449\n", 1010 | "Trainable params: 90,449\n", 1011 | "Non-trainable params: 15,000,000\n", 1012 | "_________________________________________________________________\n" 1013 | ] 1014 | } 1015 | ], 1016 | "source": [ 1017 | "# 我们来看一下模型的结构,一共90k左右可训练的变量\n", 1018 | "model.summary()" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "code", 1023 | "execution_count": 72, 1024 | "metadata": {}, 1025 | "outputs": [], 1026 | "source": [ 1027 | "# 建立一个权重的存储点\n", 1028 | "path_checkpoint = 'sentiment_checkpoint.keras'\n", 1029 | "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n", 1030 | " verbose=1, save_weights_only=True,\n", 1031 | " save_best_only=True)" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": 73, 1037 | "metadata": {}, 1038 | "outputs": [], 1039 | "source": [ 1040 | "# 尝试加载已训练模型\n", 1041 | "try:\n", 1042 | " model.load_weights(path_checkpoint)\n", 1043 | "except Exception as e:\n", 1044 | " print(e)" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": 74, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [ 1053 | "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n", 1054 | "earlystopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": 75, 1060 | "metadata": {}, 1061 | "outputs": [], 1062 | "source": [ 1063 | "# 自动降低learning rate\n", 1064 | "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n", 1065 | " factor=0.1, min_lr=1e-5, patience=0,\n", 1066 | " verbose=1)" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "code", 1071 | "execution_count": 76, 1072 | "metadata": {}, 1073 | "outputs": [], 1074 | "source": [ 1075 | "# 定义callback函数\n", 1076 | "callbacks = [\n", 1077 | " earlystopping, \n", 1078 | " checkpoint,\n", 1079 | " lr_reduction\n", 1080 | "]" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "code", 1085 | "execution_count": 77, 1086 | "metadata": { 1087 | "scrolled": false 1088 | }, 1089 | "outputs": [], 1090 | "source": [ 1091 | "# 开始训练\n", 1092 | "model.fit(X_train, y_train,\n", 1093 | " validation_split=0.1, \n", 1094 | " epochs=20,\n", 1095 | " batch_size=128,\n", 1096 | " callbacks=callbacks)" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "markdown", 1101 | "metadata": {}, 1102 | "source": [ 1103 | "**结论** \n", 1104 | "我们首先对测试样本进行预测,得到了还算满意的准确度。 \n", 1105 | "之后我们定义一个预测函数,来预测输入的文本的极性,可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "code", 1110 | "execution_count": 78, 1111 | "metadata": {}, 1112 | "outputs": [ 1113 | { 1114 | "name": "stdout", 1115 | "output_type": "stream", 1116 | "text": [ 1117 | "400/400 [==============================] - 5s 12ms/step\n", 1118 | "Accuracy:87.50%\n" 1119 | ] 1120 | } 1121 | ], 1122 | "source": [ 1123 | "result = model.evaluate(X_test, y_test)\n", 1124 | "print('Accuracy:{0:.2%}'.format(result[1]))" 1125 | ] 1126 | }, 1127 | { 1128 | "cell_type": "code", 1129 | "execution_count": 79, 1130 | "metadata": {}, 1131 | "outputs": [], 1132 | "source": [ 1133 | "def predict_sentiment(text):\n", 1134 | " print(text)\n", 1135 | " # 去标点\n", 1136 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n", 1137 | " # 分词\n", 1138 | " cut = jieba.cut(text)\n", 1139 | " cut_list = [ i for i in cut ]\n", 1140 | " # tokenize\n", 1141 | " for i, word in enumerate(cut_list):\n", 1142 | " try:\n", 1143 | " cut_list[i] = cn_model.vocab[word].index\n", 1144 | " except KeyError:\n", 1145 | " cut_list[i] = 0\n", 1146 | " # padding\n", 1147 | " tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n", 1148 | " padding='pre', truncating='pre')\n", 1149 | " # 预测\n", 1150 | " result = model.predict(x=tokens_pad)\n", 1151 | " coef = result[0][0]\n", 1152 | " if coef >= 0.5:\n", 1153 | " print('是一例正面评价','output=%.2f'%coef)\n", 1154 | " else:\n", 1155 | " print('是一例负面评价','output=%.2f'%coef)" 1156 | ] 1157 | }, 1158 | { 1159 | "cell_type": "code", 1160 | "execution_count": 80, 1161 | "metadata": {}, 1162 | "outputs": [ 1163 | { 1164 | "name": "stdout", 1165 | "output_type": "stream", 1166 | "text": [ 1167 | "酒店设施不是新的,服务态度很不好\n", 1168 | "是一例负面评价 output=0.14\n", 1169 | "酒店卫生条件非常不好\n", 1170 | "是一例负面评价 output=0.09\n", 1171 | "床铺非常舒适\n", 1172 | "是一例正面评价 output=0.76\n", 1173 | "房间很凉,不给开暖气\n", 1174 | "是一例负面评价 output=0.17\n", 1175 | "房间很凉爽,空调冷气很足\n", 1176 | "是一例正面评价 output=0.66\n", 1177 | "酒店环境不好,住宿体验很不好\n", 1178 | "是一例负面评价 output=0.06\n", 1179 | "房间隔音不到位\n", 1180 | "是一例负面评价 output=0.17\n", 1181 | "晚上回来发现没有打扫卫生\n", 1182 | "是一例负面评价 output=0.25\n", 1183 | "因为过节所以要我临时加钱,比团购的价格贵\n", 1184 | "是一例负面评价 output=0.06\n" 1185 | ] 1186 | } 1187 | ], 1188 | "source": [ 1189 | "test_list = [\n", 1190 | " '酒店设施不是新的,服务态度很不好',\n", 1191 | " '酒店卫生条件非常不好',\n", 1192 | " '床铺非常舒适',\n", 1193 | " '房间很凉,不给开暖气',\n", 1194 | " '房间很凉爽,空调冷气很足',\n", 1195 | " '酒店环境不好,住宿体验很不好',\n", 1196 | " '房间隔音不到位' ,\n", 1197 | " '晚上回来发现没有打扫卫生',\n", 1198 | " '因为过节所以要我临时加钱,比团购的价格贵'\n", 1199 | "]\n", 1200 | "for text in test_list:\n", 1201 | " predict_sentiment(text)" 1202 | ] 1203 | }, 1204 | { 1205 | "cell_type": "markdown", 1206 | "metadata": {}, 1207 | "source": [ 1208 | "**错误分类的文本**\n", 1209 | "经过查看,发现错误分类的文本的含义大多比较含糊,就算人类也不容易判断极性,如index为101的这个句子,好像没有一点满意的成分,但这例子评价在训练样本中被标记成为了正面评价,而我们的模型做出的负面评价的预测似乎是合理的。" 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "code", 1214 | "execution_count": 81, 1215 | "metadata": {}, 1216 | "outputs": [], 1217 | "source": [ 1218 | "y_pred = model.predict(X_test)\n", 1219 | "y_pred = y_pred.T[0]\n", 1220 | "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n", 1221 | "y_pred = np.array(y_pred)" 1222 | ] 1223 | }, 1224 | { 1225 | "cell_type": "code", 1226 | "execution_count": 82, 1227 | "metadata": {}, 1228 | "outputs": [], 1229 | "source": [ 1230 | "y_actual = np.array(y_test)" 1231 | ] 1232 | }, 1233 | { 1234 | "cell_type": "code", 1235 | "execution_count": 83, 1236 | "metadata": {}, 1237 | "outputs": [], 1238 | "source": [ 1239 | "# 找出错误分类的索引\n", 1240 | "misclassified = np.where( y_pred != y_actual )[0]" 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "code", 1245 | "execution_count": 92, 1246 | "metadata": { 1247 | "scrolled": true 1248 | }, 1249 | "outputs": [ 1250 | { 1251 | "name": "stdout", 1252 | "output_type": "stream", 1253 | "text": [ 1254 | "400\n" 1255 | ] 1256 | } 1257 | ], 1258 | "source": [ 1259 | "# 输出所有错误分类的索引\n", 1260 | "len(misclassified)\n", 1261 | "print(len(X_test))" 1262 | ] 1263 | }, 1264 | { 1265 | "cell_type": "code", 1266 | "execution_count": 85, 1267 | "metadata": {}, 1268 | "outputs": [ 1269 | { 1270 | "name": "stdout", 1271 | "output_type": "stream", 1272 | "text": [ 1273 | " 由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 :1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n", 1274 | "预测的分类 0\n", 1275 | "实际的分类 1.0\n" 1276 | ] 1277 | } 1278 | ], 1279 | "source": [ 1280 | "# 我们来找出错误分类的样本看看\n", 1281 | "idx=101\n", 1282 | "print(reverse_tokens(X_test[idx]))\n", 1283 | "print('预测的分类', y_pred[idx])\n", 1284 | "print('实际的分类', y_actual[idx])" 1285 | ] 1286 | }, 1287 | { 1288 | "cell_type": "code", 1289 | "execution_count": 86, 1290 | "metadata": {}, 1291 | "outputs": [ 1292 | { 1293 | "name": "stdout", 1294 | "output_type": "stream", 1295 | "text": [ 1296 | " 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n", 1297 | "预测的分类 0\n", 1298 | "实际的分类 1.0\n" 1299 | ] 1300 | } 1301 | ], 1302 | "source": [ 1303 | "idx=1\n", 1304 | "print(reverse_tokens(X_test[idx]))\n", 1305 | "print('预测的分类', y_pred[idx])\n", 1306 | "print('实际的分类', y_actual[idx])" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "code", 1311 | "execution_count": null, 1312 | "metadata": {}, 1313 | "outputs": [], 1314 | "source": [] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "execution_count": null, 1319 | "metadata": {}, 1320 | "outputs": [], 1321 | "source": [] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": null, 1326 | "metadata": {}, 1327 | "outputs": [], 1328 | "source": [] 1329 | } 1330 | ], 1331 | "metadata": { 1332 | "kernelspec": { 1333 | "display_name": "Python 3", 1334 | "language": "python", 1335 | "name": "python3" 1336 | }, 1337 | "language_info": { 1338 | "codemirror_mode": { 1339 | "name": "ipython", 1340 | "version": 3 1341 | }, 1342 | "file_extension": ".py", 1343 | "mimetype": "text/x-python", 1344 | "name": "python", 1345 | "nbconvert_exporter": "python", 1346 | "pygments_lexer": "ipython3", 1347 | "version": "3.6.5" 1348 | } 1349 | }, 1350 | "nbformat": 4, 1351 | "nbformat_minor": 2 1352 | } 1353 | -------------------------------------------------------------------------------- /语料.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/语料.zip --------------------------------------------------------------------------------