├── 2019新版debug之后--中文自然语言处理--情感分析.ipynb
├── README.md
├── flowchart.jpg
├── neg
├── neg.0.txt
├── neg.1.txt
├── neg.10.txt
├── neg.1000.txt
├── neg.1001.txt
├── neg.1002.txt
├── neg.1003.txt
├── neg.1004.txt
├── neg.1005.txt
├── neg.1006.txt
├── neg.1007.txt
├── neg.1009.txt
├── neg.101.txt
├── neg.1010.txt
├── neg.1011.txt
├── neg.1012.txt
├── neg.1013.txt
├── neg.1014.txt
├── neg.1015.txt
├── neg.1017.txt
├── neg.1018.txt
├── neg.1019.txt
├── neg.102.txt
├── neg.1020.txt
├── neg.1022.txt
├── neg.1025.txt
├── neg.1026.txt
├── neg.1027.txt
├── neg.1028.txt
├── neg.1029.txt
├── neg.103.txt
├── neg.1030.txt
├── neg.1032.txt
├── neg.1033.txt
├── neg.1034.txt
├── neg.1035.txt
├── neg.1036.txt
├── neg.1038.txt
├── neg.1039.txt
├── neg.104.txt
├── neg.1040.txt
├── neg.1041.txt
├── neg.1042.txt
├── neg.1047.txt
├── neg.1048.txt
├── neg.1049.txt
├── neg.105.txt
├── neg.1050.txt
├── neg.1052.txt
├── neg.1053.txt
├── neg.1054.txt
├── neg.1055.txt
├── neg.1056.txt
├── neg.1057.txt
├── neg.1058.txt
├── neg.1059.txt
├── neg.106.txt
├── neg.1060.txt
├── neg.1061.txt
├── neg.1062.txt
├── neg.1063.txt
├── neg.1066.txt
├── neg.1067.txt
├── neg.1069.txt
├── neg.107.txt
├── neg.1070.txt
├── neg.1071.txt
├── neg.1072.txt
├── neg.1073.txt
├── neg.1074.txt
├── neg.1075.txt
├── neg.1076.txt
├── neg.1077.txt
├── neg.1078.txt
├── neg.1079.txt
├── neg.108.txt
├── neg.1080.txt
├── neg.1081.txt
├── neg.1082.txt
├── neg.1083.txt
├── neg.1084.txt
├── neg.1085.txt
├── neg.1086.txt
├── neg.1087.txt
├── neg.1088.txt
├── neg.1089.txt
├── neg.109.txt
├── neg.1090.txt
├── neg.1091.txt
├── neg.1092.txt
├── neg.1093.txt
├── neg.1094.txt
├── neg.1095.txt
├── neg.1096.txt
├── neg.1097.txt
├── neg.1098.txt
├── neg.1099.txt
├── neg.11.txt
└── neg.110.txt
├── negative_samples.txt
├── pos
├── pos.10.txt
├── pos.100.txt
├── pos.1000.txt
├── pos.1001.txt
├── pos.1002.txt
├── pos.1003.txt
├── pos.1004.txt
├── pos.1005.txt
├── pos.1006.txt
├── pos.1007.txt
├── pos.1008.txt
├── pos.1009.txt
├── pos.101.txt
├── pos.1010.txt
├── pos.1012.txt
├── pos.1013.txt
├── pos.1014.txt
├── pos.1015.txt
├── pos.1016.txt
├── pos.1017.txt
├── pos.1018.txt
├── pos.1019.txt
├── pos.102.txt
├── pos.1020.txt
├── pos.1021.txt
├── pos.1022.txt
├── pos.1023.txt
├── pos.1024.txt
├── pos.1025.txt
├── pos.1026.txt
├── pos.1027.txt
├── pos.1028.txt
├── pos.1029.txt
├── pos.103.txt
├── pos.1030.txt
├── pos.1031.txt
├── pos.1032.txt
├── pos.1033.txt
├── pos.1034.txt
├── pos.1035.txt
├── pos.1036.txt
├── pos.1037.txt
├── pos.1038.txt
├── pos.1039.txt
├── pos.104.txt
├── pos.1040.txt
├── pos.1041.txt
├── pos.1042.txt
├── pos.1043.txt
├── pos.1044.txt
├── pos.1045.txt
├── pos.1046.txt
├── pos.1047.txt
├── pos.1048.txt
├── pos.1049.txt
├── pos.105.txt
├── pos.1050.txt
├── pos.1051.txt
├── pos.1052.txt
├── pos.1053.txt
├── pos.1054.txt
├── pos.1055.txt
├── pos.1056.txt
├── pos.1057.txt
├── pos.1058.txt
├── pos.1059.txt
├── pos.106.txt
├── pos.1060.txt
├── pos.1061.txt
├── pos.1062.txt
├── pos.1063.txt
├── pos.1064.txt
├── pos.1065.txt
├── pos.107.txt
├── pos.1073.txt
├── pos.1074.txt
├── pos.1075.txt
├── pos.1076.txt
├── pos.1077.txt
├── pos.1078.txt
├── pos.1079.txt
├── pos.108.txt
├── pos.1080.txt
├── pos.1081.txt
├── pos.1082.txt
├── pos.1083.txt
├── pos.1084.txt
├── pos.1085.txt
├── pos.1086.txt
├── pos.1087.txt
├── pos.1088.txt
├── pos.1089.txt
├── pos.109.txt
├── pos.1090.txt
├── pos.1091.txt
├── pos.1093.txt
├── pos.1094.txt
├── pos.1095.txt
├── pos.1096.txt
└── pos.1097.txt
├── positive_samples.txt
├── 中文自然语言处理--情感分析.ipynb
└── 语料.zip
/2019新版debug之后--中文自然语言处理--情感分析.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 用Tensorflow进行中文自然语言处理--情感分析"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "$$f('真好喝')=1$$\n",
15 | "$$f('太难喝了')=0$$"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "**简介** \n",
23 | "大家好,我是Espresso,这是我制作的第一个教程,是一个简单的中文自然语言处理的分类实践。 \n",
24 | "制作此教程的目的是什么呢?虽然现在自然语言处理的学习资料很多,英文的资料更多,但是网上资源很乱,尤其是中文的系统的学习资料稀少,而且知识点非常分散,缺少比较系统的实践学习资料,就算有一些代码但因为缺少注释导致要花费很长时间才能理解,我个人在学习过程中,在网络搜索花费了一整天时间,才把处理中文的步骤和需要的软件梳理出来。 \n",
25 | "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习,在这个教程中我注重的是实践部分,理论部分我推荐学习deeplearning.ai的课程,在下面的代码部分,涉及到哪方面知识的,我推荐一些学习资料并附上链接,如有侵权请e-mail:a66777@188.com。 \n",
26 | "另外我对自然语言处理并没有任何深入研究,欢迎各位大牛吐槽,希望能指出不足和改善方法。"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "**需要的库** \n",
34 | "numpy \n",
35 | "jieba \n",
36 | "gensim \n",
37 | "tensorflow \n",
38 | "matplotlib "
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 1,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "# 首先加载必用的库\n",
48 | "%matplotlib inline\n",
49 | "import numpy as np\n",
50 | "import matplotlib.pyplot as plt\n",
51 | "import re\n",
52 | "import jieba # 结巴分词\n",
53 | "# gensim用来加载预训练word vector\n",
54 | "from gensim.models import KeyedVectors\n",
55 | "import warnings\n",
56 | "warnings.filterwarnings(\"ignore\")\n",
57 | "# 用来解压\n",
58 | "import bz2"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "**预训练词向量** \n",
66 | "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为: \n",
67 | "https://github.com/Embedding/Chinese-Word-Vectors \n",
68 | "如果你不知道word2vec是什么,我推荐以下一篇文章: \n",
69 | "https://zhuanlan.zhihu.com/p/26306795 \n",
70 | "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量,可以从上面github链接下载,我们先加载预训练模型并进行一些简单测试:"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 2,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "# 请将下载的词向量压缩包放置在根目录 embeddings 文件夹里\n",
80 | "# 解压词向量, 有可能需要等待1-2分钟\n",
81 | "with open(\"embeddings/sgns.zhihu.bigram\", 'wb') as new_file, open(\"embeddings/sgns.zhihu.bigram.bz2\", 'rb') as file:\n",
82 | " decompressor = bz2.BZ2Decompressor()\n",
83 | " for data in iter(lambda : file.read(100 * 1024), b''):\n",
84 | " new_file.write(decompressor.decompress(data))"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 3,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": [
93 | "# 使用gensim加载预训练中文分词embedding, 有可能需要等待1-2分钟\n",
94 | "cn_model = KeyedVectors.load_word2vec_format('embeddings/sgns.zhihu.bigram', \n",
95 | " binary=False, unicode_errors=\"ignore\")"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "**词向量模型** \n",
103 | "在这个词向量模型里,每一个词是一个索引,对应的是一个长度为300的向量,我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本,需要先进行分次并把词汇转换为词向量,步骤请参考下图,步骤的讲解会跟着代码一步一步来,如果你不知道RNN,GRU,LSTM是什么,我推荐deeplearning.ai的课程,网易公开课有免费中文字幕版,但我还是推荐有习题和练习代码部分的的coursera原版: \n",
104 | "
"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 4,
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | "词向量的长度为300\n"
117 | ]
118 | },
119 | {
120 | "data": {
121 | "text/plain": [
122 | "array([-2.603470e-01, 3.677500e-01, -2.379650e-01, 5.301700e-02,\n",
123 | " -3.628220e-01, -3.212010e-01, -1.903330e-01, 1.587220e-01,\n",
124 | " -7.156200e-02, -4.625400e-02, -1.137860e-01, 3.515600e-01,\n",
125 | " -6.408200e-02, -2.184840e-01, 3.286950e-01, -7.110330e-01,\n",
126 | " 1.620320e-01, 1.627490e-01, 5.528180e-01, 1.016860e-01,\n",
127 | " 1.060080e-01, 7.820700e-01, -7.537310e-01, -2.108400e-02,\n",
128 | " -4.758250e-01, -1.130420e-01, -2.053000e-01, 6.624390e-01,\n",
129 | " 2.435850e-01, 9.171890e-01, -2.090610e-01, -5.290000e-02,\n",
130 | " -7.969340e-01, 2.394940e-01, -9.028100e-02, 1.537360e-01,\n",
131 | " -4.003980e-01, -2.456100e-02, -1.717860e-01, 2.037790e-01,\n",
132 | " -4.344710e-01, -3.850430e-01, -9.366000e-02, 3.775310e-01,\n",
133 | " 2.659690e-01, 8.879800e-02, 2.493440e-01, 4.914900e-02,\n",
134 | " 5.996000e-03, 3.586430e-01, -1.044960e-01, -5.838460e-01,\n",
135 | " 3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n",
136 | " -2.075230e-01, 2.845980e-01, 1.414760e-01, 1.678570e-01,\n",
137 | " 1.957560e-01, 7.782140e-01, -2.359000e-01, -6.833100e-02,\n",
138 | " 2.560170e-01, -6.906900e-02, -1.219620e-01, 2.683020e-01,\n",
139 | " 1.678810e-01, 2.068910e-01, 1.987520e-01, 6.720900e-02,\n",
140 | " -3.975290e-01, -7.123140e-01, 5.613200e-02, 2.586000e-03,\n",
141 | " 5.616910e-01, 1.157000e-03, -4.341190e-01, 1.977480e-01,\n",
142 | " 2.519540e-01, 8.835000e-03, -3.554600e-01, -1.573500e-02,\n",
143 | " -2.526010e-01, 9.355900e-02, -3.962500e-02, -1.628350e-01,\n",
144 | " 2.980950e-01, 1.647900e-01, -5.454270e-01, 3.888790e-01,\n",
145 | " 1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n",
146 | " 2.020520e-01, -4.424750e-01, 3.911580e-01, 2.115100e-01,\n",
147 | " 6.516760e-01, 5.668030e-01, 5.065500e-02, -1.259650e-01,\n",
148 | " -3.720640e-01, 2.330470e-01, 6.659900e-02, 8.300600e-02,\n",
149 | " 2.540460e-01, -5.279760e-01, -3.843280e-01, 3.366460e-01,\n",
150 | " 2.336500e-01, 3.564750e-01, -4.884160e-01, -1.183910e-01,\n",
151 | " 1.365910e-01, 2.293420e-01, -6.151930e-01, 5.212050e-01,\n",
152 | " 3.412000e-01, 5.757940e-01, 2.354480e-01, -3.641530e-01,\n",
153 | " 7.373400e-02, 1.007380e-01, -3.211410e-01, -3.040480e-01,\n",
154 | " -3.738440e-01, -2.515150e-01, 2.633890e-01, 3.995490e-01,\n",
155 | " 4.461880e-01, 1.641110e-01, 1.449590e-01, -4.191540e-01,\n",
156 | " 2.297840e-01, 6.710600e-02, 3.316430e-01, -6.026500e-02,\n",
157 | " -5.130610e-01, 1.472570e-01, 2.414060e-01, 2.011000e-03,\n",
158 | " -3.823410e-01, -1.356010e-01, 3.112300e-01, 9.177830e-01,\n",
159 | " -4.511630e-01, 1.272190e-01, -9.431600e-02, -8.216000e-03,\n",
160 | " -3.835440e-01, 2.589400e-02, 6.374980e-01, 4.931630e-01,\n",
161 | " -1.865070e-01, 4.076900e-01, -1.841000e-03, 2.213160e-01,\n",
162 | " 2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n",
163 | " 1.296890e-01, -1.304100e-01, -4.742270e-01, 2.275500e-02,\n",
164 | " 4.255050e-01, 1.570280e-01, 2.975300e-02, 1.931830e-01,\n",
165 | " 1.304340e-01, -3.179800e-02, 1.516650e-01, -2.154310e-01,\n",
166 | " -4.681410e-01, 1.007326e+00, -6.698940e-01, -1.555240e-01,\n",
167 | " 1.797170e-01, 2.848660e-01, 6.216130e-01, 1.549510e-01,\n",
168 | " 6.225000e-02, -2.227800e-02, 2.561270e-01, -1.006380e-01,\n",
169 | " 2.807900e-02, 4.597710e-01, -4.077750e-01, -1.777390e-01,\n",
170 | " 1.920500e-02, -4.829300e-02, 4.714700e-02, -3.715200e-01,\n",
171 | " -2.995930e-01, -3.719710e-01, 4.622800e-02, -1.436460e-01,\n",
172 | " 2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n",
173 | " 5.970110e-01, 3.578450e-01, -6.826000e-02, 4.735200e-02,\n",
174 | " -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n",
175 | " -4.757790e-01, 1.079320e-01, 9.858300e-02, 8.540300e-01,\n",
176 | " 3.518370e-01, -1.306360e-01, -1.541590e-01, 1.166775e+00,\n",
177 | " 2.048860e-01, 5.952340e-01, 1.158830e-01, 6.774400e-02,\n",
178 | " 6.793920e-01, -3.610700e-01, 1.697870e-01, 4.118530e-01,\n",
179 | " 4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n",
180 | " -7.043300e-02, 1.576110e-01, -4.780500e-02, -7.344390e-01,\n",
181 | " -2.834330e-01, 4.582690e-01, 3.957010e-01, -8.484300e-02,\n",
182 | " -3.472550e-01, 1.291660e-01, 3.838960e-01, -3.287600e-02,\n",
183 | " -2.802220e-01, 5.257030e-01, -3.609300e-02, -4.842220e-01,\n",
184 | " 3.690700e-02, 3.429560e-01, 2.902490e-01, -1.624650e-01,\n",
185 | " -7.513700e-02, 2.669300e-01, 5.778230e-01, -3.074020e-01,\n",
186 | " -2.183790e-01, -2.834050e-01, 1.350870e-01, 1.490070e-01,\n",
187 | " 1.438400e-02, -2.509040e-01, -3.376100e-01, 1.291880e-01,\n",
188 | " -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n",
189 | " -1.211970e-01, 2.532660e-01, 2.757050e-01, -3.382040e-01,\n",
190 | " 1.178070e-01, 3.860190e-01, 5.277960e-01, 4.581920e-01,\n",
191 | " 1.502310e-01, 1.226320e-01, 2.768540e-01, -4.502080e-01,\n",
192 | " -1.992670e-01, 1.689100e-02, 1.188860e-01, 3.502440e-01,\n",
193 | " -4.064770e-01, 2.610280e-01, -1.934990e-01, -1.625660e-01,\n",
194 | " 2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n",
195 | " -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n",
196 | " 2.154600e-02, 2.318040e-01, 5.395310e-01, -4.223720e-01],\n",
197 | " dtype=float32)"
198 | ]
199 | },
200 | "execution_count": 4,
201 | "metadata": {},
202 | "output_type": "execute_result"
203 | }
204 | ],
205 | "source": [
206 | "# 由此可见每一个词都对应一个长度为300的向量\n",
207 | "embedding_dim = cn_model['山东大学'].shape[0]\n",
208 | "print('词向量的长度为{}'.format(embedding_dim))\n",
209 | "cn_model['山东大学']"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "Cosine Similarity for Vector Space Models by Christian S. Perone\n",
217 | "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 5,
223 | "metadata": {},
224 | "outputs": [
225 | {
226 | "data": {
227 | "text/plain": [
228 | "0.66128117"
229 | ]
230 | },
231 | "execution_count": 5,
232 | "metadata": {},
233 | "output_type": "execute_result"
234 | }
235 | ],
236 | "source": [
237 | "# 计算相似度\n",
238 | "cn_model.similarity('橘子', '橙子')"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 6,
244 | "metadata": {},
245 | "outputs": [
246 | {
247 | "data": {
248 | "text/plain": [
249 | "0.66128117"
250 | ]
251 | },
252 | "execution_count": 6,
253 | "metadata": {},
254 | "output_type": "execute_result"
255 | }
256 | ],
257 | "source": [
258 | "# dot('橘子'/|'橘子'|, '橙子'/|'橙子'| )\n",
259 | "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n",
260 | "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 7,
266 | "metadata": {},
267 | "outputs": [
268 | {
269 | "data": {
270 | "text/plain": [
271 | "[('高中', 0.7247823476791382),\n",
272 | " ('本科', 0.6768535375595093),\n",
273 | " ('研究生', 0.6244412660598755),\n",
274 | " ('中学', 0.6088204979896545),\n",
275 | " ('大学本科', 0.595908522605896),\n",
276 | " ('初中', 0.5883588790893555),\n",
277 | " ('读研', 0.5778335332870483),\n",
278 | " ('职高', 0.5767995119094849),\n",
279 | " ('大学毕业', 0.5767451524734497),\n",
280 | " ('师范大学', 0.5708829760551453)]"
281 | ]
282 | },
283 | "execution_count": 7,
284 | "metadata": {},
285 | "output_type": "execute_result"
286 | }
287 | ],
288 | "source": [
289 | "# 找出最相近的词,余弦相似度\n",
290 | "cn_model.most_similar(positive=['大学'], topn=10)"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 8,
296 | "metadata": {},
297 | "outputs": [
298 | {
299 | "name": "stdout",
300 | "output_type": "stream",
301 | "text": [
302 | "在 老师 会计师 程序员 律师 医生 老人 中:\n",
303 | "不是同一类别的词为: 老人\n"
304 | ]
305 | }
306 | ],
307 | "source": [
308 | "# 找出不同的词\n",
309 | "test_words = '老师 会计师 程序员 律师 医生 老人'\n",
310 | "test_words_result = cn_model.doesnt_match(test_words.split())\n",
311 | "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 9,
317 | "metadata": {},
318 | "outputs": [
319 | {
320 | "data": {
321 | "text/plain": [
322 | "[('出轨', 0.6100173592567444)]"
323 | ]
324 | },
325 | "execution_count": 9,
326 | "metadata": {},
327 | "output_type": "execute_result"
328 | }
329 | ],
330 | "source": [
331 | "cn_model.most_similar(positive=['女人','劈腿'], negative=['男人'], topn=1)"
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "**训练语料** \n",
339 | "本教程使用了谭松波老师的酒店评论语料,即使是这个语料也很难找到下载链接,在某博客还得花积分下载,而我不知道怎么赚取积分,后来好不容易找到一个链接但竟然是失效的,再后来尝试把链接粘贴到迅雷上终于下载了下来,希望大家以后多多分享资源。 \n",
340 | "训练样本分别被放置在两个文件夹里:\n",
341 | "分别的pos和neg,每个文件夹里有2000个txt文件,每个文件内有一段评语,共有4000个训练样本,这样大小的样本数据在NLP中属于非常迷你的:"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": 10,
347 | "metadata": {},
348 | "outputs": [],
349 | "source": [
350 | "# 获得样本的索引,样本存放于两个文件夹中,\n",
351 | "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n",
352 | "# 每个文件夹中有2000个txt文件,每个文件中是一例评价\n",
353 | "import os\n",
354 | "pos_txts = os.listdir('pos')\n",
355 | "neg_txts = os.listdir('neg')"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": 11,
361 | "metadata": {},
362 | "outputs": [
363 | {
364 | "name": "stdout",
365 | "output_type": "stream",
366 | "text": [
367 | "样本总共: 4000\n"
368 | ]
369 | }
370 | ],
371 | "source": [
372 | "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )"
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": 12,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": [
381 | "# 现在我们将所有的评价内容放置到一个list里\n",
382 | "# 这里和视频课程不一样, 因为很多同学反应按照视频课程里的读取方法会乱码,\n",
383 | "# 经过检查发现是原始文本里的编码是gbk造成的,\n",
384 | "# 这里我进行了简单的预处理, 以避免乱码\n",
385 | "train_texts_orig = []\n",
386 | "# 文本所对应的labels, 也就是标记\n",
387 | "train_target = []\n",
388 | "with open(\"positive_samples.txt\", \"r\", encoding=\"utf-8\") as f:\n",
389 | " lines = f.readlines()\n",
390 | " for line in lines:\n",
391 | " dic = eval(line)\n",
392 | " train_texts_orig.append(dic[\"text\"])\n",
393 | " train_target.append(dic[\"label\"])\n",
394 | "\n",
395 | "with open(\"negative_samples.txt\", \"r\", encoding=\"utf-8\") as f:\n",
396 | " lines = f.readlines()\n",
397 | " for line in lines:\n",
398 | " dic = eval(line)\n",
399 | " train_texts_orig.append(dic[\"text\"])\n",
400 | " train_target.append(dic[\"label\"])"
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": 13,
406 | "metadata": {},
407 | "outputs": [
408 | {
409 | "data": {
410 | "text/plain": [
411 | "4000"
412 | ]
413 | },
414 | "execution_count": 13,
415 | "metadata": {},
416 | "output_type": "execute_result"
417 | }
418 | ],
419 | "source": [
420 | "len(train_texts_orig)"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": 14,
426 | "metadata": {},
427 | "outputs": [],
428 | "source": [
429 | "# 我们使用tensorflow的keras接口来建模\n",
430 | "from tensorflow.python.keras.models import Sequential\n",
431 | "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n",
432 | "from tensorflow.python.keras.preprocessing.text import Tokenizer\n",
433 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
434 | "from tensorflow.python.keras.optimizers import RMSprop\n",
435 | "from tensorflow.python.keras.optimizers import Adam\n",
436 | "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "**分词和tokenize** \n",
444 | "首先我们去掉每个样本的标点符号,然后用jieba分词,jieba分词返回一个生成器,没法直接进行tokenize,所以我们将分词结果转换成一个list,并将它索引化,这样每一例评价的文本变成一段索引数字,对应着预训练词向量模型中的词。"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": 15,
450 | "metadata": {},
451 | "outputs": [
452 | {
453 | "name": "stderr",
454 | "output_type": "stream",
455 | "text": [
456 | "Building prefix dict from the default dictionary ...\n",
457 | "WARNING: Logging before flag parsing goes to stderr.\n",
458 | "I0610 16:36:46.897187 9584 __init__.py:111] Building prefix dict from the default dictionary ...\n",
459 | "Loading model from cache C:\\Users\\OSCARZ~1\\AppData\\Local\\Temp\\jieba.cache\n",
460 | "I0610 16:36:46.899155 9584 __init__.py:131] Loading model from cache C:\\Users\\OSCARZ~1\\AppData\\Local\\Temp\\jieba.cache\n",
461 | "Loading model cost 0.535 seconds.\n",
462 | "I0610 16:36:47.432762 9584 __init__.py:163] Loading model cost 0.535 seconds.\n",
463 | "Prefix dict has been built succesfully.\n",
464 | "I0610 16:36:47.433753 9584 __init__.py:164] Prefix dict has been built succesfully.\n"
465 | ]
466 | }
467 | ],
468 | "source": [
469 | "# 进行分词和tokenize\n",
470 | "# train_tokens是一个长长的list,其中含有4000个小list,对应每一条评价\n",
471 | "train_tokens = []\n",
472 | "for text in train_texts_orig:\n",
473 | " # 去掉标点\n",
474 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n",
475 | " # 结巴分词\n",
476 | " cut = jieba.cut(text)\n",
477 | " # 结巴分词的输出结果为一个生成器\n",
478 | " # 把生成器转换为list\n",
479 | " cut_list = [ i for i in cut ]\n",
480 | " for i, word in enumerate(cut_list):\n",
481 | " try:\n",
482 | " # 将词转换为索引index\n",
483 | " cut_list[i] = cn_model.vocab[word].index\n",
484 | " except KeyError:\n",
485 | " # 如果词不在字典中,则输出0\n",
486 | " cut_list[i] = 0\n",
487 | " train_tokens.append(cut_list)"
488 | ]
489 | },
490 | {
491 | "cell_type": "markdown",
492 | "metadata": {},
493 | "source": [
494 | "**索引长度标准化** \n",
495 | "因为每段评语的长度是不一样的,我们如果单纯取最长的一个评语,并把其他评填充成同样的长度,这样十分浪费计算资源,所以我们取一个折衷的长度。"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 16,
501 | "metadata": {},
502 | "outputs": [],
503 | "source": [
504 | "# 获得所有tokens的长度\n",
505 | "num_tokens = [ len(tokens) for tokens in train_tokens ]\n",
506 | "num_tokens = np.array(num_tokens)"
507 | ]
508 | },
509 | {
510 | "cell_type": "code",
511 | "execution_count": 17,
512 | "metadata": {},
513 | "outputs": [
514 | {
515 | "data": {
516 | "text/plain": [
517 | "71.4495"
518 | ]
519 | },
520 | "execution_count": 17,
521 | "metadata": {},
522 | "output_type": "execute_result"
523 | }
524 | ],
525 | "source": [
526 | "# 平均tokens的长度\n",
527 | "np.mean(num_tokens)"
528 | ]
529 | },
530 | {
531 | "cell_type": "code",
532 | "execution_count": 18,
533 | "metadata": {},
534 | "outputs": [
535 | {
536 | "data": {
537 | "text/plain": [
538 | "1540"
539 | ]
540 | },
541 | "execution_count": 18,
542 | "metadata": {},
543 | "output_type": "execute_result"
544 | }
545 | ],
546 | "source": [
547 | "# 最长的评价tokens的长度\n",
548 | "np.max(num_tokens)"
549 | ]
550 | },
551 | {
552 | "cell_type": "code",
553 | "execution_count": 19,
554 | "metadata": {},
555 | "outputs": [
556 | {
557 | "data": {
558 | "image/png": "\n",
559 | "text/plain": [
560 | ""
561 | ]
562 | },
563 | "metadata": {
564 | "needs_background": "light"
565 | },
566 | "output_type": "display_data"
567 | }
568 | ],
569 | "source": [
570 | "plt.hist(np.log(num_tokens), bins = 100)\n",
571 | "plt.xlim((0,10))\n",
572 | "plt.ylabel('number of tokens')\n",
573 | "plt.xlabel('length of tokens')\n",
574 | "plt.title('Distribution of tokens length')\n",
575 | "plt.show()"
576 | ]
577 | },
578 | {
579 | "cell_type": "code",
580 | "execution_count": 20,
581 | "metadata": {},
582 | "outputs": [
583 | {
584 | "data": {
585 | "text/plain": [
586 | "236"
587 | ]
588 | },
589 | "execution_count": 20,
590 | "metadata": {},
591 | "output_type": "execute_result"
592 | }
593 | ],
594 | "source": [
595 | "# 取tokens平均值并加上两个tokens的标准差,\n",
596 | "# 假设tokens长度的分布为正态分布,则max_tokens这个值可以涵盖95%左右的样本\n",
597 | "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n",
598 | "max_tokens = int(max_tokens)\n",
599 | "max_tokens"
600 | ]
601 | },
602 | {
603 | "cell_type": "code",
604 | "execution_count": 21,
605 | "metadata": {},
606 | "outputs": [
607 | {
608 | "data": {
609 | "text/plain": [
610 | "0.9565"
611 | ]
612 | },
613 | "execution_count": 21,
614 | "metadata": {},
615 | "output_type": "execute_result"
616 | }
617 | ],
618 | "source": [
619 | "# 取tokens的长度为236时,大约95%的样本被涵盖\n",
620 | "# 我们对长度不足的进行padding,超长的进行修剪\n",
621 | "np.sum( num_tokens < max_tokens ) / len(num_tokens)"
622 | ]
623 | },
624 | {
625 | "cell_type": "markdown",
626 | "metadata": {},
627 | "source": [
628 | "**反向tokenize** \n",
629 | "我们定义一个function,用来把索引转换成可阅读的文本,这对于debug很重要。"
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": 22,
635 | "metadata": {},
636 | "outputs": [],
637 | "source": [
638 | "# 用来将tokens转换为文本\n",
639 | "def reverse_tokens(tokens):\n",
640 | " text = ''\n",
641 | " for i in tokens:\n",
642 | " if i != 0:\n",
643 | " text = text + cn_model.index2word[i]\n",
644 | " else:\n",
645 | " text = text + ' '\n",
646 | " return text"
647 | ]
648 | },
649 | {
650 | "cell_type": "code",
651 | "execution_count": 23,
652 | "metadata": {},
653 | "outputs": [],
654 | "source": [
655 | "reverse = reverse_tokens(train_tokens[0])"
656 | ]
657 | },
658 | {
659 | "cell_type": "markdown",
660 | "metadata": {},
661 | "source": [
662 | "以下可见,训练样本的极性并不是那么精准,比如说下面的样本,对早餐并不满意,但被定义为正面评价,这会迷惑我们的模型,不过我们暂时不对训练样本进行任何修改。"
663 | ]
664 | },
665 | {
666 | "cell_type": "code",
667 | "execution_count": 24,
668 | "metadata": {},
669 | "outputs": [
670 | {
671 | "data": {
672 | "text/plain": [
673 | "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'"
674 | ]
675 | },
676 | "execution_count": 24,
677 | "metadata": {},
678 | "output_type": "execute_result"
679 | }
680 | ],
681 | "source": [
682 | "# 经过tokenize再恢复成文本\n",
683 | "# 可见标点符号都没有了\n",
684 | "reverse"
685 | ]
686 | },
687 | {
688 | "cell_type": "code",
689 | "execution_count": 25,
690 | "metadata": {},
691 | "outputs": [
692 | {
693 | "data": {
694 | "text/plain": [
695 | "'早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'"
696 | ]
697 | },
698 | "execution_count": 25,
699 | "metadata": {},
700 | "output_type": "execute_result"
701 | }
702 | ],
703 | "source": [
704 | "# 原始文本\n",
705 | "train_texts_orig[0]"
706 | ]
707 | },
708 | {
709 | "cell_type": "markdown",
710 | "metadata": {},
711 | "source": [
712 | "**准备Embedding Matrix** \n",
713 | "现在我们来为模型准备embedding matrix(词向量矩阵),根据keras的要求,我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵,num words代表我们使用的词汇的数量,emdedding dimension在我们现在使用的预训练词向量模型中是300,每一个词汇都用一个长度为300的向量表示。 \n",
714 | "注意我们只选择使用前50k个使用频率最高的词,在这个预训练词向量模型中,一共有260万词汇量,如果全部使用在分类问题上会很浪费计算资源,因为我们的训练样本很小,一共只有4k,如果我们有100k,200k甚至更多的训练样本时,在分类问题上可以考虑减少使用的词汇量。"
715 | ]
716 | },
717 | {
718 | "cell_type": "code",
719 | "execution_count": 26,
720 | "metadata": {},
721 | "outputs": [
722 | {
723 | "data": {
724 | "text/plain": [
725 | "300"
726 | ]
727 | },
728 | "execution_count": 26,
729 | "metadata": {},
730 | "output_type": "execute_result"
731 | }
732 | ],
733 | "source": [
734 | "embedding_dim"
735 | ]
736 | },
737 | {
738 | "cell_type": "code",
739 | "execution_count": 27,
740 | "metadata": {},
741 | "outputs": [],
742 | "source": [
743 | "# 只使用前20000个词\n",
744 | "num_words = 50000\n",
745 | "# 初始化embedding_matrix,之后在keras上进行应用\n",
746 | "embedding_matrix = np.zeros((num_words, embedding_dim))\n",
747 | "# embedding_matrix为一个 [num_words,embedding_dim] 的矩阵\n",
748 | "# 维度为 50000 * 300\n",
749 | "for i in range(num_words):\n",
750 | " embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n",
751 | "embedding_matrix = embedding_matrix.astype('float32')"
752 | ]
753 | },
754 | {
755 | "cell_type": "code",
756 | "execution_count": 28,
757 | "metadata": {},
758 | "outputs": [
759 | {
760 | "data": {
761 | "text/plain": [
762 | "300"
763 | ]
764 | },
765 | "execution_count": 28,
766 | "metadata": {},
767 | "output_type": "execute_result"
768 | }
769 | ],
770 | "source": [
771 | "# 检查index是否对应,\n",
772 | "# 输出300意义为长度为300的embedding向量一一对应\n",
773 | "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )"
774 | ]
775 | },
776 | {
777 | "cell_type": "code",
778 | "execution_count": 29,
779 | "metadata": {},
780 | "outputs": [
781 | {
782 | "data": {
783 | "text/plain": [
784 | "(50000, 300)"
785 | ]
786 | },
787 | "execution_count": 29,
788 | "metadata": {},
789 | "output_type": "execute_result"
790 | }
791 | ],
792 | "source": [
793 | "# embedding_matrix的维度,\n",
794 | "# 这个维度为keras的要求,后续会在模型中用到\n",
795 | "embedding_matrix.shape"
796 | ]
797 | },
798 | {
799 | "cell_type": "markdown",
800 | "metadata": {},
801 | "source": [
802 | "**padding(填充)和truncating(修剪)** \n",
803 | "我们把文本转换为tokens(索引)之后,每一串索引的长度并不相等,所以为了方便模型的训练我们需要把索引的长度标准化,上面我们选择了236这个可以涵盖95%训练样本的长度,接下来我们进行padding和truncating,我们一般采用'pre'的方法,这会在文本索引的前面填充0,因为根据一些研究资料中的实践,如果在文本索引后面填充0的话,会对模型造成一些不良影响。"
804 | ]
805 | },
806 | {
807 | "cell_type": "code",
808 | "execution_count": 30,
809 | "metadata": {},
810 | "outputs": [],
811 | "source": [
812 | "# 进行padding和truncating, 输入的train_tokens是一个list\n",
813 | "# 返回的train_pad是一个numpy array\n",
814 | "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n",
815 | " padding='pre', truncating='pre')"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": 31,
821 | "metadata": {},
822 | "outputs": [],
823 | "source": [
824 | "# 超出五万个词向量的词用0代替\n",
825 | "train_pad[ train_pad>=num_words ] = 0"
826 | ]
827 | },
828 | {
829 | "cell_type": "code",
830 | "execution_count": 32,
831 | "metadata": {},
832 | "outputs": [
833 | {
834 | "data": {
835 | "text/plain": [
836 | "array([ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
837 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
838 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
839 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
840 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
841 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
842 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
843 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
844 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
845 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
846 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
847 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
848 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
849 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
850 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
851 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
852 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
853 | " 290, 3053, 57, 169, 73, 1, 25, 11216, 49,\n",
854 | " 163, 15985, 0, 0, 30, 8, 0, 1, 228,\n",
855 | " 223, 40, 35, 653, 0, 5, 1642, 29, 11216,\n",
856 | " 2751, 500, 98, 30, 3159, 2225, 2146, 371, 6285,\n",
857 | " 169, 27396, 1, 1191, 5432, 1080, 20055, 57, 562,\n",
858 | " 1, 22671, 40, 35, 169, 2567, 0, 42665, 7761,\n",
859 | " 110, 0, 0, 41281, 0, 110, 0, 35891, 110,\n",
860 | " 0, 28781, 57, 169, 1419, 1, 11670, 0, 19470,\n",
861 | " 1, 0, 0, 169, 35071, 40, 562, 35, 12398,\n",
862 | " 657, 4857])"
863 | ]
864 | },
865 | "execution_count": 32,
866 | "metadata": {},
867 | "output_type": "execute_result"
868 | }
869 | ],
870 | "source": [
871 | "# 可见padding之后前面的tokens全变成0,文本在最后面\n",
872 | "train_pad[33]"
873 | ]
874 | },
875 | {
876 | "cell_type": "code",
877 | "execution_count": 33,
878 | "metadata": {},
879 | "outputs": [],
880 | "source": [
881 | "# 准备target向量,前2000样本为1,后2000为0\n",
882 | "train_target = np.array(train_target)"
883 | ]
884 | },
885 | {
886 | "cell_type": "code",
887 | "execution_count": 34,
888 | "metadata": {},
889 | "outputs": [],
890 | "source": [
891 | "# 进行训练和测试样本的分割\n",
892 | "from sklearn.model_selection import train_test_split"
893 | ]
894 | },
895 | {
896 | "cell_type": "code",
897 | "execution_count": 50,
898 | "metadata": {},
899 | "outputs": [],
900 | "source": [
901 | "# 90%的样本用来训练,剩余10%用来测试\n",
902 | "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n",
903 | " train_target,\n",
904 | " test_size=0.1,\n",
905 | " random_state=12)"
906 | ]
907 | },
908 | {
909 | "cell_type": "code",
910 | "execution_count": 36,
911 | "metadata": {},
912 | "outputs": [
913 | {
914 | "name": "stdout",
915 | "output_type": "stream",
916 | "text": [
917 | " 房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n",
918 | "class: 1\n"
919 | ]
920 | }
921 | ],
922 | "source": [
923 | "# 查看训练样本,确认无误\n",
924 | "print(reverse_tokens(X_train[35]))\n",
925 | "print('class: ',y_train[35])"
926 | ]
927 | },
928 | {
929 | "cell_type": "markdown",
930 | "metadata": {},
931 | "source": [
932 | "现在我们用keras搭建LSTM模型,模型的第一层是Embedding层,只有当我们把tokens索引转换为词向量矩阵之后,才可以用神经网络对文本进行处理。\n",
933 | "keras提供了Embedding接口,避免了繁琐的稀疏矩阵操作。 \n",
934 | "在Embedding层我们输入的矩阵为:$$(batchsize, maxtokens)$$\n",
935 | "输出矩阵为: $$(batchsize, maxtokens, embeddingdim)$$"
936 | ]
937 | },
938 | {
939 | "cell_type": "code",
940 | "execution_count": 37,
941 | "metadata": {},
942 | "outputs": [],
943 | "source": [
944 | "# 用LSTM对样本进行分类\n",
945 | "model = Sequential()"
946 | ]
947 | },
948 | {
949 | "cell_type": "code",
950 | "execution_count": 38,
951 | "metadata": {},
952 | "outputs": [],
953 | "source": [
954 | "# 模型第一层为embedding\n",
955 | "model.add(Embedding(num_words,\n",
956 | " embedding_dim,\n",
957 | " weights=[embedding_matrix],\n",
958 | " input_length=max_tokens,\n",
959 | " trainable=False))"
960 | ]
961 | },
962 | {
963 | "cell_type": "code",
964 | "execution_count": 39,
965 | "metadata": {},
966 | "outputs": [],
967 | "source": [
968 | "# 在2019年6月10日修改了一些大坑的bug, 可能是数据的顺序变了, \n",
969 | "# 结果模型训练的效果没有去年最早的时候效果好了, \n",
970 | "# 有兴趣的同学可以调整一下模型参数, 看看会不会有更好的结果\n",
971 | "model.add(Bidirectional(LSTM(units=64, return_sequences=True)))\n",
972 | "model.add(LSTM(units=16, return_sequences=False))"
973 | ]
974 | },
975 | {
976 | "cell_type": "markdown",
977 | "metadata": {},
978 | "source": [
979 | "**构建模型** \n",
980 | "我在这个教程中尝试了几种神经网络结构,因为训练样本比较少,所以我们可以尽情尝试,训练过程等待时间并不长: \n",
981 | "**GRU:**如果使用GRU的话,测试样本可以达到87%的准确率,但我测试自己的文本内容时发现,GRU最后一层激活函数的输出都在0.5左右,说明模型的判断不是很明确,信心比较低,而且经过测试发现模型对于否定句的判断有时会失误,我们期望对于负面样本输出接近0,正面样本接近1而不是都徘徊于0.5之间。 \n",
982 | "**BiLSTM:**测试了LSTM和BiLSTM,发现BiLSTM的表现最好,LSTM的表现略好于GRU,这可能是因为BiLSTM对于比较长的句子结构有更好的记忆,有兴趣的朋友可以深入研究一下。 \n",
983 | "Embedding之后第,一层我们用BiLSTM返回sequences,然后第二层16个单元的LSTM不返回sequences,只返回最终结果,最后是一个全链接层,用sigmoid激活函数输出结果。"
984 | ]
985 | },
986 | {
987 | "cell_type": "code",
988 | "execution_count": 40,
989 | "metadata": {},
990 | "outputs": [],
991 | "source": [
992 | "# GRU的代码\n",
993 | "# model.add(GRU(units=32, return_sequences=True))\n",
994 | "# model.add(GRU(units=16, return_sequences=True))\n",
995 | "# model.add(GRU(units=4, return_sequences=False))"
996 | ]
997 | },
998 | {
999 | "cell_type": "code",
1000 | "execution_count": 41,
1001 | "metadata": {},
1002 | "outputs": [],
1003 | "source": [
1004 | "model.add(Dense(1, activation='sigmoid'))\n",
1005 | "# 我们使用adam以0.001的learning rate进行优化\n",
1006 | "optimizer = Adam(lr=1e-3)"
1007 | ]
1008 | },
1009 | {
1010 | "cell_type": "code",
1011 | "execution_count": 42,
1012 | "metadata": {},
1013 | "outputs": [],
1014 | "source": [
1015 | "model.compile(loss='binary_crossentropy',\n",
1016 | " optimizer=optimizer,\n",
1017 | " metrics=['accuracy'])"
1018 | ]
1019 | },
1020 | {
1021 | "cell_type": "code",
1022 | "execution_count": 43,
1023 | "metadata": {},
1024 | "outputs": [
1025 | {
1026 | "name": "stdout",
1027 | "output_type": "stream",
1028 | "text": [
1029 | "Model: \"sequential\"\n",
1030 | "_________________________________________________________________\n",
1031 | "Layer (type) Output Shape Param # \n",
1032 | "=================================================================\n",
1033 | "embedding (Embedding) (None, 236, 300) 15000000 \n",
1034 | "_________________________________________________________________\n",
1035 | "bidirectional (Bidirectional (None, 236, 128) 186880 \n",
1036 | "_________________________________________________________________\n",
1037 | "lstm_1 (LSTM) (None, 16) 9280 \n",
1038 | "_________________________________________________________________\n",
1039 | "dense (Dense) (None, 1) 17 \n",
1040 | "=================================================================\n",
1041 | "Total params: 15,196,177\n",
1042 | "Trainable params: 196,177\n",
1043 | "Non-trainable params: 15,000,000\n",
1044 | "_________________________________________________________________\n"
1045 | ]
1046 | }
1047 | ],
1048 | "source": [
1049 | "# 我们来看一下模型的结构,一共90k左右可训练的变量\n",
1050 | "model.summary()"
1051 | ]
1052 | },
1053 | {
1054 | "cell_type": "code",
1055 | "execution_count": 44,
1056 | "metadata": {},
1057 | "outputs": [],
1058 | "source": [
1059 | "# 建立一个权重的存储点\n",
1060 | "path_checkpoint = 'sentiment_checkpoint.keras'\n",
1061 | "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n",
1062 | " verbose=1, save_weights_only=True,\n",
1063 | " save_best_only=True)"
1064 | ]
1065 | },
1066 | {
1067 | "cell_type": "code",
1068 | "execution_count": 45,
1069 | "metadata": {},
1070 | "outputs": [
1071 | {
1072 | "name": "stdout",
1073 | "output_type": "stream",
1074 | "text": [
1075 | "Shapes (300, 256) and (300, 128) are incompatible\n"
1076 | ]
1077 | }
1078 | ],
1079 | "source": [
1080 | "# 尝试加载已训练模型\n",
1081 | "try:\n",
1082 | " model.load_weights(path_checkpoint)\n",
1083 | "except Exception as e:\n",
1084 | " print(e)"
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "execution_count": 46,
1090 | "metadata": {},
1091 | "outputs": [],
1092 | "source": [
1093 | "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n",
1094 | "earlystopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1)"
1095 | ]
1096 | },
1097 | {
1098 | "cell_type": "code",
1099 | "execution_count": 47,
1100 | "metadata": {},
1101 | "outputs": [],
1102 | "source": [
1103 | "# 自动降低learning rate\n",
1104 | "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n",
1105 | " factor=0.1, min_lr=1e-8, patience=0,\n",
1106 | " verbose=1)"
1107 | ]
1108 | },
1109 | {
1110 | "cell_type": "code",
1111 | "execution_count": 48,
1112 | "metadata": {},
1113 | "outputs": [],
1114 | "source": [
1115 | "# 定义callback函数\n",
1116 | "callbacks = [\n",
1117 | " earlystopping, \n",
1118 | " checkpoint,\n",
1119 | " lr_reduction\n",
1120 | "]"
1121 | ]
1122 | },
1123 | {
1124 | "cell_type": "code",
1125 | "execution_count": 49,
1126 | "metadata": {
1127 | "scrolled": false
1128 | },
1129 | "outputs": [
1130 | {
1131 | "name": "stdout",
1132 | "output_type": "stream",
1133 | "text": [
1134 | "Train on 3240 samples, validate on 360 samples\n",
1135 | "Epoch 1/20\n",
1136 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.6847 - accuracy: 0.5694\n",
1137 | "Epoch 00001: val_loss improved from inf to 0.65662, saving model to sentiment_checkpoint.keras\n",
1138 | "3240/3240 [==============================] - 34s 10ms/sample - loss: 0.6846 - accuracy: 0.5698 - val_loss: 0.6566 - val_accuracy: 0.6639\n",
1139 | "Epoch 2/20\n",
1140 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.6266 - accuracy: 0.6562\n",
1141 | "Epoch 00002: val_loss improved from 0.65662 to 0.56397, saving model to sentiment_checkpoint.keras\n",
1142 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.6266 - accuracy: 0.6556 - val_loss: 0.5640 - val_accuracy: 0.7139\n",
1143 | "Epoch 3/20\n",
1144 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.5093 - accuracy: 0.7591\n",
1145 | "Epoch 00003: val_loss improved from 0.56397 to 0.51803, saving model to sentiment_checkpoint.keras\n",
1146 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.5100 - accuracy: 0.7583 - val_loss: 0.5180 - val_accuracy: 0.7556\n",
1147 | "Epoch 4/20\n",
1148 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4914 - accuracy: 0.7734\n",
1149 | "Epoch 00004: val_loss improved from 0.51803 to 0.43727, saving model to sentiment_checkpoint.keras\n",
1150 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4904 - accuracy: 0.7744 - val_loss: 0.4373 - val_accuracy: 0.8250\n",
1151 | "Epoch 5/20\n",
1152 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4549 - accuracy: 0.8006\n",
1153 | "Epoch 00005: val_loss did not improve from 0.43727\n",
1154 | "\n",
1155 | "Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.\n",
1156 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4564 - accuracy: 0.7994 - val_loss: 0.4508 - val_accuracy: 0.8000\n",
1157 | "Epoch 6/20\n",
1158 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4261 - accuracy: 0.8206\n",
1159 | "Epoch 00006: val_loss did not improve from 0.43727\n",
1160 | "\n",
1161 | "Epoch 00006: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.\n",
1162 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4260 - accuracy: 0.8210 - val_loss: 0.4374 - val_accuracy: 0.8139\n",
1163 | "Epoch 7/20\n",
1164 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4158 - accuracy: 0.8256\n",
1165 | "Epoch 00007: val_loss improved from 0.43727 to 0.43676, saving model to sentiment_checkpoint.keras\n",
1166 | "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4160 - accuracy: 0.8256 - val_loss: 0.4368 - val_accuracy: 0.8139\n",
1167 | "Epoch 8/20\n",
1168 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4163 - accuracy: 0.8222\n",
1169 | "Epoch 00008: val_loss improved from 0.43676 to 0.43648, saving model to sentiment_checkpoint.keras\n",
1170 | "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4142 - accuracy: 0.8241 - val_loss: 0.4365 - val_accuracy: 0.8139\n",
1171 | "Epoch 9/20\n",
1172 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4125 - accuracy: 0.8241\n",
1173 | "Epoch 00009: val_loss improved from 0.43648 to 0.43615, saving model to sentiment_checkpoint.keras\n",
1174 | "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4131 - accuracy: 0.8228 - val_loss: 0.4361 - val_accuracy: 0.8139\n",
1175 | "Epoch 10/20\n",
1176 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4126 - accuracy: 0.8241\n",
1177 | "Epoch 00010: val_loss improved from 0.43615 to 0.43576, saving model to sentiment_checkpoint.keras\n",
1178 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4120 - accuracy: 0.8247 - val_loss: 0.4358 - val_accuracy: 0.8167\n",
1179 | "Epoch 11/20\n",
1180 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4110 - accuracy: 0.8253\n",
1181 | "Epoch 00011: val_loss improved from 0.43576 to 0.43573, saving model to sentiment_checkpoint.keras\n",
1182 | "\n",
1183 | "Epoch 00011: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.\n",
1184 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4109 - accuracy: 0.8253 - val_loss: 0.4357 - val_accuracy: 0.8167\n",
1185 | "Epoch 12/20\n",
1186 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4112 - accuracy: 0.8241\n",
1187 | "Epoch 00012: val_loss improved from 0.43573 to 0.43573, saving model to sentiment_checkpoint.keras\n",
1188 | "\n",
1189 | "Epoch 00012: ReduceLROnPlateau reducing learning rate to 1.0000001111620805e-07.\n",
1190 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1191 | "Epoch 13/20\n",
1192 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4109 - accuracy: 0.8238\n",
1193 | "Epoch 00013: val_loss improved from 0.43573 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1194 | "\n",
1195 | "Epoch 00013: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-08.\n",
1196 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1197 | "Epoch 14/20\n",
1198 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4098 - accuracy: 0.8250\n",
1199 | "Epoch 00014: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1200 | "\n",
1201 | "Epoch 00014: ReduceLROnPlateau reducing learning rate to 1e-08.\n",
1202 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1203 | "Epoch 15/20\n",
1204 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4090 - accuracy: 0.8253\n",
1205 | "Epoch 00015: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1206 | "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1207 | "Epoch 16/20\n",
1208 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4103 - accuracy: 0.8247\n",
1209 | "Epoch 00016: val_loss did not improve from 0.43572\n",
1210 | "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1211 | "Epoch 17/20\n",
1212 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4108 - accuracy: 0.8244\n",
1213 | "Epoch 00017: val_loss did not improve from 0.43572\n",
1214 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1215 | "Epoch 18/20\n",
1216 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4107 - accuracy: 0.8247\n",
1217 | "Epoch 00018: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1218 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1219 | "Epoch 19/20\n",
1220 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4080 - accuracy: 0.8263\n",
1221 | "Epoch 00019: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1222 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1223 | "Epoch 20/20\n",
1224 | "3200/3240 [============================>.] - ETA: 0s - loss: 0.4115 - accuracy: 0.8234\n",
1225 | "Epoch 00020: val_loss did not improve from 0.43572\n",
1226 | "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n"
1227 | ]
1228 | },
1229 | {
1230 | "data": {
1231 | "text/plain": [
1232 | ""
1233 | ]
1234 | },
1235 | "execution_count": 49,
1236 | "metadata": {},
1237 | "output_type": "execute_result"
1238 | }
1239 | ],
1240 | "source": [
1241 | "# 开始训练\n",
1242 | "model.fit(X_train, y_train,\n",
1243 | " validation_split=0.1, \n",
1244 | " epochs=20,\n",
1245 | " batch_size=128,\n",
1246 | " callbacks=callbacks)"
1247 | ]
1248 | },
1249 | {
1250 | "cell_type": "markdown",
1251 | "metadata": {},
1252 | "source": [
1253 | "**结论** \n",
1254 | "我们首先对测试样本进行预测,得到了还算满意的准确度。 \n",
1255 | "之后我们定义一个预测函数,来预测输入的文本的极性,可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。"
1256 | ]
1257 | },
1258 | {
1259 | "cell_type": "code",
1260 | "execution_count": 51,
1261 | "metadata": {},
1262 | "outputs": [
1263 | {
1264 | "name": "stdout",
1265 | "output_type": "stream",
1266 | "text": [
1267 | "400/400 [==============================] - 1s 3ms/sample - loss: 0.4799 - accuracy: 0.76750s - loss: 0.4699 - accuracy\n",
1268 | "Accuracy:76.75%\n"
1269 | ]
1270 | }
1271 | ],
1272 | "source": [
1273 | "result = model.evaluate(X_test, y_test)\n",
1274 | "print('Accuracy:{0:.2%}'.format(result[1]))"
1275 | ]
1276 | },
1277 | {
1278 | "cell_type": "code",
1279 | "execution_count": 52,
1280 | "metadata": {},
1281 | "outputs": [],
1282 | "source": [
1283 | "def predict_sentiment(text):\n",
1284 | " print(text)\n",
1285 | " # 去标点\n",
1286 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n",
1287 | " # 分词\n",
1288 | " cut = jieba.cut(text)\n",
1289 | " cut_list = [ i for i in cut ]\n",
1290 | " # tokenize\n",
1291 | " for i, word in enumerate(cut_list):\n",
1292 | " try:\n",
1293 | " cut_list[i] = cn_model.vocab[word].index\n",
1294 | " if cut_list[i] >= 50000:\n",
1295 | " cut_list[i] = 0\n",
1296 | " except KeyError:\n",
1297 | " cut_list[i] = 0\n",
1298 | " # padding\n",
1299 | " tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n",
1300 | " padding='pre', truncating='pre')\n",
1301 | " # 预测\n",
1302 | " result = model.predict(x=tokens_pad)\n",
1303 | " coef = result[0][0]\n",
1304 | " if coef >= 0.5:\n",
1305 | " print('是一例正面评价','output=%.2f'%coef)\n",
1306 | " else:\n",
1307 | " print('是一例负面评价','output=%.2f'%coef)"
1308 | ]
1309 | },
1310 | {
1311 | "cell_type": "code",
1312 | "execution_count": 53,
1313 | "metadata": {},
1314 | "outputs": [
1315 | {
1316 | "name": "stdout",
1317 | "output_type": "stream",
1318 | "text": [
1319 | "酒店设施不是新的,服务态度很不好\n",
1320 | "是一例正面评价 output=0.63\n",
1321 | "酒店卫生条件非常不好\n",
1322 | "是一例负面评价 output=0.47\n",
1323 | "床铺非常舒适\n",
1324 | "是一例正面评价 output=0.79\n",
1325 | "房间很凉,不给开暖气\n",
1326 | "是一例正面评价 output=0.53\n",
1327 | "房间很凉爽,空调冷气很足\n",
1328 | "是一例正面评价 output=0.79\n",
1329 | "酒店环境不好,住宿体验很不好\n",
1330 | "是一例负面评价 output=0.48\n",
1331 | "房间隔音不到位\n",
1332 | "是一例负面评价 output=0.31\n",
1333 | "晚上回来发现没有打扫卫生\n",
1334 | "是一例正面评价 output=0.50\n",
1335 | "因为过节所以要我临时加钱,比团购的价格贵\n",
1336 | "是一例负面评价 output=0.47\n"
1337 | ]
1338 | }
1339 | ],
1340 | "source": [
1341 | "test_list = [\n",
1342 | " '酒店设施不是新的,服务态度很不好',\n",
1343 | " '酒店卫生条件非常不好',\n",
1344 | " '床铺非常舒适',\n",
1345 | " '房间很凉,不给开暖气',\n",
1346 | " '房间很凉爽,空调冷气很足',\n",
1347 | " '酒店环境不好,住宿体验很不好',\n",
1348 | " '房间隔音不到位' ,\n",
1349 | " '晚上回来发现没有打扫卫生',\n",
1350 | " '因为过节所以要我临时加钱,比团购的价格贵'\n",
1351 | "]\n",
1352 | "for text in test_list:\n",
1353 | " predict_sentiment(text)"
1354 | ]
1355 | },
1356 | {
1357 | "cell_type": "markdown",
1358 | "metadata": {},
1359 | "source": [
1360 | "**错误分类的文本**\n",
1361 | "经过查看,发现错误分类的文本的含义大多比较含糊,就算人类也不容易判断极性,如index为101的这个句子,好像没有一点满意的成分,但这例子评价在训练样本中被标记成为了正面评价,而我们的模型做出的负面评价的预测似乎是合理的。"
1362 | ]
1363 | },
1364 | {
1365 | "cell_type": "code",
1366 | "execution_count": 54,
1367 | "metadata": {},
1368 | "outputs": [],
1369 | "source": [
1370 | "y_pred = model.predict(X_test)\n",
1371 | "y_pred = y_pred.T[0]\n",
1372 | "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n",
1373 | "y_pred = np.array(y_pred)"
1374 | ]
1375 | },
1376 | {
1377 | "cell_type": "code",
1378 | "execution_count": 55,
1379 | "metadata": {},
1380 | "outputs": [],
1381 | "source": [
1382 | "y_actual = np.array(y_test)"
1383 | ]
1384 | },
1385 | {
1386 | "cell_type": "code",
1387 | "execution_count": 56,
1388 | "metadata": {},
1389 | "outputs": [],
1390 | "source": [
1391 | "# 找出错误分类的索引\n",
1392 | "misclassified = np.where( y_pred != y_actual )[0]"
1393 | ]
1394 | },
1395 | {
1396 | "cell_type": "code",
1397 | "execution_count": 57,
1398 | "metadata": {
1399 | "scrolled": true
1400 | },
1401 | "outputs": [
1402 | {
1403 | "name": "stdout",
1404 | "output_type": "stream",
1405 | "text": [
1406 | "400\n"
1407 | ]
1408 | }
1409 | ],
1410 | "source": [
1411 | "# 输出所有错误分类的索引\n",
1412 | "len(misclassified)\n",
1413 | "print(len(X_test))"
1414 | ]
1415 | },
1416 | {
1417 | "cell_type": "code",
1418 | "execution_count": 58,
1419 | "metadata": {},
1420 | "outputs": [
1421 | {
1422 | "name": "stdout",
1423 | "output_type": "stream",
1424 | "text": [
1425 | " 由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 :1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n",
1426 | "预测的分类 0\n",
1427 | "实际的分类 1\n"
1428 | ]
1429 | }
1430 | ],
1431 | "source": [
1432 | "# 我们来找出错误分类的样本看看\n",
1433 | "idx=101\n",
1434 | "print(reverse_tokens(X_test[idx]))\n",
1435 | "print('预测的分类', y_pred[idx])\n",
1436 | "print('实际的分类', y_actual[idx])"
1437 | ]
1438 | },
1439 | {
1440 | "cell_type": "code",
1441 | "execution_count": 59,
1442 | "metadata": {},
1443 | "outputs": [
1444 | {
1445 | "name": "stdout",
1446 | "output_type": "stream",
1447 | "text": [
1448 | " 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n",
1449 | "预测的分类 1\n",
1450 | "实际的分类 1\n"
1451 | ]
1452 | }
1453 | ],
1454 | "source": [
1455 | "idx=1\n",
1456 | "print(reverse_tokens(X_test[idx]))\n",
1457 | "print('预测的分类', y_pred[idx])\n",
1458 | "print('实际的分类', y_actual[idx])"
1459 | ]
1460 | },
1461 | {
1462 | "cell_type": "code",
1463 | "execution_count": null,
1464 | "metadata": {},
1465 | "outputs": [],
1466 | "source": []
1467 | },
1468 | {
1469 | "cell_type": "code",
1470 | "execution_count": null,
1471 | "metadata": {},
1472 | "outputs": [],
1473 | "source": []
1474 | },
1475 | {
1476 | "cell_type": "code",
1477 | "execution_count": null,
1478 | "metadata": {},
1479 | "outputs": [],
1480 | "source": []
1481 | }
1482 | ],
1483 | "metadata": {
1484 | "kernelspec": {
1485 | "display_name": "Python 3",
1486 | "language": "python",
1487 | "name": "python3"
1488 | },
1489 | "language_info": {
1490 | "codemirror_mode": {
1491 | "name": "ipython",
1492 | "version": 3
1493 | },
1494 | "file_extension": ".py",
1495 | "mimetype": "text/x-python",
1496 | "name": "python",
1497 | "nbconvert_exporter": "python",
1498 | "pygments_lexer": "ipython3",
1499 | "version": "3.6.8"
1500 | }
1501 | },
1502 | "nbformat": 4,
1503 | "nbformat_minor": 2
1504 | }
1505 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # chinese_sentiment
2 | **用Tensorflow进行中文自然语言处理分类实践**
3 | 1. 词向量下载地址:
4 | 链接: https://pan.baidu.com/s/1GerioMpwj1zmju9NkkrsFg
5 | 提取码: x6v3
6 | 请下载之后在项目根目录建立"embeddings"文件夹, 将下载的文件放入(不用解压), 即可运行代码.
7 | 2. 很多同学遇到乱码等bug, 很抱歉没能及时回复, 现已重新处理了语料和代码, 已经没有了乱码的问题.
8 | 3. 修改了bug后, 可能是数据的顺序变了, 结果模型训练的效果相比去年差了一些, 有兴趣的同学可以调整一下模型参数, 看看会不会有更好的结果.
9 | 4. 代码写的比较早, 有些地方可能有坑, 现在先不重写了, 因为LSTM实在是属于比较老的模型, 近期会发布transformer语言模型的教程, 请大家关注.
10 | 5. 注意, debug之后的代码在"2019新版debug之后--中文自然语言处理--情感分析.ipynb"里, 对应的语料文件是"negative_samples.txt", "positive_samples.txt"这两个.
11 | 6. 如果有问题请在视频评论区留言, 这样各位学习的同学可以互相帮助解决问题, 或者在项目里提issue, 尽量不要给我写邮件, 因为可能回复不及时.
12 |
13 | 教学视频地址:
14 | youtube:
15 | https://www.youtube.com/watch?v=-mcrmLmNOXA&t=991s
16 | bilibili:
17 | https://www.bilibili.com/video/av30543613?from=search&seid=74343163897647645
18 | 老版本中pos和neg中的语料不全,请解压“语料.zip”覆盖
19 |
--------------------------------------------------------------------------------
/flowchart.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/flowchart.jpg
--------------------------------------------------------------------------------
/neg/neg.0.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.0.txt
--------------------------------------------------------------------------------
/neg/neg.1.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1.txt
--------------------------------------------------------------------------------
/neg/neg.10.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.10.txt
--------------------------------------------------------------------------------
/neg/neg.1000.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1000.txt
--------------------------------------------------------------------------------
/neg/neg.1001.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1001.txt
--------------------------------------------------------------------------------
/neg/neg.1002.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1002.txt
--------------------------------------------------------------------------------
/neg/neg.1003.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1003.txt
--------------------------------------------------------------------------------
/neg/neg.1004.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1004.txt
--------------------------------------------------------------------------------
/neg/neg.1005.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1005.txt
--------------------------------------------------------------------------------
/neg/neg.1006.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1006.txt
--------------------------------------------------------------------------------
/neg/neg.1007.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1007.txt
--------------------------------------------------------------------------------
/neg/neg.1009.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1009.txt
--------------------------------------------------------------------------------
/neg/neg.101.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.101.txt
--------------------------------------------------------------------------------
/neg/neg.1010.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1010.txt
--------------------------------------------------------------------------------
/neg/neg.1011.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1011.txt
--------------------------------------------------------------------------------
/neg/neg.1012.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1012.txt
--------------------------------------------------------------------------------
/neg/neg.1013.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1013.txt
--------------------------------------------------------------------------------
/neg/neg.1014.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1014.txt
--------------------------------------------------------------------------------
/neg/neg.1015.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1015.txt
--------------------------------------------------------------------------------
/neg/neg.1017.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1017.txt
--------------------------------------------------------------------------------
/neg/neg.1018.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1018.txt
--------------------------------------------------------------------------------
/neg/neg.1019.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1019.txt
--------------------------------------------------------------------------------
/neg/neg.102.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.102.txt
--------------------------------------------------------------------------------
/neg/neg.1020.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1020.txt
--------------------------------------------------------------------------------
/neg/neg.1022.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1022.txt
--------------------------------------------------------------------------------
/neg/neg.1025.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1025.txt
--------------------------------------------------------------------------------
/neg/neg.1026.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1026.txt
--------------------------------------------------------------------------------
/neg/neg.1027.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1027.txt
--------------------------------------------------------------------------------
/neg/neg.1028.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1028.txt
--------------------------------------------------------------------------------
/neg/neg.1029.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1029.txt
--------------------------------------------------------------------------------
/neg/neg.103.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.103.txt
--------------------------------------------------------------------------------
/neg/neg.1030.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1030.txt
--------------------------------------------------------------------------------
/neg/neg.1032.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1032.txt
--------------------------------------------------------------------------------
/neg/neg.1033.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1033.txt
--------------------------------------------------------------------------------
/neg/neg.1034.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1034.txt
--------------------------------------------------------------------------------
/neg/neg.1035.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1035.txt
--------------------------------------------------------------------------------
/neg/neg.1036.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1036.txt
--------------------------------------------------------------------------------
/neg/neg.1038.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1038.txt
--------------------------------------------------------------------------------
/neg/neg.1039.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1039.txt
--------------------------------------------------------------------------------
/neg/neg.104.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.104.txt
--------------------------------------------------------------------------------
/neg/neg.1040.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1040.txt
--------------------------------------------------------------------------------
/neg/neg.1041.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1041.txt
--------------------------------------------------------------------------------
/neg/neg.1042.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1042.txt
--------------------------------------------------------------------------------
/neg/neg.1047.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1047.txt
--------------------------------------------------------------------------------
/neg/neg.1048.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1048.txt
--------------------------------------------------------------------------------
/neg/neg.1049.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1049.txt
--------------------------------------------------------------------------------
/neg/neg.105.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.105.txt
--------------------------------------------------------------------------------
/neg/neg.1050.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1050.txt
--------------------------------------------------------------------------------
/neg/neg.1052.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1052.txt
--------------------------------------------------------------------------------
/neg/neg.1053.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1053.txt
--------------------------------------------------------------------------------
/neg/neg.1054.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1054.txt
--------------------------------------------------------------------------------
/neg/neg.1055.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1055.txt
--------------------------------------------------------------------------------
/neg/neg.1056.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1056.txt
--------------------------------------------------------------------------------
/neg/neg.1057.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1057.txt
--------------------------------------------------------------------------------
/neg/neg.1058.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1058.txt
--------------------------------------------------------------------------------
/neg/neg.1059.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1059.txt
--------------------------------------------------------------------------------
/neg/neg.106.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.106.txt
--------------------------------------------------------------------------------
/neg/neg.1060.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1060.txt
--------------------------------------------------------------------------------
/neg/neg.1061.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1061.txt
--------------------------------------------------------------------------------
/neg/neg.1062.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1062.txt
--------------------------------------------------------------------------------
/neg/neg.1063.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1063.txt
--------------------------------------------------------------------------------
/neg/neg.1066.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1066.txt
--------------------------------------------------------------------------------
/neg/neg.1067.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1067.txt
--------------------------------------------------------------------------------
/neg/neg.1069.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1069.txt
--------------------------------------------------------------------------------
/neg/neg.107.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.107.txt
--------------------------------------------------------------------------------
/neg/neg.1070.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1070.txt
--------------------------------------------------------------------------------
/neg/neg.1071.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1071.txt
--------------------------------------------------------------------------------
/neg/neg.1072.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1072.txt
--------------------------------------------------------------------------------
/neg/neg.1073.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1073.txt
--------------------------------------------------------------------------------
/neg/neg.1074.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1074.txt
--------------------------------------------------------------------------------
/neg/neg.1075.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1075.txt
--------------------------------------------------------------------------------
/neg/neg.1076.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1076.txt
--------------------------------------------------------------------------------
/neg/neg.1077.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1077.txt
--------------------------------------------------------------------------------
/neg/neg.1078.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1078.txt
--------------------------------------------------------------------------------
/neg/neg.1079.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1079.txt
--------------------------------------------------------------------------------
/neg/neg.108.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.108.txt
--------------------------------------------------------------------------------
/neg/neg.1080.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1080.txt
--------------------------------------------------------------------------------
/neg/neg.1081.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1081.txt
--------------------------------------------------------------------------------
/neg/neg.1082.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1082.txt
--------------------------------------------------------------------------------
/neg/neg.1083.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1083.txt
--------------------------------------------------------------------------------
/neg/neg.1084.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1084.txt
--------------------------------------------------------------------------------
/neg/neg.1085.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1085.txt
--------------------------------------------------------------------------------
/neg/neg.1086.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1086.txt
--------------------------------------------------------------------------------
/neg/neg.1087.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1087.txt
--------------------------------------------------------------------------------
/neg/neg.1088.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1088.txt
--------------------------------------------------------------------------------
/neg/neg.1089.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1089.txt
--------------------------------------------------------------------------------
/neg/neg.109.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.109.txt
--------------------------------------------------------------------------------
/neg/neg.1090.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1090.txt
--------------------------------------------------------------------------------
/neg/neg.1091.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1091.txt
--------------------------------------------------------------------------------
/neg/neg.1092.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1092.txt
--------------------------------------------------------------------------------
/neg/neg.1093.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1093.txt
--------------------------------------------------------------------------------
/neg/neg.1094.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1094.txt
--------------------------------------------------------------------------------
/neg/neg.1095.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1095.txt
--------------------------------------------------------------------------------
/neg/neg.1096.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1096.txt
--------------------------------------------------------------------------------
/neg/neg.1097.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1097.txt
--------------------------------------------------------------------------------
/neg/neg.1098.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1098.txt
--------------------------------------------------------------------------------
/neg/neg.1099.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1099.txt
--------------------------------------------------------------------------------
/neg/neg.11.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.11.txt
--------------------------------------------------------------------------------
/neg/neg.110.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.110.txt
--------------------------------------------------------------------------------
/pos/pos.10.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.10.txt
--------------------------------------------------------------------------------
/pos/pos.100.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.100.txt
--------------------------------------------------------------------------------
/pos/pos.1000.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1000.txt
--------------------------------------------------------------------------------
/pos/pos.1001.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1001.txt
--------------------------------------------------------------------------------
/pos/pos.1002.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1002.txt
--------------------------------------------------------------------------------
/pos/pos.1003.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1003.txt
--------------------------------------------------------------------------------
/pos/pos.1004.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1004.txt
--------------------------------------------------------------------------------
/pos/pos.1005.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1005.txt
--------------------------------------------------------------------------------
/pos/pos.1006.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1006.txt
--------------------------------------------------------------------------------
/pos/pos.1007.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1007.txt
--------------------------------------------------------------------------------
/pos/pos.1008.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1008.txt
--------------------------------------------------------------------------------
/pos/pos.1009.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1009.txt
--------------------------------------------------------------------------------
/pos/pos.101.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.101.txt
--------------------------------------------------------------------------------
/pos/pos.1010.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1010.txt
--------------------------------------------------------------------------------
/pos/pos.1012.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1012.txt
--------------------------------------------------------------------------------
/pos/pos.1013.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1013.txt
--------------------------------------------------------------------------------
/pos/pos.1014.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1014.txt
--------------------------------------------------------------------------------
/pos/pos.1015.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1015.txt
--------------------------------------------------------------------------------
/pos/pos.1016.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1016.txt
--------------------------------------------------------------------------------
/pos/pos.1017.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1017.txt
--------------------------------------------------------------------------------
/pos/pos.1018.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1018.txt
--------------------------------------------------------------------------------
/pos/pos.1019.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1019.txt
--------------------------------------------------------------------------------
/pos/pos.102.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.102.txt
--------------------------------------------------------------------------------
/pos/pos.1020.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1020.txt
--------------------------------------------------------------------------------
/pos/pos.1021.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1021.txt
--------------------------------------------------------------------------------
/pos/pos.1022.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1022.txt
--------------------------------------------------------------------------------
/pos/pos.1023.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1023.txt
--------------------------------------------------------------------------------
/pos/pos.1024.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1024.txt
--------------------------------------------------------------------------------
/pos/pos.1025.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1025.txt
--------------------------------------------------------------------------------
/pos/pos.1026.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1026.txt
--------------------------------------------------------------------------------
/pos/pos.1027.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1027.txt
--------------------------------------------------------------------------------
/pos/pos.1028.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1028.txt
--------------------------------------------------------------------------------
/pos/pos.1029.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1029.txt
--------------------------------------------------------------------------------
/pos/pos.103.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.103.txt
--------------------------------------------------------------------------------
/pos/pos.1030.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1030.txt
--------------------------------------------------------------------------------
/pos/pos.1031.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1031.txt
--------------------------------------------------------------------------------
/pos/pos.1032.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1032.txt
--------------------------------------------------------------------------------
/pos/pos.1033.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1033.txt
--------------------------------------------------------------------------------
/pos/pos.1034.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1034.txt
--------------------------------------------------------------------------------
/pos/pos.1035.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1035.txt
--------------------------------------------------------------------------------
/pos/pos.1036.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1036.txt
--------------------------------------------------------------------------------
/pos/pos.1037.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1037.txt
--------------------------------------------------------------------------------
/pos/pos.1038.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1038.txt
--------------------------------------------------------------------------------
/pos/pos.1039.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1039.txt
--------------------------------------------------------------------------------
/pos/pos.104.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.104.txt
--------------------------------------------------------------------------------
/pos/pos.1040.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1040.txt
--------------------------------------------------------------------------------
/pos/pos.1041.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1041.txt
--------------------------------------------------------------------------------
/pos/pos.1042.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1042.txt
--------------------------------------------------------------------------------
/pos/pos.1043.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1043.txt
--------------------------------------------------------------------------------
/pos/pos.1044.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1044.txt
--------------------------------------------------------------------------------
/pos/pos.1045.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1045.txt
--------------------------------------------------------------------------------
/pos/pos.1046.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1046.txt
--------------------------------------------------------------------------------
/pos/pos.1047.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1047.txt
--------------------------------------------------------------------------------
/pos/pos.1048.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1048.txt
--------------------------------------------------------------------------------
/pos/pos.1049.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1049.txt
--------------------------------------------------------------------------------
/pos/pos.105.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.105.txt
--------------------------------------------------------------------------------
/pos/pos.1050.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1050.txt
--------------------------------------------------------------------------------
/pos/pos.1051.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1051.txt
--------------------------------------------------------------------------------
/pos/pos.1052.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1052.txt
--------------------------------------------------------------------------------
/pos/pos.1053.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1053.txt
--------------------------------------------------------------------------------
/pos/pos.1054.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1054.txt
--------------------------------------------------------------------------------
/pos/pos.1055.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1055.txt
--------------------------------------------------------------------------------
/pos/pos.1056.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1056.txt
--------------------------------------------------------------------------------
/pos/pos.1057.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1057.txt
--------------------------------------------------------------------------------
/pos/pos.1058.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1058.txt
--------------------------------------------------------------------------------
/pos/pos.1059.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1059.txt
--------------------------------------------------------------------------------
/pos/pos.106.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.106.txt
--------------------------------------------------------------------------------
/pos/pos.1060.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1060.txt
--------------------------------------------------------------------------------
/pos/pos.1061.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1061.txt
--------------------------------------------------------------------------------
/pos/pos.1062.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1062.txt
--------------------------------------------------------------------------------
/pos/pos.1063.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1063.txt
--------------------------------------------------------------------------------
/pos/pos.1064.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1064.txt
--------------------------------------------------------------------------------
/pos/pos.1065.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1065.txt
--------------------------------------------------------------------------------
/pos/pos.107.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.107.txt
--------------------------------------------------------------------------------
/pos/pos.1073.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1073.txt
--------------------------------------------------------------------------------
/pos/pos.1074.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1074.txt
--------------------------------------------------------------------------------
/pos/pos.1075.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1075.txt
--------------------------------------------------------------------------------
/pos/pos.1076.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1076.txt
--------------------------------------------------------------------------------
/pos/pos.1077.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1077.txt
--------------------------------------------------------------------------------
/pos/pos.1078.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1078.txt
--------------------------------------------------------------------------------
/pos/pos.1079.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1079.txt
--------------------------------------------------------------------------------
/pos/pos.108.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.108.txt
--------------------------------------------------------------------------------
/pos/pos.1080.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1080.txt
--------------------------------------------------------------------------------
/pos/pos.1081.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1081.txt
--------------------------------------------------------------------------------
/pos/pos.1082.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1082.txt
--------------------------------------------------------------------------------
/pos/pos.1083.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1083.txt
--------------------------------------------------------------------------------
/pos/pos.1084.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1084.txt
--------------------------------------------------------------------------------
/pos/pos.1085.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1085.txt
--------------------------------------------------------------------------------
/pos/pos.1086.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1086.txt
--------------------------------------------------------------------------------
/pos/pos.1087.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1087.txt
--------------------------------------------------------------------------------
/pos/pos.1088.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1088.txt
--------------------------------------------------------------------------------
/pos/pos.1089.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1089.txt
--------------------------------------------------------------------------------
/pos/pos.109.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.109.txt
--------------------------------------------------------------------------------
/pos/pos.1090.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1090.txt
--------------------------------------------------------------------------------
/pos/pos.1091.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1091.txt
--------------------------------------------------------------------------------
/pos/pos.1093.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1093.txt
--------------------------------------------------------------------------------
/pos/pos.1094.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1094.txt
--------------------------------------------------------------------------------
/pos/pos.1095.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1095.txt
--------------------------------------------------------------------------------
/pos/pos.1096.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1096.txt
--------------------------------------------------------------------------------
/pos/pos.1097.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1097.txt
--------------------------------------------------------------------------------
/中文自然语言处理--情感分析.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 用Tensorflow进行中文自然语言处理--情感分析"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "$$f('真好喝')=1$$\n",
15 | "$$f('太难喝了')=0$$"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "**简介** \n",
23 | "大家好,我是Espresso,这是我制作的第一个教程,是一个简单的中文自然语言处理的分类实践。 \n",
24 | "制作此教程的目的是什么呢?虽然现在自然语言处理的学习资料很多,英文的资料更多,但是网上资源很乱,尤其是中文的系统的学习资料稀少,而且知识点非常分散,缺少比较系统的实践学习资料,就算有一些代码但因为缺少注释导致要花费很长时间才能理解,我个人在学习过程中,在网络搜索花费了一整天时间,才把处理中文的步骤和需要的软件梳理出来。 \n",
25 | "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习,在这个教程中我注重的是实践部分,理论部分我推荐学习deeplearning.ai的课程,在下面的代码部分,涉及到哪方面知识的,我推荐一些学习资料并附上链接,如有侵权请e-mail:a66777@188.com。 \n",
26 | "另外我对自然语言处理并没有任何深入研究,欢迎各位大牛吐槽,希望能指出不足和改善方法。"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "**需要的库** \n",
34 | "numpy \n",
35 | "jieba \n",
36 | "gensim \n",
37 | "tensorflow \n",
38 | "matplotlib "
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 2,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "name": "stderr",
48 | "output_type": "stream",
49 | "text": [
50 | "c:\\users\\jinan\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\gensim\\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
51 | " warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
52 | ]
53 | }
54 | ],
55 | "source": [
56 | "# 首先加载必用的库\n",
57 | "%matplotlib inline\n",
58 | "import numpy as np\n",
59 | "import matplotlib.pyplot as plt\n",
60 | "import re\n",
61 | "import jieba # 结巴分词\n",
62 | "# gensim用来加载预训练word vector\n",
63 | "from gensim.models import KeyedVectors\n",
64 | "import warnings\n",
65 | "warnings.filterwarnings(\"ignore\")"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "**预训练词向量** \n",
73 | "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为: \n",
74 | "https://github.com/Embedding/Chinese-Word-Vectors \n",
75 | "如果你不知道word2vec是什么,我推荐以下一篇文章: \n",
76 | "https://zhuanlan.zhihu.com/p/26306795 \n",
77 | "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量,可以从上面github链接下载,我们先加载预训练模型并进行一些简单测试:"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 3,
83 | "metadata": {},
84 | "outputs": [],
85 | "source": [
86 | "# 使用gensim加载预训练中文分词embedding\n",
87 | "cn_model = KeyedVectors.load_word2vec_format('chinese_word_vectors/sgns.zhihu.bigram', \n",
88 | " binary=False)"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "**词向量模型** \n",
96 | "在这个词向量模型里,每一个词是一个索引,对应的是一个长度为300的向量,我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本,需要先进行分次并把词汇转换为词向量,步骤请参考下图,步骤的讲解会跟着代码一步一步来,如果你不知道RNN,GRU,LSTM是什么,我推荐deeplearning.ai的课程,网易公开课有免费中文字幕版,但我还是推荐有习题和练习代码部分的的coursera原版: \n",
97 | "
"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 29,
103 | "metadata": {},
104 | "outputs": [
105 | {
106 | "name": "stdout",
107 | "output_type": "stream",
108 | "text": [
109 | "词向量的长度为300\n"
110 | ]
111 | },
112 | {
113 | "data": {
114 | "text/plain": [
115 | "array([-2.603470e-01, 3.677500e-01, -2.379650e-01, 5.301700e-02,\n",
116 | " -3.628220e-01, -3.212010e-01, -1.903330e-01, 1.587220e-01,\n",
117 | " -7.156200e-02, -4.625400e-02, -1.137860e-01, 3.515600e-01,\n",
118 | " -6.408200e-02, -2.184840e-01, 3.286950e-01, -7.110330e-01,\n",
119 | " 1.620320e-01, 1.627490e-01, 5.528180e-01, 1.016860e-01,\n",
120 | " 1.060080e-01, 7.820700e-01, -7.537310e-01, -2.108400e-02,\n",
121 | " -4.758250e-01, -1.130420e-01, -2.053000e-01, 6.624390e-01,\n",
122 | " 2.435850e-01, 9.171890e-01, -2.090610e-01, -5.290000e-02,\n",
123 | " -7.969340e-01, 2.394940e-01, -9.028100e-02, 1.537360e-01,\n",
124 | " -4.003980e-01, -2.456100e-02, -1.717860e-01, 2.037790e-01,\n",
125 | " -4.344710e-01, -3.850430e-01, -9.366000e-02, 3.775310e-01,\n",
126 | " 2.659690e-01, 8.879800e-02, 2.493440e-01, 4.914900e-02,\n",
127 | " 5.996000e-03, 3.586430e-01, -1.044960e-01, -5.838460e-01,\n",
128 | " 3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n",
129 | " -2.075230e-01, 2.845980e-01, 1.414760e-01, 1.678570e-01,\n",
130 | " 1.957560e-01, 7.782140e-01, -2.359000e-01, -6.833100e-02,\n",
131 | " 2.560170e-01, -6.906900e-02, -1.219620e-01, 2.683020e-01,\n",
132 | " 1.678810e-01, 2.068910e-01, 1.987520e-01, 6.720900e-02,\n",
133 | " -3.975290e-01, -7.123140e-01, 5.613200e-02, 2.586000e-03,\n",
134 | " 5.616910e-01, 1.157000e-03, -4.341190e-01, 1.977480e-01,\n",
135 | " 2.519540e-01, 8.835000e-03, -3.554600e-01, -1.573500e-02,\n",
136 | " -2.526010e-01, 9.355900e-02, -3.962500e-02, -1.628350e-01,\n",
137 | " 2.980950e-01, 1.647900e-01, -5.454270e-01, 3.888790e-01,\n",
138 | " 1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n",
139 | " 2.020520e-01, -4.424750e-01, 3.911580e-01, 2.115100e-01,\n",
140 | " 6.516760e-01, 5.668030e-01, 5.065500e-02, -1.259650e-01,\n",
141 | " -3.720640e-01, 2.330470e-01, 6.659900e-02, 8.300600e-02,\n",
142 | " 2.540460e-01, -5.279760e-01, -3.843280e-01, 3.366460e-01,\n",
143 | " 2.336500e-01, 3.564750e-01, -4.884160e-01, -1.183910e-01,\n",
144 | " 1.365910e-01, 2.293420e-01, -6.151930e-01, 5.212050e-01,\n",
145 | " 3.412000e-01, 5.757940e-01, 2.354480e-01, -3.641530e-01,\n",
146 | " 7.373400e-02, 1.007380e-01, -3.211410e-01, -3.040480e-01,\n",
147 | " -3.738440e-01, -2.515150e-01, 2.633890e-01, 3.995490e-01,\n",
148 | " 4.461880e-01, 1.641110e-01, 1.449590e-01, -4.191540e-01,\n",
149 | " 2.297840e-01, 6.710600e-02, 3.316430e-01, -6.026500e-02,\n",
150 | " -5.130610e-01, 1.472570e-01, 2.414060e-01, 2.011000e-03,\n",
151 | " -3.823410e-01, -1.356010e-01, 3.112300e-01, 9.177830e-01,\n",
152 | " -4.511630e-01, 1.272190e-01, -9.431600e-02, -8.216000e-03,\n",
153 | " -3.835440e-01, 2.589400e-02, 6.374980e-01, 4.931630e-01,\n",
154 | " -1.865070e-01, 4.076900e-01, -1.841000e-03, 2.213160e-01,\n",
155 | " 2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n",
156 | " 1.296890e-01, -1.304100e-01, -4.742270e-01, 2.275500e-02,\n",
157 | " 4.255050e-01, 1.570280e-01, 2.975300e-02, 1.931830e-01,\n",
158 | " 1.304340e-01, -3.179800e-02, 1.516650e-01, -2.154310e-01,\n",
159 | " -4.681410e-01, 1.007326e+00, -6.698940e-01, -1.555240e-01,\n",
160 | " 1.797170e-01, 2.848660e-01, 6.216130e-01, 1.549510e-01,\n",
161 | " 6.225000e-02, -2.227800e-02, 2.561270e-01, -1.006380e-01,\n",
162 | " 2.807900e-02, 4.597710e-01, -4.077750e-01, -1.777390e-01,\n",
163 | " 1.920500e-02, -4.829300e-02, 4.714700e-02, -3.715200e-01,\n",
164 | " -2.995930e-01, -3.719710e-01, 4.622800e-02, -1.436460e-01,\n",
165 | " 2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n",
166 | " 5.970110e-01, 3.578450e-01, -6.826000e-02, 4.735200e-02,\n",
167 | " -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n",
168 | " -4.757790e-01, 1.079320e-01, 9.858300e-02, 8.540300e-01,\n",
169 | " 3.518370e-01, -1.306360e-01, -1.541590e-01, 1.166775e+00,\n",
170 | " 2.048860e-01, 5.952340e-01, 1.158830e-01, 6.774400e-02,\n",
171 | " 6.793920e-01, -3.610700e-01, 1.697870e-01, 4.118530e-01,\n",
172 | " 4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n",
173 | " -7.043300e-02, 1.576110e-01, -4.780500e-02, -7.344390e-01,\n",
174 | " -2.834330e-01, 4.582690e-01, 3.957010e-01, -8.484300e-02,\n",
175 | " -3.472550e-01, 1.291660e-01, 3.838960e-01, -3.287600e-02,\n",
176 | " -2.802220e-01, 5.257030e-01, -3.609300e-02, -4.842220e-01,\n",
177 | " 3.690700e-02, 3.429560e-01, 2.902490e-01, -1.624650e-01,\n",
178 | " -7.513700e-02, 2.669300e-01, 5.778230e-01, -3.074020e-01,\n",
179 | " -2.183790e-01, -2.834050e-01, 1.350870e-01, 1.490070e-01,\n",
180 | " 1.438400e-02, -2.509040e-01, -3.376100e-01, 1.291880e-01,\n",
181 | " -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n",
182 | " -1.211970e-01, 2.532660e-01, 2.757050e-01, -3.382040e-01,\n",
183 | " 1.178070e-01, 3.860190e-01, 5.277960e-01, 4.581920e-01,\n",
184 | " 1.502310e-01, 1.226320e-01, 2.768540e-01, -4.502080e-01,\n",
185 | " -1.992670e-01, 1.689100e-02, 1.188860e-01, 3.502440e-01,\n",
186 | " -4.064770e-01, 2.610280e-01, -1.934990e-01, -1.625660e-01,\n",
187 | " 2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n",
188 | " -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n",
189 | " 2.154600e-02, 2.318040e-01, 5.395310e-01, -4.223720e-01],\n",
190 | " dtype=float32)"
191 | ]
192 | },
193 | "execution_count": 29,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "# 由此可见每一个词都对应一个长度为300的向量\n",
200 | "embedding_dim = cn_model['山东大学'].shape[0]\n",
201 | "print('词向量的长度为{}'.format(embedding_dim))\n",
202 | "cn_model['山东大学']"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "Cosine Similarity for Vector Space Models by Christian S. Perone\n",
210 | "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 33,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "data": {
220 | "text/plain": [
221 | "0.66128117"
222 | ]
223 | },
224 | "execution_count": 33,
225 | "metadata": {},
226 | "output_type": "execute_result"
227 | }
228 | ],
229 | "source": [
230 | "# 计算相似度\n",
231 | "cn_model.similarity('橘子', '橙子')"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 34,
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "data": {
241 | "text/plain": [
242 | "0.66128117"
243 | ]
244 | },
245 | "execution_count": 34,
246 | "metadata": {},
247 | "output_type": "execute_result"
248 | }
249 | ],
250 | "source": [
251 | "# dot('橘子'/|'橘子'|, '橙子'/|'橙子'| )\n",
252 | "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n",
253 | "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 35,
259 | "metadata": {},
260 | "outputs": [
261 | {
262 | "data": {
263 | "text/plain": [
264 | "[('高中', 0.7247823476791382),\n",
265 | " ('本科', 0.6768535375595093),\n",
266 | " ('研究生', 0.6244412660598755),\n",
267 | " ('中学', 0.6088204979896545),\n",
268 | " ('大学本科', 0.595908522605896),\n",
269 | " ('初中', 0.5883588790893555),\n",
270 | " ('读研', 0.5778335332870483),\n",
271 | " ('职高', 0.5767995119094849),\n",
272 | " ('大学毕业', 0.5767451524734497),\n",
273 | " ('师范大学', 0.5708829760551453)]"
274 | ]
275 | },
276 | "execution_count": 35,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "# 找出最相近的词,余弦相似度\n",
283 | "cn_model.most_similar(positive=['大学'], topn=10)"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 36,
289 | "metadata": {},
290 | "outputs": [
291 | {
292 | "name": "stdout",
293 | "output_type": "stream",
294 | "text": [
295 | "在 老师 会计师 程序员 律师 医生 老人 中:\n",
296 | "不是同一类别的词为: 老人\n"
297 | ]
298 | }
299 | ],
300 | "source": [
301 | "# 找出不同的词\n",
302 | "test_words = '老师 会计师 程序员 律师 医生 老人'\n",
303 | "test_words_result = cn_model.doesnt_match(test_words.split())\n",
304 | "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "cn_model.most_similar(positive=['女人','出轨'], negative=['男人'], topn=1)"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "**训练语料** \n",
321 | "本教程使用了谭松波老师的酒店评论语料,即使是这个语料也很难找到下载链接,在某博客还得花积分下载,而我不知道怎么赚取积分,后来好不容易找到一个链接但竟然是失效的,再后来尝试把链接粘贴到迅雷上终于下载了下来,希望大家以后多多分享资源。 \n",
322 | "训练样本分别被放置在两个文件夹里:\n",
323 | "分别的pos和neg,每个文件夹里有2000个txt文件,每个文件内有一段评语,共有4000个训练样本,这样大小的样本数据在NLP中属于非常迷你的:"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 38,
329 | "metadata": {},
330 | "outputs": [],
331 | "source": [
332 | "# 获得样本的索引,样本存放于两个文件夹中,\n",
333 | "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n",
334 | "# 每个文件夹中有2000个txt文件,每个文件中是一例评价\n",
335 | "import os\n",
336 | "pos_txts = os.listdir('pos')\n",
337 | "neg_txts = os.listdir('neg')"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 39,
343 | "metadata": {},
344 | "outputs": [
345 | {
346 | "name": "stdout",
347 | "output_type": "stream",
348 | "text": [
349 | "样本总共: 4000\n"
350 | ]
351 | }
352 | ],
353 | "source": [
354 | "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 40,
360 | "metadata": {},
361 | "outputs": [],
362 | "source": [
363 | "# 现在我们将所有的评价内容放置到一个list里\n",
364 | "\n",
365 | "train_texts_orig = [] # 存储所有评价,每例评价为一条string\n",
366 | "\n",
367 | "# 添加完所有样本之后,train_texts_orig为一个含有4000条文本的list\n",
368 | "# 其中前2000条文本为正面评价,后2000条为负面评价\n",
369 | "\n",
370 | "for i in range(len(pos_txts)):\n",
371 | " with open('pos/'+pos_txts[i], 'r', errors='ignore') as f:\n",
372 | " text = f.read().strip()\n",
373 | " train_texts_orig.append(text)\n",
374 | " f.close()\n",
375 | "for i in range(len(neg_txts)):\n",
376 | " with open('neg/'+neg_txts[i], 'r', errors='ignore') as f:\n",
377 | " text = f.read().strip()\n",
378 | " train_texts_orig.append(text)\n",
379 | " f.close()"
380 | ]
381 | },
382 | {
383 | "cell_type": "code",
384 | "execution_count": 41,
385 | "metadata": {},
386 | "outputs": [
387 | {
388 | "data": {
389 | "text/plain": [
390 | "4000"
391 | ]
392 | },
393 | "execution_count": 41,
394 | "metadata": {},
395 | "output_type": "execute_result"
396 | }
397 | ],
398 | "source": [
399 | "len(train_texts_orig)"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": 42,
405 | "metadata": {},
406 | "outputs": [],
407 | "source": [
408 | "# 我们使用tensorflow的keras接口来建模\n",
409 | "from tensorflow.python.keras.models import Sequential\n",
410 | "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n",
411 | "from tensorflow.python.keras.preprocessing.text import Tokenizer\n",
412 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
413 | "from tensorflow.python.keras.optimizers import RMSprop\n",
414 | "from tensorflow.python.keras.optimizers import Adam\n",
415 | "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {},
421 | "source": [
422 | "**分词和tokenize** \n",
423 | "首先我们去掉每个样本的标点符号,然后用jieba分词,jieba分词返回一个生成器,没法直接进行tokenize,所以我们将分词结果转换成一个list,并将它索引化,这样每一例评价的文本变成一段索引数字,对应着预训练词向量模型中的词。"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": 43,
429 | "metadata": {},
430 | "outputs": [
431 | {
432 | "name": "stderr",
433 | "output_type": "stream",
434 | "text": [
435 | "Building prefix dict from the default dictionary ...\n",
436 | "Loading model from cache C:\\Users\\jinan\\AppData\\Local\\Temp\\jieba.cache\n",
437 | "Loading model cost 0.672 seconds.\n",
438 | "Prefix dict has been built succesfully.\n"
439 | ]
440 | }
441 | ],
442 | "source": [
443 | "# 进行分词和tokenize\n",
444 | "# train_tokens是一个长长的list,其中含有4000个小list,对应每一条评价\n",
445 | "train_tokens = []\n",
446 | "for text in train_texts_orig:\n",
447 | " # 去掉标点\n",
448 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n",
449 | " # 结巴分词\n",
450 | " cut = jieba.cut(text)\n",
451 | " # 结巴分词的输出结果为一个生成器\n",
452 | " # 把生成器转换为list\n",
453 | " cut_list = [ i for i in cut ]\n",
454 | " for i, word in enumerate(cut_list):\n",
455 | " try:\n",
456 | " # 将词转换为索引index\n",
457 | " cut_list[i] = cn_model.vocab[word].index\n",
458 | " except KeyError:\n",
459 | " # 如果词不在字典中,则输出0\n",
460 | " cut_list[i] = 0\n",
461 | " train_tokens.append(cut_list)"
462 | ]
463 | },
464 | {
465 | "cell_type": "markdown",
466 | "metadata": {},
467 | "source": [
468 | "**索引长度标准化** \n",
469 | "因为每段评语的长度是不一样的,我们如果单纯取最长的一个评语,并把其他评填充成同样的长度,这样十分浪费计算资源,所以我们取一个折衷的长度。"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 44,
475 | "metadata": {},
476 | "outputs": [],
477 | "source": [
478 | "# 获得所有tokens的长度\n",
479 | "num_tokens = [ len(tokens) for tokens in train_tokens ]\n",
480 | "num_tokens = np.array(num_tokens)"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": 45,
486 | "metadata": {},
487 | "outputs": [
488 | {
489 | "data": {
490 | "text/plain": [
491 | "71.4495"
492 | ]
493 | },
494 | "execution_count": 45,
495 | "metadata": {},
496 | "output_type": "execute_result"
497 | }
498 | ],
499 | "source": [
500 | "# 平均tokens的长度\n",
501 | "np.mean(num_tokens)"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": 46,
507 | "metadata": {},
508 | "outputs": [
509 | {
510 | "data": {
511 | "text/plain": [
512 | "1540"
513 | ]
514 | },
515 | "execution_count": 46,
516 | "metadata": {},
517 | "output_type": "execute_result"
518 | }
519 | ],
520 | "source": [
521 | "# 最长的评价tokens的长度\n",
522 | "np.max(num_tokens)"
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": 89,
528 | "metadata": {},
529 | "outputs": [
530 | {
531 | "data": {
532 | "image/png": "\n",
533 | "text/plain": [
534 | ""
535 | ]
536 | },
537 | "metadata": {},
538 | "output_type": "display_data"
539 | }
540 | ],
541 | "source": [
542 | "plt.hist(np.log(num_tokens), bins = 100)\n",
543 | "plt.xlim((0,10))\n",
544 | "plt.ylabel('number of tokens')\n",
545 | "plt.xlabel('length of tokens')\n",
546 | "plt.title('Distribution of tokens length')\n",
547 | "plt.show()"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": 48,
553 | "metadata": {},
554 | "outputs": [
555 | {
556 | "data": {
557 | "text/plain": [
558 | "236"
559 | ]
560 | },
561 | "execution_count": 48,
562 | "metadata": {},
563 | "output_type": "execute_result"
564 | }
565 | ],
566 | "source": [
567 | "# 取tokens平均值并加上两个tokens的标准差,\n",
568 | "# 假设tokens长度的分布为正态分布,则max_tokens这个值可以涵盖95%左右的样本\n",
569 | "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n",
570 | "max_tokens = int(max_tokens)\n",
571 | "max_tokens"
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": 49,
577 | "metadata": {},
578 | "outputs": [
579 | {
580 | "data": {
581 | "text/plain": [
582 | "0.9565"
583 | ]
584 | },
585 | "execution_count": 49,
586 | "metadata": {},
587 | "output_type": "execute_result"
588 | }
589 | ],
590 | "source": [
591 | "# 取tokens的长度为236时,大约95%的样本被涵盖\n",
592 | "# 我们对长度不足的进行padding,超长的进行修剪\n",
593 | "np.sum( num_tokens < max_tokens ) / len(num_tokens)"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "**反向tokenize** \n",
601 | "我们定义一个function,用来把索引转换成可阅读的文本,这对于debug很重要。"
602 | ]
603 | },
604 | {
605 | "cell_type": "code",
606 | "execution_count": 50,
607 | "metadata": {},
608 | "outputs": [],
609 | "source": [
610 | "# 用来将tokens转换为文本\n",
611 | "def reverse_tokens(tokens):\n",
612 | " text = ''\n",
613 | " for i in tokens:\n",
614 | " if i != 0:\n",
615 | " text = text + cn_model.index2word[i]\n",
616 | " else:\n",
617 | " text = text + ' '\n",
618 | " return text"
619 | ]
620 | },
621 | {
622 | "cell_type": "code",
623 | "execution_count": 51,
624 | "metadata": {},
625 | "outputs": [],
626 | "source": [
627 | "reverse = reverse_tokens(train_tokens[0])"
628 | ]
629 | },
630 | {
631 | "cell_type": "markdown",
632 | "metadata": {},
633 | "source": [
634 | "以下可见,训练样本的极性并不是那么精准,比如说下面的样本,对早餐并不满意,但被定义为正面评价,这会迷惑我们的模型,不过我们暂时不对训练样本进行任何修改。"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "execution_count": 52,
640 | "metadata": {},
641 | "outputs": [
642 | {
643 | "data": {
644 | "text/plain": [
645 | "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'"
646 | ]
647 | },
648 | "execution_count": 52,
649 | "metadata": {},
650 | "output_type": "execute_result"
651 | }
652 | ],
653 | "source": [
654 | "# 经过tokenize再恢复成文本\n",
655 | "# 可见标点符号都没有了\n",
656 | "reverse"
657 | ]
658 | },
659 | {
660 | "cell_type": "code",
661 | "execution_count": 53,
662 | "metadata": {},
663 | "outputs": [
664 | {
665 | "data": {
666 | "text/plain": [
667 | "'早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'"
668 | ]
669 | },
670 | "execution_count": 53,
671 | "metadata": {},
672 | "output_type": "execute_result"
673 | }
674 | ],
675 | "source": [
676 | "# 原始文本\n",
677 | "train_texts_orig[0]"
678 | ]
679 | },
680 | {
681 | "cell_type": "markdown",
682 | "metadata": {},
683 | "source": [
684 | "**准备Embedding Matrix** \n",
685 | "现在我们来为模型准备embedding matrix(词向量矩阵),根据keras的要求,我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵,num words代表我们使用的词汇的数量,emdedding dimension在我们现在使用的预训练词向量模型中是300,每一个词汇都用一个长度为300的向量表示。 \n",
686 | "注意我们只选择使用前50k个使用频率最高的词,在这个预训练词向量模型中,一共有260万词汇量,如果全部使用在分类问题上会很浪费计算资源,因为我们的训练样本很小,一共只有4k,如果我们有100k,200k甚至更多的训练样本时,在分类问题上可以考虑减少使用的词汇量。"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": 90,
692 | "metadata": {},
693 | "outputs": [
694 | {
695 | "data": {
696 | "text/plain": [
697 | "300"
698 | ]
699 | },
700 | "execution_count": 90,
701 | "metadata": {},
702 | "output_type": "execute_result"
703 | }
704 | ],
705 | "source": [
706 | "embedding_dim"
707 | ]
708 | },
709 | {
710 | "cell_type": "code",
711 | "execution_count": 55,
712 | "metadata": {},
713 | "outputs": [],
714 | "source": [
715 | "# 只使用前20000个词\n",
716 | "num_words = 50000\n",
717 | "# 初始化embedding_matrix,之后在keras上进行应用\n",
718 | "embedding_matrix = np.zeros((num_words, embedding_dim))\n",
719 | "# embedding_matrix为一个 [num_words,embedding_dim] 的矩阵\n",
720 | "# 维度为 50000 * 300\n",
721 | "for i in range(num_words):\n",
722 | " embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n",
723 | "embedding_matrix = embedding_matrix.astype('float32')"
724 | ]
725 | },
726 | {
727 | "cell_type": "code",
728 | "execution_count": 56,
729 | "metadata": {},
730 | "outputs": [
731 | {
732 | "data": {
733 | "text/plain": [
734 | "300"
735 | ]
736 | },
737 | "execution_count": 56,
738 | "metadata": {},
739 | "output_type": "execute_result"
740 | }
741 | ],
742 | "source": [
743 | "# 检查index是否对应,\n",
744 | "# 输出300意义为长度为300的embedding向量一一对应\n",
745 | "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )"
746 | ]
747 | },
748 | {
749 | "cell_type": "code",
750 | "execution_count": 57,
751 | "metadata": {},
752 | "outputs": [
753 | {
754 | "data": {
755 | "text/plain": [
756 | "(50000, 300)"
757 | ]
758 | },
759 | "execution_count": 57,
760 | "metadata": {},
761 | "output_type": "execute_result"
762 | }
763 | ],
764 | "source": [
765 | "# embedding_matrix的维度,\n",
766 | "# 这个维度为keras的要求,后续会在模型中用到\n",
767 | "embedding_matrix.shape"
768 | ]
769 | },
770 | {
771 | "cell_type": "markdown",
772 | "metadata": {},
773 | "source": [
774 | "**padding(填充)和truncating(修剪)** \n",
775 | "我们把文本转换为tokens(索引)之后,每一串索引的长度并不相等,所以为了方便模型的训练我们需要把索引的长度标准化,上面我们选择了236这个可以涵盖95%训练样本的长度,接下来我们进行padding和truncating,我们一般采用'pre'的方法,这会在文本索引的前面填充0,因为根据一些研究资料中的实践,如果在文本索引后面填充0的话,会对模型造成一些不良影响。"
776 | ]
777 | },
778 | {
779 | "cell_type": "code",
780 | "execution_count": 58,
781 | "metadata": {},
782 | "outputs": [],
783 | "source": [
784 | "# 进行padding和truncating, 输入的train_tokens是一个list\n",
785 | "# 返回的train_pad是一个numpy array\n",
786 | "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n",
787 | " padding='pre', truncating='pre')"
788 | ]
789 | },
790 | {
791 | "cell_type": "code",
792 | "execution_count": 59,
793 | "metadata": {},
794 | "outputs": [],
795 | "source": [
796 | "# 超出五万个词向量的词用0代替\n",
797 | "train_pad[ train_pad>=num_words ] = 0"
798 | ]
799 | },
800 | {
801 | "cell_type": "code",
802 | "execution_count": 60,
803 | "metadata": {},
804 | "outputs": [
805 | {
806 | "data": {
807 | "text/plain": [
808 | "array([ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
809 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
810 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
811 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
812 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
813 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
814 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
815 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
816 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
817 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
818 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
819 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
820 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
821 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
822 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
823 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
824 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
825 | " 290, 3053, 57, 169, 73, 1, 25, 11216, 49,\n",
826 | " 163, 15985, 0, 0, 30, 8, 0, 1, 228,\n",
827 | " 223, 40, 35, 653, 0, 5, 1642, 29, 11216,\n",
828 | " 2751, 500, 98, 30, 3159, 2225, 2146, 371, 6285,\n",
829 | " 169, 27396, 1, 1191, 5432, 1080, 20055, 57, 562,\n",
830 | " 1, 22671, 40, 35, 169, 2567, 0, 42665, 7761,\n",
831 | " 110, 0, 0, 41281, 0, 110, 0, 35891, 110,\n",
832 | " 0, 28781, 57, 169, 1419, 1, 11670, 0, 19470,\n",
833 | " 1, 0, 0, 169, 35071, 40, 562, 35, 12398,\n",
834 | " 657, 4857])"
835 | ]
836 | },
837 | "execution_count": 60,
838 | "metadata": {},
839 | "output_type": "execute_result"
840 | }
841 | ],
842 | "source": [
843 | "# 可见padding之后前面的tokens全变成0,文本在最后面\n",
844 | "train_pad[33]"
845 | ]
846 | },
847 | {
848 | "cell_type": "code",
849 | "execution_count": 61,
850 | "metadata": {},
851 | "outputs": [],
852 | "source": [
853 | "# 准备target向量,前2000样本为1,后2000为0\n",
854 | "train_target = np.concatenate( (np.ones(2000),np.zeros(2000)) )"
855 | ]
856 | },
857 | {
858 | "cell_type": "code",
859 | "execution_count": 62,
860 | "metadata": {},
861 | "outputs": [],
862 | "source": [
863 | "# 进行训练和测试样本的分割\n",
864 | "from sklearn.model_selection import train_test_split"
865 | ]
866 | },
867 | {
868 | "cell_type": "code",
869 | "execution_count": 63,
870 | "metadata": {},
871 | "outputs": [],
872 | "source": [
873 | "# 90%的样本用来训练,剩余10%用来测试\n",
874 | "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n",
875 | " train_target,\n",
876 | " test_size=0.1,\n",
877 | " random_state=12)"
878 | ]
879 | },
880 | {
881 | "cell_type": "code",
882 | "execution_count": 64,
883 | "metadata": {},
884 | "outputs": [
885 | {
886 | "name": "stdout",
887 | "output_type": "stream",
888 | "text": [
889 | " 房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n",
890 | "class: 1.0\n"
891 | ]
892 | }
893 | ],
894 | "source": [
895 | "# 查看训练样本,确认无误\n",
896 | "print(reverse_tokens(X_train[35]))\n",
897 | "print('class: ',y_train[35])"
898 | ]
899 | },
900 | {
901 | "cell_type": "markdown",
902 | "metadata": {},
903 | "source": [
904 | "现在我们用keras搭建LSTM模型,模型的第一层是Embedding层,只有当我们把tokens索引转换为词向量矩阵之后,才可以用神经网络对文本进行处理。\n",
905 | "keras提供了Embedding接口,避免了繁琐的稀疏矩阵操作。 \n",
906 | "在Embedding层我们输入的矩阵为:$$(batchsize, maxtokens)$$\n",
907 | "输出矩阵为: $$(batchsize, maxtokens, embeddingdim)$$"
908 | ]
909 | },
910 | {
911 | "cell_type": "code",
912 | "execution_count": 65,
913 | "metadata": {},
914 | "outputs": [],
915 | "source": [
916 | "# 用LSTM对样本进行分类\n",
917 | "model = Sequential()"
918 | ]
919 | },
920 | {
921 | "cell_type": "code",
922 | "execution_count": 66,
923 | "metadata": {},
924 | "outputs": [],
925 | "source": [
926 | "# 模型第一层为embedding\n",
927 | "model.add(Embedding(num_words,\n",
928 | " embedding_dim,\n",
929 | " weights=[embedding_matrix],\n",
930 | " input_length=max_tokens,\n",
931 | " trainable=False))"
932 | ]
933 | },
934 | {
935 | "cell_type": "code",
936 | "execution_count": 67,
937 | "metadata": {},
938 | "outputs": [],
939 | "source": [
940 | "model.add(Bidirectional(LSTM(units=32, return_sequences=True)))\n",
941 | "model.add(LSTM(units=16, return_sequences=False))"
942 | ]
943 | },
944 | {
945 | "cell_type": "markdown",
946 | "metadata": {},
947 | "source": [
948 | "**构建模型** \n",
949 | "我在这个教程中尝试了几种神经网络结构,因为训练样本比较少,所以我们可以尽情尝试,训练过程等待时间并不长: \n",
950 | "**GRU:**如果使用GRU的话,测试样本可以达到87%的准确率,但我测试自己的文本内容时发现,GRU最后一层激活函数的输出都在0.5左右,说明模型的判断不是很明确,信心比较低,而且经过测试发现模型对于否定句的判断有时会失误,我们期望对于负面样本输出接近0,正面样本接近1而不是都徘徊于0.5之间。 \n",
951 | "**BiLSTM:**测试了LSTM和BiLSTM,发现BiLSTM的表现最好,LSTM的表现略好于GRU,这可能是因为BiLSTM对于比较长的句子结构有更好的记忆,有兴趣的朋友可以深入研究一下。 \n",
952 | "Embedding之后第,一层我们用BiLSTM返回sequences,然后第二层16个单元的LSTM不返回sequences,只返回最终结果,最后是一个全链接层,用sigmoid激活函数输出结果。"
953 | ]
954 | },
955 | {
956 | "cell_type": "code",
957 | "execution_count": 68,
958 | "metadata": {},
959 | "outputs": [],
960 | "source": [
961 | "# GRU的代码\n",
962 | "# model.add(GRU(units=32, return_sequences=True))\n",
963 | "# model.add(GRU(units=16, return_sequences=True))\n",
964 | "# model.add(GRU(units=4, return_sequences=False))"
965 | ]
966 | },
967 | {
968 | "cell_type": "code",
969 | "execution_count": 69,
970 | "metadata": {},
971 | "outputs": [],
972 | "source": [
973 | "model.add(Dense(1, activation='sigmoid'))\n",
974 | "# 我们使用adam以0.001的learning rate进行优化\n",
975 | "optimizer = Adam(lr=1e-3)"
976 | ]
977 | },
978 | {
979 | "cell_type": "code",
980 | "execution_count": 70,
981 | "metadata": {},
982 | "outputs": [],
983 | "source": [
984 | "model.compile(loss='binary_crossentropy',\n",
985 | " optimizer=optimizer,\n",
986 | " metrics=['accuracy'])"
987 | ]
988 | },
989 | {
990 | "cell_type": "code",
991 | "execution_count": 71,
992 | "metadata": {},
993 | "outputs": [
994 | {
995 | "name": "stdout",
996 | "output_type": "stream",
997 | "text": [
998 | "_________________________________________________________________\n",
999 | "Layer (type) Output Shape Param # \n",
1000 | "=================================================================\n",
1001 | "embedding_1 (Embedding) (None, 236, 300) 15000000 \n",
1002 | "_________________________________________________________________\n",
1003 | "bidirectional_1 (Bidirection (None, 236, 64) 85248 \n",
1004 | "_________________________________________________________________\n",
1005 | "lstm_2 (LSTM) (None, 16) 5184 \n",
1006 | "_________________________________________________________________\n",
1007 | "dense_1 (Dense) (None, 1) 17 \n",
1008 | "=================================================================\n",
1009 | "Total params: 15,090,449\n",
1010 | "Trainable params: 90,449\n",
1011 | "Non-trainable params: 15,000,000\n",
1012 | "_________________________________________________________________\n"
1013 | ]
1014 | }
1015 | ],
1016 | "source": [
1017 | "# 我们来看一下模型的结构,一共90k左右可训练的变量\n",
1018 | "model.summary()"
1019 | ]
1020 | },
1021 | {
1022 | "cell_type": "code",
1023 | "execution_count": 72,
1024 | "metadata": {},
1025 | "outputs": [],
1026 | "source": [
1027 | "# 建立一个权重的存储点\n",
1028 | "path_checkpoint = 'sentiment_checkpoint.keras'\n",
1029 | "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n",
1030 | " verbose=1, save_weights_only=True,\n",
1031 | " save_best_only=True)"
1032 | ]
1033 | },
1034 | {
1035 | "cell_type": "code",
1036 | "execution_count": 73,
1037 | "metadata": {},
1038 | "outputs": [],
1039 | "source": [
1040 | "# 尝试加载已训练模型\n",
1041 | "try:\n",
1042 | " model.load_weights(path_checkpoint)\n",
1043 | "except Exception as e:\n",
1044 | " print(e)"
1045 | ]
1046 | },
1047 | {
1048 | "cell_type": "code",
1049 | "execution_count": 74,
1050 | "metadata": {},
1051 | "outputs": [],
1052 | "source": [
1053 | "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n",
1054 | "earlystopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)"
1055 | ]
1056 | },
1057 | {
1058 | "cell_type": "code",
1059 | "execution_count": 75,
1060 | "metadata": {},
1061 | "outputs": [],
1062 | "source": [
1063 | "# 自动降低learning rate\n",
1064 | "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n",
1065 | " factor=0.1, min_lr=1e-5, patience=0,\n",
1066 | " verbose=1)"
1067 | ]
1068 | },
1069 | {
1070 | "cell_type": "code",
1071 | "execution_count": 76,
1072 | "metadata": {},
1073 | "outputs": [],
1074 | "source": [
1075 | "# 定义callback函数\n",
1076 | "callbacks = [\n",
1077 | " earlystopping, \n",
1078 | " checkpoint,\n",
1079 | " lr_reduction\n",
1080 | "]"
1081 | ]
1082 | },
1083 | {
1084 | "cell_type": "code",
1085 | "execution_count": 77,
1086 | "metadata": {
1087 | "scrolled": false
1088 | },
1089 | "outputs": [],
1090 | "source": [
1091 | "# 开始训练\n",
1092 | "model.fit(X_train, y_train,\n",
1093 | " validation_split=0.1, \n",
1094 | " epochs=20,\n",
1095 | " batch_size=128,\n",
1096 | " callbacks=callbacks)"
1097 | ]
1098 | },
1099 | {
1100 | "cell_type": "markdown",
1101 | "metadata": {},
1102 | "source": [
1103 | "**结论** \n",
1104 | "我们首先对测试样本进行预测,得到了还算满意的准确度。 \n",
1105 | "之后我们定义一个预测函数,来预测输入的文本的极性,可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。"
1106 | ]
1107 | },
1108 | {
1109 | "cell_type": "code",
1110 | "execution_count": 78,
1111 | "metadata": {},
1112 | "outputs": [
1113 | {
1114 | "name": "stdout",
1115 | "output_type": "stream",
1116 | "text": [
1117 | "400/400 [==============================] - 5s 12ms/step\n",
1118 | "Accuracy:87.50%\n"
1119 | ]
1120 | }
1121 | ],
1122 | "source": [
1123 | "result = model.evaluate(X_test, y_test)\n",
1124 | "print('Accuracy:{0:.2%}'.format(result[1]))"
1125 | ]
1126 | },
1127 | {
1128 | "cell_type": "code",
1129 | "execution_count": 79,
1130 | "metadata": {},
1131 | "outputs": [],
1132 | "source": [
1133 | "def predict_sentiment(text):\n",
1134 | " print(text)\n",
1135 | " # 去标点\n",
1136 | " text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——!,。?、~@#¥%……&*()]+\", \"\",text)\n",
1137 | " # 分词\n",
1138 | " cut = jieba.cut(text)\n",
1139 | " cut_list = [ i for i in cut ]\n",
1140 | " # tokenize\n",
1141 | " for i, word in enumerate(cut_list):\n",
1142 | " try:\n",
1143 | " cut_list[i] = cn_model.vocab[word].index\n",
1144 | " except KeyError:\n",
1145 | " cut_list[i] = 0\n",
1146 | " # padding\n",
1147 | " tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n",
1148 | " padding='pre', truncating='pre')\n",
1149 | " # 预测\n",
1150 | " result = model.predict(x=tokens_pad)\n",
1151 | " coef = result[0][0]\n",
1152 | " if coef >= 0.5:\n",
1153 | " print('是一例正面评价','output=%.2f'%coef)\n",
1154 | " else:\n",
1155 | " print('是一例负面评价','output=%.2f'%coef)"
1156 | ]
1157 | },
1158 | {
1159 | "cell_type": "code",
1160 | "execution_count": 80,
1161 | "metadata": {},
1162 | "outputs": [
1163 | {
1164 | "name": "stdout",
1165 | "output_type": "stream",
1166 | "text": [
1167 | "酒店设施不是新的,服务态度很不好\n",
1168 | "是一例负面评价 output=0.14\n",
1169 | "酒店卫生条件非常不好\n",
1170 | "是一例负面评价 output=0.09\n",
1171 | "床铺非常舒适\n",
1172 | "是一例正面评价 output=0.76\n",
1173 | "房间很凉,不给开暖气\n",
1174 | "是一例负面评价 output=0.17\n",
1175 | "房间很凉爽,空调冷气很足\n",
1176 | "是一例正面评价 output=0.66\n",
1177 | "酒店环境不好,住宿体验很不好\n",
1178 | "是一例负面评价 output=0.06\n",
1179 | "房间隔音不到位\n",
1180 | "是一例负面评价 output=0.17\n",
1181 | "晚上回来发现没有打扫卫生\n",
1182 | "是一例负面评价 output=0.25\n",
1183 | "因为过节所以要我临时加钱,比团购的价格贵\n",
1184 | "是一例负面评价 output=0.06\n"
1185 | ]
1186 | }
1187 | ],
1188 | "source": [
1189 | "test_list = [\n",
1190 | " '酒店设施不是新的,服务态度很不好',\n",
1191 | " '酒店卫生条件非常不好',\n",
1192 | " '床铺非常舒适',\n",
1193 | " '房间很凉,不给开暖气',\n",
1194 | " '房间很凉爽,空调冷气很足',\n",
1195 | " '酒店环境不好,住宿体验很不好',\n",
1196 | " '房间隔音不到位' ,\n",
1197 | " '晚上回来发现没有打扫卫生',\n",
1198 | " '因为过节所以要我临时加钱,比团购的价格贵'\n",
1199 | "]\n",
1200 | "for text in test_list:\n",
1201 | " predict_sentiment(text)"
1202 | ]
1203 | },
1204 | {
1205 | "cell_type": "markdown",
1206 | "metadata": {},
1207 | "source": [
1208 | "**错误分类的文本**\n",
1209 | "经过查看,发现错误分类的文本的含义大多比较含糊,就算人类也不容易判断极性,如index为101的这个句子,好像没有一点满意的成分,但这例子评价在训练样本中被标记成为了正面评价,而我们的模型做出的负面评价的预测似乎是合理的。"
1210 | ]
1211 | },
1212 | {
1213 | "cell_type": "code",
1214 | "execution_count": 81,
1215 | "metadata": {},
1216 | "outputs": [],
1217 | "source": [
1218 | "y_pred = model.predict(X_test)\n",
1219 | "y_pred = y_pred.T[0]\n",
1220 | "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n",
1221 | "y_pred = np.array(y_pred)"
1222 | ]
1223 | },
1224 | {
1225 | "cell_type": "code",
1226 | "execution_count": 82,
1227 | "metadata": {},
1228 | "outputs": [],
1229 | "source": [
1230 | "y_actual = np.array(y_test)"
1231 | ]
1232 | },
1233 | {
1234 | "cell_type": "code",
1235 | "execution_count": 83,
1236 | "metadata": {},
1237 | "outputs": [],
1238 | "source": [
1239 | "# 找出错误分类的索引\n",
1240 | "misclassified = np.where( y_pred != y_actual )[0]"
1241 | ]
1242 | },
1243 | {
1244 | "cell_type": "code",
1245 | "execution_count": 92,
1246 | "metadata": {
1247 | "scrolled": true
1248 | },
1249 | "outputs": [
1250 | {
1251 | "name": "stdout",
1252 | "output_type": "stream",
1253 | "text": [
1254 | "400\n"
1255 | ]
1256 | }
1257 | ],
1258 | "source": [
1259 | "# 输出所有错误分类的索引\n",
1260 | "len(misclassified)\n",
1261 | "print(len(X_test))"
1262 | ]
1263 | },
1264 | {
1265 | "cell_type": "code",
1266 | "execution_count": 85,
1267 | "metadata": {},
1268 | "outputs": [
1269 | {
1270 | "name": "stdout",
1271 | "output_type": "stream",
1272 | "text": [
1273 | " 由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 :1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n",
1274 | "预测的分类 0\n",
1275 | "实际的分类 1.0\n"
1276 | ]
1277 | }
1278 | ],
1279 | "source": [
1280 | "# 我们来找出错误分类的样本看看\n",
1281 | "idx=101\n",
1282 | "print(reverse_tokens(X_test[idx]))\n",
1283 | "print('预测的分类', y_pred[idx])\n",
1284 | "print('实际的分类', y_actual[idx])"
1285 | ]
1286 | },
1287 | {
1288 | "cell_type": "code",
1289 | "execution_count": 86,
1290 | "metadata": {},
1291 | "outputs": [
1292 | {
1293 | "name": "stdout",
1294 | "output_type": "stream",
1295 | "text": [
1296 | " 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n",
1297 | "预测的分类 0\n",
1298 | "实际的分类 1.0\n"
1299 | ]
1300 | }
1301 | ],
1302 | "source": [
1303 | "idx=1\n",
1304 | "print(reverse_tokens(X_test[idx]))\n",
1305 | "print('预测的分类', y_pred[idx])\n",
1306 | "print('实际的分类', y_actual[idx])"
1307 | ]
1308 | },
1309 | {
1310 | "cell_type": "code",
1311 | "execution_count": null,
1312 | "metadata": {},
1313 | "outputs": [],
1314 | "source": []
1315 | },
1316 | {
1317 | "cell_type": "code",
1318 | "execution_count": null,
1319 | "metadata": {},
1320 | "outputs": [],
1321 | "source": []
1322 | },
1323 | {
1324 | "cell_type": "code",
1325 | "execution_count": null,
1326 | "metadata": {},
1327 | "outputs": [],
1328 | "source": []
1329 | }
1330 | ],
1331 | "metadata": {
1332 | "kernelspec": {
1333 | "display_name": "Python 3",
1334 | "language": "python",
1335 | "name": "python3"
1336 | },
1337 | "language_info": {
1338 | "codemirror_mode": {
1339 | "name": "ipython",
1340 | "version": 3
1341 | },
1342 | "file_extension": ".py",
1343 | "mimetype": "text/x-python",
1344 | "name": "python",
1345 | "nbconvert_exporter": "python",
1346 | "pygments_lexer": "ipython3",
1347 | "version": "3.6.5"
1348 | }
1349 | },
1350 | "nbformat": 4,
1351 | "nbformat_minor": 2
1352 | }
1353 |
--------------------------------------------------------------------------------
/语料.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/语料.zip
--------------------------------------------------------------------------------