├── 2019新版debug之后--中文自然语言处理--情感分析.ipynb
├── README.md
├── flowchart.jpg
├── neg
    ├── neg.0.txt
    ├── neg.1.txt
    ├── neg.10.txt
    ├── neg.1000.txt
    ├── neg.1001.txt
    ├── neg.1002.txt
    ├── neg.1003.txt
    ├── neg.1004.txt
    ├── neg.1005.txt
    ├── neg.1006.txt
    ├── neg.1007.txt
    ├── neg.1009.txt
    ├── neg.101.txt
    ├── neg.1010.txt
    ├── neg.1011.txt
    ├── neg.1012.txt
    ├── neg.1013.txt
    ├── neg.1014.txt
    ├── neg.1015.txt
    ├── neg.1017.txt
    ├── neg.1018.txt
    ├── neg.1019.txt
    ├── neg.102.txt
    ├── neg.1020.txt
    ├── neg.1022.txt
    ├── neg.1025.txt
    ├── neg.1026.txt
    ├── neg.1027.txt
    ├── neg.1028.txt
    ├── neg.1029.txt
    ├── neg.103.txt
    ├── neg.1030.txt
    ├── neg.1032.txt
    ├── neg.1033.txt
    ├── neg.1034.txt
    ├── neg.1035.txt
    ├── neg.1036.txt
    ├── neg.1038.txt
    ├── neg.1039.txt
    ├── neg.104.txt
    ├── neg.1040.txt
    ├── neg.1041.txt
    ├── neg.1042.txt
    ├── neg.1047.txt
    ├── neg.1048.txt
    ├── neg.1049.txt
    ├── neg.105.txt
    ├── neg.1050.txt
    ├── neg.1052.txt
    ├── neg.1053.txt
    ├── neg.1054.txt
    ├── neg.1055.txt
    ├── neg.1056.txt
    ├── neg.1057.txt
    ├── neg.1058.txt
    ├── neg.1059.txt
    ├── neg.106.txt
    ├── neg.1060.txt
    ├── neg.1061.txt
    ├── neg.1062.txt
    ├── neg.1063.txt
    ├── neg.1066.txt
    ├── neg.1067.txt
    ├── neg.1069.txt
    ├── neg.107.txt
    ├── neg.1070.txt
    ├── neg.1071.txt
    ├── neg.1072.txt
    ├── neg.1073.txt
    ├── neg.1074.txt
    ├── neg.1075.txt
    ├── neg.1076.txt
    ├── neg.1077.txt
    ├── neg.1078.txt
    ├── neg.1079.txt
    ├── neg.108.txt
    ├── neg.1080.txt
    ├── neg.1081.txt
    ├── neg.1082.txt
    ├── neg.1083.txt
    ├── neg.1084.txt
    ├── neg.1085.txt
    ├── neg.1086.txt
    ├── neg.1087.txt
    ├── neg.1088.txt
    ├── neg.1089.txt
    ├── neg.109.txt
    ├── neg.1090.txt
    ├── neg.1091.txt
    ├── neg.1092.txt
    ├── neg.1093.txt
    ├── neg.1094.txt
    ├── neg.1095.txt
    ├── neg.1096.txt
    ├── neg.1097.txt
    ├── neg.1098.txt
    ├── neg.1099.txt
    ├── neg.11.txt
    └── neg.110.txt
├── negative_samples.txt
├── pos
    ├── pos.10.txt
    ├── pos.100.txt
    ├── pos.1000.txt
    ├── pos.1001.txt
    ├── pos.1002.txt
    ├── pos.1003.txt
    ├── pos.1004.txt
    ├── pos.1005.txt
    ├── pos.1006.txt
    ├── pos.1007.txt
    ├── pos.1008.txt
    ├── pos.1009.txt
    ├── pos.101.txt
    ├── pos.1010.txt
    ├── pos.1012.txt
    ├── pos.1013.txt
    ├── pos.1014.txt
    ├── pos.1015.txt
    ├── pos.1016.txt
    ├── pos.1017.txt
    ├── pos.1018.txt
    ├── pos.1019.txt
    ├── pos.102.txt
    ├── pos.1020.txt
    ├── pos.1021.txt
    ├── pos.1022.txt
    ├── pos.1023.txt
    ├── pos.1024.txt
    ├── pos.1025.txt
    ├── pos.1026.txt
    ├── pos.1027.txt
    ├── pos.1028.txt
    ├── pos.1029.txt
    ├── pos.103.txt
    ├── pos.1030.txt
    ├── pos.1031.txt
    ├── pos.1032.txt
    ├── pos.1033.txt
    ├── pos.1034.txt
    ├── pos.1035.txt
    ├── pos.1036.txt
    ├── pos.1037.txt
    ├── pos.1038.txt
    ├── pos.1039.txt
    ├── pos.104.txt
    ├── pos.1040.txt
    ├── pos.1041.txt
    ├── pos.1042.txt
    ├── pos.1043.txt
    ├── pos.1044.txt
    ├── pos.1045.txt
    ├── pos.1046.txt
    ├── pos.1047.txt
    ├── pos.1048.txt
    ├── pos.1049.txt
    ├── pos.105.txt
    ├── pos.1050.txt
    ├── pos.1051.txt
    ├── pos.1052.txt
    ├── pos.1053.txt
    ├── pos.1054.txt
    ├── pos.1055.txt
    ├── pos.1056.txt
    ├── pos.1057.txt
    ├── pos.1058.txt
    ├── pos.1059.txt
    ├── pos.106.txt
    ├── pos.1060.txt
    ├── pos.1061.txt
    ├── pos.1062.txt
    ├── pos.1063.txt
    ├── pos.1064.txt
    ├── pos.1065.txt
    ├── pos.107.txt
    ├── pos.1073.txt
    ├── pos.1074.txt
    ├── pos.1075.txt
    ├── pos.1076.txt
    ├── pos.1077.txt
    ├── pos.1078.txt
    ├── pos.1079.txt
    ├── pos.108.txt
    ├── pos.1080.txt
    ├── pos.1081.txt
    ├── pos.1082.txt
    ├── pos.1083.txt
    ├── pos.1084.txt
    ├── pos.1085.txt
    ├── pos.1086.txt
    ├── pos.1087.txt
    ├── pos.1088.txt
    ├── pos.1089.txt
    ├── pos.109.txt
    ├── pos.1090.txt
    ├── pos.1091.txt
    ├── pos.1093.txt
    ├── pos.1094.txt
    ├── pos.1095.txt
    ├── pos.1096.txt
    └── pos.1097.txt
├── positive_samples.txt
├── 中文自然语言处理--情感分析.ipynb
└── 语料.zip


/2019新版debug之后--中文自然语言处理--情感分析.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# 用Tensorflow进行中文自然语言处理--情感分析"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "$$f('真好喝')=1$$\n",
  15 |     "$$f('太难喝了')=0$$"
  16 |    ]
  17 |   },
  18 |   {
  19 |    "cell_type": "markdown",
  20 |    "metadata": {},
  21 |    "source": [
  22 |     "**简介**  \n",
  23 |     "大家好，我是Espresso，这是我制作的第一个教程，是一个简单的中文自然语言处理的分类实践。  \n",
  24 |     "制作此教程的目的是什么呢？虽然现在自然语言处理的学习资料很多，英文的资料更多，但是网上资源很乱，尤其是中文的系统的学习资料稀少，而且知识点非常分散，缺少比较系统的实践学习资料，就算有一些代码但因为缺少注释导致要花费很长时间才能理解，我个人在学习过程中，在网络搜索花费了一整天时间，才把处理中文的步骤和需要的软件梳理出来。  \n",
  25 |     "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习，在这个教程中我注重的是实践部分，理论部分我推荐学习deeplearning.ai的课程，在下面的代码部分，涉及到哪方面知识的，我推荐一些学习资料并附上链接，如有侵权请e-mail：a66777@188.com。  \n",
  26 |     "另外我对自然语言处理并没有任何深入研究，欢迎各位大牛吐槽，希望能指出不足和改善方法。"
  27 |    ]
  28 |   },
  29 |   {
  30 |    "cell_type": "markdown",
  31 |    "metadata": {},
  32 |    "source": [
  33 |     "**需要的库**  \n",
  34 |     "numpy  \n",
  35 |     "jieba  \n",
  36 |     "gensim  \n",
  37 |     "tensorflow  \n",
  38 |     "matplotlib  "
  39 |    ]
  40 |   },
  41 |   {
  42 |    "cell_type": "code",
  43 |    "execution_count": 1,
  44 |    "metadata": {},
  45 |    "outputs": [],
  46 |    "source": [
  47 |     "# 首先加载必用的库\n",
  48 |     "%matplotlib inline\n",
  49 |     "import numpy as np\n",
  50 |     "import matplotlib.pyplot as plt\n",
  51 |     "import re\n",
  52 |     "import jieba # 结巴分词\n",
  53 |     "# gensim用来加载预训练word vector\n",
  54 |     "from gensim.models import KeyedVectors\n",
  55 |     "import warnings\n",
  56 |     "warnings.filterwarnings(\"ignore\")\n",
  57 |     "# 用来解压\n",
  58 |     "import bz2"
  59 |    ]
  60 |   },
  61 |   {
  62 |    "cell_type": "markdown",
  63 |    "metadata": {},
  64 |    "source": [
  65 |     "**预训练词向量**  \n",
  66 |     "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为：  \n",
  67 |     "https://github.com/Embedding/Chinese-Word-Vectors  \n",
  68 |     "如果你不知道word2vec是什么，我推荐以下一篇文章：  \n",
  69 |     "https://zhuanlan.zhihu.com/p/26306795  \n",
  70 |     "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量，可以从上面github链接下载，我们先加载预训练模型并进行一些简单测试："
  71 |    ]
  72 |   },
  73 |   {
  74 |    "cell_type": "code",
  75 |    "execution_count": 2,
  76 |    "metadata": {},
  77 |    "outputs": [],
  78 |    "source": [
  79 |     "# 请将下载的词向量压缩包放置在根目录 embeddings 文件夹里\n",
  80 |     "# 解压词向量, 有可能需要等待1-2分钟\n",
  81 |     "with open(\"embeddings/sgns.zhihu.bigram\", 'wb') as new_file, open(\"embeddings/sgns.zhihu.bigram.bz2\", 'rb') as file:\n",
  82 |     "    decompressor = bz2.BZ2Decompressor()\n",
  83 |     "    for data in iter(lambda : file.read(100 * 1024), b''):\n",
  84 |     "        new_file.write(decompressor.decompress(data))"
  85 |    ]
  86 |   },
  87 |   {
  88 |    "cell_type": "code",
  89 |    "execution_count": 3,
  90 |    "metadata": {},
  91 |    "outputs": [],
  92 |    "source": [
  93 |     "# 使用gensim加载预训练中文分词embedding, 有可能需要等待1-2分钟\n",
  94 |     "cn_model = KeyedVectors.load_word2vec_format('embeddings/sgns.zhihu.bigram', \n",
  95 |     "                                             binary=False, unicode_errors=\"ignore\")"
  96 |    ]
  97 |   },
  98 |   {
  99 |    "cell_type": "markdown",
 100 |    "metadata": {},
 101 |    "source": [
 102 |     "**词向量模型**  \n",
 103 |     "在这个词向量模型里，每一个词是一个索引，对应的是一个长度为300的向量，我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本，需要先进行分次并把词汇转换为词向量，步骤请参考下图，步骤的讲解会跟着代码一步一步来，如果你不知道RNN，GRU，LSTM是什么，我推荐deeplearning.ai的课程，网易公开课有免费中文字幕版，但我还是推荐有习题和练习代码部分的的coursera原版：  \n",
 104 |     "<img src='flowchart.jpg' style='width:400px;'>"
 105 |    ]
 106 |   },
 107 |   {
 108 |    "cell_type": "code",
 109 |    "execution_count": 4,
 110 |    "metadata": {},
 111 |    "outputs": [
 112 |     {
 113 |      "name": "stdout",
 114 |      "output_type": "stream",
 115 |      "text": [
 116 |       "词向量的长度为300\n"
 117 |      ]
 118 |     },
 119 |     {
 120 |      "data": {
 121 |       "text/plain": [
 122 |        "array([-2.603470e-01,  3.677500e-01, -2.379650e-01,  5.301700e-02,\n",
 123 |        "       -3.628220e-01, -3.212010e-01, -1.903330e-01,  1.587220e-01,\n",
 124 |        "       -7.156200e-02, -4.625400e-02, -1.137860e-01,  3.515600e-01,\n",
 125 |        "       -6.408200e-02, -2.184840e-01,  3.286950e-01, -7.110330e-01,\n",
 126 |        "        1.620320e-01,  1.627490e-01,  5.528180e-01,  1.016860e-01,\n",
 127 |        "        1.060080e-01,  7.820700e-01, -7.537310e-01, -2.108400e-02,\n",
 128 |        "       -4.758250e-01, -1.130420e-01, -2.053000e-01,  6.624390e-01,\n",
 129 |        "        2.435850e-01,  9.171890e-01, -2.090610e-01, -5.290000e-02,\n",
 130 |        "       -7.969340e-01,  2.394940e-01, -9.028100e-02,  1.537360e-01,\n",
 131 |        "       -4.003980e-01, -2.456100e-02, -1.717860e-01,  2.037790e-01,\n",
 132 |        "       -4.344710e-01, -3.850430e-01, -9.366000e-02,  3.775310e-01,\n",
 133 |        "        2.659690e-01,  8.879800e-02,  2.493440e-01,  4.914900e-02,\n",
 134 |        "        5.996000e-03,  3.586430e-01, -1.044960e-01, -5.838460e-01,\n",
 135 |        "        3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n",
 136 |        "       -2.075230e-01,  2.845980e-01,  1.414760e-01,  1.678570e-01,\n",
 137 |        "        1.957560e-01,  7.782140e-01, -2.359000e-01, -6.833100e-02,\n",
 138 |        "        2.560170e-01, -6.906900e-02, -1.219620e-01,  2.683020e-01,\n",
 139 |        "        1.678810e-01,  2.068910e-01,  1.987520e-01,  6.720900e-02,\n",
 140 |        "       -3.975290e-01, -7.123140e-01,  5.613200e-02,  2.586000e-03,\n",
 141 |        "        5.616910e-01,  1.157000e-03, -4.341190e-01,  1.977480e-01,\n",
 142 |        "        2.519540e-01,  8.835000e-03, -3.554600e-01, -1.573500e-02,\n",
 143 |        "       -2.526010e-01,  9.355900e-02, -3.962500e-02, -1.628350e-01,\n",
 144 |        "        2.980950e-01,  1.647900e-01, -5.454270e-01,  3.888790e-01,\n",
 145 |        "        1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n",
 146 |        "        2.020520e-01, -4.424750e-01,  3.911580e-01,  2.115100e-01,\n",
 147 |        "        6.516760e-01,  5.668030e-01,  5.065500e-02, -1.259650e-01,\n",
 148 |        "       -3.720640e-01,  2.330470e-01,  6.659900e-02,  8.300600e-02,\n",
 149 |        "        2.540460e-01, -5.279760e-01, -3.843280e-01,  3.366460e-01,\n",
 150 |        "        2.336500e-01,  3.564750e-01, -4.884160e-01, -1.183910e-01,\n",
 151 |        "        1.365910e-01,  2.293420e-01, -6.151930e-01,  5.212050e-01,\n",
 152 |        "        3.412000e-01,  5.757940e-01,  2.354480e-01, -3.641530e-01,\n",
 153 |        "        7.373400e-02,  1.007380e-01, -3.211410e-01, -3.040480e-01,\n",
 154 |        "       -3.738440e-01, -2.515150e-01,  2.633890e-01,  3.995490e-01,\n",
 155 |        "        4.461880e-01,  1.641110e-01,  1.449590e-01, -4.191540e-01,\n",
 156 |        "        2.297840e-01,  6.710600e-02,  3.316430e-01, -6.026500e-02,\n",
 157 |        "       -5.130610e-01,  1.472570e-01,  2.414060e-01,  2.011000e-03,\n",
 158 |        "       -3.823410e-01, -1.356010e-01,  3.112300e-01,  9.177830e-01,\n",
 159 |        "       -4.511630e-01,  1.272190e-01, -9.431600e-02, -8.216000e-03,\n",
 160 |        "       -3.835440e-01,  2.589400e-02,  6.374980e-01,  4.931630e-01,\n",
 161 |        "       -1.865070e-01,  4.076900e-01, -1.841000e-03,  2.213160e-01,\n",
 162 |        "        2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n",
 163 |        "        1.296890e-01, -1.304100e-01, -4.742270e-01,  2.275500e-02,\n",
 164 |        "        4.255050e-01,  1.570280e-01,  2.975300e-02,  1.931830e-01,\n",
 165 |        "        1.304340e-01, -3.179800e-02,  1.516650e-01, -2.154310e-01,\n",
 166 |        "       -4.681410e-01,  1.007326e+00, -6.698940e-01, -1.555240e-01,\n",
 167 |        "        1.797170e-01,  2.848660e-01,  6.216130e-01,  1.549510e-01,\n",
 168 |        "        6.225000e-02, -2.227800e-02,  2.561270e-01, -1.006380e-01,\n",
 169 |        "        2.807900e-02,  4.597710e-01, -4.077750e-01, -1.777390e-01,\n",
 170 |        "        1.920500e-02, -4.829300e-02,  4.714700e-02, -3.715200e-01,\n",
 171 |        "       -2.995930e-01, -3.719710e-01,  4.622800e-02, -1.436460e-01,\n",
 172 |        "        2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n",
 173 |        "        5.970110e-01,  3.578450e-01, -6.826000e-02,  4.735200e-02,\n",
 174 |        "       -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n",
 175 |        "       -4.757790e-01,  1.079320e-01,  9.858300e-02,  8.540300e-01,\n",
 176 |        "        3.518370e-01, -1.306360e-01, -1.541590e-01,  1.166775e+00,\n",
 177 |        "        2.048860e-01,  5.952340e-01,  1.158830e-01,  6.774400e-02,\n",
 178 |        "        6.793920e-01, -3.610700e-01,  1.697870e-01,  4.118530e-01,\n",
 179 |        "        4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n",
 180 |        "       -7.043300e-02,  1.576110e-01, -4.780500e-02, -7.344390e-01,\n",
 181 |        "       -2.834330e-01,  4.582690e-01,  3.957010e-01, -8.484300e-02,\n",
 182 |        "       -3.472550e-01,  1.291660e-01,  3.838960e-01, -3.287600e-02,\n",
 183 |        "       -2.802220e-01,  5.257030e-01, -3.609300e-02, -4.842220e-01,\n",
 184 |        "        3.690700e-02,  3.429560e-01,  2.902490e-01, -1.624650e-01,\n",
 185 |        "       -7.513700e-02,  2.669300e-01,  5.778230e-01, -3.074020e-01,\n",
 186 |        "       -2.183790e-01, -2.834050e-01,  1.350870e-01,  1.490070e-01,\n",
 187 |        "        1.438400e-02, -2.509040e-01, -3.376100e-01,  1.291880e-01,\n",
 188 |        "       -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n",
 189 |        "       -1.211970e-01,  2.532660e-01,  2.757050e-01, -3.382040e-01,\n",
 190 |        "        1.178070e-01,  3.860190e-01,  5.277960e-01,  4.581920e-01,\n",
 191 |        "        1.502310e-01,  1.226320e-01,  2.768540e-01, -4.502080e-01,\n",
 192 |        "       -1.992670e-01,  1.689100e-02,  1.188860e-01,  3.502440e-01,\n",
 193 |        "       -4.064770e-01,  2.610280e-01, -1.934990e-01, -1.625660e-01,\n",
 194 |        "        2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n",
 195 |        "       -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n",
 196 |        "        2.154600e-02,  2.318040e-01,  5.395310e-01, -4.223720e-01],\n",
 197 |        "      dtype=float32)"
 198 |       ]
 199 |      },
 200 |      "execution_count": 4,
 201 |      "metadata": {},
 202 |      "output_type": "execute_result"
 203 |     }
 204 |    ],
 205 |    "source": [
 206 |     "# 由此可见每一个词都对应一个长度为300的向量\n",
 207 |     "embedding_dim = cn_model['山东大学'].shape[0]\n",
 208 |     "print('词向量的长度为{}'.format(embedding_dim))\n",
 209 |     "cn_model['山东大学']"
 210 |    ]
 211 |   },
 212 |   {
 213 |    "cell_type": "markdown",
 214 |    "metadata": {},
 215 |    "source": [
 216 |     "Cosine Similarity for Vector Space Models by Christian S. Perone\n",
 217 |     "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/"
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "code",
 222 |    "execution_count": 5,
 223 |    "metadata": {},
 224 |    "outputs": [
 225 |     {
 226 |      "data": {
 227 |       "text/plain": [
 228 |        "0.66128117"
 229 |       ]
 230 |      },
 231 |      "execution_count": 5,
 232 |      "metadata": {},
 233 |      "output_type": "execute_result"
 234 |     }
 235 |    ],
 236 |    "source": [
 237 |     "# 计算相似度\n",
 238 |     "cn_model.similarity('橘子', '橙子')"
 239 |    ]
 240 |   },
 241 |   {
 242 |    "cell_type": "code",
 243 |    "execution_count": 6,
 244 |    "metadata": {},
 245 |    "outputs": [
 246 |     {
 247 |      "data": {
 248 |       "text/plain": [
 249 |        "0.66128117"
 250 |       ]
 251 |      },
 252 |      "execution_count": 6,
 253 |      "metadata": {},
 254 |      "output_type": "execute_result"
 255 |     }
 256 |    ],
 257 |    "source": [
 258 |     "# dot（'橘子'/|'橘子'|， '橙子'/|'橙子'| ）\n",
 259 |     "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n",
 260 |     "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))"
 261 |    ]
 262 |   },
 263 |   {
 264 |    "cell_type": "code",
 265 |    "execution_count": 7,
 266 |    "metadata": {},
 267 |    "outputs": [
 268 |     {
 269 |      "data": {
 270 |       "text/plain": [
 271 |        "[('高中', 0.7247823476791382),\n",
 272 |        " ('本科', 0.6768535375595093),\n",
 273 |        " ('研究生', 0.6244412660598755),\n",
 274 |        " ('中学', 0.6088204979896545),\n",
 275 |        " ('大学本科', 0.595908522605896),\n",
 276 |        " ('初中', 0.5883588790893555),\n",
 277 |        " ('读研', 0.5778335332870483),\n",
 278 |        " ('职高', 0.5767995119094849),\n",
 279 |        " ('大学毕业', 0.5767451524734497),\n",
 280 |        " ('师范大学', 0.5708829760551453)]"
 281 |       ]
 282 |      },
 283 |      "execution_count": 7,
 284 |      "metadata": {},
 285 |      "output_type": "execute_result"
 286 |     }
 287 |    ],
 288 |    "source": [
 289 |     "# 找出最相近的词，余弦相似度\n",
 290 |     "cn_model.most_similar(positive=['大学'], topn=10)"
 291 |    ]
 292 |   },
 293 |   {
 294 |    "cell_type": "code",
 295 |    "execution_count": 8,
 296 |    "metadata": {},
 297 |    "outputs": [
 298 |     {
 299 |      "name": "stdout",
 300 |      "output_type": "stream",
 301 |      "text": [
 302 |       "在 老师 会计师 程序员 律师 医生 老人 中:\n",
 303 |       "不是同一类别的词为: 老人\n"
 304 |      ]
 305 |     }
 306 |    ],
 307 |    "source": [
 308 |     "# 找出不同的词\n",
 309 |     "test_words = '老师 会计师 程序员 律师 医生 老人'\n",
 310 |     "test_words_result = cn_model.doesnt_match(test_words.split())\n",
 311 |     "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)"
 312 |    ]
 313 |   },
 314 |   {
 315 |    "cell_type": "code",
 316 |    "execution_count": 9,
 317 |    "metadata": {},
 318 |    "outputs": [
 319 |     {
 320 |      "data": {
 321 |       "text/plain": [
 322 |        "[('出轨', 0.6100173592567444)]"
 323 |       ]
 324 |      },
 325 |      "execution_count": 9,
 326 |      "metadata": {},
 327 |      "output_type": "execute_result"
 328 |     }
 329 |    ],
 330 |    "source": [
 331 |     "cn_model.most_similar(positive=['女人','劈腿'], negative=['男人'], topn=1)"
 332 |    ]
 333 |   },
 334 |   {
 335 |    "cell_type": "markdown",
 336 |    "metadata": {},
 337 |    "source": [
 338 |     "**训练语料**  \n",
 339 |     "本教程使用了谭松波老师的酒店评论语料，即使是这个语料也很难找到下载链接，在某博客还得花积分下载，而我不知道怎么赚取积分，后来好不容易找到一个链接但竟然是失效的，再后来尝试把链接粘贴到迅雷上终于下载了下来，希望大家以后多多分享资源。  \n",
 340 |     "训练样本分别被放置在两个文件夹里：\n",
 341 |     "分别的pos和neg，每个文件夹里有2000个txt文件，每个文件内有一段评语，共有4000个训练样本，这样大小的样本数据在NLP中属于非常迷你的："
 342 |    ]
 343 |   },
 344 |   {
 345 |    "cell_type": "code",
 346 |    "execution_count": 10,
 347 |    "metadata": {},
 348 |    "outputs": [],
 349 |    "source": [
 350 |     "# 获得样本的索引，样本存放于两个文件夹中，\n",
 351 |     "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n",
 352 |     "# 每个文件夹中有2000个txt文件，每个文件中是一例评价\n",
 353 |     "import os\n",
 354 |     "pos_txts = os.listdir('pos')\n",
 355 |     "neg_txts = os.listdir('neg')"
 356 |    ]
 357 |   },
 358 |   {
 359 |    "cell_type": "code",
 360 |    "execution_count": 11,
 361 |    "metadata": {},
 362 |    "outputs": [
 363 |     {
 364 |      "name": "stdout",
 365 |      "output_type": "stream",
 366 |      "text": [
 367 |       "样本总共: 4000\n"
 368 |      ]
 369 |     }
 370 |    ],
 371 |    "source": [
 372 |     "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )"
 373 |    ]
 374 |   },
 375 |   {
 376 |    "cell_type": "code",
 377 |    "execution_count": 12,
 378 |    "metadata": {},
 379 |    "outputs": [],
 380 |    "source": [
 381 |     "# 现在我们将所有的评价内容放置到一个list里\n",
 382 |     "# 这里和视频课程不一样, 因为很多同学反应按照视频课程里的读取方法会乱码,\n",
 383 |     "# 经过检查发现是原始文本里的编码是gbk造成的,\n",
 384 |     "# 这里我进行了简单的预处理, 以避免乱码\n",
 385 |     "train_texts_orig = []\n",
 386 |     "# 文本所对应的labels, 也就是标记\n",
 387 |     "train_target = []\n",
 388 |     "with open(\"positive_samples.txt\", \"r\", encoding=\"utf-8\") as f:\n",
 389 |     "    lines = f.readlines()\n",
 390 |     "    for line in lines:\n",
 391 |     "        dic = eval(line)\n",
 392 |     "        train_texts_orig.append(dic[\"text\"])\n",
 393 |     "        train_target.append(dic[\"label\"])\n",
 394 |     "\n",
 395 |     "with open(\"negative_samples.txt\", \"r\", encoding=\"utf-8\") as f:\n",
 396 |     "    lines = f.readlines()\n",
 397 |     "    for line in lines:\n",
 398 |     "        dic = eval(line)\n",
 399 |     "        train_texts_orig.append(dic[\"text\"])\n",
 400 |     "        train_target.append(dic[\"label\"])"
 401 |    ]
 402 |   },
 403 |   {
 404 |    "cell_type": "code",
 405 |    "execution_count": 13,
 406 |    "metadata": {},
 407 |    "outputs": [
 408 |     {
 409 |      "data": {
 410 |       "text/plain": [
 411 |        "4000"
 412 |       ]
 413 |      },
 414 |      "execution_count": 13,
 415 |      "metadata": {},
 416 |      "output_type": "execute_result"
 417 |     }
 418 |    ],
 419 |    "source": [
 420 |     "len(train_texts_orig)"
 421 |    ]
 422 |   },
 423 |   {
 424 |    "cell_type": "code",
 425 |    "execution_count": 14,
 426 |    "metadata": {},
 427 |    "outputs": [],
 428 |    "source": [
 429 |     "# 我们使用tensorflow的keras接口来建模\n",
 430 |     "from tensorflow.python.keras.models import Sequential\n",
 431 |     "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n",
 432 |     "from tensorflow.python.keras.preprocessing.text import Tokenizer\n",
 433 |     "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
 434 |     "from tensorflow.python.keras.optimizers import RMSprop\n",
 435 |     "from tensorflow.python.keras.optimizers import Adam\n",
 436 |     "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau"
 437 |    ]
 438 |   },
 439 |   {
 440 |    "cell_type": "markdown",
 441 |    "metadata": {},
 442 |    "source": [
 443 |     "**分词和tokenize**  \n",
 444 |     "首先我们去掉每个样本的标点符号，然后用jieba分词，jieba分词返回一个生成器，没法直接进行tokenize，所以我们将分词结果转换成一个list，并将它索引化，这样每一例评价的文本变成一段索引数字，对应着预训练词向量模型中的词。"
 445 |    ]
 446 |   },
 447 |   {
 448 |    "cell_type": "code",
 449 |    "execution_count": 15,
 450 |    "metadata": {},
 451 |    "outputs": [
 452 |     {
 453 |      "name": "stderr",
 454 |      "output_type": "stream",
 455 |      "text": [
 456 |       "Building prefix dict from the default dictionary ...\n",
 457 |       "WARNING: Logging before flag parsing goes to stderr.\n",
 458 |       "I0610 16:36:46.897187  9584 __init__.py:111] Building prefix dict from the default dictionary ...\n",
 459 |       "Loading model from cache C:\\Users\\OSCARZ~1\\AppData\\Local\\Temp\\jieba.cache\n",
 460 |       "I0610 16:36:46.899155  9584 __init__.py:131] Loading model from cache C:\\Users\\OSCARZ~1\\AppData\\Local\\Temp\\jieba.cache\n",
 461 |       "Loading model cost 0.535 seconds.\n",
 462 |       "I0610 16:36:47.432762  9584 __init__.py:163] Loading model cost 0.535 seconds.\n",
 463 |       "Prefix dict has been built succesfully.\n",
 464 |       "I0610 16:36:47.433753  9584 __init__.py:164] Prefix dict has been built succesfully.\n"
 465 |      ]
 466 |     }
 467 |    ],
 468 |    "source": [
 469 |     "# 进行分词和tokenize\n",
 470 |     "# train_tokens是一个长长的list，其中含有4000个小list，对应每一条评价\n",
 471 |     "train_tokens = []\n",
 472 |     "for text in train_texts_orig:\n",
 473 |     "    # 去掉标点\n",
 474 |     "    text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——！，。？、~@#￥%……&*（）]+\", \"\",text)\n",
 475 |     "    # 结巴分词\n",
 476 |     "    cut = jieba.cut(text)\n",
 477 |     "    # 结巴分词的输出结果为一个生成器\n",
 478 |     "    # 把生成器转换为list\n",
 479 |     "    cut_list = [ i for i in cut ]\n",
 480 |     "    for i, word in enumerate(cut_list):\n",
 481 |     "        try:\n",
 482 |     "            # 将词转换为索引index\n",
 483 |     "            cut_list[i] = cn_model.vocab[word].index\n",
 484 |     "        except KeyError:\n",
 485 |     "            # 如果词不在字典中，则输出0\n",
 486 |     "            cut_list[i] = 0\n",
 487 |     "    train_tokens.append(cut_list)"
 488 |    ]
 489 |   },
 490 |   {
 491 |    "cell_type": "markdown",
 492 |    "metadata": {},
 493 |    "source": [
 494 |     "**索引长度标准化**  \n",
 495 |     "因为每段评语的长度是不一样的，我们如果单纯取最长的一个评语，并把其他评填充成同样的长度，这样十分浪费计算资源，所以我们取一个折衷的长度。"
 496 |    ]
 497 |   },
 498 |   {
 499 |    "cell_type": "code",
 500 |    "execution_count": 16,
 501 |    "metadata": {},
 502 |    "outputs": [],
 503 |    "source": [
 504 |     "# 获得所有tokens的长度\n",
 505 |     "num_tokens = [ len(tokens) for tokens in train_tokens ]\n",
 506 |     "num_tokens = np.array(num_tokens)"
 507 |    ]
 508 |   },
 509 |   {
 510 |    "cell_type": "code",
 511 |    "execution_count": 17,
 512 |    "metadata": {},
 513 |    "outputs": [
 514 |     {
 515 |      "data": {
 516 |       "text/plain": [
 517 |        "71.4495"
 518 |       ]
 519 |      },
 520 |      "execution_count": 17,
 521 |      "metadata": {},
 522 |      "output_type": "execute_result"
 523 |     }
 524 |    ],
 525 |    "source": [
 526 |     "# 平均tokens的长度\n",
 527 |     "np.mean(num_tokens)"
 528 |    ]
 529 |   },
 530 |   {
 531 |    "cell_type": "code",
 532 |    "execution_count": 18,
 533 |    "metadata": {},
 534 |    "outputs": [
 535 |     {
 536 |      "data": {
 537 |       "text/plain": [
 538 |        "1540"
 539 |       ]
 540 |      },
 541 |      "execution_count": 18,
 542 |      "metadata": {},
 543 |      "output_type": "execute_result"
 544 |     }
 545 |    ],
 546 |    "source": [
 547 |     "# 最长的评价tokens的长度\n",
 548 |     "np.max(num_tokens)"
 549 |    ]
 550 |   },
 551 |   {
 552 |    "cell_type": "code",
 553 |    "execution_count": 19,
 554 |    "metadata": {},
 555 |    "outputs": [
 556 |     {
 557 |      "data": {
 558 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAHXBJREFUeJzt3XmYXVWd7vHva0DGCGIKhAQo0CgiDagRaUFF4dogKNxWBgcMg532qqCAV4PYgl69QmOjOBuZIiKCiIKiNohw0QcFwxgEsXkYQgiSIFMYGgi+94+9ipwUVbVPDafOqTrv53nOU2evPaxfdlXO76y19l5btomIiBjK89odQEREdL4ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRbRFEnflvRvY3SszSQ9KmlKWb5c0gfG4tjleL+UNHusjjeMej8v6X5Jfx2DY+0iafFYxDWKGCzppW2ot+3/9niuJItA0p2SnpC0XNJDkq6U9EFJz/592P6g7f/T5LF2G2ob24tsr2v7mTGI/ThJ3+93/D1szx/tsYcZx6bAUcDWtl88wPp8AA6iXUkphifJIvq83fZUYHPgeOCTwKljXYmk1cb6mB1ic+Bvtpe2O5CIVkiyiFXYftj2hcD+wGxJ2wBIOkPS58v7aZJ+XlohD0j6raTnSToT2Az4Welm+oSk3vLN8VBJi4DfNJQ1Jo6XSLpa0sOSLpC0QanrOd/I+1ovknYHPgXsX+q7oax/tlurxPVpSXdJWirpe5LWK+v64pgtaVHpQjpmsHMjab2y/7JyvE+X4+8GXAJsUuI4o99+6wC/bFj/qKRNJK0h6SuSlpTXVyStMUjdh0u6WdKMsryXpOsbWoLb9js/H5d0Yzmf50hac6jf3RB/En3HXEPSl8p5uq90S67V+DuSdFQ5x/dKOrhh3xdJ+pmkRyT9sXTX/a6su6JsdkM5L/s37Dfg8aI9kixiQLavBhYDbxhg9VFlXQ+wEdUHtm0fCCyiaqWsa/vfG/Z5E/AK4J8GqfL9wCHAJsAK4KtNxPgr4P8C55T6thtgs4PK683AlsC6wNf7bbMz8HJgV+Azkl4xSJVfA9Yrx3lTiflg278G9gCWlDgO6hfnY/3Wr2t7CXAMsCOwPbAdsAPw6f6VqhorOgh4k+3Fkl4NnAb8K/Ai4DvAhf0SzX7A7sAWwLZlfxjkdzfIv7fRCcDLSqwvBaYDn2lY/+JybqYDhwLfkPTCsu4bwGNlm9nl1Xdu3ljeblfOyzlNHC/aIMkihrIE2GCA8qeBjYHNbT9t+7eun2TsONuP2X5ikPVn2r6pfLD+G7CfygD4KL0XOMn27bYfBY4GDujXqvms7Sds3wDcQPXBvYoSy/7A0baX274T+A/gwFHG9jnbS20vAz7b73iSdBJVgn1z2QbgX4Dv2L7K9jNlfOZJqsTT56u2l9h+APgZ1Yc8jOB3J0mlziNsP2B7OVWSPqBhs6fLv+Vp278AHgVeXs7bO4FjbT9u+2agmfGkAY/XxH7RIkkWMZTpwAMDlJ8I3AZcLOl2SXObONbdw1h/F7A6MK2pKIe2STle47FXo/pW3afx6qXHqVof/U0Dnj/AsaaPcWybNCyvD8wBvmj74YbyzYGjSlfSQ5IeAjbtt+9g/6aR/O56gLWBaxrq+1Up7/M32ysGqLOH6nw3/n7r/haGOl60SZJFDEjSa6k+CH/Xf135Zn2U7S2BtwNHStq1b/Ugh6xreWza8H4zqm+W91N1X6zdENcUVv2QqjvuEqoP18ZjrwDuq9mvv/tLTP2PdU+T+w8U50CxLWlYfhDYCzhd0k4N5XcDX7C9fsNrbdtn1wYx9O9uMPcDTwCvbKhvPdvNfHgvozrfMxrKNh1k2+hgSRaxCkkvkLQX8EPg+7YXDrDNXpJeWronHgGeKS+oPoS3HEHV75O0taS1gc8B55VLa/8CrClpT0mrU/XpN/bN3wf0DjFIezZwhKQtJK3LyjGOFYNsP6ASy7nAFyRNlbQ5cCTw/aH3XCXOF/UNrjfE9mlJPZKmUY0B9L8M+HKq7qqfSHpdKf4u8EFJr1NlnXJ+ptYFUfO7G5Dtv5c6vyxpw3Kc6ZIGG39q3PcZ4HzgOElrS9qKaqyn0Uj/ZmIcJVlEn59JWk71rfUY4CRgsCtQZgK/pupH/j3wzfKhBvBFqg/AhyR9fBj1nwmcQdV9siZwOFRXZwEfAk6h+hb/GNUAbZ8flZ9/k3TtAMc9rRz7CuAO4L+Bw4YRV6PDSv23U7W4flCOX8v2n6mSw+3l3GwCfB5YANwILASuLWX9972E6ndxoaTX2F5ANYbwdarWx22sHMCuM9TvbiifLPX8QdIj5RjNjiF8hGqw+q9Uv4uzqcZY+hwHzC/nZb8mjxnjTHn4UUSMJ0knAC+2Pe532cfIpWURES0laStJ25Yusx2oLoX9SbvjiuGZrHfTRkTnmErV9bQJsJTqkuML2hpRDFu6oSIiola6oSIiotaE7oaaNm2ae3t72x1GRMSEcs0119xvu6d+y5UmdLLo7e1lwYIF7Q4jImJCkXRX/VarSjdURETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIUemdexG9cy9qdxgR0WJJFhERUSvJIiIiak3oiQRj8urr2rrz+D2Hvc9w94uIemlZRERErSSLiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZxISReagi2ifJIiIiarUsWUg6TdJSSTc1lJ0o6c+SbpT0E0nrN6w7WtJtkm6V9E+tiisiIoavlS2LM4Dd+5VdAmxje1vgL8DRAJK2Bg4AXln2+aakKS2MLSIihqFlycL2FcAD/coutr2iLP4BmFHe7w380PaTtu8AbgN2aFVsERExPO0cszgE+GV5Px24u2Hd4lL2HJLmSFogacGyZctaHGJERECbkoWkY4AVwFl9RQNs5oH2tT3P9izbs3p6eloVYkRENBj351lImg3sBexquy8hLAY2bdhsBrBkvGOLyWMkz8OIiMGNa8tC0u7AJ4F32H68YdWFwAGS1pC0BTATuHo8Y4uIiMG1rGUh6WxgF2CapMXAsVRXP60BXCIJ4A+2P2j7T5LOBW6m6p76sO1nWhVbREQMT8uShe13D1B86hDbfwH4Qqviic6RLqKIiSd3cEdERK0ki4iIqJVkERERtZIsIiKi1rjfZxExlExBHtGZkiyiqzQmo1yNFdG8dENFREStJIuIiKiVbqiY8DLOEdF6aVlEREStJIuIiKiVbqjoCMPpSsrcUhHjLy2LiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZRERErSSLiIiolTu4oys0c4d47gyPGFxaFhERUatlLQtJpwF7AUttb1PKNgDOAXqBO4H9bD8oScDJwNuAx4GDbF/bqtiiPfp/u8/U4hETRytbFmcAu/crmwtcansmcGlZBtgDmFlec4BvtTCuiFX0zr0oiSuiRsuShe0rgAf6Fe8NzC/v5wP7NJR/z5U/AOtL2rhVsUVExPCM95jFRrbvBSg/Nyzl04G7G7ZbXMqeQ9IcSQskLVi2bFlLg42IiEqnDHBrgDIPtKHtebZn2Z7V09PT4rAiIgLGP1nc19e9VH4uLeWLgU0btpsBLBnn2CIiYhDjnSwuBGaX97OBCxrK36/KjsDDfd1VERHRfq28dPZsYBdgmqTFwLHA8cC5kg4FFgH7ls1/QXXZ7G1Ul84e3Kq4IiJi+FqWLGy/e5BVuw6wrYEPtyqWiIgYnUz3ERNW7o2IGD+1YxaSdpK0Tnn/PkknSdq89aFFRESnaGaA+1vA45K2Az4B3AV8r6VRxaSQO6MjJo9muqFW2LakvYGTbZ8qaXbtXjHhdcIsrKNNNklWEWOjmWSxXNLRwPuAN0qaAqze2rAiIqKTNNMNtT/wJHCo7b9STcNxYkujioiIjlLbsigJ4qSG5UVkzCIioqs0czXUP0v6L0kPS3pE0nJJj4xHcBER0RmaGbP4d+Dttm9pdTAREdGZmhmzuC+JIiKiuzXTslgg6Rzgp1QD3QDYPr9lUUVEREdpJlm8gGpyv7c2lBlIsoiI6BLNXA2VGWAjIrpcM1dDvUzSpZJuKsvbSvp060OLiIhO0cwA93eBo4GnAWzfCBzQyqAiIqKzNJMs1rZ9db+yFa0IJiIiOlMzyeJ+SS+hGtRG0ruAPPI0ulJm0o1u1czVUB8G5gFbSboHuINqUsGISakTZtuN6DTNJIt7bO9WHoD0PNvLJW3Q6sAiIqJzNNMNdb6k1Ww/VhLFi4FLWh1YdI50vUREM8nip8B5kqZI6gUupro6KiIiukQzN+V9V9LzqZJGL/Cvtq9sdWAREdE5Bk0Wko5sXAQ2Ba4HdpS0o+2TBt6znqQjgA9QXWG1EDgY2Bj4IbABcC1woO2nRlpHRESMnaG6oaY2vNYFfgLc1lA2IpKmA4cDs2xvA0yhusnvBODLtmcCDwKHjrSOiIgYW4O2LGx/tnFZ0tSq2I+OUb1rSXoaWJvqvo23AO8p6+cDxwHfGoO6IiJilGrHLCRtA5xJ1T2EpPuB99v+00gqtH2PpC8Bi4AnqAbMrwEest13Z/hiqmd9DxTPHGAOwGabbTaSEKJGrnyKiP6auRpqHnCk7c1tbw4cRTVf1IhIeiGwN7AFsAmwDrDHAJt6oP1tz7M9y/asnp6ekYYRERHD0MxNeevYvqxvwfbl5Qa9kdoNuMP2MgBJ5wOvB9Yv93OsAGYAS0ZRR0SttKAimtdMsrhd0r9RdUVBNdXHHaOocxHVFVVrU3VD7QosAC4D3kV1RdRs4IJR1BEt0PjhmqkwIrpLM91QhwA9VE/GOx+YBhw00gptXwWcR3V57MISwzzgk8CRkm4DXgScOtI6IiJibDXTstjN9uGNBZL2BX400kptHwsc26/4dmCHkR4zIiJap5mWxUBTe2S6j4iILjLUHdx7AG8Dpkv6asOqF5CHH0WXyWB4dLuhuqGWUA08v4PqPog+y4EjWhlURCdIgohYaag7uG8AbpD0A9tPj2NMERHRYWrHLJIoIiKimQHuiIjocoMmC0lnlp8fHb9wIiKiEw3VsniNpM2BQyS9UNIGja/xCjAiItpvqKuhvg38CtiS6mooNaxzKY+IiC4waMvC9ldtvwI4zfaWtrdoeCVRRER0kWaewf2/JG0HvKEUXWH7xtaGFZ0u9yBEdJfaq6EkHQ6cBWxYXmdJOqzVgUVEROdoZiLBDwCvs/0YgKQTgN8DX2tlYBER0Tmauc9CwDMNy8+w6mB3RERMcs20LE4HrpL0k7K8D3nWREREV2lmgPskSZcDO1O1KA62fV2rA4vx0TdQnSffRcRQmmlZYPtaqifbRUREF8rcUBERUSvJIiIiag2ZLCRNkfTr8QomIiI605DJwvYzwOOS1huneCIiogM1M8D938BCSZcAj/UV2j68ZVFFTBCN057kirKYzJpJFheVV0REdKlm7rOYL2ktYDPbt45FpZLWB04BtqGa7vwQ4FbgHKAXuBPYz/aDY1FfRESMTjMTCb4duJ7q2RZI2l7ShaOs92TgV7a3ArYDbgHmApfanglcWpajRXrnXpSZYyOiac1cOnscsAPwEIDt64EtRlqhpBcAb6RMGWL7KdsPAXsD88tm86mmFYmIiA7QTLJYYfvhfmUeRZ1bAsuA0yVdJ+kUSesAG9m+F6D83HCgnSXNkbRA0oJly5aNIoyIiGhWM8niJknvAaZIminpa8CVo6hzNeDVwLdsv4rqCqumu5xsz7M9y/asnp6eUYQRERHNaiZZHAa8EngSOBt4BPjYKOpcDCy2fVVZPo8qedwnaWOA8nPpKOqIaKmM+US3aeZqqMeBY8pDj2x7+WgqtP1XSXdLenm5umpX4Obymg0cX35eMJp6onN0y4dqZvCNyaw2WUh6LXAaMLUsPwwcYvuaUdR7GNXjWZ8P3A4cTNXKOVfSocAiYN9RHD8iIsZQMzflnQp8yPZvASTtTPVApG1HWmm5omrWAKt2HekxIyKidZpJFsv7EgWA7d9JGlVXVExe3dLlFNFtBk0Wkl5d3l4t6TtUg9sG9gcub31oERHRKYZqWfxHv+VjG96P5j6LmITSooiY3AZNFrbfPJ6BRERE52rmaqj1gfdTTfD37PaZojwions0M8D9C+APwELg760NJyIiOlEzyWJN20e2PJKIiOhYzUz3caakf5G0saQN+l4tjywiIjpGMy2Lp4ATgWNYeRWUqWaPjYiILtBMsjgSeKnt+1sdTEREdKZmuqH+BDze6kAiJovMSBuTUTMti2eA6yVdRjVNOZBLZyMiukkzyeKn5RUREV2qmedZzK/bJiIiJrdm7uC+gwHmgrKdq6EiIrpEM91Qjc+dWJPqoUS5zyIioos00w31t35FX5H0O+AzrQkpYnLof0VUHrcaE1kz3VCvblh8HlVLY2rLIoqIiI7TTDdU43MtVgB3Avu1JJqIiOhIzXRD5bkWERFdrpluqDWAd/Lc51l8rnVhRUREJ2mmG+oC4GHgGhru4I6I4Wkc8M5gd0w0zSSLGbZ3H+uKJU0BFgD32N5L0hbAD6kuy70WOND2U2Ndb0REDF8zEwleKekfWlD3R4FbGpZPAL5seybwIHBoC+qMiIgRaCZZ7AxcI+lWSTdKWijpxtFUKmkGsCdwSlkW8BbgvLLJfGCf0dQRERFjp5luqD1aUO9XgE+w8n6NFwEP2V5RlhcD01tQb0REjEAzl87eNZYVStoLWGr7Gkm79BUPVPUg+88B5gBsttlmYxlaREQMopluqLG2E/AOSXdSDWi/haqlsb6kvuQ1A1gy0M6259meZXtWT0/PeMQbEdH1xj1Z2D7a9gzbvcABwG9svxe4DHhX2Ww21SW7ERHRAdrRshjMJ4EjJd1GNYZxapvjiYiIopkB7paxfTlweXl/O7BDO+OJiIiBdVLLIiIiOlRbWxYx/vo/YyEiohlpWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkEdFGvXMvyhQsMSEkWURERK1MJBjRBmlNxESTlkVEh0tXVXSCJIuIiKiVbqiIDtLYgrjz+D3bGEnEqtKyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKg17ldDSdoU+B7wYuDvwDzbJ0vaADgH6AXuBPaz/eB4xxfRDrmPIjpdO1oWK4CjbL8C2BH4sKStgbnApbZnApeW5YiI6ADjnixs32v72vJ+OXALMB3YG5hfNpsP7DPesUVExMDaOmYhqRd4FXAVsJHte6FKKMCGg+wzR9ICSQuWLVs2XqFGtF2m/Yh2aluykLQu8GPgY7YfaXY/2/Nsz7I9q6enp3UBRkTEs9qSLCStTpUozrJ9fim+T9LGZf3GwNJ2xBYREc/VjquhBJwK3GL7pIZVFwKzgePLzwvGO7aITpIup+gk7ZhIcCfgQGChpOtL2aeoksS5kg4FFgH7tiG2iIgYwLgnC9u/AzTI6l3HM5aIiSgz00Y75A7uiIiolWQRMYHlctoYL3n40SSW7oqIGCtpWURERK0ki4hJIN1R0WpJFhERUSvJIiIiaiVZREwi6Y6KVkmyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKiV6T4iJrFM+RJjJS2LiIiolWQRERG10g0VMQmN9Ma8vv3SZRX9pWURERG1kiwmkUz1EEPp//eRv5cYjnRDTRDpHoixkgQRI5GWRURE1EqyiIhBpasq+iRZRERErY4bs5C0O3AyMAU4xfbxbQ4pYlIbqOXQvyxjZtFRyULSFOAbwP8AFgN/lHSh7ZvHO5Z2TJMwnDozjUNMBMNJMklIna3TuqF2AG6zfbvtp4AfAnu3OaaIiK4n2+2O4VmS3gXsbvsDZflA4HW2P9KwzRxgTlncBrhp3APtTNOA+9sdRIfIuVgp52KlnIuVXm576nB26KhuKEADlK2SzWzPA+YBSFpge9Z4BNbpci5WyrlYKedipZyLlSQtGO4+ndYNtRjYtGF5BrCkTbFERETRacnij8BMSVtIej5wAHBhm2OKiOh6HdUNZXuFpI8A/0l16exptv80xC7zxieyCSHnYqWci5VyLlbKuVhp2Oeiowa4IyKiM3VaN1RERHSgJIuIiKg1YZOFpN0l3SrpNklz2x1Pu0jaVNJlkm6R9CdJH213TO0kaYqk6yT9vN2xtJuk9SWdJ+nP5e/jH9sdU7tIOqL8/7hJ0tmS1mx3TONF0mmSlkq6qaFsA0mXSPqv8vOFdceZkMmiYVqQPYCtgXdL2rq9UbXNCuAo268AdgQ+3MXnAuCjwC3tDqJDnAz8yvZWwHZ06XmRNB04HJhlexuqi2cOaG9U4+oMYPd+ZXOBS23PBC4ty0OakMmCTAvyLNv32r62vF9O9YEwvb1RtYekGcCewCntjqXdJL0AeCNwKoDtp2w/1N6o2mo1YC1JqwFr00X3b9m+AnigX/HewPzyfj6wT91xJmqymA7c3bC8mC79gGwkqRd4FXBVeyNpm68AnwD+3u5AOsCWwDLg9NItd4qkddodVDvYvgf4ErAIuBd42PbF7Y2q7TayfS9UXziBDet2mKjJonZakG4jaV3gx8DHbD/S7njGm6S9gKW2r2l3LB1iNeDVwLdsvwp4jCa6Giaj0h+/N7AFsAmwjqT3tTeqiWeiJotMC9JA0upUieIs2+e3O5422Ql4h6Q7qbol3yLp++0Nqa0WA4tt97Uyz6NKHt1oN+AO28tsPw2cD7y+zTG1232SNgYoP5fW7TBRk0WmBSkkiapf+hbbJ7U7nnaxfbTtGbZ7qf4efmO7a7892v4rcLekl5eiXYFxfy5Mh1gE7Chp7fL/ZVe6dLC/wYXA7PJ+NnBB3Q4dNd1Hs0YwLchkthNwILBQ0vWl7FO2f9HGmKIzHAacVb5Q3Q4c3OZ42sL2VZLOA66lunrwOrpo6g9JZwO7ANMkLQaOBY4HzpV0KFUy3bf2OJnuIyIi6kzUbqiIiBhHSRYREVErySIiImolWURERK0ki4iIqJVkEROWpEdbcMztJb2tYfk4SR8fxfH2LTO+XtavvFfSe5rY/yBJXx9p/RFjJckiYlXbA2+r3ap5hwIfsv3mfuW9QG2yiOgUSRYxKUj635L+KOlGSZ8tZb3lW/13y7MMLpa0Vln32rLt7yWdWJ5z8Hzgc8D+kq6XtH85/NaSLpd0u6TDB6n/3ZIWluOcUMo+A+wMfFvSif12OR54Q6nnCElrSjq9HOM6Sf2TC5L2LPFOk9Qj6cfl3/xHSTuVbY4rzy9YJV5J60i6SNINJcb9+x8/Yki288prQr6AR8vPt1LdkSuqL0A/p5qeu5fqjt3ty3bnAu8r728CXl/eHw/cVN4fBHy9oY7jgCuBNYBpwN+A1fvFsQnVXbA9VLMi/AbYp6y7nOo5Cv1j3wX4ecPyUcDp5f1W5Xhr9sUD/E/gt8ALyzY/AHYu7zejmu5l0HiBdwLfbahvvXb//vKaWK8JOd1HRD9vLa/ryvK6wEyqD9w7bPdNg3IN0CtpfWCq7StL+Q+AvYY4/kW2nwSelLQU2Ihqor4+rwUut70MQNJZVMnqp8P4N+wMfA3A9p8l3QW8rKx7MzALeKtXzii8G1WLp2//F0iaOkS8C4EvlVbPz23/dhixRSRZxKQg4Iu2v7NKYfV8jycbip4B1mLgKe6H0v8Y/f/fDPd4AxnqGLdTPZ/iZcCCUvY84B9tP7HKQark8Zx4bf9F0muoxmO+KOli258bg7ijS2TMIiaD/wQOKc/0QNJ0SYM+zMX2g8BySTuWosZHbC4Hpj53ryFdBbypjCVMAd4N/L+affrXcwXw3hL/y6i6lm4t6+4C/hn4nqRXlrKLgY/07Sxp+6Eqk7QJ8Ljt71M9CKhbpyuPEUqyiAnP1VPPfgD8XtJCqmc31H3gHwrMk/R7qm/1D5fyy6i6d65vdhDY1ZPGji773gBca7tuyucbgRVlwPkI4JvAlBL/OcBBpSupr45bqZLJjyS9hPJM6TJIfzPwwZr6/gG4usxMfAzw+Wb+bRF9MutsdCVJ69p+tLyfC2xs+6NtDiuiY2XMIrrVnpKOpvo/cBfVVUcRMYi0LCIiolbGLCIiolaSRURE1EqyiIiIWkkWERFRK8kiIiJq/X+K7ei/t/dDegAAAABJRU5ErkJggg==\n",
 559 |       "text/plain": [
 560 |        "<Figure size 432x288 with 1 Axes>"
 561 |       ]
 562 |      },
 563 |      "metadata": {
 564 |       "needs_background": "light"
 565 |      },
 566 |      "output_type": "display_data"
 567 |     }
 568 |    ],
 569 |    "source": [
 570 |     "plt.hist(np.log(num_tokens), bins = 100)\n",
 571 |     "plt.xlim((0,10))\n",
 572 |     "plt.ylabel('number of tokens')\n",
 573 |     "plt.xlabel('length of tokens')\n",
 574 |     "plt.title('Distribution of tokens length')\n",
 575 |     "plt.show()"
 576 |    ]
 577 |   },
 578 |   {
 579 |    "cell_type": "code",
 580 |    "execution_count": 20,
 581 |    "metadata": {},
 582 |    "outputs": [
 583 |     {
 584 |      "data": {
 585 |       "text/plain": [
 586 |        "236"
 587 |       ]
 588 |      },
 589 |      "execution_count": 20,
 590 |      "metadata": {},
 591 |      "output_type": "execute_result"
 592 |     }
 593 |    ],
 594 |    "source": [
 595 |     "# 取tokens平均值并加上两个tokens的标准差，\n",
 596 |     "# 假设tokens长度的分布为正态分布，则max_tokens这个值可以涵盖95%左右的样本\n",
 597 |     "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n",
 598 |     "max_tokens = int(max_tokens)\n",
 599 |     "max_tokens"
 600 |    ]
 601 |   },
 602 |   {
 603 |    "cell_type": "code",
 604 |    "execution_count": 21,
 605 |    "metadata": {},
 606 |    "outputs": [
 607 |     {
 608 |      "data": {
 609 |       "text/plain": [
 610 |        "0.9565"
 611 |       ]
 612 |      },
 613 |      "execution_count": 21,
 614 |      "metadata": {},
 615 |      "output_type": "execute_result"
 616 |     }
 617 |    ],
 618 |    "source": [
 619 |     "# 取tokens的长度为236时，大约95%的样本被涵盖\n",
 620 |     "# 我们对长度不足的进行padding，超长的进行修剪\n",
 621 |     "np.sum( num_tokens < max_tokens ) / len(num_tokens)"
 622 |    ]
 623 |   },
 624 |   {
 625 |    "cell_type": "markdown",
 626 |    "metadata": {},
 627 |    "source": [
 628 |     "**反向tokenize**  \n",
 629 |     "我们定义一个function，用来把索引转换成可阅读的文本，这对于debug很重要。"
 630 |    ]
 631 |   },
 632 |   {
 633 |    "cell_type": "code",
 634 |    "execution_count": 22,
 635 |    "metadata": {},
 636 |    "outputs": [],
 637 |    "source": [
 638 |     "# 用来将tokens转换为文本\n",
 639 |     "def reverse_tokens(tokens):\n",
 640 |     "    text = ''\n",
 641 |     "    for i in tokens:\n",
 642 |     "        if i != 0:\n",
 643 |     "            text = text + cn_model.index2word[i]\n",
 644 |     "        else:\n",
 645 |     "            text = text + ' '\n",
 646 |     "    return text"
 647 |    ]
 648 |   },
 649 |   {
 650 |    "cell_type": "code",
 651 |    "execution_count": 23,
 652 |    "metadata": {},
 653 |    "outputs": [],
 654 |    "source": [
 655 |     "reverse = reverse_tokens(train_tokens[0])"
 656 |    ]
 657 |   },
 658 |   {
 659 |    "cell_type": "markdown",
 660 |    "metadata": {},
 661 |    "source": [
 662 |     "以下可见，训练样本的极性并不是那么精准，比如说下面的样本，对早餐并不满意，但被定义为正面评价，这会迷惑我们的模型，不过我们暂时不对训练样本进行任何修改。"
 663 |    ]
 664 |   },
 665 |   {
 666 |    "cell_type": "code",
 667 |    "execution_count": 24,
 668 |    "metadata": {},
 669 |    "outputs": [
 670 |     {
 671 |      "data": {
 672 |       "text/plain": [
 673 |        "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'"
 674 |       ]
 675 |      },
 676 |      "execution_count": 24,
 677 |      "metadata": {},
 678 |      "output_type": "execute_result"
 679 |     }
 680 |    ],
 681 |    "source": [
 682 |     "# 经过tokenize再恢复成文本\n",
 683 |     "# 可见标点符号都没有了\n",
 684 |     "reverse"
 685 |    ]
 686 |   },
 687 |   {
 688 |    "cell_type": "code",
 689 |    "execution_count": 25,
 690 |    "metadata": {},
 691 |    "outputs": [
 692 |     {
 693 |      "data": {
 694 |       "text/plain": [
 695 |        "'早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'"
 696 |       ]
 697 |      },
 698 |      "execution_count": 25,
 699 |      "metadata": {},
 700 |      "output_type": "execute_result"
 701 |     }
 702 |    ],
 703 |    "source": [
 704 |     "# 原始文本\n",
 705 |     "train_texts_orig[0]"
 706 |    ]
 707 |   },
 708 |   {
 709 |    "cell_type": "markdown",
 710 |    "metadata": {},
 711 |    "source": [
 712 |     "**准备Embedding Matrix**  \n",
 713 |     "现在我们来为模型准备embedding matrix（词向量矩阵），根据keras的要求，我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵，num words代表我们使用的词汇的数量，emdedding dimension在我们现在使用的预训练词向量模型中是300，每一个词汇都用一个长度为300的向量表示。  \n",
 714 |     "注意我们只选择使用前50k个使用频率最高的词，在这个预训练词向量模型中，一共有260万词汇量，如果全部使用在分类问题上会很浪费计算资源，因为我们的训练样本很小，一共只有4k，如果我们有100k，200k甚至更多的训练样本时，在分类问题上可以考虑减少使用的词汇量。"
 715 |    ]
 716 |   },
 717 |   {
 718 |    "cell_type": "code",
 719 |    "execution_count": 26,
 720 |    "metadata": {},
 721 |    "outputs": [
 722 |     {
 723 |      "data": {
 724 |       "text/plain": [
 725 |        "300"
 726 |       ]
 727 |      },
 728 |      "execution_count": 26,
 729 |      "metadata": {},
 730 |      "output_type": "execute_result"
 731 |     }
 732 |    ],
 733 |    "source": [
 734 |     "embedding_dim"
 735 |    ]
 736 |   },
 737 |   {
 738 |    "cell_type": "code",
 739 |    "execution_count": 27,
 740 |    "metadata": {},
 741 |    "outputs": [],
 742 |    "source": [
 743 |     "# 只使用前20000个词\n",
 744 |     "num_words = 50000\n",
 745 |     "# 初始化embedding_matrix，之后在keras上进行应用\n",
 746 |     "embedding_matrix = np.zeros((num_words, embedding_dim))\n",
 747 |     "# embedding_matrix为一个 [num_words，embedding_dim] 的矩阵\n",
 748 |     "# 维度为 50000 * 300\n",
 749 |     "for i in range(num_words):\n",
 750 |     "    embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n",
 751 |     "embedding_matrix = embedding_matrix.astype('float32')"
 752 |    ]
 753 |   },
 754 |   {
 755 |    "cell_type": "code",
 756 |    "execution_count": 28,
 757 |    "metadata": {},
 758 |    "outputs": [
 759 |     {
 760 |      "data": {
 761 |       "text/plain": [
 762 |        "300"
 763 |       ]
 764 |      },
 765 |      "execution_count": 28,
 766 |      "metadata": {},
 767 |      "output_type": "execute_result"
 768 |     }
 769 |    ],
 770 |    "source": [
 771 |     "# 检查index是否对应，\n",
 772 |     "# 输出300意义为长度为300的embedding向量一一对应\n",
 773 |     "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )"
 774 |    ]
 775 |   },
 776 |   {
 777 |    "cell_type": "code",
 778 |    "execution_count": 29,
 779 |    "metadata": {},
 780 |    "outputs": [
 781 |     {
 782 |      "data": {
 783 |       "text/plain": [
 784 |        "(50000, 300)"
 785 |       ]
 786 |      },
 787 |      "execution_count": 29,
 788 |      "metadata": {},
 789 |      "output_type": "execute_result"
 790 |     }
 791 |    ],
 792 |    "source": [
 793 |     "# embedding_matrix的维度，\n",
 794 |     "# 这个维度为keras的要求，后续会在模型中用到\n",
 795 |     "embedding_matrix.shape"
 796 |    ]
 797 |   },
 798 |   {
 799 |    "cell_type": "markdown",
 800 |    "metadata": {},
 801 |    "source": [
 802 |     "**padding（填充）和truncating（修剪）**  \n",
 803 |     "我们把文本转换为tokens（索引）之后，每一串索引的长度并不相等，所以为了方便模型的训练我们需要把索引的长度标准化，上面我们选择了236这个可以涵盖95%训练样本的长度，接下来我们进行padding和truncating，我们一般采用'pre'的方法，这会在文本索引的前面填充0，因为根据一些研究资料中的实践，如果在文本索引后面填充0的话，会对模型造成一些不良影响。"
 804 |    ]
 805 |   },
 806 |   {
 807 |    "cell_type": "code",
 808 |    "execution_count": 30,
 809 |    "metadata": {},
 810 |    "outputs": [],
 811 |    "source": [
 812 |     "# 进行padding和truncating， 输入的train_tokens是一个list\n",
 813 |     "# 返回的train_pad是一个numpy array\n",
 814 |     "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n",
 815 |     "                            padding='pre', truncating='pre')"
 816 |    ]
 817 |   },
 818 |   {
 819 |    "cell_type": "code",
 820 |    "execution_count": 31,
 821 |    "metadata": {},
 822 |    "outputs": [],
 823 |    "source": [
 824 |     "# 超出五万个词向量的词用0代替\n",
 825 |     "train_pad[ train_pad>=num_words ] = 0"
 826 |    ]
 827 |   },
 828 |   {
 829 |    "cell_type": "code",
 830 |    "execution_count": 32,
 831 |    "metadata": {},
 832 |    "outputs": [
 833 |     {
 834 |      "data": {
 835 |       "text/plain": [
 836 |        "array([    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 837 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 838 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 839 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 840 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 841 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 842 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 843 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 844 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 845 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 846 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 847 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 848 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 849 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 850 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 851 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 852 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 853 |        "         290,  3053,    57,   169,    73,     1,    25, 11216,    49,\n",
 854 |        "         163, 15985,     0,     0,    30,     8,     0,     1,   228,\n",
 855 |        "         223,    40,    35,   653,     0,     5,  1642,    29, 11216,\n",
 856 |        "        2751,   500,    98,    30,  3159,  2225,  2146,   371,  6285,\n",
 857 |        "         169, 27396,     1,  1191,  5432,  1080, 20055,    57,   562,\n",
 858 |        "           1, 22671,    40,    35,   169,  2567,     0, 42665,  7761,\n",
 859 |        "         110,     0,     0, 41281,     0,   110,     0, 35891,   110,\n",
 860 |        "           0, 28781,    57,   169,  1419,     1, 11670,     0, 19470,\n",
 861 |        "           1,     0,     0,   169, 35071,    40,   562,    35, 12398,\n",
 862 |        "         657,  4857])"
 863 |       ]
 864 |      },
 865 |      "execution_count": 32,
 866 |      "metadata": {},
 867 |      "output_type": "execute_result"
 868 |     }
 869 |    ],
 870 |    "source": [
 871 |     "# 可见padding之后前面的tokens全变成0，文本在最后面\n",
 872 |     "train_pad[33]"
 873 |    ]
 874 |   },
 875 |   {
 876 |    "cell_type": "code",
 877 |    "execution_count": 33,
 878 |    "metadata": {},
 879 |    "outputs": [],
 880 |    "source": [
 881 |     "# 准备target向量，前2000样本为1，后2000为0\n",
 882 |     "train_target = np.array(train_target)"
 883 |    ]
 884 |   },
 885 |   {
 886 |    "cell_type": "code",
 887 |    "execution_count": 34,
 888 |    "metadata": {},
 889 |    "outputs": [],
 890 |    "source": [
 891 |     "# 进行训练和测试样本的分割\n",
 892 |     "from sklearn.model_selection import train_test_split"
 893 |    ]
 894 |   },
 895 |   {
 896 |    "cell_type": "code",
 897 |    "execution_count": 50,
 898 |    "metadata": {},
 899 |    "outputs": [],
 900 |    "source": [
 901 |     "# 90%的样本用来训练，剩余10%用来测试\n",
 902 |     "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n",
 903 |     "                                                    train_target,\n",
 904 |     "                                                    test_size=0.1,\n",
 905 |     "                                                    random_state=12)"
 906 |    ]
 907 |   },
 908 |   {
 909 |    "cell_type": "code",
 910 |    "execution_count": 36,
 911 |    "metadata": {},
 912 |    "outputs": [
 913 |     {
 914 |      "name": "stdout",
 915 |      "output_type": "stream",
 916 |      "text": [
 917 |       "                                                                                                                                                                                                                        房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n",
 918 |       "class:  1\n"
 919 |      ]
 920 |     }
 921 |    ],
 922 |    "source": [
 923 |     "# 查看训练样本，确认无误\n",
 924 |     "print(reverse_tokens(X_train[35]))\n",
 925 |     "print('class: ',y_train[35])"
 926 |    ]
 927 |   },
 928 |   {
 929 |    "cell_type": "markdown",
 930 |    "metadata": {},
 931 |    "source": [
 932 |     "现在我们用keras搭建LSTM模型，模型的第一层是Embedding层，只有当我们把tokens索引转换为词向量矩阵之后，才可以用神经网络对文本进行处理。\n",
 933 |     "keras提供了Embedding接口，避免了繁琐的稀疏矩阵操作。   \n",
 934 |     "在Embedding层我们输入的矩阵为：$$(batchsize, maxtokens)$$\n",
 935 |     "输出矩阵为： $$(batchsize, maxtokens, embeddingdim)$$"
 936 |    ]
 937 |   },
 938 |   {
 939 |    "cell_type": "code",
 940 |    "execution_count": 37,
 941 |    "metadata": {},
 942 |    "outputs": [],
 943 |    "source": [
 944 |     "# 用LSTM对样本进行分类\n",
 945 |     "model = Sequential()"
 946 |    ]
 947 |   },
 948 |   {
 949 |    "cell_type": "code",
 950 |    "execution_count": 38,
 951 |    "metadata": {},
 952 |    "outputs": [],
 953 |    "source": [
 954 |     "# 模型第一层为embedding\n",
 955 |     "model.add(Embedding(num_words,\n",
 956 |     "                    embedding_dim,\n",
 957 |     "                    weights=[embedding_matrix],\n",
 958 |     "                    input_length=max_tokens,\n",
 959 |     "                    trainable=False))"
 960 |    ]
 961 |   },
 962 |   {
 963 |    "cell_type": "code",
 964 |    "execution_count": 39,
 965 |    "metadata": {},
 966 |    "outputs": [],
 967 |    "source": [
 968 |     "# 在2019年6月10日修改了一些大坑的bug, 可能是数据的顺序变了, \n",
 969 |     "# 结果模型训练的效果没有去年最早的时候效果好了, \n",
 970 |     "# 有兴趣的同学可以调整一下模型参数, 看看会不会有更好的结果\n",
 971 |     "model.add(Bidirectional(LSTM(units=64, return_sequences=True)))\n",
 972 |     "model.add(LSTM(units=16, return_sequences=False))"
 973 |    ]
 974 |   },
 975 |   {
 976 |    "cell_type": "markdown",
 977 |    "metadata": {},
 978 |    "source": [
 979 |     "**构建模型**  \n",
 980 |     "我在这个教程中尝试了几种神经网络结构，因为训练样本比较少，所以我们可以尽情尝试，训练过程等待时间并不长：  \n",
 981 |     "**GRU：**如果使用GRU的话，测试样本可以达到87%的准确率，但我测试自己的文本内容时发现，GRU最后一层激活函数的输出都在0.5左右，说明模型的判断不是很明确，信心比较低，而且经过测试发现模型对于否定句的判断有时会失误，我们期望对于负面样本输出接近0，正面样本接近1而不是都徘徊于0.5之间。  \n",
 982 |     "**BiLSTM：**测试了LSTM和BiLSTM，发现BiLSTM的表现最好，LSTM的表现略好于GRU，这可能是因为BiLSTM对于比较长的句子结构有更好的记忆，有兴趣的朋友可以深入研究一下。  \n",
 983 |     "Embedding之后第，一层我们用BiLSTM返回sequences，然后第二层16个单元的LSTM不返回sequences，只返回最终结果，最后是一个全链接层，用sigmoid激活函数输出结果。"
 984 |    ]
 985 |   },
 986 |   {
 987 |    "cell_type": "code",
 988 |    "execution_count": 40,
 989 |    "metadata": {},
 990 |    "outputs": [],
 991 |    "source": [
 992 |     "# GRU的代码\n",
 993 |     "# model.add(GRU(units=32, return_sequences=True))\n",
 994 |     "# model.add(GRU(units=16, return_sequences=True))\n",
 995 |     "# model.add(GRU(units=4, return_sequences=False))"
 996 |    ]
 997 |   },
 998 |   {
 999 |    "cell_type": "code",
1000 |    "execution_count": 41,
1001 |    "metadata": {},
1002 |    "outputs": [],
1003 |    "source": [
1004 |     "model.add(Dense(1, activation='sigmoid'))\n",
1005 |     "# 我们使用adam以0.001的learning rate进行优化\n",
1006 |     "optimizer = Adam(lr=1e-3)"
1007 |    ]
1008 |   },
1009 |   {
1010 |    "cell_type": "code",
1011 |    "execution_count": 42,
1012 |    "metadata": {},
1013 |    "outputs": [],
1014 |    "source": [
1015 |     "model.compile(loss='binary_crossentropy',\n",
1016 |     "              optimizer=optimizer,\n",
1017 |     "              metrics=['accuracy'])"
1018 |    ]
1019 |   },
1020 |   {
1021 |    "cell_type": "code",
1022 |    "execution_count": 43,
1023 |    "metadata": {},
1024 |    "outputs": [
1025 |     {
1026 |      "name": "stdout",
1027 |      "output_type": "stream",
1028 |      "text": [
1029 |       "Model: \"sequential\"\n",
1030 |       "_________________________________________________________________\n",
1031 |       "Layer (type)                 Output Shape              Param #   \n",
1032 |       "=================================================================\n",
1033 |       "embedding (Embedding)        (None, 236, 300)          15000000  \n",
1034 |       "_________________________________________________________________\n",
1035 |       "bidirectional (Bidirectional (None, 236, 128)          186880    \n",
1036 |       "_________________________________________________________________\n",
1037 |       "lstm_1 (LSTM)                (None, 16)                9280      \n",
1038 |       "_________________________________________________________________\n",
1039 |       "dense (Dense)                (None, 1)                 17        \n",
1040 |       "=================================================================\n",
1041 |       "Total params: 15,196,177\n",
1042 |       "Trainable params: 196,177\n",
1043 |       "Non-trainable params: 15,000,000\n",
1044 |       "_________________________________________________________________\n"
1045 |      ]
1046 |     }
1047 |    ],
1048 |    "source": [
1049 |     "# 我们来看一下模型的结构，一共90k左右可训练的变量\n",
1050 |     "model.summary()"
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "code",
1055 |    "execution_count": 44,
1056 |    "metadata": {},
1057 |    "outputs": [],
1058 |    "source": [
1059 |     "# 建立一个权重的存储点\n",
1060 |     "path_checkpoint = 'sentiment_checkpoint.keras'\n",
1061 |     "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n",
1062 |     "                                      verbose=1, save_weights_only=True,\n",
1063 |     "                                      save_best_only=True)"
1064 |    ]
1065 |   },
1066 |   {
1067 |    "cell_type": "code",
1068 |    "execution_count": 45,
1069 |    "metadata": {},
1070 |    "outputs": [
1071 |     {
1072 |      "name": "stdout",
1073 |      "output_type": "stream",
1074 |      "text": [
1075 |       "Shapes (300, 256) and (300, 128) are incompatible\n"
1076 |      ]
1077 |     }
1078 |    ],
1079 |    "source": [
1080 |     "# 尝试加载已训练模型\n",
1081 |     "try:\n",
1082 |     "    model.load_weights(path_checkpoint)\n",
1083 |     "except Exception as e:\n",
1084 |     "    print(e)"
1085 |    ]
1086 |   },
1087 |   {
1088 |    "cell_type": "code",
1089 |    "execution_count": 46,
1090 |    "metadata": {},
1091 |    "outputs": [],
1092 |    "source": [
1093 |     "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n",
1094 |     "earlystopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1)"
1095 |    ]
1096 |   },
1097 |   {
1098 |    "cell_type": "code",
1099 |    "execution_count": 47,
1100 |    "metadata": {},
1101 |    "outputs": [],
1102 |    "source": [
1103 |     "# 自动降低learning rate\n",
1104 |     "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n",
1105 |     "                                       factor=0.1, min_lr=1e-8, patience=0,\n",
1106 |     "                                       verbose=1)"
1107 |    ]
1108 |   },
1109 |   {
1110 |    "cell_type": "code",
1111 |    "execution_count": 48,
1112 |    "metadata": {},
1113 |    "outputs": [],
1114 |    "source": [
1115 |     "# 定义callback函数\n",
1116 |     "callbacks = [\n",
1117 |     "    earlystopping, \n",
1118 |     "    checkpoint,\n",
1119 |     "    lr_reduction\n",
1120 |     "]"
1121 |    ]
1122 |   },
1123 |   {
1124 |    "cell_type": "code",
1125 |    "execution_count": 49,
1126 |    "metadata": {
1127 |     "scrolled": false
1128 |    },
1129 |    "outputs": [
1130 |     {
1131 |      "name": "stdout",
1132 |      "output_type": "stream",
1133 |      "text": [
1134 |       "Train on 3240 samples, validate on 360 samples\n",
1135 |       "Epoch 1/20\n",
1136 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.6847 - accuracy: 0.5694\n",
1137 |       "Epoch 00001: val_loss improved from inf to 0.65662, saving model to sentiment_checkpoint.keras\n",
1138 |       "3240/3240 [==============================] - 34s 10ms/sample - loss: 0.6846 - accuracy: 0.5698 - val_loss: 0.6566 - val_accuracy: 0.6639\n",
1139 |       "Epoch 2/20\n",
1140 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.6266 - accuracy: 0.6562\n",
1141 |       "Epoch 00002: val_loss improved from 0.65662 to 0.56397, saving model to sentiment_checkpoint.keras\n",
1142 |       "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.6266 - accuracy: 0.6556 - val_loss: 0.5640 - val_accuracy: 0.7139\n",
1143 |       "Epoch 3/20\n",
1144 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.5093 - accuracy: 0.7591\n",
1145 |       "Epoch 00003: val_loss improved from 0.56397 to 0.51803, saving model to sentiment_checkpoint.keras\n",
1146 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.5100 - accuracy: 0.7583 - val_loss: 0.5180 - val_accuracy: 0.7556\n",
1147 |       "Epoch 4/20\n",
1148 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4914 - accuracy: 0.7734\n",
1149 |       "Epoch 00004: val_loss improved from 0.51803 to 0.43727, saving model to sentiment_checkpoint.keras\n",
1150 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4904 - accuracy: 0.7744 - val_loss: 0.4373 - val_accuracy: 0.8250\n",
1151 |       "Epoch 5/20\n",
1152 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4549 - accuracy: 0.8006\n",
1153 |       "Epoch 00005: val_loss did not improve from 0.43727\n",
1154 |       "\n",
1155 |       "Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.\n",
1156 |       "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4564 - accuracy: 0.7994 - val_loss: 0.4508 - val_accuracy: 0.8000\n",
1157 |       "Epoch 6/20\n",
1158 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4261 - accuracy: 0.8206\n",
1159 |       "Epoch 00006: val_loss did not improve from 0.43727\n",
1160 |       "\n",
1161 |       "Epoch 00006: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.\n",
1162 |       "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4260 - accuracy: 0.8210 - val_loss: 0.4374 - val_accuracy: 0.8139\n",
1163 |       "Epoch 7/20\n",
1164 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4158 - accuracy: 0.8256\n",
1165 |       "Epoch 00007: val_loss improved from 0.43727 to 0.43676, saving model to sentiment_checkpoint.keras\n",
1166 |       "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4160 - accuracy: 0.8256 - val_loss: 0.4368 - val_accuracy: 0.8139\n",
1167 |       "Epoch 8/20\n",
1168 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4163 - accuracy: 0.8222\n",
1169 |       "Epoch 00008: val_loss improved from 0.43676 to 0.43648, saving model to sentiment_checkpoint.keras\n",
1170 |       "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4142 - accuracy: 0.8241 - val_loss: 0.4365 - val_accuracy: 0.8139\n",
1171 |       "Epoch 9/20\n",
1172 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4125 - accuracy: 0.8241\n",
1173 |       "Epoch 00009: val_loss improved from 0.43648 to 0.43615, saving model to sentiment_checkpoint.keras\n",
1174 |       "3240/3240 [==============================] - 31s 9ms/sample - loss: 0.4131 - accuracy: 0.8228 - val_loss: 0.4361 - val_accuracy: 0.8139\n",
1175 |       "Epoch 10/20\n",
1176 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4126 - accuracy: 0.8241\n",
1177 |       "Epoch 00010: val_loss improved from 0.43615 to 0.43576, saving model to sentiment_checkpoint.keras\n",
1178 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4120 - accuracy: 0.8247 - val_loss: 0.4358 - val_accuracy: 0.8167\n",
1179 |       "Epoch 11/20\n",
1180 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4110 - accuracy: 0.8253\n",
1181 |       "Epoch 00011: val_loss improved from 0.43576 to 0.43573, saving model to sentiment_checkpoint.keras\n",
1182 |       "\n",
1183 |       "Epoch 00011: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.\n",
1184 |       "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4109 - accuracy: 0.8253 - val_loss: 0.4357 - val_accuracy: 0.8167\n",
1185 |       "Epoch 12/20\n",
1186 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4112 - accuracy: 0.8241\n",
1187 |       "Epoch 00012: val_loss improved from 0.43573 to 0.43573, saving model to sentiment_checkpoint.keras\n",
1188 |       "\n",
1189 |       "Epoch 00012: ReduceLROnPlateau reducing learning rate to 1.0000001111620805e-07.\n",
1190 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1191 |       "Epoch 13/20\n",
1192 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4109 - accuracy: 0.8238\n",
1193 |       "Epoch 00013: val_loss improved from 0.43573 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1194 |       "\n",
1195 |       "Epoch 00013: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-08.\n",
1196 |       "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1197 |       "Epoch 14/20\n",
1198 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4098 - accuracy: 0.8250\n",
1199 |       "Epoch 00014: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1200 |       "\n",
1201 |       "Epoch 00014: ReduceLROnPlateau reducing learning rate to 1e-08.\n",
1202 |       "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1203 |       "Epoch 15/20\n",
1204 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4090 - accuracy: 0.8253\n",
1205 |       "Epoch 00015: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1206 |       "3240/3240 [==============================] - 30s 9ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1207 |       "Epoch 16/20\n",
1208 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4103 - accuracy: 0.8247\n",
1209 |       "Epoch 00016: val_loss did not improve from 0.43572\n",
1210 |       "3240/3240 [==============================] - 31s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1211 |       "Epoch 17/20\n",
1212 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4108 - accuracy: 0.8244\n",
1213 |       "Epoch 00017: val_loss did not improve from 0.43572\n",
1214 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1215 |       "Epoch 18/20\n",
1216 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4107 - accuracy: 0.8247\n",
1217 |       "Epoch 00018: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1218 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1219 |       "Epoch 19/20\n",
1220 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4080 - accuracy: 0.8263\n",
1221 |       "Epoch 00019: val_loss improved from 0.43572 to 0.43572, saving model to sentiment_checkpoint.keras\n",
1222 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n",
1223 |       "Epoch 20/20\n",
1224 |       "3200/3240 [============================>.] - ETA: 0s - loss: 0.4115 - accuracy: 0.8234\n",
1225 |       "Epoch 00020: val_loss did not improve from 0.43572\n",
1226 |       "3240/3240 [==============================] - 32s 10ms/sample - loss: 0.4102 - accuracy: 0.8247 - val_loss: 0.4357 - val_accuracy: 0.8194\n"
1227 |      ]
1228 |     },
1229 |     {
1230 |      "data": {
1231 |       "text/plain": [
1232 |        "<tensorflow.python.keras.callbacks.History at 0x1e1c2885c18>"
1233 |       ]
1234 |      },
1235 |      "execution_count": 49,
1236 |      "metadata": {},
1237 |      "output_type": "execute_result"
1238 |     }
1239 |    ],
1240 |    "source": [
1241 |     "# 开始训练\n",
1242 |     "model.fit(X_train, y_train,\n",
1243 |     "          validation_split=0.1, \n",
1244 |     "          epochs=20,\n",
1245 |     "          batch_size=128,\n",
1246 |     "          callbacks=callbacks)"
1247 |    ]
1248 |   },
1249 |   {
1250 |    "cell_type": "markdown",
1251 |    "metadata": {},
1252 |    "source": [
1253 |     "**结论**  \n",
1254 |     "我们首先对测试样本进行预测，得到了还算满意的准确度。  \n",
1255 |     "之后我们定义一个预测函数，来预测输入的文本的极性，可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。"
1256 |    ]
1257 |   },
1258 |   {
1259 |    "cell_type": "code",
1260 |    "execution_count": 51,
1261 |    "metadata": {},
1262 |    "outputs": [
1263 |     {
1264 |      "name": "stdout",
1265 |      "output_type": "stream",
1266 |      "text": [
1267 |       "400/400 [==============================] - 1s 3ms/sample - loss: 0.4799 - accuracy: 0.76750s - loss: 0.4699 - accuracy\n",
1268 |       "Accuracy:76.75%\n"
1269 |      ]
1270 |     }
1271 |    ],
1272 |    "source": [
1273 |     "result = model.evaluate(X_test, y_test)\n",
1274 |     "print('Accuracy:{0:.2%}'.format(result[1]))"
1275 |    ]
1276 |   },
1277 |   {
1278 |    "cell_type": "code",
1279 |    "execution_count": 52,
1280 |    "metadata": {},
1281 |    "outputs": [],
1282 |    "source": [
1283 |     "def predict_sentiment(text):\n",
1284 |     "    print(text)\n",
1285 |     "    # 去标点\n",
1286 |     "    text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——！，。？、~@#￥%……&*（）]+\", \"\",text)\n",
1287 |     "    # 分词\n",
1288 |     "    cut = jieba.cut(text)\n",
1289 |     "    cut_list = [ i for i in cut ]\n",
1290 |     "    # tokenize\n",
1291 |     "    for i, word in enumerate(cut_list):\n",
1292 |     "        try:\n",
1293 |     "            cut_list[i] = cn_model.vocab[word].index\n",
1294 |     "            if cut_list[i] >= 50000:\n",
1295 |     "                cut_list[i] = 0\n",
1296 |     "        except KeyError:\n",
1297 |     "            cut_list[i] = 0\n",
1298 |     "    # padding\n",
1299 |     "    tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n",
1300 |     "                           padding='pre', truncating='pre')\n",
1301 |     "    # 预测\n",
1302 |     "    result = model.predict(x=tokens_pad)\n",
1303 |     "    coef = result[0][0]\n",
1304 |     "    if coef >= 0.5:\n",
1305 |     "        print('是一例正面评价','output=%.2f'%coef)\n",
1306 |     "    else:\n",
1307 |     "        print('是一例负面评价','output=%.2f'%coef)"
1308 |    ]
1309 |   },
1310 |   {
1311 |    "cell_type": "code",
1312 |    "execution_count": 53,
1313 |    "metadata": {},
1314 |    "outputs": [
1315 |     {
1316 |      "name": "stdout",
1317 |      "output_type": "stream",
1318 |      "text": [
1319 |       "酒店设施不是新的，服务态度很不好\n",
1320 |       "是一例正面评价 output=0.63\n",
1321 |       "酒店卫生条件非常不好\n",
1322 |       "是一例负面评价 output=0.47\n",
1323 |       "床铺非常舒适\n",
1324 |       "是一例正面评价 output=0.79\n",
1325 |       "房间很凉，不给开暖气\n",
1326 |       "是一例正面评价 output=0.53\n",
1327 |       "房间很凉爽，空调冷气很足\n",
1328 |       "是一例正面评价 output=0.79\n",
1329 |       "酒店环境不好，住宿体验很不好\n",
1330 |       "是一例负面评价 output=0.48\n",
1331 |       "房间隔音不到位\n",
1332 |       "是一例负面评价 output=0.31\n",
1333 |       "晚上回来发现没有打扫卫生\n",
1334 |       "是一例正面评价 output=0.50\n",
1335 |       "因为过节所以要我临时加钱，比团购的价格贵\n",
1336 |       "是一例负面评价 output=0.47\n"
1337 |      ]
1338 |     }
1339 |    ],
1340 |    "source": [
1341 |     "test_list = [\n",
1342 |     "    '酒店设施不是新的，服务态度很不好',\n",
1343 |     "    '酒店卫生条件非常不好',\n",
1344 |     "    '床铺非常舒适',\n",
1345 |     "    '房间很凉，不给开暖气',\n",
1346 |     "    '房间很凉爽，空调冷气很足',\n",
1347 |     "    '酒店环境不好，住宿体验很不好',\n",
1348 |     "    '房间隔音不到位' ,\n",
1349 |     "    '晚上回来发现没有打扫卫生',\n",
1350 |     "    '因为过节所以要我临时加钱，比团购的价格贵'\n",
1351 |     "]\n",
1352 |     "for text in test_list:\n",
1353 |     "    predict_sentiment(text)"
1354 |    ]
1355 |   },
1356 |   {
1357 |    "cell_type": "markdown",
1358 |    "metadata": {},
1359 |    "source": [
1360 |     "**错误分类的文本**\n",
1361 |     "经过查看，发现错误分类的文本的含义大多比较含糊，就算人类也不容易判断极性，如index为101的这个句子，好像没有一点满意的成分，但这例子评价在训练样本中被标记成为了正面评价，而我们的模型做出的负面评价的预测似乎是合理的。"
1362 |    ]
1363 |   },
1364 |   {
1365 |    "cell_type": "code",
1366 |    "execution_count": 54,
1367 |    "metadata": {},
1368 |    "outputs": [],
1369 |    "source": [
1370 |     "y_pred = model.predict(X_test)\n",
1371 |     "y_pred = y_pred.T[0]\n",
1372 |     "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n",
1373 |     "y_pred = np.array(y_pred)"
1374 |    ]
1375 |   },
1376 |   {
1377 |    "cell_type": "code",
1378 |    "execution_count": 55,
1379 |    "metadata": {},
1380 |    "outputs": [],
1381 |    "source": [
1382 |     "y_actual = np.array(y_test)"
1383 |    ]
1384 |   },
1385 |   {
1386 |    "cell_type": "code",
1387 |    "execution_count": 56,
1388 |    "metadata": {},
1389 |    "outputs": [],
1390 |    "source": [
1391 |     "# 找出错误分类的索引\n",
1392 |     "misclassified = np.where( y_pred != y_actual )[0]"
1393 |    ]
1394 |   },
1395 |   {
1396 |    "cell_type": "code",
1397 |    "execution_count": 57,
1398 |    "metadata": {
1399 |     "scrolled": true
1400 |    },
1401 |    "outputs": [
1402 |     {
1403 |      "name": "stdout",
1404 |      "output_type": "stream",
1405 |      "text": [
1406 |       "400\n"
1407 |      ]
1408 |     }
1409 |    ],
1410 |    "source": [
1411 |     "# 输出所有错误分类的索引\n",
1412 |     "len(misclassified)\n",
1413 |     "print(len(X_test))"
1414 |    ]
1415 |   },
1416 |   {
1417 |    "cell_type": "code",
1418 |    "execution_count": 58,
1419 |    "metadata": {},
1420 |    "outputs": [
1421 |     {
1422 |      "name": "stdout",
1423 |      "output_type": "stream",
1424 |      "text": [
1425 |       "                                                                                                                                                        由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 ：1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n",
1426 |       "预测的分类 0\n",
1427 |       "实际的分类 1\n"
1428 |      ]
1429 |     }
1430 |    ],
1431 |    "source": [
1432 |     "# 我们来找出错误分类的样本看看\n",
1433 |     "idx=101\n",
1434 |     "print(reverse_tokens(X_test[idx]))\n",
1435 |     "print('预测的分类', y_pred[idx])\n",
1436 |     "print('实际的分类', y_actual[idx])"
1437 |    ]
1438 |   },
1439 |   {
1440 |    "cell_type": "code",
1441 |    "execution_count": 59,
1442 |    "metadata": {},
1443 |    "outputs": [
1444 |     {
1445 |      "name": "stdout",
1446 |      "output_type": "stream",
1447 |      "text": [
1448 |       "                                                                                                                                                                                                                 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n",
1449 |       "预测的分类 1\n",
1450 |       "实际的分类 1\n"
1451 |      ]
1452 |     }
1453 |    ],
1454 |    "source": [
1455 |     "idx=1\n",
1456 |     "print(reverse_tokens(X_test[idx]))\n",
1457 |     "print('预测的分类', y_pred[idx])\n",
1458 |     "print('实际的分类', y_actual[idx])"
1459 |    ]
1460 |   },
1461 |   {
1462 |    "cell_type": "code",
1463 |    "execution_count": null,
1464 |    "metadata": {},
1465 |    "outputs": [],
1466 |    "source": []
1467 |   },
1468 |   {
1469 |    "cell_type": "code",
1470 |    "execution_count": null,
1471 |    "metadata": {},
1472 |    "outputs": [],
1473 |    "source": []
1474 |   },
1475 |   {
1476 |    "cell_type": "code",
1477 |    "execution_count": null,
1478 |    "metadata": {},
1479 |    "outputs": [],
1480 |    "source": []
1481 |   }
1482 |  ],
1483 |  "metadata": {
1484 |   "kernelspec": {
1485 |    "display_name": "Python 3",
1486 |    "language": "python",
1487 |    "name": "python3"
1488 |   },
1489 |   "language_info": {
1490 |    "codemirror_mode": {
1491 |     "name": "ipython",
1492 |     "version": 3
1493 |    },
1494 |    "file_extension": ".py",
1495 |    "mimetype": "text/x-python",
1496 |    "name": "python",
1497 |    "nbconvert_exporter": "python",
1498 |    "pygments_lexer": "ipython3",
1499 |    "version": "3.6.8"
1500 |   }
1501 |  },
1502 |  "nbformat": 4,
1503 |  "nbformat_minor": 2
1504 | }
1505 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # chinese_sentiment
 2 | **用Tensorflow进行中文自然语言处理分类实践**   
 3 | 1. 词向量下载地址:   
 4 | 链接: https://pan.baidu.com/s/1GerioMpwj1zmju9NkkrsFg 
 5 | 提取码: x6v3
 6 | 请下载之后在项目根目录建立"embeddings"文件夹, 将下载的文件放入(不用解压), 即可运行代码.   
 7 | 2. 很多同学遇到乱码等bug, 很抱歉没能及时回复, 现已重新处理了语料和代码, 已经没有了乱码的问题.   
 8 | 3. 修改了bug后, 可能是数据的顺序变了, 结果模型训练的效果相比去年差了一些, 有兴趣的同学可以调整一下模型参数, 看看会不会有更好的结果.   
 9 | 4. 代码写的比较早, 有些地方可能有坑, 现在先不重写了, 因为LSTM实在是属于比较老的模型, 近期会发布transformer语言模型的教程, 请大家关注. 
10 | 5. 注意, debug之后的代码在"2019新版debug之后--中文自然语言处理--情感分析.ipynb"里, 对应的语料文件是"negative_samples.txt", "positive_samples.txt"这两个.   
11 | 6. 如果有问题请在视频评论区留言, 这样各位学习的同学可以互相帮助解决问题, 或者在项目里提issue, 尽量不要给我写邮件, 因为可能回复不及时.   
12 | 
13 | 教学视频地址：  
14 | youtube：  
15 | https://www.youtube.com/watch?v=-mcrmLmNOXA&t=991s  
16 | bilibili：  
17 | https://www.bilibili.com/video/av30543613?from=search&seid=74343163897647645  
18 | 老版本中pos和neg中的语料不全，请解压“语料.zip”覆盖
19 | 


--------------------------------------------------------------------------------
/flowchart.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/flowchart.jpg


--------------------------------------------------------------------------------
/neg/neg.0.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.0.txt


--------------------------------------------------------------------------------
/neg/neg.1.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1.txt


--------------------------------------------------------------------------------
/neg/neg.10.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.10.txt


--------------------------------------------------------------------------------
/neg/neg.1000.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1000.txt


--------------------------------------------------------------------------------
/neg/neg.1001.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1001.txt


--------------------------------------------------------------------------------
/neg/neg.1002.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1002.txt


--------------------------------------------------------------------------------
/neg/neg.1003.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1003.txt


--------------------------------------------------------------------------------
/neg/neg.1004.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1004.txt


--------------------------------------------------------------------------------
/neg/neg.1005.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1005.txt


--------------------------------------------------------------------------------
/neg/neg.1006.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1006.txt


--------------------------------------------------------------------------------
/neg/neg.1007.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1007.txt


--------------------------------------------------------------------------------
/neg/neg.1009.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1009.txt


--------------------------------------------------------------------------------
/neg/neg.101.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.101.txt


--------------------------------------------------------------------------------
/neg/neg.1010.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1010.txt


--------------------------------------------------------------------------------
/neg/neg.1011.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1011.txt


--------------------------------------------------------------------------------
/neg/neg.1012.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1012.txt


--------------------------------------------------------------------------------
/neg/neg.1013.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1013.txt


--------------------------------------------------------------------------------
/neg/neg.1014.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1014.txt


--------------------------------------------------------------------------------
/neg/neg.1015.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1015.txt


--------------------------------------------------------------------------------
/neg/neg.1017.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1017.txt


--------------------------------------------------------------------------------
/neg/neg.1018.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1018.txt


--------------------------------------------------------------------------------
/neg/neg.1019.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1019.txt


--------------------------------------------------------------------------------
/neg/neg.102.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.102.txt


--------------------------------------------------------------------------------
/neg/neg.1020.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1020.txt


--------------------------------------------------------------------------------
/neg/neg.1022.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1022.txt


--------------------------------------------------------------------------------
/neg/neg.1025.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1025.txt


--------------------------------------------------------------------------------
/neg/neg.1026.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1026.txt


--------------------------------------------------------------------------------
/neg/neg.1027.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1027.txt


--------------------------------------------------------------------------------
/neg/neg.1028.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1028.txt


--------------------------------------------------------------------------------
/neg/neg.1029.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1029.txt


--------------------------------------------------------------------------------
/neg/neg.103.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.103.txt


--------------------------------------------------------------------------------
/neg/neg.1030.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1030.txt


--------------------------------------------------------------------------------
/neg/neg.1032.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1032.txt


--------------------------------------------------------------------------------
/neg/neg.1033.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1033.txt


--------------------------------------------------------------------------------
/neg/neg.1034.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1034.txt


--------------------------------------------------------------------------------
/neg/neg.1035.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1035.txt


--------------------------------------------------------------------------------
/neg/neg.1036.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1036.txt


--------------------------------------------------------------------------------
/neg/neg.1038.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1038.txt


--------------------------------------------------------------------------------
/neg/neg.1039.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1039.txt


--------------------------------------------------------------------------------
/neg/neg.104.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.104.txt


--------------------------------------------------------------------------------
/neg/neg.1040.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1040.txt


--------------------------------------------------------------------------------
/neg/neg.1041.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1041.txt


--------------------------------------------------------------------------------
/neg/neg.1042.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1042.txt


--------------------------------------------------------------------------------
/neg/neg.1047.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1047.txt


--------------------------------------------------------------------------------
/neg/neg.1048.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1048.txt


--------------------------------------------------------------------------------
/neg/neg.1049.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1049.txt


--------------------------------------------------------------------------------
/neg/neg.105.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.105.txt


--------------------------------------------------------------------------------
/neg/neg.1050.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1050.txt


--------------------------------------------------------------------------------
/neg/neg.1052.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1052.txt


--------------------------------------------------------------------------------
/neg/neg.1053.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1053.txt


--------------------------------------------------------------------------------
/neg/neg.1054.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1054.txt


--------------------------------------------------------------------------------
/neg/neg.1055.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1055.txt


--------------------------------------------------------------------------------
/neg/neg.1056.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1056.txt


--------------------------------------------------------------------------------
/neg/neg.1057.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1057.txt


--------------------------------------------------------------------------------
/neg/neg.1058.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1058.txt


--------------------------------------------------------------------------------
/neg/neg.1059.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1059.txt


--------------------------------------------------------------------------------
/neg/neg.106.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.106.txt


--------------------------------------------------------------------------------
/neg/neg.1060.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1060.txt


--------------------------------------------------------------------------------
/neg/neg.1061.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1061.txt


--------------------------------------------------------------------------------
/neg/neg.1062.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1062.txt


--------------------------------------------------------------------------------
/neg/neg.1063.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1063.txt


--------------------------------------------------------------------------------
/neg/neg.1066.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1066.txt


--------------------------------------------------------------------------------
/neg/neg.1067.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1067.txt


--------------------------------------------------------------------------------
/neg/neg.1069.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1069.txt


--------------------------------------------------------------------------------
/neg/neg.107.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.107.txt


--------------------------------------------------------------------------------
/neg/neg.1070.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1070.txt


--------------------------------------------------------------------------------
/neg/neg.1071.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1071.txt


--------------------------------------------------------------------------------
/neg/neg.1072.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1072.txt


--------------------------------------------------------------------------------
/neg/neg.1073.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1073.txt


--------------------------------------------------------------------------------
/neg/neg.1074.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1074.txt


--------------------------------------------------------------------------------
/neg/neg.1075.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1075.txt


--------------------------------------------------------------------------------
/neg/neg.1076.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1076.txt


--------------------------------------------------------------------------------
/neg/neg.1077.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1077.txt


--------------------------------------------------------------------------------
/neg/neg.1078.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1078.txt


--------------------------------------------------------------------------------
/neg/neg.1079.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1079.txt


--------------------------------------------------------------------------------
/neg/neg.108.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.108.txt


--------------------------------------------------------------------------------
/neg/neg.1080.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1080.txt


--------------------------------------------------------------------------------
/neg/neg.1081.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1081.txt


--------------------------------------------------------------------------------
/neg/neg.1082.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1082.txt


--------------------------------------------------------------------------------
/neg/neg.1083.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1083.txt


--------------------------------------------------------------------------------
/neg/neg.1084.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1084.txt


--------------------------------------------------------------------------------
/neg/neg.1085.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1085.txt


--------------------------------------------------------------------------------
/neg/neg.1086.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1086.txt


--------------------------------------------------------------------------------
/neg/neg.1087.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1087.txt


--------------------------------------------------------------------------------
/neg/neg.1088.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1088.txt


--------------------------------------------------------------------------------
/neg/neg.1089.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1089.txt


--------------------------------------------------------------------------------
/neg/neg.109.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.109.txt


--------------------------------------------------------------------------------
/neg/neg.1090.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1090.txt


--------------------------------------------------------------------------------
/neg/neg.1091.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1091.txt


--------------------------------------------------------------------------------
/neg/neg.1092.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1092.txt


--------------------------------------------------------------------------------
/neg/neg.1093.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1093.txt


--------------------------------------------------------------------------------
/neg/neg.1094.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1094.txt


--------------------------------------------------------------------------------
/neg/neg.1095.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1095.txt


--------------------------------------------------------------------------------
/neg/neg.1096.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1096.txt


--------------------------------------------------------------------------------
/neg/neg.1097.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1097.txt


--------------------------------------------------------------------------------
/neg/neg.1098.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1098.txt


--------------------------------------------------------------------------------
/neg/neg.1099.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.1099.txt


--------------------------------------------------------------------------------
/neg/neg.11.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.11.txt


--------------------------------------------------------------------------------
/neg/neg.110.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/neg/neg.110.txt


--------------------------------------------------------------------------------
/pos/pos.10.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.10.txt


--------------------------------------------------------------------------------
/pos/pos.100.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.100.txt


--------------------------------------------------------------------------------
/pos/pos.1000.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1000.txt


--------------------------------------------------------------------------------
/pos/pos.1001.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1001.txt


--------------------------------------------------------------------------------
/pos/pos.1002.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1002.txt


--------------------------------------------------------------------------------
/pos/pos.1003.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1003.txt


--------------------------------------------------------------------------------
/pos/pos.1004.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1004.txt


--------------------------------------------------------------------------------
/pos/pos.1005.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1005.txt


--------------------------------------------------------------------------------
/pos/pos.1006.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1006.txt


--------------------------------------------------------------------------------
/pos/pos.1007.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1007.txt


--------------------------------------------------------------------------------
/pos/pos.1008.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1008.txt


--------------------------------------------------------------------------------
/pos/pos.1009.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1009.txt


--------------------------------------------------------------------------------
/pos/pos.101.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.101.txt


--------------------------------------------------------------------------------
/pos/pos.1010.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1010.txt


--------------------------------------------------------------------------------
/pos/pos.1012.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1012.txt


--------------------------------------------------------------------------------
/pos/pos.1013.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1013.txt


--------------------------------------------------------------------------------
/pos/pos.1014.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1014.txt


--------------------------------------------------------------------------------
/pos/pos.1015.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1015.txt


--------------------------------------------------------------------------------
/pos/pos.1016.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1016.txt


--------------------------------------------------------------------------------
/pos/pos.1017.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1017.txt


--------------------------------------------------------------------------------
/pos/pos.1018.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1018.txt


--------------------------------------------------------------------------------
/pos/pos.1019.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1019.txt


--------------------------------------------------------------------------------
/pos/pos.102.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.102.txt


--------------------------------------------------------------------------------
/pos/pos.1020.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1020.txt


--------------------------------------------------------------------------------
/pos/pos.1021.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1021.txt


--------------------------------------------------------------------------------
/pos/pos.1022.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1022.txt


--------------------------------------------------------------------------------
/pos/pos.1023.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1023.txt


--------------------------------------------------------------------------------
/pos/pos.1024.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1024.txt


--------------------------------------------------------------------------------
/pos/pos.1025.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1025.txt


--------------------------------------------------------------------------------
/pos/pos.1026.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1026.txt


--------------------------------------------------------------------------------
/pos/pos.1027.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1027.txt


--------------------------------------------------------------------------------
/pos/pos.1028.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1028.txt


--------------------------------------------------------------------------------
/pos/pos.1029.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1029.txt


--------------------------------------------------------------------------------
/pos/pos.103.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.103.txt


--------------------------------------------------------------------------------
/pos/pos.1030.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1030.txt


--------------------------------------------------------------------------------
/pos/pos.1031.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1031.txt


--------------------------------------------------------------------------------
/pos/pos.1032.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1032.txt


--------------------------------------------------------------------------------
/pos/pos.1033.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1033.txt


--------------------------------------------------------------------------------
/pos/pos.1034.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1034.txt


--------------------------------------------------------------------------------
/pos/pos.1035.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1035.txt


--------------------------------------------------------------------------------
/pos/pos.1036.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1036.txt


--------------------------------------------------------------------------------
/pos/pos.1037.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1037.txt


--------------------------------------------------------------------------------
/pos/pos.1038.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1038.txt


--------------------------------------------------------------------------------
/pos/pos.1039.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1039.txt


--------------------------------------------------------------------------------
/pos/pos.104.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.104.txt


--------------------------------------------------------------------------------
/pos/pos.1040.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1040.txt


--------------------------------------------------------------------------------
/pos/pos.1041.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1041.txt


--------------------------------------------------------------------------------
/pos/pos.1042.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1042.txt


--------------------------------------------------------------------------------
/pos/pos.1043.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1043.txt


--------------------------------------------------------------------------------
/pos/pos.1044.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1044.txt


--------------------------------------------------------------------------------
/pos/pos.1045.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1045.txt


--------------------------------------------------------------------------------
/pos/pos.1046.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1046.txt


--------------------------------------------------------------------------------
/pos/pos.1047.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1047.txt


--------------------------------------------------------------------------------
/pos/pos.1048.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1048.txt


--------------------------------------------------------------------------------
/pos/pos.1049.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1049.txt


--------------------------------------------------------------------------------
/pos/pos.105.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.105.txt


--------------------------------------------------------------------------------
/pos/pos.1050.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1050.txt


--------------------------------------------------------------------------------
/pos/pos.1051.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1051.txt


--------------------------------------------------------------------------------
/pos/pos.1052.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1052.txt


--------------------------------------------------------------------------------
/pos/pos.1053.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1053.txt


--------------------------------------------------------------------------------
/pos/pos.1054.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1054.txt


--------------------------------------------------------------------------------
/pos/pos.1055.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1055.txt


--------------------------------------------------------------------------------
/pos/pos.1056.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1056.txt


--------------------------------------------------------------------------------
/pos/pos.1057.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1057.txt


--------------------------------------------------------------------------------
/pos/pos.1058.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1058.txt


--------------------------------------------------------------------------------
/pos/pos.1059.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1059.txt


--------------------------------------------------------------------------------
/pos/pos.106.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.106.txt


--------------------------------------------------------------------------------
/pos/pos.1060.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1060.txt


--------------------------------------------------------------------------------
/pos/pos.1061.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1061.txt


--------------------------------------------------------------------------------
/pos/pos.1062.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1062.txt


--------------------------------------------------------------------------------
/pos/pos.1063.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1063.txt


--------------------------------------------------------------------------------
/pos/pos.1064.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1064.txt


--------------------------------------------------------------------------------
/pos/pos.1065.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1065.txt


--------------------------------------------------------------------------------
/pos/pos.107.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.107.txt


--------------------------------------------------------------------------------
/pos/pos.1073.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1073.txt


--------------------------------------------------------------------------------
/pos/pos.1074.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1074.txt


--------------------------------------------------------------------------------
/pos/pos.1075.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1075.txt


--------------------------------------------------------------------------------
/pos/pos.1076.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1076.txt


--------------------------------------------------------------------------------
/pos/pos.1077.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1077.txt


--------------------------------------------------------------------------------
/pos/pos.1078.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1078.txt


--------------------------------------------------------------------------------
/pos/pos.1079.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1079.txt


--------------------------------------------------------------------------------
/pos/pos.108.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.108.txt


--------------------------------------------------------------------------------
/pos/pos.1080.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1080.txt


--------------------------------------------------------------------------------
/pos/pos.1081.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1081.txt


--------------------------------------------------------------------------------
/pos/pos.1082.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1082.txt


--------------------------------------------------------------------------------
/pos/pos.1083.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1083.txt


--------------------------------------------------------------------------------
/pos/pos.1084.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1084.txt


--------------------------------------------------------------------------------
/pos/pos.1085.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1085.txt


--------------------------------------------------------------------------------
/pos/pos.1086.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1086.txt


--------------------------------------------------------------------------------
/pos/pos.1087.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1087.txt


--------------------------------------------------------------------------------
/pos/pos.1088.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1088.txt


--------------------------------------------------------------------------------
/pos/pos.1089.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1089.txt


--------------------------------------------------------------------------------
/pos/pos.109.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.109.txt


--------------------------------------------------------------------------------
/pos/pos.1090.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1090.txt


--------------------------------------------------------------------------------
/pos/pos.1091.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1091.txt


--------------------------------------------------------------------------------
/pos/pos.1093.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1093.txt


--------------------------------------------------------------------------------
/pos/pos.1094.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1094.txt


--------------------------------------------------------------------------------
/pos/pos.1095.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1095.txt


--------------------------------------------------------------------------------
/pos/pos.1096.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1096.txt


--------------------------------------------------------------------------------
/pos/pos.1097.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/pos/pos.1097.txt


--------------------------------------------------------------------------------
/中文自然语言处理--情感分析.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# 用Tensorflow进行中文自然语言处理--情感分析"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "$$f('真好喝')=1$$\n",
  15 |     "$$f('太难喝了')=0$$"
  16 |    ]
  17 |   },
  18 |   {
  19 |    "cell_type": "markdown",
  20 |    "metadata": {},
  21 |    "source": [
  22 |     "**简介**  \n",
  23 |     "大家好，我是Espresso，这是我制作的第一个教程，是一个简单的中文自然语言处理的分类实践。  \n",
  24 |     "制作此教程的目的是什么呢？虽然现在自然语言处理的学习资料很多，英文的资料更多，但是网上资源很乱，尤其是中文的系统的学习资料稀少，而且知识点非常分散，缺少比较系统的实践学习资料，就算有一些代码但因为缺少注释导致要花费很长时间才能理解，我个人在学习过程中，在网络搜索花费了一整天时间，才把处理中文的步骤和需要的软件梳理出来。  \n",
  25 |     "所以我觉得自己有义务制作一个入门教程把零散的资料结合成一个实践案例方便各位同学学习，在这个教程中我注重的是实践部分，理论部分我推荐学习deeplearning.ai的课程，在下面的代码部分，涉及到哪方面知识的，我推荐一些学习资料并附上链接，如有侵权请e-mail：a66777@188.com。  \n",
  26 |     "另外我对自然语言处理并没有任何深入研究，欢迎各位大牛吐槽，希望能指出不足和改善方法。"
  27 |    ]
  28 |   },
  29 |   {
  30 |    "cell_type": "markdown",
  31 |    "metadata": {},
  32 |    "source": [
  33 |     "**需要的库**  \n",
  34 |     "numpy  \n",
  35 |     "jieba  \n",
  36 |     "gensim  \n",
  37 |     "tensorflow  \n",
  38 |     "matplotlib  "
  39 |    ]
  40 |   },
  41 |   {
  42 |    "cell_type": "code",
  43 |    "execution_count": 2,
  44 |    "metadata": {},
  45 |    "outputs": [
  46 |     {
  47 |      "name": "stderr",
  48 |      "output_type": "stream",
  49 |      "text": [
  50 |       "c:\\users\\jinan\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\gensim\\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
  51 |       "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
  52 |      ]
  53 |     }
  54 |    ],
  55 |    "source": [
  56 |     "# 首先加载必用的库\n",
  57 |     "%matplotlib inline\n",
  58 |     "import numpy as np\n",
  59 |     "import matplotlib.pyplot as plt\n",
  60 |     "import re\n",
  61 |     "import jieba # 结巴分词\n",
  62 |     "# gensim用来加载预训练word vector\n",
  63 |     "from gensim.models import KeyedVectors\n",
  64 |     "import warnings\n",
  65 |     "warnings.filterwarnings(\"ignore\")"
  66 |    ]
  67 |   },
  68 |   {
  69 |    "cell_type": "markdown",
  70 |    "metadata": {},
  71 |    "source": [
  72 |     "**预训练词向量**  \n",
  73 |     "本教程使用了北京师范大学中文信息处理研究所与中国人民大学 DBIIR 实验室的研究者开源的\"chinese-word-vectors\" github链接为：  \n",
  74 |     "https://github.com/Embedding/Chinese-Word-Vectors  \n",
  75 |     "如果你不知道word2vec是什么，我推荐以下一篇文章：  \n",
  76 |     "https://zhuanlan.zhihu.com/p/26306795  \n",
  77 |     "这里我们使用了\"chinese-word-vectors\"知乎Word + Ngram的词向量，可以从上面github链接下载，我们先加载预训练模型并进行一些简单测试："
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "code",
  82 |    "execution_count": 3,
  83 |    "metadata": {},
  84 |    "outputs": [],
  85 |    "source": [
  86 |     "# 使用gensim加载预训练中文分词embedding\n",
  87 |     "cn_model = KeyedVectors.load_word2vec_format('chinese_word_vectors/sgns.zhihu.bigram', \n",
  88 |     "                                          binary=False)"
  89 |    ]
  90 |   },
  91 |   {
  92 |    "cell_type": "markdown",
  93 |    "metadata": {},
  94 |    "source": [
  95 |     "**词向量模型**  \n",
  96 |     "在这个词向量模型里，每一个词是一个索引，对应的是一个长度为300的向量，我们今天需要构建的LSTM神经网络模型并不能直接处理汉字文本，需要先进行分次并把词汇转换为词向量，步骤请参考下图，步骤的讲解会跟着代码一步一步来，如果你不知道RNN，GRU，LSTM是什么，我推荐deeplearning.ai的课程，网易公开课有免费中文字幕版，但我还是推荐有习题和练习代码部分的的coursera原版：  \n",
  97 |     "<img src='flowchart.jpg' style='width:400px;'>"
  98 |    ]
  99 |   },
 100 |   {
 101 |    "cell_type": "code",
 102 |    "execution_count": 29,
 103 |    "metadata": {},
 104 |    "outputs": [
 105 |     {
 106 |      "name": "stdout",
 107 |      "output_type": "stream",
 108 |      "text": [
 109 |       "词向量的长度为300\n"
 110 |      ]
 111 |     },
 112 |     {
 113 |      "data": {
 114 |       "text/plain": [
 115 |        "array([-2.603470e-01,  3.677500e-01, -2.379650e-01,  5.301700e-02,\n",
 116 |        "       -3.628220e-01, -3.212010e-01, -1.903330e-01,  1.587220e-01,\n",
 117 |        "       -7.156200e-02, -4.625400e-02, -1.137860e-01,  3.515600e-01,\n",
 118 |        "       -6.408200e-02, -2.184840e-01,  3.286950e-01, -7.110330e-01,\n",
 119 |        "        1.620320e-01,  1.627490e-01,  5.528180e-01,  1.016860e-01,\n",
 120 |        "        1.060080e-01,  7.820700e-01, -7.537310e-01, -2.108400e-02,\n",
 121 |        "       -4.758250e-01, -1.130420e-01, -2.053000e-01,  6.624390e-01,\n",
 122 |        "        2.435850e-01,  9.171890e-01, -2.090610e-01, -5.290000e-02,\n",
 123 |        "       -7.969340e-01,  2.394940e-01, -9.028100e-02,  1.537360e-01,\n",
 124 |        "       -4.003980e-01, -2.456100e-02, -1.717860e-01,  2.037790e-01,\n",
 125 |        "       -4.344710e-01, -3.850430e-01, -9.366000e-02,  3.775310e-01,\n",
 126 |        "        2.659690e-01,  8.879800e-02,  2.493440e-01,  4.914900e-02,\n",
 127 |        "        5.996000e-03,  3.586430e-01, -1.044960e-01, -5.838460e-01,\n",
 128 |        "        3.093280e-01, -2.828090e-01, -8.563400e-02, -5.745400e-02,\n",
 129 |        "       -2.075230e-01,  2.845980e-01,  1.414760e-01,  1.678570e-01,\n",
 130 |        "        1.957560e-01,  7.782140e-01, -2.359000e-01, -6.833100e-02,\n",
 131 |        "        2.560170e-01, -6.906900e-02, -1.219620e-01,  2.683020e-01,\n",
 132 |        "        1.678810e-01,  2.068910e-01,  1.987520e-01,  6.720900e-02,\n",
 133 |        "       -3.975290e-01, -7.123140e-01,  5.613200e-02,  2.586000e-03,\n",
 134 |        "        5.616910e-01,  1.157000e-03, -4.341190e-01,  1.977480e-01,\n",
 135 |        "        2.519540e-01,  8.835000e-03, -3.554600e-01, -1.573500e-02,\n",
 136 |        "       -2.526010e-01,  9.355900e-02, -3.962500e-02, -1.628350e-01,\n",
 137 |        "        2.980950e-01,  1.647900e-01, -5.454270e-01,  3.888790e-01,\n",
 138 |        "        1.446840e-01, -7.239600e-02, -7.597800e-02, -7.803000e-03,\n",
 139 |        "        2.020520e-01, -4.424750e-01,  3.911580e-01,  2.115100e-01,\n",
 140 |        "        6.516760e-01,  5.668030e-01,  5.065500e-02, -1.259650e-01,\n",
 141 |        "       -3.720640e-01,  2.330470e-01,  6.659900e-02,  8.300600e-02,\n",
 142 |        "        2.540460e-01, -5.279760e-01, -3.843280e-01,  3.366460e-01,\n",
 143 |        "        2.336500e-01,  3.564750e-01, -4.884160e-01, -1.183910e-01,\n",
 144 |        "        1.365910e-01,  2.293420e-01, -6.151930e-01,  5.212050e-01,\n",
 145 |        "        3.412000e-01,  5.757940e-01,  2.354480e-01, -3.641530e-01,\n",
 146 |        "        7.373400e-02,  1.007380e-01, -3.211410e-01, -3.040480e-01,\n",
 147 |        "       -3.738440e-01, -2.515150e-01,  2.633890e-01,  3.995490e-01,\n",
 148 |        "        4.461880e-01,  1.641110e-01,  1.449590e-01, -4.191540e-01,\n",
 149 |        "        2.297840e-01,  6.710600e-02,  3.316430e-01, -6.026500e-02,\n",
 150 |        "       -5.130610e-01,  1.472570e-01,  2.414060e-01,  2.011000e-03,\n",
 151 |        "       -3.823410e-01, -1.356010e-01,  3.112300e-01,  9.177830e-01,\n",
 152 |        "       -4.511630e-01,  1.272190e-01, -9.431600e-02, -8.216000e-03,\n",
 153 |        "       -3.835440e-01,  2.589400e-02,  6.374980e-01,  4.931630e-01,\n",
 154 |        "       -1.865070e-01,  4.076900e-01, -1.841000e-03,  2.213160e-01,\n",
 155 |        "        2.253950e-01, -2.159220e-01, -7.611480e-01, -2.305920e-01,\n",
 156 |        "        1.296890e-01, -1.304100e-01, -4.742270e-01,  2.275500e-02,\n",
 157 |        "        4.255050e-01,  1.570280e-01,  2.975300e-02,  1.931830e-01,\n",
 158 |        "        1.304340e-01, -3.179800e-02,  1.516650e-01, -2.154310e-01,\n",
 159 |        "       -4.681410e-01,  1.007326e+00, -6.698940e-01, -1.555240e-01,\n",
 160 |        "        1.797170e-01,  2.848660e-01,  6.216130e-01,  1.549510e-01,\n",
 161 |        "        6.225000e-02, -2.227800e-02,  2.561270e-01, -1.006380e-01,\n",
 162 |        "        2.807900e-02,  4.597710e-01, -4.077750e-01, -1.777390e-01,\n",
 163 |        "        1.920500e-02, -4.829300e-02,  4.714700e-02, -3.715200e-01,\n",
 164 |        "       -2.995930e-01, -3.719710e-01,  4.622800e-02, -1.436460e-01,\n",
 165 |        "        2.532540e-01, -9.334000e-02, -4.957400e-02, -3.803850e-01,\n",
 166 |        "        5.970110e-01,  3.578450e-01, -6.826000e-02,  4.735200e-02,\n",
 167 |        "       -3.707590e-01, -8.621300e-02, -2.556480e-01, -5.950440e-01,\n",
 168 |        "       -4.757790e-01,  1.079320e-01,  9.858300e-02,  8.540300e-01,\n",
 169 |        "        3.518370e-01, -1.306360e-01, -1.541590e-01,  1.166775e+00,\n",
 170 |        "        2.048860e-01,  5.952340e-01,  1.158830e-01,  6.774400e-02,\n",
 171 |        "        6.793920e-01, -3.610700e-01,  1.697870e-01,  4.118530e-01,\n",
 172 |        "        4.731000e-03, -7.516530e-01, -9.833700e-02, -2.312220e-01,\n",
 173 |        "       -7.043300e-02,  1.576110e-01, -4.780500e-02, -7.344390e-01,\n",
 174 |        "       -2.834330e-01,  4.582690e-01,  3.957010e-01, -8.484300e-02,\n",
 175 |        "       -3.472550e-01,  1.291660e-01,  3.838960e-01, -3.287600e-02,\n",
 176 |        "       -2.802220e-01,  5.257030e-01, -3.609300e-02, -4.842220e-01,\n",
 177 |        "        3.690700e-02,  3.429560e-01,  2.902490e-01, -1.624650e-01,\n",
 178 |        "       -7.513700e-02,  2.669300e-01,  5.778230e-01, -3.074020e-01,\n",
 179 |        "       -2.183790e-01, -2.834050e-01,  1.350870e-01,  1.490070e-01,\n",
 180 |        "        1.438400e-02, -2.509040e-01, -3.376100e-01,  1.291880e-01,\n",
 181 |        "       -3.808700e-01, -4.420520e-01, -2.512300e-01, -1.328990e-01,\n",
 182 |        "       -1.211970e-01,  2.532660e-01,  2.757050e-01, -3.382040e-01,\n",
 183 |        "        1.178070e-01,  3.860190e-01,  5.277960e-01,  4.581920e-01,\n",
 184 |        "        1.502310e-01,  1.226320e-01,  2.768540e-01, -4.502080e-01,\n",
 185 |        "       -1.992670e-01,  1.689100e-02,  1.188860e-01,  3.502440e-01,\n",
 186 |        "       -4.064770e-01,  2.610280e-01, -1.934990e-01, -1.625660e-01,\n",
 187 |        "        2.498400e-02, -1.867150e-01, -1.954400e-02, -2.281900e-01,\n",
 188 |        "       -3.417670e-01, -5.222770e-01, -9.543200e-02, -3.500350e-01,\n",
 189 |        "        2.154600e-02,  2.318040e-01,  5.395310e-01, -4.223720e-01],\n",
 190 |        "      dtype=float32)"
 191 |       ]
 192 |      },
 193 |      "execution_count": 29,
 194 |      "metadata": {},
 195 |      "output_type": "execute_result"
 196 |     }
 197 |    ],
 198 |    "source": [
 199 |     "# 由此可见每一个词都对应一个长度为300的向量\n",
 200 |     "embedding_dim = cn_model['山东大学'].shape[0]\n",
 201 |     "print('词向量的长度为{}'.format(embedding_dim))\n",
 202 |     "cn_model['山东大学']"
 203 |    ]
 204 |   },
 205 |   {
 206 |    "cell_type": "markdown",
 207 |    "metadata": {},
 208 |    "source": [
 209 |     "Cosine Similarity for Vector Space Models by Christian S. Perone\n",
 210 |     "http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/"
 211 |    ]
 212 |   },
 213 |   {
 214 |    "cell_type": "code",
 215 |    "execution_count": 33,
 216 |    "metadata": {},
 217 |    "outputs": [
 218 |     {
 219 |      "data": {
 220 |       "text/plain": [
 221 |        "0.66128117"
 222 |       ]
 223 |      },
 224 |      "execution_count": 33,
 225 |      "metadata": {},
 226 |      "output_type": "execute_result"
 227 |     }
 228 |    ],
 229 |    "source": [
 230 |     "# 计算相似度\n",
 231 |     "cn_model.similarity('橘子', '橙子')"
 232 |    ]
 233 |   },
 234 |   {
 235 |    "cell_type": "code",
 236 |    "execution_count": 34,
 237 |    "metadata": {},
 238 |    "outputs": [
 239 |     {
 240 |      "data": {
 241 |       "text/plain": [
 242 |        "0.66128117"
 243 |       ]
 244 |      },
 245 |      "execution_count": 34,
 246 |      "metadata": {},
 247 |      "output_type": "execute_result"
 248 |     }
 249 |    ],
 250 |    "source": [
 251 |     "# dot（'橘子'/|'橘子'|， '橙子'/|'橙子'| ）\n",
 252 |     "np.dot(cn_model['橘子']/np.linalg.norm(cn_model['橘子']), \n",
 253 |     "cn_model['橙子']/np.linalg.norm(cn_model['橙子']))"
 254 |    ]
 255 |   },
 256 |   {
 257 |    "cell_type": "code",
 258 |    "execution_count": 35,
 259 |    "metadata": {},
 260 |    "outputs": [
 261 |     {
 262 |      "data": {
 263 |       "text/plain": [
 264 |        "[('高中', 0.7247823476791382),\n",
 265 |        " ('本科', 0.6768535375595093),\n",
 266 |        " ('研究生', 0.6244412660598755),\n",
 267 |        " ('中学', 0.6088204979896545),\n",
 268 |        " ('大学本科', 0.595908522605896),\n",
 269 |        " ('初中', 0.5883588790893555),\n",
 270 |        " ('读研', 0.5778335332870483),\n",
 271 |        " ('职高', 0.5767995119094849),\n",
 272 |        " ('大学毕业', 0.5767451524734497),\n",
 273 |        " ('师范大学', 0.5708829760551453)]"
 274 |       ]
 275 |      },
 276 |      "execution_count": 35,
 277 |      "metadata": {},
 278 |      "output_type": "execute_result"
 279 |     }
 280 |    ],
 281 |    "source": [
 282 |     "# 找出最相近的词，余弦相似度\n",
 283 |     "cn_model.most_similar(positive=['大学'], topn=10)"
 284 |    ]
 285 |   },
 286 |   {
 287 |    "cell_type": "code",
 288 |    "execution_count": 36,
 289 |    "metadata": {},
 290 |    "outputs": [
 291 |     {
 292 |      "name": "stdout",
 293 |      "output_type": "stream",
 294 |      "text": [
 295 |       "在 老师 会计师 程序员 律师 医生 老人 中:\n",
 296 |       "不是同一类别的词为: 老人\n"
 297 |      ]
 298 |     }
 299 |    ],
 300 |    "source": [
 301 |     "# 找出不同的词\n",
 302 |     "test_words = '老师 会计师 程序员 律师 医生 老人'\n",
 303 |     "test_words_result = cn_model.doesnt_match(test_words.split())\n",
 304 |     "print('在 '+test_words+' 中:\\n不是同一类别的词为: %s' %test_words_result)"
 305 |    ]
 306 |   },
 307 |   {
 308 |    "cell_type": "code",
 309 |    "execution_count": null,
 310 |    "metadata": {},
 311 |    "outputs": [],
 312 |    "source": [
 313 |     "cn_model.most_similar(positive=['女人','出轨'], negative=['男人'], topn=1)"
 314 |    ]
 315 |   },
 316 |   {
 317 |    "cell_type": "markdown",
 318 |    "metadata": {},
 319 |    "source": [
 320 |     "**训练语料**  \n",
 321 |     "本教程使用了谭松波老师的酒店评论语料，即使是这个语料也很难找到下载链接，在某博客还得花积分下载，而我不知道怎么赚取积分，后来好不容易找到一个链接但竟然是失效的，再后来尝试把链接粘贴到迅雷上终于下载了下来，希望大家以后多多分享资源。  \n",
 322 |     "训练样本分别被放置在两个文件夹里：\n",
 323 |     "分别的pos和neg，每个文件夹里有2000个txt文件，每个文件内有一段评语，共有4000个训练样本，这样大小的样本数据在NLP中属于非常迷你的："
 324 |    ]
 325 |   },
 326 |   {
 327 |    "cell_type": "code",
 328 |    "execution_count": 38,
 329 |    "metadata": {},
 330 |    "outputs": [],
 331 |    "source": [
 332 |     "# 获得样本的索引，样本存放于两个文件夹中，\n",
 333 |     "# 分别为 正面评价'pos'文件夹 和 负面评价'neg'文件夹\n",
 334 |     "# 每个文件夹中有2000个txt文件，每个文件中是一例评价\n",
 335 |     "import os\n",
 336 |     "pos_txts = os.listdir('pos')\n",
 337 |     "neg_txts = os.listdir('neg')"
 338 |    ]
 339 |   },
 340 |   {
 341 |    "cell_type": "code",
 342 |    "execution_count": 39,
 343 |    "metadata": {},
 344 |    "outputs": [
 345 |     {
 346 |      "name": "stdout",
 347 |      "output_type": "stream",
 348 |      "text": [
 349 |       "样本总共: 4000\n"
 350 |      ]
 351 |     }
 352 |    ],
 353 |    "source": [
 354 |     "print( '样本总共: '+ str(len(pos_txts) + len(neg_txts)) )"
 355 |    ]
 356 |   },
 357 |   {
 358 |    "cell_type": "code",
 359 |    "execution_count": 40,
 360 |    "metadata": {},
 361 |    "outputs": [],
 362 |    "source": [
 363 |     "# 现在我们将所有的评价内容放置到一个list里\n",
 364 |     "\n",
 365 |     "train_texts_orig = [] # 存储所有评价，每例评价为一条string\n",
 366 |     "\n",
 367 |     "# 添加完所有样本之后，train_texts_orig为一个含有4000条文本的list\n",
 368 |     "# 其中前2000条文本为正面评价，后2000条为负面评价\n",
 369 |     "\n",
 370 |     "for i in range(len(pos_txts)):\n",
 371 |     "    with open('pos/'+pos_txts[i], 'r', errors='ignore') as f:\n",
 372 |     "        text = f.read().strip()\n",
 373 |     "        train_texts_orig.append(text)\n",
 374 |     "        f.close()\n",
 375 |     "for i in range(len(neg_txts)):\n",
 376 |     "    with open('neg/'+neg_txts[i], 'r', errors='ignore') as f:\n",
 377 |     "        text = f.read().strip()\n",
 378 |     "        train_texts_orig.append(text)\n",
 379 |     "        f.close()"
 380 |    ]
 381 |   },
 382 |   {
 383 |    "cell_type": "code",
 384 |    "execution_count": 41,
 385 |    "metadata": {},
 386 |    "outputs": [
 387 |     {
 388 |      "data": {
 389 |       "text/plain": [
 390 |        "4000"
 391 |       ]
 392 |      },
 393 |      "execution_count": 41,
 394 |      "metadata": {},
 395 |      "output_type": "execute_result"
 396 |     }
 397 |    ],
 398 |    "source": [
 399 |     "len(train_texts_orig)"
 400 |    ]
 401 |   },
 402 |   {
 403 |    "cell_type": "code",
 404 |    "execution_count": 42,
 405 |    "metadata": {},
 406 |    "outputs": [],
 407 |    "source": [
 408 |     "# 我们使用tensorflow的keras接口来建模\n",
 409 |     "from tensorflow.python.keras.models import Sequential\n",
 410 |     "from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, Bidirectional\n",
 411 |     "from tensorflow.python.keras.preprocessing.text import Tokenizer\n",
 412 |     "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
 413 |     "from tensorflow.python.keras.optimizers import RMSprop\n",
 414 |     "from tensorflow.python.keras.optimizers import Adam\n",
 415 |     "from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau"
 416 |    ]
 417 |   },
 418 |   {
 419 |    "cell_type": "markdown",
 420 |    "metadata": {},
 421 |    "source": [
 422 |     "**分词和tokenize**  \n",
 423 |     "首先我们去掉每个样本的标点符号，然后用jieba分词，jieba分词返回一个生成器，没法直接进行tokenize，所以我们将分词结果转换成一个list，并将它索引化，这样每一例评价的文本变成一段索引数字，对应着预训练词向量模型中的词。"
 424 |    ]
 425 |   },
 426 |   {
 427 |    "cell_type": "code",
 428 |    "execution_count": 43,
 429 |    "metadata": {},
 430 |    "outputs": [
 431 |     {
 432 |      "name": "stderr",
 433 |      "output_type": "stream",
 434 |      "text": [
 435 |       "Building prefix dict from the default dictionary ...\n",
 436 |       "Loading model from cache C:\\Users\\jinan\\AppData\\Local\\Temp\\jieba.cache\n",
 437 |       "Loading model cost 0.672 seconds.\n",
 438 |       "Prefix dict has been built succesfully.\n"
 439 |      ]
 440 |     }
 441 |    ],
 442 |    "source": [
 443 |     "# 进行分词和tokenize\n",
 444 |     "# train_tokens是一个长长的list，其中含有4000个小list，对应每一条评价\n",
 445 |     "train_tokens = []\n",
 446 |     "for text in train_texts_orig:\n",
 447 |     "    # 去掉标点\n",
 448 |     "    text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——！，。？、~@#￥%……&*（）]+\", \"\",text)\n",
 449 |     "    # 结巴分词\n",
 450 |     "    cut = jieba.cut(text)\n",
 451 |     "    # 结巴分词的输出结果为一个生成器\n",
 452 |     "    # 把生成器转换为list\n",
 453 |     "    cut_list = [ i for i in cut ]\n",
 454 |     "    for i, word in enumerate(cut_list):\n",
 455 |     "        try:\n",
 456 |     "            # 将词转换为索引index\n",
 457 |     "            cut_list[i] = cn_model.vocab[word].index\n",
 458 |     "        except KeyError:\n",
 459 |     "            # 如果词不在字典中，则输出0\n",
 460 |     "            cut_list[i] = 0\n",
 461 |     "    train_tokens.append(cut_list)"
 462 |    ]
 463 |   },
 464 |   {
 465 |    "cell_type": "markdown",
 466 |    "metadata": {},
 467 |    "source": [
 468 |     "**索引长度标准化**  \n",
 469 |     "因为每段评语的长度是不一样的，我们如果单纯取最长的一个评语，并把其他评填充成同样的长度，这样十分浪费计算资源，所以我们取一个折衷的长度。"
 470 |    ]
 471 |   },
 472 |   {
 473 |    "cell_type": "code",
 474 |    "execution_count": 44,
 475 |    "metadata": {},
 476 |    "outputs": [],
 477 |    "source": [
 478 |     "# 获得所有tokens的长度\n",
 479 |     "num_tokens = [ len(tokens) for tokens in train_tokens ]\n",
 480 |     "num_tokens = np.array(num_tokens)"
 481 |    ]
 482 |   },
 483 |   {
 484 |    "cell_type": "code",
 485 |    "execution_count": 45,
 486 |    "metadata": {},
 487 |    "outputs": [
 488 |     {
 489 |      "data": {
 490 |       "text/plain": [
 491 |        "71.4495"
 492 |       ]
 493 |      },
 494 |      "execution_count": 45,
 495 |      "metadata": {},
 496 |      "output_type": "execute_result"
 497 |     }
 498 |    ],
 499 |    "source": [
 500 |     "# 平均tokens的长度\n",
 501 |     "np.mean(num_tokens)"
 502 |    ]
 503 |   },
 504 |   {
 505 |    "cell_type": "code",
 506 |    "execution_count": 46,
 507 |    "metadata": {},
 508 |    "outputs": [
 509 |     {
 510 |      "data": {
 511 |       "text/plain": [
 512 |        "1540"
 513 |       ]
 514 |      },
 515 |      "execution_count": 46,
 516 |      "metadata": {},
 517 |      "output_type": "execute_result"
 518 |     }
 519 |    ],
 520 |    "source": [
 521 |     "# 最长的评价tokens的长度\n",
 522 |     "np.max(num_tokens)"
 523 |    ]
 524 |   },
 525 |   {
 526 |    "cell_type": "code",
 527 |    "execution_count": 89,
 528 |    "metadata": {},
 529 |    "outputs": [
 530 |     {
 531 |      "data": {
 532 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEWCAYAAACXGLsWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAHXBJREFUeJzt3XmYXVWd7vHva0DGCGIKhAQo0CgiDagRaUFF4dogKNxWBgcMg532qqCAV4PYgl69QmOjOBuZIiKCiIKiNohw0QcFwxgEsXkYQgiSIFMYGgi+94+9ipwUVbVPDafOqTrv53nOU2evPaxfdlXO76y19l5btomIiBjK89odQEREdL4ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRbRFEnflvRvY3SszSQ9KmlKWb5c0gfG4tjleL+UNHusjjeMej8v6X5Jfx2DY+0iafFYxDWKGCzppW2ot+3/9niuJItA0p2SnpC0XNJDkq6U9EFJz/592P6g7f/T5LF2G2ob24tsr2v7mTGI/ThJ3+93/D1szx/tsYcZx6bAUcDWtl88wPp8AA6iXUkphifJIvq83fZUYHPgeOCTwKljXYmk1cb6mB1ic+Bvtpe2O5CIVkiyiFXYftj2hcD+wGxJ2wBIOkPS58v7aZJ+XlohD0j6raTnSToT2Az4Welm+oSk3vLN8VBJi4DfNJQ1Jo6XSLpa0sOSLpC0QanrOd/I+1ovknYHPgXsX+q7oax/tlurxPVpSXdJWirpe5LWK+v64pgtaVHpQjpmsHMjab2y/7JyvE+X4+8GXAJsUuI4o99+6wC/bFj/qKRNJK0h6SuSlpTXVyStMUjdh0u6WdKMsryXpOsbWoLb9js/H5d0Yzmf50hac6jf3RB/En3HXEPSl8p5uq90S67V+DuSdFQ5x/dKOrhh3xdJ+pmkRyT9sXTX/a6su6JsdkM5L/s37Dfg8aI9kixiQLavBhYDbxhg9VFlXQ+wEdUHtm0fCCyiaqWsa/vfG/Z5E/AK4J8GqfL9wCHAJsAK4KtNxPgr4P8C55T6thtgs4PK683AlsC6wNf7bbMz8HJgV+Azkl4xSJVfA9Yrx3lTiflg278G9gCWlDgO6hfnY/3Wr2t7CXAMsCOwPbAdsAPw6f6VqhorOgh4k+3Fkl4NnAb8K/Ai4DvAhf0SzX7A7sAWwLZlfxjkdzfIv7fRCcDLSqwvBaYDn2lY/+JybqYDhwLfkPTCsu4bwGNlm9nl1Xdu3ljeblfOyzlNHC/aIMkihrIE2GCA8qeBjYHNbT9t+7eun2TsONuP2X5ikPVn2r6pfLD+G7CfygD4KL0XOMn27bYfBY4GDujXqvms7Sds3wDcQPXBvYoSy/7A0baX274T+A/gwFHG9jnbS20vAz7b73iSdBJVgn1z2QbgX4Dv2L7K9jNlfOZJqsTT56u2l9h+APgZ1Yc8jOB3J0mlziNsP2B7OVWSPqBhs6fLv+Vp278AHgVeXs7bO4FjbT9u+2agmfGkAY/XxH7RIkkWMZTpwAMDlJ8I3AZcLOl2SXObONbdw1h/F7A6MK2pKIe2STle47FXo/pW3afx6qXHqVof/U0Dnj/AsaaPcWybNCyvD8wBvmj74YbyzYGjSlfSQ5IeAjbtt+9g/6aR/O56gLWBaxrq+1Up7/M32ysGqLOH6nw3/n7r/haGOl60SZJFDEjSa6k+CH/Xf135Zn2U7S2BtwNHStq1b/Ugh6xreWza8H4zqm+W91N1X6zdENcUVv2QqjvuEqoP18ZjrwDuq9mvv/tLTP2PdU+T+w8U50CxLWlYfhDYCzhd0k4N5XcDX7C9fsNrbdtn1wYx9O9uMPcDTwCvbKhvPdvNfHgvozrfMxrKNh1k2+hgSRaxCkkvkLQX8EPg+7YXDrDNXpJeWronHgGeKS+oPoS3HEHV75O0taS1gc8B55VLa/8CrClpT0mrU/XpN/bN3wf0DjFIezZwhKQtJK3LyjGOFYNsP6ASy7nAFyRNlbQ5cCTw/aH3XCXOF/UNrjfE9mlJPZKmUY0B9L8M+HKq7qqfSHpdKf4u8EFJr1NlnXJ+ptYFUfO7G5Dtv5c6vyxpw3Kc6ZIGG39q3PcZ4HzgOElrS9qKaqyn0Uj/ZmIcJVlEn59JWk71rfUY4CRgsCtQZgK/pupH/j3wzfKhBvBFqg/AhyR9fBj1nwmcQdV9siZwOFRXZwEfAk6h+hb/GNUAbZ8flZ9/k3TtAMc9rRz7CuAO4L+Bw4YRV6PDSv23U7W4flCOX8v2n6mSw+3l3GwCfB5YANwILASuLWX9972E6ndxoaTX2F5ANYbwdarWx22sHMCuM9TvbiifLPX8QdIj5RjNjiF8hGqw+q9Uv4uzqcZY+hwHzC/nZb8mjxnjTHn4UUSMJ0knAC+2Pe532cfIpWURES0laStJ25Yusx2oLoX9SbvjiuGZrHfTRkTnmErV9bQJsJTqkuML2hpRDFu6oSIiola6oSIiotaE7oaaNm2ae3t72x1GRMSEcs0119xvu6d+y5UmdLLo7e1lwYIF7Q4jImJCkXRX/VarSjdURETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIUemdexG9cy9qdxgR0WJJFhERUSvJIiIiak3oiQRj8urr2rrz+D2Hvc9w94uIemlZRERErSSLiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZxISReagi2ifJIiIiarUsWUg6TdJSSTc1lJ0o6c+SbpT0E0nrN6w7WtJtkm6V9E+tiisiIoavlS2LM4Dd+5VdAmxje1vgL8DRAJK2Bg4AXln2+aakKS2MLSIihqFlycL2FcAD/coutr2iLP4BmFHe7w380PaTtu8AbgN2aFVsERExPO0cszgE+GV5Px24u2Hd4lL2HJLmSFogacGyZctaHGJERECbkoWkY4AVwFl9RQNs5oH2tT3P9izbs3p6eloVYkRENBj351lImg3sBexquy8hLAY2bdhsBrBkvGOLyWMkz8OIiMGNa8tC0u7AJ4F32H68YdWFwAGS1pC0BTATuHo8Y4uIiMG1rGUh6WxgF2CapMXAsVRXP60BXCIJ4A+2P2j7T5LOBW6m6p76sO1nWhVbREQMT8uShe13D1B86hDbfwH4Qqviic6RLqKIiSd3cEdERK0ki4iIqJVkERERtZIsIiKi1rjfZxExlExBHtGZkiyiqzQmo1yNFdG8dENFREStJIuIiKiVbqiY8DLOEdF6aVlEREStJIuIiKiVbqjoCMPpSsrcUhHjLy2LiIiolWQRERG1kiwiIqJWkkVERNRKsoiIiFpJFhERUSvJIiIiaiVZRERErSSLiIiolTu4oys0c4d47gyPGFxaFhERUatlLQtJpwF7AUttb1PKNgDOAXqBO4H9bD8oScDJwNuAx4GDbF/bqtiiPfp/u8/U4hETRytbFmcAu/crmwtcansmcGlZBtgDmFlec4BvtTCuiFX0zr0oiSuiRsuShe0rgAf6Fe8NzC/v5wP7NJR/z5U/AOtL2rhVsUVExPCM95jFRrbvBSg/Nyzl04G7G7ZbXMqeQ9IcSQskLVi2bFlLg42IiEqnDHBrgDIPtKHtebZn2Z7V09PT4rAiIgLGP1nc19e9VH4uLeWLgU0btpsBLBnn2CIiYhDjnSwuBGaX97OBCxrK36/KjsDDfd1VERHRfq28dPZsYBdgmqTFwLHA8cC5kg4FFgH7ls1/QXXZ7G1Ul84e3Kq4IiJi+FqWLGy/e5BVuw6wrYEPtyqWiIgYnUz3ERNW7o2IGD+1YxaSdpK0Tnn/PkknSdq89aFFRESnaGaA+1vA45K2Az4B3AV8r6VRxaSQO6MjJo9muqFW2LakvYGTbZ8qaXbtXjHhdcIsrKNNNklWEWOjmWSxXNLRwPuAN0qaAqze2rAiIqKTNNMNtT/wJHCo7b9STcNxYkujioiIjlLbsigJ4qSG5UVkzCIioqs0czXUP0v6L0kPS3pE0nJJj4xHcBER0RmaGbP4d+Dttm9pdTAREdGZmhmzuC+JIiKiuzXTslgg6Rzgp1QD3QDYPr9lUUVEREdpJlm8gGpyv7c2lBlIsoiI6BLNXA2VGWAjIrpcM1dDvUzSpZJuKsvbSvp060OLiIhO0cwA93eBo4GnAWzfCBzQyqAiIqKzNJMs1rZ9db+yFa0IJiIiOlMzyeJ+SS+hGtRG0ruAPPI0ulJm0o1u1czVUB8G5gFbSboHuINqUsGISakTZtuN6DTNJIt7bO9WHoD0PNvLJW3Q6sAiIqJzNNMNdb6k1Ww/VhLFi4FLWh1YdI50vUREM8nip8B5kqZI6gUupro6KiIiukQzN+V9V9LzqZJGL/Cvtq9sdWAREdE5Bk0Wko5sXAQ2Ba4HdpS0o+2TBt6znqQjgA9QXWG1EDgY2Bj4IbABcC1woO2nRlpHRESMnaG6oaY2vNYFfgLc1lA2IpKmA4cDs2xvA0yhusnvBODLtmcCDwKHjrSOiIgYW4O2LGx/tnFZ0tSq2I+OUb1rSXoaWJvqvo23AO8p6+cDxwHfGoO6IiJilGrHLCRtA5xJ1T2EpPuB99v+00gqtH2PpC8Bi4AnqAbMrwEest13Z/hiqmd9DxTPHGAOwGabbTaSEKJGrnyKiP6auRpqHnCk7c1tbw4cRTVf1IhIeiGwN7AFsAmwDrDHAJt6oP1tz7M9y/asnp6ekYYRERHD0MxNeevYvqxvwfbl5Qa9kdoNuMP2MgBJ5wOvB9Yv93OsAGYAS0ZRR0SttKAimtdMsrhd0r9RdUVBNdXHHaOocxHVFVVrU3VD7QosAC4D3kV1RdRs4IJR1BEt0PjhmqkwIrpLM91QhwA9VE/GOx+YBhw00gptXwWcR3V57MISwzzgk8CRkm4DXgScOtI6IiJibDXTstjN9uGNBZL2BX400kptHwsc26/4dmCHkR4zIiJap5mWxUBTe2S6j4iILjLUHdx7AG8Dpkv6asOqF5CHH0WXyWB4dLuhuqGWUA08v4PqPog+y4EjWhlURCdIgohYaag7uG8AbpD0A9tPj2NMERHRYWrHLJIoIiKimQHuiIjocoMmC0lnlp8fHb9wIiKiEw3VsniNpM2BQyS9UNIGja/xCjAiItpvqKuhvg38CtiS6mooNaxzKY+IiC4waMvC9ldtvwI4zfaWtrdoeCVRRER0kWaewf2/JG0HvKEUXWH7xtaGFZ0u9yBEdJfaq6EkHQ6cBWxYXmdJOqzVgUVEROdoZiLBDwCvs/0YgKQTgN8DX2tlYBER0Tmauc9CwDMNy8+w6mB3RERMcs20LE4HrpL0k7K8D3nWREREV2lmgPskSZcDO1O1KA62fV2rA4vx0TdQnSffRcRQmmlZYPtaqifbRUREF8rcUBERUSvJIiIiag2ZLCRNkfTr8QomIiI605DJwvYzwOOS1huneCIiogM1M8D938BCSZcAj/UV2j68ZVFFTBCN057kirKYzJpJFheVV0REdKlm7rOYL2ktYDPbt45FpZLWB04BtqGa7vwQ4FbgHKAXuBPYz/aDY1FfRESMTjMTCb4duJ7q2RZI2l7ShaOs92TgV7a3ArYDbgHmApfanglcWpajRXrnXpSZYyOiac1cOnscsAPwEIDt64EtRlqhpBcAb6RMGWL7KdsPAXsD88tm86mmFYmIiA7QTLJYYfvhfmUeRZ1bAsuA0yVdJ+kUSesAG9m+F6D83HCgnSXNkbRA0oJly5aNIoyIiGhWM8niJknvAaZIminpa8CVo6hzNeDVwLdsv4rqCqumu5xsz7M9y/asnp6eUYQRERHNaiZZHAa8EngSOBt4BPjYKOpcDCy2fVVZPo8qedwnaWOA8nPpKOqIaKmM+US3aeZqqMeBY8pDj2x7+WgqtP1XSXdLenm5umpX4Obymg0cX35eMJp6onN0y4dqZvCNyaw2WUh6LXAaMLUsPwwcYvuaUdR7GNXjWZ8P3A4cTNXKOVfSocAiYN9RHD8iIsZQMzflnQp8yPZvASTtTPVApG1HWmm5omrWAKt2HekxIyKidZpJFsv7EgWA7d9JGlVXVExe3dLlFNFtBk0Wkl5d3l4t6TtUg9sG9gcub31oERHRKYZqWfxHv+VjG96P5j6LmITSooiY3AZNFrbfPJ6BRERE52rmaqj1gfdTTfD37PaZojwions0M8D9C+APwELg760NJyIiOlEzyWJN20e2PJKIiOhYzUz3caakf5G0saQN+l4tjywiIjpGMy2Lp4ATgWNYeRWUqWaPjYiILtBMsjgSeKnt+1sdTEREdKZmuqH+BDze6kAiJovMSBuTUTMti2eA6yVdRjVNOZBLZyMiukkzyeKn5RUREV2qmedZzK/bJiIiJrdm7uC+gwHmgrKdq6EiIrpEM91Qjc+dWJPqoUS5zyIioos00w31t35FX5H0O+AzrQkpYnLof0VUHrcaE1kz3VCvblh8HlVLY2rLIoqIiI7TTDdU43MtVgB3Avu1JJqIiOhIzXRD5bkWERFdrpluqDWAd/Lc51l8rnVhRUREJ2mmG+oC4GHgGhru4I6I4Wkc8M5gd0w0zSSLGbZ3H+uKJU0BFgD32N5L0hbAD6kuy70WOND2U2Ndb0REDF8zEwleKekfWlD3R4FbGpZPAL5seybwIHBoC+qMiIgRaCZZ7AxcI+lWSTdKWijpxtFUKmkGsCdwSlkW8BbgvLLJfGCf0dQRERFjp5luqD1aUO9XgE+w8n6NFwEP2V5RlhcD01tQb0REjEAzl87eNZYVStoLWGr7Gkm79BUPVPUg+88B5gBsttlmYxlaREQMopluqLG2E/AOSXdSDWi/haqlsb6kvuQ1A1gy0M6259meZXtWT0/PeMQbEdH1xj1Z2D7a9gzbvcABwG9svxe4DHhX2Ww21SW7ERHRAdrRshjMJ4EjJd1GNYZxapvjiYiIopkB7paxfTlweXl/O7BDO+OJiIiBdVLLIiIiOlRbWxYx/vo/YyEiohlpWURERK0ki4iIqJVkERERtZIsIiKiVpJFRETUSrKIiIhaSRYREVErySIiImolWURERK0ki4iIqJVkEdFGvXMvyhQsMSEkWURERK1MJBjRBmlNxESTlkVEh0tXVXSCJIuIiKiVbqiIDtLYgrjz+D3bGEnEqtKyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKg17ldDSdoU+B7wYuDvwDzbJ0vaADgH6AXuBPaz/eB4xxfRDrmPIjpdO1oWK4CjbL8C2BH4sKStgbnApbZnApeW5YiI6ADjnixs32v72vJ+OXALMB3YG5hfNpsP7DPesUVExMDaOmYhqRd4FXAVsJHte6FKKMCGg+wzR9ICSQuWLVs2XqFGtF2m/Yh2aluykLQu8GPgY7YfaXY/2/Nsz7I9q6enp3UBRkTEs9qSLCStTpUozrJ9fim+T9LGZf3GwNJ2xBYREc/VjquhBJwK3GL7pIZVFwKzgePLzwvGO7aITpIup+gk7ZhIcCfgQGChpOtL2aeoksS5kg4FFgH7tiG2iIgYwLgnC9u/AzTI6l3HM5aIiSgz00Y75A7uiIiolWQRMYHlctoYL3n40SSW7oqIGCtpWURERK0ki4hJIN1R0WpJFhERUSvJIiIiaiVZREwi6Y6KVkmyiIiIWkkWERFRK8kiIiJqJVlEREStJIuIiKiV6T4iJrFM+RJjJS2LiIiolWQRERG10g0VMQmN9Ma8vv3SZRX9pWURERG1kiwmkUz1EEPp//eRv5cYjnRDTRDpHoixkgQRI5GWRURE1EqyiIhBpasq+iRZRERErY4bs5C0O3AyMAU4xfbxbQ4pYlIbqOXQvyxjZtFRyULSFOAbwP8AFgN/lHSh7ZvHO5Z2TJMwnDozjUNMBMNJMklIna3TuqF2AG6zfbvtp4AfAnu3OaaIiK4n2+2O4VmS3gXsbvsDZflA4HW2P9KwzRxgTlncBrhp3APtTNOA+9sdRIfIuVgp52KlnIuVXm576nB26KhuKEADlK2SzWzPA+YBSFpge9Z4BNbpci5WyrlYKedipZyLlSQtGO4+ndYNtRjYtGF5BrCkTbFERETRacnij8BMSVtIej5wAHBhm2OKiOh6HdUNZXuFpI8A/0l16exptv80xC7zxieyCSHnYqWci5VyLlbKuVhp2Oeiowa4IyKiM3VaN1RERHSgJIuIiKg1YZOFpN0l3SrpNklz2x1Pu0jaVNJlkm6R9CdJH213TO0kaYqk6yT9vN2xtJuk9SWdJ+nP5e/jH9sdU7tIOqL8/7hJ0tmS1mx3TONF0mmSlkq6qaFsA0mXSPqv8vOFdceZkMmiYVqQPYCtgXdL2rq9UbXNCuAo268AdgQ+3MXnAuCjwC3tDqJDnAz8yvZWwHZ06XmRNB04HJhlexuqi2cOaG9U4+oMYPd+ZXOBS23PBC4ty0OakMmCTAvyLNv32r62vF9O9YEwvb1RtYekGcCewCntjqXdJL0AeCNwKoDtp2w/1N6o2mo1YC1JqwFr00X3b9m+AnigX/HewPzyfj6wT91xJmqymA7c3bC8mC79gGwkqRd4FXBVeyNpm68AnwD+3u5AOsCWwDLg9NItd4qkddodVDvYvgf4ErAIuBd42PbF7Y2q7TayfS9UXziBDet2mKjJonZakG4jaV3gx8DHbD/S7njGm6S9gKW2r2l3LB1iNeDVwLdsvwp4jCa6Giaj0h+/N7AFsAmwjqT3tTeqiWeiJotMC9JA0upUieIs2+e3O5422Ql4h6Q7qbol3yLp++0Nqa0WA4tt97Uyz6NKHt1oN+AO28tsPw2cD7y+zTG1232SNgYoP5fW7TBRk0WmBSkkiapf+hbbJ7U7nnaxfbTtGbZ7qf4efmO7a7892v4rcLekl5eiXYFxfy5Mh1gE7Chp7fL/ZVe6dLC/wYXA7PJ+NnBB3Q4dNd1Hs0YwLchkthNwILBQ0vWl7FO2f9HGmKIzHAacVb5Q3Q4c3OZ42sL2VZLOA66lunrwOrpo6g9JZwO7ANMkLQaOBY4HzpV0KFUy3bf2OJnuIyIi6kzUbqiIiBhHSRYREVErySIiImolWURERK0ki4iIqJVkEROWpEdbcMztJb2tYfk4SR8fxfH2LTO+XtavvFfSe5rY/yBJXx9p/RFjJckiYlXbA2+r3ap5hwIfsv3mfuW9QG2yiOgUSRYxKUj635L+KOlGSZ8tZb3lW/13y7MMLpa0Vln32rLt7yWdWJ5z8Hzgc8D+kq6XtH85/NaSLpd0u6TDB6n/3ZIWluOcUMo+A+wMfFvSif12OR54Q6nnCElrSjq9HOM6Sf2TC5L2LPFOk9Qj6cfl3/xHSTuVbY4rzy9YJV5J60i6SNINJcb9+x8/Yki288prQr6AR8vPt1LdkSuqL0A/p5qeu5fqjt3ty3bnAu8r728CXl/eHw/cVN4fBHy9oY7jgCuBNYBpwN+A1fvFsQnVXbA9VLMi/AbYp6y7nOo5Cv1j3wX4ecPyUcDp5f1W5Xhr9sUD/E/gt8ALyzY/AHYu7zejmu5l0HiBdwLfbahvvXb//vKaWK8JOd1HRD9vLa/ryvK6wEyqD9w7bPdNg3IN0CtpfWCq7StL+Q+AvYY4/kW2nwSelLQU2Ihqor4+rwUut70MQNJZVMnqp8P4N+wMfA3A9p8l3QW8rKx7MzALeKtXzii8G1WLp2//F0iaOkS8C4EvlVbPz23/dhixRSRZxKQg4Iu2v7NKYfV8jycbip4B1mLgKe6H0v8Y/f/fDPd4AxnqGLdTPZ/iZcCCUvY84B9tP7HKQark8Zx4bf9F0muoxmO+KOli258bg7ijS2TMIiaD/wQOKc/0QNJ0SYM+zMX2g8BySTuWosZHbC4Hpj53ryFdBbypjCVMAd4N/L+affrXcwXw3hL/y6i6lm4t6+4C/hn4nqRXlrKLgY/07Sxp+6Eqk7QJ8Ljt71M9CKhbpyuPEUqyiAnP1VPPfgD8XtJCqmc31H3gHwrMk/R7qm/1D5fyy6i6d65vdhDY1ZPGji773gBca7tuyucbgRVlwPkI4JvAlBL/OcBBpSupr45bqZLJjyS9hPJM6TJIfzPwwZr6/gG4usxMfAzw+Wb+bRF9MutsdCVJ69p+tLyfC2xs+6NtDiuiY2XMIrrVnpKOpvo/cBfVVUcRMYi0LCIiolbGLCIiolaSRURE1EqyiIiIWkkWERFRK8kiIiJq/X+K7ei/t/dDegAAAABJRU5ErkJggg==\n",
 533 |       "text/plain": [
 534 |        "<Figure size 432x288 with 1 Axes>"
 535 |       ]
 536 |      },
 537 |      "metadata": {},
 538 |      "output_type": "display_data"
 539 |     }
 540 |    ],
 541 |    "source": [
 542 |     "plt.hist(np.log(num_tokens), bins = 100)\n",
 543 |     "plt.xlim((0,10))\n",
 544 |     "plt.ylabel('number of tokens')\n",
 545 |     "plt.xlabel('length of tokens')\n",
 546 |     "plt.title('Distribution of tokens length')\n",
 547 |     "plt.show()"
 548 |    ]
 549 |   },
 550 |   {
 551 |    "cell_type": "code",
 552 |    "execution_count": 48,
 553 |    "metadata": {},
 554 |    "outputs": [
 555 |     {
 556 |      "data": {
 557 |       "text/plain": [
 558 |        "236"
 559 |       ]
 560 |      },
 561 |      "execution_count": 48,
 562 |      "metadata": {},
 563 |      "output_type": "execute_result"
 564 |     }
 565 |    ],
 566 |    "source": [
 567 |     "# 取tokens平均值并加上两个tokens的标准差，\n",
 568 |     "# 假设tokens长度的分布为正态分布，则max_tokens这个值可以涵盖95%左右的样本\n",
 569 |     "max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)\n",
 570 |     "max_tokens = int(max_tokens)\n",
 571 |     "max_tokens"
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "code",
 576 |    "execution_count": 49,
 577 |    "metadata": {},
 578 |    "outputs": [
 579 |     {
 580 |      "data": {
 581 |       "text/plain": [
 582 |        "0.9565"
 583 |       ]
 584 |      },
 585 |      "execution_count": 49,
 586 |      "metadata": {},
 587 |      "output_type": "execute_result"
 588 |     }
 589 |    ],
 590 |    "source": [
 591 |     "# 取tokens的长度为236时，大约95%的样本被涵盖\n",
 592 |     "# 我们对长度不足的进行padding，超长的进行修剪\n",
 593 |     "np.sum( num_tokens < max_tokens ) / len(num_tokens)"
 594 |    ]
 595 |   },
 596 |   {
 597 |    "cell_type": "markdown",
 598 |    "metadata": {},
 599 |    "source": [
 600 |     "**反向tokenize**  \n",
 601 |     "我们定义一个function，用来把索引转换成可阅读的文本，这对于debug很重要。"
 602 |    ]
 603 |   },
 604 |   {
 605 |    "cell_type": "code",
 606 |    "execution_count": 50,
 607 |    "metadata": {},
 608 |    "outputs": [],
 609 |    "source": [
 610 |     "# 用来将tokens转换为文本\n",
 611 |     "def reverse_tokens(tokens):\n",
 612 |     "    text = ''\n",
 613 |     "    for i in tokens:\n",
 614 |     "        if i != 0:\n",
 615 |     "            text = text + cn_model.index2word[i]\n",
 616 |     "        else:\n",
 617 |     "            text = text + ' '\n",
 618 |     "    return text"
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "code",
 623 |    "execution_count": 51,
 624 |    "metadata": {},
 625 |    "outputs": [],
 626 |    "source": [
 627 |     "reverse = reverse_tokens(train_tokens[0])"
 628 |    ]
 629 |   },
 630 |   {
 631 |    "cell_type": "markdown",
 632 |    "metadata": {},
 633 |    "source": [
 634 |     "以下可见，训练样本的极性并不是那么精准，比如说下面的样本，对早餐并不满意，但被定义为正面评价，这会迷惑我们的模型，不过我们暂时不对训练样本进行任何修改。"
 635 |    ]
 636 |   },
 637 |   {
 638 |    "cell_type": "code",
 639 |    "execution_count": 52,
 640 |    "metadata": {},
 641 |    "outputs": [
 642 |     {
 643 |      "data": {
 644 |       "text/plain": [
 645 |        "'早餐太差无论去多少人那边也不加食品的酒店应该重视一下这个问题了房间本身很好'"
 646 |       ]
 647 |      },
 648 |      "execution_count": 52,
 649 |      "metadata": {},
 650 |      "output_type": "execute_result"
 651 |     }
 652 |    ],
 653 |    "source": [
 654 |     "# 经过tokenize再恢复成文本\n",
 655 |     "# 可见标点符号都没有了\n",
 656 |     "reverse"
 657 |    ]
 658 |   },
 659 |   {
 660 |    "cell_type": "code",
 661 |    "execution_count": 53,
 662 |    "metadata": {},
 663 |    "outputs": [
 664 |     {
 665 |      "data": {
 666 |       "text/plain": [
 667 |        "'早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。\\n\\n房间本身很好。'"
 668 |       ]
 669 |      },
 670 |      "execution_count": 53,
 671 |      "metadata": {},
 672 |      "output_type": "execute_result"
 673 |     }
 674 |    ],
 675 |    "source": [
 676 |     "# 原始文本\n",
 677 |     "train_texts_orig[0]"
 678 |    ]
 679 |   },
 680 |   {
 681 |    "cell_type": "markdown",
 682 |    "metadata": {},
 683 |    "source": [
 684 |     "**准备Embedding Matrix**  \n",
 685 |     "现在我们来为模型准备embedding matrix（词向量矩阵），根据keras的要求，我们需要准备一个维度为$(numwords, embeddingdim)$的矩阵，num words代表我们使用的词汇的数量，emdedding dimension在我们现在使用的预训练词向量模型中是300，每一个词汇都用一个长度为300的向量表示。  \n",
 686 |     "注意我们只选择使用前50k个使用频率最高的词，在这个预训练词向量模型中，一共有260万词汇量，如果全部使用在分类问题上会很浪费计算资源，因为我们的训练样本很小，一共只有4k，如果我们有100k，200k甚至更多的训练样本时，在分类问题上可以考虑减少使用的词汇量。"
 687 |    ]
 688 |   },
 689 |   {
 690 |    "cell_type": "code",
 691 |    "execution_count": 90,
 692 |    "metadata": {},
 693 |    "outputs": [
 694 |     {
 695 |      "data": {
 696 |       "text/plain": [
 697 |        "300"
 698 |       ]
 699 |      },
 700 |      "execution_count": 90,
 701 |      "metadata": {},
 702 |      "output_type": "execute_result"
 703 |     }
 704 |    ],
 705 |    "source": [
 706 |     "embedding_dim"
 707 |    ]
 708 |   },
 709 |   {
 710 |    "cell_type": "code",
 711 |    "execution_count": 55,
 712 |    "metadata": {},
 713 |    "outputs": [],
 714 |    "source": [
 715 |     "# 只使用前20000个词\n",
 716 |     "num_words = 50000\n",
 717 |     "# 初始化embedding_matrix，之后在keras上进行应用\n",
 718 |     "embedding_matrix = np.zeros((num_words, embedding_dim))\n",
 719 |     "# embedding_matrix为一个 [num_words，embedding_dim] 的矩阵\n",
 720 |     "# 维度为 50000 * 300\n",
 721 |     "for i in range(num_words):\n",
 722 |     "    embedding_matrix[i,:] = cn_model[cn_model.index2word[i]]\n",
 723 |     "embedding_matrix = embedding_matrix.astype('float32')"
 724 |    ]
 725 |   },
 726 |   {
 727 |    "cell_type": "code",
 728 |    "execution_count": 56,
 729 |    "metadata": {},
 730 |    "outputs": [
 731 |     {
 732 |      "data": {
 733 |       "text/plain": [
 734 |        "300"
 735 |       ]
 736 |      },
 737 |      "execution_count": 56,
 738 |      "metadata": {},
 739 |      "output_type": "execute_result"
 740 |     }
 741 |    ],
 742 |    "source": [
 743 |     "# 检查index是否对应，\n",
 744 |     "# 输出300意义为长度为300的embedding向量一一对应\n",
 745 |     "np.sum( cn_model[cn_model.index2word[333]] == embedding_matrix[333] )"
 746 |    ]
 747 |   },
 748 |   {
 749 |    "cell_type": "code",
 750 |    "execution_count": 57,
 751 |    "metadata": {},
 752 |    "outputs": [
 753 |     {
 754 |      "data": {
 755 |       "text/plain": [
 756 |        "(50000, 300)"
 757 |       ]
 758 |      },
 759 |      "execution_count": 57,
 760 |      "metadata": {},
 761 |      "output_type": "execute_result"
 762 |     }
 763 |    ],
 764 |    "source": [
 765 |     "# embedding_matrix的维度，\n",
 766 |     "# 这个维度为keras的要求，后续会在模型中用到\n",
 767 |     "embedding_matrix.shape"
 768 |    ]
 769 |   },
 770 |   {
 771 |    "cell_type": "markdown",
 772 |    "metadata": {},
 773 |    "source": [
 774 |     "**padding（填充）和truncating（修剪）**  \n",
 775 |     "我们把文本转换为tokens（索引）之后，每一串索引的长度并不相等，所以为了方便模型的训练我们需要把索引的长度标准化，上面我们选择了236这个可以涵盖95%训练样本的长度，接下来我们进行padding和truncating，我们一般采用'pre'的方法，这会在文本索引的前面填充0，因为根据一些研究资料中的实践，如果在文本索引后面填充0的话，会对模型造成一些不良影响。"
 776 |    ]
 777 |   },
 778 |   {
 779 |    "cell_type": "code",
 780 |    "execution_count": 58,
 781 |    "metadata": {},
 782 |    "outputs": [],
 783 |    "source": [
 784 |     "# 进行padding和truncating， 输入的train_tokens是一个list\n",
 785 |     "# 返回的train_pad是一个numpy array\n",
 786 |     "train_pad = pad_sequences(train_tokens, maxlen=max_tokens,\n",
 787 |     "                            padding='pre', truncating='pre')"
 788 |    ]
 789 |   },
 790 |   {
 791 |    "cell_type": "code",
 792 |    "execution_count": 59,
 793 |    "metadata": {},
 794 |    "outputs": [],
 795 |    "source": [
 796 |     "# 超出五万个词向量的词用0代替\n",
 797 |     "train_pad[ train_pad>=num_words ] = 0"
 798 |    ]
 799 |   },
 800 |   {
 801 |    "cell_type": "code",
 802 |    "execution_count": 60,
 803 |    "metadata": {},
 804 |    "outputs": [
 805 |     {
 806 |      "data": {
 807 |       "text/plain": [
 808 |        "array([    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 809 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 810 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 811 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 812 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 813 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 814 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 815 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 816 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 817 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 818 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 819 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 820 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 821 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 822 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 823 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 824 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
 825 |        "         290,  3053,    57,   169,    73,     1,    25, 11216,    49,\n",
 826 |        "         163, 15985,     0,     0,    30,     8,     0,     1,   228,\n",
 827 |        "         223,    40,    35,   653,     0,     5,  1642,    29, 11216,\n",
 828 |        "        2751,   500,    98,    30,  3159,  2225,  2146,   371,  6285,\n",
 829 |        "         169, 27396,     1,  1191,  5432,  1080, 20055,    57,   562,\n",
 830 |        "           1, 22671,    40,    35,   169,  2567,     0, 42665,  7761,\n",
 831 |        "         110,     0,     0, 41281,     0,   110,     0, 35891,   110,\n",
 832 |        "           0, 28781,    57,   169,  1419,     1, 11670,     0, 19470,\n",
 833 |        "           1,     0,     0,   169, 35071,    40,   562,    35, 12398,\n",
 834 |        "         657,  4857])"
 835 |       ]
 836 |      },
 837 |      "execution_count": 60,
 838 |      "metadata": {},
 839 |      "output_type": "execute_result"
 840 |     }
 841 |    ],
 842 |    "source": [
 843 |     "# 可见padding之后前面的tokens全变成0，文本在最后面\n",
 844 |     "train_pad[33]"
 845 |    ]
 846 |   },
 847 |   {
 848 |    "cell_type": "code",
 849 |    "execution_count": 61,
 850 |    "metadata": {},
 851 |    "outputs": [],
 852 |    "source": [
 853 |     "# 准备target向量，前2000样本为1，后2000为0\n",
 854 |     "train_target = np.concatenate( (np.ones(2000),np.zeros(2000)) )"
 855 |    ]
 856 |   },
 857 |   {
 858 |    "cell_type": "code",
 859 |    "execution_count": 62,
 860 |    "metadata": {},
 861 |    "outputs": [],
 862 |    "source": [
 863 |     "# 进行训练和测试样本的分割\n",
 864 |     "from sklearn.model_selection import train_test_split"
 865 |    ]
 866 |   },
 867 |   {
 868 |    "cell_type": "code",
 869 |    "execution_count": 63,
 870 |    "metadata": {},
 871 |    "outputs": [],
 872 |    "source": [
 873 |     "# 90%的样本用来训练，剩余10%用来测试\n",
 874 |     "X_train, X_test, y_train, y_test = train_test_split(train_pad,\n",
 875 |     "                                                    train_target,\n",
 876 |     "                                                    test_size=0.1,\n",
 877 |     "                                                    random_state=12)"
 878 |    ]
 879 |   },
 880 |   {
 881 |    "cell_type": "code",
 882 |    "execution_count": 64,
 883 |    "metadata": {},
 884 |    "outputs": [
 885 |     {
 886 |      "name": "stdout",
 887 |      "output_type": "stream",
 888 |      "text": [
 889 |       "                                                                                                                                                                                                                        房间很大还有海景阳台走出酒店就是沙滩非常不错唯一遗憾的就是不能刷 不方便\n",
 890 |       "class:  1.0\n"
 891 |      ]
 892 |     }
 893 |    ],
 894 |    "source": [
 895 |     "# 查看训练样本，确认无误\n",
 896 |     "print(reverse_tokens(X_train[35]))\n",
 897 |     "print('class: ',y_train[35])"
 898 |    ]
 899 |   },
 900 |   {
 901 |    "cell_type": "markdown",
 902 |    "metadata": {},
 903 |    "source": [
 904 |     "现在我们用keras搭建LSTM模型，模型的第一层是Embedding层，只有当我们把tokens索引转换为词向量矩阵之后，才可以用神经网络对文本进行处理。\n",
 905 |     "keras提供了Embedding接口，避免了繁琐的稀疏矩阵操作。   \n",
 906 |     "在Embedding层我们输入的矩阵为：$$(batchsize, maxtokens)$$\n",
 907 |     "输出矩阵为： $$(batchsize, maxtokens, embeddingdim)$$"
 908 |    ]
 909 |   },
 910 |   {
 911 |    "cell_type": "code",
 912 |    "execution_count": 65,
 913 |    "metadata": {},
 914 |    "outputs": [],
 915 |    "source": [
 916 |     "# 用LSTM对样本进行分类\n",
 917 |     "model = Sequential()"
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "code",
 922 |    "execution_count": 66,
 923 |    "metadata": {},
 924 |    "outputs": [],
 925 |    "source": [
 926 |     "# 模型第一层为embedding\n",
 927 |     "model.add(Embedding(num_words,\n",
 928 |     "                    embedding_dim,\n",
 929 |     "                    weights=[embedding_matrix],\n",
 930 |     "                    input_length=max_tokens,\n",
 931 |     "                    trainable=False))"
 932 |    ]
 933 |   },
 934 |   {
 935 |    "cell_type": "code",
 936 |    "execution_count": 67,
 937 |    "metadata": {},
 938 |    "outputs": [],
 939 |    "source": [
 940 |     "model.add(Bidirectional(LSTM(units=32, return_sequences=True)))\n",
 941 |     "model.add(LSTM(units=16, return_sequences=False))"
 942 |    ]
 943 |   },
 944 |   {
 945 |    "cell_type": "markdown",
 946 |    "metadata": {},
 947 |    "source": [
 948 |     "**构建模型**  \n",
 949 |     "我在这个教程中尝试了几种神经网络结构，因为训练样本比较少，所以我们可以尽情尝试，训练过程等待时间并不长：  \n",
 950 |     "**GRU：**如果使用GRU的话，测试样本可以达到87%的准确率，但我测试自己的文本内容时发现，GRU最后一层激活函数的输出都在0.5左右，说明模型的判断不是很明确，信心比较低，而且经过测试发现模型对于否定句的判断有时会失误，我们期望对于负面样本输出接近0，正面样本接近1而不是都徘徊于0.5之间。  \n",
 951 |     "**BiLSTM：**测试了LSTM和BiLSTM，发现BiLSTM的表现最好，LSTM的表现略好于GRU，这可能是因为BiLSTM对于比较长的句子结构有更好的记忆，有兴趣的朋友可以深入研究一下。  \n",
 952 |     "Embedding之后第，一层我们用BiLSTM返回sequences，然后第二层16个单元的LSTM不返回sequences，只返回最终结果，最后是一个全链接层，用sigmoid激活函数输出结果。"
 953 |    ]
 954 |   },
 955 |   {
 956 |    "cell_type": "code",
 957 |    "execution_count": 68,
 958 |    "metadata": {},
 959 |    "outputs": [],
 960 |    "source": [
 961 |     "# GRU的代码\n",
 962 |     "# model.add(GRU(units=32, return_sequences=True))\n",
 963 |     "# model.add(GRU(units=16, return_sequences=True))\n",
 964 |     "# model.add(GRU(units=4, return_sequences=False))"
 965 |    ]
 966 |   },
 967 |   {
 968 |    "cell_type": "code",
 969 |    "execution_count": 69,
 970 |    "metadata": {},
 971 |    "outputs": [],
 972 |    "source": [
 973 |     "model.add(Dense(1, activation='sigmoid'))\n",
 974 |     "# 我们使用adam以0.001的learning rate进行优化\n",
 975 |     "optimizer = Adam(lr=1e-3)"
 976 |    ]
 977 |   },
 978 |   {
 979 |    "cell_type": "code",
 980 |    "execution_count": 70,
 981 |    "metadata": {},
 982 |    "outputs": [],
 983 |    "source": [
 984 |     "model.compile(loss='binary_crossentropy',\n",
 985 |     "              optimizer=optimizer,\n",
 986 |     "              metrics=['accuracy'])"
 987 |    ]
 988 |   },
 989 |   {
 990 |    "cell_type": "code",
 991 |    "execution_count": 71,
 992 |    "metadata": {},
 993 |    "outputs": [
 994 |     {
 995 |      "name": "stdout",
 996 |      "output_type": "stream",
 997 |      "text": [
 998 |       "_________________________________________________________________\n",
 999 |       "Layer (type)                 Output Shape              Param #   \n",
1000 |       "=================================================================\n",
1001 |       "embedding_1 (Embedding)      (None, 236, 300)          15000000  \n",
1002 |       "_________________________________________________________________\n",
1003 |       "bidirectional_1 (Bidirection (None, 236, 64)           85248     \n",
1004 |       "_________________________________________________________________\n",
1005 |       "lstm_2 (LSTM)                (None, 16)                5184      \n",
1006 |       "_________________________________________________________________\n",
1007 |       "dense_1 (Dense)              (None, 1)                 17        \n",
1008 |       "=================================================================\n",
1009 |       "Total params: 15,090,449\n",
1010 |       "Trainable params: 90,449\n",
1011 |       "Non-trainable params: 15,000,000\n",
1012 |       "_________________________________________________________________\n"
1013 |      ]
1014 |     }
1015 |    ],
1016 |    "source": [
1017 |     "# 我们来看一下模型的结构，一共90k左右可训练的变量\n",
1018 |     "model.summary()"
1019 |    ]
1020 |   },
1021 |   {
1022 |    "cell_type": "code",
1023 |    "execution_count": 72,
1024 |    "metadata": {},
1025 |    "outputs": [],
1026 |    "source": [
1027 |     "# 建立一个权重的存储点\n",
1028 |     "path_checkpoint = 'sentiment_checkpoint.keras'\n",
1029 |     "checkpoint = ModelCheckpoint(filepath=path_checkpoint, monitor='val_loss',\n",
1030 |     "                                      verbose=1, save_weights_only=True,\n",
1031 |     "                                      save_best_only=True)"
1032 |    ]
1033 |   },
1034 |   {
1035 |    "cell_type": "code",
1036 |    "execution_count": 73,
1037 |    "metadata": {},
1038 |    "outputs": [],
1039 |    "source": [
1040 |     "# 尝试加载已训练模型\n",
1041 |     "try:\n",
1042 |     "    model.load_weights(path_checkpoint)\n",
1043 |     "except Exception as e:\n",
1044 |     "    print(e)"
1045 |    ]
1046 |   },
1047 |   {
1048 |    "cell_type": "code",
1049 |    "execution_count": 74,
1050 |    "metadata": {},
1051 |    "outputs": [],
1052 |    "source": [
1053 |     "# 定义early stoping如果3个epoch内validation loss没有改善则停止训练\n",
1054 |     "earlystopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)"
1055 |    ]
1056 |   },
1057 |   {
1058 |    "cell_type": "code",
1059 |    "execution_count": 75,
1060 |    "metadata": {},
1061 |    "outputs": [],
1062 |    "source": [
1063 |     "# 自动降低learning rate\n",
1064 |     "lr_reduction = ReduceLROnPlateau(monitor='val_loss',\n",
1065 |     "                                       factor=0.1, min_lr=1e-5, patience=0,\n",
1066 |     "                                       verbose=1)"
1067 |    ]
1068 |   },
1069 |   {
1070 |    "cell_type": "code",
1071 |    "execution_count": 76,
1072 |    "metadata": {},
1073 |    "outputs": [],
1074 |    "source": [
1075 |     "# 定义callback函数\n",
1076 |     "callbacks = [\n",
1077 |     "    earlystopping, \n",
1078 |     "    checkpoint,\n",
1079 |     "    lr_reduction\n",
1080 |     "]"
1081 |    ]
1082 |   },
1083 |   {
1084 |    "cell_type": "code",
1085 |    "execution_count": 77,
1086 |    "metadata": {
1087 |     "scrolled": false
1088 |    },
1089 |    "outputs": [],
1090 |    "source": [
1091 |     "# 开始训练\n",
1092 |     "model.fit(X_train, y_train,\n",
1093 |     "          validation_split=0.1, \n",
1094 |     "          epochs=20,\n",
1095 |     "          batch_size=128,\n",
1096 |     "          callbacks=callbacks)"
1097 |    ]
1098 |   },
1099 |   {
1100 |    "cell_type": "markdown",
1101 |    "metadata": {},
1102 |    "source": [
1103 |     "**结论**  \n",
1104 |     "我们首先对测试样本进行预测，得到了还算满意的准确度。  \n",
1105 |     "之后我们定义一个预测函数，来预测输入的文本的极性，可见模型对于否定句和一些简单的逻辑结构都可以进行准确的判断。"
1106 |    ]
1107 |   },
1108 |   {
1109 |    "cell_type": "code",
1110 |    "execution_count": 78,
1111 |    "metadata": {},
1112 |    "outputs": [
1113 |     {
1114 |      "name": "stdout",
1115 |      "output_type": "stream",
1116 |      "text": [
1117 |       "400/400 [==============================] - 5s 12ms/step\n",
1118 |       "Accuracy:87.50%\n"
1119 |      ]
1120 |     }
1121 |    ],
1122 |    "source": [
1123 |     "result = model.evaluate(X_test, y_test)\n",
1124 |     "print('Accuracy:{0:.2%}'.format(result[1]))"
1125 |    ]
1126 |   },
1127 |   {
1128 |    "cell_type": "code",
1129 |    "execution_count": 79,
1130 |    "metadata": {},
1131 |    "outputs": [],
1132 |    "source": [
1133 |     "def predict_sentiment(text):\n",
1134 |     "    print(text)\n",
1135 |     "    # 去标点\n",
1136 |     "    text = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\']+|[+——！，。？、~@#￥%……&*（）]+\", \"\",text)\n",
1137 |     "    # 分词\n",
1138 |     "    cut = jieba.cut(text)\n",
1139 |     "    cut_list = [ i for i in cut ]\n",
1140 |     "    # tokenize\n",
1141 |     "    for i, word in enumerate(cut_list):\n",
1142 |     "        try:\n",
1143 |     "            cut_list[i] = cn_model.vocab[word].index\n",
1144 |     "        except KeyError:\n",
1145 |     "            cut_list[i] = 0\n",
1146 |     "    # padding\n",
1147 |     "    tokens_pad = pad_sequences([cut_list], maxlen=max_tokens,\n",
1148 |     "                           padding='pre', truncating='pre')\n",
1149 |     "    # 预测\n",
1150 |     "    result = model.predict(x=tokens_pad)\n",
1151 |     "    coef = result[0][0]\n",
1152 |     "    if coef >= 0.5:\n",
1153 |     "        print('是一例正面评价','output=%.2f'%coef)\n",
1154 |     "    else:\n",
1155 |     "        print('是一例负面评价','output=%.2f'%coef)"
1156 |    ]
1157 |   },
1158 |   {
1159 |    "cell_type": "code",
1160 |    "execution_count": 80,
1161 |    "metadata": {},
1162 |    "outputs": [
1163 |     {
1164 |      "name": "stdout",
1165 |      "output_type": "stream",
1166 |      "text": [
1167 |       "酒店设施不是新的，服务态度很不好\n",
1168 |       "是一例负面评价 output=0.14\n",
1169 |       "酒店卫生条件非常不好\n",
1170 |       "是一例负面评价 output=0.09\n",
1171 |       "床铺非常舒适\n",
1172 |       "是一例正面评价 output=0.76\n",
1173 |       "房间很凉，不给开暖气\n",
1174 |       "是一例负面评价 output=0.17\n",
1175 |       "房间很凉爽，空调冷气很足\n",
1176 |       "是一例正面评价 output=0.66\n",
1177 |       "酒店环境不好，住宿体验很不好\n",
1178 |       "是一例负面评价 output=0.06\n",
1179 |       "房间隔音不到位\n",
1180 |       "是一例负面评价 output=0.17\n",
1181 |       "晚上回来发现没有打扫卫生\n",
1182 |       "是一例负面评价 output=0.25\n",
1183 |       "因为过节所以要我临时加钱，比团购的价格贵\n",
1184 |       "是一例负面评价 output=0.06\n"
1185 |      ]
1186 |     }
1187 |    ],
1188 |    "source": [
1189 |     "test_list = [\n",
1190 |     "    '酒店设施不是新的，服务态度很不好',\n",
1191 |     "    '酒店卫生条件非常不好',\n",
1192 |     "    '床铺非常舒适',\n",
1193 |     "    '房间很凉，不给开暖气',\n",
1194 |     "    '房间很凉爽，空调冷气很足',\n",
1195 |     "    '酒店环境不好，住宿体验很不好',\n",
1196 |     "    '房间隔音不到位' ,\n",
1197 |     "    '晚上回来发现没有打扫卫生',\n",
1198 |     "    '因为过节所以要我临时加钱，比团购的价格贵'\n",
1199 |     "]\n",
1200 |     "for text in test_list:\n",
1201 |     "    predict_sentiment(text)"
1202 |    ]
1203 |   },
1204 |   {
1205 |    "cell_type": "markdown",
1206 |    "metadata": {},
1207 |    "source": [
1208 |     "**错误分类的文本**\n",
1209 |     "经过查看，发现错误分类的文本的含义大多比较含糊，就算人类也不容易判断极性，如index为101的这个句子，好像没有一点满意的成分，但这例子评价在训练样本中被标记成为了正面评价，而我们的模型做出的负面评价的预测似乎是合理的。"
1210 |    ]
1211 |   },
1212 |   {
1213 |    "cell_type": "code",
1214 |    "execution_count": 81,
1215 |    "metadata": {},
1216 |    "outputs": [],
1217 |    "source": [
1218 |     "y_pred = model.predict(X_test)\n",
1219 |     "y_pred = y_pred.T[0]\n",
1220 |     "y_pred = [1 if p>= 0.5 else 0 for p in y_pred]\n",
1221 |     "y_pred = np.array(y_pred)"
1222 |    ]
1223 |   },
1224 |   {
1225 |    "cell_type": "code",
1226 |    "execution_count": 82,
1227 |    "metadata": {},
1228 |    "outputs": [],
1229 |    "source": [
1230 |     "y_actual = np.array(y_test)"
1231 |    ]
1232 |   },
1233 |   {
1234 |    "cell_type": "code",
1235 |    "execution_count": 83,
1236 |    "metadata": {},
1237 |    "outputs": [],
1238 |    "source": [
1239 |     "# 找出错误分类的索引\n",
1240 |     "misclassified = np.where( y_pred != y_actual )[0]"
1241 |    ]
1242 |   },
1243 |   {
1244 |    "cell_type": "code",
1245 |    "execution_count": 92,
1246 |    "metadata": {
1247 |     "scrolled": true
1248 |    },
1249 |    "outputs": [
1250 |     {
1251 |      "name": "stdout",
1252 |      "output_type": "stream",
1253 |      "text": [
1254 |       "400\n"
1255 |      ]
1256 |     }
1257 |    ],
1258 |    "source": [
1259 |     "# 输出所有错误分类的索引\n",
1260 |     "len(misclassified)\n",
1261 |     "print(len(X_test))"
1262 |    ]
1263 |   },
1264 |   {
1265 |    "cell_type": "code",
1266 |    "execution_count": 85,
1267 |    "metadata": {},
1268 |    "outputs": [
1269 |     {
1270 |      "name": "stdout",
1271 |      "output_type": "stream",
1272 |      "text": [
1273 |       "                                                                                                                                                        由于2007年 有一些新问题可能还没来得及解决我因为工作需要经常要住那里所以慎重的提出以下 ：1 后 的 淋浴喷头的位置都太高我换了房间还是一样很不好用2 后的一些管理和服务还很不到位尤其是前台入住和 时代效率太低每次 都超过10分钟好像不符合 宾馆的要求\n",
1274 |       "预测的分类 0\n",
1275 |       "实际的分类 1.0\n"
1276 |      ]
1277 |     }
1278 |    ],
1279 |    "source": [
1280 |     "# 我们来找出错误分类的样本看看\n",
1281 |     "idx=101\n",
1282 |     "print(reverse_tokens(X_test[idx]))\n",
1283 |     "print('预测的分类', y_pred[idx])\n",
1284 |     "print('实际的分类', y_actual[idx])"
1285 |    ]
1286 |   },
1287 |   {
1288 |    "cell_type": "code",
1289 |    "execution_count": 86,
1290 |    "metadata": {},
1291 |    "outputs": [
1292 |     {
1293 |      "name": "stdout",
1294 |      "output_type": "stream",
1295 |      "text": [
1296 |       "                                                                                                                                                                                                                 还是很 设施也不错但是 和以前 比急剧下滑了 和客房 的服务极差幸好我不是很在乎\n",
1297 |       "预测的分类 0\n",
1298 |       "实际的分类 1.0\n"
1299 |      ]
1300 |     }
1301 |    ],
1302 |    "source": [
1303 |     "idx=1\n",
1304 |     "print(reverse_tokens(X_test[idx]))\n",
1305 |     "print('预测的分类', y_pred[idx])\n",
1306 |     "print('实际的分类', y_actual[idx])"
1307 |    ]
1308 |   },
1309 |   {
1310 |    "cell_type": "code",
1311 |    "execution_count": null,
1312 |    "metadata": {},
1313 |    "outputs": [],
1314 |    "source": []
1315 |   },
1316 |   {
1317 |    "cell_type": "code",
1318 |    "execution_count": null,
1319 |    "metadata": {},
1320 |    "outputs": [],
1321 |    "source": []
1322 |   },
1323 |   {
1324 |    "cell_type": "code",
1325 |    "execution_count": null,
1326 |    "metadata": {},
1327 |    "outputs": [],
1328 |    "source": []
1329 |   }
1330 |  ],
1331 |  "metadata": {
1332 |   "kernelspec": {
1333 |    "display_name": "Python 3",
1334 |    "language": "python",
1335 |    "name": "python3"
1336 |   },
1337 |   "language_info": {
1338 |    "codemirror_mode": {
1339 |     "name": "ipython",
1340 |     "version": 3
1341 |    },
1342 |    "file_extension": ".py",
1343 |    "mimetype": "text/x-python",
1344 |    "name": "python",
1345 |    "nbconvert_exporter": "python",
1346 |    "pygments_lexer": "ipython3",
1347 |    "version": "3.6.5"
1348 |   }
1349 |  },
1350 |  "nbformat": 4,
1351 |  "nbformat_minor": 2
1352 | }
1353 | 


--------------------------------------------------------------------------------
/语料.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aespresso/chinese_sentiment/6c945dd8de9cd379ea3d4afcc4f7aff8ca1b78f6/语料.zip


--------------------------------------------------------------------------------