├── .gitignore ├── LICENSE ├── README.md ├── images ├── 1-Word2vec_and_Dynamic_RNN-color.png ├── 10-CNN_Classifier-color.png ├── 2-Word2vec-color.png ├── 3-Dynamic_RNN-color.png ├── 4-Lookup_Table-color.png ├── 5-Loss_and_Accuracy-color.png ├── 6-Antispam_Service_Architecture-color.png ├── 7-gRPC-color.png ├── 8-From_Client_to_Server-color.png └── 9-NBOW_and_MLP_Classifier-color.png ├── network ├── README.md ├── cnn_classifier.py ├── mlp_classifier.py └── rnn_classifier.py ├── serving ├── README.md ├── packages │ ├── __init__.py │ └── text_regularization.py ├── serving_cnn.py ├── serving_mlp.py └── serving_rnn.py └── word2vec ├── README.md ├── data └── msglog.tar.gz ├── text2vec.py ├── text_features.py └── word2vec.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | .hypothesis/ 46 | 47 | # Translations 48 | *.mo 49 | *.pot 50 | 51 | # Django stuff: 52 | *.log 53 | local_settings.py 54 | 55 | # Flask stuff: 56 | instance/ 57 | .webassets-cache 58 | 59 | # Scrapy stuff: 60 | .scrapy 61 | 62 | # Sphinx documentation 63 | docs/_build/ 64 | 65 | # PyBuilder 66 | target/ 67 | 68 | # IPython Notebook 69 | .ipynb_checkpoints 70 | 71 | # pyenv 72 | .python-version 73 | 74 | # celery beat schedule file 75 | celerybeat-schedule 76 | 77 | # dotenv 78 | .env 79 | 80 | # virtualenv 81 | venv/ 82 | ENV/ 83 | 84 | # Spyder project settings 85 | .spyderproject 86 | 87 | # Rope project settings 88 | .ropeproject 89 | 90 | # custom 91 | .DS_Store 92 | checkpoint/ 93 | weights/ 94 | cnn_checkpoint/ 95 | rnn_checkpoint/ 96 | mlp_checkpoint/ 97 | output/ 98 | data/ 99 | network/model_h5 100 | network/saved_models 101 | network/logs 102 | word2vec/output 103 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Hong Chen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## 产品级垃圾文本分类器 2 | 3 | ### 注意事项: 4 | 垃圾文本分类器所用到的tensorflow版本为2.2.0。 5 | 6 | 需要[**TensorLayer2.0+**](https://github.com/tensorlayer/tensorlayer)版本,建议从GitHub源码下载。 7 | 8 | --- 9 | ### 任务场景 10 | 11 | 文本反垃圾是网络社区应用非常常见的任务。因为各种利益关系,网络社区通常都难以避免地会涌入大量骚扰、色情、诈骗等垃圾信息,扰乱社区秩序,伤害用户体验。这些信息往往隐晦,多变,传统规则系统如正则表达式匹配关键词难以应对。通常情况下,文本反垃圾离不开用户行为分析,本章只针对文本内容部分进行讨论。 12 | 13 | 为了躲避平台监测,垃圾文本常常会使用火星文等方式对关键词进行隐藏。例如: 14 | 15 | ``` 16 | 渴望 兂 极限 激情 恠 燃烧 加 涐 嶶 信 lovexxxx521 17 | 亲爱 的 看 頭潒 约 18 | 私人 企鹅 ⓧⓧⓧ㊆㊆⑧⑧⑧ 给 你 爽 你 懂 的 19 | ``` 20 | 21 | 垃圾文本通常还会备有多个联系方式进行用户导流。识别异常联系方式是反垃圾的一项重要工作,但是传统的识别方法依赖大量策略,攻防压力大,也容易被突破。例如: 22 | 23 | ``` 24 | 自啪 试平 n 罗辽 婊研 危性 xxxx447 25 | 自啪 试平 n 罗辽 婊研 危性 xxxxx11118 26 | 自啪 试平 n 罗辽 婊研 危性 xxxx2323 27 | ``` 28 | 29 | 在这个实例中,我们将使用TensorLayer来训练一个垃圾文本分类器,并介绍如何通过TensorFlow Serving来提供高性能服务,实现产品化部署。这个分类器将解决以上几个难题,我们不再担心垃圾文本有多么隐晦,也不再关心它们用的哪国语言或有多少种联系方式。 30 | 31 | 第一步,[训练词向量](./word2vec),相关代码在word2vec文件夹,执行步骤见word2vec/README.md。 32 | 33 | 第二步,[训练分类器](./network),相关代码在network文件夹,执行步骤见network/README.md。 34 | 35 | 第三步,[与TensorFlow Serving交互](./serving),客户端代码在serving文件夹。 36 | 37 | ### 网络结构 38 | 39 | 文本分类必然要先解决文本表征问题。文本表征在自然语言处理任务中扮演着重要的角色。它的目标是将不定长文本(句子、段落、文章)映射成固定长度的向量。 40 | 文本向量的质量会直接影响下游模型的性能。神经网络模型的文本表征工作通常分为两步,首先将单词映射成词向量,然后将词向量组合起来。 41 | 有多种模型能够将词向量组合成文本向量,例如词袋模型(Neural Bag-of-Words,NBOW)、递归神经网络(Recurrent Neural Network,RNN)和卷积神经网络(Convolutional Neural Network,CNN)。这些模型接受由一组词向量组成的文本序列作为输入,然后将文本的语义信息表示成一个固定长度的向量。 42 | NBOW模型的优点是简单快速,配合多层全连接网络能实现不逊于RNN和CNN的分类效果,缺点是向量线性相加必然会丢失很多词与词相关信息,无法更精细地表达句子的语义。CNN在语言模型训练中也被广泛使用,这里卷积的作用变成了从句子中提取出局部的语义组合信息,多个卷积核则用来保证提取的语义组合的多样性。 43 | RNN常用于处理时间序列数据,它能够接受任意长度的输入,是自然语言处理最受欢迎的架构之一,在短文本分类中,相比NBOW和CNN的缺点是需要的计算时间更长。 44 | 45 | 实例中我们使用RNN来表征文本,将输入的文本序列通过一个RNN层映射成固定长度的向量,然后将文本向量输入到一个Softmax层进行分类。 46 | 本章结尾我们会再简单介绍由NBOW和多层感知机(Multilayer Perceptron,MLP)组成的分类器和CNN分类器。实际分类结果中,以上三种分类器的 47 | 准确率都能达到97%以上。如图1所示,相比之前训练的SVM分类器所达到的93%左右的准确率,基于神经网络的垃圾文本分类器表现出非常优秀的性能。 48 | 49 |
50 | 51 |
52 | 图1 Word2vec与Dynamic RNN 53 |
54 | 55 | ### 词的向量表示 56 | 57 | 最简单的词表示方法是One-hot Representation,即把每个词表示为一个很长的向量,这个向量的维度是词表的大小,其中只有一个维度的值为1,其余都为0,这个维度就代表了当前的词。这种表示方法非常简洁,但是容易造成维数灾难,并且无法描述词与词之间的关系。还有一种表示方法是Distributed Representation,如Word2vec。这种方法把词表示成一种稠密、低维的实数向量。该向量可以表示一个词在一个`N`维空间中的位置,并且相似词在空间中的位置相近。由于训练的时候就利用了单词的上下文,因此Word2vec训练出来的词向量天然带有一些句法和语义特征。它的每一维表示词语的一个潜在特征,可以通过空间距离来描述词与词之间的相似性。 58 | 59 | 比较有代表性的Word2vec模型有CBOW模型和Skip-Gram模型。图2演示了Skip-Gram模型的训练过程。假设我们的窗口取1,通过滑动窗口我们得到`(fox, brown)`、`(fox, jumps)`等输入输出对,经过足够多次的迭代后,当我们再次输入`fox`时,`jumps`和`brown`的概率会明显高于其他词。在输入层与隐层之间的矩阵`W1`存储着每一个单词的词向量,从输入层到隐层之间的计算就是取出单词的词向量。因为训练的目标是相似词得到相似上下文,所以相似词在隐层的输出(即其词向量)在优化过程中会越来越接近。训练完成后我们把`W1`(词向量集合)保存起来用于后续的任务。 60 | 61 |
62 | 63 |
64 | 图2 Word2vec训练过程 65 |
66 | 67 | ### Dynamic RNN分类器 68 | 69 | 传统神经网络如MLP受限于固定大小的输入,以及静态的输入输出关系,在动态系统建模任务中会遇到比较大的困难。传统神经网络假设所有输入都互相独立,其有向无环的神经网络的各层神经元不会互相作用,不好处理前后输入有关联的问题。但是现实生活中很多问题都是以动态系统的方式呈现的,一件事物的现状往往依托于它之前的状态。虽然也能通过将一长段时间分成多个同等长度的时间窗口来计算时间窗口内的相关内容,但是这个时间窗的依赖与变化都太多,大小并不好取。目前常用的一种RNN是LSTM,它与标准RNN的不同之处是隐层单元的计算函数更加复杂,使得RNN的记忆能力变得更强。 70 | 71 | 在训练RNN的时候我们会遇到另一个问题。不定长序列的长度有可能范围很广,Static RNN由于只构建一次Graph,训练前需要对所有输入进行Padding以确保整个迭代过程中每个Batch的长度一致,这样输入的长度就取决于训练集最长的一个序列,导致许多计算资源浪费在Padding部分。Dynamic RNN实现了Graph动态生成,因此不同Batch的长度可以不同,并且可以跳过Padding部分的计算。这样每一个Batch的数据在输入前只需Padding到该Batch最长序列的长度,并且根据序列实际长度中止计算,从而减少空间和计算量。 72 | 73 | 图3演示了Dynamic RNN分类器的训练过程,Sequence 1、2、3作为一个Batch输入到网络中,这个Batch最长的长度是6,因此左方RNN Graph展开后如右方所示是一个有着6个隐层的网络,每一层的输出会和下一个词一起作为输入进入到下一层。第1个序列的长度为6,因此我们取第6个输出作为这个序列的Embedding输入到Softmax层进行分类。第2个序列的长度为3,因此我们在计算到第3个输出时就停止计算,取第3个输出作为这个序列的Embedding输入到Softmax层进行后续的计算。依此类推,第3个序列取第5个输出作为Softmax层的输入,完成一次前向与后向传播。 74 | 75 |
76 | 77 |
78 | 图3 Dynamic RNN训练过程 79 |
80 | -------------------------------------------------------------------------------- /images/1-Word2vec_and_Dynamic_RNN-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/1-Word2vec_and_Dynamic_RNN-color.png -------------------------------------------------------------------------------- /images/10-CNN_Classifier-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/10-CNN_Classifier-color.png -------------------------------------------------------------------------------- /images/2-Word2vec-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/2-Word2vec-color.png -------------------------------------------------------------------------------- /images/3-Dynamic_RNN-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/3-Dynamic_RNN-color.png -------------------------------------------------------------------------------- /images/4-Lookup_Table-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/4-Lookup_Table-color.png -------------------------------------------------------------------------------- /images/5-Loss_and_Accuracy-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/5-Loss_and_Accuracy-color.png -------------------------------------------------------------------------------- /images/6-Antispam_Service_Architecture-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/6-Antispam_Service_Architecture-color.png -------------------------------------------------------------------------------- /images/7-gRPC-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/7-gRPC-color.png -------------------------------------------------------------------------------- /images/8-From_Client_to_Server-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/8-From_Client_to_Server-color.png -------------------------------------------------------------------------------- /images/9-NBOW_and_MLP_Classifier-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/images/9-NBOW_and_MLP_Classifier-color.png -------------------------------------------------------------------------------- /network/README.md: -------------------------------------------------------------------------------- 1 | ### 训练RNN分类器并导出Servable 2 | 3 | ``` 4 | python3 rnn_classifier.py 5 | ``` 6 | 7 | ### 训练MLP分类器并导出Servable 8 | 9 | ``` 10 | python3 mlp_classifier.py 11 | ``` 12 | 13 | ### 训练CNN分类器并导出Servable 14 | 15 | ``` 16 | python3 cnn_classifier.py 17 | ``` 18 | #### 模型构建 19 | ##### 训练分类器 20 | 我们使用Dynamic RNN实现不定长文本序列分类。首先加载数据,通过`sklearn`库的`train_test_split`方法将样本按照要求的比例切分成训练集和测试集。 21 | 22 | ```python 23 | def load_dataset(files, test_size=0.2): 24 | """加载样本并取test_size的比例做测试集 25 | Args: 26 | files: 样本文件目录集合 27 | 样本文件是包含了样本特征向量与标签的npy文件 28 | test_size: float 29 | 0.0到1.0之间,代表数据集中有多少比例抽做测试集 30 | Returns: 31 | X_train, y_train: 训练集特征列表和标签列表 32 | X_test, y_test: 测试集特征列表和标签列表 33 | """ 34 | x = [] 35 | y = [] 36 | for file in files: 37 | data = np.load(file, allow_pickle=True) 38 | if x == [] or y == []: 39 | x = data['x'] 40 | y = data['y'] 41 | else: 42 | x = np.append(x, data['x'], axis=0) 43 | y = np.append(y, data['y'], axis=0) 44 | 45 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size) 46 | return x_train, y_train, x_test, y_test 47 | ``` 48 | 49 | 为了防止过拟合,我们对Dynamic RNN的隐藏层做了Dropout操作。参数`recurrent_dropout`决定了隐藏层的舍弃比例。通过配置`recurrent_dropout`参数我们可以在训练的时候打开Dropout,在服务的时候关闭它。需要特别注意的是,我们在RNN层传入了`sequence length`来表示每个输入文本的长度。根据Tensorlayer官方文档,只有提供了`sequence length` RNN才是一个Dynamic RNN。计算`sequence length`的方程`retrieve_seq_length_op3`需要指定padding值,在该项目中是维度为200的全零矩阵。 50 | 51 | ```python 52 | def get_model(inputs_shape): 53 | """定义网络结Args: 54 | inputs_shape: 输入数据的shape 55 | recurrent_dropout: RNN隐藏层的舍弃比重 56 | Returns: 57 | model: 定义好的模型 58 | """ 59 | ni = tl.layers.Input(inputs_shape, name='input_layer') 60 | out = tl.layers.RNN(cell=tf.keras.layers.LSTMCell(units=64, recurrent_dropout=0.2), 61 | return_last_output=True, 62 | return_last_state=False, 63 | return_seq_2d=True)(ni, sequence_length=tl.layers.retrieve_seq_length_op3(ni, pad_val=masking_val)) 64 | nn = tl.layers.Dense(n_units=2, act=tf.nn.softmax, name="dense")(out) 65 | model = tl.models.Model(inputs=ni, outputs=nn, name='rnn') 66 | return model 67 | ``` 68 | 例子中每一次迭代,我们给网络输入128条文本序列。根据预测与标签的差异,网络不断优化权重,减少损失,逐步提高分类的准确率。 69 | 70 | ```python 71 | def train(model): 72 | # 开始训练 73 | learning_rate = 0.001 74 | n_epoch = 50 75 | batch_size = 128 76 | display_step = 10 77 | loss_vals = [] 78 | acc_vals = [] 79 | optimizer = tf.optimizers.Nadam(learning_rate=learning_rate) 80 | 81 | logging.info("batch_size: %d", batch_size) 82 | logging.info("Start training the network...") 83 | 84 | for epoch in range(n_epoch): 85 | step = 0 86 | total_step = math.ceil(len(x_train) / batch_size) 87 | 88 | # 利用训练集训练 89 | model.train() 90 | for batch_x, batch_y in tl.iterate.minibatches(x_train, y_train, batch_size, shuffle=True): 91 | 92 | start_time = time.time() 93 | # temp = copy.deepcopy(batch_x) 94 | max_seq_len = max([len(d) for d in batch_x]) 95 | batch_y = batch_y.astype(np.int32) 96 | for i, d in enumerate(batch_x): 97 | batch_x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in 98 | range(max_seq_len - len(d))] 99 | batch_x[i] = tf.convert_to_tensor(batch_x[i], dtype=tf.float32) 100 | batch_x = list(batch_x) 101 | batch_x = tf.convert_to_tensor(batch_x, dtype=tf.float32) 102 | # sequence_length = tl.layers.retrieve_seq_length_op3(batch_x, pad_val=masking_val) 103 | 104 | with tf.GradientTape() as tape: 105 | _y = model(batch_x) 106 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='train_loss') 107 | loss_val = tf.reduce_mean(loss_val) 108 | grad = tape.gradient(loss_val, model.trainable_weights) 109 | optimizer.apply_gradients(zip(grad, model.trainable_weights)) 110 | 111 | loss_vals.append(loss_val) 112 | acc_vals.append(accuracy(_y, batch_y)) 113 | 114 | if step + 1 == 1 or (step + 1) % display_step == 0: 115 | logging.info("Epoch {}/{},Step {}/{}, took {}".format(epoch + 1, n_epoch, step, total_step, 116 | time.time() - start_time)) 117 | loss = sum(loss_vals) / len(loss_vals) 118 | acc = sum(acc_vals) / len(acc_vals) 119 | del loss_vals[:] 120 | del acc_vals[:] 121 | logging.info( 122 | "Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)) 123 | step += 1 124 | 125 | with train_summary_writer.as_default(): 126 | tf.summary.scalar('loss', loss_val.numpy(), step = epoch) 127 | tf.summary.scalar('accuracy', accuracy(_y, batch_y).numpy(), step = epoch) 128 | 129 | # 利用测试集评估 130 | model.eval() 131 | test_loss, test_acc, n_iter = 0, 0, 0 132 | for batch_x, batch_y in tl.iterate.minibatches(x_test, y_test, batch_size, shuffle=True): 133 | batch_y = batch_y.astype(np.int32) 134 | max_seq_len = max([len(d) for d in batch_x]) 135 | for i, d in enumerate(batch_x): 136 | # 依照每个batch中最大样本长度将剩余样本打上padding 137 | batch_x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in 138 | range(max_seq_len - len(d))] 139 | batch_x[i] = tf.convert_to_tensor(batch_x[i], dtype=tf.float32) 140 | # ValueError: setting an array element with a sequence. 141 | batch_x = list(batch_x) 142 | batch_x = tf.convert_to_tensor(batch_x, dtype=tf.float32) 143 | 144 | _y = model(batch_x) 145 | 146 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='test_loss') 147 | loss_val = tf.reduce_mean(loss_val) 148 | 149 | test_loss += loss_val 150 | test_acc += accuracy(_y, batch_y) 151 | n_iter += 1 152 | 153 | with test_summary_writer.as_default(): 154 | tf.summary.scalar('loss', loss_val.numpy(), step=epoch) 155 | tf.summary.scalar('accuracy', accuracy(_y, batch_y).numpy(), step=epoch) 156 | 157 | logging.info(" test loss: {}".format(test_loss / n_iter)) 158 | logging.info(" test acc: {}".format(test_acc / n_iter)) 159 | ``` 160 | 161 | 在模型训练到60个epoch之后,训练集与测试集的准确率都上升到了95%以上。 162 | 163 |
164 | 165 |
166 | 图4 Dynamic RNN Loss and Accuracy 167 |
168 | 169 | 170 | 171 | ##### 模型导出 172 | 173 | TensorFlow提供了SavedModel这一格式专门用于保存可在多平台部署的文件,然而在TensorFlow2这一版本中,用于保存SavedModel的方法`tf.saved_model.save`仅支持对Trackable对象的模型导出。由于Trackable在Tensorflow2中是keras.model的父类,而TensorLayer构建的model不继承Trackable类,因此我们构建的model无法用`tf.saved_model.save`导出可部署文件。 174 | 175 | 在这里,我们的解决思路是先将TensorLayer模型保存为hdf5文件,再设计一套转译机制,将该hdf5文件转成tf.keras可以读取的形式,然后再由`tf.saved_model.save`方法进行模型导出。 176 | 177 | hdf5文件从TensorLayer到keras的转译,被分为weights和config两部分。 178 | 179 | ```python 180 | def translator_tl2_keras_h5(_tl_h5_path, _keras_h5_path): 181 | f_tl_ = h5py.File(_tl_h5_path, 'r+') 182 | f_k_ = h5py.File(_keras_h5_path, 'a') 183 | f_k_.clear() 184 | weights_translator(f_tl_, f_k_) 185 | config_translator(f_tl_, f_k_) 186 | f_tl_.close() 187 | f_k_.close() 188 | ``` 189 | 190 | weights_translator将训练过程中学习到的权重(例如bias和kernel)进行转译。 191 | 192 | ```python 193 | def weights_translator(f_tl, f_k): 194 | # todo: delete inputlayer 195 | if 'model_weights' not in f_k.keys(): 196 | f_k_model_weights = f_k.create_group('model_weights') 197 | else: 198 | f_k_model_weights = f_k['model_weights'] 199 | for key in f_tl.keys(): 200 | if key not in f_k_model_weights.keys(): 201 | f_k_model_weights.create_group(key) 202 | try: 203 | f_tl_para = f_tl[key][key] 204 | except KeyError: 205 | pass 206 | else: 207 | if key not in f_k_model_weights[key].keys(): 208 | f_k_model_weights[key].create_group(key) 209 | weight_names = [] 210 | f_k_para = f_k_model_weights[key][key] 211 | # todo:对RNN层的weights进行通用适配 212 | cell_name = '' 213 | if key == 'rnn_1': 214 | cell_name = 'lstm_cell' 215 | f_k_para.create_group(cell_name) 216 | f_k_para = f_k_para[cell_name] 217 | f_k_model_weights.create_group('masking') 218 | f_k_model_weights['masking'].attrs['weight_names'] = [] 219 | for k in f_tl_para: 220 | if k == 'biases:0' or k == 'bias:0': 221 | weight_name = 'bias:0' 222 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 223 | weight_name = 'kernel:0' 224 | elif k == 'recurrent_kernel:0': 225 | weight_name = 'recurrent_kernel:0' 226 | else: 227 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 228 | if weight_name in f_k_para: 229 | del f_k_para[weight_name] 230 | f_k_para.create_dataset(name=weight_name, data=f_tl_para[k][:], 231 | shape=f_tl_para[k].shape) 232 | 233 | weight_names = [] 234 | for weight_name in f_tl[key].attrs['weight_names']: 235 | weight_name = weight_name.decode('utf8') 236 | weight_name = weight_name.split('/') 237 | k = weight_name[-1] 238 | if k == 'biases:0' or k == 'bias:0': 239 | weight_name[-1] = 'bias:0' 240 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 241 | weight_name[-1] = 'kernel:0' 242 | elif k == 'recurrent_kernel:0': 243 | weight_name[-1] = 'recurrent_kernel:0' 244 | else: 245 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 246 | if key == 'rnn_1': 247 | weight_name.insert(-1, 'lstm_cell') 248 | weight_name = '/'.join(weight_name) 249 | weight_names.append(weight_name.encode('utf8')) 250 | f_k_model_weights[key].attrs['weight_names'] = weight_names 251 | 252 | f_k_model_weights.attrs['backend'] = keras.backend.backend().encode('utf8') 253 | f_k_model_weights.attrs['keras_version'] = str(keras.__version__).encode('utf8') 254 | 255 | f_k_model_weights.attrs['layer_names'] = [i for i in f_tl.attrs['layer_names']] 256 | ``` 257 | 258 | config_translator转译了模型的config信息,包括了模型的结构,和训练过程中的loss,metrics,optimizer等信息,同时传入了Masking层,保证dynamic RNN能够根据序列实际长度中止计算。 259 | 260 | ```PYTHON 261 | def config_translator(f_tl, f_k): 262 | tl_model_config = f_tl.attrs['model_config'].decode('utf8') 263 | tl_model_config = eval(tl_model_config) 264 | tl_model_architecture = tl_model_config['model_architecture'] 265 | 266 | k_layers = [] 267 | masking_layer = { 268 | 'class_name': 'Masking', 269 | 'config': { 270 | 'batch_input_shape': [None, None, 200], 271 | 'dtype': 'float32', 272 | 'mask_value': 0, # Masks a sequence to skip timesteps if values are equal to mask_value 273 | 'name': 'masking', 274 | 'trainable': True 275 | } 276 | } 277 | k_layers.append(masking_layer) 278 | for key, tl_layer in enumerate(tl_model_architecture): 279 | if key == 1: 280 | k_layer = layer_translator(tl_layer, is_first_layer=True) 281 | else: 282 | k_layer = layer_translator(tl_layer) 283 | if k_layer is not None: 284 | k_layers.append(k_layer) 285 | f_k.attrs['model_config'] = json.dumps({'class_name': 'Sequential', 286 | 'config': {'name': 'sequential', 'layers': k_layers}, 287 | 'build_input_shape': input_shape}, 288 | default=serialization.get_json_type).encode('utf8') 289 | f_k.attrs['backend'] = keras.backend.backend().encode('utf8') 290 | f_k.attrs['keras_version'] = str(keras.__version__).encode('utf8') 291 | 292 | # todo: translate the 'training_config' 293 | training_config = {'loss': {'class_name': 'SparseCategoricalCrossentropy', 294 | 'config': {'reduction': 'auto', 'name': 'sparse_categorical_crossentropy', 295 | 'from_logits': False}}, 296 | 'metrics': ['accuracy'], 'weighted_metrics': None, 'loss_weights': None, 'sample_weight_mode': None, 297 | 'optimizer_config': {'class_name': 'Adam', 298 | 'config': {'name': 'Adam', 'learning_rate': 0.01, 'decay': 0.0, 299 | 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False 300 | } 301 | } 302 | } 303 | 304 | f_k.attrs['training_config'] = json.dumps(training_config, default=serialization.get_json_type).encode('utf8') 305 | ``` 306 | 307 | TensorLayer和keras的模型在保存config时,都是以layer为单位分别保存,于是在translate时,我们按照每个层的类型进行逐层转译。 308 | 309 | ```PYTHON 310 | def layer_translator(tl_layer, is_first_layer=False): 311 | _input_shape = None 312 | global input_shape 313 | if is_first_layer: 314 | _input_shape = input_shape 315 | if tl_layer['class'] == '_InputLayer': 316 | input_shape = tl_layer['args']['shape'] 317 | elif tl_layer['class'] == 'Conv1d': 318 | return layer_conv1d_translator(tl_layer, _input_shape) 319 | elif tl_layer['class'] == 'MaxPool1d': 320 | return layer_maxpooling1d_translator(tl_layer, _input_shape) 321 | elif tl_layer['class'] == 'Flatten': 322 | return layer_flatten_translator(tl_layer, _input_shape) 323 | elif tl_layer['class'] == 'Dropout': 324 | return layer_dropout_translator(tl_layer, _input_shape) 325 | elif tl_layer['class'] == 'Dense': 326 | return layer_dense_translator(tl_layer, _input_shape) 327 | elif tl_layer['class'] == 'RNN': 328 | return layer_rnn_translator(tl_layer, _input_shape) 329 | return None 330 | ``` 331 | 332 | 以rnn层为例,我们设计了其config的转译方法。 333 | 334 | ```PYTHON 335 | def layer_rnn_translator(tl_layer, _input_shape=None): 336 | args = tl_layer['args'] 337 | name = args['name'] 338 | cell = args['cell'] 339 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'return_sequences': False, 340 | 'return_state': False, 'go_backwards': False, 'stateful': False, 'unroll': False, 'time_major': False, 341 | 'cell': cell 342 | } 343 | if _input_shape is not None: 344 | config['batch_input_shape'] = _input_shape 345 | result = {'class_name': 'RNN', 'config': config} 346 | return result 347 | ``` 348 | 349 | 按照main函数的顺序,分类器完成了训练和导出模型的全部步骤。 350 | 351 | ```python 352 | if __name__ == '__main__': 353 | 354 | masking_val = np.zeros(200) 355 | input_shape = None 356 | gradient_log_dir = 'logs/gradient_tape/' 357 | tensorboard = TensorBoard(log_dir = gradient_log_dir) 358 | 359 | # 定义log格式 360 | fmt = "%(asctime)s %(levelname)s %(message)s" 361 | logging.basicConfig(format=fmt, level=logging.INFO) 362 | 363 | # 加载数据 364 | x_train, y_train, x_test, y_test = load_dataset( 365 | ["../word2vec/output/sample_seq_pass.npz", 366 | "../word2vec/output/sample_seq_spam.npz"]) 367 | 368 | # 构建模型 369 | model = get_model(inputs_shape=[None, None, 200]) 370 | 371 | for index, layer in enumerate(model.config['model_architecture']): 372 | if layer['class'] == 'RNN': 373 | if 'cell' in layer['args']: 374 | model.config['model_architecture'][index]['args']['cell'] = '' 375 | 376 | current_time = datetime.datetime.now().strftime('%Y%m%d-%H%M%S') 377 | train_log_dir = gradient_log_dir + current_time + '/train' 378 | test_log_dir = gradient_log_dir + current_time + '/test' 379 | train_summary_writer = tf.summary.create_file_writer(train_log_dir) 380 | test_summary_writer = tf.summary.create_file_writer(test_log_dir) 381 | 382 | train(model) 383 | 384 | logging.info("Optimization Finished!") 385 | 386 | # h5保存和转译 387 | model_dir = './model_h5' 388 | if not os.path.exists(model_dir): 389 | os.mkdir(model_dir) 390 | tl_h5_path = model_dir + '/model_rnn_tl.hdf5' 391 | keras_h5_path = model_dir + '/model_rnn_tl2k.hdf5' 392 | tl.files.save_hdf5_graph(network=model, filepath=tl_h5_path, save_weights=True) 393 | translator_tl2_keras_h5(tl_h5_path, keras_h5_path) 394 | 395 | # 读取模型 396 | new_model = keras.models.load_model(keras_h5_path) 397 | x_test, y_test = format_convert(x_test, y_test) 398 | score = new_model.evaluate(x_test, y_test, batch_size=128) 399 | 400 | # 保存SavedModel可部署文件 401 | saved_model_version = 1 402 | saved_model_path = "./saved_models/rnn/" 403 | tf.saved_model.save(new_model, saved_model_path + str(saved_model_version)) 404 | ``` 405 | 406 | 最终我们将在`./saved_models/rnn`目录下看到导出模型的每个版本,实例中`model_version`被设置为1,因此创建了相应的子目录`./saved_models/rnn/1`。 407 | 408 | SavedModel目录具有以下结构。 409 | 410 | ``` 411 | assets/ 412 | variables/ 413 | variables.data-?????-of-????? 414 | variables.index 415 | saved_model.pb 416 | ``` 417 | 418 | 导出的模型在TensorFlow Serving中又被称为Servable,其中`saved_model.pb`保存了接口的数据交换格式,`variables`保存了模型的网络结构和参数,`assets`用来保存如词库等模型初始化所需的外部资源。本例没有用到外部资源,因此没有`assets`文件夹。 419 | 420 | #### 其他常用方法 421 | 422 | ##### MLP分类器 423 | 424 | 前文提到过,分类器还可以用NBOW+MLP(如图5所示)和CNN来实现。借助TensorLayer,我们可以很方便地重组网络。下面简单介绍这两种网络的结构及其实现。 425 | 426 | 由于词向量之间存在着线性平移的关系,如果相似词空间距离相近,那么在仅仅将文本中一个或几个词改成近义词的情况下,两个文本的词向量线性相加的结果也应该是非常接近的。 427 | 428 |
429 | 430 |
431 | 图5 NBOW+MLP分类器 432 |
433 | 434 | 多层神经网络可以无限逼近任意函数,能够胜任复杂的非线性分类任务。下面的代码将Word2vec训练好的词向量线性相加,再通过三层全连接网络进行分类。 435 | ```python 436 | def get_model(input_shape, keep=0.5): 437 | """定义网络结构 438 | 439 | Args: 440 | inputs_shape: 输入数据的shape 441 | keep: 各层神经元激活比例 442 | keep=1.0: 关闭Dropout 443 | Returns: 444 | model: 定义好的网络结构 445 | """ 446 | ni = tl.layers.Input(input_shape, name='input_layer') 447 | nn = tl.layers.Dropout(keep=keep, name='drop1')(ni) 448 | nn = tl.layers.Dense(n_units=200, act=tf.nn.relu, name='relu1')(nn) 449 | nn = tl.layers.Dropout(keep=keep, name='drop2')(nn) 450 | nn = tl.layers.Dense(n_units=200, act=tf.nn.relu, name='relu2')(nn) 451 | nn = tl.layers.Dropout(keep=keep, name='drop3')(nn) 452 | nn = tl.layers.Dense(n_units=2, act=tf.nn.relu, name='output_layer')(nn) 453 | model = tl.models.Model(inputs=ni, outputs=nn, name='mlp') 454 | return model 455 | ``` 456 | ##### CNN分类器 457 | CNN卷积的过程捕捉了文本的局部相关性,在文本分类中也取得了不错的结果。图6演示了CNN分类过程。输入是一个由6维词向量组成的最大长度为11的文本,经过与4个3×6的卷积核进行卷积,得到4张9维的特征图。再对特征图每3块不重合区域进行最大池化,将结果合成一个12维的向量输入到全连接层。 458 | 459 |
460 | 461 |
462 | 图6 CNN分类器 463 |
464 | 465 | 下面代码中输入是一个由200维词向量组成的最大长度为20的文本(确定好文本的最大长度后,我们需要对输入进行截取或者填充)。卷积层参数[3, 200, 6]代表6个3×200的卷积核。这里使用1D CNN,是因为我们把文本序列看成一维数据,这意味着卷积的过程只会朝一个方向进行(同理,处理图片和小视频分别需要使用2D CNN和3D CNN)。卷积核宽度被设置为和词向量大小一致,确保了词向量作为最基本的元素不会被破坏。我们选取连续的3维作为池化区域,滑动步长取3,使得池化区域不重合,最后通过一个带Dropout的全连接层得到Softmax后的输出。 466 | ```python 467 | def get_model(inputs_shape, keep=0.5): 468 | """定义网络结构 469 | Args: 470 | inputs_shape: 输入数据的shape 471 | keep: 全连接层输入神经元激活比例 472 | keep=1.0: 关闭Dropout 473 | Returns: 474 | network: 定义好的网络结构 475 | """ 476 | ni = tl.layers.Input(inputs_shape, name='input_layer') 477 | nn = tl.layers.Conv1d(n_filter=6, filter_size=3, stride=2, in_channels=200, name='conv1d_1')(ni) 478 | # nn = tl.layers.Conv1dLayer(act=tf.nn.relu, shape=[3, 200, 6], name='cnn_layer1', padding='VALID')(ni) 479 | nn = tl.layers.MaxPool1d(filter_size=3, strides=3, name='pool_layer1')(nn) 480 | nn = tl.layers.Flatten(name='flatten_layer')(nn) 481 | nn = tl.layers.Dropout(keep=keep, name='drop1')(nn) 482 | nn = tl.layers.Dense(n_units=2, act=tf.nn.relu, name="output")(nn) 483 | 484 | model = tl.models.Model(inputs=ni, outputs=nn, name='cnn') 485 | return model 486 | ``` 487 | -------------------------------------------------------------------------------- /network/cnn_classifier.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import math 3 | import os 4 | import time 5 | import numpy as np 6 | import tensorflow as tf 7 | import tensorlayer as tl 8 | from sklearn.model_selection import train_test_split 9 | import h5py 10 | import json 11 | from tensorflow.python.util import serialization 12 | import tensorflow.keras as keras 13 | 14 | input_shape = None 15 | max_seq_len = 20 16 | 17 | 18 | def load_dataset(files, test_size=0.2): 19 | """加载样本并取test_size的比例做测试集 20 | Args: 21 | files: 样本文件目录集合 22 | 样本文件是包含了样本特征向量与标签的npy文件 23 | test_size: float 24 | 0.0到1.0之间,代表数据集中有多少比例抽做测试集 25 | Returns: 26 | X_train, y_train: 训练集特征列表和标签列表 27 | X_test, y_test: 测试集特征列表和标签列表 28 | """ 29 | x = [] 30 | y = [] 31 | for file in files: 32 | data = np.load(file, allow_pickle=True) 33 | if x == [] or y == []: 34 | x = data['x'] 35 | y = data['y'] 36 | else: 37 | x = np.append(x, data['x'], axis=0) 38 | y = np.append(y, data['y'], axis=0) 39 | 40 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size) 41 | return x_train, y_train, x_test, y_test 42 | 43 | 44 | def get_model(inputs_shape, keep=0.5): 45 | """定义网络结构 46 | 为了防止过拟合,我们在全连接层加了Dropout操作。参数keep决定了某一层神经元输出保留的比例。通过配置keep我们可以在训练的时候打开Dropout,在测试的时候关闭它。TensorFlow的tf.nn.dropout操作会自动根据keep调整激活的神经元的输出权重,使得我们无需在keep改变时手动调节输出权重。 47 | Args: 48 | inputs_shape: 输入数据的shape 49 | keep: 全连接层输入神经元激活比例 50 | keep=1.0: 关闭Dropout 51 | Returns: 52 | network: 定义好的网络结构 53 | """ 54 | ni = tl.layers.Input(inputs_shape, name='input_layer') 55 | nn = tl.layers.Conv1d(n_filter=6, filter_size=3, stride=2, in_channels=200, name='conv1d_1')(ni) 56 | # nn = tl.layers.Conv1dLayer(act=tf.nn.relu, shape=[3, 200, 6], name='cnn_layer1', padding='VALID')(ni) 57 | nn = tl.layers.MaxPool1d(filter_size=3, strides=3, name='pool_layer1')(nn) 58 | nn = tl.layers.Flatten(name='flatten_layer')(nn) 59 | nn = tl.layers.Dropout(keep=keep, name='drop1')(nn) 60 | nn = tl.layers.Dense(n_units=2, act=tf.nn.relu, name="output")(nn) 61 | 62 | model = tl.models.Model(inputs=ni, outputs=nn, name='cnn') 63 | return model 64 | 65 | 66 | def accuracy(y_pred, y_true): 67 | """ 68 | 计算预测精准度accuracy 69 | :param y_pred: 模型预测结果 70 | :param y_true: 真实结果 ground truth 71 | :return: 精准度acc 72 | """ 73 | # Predicted class is the index of highest score in prediction vector (i.e. argmax). 74 | correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64)) 75 | return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1) 76 | 77 | 78 | def train(model): 79 | # 训练模型 80 | learning_rate = 0.005 81 | n_epoch = 100 82 | batch_size = 128 83 | display_step = 10 84 | loss_vals = [] 85 | acc_vals = [] 86 | optimizer = tf.optimizers.Adam(learning_rate=learning_rate) 87 | 88 | logging.info("batch_size: %d", batch_size) 89 | logging.info("Start training the network...") 90 | 91 | for epoch in range(n_epoch): 92 | step = 0 93 | total_step = math.ceil(len(x_train) / batch_size) 94 | 95 | # 利用训练集训练 96 | model.train() 97 | for batch_x, batch_y in tl.iterate.minibatches(x_train, y_train, batch_size, shuffle=True): 98 | 99 | start_time = time.time() 100 | batch_y = batch_y.astype(np.int32) 101 | for i, d in enumerate(batch_x): 102 | batch_x[i] = batch_x[i][:max_seq_len] 103 | batch_x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in 104 | range(max_seq_len - len(d))] 105 | batch_x[i] = tf.convert_to_tensor(batch_x[i], dtype=tf.float32) 106 | batch_x = list(batch_x) 107 | batch_x = tf.convert_to_tensor(batch_x, dtype=tf.float32) 108 | 109 | with tf.GradientTape() as tape: 110 | _y = model(batch_x) 111 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='train_loss') 112 | loss_val = tf.reduce_mean(loss_val) 113 | grad = tape.gradient(loss_val, model.trainable_weights) 114 | optimizer.apply_gradients(zip(grad, model.trainable_weights)) 115 | 116 | loss_vals.append(loss_val) 117 | acc_vals.append(accuracy(_y, batch_y)) 118 | 119 | if step + 1 == 1 or (step + 1) % display_step == 0: 120 | logging.info("Epoch {}/{},Step {}/{}, took {}".format(epoch + 1, n_epoch, step, total_step, 121 | time.time() - start_time)) 122 | loss = sum(loss_vals) / len(loss_vals) 123 | acc = sum(acc_vals) / len(acc_vals) 124 | del loss_vals[:] 125 | del acc_vals[:] 126 | logging.info("Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)) 127 | step += 1 128 | 129 | # 利用测试集评估 130 | model.eval() 131 | test_loss, test_acc, n_iter = 0, 0, 0 132 | for batch_x, batch_y in tl.iterate.minibatches(x_test, y_test, batch_size, shuffle=True): 133 | batch_y = batch_y.astype(np.int32) 134 | for i, d in enumerate(batch_x): 135 | batch_x[i] = batch_x[i][:max_seq_len] 136 | batch_x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in 137 | range(max_seq_len - len(d))] 138 | batch_x[i] = tf.convert_to_tensor(batch_x[i], dtype=tf.float32) 139 | # ValueError: setting an array element with a sequence. 140 | batch_x = list(batch_x) 141 | batch_x = tf.convert_to_tensor(batch_x, dtype=tf.float32) 142 | 143 | _y = model(batch_x) 144 | 145 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='test_loss') 146 | loss_val = tf.reduce_mean(loss_val) 147 | 148 | test_loss += loss_val 149 | test_acc += accuracy(_y, batch_y) 150 | n_iter += 1 151 | logging.info(" test loss: {}".format(test_loss / n_iter)) 152 | logging.info(" test acc: {}".format(test_acc / n_iter)) 153 | 154 | 155 | def layer_conv1d_translator(tl_layer, _input_shape=None): 156 | args = tl_layer['args'] 157 | name = args['name'] 158 | filters = args['n_filter'] 159 | kernel_size = [args['filter_size']] 160 | strides = [args['stride']] 161 | padding = args['padding'] 162 | data_format = args['data_format'] 163 | dilation_rate = [args['dilation_rate']] 164 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'filters': filters, 165 | 'kernel_size': kernel_size, 'strides': strides, 'padding': padding, 'data_format': data_format, 166 | 'dilation_rate': dilation_rate, 'activation': None, 'use_bias': True, 167 | 'kernel_initializer': {'class_name': 'GlorotUniform', 168 | 'config': {'seed': None} 169 | }, 170 | 'bias_initializer': {'class_name': 'Zeros', 171 | 'config': {} 172 | }, 173 | 'kernel_regularizer': None, 'bias_regularizer': None, 174 | 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None} 175 | 176 | if _input_shape is not None: 177 | config['batch_input_shape'] = _input_shape 178 | result = {'class_name': 'Conv1D', 'config': config} 179 | return result 180 | 181 | 182 | def layer_maxpooling1d_translator(tl_layer, _input_shape=None): 183 | args = tl_layer['args'] 184 | name = args['name'] 185 | pool_size = [args['filter_size']] 186 | strides = [args['strides']] 187 | padding = args['padding'] 188 | data_format = args['data_format'] 189 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'strides': strides, 'pool_size': pool_size, 190 | 'padding': padding, 'data_format': data_format} 191 | if _input_shape is not None: 192 | config['batch_input_shape'] = _input_shape 193 | result = {'class_name': 'MaxPooling1D', 'config': config} 194 | return result 195 | 196 | 197 | def layer_flatten_translator(tl_layer, _input_shape=None): 198 | args = tl_layer['args'] 199 | name = args['name'] 200 | 201 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'data_format': 'channels_last'} 202 | if _input_shape is not None: 203 | config['batch_input_shape'] = _input_shape 204 | result = {'class_name': 'Flatten', 'config': config} 205 | return result 206 | 207 | 208 | def layer_dropout_translator(tl_layer, _input_shape=None): 209 | args = tl_layer['args'] 210 | name = args['name'] 211 | rate = 1-args['keep'] 212 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'rate': rate, 'noise_shape': None, 'seed': None} 213 | if _input_shape is not None: 214 | config['batch_input_shape'] = _input_shape 215 | result = {'class_name': 'Dropout', 'config': config} 216 | return result 217 | 218 | 219 | def layer_dense_translator(tl_layer, _input_shape=None): 220 | args = tl_layer['args'] 221 | name = args['name'] 222 | units = args['n_units'] 223 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'units': units, 'activation': 'relu', 'use_bias': True, 224 | 'kernel_initializer': {'class_name': 'GlorotUniform', 'config': {'seed': None}}, 225 | 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 226 | 'kernel_regularizer': None, 227 | 'bias_regularizer': None, 228 | 'activity_regularizer': None, 229 | 'kernel_constraint': None, 230 | 'bias_constraint': None} 231 | 232 | if _input_shape is not None: 233 | config['batch_input_shape'] = _input_shape 234 | result = {'class_name': 'Dense', 'config': config} 235 | return result 236 | 237 | 238 | def layer_rnn_translator(tl_layer, _input_shape=None): 239 | args = tl_layer['args'] 240 | name = args['name'] 241 | cell = args['cell'] 242 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'return_sequences': False, 243 | 'return_state': False, 'go_backwards': False, 'stateful': False, 'unroll': False, 'time_major': False, 244 | 'cell': cell 245 | } 246 | if _input_shape is not None: 247 | config['batch_input_shape'] = _input_shape 248 | result = {'class_name': 'RNN', 'config': config} 249 | return result 250 | 251 | 252 | def layer_translator(tl_layer, is_first_layer=False): 253 | _input_shape = None 254 | global input_shape 255 | if is_first_layer: 256 | _input_shape = input_shape 257 | if tl_layer['class'] == '_InputLayer': 258 | input_shape = tl_layer['args']['shape'] 259 | elif tl_layer['class'] == 'Conv1d': 260 | return layer_conv1d_translator(tl_layer, _input_shape) 261 | elif tl_layer['class'] == 'MaxPool1d': 262 | return layer_maxpooling1d_translator(tl_layer, _input_shape) 263 | elif tl_layer['class'] == 'Flatten': 264 | return layer_flatten_translator(tl_layer, _input_shape) 265 | elif tl_layer['class'] == 'Dropout': 266 | return layer_dropout_translator(tl_layer, _input_shape) 267 | elif tl_layer['class'] == 'Dense': 268 | return layer_dense_translator(tl_layer, _input_shape) 269 | elif tl_layer['class'] == 'RNN': 270 | return layer_rnn_translator(tl_layer, _input_shape) 271 | return None 272 | 273 | 274 | def config_translator(f_tl, f_k): 275 | tl_model_config = f_tl.attrs['model_config'].decode('utf8') 276 | tl_model_config = eval(tl_model_config) 277 | tl_model_architecture = tl_model_config['model_architecture'] 278 | 279 | k_layers = [] 280 | for key, tl_layer in enumerate(tl_model_architecture): 281 | if key == 1: 282 | k_layer = layer_translator(tl_layer, is_first_layer=True) 283 | else: 284 | k_layer = layer_translator(tl_layer) 285 | if k_layer is not None: 286 | k_layers.append(k_layer) 287 | f_k.attrs['model_config'] = json.dumps({'class_name': 'Sequential', 288 | 'config': {'name': 'sequential', 'layers': k_layers}, 289 | 'build_input_shape': input_shape}, 290 | default=serialization.get_json_type).encode('utf8') 291 | f_k.attrs['backend'] = keras.backend.backend().encode('utf8') 292 | f_k.attrs['keras_version'] = str(keras.__version__).encode('utf8') 293 | 294 | # todo: translate the 'training_config' 295 | training_config = {'loss': {'class_name': 'SparseCategoricalCrossentropy', 296 | 'config': {'reduction': 'auto', 'name': 'sparse_categorical_crossentropy', 297 | 'from_logits': False}}, 298 | 'metrics': ['accuracy'], 'weighted_metrics': None, 'loss_weights': None, 'sample_weight_mode': None, 299 | 'optimizer_config': {'class_name': 'Adam', 300 | 'config': {'name': 'Adam', 'learning_rate': 0.01, 'decay': 0.0, 301 | 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False 302 | } 303 | } 304 | } 305 | 306 | f_k.attrs['training_config'] = json.dumps(training_config, default=serialization.get_json_type).encode('utf8') 307 | 308 | 309 | def weights_translator(f_tl, f_k): 310 | # todo: delete inputlayer 311 | if 'model_weights' not in f_k.keys(): 312 | f_k_model_weights = f_k.create_group('model_weights') 313 | else: 314 | f_k_model_weights = f_k['model_weights'] 315 | for key in f_tl.keys(): 316 | if key not in f_k_model_weights.keys(): 317 | f_k_model_weights.create_group(key) 318 | try: 319 | f_tl_para = f_tl[key][key] 320 | except KeyError: 321 | pass 322 | else: 323 | if key not in f_k_model_weights[key].keys(): 324 | f_k_model_weights[key].create_group(key) 325 | weight_names = [] 326 | f_k_para = f_k_model_weights[key][key] 327 | # todo:对RNN层的weights进行通用适配 328 | cell_name = '' 329 | if key == 'rnn_1': 330 | cell_name = 'lstm_cell' 331 | f_k_para.create_group(cell_name) 332 | f_k_para = f_k_para[cell_name] 333 | f_k_model_weights.create_group('masking') 334 | f_k_model_weights['masking'].attrs['weight_names'] = [] 335 | for k in f_tl_para: 336 | if k == 'biases:0' or k == 'bias:0': 337 | weight_name = 'bias:0' 338 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 339 | weight_name = 'kernel:0' 340 | elif k == 'recurrent_kernel:0': 341 | weight_name = 'recurrent_kernel:0' 342 | else: 343 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 344 | if weight_name in f_k_para: 345 | del f_k_para[weight_name] 346 | f_k_para.create_dataset(name=weight_name, data=f_tl_para[k][:], 347 | shape=f_tl_para[k].shape) 348 | # todo:对RNN层的weights进行通用适配 349 | if key == 'rnn_1': 350 | weight_names.append('{}/{}/{}'.format(key, cell_name, weight_name).encode('utf8')) 351 | else: 352 | weight_names.append('{}/{}'.format(key, weight_name).encode('utf8')) 353 | 354 | weight_names.reverse() # todo: 临时解决了参数顺序和keras不一致的问题 355 | f_k_model_weights[key].attrs['weight_names'] = weight_names 356 | f_k_model_weights.attrs['backend'] = keras.backend.backend().encode('utf8') 357 | f_k_model_weights.attrs['keras_version'] = str(keras.__version__).encode('utf8') 358 | f_k_model_weights.attrs['layer_names'] = [key.encode('utf8') for key in f_tl.keys()] 359 | 360 | 361 | def translator_tl2_keras_h5(_tl_h5_path, _keras_h5_path): 362 | f_tl_ = h5py.File(_tl_h5_path, 'r+') 363 | f_k_ = h5py.File(_keras_h5_path, 'a') 364 | f_k_.clear() 365 | weights_translator(f_tl_, f_k_) 366 | config_translator(f_tl_, f_k_) 367 | f_tl_.close() 368 | f_k_.close() 369 | 370 | 371 | def format_convert(x, y): 372 | y = y.astype(np.int32) 373 | for i, d in enumerate(x): 374 | x[i] = x[i][:max_seq_len] 375 | x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in range(max_seq_len - len(d))] 376 | x[i] = tf.convert_to_tensor(x[i], dtype=tf.float32) 377 | x = list(x) 378 | x = tf.convert_to_tensor(x, dtype=tf.float32) 379 | return x, y 380 | 381 | 382 | if __name__ == '__main__': 383 | 384 | # 定义log格式 385 | fmt = "%(asctime)s %(levelname)s %(message)s" 386 | logging.basicConfig(format=fmt, level=logging.INFO) 387 | 388 | # 加载数据 389 | x_train, y_train, x_test, y_test = load_dataset( 390 | ["../word2vec/output/sample_seq_pass.npz", 391 | "../word2vec/output/sample_seq_spam.npz"]) 392 | 393 | # 构建模型 394 | model = get_model(inputs_shape=[None, 20, 200]) 395 | 396 | # 开始训练 397 | train(model) 398 | logging.info("Optimization Finished!") 399 | 400 | # h5保存和转译 401 | model_dir = './model_h5' 402 | if not os.path.exists(model_dir): 403 | os.mkdir(model_dir) 404 | tl_h5_path = model_dir + '/model_cnn_tl.hdf5' 405 | keras_h5_path = model_dir + '/model_cnn_tl2k.hdf5' 406 | tl.files.save_hdf5_graph(network=model, filepath=tl_h5_path, save_weights=True) 407 | translator_tl2_keras_h5(tl_h5_path, keras_h5_path) 408 | 409 | # 读取模型 410 | new_model = keras.models.load_model(keras_h5_path) 411 | x_test, y_test = format_convert(x_test, y_test) 412 | score = new_model.evaluate(x_test, y_test, batch_size=128) 413 | 414 | # 保存SavedModel可部署文件 415 | saved_model_version = 1 416 | saved_model_path = "./saved_models/cnn/" 417 | tf.saved_model.save(new_model, saved_model_path + str(saved_model_version)) 418 | 419 | -------------------------------------------------------------------------------- /network/mlp_classifier.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """训练Neural Bag-of-Words (NBOW) + Multilayer Perceptron (MLP)分类器 3 | 4 | Bag of Tricks for Efficient Text Classification: https://arxiv.org/pdf/1607.01759.pdf 5 | Neural Bag-of-Ngrams: http://iir.ruc.edu.cn/~libf/papers/AAAI-17.pdf 6 | 7 | """ 8 | import logging 9 | import math 10 | import os 11 | import time 12 | import numpy as np 13 | import tensorflow as tf 14 | import tensorlayer as tl 15 | from sklearn.model_selection import train_test_split 16 | import h5py 17 | import json 18 | from tensorflow.python.util import serialization 19 | import tensorflow.keras as keras 20 | 21 | 22 | def load_dataset(files, test_size=0.2): 23 | """加载样本并取test_size的比例做测试集 24 | Args: 25 | files: 样本文件目录集合 26 | 样本文件是包含了样本特征向量与标签的npy文件 27 | test_size: float 28 | 0.0到1.0之间,代表数据集中有多少比例抽做测试集 29 | Returns: 30 | X_train, y_train: 训练集特征列表和标签列表 31 | X_test, y_test: 测试集特征列表和标签列表 32 | """ 33 | x = [] 34 | y = [] 35 | for file in files: 36 | data = np.load(file) 37 | if x == [] or y == []: 38 | x = data['x'] 39 | y = data['y'] 40 | else: 41 | x = np.append(x, data['x'], axis=0) 42 | y = np.append(y, data['y'], axis=0) 43 | 44 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size) 45 | return x_train, y_train, x_test, y_test 46 | 47 | 48 | def get_model(input_shape, keep=0.5): 49 | """定义网络结构 50 | 为了防止过拟合,我们在读取前一层输出之前都加了Dropout操作。参数keep决定了某一层神经元输出保留的比例。通过配置keep我们可以在训练的时候打开Dropout,在测试的时候关闭它。TensorFlow的tf.nn.dropout操作会自动根据keep调整激活的神经元的输出权重,使得我们无需在keep改变时手动调节输出权重。 51 | Args: 52 | inputs_shape: 输入数据的shape 53 | keep: 各层神经元激活比例 54 | keep=1.0: 关闭Dropout 55 | Returns: 56 | model: 定义好的网络结构 57 | """ 58 | ni = tl.layers.Input(input_shape, name='input_layer') 59 | nn = tl.layers.Dropout(keep=keep, name='drop1')(ni) 60 | nn = tl.layers.Dense(n_units=200, act=tf.nn.relu, name='relu1')(nn) 61 | nn = tl.layers.Dropout(keep=keep, name='drop2')(nn) 62 | nn = tl.layers.Dense(n_units=200, act=tf.nn.relu, name='relu2')(nn) 63 | nn = tl.layers.Dropout(keep=keep, name='drop3')(nn) 64 | nn = tl.layers.Dense(n_units=2, act=tf.nn.relu, name='output_layer')(nn) 65 | model = tl.models.Model(inputs=ni, outputs=nn, name='mlp') 66 | return model 67 | 68 | 69 | def accuracy(y_pred, y_true): 70 | """ 71 | 计算预测精准度accuracy 72 | :param y_pred: 模型预测结果 73 | :param y_true: 真实结果 ground truth 74 | :return: 精准度acc 75 | """ 76 | # Predicted class is the index of highest score in prediction vector (i.e. argmax). 77 | correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64)) 78 | return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1) 79 | 80 | 81 | def train(model): 82 | # 训练网络 83 | learning_rate = 0.01 84 | n_epoch = 50 85 | batch_size = 128 86 | loss_vals = [] 87 | acc_vals = [] 88 | display_step = 10 89 | logging.info("batch_size: %d", batch_size) 90 | logging.info("Start training the network...") 91 | optimizer = tf.optimizers.Adam(learning_rate=learning_rate) 92 | model.train() 93 | 94 | for epoch in range(n_epoch): 95 | step = 0 96 | total_step = math.ceil(len(x_train) / batch_size) 97 | start_time = time.time() 98 | for X_train_a, y_train_a in tl.iterate.minibatches(x_train, y_train, batch_size, shuffle=True): 99 | X_train_a = tf.convert_to_tensor(X_train_a, dtype=tf.float32) 100 | 101 | with tf.GradientTape() as tape: 102 | _y = model(X_train_a) 103 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(y_train_a.astype(np.int32), _y, name='train_loss') 104 | loss_val = tf.reduce_mean(loss_val) 105 | grad = tape.gradient(loss_val, model.trainable_weights) 106 | optimizer.apply_gradients(zip(grad, model.trainable_weights)) 107 | 108 | loss_vals.append(loss_val) 109 | acc_vals.append(accuracy(_y, y_train_a)) 110 | if step + 1 == 1 or (step + 1) % display_step == 0: 111 | logging.info("Epoch {}/{},Step {}/{}, took {}".format(epoch + 1, n_epoch, step, total_step, 112 | time.time() - start_time)) 113 | loss = sum(loss_vals) / len(loss_vals) 114 | acc = sum(acc_vals) / len(acc_vals) 115 | del loss_vals[:] 116 | del acc_vals[:] 117 | logging.info("Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)) 118 | step += 1 119 | 120 | model.eval() 121 | test_loss, test_acc, n_iter = 0, 0, 0 122 | for batch_x, batch_y in tl.iterate.minibatches(x_test, y_test, batch_size, shuffle=True): 123 | batch_y = batch_y.astype(np.int32) 124 | _y = model(batch_x) 125 | 126 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='test_loss') 127 | loss_val = tf.reduce_mean(loss_val) 128 | 129 | test_loss += loss_val 130 | test_acc += accuracy(_y, batch_y) 131 | n_iter += 1 132 | logging.info(" test loss: {}".format(test_loss / n_iter)) 133 | logging.info(" test acc: {}".format(test_acc / n_iter)) 134 | 135 | 136 | def layer_conv1d_translator(tl_layer, _input_shape=None): 137 | args = tl_layer['args'] 138 | name = args['name'] 139 | filters = args['n_filter'] 140 | kernel_size = [args['filter_size']] 141 | strides = [args['stride']] 142 | padding = args['padding'] 143 | data_format = args['data_format'] 144 | dilation_rate = [args['dilation_rate']] 145 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'filters': filters, 146 | 'kernel_size': kernel_size, 'strides': strides, 'padding': padding, 'data_format': data_format, 147 | 'dilation_rate': dilation_rate, 'activation': 'relu', 'use_bias': True, 148 | 'kernel_initializer': {'class_name': 'GlorotUniform', 149 | 'config': {'seed': None} 150 | }, 151 | 'bias_initializer': {'class_name': 'Zeros', 152 | 'config': {} 153 | }, 154 | 'kernel_regularizer': None, 'bias_regularizer': None, 155 | 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None} 156 | 157 | if _input_shape is not None: 158 | config['batch_input_shape'] = _input_shape 159 | result = {'class_name': 'Conv1D', 'config': config} 160 | return result 161 | 162 | 163 | def layer_maxpooling1d_translator(tl_layer, _input_shape=None): 164 | args = tl_layer['args'] 165 | name = args['name'] 166 | pool_size = [args['filter_size']] 167 | strides = [args['strides']] 168 | padding = args['padding'] 169 | data_format = args['data_format'] 170 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'strides': strides, 'pool_size': pool_size, 171 | 'padding': padding, 'data_format': data_format} 172 | if _input_shape is not None: 173 | config['batch_input_shape'] = _input_shape 174 | result = {'class_name': 'MaxPooling1D', 'config': config} 175 | return result 176 | 177 | 178 | def layer_flatten_translator(tl_layer, _input_shape=None): 179 | args = tl_layer['args'] 180 | name = args['name'] 181 | 182 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'data_format': 'channels_last'} 183 | if _input_shape is not None: 184 | config['batch_input_shape'] = _input_shape 185 | result = {'class_name': 'Flatten', 'config': config} 186 | return result 187 | 188 | 189 | def layer_dropout_translator(tl_layer, _input_shape=None): 190 | args = tl_layer['args'] 191 | name = args['name'] 192 | rate = 1-args['keep'] 193 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'rate': rate, 'noise_shape': None, 'seed': None} 194 | if _input_shape is not None: 195 | config['batch_input_shape'] = _input_shape 196 | result = {'class_name': 'Dropout', 'config': config} 197 | return result 198 | 199 | 200 | def layer_dense_translator(tl_layer, _input_shape=None): 201 | args = tl_layer['args'] 202 | name = args['name'] 203 | units = args['n_units'] 204 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'units': units, 'activation': 'relu', 'use_bias': True, 205 | 'kernel_initializer': {'class_name': 'GlorotUniform', 'config': {'seed': None}}, 206 | 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 207 | 'kernel_regularizer': None, 208 | 'bias_regularizer': None, 209 | 'activity_regularizer': None, 210 | 'kernel_constraint': None, 211 | 'bias_constraint': None} 212 | 213 | if _input_shape is not None: 214 | config['batch_input_shape'] = _input_shape 215 | result = {'class_name': 'Dense', 'config': config} 216 | return result 217 | 218 | 219 | def layer_rnn_translator(tl_layer, _input_shape=None): 220 | args = tl_layer['args'] 221 | name = args['name'] 222 | cell = {'class_name': 'LSTMCell', 'config': {'name': 'lstm_cell', 'trainable': True, 'dtype': 'float32', 223 | 'units': 64, 'activation': 'tanh', 'recurrent_activation': 'sigmoid', 224 | 'use_bias': True, 225 | 'kernel_initializer': {'class_name': 'GlorotUniform', 226 | 'config': {'seed': None}}, 227 | 'recurrent_initializer': {'class_name': 'Orthogonal', 228 | 'config': {'gain': 1.0, 'seed': None}}, 229 | 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 230 | 'unit_forget_bias': True, 'kernel_regularizer': None, 231 | 'recurrent_regularizer': None, 'bias_regularizer': None, 232 | 'kernel_constraint': None, 'recurrent_constraint': None, 233 | 'bias_constraint': None, 'dropout': 0.0, 'recurrent_dropout': 0.2, 234 | 'implementation': 1}} 235 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'return_sequences': False, 236 | 'return_state': False, 'go_backwards': False, 'stateful': False, 'unroll': False, 'time_major': False, 237 | 'cell': cell 238 | } 239 | if _input_shape is not None: 240 | config['batch_input_shape'] = _input_shape 241 | result = {'class_name': 'RNN', 'config': config} 242 | return result 243 | 244 | 245 | def layer_translator(tl_layer, is_first_layer=False): 246 | _input_shape = None 247 | global input_shape 248 | if is_first_layer: 249 | _input_shape = input_shape 250 | if tl_layer['class'] == '_InputLayer': 251 | input_shape = tl_layer['args']['shape'] 252 | elif tl_layer['class'] == 'Conv1d': 253 | return layer_conv1d_translator(tl_layer, _input_shape) 254 | elif tl_layer['class'] == 'MaxPool1d': 255 | return layer_maxpooling1d_translator(tl_layer, _input_shape) 256 | elif tl_layer['class'] == 'Flatten': 257 | return layer_flatten_translator(tl_layer, _input_shape) 258 | elif tl_layer['class'] == 'Dropout': 259 | return layer_dropout_translator(tl_layer, _input_shape) 260 | elif tl_layer['class'] == 'Dense': 261 | return layer_dense_translator(tl_layer, _input_shape) 262 | elif tl_layer['class'] == 'RNN': 263 | return layer_rnn_translator(tl_layer, _input_shape) 264 | return None 265 | 266 | 267 | def config_translator(f_tl, f_k): 268 | tl_model_config = f_tl.attrs['model_config'].decode('utf8') 269 | tl_model_config = eval(tl_model_config) 270 | tl_model_architecture = tl_model_config['model_architecture'] 271 | 272 | k_layers = [] 273 | for key, tl_layer in enumerate(tl_model_architecture): 274 | if key == 1: 275 | k_layer = layer_translator(tl_layer, is_first_layer=True) 276 | else: 277 | k_layer = layer_translator(tl_layer) 278 | if k_layer is not None: 279 | k_layers.append(k_layer) 280 | f_k.attrs['model_config'] = json.dumps({'class_name': 'Sequential', 281 | 'config': {'name': 'sequential', 'layers': k_layers}, 282 | 'build_input_shape': input_shape}, 283 | default=serialization.get_json_type).encode('utf8') 284 | f_k.attrs['backend'] = keras.backend.backend().encode('utf8') 285 | f_k.attrs['keras_version'] = str(keras.__version__).encode('utf8') 286 | 287 | # todo: translate the 'training_config' 288 | training_config = {'loss': {'class_name': 'SparseCategoricalCrossentropy', 289 | 'config': {'reduction': 'auto', 'name': 'sparse_categorical_crossentropy', 290 | 'from_logits': False}}, 291 | 'metrics': ['accuracy'], 'weighted_metrics': None, 'loss_weights': None, 'sample_weight_mode': None, 292 | 'optimizer_config': {'class_name': 'Adam', 293 | 'config': {'name': 'Adam', 'learning_rate': 0.01, 'decay': 0.0, 294 | 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False 295 | } 296 | } 297 | } 298 | 299 | f_k.attrs['training_config'] = json.dumps(training_config, default=serialization.get_json_type).encode('utf8') 300 | 301 | 302 | def weights_translator(f_tl, f_k): 303 | # todo: delete inputlayer 304 | if 'model_weights' not in f_k.keys(): 305 | f_k_model_weights = f_k.create_group('model_weights') 306 | else: 307 | f_k_model_weights = f_k['model_weights'] 308 | for key in f_tl.keys(): 309 | if key not in f_k_model_weights.keys(): 310 | f_k_model_weights.create_group(key) 311 | try: 312 | f_tl_para = f_tl[key][key] 313 | except KeyError: 314 | pass 315 | else: 316 | if key not in f_k_model_weights[key].keys(): 317 | f_k_model_weights[key].create_group(key) 318 | weight_names = [] 319 | f_k_para = f_k_model_weights[key][key] 320 | # todo:对RNN层的weights进行通用适配 321 | cell_name = '' 322 | if key == 'rnn_1': 323 | cell_name = 'lstm_cell' 324 | f_k_para.create_group(cell_name) 325 | f_k_para = f_k_para[cell_name] 326 | f_k_model_weights.create_group('masking') 327 | f_k_model_weights['masking'].attrs['weight_names'] = [] 328 | for k in f_tl_para: 329 | if k == 'biases:0' or k == 'bias:0': 330 | weight_name = 'bias:0' 331 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 332 | weight_name = 'kernel:0' 333 | elif k == 'recurrent_kernel:0': 334 | weight_name = 'recurrent_kernel:0' 335 | else: 336 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 337 | if weight_name in f_k_para: 338 | del f_k_para[weight_name] 339 | f_k_para.create_dataset(name=weight_name, data=f_tl_para[k][:], 340 | shape=f_tl_para[k].shape) 341 | 342 | weight_names = [] 343 | for weight_name in f_tl[key].attrs['weight_names']: 344 | weight_name = weight_name.decode('utf8') 345 | weight_name = weight_name.split('/') 346 | k = weight_name[-1] 347 | if k == 'biases:0' or k == 'bias:0': 348 | weight_name[-1] = 'bias:0' 349 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 350 | weight_name[-1] = 'kernel:0' 351 | elif k == 'recurrent_kernel:0': 352 | weight_name[-1] = 'recurrent_kernel:0' 353 | else: 354 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 355 | if key == 'rnn_1': 356 | weight_name.insert(-1, 'lstm_cell') 357 | weight_name = '/'.join(weight_name) 358 | weight_names.append(weight_name.encode('utf8')) 359 | f_k_model_weights[key].attrs['weight_names'] = weight_names 360 | 361 | f_k_model_weights.attrs['backend'] = keras.backend.backend().encode('utf8') 362 | f_k_model_weights.attrs['keras_version'] = str(keras.__version__).encode('utf8') 363 | 364 | f_k_model_weights.attrs['layer_names'] = [i for i in f_tl.attrs['layer_names']] 365 | 366 | 367 | def translator_tl2_keras_h5(_tl_h5_path, _keras_h5_path): 368 | f_tl_ = h5py.File(_tl_h5_path, 'r+') 369 | f_k_ = h5py.File(_keras_h5_path, 'a') 370 | f_k_.clear() 371 | weights_translator(f_tl_, f_k_) 372 | config_translator(f_tl_, f_k_) 373 | f_tl_.close() 374 | f_k_.close() 375 | 376 | 377 | if __name__ == '__main__': 378 | # 定义log格式 379 | fmt = "%(asctime)s %(levelname)s %(message)s" 380 | logging.basicConfig(format=fmt, level=logging.INFO) 381 | 382 | # 加载数据 383 | x_train, y_train, x_test, y_test = load_dataset( 384 | ["../word2vec/output/sample_pass.npz", 385 | "../word2vec/output/sample_spam.npz"]) 386 | 387 | # 构建模型 388 | input_shape = [None, 200] 389 | model = get_model(input_shape) 390 | 391 | # 开始训练 392 | train(model) 393 | logging.info("Optimization Finished!") 394 | 395 | # h5保存和转译 396 | model_dir = './model_h5' 397 | if not os.path.exists(model_dir): 398 | os.mkdir(model_dir) 399 | tl_h5_path = model_dir + '/model_mlp_tl.hdf5' 400 | keras_h5_path = model_dir + '/model_mlp_tl2k.hdf5' 401 | tl.files.save_hdf5_graph(network=model, filepath=tl_h5_path, save_weights=True) 402 | translator_tl2_keras_h5(tl_h5_path, keras_h5_path) 403 | 404 | # 读取模型 405 | new_model = keras.models.load_model(keras_h5_path) 406 | score = new_model.evaluate(x_test, y_test, batch_size=128) 407 | 408 | # 保存SavedModel可部署文件 409 | saved_model_version = 1 410 | saved_model_path = "./saved_models/mlp/" 411 | tf.saved_model.save(new_model, saved_model_path + str(saved_model_version)) -------------------------------------------------------------------------------- /network/rnn_classifier.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import math 3 | import os 4 | import time 5 | import numpy as np 6 | import tensorflow as tf 7 | import tensorlayer as tl 8 | from sklearn.model_selection import train_test_split 9 | import h5py 10 | import json 11 | from tensorflow.python.util import serialization 12 | from tensorflow.keras.callbacks import TensorBoard 13 | import tensorflow.keras as keras 14 | import datetime 15 | 16 | 17 | def load_dataset(files, test_size=0.2): 18 | """加载样本并取test_size的比例做测试集 19 | Args: 20 | files: 样本文件目录集合 21 | 样本文件是包含了样本特征向量与标签的npy文件 22 | test_size: float 23 | 0.0到1.0之间,代表数据集中有多少比例抽做测试集 24 | Returns: 25 | X_train, y_train: 训练集特征列表和标签列表 26 | X_test, y_test: 测试集特征列表和标签列表 27 | """ 28 | x = [] 29 | y = [] 30 | for file in files: 31 | data = np.load(file, allow_pickle=True) 32 | if x == [] or y == []: 33 | x = data['x'] 34 | y = data['y'] 35 | else: 36 | x = np.append(x, data['x'], axis=0) 37 | y = np.append(y, data['y'], axis=0) 38 | 39 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size) 40 | return x_train, y_train, x_test, y_test 41 | 42 | 43 | def get_model(inputs_shape): 44 | """定义网络结Args: 45 | inputs_shape: 输入数据的shape 46 | recurrent_dropout: RNN隐藏层的舍弃比重 47 | Returns: 48 | model: 定义好的模型 49 | """ 50 | ni = tl.layers.Input(inputs_shape, name='input_layer') 51 | out = tl.layers.RNN(cell=tf.keras.layers.LSTMCell(units=64, recurrent_dropout=0.2), 52 | return_last_output=True, 53 | return_last_state=False, 54 | return_seq_2d=True)(ni, sequence_length=tl.layers.retrieve_seq_length_op3(ni, pad_val=masking_val)) 55 | nn = tl.layers.Dense(n_units=2, act=tf.nn.softmax, name="dense")(out) 56 | model = tl.models.Model(inputs=ni, outputs=nn, name='rnn') 57 | return model 58 | 59 | 60 | 61 | def accuracy(y_pred, y_true): 62 | """ 63 | 计算预测精准度accuracy 64 | :param y_pred: 模型预测结果 65 | :param y_true: 真实结果 ground truth 66 | :return: 精准度acc 67 | """ 68 | # Predicted class is the index of highest score in prediction vector (i.e. argmax). 69 | correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64)) 70 | return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1) 71 | 72 | 73 | def train(model): 74 | # 开始训练 75 | learning_rate = 0.001 76 | n_epoch = 50 77 | batch_size = 128 78 | display_step = 10 79 | loss_vals = [] 80 | acc_vals = [] 81 | optimizer = tf.optimizers.Nadam(learning_rate=learning_rate) 82 | 83 | logging.info("batch_size: %d", batch_size) 84 | logging.info("Start training the network...") 85 | 86 | for epoch in range(n_epoch): 87 | step = 0 88 | total_step = math.ceil(len(x_train) / batch_size) 89 | 90 | # 利用训练集训练 91 | model.train() 92 | for batch_x, batch_y in tl.iterate.minibatches(x_train, y_train, batch_size, shuffle=True): 93 | 94 | start_time = time.time() 95 | # temp = copy.deepcopy(batch_x) 96 | max_seq_len = max([len(d) for d in batch_x]) 97 | batch_y = batch_y.astype(np.int32) 98 | for i, d in enumerate(batch_x): 99 | batch_x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in 100 | range(max_seq_len - len(d))] 101 | batch_x[i] = tf.convert_to_tensor(batch_x[i], dtype=tf.float32) 102 | batch_x = list(batch_x) 103 | batch_x = tf.convert_to_tensor(batch_x, dtype=tf.float32) 104 | # sequence_length = tl.layers.retrieve_seq_length_op3(batch_x, pad_val=masking_val) 105 | 106 | with tf.GradientTape() as tape: 107 | _y = model(batch_x) 108 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='train_loss') 109 | loss_val = tf.reduce_mean(loss_val) 110 | grad = tape.gradient(loss_val, model.trainable_weights) 111 | optimizer.apply_gradients(zip(grad, model.trainable_weights)) 112 | 113 | loss_vals.append(loss_val) 114 | acc_vals.append(accuracy(_y, batch_y)) 115 | 116 | if step + 1 == 1 or (step + 1) % display_step == 0: 117 | logging.info("Epoch {}/{},Step {}/{}, took {}".format(epoch + 1, n_epoch, step, total_step, 118 | time.time() - start_time)) 119 | loss = sum(loss_vals) / len(loss_vals) 120 | acc = sum(acc_vals) / len(acc_vals) 121 | del loss_vals[:] 122 | del acc_vals[:] 123 | logging.info( 124 | "Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)) 125 | step += 1 126 | 127 | with train_summary_writer.as_default(): 128 | tf.summary.scalar('loss', loss_val.numpy(), step = epoch) 129 | tf.summary.scalar('accuracy', accuracy(_y, batch_y).numpy(), step = epoch) 130 | 131 | # 利用测试集评估 132 | model.eval() 133 | test_loss, test_acc, n_iter = 0, 0, 0 134 | for batch_x, batch_y in tl.iterate.minibatches(x_test, y_test, batch_size, shuffle=True): 135 | batch_y = batch_y.astype(np.int32) 136 | max_seq_len = max([len(d) for d in batch_x]) 137 | for i, d in enumerate(batch_x): 138 | # 依照每个batch中最大样本长度将剩余样本打上padding 139 | batch_x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in 140 | range(max_seq_len - len(d))] 141 | batch_x[i] = tf.convert_to_tensor(batch_x[i], dtype=tf.float32) 142 | # ValueError: setting an array element with a sequence. 143 | batch_x = list(batch_x) 144 | batch_x = tf.convert_to_tensor(batch_x, dtype=tf.float32) 145 | 146 | _y = model(batch_x) 147 | 148 | loss_val = tf.nn.sparse_softmax_cross_entropy_with_logits(batch_y, _y, name='test_loss') 149 | loss_val = tf.reduce_mean(loss_val) 150 | 151 | test_loss += loss_val 152 | test_acc += accuracy(_y, batch_y) 153 | n_iter += 1 154 | 155 | with test_summary_writer.as_default(): 156 | tf.summary.scalar('loss', loss_val.numpy(), step=epoch) 157 | tf.summary.scalar('accuracy', accuracy(_y, batch_y).numpy(), step=epoch) 158 | 159 | logging.info(" test loss: {}".format(test_loss / n_iter)) 160 | logging.info(" test acc: {}".format(test_acc / n_iter)) 161 | 162 | 163 | def layer_conv1d_translator(tl_layer, _input_shape=None): 164 | args = tl_layer['args'] 165 | name = args['name'] 166 | filters = args['n_filter'] 167 | kernel_size = [args['filter_size']] 168 | strides = [args['stride']] 169 | padding = args['padding'] 170 | data_format = args['data_format'] 171 | dilation_rate = [args['dilation_rate']] 172 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'filters': filters, 173 | 'kernel_size': kernel_size, 'strides': strides, 'padding': padding, 'data_format': data_format, 174 | 'dilation_rate': dilation_rate, 'activation': 'relu', 'use_bias': True, 175 | 'kernel_initializer': {'class_name': 'GlorotUniform', 176 | 'config': {'seed': None} 177 | }, 178 | 'bias_initializer': {'class_name': 'Zeros', 179 | 'config': {} 180 | }, 181 | 'kernel_regularizer': None, 'bias_regularizer': None, 182 | 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None} 183 | 184 | if _input_shape is not None: 185 | config['batch_input_shape'] = _input_shape 186 | result = {'class_name': 'Conv1D', 'config': config} 187 | return result 188 | 189 | 190 | def layer_maxpooling1d_translator(tl_layer, _input_shape=None): 191 | args = tl_layer['args'] 192 | name = args['name'] 193 | pool_size = [args['filter_size']] 194 | strides = [args['strides']] 195 | padding = args['padding'] 196 | data_format = args['data_format'] 197 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'strides': strides, 'pool_size': pool_size, 198 | 'padding': padding, 'data_format': data_format} 199 | if _input_shape is not None: 200 | config['batch_input_shape'] = _input_shape 201 | result = {'class_name': 'MaxPooling1D', 'config': config} 202 | return result 203 | 204 | 205 | def layer_flatten_translator(tl_layer, _input_shape=None): 206 | args = tl_layer['args'] 207 | name = args['name'] 208 | 209 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'data_format': 'channels_last'} 210 | if _input_shape is not None: 211 | config['batch_input_shape'] = _input_shape 212 | result = {'class_name': 'Flatten', 'config': config} 213 | return result 214 | 215 | 216 | def layer_dropout_translator(tl_layer, _input_shape=None): 217 | args = tl_layer['args'] 218 | name = args['name'] 219 | rate = 1-args['keep'] 220 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'rate': rate, 'noise_shape': None, 'seed': None} 221 | if _input_shape is not None: 222 | config['batch_input_shape'] = _input_shape 223 | result = {'class_name': 'Dropout', 'config': config} 224 | return result 225 | 226 | 227 | def layer_dense_translator(tl_layer, _input_shape=None): 228 | args = tl_layer['args'] 229 | name = args['name'] 230 | units = args['n_units'] 231 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'units': units, 'activation': 'softmax', 'use_bias': True, 232 | 'kernel_initializer': {'class_name': 'GlorotUniform', 'config': {'seed': None}}, 233 | 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 234 | 'kernel_regularizer': None, 235 | 'bias_regularizer': None, 236 | 'activity_regularizer': None, 237 | 'kernel_constraint': None, 238 | 'bias_constraint': None} 239 | 240 | if _input_shape is not None: 241 | config['batch_input_shape'] = _input_shape 242 | result = {'class_name': 'Dense', 'config': config} 243 | return result 244 | 245 | 246 | def layer_rnn_translator(tl_layer, _input_shape=None): 247 | args = tl_layer['args'] 248 | name = args['name'] 249 | cell = {'class_name': 'LSTMCell', 'config': {'name': 'lstm_cell', 'trainable': True, 'dtype': 'float32', 250 | 'units': 64, 'activation': 'tanh', 'recurrent_activation': 'sigmoid', 251 | 'use_bias': True, 252 | 'kernel_initializer': {'class_name': 'GlorotUniform', 253 | 'config': {'seed': None}}, 254 | 'recurrent_initializer': {'class_name': 'Orthogonal', 255 | 'config': {'gain': 1.0, 'seed': None}}, 256 | 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 257 | 'unit_forget_bias': True, 'kernel_regularizer': None, 258 | 'recurrent_regularizer': None, 'bias_regularizer': None, 259 | 'kernel_constraint': None, 'recurrent_constraint': None, 260 | 'bias_constraint': None, 'dropout': 0.0, 'recurrent_dropout': 0.2, 261 | 'implementation': 1}} 262 | config = {'name': name, 'trainable': True, 'dtype': 'float32', 'return_sequences': False, 263 | 'return_state': False, 'go_backwards': False, 'stateful': False, 'unroll': False, 'time_major': False, 264 | 'cell': cell 265 | } 266 | if _input_shape is not None: 267 | config['batch_input_shape'] = _input_shape 268 | result = {'class_name': 'RNN', 'config': config} 269 | return result 270 | 271 | 272 | def layer_translator(tl_layer, is_first_layer=False): 273 | _input_shape = None 274 | global input_shape 275 | if is_first_layer: 276 | _input_shape = input_shape 277 | if tl_layer['class'] == '_InputLayer': 278 | input_shape = tl_layer['args']['shape'] 279 | elif tl_layer['class'] == 'Conv1d': 280 | return layer_conv1d_translator(tl_layer, _input_shape) 281 | elif tl_layer['class'] == 'MaxPool1d': 282 | return layer_maxpooling1d_translator(tl_layer, _input_shape) 283 | elif tl_layer['class'] == 'Flatten': 284 | return layer_flatten_translator(tl_layer, _input_shape) 285 | elif tl_layer['class'] == 'Dropout': 286 | return layer_dropout_translator(tl_layer, _input_shape) 287 | elif tl_layer['class'] == 'Dense': 288 | return layer_dense_translator(tl_layer, _input_shape) 289 | elif tl_layer['class'] == 'RNN': 290 | return layer_rnn_translator(tl_layer, _input_shape) 291 | return None 292 | 293 | 294 | def config_translator(f_tl, f_k): 295 | tl_model_config = f_tl.attrs['model_config'].decode('utf8') 296 | tl_model_config = eval(tl_model_config) 297 | tl_model_architecture = tl_model_config['model_architecture'] 298 | 299 | k_layers = [] 300 | 301 | masking_layer = { 302 | 'class_name': 'Masking', 303 | 'config': { 304 | 'batch_input_shape': [None, None, 200], 305 | 'dtype': 'float32', 306 | 'mask_value': 0, #Masks a sequence to skip timesteps if values are equal to mask_value 307 | 'name': 'masking', 308 | 'trainable': True 309 | } 310 | } 311 | k_layers.append(masking_layer) 312 | 313 | for key, tl_layer in enumerate(tl_model_architecture): 314 | if key == 1: 315 | k_layer = layer_translator(tl_layer, is_first_layer=True) 316 | else: 317 | k_layer = layer_translator(tl_layer) 318 | if k_layer is not None: 319 | k_layers.append(k_layer) 320 | f_k.attrs['model_config'] = json.dumps({'class_name': 'Sequential', 321 | 'config': {'name': 'sequential', 'layers': k_layers}, 322 | 'build_input_shape': input_shape}, 323 | default=serialization.get_json_type).encode('utf8') 324 | f_k.attrs['backend'] = keras.backend.backend().encode('utf8') 325 | f_k.attrs['keras_version'] = str(keras.__version__).encode('utf8') 326 | 327 | # todo: translate the 'training_config' 328 | training_config = {'loss': {'class_name': 'SparseCategoricalCrossentropy', 329 | 'config': {'reduction': 'auto', 'name': 'sparse_categorical_crossentropy', 330 | 'from_logits': False}}, 331 | 'metrics': ['accuracy'], 'weighted_metrics': None, 'loss_weights': None, 'sample_weight_mode': None, 332 | 'optimizer_config': {'class_name': 'Adam', 333 | 'config': {'name': 'Adam', 'learning_rate': 0.01, 'decay': 0.0, 334 | 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False 335 | } 336 | } 337 | } 338 | 339 | f_k.attrs['training_config'] = json.dumps(training_config, default=serialization.get_json_type).encode('utf8') 340 | 341 | 342 | def weights_translator(f_tl, f_k): 343 | # todo: delete inputlayer 344 | if 'model_weights' not in f_k.keys(): 345 | f_k_model_weights = f_k.create_group('model_weights') 346 | else: 347 | f_k_model_weights = f_k['model_weights'] 348 | for key in f_tl.keys(): 349 | if key not in f_k_model_weights.keys(): 350 | f_k_model_weights.create_group(key) 351 | try: 352 | f_tl_para = f_tl[key][key] 353 | except KeyError: 354 | pass 355 | else: 356 | if key not in f_k_model_weights[key].keys(): 357 | f_k_model_weights[key].create_group(key) 358 | weight_names = [] 359 | f_k_para = f_k_model_weights[key][key] 360 | # todo:对RNN层的weights进行通用适配 361 | cell_name = '' 362 | if key == 'rnn_1': 363 | cell_name = 'lstm_cell' 364 | f_k_para.create_group(cell_name) 365 | f_k_para = f_k_para[cell_name] 366 | f_k_model_weights.create_group('masking') 367 | f_k_model_weights['masking'].attrs['weight_names'] = [] 368 | for k in f_tl_para: 369 | if k == 'biases:0' or k == 'bias:0': 370 | weight_name = 'bias:0' 371 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 372 | weight_name = 'kernel:0' 373 | elif k == 'recurrent_kernel:0': 374 | weight_name = 'recurrent_kernel:0' 375 | else: 376 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 377 | if weight_name in f_k_para: 378 | del f_k_para[weight_name] 379 | f_k_para.create_dataset(name=weight_name, data=f_tl_para[k][:], 380 | shape=f_tl_para[k].shape) 381 | 382 | weight_names = [] 383 | for weight_name in f_tl[key].attrs['weight_names']: 384 | weight_name = weight_name.decode('utf8') 385 | weight_name = weight_name.split('/') 386 | k = weight_name[-1] 387 | if k == 'biases:0' or k == 'bias:0': 388 | weight_name[-1] = 'bias:0' 389 | elif k == 'filters:0' or k == 'weights:0' or k == 'kernel:0': 390 | weight_name[-1] = 'kernel:0' 391 | elif k == 'recurrent_kernel:0': 392 | weight_name[-1] = 'recurrent_kernel:0' 393 | else: 394 | raise Exception("cant find the parameter '{}' in tensorlayer".format(k)) 395 | if key == 'rnn_1': 396 | weight_name.insert(-1, 'lstm_cell') 397 | weight_name = '/'.join(weight_name) 398 | weight_names.append(weight_name.encode('utf8')) 399 | f_k_model_weights[key].attrs['weight_names'] = weight_names 400 | 401 | f_k_model_weights.attrs['backend'] = keras.backend.backend().encode('utf8') 402 | f_k_model_weights.attrs['keras_version'] = str(keras.__version__).encode('utf8') 403 | 404 | f_k_model_weights.attrs['layer_names'] = [i for i in f_tl.attrs['layer_names']] 405 | 406 | 407 | def translator_tl2_keras_h5(_tl_h5_path, _keras_h5_path): 408 | f_tl_ = h5py.File(_tl_h5_path, 'r+') 409 | f_k_ = h5py.File(_keras_h5_path, 'a') 410 | f_k_.clear() 411 | weights_translator(f_tl_, f_k_) 412 | config_translator(f_tl_, f_k_) 413 | f_tl_.close() 414 | f_k_.close() 415 | 416 | 417 | def format_convert(x, y): 418 | y = y.astype(np.int32) 419 | max_seq_len = max([len(d) for d in x]) 420 | for i, d in enumerate(x): 421 | x[i] += [tf.convert_to_tensor(np.zeros(200), dtype=tf.float32) for i in range(max_seq_len - len(d))] 422 | x[i] = tf.convert_to_tensor(x[i], dtype=tf.float32) 423 | x = list(x) 424 | x = tf.convert_to_tensor(x, dtype=tf.float32) 425 | return x, y 426 | 427 | 428 | if __name__ == '__main__': 429 | 430 | masking_val = np.zeros(200) 431 | input_shape = None 432 | gradient_log_dir = 'logs/gradient_tape/' 433 | tensorboard = TensorBoard(log_dir = gradient_log_dir) 434 | 435 | # 定义log格式 436 | fmt = "%(asctime)s %(levelname)s %(message)s" 437 | logging.basicConfig(format=fmt, level=logging.INFO) 438 | 439 | # 加载数据 440 | x_train, y_train, x_test, y_test = load_dataset( 441 | ["../word2vec/output/sample_seq_pass.npz", 442 | "../word2vec/output/sample_seq_spam.npz"]) 443 | 444 | # 构建模型 445 | model = get_model(inputs_shape=[None, None, 200]) 446 | 447 | for index, layer in enumerate(model.config['model_architecture']): 448 | if layer['class'] == 'RNN': 449 | if 'cell' in layer['args']: 450 | model.config['model_architecture'][index]['args']['cell'] = '' 451 | 452 | current_time = datetime.datetime.now().strftime('%Y%m%d-%H%M%S') 453 | train_log_dir = gradient_log_dir + current_time + '/train' 454 | test_log_dir = gradient_log_dir + current_time + '/test' 455 | train_summary_writer = tf.summary.create_file_writer(train_log_dir) 456 | test_summary_writer = tf.summary.create_file_writer(test_log_dir) 457 | 458 | train(model) 459 | 460 | logging.info("Optimization Finished!") 461 | 462 | # h5保存和转译 463 | model_dir = './model_h5' 464 | if not os.path.exists(model_dir): 465 | os.mkdir(model_dir) 466 | tl_h5_path = model_dir + '/model_rnn_tl.hdf5' 467 | keras_h5_path = model_dir + '/model_rnn_tl2k.hdf5' 468 | tl.files.save_hdf5_graph(network=model, filepath=tl_h5_path, save_weights=True) 469 | translator_tl2_keras_h5(tl_h5_path, keras_h5_path) 470 | 471 | # 读取模型 472 | new_model = keras.models.load_model(keras_h5_path) 473 | x_test, y_test = format_convert(x_test, y_test) 474 | score = new_model.evaluate(x_test, y_test, batch_size=128) 475 | 476 | # 保存SavedModel可部署文件 477 | saved_model_version = 1 478 | saved_model_path = "./saved_models/rnn/" 479 | tf.saved_model.save(new_model, saved_model_path + str(saved_model_version)) 480 | 481 | 482 | -------------------------------------------------------------------------------- /serving/README.md: -------------------------------------------------------------------------------- 1 | 2 | ### TensorFlow Serving部署 3 | 4 | 反垃圾服务分为线上与线下两层。线上实时服务要求毫秒级判断文本是否属于垃圾文本,线下离线计算需要根据新进的样本不断更新模型,并及时推送到线上。 5 | 6 | 图8所示的分类器就是用TensorFlow Serving提供的服务。TensorFlow Serving是一个灵活、高性能的机器学习模型服务系统,专为生产环境而设计。它可以将训练好的机器学习模型轻松部署到线上,并且支持热更新。它使用gRPC作为接口框架接受外部调用,服务稳定,接口简单。这些优秀特性使我们能够专注于线下模型训练。 7 | 8 |
9 | 10 |
11 | 图8 反垃圾服务架构 12 |
13 | 14 | 为什么使用TensorFlow Serving而不是直接启动多个加载了模型的Python进程来提供线上服务?因为重复引入TensorFlow并加载模型的Python进程浪费资源并且运行效率不高。而且TensorFlow本身有一些限制导致并不是所有时候都能启动多个进程。TensorFlow默认会使用尽可能多的GPU并且占用所使用的GPU。因此如果有一个TensorFlow进程正在运行,可能导致其他TensorFlow进程无法启动。虽然可以指定程序使用特定的GPU,但是进程的数量也受到GPU数量的限制,总体来说不利于分布式部署。而TensorFlow Serving提供了一个高效的分布式解决方案。当新数据可用或改进模型时,加载并迭代模型是很常见的。TensorFlow Serving能够实现模型生命周期管理,它能自动检测并加载最新模型或回退到上一个模型,非常适用于高频迭代场景。 15 | 16 | *现在通过Docker使用Tensorflow Serving已经非常方便了,建议大家直接参考[TensorFlow Serving with Docker](https://www.tensorflow.org/tfx/serving/docker)这篇文档安装TensorFlow Serving。* 17 | 18 | ``` 19 | $ docker pull tensorflow/serving 20 | ``` 21 | 22 | 部署的方式非常简单,只需在启动TensorFlow Serving时加载Servable并定义`model_name`即可,这里的`model_name`将用于与客户端进行交互。运行TensorFlow Serving并从指定目录加载Servable的Docker命令为: 23 | 24 | ``` 25 | $ docker run -dt -p 8501:8501 -v /Users/code/task/text_antispam_tl2/network/saved_models:/models/saved_model -e MODEL_NAME=saved_model tensorflow/serving & 26 | ``` 27 | 28 | 冒号前是模型保存的路径(注意不包括版本号)。run命令的一些常用参数说明如下: 29 | 30 | ``` 31 | -a stdin: 指定标准输入输出内容类型,可选 STDIN/STDOUT/STDERR 三项; 32 | -d: 后台运行容器,并返回容器ID; 33 | -i: 以交互模式运行容器,通常与 -t 同时使用; 34 | -P: 随机端口映射,容器内部端口随机映射到主机的端口 35 | -p: 指定端口映射,格式为:主机(宿主)端口:容器端口 36 | -t: 为容器重新分配一个伪输入终端,通常与 -i 同时使用; 37 | --name="nginx-lb": 为容器指定一个名称; 38 | --dns 8.8.8.8: 指定容器使用的DNS服务器,默认和宿主一致; 39 | --dns-search example.com: 指定容器DNS搜索域名,默认和宿主一致; 40 | -h "mars": 指定容器的hostname; 41 | -e username="ritchie": 设置环境变量; 42 | --env-file=[]: 从指定文件读入环境变量; 43 | --cpuset="0-2" or --cpuset="0,1,2": 绑定容器到指定CPU运行; 44 | -m :设置容器使用内存最大值; 45 | --net="bridge": 指定容器的网络连接类型,支持 bridge/host/none/container: 四种类型; 46 | --link=[]: 添加链接到另一个容器; 47 | --expose=[]: 开放一个端口或一组端口; 48 | --volume , -v: 绑定一个卷 49 | -dt表示后台运行容器,并返回容器ID可以看到TensorFlow Serving成功加载了我们刚刚导出的模型。 50 | ``` 51 | 52 | 运行完可以看到运行的tfs容器ID。还可以进一步查看保存模型的Signature(用于验证模型是否成功保存): 53 | ``` 54 | saved_model_cli show --dir /Users/code/task/text_antispam_tl2/network/saved_models/1 --all 55 | ``` 56 | 57 | 输出信息类似以下形式: 58 | 59 | ``` 60 | signature_def['serving_default']: 61 | The given SavedModel SignatureDef contains the following input(s): 62 | inputs['conv1d_1_input'] tensor_info: 63 | dtype: DT_FLOAT 64 | shape: (-1, 20, 200) 65 | name: serving_default_conv1d_1_input:0 66 | The given SavedModel SignatureDef contains the following output(s): 67 | outputs['output'] tensor_info: 68 | dtype: DT_FLOAT 69 | shape: (-1, 2) 70 | name: StatefulPartitionedCall:0 71 | Method name is: tensorflow/serving/predict 72 | ``` 73 | 74 | ### 客户端调用 75 | 76 | TensorFlow Serving通过gRPC框架接收外部调用。gRPC是一种高性能、通用的远程过程调用(Remote Procedure Call,RPC)框架。RPC协议包含了编码协议和传输协议。gRPC的编码协议是Protocol Buffers(ProtoBuf),它是Google开发的一种二进制格式数据描述语言,支持众多开发语言和平台。与JSON、XML相比,ProtoBuf的优点是体积小、速度快,其序列化与反序列化代码都是通过代码生成器根据定义好的数据结构生成的,使用起来也很简单。gRPC的传输协议是HTTP/2,相比于HTTP/1.1,HTTP/2引入了头部压缩算法(HPACK)等新特性,并采用了二进制而非明文来打包、传输客户端——服务器间的数据,性能更好,功能更强。总而言之,gRPC提供了一种简单的方法来精确地定义服务,并自动为客户端生成可靠性很强的功能库,如图7所示。 77 | 78 | 在使用gRPC进行通信之前,我们需要完成两步操作:1)定义服务;2)生成服务端和客户端代码。定义服务这块工作TensorFlow Serving已经帮我们完成了。在[TensorFlow Serving](https://github.com/tensorflow/serving)项目中,我们可以在以下目录找到三个`.proto`文件:`model.proto`、`predict.proto`和`prediction_service.proto`。这三个`.proto`文件定义了一次预测请求的输入和输出。例如一次预测请求应该包含哪些元数据(如模型的名称和版本),以及输入、输出与Tensor如何转换。 79 | 80 |
81 | 82 |
83 | 图9 客户端与服务端使用gRPC进行通信 84 |
85 | 86 | ``` 87 | $ tree serving 88 | serving 89 | ├── tensorflow 90 | │ ├── ... 91 | ├── tensorflow_serving 92 | │ ├── apis 93 | │ │ ├── model.proto 94 | │ │ ├── predict.proto 95 | │ │ ├── prediction_service.proto 96 | │ │ ├── ... 97 | │ ├── ... 98 | ├── ... 99 | ``` 100 | 101 | 接下来写一个简单的客户端程序来调用部署好的模型。以RNN Classifier为例,`serving_rnn.py`负责构建一个Request用于与TensorFlow Serving交互。为了描述简洁,这里分词使用了结巴分词,词向量也是直接载入内存,实际生产环境中分词与词向量获取是一个单独的服务。特别需要注意的是,输入的签名和数据必须与之前导出的模型相匹配。 102 | 103 | ``` 104 | import json 105 | import tornado.ioloop 106 | import tornado.web 107 | import tensorflow as tf 108 | import jieba 109 | import tensorlayer as tl 110 | from packages import text_regularization as tr 111 | import numpy as np 112 | import requests 113 | 114 | 115 | print(" ".join(jieba.cut('分词初始化'))) 116 | wv = tl.files.load_npy_to_any(name='../word2vec/output/model_word2vec_200.npy') 117 | 118 | 119 | def text_tensor(text, wv): 120 | """获取文本向量 121 | Args: 122 | text: 待检测文本 123 | wv: 词向量模型 124 | Returns: 125 | [[[ 3.80905056 2.94315064 -0.20703495 -2.31589055 2.9627794 126 | ... 127 | 2.16935492 2.95426321 -4.71534014 -3.25034237 -11.28901672]]] 128 | """ 129 | text = tr.extractWords(text) 130 | words = jieba.cut(text.strip()) 131 | text_sequence = [] 132 | for word in words: 133 | try: 134 | text_sequence.append(wv[word]) 135 | except KeyError: 136 | text_sequence.append(wv['UNK']) 137 | text_sequence = np.asarray(text_sequence) 138 | sample = text_sequence.reshape(1, len(text_sequence), 200) 139 | return sample 140 | ``` 141 | 142 | 接下来定义请求处理类MainHandler负责接收和处理请求。生产环境中一般使用反向代理软件如Nginx实现负载均衡。这里我们演示直接监听80端口来提供HTTP服务。 143 | 144 | ``` 145 | class MainHandler(tornado.web.RequestHandler): 146 | """请求处理类 147 | """ 148 | 149 | def get(self): 150 | """处理GET请求 151 | """ 152 | text = self.get_argument("text") 153 | predict = self.classify(text) 154 | data = { 155 | 'text' : text, 156 | 'predict' : str(predict[0]) 157 | } 158 | self.write(json.dumps({'data': data})) 159 | 160 | def classify(self, text): 161 | """调用引擎检测文本 162 | Args: 163 | text: 待检测文本 164 | Returns: 165 | 垃圾返回[0],通过返回[1] 166 | """ 167 | sample = text_tensor(text, wv) 168 | sample = np.array(sample, dtype=np.float32) 169 | len = sample.shape[1] 170 | sample = sample.reshape(len, 200) 171 | 172 | sample = sample.reshape(1, len, 200) 173 | data = json.dumps({ 174 | # "signature_name": "call", 175 | "instances": sample.tolist() 176 | }) 177 | headers = {"content-type": "application/json"} 178 | json_response = requests.post( 179 | 'http://localhost:8501/v1/models/saved_model:predict', 180 | data=data, headers=headers) 181 | 182 | predictions = np.array(json.loads(json_response.text)['predictions']) 183 | result = np.argmax(predictions, axis=-1) 184 | return result 185 | 186 | def make_app(): 187 | """定义并返回Tornado Web Application 188 | """ 189 | return tornado.web.Application([ 190 | (r"/predict", MainHandler), 191 | ]) 192 | 193 | 194 | if __name__ == "__main__": 195 | app = make_app() 196 | app.listen(80) 197 | print("listen start") 198 | tornado.ioloop.IOLoop.current().start() 199 | ``` 200 | 201 | 在运行上述代码后,如果是在本地启动服务,访问`http://127.0.0.1/predict?text=加我微信xxxxx有福利`,可以看到如下结果。 202 | 203 | ``` 204 | { 205 | "data": { 206 | "text": "\u52a0\u6211\u5fae\u4fe1xxxxx\u6709\u798f\u5229", 207 | "predict": 0 208 | } 209 | } 210 | ``` 211 | 212 | 成功识别出垃圾消息。 213 | 214 | -------------------------------------------------------------------------------- /serving/packages/__init__.py: -------------------------------------------------------------------------------- 1 | from . import text_regularization 2 | -------------------------------------------------------------------------------- /serving/packages/text_regularization.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | 3 | import re 4 | # import opencc 5 | 6 | 7 | def strtr(text, replace): 8 | for s, r in replace.items(): 9 | text = text.replace(s, r) 10 | return text 11 | 12 | 13 | # def simplify(text): 14 | # return opencc.convert(text) 15 | 16 | 17 | # def traditionalize(text): 18 | # """ https://pypi.python.org/pypi/OpenCC 19 | # """ 20 | # return opencc.convert(text, config='s2t.json') 21 | 22 | 23 | def sbcCase(text): 24 | """转换全角字符为半角字符 25 | SBC case to DBC case 26 | """ 27 | return strtr(text, { 28 | "0" : "0", "1" : "1", "2" : "2", "3" : "3", "4" : "4", 29 | "5" : "5", "6" : "6", "7" : "7", "8" : "8", "9" : "9", 30 | 'A' : 'A', 'B' : 'B', 'C' : 'C', 'D' : 'D', 'E' : 'E', 31 | 'F' : 'F', 'G' : 'G', 'H' : 'H', 'I' : 'I', 'J' : 'J', 32 | 'K' : 'K', 'L' : 'L', 'M' : 'M', 'N' : 'N', 'O' : 'O', 33 | 'P' : 'P', 'Q' : 'Q', 'R' : 'R', 'S' : 'S', 'T' : 'T', 34 | 'U' : 'U', 'V' : 'V', 'W' : 'W', 'X' : 'X', 'Y' : 'Y', 35 | 'Z' : 'Z', 'a' : 'a', 'b' : 'b', 'c' : 'c', 'd' : 'd', 36 | 'e' : 'e', 'f' : 'f', 'g' : 'g', 'h' : 'h', 'i' : 'i', 37 | 'j' : 'j', 'k' : 'k', 'l' : 'l', 'm' : 'm', 'n' : 'n', 38 | 'o' : 'o', 'p' : 'p', 'q' : 'q', 'r' : 'r', 's' : 's', 39 | 't' : 't', 'u' : 'u', 'v' : 'v', 'w' : 'w', 'x' : 'x', 40 | 'y' : 'y', 'z' : 'z', 41 | }) 42 | 43 | 44 | def circleCase(text): 45 | """转换①数字为半角数字 46 | ①②③④⑤⑥⑦⑧⑨⑩ case 全角 47 | ❹❸❽❼❽❼❾❽❼ 48 | ➀➁➂➃➄➅➆➇➈ 49 | """ 50 | return strtr(text, { 51 | "➀" : "1", "➁" : "2", "➂" : "3", "➃" : "4", "➄" : "5", 52 | "➅" : "6", "➆" : "7", "➇" : "8", "➈" : "9", "①" : "1", 53 | "②" : "2", "③" : "3", "④" : "4", "⑤" : "5", "⑥" : "6", 54 | "⑦" : "7", "⑧" : "8", "⑨" : "9", "❶" : "1", "❷" : "2", 55 | "❸" : "3", "❹" : "4", "❺" : "5", "❻" : "6", "❼" : "7", 56 | "❽" : "8", "❾" : "9", "➊" : "1", "➋" : "2", "➌" : "3", 57 | "➍" : "4", "➎" : "5", "➏" : "6", "➐" : "7", "➑" : "8", 58 | "➒" : "9", 'Ⓐ' : 'A', 'Ⓑ' : 'B', 'Ⓒ' : 'C', 'Ⓓ' : 'D', 59 | 'Ⓔ' : 'E', 'Ⓕ' : 'F', 'Ⓖ' : 'G', 'Ⓗ' : 'H', 'Ⓘ' : 'I', 60 | 'Ⓙ' : 'J', 'Ⓚ' : 'K', 'Ⓛ' : 'L', 'Ⓜ' : 'M', 'Ⓝ' : 'N', 61 | 'Ⓞ' : 'O', 'Ⓟ' : 'P', 'Ⓠ' : 'Q', 'Ⓡ' : 'R', 'Ⓢ' : 'S', 62 | 'Ⓣ' : 'T', 'Ⓤ' : 'U', 'Ⓥ' : 'V', 'Ⓦ' : 'W', 'Ⓧ' : 'X', 63 | 'Ⓨ' : 'Y', 'Ⓩ' : 'Z', 'ⓐ' : 'a', 'ⓑ' : 'b', 'ⓒ' : 'c', 64 | 'ⓓ' : 'd', 'ⓔ' : 'e', 'ⓕ' : 'f', 'ⓖ' : 'g', 'ⓗ' : 'h', 65 | 'ⓘ' : 'i', 'ⓙ' : 'j', 'ⓚ' : 'k', 'ⓛ' : 'l', 'ⓜ' : 'm', 66 | 'ⓝ' : 'n', 'ⓞ' : 'o', 'ⓟ' : 'p', 'ⓠ' : 'q', 'ⓡ' : 'r', 67 | 'ⓢ' : 's', 'ⓣ' : 't', 'ⓤ' : 'u', 'ⓥ' : 'v', 'ⓦ' : 'w', 68 | 'ⓧ' : 'x', 'ⓨ' : 'y', 'ⓩ' : 'z', 69 | "㊀" : "一", "㊁" : "二", "㊂" : "三", "㊃" : "四", 70 | "㊄" : "五", "㊅" : "六", "㊆" : "七", "㊇" : "八", 71 | "㊈" : "九", 72 | }) 73 | 74 | 75 | def bracketCase(text): 76 | """转换⑴数字为半角数字 77 | ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼case 全角 78 | """ 79 | return strtr(text, { 80 | "⑴" : "1", "⑵" : "2", "⑶" : "3", "⑷" : "4", "⑸" : "5", 81 | "⑹" : "6", "⑺" : "7", "⑻" : "8", "⑼" : "9", 82 | '🄐' : 'A', '🄑' : 'B', '🄒' : 'C', '🄓' : 'D', '🄔' : 'E', 83 | '🄕' : 'F', '🄖' : 'G', '🄗' : 'H', '🄘' : 'I', '🄙' : 'J', 84 | '🄚' : 'K', '🄛' : 'L', '🄜' : 'M', '🄝' : 'N', '🄞' : 'O', 85 | '🄟' : 'P', '🄠' : 'Q', '🄡' : 'R', '🄢' : 'S', '🄣' : 'T', 86 | '🄤' : 'U', '🄥' : 'V', '🄦' : 'W', '🄧' : 'X', '🄨' : 'Y', 87 | '🄩' : 'Z', '⒜' : 'a', '⒝' : 'b', '⒞' : 'c', '⒟' : 'd', 88 | '⒠' : 'e', '⒡' : 'f', '⒢' : 'g', '⒣' : 'h', '⒤' : 'i', 89 | '⒥' : 'j', '⒦' : 'k', '⒧' : 'l', '⒨' : 'm', '⒩' : 'n', 90 | '⒪' : 'o', '⒫' : 'p', '⒬' : 'q', '⒭' : 'r', '⒮' : 's', 91 | '⒯' : 't', '⒰' : 'u', '⒱' : 'v', '⒲' : 'w', '⒳' : 'x', 92 | '⒴' : 'y', '⒵' : 'z', 93 | "㈠" : "一", "㈡" : "二", "㈢" : "三", "㈣" : "四", 94 | "㈤" : "五", "㈥" : "六", "㈦" : "七", "㈧" : "八", 95 | "㈨" : "九", 96 | }) 97 | 98 | 99 | def dotCase(text): 100 | """转换⑴数字为半角数字 101 | ⒈⒉⒊⒋⒌⒍⒎⒏⒐case 全角 102 | """ 103 | return strtr(text, { 104 | "⒈" : "1", 105 | "⒉" : "2", 106 | "⒊" : "3", 107 | "⒋" : "4", 108 | "⒌" : "5", 109 | "⒍" : "6", 110 | "⒎" : "7", 111 | "⒏" : "8", 112 | "⒐" : "9", 113 | }) 114 | 115 | 116 | def specialCase(text): 117 | """特殊字符比如希腊字符,西里尔字母 118 | """ 119 | return strtr(text, { 120 | # 希腊字母 121 | "Α" : "A", "Β" : "B", "Ε" : "E", "Ζ" : "Z", "Η" : "H", 122 | "Ι" : "I", "Κ" : "K", "Μ" : "M", "Ν" : "N", "Ο" : "O", 123 | "Ρ" : "P", "Τ" : "T", "Χ" : "X", "α" : "a", "β" : "b", 124 | "γ" : "y", "ι" : "l", "κ" : "k", "μ" : "u", "ν" : "v", 125 | "ο" : "o", "ρ" : "p", "τ" : "t", "χ" : "x", 126 | 127 | # 西里尔字母 (U+0400 - U+04FF) 128 | "Ѐ" : "E", "Ё" : "E", "Ѕ" : "S", "І" : "I", "Ї" : "I", 129 | "Ј" : "J", "Ќ" : "K", "А" : "A", "В" : "B", "Е" : "E", 130 | "З" : "3", "Ζ" : "Z", "И" : "N", "М" : "M", "Н" : "H", 131 | "О" : "O", "Р" : "P", "С" : "C", "Т" : "T", "У" : "y", 132 | "Х" : "X", "Ь" : "b", "Ъ" : "b", "а" : "a", "в" : "B", 133 | "е" : "e", "к" : "K", "м" : "M", "н" : "H", "о" : "O", 134 | "п" : "n", "р" : "P", "с" : "c", "т" : "T", "у" : "y", 135 | "х" : "x", "ш" : "w", "ь" : "b", "ѕ" : "s", "і" : "i", 136 | "ј" : "j", 137 | 138 | "À" : "A", "Á" : "A", "Â" : "A", "Ã" : "A", "Ä" : "A", "Å" : "A", "Ā" : "A", "Ă" : "A", "Ă" : "A", 139 | "Ç" : "C", "Ć" : "C", "Ĉ" : "C", "Ċ" : "C", 140 | "Ð" : "D", "Ď" : "D", "Đ" : "D", 141 | "È" : "E", "É" : "E", "Ê" : "E", "Ë" : "E", "Ē" : "E", "Ė" : "E", "Ę" : "E", "Ě" : "E", 142 | "Ĝ" : "G", "Ġ" : "G", "Ģ" : "G", 143 | "Ĥ" : "H", "Ħ" : "H", 144 | "Ì" : "I", "Í" : "I", "î" : "I", "ï" : "I", "į" : "I", 145 | "Ĵ" : "J", 146 | "Ķ" : "K", 147 | "Ļ" : "L", "Ł" : "L", 148 | "Ñ" : "N", "Ń" : "N", "Ņ" : "N", "Ň" : "N", 149 | "Ò" : "O", "Ó" : "O", "Ô" : "O", "Õ" : "O", "Ö" : "O", "Ő" : "O", 150 | "Ŕ" : "R", "Ř" : "R", 151 | "Ś" : "S", "Ŝ" : "S", "Ş" : "S", "Š" : "S", "Ș" : "S", 152 | "Ţ" : "T", "Ť" : "T", "Ț" : "T", 153 | "Ù" : "U", "Ú" : "U", "Û" : "U", "Ü" : "U", "Ū" : "U", "Ŭ" : "U", "Ů" : "U", "Ű" : "U", "Ų" : "U", 154 | "Ŵ" : "W", 155 | "Ý" : "Y", "Ŷ" : "Y", "Ÿ" : "Y", 156 | "Ź" : "Z", "Ż" : "Z", "Ž" : "Z", 157 | 158 | "à" : "a", "á" : "a", "â" : "a", "ã" : "a", "ä" : "a", "å" : "a", "ā" : "a", "ă" : "a", "ą" : "a", 159 | "ç" : "c", "ć" : "c", "ĉ" : "c", "ċ" : "c", 160 | "ď" : "d", "đ" : "d", 161 | "è" : "e", "é" : "e", "ê" : "e", "ë" : "e", "ē" : "e", "ė" : "e", "ę" : "e", "ě" : "e", "ə" : "e", 162 | "ĝ" : "g", "ġ" : "g", "ģ" : "g", 163 | "ĥ" : "h", "ħ" : "h", 164 | "ì" : "i", "í" : "i", "î" : "i", "ï" : "i", "ī" : "i", "į" : "i", 165 | "ĵ" : "j", 166 | "ķ" : "k", 167 | "ļ" : "l", 168 | "ñ" : "n", "ń" : "n", "ņ" : "n", "ň" : "n", 169 | "ò" : "o", "ó" : "o", "ô" : "o", "õ" : "o", "ö" : "o", "ő" : "o", "ŕ" : "r", "ř" : "r", 170 | "ś" : "s", "ŝ" : "s", "ş" : "s", "š" : "s", "ș" : "s", 171 | "ţ" : "t", "ť" : "t", "ț" : "t", 172 | "ù" : "u", "ú" : "u", "û" : "u", "ü" : "u", "ū" : "u", "ŭ" : "u", "ů" : "u", "ű" : "u", "ų" : "u", 173 | "ŵ" : "w", 174 | "ý" : "y", "ŷ" : "y", "ÿ" : "y", 175 | "ź" : "z", "ż" : "z", "ž" : "z", 176 | 177 | # 罗马数字Roman numerals 178 | "Ⅰ" : "I", "Ⅱ" : "II", "Ⅲ" : "III", "Ⅳ" : "IV", "Ⅴ" : "V", "Ⅵ" : "VI", "Ⅶ" : "VII", 179 | "Ⅷ" : "VIII", "Ⅸ" : "IX", "Ⅹ" : "X", "Ⅺ" : "XI", "Ⅻ" : "XII", "Ⅼ" : "L", "Ⅽ" : "C", 180 | "Ⅾ" : "D", "Ⅿ" : "M", 181 | "ⅰ" : "i", "ⅱ" : "ii", "ⅲ" : "iii", "ⅳ" : "iv", "ⅴ" : "v", "ⅵ" : "vi", "ⅶ" : "vii", 182 | "ⅷ" : "viii", "ⅸ" : "ix", "ⅹ" : "x", "ⅺ" : "xi", "ⅻ" : "xii", "ⅼ" : "l", "ⅽ" : "c", 183 | "ⅾ" : "d", "ⅿ" : "m", 184 | }) 185 | 186 | 187 | def extractWords(text): 188 | """抽出中文、英文、数字,忽略标点符号和特殊字符(表情) 189 | 一般用与反垃圾 190 | """ 191 | text = text.lower() # 词向量只计算小写 192 | # text = simplify(text) # 繁体转简体 193 | text = sbcCase(text) # 全角转半角 194 | text = circleCase(text) # 特殊字符 195 | text = bracketCase(text) # 特殊字符 196 | text = dotCase(text) # 特殊字符 197 | text = specialCase(text) # 特殊字符 198 | pattern = re.compile(r"[0-9A-Za-z\u4E00-\u9FFF]+") 199 | match = pattern.findall(text) 200 | if match: 201 | text = " ".join(match) 202 | else: 203 | text = " " 204 | 205 | return text 206 | -------------------------------------------------------------------------------- /serving/serving_cnn.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """与加载了RNN Classifier导出的Servable的TensorFlow Serving进行通信 3 | """ 4 | 5 | import json 6 | import tornado.ioloop 7 | import tornado.web 8 | import tensorflow as tf 9 | import jieba 10 | import tensorlayer as tl 11 | from packages import text_regularization as tr 12 | import numpy as np 13 | import requests 14 | 15 | 16 | print(" ".join(jieba.cut('分词初始化'))) 17 | wv = tl.files.load_npy_to_any(name='../word2vec/output/model_word2vec_200.npy') 18 | 19 | 20 | def text_tensor(text, wv): 21 | """获取文本向量 22 | Args: 23 | text: 待检测文本 24 | wv: 词向量模型 25 | Returns: 26 | [[[ 3.80905056 2.94315064 -0.20703495 -2.31589055 2.9627794 27 | ... 28 | 2.16935492 2.95426321 -4.71534014 -3.25034237 -11.28901672]]] 29 | """ 30 | text = tr.extractWords(text) 31 | words = jieba.cut(text.strip()) 32 | text_sequence = [] 33 | for word in words: 34 | try: 35 | text_sequence.append(wv[word]) 36 | except KeyError: 37 | text_sequence.append(wv['UNK']) 38 | text_sequence = np.asarray(text_sequence) 39 | sample = text_sequence.reshape(1, len(text_sequence), 200) 40 | return sample 41 | 42 | 43 | class MainHandler(tornado.web.RequestHandler): 44 | """请求处理类 45 | """ 46 | 47 | def get(self): 48 | """处理GET请求 49 | """ 50 | text = self.get_argument("text") 51 | predict = self.classify(text) 52 | data = { 53 | 'text' : text, 54 | 'predict' : str(predict[0]) 55 | } 56 | self.write(json.dumps({'data': data})) 57 | 58 | def classify(self, text): 59 | """调用引擎检测文本 60 | Args: 61 | text: 待检测文本 62 | Returns: 63 | 垃圾返回[0],通过返回[1] 64 | """ 65 | max_seq_len = 20 66 | sample = text_tensor(text, wv) 67 | sample = np.array(sample,dtype=np.float32) 68 | sample = sample[:, :max_seq_len] 69 | len = sample.shape[1] 70 | sample = sample.reshape(len, 200) 71 | temp = np.zeros(((max_seq_len - len), 200), dtype=np.float32) 72 | sample = np.vstack((sample, temp)) 73 | 74 | sample = sample.reshape(1, max_seq_len, 200) 75 | data = json.dumps({ 76 | # "signature_name": "call", 77 | "instances": sample.tolist() 78 | }) 79 | headers = {"content-type": "application/json"} 80 | json_response = requests.post( 81 | 'http://localhost:8501/v1/models/saved_model:predict', 82 | data=data, headers=headers) 83 | 84 | predictions = np.array(json.loads(json_response.text)['predictions']) 85 | result = np.argmax(predictions, axis=-1) 86 | return result 87 | 88 | def make_app(): 89 | """定义并返回Tornado Web Application 90 | """ 91 | return tornado.web.Application([ 92 | (r"/predict", MainHandler), 93 | ]) 94 | 95 | 96 | if __name__ == "__main__": 97 | app = make_app() 98 | app.listen(80) 99 | print("listen start") 100 | tornado.ioloop.IOLoop.current().start() 101 | -------------------------------------------------------------------------------- /serving/serving_mlp.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """与加载了RNN Classifier导出的Servable的TensorFlow Serving进行通信 3 | """ 4 | 5 | import json 6 | import tornado.ioloop 7 | import tornado.web 8 | import tensorflow as tf 9 | import jieba 10 | import tensorlayer as tl 11 | from packages import text_regularization as tr 12 | import numpy as np 13 | import requests 14 | 15 | 16 | print(" ".join(jieba.cut('分词初始化'))) 17 | wv = tl.files.load_npy_to_any(name='../word2vec/output/model_word2vec_200.npy') 18 | 19 | 20 | def text_tensor(text, wv): 21 | """获取文本向量 22 | Args: 23 | text: 待检测文本 24 | wv: 词向量模型 25 | Returns: 26 | [[[ 3.80905056 2.94315064 -0.20703495 -2.31589055 2.9627794 27 | ... 28 | 2.16935492 2.95426321 -4.71534014 -3.25034237 -11.28901672]]] 29 | """ 30 | text = tr.extractWords(text) 31 | words = jieba.cut(text.strip()) 32 | text_embedding = np.zeros(200) 33 | for word in words: 34 | try: 35 | text_embedding += wv[word] 36 | except KeyError: 37 | text_embedding += wv['UNK'] 38 | text_embedding = np.asarray(text_embedding) 39 | sample = text_embedding.reshape(1, 200) 40 | return sample 41 | 42 | 43 | class MainHandler(tornado.web.RequestHandler): 44 | """请求处理类 45 | """ 46 | 47 | def get(self): 48 | """处理GET请求 49 | """ 50 | text = self.get_argument("text") 51 | predict = self.classify(text) 52 | data = { 53 | 'text' : text, 54 | 'predict' : str(predict[0]) 55 | } 56 | self.write(json.dumps({'data': data})) 57 | 58 | def classify(self, text): 59 | """调用引擎检测文本 60 | Args: 61 | text: 待检测文本 62 | Returns: 63 | 垃圾返回[0],通过返回[1] 64 | """ 65 | sample = text_tensor(text, wv) 66 | sample = np.array(sample, dtype=np.float32) 67 | 68 | sample = sample.reshape(1, 200) 69 | data = json.dumps({ 70 | # "signature_name": "call", 71 | "instances": sample.tolist() 72 | }) 73 | headers = {"content-type": "application/json"} 74 | json_response = requests.post( 75 | 'http://localhost:8501/v1/models/saved_model:predict', 76 | data=data, headers=headers) 77 | 78 | predictions = np.array(json.loads(json_response.text)['predictions']) 79 | result = np.argmax(predictions, axis=-1) 80 | return result 81 | 82 | 83 | def make_app(): 84 | """定义并返回Tornado Web Application 85 | """ 86 | return tornado.web.Application([ 87 | (r"/predict", MainHandler), 88 | ]) 89 | 90 | 91 | if __name__ == "__main__": 92 | app = make_app() 93 | app.listen(80) 94 | print("listen start") 95 | tornado.ioloop.IOLoop.current().start() 96 | -------------------------------------------------------------------------------- /serving/serving_rnn.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """与加载了RNN Classifier导出的Servable的TensorFlow Serving进行通信 3 | """ 4 | 5 | import json 6 | import tornado.ioloop 7 | import tornado.web 8 | import tensorflow as tf 9 | import jieba 10 | import tensorlayer as tl 11 | from packages import text_regularization as tr 12 | import numpy as np 13 | import requests 14 | 15 | 16 | print(" ".join(jieba.cut('分词初始化'))) 17 | wv = tl.files.load_npy_to_any(name='../word2vec/output/model_word2vec_200.npy') 18 | 19 | 20 | def text_tensor(text, wv): 21 | """获取文本向量 22 | Args: 23 | text: 待检测文本 24 | wv: 词向量模型 25 | Returns: 26 | [[[ 3.80905056 2.94315064 -0.20703495 -2.31589055 2.9627794 27 | ... 28 | 2.16935492 2.95426321 -4.71534014 -3.25034237 -11.28901672]]] 29 | """ 30 | text = tr.extractWords(text) 31 | words = jieba.cut(text.strip()) 32 | text_sequence = [] 33 | for word in words: 34 | try: 35 | text_sequence.append(wv[word]) 36 | except KeyError: 37 | text_sequence.append(wv['UNK']) 38 | text_sequence = np.asarray(text_sequence) 39 | sample = text_sequence.reshape(1, len(text_sequence), 200) 40 | return sample 41 | 42 | 43 | class MainHandler(tornado.web.RequestHandler): 44 | """请求处理类 45 | """ 46 | 47 | def get(self): 48 | """处理GET请求 49 | """ 50 | text = self.get_argument("text") 51 | predict = self.classify(text) 52 | data = { 53 | 'text' : text, 54 | 'predict' : str(predict[0]) 55 | } 56 | self.write(json.dumps({'data': data})) 57 | 58 | def classify(self, text): 59 | """调用引擎检测文本 60 | Args: 61 | text: 待检测文本 62 | Returns: 63 | 垃圾返回[0],通过返回[1] 64 | """ 65 | sample = text_tensor(text, wv) 66 | sample = np.array(sample, dtype=np.float32) 67 | len = sample.shape[1] 68 | sample = sample.reshape(len, 200) 69 | 70 | sample = sample.reshape(1, len, 200) 71 | data = json.dumps({ 72 | # "signature_name": "call", 73 | "instances": sample.tolist() 74 | }) 75 | headers = {"content-type": "application/json"} 76 | json_response = requests.post( 77 | 'http://localhost:8501/v1/models/saved_model:predict', 78 | data=data, headers=headers) 79 | 80 | predictions = np.array(json.loads(json_response.text)['predictions']) 81 | result = np.argmax(predictions, axis=-1) 82 | return result 83 | 84 | def make_app(): 85 | """定义并返回Tornado Web Application 86 | """ 87 | return tornado.web.Application([ 88 | (r"/predict", MainHandler), 89 | ]) 90 | 91 | 92 | if __name__ == "__main__": 93 | app = make_app() 94 | app.listen(80) 95 | print("listen start") 96 | tornado.ioloop.IOLoop.current().start() 97 | -------------------------------------------------------------------------------- /word2vec/README.md: -------------------------------------------------------------------------------- 1 | ### 训练词向量 2 | 3 | ``` 4 | python3 word2vec.py 5 | ``` 6 | 7 | ### 为MLP分类器准备训练集 8 | 9 | ``` 10 | python3 text2vec.py 11 | ``` 12 | 13 | ### 为CNN或RNN分类器准备训练集 14 | 15 | ``` 16 | python3 text_features.py 17 | ``` 18 | 19 | ### 训练网络 20 | 21 | 网络的训练分为两段:一段是词向量的训练,一段是分类器的训练。训练好的词向量在分类器的训练过程中不会再更新。 22 | 23 | #### 训练词向量 24 | 25 | 本例Word2vec训练集的大部分内容都是短文本,经过了基本的特殊字符处理和分词。关于分词,由于用户多样的聊天习惯,文本中会出现大量新词或者火星文,垃圾文本更有各种只可意会不可言传的词出现,因此好的分词器还有赖于新词发现,这是另外一个话题了。因为分词的实现不是本章的重点,所以例子中所有涉及分词的部分都会使用Python上流行的开源分词器结巴分词(Jieba)。作为一款优秀的分词器,它用来训练是完全不成问题的。 26 | 27 | 正样本示例: 28 | 29 | ``` 30 | 得 我 就 在 车里 咪 一会 31 | auv 不错 耶 32 | 不 忘 初心 方 得 始终 你 的 面相 是 有 志向 的 人 33 | ``` 34 | 35 | 负样本示例: 36 | 37 | ``` 38 | 帅哥哥 约 吗 v 信 xx88775 么 么 哒 你 懂 的 39 | 想 在 这里 有个 故事 xxxxx2587 40 | 不再 珈 矀 信 xxx885 无 需要 低线 得 唠嗑 41 | ``` 42 | 43 | 首先加载训练数据。例子中词向量的训练集和接下来分类器所用的训练集是一样的,但是实际场景中词向量的训练集一般比分类器的大很多。因为词向量的训练集是无须打标签的数据,这使得我们可以利用更大规模的文本数据信息,对于接下来分类器处理未被标识过的数据非常有帮助。例如“加我微信xxxxx有福利”的变种“加我溦信xxxxx有福利”,这里“微信”和“溦信”是相似的,经过训练,“微信”和“溦信”在空间中的距离也会比较接近。实例中,经过Word2vec训练之后,我们得到“微信”、“危性”、“溦信”、“微伈”这几个词在空间上是相近的。也就是说,如果“加我微信xxxxx有福利”被标记为负样本,那么“加我溦信xxxxx有福利”也很有可能被判定为垃圾文本。 44 | 45 | ``` 46 | import collections, logging, os, tarfile 47 | import tensorflow as tf 48 | import tensorlayer as tl 49 | 50 | def load_dataset(): 51 | """加载训练数据 52 | Args: 53 | files: 词向量训练数据集合 54 | 得 我 就 在 车里 咪 一会 55 | 终于 知道 哪里 感觉 不 对 了 56 | ... 57 | Returns: 58 | [得 我 就 在 车里 咪 一会 终于 知道 哪里 感觉 不 对 了...] 59 | """ 60 | prj = "https://github.com/tensorlayer/text-antispam" 61 | if not os.path.exists('data/msglog'): 62 | tl.files.maybe_download_and_extract( 63 | 'msglog.tar.gz', 64 | 'data', 65 | prj + '/raw/master/word2vec/data/') 66 | tarfile.open('data/msglog.tar.gz', 'r').extractall('data') 67 | files = ['data/msglog/msgpass.log.seg', 'data/msglog/msgspam.log.seg'] 68 | words = [] 69 | for file in files: 70 | f = open(file) 71 | for line in f: 72 | for word in line.strip().split(' '): 73 | if word != ”: 74 | words.append(word) 75 | f.close() 76 | return words 77 | ``` 78 | 79 | 为了尽可能不丢失关键信息,我们希望所有词频不小于3的词都加入训练。同时词频小于3的词统一用`UNK`代替,这样只出现一两次的异常联系方式也能加入训练,提高模型的泛化能力。 80 | 81 | ``` 82 | def get_vocabulary_size(words, min_freq=3): 83 | """获取词频不小于min_freq的单词数量 84 | 小于min_freq的单词统一用UNK(unknown)表示 85 | Args: 86 | words: 训练词表 87 | [得 我 就 在 车里 咪 一会 终于 知道 哪里 感觉 不 对 了...] 88 | min_freq: 最低词频 89 | Return: 90 | size: 词频不小于min_freq的单词数量 91 | """ 92 | size = 1 # 为UNK预留 93 | counts = collections.Counter(words).most_common() 94 | for word, c in counts: 95 | if c >= min_freq: 96 | size += 1 97 | return size 98 | ``` 99 | 100 | 在训练过程中,我们不时地要将训练状态进行保存。`save_weights()和load_weights()`是TensorLayer自带的模型权重存储和读取方式,可以非常方便地保存当前模型的变量或者导入之前训练好的变量。 101 | 102 | ``` 103 | def save_weights(model, weights_file_path): 104 | """保存模型训练状态 105 | 将会产生以下文件: 106 | weights/model_word2vec_200.hdf5 107 | Args: 108 | weights_file_path: 储存训练状态的文件路径 109 | """ 110 | path = os.path.dirname(os.path.abspath(weights_file_path)) 111 | if os.path.isdir(path) == False: 112 | logging.warning('Path (%s) not exists, making directories...', path) 113 | os.makedirs(path) 114 | model.save_weights(filepath=weights_file_path) 115 | 116 | def load_weights(model, weights_file_path): 117 | """恢复模型训练状态 118 | 从weights_file_path中恢复所保存的训练状态 119 | Args: 120 | weights_file_path: 储存训练状态的文件路径 121 | """ 122 | if os.path.isfile(weights_file_path): 123 | model.load_weights(filepath=weights_file_path) 124 | ``` 125 | 126 | 我们还需要将词向量保存下来用于后续分类器的训练以及再往后的线上服务。如图2所示,词向量保存在隐层的`W1`矩阵中。如图3所示,输入一个One-hot Representation表示的词与隐层矩阵相乘,输出的就是这个词的词向量。我们将词与向量一一映射导出到一个`.npy`文件中。 127 | 128 |
129 | 130 |
131 | 图3 隐层矩阵存储着词向量 132 |
133 | 134 | ``` 135 | def save_embedding(dictionary, network, embedding_file_path): 136 | """保存词向量 137 | 将训练好的词向量保存到embedding_file_path.npy文件中 138 | Args: 139 | dictionary: 单词与单词ID映射表 140 | {'UNK': 0, '你': 1, '我': 2, ..., '小姐姐': 2546, ...} 141 | network: 默认TensorFlow Session所初始化的网络结构 142 | network = tl.layers.InputLayer(x, name='input_layer') 143 | ... 144 | embedding_file_path: 储存词向量的文件路径 145 | Returns: 146 | 单词与向量映射表以npy格式保存在embedding_file_path.npy文件中 147 | {'关注': [-0.91619176, -0.83772564, ..., 0.74918884], ...} 148 | """ 149 | words, ids = zip(*dictionary.items()) 150 | params = network.normalized_embeddings 151 | embeddings = tf.nn.embedding_lookup( 152 | params, tf.constant(ids, dtype=tf.int32)) 153 | wv = dict(zip(words, embeddings)) 154 | path = os.path.dirname(os.path.abspath(embedding_file_path)) 155 | if os.path.isdir(path) == False: 156 | logging.warning('(%s) not exists, making directories...', path) 157 | os.makedirs(path) 158 | tl.files.save_any_to_npy(save_dict=wv, name=embedding_file_path+'.npy') 159 | ``` 160 | 161 | 这里使用Skip-Gram模型进行训练。Word2vec的训练过程相当于解决一个多分类问题。我们希望学习一个函数`F(x,y)`来表示输入属于特定类别的概率。这意味着对于每个训练样例,我们都要对所有词计算其为给定单词上下文的概率并更新权重,这种穷举的训练方法计算量太大了。Negative Sampling方法通过选取少量的负采样进行权重更新,将每次训练需要计算的类别数量减少到`num_skips`加`num_sampled`之和,使得训练的时间复杂度一下子降低了许多。 162 | 163 | ``` 164 | def train(model_name): 165 | """训练词向量 166 | Args: 167 | corpus_file: 文件内容已经经过分词。 168 | 得 我 就 在 车里 咪 一会 169 | 终于 知道 哪里 感觉 不 对 了 170 | ... 171 | model_name: 模型名称,用于生成保存训练状态和词向量的文件名 172 | Returns: 173 | 输出训练状态以及训练后的词向量文件 174 | """ 175 | words = load_dataset() 176 | data_size = len(words) 177 | vocabulary_size = get_vocabulary_size(words, min_freq=3) 178 | batch_size = 500 # 一次Forword运算以及BP运算中所需要的训练样本数目 179 | embedding_size = 200 # 词向量维度 180 | skip_window = 5 # 上下文窗口,单词前后各取五个词 181 | num_skips = 10 # 从窗口中选取多少个预测对 182 | num_sampled = 64 # 负采样个数 183 | learning_rate = 0.025 # 学习率 184 | n_epoch = 50 # 所有样本重复训练50次 185 | num_steps = int((data_size/batch_size) * n_epoch) # 总迭代次数 186 | 187 | data, count, dictionary, reverse_dictionary = tl.nlp.build_words_dataset(words, vocabulary_size) 188 | train_inputs = tl.layers.Input([batch_size], dtype=tf.int32) 189 | train_labels = tl.layers.Input([batch_size, 1], dtype=tf.int32) 190 | 191 | emb_net = tl.layers.Word2vecEmbedding( 192 | vocabulary_size = vocabulary_size, 193 | embedding_size = embedding_size, 194 | num_sampled = num_sampled, 195 | activate_nce_loss = True, 196 | nce_loss_args = {}) 197 | 198 | emb, nce = emb_net([train_inputs, train_labels]) 199 | model = tl.models.Model(inputs=[train_inputs, train_labels], outputs=[emb, nce]) 200 | optimizer = tf.optimizers.Adam(learning_rate) 201 | 202 | # Start training 203 | model.train() 204 | weights_file_path = "weights/" + model_name + ".hdf5" 205 | load_weights(model, weights_file_path) 206 | 207 | loss_vals = [] 208 | step = data_index = 0 209 | print_freq = 200 210 | while (step < num_steps): 211 | batch_inputs, batch_labels, data_index = tl.nlp.generate_skip_gram_batch( 212 | data=data, batch_size=batch_size, num_skips=num_skips, 213 | skip_window=skip_window, data_index=data_index) 214 | 215 | with tf.GradientTape() as tape: 216 | _, loss_val = model([batch_inputs, batch_labels]) 217 | grad = tape.gradient(loss_val, model.trainable_weights) 218 | optimizer.apply_gradients(zip(grad, model.trainable_weights)) 219 | 220 | loss_vals.append(loss_val) 221 | if step % print_freq == 0: 222 | logging.info("(%d/%d) latest average loss: %f.", step, num_steps, sum(loss_vals)/len(loss_vals)) 223 | del loss_vals[:] 224 | save_weights(model, weights_file_path) 225 | embedding_file_path = "output/" + model_name 226 | save_embedding(dictionary, emb_net, embedding_file_path) 227 | step += 1 228 | 229 | if __name__ == '__main__': 230 | fmt = "%(asctime)s %(levelname)s %(message)s" 231 | logging.basicConfig(format=fmt, level=logging.INFO) 232 | 233 | train('model_word2vec_200') 234 | ``` 235 | 236 | #### 文本的表示 237 | 238 | 训练好词向量后,我们将每一行文本转化成词向量序列。有两种方式:将文本各单词向量求和成为一个新的特征向量(适用于MLP Classifier)分别将正负样本保存到`sample_pass.npz`和`sample_spam.npz`中; 或者将文本各单词向量序列化(适用于CNN Classifier和RNN Classifier),分别将正负样本保存到`sample_seq_pass.npz`和`sample_seq_spam.npz`中。 239 | 240 | 为MLP分类器准备训练集: 241 | 242 | ``` 243 | import numpy as np 244 | import tensorlayer as tl 245 | 246 | wv = tl.files.load_npy_to_any(name='./output/model_word2vec_200.npy') 247 | for label in ["pass", "spam"]: 248 | embeddings = [] 249 | inp = "data/msglog/msg" + label + ".log.seg" 250 | outp = "output/sample_" + label 251 | f = open(inp, encoding='utf-8') 252 | for line in f: 253 | words = line.strip().split(' ') 254 | text_embedding = np.zeros(200) 255 | for word in words: 256 | try: 257 | text_embedding += wv[word] 258 | except KeyError: 259 | text_embedding += wv['UNK'] 260 | embeddings.append(text_embedding) 261 | 262 | embeddings = np.asarray(embeddings, dtype=np.float32) 263 | if label == "spam": 264 | labels = np.zeros(embeddings.shape[0]) 265 | elif label == "pass": 266 | labels = np.ones(embeddings.shape[0]) 267 | 268 | np.savez(outp, x=embeddings, y=labels) 269 | f.close() 270 | ``` 271 | 272 | 为CNN或RNN分类器准备训练集: 273 | 274 | ``` 275 | import numpy as np 276 | import tensorlayer as tl 277 | 278 | wv = tl.files.load_npy_to_any(name='./output/model_word2vec_200.npy') 279 | for label in ["pass", "spam"]: 280 | samples = [] 281 | inp = "data/msglog/msg" + label + ".log.seg" 282 | outp = "output/sample_seq_" + label 283 | f = open(inp) 284 | for line in f: 285 | words = line.strip().split(' ') 286 | text_sequence = [] 287 | for word in words: 288 | try: 289 | text_sequence.append(wv[word]) 290 | except KeyError: 291 | text_sequence.append(wv['UNK']) 292 | samples.append(text_sequence) 293 | 294 | if label == "spam": 295 | labels = np.zeros(len(samples)) 296 | elif label == "pass": 297 | labels = np.ones(len(samples)) 298 | 299 | np.savez(outp, x=samples, y=labels) 300 | f.close() 301 | ``` 302 | -------------------------------------------------------------------------------- /word2vec/data/msglog.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorlayer/text-antispam/a713066d795bbb5b76af63494a10a9cec97ffe66/word2vec/data/msglog.tar.gz -------------------------------------------------------------------------------- /word2vec/text2vec.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """生成用于NBOW+MLP Classifier的训练集。 3 | 4 | 训练好词向量后,通过将词向量线性相加获得文本的向量。 5 | 输入分词后的样本,每一行逐词查找词向量并相加,从而得到文本的特征向量和标签。 6 | 分别将正负样本保存到sample_pass.npz和sample_spam.npz。 7 | 8 | """ 9 | 10 | import numpy as np 11 | import tensorlayer as tl 12 | import sys 13 | sys.path.append("../serving/packages") 14 | from text_regularization import extractWords 15 | 16 | wv = tl.files.load_npy_to_any(name='./output/model_word2vec_200.npy') 17 | for label in ["pass", "spam"]: 18 | embeddings = [] 19 | inp = "data/msglog/msg" + label + ".log.seg" 20 | outp = "output/sample_" + label 21 | f = open(inp, encoding='utf-8') 22 | for line in f: 23 | line = extractWords(line) 24 | words = line.strip().split(' ') 25 | text_embedding = np.zeros(200) 26 | for word in words: 27 | try: 28 | text_embedding += wv[word] 29 | except KeyError: 30 | text_embedding += wv['UNK'] 31 | embeddings.append(text_embedding) 32 | 33 | embeddings = np.asarray(embeddings, dtype=np.float32) 34 | if label == "spam": 35 | labels = np.zeros(embeddings.shape[0]) 36 | elif label == "pass": 37 | labels = np.ones(embeddings.shape[0]) 38 | 39 | np.savez(outp, x=embeddings, y=labels) 40 | f.close() 41 | -------------------------------------------------------------------------------- /word2vec/text_features.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """生成用于CNN Classifier或者RNN Classifier的训练集。 3 | 4 | 训练好词向量后,将每一行文本转成词向量序列。 5 | 分别将正负样本保存到sample_seq_pass.npz和sample_seq_spam.npz。 6 | 7 | """ 8 | 9 | import numpy as np 10 | import tensorlayer as tl 11 | import sys 12 | sys.path.append("../serving/packages") 13 | from text_regularization import extractWords 14 | 15 | wv = tl.files.load_npy_to_any(name='./output/model_word2vec_200.npy') 16 | for label in ["pass", "spam"]: 17 | samples = [] 18 | inp = "data/msglog/msg" + label + ".log.seg" 19 | outp = "output/sample_seq_" + label 20 | f = open(inp,encoding='utf-8') 21 | for line in f: 22 | line = extractWords(line) 23 | words = line.strip().split(' ') 24 | text_sequence = [] 25 | for word in words: 26 | try: 27 | text_sequence.append(wv[word]) 28 | except KeyError: 29 | text_sequence.append(wv['UNK']) 30 | samples.append(text_sequence) 31 | 32 | if label == "spam": 33 | labels = np.zeros(len(samples)) 34 | elif label == "pass": 35 | labels = np.ones(len(samples)) 36 | 37 | np.savez(outp, x=samples, y=labels) 38 | f.close() 39 | -------------------------------------------------------------------------------- /word2vec/word2vec.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python3 2 | """Word2vec训练词向量 3 | 4 | Efficient Estimation of Word Representations in Vector Space: https://arxiv.org/pdf/1301.3781.pdf 5 | Distributed Representations of Words and Phrases and their Compositionality: https://arxiv.org/pdf/1310.4546.pdf 6 | word2vec Parameter Learning Explained: https://arxiv.org/pdf/1411.2738.pdf 7 | 8 | """ 9 | 10 | import collections 11 | import logging 12 | import os 13 | import tarfile 14 | import tensorflow as tf 15 | import tensorlayer as tl 16 | 17 | 18 | def load_dataset(): 19 | """加载训练数据 20 | Args: 21 | files: 词向量训练数据集合 22 | 得 我 就 在 车里 咪 一会 23 | 终于 知道 哪里 感觉 不 对 了 24 | ... 25 | Returns: 26 | [得 我 就 在 车里 咪 一会 终于 知道 哪里 感觉 不 对 了...] 27 | """ 28 | prj = "https://github.com/tensorlayer/text-antispam" 29 | if not os.path.exists('data/msglog'): 30 | tl.files.maybe_download_and_extract( 31 | 'msglog.tar.gz', 32 | 'data', 33 | prj + '/raw/master/word2vec/data/') 34 | tarfile.open('data/msglog.tar.gz', 'r').extractall('data') 35 | files = ['data/msglog/msgpass.log.seg', 'data/msglog/msgspam.log.seg'] 36 | words = [] 37 | for file in files: 38 | f = open(file,encoding='utf-8') 39 | for line in f: 40 | for word in line.strip().split(' '): 41 | if word != '': 42 | words.append(word) 43 | f.close() 44 | return words 45 | 46 | 47 | def get_vocabulary_size(words, min_freq=3): 48 | """获取词频不小于min_freq的单词数量 49 | 小于min_freq的单词统一用UNK(unknown)表示 50 | Args: 51 | words: 训练词表 52 | [得 我 就 在 车里 咪 一会 终于 知道 哪里 感觉 不 对 了...] 53 | min_freq: 最低词频 54 | Return: 55 | size: 词频不小于min_freq的单词数量 56 | """ 57 | size = 1 # 为UNK预留 58 | counts = collections.Counter(words).most_common() 59 | for word, c in counts: 60 | if c >= min_freq: 61 | size += 1 62 | return size 63 | 64 | 65 | def save_weights(model, weights_file_path): 66 | """保存模型训练状态 67 | 将会产生以下文件: 68 | weights/model_word2vec_200.hdf5 69 | Args: 70 | weights_file_path: 储存训练状态的文件路径 71 | """ 72 | path = os.path.dirname(os.path.abspath(weights_file_path)) 73 | if os.path.isdir(path) == False: 74 | logging.warning('Path (%s) not exists, making directories...', path) 75 | os.makedirs(path) 76 | model.save_weights(filepath=weights_file_path) 77 | 78 | 79 | def load_weights(model, weights_file_path): 80 | """恢复模型训练状态 81 | 从weights_file_path中恢复所保存的训练状态 82 | Args: 83 | weights_file_path: 储存训练状态的文件路径 84 | """ 85 | if os.path.isfile(weights_file_path): 86 | model.load_weights(filepath=weights_file_path) 87 | 88 | 89 | def save_embedding(dictionary, network, embedding_file_path): 90 | """保存词向量 91 | 将训练好的词向量保存到embedding_file_path.npy文件中 92 | Args: 93 | dictionary: 单词与单词ID映射表 94 | {'UNK': 0, '你': 1, '我': 2, ..., '别生气': 2545, '小姐姐': 2546, ...} 95 | network: 默认TensorFlow Session所初始化的网络结构 96 | network = tl.layers.InputLayer(x, name='input_layer') 97 | ... 98 | embedding_file_path: 储存词向量的文件路径 99 | Returns: 100 | 单词与向量映射表以npy格式保存在embedding_file_path.npy文件中 101 | {'关注': [-0.91619176, -0.83772564, ..., -1.90845013, 0.74918884], ...} 102 | """ 103 | words, ids = zip(*dictionary.items()) 104 | params = network.normalized_embeddings 105 | embeddings = tf.nn.embedding_lookup(params, tf.constant(ids, dtype=tf.int32)) 106 | #embeddings = tf.nn.embedding_lookup(params, tf.constant(ids, dtype=tf.int32)).eval() 107 | wv = dict(zip(words, embeddings)) 108 | path = os.path.dirname(os.path.abspath(embedding_file_path)) 109 | if os.path.isdir(path) == False: 110 | logging.warning('Path (%s) not exists, making directories...', path) 111 | os.makedirs(path) 112 | tl.files.save_any_to_npy(save_dict=wv, name=embedding_file_path+'.npy') 113 | 114 | 115 | def train(model_name): 116 | """训练词向量 117 | Args: 118 | corpus_file: 文件内容已经经过分词。 119 | 得 我 就 在 车里 咪 一会 120 | 终于 知道 哪里 感觉 不 对 了 121 | ... 122 | model_name: 模型名称,用于生成保存训练状态和词向量的文件名 123 | Returns: 124 | 输出训练状态以及训练后的词向量文件 125 | """ 126 | words = load_dataset() 127 | data_size = len(words) 128 | vocabulary_size = get_vocabulary_size(words, min_freq=3) 129 | batch_size = 500 # 一次Forword运算以及BP运算中所需要的训练样本数目 130 | embedding_size = 200 # 词向量维度 131 | skip_window = 5 # 上下文窗口,单词前后各取五个词 132 | num_skips = 10 # 从窗口中选取多少个预测对 133 | num_sampled = 64 # 负采样个数 134 | learning_rate = 0.025 # 学习率 135 | n_epoch = 50 # 所有样本重复训练50次 136 | num_steps = int((data_size/batch_size) * n_epoch) # 总迭代次数 137 | 138 | data, count, dictionary, reverse_dictionary = tl.nlp.build_words_dataset(words, vocabulary_size) 139 | train_inputs = tl.layers.Input([batch_size], dtype=tf.int32) 140 | train_labels = tl.layers.Input([batch_size, 1], dtype=tf.int32) 141 | 142 | emb_net = tl.layers.Word2vecEmbedding( 143 | vocabulary_size = vocabulary_size, 144 | embedding_size = embedding_size, 145 | num_sampled = num_sampled, 146 | activate_nce_loss = True, 147 | nce_loss_args = {}) 148 | 149 | emb, nce = emb_net([train_inputs, train_labels]) 150 | model = tl.models.Model(inputs=[train_inputs, train_labels], outputs=[emb, nce]) 151 | optimizer = tf.optimizers.Adam(learning_rate) 152 | 153 | # Start training 154 | model.train() 155 | weights_file_path = "weights/" + model_name + ".hdf5" 156 | load_weights(model, weights_file_path) 157 | 158 | loss_vals = [] 159 | step = data_index = 0 160 | print_freq = 200 161 | while (step < num_steps): 162 | batch_inputs, batch_labels, data_index = tl.nlp.generate_skip_gram_batch( 163 | data=data, batch_size=batch_size, num_skips=num_skips, 164 | skip_window=skip_window, data_index=data_index) 165 | 166 | with tf.GradientTape() as tape: 167 | _, loss_val = model([batch_inputs, batch_labels]) 168 | grad = tape.gradient(loss_val, model.trainable_weights) 169 | optimizer.apply_gradients(zip(grad, model.trainable_weights)) 170 | 171 | loss_vals.append(loss_val) 172 | if step % print_freq == 0: 173 | logging.info("(%d/%d) latest average loss: %f.", step, num_steps, sum(loss_vals)/len(loss_vals)) 174 | del loss_vals[:] 175 | save_weights(model, weights_file_path) 176 | embedding_file_path = "output/" + model_name 177 | save_embedding(dictionary, emb_net, embedding_file_path) 178 | step += 1 179 | 180 | 181 | if __name__ == '__main__': 182 | fmt = "%(asctime)s %(levelname)s %(message)s" 183 | logging.basicConfig(format=fmt, level=logging.INFO) 184 | 185 | train('model_word2vec_200') 186 | --------------------------------------------------------------------------------