├── .DS_Store
├── .gitignore
├── README.md
├── assets
    └── dssm_rnn_loss.png
├── auto_ml.yml
├── config.py
├── configs
    ├── bert_classify.yml
    ├── config.yml
    ├── config_bert.yml
    └── search_space.json
├── data
    ├── readme.md
    └── vocab.txt
├── data_input.py
├── dssm.py
├── dssm_rnn.py
├── flask_server.py
├── model
    ├── base_model.py
    ├── bert
    │   ├── ReadMe.md
    │   ├── modeling.py
    │   ├── modeling_v1.py
    │   ├── optimization.py
    │   └── tokenization.py
    ├── bert_classifier.py
    └── siamese_network.py
├── multi_view_dssm_v3.py
├── requirement.txt
├── train.py
└── util.py


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsaneLife/dssm/1d32e137654e03994f7ba6cfde52e1d47601027c/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # data
 2 | data/*
 3 | !data/readme.md
 4 | !data/vocab.txt
 5 | Summaries/
 6 | 
 7 | Summaries/*
 8 | results/*
 9 | log
10 | tmp
11 | 
12 | # py
13 | test.py
14 | 
15 | # 通用
16 | .unotes/
17 | envi/
18 | __pycache__/
19 | .vscode/
20 | *.pyc
21 | .DS_Store
22 | *.DS_Store
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/)以及其后续文章
  2 | 
  3 | [A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](http://blog.csdn.net/shine19930820/article/details/78810984)的实现Demo。
  4 | 
  5 | # 注意：
  6 | **\*\*\*\*2020/11/15\*\*\*\***
  7 | 
  8 | 论文[li2020sentence](https://arxiv.org/abs/2011.05864)将normalizing flows和bert结合，在语义相似度任务上有奇效，接下来会继续进行验证。
  9 | 
 10 | **\*\*\*\*2020/10/27\*\*\*\***
 11 | 
 12 | 添加底层使用bert的siamese-bert实验，见[siamese\_network.py](https://github.com/InsaneLife/dssm/blob/master/model/siamese_network.py)中类 SiamenseBert，其他和下面一样.
 13 | 
 14 | 相比于[bert](https://github.com/InsaneLife/dssm/blob/master/model/bert_classifier.py) 直接将两句作为输入，双塔bert的优势在于：
 15 | - max sequence len会更短，训练所需要显存更小，速度会稍微快一些，对于硬件不是太好的伙伴比较友好。
 16 | - 可以训练后使用bert作为句子embedding的encoder，在一些线上匹配的时候，可以预先将需要对比的句子向量算出来，节省实时算力。
 17 | - 效果相比于直接用bert输入两句，测试集会差一个多点。
 18 | - bert可以使用[CLS]的输出或者average context embedding, 一般后者效果会更好。
 19 | ```shell
 20 | # bert_siamese双塔模型
 21 | python train.py --mode=train --method=bert_siamese
 22 | # 直接使用功能bert
 23 | python train.py --mode=train --method=bert
 24 | ```
 25 | 参考：[reimers2019sentence](https://arxiv.org/abs/1908.10084)
 26 | 
 27 | **\*\*\*\*2020/10/17\*\*\*\***
 28 | 
 29 | 由于之前数据集问题，会有不收敛问题，现更换数据集为LCQMC口语化描述的语义相似度数据集。模型也从多塔变成了双塔模型，见[siamese\_network.py](https://github.com/InsaneLife/dssm/blob/master/model/siamese_network.py), 训练入口：[train.py](https://github.com/InsaneLife/dssm/blob/master/train.py)
 30 | 
 31 | > 难以找到搜索点击的公开数据集，暂且用语义相似任务数据集，有点变味了，哈哈
 32 | > 目前看在此数据集上测试数据集的准确率是提升的，只有七十多，但是要达到论文的准确率，仍然还需要进行调参
 33 | 
 34 | 训练（默认使用功LCQMC数据集）：
 35 | 
 36 | ```shell
 37 | python train.py --mode=train
 38 | ```
 39 | 
 40 | 预测：
 41 | 
 42 | ```shell
 43 | python train.py --mode=train --file=$predict_file$
 44 | ```
 45 | 
 46 | 测试文件格式: q1\tq2, 例如：
 47 | 
 48 | ```
 49 | 今天天气怎么样	今天温度怎么样
 50 | ```
 51 | 
 52 | 
 53 | 
 54 | **\*\*\*\*2019/5/18\*\*\*\***
 55 | 
 56 | 由于之前代码api过时，已更新最新代码于：[dssm\_rnn.py](https://github.com/InsaneLife/dssm/blob/master/dssm_rnn.py)
 57 | 
 58 | 数据处理代码[data\_input.py](https://github.com/InsaneLife/dssm/blob/master/data_input.py) 和数据[data](https://github.com/InsaneLife/dssm/tree/master/data) 已经更新，由于使用了rnn，所以**输入非bag of words方式。**
 59 | 
 60 | ![img](https://ask.qcloudimg.com/http-save/yehe-1881084/7ficv1hhqf.png?imageView2/2/w/1620)
 61 | 
 62 | > 来源：Palangi, Hamid, et al. "Semantic modelling with long-short-term memory for information retrieval." arXiv preprint arXiv:1412.6629 2014.
 63 | > 
 64 | > 训练损失，在45个epoch时基本不下降：
 65 | > 
 66 | > ![dssm_rnn_loss](https://raw.githubusercontent.com/InsaneLife/dssm/master/assets/dssm_rnn_loss.png)
 67 | 
 68 | # 1\. 数据&环境
 69 | 
 70 | DSSM，对于输入数据是Query对，即Query短句和相应的展示，展示中分点击和未点击，分别为正负样，同时对于点击的先后顺序，也是有不同赋值，具体可参考论文。
 71 | 
 72 | 对于我的Query数据本人无权开放，还请自行寻找数据。
 73 | 环境：
 74 | 
 75 | 1. win, python3.5, tensorflow1.4.
 76 | 
 77 | # 2\. word hashing
 78 | 
 79 | 原文使用3-grams，对于中文，我使用了uni-gram，因为中文本身字有一定代表意义（也有论文拆笔画），对于每个gram都使用one-hot编码代替，最终可以大大降低短句维度。
 80 | 
 81 | # 3\. 结构
 82 | 
 83 | 结构图：
 84 | 
 85 | ![img](https://raw.githubusercontent.com/InsaneLife/MyPicture/master/dssm2.png)
 86 | 
 87 | 1. 把条目映射成低维向量。
 88 | 2. 计算查询和文档的cosine相似度。
 89 | 
 90 | ## 3.1 输入
 91 | 
 92 | 这里使用了TensorBoard可视化，所以定义了name\_scope:
 93 | 
 94 | ``` python
 95 | with tf.name_scope('input'):
 96 |     query_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='QueryBatch')
 97 |     doc_positive_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='DocBatch')
 98 |     doc_negative_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='DocBatch')
 99 |     on_train = tf.placeholder(tf.bool)
100 | ```
101 | 
102 | ## 3.2 全连接层
103 | 
104 | 我使用三层的全连接层，对于每一层全连接层，除了神经元不一样，其他都一样，所以可以写一个函数复用。
105 | $$
106 | l\_n = W\_n x + b\_1
107 | $$
108 | 
109 | ``` python
110 | def add_layer(inputs, in_size, out_size, activation_function=None):
111 |     wlimit = np.sqrt(6.0 / (in_size + out_size))
112 |     Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit))
113 |     biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit))
114 |     Wx_plus_b = tf.matmul(inputs, Weights) + biases
115 |     if activation_function is None:
116 |         outputs = Wx_plus_b
117 |     else:
118 |         outputs = activation_function(Wx_plus_b)
119 |     return outputs
120 | ```
121 | 
122 | 其中，对于权重和Bias，使用了按照论文的特定的初始化方式：
123 | 
124 | ``` python
125 | 	wlimit = np.sqrt(6.0 / (in_size + out_size))
126 |     Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit))
127 |     biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit))
128 | ```
129 | 
130 | ### Batch Normalization
131 | 
132 | ``` python
133 | def batch_normalization(x, phase_train, out_size):
134 |     """
135 |     Batch normalization on convolutional maps.
136 |     Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow
137 |     Args:
138 |         x:           Tensor, 4D BHWD input maps
139 |         out_size:       integer, depth of input maps
140 |         phase_train: boolean tf.Varialbe, true indicates training phase
141 |         scope:       string, variable scope
142 |     Return:
143 |         normed:      batch-normalized maps
144 |     """
145 |     with tf.variable_scope('bn'):
146 |         beta = tf.Variable(tf.constant(0.0, shape=[out_size]),
147 |                            name='beta', trainable=True)
148 |         gamma = tf.Variable(tf.constant(1.0, shape=[out_size]),
149 |                             name='gamma', trainable=True)
150 |         batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
151 |         ema = tf.train.ExponentialMovingAverage(decay=0.5)
152 | 
153 |         def mean_var_with_update():
154 |             ema_apply_op = ema.apply([batch_mean, batch_var])
155 |             with tf.control_dependencies([ema_apply_op]):
156 |                 return tf.identity(batch_mean), tf.identity(batch_var)
157 | 
158 |         mean, var = tf.cond(phase_train,
159 |                             mean_var_with_update,
160 |                             lambda: (ema.average(batch_mean), ema.average(batch_var)))
161 |         normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
162 |     return normed
163 | ```
164 | 
165 | ### 单层
166 | 
167 | ``` python
168 | with tf.name_scope('FC1'):
169 |     # 激活函数在BN之后，所以此处为None
170 |     query_l1 = add_layer(query_batch, TRIGRAM_D, L1_N, activation_function=None)
171 |     doc_positive_l1 = add_layer(doc_positive_batch, TRIGRAM_D, L1_N, activation_function=None)
172 |     doc_negative_l1 = add_layer(doc_negative_batch, TRIGRAM_D, L1_N, activation_function=None)
173 | 
174 | with tf.name_scope('BN1'):
175 |     query_l1 = batch_normalization(query_l1, on_train, L1_N)
176 |     doc_l1 = batch_normalization(tf.concat([doc_positive_l1, doc_negative_l1], axis=0), on_train, L1_N)
177 |     doc_positive_l1 = tf.slice(doc_l1, [0, 0], [query_BS, -1])
178 |     doc_negative_l1 = tf.slice(doc_l1, [query_BS, 0], [-1, -1])
179 |     query_l1_out = tf.nn.relu(query_l1)
180 |     doc_positive_l1_out = tf.nn.relu(doc_positive_l1)
181 |     doc_negative_l1_out = tf.nn.relu(doc_negative_l1)
182 | ······
183 | ```
184 | 
185 | 合并负样本
186 | 
187 | ``` python
188 | with tf.name_scope('Merge_Negative_Doc'):
189 |     # 合并负样本，tile可选择是否扩展负样本。
190 |     doc_y = tf.tile(doc_positive_y, [1, 1])
191 |     for i in range(NEG):
192 |         for j in range(query_BS):
193 |             # slice(input_, begin, size)切片API
194 |             doc_y = tf.concat([doc_y, tf.slice(doc_negative_y, [j * NEG + i, 0], [1, -1])], 0)
195 | ```
196 | 
197 | ## 3.3 计算cos相似度
198 | 
199 | ``` python
200 | with tf.name_scope('Cosine_Similarity'):
201 |     # Cosine similarity
202 |     # query_norm = sqrt(sum(each x^2))
203 |     query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(query_y), 1, True)), [NEG + 1, 1])
204 |     # doc_norm = sqrt(sum(each x^2))
205 |     doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_y), 1, True))
206 | 
207 |     prod = tf.reduce_sum(tf.multiply(tf.tile(query_y, [NEG + 1, 1]), doc_y), 1, True)
208 |     norm_prod = tf.multiply(query_norm, doc_norm)
209 | 
210 |     # cos_sim_raw = query * doc / (||query|| * ||doc||)
211 |     cos_sim_raw = tf.truediv(prod, norm_prod)
212 |     # gamma = 20
213 |     cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, query_BS])) * 20
214 | ```
215 | 
216 | ## 3.4 定义损失函数
217 | 
218 | ``` python
219 | with tf.name_scope('Loss'):
220 |     # Train Loss
221 |     # 转化为softmax概率矩阵。
222 |     prob = tf.nn.softmax(cos_sim)
223 |     # 只取第一列，即正样本列概率。
224 |     hit_prob = tf.slice(prob, [0, 0], [-1, 1])
225 |     loss = -tf.reduce_sum(tf.log(hit_prob))
226 |     tf.summary.scalar('loss', loss)
227 | ```
228 | 
229 | ## 3.5选择优化方法
230 | 
231 | ``` python
232 | with tf.name_scope('Training'):
233 |     # Optimizer
234 |     train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(loss)
235 | ```
236 | 
237 | ## 3.6 开始训练
238 | 
239 | ``` python
240 | # 创建一个Saver对象，选择性保存变量或者模型。
241 | saver = tf.train.Saver()
242 | # with tf.Session(config=config) as sess:
243 | with tf.Session() as sess:
244 |     sess.run(tf.global_variables_initializer())
245 |     train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train', sess.graph)
246 |     start = time.time()
247 |     for step in range(FLAGS.max_steps):
248 |         batch_id = step % FLAGS.epoch_steps
249 |         sess.run(train_step, feed_dict=feed_dict(True, True, batch_id % FLAGS.pack_size, 0.5))
250 | ```
251 | 
252 | GitHub完整代码 [https://github.com/InsaneLife/dssm](https://github.com/InsaneLife/dssm)
253 | 
254 | Multi-view DSSM实现同理，可以参考GitHub：[multi\_view\_dssm\_v3](https://github.com/InsaneLife/dssm/blob/master/multi_view_dssm_v3.py)
255 | 
256 | CSDN原文：[http://blog.csdn.net/shine19930820/article/details/79042567](http://blog.csdn.net/shine19930820/article/details/79042567)
257 | 
258 | ## 自动调参
259 | 参数搜索空间：[search_space.json](./configs/search_space.json)
260 | 配置文件：[auto_ml.yml](auto_ml.yml)
261 | 启动命令
262 | ```shell
263 | nnictl create --config auto_ml.yml -p 8888
264 | ```
265 | > 由于没有gpu 😂，[auto_ml.yml](auto_ml.yml)设置中没有配置gpu，有gpu同学可自行配置。
266 | 
267 | 详细文档：https://nni.readthedocs.io/zh/latest/Overview.html
268 | 
269 | 
270 | # Reference
271 | - [li2020sentence](https://arxiv.org/abs/2011.05864)
272 | - [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
273 | - nni 调参: https://nni.readthedocs.io/zh/latest/Overview.html


--------------------------------------------------------------------------------
/assets/dssm_rnn_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/InsaneLife/dssm/1d32e137654e03994f7ba6cfde52e1d47601027c/assets/dssm_rnn_loss.png


--------------------------------------------------------------------------------
/auto_ml.yml:
--------------------------------------------------------------------------------
 1 | authorName: default
 2 | experimentName: example_dssm
 3 | trialConcurrency: 1
 4 | maxExecDuration: 10h
 5 | maxTrialNum: 8
 6 | #choice: local, remote, pai
 7 | trainingServicePlatform: local
 8 | searchSpacePath: ./configs/search_space.json
 9 | #choice: true, false
10 | useAnnotation: false
11 | tuner:
12 |   #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
13 |   #SMAC (SMAC should be installed through nnictl)
14 |   builtinTunerName: TPE
15 |   classArgs:
16 |     #choice: maximize, minimize
17 |     optimize_mode: maximize
18 | trial:
19 |   command: python3 train.py --method=bert --mode=train
20 |   codeDir: .
21 |   gpuNum: 0
22 | localConfig:
23 |   useActiveGpu: true
24 |   maxTrialNumPerGpu: 3
25 |   # gpuIndices: 4,0,5


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # encoding=utf-8
 3 | '''
 4 | Author: 	zhiyang.zzy 
 5 | Date: 		2019-09-25 21:59:54
 6 | Contact: 	zhiyangchou@gmail.com
 7 | FilePath: /dssm/config.py
 8 | Desc: 		
 9 | '''
10 | 
11 | 
12 | def load_vocab(file_path):
13 |     word_dict = {}
14 |     with open(file_path, encoding='utf8') as f:
15 |         for idx, word in enumerate(f.readlines()):
16 |             word = word.strip()
17 |             word_dict[word] = idx
18 |     return word_dict
19 | 
20 | 
21 | class Config(object):
22 |     def __init__(self):
23 |         self.vocab_map = load_vocab(self.vocab_path)
24 |         self.nwords = len(self.vocab_map)
25 | 
26 |     unk = '[UNK]'
27 |     pad = '[PAD]'
28 |     vocab_path = './data/vocab.txt'
29 |     # file_train = './data/oppo_round1_train_20180929.mini'
30 |     # file_train = './data/oppo_round1_train_20180929.txt'
31 |     # file_vali = './data/oppo_round1_vali_20180929.mini'
32 |     file_vali = './data/oppo_round1_vali_20180929.txt'
33 |     file_train = file_vali
34 |     max_seq_len = 40
35 |     hidden_size_rnn = 100
36 |     use_stack_rnn = False
37 |     learning_rate = 0.001
38 |     decay_step = 2000
39 |     lr_decay = 0.95
40 |     num_epoch = 300
41 |     epoch_no_imprv = 5
42 |     optimizer = "lazyadam"
43 |     summaries_dir = './results/Summaries/'
44 |     gpu = 0
45 |     word_dim = 100
46 |     batch_size = 64
47 |     keep_porb = 0.5
48 |     dropout = 1- keep_porb
49 | 
50 |     # checkpoint_dir
51 |     checkpoint_dir='./results/checkpoint'
52 | 
53 | 
54 | if __name__ == '__main__':
55 |     conf = Config()
56 |     print(len(conf.vocab_map))
57 |     pass
58 | 


--------------------------------------------------------------------------------
/configs/bert_classify.yml:
--------------------------------------------------------------------------------
 1 | unk: '[UNK]'
 2 | pad: '[PAD]'
 3 | vocab_path: './data/vocab.txt'
 4 | max_seq_len: 80
 5 | hidden_size_rnn: 200
 6 | use_stack_rnn: False
 7 | learning_rate: 0.00005
 8 | decay_step: 1800
 9 | lr_decay: 0.95
10 | num_epoch: 10
11 | epoch_no_imprv: 30
12 | optimizer: "adam"
13 | summaries_dir: './results/Summaries/'
14 | gpu: 0
15 | word_dim: 100
16 | batch_size: 64
17 | keep_porb: 0.5
18 | # checkpoint_dir
19 | checkpoint_dir: './results/checkpoint/bert_classifier/model'
20 | nwords: 21128
21 | sentence_embedding_type: cls
22 | 
23 | # bert
24 | # bert_dir: &bert_dir '/mnt/nlp/bert/chinese_L-12_H-768_A-12/'
25 | bert_dir: '/Volumes/HddData/ProjectData/NLP/bert/chinese_L-12_H-768_A-12/'
26 | bert_init_checkpoint: "bert_model.ckpt"
27 | bert_vocab: "vocab.txt"
28 | bert_config: "bert_config.json"


--------------------------------------------------------------------------------
/configs/config.yml:
--------------------------------------------------------------------------------
 1 | unk: '[UNK]'
 2 | pad: '[PAD]'
 3 | vocab_path: './data/vocab.txt'
 4 | max_seq_len: 40
 5 | hidden_size_rnn: 200
 6 | use_stack_rnn: False
 7 | learning_rate: 0.0005
 8 | decay_step: 1800
 9 | lr_decay: 0.95
10 | num_epoch: 300
11 | epoch_no_imprv: 10
12 | optimizer: "lazyadam"
13 | summaries_dir: './results/Summaries/'
14 | gpu: 0
15 | word_dim: 100
16 | batch_size: 128
17 | keep_porb: 0.5
18 | # checkpoint_dir
19 | checkpoint_dir: './results/checkpoint/bert/model'
20 | nwords: 21128
21 | 
22 | # bert
23 | # bert_dir: &bert_dir '/mnt/nlp/bert/chinese_L-12_H-768_A-12/'
24 | bert_dir: '/Volumes/HddData/ProjectData/NLP/bert/chinese_L-12_H-768_A-12/'
25 | bert_init_checkpoint: "bert_model.ckpt"
26 | bert_vocab: "vocab.txt"
27 | bert_config: "bert_config.json"


--------------------------------------------------------------------------------
/configs/config_bert.yml:
--------------------------------------------------------------------------------
 1 | unk: '[UNK]'
 2 | pad: '[PAD]'
 3 | vocab_path: './data/vocab.txt'
 4 | max_seq_len: 40
 5 | hidden_size_rnn: 200
 6 | use_stack_rnn: False
 7 | learning_rate: 0.00005
 8 | decay_step: 1800
 9 | lr_decay: 0.95
10 | num_epoch: 10
11 | epoch_no_imprv: 5
12 | optimizer: "adam"
13 | summaries_dir: './results/Summaries/'
14 | gpu: 0
15 | word_dim: 100
16 | batch_size: 256
17 | keep_porb: 0.5
18 | # checkpoint_dir
19 | checkpoint_dir: './results/checkpoint/bert/model'
20 | nwords: 21128
21 | use_avg_pooling: 1
22 | sentence_embedding_type: avg-last-2
23 | 
24 | # bert
25 | # bert_dir: &bert_dir '/mnt/nlp/bert/chinese_L-12_H-768_A-12/'
26 | bert_dir: '/Volumes/HddData/ProjectData/NLP/bert/chinese_L-12_H-768_A-12/'
27 | bert_init_checkpoint: "bert_model.ckpt"
28 | bert_vocab: "vocab.txt"
29 | bert_config: "bert_config.json"


--------------------------------------------------------------------------------
/configs/search_space.json:
--------------------------------------------------------------------------------
1 | {
2 |     "batch_size": {"_type":"choice", "_value": [32, 64, 128, 256]},
3 |     "learning_rate":{"_type":"quniform","_value":[0.00002, 0.00005, 0.00001]}
4 | }


--------------------------------------------------------------------------------
/data/readme.md:
--------------------------------------------------------------------------------
 1 | # OPPO手机搜索排序query-title语义匹配数据集
 2 | 数据来自天池大数据比赛，是OPPO手机搜索排序query-title语义匹配的问题。
 3 | 
 4 | 数据格式： 数据分4列，\t分隔。
 5 | 
 6 | | 字段             | 说明                                                         | 数据示例                                  |
 7 | | ---------------- | ------------------------------------------------------------ | ----------------------------------------- |
 8 | | prefix           | 用户输入（query前缀）                                        | 刘德                                      |
 9 | | query_prediction | 根据当前前缀，预测的用户完整需求查询词，最多10条；预测的查询词可能是前缀本身，数字为统计概率 | {“刘德华”: “0.5”, “刘德华的歌”: “0.3”, …} |
10 | | title            | 文章标题                                                     | 刘德华                                    |
11 | | tag              | 文章内容标签                                                 | 百科                                      |
12 | | label            | 是否点击                                                     | 0或1                                      |
13 | 
14 | 为了应用来训练DSSM demo，将prefix和title作为正样，prefix和query_prediction（除title以外）作为负样本。
15 | 
16 | 下载链接：链接: https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw 提取码: 7p3n
17 | 
18 | 本数据仅限用于个人实验，如数据版权问题，请联系[chou.young@qq.com](mailto:chou.young@qq.com) 下架。
19 | 
20 | 
21 | 
22 | 下载解压到data文件夹即可，注意修改config.py中配置：file_train, file_vali。
23 | 
24 | 
25 | # 其他数据集
26 | https://paddlehub.readthedocs.io/zh_CN/latest/reference/dataset.html
27 | 
28 | ## LCQMC
29 | import paddlehub as hub
30 | dataset = hub.dataset.LCQMC()
31 | 
32 | pass
33 | train:238766
34 | test:12500
35 | dev:8802


--------------------------------------------------------------------------------
/data_input.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding=utf-8
  3 | from inspect import getblock
  4 | import json
  5 | import os
  6 | from os import read
  7 | from numpy.core.fromnumeric import mean
  8 | import numpy as np
  9 | import paddlehub as hub
 10 | import six
 11 | import math
 12 | import random
 13 | import sys
 14 | from util import read_file
 15 | from config import Config
 16 | # 配置文件
 17 | conf = Config()
 18 | 
 19 | 
 20 | class Vocabulary(object):
 21 |     def __init__(self, meta_file, max_len, allow_unk=0, unk="$UNK$", pad="$PAD$",):
 22 |         self.voc2id = {}
 23 |         self.id2voc = {}
 24 |         self.unk = unk
 25 |         self.pad = pad
 26 |         self.max_len = max_len
 27 |         self.allow_unk = allow_unk
 28 |         with open(meta_file, encoding='utf-8') as f:
 29 |             for i, line in enumerate(f):
 30 |                 line = convert_to_unicode(line.strip("\n"))
 31 |                 self.voc2id[line] = i
 32 |                 self.id2voc[i] = line
 33 |         self.size = len(self.voc2id)
 34 |         self.oov_num = self.size + 1
 35 | 
 36 |     def fit(self, words_list):
 37 |         """
 38 |         :param words_list: [[w11, w12, ...], [w21, w22, ...], ...]
 39 |         :return:
 40 |         """
 41 |         word_lst = []
 42 |         word_lst_append = word_lst.append
 43 |         for words in words_list:
 44 |             if not isinstance(words, list):
 45 |                 print(words)
 46 |                 continue
 47 |             for word in words:
 48 |                 word = convert_to_unicode(word)
 49 |                 word_lst_append(word)
 50 |         word_counts = Counter(word_lst)
 51 |         if self.max_num_word < 0:
 52 |             self.max_num_word = len(word_counts)
 53 |         sorted_voc = [w for w, c in word_counts.most_common(self.max_num_word)]
 54 |         self.max_num_word = len(sorted_voc)
 55 |         self.oov_index = self.max_num_word + 1
 56 |         self.voc2id = dict(zip(sorted_voc, range(1, self.max_num_word + 1)))
 57 |         return self
 58 | 
 59 |     def _transform2id(self, word):
 60 |         word = convert_to_unicode(word)
 61 |         if word in self.voc2id:
 62 |             return self.voc2id[word]
 63 |         elif self.allow_unk:
 64 |             return self.voc2id[self.unk]
 65 |         else:
 66 |             print(word)
 67 |             raise ValueError("word:{} Not in voc2id, please check".format(word))
 68 | 
 69 |     def _transform_seq2id(self, words, padding=0):
 70 |         out_ids = []
 71 |         words = convert_to_unicode(words)
 72 |         if self.max_len:
 73 |             words = words[:self.max_len]
 74 |         for w in words:
 75 |             out_ids.append(self._transform2id(w))
 76 |         if padding and self.max_len:
 77 |             while len(out_ids) < self.max_len:
 78 |                 out_ids.append(0)
 79 |         return out_ids
 80 |     
 81 |     def _transform_intent2ont_hot(self, words, padding=0):
 82 |         # 将多标签意图转为 one_hot
 83 |         out_ids = np.zeros(self.size, dtype=np.float32)
 84 |         words = convert_to_unicode(words)
 85 |         for w in words:
 86 |             out_ids[self._transform2id(w)] = 1.0
 87 |         return out_ids
 88 | 
 89 |     def _transform_seq2bert_id(self, words, padding=0):
 90 |         out_ids, seq_len = [], 0
 91 |         words = convert_to_unicode(words)
 92 |         if self.max_len:
 93 |             words = words[:self.max_len]
 94 |         seq_len = len(words)
 95 |         # 插入 [CLS], [SEP]
 96 |         out_ids.append(self._transform2id("[CLS]"))
 97 |         for w in words:
 98 |             out_ids.append(self._transform2id(w))
 99 |         mask_ids = [1 for _ in out_ids]
100 |         if padding and self.max_len:
101 |             while len(out_ids) < self.max_len + 1:
102 |                 out_ids.append(0)
103 |                 mask_ids.append(0)
104 |         seg_ids = [0 for _ in out_ids]
105 |         return out_ids, mask_ids, seg_ids, seq_len
106 | 
107 |     @staticmethod
108 |     def _truncate_seq_pair(tokens_a, tokens_b, max_length):
109 |         """Truncates a sequence pair in place to the maximum length."""
110 |         while True:
111 |             total_length = len(tokens_a) + len(tokens_b)
112 |             if total_length <= max_length:
113 |                 break
114 |             if len(tokens_a) > len(tokens_b):
115 |                 tokens_a.pop()
116 |             else:
117 |                 tokens_b.pop()
118 | 
119 |     def _transform_2seq2bert_id(self, seq1, seq2, padding=0):
120 |         out_ids, seg_ids, seq_len = [], [1], 0
121 |         seq1 = [x for x in convert_to_unicode(seq1)]
122 |         seq2 = [x for x in convert_to_unicode(seq2)]
123 |         # 截断
124 |         self._truncate_seq_pair(seq1, seq2, self.max_len - 2)
125 |         # 插入 [CLS], [SEP]
126 |         out_ids.append(self._transform2id("[CLS]"))
127 |         for w in seq1:
128 |             out_ids.append(self._transform2id(w))
129 |             seg_ids.append(0)
130 |         out_ids.append(self._transform2id("[SEP]"))
131 |         seg_ids.append(0)
132 |         for w in seq2:
133 |             out_ids.append(self._transform2id(w))
134 |             seg_ids.append(1)
135 |         mask_ids = [1 for _ in out_ids]
136 |         if padding and self.max_len:
137 |             while len(out_ids) < self.max_len + 1:
138 |                 out_ids.append(0)
139 |                 mask_ids.append(0)
140 |                 seg_ids.append(0)
141 |         return out_ids, mask_ids, seg_ids, seq_len
142 | 
143 |     def transform(self, seq_list, is_bert=0):
144 |         if is_bert:
145 |             return [self._transform_seq2bert_id(seq) for seq in seq_list]
146 |         else:
147 |             return [self._transform_seq2id(seq) for seq in seq_list]
148 | 
149 |     def __len__(self):
150 |         return len(self.voc2id)
151 | 
152 | def convert_to_unicode(text):
153 |     """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
154 |     if six.PY3:
155 |         if isinstance(text, str):
156 |             return text
157 |         elif isinstance(text, bytes):
158 |             return text.decode("utf-8", "ignore")
159 |         else:
160 |             raise ValueError("Unsupported string type: %s" % (type(text)))
161 |     elif six.PY2:
162 |         if isinstance(text, str):
163 |             return text.decode("utf-8", "ignore")
164 |         elif isinstance(text, unicode):
165 |             return text
166 |         else:
167 |             raise ValueError("Unsupported string type: %s" % (type(text)))
168 |     else:
169 |         raise ValueError("Not running on Python2 or Python 3?")
170 | 
171 | def gen_word_set(file_path, out_path='./data/words.txt'):
172 |     word_set = set()
173 |     with open(file_path, encoding='utf-8') as f:
174 |         for line in f.readlines():
175 |             spline = line.strip().split('\t')
176 |             if len(spline) < 4:
177 |                 continue
178 |             prefix, query_pred, title, tag, label = spline
179 |             if label == '0':
180 |                 continue
181 |             cur_arr = [prefix, title]
182 |             query_pred = json.loads(query_pred)
183 |             for w in prefix:
184 |                 word_set.add(w)
185 |             for each in query_pred:
186 |                 for w in each:
187 |                     word_set.add(w)
188 |     with open(word_set, 'w', encoding='utf-8') as o:
189 |         for w in word_set:
190 |             o.write(w + '\n')
191 |     pass
192 | 
193 | def convert_word2id(query, vocab_map):
194 |     ids = []
195 |     for w in query:
196 |         if w in vocab_map:
197 |             ids.append(vocab_map[w])
198 |         else:
199 |             ids.append(vocab_map[conf.unk])
200 |     while len(ids) < conf.max_seq_len:
201 |         ids.append(vocab_map[conf.pad])
202 |     return ids[:conf.max_seq_len]
203 | 
204 | 
205 | def convert_seq2bow(query, vocab_map):
206 |     bow_ids = np.zeros(conf.nwords)
207 |     for w in query:
208 |         if w in vocab_map:
209 |             bow_ids[vocab_map[w]] += 1
210 |         else:
211 |             bow_ids[vocab_map[conf.unk]] += 1
212 |     return bow_ids
213 | 
214 | 
215 | def get_data(file_path):
216 |     """
217 |     gen datasets, convert word into word ids.
218 |     :param file_path:
219 |     :return: [[query, pos sample, 4 neg sample]], shape = [n, 6]
220 |     """
221 |     data_map = {'query': [], 'query_len': [], 'doc_pos': [], 'doc_pos_len': [], 'doc_neg': [], 'doc_neg_len': []}
222 |     with open(file_path, encoding='utf8') as f:
223 |         for line in f.readlines():
224 |             spline = line.strip().split('\t')
225 |             if len(spline) < 4:
226 |                 continue
227 |             prefix, query_pred, title, tag, label = spline
228 |             if label == '0':
229 |                 continue
230 |             cur_arr, cur_len = [], []
231 |             query_pred = json.loads(query_pred)
232 |             # only 4 negative sample
233 |             for each in query_pred:
234 |                 if each == title:
235 |                     continue
236 |                 cur_arr.append(convert_word2id(each, conf.vocab_map))
237 |                 each_len = len(each) if len(each) < conf.max_seq_len else conf.max_seq_len
238 |                 cur_len.append(each_len)
239 |             if len(cur_arr) >= 4:
240 |                 data_map['query'].append(convert_word2id(prefix, conf.vocab_map))
241 |                 data_map['query_len'].append(len(prefix) if len(prefix) < conf.max_seq_len else conf.max_seq_len)
242 |                 data_map['doc_pos'].append(convert_word2id(title, conf.vocab_map))
243 |                 data_map['doc_pos_len'].append(len(title) if len(title) < conf.max_seq_len else conf.max_seq_len)
244 |                 data_map['doc_neg'].extend(cur_arr[:4])
245 |                 data_map['doc_neg_len'].extend(cur_len[:4])
246 |             pass
247 |     return data_map
248 | 
249 | 
250 | def get_data_siamese_rnn(file_path):
251 |     """
252 |     gen datasets, convert word into word ids.
253 |     :param file_path:
254 |     :return: [[query, pos sample, 4 neg sample]], shape = [n, 6]
255 |     """
256 |     data_arr = []
257 |     with open(file_path, encoding='utf8') as f:
258 |         for line in f.readlines():
259 |             spline = line.strip().split('\t')
260 |             if len(spline) < 4:
261 |                 continue
262 |             prefix, _, title, tag, label = spline
263 |             prefix_seq = convert_word2id(prefix, conf.vocab_map)
264 |             title_seq = convert_word2id(title, conf.vocab_map)
265 |             data_arr.append([prefix_seq, title_seq, int(label)])
266 |     return data_arr
267 | 
268 | 
269 | def get_data_bow(file_path):
270 |     """
271 |     gen datasets, convert word into word ids.
272 |     :param file_path:
273 |     :return: [[query, prefix, label]], shape = [n, 3]
274 |     """
275 |     data_arr = []
276 |     with open(file_path, encoding='utf8') as f:
277 |         for line in f.readlines():
278 |             spline = line.strip().split('\t')
279 |             if len(spline) < 4:
280 |                 continue
281 |             prefix, _, title, tag, label = spline
282 |             prefix_ids = convert_seq2bow(prefix, conf.vocab_map)
283 |             title_ids = convert_seq2bow(title, conf.vocab_map)
284 |             data_arr.append([prefix_ids, title_ids, int(label)])
285 |     return data_arr
286 | 
287 | def trans_lcqmc(dataset):
288 |     """
289 |     最大长度
290 |     """
291 |     out_arr, text_len =  [], []
292 |     for each in dataset:
293 |         t1, t2, label = each.text_a, each.text_b, int(each.label)
294 |         t1_ids = convert_word2id(t1, conf.vocab_map)
295 |         t1_len = conf.max_seq_len if len(t1) > conf.max_seq_len else len(t1)
296 |         t2_ids = convert_word2id(t2, conf.vocab_map)
297 |         t2_len = conf.max_seq_len if len(t2) > conf.max_seq_len else len(t2)
298 |         # t2_len = len(t2) 
299 |         out_arr.append([t1_ids, t1_len, t2_ids, t2_len, label])
300 |         # out_arr.append([t1_ids, t1_len, t2_ids, t2_len, label, t1, t2])
301 |         text_len.extend([len(t1), len(t2)])
302 |         pass
303 |     print("max len", max(text_len), "avg len", mean(text_len), "cover rate:", np.mean([x <= conf.max_seq_len for x in text_len]))
304 |     return out_arr
305 | 
306 | def get_lcqmc():
307 |     """
308 |     使用LCQMC数据集，并将其转为word_id
309 |     """
310 |     dataset = hub.dataset.LCQMC()
311 |     train_set = trans_lcqmc(dataset.train_examples)
312 |     dev_set = trans_lcqmc(dataset.dev_examples)
313 |     test_set = trans_lcqmc(dataset.test_examples)
314 |     return train_set, dev_set, test_set
315 |     # return test_set, test_set, test_set
316 | 
317 | def trans_lcqmc_bert(dataset:list, vocab:Vocabulary, is_merge=0):
318 |     """
319 |     最大长度
320 |     """
321 |     out_arr, text_len =  [], []
322 |     for each in dataset:
323 |         t1, t2, label = each.text_a, each.text_b, int(each.label)
324 |         if is_merge:
325 |             out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_2seq2bert_id(t1, t2, padding=1)
326 |             out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1, label])
327 |             text_len.extend([len(t1) + len(t2)])
328 |         else:
329 |             out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_seq2bert_id(t1, padding=1)
330 |             out_ids2, mask_ids2, seg_ids2, seq_len2 = vocab._transform_seq2bert_id(t2, padding=1)
331 |             out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1, out_ids2, mask_ids2, seg_ids2, seq_len2, label])
332 |             text_len.extend([len(t1), len(t2)])
333 |         pass
334 |     print("max len", max(text_len), "avg len", mean(text_len), "cover rate:", np.mean([x <= conf.max_seq_len for x in text_len]))
335 |     return out_arr
336 | 
337 | def get_lcqmc_bert(vocab:Vocabulary, is_merge=0):
338 |     """
339 |     使用LCQMC数据集，并将每个query其转为word_id，
340 |     """
341 |     dataset = hub.dataset.LCQMC()
342 |     train_set = trans_lcqmc_bert(dataset.train_examples, vocab, is_merge)
343 |     dev_set = trans_lcqmc_bert(dataset.dev_examples, vocab, is_merge)
344 |     test_set = trans_lcqmc_bert(dataset.test_examples, vocab, is_merge)
345 |     return train_set, dev_set, test_set
346 |     # test_set = test_set[:100]
347 |     # return test_set, test_set, test_set
348 | 
349 | def get_test(file_:str, vocab:Vocabulary):
350 |     test_arr = read_file(file_, '\t') # [[q1, q2],...]
351 |     out_arr = []
352 |     for line in test_arr:
353 |         if len(line) != 2:
354 |             print('wrong line size=', len(line))
355 |         t1, t2 = line   # [t1_ids, t1_len, t2_ids, t2_len, label]
356 |         t1_ids = vocab._transform_seq2id(t1, padding=1)
357 |         t1_len = vocab.max_len if len(t1) > vocab.max_len else len(t1)
358 |         t2_ids = vocab._transform_seq2id(t2, padding=1)
359 |         t2_len = vocab.max_len if len(t2) > vocab.max_len else len(t2)
360 |         out_arr.append([t1_ids, t1_len, t2_ids, t2_len])
361 |     return out_arr, test_arr
362 | 
363 | def get_test_bert(file_:str, vocab:Vocabulary, is_merge=0):
364 |     test_arr = read_file(file_, '\t') # [[q1, q2],...]
365 |     out_arr, _ = get_test_bert_by_arr(test_arr, vocab, is_merge)
366 |     return out_arr, test_arr
367 | 
368 | def get_test_bert_by_arr(test_arr:list, vocab:Vocabulary, is_merge=0):
369 |     # test_arr # [[q1, q2],...]
370 |     out_arr = []
371 |     for line in test_arr:
372 |         if len(line) != 2:
373 |             print('wrong line size=', len(line))
374 |         t1, t2 = line   # [t1_ids, t1_len, t2_ids, t2_len, label]
375 |         if is_merge:
376 |             out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_2seq2bert_id(t1, t2, padding=1)
377 |             out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1])
378 |         else:
379 |             out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_seq2bert_id(t1, padding=1)
380 |             out_ids2, mask_ids2, seg_ids2, seq_len2 = vocab._transform_seq2bert_id(t2, padding=1)
381 |             out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1, out_ids2, mask_ids2, seg_ids2, seq_len2])
382 |     return out_arr, test_arr
383 | 
384 | def get_test_bert_single(file_:str, vocab:Vocabulary, is_merge=0):
385 |     test_arr = read_file(file_) # [q1,...]
386 |     out_arr = []
387 |     for line in test_arr:
388 |         t1 = line   # [t1_ids, t1_len, t2_ids, t2_len, label]
389 |         out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_seq2bert_id(t1, padding=1)
390 |         out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1])
391 |     return out_arr, test_arr
392 | 
393 | def get_batch(dataset, batch_size=None, is_test=0):
394 |     # tf Dataset太难用，不如自己实现
395 |     # https://stackoverflow.com/questions/50539342/getting-batches-in-tensorflow
396 |     # dataset：每个元素是一个特征，[[x1, x2, x3,...], ...], 如果是测试集，可能就没有标签
397 |     if not batch_size:
398 |         batch_size = 32
399 |     if not is_test:
400 |         random.shuffle(dataset)
401 |     steps = int(math.ceil(float(len(dataset)) / batch_size))
402 |     for i in range(steps):
403 |         idx = i * batch_size
404 |         cur_set = dataset[idx: idx + batch_size]
405 |         cur_set = zip(*cur_set)
406 |         yield cur_set
407 | 
408 | 
409 | if __name__ == '__main__':
410 |     # prefix, query_prediction, title, tag, label
411 |     # query_prediction 为json格式。
412 |     file_train = './data/oppo_round1_train_20180929.txt'
413 |     file_vali = './data/oppo_round1_vali_20180929.txt'
414 |     # data_train = get_data(file_train)
415 |     # data_train = get_data(file_vali)
416 |     # print(len(data_train['query']), len(data_train['doc_pos']), len(data_train['doc_neg']))
417 |     dataset = get_lcqmc()
418 |     print(dataset[1][:3])
419 |     for each in get_batch(dataset[1][:3], batch_size=2):
420 |         t1_ids, t1_len, t2_ids, t2_len, label = each
421 |         print(each)
422 |     pass
423 | 


--------------------------------------------------------------------------------
/dssm.py:
--------------------------------------------------------------------------------
  1 | # coding=utf8
  2 | """
  3 | python=3.5
  4 | TensorFlow=1.2.1
  5 | """
  6 | 
  7 | import time
  8 | import numpy as np
  9 | import tensorflow as tf
 10 | import data_input
 11 | from config import Config
 12 | import random
 13 | 
 14 | random.seed(9102)
 15 | 
 16 | start = time.time()
 17 | # 是否加BN层
 18 | norm, epsilon = False, 0.001
 19 | 
 20 | # negative sample
 21 | # query batch size
 22 | query_BS = 100
 23 | # batch size
 24 | L1_N = 400
 25 | L2_N = 120
 26 | 
 27 | # 读取数据
 28 | conf = Config()
 29 | data_train = data_input.get_data_bow(conf.file_train)
 30 | data_vali = data_input.get_data_bow(conf.file_vali)
 31 | # print(len(data_train['query']), query_BS, len(data_train['query']) / query_BS)
 32 | train_epoch_steps = int(len(data_train) / query_BS) - 1
 33 | vali_epoch_steps = int(len(data_vali) / query_BS) - 1
 34 | 
 35 | 
 36 | def add_layer(inputs, in_size, out_size, activation_function=None):
 37 |     wlimit = np.sqrt(6.0 / (in_size + out_size))
 38 |     Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit))
 39 |     biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit))
 40 |     Wx_plus_b = tf.matmul(inputs, Weights) + biases
 41 |     if activation_function is None:
 42 |         outputs = Wx_plus_b
 43 |     else:
 44 |         outputs = activation_function(Wx_plus_b)
 45 |     return outputs
 46 | 
 47 | 
 48 | def mean_var_with_update(ema, fc_mean, fc_var):
 49 |     ema_apply_op = ema.apply([fc_mean, fc_var])
 50 |     with tf.control_dependencies([ema_apply_op]):
 51 |         return tf.identity(fc_mean), tf.identity(fc_var)
 52 | 
 53 | 
 54 | def batch_normalization(x, phase_train, out_size):
 55 |     """
 56 |     Batch normalization on convolutional maps.
 57 |     Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow
 58 |     Args:
 59 |         x:           Tensor, 4D BHWD input maps
 60 |         out_size:       integer, depth of input maps
 61 |         phase_train: boolean tf.Varialbe, true indicates training phase
 62 |         scope:       string, variable scope
 63 |     Return:
 64 |         normed:      batch-normalized maps
 65 |     """
 66 |     with tf.variable_scope('bn'):
 67 |         beta = tf.Variable(tf.constant(0.0, shape=[out_size]),
 68 |                            name='beta', trainable=True)
 69 |         gamma = tf.Variable(tf.constant(1.0, shape=[out_size]),
 70 |                             name='gamma', trainable=True)
 71 |         batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
 72 |         ema = tf.train.ExponentialMovingAverage(decay=0.5)
 73 | 
 74 |         def mean_var_with_update():
 75 |             ema_apply_op = ema.apply([batch_mean, batch_var])
 76 |             with tf.control_dependencies([ema_apply_op]):
 77 |                 return tf.identity(batch_mean), tf.identity(batch_var)
 78 | 
 79 |         mean, var = tf.cond(phase_train,
 80 |                             mean_var_with_update,
 81 |                             lambda: (ema.average(batch_mean), ema.average(batch_var)))
 82 |         normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
 83 |     return normed
 84 | 
 85 | 
 86 | def variable_summaries(var, name):
 87 |     """Attach a lot of summaries to a Tensor."""
 88 |     with tf.name_scope('summaries'):
 89 |         mean = tf.reduce_mean(var)
 90 |         tf.summary.scalar('mean/' + name, mean)
 91 |         with tf.name_scope('stddev'):
 92 |             stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))
 93 |         tf.summary.scalar('sttdev/' + name, stddev)
 94 |         tf.summary.scalar('max/' + name, tf.reduce_max(var))
 95 |         tf.summary.scalar('min/' + name, tf.reduce_min(var))
 96 |         tf.summary.histogram(name, var)
 97 | 
 98 | 
 99 | def contrastive_loss(y, d, batch_size):
100 |     tmp = y * tf.square(d)
101 |     # tmp= tf.mul(y,tf.square(d))
102 |     tmp2 = (1 - y) * tf.square(tf.maximum((1 - d), 0))
103 |     reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-4), tf.trainable_variables())
104 |     return tf.reduce_sum(tmp + tmp2) / batch_size / 2 + reg
105 | 
106 | 
107 | def get_cosine_score(query_arr, doc_arr):
108 |     # query_norm = sqrt(sum(each x^2))
109 |     pooled_len_1 = tf.sqrt(tf.reduce_sum(tf.square(query_arr), 1))
110 |     pooled_len_2 = tf.sqrt(tf.reduce_sum(tf.square(doc_arr), 1))
111 |     pooled_mul_12 = tf.reduce_sum(tf.multiply(query_arr, doc_arr), 1)
112 |     cos_scores = tf.div(pooled_mul_12, pooled_len_1 * pooled_len_2 + 1e-8, name="cos_scores")
113 |     return cos_scores
114 | 
115 | 
116 | with tf.name_scope('input'):
117 |     # 预测时只用输入query即可，将其embedding为向量。
118 |     query_batch = tf.placeholder(tf.float32, shape=[None, None], name='query_batch')
119 |     doc_batch = tf.placeholder(tf.float32, shape=[None, None], name='doc_batch')
120 |     doc_label_batch = tf.placeholder(tf.float32, shape=[None], name='doc_label_batch')
121 |     on_train = tf.placeholder(tf.bool)
122 |     keep_prob = tf.placeholder(tf.float32, name='drop_out_prob')
123 | 
124 | with tf.name_scope('FC1'):
125 |     # 全连接网络
126 |     query_l1 = add_layer(query_batch, conf.nwords, L1_N, activation_function=None)
127 |     doc_l1 = add_layer(doc_batch, conf.nwords, L1_N, activation_function=None)
128 | 
129 | with tf.name_scope('BN1'):
130 |     query_l1 = batch_normalization(query_l1, on_train, L1_N)
131 |     doc_l1 = batch_normalization(doc_l1, on_train, L1_N)
132 |     query_l1 = tf.nn.relu(query_l1)
133 |     doc_l1 = tf.nn.relu(doc_l1)
134 | 
135 | with tf.name_scope('Drop_out'):
136 |     query_l1 = tf.nn.dropout(query_l1, keep_prob)
137 |     doc_l1 = tf.nn.dropout(doc_l1, keep_prob)
138 | 
139 | with tf.name_scope('FC2'):
140 |     query_l2 = add_layer(query_l1, L1_N, L2_N, activation_function=None)
141 |     doc_l2 = add_layer(doc_l1, L1_N, L2_N, activation_function=None)
142 | 
143 | with tf.name_scope('BN2'):
144 |     query_l2 = batch_normalization(query_l2, on_train, L2_N)
145 |     doc_l2 = batch_normalization(doc_l2, on_train, L2_N)
146 |     query_l2 = tf.nn.relu(query_l2)
147 |     doc_l2 = tf.nn.relu(doc_l2)
148 | 
149 |     query_pred = tf.nn.relu(query_l2)
150 |     doc_pred = tf.nn.relu(doc_l2)
151 | 
152 |     # query_pred = tf.contrib.slim.batch_norm(query_l2, activation_fn=tf.nn.relu)
153 | 
154 | with tf.name_scope('Cosine_Similarity'):
155 |     # Cosine similarity
156 |     cos_sim = get_cosine_score(query_pred, doc_pred)
157 |     cos_sim_prob = tf.clip_by_value(cos_sim, 1e-8, 1.0)
158 | 
159 | with tf.name_scope('Loss'):
160 |     # Train Loss
161 |     cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=doc_label_batch, logits=cos_sim)
162 |     losses = tf.reduce_sum(cross_entropy)
163 |     tf.summary.scalar('loss', losses)
164 |     pass
165 | 
166 | with tf.name_scope('Training'):
167 |     # Optimizer
168 |     train_step = tf.train.AdamOptimizer(conf.learning_rate).minimize(losses)
169 |     pass
170 | 
171 | # with tf.name_scope('Accuracy'):
172 | #     correct_prediction = tf.equal(tf.argmax(prob, 1), 0)
173 | #     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
174 | #     tf.summary.scalar('accuracy', accuracy)
175 | 
176 | merged = tf.summary.merge_all()
177 | 
178 | with tf.name_scope('Test'):
179 |     average_loss = tf.placeholder(tf.float32)
180 |     loss_summary = tf.summary.scalar('average_loss', average_loss)
181 | 
182 | with tf.name_scope('Train'):
183 |     train_average_loss = tf.placeholder(tf.float32)
184 |     train_loss_summary = tf.summary.scalar('train_average_loss', train_average_loss)
185 | 
186 | 
187 | def pull_all(query_in, doc_positive_in, doc_negative_in):
188 |     query_in = query_in.tocoo()
189 |     doc_positive_in = doc_positive_in.tocoo()
190 |     doc_negative_in = doc_negative_in.tocoo()
191 |     query_in = tf.SparseTensorValue(
192 |         np.transpose([np.array(query_in.row, dtype=np.int64), np.array(query_in.col, dtype=np.int64)]),
193 |         np.array(query_in.data, dtype=np.float),
194 |         np.array(query_in.shape, dtype=np.int64))
195 |     doc_positive_in = tf.SparseTensorValue(
196 |         np.transpose([np.array(doc_positive_in.row, dtype=np.int64), np.array(doc_positive_in.col, dtype=np.int64)]),
197 |         np.array(doc_positive_in.data, dtype=np.float),
198 |         np.array(doc_positive_in.shape, dtype=np.int64))
199 |     doc_negative_in = tf.SparseTensorValue(
200 |         np.transpose([np.array(doc_negative_in.row, dtype=np.int64), np.array(doc_negative_in.col, dtype=np.int64)]),
201 |         np.array(doc_negative_in.data, dtype=np.float),
202 |         np.array(doc_negative_in.shape, dtype=np.int64))
203 | 
204 |     return query_in, doc_positive_in, doc_negative_in
205 | 
206 | 
207 | def pull_batch(data_map, batch_id):
208 |     query, title, label, dsize = range(4)
209 |     cur_data = data_map[batch_id * query_BS:(batch_id + 1) * query_BS]
210 |     query_in = [x[0] for x in cur_data]
211 |     doc_in = [x[1] for x in cur_data]
212 |     label = [x[2] for x in cur_data]
213 | 
214 |     # query_in, doc_positive_in, doc_negative_in = pull_all(query_in, doc_positive_in, doc_negative_in)
215 |     return query_in, doc_in, label
216 | 
217 | 
218 | def feed_dict(on_training, data_set, batch_id, drop_prob):
219 |     query_in, doc_in, label = pull_batch(data_set, batch_id)
220 |     query_in, doc_in, label = np.array(query_in), np.array(doc_in), np.array(label)
221 |     return {query_batch: query_in, doc_batch: doc_in, doc_label_batch: label,
222 |             on_train: on_training, keep_prob: drop_prob}
223 | 
224 | 
225 | # config = tf.ConfigProto()  # log_device_placement=True)
226 | # config.gpu_options.allow_growth = True
227 | # if not config.gpu:
228 | # config = tf.ConfigProto(device_count= {'GPU' : 0})
229 | 
230 | # 创建一个Saver对象，选择性保存变量或者模型。
231 | saver = tf.train.Saver()
232 | # with tf.Session(config=config) as sess:
233 | with tf.Session() as sess:
234 |     sess.run(tf.global_variables_initializer())
235 |     train_writer = tf.summary.FileWriter(conf.summaries_dir + '/train', sess.graph)
236 | 
237 |     start = time.time()
238 |     for epoch in range(conf.num_epoch):
239 |         random.shuffle(data_train)
240 |         for batch_id in range(train_epoch_steps):
241 |             # print(batch_id)
242 |             sess.run(train_step, feed_dict=feed_dict(True, data_train, batch_id, 0.5))
243 |             pass
244 |         end = time.time()
245 |         # train loss
246 |         epoch_loss = 0
247 |         for i in range(train_epoch_steps):
248 |             loss_v = sess.run(losses, feed_dict=feed_dict(False, data_train, i, 1))
249 |             epoch_loss += loss_v
250 | 
251 |         epoch_loss /= (train_epoch_steps)
252 |         train_loss = sess.run(train_loss_summary, feed_dict={train_average_loss: epoch_loss})
253 |         train_writer.add_summary(train_loss, epoch + 1)
254 |         print("\nEpoch #%d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" %
255 |               (epoch, epoch_loss, end - start))
256 | 
257 |         # test loss
258 |         start = time.time()
259 |         epoch_loss = 0
260 |         for i in range(vali_epoch_steps):
261 |             loss_v = sess.run(losses, feed_dict=feed_dict(False, data_vali, i, 1))
262 |             epoch_loss += loss_v
263 |         epoch_loss /= (vali_epoch_steps)
264 |         test_loss = sess.run(loss_summary, feed_dict={average_loss: epoch_loss})
265 |         train_writer.add_summary(test_loss, epoch + 1)
266 |         # test_writer.add_summary(test_loss, step + 1)
267 |         print("Epoch #%d | Test  Loss: %-4.3f | Calc_LossTime: %-3.3fs" %
268 |               (epoch, epoch_loss, start - end))
269 | 
270 |     # 保存模型
271 |     save_path = saver.save(sess, "model/model_1.ckpt")
272 |     print("Model saved in file: ", save_path)
273 | 


--------------------------------------------------------------------------------
/dssm_rnn.py:
--------------------------------------------------------------------------------
  1 | # coding=utf8
  2 | """
  3 | python=3.5
  4 | TensorFlow=1.2.1
  5 | """
  6 | 
  7 | import time
  8 | import numpy as np
  9 | import tensorflow as tf
 10 | import data_input
 11 | from config import Config
 12 | import random
 13 | 
 14 | random.seed(9102)
 15 | 
 16 | start = time.time()
 17 | # 是否加BN层
 18 | norm, epsilon = False, 0.001
 19 | 
 20 | # TRIGRAM_D = 21128
 21 | TRIGRAM_D = 100
 22 | # negative sample
 23 | NEG = 4
 24 | # query batch size
 25 | query_BS = 100
 26 | # batch size
 27 | BS = query_BS * NEG
 28 | 
 29 | # 读取数据
 30 | conf = Config()
 31 | data_train = data_input.get_data(conf.file_train)
 32 | data_vali = data_input.get_data(conf.file_vali)
 33 | # print(len(data_train['query']), query_BS, len(data_train['query']) / query_BS)
 34 | train_epoch_steps = int(len(data_train['query']) / query_BS) - 1
 35 | vali_epoch_steps = int(len(data_vali['query']) / query_BS) - 1
 36 | 
 37 | 
 38 | def variable_summaries(var, name):
 39 |     """Attach a lot of summaries to a Tensor."""
 40 |     with tf.name_scope('summaries'):
 41 |         mean = tf.reduce_mean(var)
 42 |         tf.summary.scalar('mean/' + name, mean)
 43 |         with tf.name_scope('stddev'):
 44 |             stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))
 45 |         tf.summary.scalar('sttdev/' + name, stddev)
 46 |         tf.summary.scalar('max/' + name, tf.reduce_max(var))
 47 |         tf.summary.scalar('min/' + name, tf.reduce_min(var))
 48 |         tf.summary.histogram(name, var)
 49 | 
 50 | 
 51 | with tf.name_scope('input'):
 52 |     # 预测时只用输入query即可，将其embedding为向量。
 53 |     query_batch = tf.placeholder(tf.int32, shape=[None, None], name='query_batch')
 54 |     doc_pos_batch = tf.placeholder(tf.int32, shape=[None, None], name='doc_positive_batch')
 55 |     doc_neg_batch = tf.placeholder(tf.int32, shape=[None, None], name='doc_negative_batch')
 56 |     query_seq_length = tf.placeholder(tf.int32, shape=[None], name='query_sequence_length')
 57 |     pos_seq_length = tf.placeholder(tf.int32, shape=[None], name='pos_seq_length')
 58 |     neg_seq_length = tf.placeholder(tf.int32, shape=[None], name='neg_sequence_length')
 59 |     on_train = tf.placeholder(tf.bool)
 60 |     drop_out_prob = tf.placeholder(tf.float32, name='drop_out_prob')
 61 | 
 62 | with tf.name_scope('word_embeddings_layer'):
 63 |     # 这里可以加载预训练词向量
 64 |     _word_embedding = tf.get_variable(name="word_embedding_arr", dtype=tf.float32,
 65 |                                       shape=[conf.nwords, TRIGRAM_D])
 66 |     query_embed = tf.nn.embedding_lookup(_word_embedding, query_batch, name='query_batch_embed')
 67 |     doc_pos_embed = tf.nn.embedding_lookup(_word_embedding, doc_pos_batch, name='doc_positive_embed')
 68 |     doc_neg_embed = tf.nn.embedding_lookup(_word_embedding, doc_neg_batch, name='doc_negative_embed')
 69 | 
 70 | with tf.name_scope('RNN'):
 71 |     # Abandon bag of words, use GRU, you can use stacked gru
 72 |     # query_l1 = add_layer(query_batch, TRIGRAM_D, L1_N, activation_function=None)  # tf.nn.relu()
 73 |     # doc_positive_l1 = add_layer(doc_positive_batch, TRIGRAM_D, L1_N, activation_function=None)
 74 |     # doc_negative_l1 = add_layer(doc_negative_batch, TRIGRAM_D, L1_N, activation_function=None)
 75 |     if conf.use_stack_rnn:
 76 |         cell_fw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE)
 77 |         stacked_gru_fw = tf.contrib.rnn.MultiRNNCell([cell_fw], state_is_tuple=True)
 78 |         cell_bw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE)
 79 |         stacked_gru_bw = tf.contrib.rnn.MultiRNNCell([cell_fw], state_is_tuple=True)
 80 |         (output_fw, output_bw), (_, _) = tf.nn.bidirectional_dynamic_rnn(stacked_gru_fw, stacked_gru_bw)
 81 |         # not ready, to be continue ...
 82 |     else:
 83 |         cell_fw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE)
 84 |         cell_bw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE)
 85 |         # query
 86 |         (_, _), (query_output_fw, query_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, query_embed,
 87 |                                                                                      sequence_length=query_seq_length,
 88 |                                                                                      dtype=tf.float32)
 89 |         query_rnn_output = tf.concat([query_output_fw, query_output_bw], axis=-1)
 90 |         query_rnn_output = tf.nn.dropout(query_rnn_output, drop_out_prob)
 91 |         # doc_pos
 92 |         (_, _), (doc_pos_output_fw, doc_pos_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,
 93 |                                                                                          doc_pos_embed,
 94 |                                                                                          sequence_length=pos_seq_length,
 95 |                                                                                          dtype=tf.float32)
 96 |         doc_pos_rnn_output = tf.concat([doc_pos_output_fw, doc_pos_output_bw], axis=-1)
 97 |         doc_pos_rnn_output = tf.nn.dropout(doc_pos_rnn_output, drop_out_prob)
 98 |         # doc_neg
 99 |         (_, _), (doc_neg_output_fw, doc_neg_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,
100 |                                                                                          doc_neg_embed,
101 |                                                                                          sequence_length=neg_seq_length,
102 |                                                                                          dtype=tf.float32)
103 |         doc_neg_rnn_output = tf.concat([doc_neg_output_fw, doc_neg_output_bw], axis=-1)
104 |         doc_neg_rnn_output = tf.nn.dropout(doc_neg_rnn_output, drop_out_prob)
105 | 
106 | with tf.name_scope('Merge_Negative_Doc'):
107 |     # 合并负样本，tile可选择是否扩展负样本。
108 |     # doc_y = tf.tile(doc_positive_y, [1, 1])
109 |     doc_y = tf.tile(doc_pos_rnn_output, [1, 1])
110 | 
111 |     for i in range(NEG):
112 |         for j in range(query_BS):
113 |             # slice(input_, begin, size)切片API
114 |             # doc_y = tf.concat([doc_y, tf.slice(doc_negative_y, [j * NEG + i, 0], [1, -1])], 0)
115 |             doc_y = tf.concat([doc_y, tf.slice(doc_neg_rnn_output, [j * NEG + i, 0], [1, -1])], 0)
116 | 
117 | with tf.name_scope('Cosine_Similarity'):
118 |     # Cosine similarity
119 |     # query_norm = sqrt(sum(each x^2))
120 |     query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(query_rnn_output), 1, True)), [NEG + 1, 1])
121 |     # doc_norm = sqrt(sum(each x^2))
122 |     doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_y), 1, True))
123 | 
124 |     prod = tf.reduce_sum(tf.multiply(tf.tile(query_rnn_output, [NEG + 1, 1]), doc_y), 1, True)
125 |     norm_prod = tf.multiply(query_norm, doc_norm)
126 | 
127 |     # cos_sim_raw = query * doc / (||query|| * ||doc||)
128 |     cos_sim_raw = tf.truediv(prod, norm_prod)
129 |     # gamma = 20
130 |     cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, query_BS])) * 20
131 | 
132 | with tf.name_scope('Loss'):
133 |     # Train Loss
134 |     # 转化为softmax概率矩阵。
135 |     prob = tf.nn.softmax(cos_sim)
136 |     # 只取第一列，即正样本列概率。
137 |     hit_prob = tf.slice(prob, [0, 0], [-1, 1])
138 |     loss = -tf.reduce_sum(tf.log(hit_prob))
139 |     tf.summary.scalar('loss', loss)
140 | 
141 | with tf.name_scope('Training'):
142 |     # Optimizer
143 |     train_step = tf.train.AdamOptimizer(conf.learning_rate).minimize(loss)
144 | 
145 | # with tf.name_scope('Accuracy'):
146 | #     correct_prediction = tf.equal(tf.argmax(prob, 1), 0)
147 | #     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
148 | #     tf.summary.scalar('accuracy', accuracy)
149 | 
150 | merged = tf.summary.merge_all()
151 | 
152 | with tf.name_scope('Test'):
153 |     average_loss = tf.placeholder(tf.float32)
154 |     loss_summary = tf.summary.scalar('average_loss', average_loss)
155 | 
156 | with tf.name_scope('Train'):
157 |     train_average_loss = tf.placeholder(tf.float32)
158 |     train_loss_summary = tf.summary.scalar('train_average_loss', train_average_loss)
159 | 
160 | 
161 | def pull_batch(data_map, batch_id):
162 |     query_in = data_map['query'][batch_id * query_BS:(batch_id + 1) * query_BS]
163 |     query_len = data_map['query_len'][batch_id * query_BS:(batch_id + 1) * query_BS]
164 |     doc_positive_in = data_map['doc_pos'][batch_id * query_BS:(batch_id + 1) * query_BS]
165 |     doc_positive_len = data_map['doc_pos_len'][batch_id * query_BS:(batch_id + 1) * query_BS]
166 |     doc_negative_in = data_map['doc_neg'][batch_id * query_BS * NEG:(batch_id + 1) * query_BS * NEG]
167 |     doc_negative_len = data_map['doc_neg_len'][batch_id * query_BS * NEG:(batch_id + 1) * query_BS * NEG]
168 | 
169 |     # query_in, doc_positive_in, doc_negative_in = pull_all(query_in, doc_positive_in, doc_negative_in)
170 |     return query_in, doc_positive_in, doc_negative_in, query_len, doc_positive_len, doc_negative_len
171 | 
172 | 
173 | def feed_dict(on_training, data_set, batch_id, drop_prob):
174 |     query_in, doc_positive_in, doc_negative_in, query_seq_len, pos_seq_len, neg_seq_len = pull_batch(data_set,
175 |                                                                                                      batch_id)
176 |     query_len = len(query_in)
177 |     query_seq_len = [conf.max_seq_len] * query_len
178 |     pos_seq_len = [conf.max_seq_len] * query_len
179 |     neg_seq_len = [conf.max_seq_len] * query_len * NEG
180 |     return {query_batch: query_in, doc_pos_batch: doc_positive_in, doc_neg_batch: doc_negative_in,
181 |             on_train: on_training, drop_out_prob: drop_prob, query_seq_length: query_seq_len,
182 |             neg_seq_length: neg_seq_len, pos_seq_length: pos_seq_len}
183 | 
184 | 
185 | # config = tf.ConfigProto()  # log_device_placement=True)
186 | # config.gpu_options.allow_growth = True
187 | # if not config.gpu:
188 | # config = tf.ConfigProto(device_count= {'GPU' : 0})
189 | 
190 | # 创建一个Saver对象，选择性保存变量或者模型。
191 | saver = tf.train.Saver()
192 | # with tf.Session(config=config) as sess:
193 | with tf.Session() as sess:
194 |     sess.run(tf.global_variables_initializer())
195 |     train_writer = tf.summary.FileWriter(conf.summaries_dir + '/train', sess.graph)
196 | 
197 |     start = time.time()
198 |     for epoch in range(conf.num_epoch):
199 |         batch_ids = [i for i in range(train_epoch_steps)]
200 |         random.shuffle(batch_ids)
201 |         for batch_id in batch_ids:
202 |             # print(batch_id)
203 |             sess.run(train_step, feed_dict=feed_dict(True, data_train, batch_id, 0.5))
204 |         end = time.time()
205 |         # train loss
206 |         epoch_loss = 0
207 |         for i in range(train_epoch_steps):
208 |             loss_v = sess.run(loss, feed_dict=feed_dict(False, data_train, i, 1))
209 |             epoch_loss += loss_v
210 | 
211 |         epoch_loss /= (train_epoch_steps)
212 |         train_loss = sess.run(train_loss_summary, feed_dict={train_average_loss: epoch_loss})
213 |         train_writer.add_summary(train_loss, epoch + 1)
214 |         print("\nEpoch #%d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" %
215 |               (epoch, epoch_loss, end - start))
216 | 
217 |         # test loss
218 |         start = time.time()
219 |         epoch_loss = 0
220 |         for i in range(vali_epoch_steps):
221 |             loss_v = sess.run(loss, feed_dict=feed_dict(False, data_vali, i, 1))
222 |             epoch_loss += loss_v
223 |         epoch_loss /= (vali_epoch_steps)
224 |         test_loss = sess.run(loss_summary, feed_dict={average_loss: epoch_loss})
225 |         train_writer.add_summary(test_loss, epoch + 1)
226 |         # test_writer.add_summary(test_loss, step + 1)
227 |         print("Epoch #%d | Test  Loss: %-4.3f | Calc_LossTime: %-3.3fs" %
228 |               (epoch, epoch_loss, start - end))
229 | 
230 |     # 保存模型
231 |     save_path = saver.save(sess, "model/model_1.ckpt")
232 |     print("Model saved in file: ", save_path)
233 | 


--------------------------------------------------------------------------------
/flask_server.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #encoding=utf-8
 3 | '''
 4 | @Time    :   2020/11/02 00:06:44
 5 | @Author  :   Zhiyang.zzy 
 6 | @Contact :   zhiyangchou@gmail.com
 7 | @Desc    :   
 8 | '''
 9 | 
10 | # here put the import lib
11 | from model.bert_classifier import BertClassifier
12 | import os
13 | import time
14 | from numpy.lib.arraypad import pad
15 | from tensorflow.python.ops.gen_io_ops import write_file
16 | import yaml
17 | import logging
18 | import argparse
19 | logging.basicConfig(level=logging.INFO)
20 | import data_input
21 | from config import Config
22 | from model.siamese_network import SiamenseRNN, SiamenseBert
23 | from data_input import Vocabulary, get_test
24 | from util import write_file
25 | from flask import Flask
26 | app = Flask(__name__)
27 | 
28 | @app.route('/hello/<q1>/<q2>')
29 | def hello_world(q1, q2):
30 |     # print('Hello World! %s, %s' % (q1, q2))
31 |     test_arr, query_arr  = data_input.get_test_bert_by_arr([[q1, q2]], vocab, is_merge=1)
32 |     # print("test_arr:", test_arr)
33 |     test_label, test_prob = model.predict(test_arr)
34 |     # print("test label", test_label)
35 |     return 'Hello World! {}:{}'.format(q1 + "-" + q2, test_prob[0])
36 | 
37 | if __name__ == '__main__':
38 |     # 读取配置
39 |     # conf = Config()
40 |     cfg_path = "./configs/bert_classify.yml"
41 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
42 |     # vocab: 将 seq转为id，
43 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
44 |     # 读取数据
45 |     # test_arr, query_arr  = data_input.get_test_bert(file_, vocab, is_merge=1)
46 |     # print("test size:{}".format(len(test_arr)))
47 |     model = BertClassifier(cfg)
48 |     model.restore_session(cfg["checkpoint_dir"])
49 |     app.run()
50 |     # 输入url测试，例如：http://127.0.0.1:5000/hello/今天天气/明天天气


--------------------------------------------------------------------------------
/model/base_model.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding=utf-8
  3 | '''
  4 | Author: 	zhiyang.zzy 
  5 | Date: 		2020-10-25 11:07:55
  6 | Contact: 	zhiyangchou@gmail.com
  7 | FilePath: /dssm/base_model.py
  8 | Desc: 		基础模型，包含基本功能
  9 | '''
 10 | # here put the import lib
 11 | 
 12 | 
 13 | # here put the import lib
 14 | import numpy as np
 15 | import os
 16 | import tensorflow as tf
 17 | import nni
 18 | # from tensorflow.python.ops import rnn_cell_impl as core_rnn_cell
 19 | import logging
 20 | from collections import defaultdict
 21 | from .bert import modeling_v1 as modeling, tokenization, optimization
 22 | # logging.basicConfig(level=logging.DEBUG)
 23 | 
 24 | 
 25 | class TriplteLoss(object):
 26 |     # https://blog.csdn.net/u013082989/article/details/83537370
 27 |     @staticmethod
 28 |     def _pairwise_distance(embeddings, squared=False):
 29 |         '''
 30 |         计算两两embedding的距离
 31 |         ------------------------------------------
 32 |         Args：
 33 |             embedding: 特征向量， 大小（batch_size, vector_size）
 34 |             squared:   是否距离的平方，即欧式距离
 35 | 
 36 |         Returns：
 37 |             distances: 两两embeddings的距离矩阵，大小 （batch_size, batch_size）
 38 |         '''
 39 |         # 矩阵相乘,得到（batch_size, batch_size），因为计算欧式距离|a-b|^2 = a^2 -2ab + b^2,
 40 |         # 其中 ab 可以用矩阵乘表示
 41 |         dot_product = tf.matmul(embeddings, tf.transpose(embeddings))
 42 |         # dot_product对角线部分就是 每个embedding的平方
 43 |         square_norm = tf.diag_part(dot_product)
 44 |         # |a-b|^2 = a^2 - 2ab + b^2
 45 |         # tf.expand_dims(square_norm, axis=1)是（batch_size, 1）大小的矩阵，减去 （batch_size, batch_size）大小的矩阵，相当于每一列操作
 46 |         distances = tf.expand_dims(
 47 |             square_norm, axis=1) - 2.0 * dot_product + tf.expand_dims(square_norm, axis=0)
 48 |         distances = tf.maximum(distances, 0.0)   # 小于0的距离置为0
 49 |         if not squared:          # 如果不平方，就开根号，但是注意有0元素，所以0的位置加上 1e*-16
 50 |             distances = distances + mask * 1e-16
 51 |             distances = tf.sqrt(distances)
 52 |             distances = distances * (1.0 - mask)    # 0的部分仍然置为0
 53 |         return distances
 54 |     @staticmethod
 55 |     def _get_triplet_mask(labels):
 56 |         '''
 57 |         得到一个3D的mask [a, p, n], 对应triplet（a, p, n）是valid的位置是True
 58 |         ----------------------------------
 59 |         Args:
 60 |             labels: 对应训练数据的labels, shape = (batch_size,)
 61 |         
 62 |         Returns:
 63 |             mask: 3D,shape = (batch_size, batch_size, batch_size)
 64 |         
 65 |         '''
 66 | 
 67 |         # 初始化一个二维矩阵，坐标(i, j)不相等置为1，得到indices_not_equal
 68 |         indices_equal = tf.cast(tf.eye(tf.shape(labels)[0]), tf.bool)
 69 |         indices_not_equal = tf.logical_not(indices_equal)
 70 |         # 因为最后得到一个3D的mask矩阵(i, j, k)，增加一个维度，则 i_not_equal_j 在第三个维度增加一个即，(batch_size, batch_size, 1), 其他同理
 71 |         i_not_equal_j = tf.expand_dims(indices_not_equal, 2)
 72 |         i_not_equal_k = tf.expand_dims(indices_not_equal, 1)
 73 |         j_not_equal_k = tf.expand_dims(indices_not_equal, 0)
 74 |         # 想得到i!=j!=k, 三个不等取and即可, 最后可以得到当下标（i, j, k）不相等时才取True
 75 |         distinct_indices = tf.logical_and(tf.logical_and(
 76 |             i_not_equal_j, i_not_equal_k), j_not_equal_k)
 77 | 
 78 |         # 同样根据labels得到对应i=j, i!=k
 79 |         label_equal = tf.equal(tf.expand_dims(labels, 0),
 80 |                             tf.expand_dims(labels, 1))
 81 |         i_equal_j = tf.expand_dims(label_equal, 2)
 82 |         i_equal_k = tf.expand_dims(label_equal, 1)
 83 |         valid_labels = tf.logical_and(i_equal_j, tf.logical_not(i_equal_k))
 84 |         # mask即为满足上面两个约束，所以两个3D取and
 85 |         mask = tf.logical_and(distinct_indices, valid_labels)
 86 |         return mask
 87 |     @staticmethod
 88 |     def batch_all_triplet_loss(labels, embeddings, margin, squared=False):
 89 |         '''
 90 |         triplet loss of a batch
 91 |         -------------------------------
 92 |         Args:
 93 |             labels:     标签数据，shape = （batch_size,）
 94 |             embeddings: 提取的特征向量， shape = (batch_size, vector_size)
 95 |             margin:     margin大小， scalar
 96 |             
 97 |         Returns:
 98 |             triplet_loss: scalar, 一个batch的损失值
 99 |             fraction_postive_triplets : valid的triplets占的比例
100 |         '''
101 |         # 得到每两两embeddings的距离，然后增加一个维度，一维需要得到（batch_size, batch_size, batch_size）大小的3D矩阵
102 |         # 然后再点乘上valid 的 mask即可
103 |         pairwise_dis = _pairwise_distance(embeddings, squared=squared)
104 |         anchor_positive_dist = tf.expand_dims(pairwise_dis, 2)
105 |         assert anchor_positive_dist.shape[2] == 1, "{}".format(
106 |             anchor_positive_dist.shape)
107 |         anchor_negative_dist = tf.expand_dims(pairwise_dis, 1)
108 |         assert anchor_negative_dist.shape[1] == 1, "{}".format(
109 |             anchor_negative_dist.shape)
110 |         triplet_loss = anchor_positive_dist - anchor_negative_dist + margin
111 | 
112 |         mask = _get_triplet_mask(labels)
113 |         mask = tf.to_float(mask)
114 |         triplet_loss = tf.multiply(mask, triplet_loss)
115 |         triplet_loss = tf.maximum(triplet_loss, 0.0)
116 | 
117 |         # 计算valid的triplet的个数，然后对所有的triplet loss求平均
118 |         valid_triplets = tf.to_float(tf.greater(triplet_loss, 1e-16))
119 |         num_positive_triplets = tf.reduce_sum(valid_triplets)
120 |         num_valid_triplets = tf.reduce_sum(mask)
121 |         fraction_postive_triplets = num_positive_triplets / \
122 |             (num_valid_triplets + 1e-16)
123 | 
124 |         triplet_loss = tf.reduce_sum(triplet_loss) / \
125 |             (num_positive_triplets + 1e-16)
126 |         return triplet_loss, fraction_postive_triplets
127 | 
128 | 
129 | class BaseModel(object):
130 |     def __init__(self, cfg, is_training=1):
131 |         # config来自于yml文件。
132 |         self.cfg = cfg
133 |         # 通过cfg 解析出多少个 word, intent, action, 等
134 |         # if not is_training: dropout=0
135 |         self.is_training = is_training
136 |         if not is_training:
137 |             self.cfg['dropout'] = 0
138 |         self.build()
139 | 
140 |     def __del__(self):
141 |         # self.sess.close()
142 |         pass
143 | 
144 |     def _init_session(self):
145 |         # https://zhuanlan.zhihu.com/p/78998468
146 |         config = tf.ConfigProto()
147 |         config.gpu_options.allow_growth = True
148 |         self.sess = tf.Session(config=config)
149 |         self.sess.run(tf.global_variables_initializer())
150 |         self.sess.run(tf.tables_initializer())
151 |         # saver = tf.train.Saver(max_to_keep=None)
152 |         self.saver = tf.train.Saver()
153 | 
154 |     def restore_session(self, dir_model):
155 |         print("Reloading the latest trained model...")
156 |         self.saver.restore(self.sess, dir_model)
157 | 
158 |     def _add_summary(self):
159 |         self.merged = tf.summary.merge_all()
160 |         if not os.path.exists(self.cfg['summaries_dir']):
161 |             os.makedirs(self.cfg['summaries_dir'])
162 |         self.file_writer = tf.summary.FileWriter(
163 |             self.cfg['summaries_dir'], self.sess.graph)
164 | 
165 |     def save_session(self):
166 |         if not os.path.exists(self.cfg['checkpoint_dir']):
167 |             os.makedirs(self.cfg['checkpoint_dir'])
168 |         self.saver.save(self.sess, self.cfg['checkpoint_dir'])
169 | 
170 |     def init_from_pre_dir(self, pre_dir):
171 |         tvars = tf.trainable_variables()
172 |         (assignment, init_variable_names) = modeling.get_assignment_map_from_checkpoint(
173 |             tvars, pre_dir)
174 |         tf.train.init_from_checkpoint(pre_dir, assignment)
175 | 
176 |     @staticmethod
177 |     def get_params_count():
178 |         params_count = np.sum([np.prod(v.get_shape().as_list())
179 |                                for v in tf.trainable_variables()])
180 |         print("params_count", params_count)
181 |         return params_count
182 | 
183 |     ####################   基本功能： fit, evaluate, predict  #####################
184 |     def fit(self, train, dev, test=None):
185 |         '''
186 |         @description: 模型训练
187 |         @param {type} 
188 |         @return: 
189 |         '''
190 |         best_score, nepoch_no_imprv = -1, 0
191 |         for epoch in range(self.cfg["num_epoch"]):
192 |             print("Epoch {:} out of {:}".format(
193 |                 epoch + 1, self.cfg["num_epoch"]))
194 |             score = self.run_epoch(epoch, train, dev)
195 |             if score > best_score:
196 |                 nepoch_no_imprv = 0
197 |                 self.save_session()
198 |                 best_score = score
199 |                 print("- new best score!")
200 |                 if test:
201 |                     test_acc = self.eval(test)
202 |                     # self.print_eval_result(test_result)
203 |                     print("test sf acc:{}".format(test_acc))
204 |             else:
205 |                 nepoch_no_imprv += 1
206 |                 if nepoch_no_imprv >= self.cfg["epoch_no_imprv"]:
207 |                     print(
208 |                         "- early stopping {} epoches without improvement".format(nepoch_no_imprv))
209 |                     nni.report_final_result(best_score)
210 |                     break
211 |             pass
212 |         pass
213 | 
214 |     def eval(self, test):
215 |         '''
216 |         @description: 测试集评测
217 |         @param {type} 
218 |         @return: 
219 |         '''
220 |         pass
221 | 
222 |     def predict(self):
223 |         '''
224 |         @description: 无标注数据评测
225 |         @param {type} 
226 |         @return: 
227 |         '''
228 |         pass
229 | 
230 |     ####################   模型模块  #####################
231 |     def _state_lstm(self, input_emb, input_length, initial_state, hidden_size, variable_scope="StateLSTM"):
232 |         with tf.variable_scope(variable_scope):
233 |             cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
234 |             cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
235 |             initial_state = tf.nn.rnn_cell.LSTMStateTuple(
236 |                 initial_state, initial_state)
237 |             _output = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb,
238 |                                                       sequence_length=input_length,
239 |                                                       dtype=tf.float32,
240 |                                                       initial_state_fw=initial_state,
241 |                                                       initial_state_bw=initial_state)
242 |             (output_fw, output_bw), ((_, state_fw), (_, state_bw)) = _output
243 |             output = tf.concat([output_fw, output_bw], axis=-1)
244 |             state = tf.concat([state_fw, state_bw], axis=-1)
245 | 
246 |         return output, state
247 | 
248 |     def _concat_lstm(self, input_emb, input_length, extra_emb, hidden_size, variable_scope="ConcatLSTM"):
249 |         """
250 |         input_emb: [batch_size, nstep, hidden_size]
251 |         extra_emb: [batch_size, hidden_size]
252 |         """
253 |         with tf.variable_scope(variable_scope):
254 |             nstep = input_emb.shape[1].value
255 |             # [batch_size, nstep, hidden_size]
256 |             expand_extra_emb = tf.tile(tf.expand_dims(
257 |                 extra_emb, axis=1), multiples=[1, nstep, 1])
258 |             input_emb = tf.concat([input_emb, expand_extra_emb], axis=-1)
259 | 
260 |             cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
261 |             cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
262 |             _output = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb,
263 |                                                       sequence_length=input_length,
264 |                                                       dtype=tf.float32)
265 |             (output_fw, output_bw), ((_, state_fw), (_, state_bw)) = _output
266 |             output = tf.concat([output_fw, output_bw], axis=-1)
267 |             state = tf.concat([state_fw, state_bw], axis=-1)
268 | 
269 |         return output, state
270 | 
271 |     def _train_op(self):
272 |         lr_m = self.cfg['optimizer'].lower()
273 |         with tf.variable_scope("train_op"):
274 |             optimizer = self._get_optimizer(lr_m)
275 |             grads_and_vars = optimizer.compute_gradients(self.loss)
276 |             for grad, var in grads_and_vars:
277 |                 # grad = tf.Print(grad, [grad], "{} grad: ".format(var.name))
278 |                 if grad is not None:
279 |                     tf.summary.histogram(var.op.name + "/gradients", grad)
280 |             if self.cfg['clip'] > 0:
281 |                 grads, variables = zip(*grads_and_vars)
282 |                 grads, gnorm = tf.clip_by_global_norm(grads, self.cfg['clip'])
283 |                 self.train_op = optimizer.apply_gradients(zip(grads, variables),
284 |                                                           global_step=tf.train.get_global_step())
285 |             else:
286 |                 self.train_op = optimizer.minimize(
287 |                     self.loss, global_step=tf.train.get_global_step())
288 | 
289 |     ####################   基础模块  #####################
290 |     def _add_word_embedding_matrix(self,):
291 |         # 如果有预训练矩阵，从其中导入
292 |         self.embedding_file = self.cfg['meta_dir'] + \
293 |             self.cfg.get('embedding_trimmed', None)
294 |         if self.embedding_file and self.cfg["use_pretrained"]:
295 |             embedding_matrix = np.load(self.embedding_file)["embeddings"]
296 |             self.embedding_matrix = tf.Variable(
297 |                 embedding_matrix, name='embedding_matrix', dtype=tf.float32)
298 |             pass
299 |         else:
300 |             self.embedding_matrix = tf.get_variable(name="embedding_matrix",
301 |                                                     dtype=tf.float32,
302 |                                                     shape=[self.cfg["word_num"], self.cfg["embedding_dim"]])
303 | 
304 |     def add_bert_layer(self, use_bert_pre=1):
305 |         self.bert_config = modeling.BertConfig.from_json_file(
306 |             self.cfg["bert_dir"] + self.cfg["bert_config"])
307 |         bert_model = modeling.BertModel(
308 |             config=self.bert_config,
309 |             is_training=self.is_train_place,
310 |             input_ids=self.query_ids,
311 |             input_mask=self.mask_ids,
312 |             token_type_ids=self.seg_ids,
313 |             use_one_hot_embeddings=False)
314 | 
315 |         if use_bert_pre:
316 |             tvars = tf.trainable_variables()
317 |             bert_init_dir = self.cfg["bert_dir"] + \
318 |                 self.cfg["bert_init_checkpoint"]
319 |             (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars,
320 |                                                                                                    bert_init_dir)
321 |             tf.train.init_from_checkpoint(bert_init_dir, assignment)
322 | 
323 |         bert_output_seq_ori = bert_model.get_sequence_output()
324 |         bert_output_shape = tf.shape(bert_output_seq_ori)
325 |         self.bert_output_seq_ori = bert_output_seq_ori
326 |         # bs, seq, 768
327 |         bert_output_seq = tf.strided_slice(
328 |             bert_output_seq_ori, [0, 1, 0], bert_output_shape, [1, 1, 1])
329 |         nsteps = tf.shape(bert_output_seq)[1]
330 |         self.bert_output_seq = tf.reshape(
331 |             bert_output_seq, [-1, nsteps, self.bert_config.hidden_size])
332 |         self.cls_output = bert_model.get_pooled_output()
333 |         self.embedding_table = bert_model.embedding_table
334 |         # mask onehot
335 |         bert_mask_shape = tf.shape(self.mask_ids)
336 |         self.seq_mask_ids = tf.strided_slice(
337 |             self.mask_ids, [0, 1], bert_mask_shape, [1, 1])
338 |         self.word_mask_ids = tf.expand_dims(
339 |             tf.cast(self.seq_mask_ids, tf.float32), -1)
340 | 
341 |     def share_bert_layer(self, is_train_place, query_ids, mask_ids, seg_ids, use_bert_pre=1):
342 |         self.bert_config = modeling.BertConfig.from_json_file(
343 |             self.cfg["bert_dir"] + self.cfg["bert_config"])
344 |         bert_model = modeling.BertModel(
345 |             config=self.bert_config,
346 |             is_training=is_train_place,
347 |             input_ids=query_ids,
348 |             input_mask=mask_ids,
349 |             token_type_ids=seg_ids,
350 |             use_one_hot_embeddings=False,
351 |             scope="bert")
352 |         if use_bert_pre:
353 |             tvars = tf.trainable_variables()
354 |             bert_init_dir = self.cfg["bert_dir"] + \
355 |                 self.cfg["bert_init_checkpoint"]
356 |             (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars,
357 |                                                                                                    bert_init_dir)
358 |             tf.train.init_from_checkpoint(bert_init_dir, assignment)
359 |         bert_output_seq = bert_model.get_sequence_output()
360 | 
361 |         # 默认使用cls输出
362 |         pooled = bert_model.get_pooled_output()
363 |         embedding_table = bert_model.embedding_table
364 |         input_mask_ = tf.cast(tf.expand_dims(mask_ids, axis=-1), dtype=tf.float32)
365 |         if self.cfg['sentence_embedding_type'] == "avg":
366 |             # 最后一层avg pooling
367 |             pooled = tf.reduce_sum(bert_output_seq * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1)
368 |         elif self.cfg['sentence_embedding_type'].startswith("avg-last-last-"):
369 |             # 使用最后的第n层 avg pooling
370 |             n_last = int(self.cfg['sentence_embedding_type'][-1])
371 |             sequence = bert_model.all_encoder_layers[-n_last] # [batch_size, seq_length, hidden_size]
372 |             pooled = tf.reduce_sum(sequence * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1)
373 |         elif self.cfg['sentence_embedding_type'].startswith("avg-last-"):
374 |             # 使用最后的n层 avg pooling
375 |             pooled = 0
376 |             n_last = int(self.cfg['sentence_embedding_type'][-1])
377 |             for i in range(n_last):
378 |                 sequence = bert_model.all_encoder_layers[-i]
379 |                 pooled += tf.reduce_sum(sequence * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1)
380 |             pooled /= float(n_last)
381 |         elif self.cfg['sentence_embedding_type'].startswith("avg-last-concat-"):
382 |             pooled = []
383 |             n_last = int(self.cfg['sentence_embedding_type'][-1])
384 |             for i in range(n_last):
385 |                 sequence = bert_model.all_encoder_layers[-i]
386 |                 pooled += [tf.reduce_sum(sequence * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1)]
387 |             pooled = tf.concat(pooled, axis=-1)
388 |         return pooled, bert_output_seq, embedding_table
389 | 
390 |     def _dropout(self, input_emb, ratio=None):
391 |         if not self.is_training:
392 |             return input_emb
393 |         if ratio:
394 |             return tf.layers.dropout(input_emb, ratio)
395 |         else:
396 |             return tf.layers.dropout(input_emb, self.cfg['dropout'])
397 | 
398 |     def _bigru(self, input_emb, input_length, hidden_size, variable_scope="BiGRU"):
399 |         with tf.variable_scope(variable_scope):
400 |             cell_fw = tf.nn.rnn_cell.GRUCell(hidden_size)
401 |             cell_bw = tf.nn.rnn_cell.GRUCell(hidden_size)
402 |             outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb,
403 |                                                               input_length, dtype=tf.float32)
404 |         return tf.concat(outputs, axis=-1), tf.concat(states, axis=-1)
405 | 
406 |     def _bilstm(self, input_emb, input_length, hidden_size, variable_scope="BilSTM"):
407 |         with tf.variable_scope(variable_scope):
408 |             cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
409 |             cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
410 |             _output = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb,
411 |                                                       input_length, dtype=tf.float32)
412 |             (output_fw, output_bw), ((_, state_fw), (_, state_bw)) = _output
413 | 
414 |         return tf.concat([output_fw, output_bw], axis=-1), tf.concat([state_fw, state_bw], axis=-1)
415 | 
416 |     def _iterable_dilated_cnn(self, embeddings):
417 |         """
418 |         :param embeddings: [batch_size, steps, embedding_dim]
419 |         :return:
420 |         """
421 |         embedding_dim = embeddings.get_shape()[-1]
422 |         with tf.variable_scope("id_cnn"):
423 |             cnn_input = tf.expand_dims(embeddings, 1)
424 |             initial_layer_filter_shape = [
425 |                 1, self.cfg.filter_width, embedding_dim, self.cfg.filter_num]
426 |             initial_layer_w = tf.get_variable("initial_layer_w", shape=initial_layer_filter_shape,
427 |                                               initializer=tf.contrib.layers.xavier_initializer())
428 |             initial_layer_b = tf.get_variable("initial_layer_b",
429 |                                               initializer=tf.constant(0.01, shape=[self.cfg.filter_num]))
430 |             initial_layer_output = tf.nn.conv2d(cnn_input, initial_layer_w, strides=[1, 1, 1, 1],
431 |                                                 padding="SAME", name="initial_layer")
432 |             initial_layer_output = tf.nn.relu(tf.nn.bias_add(
433 |                 initial_layer_output, initial_layer_b), name="relu")
434 | 
435 |             atrous_input = initial_layer_output
436 |             atrous_layers_output = []
437 |             atrous_layers_output_dim = 0
438 |             for block in range(self.cfg.repeat_times):
439 |                 for i in range(len(self.cfg.idcnn_layers)):
440 |                     layer_name = "conv_{}".format(i)
441 |                     dilation = self.cfg.idcnn_layers[i]
442 |                     with tf.variable_scope("atrous_conv_{}".format(i), reuse=tf.AUTO_REUSE):
443 |                         filter_shape = [1, self.cfg.filter_width,
444 |                                         self.cfg.filter_num, self.cfg.filter_num]
445 |                         conv_w = tf.get_variable("{}_w".format(layer_name), shape=filter_shape,
446 |                                                  initializer=tf.contrib.layers.xavier_initializer())
447 |                         conv_b = tf.get_variable("{}_b".format(
448 |                             layer_name), shape=[self.cfg.filter_num])
449 |                         conv_output = tf.nn.convolution(atrous_input, conv_w, dilation_rate=[1, dilation],
450 |                                                         padding="SAME", name=layer_name)
451 |                         conv_output = tf.nn.relu(
452 |                             tf.nn.bias_add(conv_output, conv_b))
453 |                         if i == len(self.cfg.idcnn_layers) - 1:
454 |                             atrous_layers_output.append(conv_output)
455 |                             atrous_layers_output_dim += self.cfg.filter_num
456 |                         atrous_input = conv_output
457 |             output = tf.concat(axis=3, values=atrous_layers_output)
458 |             return tf.squeeze(output, [1])
459 | 
460 |     def add_train_op(self, learning_method, learning_rate, loss, clip=-1):
461 |         learning_rate = tf.train.exponential_decay(learning_rate=learning_rate,
462 |                                                    global_step=tf.train.get_or_create_global_step(),
463 |                                                    decay_steps=self.cfg['decay_step'],
464 |                                                    decay_rate=self.cfg['lr_decay'])
465 |         update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
466 |         _lr_m = learning_method.lower()
467 |         with tf.variable_scope("train_step"):
468 |             if _lr_m == "adam":
469 |                 optimizer = tf.train.AdamOptimizer(learning_rate)
470 |             elif _lr_m == 'lazyadam':
471 |                 optimizer = tf.contrib.opt.LazyAdamOptimizer(
472 |                     learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8)
473 |             elif _lr_m == "adagrad":
474 |                 optimizer = tf.train.AdagradOptimizer(learning_rate)
475 |             elif _lr_m == "sgd":
476 |                 optimizer = tf.train.GradientDescentOptimizer(learning_rate)
477 |             elif _lr_m == "rmsprop":
478 |                 optimizer = tf.train.RMSPropOptimizer(learning_rate)
479 |             else:
480 |                 raise NotImplementedError("Unknown method {}".format(_lr_m))
481 |             with tf.control_dependencies(update_ops):
482 |                 if clip > 0:
483 |                     grads, variables = zip(*optimizer.compute_gradients(loss))
484 |                     grads, gnorm = tf.clip_by_global_norm(grads, clip)
485 |                     self.train_op = optimizer.apply_gradients(zip(grads, variables),
486 |                                                               global_step=tf.train.get_global_step())
487 |                 else:
488 |                     # 梯度截断
489 |                     # params = tf.trainable_variables()
490 |                     # all_gradients = tf.gradients(loss, all_variables, stop_gradients=stop_tensors)
491 |                     self.train_op = optimizer.minimize(
492 |                         loss, global_step=tf.train.get_global_step())
493 | 
494 |             return self.train_op
495 | 
496 |     @staticmethod
497 |     def label_smoothing(inp, ls_epsilon):
498 |         """
499 |         From the paper: "... employed label smoothing of epsilon = 0.1. This hurts perplexity,
500 |         as the model learns to be more unsure, but improves accuracy and BLEU score."
501 |         Args:
502 |             inp (tf.tensor): one-hot encoding vectors, [batch, seq_len, vocab_size]
503 |         """
504 |         vocab_size = inp.shape.as_list()[-1]
505 |         smoothed = (1.0 - ls_epsilon) * inp + (ls_epsilon / vocab_size)
506 |         return smoothed
507 | 
508 | 
509 | if __name__ == "__main__":
510 |     model = BaseModel("s")
511 |     pass
512 | 


--------------------------------------------------------------------------------
/model/bert/ReadMe.md:
--------------------------------------------------------------------------------
1 | # download
2 | - bert-base chinese: https://github.com/google-research/bert
3 | - roberta_zh: https://github.com/brightmart/roberta_zh
4 | - albert_zh: https://github.com/brightmart/albert_zh
5 | - xlnet_zh: https://github.com/brightmart/xlnet_zh
6 | 
7 | 来自于：[https://github.com/google-research/bert](https://github.com/google-research/bert)
8 | 


--------------------------------------------------------------------------------
/model/bert/modeling.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """The main BERT model and related functions."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import collections
 22 | import copy
 23 | import json
 24 | import math
 25 | import re
 26 | import six
 27 | import tensorflow as tf
 28 | 
 29 | 
 30 | class BertConfig(object):
 31 |   """Configuration for `BertModel`."""
 32 | 
 33 |   def __init__(self,
 34 |                vocab_size,
 35 |                hidden_size=768,
 36 |                num_hidden_layers=12,
 37 |                num_attention_heads=12,
 38 |                intermediate_size=3072,
 39 |                hidden_act="gelu",
 40 |                hidden_dropout_prob=0.1,
 41 |                attention_probs_dropout_prob=0.1,
 42 |                max_position_embeddings=512,
 43 |                type_vocab_size=16,
 44 |                initializer_range=0.02):
 45 |     """Constructs BertConfig.
 46 |     Args:
 47 |       vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
 48 |       hidden_size: Size of the encoder layers and the pooler layer.
 49 |       num_hidden_layers: Number of hidden layers in the Transformer encoder.
 50 |       num_attention_heads: Number of attention heads for each attention layer in
 51 |         the Transformer encoder.
 52 |       intermediate_size: The size of the "intermediate" (i.e., feed-forward)
 53 |         layer in the Transformer encoder.
 54 |       hidden_act: The non-linear activation function (function or string) in the
 55 |         encoder and pooler.
 56 |       hidden_dropout_prob: The dropout probability for all fully connected
 57 |         layers in the embeddings, encoder, and pooler.
 58 |       attention_probs_dropout_prob: The dropout ratio for the attention
 59 |         probabilities.
 60 |       max_position_embeddings: The maximum sequence length that this model might
 61 |         ever be used with. Typically set this to something large just in case
 62 |         (e.g., 512 or 1024 or 2048).
 63 |       type_vocab_size: The vocabulary size of the `token_type_ids` passed into
 64 |         `BertModel`.
 65 |       initializer_range: The stdev of the truncated_normal_initializer for
 66 |         initializing all weight matrices.
 67 |     """
 68 |     self.vocab_size = vocab_size
 69 |     self.hidden_size = hidden_size
 70 |     self.num_hidden_layers = num_hidden_layers
 71 |     self.num_attention_heads = num_attention_heads
 72 |     self.hidden_act = hidden_act
 73 |     self.intermediate_size = intermediate_size
 74 |     self.hidden_dropout_prob = hidden_dropout_prob
 75 |     self.attention_probs_dropout_prob = attention_probs_dropout_prob
 76 |     self.max_position_embeddings = max_position_embeddings
 77 |     self.type_vocab_size = type_vocab_size
 78 |     self.initializer_range = initializer_range
 79 | 
 80 |   @classmethod
 81 |   def from_dict(cls, json_object):
 82 |     """Constructs a `BertConfig` from a Python dictionary of parameters."""
 83 |     config = BertConfig(vocab_size=None)
 84 |     for (key, value) in six.iteritems(json_object):
 85 |       config.__dict__[key] = value
 86 |     return config
 87 | 
 88 |   @classmethod
 89 |   def from_json_file(cls, json_file):
 90 |     """Constructs a `BertConfig` from a json file of parameters."""
 91 |     with tf.gfile.GFile(json_file, "r") as reader:
 92 |       text = reader.read()
 93 |     return cls.from_dict(json.loads(text))
 94 | 
 95 |   def to_dict(self):
 96 |     """Serializes this instance to a Python dictionary."""
 97 |     output = copy.deepcopy(self.__dict__)
 98 |     return output
 99 | 
100 |   def to_json_string(self):
101 |     """Serializes this instance to a JSON string."""
102 |     return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
103 | 
104 | 
105 | class BertModel(object):
106 |   """BERT model ("Bidirectional Embedding Representations from a Transformer").
107 |   Example usage:
108 |   ```python
109 |   # Already been converted into WordPiece token ids
110 |   input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
111 |   input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
112 |   token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
113 |   config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
114 |     num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
115 |   model = modeling.BertModel(config=config, is_training=True,
116 |     input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
117 |   label_embeddings = tf.get_variable(...)
118 |   pooled_output = model.get_pooled_output()
119 |   logits = tf.matmul(pooled_output, label_embeddings)
120 |   ...
121 |   ```
122 |   """
123 | 
124 |   def __init__(self,
125 |                config,
126 |                is_training,
127 |                input_ids,
128 |                input_mask=None,
129 |                token_type_ids=None,
130 |                use_one_hot_embeddings=True,
131 |                scope=None):
132 |     """Constructor for BertModel.
133 |     Args:
134 |       config: `BertConfig` instance.
135 |       is_training: bool. rue for training model, false for eval model. Controls
136 |         whether dropout will be applied.
137 |       input_ids: int32 Tensor of shape [batch_size, seq_length].
138 |       input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
139 |       token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
140 |       use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
141 |         embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
142 |         it is must faster if this is True, on the CPU or GPU, it is faster if
143 |         this is False.
144 |       scope: (optional) variable scope. Defaults to "bert".
145 |     Raises:
146 |       ValueError: The config is invalid or one of the input tensor shapes
147 |         is invalid.
148 |     """
149 |     config = copy.deepcopy(config)
150 |     if not is_training:
151 |       config.hidden_dropout_prob = 0.0
152 |       config.attention_probs_dropout_prob = 0.0
153 | 
154 |     input_shape = get_shape_list(input_ids, expected_rank=2)
155 |     batch_size = input_shape[0]
156 |     seq_length = input_shape[1]
157 | 
158 |     if input_mask is None:
159 |       input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
160 | 
161 |     if token_type_ids is None:
162 |       token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
163 | 
164 |     with tf.variable_scope(scope, default_name="bert"):
165 |       with tf.variable_scope("embeddings"):
166 |         # Perform embedding lookup on the word ids.
167 |         (self.embedding_output, self.embedding_table) = embedding_lookup(
168 |             input_ids=input_ids,
169 |             vocab_size=config.vocab_size,
170 |             embedding_size=config.hidden_size,
171 |             initializer_range=config.initializer_range,
172 |             word_embedding_name="word_embeddings",
173 |             use_one_hot_embeddings=use_one_hot_embeddings)
174 | 
175 |         # Add positional embeddings and token type embeddings, then layer
176 |         # normalize and perform dropout.
177 |         self.embedding_output = embedding_postprocessor(
178 |             input_tensor=self.embedding_output,
179 |             use_token_type=True,
180 |             token_type_ids=token_type_ids,
181 |             token_type_vocab_size=config.type_vocab_size,
182 |             token_type_embedding_name="token_type_embeddings",
183 |             use_position_embeddings=True,
184 |             position_embedding_name="position_embeddings",
185 |             initializer_range=config.initializer_range,
186 |             max_position_embeddings=config.max_position_embeddings,
187 |             dropout_prob=config.hidden_dropout_prob)
188 | 
189 |       with tf.variable_scope("encoder"):
190 |         # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
191 |         # mask of shape [batch_size, seq_length, seq_length] which is used
192 |         # for the attention scores.
193 |         attention_mask = create_attention_mask_from_input_mask(
194 |             input_ids, input_mask)
195 | 
196 |         # Run the stacked transformer.
197 |         # `sequence_output` shape = [batch_size, seq_length, hidden_size].
198 |         self.all_encoder_layers = transformer_model(
199 |             input_tensor=self.embedding_output,
200 |             attention_mask=attention_mask,
201 |             hidden_size=config.hidden_size,
202 |             num_hidden_layers=config.num_hidden_layers,
203 |             num_attention_heads=config.num_attention_heads,
204 |             intermediate_size=config.intermediate_size,
205 |             intermediate_act_fn=get_activation(config.hidden_act),
206 |             hidden_dropout_prob=config.hidden_dropout_prob,
207 |             attention_probs_dropout_prob=config.attention_probs_dropout_prob,
208 |             initializer_range=config.initializer_range,
209 |             do_return_all_layers=True)
210 | 
211 |       self.sequence_output = self.all_encoder_layers[-1]
212 |       # The "pooler" converts the encoded sequence tensor of shape
213 |       # [batch_size, seq_length, hidden_size] to a tensor of shape
214 |       # [batch_size, hidden_size]. This is necessary for segment-level
215 |       # (or segment-pair-level) classification tasks where we need a fixed
216 |       # dimensional representation of the segment.
217 |       with tf.variable_scope("pooler"):
218 |         # We "pool" the model by simply taking the hidden state corresponding
219 |         # to the first token. We assume that this has been pre-trained
220 |         first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
221 |         self.pooled_output = tf.layers.dense(
222 |             first_token_tensor,
223 |             config.hidden_size,
224 |             activation=tf.tanh,
225 |             kernel_initializer=create_initializer(config.initializer_range))
226 | 
227 |   def get_pooled_output(self):
228 |     return self.pooled_output
229 | 
230 |   def get_sequence_output(self):
231 |     """Gets final hidden layer of encoder.
232 |     Returns:
233 |       float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
234 |       to the final hidden of the transformer encoder.
235 |     """
236 |     return self.sequence_output
237 | 
238 |   def get_all_encoder_layers(self):
239 |     return self.all_encoder_layers
240 | 
241 |   def get_embedding_output(self):
242 |     """Gets output of the embedding lookup (i.e., input to the transformer).
243 |     Returns:
244 |       float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
245 |       to the output of the embedding layer, after summing the word
246 |       embeddings with the positional embeddings and the token type embeddings,
247 |       then performing layer normalization. This is the input to the transformer.
248 |     """
249 |     return self.embedding_output
250 | 
251 |   def get_embedding_table(self):
252 |     return self.embedding_table
253 | 
254 | 
255 | def gelu(input_tensor):
256 |   """Gaussian Error Linear Unit.
257 |   This is a smoother version of the RELU.
258 |   Original paper: https://arxiv.org/abs/1606.08415
259 |   Args:
260 |     input_tensor: float Tensor to perform activation.
261 |   Returns:
262 |     `input_tensor` with the GELU activation applied.
263 |   """
264 |   cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
265 |   return input_tensor * cdf
266 | 
267 | 
268 | def get_activation(activation_string):
269 |   """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.
270 |   Args:
271 |     activation_string: String name of the activation function.
272 |   Returns:
273 |     A Python function corresponding to the activation function. If
274 |     `activation_string` is None, empty, or "linear", this will return None.
275 |     If `activation_string` is not a string, it will return `activation_string`.
276 |   Raises:
277 |     ValueError: The `activation_string` does not correspond to a known
278 |       activation.
279 |   """
280 | 
281 |   # We assume that anything that"s not a string is already an activation
282 |   # function, so we just return it.
283 |   if not isinstance(activation_string, six.string_types):
284 |     return activation_string
285 | 
286 |   if not activation_string:
287 |     return None
288 | 
289 |   act = activation_string.lower()
290 |   if act == "linear":
291 |     return None
292 |   elif act == "relu":
293 |     return tf.nn.relu
294 |   elif act == "gelu":
295 |     return gelu
296 |   elif act == "tanh":
297 |     return tf.tanh
298 |   else:
299 |     raise ValueError("Unsupported activation: %s" % act)
300 | 
301 | 
302 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
303 |   """Compute the union of the current variables and checkpoint variables."""
304 |   assignment_map = {}
305 |   initialized_variable_names = {}
306 | 
307 |   name_to_variable = collections.OrderedDict()
308 |   for var in tvars:
309 |     name = var.name
310 |     m = re.match("^(.*):\\d+$", name)
311 |     if m is not None:
312 |       name = m.group(1)
313 |     name_to_variable[name] = var
314 | 
315 |   init_vars = tf.train.list_variables(init_checkpoint)
316 | 
317 |   assignment_map = collections.OrderedDict()
318 |   for x in init_vars:
319 |     (name, var) = (x[0], x[1])
320 |     if name not in name_to_variable:
321 |       continue
322 |     assignment_map[name] = name
323 |     initialized_variable_names[name] = 1
324 |     initialized_variable_names[name + ":0"] = 1
325 | 
326 |   return (assignment_map, initialized_variable_names)
327 | 
328 | 
329 | def dropout(input_tensor, dropout_prob):
330 |   """Perform dropout.
331 |   Args:
332 |     input_tensor: float Tensor.
333 |     dropout_prob: Python float. The probability of dropping out a value (NOT of
334 |       *keeping* a dimension as in `tf.nn.dropout`).
335 |   Returns:
336 |     A version of `input_tensor` with dropout applied.
337 |   """
338 |   if dropout_prob is None or dropout_prob == 0.0:
339 |     return input_tensor
340 | 
341 |   output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
342 |   return output
343 | 
344 | 
345 | def layer_norm(input_tensor, name=None):
346 |   """Run layer normalization on the last dimension of the tensor."""
347 |   return tf.contrib.layers.layer_norm(
348 |       inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
349 | 
350 | 
351 | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
352 |   """Runs layer normalization followed by dropout."""
353 |   output_tensor = layer_norm(input_tensor, name)
354 |   output_tensor = dropout(output_tensor, dropout_prob)
355 |   return output_tensor
356 | 
357 | 
358 | def create_initializer(initializer_range=0.02):
359 |   """Creates a `truncated_normal_initializer` with the given range."""
360 |   return tf.truncated_normal_initializer(stddev=initializer_range)
361 | 
362 | 
363 | def embedding_lookup(input_ids,
364 |                      vocab_size,
365 |                      embedding_size=128,
366 |                      initializer_range=0.02,
367 |                      word_embedding_name="word_embeddings",
368 |                      use_one_hot_embeddings=False):
369 |   """Looks up words embeddings for id tensor.
370 |   Args:
371 |     input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
372 |       ids.
373 |     vocab_size: int. Size of the embedding vocabulary.
374 |     embedding_size: int. Width of the word embeddings.
375 |     initializer_range: float. Embedding initialization range.
376 |     word_embedding_name: string. Name of the embedding table.
377 |     use_one_hot_embeddings: bool. If True, use one-hot method for word
378 |       embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
379 |       for TPUs.
380 |   Returns:
381 |     float Tensor of shape [batch_size, seq_length, embedding_size].
382 |   """
383 |   # This function assumes that the input is of shape [batch_size, seq_length,
384 |   # num_inputs].
385 |   #
386 |   # If the input is a 2D tensor of shape [batch_size, seq_length], we
387 |   # reshape to [batch_size, seq_length, 1].
388 |   if input_ids.shape.ndims == 2:
389 |     input_ids = tf.expand_dims(input_ids, axis=[-1])
390 | 
391 |   embedding_table = tf.get_variable(
392 |       name=word_embedding_name,
393 |       shape=[vocab_size, embedding_size],
394 |       initializer=create_initializer(initializer_range))
395 | 
396 |   if use_one_hot_embeddings:
397 |     flat_input_ids = tf.reshape(input_ids, [-1])
398 |     one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
399 |     output = tf.matmul(one_hot_input_ids, embedding_table)
400 |   else:
401 |     output = tf.nn.embedding_lookup(embedding_table, input_ids)
402 | 
403 |   input_shape = get_shape_list(input_ids)
404 | 
405 |   output = tf.reshape(output,
406 |                       input_shape[0:-1] + [input_shape[-1] * embedding_size])
407 |   return (output, embedding_table)
408 | 
409 | 
410 | def embedding_postprocessor(input_tensor,
411 |                             use_token_type=False,
412 |                             token_type_ids=None,
413 |                             token_type_vocab_size=16,
414 |                             token_type_embedding_name="token_type_embeddings",
415 |                             use_position_embeddings=True,
416 |                             position_embedding_name="position_embeddings",
417 |                             initializer_range=0.02,
418 |                             max_position_embeddings=512,
419 |                             dropout_prob=0.1):
420 |   """Performs various post-processing on a word embedding tensor.
421 |   Args:
422 |     input_tensor: float Tensor of shape [batch_size, seq_length,
423 |       embedding_size].
424 |     use_token_type: bool. Whether to add embeddings for `token_type_ids`.
425 |     token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
426 |       Must be specified if `use_token_type` is True.
427 |     token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
428 |     token_type_embedding_name: string. The name of the embedding table variable
429 |       for token type ids.
430 |     use_position_embeddings: bool. Whether to add position embeddings for the
431 |       position of each token in the sequence.
432 |     position_embedding_name: string. The name of the embedding table variable
433 |       for positional embeddings.
434 |     initializer_range: float. Range of the weight initialization.
435 |     max_position_embeddings: int. Maximum sequence length that might ever be
436 |       used with this model. This can be longer than the sequence length of
437 |       input_tensor, but cannot be shorter.
438 |     dropout_prob: float. Dropout probability applied to the final output tensor.
439 |   Returns:
440 |     float tensor with same shape as `input_tensor`.
441 |   Raises:
442 |     ValueError: One of the tensor shapes or input values is invalid.
443 |   """
444 |   input_shape = get_shape_list(input_tensor, expected_rank=3)
445 |   batch_size = input_shape[0]
446 |   seq_length = input_shape[1]
447 |   width = input_shape[2]
448 | 
449 |   output = input_tensor
450 | 
451 |   if use_token_type:
452 |     if token_type_ids is None:
453 |       raise ValueError("`token_type_ids` must be specified if"
454 |                        "`use_token_type` is True.")
455 |     token_type_table = tf.get_variable(
456 |         name=token_type_embedding_name,
457 |         shape=[token_type_vocab_size, width],
458 |         initializer=create_initializer(initializer_range))
459 |     # This vocab will be small so we always do one-hot here, since it is always
460 |     # faster for a small vocabulary.
461 |     flat_token_type_ids = tf.reshape(token_type_ids, [-1])
462 |     one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
463 |     token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
464 |     token_type_embeddings = tf.reshape(token_type_embeddings,
465 |                                        [batch_size, seq_length, width])
466 |     output += token_type_embeddings
467 | 
468 |   if use_position_embeddings:
469 |     assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
470 |     with tf.control_dependencies([assert_op]):
471 |       full_position_embeddings = tf.get_variable(
472 |           name=position_embedding_name,
473 |           shape=[max_position_embeddings, width],
474 |           initializer=create_initializer(initializer_range))
475 |       # Since the position embedding table is a learned variable, we create it
476 |       # using a (long) sequence length `max_position_embeddings`. The actual
477 |       # sequence length might be shorter than this, for faster training of
478 |       # tasks that do not have long sequences.
479 |       #
480 |       # So `full_position_embeddings` is effectively an embedding table
481 |       # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
482 |       # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
483 |       # perform a slice.
484 |       position_embeddings = tf.slice(full_position_embeddings, [0, 0],
485 |                                      [seq_length, -1])
486 |       num_dims = len(output.shape.as_list())
487 | 
488 |       # Only the last two dimensions are relevant (`seq_length` and `width`), so
489 |       # we broadcast among the first dimensions, which is typically just
490 |       # the batch size.
491 |       position_broadcast_shape = []
492 |       for _ in range(num_dims - 2):
493 |         position_broadcast_shape.append(1)
494 |       position_broadcast_shape.extend([seq_length, width])
495 |       position_embeddings = tf.reshape(position_embeddings,
496 |                                        position_broadcast_shape)
497 |       output += position_embeddings
498 | 
499 |   output = layer_norm_and_dropout(output, dropout_prob)
500 |   return output
501 | 
502 | 
503 | def create_attention_mask_from_input_mask(from_tensor, to_mask):
504 |   """Create 3D attention mask from a 2D tensor mask.
505 |   Args:
506 |     from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
507 |     to_mask: int32 Tensor of shape [batch_size, to_seq_length].
508 |   Returns:
509 |     float Tensor of shape [batch_size, from_seq_length, to_seq_length].
510 |   """
511 |   from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
512 |   batch_size = from_shape[0]
513 |   from_seq_length = from_shape[1]
514 | 
515 |   to_shape = get_shape_list(to_mask, expected_rank=2)
516 |   to_seq_length = to_shape[1]
517 | 
518 |   to_mask = tf.cast(
519 |       tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
520 | 
521 |   # We don't assume that `from_tensor` is a mask (although it could be). We
522 |   # don't actually care if we attend *from* padding tokens (only *to* padding)
523 |   # tokens so we create a tensor of all ones.
524 |   #
525 |   # `broadcast_ones` = [batch_size, from_seq_length, 1]
526 |   broadcast_ones = tf.ones(
527 |       shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
528 | 
529 |   # Here we broadcast along two dimensions to create the mask.
530 |   mask = broadcast_ones * to_mask
531 | 
532 |   return mask
533 | 
534 | 
535 | def attention_layer(from_tensor,
536 |                     to_tensor,
537 |                     attention_mask=None,
538 |                     num_attention_heads=1,
539 |                     size_per_head=512,
540 |                     query_act=None,
541 |                     key_act=None,
542 |                     value_act=None,
543 |                     attention_probs_dropout_prob=0.0,
544 |                     initializer_range=0.02,
545 |                     do_return_2d_tensor=False,
546 |                     batch_size=None,
547 |                     from_seq_length=None,
548 |                     to_seq_length=None):
549 |   """Performs multi-headed attention from `from_tensor` to `to_tensor`.
550 |   This is an implementation of multi-headed attention based on "Attention
551 |   is all you Need". If `from_tensor` and `to_tensor` are the same, then
552 |   this is self-attention. Each timestep in `from_tensor` attends to the
553 |   corresponding sequence in `to_tensor`, and returns a fixed-with vector.
554 |   This function first projects `from_tensor` into a "query" tensor and
555 |   `to_tensor` into "key" and "value" tensors. These are (effectively) a list
556 |   of tensors of length `num_attention_heads`, where each tensor is of shape
557 |   [batch_size, seq_length, size_per_head].
558 |   Then, the query and key tensors are dot-producted and scaled. These are
559 |   softmaxed to obtain attention probabilities. The value tensors are then
560 |   interpolated by these probabilities, then concatenated back to a single
561 |   tensor and returned.
562 |   In practice, the multi-headed attention are done with transposes and
563 |   reshapes rather than actual separate tensors.
564 |   Args:
565 |     from_tensor: float Tensor of shape [batch_size, from_seq_length,
566 |       from_width].
567 |     to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
568 |     attention_mask: (optional) int32 Tensor of shape [batch_size,
569 |       from_seq_length, to_seq_length]. The values should be 1 or 0. The
570 |       attention scores will effectively be set to -infinity for any positions in
571 |       the mask that are 0, and will be unchanged for positions that are 1.
572 |     num_attention_heads: int. Number of attention heads.
573 |     size_per_head: int. Size of each attention head.
574 |     query_act: (optional) Activation function for the query transform.
575 |     key_act: (optional) Activation function for the key transform.
576 |     value_act: (optional) Activation function for the value transform.
577 |     attention_probs_dropout_prob: (optional) float. Dropout probability of the
578 |       attention probabilities.
579 |     initializer_range: float. Range of the weight initializer.
580 |     do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
581 |       * from_seq_length, num_attention_heads * size_per_head]. If False, the
582 |       output will be of shape [batch_size, from_seq_length, num_attention_heads
583 |       * size_per_head].
584 |     batch_size: (Optional) int. If the input is 2D, this might be the batch size
585 |       of the 3D version of the `from_tensor` and `to_tensor`.
586 |     from_seq_length: (Optional) If the input is 2D, this might be the seq length
587 |       of the 3D version of the `from_tensor`.
588 |     to_seq_length: (Optional) If the input is 2D, this might be the seq length
589 |       of the 3D version of the `to_tensor`.
590 |   Returns:
591 |     float Tensor of shape [batch_size, from_seq_length,
592 |       num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
593 |       true, this will be of shape [batch_size * from_seq_length,
594 |       num_attention_heads * size_per_head]).
595 |   Raises:
596 |     ValueError: Any of the arguments or tensor shapes are invalid.
597 |   """
598 | 
599 |   def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
600 |                            seq_length, width):
601 |     output_tensor = tf.reshape(
602 |         input_tensor, [batch_size, seq_length, num_attention_heads, width])
603 | 
604 |     output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
605 |     return output_tensor
606 | 
607 |   from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
608 |   to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
609 | 
610 |   if len(from_shape) != len(to_shape):
611 |     raise ValueError(
612 |         "The rank of `from_tensor` must match the rank of `to_tensor`.")
613 | 
614 |   if len(from_shape) == 3:
615 |     batch_size = from_shape[0]
616 |     from_seq_length = from_shape[1]
617 |     to_seq_length = to_shape[1]
618 |   elif len(from_shape) == 2:
619 |     if (batch_size is None or from_seq_length is None or to_seq_length is None):
620 |       raise ValueError(
621 |           "When passing in rank 2 tensors to attention_layer, the values "
622 |           "for `batch_size`, `from_seq_length`, and `to_seq_length` "
623 |           "must all be specified.")
624 | 
625 |   # Scalar dimensions referenced here:
626 |   #   B = batch size (number of sequences)
627 |   #   F = `from_tensor` sequence length
628 |   #   T = `to_tensor` sequence length
629 |   #   N = `num_attention_heads`
630 |   #   H = `size_per_head`
631 | 
632 |   from_tensor_2d = reshape_to_matrix(from_tensor)
633 |   to_tensor_2d = reshape_to_matrix(to_tensor)
634 | 
635 |   # `query_layer` = [B*F, N*H]
636 |   query_layer = tf.layers.dense(
637 |       from_tensor_2d,
638 |       num_attention_heads * size_per_head,
639 |       activation=query_act,
640 |       name="query",
641 |       kernel_initializer=create_initializer(initializer_range))
642 | 
643 |   # `key_layer` = [B*T, N*H]
644 |   key_layer = tf.layers.dense(
645 |       to_tensor_2d,
646 |       num_attention_heads * size_per_head,
647 |       activation=key_act,
648 |       name="key",
649 |       kernel_initializer=create_initializer(initializer_range))
650 | 
651 |   # `value_layer` = [B*T, N*H]
652 |   value_layer = tf.layers.dense(
653 |       to_tensor_2d,
654 |       num_attention_heads * size_per_head,
655 |       activation=value_act,
656 |       name="value",
657 |       kernel_initializer=create_initializer(initializer_range))
658 | 
659 |   # `query_layer` = [B, N, F, H]
660 |   query_layer = transpose_for_scores(query_layer, batch_size,
661 |                                      num_attention_heads, from_seq_length,
662 |                                      size_per_head)
663 | 
664 |   # `key_layer` = [B, N, T, H]
665 |   key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
666 |                                    to_seq_length, size_per_head)
667 | 
668 |   # Take the dot product between "query" and "key" to get the raw
669 |   # attention scores.
670 |   # `attention_scores` = [B, N, F, T]
671 |   attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
672 |   attention_scores = tf.multiply(attention_scores,
673 |                                  1.0 / math.sqrt(float(size_per_head)))
674 | 
675 |   if attention_mask is not None:
676 |     # `attention_mask` = [B, 1, F, T]
677 |     attention_mask = tf.expand_dims(attention_mask, axis=[1])
678 | 
679 |     # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
680 |     # masked positions, this operation will create a tensor which is 0.0 for
681 |     # positions we want to attend and -10000.0 for masked positions.
682 |     adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
683 | 
684 |     # Since we are adding it to the raw scores before the softmax, this is
685 |     # effectively the same as removing these entirely.
686 |     attention_scores += adder
687 | 
688 |   # Normalize the attention scores to probabilities.
689 |   # `attention_probs` = [B, N, F, T]
690 |   attention_probs = tf.nn.softmax(attention_scores)
691 | 
692 |   # This is actually dropping out entire tokens to attend to, which might
693 |   # seem a bit unusual, but is taken from the original Transformer paper.
694 |   attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
695 | 
696 |   # `value_layer` = [B, T, N, H]
697 |   value_layer = tf.reshape(
698 |       value_layer,
699 |       [batch_size, to_seq_length, num_attention_heads, size_per_head])
700 | 
701 |   # `value_layer` = [B, N, T, H]
702 |   value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
703 | 
704 |   # `context_layer` = [B, N, F, H]
705 |   context_layer = tf.matmul(attention_probs, value_layer)
706 | 
707 |   # `context_layer` = [B, F, N, H]
708 |   context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
709 | 
710 |   if do_return_2d_tensor:
711 |     # `context_layer` = [B*F, N*V]
712 |     context_layer = tf.reshape(
713 |         context_layer,
714 |         [batch_size * from_seq_length, num_attention_heads * size_per_head])
715 |   else:
716 |     # `context_layer` = [B, F, N*V]
717 |     context_layer = tf.reshape(
718 |         context_layer,
719 |         [batch_size, from_seq_length, num_attention_heads * size_per_head])
720 | 
721 |   return context_layer
722 | 
723 | 
724 | def transformer_model(input_tensor,
725 |                       attention_mask=None,
726 |                       hidden_size=768,
727 |                       num_hidden_layers=12,
728 |                       num_attention_heads=12,
729 |                       intermediate_size=3072,
730 |                       intermediate_act_fn=gelu,
731 |                       hidden_dropout_prob=0.1,
732 |                       attention_probs_dropout_prob=0.1,
733 |                       initializer_range=0.02,
734 |                       do_return_all_layers=False):
735 |   """Multi-headed, multi-layer Transformer from "Attention is All You Need".
736 |   This is almost an exact implementation of the original Transformer encoder.
737 |   See the original paper:
738 |   https://arxiv.org/abs/1706.03762
739 |   Also see:
740 |   https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
741 |   Args:
742 |     input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
743 |     attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
744 |       seq_length], with 1 for positions that can be attended to and 0 in
745 |       positions that should not be.
746 |     hidden_size: int. Hidden size of the Transformer.
747 |     num_hidden_layers: int. Number of layers (blocks) in the Transformer.
748 |     num_attention_heads: int. Number of attention heads in the Transformer.
749 |     intermediate_size: int. The size of the "intermediate" (a.k.a., feed
750 |       forward) layer.
751 |     intermediate_act_fn: function. The non-linear activation function to apply
752 |       to the output of the intermediate/feed-forward layer.
753 |     hidden_dropout_prob: float. Dropout probability for the hidden layers.
754 |     attention_probs_dropout_prob: float. Dropout probability of the attention
755 |       probabilities.
756 |     initializer_range: float. Range of the initializer (stddev of truncated
757 |       normal).
758 |     do_return_all_layers: Whether to also return all layers or just the final
759 |       layer.
760 |   Returns:
761 |     float Tensor of shape [batch_size, seq_length, hidden_size], the final
762 |     hidden layer of the Transformer.
763 |   Raises:
764 |     ValueError: A Tensor shape or parameter is invalid.
765 |   """
766 |   if hidden_size % num_attention_heads != 0:
767 |     raise ValueError(
768 |         "The hidden size (%d) is not a multiple of the number of attention "
769 |         "heads (%d)" % (hidden_size, num_attention_heads))
770 | 
771 |   attention_head_size = int(hidden_size / num_attention_heads)
772 |   input_shape = get_shape_list(input_tensor, expected_rank=3)
773 |   batch_size = input_shape[0]
774 |   seq_length = input_shape[1]
775 |   input_width = input_shape[2]
776 | 
777 |   # The Transformer performs sum residuals on all layers so the input needs
778 |   # to be the same as the hidden size.
779 |   if input_width != hidden_size:
780 |     raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
781 |                      (input_width, hidden_size))
782 | 
783 |   # We keep the representation as a 2D tensor to avoid re-shaping it back and
784 |   # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
785 |   # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
786 |   # help the optimizer.
787 |   prev_output = reshape_to_matrix(input_tensor)
788 | 
789 |   all_layer_outputs = []
790 |   for layer_idx in range(num_hidden_layers):
791 |     with tf.variable_scope("layer_%d" % layer_idx):
792 |       layer_input = prev_output
793 | 
794 |       with tf.variable_scope("attention"):
795 |         attention_heads = []
796 |         with tf.variable_scope("self"):
797 |           attention_head = attention_layer(
798 |               from_tensor=layer_input,
799 |               to_tensor=layer_input,
800 |               attention_mask=attention_mask,
801 |               num_attention_heads=num_attention_heads,
802 |               size_per_head=attention_head_size,
803 |               attention_probs_dropout_prob=attention_probs_dropout_prob,
804 |               initializer_range=initializer_range,
805 |               do_return_2d_tensor=True,
806 |               batch_size=batch_size,
807 |               from_seq_length=seq_length,
808 |               to_seq_length=seq_length)
809 |           attention_heads.append(attention_head)
810 | 
811 |         attention_output = None
812 |         if len(attention_heads) == 1:
813 |           attention_output = attention_heads[0]
814 |         else:
815 |           # In the case where we have other sequences, we just concatenate
816 |           # them to the self-attention head before the projection.
817 |           attention_output = tf.concat(attention_heads, axis=-1)
818 | 
819 |         # Run a linear projection of `hidden_size` then add a residual
820 |         # with `layer_input`.
821 |         with tf.variable_scope("output"):
822 |           attention_output = tf.layers.dense(
823 |               attention_output,
824 |               hidden_size,
825 |               kernel_initializer=create_initializer(initializer_range))
826 |           attention_output = dropout(attention_output, hidden_dropout_prob)
827 |           attention_output = layer_norm(attention_output + layer_input)
828 | 
829 |       # The activation is only applied to the "intermediate" hidden layer.
830 |       with tf.variable_scope("intermediate"):
831 |         intermediate_output = tf.layers.dense(
832 |             attention_output,
833 |             intermediate_size,
834 |             activation=intermediate_act_fn,
835 |             kernel_initializer=create_initializer(initializer_range))
836 | 
837 |       # Down-project back to `hidden_size` then add the residual.
838 |       with tf.variable_scope("output"):
839 |         layer_output = tf.layers.dense(
840 |             intermediate_output,
841 |             hidden_size,
842 |             kernel_initializer=create_initializer(initializer_range))
843 |         layer_output = dropout(layer_output, hidden_dropout_prob)
844 |         layer_output = layer_norm(layer_output + attention_output)
845 |         prev_output = layer_output
846 |         all_layer_outputs.append(layer_output)
847 | 
848 |   if do_return_all_layers:
849 |     final_outputs = []
850 |     for layer_output in all_layer_outputs:
851 |       final_output = reshape_from_matrix(layer_output, input_shape)
852 |       final_outputs.append(final_output)
853 |     return final_outputs
854 |   else:
855 |     final_output = reshape_from_matrix(prev_output, input_shape)
856 |     return final_output
857 | 
858 | 
859 | def get_shape_list(tensor, expected_rank=None, name=None):
860 |   """Returns a list of the shape of tensor, preferring static dimensions.
861 |   Args:
862 |     tensor: A tf.Tensor object to find the shape of.
863 |     expected_rank: (optional) int. The expected rank of `tensor`. If this is
864 |       specified and the `tensor` has a different rank, and exception will be
865 |       thrown.
866 |     name: Optional name of the tensor for the error message.
867 |   Returns:
868 |     A list of dimensions of the shape of tensor. All static dimensions will
869 |     be returned as python integers, and dynamic dimensions will be returned
870 |     as tf.Tensor scalars.
871 |   """
872 |   if name is None:
873 |     name = tensor.name
874 | 
875 |   if expected_rank is not None:
876 |     assert_rank(tensor, expected_rank, name)
877 | 
878 |   shape = tensor.shape.as_list()
879 | 
880 |   non_static_indexes = []
881 |   for (index, dim) in enumerate(shape):
882 |     if dim is None:
883 |       non_static_indexes.append(index)
884 | 
885 |   if not non_static_indexes:
886 |     return shape
887 | 
888 |   dyn_shape = tf.shape(tensor)
889 |   for index in non_static_indexes:
890 |     shape[index] = dyn_shape[index]
891 |   return shape
892 | 
893 | 
894 | def reshape_to_matrix(input_tensor):
895 |   """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
896 |   ndims = input_tensor.shape.ndims
897 |   if ndims < 2:
898 |     raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
899 |                      (input_tensor.shape))
900 |   if ndims == 2:
901 |     return input_tensor
902 | 
903 |   width = input_tensor.shape[-1]
904 |   output_tensor = tf.reshape(input_tensor, [-1, width])
905 |   return output_tensor
906 | 
907 | 
908 | def reshape_from_matrix(output_tensor, orig_shape_list):
909 |   """Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
910 |   if len(orig_shape_list) == 2:
911 |     return output_tensor
912 | 
913 |   output_shape = get_shape_list(output_tensor)
914 | 
915 |   orig_dims = orig_shape_list[0:-1]
916 |   width = output_shape[-1]
917 | 
918 |   return tf.reshape(output_tensor, orig_dims + [width])
919 | 
920 | 
921 | def assert_rank(tensor, expected_rank, name=None):
922 |   """Raises an exception if the tensor rank is not of the expected rank.
923 |   Args:
924 |     tensor: A tf.Tensor to check the rank of.
925 |     expected_rank: Python integer or list of integers, expected rank.
926 |     name: Optional name of the tensor for the error message.
927 |   Raises:
928 |     ValueError: If the expected shape doesn't match the actual shape.
929 |   """
930 |   if name is None:
931 |     name = tensor.name
932 | 
933 |   expected_rank_dict = {}
934 |   if isinstance(expected_rank, six.integer_types):
935 |     expected_rank_dict[expected_rank] = True
936 |   else:
937 |     for x in expected_rank:
938 |       expected_rank_dict[x] = True
939 | 
940 |   actual_rank = tensor.shape.ndims
941 |   if actual_rank not in expected_rank_dict:
942 |     scope_name = tf.get_variable_scope().name
943 |     raise ValueError(
944 |         "For the tensor `%s` in scope `%s`, the actual rank "
945 |         "`%d` (shape = %s) is not equal to the expected rank `%s`" %
946 |         (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))


--------------------------------------------------------------------------------
/model/bert/optimization.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Functions and classes related to optimization (weight updates)."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import re
 22 | import tensorflow as tf
 23 | 
 24 | 
 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
 26 |     """Creates an optimizer training op."""
 27 |     global_step = tf.train.get_or_create_global_step()
 28 | 
 29 |     learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
 30 | 
 31 |     # Implements linear decay of the learning rate.
 32 |     learning_rate = tf.train.polynomial_decay(
 33 |         learning_rate,
 34 |         global_step,
 35 |         num_train_steps,
 36 |         end_learning_rate=0.0,
 37 |         power=1.0,
 38 |         cycle=False)
 39 | 
 40 |     # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
 41 |     # learning rate will be `global_step/num_warmup_steps * init_lr`.
 42 |     if num_warmup_steps:
 43 |         global_steps_int = tf.cast(global_step, tf.int32)
 44 |         warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
 45 | 
 46 |         global_steps_float = tf.cast(global_steps_int, tf.float32)
 47 |         warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
 48 | 
 49 |         warmup_percent_done = global_steps_float / warmup_steps_float
 50 |         warmup_learning_rate = init_lr * warmup_percent_done
 51 | 
 52 |         is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
 53 |         learning_rate = (
 54 |                 (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
 55 | 
 56 |     # It is recommended that you use this optimizer for fine tuning, since this
 57 |     # is how the model was trained (note that the Adam m/v variables are NOT
 58 |     # loaded from init_checkpoint.)
 59 |     optimizer = AdamWeightDecayOptimizer(
 60 |         learning_rate=learning_rate,
 61 |         weight_decay_rate=0.01,
 62 |         beta_1=0.9,
 63 |         beta_2=0.999,
 64 |         epsilon=1e-6,
 65 |         exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
 66 | 
 67 |     if use_tpu:
 68 |         optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
 69 | 
 70 |     tvars = tf.trainable_variables()
 71 |     grads = tf.gradients(loss, tvars)
 72 | 
 73 |     # This is how the model was pre-trained.
 74 |     (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
 75 | 
 76 |     train_op = optimizer.apply_gradients(
 77 |         zip(grads, tvars), global_step=global_step)
 78 | 
 79 |     new_global_step = global_step + 1
 80 |     train_op = tf.group(train_op, global_step.assign(new_global_step))
 81 |     return train_op
 82 | 
 83 | 
 84 | class AdamWeightDecayOptimizer(tf.train.Optimizer):
 85 |     """A basic Adam optimizer that includes "correct" L2 weight decay."""
 86 | 
 87 |     def __init__(self,
 88 |                  learning_rate,
 89 |                  weight_decay_rate=0.0,
 90 |                  beta_1=0.9,
 91 |                  beta_2=0.999,
 92 |                  epsilon=1e-6,
 93 |                  exclude_from_weight_decay=None,
 94 |                  name="AdamWeightDecayOptimizer"):
 95 |         """Constructs a AdamWeightDecayOptimizer."""
 96 |         super(AdamWeightDecayOptimizer, self).__init__(False, name)
 97 | 
 98 |         self.learning_rate = learning_rate
 99 |         self.weight_decay_rate = weight_decay_rate
100 |         self.beta_1 = beta_1
101 |         self.beta_2 = beta_2
102 |         self.epsilon = epsilon
103 |         self.exclude_from_weight_decay = exclude_from_weight_decay
104 | 
105 |     def apply_gradients(self, grads_and_vars, global_step=None, name=None):
106 |         """See base class."""
107 |         assignments = []
108 |         for (grad, param) in grads_and_vars:
109 |             if grad is None or param is None:
110 |                 continue
111 | 
112 |             param_name = self._get_variable_name(param.name)
113 | 
114 |             m = tf.get_variable(
115 |                 name=param_name + "/adam_m",
116 |                 shape=param.shape.as_list(),
117 |                 dtype=tf.float32,
118 |                 trainable=False,
119 |                 initializer=tf.zeros_initializer())
120 |             v = tf.get_variable(
121 |                 name=param_name + "/adam_v",
122 |                 shape=param.shape.as_list(),
123 |                 dtype=tf.float32,
124 |                 trainable=False,
125 |                 initializer=tf.zeros_initializer())
126 | 
127 |             # Standard Adam update.
128 |             next_m = (
129 |                     tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
130 |             next_v = (
131 |                     tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
132 |                                                               tf.square(grad)))
133 | 
134 |             update = next_m / (tf.sqrt(next_v) + self.epsilon)
135 | 
136 |             # Just adding the square of the weights to the loss function is *not*
137 |             # the correct way of using L2 regularization/weight decay with Adam,
138 |             # since that will interact with the m and v parameters in strange ways.
139 |             #
140 |             # Instead we want ot decay the weights in a manner that doesn't interact
141 |             # with the m/v parameters. This is equivalent to adding the square
142 |             # of the weights to the loss with plain (non-momentum) SGD.
143 |             if self._do_use_weight_decay(param_name):
144 |                 update += self.weight_decay_rate * param
145 | 
146 |             update_with_lr = self.learning_rate * update
147 | 
148 |             next_param = param - update_with_lr
149 | 
150 |             assignments.extend(
151 |                 [param.assign(next_param),
152 |                  m.assign(next_m),
153 |                  v.assign(next_v)])
154 |         return tf.group(*assignments, name=name)
155 | 
156 |     def _do_use_weight_decay(self, param_name):
157 |         """Whether to use L2 weight decay for `param_name`."""
158 |         if not self.weight_decay_rate:
159 |             return False
160 |         if self.exclude_from_weight_decay:
161 |             for r in self.exclude_from_weight_decay:
162 |                 if re.search(r, param_name) is not None:
163 |                     return False
164 |         return True
165 | 
166 |     def _get_variable_name(self, param_name):
167 |         """Get the variable name from the tensor name."""
168 |         m = re.match("^(.*):\\d+$", param_name)
169 |         if m is not None:
170 |             param_name = m.group(1)
171 |         return param_name
172 | 


--------------------------------------------------------------------------------
/model/bert/tokenization.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Tokenization classes."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import collections
 22 | import unicodedata
 23 | import six
 24 | import tensorflow as tf
 25 | 
 26 | 
 27 | def convert_to_unicode(text):
 28 |     """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
 29 |     if six.PY3:
 30 |         if isinstance(text, str):
 31 |             return text
 32 |         elif isinstance(text, bytes):
 33 |             return text.decode("utf-8", "ignore")
 34 |         else:
 35 |             raise ValueError("Unsupported string type: %s" % (type(text)))
 36 |     elif six.PY2:
 37 |         if isinstance(text, str):
 38 |             return text.decode("utf-8", "ignore")
 39 |         elif isinstance(text, unicode):
 40 |             return text
 41 |         else:
 42 |             raise ValueError("Unsupported string type: %s" % (type(text)))
 43 |     else:
 44 |         raise ValueError("Not running on Python2 or Python 3?")
 45 | 
 46 | 
 47 | def printable_text(text):
 48 |     """Returns text encoded in a way suitable for print or `tf.logging`."""
 49 | 
 50 |     # These functions want `str` for both Python2 and Python3, but in one case
 51 |     # it's a Unicode string and in the other it's a byte string.
 52 |     if six.PY3:
 53 |         if isinstance(text, str):
 54 |             return text
 55 |         elif isinstance(text, bytes):
 56 |             return text.decode("utf-8", "ignore")
 57 |         else:
 58 |             raise ValueError("Unsupported string type: %s" % (type(text)))
 59 |     elif six.PY2:
 60 |         if isinstance(text, str):
 61 |             return text
 62 |         elif isinstance(text, unicode):
 63 |             return text.encode("utf-8")
 64 |         else:
 65 |             raise ValueError("Unsupported string type: %s" % (type(text)))
 66 |     else:
 67 |         raise ValueError("Not running on Python2 or Python 3?")
 68 | 
 69 | 
 70 | def load_vocab(vocab_file):
 71 |     """Loads a vocabulary file into a dictionary."""
 72 |     vocab = collections.OrderedDict()
 73 |     index = 0
 74 |     with tf.gfile.GFile(vocab_file, "r") as reader:
 75 |         while True:
 76 |             token = convert_to_unicode(reader.readline())
 77 |             if not token:
 78 |                 break
 79 |             token = token.strip()
 80 |             vocab[token] = index
 81 |             index += 1
 82 |     return vocab
 83 | 
 84 | 
 85 | def convert_tokens_to_ids(vocab, tokens, unk_token="[UNK]"):
 86 |     """Converts a sequence of tokens into ids using the vocab."""
 87 |     ids = []
 88 |     for token in tokens:
 89 |         if token in vocab:
 90 |             ids.append(vocab[token])
 91 |         else:
 92 |             ids.append(vocab[unk_token])
 93 |     return ids
 94 | 
 95 | 
 96 | def whitespace_tokenize(text):
 97 |     """Runs basic whitespace cleaning and splitting on a peice of text."""
 98 |     text = text.strip()
 99 |     if not text:
100 |         return []
101 |     tokens = text.split()
102 |     return tokens
103 | 
104 | 
105 | class FullTokenizer(object):
106 |     """Runs end-to-end tokenziation."""
107 | 
108 |     def __init__(self, vocab_file, do_lower_case=True):
109 |         self.vocab = load_vocab(vocab_file)
110 |         self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
111 |         self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
112 | 
113 |     def tokenize(self, text):
114 |         split_tokens = []
115 |         for token in self.basic_tokenizer.tokenize(text):
116 |             for sub_token in self.wordpiece_tokenizer.tokenize(token):
117 |                 split_tokens.append(sub_token)
118 | 
119 |         return split_tokens
120 | 
121 |     def convert_tokens_to_ids(self, tokens):
122 |         return convert_tokens_to_ids(self.vocab, tokens)
123 | 
124 | 
125 | class CharTokenizer(object):
126 |     """Runs end-to-end tokenziation."""
127 | 
128 |     def __init__(self, vocab_file, do_lower_case=True):
129 |         self.vocab = load_vocab(vocab_file)
130 |         self.id2vocab = {v:k for k, v in self.vocab.items()}
131 |         self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
132 |         self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
133 | 
134 |     def tokenize(self, text):
135 |         split_tokens = []
136 |         for token in self.basic_tokenizer.tokenize(text):
137 |             for sub_token in token:
138 |                 split_tokens.append(sub_token)
139 | 
140 |         return split_tokens
141 | 
142 |     def convert_tokens_to_ids(self, tokens):
143 |         return convert_tokens_to_ids(self.vocab, tokens)
144 | 
145 | 
146 | class BasicTokenizer(object):
147 |     """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
148 | 
149 |     def __init__(self, do_lower_case=True):
150 |         """Constructs a BasicTokenizer.
151 | 
152 |         Args:
153 |           do_lower_case: Whether to lower case the input.
154 |         """
155 |         self.do_lower_case = do_lower_case
156 | 
157 |     def tokenize(self, text):
158 |         """Tokenizes a piece of text."""
159 |         text = convert_to_unicode(text)
160 |         text = self._clean_text(text)
161 | 
162 |         # This was added on November 1st, 2018 for the multilingual and Chinese
163 |         # models. This is also applied to the English models now, but it doesn't
164 |         # matter since the English models were not trained on any Chinese data
165 |         # and generally don't have any Chinese data in them (there are Chinese
166 |         # characters in the vocabulary because Wikipedia does have some Chinese
167 |         # words in the English Wikipedia.).
168 |         text = self._tokenize_chinese_chars(text)
169 | 
170 |         orig_tokens = whitespace_tokenize(text)
171 |         split_tokens = []
172 |         for token in orig_tokens:
173 |             if self.do_lower_case:
174 |                 token = token.lower()
175 |                 token = self._run_strip_accents(token)
176 |             split_tokens.extend(self._run_split_on_punc(token))
177 | 
178 |         output_tokens = whitespace_tokenize(" ".join(split_tokens))
179 |         return output_tokens
180 | 
181 |     def _run_strip_accents(self, text):
182 |         """Strips accents from a piece of text."""
183 |         text = unicodedata.normalize("NFD", text)
184 |         output = []
185 |         for char in text:
186 |             cat = unicodedata.category(char)
187 |             if cat == "Mn":
188 |                 continue
189 |             output.append(char)
190 |         return "".join(output)
191 | 
192 |     def _run_split_on_punc(self, text):
193 |         """Splits punctuation on a piece of text."""
194 |         chars = list(text)
195 |         i = 0
196 |         start_new_word = True
197 |         output = []
198 |         while i < len(chars):
199 |             char = chars[i]
200 |             if _is_punctuation(char):
201 |                 output.append([char])
202 |                 start_new_word = True
203 |             else:
204 |                 if start_new_word:
205 |                     output.append([])
206 |                 start_new_word = False
207 |                 output[-1].append(char)
208 |             i += 1
209 | 
210 |         return ["".join(x) for x in output]
211 | 
212 |     def _tokenize_chinese_chars(self, text):
213 |         """Adds whitespace around any CJK character."""
214 |         output = []
215 |         for char in text:
216 |             cp = ord(char)
217 |             if self._is_chinese_char(cp):
218 |                 output.append(" ")
219 |                 output.append(char)
220 |                 output.append(" ")
221 |             else:
222 |                 output.append(char)
223 |         return "".join(output)
224 | 
225 |     def _is_chinese_char(self, cp):
226 |         """Checks whether CP is the codepoint of a CJK character."""
227 |         # This defines a "chinese character" as anything in the CJK Unicode block:
228 |         #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
229 |         #
230 |         # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
231 |         # despite its name. The modern Korean Hangul alphabet is a different block,
232 |         # as is Japanese Hiragana and Katakana. Those alphabets are used to write
233 |         # space-separated words, so they are not treated specially and handled
234 |         # like the all of the other languages.
235 |         if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
236 |                 (cp >= 0x3400 and cp <= 0x4DBF) or  #
237 |                 (cp >= 0x20000 and cp <= 0x2A6DF) or  #
238 |                 (cp >= 0x2A700 and cp <= 0x2B73F) or  #
239 |                 (cp >= 0x2B740 and cp <= 0x2B81F) or  #
240 |                 (cp >= 0x2B820 and cp <= 0x2CEAF) or
241 |                 (cp >= 0xF900 and cp <= 0xFAFF) or  #
242 |                 (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
243 |             return True
244 | 
245 |         return False
246 | 
247 |     def _clean_text(self, text):
248 |         """Performs invalid character removal and whitespace cleanup on text."""
249 |         output = []
250 |         for char in text:
251 |             cp = ord(char)
252 |             if cp == 0 or cp == 0xfffd or _is_control(char):
253 |                 continue
254 |             if _is_whitespace(char):
255 |                 output.append(" ")
256 |             else:
257 |                 output.append(char)
258 |         return "".join(output)
259 | 
260 | 
261 | class WordpieceTokenizer(object):
262 |     """Runs WordPiece tokenziation."""
263 | 
264 |     def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
265 |         self.vocab = vocab
266 |         self.unk_token = unk_token
267 |         self.max_input_chars_per_word = max_input_chars_per_word
268 | 
269 |     def tokenize(self, text):
270 |         """Tokenizes a piece of text into its word pieces.
271 | 
272 |         This uses a greedy longest-match-first algorithm to perform tokenization
273 |         using the given vocabulary.
274 | 
275 |         For example:
276 |           input = "unaffable"
277 |           output = ["un", "##aff", "##able"]
278 | 
279 |         Args:
280 |           text: A single token or whitespace separated tokens. This should have
281 |             already been passed through `BasicTokenizer.
282 | 
283 |         Returns:
284 |           A list of wordpiece tokens.
285 |         """
286 | 
287 |         text = convert_to_unicode(text)
288 | 
289 |         output_tokens = []
290 |         for token in whitespace_tokenize(text):
291 |             chars = list(token)
292 |             if len(chars) > self.max_input_chars_per_word:
293 |                 output_tokens.append(self.unk_token)
294 |                 continue
295 | 
296 |             is_bad = False
297 |             start = 0
298 |             sub_tokens = []
299 |             while start < len(chars):
300 |                 end = len(chars)
301 |                 cur_substr = None
302 |                 while start < end:
303 |                     substr = "".join(chars[start:end])
304 |                     if start > 0:
305 |                         substr = "##" + substr
306 |                     if substr in self.vocab:
307 |                         cur_substr = substr
308 |                         break
309 |                     end -= 1
310 |                 if cur_substr is None:
311 |                     is_bad = True
312 |                     break
313 |                 sub_tokens.append(cur_substr)
314 |                 start = end
315 | 
316 |             if is_bad:
317 |                 output_tokens.append(self.unk_token)
318 |             else:
319 |                 output_tokens.extend(sub_tokens)
320 |         return output_tokens
321 | 
322 | 
323 | def _is_whitespace(char):
324 |     """Checks whether `chars` is a whitespace character."""
325 |     # \t, \n, and \r are technically contorl characters but we treat them
326 |     # as whitespace since they are generally considered as such.
327 |     if char == " " or char == "\t" or char == "\n" or char == "\r":
328 |         return True
329 |     cat = unicodedata.category(char)
330 |     if cat == "Zs":
331 |         return True
332 |     return False
333 | 
334 | 
335 | def _is_control(char):
336 |     """Checks whether `chars` is a control character."""
337 |     # These are technically control characters but we count them as whitespace
338 |     # characters.
339 |     if char == "\t" or char == "\n" or char == "\r":
340 |         return False
341 |     cat = unicodedata.category(char)
342 |     if cat.startswith("C"):
343 |         return True
344 |     return False
345 | 
346 | 
347 | def _is_punctuation(char):
348 |     """Checks whether `chars` is a punctuation character."""
349 |     cp = ord(char)
350 |     # We treat all non-letter/number ASCII as punctuation.
351 |     # Characters such as "^", "$", and "`" are not in the Unicode
352 |     # Punctuation class but we treat them as punctuation anyways, for
353 |     # consistency.
354 |     if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
355 |             (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
356 |         return True
357 |     cat = unicodedata.category(char)
358 |     if cat.startswith("P"):
359 |         return True
360 |     return False
361 | 


--------------------------------------------------------------------------------
/model/bert_classifier.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding=utf-8
  3 | '''
  4 | @Time    :   2020/10/17 11:38:00
  5 | @Author  :   zhiyang.zzy
  6 | @Contact :   zhiyangchou@gmail.com
  7 | @Desc    :   使用bert做分类。
  8 | 1. 对于sentence pair，直接将两个句子输入，然后用sep分割输入，然后使用cls的输出作为类别预测的输入。
  9 | '''
 10 | 
 11 | # here put the import lib
 12 | import time
 13 | import numpy as np
 14 | import tensorflow as tf
 15 | import random
 16 | import paddlehub as hub
 17 | from sklearn.metrics import accuracy_score
 18 | import math
 19 | from keras.layers import Dense, Subtract, Lambda
 20 | import keras.backend as K
 21 | from keras.regularizers import l2
 22 | import nni
 23 | 
 24 | import data_input
 25 | from config import Config
 26 | from .base_model import BaseModel
 27 | 
 28 | random.seed(9102)
 29 | 
 30 | 
 31 | def cosine_similarity(a, b):
 32 |     c = tf.sqrt(tf.reduce_sum(tf.multiply(a, a), axis=1))
 33 |     d = tf.sqrt(tf.reduce_sum(tf.multiply(b, b), axis=1))
 34 |     e = tf.reduce_sum(tf.multiply(a, b), axis=1)
 35 |     f = tf.multiply(c, d)
 36 |     r = tf.divide(e, f)
 37 |     return r
 38 | 
 39 | 
 40 | def variable_summaries(var, name):
 41 |     """Attach a lot of summaries to a Tensor."""
 42 |     with tf.name_scope('summaries'):
 43 |         mean = tf.reduce_mean(var)
 44 |         tf.summary.scalar('mean/' + name, mean)
 45 |         with tf.name_scope('stddev'):
 46 |             stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))
 47 |         tf.summary.scalar('sttdev/' + name, stddev)
 48 |         tf.summary.scalar('max/' + name, tf.reduce_max(var))
 49 |         tf.summary.scalar('min/' + name, tf.reduce_min(var))
 50 |         tf.summary.histogram(name, var)
 51 | 
 52 | class BertClassifier(BaseModel):
 53 |     def __init__(self, cfg, is_training=1):
 54 |         super(BertClassifier, self).__init__(cfg, is_training)
 55 |     pass
 56 | 
 57 |     def add_placeholder(self):
 58 |         # 预测时只用输入query即可，将其embedding为向量。
 59 |         self.q_ids = tf.placeholder(
 60 |             tf.int32, shape=[None, None], name='query_batch')
 61 |         self.q_mask_ids = tf.placeholder(
 62 |             tf.int32, shape=[None, None], name='q_mask_ids')
 63 |         self.q_seg_ids = tf.placeholder(
 64 |             tf.int32, shape=[None, None], name='q_seg_ids')
 65 |         self.q_seq_length = tf.placeholder(
 66 |             tf.int32, shape=[None], name='query_sequence_length')
 67 |         self.is_train_place = tf.placeholder(
 68 |             dtype=tf.bool, name='is_train_place')
 69 |         # label
 70 |         self.sim_labels = tf.placeholder(
 71 |             tf.float32, shape=[None], name="sim_labels")
 72 | 
 73 |     def forward(self):
 74 |         # 获取cls的输出
 75 |         q_emb, _, self.q_e = self.share_bert_layer(
 76 |             self.is_train_place, self.q_ids, self.q_mask_ids, self.q_seg_ids, use_bert_pre=1)
 77 |         predict_prob = Dense(units=1, activation='sigmoid')(q_emb)
 78 |         self.predict_prob = tf.reshape(predict_prob, [-1])
 79 |         self.predict_idx = tf.cast(tf.greater_equal(predict_prob, 0.5), tf.int32)
 80 |         with tf.name_scope('Loss'):
 81 |             # Train Loss
 82 |             loss = tf.losses.log_loss(self.sim_labels, self.predict_prob)
 83 |             self.loss = tf.reduce_mean(loss)
 84 |             tf.summary.scalar('loss', self.loss)
 85 | 
 86 |     def build(self):
 87 |         self.add_placeholder()
 88 |         self.forward()
 89 |         self.add_train_op(self.cfg['optimizer'],
 90 |                           self.cfg['learning_rate'], self.loss)
 91 |         self._init_session()
 92 |         self._add_summary()
 93 |         pass
 94 | 
 95 |     def feed_batch(self, out_ids1, m_ids1, seg_ids1, seq_len1, label=None, is_test=0):
 96 |         is_train = 0 if is_test else 1
 97 |         fd = {
 98 |             self.q_ids: out_ids1, self.q_mask_ids: m_ids1,
 99 |             self.q_seg_ids: seg_ids1,
100 |             self.q_seq_length: seq_len1,
101 |             self.is_train_place: is_train}
102 |         if label:
103 |             fd[self.sim_labels] = label
104 |         return fd
105 | 
106 |     def run_epoch(self, epoch, d_train, d_val):
107 |         steps = int(math.ceil(float(len(d_train)) / self.cfg['batch_size']))
108 |         progbar = tf.keras.utils.Progbar(steps)
109 |         # 每个 epoch 分batch训练
110 |         batch_iter = data_input.get_batch(
111 |             d_train, batch_size=self.cfg['batch_size'])
112 |         for i, (out_ids1, m_ids1, seg_ids1, seq_len1, label) in enumerate(batch_iter):
113 |             fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, label)
114 |             # a = self.sess.run([self.is_train_place, self.q_e], feed_dict=fd)
115 |             _, cur_loss = self.sess.run(
116 |                 [self.train_op, self.loss], feed_dict=fd)
117 |             progbar.update(i + 1, [("loss", cur_loss)])
118 |         # 训练完一个epoch之后，使用验证集评估，然后预测， 然后评估准确率
119 |         dev_acc = self.eval(d_val)
120 |         nni.report_intermediate_result(dev_acc)
121 |         print("dev set acc:", dev_acc)
122 |         return dev_acc
123 | 
124 |     def eval(self, test_data):
125 |         pbar = data_input.get_batch(
126 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
127 |         val_label, val_pred = [], []
128 |         for (out_ids1, m_ids1, seg_ids1, seq_len1, label) in pbar:
129 |             val_label.extend(label)
130 |             fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, is_test=1)
131 |             pred_labels, pred_prob = self.sess.run(
132 |                 [self.predict_idx, self.predict_prob], feed_dict=fd)
133 |             val_pred.extend(pred_labels)
134 |         test_acc = accuracy_score(val_label, val_pred)
135 |         return test_acc
136 | 
137 |     def predict(self, test_data):
138 |         pbar = data_input.get_batch(
139 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
140 |         val_pred, val_prob = [], []
141 |         for (t1_ids, t1_len, t2_ids, t2_len) in pbar:
142 |             fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1)
143 |             pred_labels, pred_prob = self.sess.run(
144 |                 [self.predict_idx, self.predict_prob], feed_dict=fd)
145 |             val_pred.extend(pred_labels)
146 |             val_prob.extend(pred_prob)
147 |         return val_pred, val_prob
148 | 
149 | 
150 | if __name__ == "__main__":
151 |     start = time.time()
152 |     # 读取配置
153 |     conf = Config()
154 |     # 读取数据
155 |     dataset = hub.dataset.LCQMC()
156 |     data_train, data_val, data_test = data_input.get_lcqmc()
157 |     # data_train = data_train[:10000]
158 |     print("train size:{},val size:{}, test size:{}".format(
159 |         len(data_train), len(data_val), len(data_test)))
160 |     model = SiamenseRNN(conf)
161 |     model.fit(data_train, data_val, data_test)
162 |     pass
163 | 


--------------------------------------------------------------------------------
/model/siamese_network.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding=utf-8
  3 | '''
  4 | @Time    :   2020/10/17 11:38:00
  5 | @Author  :   zhiyang.zzy
  6 | @Contact :   zhiyangchou@gmail.com
  7 | @Desc    :   siamense network, 使用曼哈顿距离、cos相似度进行实验。
  8 | 1. 使用预训练词向量。2. 使用lcqmc数据集实验。3. 添加预测。
  9 | todo: add triplet loss
 10 | '''
 11 | 
 12 | # here put the import lib
 13 | from os import name
 14 | import time
 15 | import numpy as np
 16 | import tensorflow as tf
 17 | import random
 18 | import paddlehub as hub
 19 | from sklearn.metrics import accuracy_score
 20 | import math
 21 | from keras.layers import Dense, Subtract, Lambda
 22 | import keras.backend as K
 23 | from keras.regularizers import l2
 24 | 
 25 | import data_input
 26 | from config import Config
 27 | from .base_model import BaseModel
 28 | 
 29 | random.seed(9102)
 30 | 
 31 | 
 32 | def cosine_similarity(a, b):
 33 |     c = tf.sqrt(tf.reduce_sum(tf.multiply(a, a), axis=1))
 34 |     d = tf.sqrt(tf.reduce_sum(tf.multiply(b, b), axis=1))
 35 |     e = tf.reduce_sum(tf.multiply(a, b), axis=1)
 36 |     f = tf.multiply(c, d)
 37 |     r = tf.divide(e, f)
 38 |     return r
 39 | 
 40 | def siamese_loss(out1,out2,y,Q=5):
 41 |     # 使用欧式距离，概率使用e^{-x}
 42 |     Q = tf.constant(Q, name="Q",dtype=tf.float32)
 43 |     E_w = tf.sqrt(tf.reduce_sum(tf.square(out1-out2),1))   
 44 |     pos = tf.multiply(tf.multiply(y,2/Q),tf.square(E_w))
 45 |     neg = tf.multiply(tf.multiply(1-y,2*Q),tf.exp(-2.77/Q*E_w))                
 46 |     loss = pos + neg
 47 |     loss = tf.reduce_mean(loss)
 48 |     prob = tf.exp(-E_w)
 49 |     return loss, prob
 50 | 
 51 | def variable_summaries(var, name):
 52 |     """Attach a lot of summaries to a Tensor."""
 53 |     with tf.name_scope('summaries'):
 54 |         mean = tf.reduce_mean(var)
 55 |         tf.summary.scalar('mean/' + name, mean)
 56 |         with tf.name_scope('stddev'):
 57 |             stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))
 58 |         tf.summary.scalar('sttdev/' + name, stddev)
 59 |         tf.summary.scalar('max/' + name, tf.reduce_max(var))
 60 |         tf.summary.scalar('min/' + name, tf.reduce_min(var))
 61 |         tf.summary.histogram(name, var)
 62 | 
 63 | 
 64 | class SiamenseRNN(BaseModel):
 65 |     def __init__(self, cfg, is_training=1):
 66 |         # config来自于yml, 或者config.py 文件。
 67 |         self.cfg = cfg
 68 |         # if not is_training: dropout=0
 69 |         self.is_training = is_training
 70 |         if not is_training:
 71 |             self.cfg['dropout'] = 0
 72 |         self.build()
 73 |         pass
 74 |     pass
 75 | 
 76 |     def share_encoder(self, query_batch, query_seq_length, keep_prob_place):
 77 |         with tf.variable_scope('word_embeddings_layer', reuse=tf.AUTO_REUSE):
 78 |             # 这里可以加载预训练词向量
 79 |             _word_embedding = tf.get_variable(name="word_embedding_arr", dtype=tf.float32,
 80 |                                               shape=[self.cfg['nwords'], self.cfg['word_dim']])
 81 |             query_embed = tf.nn.embedding_lookup(
 82 |                 _word_embedding, query_batch, name='query_batch_embed')
 83 |         with tf.variable_scope('RNN', reuse=tf.AUTO_REUSE):
 84 |             # Abandon bag of words, use GRU, you can use stacked gru
 85 |             cell_fw = tf.contrib.rnn.GRUCell(
 86 |                 self.cfg['hidden_size_rnn'], reuse=tf.AUTO_REUSE)   # , reuse=tf.AUTO_REUSE
 87 |             cell_bw = tf.contrib.rnn.GRUCell(
 88 |                 self.cfg['hidden_size_rnn'], reuse=tf.AUTO_REUSE)
 89 |             # query
 90 |             (_, _), (query_output_fw, query_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, query_embed,
 91 |                                                                                          sequence_length=query_seq_length,
 92 |                                                                                          dtype=tf.float32)
 93 |             query_rnn_output = tf.concat(
 94 |                 [query_output_fw, query_output_bw], axis=-1)
 95 |             query_rnn_output = tf.nn.dropout(query_rnn_output, keep_prob_place)
 96 |             # TODO： 使用mean pooling， 或者self attention 来代替最后一个states
 97 |         return query_rnn_output
 98 | 
 99 |     def cos_sim(self, query_rnn_output, doc_rnn_output):
100 |         with tf.name_scope('Cosine_Similarity'):
101 |             # Cosine similarity
102 |             # query_norm = sqrt(sum(each x^2))
103 |             query_norm = tf.sqrt(tf.reduce_sum(tf.square(query_rnn_output), 1))
104 |             # doc_norm = sqrt(sum(each x^2))
105 |             doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_rnn_output), 1))
106 | 
107 |             # 内积
108 |             prod = tf.reduce_sum(tf.multiply(
109 |                 query_rnn_output, doc_rnn_output), axis=1)
110 |             # 模相乘
111 |             mul = tf.multiply(query_norm, doc_norm)
112 |             # cos_sim_raw = query * doc / (||query|| * ||doc||)
113 |             # cos_sim_raw = tf.truediv(prod, tf.multiply(query_norm, doc_norm))
114 |             cos_sim_raw = tf.divide(prod, mul)
115 |             predict_prob = tf.sigmoid(cos_sim_raw)
116 |             predict_idx = tf.cast(tf.greater_equal(
117 |                 predict_prob, 0.5), tf.int32)
118 |         return predict_prob, predict_idx
119 | 
120 |     def l1_distance(self, query_rnn_output, doc_rnn_output):
121 |         l1_distance_layer = Lambda(
122 |             lambda tensors: K.abs(tensors[0] - tensors[1]))
123 |         l1_distance = l1_distance_layer([query_rnn_output, doc_rnn_output])
124 |         l1_distance = tf.concat([l1_distance, query_rnn_output, doc_rnn_output], axis=-1)
125 |         predict_prob = Dense(units=1, activation='sigmoid')(l1_distance)
126 |         # bs * 1
127 |         predict_prob = tf.reshape(predict_prob, [-1])
128 |         predict_idx = tf.cast(tf.greater_equal(predict_prob, 0.5), tf.int32)
129 |         return predict_prob, predict_idx
130 | 
131 |     def forward(self):
132 |         # 共享的encode来编码query
133 |         query_rnn_output = self.share_encoder(
134 |             self.query_batch, self.query_seq_length, self.keep_prob_place)
135 |         self.query_rnn_output = query_rnn_output
136 |         self.q_emb = query_rnn_output
137 |         doc_rnn_output = self.share_encoder(
138 |             self.doc_batch, self.doc_seq_length, self.keep_prob_place)
139 |         # 计算cos相似度：
140 |         # self.predict_prob, self.predict_idx = self.cos_sim(query_rnn_output, doc_rnn_output)
141 |         # 使用原文曼哈顿距离
142 |         self.predict_prob, self.predict_idx = self.l1_distance(
143 |             query_rnn_output, doc_rnn_output)
144 | 
145 |         with tf.name_scope('Loss'):
146 |             # Train Loss
147 |             # cross_entropy = -tf.reduce_mean(self.sim_labels * tf.log(tf.clip_by_value(self.predict_prob,1e-10,1.0))+(1-self.sim_labels) * tf.log(tf.clip_by_value(1-self.predict_prob,1e-10,1.0)))
148 |             loss = tf.losses.log_loss(self.sim_labels, self.predict_prob)
149 |             self.loss = tf.reduce_mean(loss)
150 |             tf.summary.scalar('loss', self.loss)
151 |         # with tf.name_scope('Accuracy'):
152 |         #     correct_prediction = tf.equal(tf.argmax(prob, 1), 0)
153 |         #     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
154 |         #     tf.summary.scalar('accuracy', accuracy)
155 | 
156 |     def add_placeholder(self):
157 |         with tf.name_scope('input'):
158 |             # 预测时只用输入query即可，将其embedding为向量。
159 |             self.query_batch = tf.placeholder(
160 |                 tf.int32, shape=[None, None], name='query_batch')
161 |             self.doc_batch = tf.placeholder(
162 |                 tf.int32, shape=[None, None], name='doc_batch')
163 |             self.query_seq_length = tf.placeholder(
164 |                 tf.int32, shape=[None], name='query_sequence_length')
165 |             self.doc_seq_length = tf.placeholder(
166 |                 tf.int32, shape=[None], name='doc_seq_length')
167 |             # label
168 |             self.sim_labels = tf.placeholder(
169 |                 tf.float32, shape=[None], name="sim_labels")
170 |             self.keep_prob_place = tf.placeholder(tf.float32, name='keep_prob')
171 | 
172 |     def build(self):
173 |         self.add_placeholder()
174 |         self.forward()
175 |         self.add_train_op(self.cfg['optimizer'],
176 |                           self.cfg['learning_rate'], self.loss)
177 |         self._init_session()
178 |         self._add_summary()
179 |         pass
180 | 
181 |     def feed_batch(self, t1_ids, t1_len, t2_ids, t2_len, label=None, is_test=0):
182 |         keep_porb = 1 if is_test else self.cfg['keep_porb']
183 |         fd = {
184 |             self.query_batch: t1_ids, self.doc_batch: t2_ids, self.query_seq_length: t1_len,
185 |             self.doc_seq_length: t2_len, self.keep_prob_place: keep_porb}
186 |         if label:
187 |             fd[self.sim_labels] = label
188 |         return fd
189 | 
190 |     def eval(self, test_data):
191 |         pbar = data_input.get_batch(
192 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
193 |         val_label, val_pred = [], []
194 |         for (t1_ids, t1_len, t2_ids, t2_len, label) in pbar:
195 |             val_label.extend(label)
196 |             fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1)
197 |             pred_labels, pred_prob = self.sess.run(
198 |                 [self.predict_idx, self.predict_prob], feed_dict=fd)
199 |             val_pred.extend(pred_labels)
200 |         test_acc = accuracy_score(val_label, val_pred)
201 |         return test_acc
202 | 
203 |     def predict(self, test_data):
204 |         pbar = data_input.get_batch(
205 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
206 |         val_pred, val_prob = [], []
207 |         for (t1_ids, t1_len, t2_ids, t2_len) in pbar:
208 |             fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1)
209 |             pred_labels, pred_prob = self.sess.run(
210 |                 [self.predict_idx, self.predict_prob], feed_dict=fd)
211 |             val_pred.extend(pred_labels)
212 |             val_prob.extend(pred_prob)
213 |         return val_pred, val_prob
214 | 
215 |     def run_epoch(self, epoch, data_train, data_val):
216 |         steps = int(math.ceil(float(len(data_train)) / self.cfg['batch_size']))
217 |         progbar = tf.keras.utils.Progbar(steps)
218 |         # 每个 epoch 分batch训练
219 |         batch_iter = data_input.get_batch(
220 |             data_train, batch_size=self.cfg['batch_size'])
221 |         for i, (t1_ids, t1_len, t2_ids, t2_len, label) in enumerate(batch_iter):
222 |             fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, label)
223 |             # a = sess.run([query_norm, doc_norm, prod, cos_sim_raw], feed_dict=fd)
224 |             _, cur_loss = self.sess.run(
225 |                 [self.train_op, self.loss], feed_dict=fd)
226 |             progbar.update(i + 1, [("loss", cur_loss)])
227 |         # 训练完一个epoch之后，使用验证集评估，然后预测， 然后评估准确率
228 |         dev_acc = self.eval(data_val)
229 |         print("dev set acc:", dev_acc)
230 |         return dev_acc
231 | 
232 | 
233 | class SiamenseBert(SiamenseRNN):
234 |     def __init__(self, cfg, is_training=1):
235 |         super(SiamenseBert, self).__init__(cfg, is_training)
236 |     pass
237 | 
238 |     def add_placeholder(self):
239 |         # 预测时只用输入query即可，将其embedding为向量。
240 |         self.q_ids = tf.placeholder(
241 |             tf.int32, shape=[None, None], name='query_batch')
242 |         self.q_mask_ids = tf.placeholder(
243 |             tf.int32, shape=[None, None], name='q_mask_ids')
244 |         self.q_seg_ids = tf.placeholder(
245 |             tf.int32, shape=[None, None], name='q_seg_ids')
246 |         self.q_seq_length = tf.placeholder(
247 |             tf.int32, shape=[None], name='query_sequence_length')
248 | 
249 |         self.d_ids = tf.placeholder(
250 |             tf.int32, shape=[None, None], name='doc_batch')
251 |         self.d_mask_ids = tf.placeholder(
252 |             tf.int32, shape=[None, None], name='d_mask_ids')
253 |         self.d_seg_ids = tf.placeholder(
254 |             tf.int32, shape=[None, None], name='d_seg_ids')
255 |         self.d_seq_length = tf.placeholder(
256 |             tf.int32, shape=[None], name='doc_seq_length')
257 |         self.is_train_place = tf.placeholder(
258 |             dtype=tf.bool, name='is_train_place')
259 |         # label
260 |         self.sim_labels = tf.placeholder(
261 |             tf.float32, shape=[None], name="sim_labels")
262 |         self.keep_prob_place = tf.placeholder(tf.float32, name='keep_prob')
263 |     def siamese_loss(self, out1, out2, y, Q=5.0):
264 |         Q = tf.constant(Q, dtype=tf.float32)
265 |         E_w = tf.sqrt(tf.reduce_sum(tf.square(out1-out2),1))   
266 |         pos = tf.multiply(tf.multiply(y,2/Q),tf.square(E_w))
267 |         neg = tf.multiply(tf.multiply(1-y,2*Q),tf.exp(-2.77/Q*E_w))                
268 |         loss = pos + neg                 
269 |         loss = tf.reduce_mean(loss)              
270 |         return loss
271 |     def contrastive_loss(self, model1, model2, y, margin=0.5):
272 |         with tf.name_scope("contrastive-loss"):
273 |             distance = tf.sqrt(tf.reduce_sum(tf.pow(model1 - model2, 2), 1, keepdims=True))
274 |             similarity = y * tf.square(distance)                                           # keep the similar label (1) close to each other
275 |             dissimilarity = (1 - y) * tf.square(tf.maximum((margin - distance), 0))        # give penalty to dissimilar label if the distance is bigger than margin
276 |             return tf.reduce_mean(dissimilarity + similarity) / 2
277 |     def forward(self):
278 |         # 获取cls的输出
279 |         q_emb, _, self.q_e = self.share_bert_layer(
280 |             self.is_train_place, self.q_ids, self.q_mask_ids, self.q_seg_ids, use_bert_pre=1)
281 |         d_emb, _, self.d_e = self.share_bert_layer(
282 |             self.is_train_place, self.d_ids, self.d_mask_ids, self.d_seg_ids, use_bert_pre=1)
283 |         self.q_emb = q_emb
284 |         # 计算cos相似度：
285 |         # self.predict_prob, self.predict_idx = self.cos_sim(q_emb, d_emb)
286 |         # 使用原文曼哈顿距离
287 |         self.predict_prob, self.predict_idx = self.l1_distance(q_emb, d_emb)
288 |         with tf.name_scope('Loss'):
289 |             # Train Loss
290 |             # cross_entropy = -tf.reduce_mean(self.sim_labels * tf.log(tf.clip_by_value(self.predict_prob,1e-10,1.0))+(1-self.sim_labels) * tf.log(tf.clip_by_value(1-self.predict_prob,1e-10,1.0)))
291 |             loss = tf.losses.log_loss(self.sim_labels, self.predict_prob)
292 |             self.loss = tf.reduce_mean(loss)
293 |             tf.summary.scalar('loss', self.loss)
294 | 
295 |     def build(self):
296 |         self.add_placeholder()
297 |         self.forward()
298 |         self.add_train_op(self.cfg['optimizer'],
299 |                           self.cfg['learning_rate'], self.loss)
300 |         self._init_session()
301 |         self._add_summary()
302 |         pass
303 | 
304 |     def feed_batch(self, out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, label=None, is_test=0):
305 |         keep_porb = 1 if is_test else self.cfg['keep_porb']
306 |         is_train = 0 if is_test else 1
307 |         fd = {
308 |             self.q_ids: out_ids1, self.q_mask_ids: m_ids1,
309 |             self.q_seg_ids: seg_ids1,
310 |             self.q_seq_length: seq_len1,
311 |             self.d_ids: out_ids2,
312 |             self.d_mask_ids: m_ids2,
313 |             self.d_seg_ids: seg_ids2,
314 |             self.d_seq_length: seq_len2,
315 |             self.keep_prob_place: keep_porb,
316 |             self.is_train_place: is_train}
317 |         if label:
318 |             fd[self.sim_labels] = label
319 |         return fd
320 | 
321 |     def run_epoch(self, epoch, d_train, d_val):
322 |         steps = int(math.ceil(float(len(d_train)) / self.cfg['batch_size']))
323 |         progbar = tf.keras.utils.Progbar(steps)
324 |         # 每个 epoch 分batch训练
325 |         batch_iter = data_input.get_batch(
326 |             d_train, batch_size=self.cfg['batch_size'])
327 |         for i, (out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, label) in enumerate(batch_iter):
328 |             fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1,
329 |                                  out_ids2, m_ids2, seg_ids2, seq_len2, label)
330 |             # a = self.sess.run([self.q_emb1, self.q_e, self.d_e], feed_dict=fd)
331 |             _, cur_loss = self.sess.run(
332 |                 [self.train_op, self.loss], feed_dict=fd)
333 |             progbar.update(i + 1, [("loss", cur_loss)])
334 |         # 训练完一个epoch之后，使用验证集评估，然后预测， 然后评估准确率
335 |         dev_acc = self.eval(d_val)
336 |         print("dev set acc:", dev_acc)
337 |         return dev_acc
338 | 
339 |     def eval(self, test_data):
340 |         pbar = data_input.get_batch(
341 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
342 |         val_label, val_pred = [], []
343 |         for (out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, label) in pbar:
344 |             val_label.extend(label)
345 |             fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, is_test=1)
346 |             pred_labels, pred_prob = self.sess.run(
347 |                 [self.predict_idx, self.predict_prob], feed_dict=fd)
348 |             val_pred.extend(pred_labels)
349 |         test_acc = accuracy_score(val_label, val_pred)
350 |         return test_acc
351 | 
352 |     def predict(self, test_data):
353 |         pbar = data_input.get_batch(
354 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
355 |         val_pred, val_prob = [], []
356 |         for (t1_ids, t1_len, t2_ids, t2_len) in pbar:
357 |             fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1)
358 |             pred_labels, pred_prob = self.sess.run(
359 |                 [self.predict_idx, self.predict_prob], feed_dict=fd)
360 |             val_pred.extend(pred_labels)
361 |             val_prob.extend(pred_prob)
362 |         return val_pred, val_prob
363 |         
364 |     def predict_embedding(self, test_data):
365 |         pbar = data_input.get_batch(
366 |             test_data, batch_size=self.cfg['batch_size'], is_test=1)
367 |         val_embed = []
368 |         for (out_ids1, m_ids1, seg_ids1, seq_len1) in pbar:
369 |             fd = {
370 |                 self.q_ids: out_ids1, self.q_mask_ids: m_ids1,
371 |                 self.q_seg_ids: seg_ids1,
372 |                 self.q_seq_length: seq_len1,
373 |                 self.keep_prob_place: 1,
374 |                 self.is_train_place: 0
375 |             }
376 |             pred_embedding = self.sess.run(self.q_emb, feed_dict=fd)
377 |             val_embed.extend(pred_embedding)
378 |         return val_embed
379 | 
380 | 
381 | if __name__ == "__main__":
382 |     start = time.time()
383 |     # 读取配置
384 |     conf = Config()
385 |     # 读取数据
386 |     dataset = hub.dataset.LCQMC()
387 |     data_train, data_val, data_test = data_input.get_lcqmc()
388 |     # data_train = data_train[:10000]
389 |     print("train size:{},val size:{}, test size:{}".format(
390 |         len(data_train), len(data_val), len(data_test)))
391 |     model = SiamenseRNN(conf)
392 |     model.fit(data_train, data_val, data_test)
393 |     pass
394 | 


--------------------------------------------------------------------------------
/multi_view_dssm_v3.py:
--------------------------------------------------------------------------------
  1 | # coding=utf8
  2 | """
  3 | python=3.5
  4 | TensorFlow=1.2.1
  5 | """
  6 | 
  7 | import pandas as pd
  8 | from scipy import sparse
  9 | import collections
 10 | import random
 11 | import time
 12 | import numpy as np
 13 | import tensorflow as tf
 14 | import multi_view_data_input
 15 | 
 16 | flags = tf.app.flags
 17 | FLAGS = flags.FLAGS
 18 | 
 19 | flags.DEFINE_string('summaries_dir', 'Summaries', 'Summaries directory')
 20 | flags.DEFINE_float('learning_rate', 0.05, 'Initial learning rate.')
 21 | flags.DEFINE_integer('max_steps', 800000, 'Number of steps to run trainer.')
 22 | flags.DEFINE_integer('epoch_steps', 200, "Number of steps in one epoch.")
 23 | flags.DEFINE_integer('test_pack_size', 3185, "Number of steps in one epoch.")
 24 | flags.DEFINE_bool('gpu', 0, "Enable GPU or not")
 25 | 
 26 | start = time.time()
 27 | # user feature维度
 28 | user_dimension = 17309
 29 | # 负样本个数
 30 | NEG = 4
 31 | # positive batch size
 32 | user_BS = 100
 33 | # batch size
 34 | # BS = user_BS * (NEG + 1)
 35 | # 第1层网络的单元数目
 36 | L1_N = 400
 37 | # 第1层网络的单元数目
 38 | L2_N = 120
 39 | 
 40 | # 读取数据
 41 | # train_size, test_size = 1000000, 100000
 42 | # data_sets = multi_view_data_input.load_data()
 43 | data_sets = multi_view_data_input.get_data()
 44 | user_dimension = data_sets.TRIGRAM_D
 45 | # view1维度
 46 | view1_dimension = data_sets.app_number
 47 | view2_dimension = data_sets.music_number
 48 | view3_dimension = data_sets.novel_number
 49 | # view1 训练集大小
 50 | view1_size = data_sets.app_his.shape[0]
 51 | view2_size = data_sets.music_his.shape[0]
 52 | view3_size = data_sets.novel_his.shape[0]
 53 | total_size = view1_size + view2_size + view3_size
 54 | # view1 测试集大小
 55 | view1_size_test = data_sets.app_his_test.shape[0]
 56 | view2_size_test = data_sets.music_his_test.shape[0]
 57 | view3_size_test = data_sets.novel_his_test.shape[0]
 58 | # 测试集package size
 59 | flags.test_pack_size = int((view1_size_test + view2_size_test + view3_size_test) / user_BS)
 60 | 
 61 | 
 62 | def batch_normalization(x, phase_train, out_size):
 63 |     """
 64 |     Batch normalization on convolutional maps.
 65 |     Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow
 66 |     Args:
 67 |         x:           Tensor, 4D BHWD input maps
 68 |         out_size:       integer, depth of input maps
 69 |         phase_train: boolean tf.Varialbe, true indicates training phase
 70 |         scope:       string, variable scope
 71 |     Return:
 72 |         normed:      batch-normalized maps
 73 |     """
 74 |     with tf.variable_scope('bn'):
 75 |         beta = tf.Variable(tf.constant(0.0, shape=[out_size]),
 76 |                            name='beta', trainable=True)
 77 |         gamma = tf.Variable(tf.constant(1.0, shape=[out_size]),
 78 |                             name='gamma', trainable=True)
 79 |         batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
 80 |         ema = tf.train.ExponentialMovingAverage(decay=0.5)
 81 | 
 82 |         def mean_var_with_update():
 83 |             ema_apply_op = ema.apply([batch_mean, batch_var])
 84 |             with tf.control_dependencies([ema_apply_op]):
 85 |                 return tf.identity(batch_mean), tf.identity(batch_var)
 86 | 
 87 |         mean, var = tf.cond(phase_train,
 88 |                             mean_var_with_update,
 89 |                             lambda: (ema.average(batch_mean), ema.average(batch_var)))
 90 |         normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
 91 |     return normed
 92 | 
 93 | 
 94 | def variable_summaries(var, name):
 95 |     """Attach a lot of summaries to a Tensor."""
 96 |     with tf.name_scope('summaries'):
 97 |         mean = tf.reduce_mean(var)
 98 |         tf.summary.scalar('mean/' + name, mean)
 99 |         with tf.name_scope('stddev'):
100 |             stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))
101 |         tf.summary.scalar('sttdev/' + name, stddev)
102 |         tf.summary.scalar('max/' + name, tf.reduce_max(var))
103 |         tf.summary.scalar('min/' + name, tf.reduce_min(var))
104 |         tf.summary.histogram(name, var)
105 | 
106 | 
107 | with tf.name_scope('input'):
108 |     user_batch = tf.sparse_placeholder(tf.float32, shape=[None, user_dimension], name='user_batch')
109 |     view1_batch = tf.sparse_placeholder(tf.float32, shape=[None, view1_dimension], name='view1_batch')
110 |     view2_batch = tf.sparse_placeholder(tf.float32, shape=[None, view2_dimension], name='view2_batch')
111 |     view3_batch = tf.sparse_placeholder(tf.float32, shape=[None, view3_dimension], name='view3_batch')
112 |     active_view = tf.placeholder(tf.int32, name='active_view_number')
113 |     on_train = tf.placeholder(tf.bool)
114 | 
115 | with tf.name_scope('User_View'):
116 |     with tf.name_scope('User_FC1'):
117 |         user_fc1_par_range = np.sqrt(6.0 / (user_dimension + L1_N))
118 |         user_weight1 = tf.Variable(tf.random_uniform([user_dimension, L1_N], -user_fc1_par_range, user_fc1_par_range))
119 |         user_bias1 = tf.Variable(tf.random_uniform([L1_N], -user_fc1_par_range, user_fc1_par_range))
120 |         # variable_summaries(user_weight1, 'L1_weights')
121 |         # variable_summaries(user_bias1, 'L1_biases')
122 | 
123 |         user_l1 = tf.sparse_tensor_dense_matmul(user_batch, user_weight1) + user_bias1
124 |         user_l1_out = tf.nn.relu(user_l1)
125 | 
126 |     with tf.name_scope('User_FC2'):
127 |         user_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N))
128 |         user_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -user_fc2_par_range, user_fc2_par_range))
129 |         user_bias2 = tf.Variable(tf.random_uniform([L2_N], -user_fc2_par_range, user_fc2_par_range))
130 |         # variable_summaries(user_weight2, 'L2_weights')
131 |         # variable_summaries(user_bias2, 'L2_biases')
132 | 
133 |         user_l2 = tf.matmul(user_l1_out, user_weight2) + user_bias2
134 |         user_y = tf.nn.relu(user_l2)
135 | 
136 | with tf.name_scope('Item_view1'):
137 |     with tf.name_scope('Item_FC1'):
138 |         view1_fc1_par_range = np.sqrt(6.0 / (view1_dimension + L1_N))
139 |         view1_weight1 = tf.Variable(tf.random_uniform([view1_dimension, L1_N], -view1_fc1_par_range, view1_fc1_par_range))
140 |         view1_bias1 = tf.Variable(tf.random_uniform([L1_N], -view1_fc1_par_range, view1_fc1_par_range))
141 |         # variable_summaries(item_weight1, 'L1_weights')
142 |         # variable_summaries(item_bias1, 'L1_biases')
143 |         view1_positive_l1 = tf.sparse_tensor_dense_matmul(view1_batch, view1_weight1) + view1_bias1
144 |         view1_positive_l1_out = tf.nn.relu(view1_positive_l1)
145 | 
146 |     with tf.name_scope('Item_FC2'):
147 |         view1_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N))
148 |         view1_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -view1_fc2_par_range, view1_fc2_par_range))
149 |         view1_bias2 = tf.Variable(tf.random_uniform([L2_N], -view1_fc2_par_range, view1_fc2_par_range))
150 |         # variable_summaries(item_weight2, 'L2_weights')
151 |         # variable_summaries(item_bias2, 'L2_biases')
152 | 
153 |         view1_positive_l2 = tf.matmul(view1_positive_l1_out, view1_weight2) + view1_bias2
154 |         view1_positive_y = tf.nn.relu(view1_positive_l2)
155 | 
156 | with tf.name_scope('Item_view2'):
157 |     with tf.name_scope('Item_FC1'):
158 |         view2_fc1_par_range = np.sqrt(6.0 / (view2_dimension + L1_N))
159 |         view2_weight1 = tf.Variable(tf.random_uniform([view2_dimension, L1_N], -view2_fc1_par_range, view2_fc1_par_range))
160 |         view2_bias1 = tf.Variable(tf.random_uniform([L1_N], -view2_fc1_par_range, view2_fc1_par_range))
161 |         # variable_summaries(item_weight1, 'L1_weights')
162 |         # variable_summaries(item_bias1, 'L1_biases')
163 |         view2_positive_l1 = tf.sparse_tensor_dense_matmul(view2_batch, view2_weight1) + view2_bias1
164 |         view2_positive_l1_out = tf.nn.relu(view2_positive_l1)
165 | 
166 |     with tf.name_scope('Item_FC2'):
167 |         view2_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N))
168 |         view2_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -view2_fc2_par_range, view2_fc2_par_range))
169 |         view2_bias2 = tf.Variable(tf.random_uniform([L2_N], -view2_fc2_par_range, view2_fc2_par_range))
170 |         # variable_summaries(item_weight2, 'L2_weights')
171 |         # variable_summaries(item_bias2, 'L2_biases')
172 | 
173 |         view2_positive_l2 = tf.matmul(view2_positive_l1_out, view2_weight2) + view2_bias2
174 |         view2_positive_y = tf.nn.relu(view2_positive_l2)
175 | 
176 | with tf.name_scope('Item_view3'):
177 |     with tf.name_scope('Item_FC1'):
178 |         view3_fc1_par_range = np.sqrt(6.0 / (view3_dimension + L1_N))
179 |         view3_weight1 = tf.Variable(tf.random_uniform([view3_dimension, L1_N], -view3_fc1_par_range, view3_fc1_par_range))
180 |         view3_bias1 = tf.Variable(tf.random_uniform([L1_N], -view3_fc1_par_range, view3_fc1_par_range))
181 |         # variable_summaries(item_weight1, 'L1_weights')
182 |         # variable_summaries(item_bias1, 'L1_biases')
183 |         view3_positive_l1 = tf.sparse_tensor_dense_matmul(view3_batch, view3_weight1) + view3_bias1
184 |         view3_positive_l1_out = tf.nn.relu(view3_positive_l1)
185 | 
186 |     with tf.name_scope('Item_FC2'):
187 |         view3_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N))
188 |         view3_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -view3_fc2_par_range, view3_fc2_par_range))
189 |         view3_bias2 = tf.Variable(tf.random_uniform([L2_N], -view3_fc2_par_range, view3_fc2_par_range))
190 |         # variable_summaries(item_weight2, 'L2_weights')
191 |         # variable_summaries(item_bias2, 'L2_biases')
192 | 
193 |         view3_positive_l2 = tf.matmul(view3_positive_l1_out, view3_weight2) + view3_bias2
194 |         view3_positive_y = tf.nn.relu(view3_positive_l2)
195 | 
196 | with tf.name_scope('Make_Negative_Item'):
197 |     # 合并负样本，tile可选择是否扩展负样本。
198 |     # 判断激活哪一个view。
199 |     if active_view == 1:
200 |         item_y = tf.tile(view1_positive_y, [1, 1])
201 |     elif active_view == 2:
202 |         item_y = tf.tile(view2_positive_y, [1, 1])
203 |     else:
204 |         item_y = tf.tile(view3_positive_y, [1, 1])
205 | 
206 |     item_y_temp = tf.tile(item_y, [1, 1])
207 |     # batch内随机负采样。
208 |     for i in range(NEG):
209 |         rand = int((random.random() + i) * user_BS / NEG)
210 |         item_y = tf.concat([item_y,
211 |                             tf.slice(item_y_temp, [rand, 0], [user_BS - rand, -1]),
212 |                             tf.slice(item_y_temp, [0, 0], [rand, -1])], 0)
213 | 
214 | with tf.name_scope('Cosine_Similarity'):
215 |     # Cosine similarity
216 |     # query_norm = sqrt(sum(each x^2))
217 |     query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(user_y), 1, True)), [NEG + 1, 1])
218 |     # doc_norm = sqrt(sum(each x^2))
219 |     doc_norm = tf.sqrt(tf.reduce_sum(tf.square(item_y), 1, True))
220 |     # query * doc
221 |     prod = tf.reduce_sum(tf.multiply(tf.tile(user_y, [NEG + 1, 1]), item_y), 1, True)
222 |     # ||query|| * ||doc||
223 |     norm_prod = tf.multiply(query_norm, doc_norm)
224 |     # cos_sim_raw = query * doc / (||query|| * ||doc||)
225 |     cos_sim_raw = tf.truediv(prod, norm_prod)
226 |     # gamma = 20
227 |     # shape = [user_BS, NEG + 1]，第一列是正样本cos相似度。
228 |     cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, user_BS])) * 20
229 | 
230 | with tf.name_scope('Loss'):
231 |     # Train Loss
232 |     # 转化为softmax概率矩阵。
233 |     prob = tf.nn.softmax(cos_sim)
234 |     # 只取第一列，即正样本列概率。
235 |     hit_prob = tf.slice(prob, [0, 0], [-1, 1])
236 |     loss = -tf.reduce_sum(tf.log(hit_prob))
237 |     tf.summary.scalar('loss', loss)
238 | 
239 | with tf.name_scope('Training'):
240 |     # Optimizer
241 |     train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(loss)
242 | 
243 | # with tf.name_scope('Accuracy'):
244 | #     correct_prediction = tf.equal(tf.argmax(prob, 1), 0)
245 | #     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
246 | #     tf.summary.scalar('accuracy', accuracy)
247 | 
248 | merged = tf.summary.merge_all()
249 | 
250 | with tf.name_scope('Test'):
251 |     average_loss = tf.placeholder(tf.float32)
252 |     loss_summary = tf.summary.scalar('average_loss', average_loss)
253 | 
254 | with tf.name_scope('Train'):
255 |     train_average_loss = tf.placeholder(tf.float32)
256 |     train_loss_summary = tf.summary.scalar('train_average_loss', train_average_loss)
257 | 
258 | 
259 | def convert_to_sparse_tensor(data_in):
260 |     data_in = data_in.tocoo()
261 |     data_in = tf.SparseTensorValue(
262 |         np.transpose([np.array(data_in.row, dtype=np.int64), np.array(data_in.col, dtype=np.int64)]),
263 |         np.array(data_in.data, dtype=np.float),
264 |         np.array(data_in.shape, dtype=np.int64))
265 |     return data_in
266 | 
267 | 
268 | def pull_batch(user_data, item_positive, batch_id):
269 |     batch_id = int(batch_id)
270 |     user_in = user_data[batch_id * user_BS:(batch_id + 1) * user_BS, :]
271 |     item_positive_in = item_positive[batch_id * user_BS:(batch_id + 1) * user_BS, :]
272 |     user_in, item_positive_in = convert_to_sparse_tensor(user_in), convert_to_sparse_tensor(item_positive_in)
273 |     return user_in, item_positive_in
274 | 
275 | 
276 | def feed_dict(on_verify, Train, batch_id):
277 |     view1_batch_in = convert_to_sparse_tensor(sparse.csr_matrix(([], ([], [])), shape=(user_BS, view1_dimension)))
278 |     view2_batch_in = convert_to_sparse_tensor(sparse.csr_matrix(([], ([], [])), shape=(user_BS, view2_dimension)))
279 |     view3_batch_in = convert_to_sparse_tensor(sparse.csr_matrix(([], ([], [])), shape=(user_BS, view3_dimension)))
280 |     active_view_in = 1
281 |     if Train:
282 |         if batch_id <= view1_size / user_BS:
283 |             batch_id = batch_id if batch_id < view1_size / user_BS - 1 else batch_id - 1
284 |             active_view_in = 1
285 |             user_batch_in, view1_batch_in = pull_batch(data_sets.app_search, data_sets.app_his, batch_id)
286 |         elif view1_size / user_BS < batch_id <= (view1_size + view2_size) / user_BS:
287 |             batch_id -= view1_size / user_BS
288 |             batch_id = batch_id if batch_id < view2_size / user_BS - 1 else batch_id - 1
289 |             active_view_in = 2
290 |             user_batch_in, view2_batch_in = pull_batch(data_sets.music_search, data_sets.music_his, batch_id)
291 |         else:
292 |             batch_id -= view1_size / user_BS + view2_size / user_BS
293 |             batch_id = batch_id if batch_id < view3_size / user_BS - 1 else batch_id - 1
294 |             active_view_in = 3
295 |             user_batch_in, view3_batch_in = pull_batch(data_sets.novel_search, data_sets.novel_his, batch_id)
296 | 
297 | 
298 |     else:
299 |         if batch_id <= view1_size_test / user_BS:
300 |             batch_id = batch_id if batch_id < view1_size_test / user_BS - 1 else batch_id - 1
301 |             active_view_in = 1
302 |             user_batch_in, view1_batch_in = pull_batch(data_sets.app_search_test, data_sets.app_his_test, batch_id)
303 |         elif view1_size_test / user_BS < batch_id <= (view1_size_test + view2_size_test) / user_BS:
304 |             batch_id -= view1_size_test / user_BS
305 |             batch_id = batch_id if batch_id < view2_size_test / user_BS - 1 else batch_id - 1
306 |             active_view_in = 2
307 |             user_batch_in, view2_batch_in = pull_batch(data_sets.music_search_test, data_sets.music_his_test, batch_id)
308 |         else:
309 |             batch_id -= view1_size_test / user_BS + view2_size_test / user_BS
310 |             batch_id = batch_id if batch_id < view3_size_test / user_BS - 1 else batch_id - 1
311 |             active_view_in = 3
312 |             user_batch_in, view3_batch_in = pull_batch(data_sets.novel_search_test, data_sets.novel_his_test, batch_id)
313 | 
314 |     return {user_batch: user_batch_in,
315 |             view1_batch: view1_batch_in,
316 |             view2_batch: view2_batch_in,
317 |             view3_batch: view3_batch_in,
318 |             active_view: active_view_in,
319 |             on_train: on_verify}
320 | 
321 | 
322 | # config = tf.ConfigProto()  # log_device_placement=True)
323 | # config.gpu_options.allow_growth = True
324 | # if not FLAGS.gpu:
325 | # config = tf.ConfigProto(device_count= {'GPU' : 0})
326 | 
327 | # 创建一个Saver对象，选择性保存变量或者模型。
328 | saver = tf.train.Saver()
329 | # with tf.Session(config=config) as sess:
330 | with tf.Session() as sess:
331 |     sess.run(tf.global_variables_initializer())
332 |     train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train', sess.graph)
333 |     # test_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/test', sess.graph)
334 | 
335 |     start = time.time()
336 |     for step in range(FLAGS.max_steps):
337 |         batch_id = int(random.random() * (total_size / user_BS - 1))
338 |         # print(batch_id)
339 |         sess.run(train_step, feed_dict=feed_dict(True, True, batch_id))
340 | 
341 |         if step % FLAGS.epoch_steps == 0:
342 |             # train loss
343 |             loss_v = sess.run(loss, feed_dict=feed_dict(False, True, batch_id))
344 | 
345 |             loss_v /= user_BS
346 |             train_loss = sess.run(train_loss_summary, feed_dict={train_average_loss: loss_v})
347 |             train_writer.add_summary(train_loss, step + 1)
348 |             end = time.time()
349 |             print("\nEpoch #%-5d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" %
350 |                   (step / FLAGS.epoch_steps, loss_v, end - start))
351 | 
352 |             # test loss
353 |             epoch_loss = 0
354 |             for i in range(FLAGS.test_pack_size):
355 |                 loss_v = sess.run(loss, feed_dict=feed_dict(False, False, i))
356 |                 epoch_loss += loss_v
357 |             epoch_loss /= (FLAGS.test_pack_size * user_BS)
358 |             test_loss = sess.run(loss_summary, feed_dict={average_loss: epoch_loss})
359 |             train_writer.add_summary(test_loss, step + 1)
360 |             start = time.time()
361 |             print("Epoch #%-5d | Test  Loss: %-4.3f | Calc_LossTime: %-3.3fs" %
362 |                   (step / FLAGS.epoch_steps, epoch_loss, start - end))
363 | 
364 |     # 保存模型
365 |     save_path = saver.save(sess, "model/model_1.ckpt")
366 |     print("Model saved in file: ", save_path)
367 | 


--------------------------------------------------------------------------------
/requirement.txt:
--------------------------------------------------------------------------------
1 | paddlepaddle==1.7.1
2 | paddlehub==1.8.2
3 | TensorFlow==1.12
4 | sklearn
5 | pyyaml
6 | keras==2.2.5
7 | argparse


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #encoding=utf-8
  3 | '''
  4 | @Time    :   2020/10/25 22:28:30
  5 | @Author  :   zhiyang.zzy 
  6 | @Contact :   zhiyangchou@gmail.com
  7 | @Desc    :   训练相似度模型
  8 | 1. siamese network，分别使用 cosine、曼哈顿距离
  9 | 2. triplet loss
 10 | '''
 11 | 
 12 | # here put the import lib
 13 | from model.bert_classifier import BertClassifier
 14 | import os
 15 | import time
 16 | from numpy.lib.arraypad import pad
 17 | import nni
 18 | from tensorflow.python.ops.gen_io_ops import write_file
 19 | import yaml
 20 | import logging
 21 | import argparse
 22 | logging.basicConfig(level=logging.INFO)
 23 | import data_input
 24 | from config import Config
 25 | from model.siamese_network import SiamenseRNN, SiamenseBert
 26 | from data_input import Vocabulary, get_test
 27 | from util import write_file
 28 | 
 29 | def train_siamese():
 30 |     # 读取配置
 31 |     # conf = Config()
 32 |     cfg_path = "./configs/config.yml"
 33 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
 34 |     # 读取数据
 35 |     data_train, data_val, data_test = data_input.get_lcqmc()
 36 |     # data_train = data_train[:100]
 37 |     print("train size:{},val size:{}, test size:{}".format(
 38 |         len(data_train), len(data_val), len(data_test)))
 39 |     model = SiamenseRNN(cfg)
 40 |     model.fit(data_train, data_val, data_test)
 41 |     pass
 42 | 
 43 | def predict_siamese(file_='./results/'):
 44 |     # 加载配置
 45 |     cfg_path = "./configs/config.yml"
 46 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
 47 |     # 将 seq转为id，
 48 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
 49 |     test_arr, query_arr = get_test(file_, vocab)
 50 |     # 加载模型
 51 |     model = SiamenseRNN(cfg)
 52 |     model.restore_session(cfg["checkpoint_dir"])
 53 |     test_label, test_prob = model.predict(test_arr)
 54 |     out_arr = [x + [test_label[i]] + [test_prob[i]] for i, x in enumerate(query_arr)]
 55 |     write_file(out_arr, file_ + '.siamese.predict', )
 56 |     pass
 57 | 
 58 | def train_siamese_bert():
 59 |     # 读取配置
 60 |     # conf = Config()
 61 |     cfg_path = "./configs/config_bert.yml"
 62 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
 63 |     # 自动调参的参数，每次会更新一组搜索空间中的参数
 64 |     tuner_params= nni.get_next_parameter()
 65 |     cfg.update(tuner_params)
 66 |     # vocab: 将 seq转为id，
 67 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
 68 |     # 读取数据
 69 |     data_train, data_val, data_test = data_input.get_lcqmc_bert(vocab)
 70 |     # data_train = data_train[:100]
 71 |     print("train size:{},val size:{}, test size:{}".format(
 72 |         len(data_train), len(data_val), len(data_test)))
 73 |     model = SiamenseBert(cfg)
 74 |     model.fit(data_train, data_val, data_test)
 75 |     pass
 76 | 
 77 | def predict_siamese_bert(file_="./results/input/test"):
 78 |     # 读取配置
 79 |     # conf = Config()
 80 |     cfg_path = "./configs/config_bert.yml"
 81 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
 82 |     os.environ["CUDA_VISIBLE_DEVICES"] = "4"
 83 |     # vocab: 将 seq转为id，
 84 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
 85 |     # 读取数据
 86 |     test_arr, query_arr  = data_input.get_test_bert(file_, vocab)
 87 |     print("test size:{}".format(len(test_arr)))
 88 |     model = SiamenseBert(cfg)
 89 |     model.restore_session(cfg["checkpoint_dir"])
 90 |     test_label, test_prob = model.predict(test_arr)
 91 |     out_arr = [x + [test_label[i]] + [test_prob[i]] for i, x in enumerate(query_arr)]
 92 |     write_file(out_arr, file_ + '.siamese.bert.predict', )
 93 |     pass
 94 | 
 95 | def train_bert():
 96 |     # 读取配置
 97 |     # conf = Config()
 98 |     cfg_path = "./configs/bert_classify.yml"
 99 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
100 |     # 自动调参的参数，每次会更新一组搜索空间中的参数
101 |     tuner_params= nni.get_next_parameter()
102 |     cfg.update(tuner_params)
103 |     # vocab: 将 seq转为id，
104 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
105 |     # 读取数据
106 |     data_train, data_val, data_test = data_input.get_lcqmc_bert(vocab, is_merge=1)
107 |     # data_train = data_train[:100]
108 |     print("train size:{},val size:{}, test size:{}".format(
109 |         len(data_train), len(data_val), len(data_test)))
110 |     model = BertClassifier(cfg)
111 |     model.fit(data_train, data_val, data_test)
112 |     pass
113 | 
114 | def predict_bert(file_="./results/input/test"):
115 |     # 读取配置
116 |     # conf = Config()
117 |     cfg_path = "./configs/bert_classify.yml"
118 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
119 |     # vocab: 将 seq转为id，
120 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
121 |     # 读取数据
122 |     test_arr, query_arr  = data_input.get_test_bert(file_, vocab, is_merge=1)
123 |     print("test size:{}".format(len(test_arr)))
124 |     model = BertClassifier(cfg)
125 |     model.restore_session(cfg["checkpoint_dir"])
126 |     test_label, test_prob = model.predict(test_arr)
127 |     out_arr = [x + [test_label[i]] + [test_prob[i]] for i, x in enumerate(query_arr)]
128 |     write_file(out_arr, file_ + '.bert.predict', )
129 |     pass
130 | 
131 | def siamese_bert_sentence_embedding(file_="./results/input/test.single"):
132 |     # 输入一行是一个query，输出是此query对应的向量
133 |     # 读取配置
134 |     cfg_path = "./configs/config_bert.yml"
135 |     cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader)
136 |     cfg['batch_size'] = 64
137 |     os.environ["CUDA_VISIBLE_DEVICES"] = "7"
138 |     # vocab: 将 seq转为id，
139 |     vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]')
140 |     # 读取数据
141 |     test_arr, query_arr = data_input.get_test_bert_single(file_, vocab)
142 |     print("test size:{}".format(len(test_arr)))
143 |     model = SiamenseBert(cfg)
144 |     model.restore_session(cfg["checkpoint_dir"])
145 |     test_label = model.predict_embedding(test_arr)
146 |     test_label = [",".join([str(y) for y in x]) for x in test_label]
147 |     out_arr = [[x, test_label[i]] for i, x in enumerate(query_arr)]
148 |     print("write to file...")
149 |     write_file(out_arr, file_ + '.siamese.bert.embedding', )
150 |     pass
151 | 
152 | if __name__ == "__main__":
153 |     os.environ["CUDA_VISIBLE_DEVICES"] = "4"
154 |     ap = argparse.ArgumentParser()
155 |     ap.add_argument("--method", default="bert", type=str, help="train/predict")
156 |     ap.add_argument("--mode", default="train", type=str, help="train/predict")
157 |     ap.add_argument("--file", default="./results/input/test", type=str, help="train/predict")
158 |     args = ap.parse_args()
159 |     if args.mode == 'train' and args.method == 'rnn':
160 |         train_siamese()
161 |     elif args.mode == 'predict' and args.method == 'rnn':
162 |         predict_siamese(args.file)
163 |     elif args.mode == 'train' and args.method == 'bert_siamese':
164 |         train_siamese_bert()
165 |     elif args.mode == 'predict' and args.method == 'bert_siamese':
166 |         predict_siamese_bert(args.file)
167 |     elif args.mode == 'train' and args.method == 'bert':
168 |         train_bert()
169 |     elif args.mode == 'predict' and args.method == 'bert':
170 |         predict_bert(args.file)
171 |     elif args.mode == 'predict' and args.method == 'bert_siamese_embedding':
172 |         # 此处输出句子的 embedding，如果想要使用向量召回
173 |         # 建议训练模型的时候，损失函数使用功能和faiss一致的距离度量，例如faiss中使用是l2，那么损失函数用l2
174 |         # faiss距离用cos，损失函数用cosin，或者损失中有一项是cosin相似度损失
175 |         siamese_bert_sentence_embedding(args.file)
176 | 


--------------------------------------------------------------------------------
/util.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #encoding=utf-8
 3 | '''
 4 | @Time    :   2020/10/13 20:33:50
 5 | @Author  :   zhiyang.zzy 
 6 | @Contact :   zhiyangchou@gmail.com
 7 | @Desc    :   
 8 | '''
 9 | 
10 | # here put the import lib
11 | import os 
12 | import six
13 | import time
14 | 
15 | def convert_to_unicode(text):
16 |     """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
17 |     if six.PY3:
18 |         if isinstance(text, str):
19 |             return text
20 |         elif isinstance(text, bytes):
21 |             return text.decode("utf-8", "ignore")
22 |         else:
23 |             raise ValueError("Unsupported string type: %s" % (type(text)))
24 |     elif six.PY2:
25 |         if isinstance(text, str):
26 |             return text.decode("utf-8", "ignore")
27 |         elif isinstance(text, unicode):
28 |             return text
29 |         else:
30 |             raise ValueError("Unsupported string type: %s" % (type(text)))
31 |     else:
32 |         raise ValueError("Not running on Python2 or Python 3?")
33 | 
34 | def read_file(file_:str, splitter:str=None):
35 |     out_arr = []
36 |     with open(file_, encoding="utf-8") as f: 
37 |         out_arr = [x.strip("\n") for x in f.readlines()]
38 |         if splitter:
39 |             out_arr = [x.split(splitter) for x in out_arr]
40 |     return out_arr
41 | 
42 | def write_file(out_arr:list, file_:str, splitter='\t'):
43 |     with open(file_, 'w', encoding='utf-8') as out:
44 |         for line in out_arr:
45 |             out.write(splitter.join([str(x) for x in line]) + '\n')
46 | 
47 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
48 |     """Truncates a sequence pair in place to the maximum length."""
49 |     while True:
50 |         total_length = len(tokens_a) + len(tokens_b)
51 |         if total_length <= max_length:
52 |             break
53 |         if len(tokens_a) > len(tokens_b):
54 |             tokens_a.pop()
55 |         else:
56 |             tokens_b.pop()
57 | if __name__ == "__main__":
58 |     pass


--------------------------------------------------------------------------------