├── .DS_Store ├── .gitignore ├── README.md ├── assets └── dssm_rnn_loss.png ├── auto_ml.yml ├── config.py ├── configs ├── bert_classify.yml ├── config.yml ├── config_bert.yml └── search_space.json ├── data ├── readme.md └── vocab.txt ├── data_input.py ├── dssm.py ├── dssm_rnn.py ├── flask_server.py ├── model ├── base_model.py ├── bert │ ├── ReadMe.md │ ├── modeling.py │ ├── modeling_v1.py │ ├── optimization.py │ └── tokenization.py ├── bert_classifier.py └── siamese_network.py ├── multi_view_dssm_v3.py ├── requirement.txt ├── train.py └── util.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsaneLife/dssm/1d32e137654e03994f7ba6cfde52e1d47601027c/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # data 2 | data/* 3 | !data/readme.md 4 | !data/vocab.txt 5 | Summaries/ 6 | 7 | Summaries/* 8 | results/* 9 | log 10 | tmp 11 | 12 | # py 13 | test.py 14 | 15 | # 通用 16 | .unotes/ 17 | envi/ 18 | __pycache__/ 19 | .vscode/ 20 | *.pyc 21 | .DS_Store 22 | *.DS_Store 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/)以及其后续文章 2 | 3 | [A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](http://blog.csdn.net/shine19930820/article/details/78810984)的实现Demo。 4 | 5 | # 注意: 6 | **\*\*\*\*2020/11/15\*\*\*\*** 7 | 8 | 论文[li2020sentence](https://arxiv.org/abs/2011.05864)将normalizing flows和bert结合,在语义相似度任务上有奇效,接下来会继续进行验证。 9 | 10 | **\*\*\*\*2020/10/27\*\*\*\*** 11 | 12 | 添加底层使用bert的siamese-bert实验,见[siamese\_network.py](https://github.com/InsaneLife/dssm/blob/master/model/siamese_network.py)中类 SiamenseBert,其他和下面一样. 13 | 14 | 相比于[bert](https://github.com/InsaneLife/dssm/blob/master/model/bert_classifier.py) 直接将两句作为输入,双塔bert的优势在于: 15 | - max sequence len会更短,训练所需要显存更小,速度会稍微快一些,对于硬件不是太好的伙伴比较友好。 16 | - 可以训练后使用bert作为句子embedding的encoder,在一些线上匹配的时候,可以预先将需要对比的句子向量算出来,节省实时算力。 17 | - 效果相比于直接用bert输入两句,测试集会差一个多点。 18 | - bert可以使用[CLS]的输出或者average context embedding, 一般后者效果会更好。 19 | ```shell 20 | # bert_siamese双塔模型 21 | python train.py --mode=train --method=bert_siamese 22 | # 直接使用功能bert 23 | python train.py --mode=train --method=bert 24 | ``` 25 | 参考:[reimers2019sentence](https://arxiv.org/abs/1908.10084) 26 | 27 | **\*\*\*\*2020/10/17\*\*\*\*** 28 | 29 | 由于之前数据集问题,会有不收敛问题,现更换数据集为LCQMC口语化描述的语义相似度数据集。模型也从多塔变成了双塔模型,见[siamese\_network.py](https://github.com/InsaneLife/dssm/blob/master/model/siamese_network.py), 训练入口:[train.py](https://github.com/InsaneLife/dssm/blob/master/train.py) 30 | 31 | > 难以找到搜索点击的公开数据集,暂且用语义相似任务数据集,有点变味了,哈哈 32 | > 目前看在此数据集上测试数据集的准确率是提升的,只有七十多,但是要达到论文的准确率,仍然还需要进行调参 33 | 34 | 训练(默认使用功LCQMC数据集): 35 | 36 | ```shell 37 | python train.py --mode=train 38 | ``` 39 | 40 | 预测: 41 | 42 | ```shell 43 | python train.py --mode=train --file=$predict_file$ 44 | ``` 45 | 46 | 测试文件格式: q1\tq2, 例如: 47 | 48 | ``` 49 | 今天天气怎么样 今天温度怎么样 50 | ``` 51 | 52 | 53 | 54 | **\*\*\*\*2019/5/18\*\*\*\*** 55 | 56 | 由于之前代码api过时,已更新最新代码于:[dssm\_rnn.py](https://github.com/InsaneLife/dssm/blob/master/dssm_rnn.py) 57 | 58 | 数据处理代码[data\_input.py](https://github.com/InsaneLife/dssm/blob/master/data_input.py) 和数据[data](https://github.com/InsaneLife/dssm/tree/master/data) 已经更新,由于使用了rnn,所以**输入非bag of words方式。** 59 | 60 | ![img](https://ask.qcloudimg.com/http-save/yehe-1881084/7ficv1hhqf.png?imageView2/2/w/1620) 61 | 62 | > 来源:Palangi, Hamid, et al. "Semantic modelling with long-short-term memory for information retrieval." arXiv preprint arXiv:1412.6629 2014. 63 | > 64 | > 训练损失,在45个epoch时基本不下降: 65 | > 66 | > ![dssm_rnn_loss](https://raw.githubusercontent.com/InsaneLife/dssm/master/assets/dssm_rnn_loss.png) 67 | 68 | # 1\. 数据&环境 69 | 70 | DSSM,对于输入数据是Query对,即Query短句和相应的展示,展示中分点击和未点击,分别为正负样,同时对于点击的先后顺序,也是有不同赋值,具体可参考论文。 71 | 72 | 对于我的Query数据本人无权开放,还请自行寻找数据。 73 | 环境: 74 | 75 | 1. win, python3.5, tensorflow1.4. 76 | 77 | # 2\. word hashing 78 | 79 | 原文使用3-grams,对于中文,我使用了uni-gram,因为中文本身字有一定代表意义(也有论文拆笔画),对于每个gram都使用one-hot编码代替,最终可以大大降低短句维度。 80 | 81 | # 3\. 结构 82 | 83 | 结构图: 84 | 85 | ![img](https://raw.githubusercontent.com/InsaneLife/MyPicture/master/dssm2.png) 86 | 87 | 1. 把条目映射成低维向量。 88 | 2. 计算查询和文档的cosine相似度。 89 | 90 | ## 3.1 输入 91 | 92 | 这里使用了TensorBoard可视化,所以定义了name\_scope: 93 | 94 | ``` python 95 | with tf.name_scope('input'): 96 | query_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='QueryBatch') 97 | doc_positive_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='DocBatch') 98 | doc_negative_batch = tf.sparse_placeholder(tf.float32, shape=[None, TRIGRAM_D], name='DocBatch') 99 | on_train = tf.placeholder(tf.bool) 100 | ``` 101 | 102 | ## 3.2 全连接层 103 | 104 | 我使用三层的全连接层,对于每一层全连接层,除了神经元不一样,其他都一样,所以可以写一个函数复用。 105 | $$ 106 | l\_n = W\_n x + b\_1 107 | $$ 108 | 109 | ``` python 110 | def add_layer(inputs, in_size, out_size, activation_function=None): 111 | wlimit = np.sqrt(6.0 / (in_size + out_size)) 112 | Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit)) 113 | biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit)) 114 | Wx_plus_b = tf.matmul(inputs, Weights) + biases 115 | if activation_function is None: 116 | outputs = Wx_plus_b 117 | else: 118 | outputs = activation_function(Wx_plus_b) 119 | return outputs 120 | ``` 121 | 122 | 其中,对于权重和Bias,使用了按照论文的特定的初始化方式: 123 | 124 | ``` python 125 | wlimit = np.sqrt(6.0 / (in_size + out_size)) 126 | Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit)) 127 | biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit)) 128 | ``` 129 | 130 | ### Batch Normalization 131 | 132 | ``` python 133 | def batch_normalization(x, phase_train, out_size): 134 | """ 135 | Batch normalization on convolutional maps. 136 | Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow 137 | Args: 138 | x: Tensor, 4D BHWD input maps 139 | out_size: integer, depth of input maps 140 | phase_train: boolean tf.Varialbe, true indicates training phase 141 | scope: string, variable scope 142 | Return: 143 | normed: batch-normalized maps 144 | """ 145 | with tf.variable_scope('bn'): 146 | beta = tf.Variable(tf.constant(0.0, shape=[out_size]), 147 | name='beta', trainable=True) 148 | gamma = tf.Variable(tf.constant(1.0, shape=[out_size]), 149 | name='gamma', trainable=True) 150 | batch_mean, batch_var = tf.nn.moments(x, [0], name='moments') 151 | ema = tf.train.ExponentialMovingAverage(decay=0.5) 152 | 153 | def mean_var_with_update(): 154 | ema_apply_op = ema.apply([batch_mean, batch_var]) 155 | with tf.control_dependencies([ema_apply_op]): 156 | return tf.identity(batch_mean), tf.identity(batch_var) 157 | 158 | mean, var = tf.cond(phase_train, 159 | mean_var_with_update, 160 | lambda: (ema.average(batch_mean), ema.average(batch_var))) 161 | normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3) 162 | return normed 163 | ``` 164 | 165 | ### 单层 166 | 167 | ``` python 168 | with tf.name_scope('FC1'): 169 | # 激活函数在BN之后,所以此处为None 170 | query_l1 = add_layer(query_batch, TRIGRAM_D, L1_N, activation_function=None) 171 | doc_positive_l1 = add_layer(doc_positive_batch, TRIGRAM_D, L1_N, activation_function=None) 172 | doc_negative_l1 = add_layer(doc_negative_batch, TRIGRAM_D, L1_N, activation_function=None) 173 | 174 | with tf.name_scope('BN1'): 175 | query_l1 = batch_normalization(query_l1, on_train, L1_N) 176 | doc_l1 = batch_normalization(tf.concat([doc_positive_l1, doc_negative_l1], axis=0), on_train, L1_N) 177 | doc_positive_l1 = tf.slice(doc_l1, [0, 0], [query_BS, -1]) 178 | doc_negative_l1 = tf.slice(doc_l1, [query_BS, 0], [-1, -1]) 179 | query_l1_out = tf.nn.relu(query_l1) 180 | doc_positive_l1_out = tf.nn.relu(doc_positive_l1) 181 | doc_negative_l1_out = tf.nn.relu(doc_negative_l1) 182 | ······ 183 | ``` 184 | 185 | 合并负样本 186 | 187 | ``` python 188 | with tf.name_scope('Merge_Negative_Doc'): 189 | # 合并负样本,tile可选择是否扩展负样本。 190 | doc_y = tf.tile(doc_positive_y, [1, 1]) 191 | for i in range(NEG): 192 | for j in range(query_BS): 193 | # slice(input_, begin, size)切片API 194 | doc_y = tf.concat([doc_y, tf.slice(doc_negative_y, [j * NEG + i, 0], [1, -1])], 0) 195 | ``` 196 | 197 | ## 3.3 计算cos相似度 198 | 199 | ``` python 200 | with tf.name_scope('Cosine_Similarity'): 201 | # Cosine similarity 202 | # query_norm = sqrt(sum(each x^2)) 203 | query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(query_y), 1, True)), [NEG + 1, 1]) 204 | # doc_norm = sqrt(sum(each x^2)) 205 | doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_y), 1, True)) 206 | 207 | prod = tf.reduce_sum(tf.multiply(tf.tile(query_y, [NEG + 1, 1]), doc_y), 1, True) 208 | norm_prod = tf.multiply(query_norm, doc_norm) 209 | 210 | # cos_sim_raw = query * doc / (||query|| * ||doc||) 211 | cos_sim_raw = tf.truediv(prod, norm_prod) 212 | # gamma = 20 213 | cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, query_BS])) * 20 214 | ``` 215 | 216 | ## 3.4 定义损失函数 217 | 218 | ``` python 219 | with tf.name_scope('Loss'): 220 | # Train Loss 221 | # 转化为softmax概率矩阵。 222 | prob = tf.nn.softmax(cos_sim) 223 | # 只取第一列,即正样本列概率。 224 | hit_prob = tf.slice(prob, [0, 0], [-1, 1]) 225 | loss = -tf.reduce_sum(tf.log(hit_prob)) 226 | tf.summary.scalar('loss', loss) 227 | ``` 228 | 229 | ## 3.5选择优化方法 230 | 231 | ``` python 232 | with tf.name_scope('Training'): 233 | # Optimizer 234 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(loss) 235 | ``` 236 | 237 | ## 3.6 开始训练 238 | 239 | ``` python 240 | # 创建一个Saver对象,选择性保存变量或者模型。 241 | saver = tf.train.Saver() 242 | # with tf.Session(config=config) as sess: 243 | with tf.Session() as sess: 244 | sess.run(tf.global_variables_initializer()) 245 | train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train', sess.graph) 246 | start = time.time() 247 | for step in range(FLAGS.max_steps): 248 | batch_id = step % FLAGS.epoch_steps 249 | sess.run(train_step, feed_dict=feed_dict(True, True, batch_id % FLAGS.pack_size, 0.5)) 250 | ``` 251 | 252 | GitHub完整代码 [https://github.com/InsaneLife/dssm](https://github.com/InsaneLife/dssm) 253 | 254 | Multi-view DSSM实现同理,可以参考GitHub:[multi\_view\_dssm\_v3](https://github.com/InsaneLife/dssm/blob/master/multi_view_dssm_v3.py) 255 | 256 | CSDN原文:[http://blog.csdn.net/shine19930820/article/details/79042567](http://blog.csdn.net/shine19930820/article/details/79042567) 257 | 258 | ## 自动调参 259 | 参数搜索空间:[search_space.json](./configs/search_space.json) 260 | 配置文件:[auto_ml.yml](auto_ml.yml) 261 | 启动命令 262 | ```shell 263 | nnictl create --config auto_ml.yml -p 8888 264 | ``` 265 | > 由于没有gpu 😂,[auto_ml.yml](auto_ml.yml)设置中没有配置gpu,有gpu同学可自行配置。 266 | 267 | 详细文档:https://nni.readthedocs.io/zh/latest/Overview.html 268 | 269 | 270 | # Reference 271 | - [li2020sentence](https://arxiv.org/abs/2011.05864) 272 | - [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) 273 | - nni 调参: https://nni.readthedocs.io/zh/latest/Overview.html -------------------------------------------------------------------------------- /assets/dssm_rnn_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsaneLife/dssm/1d32e137654e03994f7ba6cfde52e1d47601027c/assets/dssm_rnn_loss.png -------------------------------------------------------------------------------- /auto_ml.yml: -------------------------------------------------------------------------------- 1 | authorName: default 2 | experimentName: example_dssm 3 | trialConcurrency: 1 4 | maxExecDuration: 10h 5 | maxTrialNum: 8 6 | #choice: local, remote, pai 7 | trainingServicePlatform: local 8 | searchSpacePath: ./configs/search_space.json 9 | #choice: true, false 10 | useAnnotation: false 11 | tuner: 12 | #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner 13 | #SMAC (SMAC should be installed through nnictl) 14 | builtinTunerName: TPE 15 | classArgs: 16 | #choice: maximize, minimize 17 | optimize_mode: maximize 18 | trial: 19 | command: python3 train.py --method=bert --mode=train 20 | codeDir: . 21 | gpuNum: 0 22 | localConfig: 23 | useActiveGpu: true 24 | maxTrialNumPerGpu: 3 25 | # gpuIndices: 4,0,5 -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding=utf-8 3 | ''' 4 | Author: zhiyang.zzy 5 | Date: 2019-09-25 21:59:54 6 | Contact: zhiyangchou@gmail.com 7 | FilePath: /dssm/config.py 8 | Desc: 9 | ''' 10 | 11 | 12 | def load_vocab(file_path): 13 | word_dict = {} 14 | with open(file_path, encoding='utf8') as f: 15 | for idx, word in enumerate(f.readlines()): 16 | word = word.strip() 17 | word_dict[word] = idx 18 | return word_dict 19 | 20 | 21 | class Config(object): 22 | def __init__(self): 23 | self.vocab_map = load_vocab(self.vocab_path) 24 | self.nwords = len(self.vocab_map) 25 | 26 | unk = '[UNK]' 27 | pad = '[PAD]' 28 | vocab_path = './data/vocab.txt' 29 | # file_train = './data/oppo_round1_train_20180929.mini' 30 | # file_train = './data/oppo_round1_train_20180929.txt' 31 | # file_vali = './data/oppo_round1_vali_20180929.mini' 32 | file_vali = './data/oppo_round1_vali_20180929.txt' 33 | file_train = file_vali 34 | max_seq_len = 40 35 | hidden_size_rnn = 100 36 | use_stack_rnn = False 37 | learning_rate = 0.001 38 | decay_step = 2000 39 | lr_decay = 0.95 40 | num_epoch = 300 41 | epoch_no_imprv = 5 42 | optimizer = "lazyadam" 43 | summaries_dir = './results/Summaries/' 44 | gpu = 0 45 | word_dim = 100 46 | batch_size = 64 47 | keep_porb = 0.5 48 | dropout = 1- keep_porb 49 | 50 | # checkpoint_dir 51 | checkpoint_dir='./results/checkpoint' 52 | 53 | 54 | if __name__ == '__main__': 55 | conf = Config() 56 | print(len(conf.vocab_map)) 57 | pass 58 | -------------------------------------------------------------------------------- /configs/bert_classify.yml: -------------------------------------------------------------------------------- 1 | unk: '[UNK]' 2 | pad: '[PAD]' 3 | vocab_path: './data/vocab.txt' 4 | max_seq_len: 80 5 | hidden_size_rnn: 200 6 | use_stack_rnn: False 7 | learning_rate: 0.00005 8 | decay_step: 1800 9 | lr_decay: 0.95 10 | num_epoch: 10 11 | epoch_no_imprv: 30 12 | optimizer: "adam" 13 | summaries_dir: './results/Summaries/' 14 | gpu: 0 15 | word_dim: 100 16 | batch_size: 64 17 | keep_porb: 0.5 18 | # checkpoint_dir 19 | checkpoint_dir: './results/checkpoint/bert_classifier/model' 20 | nwords: 21128 21 | sentence_embedding_type: cls 22 | 23 | # bert 24 | # bert_dir: &bert_dir '/mnt/nlp/bert/chinese_L-12_H-768_A-12/' 25 | bert_dir: '/Volumes/HddData/ProjectData/NLP/bert/chinese_L-12_H-768_A-12/' 26 | bert_init_checkpoint: "bert_model.ckpt" 27 | bert_vocab: "vocab.txt" 28 | bert_config: "bert_config.json" -------------------------------------------------------------------------------- /configs/config.yml: -------------------------------------------------------------------------------- 1 | unk: '[UNK]' 2 | pad: '[PAD]' 3 | vocab_path: './data/vocab.txt' 4 | max_seq_len: 40 5 | hidden_size_rnn: 200 6 | use_stack_rnn: False 7 | learning_rate: 0.0005 8 | decay_step: 1800 9 | lr_decay: 0.95 10 | num_epoch: 300 11 | epoch_no_imprv: 10 12 | optimizer: "lazyadam" 13 | summaries_dir: './results/Summaries/' 14 | gpu: 0 15 | word_dim: 100 16 | batch_size: 128 17 | keep_porb: 0.5 18 | # checkpoint_dir 19 | checkpoint_dir: './results/checkpoint/bert/model' 20 | nwords: 21128 21 | 22 | # bert 23 | # bert_dir: &bert_dir '/mnt/nlp/bert/chinese_L-12_H-768_A-12/' 24 | bert_dir: '/Volumes/HddData/ProjectData/NLP/bert/chinese_L-12_H-768_A-12/' 25 | bert_init_checkpoint: "bert_model.ckpt" 26 | bert_vocab: "vocab.txt" 27 | bert_config: "bert_config.json" -------------------------------------------------------------------------------- /configs/config_bert.yml: -------------------------------------------------------------------------------- 1 | unk: '[UNK]' 2 | pad: '[PAD]' 3 | vocab_path: './data/vocab.txt' 4 | max_seq_len: 40 5 | hidden_size_rnn: 200 6 | use_stack_rnn: False 7 | learning_rate: 0.00005 8 | decay_step: 1800 9 | lr_decay: 0.95 10 | num_epoch: 10 11 | epoch_no_imprv: 5 12 | optimizer: "adam" 13 | summaries_dir: './results/Summaries/' 14 | gpu: 0 15 | word_dim: 100 16 | batch_size: 256 17 | keep_porb: 0.5 18 | # checkpoint_dir 19 | checkpoint_dir: './results/checkpoint/bert/model' 20 | nwords: 21128 21 | use_avg_pooling: 1 22 | sentence_embedding_type: avg-last-2 23 | 24 | # bert 25 | # bert_dir: &bert_dir '/mnt/nlp/bert/chinese_L-12_H-768_A-12/' 26 | bert_dir: '/Volumes/HddData/ProjectData/NLP/bert/chinese_L-12_H-768_A-12/' 27 | bert_init_checkpoint: "bert_model.ckpt" 28 | bert_vocab: "vocab.txt" 29 | bert_config: "bert_config.json" -------------------------------------------------------------------------------- /configs/search_space.json: -------------------------------------------------------------------------------- 1 | { 2 | "batch_size": {"_type":"choice", "_value": [32, 64, 128, 256]}, 3 | "learning_rate":{"_type":"quniform","_value":[0.00002, 0.00005, 0.00001]} 4 | } -------------------------------------------------------------------------------- /data/readme.md: -------------------------------------------------------------------------------- 1 | # OPPO手机搜索排序query-title语义匹配数据集 2 | 数据来自天池大数据比赛,是OPPO手机搜索排序query-title语义匹配的问题。 3 | 4 | 数据格式: 数据分4列,\t分隔。 5 | 6 | | 字段 | 说明 | 数据示例 | 7 | | ---------------- | ------------------------------------------------------------ | ----------------------------------------- | 8 | | prefix | 用户输入(query前缀) | 刘德 | 9 | | query_prediction | 根据当前前缀,预测的用户完整需求查询词,最多10条;预测的查询词可能是前缀本身,数字为统计概率 | {“刘德华”: “0.5”, “刘德华的歌”: “0.3”, …} | 10 | | title | 文章标题 | 刘德华 | 11 | | tag | 文章内容标签 | 百科 | 12 | | label | 是否点击 | 0或1 | 13 | 14 | 为了应用来训练DSSM demo,将prefix和title作为正样,prefix和query_prediction(除title以外)作为负样本。 15 | 16 | 下载链接:链接: https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw 提取码: 7p3n 17 | 18 | 本数据仅限用于个人实验,如数据版权问题,请联系[chou.young@qq.com](mailto:chou.young@qq.com) 下架。 19 | 20 | 21 | 22 | 下载解压到data文件夹即可,注意修改config.py中配置:file_train, file_vali。 23 | 24 | 25 | # 其他数据集 26 | https://paddlehub.readthedocs.io/zh_CN/latest/reference/dataset.html 27 | 28 | ## LCQMC 29 | import paddlehub as hub 30 | dataset = hub.dataset.LCQMC() 31 | 32 | pass 33 | train:238766 34 | test:12500 35 | dev:8802 -------------------------------------------------------------------------------- /data_input.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding=utf-8 3 | from inspect import getblock 4 | import json 5 | import os 6 | from os import read 7 | from numpy.core.fromnumeric import mean 8 | import numpy as np 9 | import paddlehub as hub 10 | import six 11 | import math 12 | import random 13 | import sys 14 | from util import read_file 15 | from config import Config 16 | # 配置文件 17 | conf = Config() 18 | 19 | 20 | class Vocabulary(object): 21 | def __init__(self, meta_file, max_len, allow_unk=0, unk="$UNK$", pad="$PAD$",): 22 | self.voc2id = {} 23 | self.id2voc = {} 24 | self.unk = unk 25 | self.pad = pad 26 | self.max_len = max_len 27 | self.allow_unk = allow_unk 28 | with open(meta_file, encoding='utf-8') as f: 29 | for i, line in enumerate(f): 30 | line = convert_to_unicode(line.strip("\n")) 31 | self.voc2id[line] = i 32 | self.id2voc[i] = line 33 | self.size = len(self.voc2id) 34 | self.oov_num = self.size + 1 35 | 36 | def fit(self, words_list): 37 | """ 38 | :param words_list: [[w11, w12, ...], [w21, w22, ...], ...] 39 | :return: 40 | """ 41 | word_lst = [] 42 | word_lst_append = word_lst.append 43 | for words in words_list: 44 | if not isinstance(words, list): 45 | print(words) 46 | continue 47 | for word in words: 48 | word = convert_to_unicode(word) 49 | word_lst_append(word) 50 | word_counts = Counter(word_lst) 51 | if self.max_num_word < 0: 52 | self.max_num_word = len(word_counts) 53 | sorted_voc = [w for w, c in word_counts.most_common(self.max_num_word)] 54 | self.max_num_word = len(sorted_voc) 55 | self.oov_index = self.max_num_word + 1 56 | self.voc2id = dict(zip(sorted_voc, range(1, self.max_num_word + 1))) 57 | return self 58 | 59 | def _transform2id(self, word): 60 | word = convert_to_unicode(word) 61 | if word in self.voc2id: 62 | return self.voc2id[word] 63 | elif self.allow_unk: 64 | return self.voc2id[self.unk] 65 | else: 66 | print(word) 67 | raise ValueError("word:{} Not in voc2id, please check".format(word)) 68 | 69 | def _transform_seq2id(self, words, padding=0): 70 | out_ids = [] 71 | words = convert_to_unicode(words) 72 | if self.max_len: 73 | words = words[:self.max_len] 74 | for w in words: 75 | out_ids.append(self._transform2id(w)) 76 | if padding and self.max_len: 77 | while len(out_ids) < self.max_len: 78 | out_ids.append(0) 79 | return out_ids 80 | 81 | def _transform_intent2ont_hot(self, words, padding=0): 82 | # 将多标签意图转为 one_hot 83 | out_ids = np.zeros(self.size, dtype=np.float32) 84 | words = convert_to_unicode(words) 85 | for w in words: 86 | out_ids[self._transform2id(w)] = 1.0 87 | return out_ids 88 | 89 | def _transform_seq2bert_id(self, words, padding=0): 90 | out_ids, seq_len = [], 0 91 | words = convert_to_unicode(words) 92 | if self.max_len: 93 | words = words[:self.max_len] 94 | seq_len = len(words) 95 | # 插入 [CLS], [SEP] 96 | out_ids.append(self._transform2id("[CLS]")) 97 | for w in words: 98 | out_ids.append(self._transform2id(w)) 99 | mask_ids = [1 for _ in out_ids] 100 | if padding and self.max_len: 101 | while len(out_ids) < self.max_len + 1: 102 | out_ids.append(0) 103 | mask_ids.append(0) 104 | seg_ids = [0 for _ in out_ids] 105 | return out_ids, mask_ids, seg_ids, seq_len 106 | 107 | @staticmethod 108 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 109 | """Truncates a sequence pair in place to the maximum length.""" 110 | while True: 111 | total_length = len(tokens_a) + len(tokens_b) 112 | if total_length <= max_length: 113 | break 114 | if len(tokens_a) > len(tokens_b): 115 | tokens_a.pop() 116 | else: 117 | tokens_b.pop() 118 | 119 | def _transform_2seq2bert_id(self, seq1, seq2, padding=0): 120 | out_ids, seg_ids, seq_len = [], [1], 0 121 | seq1 = [x for x in convert_to_unicode(seq1)] 122 | seq2 = [x for x in convert_to_unicode(seq2)] 123 | # 截断 124 | self._truncate_seq_pair(seq1, seq2, self.max_len - 2) 125 | # 插入 [CLS], [SEP] 126 | out_ids.append(self._transform2id("[CLS]")) 127 | for w in seq1: 128 | out_ids.append(self._transform2id(w)) 129 | seg_ids.append(0) 130 | out_ids.append(self._transform2id("[SEP]")) 131 | seg_ids.append(0) 132 | for w in seq2: 133 | out_ids.append(self._transform2id(w)) 134 | seg_ids.append(1) 135 | mask_ids = [1 for _ in out_ids] 136 | if padding and self.max_len: 137 | while len(out_ids) < self.max_len + 1: 138 | out_ids.append(0) 139 | mask_ids.append(0) 140 | seg_ids.append(0) 141 | return out_ids, mask_ids, seg_ids, seq_len 142 | 143 | def transform(self, seq_list, is_bert=0): 144 | if is_bert: 145 | return [self._transform_seq2bert_id(seq) for seq in seq_list] 146 | else: 147 | return [self._transform_seq2id(seq) for seq in seq_list] 148 | 149 | def __len__(self): 150 | return len(self.voc2id) 151 | 152 | def convert_to_unicode(text): 153 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 154 | if six.PY3: 155 | if isinstance(text, str): 156 | return text 157 | elif isinstance(text, bytes): 158 | return text.decode("utf-8", "ignore") 159 | else: 160 | raise ValueError("Unsupported string type: %s" % (type(text))) 161 | elif six.PY2: 162 | if isinstance(text, str): 163 | return text.decode("utf-8", "ignore") 164 | elif isinstance(text, unicode): 165 | return text 166 | else: 167 | raise ValueError("Unsupported string type: %s" % (type(text))) 168 | else: 169 | raise ValueError("Not running on Python2 or Python 3?") 170 | 171 | def gen_word_set(file_path, out_path='./data/words.txt'): 172 | word_set = set() 173 | with open(file_path, encoding='utf-8') as f: 174 | for line in f.readlines(): 175 | spline = line.strip().split('\t') 176 | if len(spline) < 4: 177 | continue 178 | prefix, query_pred, title, tag, label = spline 179 | if label == '0': 180 | continue 181 | cur_arr = [prefix, title] 182 | query_pred = json.loads(query_pred) 183 | for w in prefix: 184 | word_set.add(w) 185 | for each in query_pred: 186 | for w in each: 187 | word_set.add(w) 188 | with open(word_set, 'w', encoding='utf-8') as o: 189 | for w in word_set: 190 | o.write(w + '\n') 191 | pass 192 | 193 | def convert_word2id(query, vocab_map): 194 | ids = [] 195 | for w in query: 196 | if w in vocab_map: 197 | ids.append(vocab_map[w]) 198 | else: 199 | ids.append(vocab_map[conf.unk]) 200 | while len(ids) < conf.max_seq_len: 201 | ids.append(vocab_map[conf.pad]) 202 | return ids[:conf.max_seq_len] 203 | 204 | 205 | def convert_seq2bow(query, vocab_map): 206 | bow_ids = np.zeros(conf.nwords) 207 | for w in query: 208 | if w in vocab_map: 209 | bow_ids[vocab_map[w]] += 1 210 | else: 211 | bow_ids[vocab_map[conf.unk]] += 1 212 | return bow_ids 213 | 214 | 215 | def get_data(file_path): 216 | """ 217 | gen datasets, convert word into word ids. 218 | :param file_path: 219 | :return: [[query, pos sample, 4 neg sample]], shape = [n, 6] 220 | """ 221 | data_map = {'query': [], 'query_len': [], 'doc_pos': [], 'doc_pos_len': [], 'doc_neg': [], 'doc_neg_len': []} 222 | with open(file_path, encoding='utf8') as f: 223 | for line in f.readlines(): 224 | spline = line.strip().split('\t') 225 | if len(spline) < 4: 226 | continue 227 | prefix, query_pred, title, tag, label = spline 228 | if label == '0': 229 | continue 230 | cur_arr, cur_len = [], [] 231 | query_pred = json.loads(query_pred) 232 | # only 4 negative sample 233 | for each in query_pred: 234 | if each == title: 235 | continue 236 | cur_arr.append(convert_word2id(each, conf.vocab_map)) 237 | each_len = len(each) if len(each) < conf.max_seq_len else conf.max_seq_len 238 | cur_len.append(each_len) 239 | if len(cur_arr) >= 4: 240 | data_map['query'].append(convert_word2id(prefix, conf.vocab_map)) 241 | data_map['query_len'].append(len(prefix) if len(prefix) < conf.max_seq_len else conf.max_seq_len) 242 | data_map['doc_pos'].append(convert_word2id(title, conf.vocab_map)) 243 | data_map['doc_pos_len'].append(len(title) if len(title) < conf.max_seq_len else conf.max_seq_len) 244 | data_map['doc_neg'].extend(cur_arr[:4]) 245 | data_map['doc_neg_len'].extend(cur_len[:4]) 246 | pass 247 | return data_map 248 | 249 | 250 | def get_data_siamese_rnn(file_path): 251 | """ 252 | gen datasets, convert word into word ids. 253 | :param file_path: 254 | :return: [[query, pos sample, 4 neg sample]], shape = [n, 6] 255 | """ 256 | data_arr = [] 257 | with open(file_path, encoding='utf8') as f: 258 | for line in f.readlines(): 259 | spline = line.strip().split('\t') 260 | if len(spline) < 4: 261 | continue 262 | prefix, _, title, tag, label = spline 263 | prefix_seq = convert_word2id(prefix, conf.vocab_map) 264 | title_seq = convert_word2id(title, conf.vocab_map) 265 | data_arr.append([prefix_seq, title_seq, int(label)]) 266 | return data_arr 267 | 268 | 269 | def get_data_bow(file_path): 270 | """ 271 | gen datasets, convert word into word ids. 272 | :param file_path: 273 | :return: [[query, prefix, label]], shape = [n, 3] 274 | """ 275 | data_arr = [] 276 | with open(file_path, encoding='utf8') as f: 277 | for line in f.readlines(): 278 | spline = line.strip().split('\t') 279 | if len(spline) < 4: 280 | continue 281 | prefix, _, title, tag, label = spline 282 | prefix_ids = convert_seq2bow(prefix, conf.vocab_map) 283 | title_ids = convert_seq2bow(title, conf.vocab_map) 284 | data_arr.append([prefix_ids, title_ids, int(label)]) 285 | return data_arr 286 | 287 | def trans_lcqmc(dataset): 288 | """ 289 | 最大长度 290 | """ 291 | out_arr, text_len = [], [] 292 | for each in dataset: 293 | t1, t2, label = each.text_a, each.text_b, int(each.label) 294 | t1_ids = convert_word2id(t1, conf.vocab_map) 295 | t1_len = conf.max_seq_len if len(t1) > conf.max_seq_len else len(t1) 296 | t2_ids = convert_word2id(t2, conf.vocab_map) 297 | t2_len = conf.max_seq_len if len(t2) > conf.max_seq_len else len(t2) 298 | # t2_len = len(t2) 299 | out_arr.append([t1_ids, t1_len, t2_ids, t2_len, label]) 300 | # out_arr.append([t1_ids, t1_len, t2_ids, t2_len, label, t1, t2]) 301 | text_len.extend([len(t1), len(t2)]) 302 | pass 303 | print("max len", max(text_len), "avg len", mean(text_len), "cover rate:", np.mean([x <= conf.max_seq_len for x in text_len])) 304 | return out_arr 305 | 306 | def get_lcqmc(): 307 | """ 308 | 使用LCQMC数据集,并将其转为word_id 309 | """ 310 | dataset = hub.dataset.LCQMC() 311 | train_set = trans_lcqmc(dataset.train_examples) 312 | dev_set = trans_lcqmc(dataset.dev_examples) 313 | test_set = trans_lcqmc(dataset.test_examples) 314 | return train_set, dev_set, test_set 315 | # return test_set, test_set, test_set 316 | 317 | def trans_lcqmc_bert(dataset:list, vocab:Vocabulary, is_merge=0): 318 | """ 319 | 最大长度 320 | """ 321 | out_arr, text_len = [], [] 322 | for each in dataset: 323 | t1, t2, label = each.text_a, each.text_b, int(each.label) 324 | if is_merge: 325 | out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_2seq2bert_id(t1, t2, padding=1) 326 | out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1, label]) 327 | text_len.extend([len(t1) + len(t2)]) 328 | else: 329 | out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_seq2bert_id(t1, padding=1) 330 | out_ids2, mask_ids2, seg_ids2, seq_len2 = vocab._transform_seq2bert_id(t2, padding=1) 331 | out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1, out_ids2, mask_ids2, seg_ids2, seq_len2, label]) 332 | text_len.extend([len(t1), len(t2)]) 333 | pass 334 | print("max len", max(text_len), "avg len", mean(text_len), "cover rate:", np.mean([x <= conf.max_seq_len for x in text_len])) 335 | return out_arr 336 | 337 | def get_lcqmc_bert(vocab:Vocabulary, is_merge=0): 338 | """ 339 | 使用LCQMC数据集,并将每个query其转为word_id, 340 | """ 341 | dataset = hub.dataset.LCQMC() 342 | train_set = trans_lcqmc_bert(dataset.train_examples, vocab, is_merge) 343 | dev_set = trans_lcqmc_bert(dataset.dev_examples, vocab, is_merge) 344 | test_set = trans_lcqmc_bert(dataset.test_examples, vocab, is_merge) 345 | return train_set, dev_set, test_set 346 | # test_set = test_set[:100] 347 | # return test_set, test_set, test_set 348 | 349 | def get_test(file_:str, vocab:Vocabulary): 350 | test_arr = read_file(file_, '\t') # [[q1, q2],...] 351 | out_arr = [] 352 | for line in test_arr: 353 | if len(line) != 2: 354 | print('wrong line size=', len(line)) 355 | t1, t2 = line # [t1_ids, t1_len, t2_ids, t2_len, label] 356 | t1_ids = vocab._transform_seq2id(t1, padding=1) 357 | t1_len = vocab.max_len if len(t1) > vocab.max_len else len(t1) 358 | t2_ids = vocab._transform_seq2id(t2, padding=1) 359 | t2_len = vocab.max_len if len(t2) > vocab.max_len else len(t2) 360 | out_arr.append([t1_ids, t1_len, t2_ids, t2_len]) 361 | return out_arr, test_arr 362 | 363 | def get_test_bert(file_:str, vocab:Vocabulary, is_merge=0): 364 | test_arr = read_file(file_, '\t') # [[q1, q2],...] 365 | out_arr, _ = get_test_bert_by_arr(test_arr, vocab, is_merge) 366 | return out_arr, test_arr 367 | 368 | def get_test_bert_by_arr(test_arr:list, vocab:Vocabulary, is_merge=0): 369 | # test_arr # [[q1, q2],...] 370 | out_arr = [] 371 | for line in test_arr: 372 | if len(line) != 2: 373 | print('wrong line size=', len(line)) 374 | t1, t2 = line # [t1_ids, t1_len, t2_ids, t2_len, label] 375 | if is_merge: 376 | out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_2seq2bert_id(t1, t2, padding=1) 377 | out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1]) 378 | else: 379 | out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_seq2bert_id(t1, padding=1) 380 | out_ids2, mask_ids2, seg_ids2, seq_len2 = vocab._transform_seq2bert_id(t2, padding=1) 381 | out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1, out_ids2, mask_ids2, seg_ids2, seq_len2]) 382 | return out_arr, test_arr 383 | 384 | def get_test_bert_single(file_:str, vocab:Vocabulary, is_merge=0): 385 | test_arr = read_file(file_) # [q1,...] 386 | out_arr = [] 387 | for line in test_arr: 388 | t1 = line # [t1_ids, t1_len, t2_ids, t2_len, label] 389 | out_ids1, mask_ids1, seg_ids1, seq_len1 = vocab._transform_seq2bert_id(t1, padding=1) 390 | out_arr.append([out_ids1, mask_ids1, seg_ids1, seq_len1]) 391 | return out_arr, test_arr 392 | 393 | def get_batch(dataset, batch_size=None, is_test=0): 394 | # tf Dataset太难用,不如自己实现 395 | # https://stackoverflow.com/questions/50539342/getting-batches-in-tensorflow 396 | # dataset:每个元素是一个特征,[[x1, x2, x3,...], ...], 如果是测试集,可能就没有标签 397 | if not batch_size: 398 | batch_size = 32 399 | if not is_test: 400 | random.shuffle(dataset) 401 | steps = int(math.ceil(float(len(dataset)) / batch_size)) 402 | for i in range(steps): 403 | idx = i * batch_size 404 | cur_set = dataset[idx: idx + batch_size] 405 | cur_set = zip(*cur_set) 406 | yield cur_set 407 | 408 | 409 | if __name__ == '__main__': 410 | # prefix, query_prediction, title, tag, label 411 | # query_prediction 为json格式。 412 | file_train = './data/oppo_round1_train_20180929.txt' 413 | file_vali = './data/oppo_round1_vali_20180929.txt' 414 | # data_train = get_data(file_train) 415 | # data_train = get_data(file_vali) 416 | # print(len(data_train['query']), len(data_train['doc_pos']), len(data_train['doc_neg'])) 417 | dataset = get_lcqmc() 418 | print(dataset[1][:3]) 419 | for each in get_batch(dataset[1][:3], batch_size=2): 420 | t1_ids, t1_len, t2_ids, t2_len, label = each 421 | print(each) 422 | pass 423 | -------------------------------------------------------------------------------- /dssm.py: -------------------------------------------------------------------------------- 1 | # coding=utf8 2 | """ 3 | python=3.5 4 | TensorFlow=1.2.1 5 | """ 6 | 7 | import time 8 | import numpy as np 9 | import tensorflow as tf 10 | import data_input 11 | from config import Config 12 | import random 13 | 14 | random.seed(9102) 15 | 16 | start = time.time() 17 | # 是否加BN层 18 | norm, epsilon = False, 0.001 19 | 20 | # negative sample 21 | # query batch size 22 | query_BS = 100 23 | # batch size 24 | L1_N = 400 25 | L2_N = 120 26 | 27 | # 读取数据 28 | conf = Config() 29 | data_train = data_input.get_data_bow(conf.file_train) 30 | data_vali = data_input.get_data_bow(conf.file_vali) 31 | # print(len(data_train['query']), query_BS, len(data_train['query']) / query_BS) 32 | train_epoch_steps = int(len(data_train) / query_BS) - 1 33 | vali_epoch_steps = int(len(data_vali) / query_BS) - 1 34 | 35 | 36 | def add_layer(inputs, in_size, out_size, activation_function=None): 37 | wlimit = np.sqrt(6.0 / (in_size + out_size)) 38 | Weights = tf.Variable(tf.random_uniform([in_size, out_size], -wlimit, wlimit)) 39 | biases = tf.Variable(tf.random_uniform([out_size], -wlimit, wlimit)) 40 | Wx_plus_b = tf.matmul(inputs, Weights) + biases 41 | if activation_function is None: 42 | outputs = Wx_plus_b 43 | else: 44 | outputs = activation_function(Wx_plus_b) 45 | return outputs 46 | 47 | 48 | def mean_var_with_update(ema, fc_mean, fc_var): 49 | ema_apply_op = ema.apply([fc_mean, fc_var]) 50 | with tf.control_dependencies([ema_apply_op]): 51 | return tf.identity(fc_mean), tf.identity(fc_var) 52 | 53 | 54 | def batch_normalization(x, phase_train, out_size): 55 | """ 56 | Batch normalization on convolutional maps. 57 | Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow 58 | Args: 59 | x: Tensor, 4D BHWD input maps 60 | out_size: integer, depth of input maps 61 | phase_train: boolean tf.Varialbe, true indicates training phase 62 | scope: string, variable scope 63 | Return: 64 | normed: batch-normalized maps 65 | """ 66 | with tf.variable_scope('bn'): 67 | beta = tf.Variable(tf.constant(0.0, shape=[out_size]), 68 | name='beta', trainable=True) 69 | gamma = tf.Variable(tf.constant(1.0, shape=[out_size]), 70 | name='gamma', trainable=True) 71 | batch_mean, batch_var = tf.nn.moments(x, [0], name='moments') 72 | ema = tf.train.ExponentialMovingAverage(decay=0.5) 73 | 74 | def mean_var_with_update(): 75 | ema_apply_op = ema.apply([batch_mean, batch_var]) 76 | with tf.control_dependencies([ema_apply_op]): 77 | return tf.identity(batch_mean), tf.identity(batch_var) 78 | 79 | mean, var = tf.cond(phase_train, 80 | mean_var_with_update, 81 | lambda: (ema.average(batch_mean), ema.average(batch_var))) 82 | normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3) 83 | return normed 84 | 85 | 86 | def variable_summaries(var, name): 87 | """Attach a lot of summaries to a Tensor.""" 88 | with tf.name_scope('summaries'): 89 | mean = tf.reduce_mean(var) 90 | tf.summary.scalar('mean/' + name, mean) 91 | with tf.name_scope('stddev'): 92 | stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean))) 93 | tf.summary.scalar('sttdev/' + name, stddev) 94 | tf.summary.scalar('max/' + name, tf.reduce_max(var)) 95 | tf.summary.scalar('min/' + name, tf.reduce_min(var)) 96 | tf.summary.histogram(name, var) 97 | 98 | 99 | def contrastive_loss(y, d, batch_size): 100 | tmp = y * tf.square(d) 101 | # tmp= tf.mul(y,tf.square(d)) 102 | tmp2 = (1 - y) * tf.square(tf.maximum((1 - d), 0)) 103 | reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-4), tf.trainable_variables()) 104 | return tf.reduce_sum(tmp + tmp2) / batch_size / 2 + reg 105 | 106 | 107 | def get_cosine_score(query_arr, doc_arr): 108 | # query_norm = sqrt(sum(each x^2)) 109 | pooled_len_1 = tf.sqrt(tf.reduce_sum(tf.square(query_arr), 1)) 110 | pooled_len_2 = tf.sqrt(tf.reduce_sum(tf.square(doc_arr), 1)) 111 | pooled_mul_12 = tf.reduce_sum(tf.multiply(query_arr, doc_arr), 1) 112 | cos_scores = tf.div(pooled_mul_12, pooled_len_1 * pooled_len_2 + 1e-8, name="cos_scores") 113 | return cos_scores 114 | 115 | 116 | with tf.name_scope('input'): 117 | # 预测时只用输入query即可,将其embedding为向量。 118 | query_batch = tf.placeholder(tf.float32, shape=[None, None], name='query_batch') 119 | doc_batch = tf.placeholder(tf.float32, shape=[None, None], name='doc_batch') 120 | doc_label_batch = tf.placeholder(tf.float32, shape=[None], name='doc_label_batch') 121 | on_train = tf.placeholder(tf.bool) 122 | keep_prob = tf.placeholder(tf.float32, name='drop_out_prob') 123 | 124 | with tf.name_scope('FC1'): 125 | # 全连接网络 126 | query_l1 = add_layer(query_batch, conf.nwords, L1_N, activation_function=None) 127 | doc_l1 = add_layer(doc_batch, conf.nwords, L1_N, activation_function=None) 128 | 129 | with tf.name_scope('BN1'): 130 | query_l1 = batch_normalization(query_l1, on_train, L1_N) 131 | doc_l1 = batch_normalization(doc_l1, on_train, L1_N) 132 | query_l1 = tf.nn.relu(query_l1) 133 | doc_l1 = tf.nn.relu(doc_l1) 134 | 135 | with tf.name_scope('Drop_out'): 136 | query_l1 = tf.nn.dropout(query_l1, keep_prob) 137 | doc_l1 = tf.nn.dropout(doc_l1, keep_prob) 138 | 139 | with tf.name_scope('FC2'): 140 | query_l2 = add_layer(query_l1, L1_N, L2_N, activation_function=None) 141 | doc_l2 = add_layer(doc_l1, L1_N, L2_N, activation_function=None) 142 | 143 | with tf.name_scope('BN2'): 144 | query_l2 = batch_normalization(query_l2, on_train, L2_N) 145 | doc_l2 = batch_normalization(doc_l2, on_train, L2_N) 146 | query_l2 = tf.nn.relu(query_l2) 147 | doc_l2 = tf.nn.relu(doc_l2) 148 | 149 | query_pred = tf.nn.relu(query_l2) 150 | doc_pred = tf.nn.relu(doc_l2) 151 | 152 | # query_pred = tf.contrib.slim.batch_norm(query_l2, activation_fn=tf.nn.relu) 153 | 154 | with tf.name_scope('Cosine_Similarity'): 155 | # Cosine similarity 156 | cos_sim = get_cosine_score(query_pred, doc_pred) 157 | cos_sim_prob = tf.clip_by_value(cos_sim, 1e-8, 1.0) 158 | 159 | with tf.name_scope('Loss'): 160 | # Train Loss 161 | cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=doc_label_batch, logits=cos_sim) 162 | losses = tf.reduce_sum(cross_entropy) 163 | tf.summary.scalar('loss', losses) 164 | pass 165 | 166 | with tf.name_scope('Training'): 167 | # Optimizer 168 | train_step = tf.train.AdamOptimizer(conf.learning_rate).minimize(losses) 169 | pass 170 | 171 | # with tf.name_scope('Accuracy'): 172 | # correct_prediction = tf.equal(tf.argmax(prob, 1), 0) 173 | # accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 174 | # tf.summary.scalar('accuracy', accuracy) 175 | 176 | merged = tf.summary.merge_all() 177 | 178 | with tf.name_scope('Test'): 179 | average_loss = tf.placeholder(tf.float32) 180 | loss_summary = tf.summary.scalar('average_loss', average_loss) 181 | 182 | with tf.name_scope('Train'): 183 | train_average_loss = tf.placeholder(tf.float32) 184 | train_loss_summary = tf.summary.scalar('train_average_loss', train_average_loss) 185 | 186 | 187 | def pull_all(query_in, doc_positive_in, doc_negative_in): 188 | query_in = query_in.tocoo() 189 | doc_positive_in = doc_positive_in.tocoo() 190 | doc_negative_in = doc_negative_in.tocoo() 191 | query_in = tf.SparseTensorValue( 192 | np.transpose([np.array(query_in.row, dtype=np.int64), np.array(query_in.col, dtype=np.int64)]), 193 | np.array(query_in.data, dtype=np.float), 194 | np.array(query_in.shape, dtype=np.int64)) 195 | doc_positive_in = tf.SparseTensorValue( 196 | np.transpose([np.array(doc_positive_in.row, dtype=np.int64), np.array(doc_positive_in.col, dtype=np.int64)]), 197 | np.array(doc_positive_in.data, dtype=np.float), 198 | np.array(doc_positive_in.shape, dtype=np.int64)) 199 | doc_negative_in = tf.SparseTensorValue( 200 | np.transpose([np.array(doc_negative_in.row, dtype=np.int64), np.array(doc_negative_in.col, dtype=np.int64)]), 201 | np.array(doc_negative_in.data, dtype=np.float), 202 | np.array(doc_negative_in.shape, dtype=np.int64)) 203 | 204 | return query_in, doc_positive_in, doc_negative_in 205 | 206 | 207 | def pull_batch(data_map, batch_id): 208 | query, title, label, dsize = range(4) 209 | cur_data = data_map[batch_id * query_BS:(batch_id + 1) * query_BS] 210 | query_in = [x[0] for x in cur_data] 211 | doc_in = [x[1] for x in cur_data] 212 | label = [x[2] for x in cur_data] 213 | 214 | # query_in, doc_positive_in, doc_negative_in = pull_all(query_in, doc_positive_in, doc_negative_in) 215 | return query_in, doc_in, label 216 | 217 | 218 | def feed_dict(on_training, data_set, batch_id, drop_prob): 219 | query_in, doc_in, label = pull_batch(data_set, batch_id) 220 | query_in, doc_in, label = np.array(query_in), np.array(doc_in), np.array(label) 221 | return {query_batch: query_in, doc_batch: doc_in, doc_label_batch: label, 222 | on_train: on_training, keep_prob: drop_prob} 223 | 224 | 225 | # config = tf.ConfigProto() # log_device_placement=True) 226 | # config.gpu_options.allow_growth = True 227 | # if not config.gpu: 228 | # config = tf.ConfigProto(device_count= {'GPU' : 0}) 229 | 230 | # 创建一个Saver对象,选择性保存变量或者模型。 231 | saver = tf.train.Saver() 232 | # with tf.Session(config=config) as sess: 233 | with tf.Session() as sess: 234 | sess.run(tf.global_variables_initializer()) 235 | train_writer = tf.summary.FileWriter(conf.summaries_dir + '/train', sess.graph) 236 | 237 | start = time.time() 238 | for epoch in range(conf.num_epoch): 239 | random.shuffle(data_train) 240 | for batch_id in range(train_epoch_steps): 241 | # print(batch_id) 242 | sess.run(train_step, feed_dict=feed_dict(True, data_train, batch_id, 0.5)) 243 | pass 244 | end = time.time() 245 | # train loss 246 | epoch_loss = 0 247 | for i in range(train_epoch_steps): 248 | loss_v = sess.run(losses, feed_dict=feed_dict(False, data_train, i, 1)) 249 | epoch_loss += loss_v 250 | 251 | epoch_loss /= (train_epoch_steps) 252 | train_loss = sess.run(train_loss_summary, feed_dict={train_average_loss: epoch_loss}) 253 | train_writer.add_summary(train_loss, epoch + 1) 254 | print("\nEpoch #%d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" % 255 | (epoch, epoch_loss, end - start)) 256 | 257 | # test loss 258 | start = time.time() 259 | epoch_loss = 0 260 | for i in range(vali_epoch_steps): 261 | loss_v = sess.run(losses, feed_dict=feed_dict(False, data_vali, i, 1)) 262 | epoch_loss += loss_v 263 | epoch_loss /= (vali_epoch_steps) 264 | test_loss = sess.run(loss_summary, feed_dict={average_loss: epoch_loss}) 265 | train_writer.add_summary(test_loss, epoch + 1) 266 | # test_writer.add_summary(test_loss, step + 1) 267 | print("Epoch #%d | Test Loss: %-4.3f | Calc_LossTime: %-3.3fs" % 268 | (epoch, epoch_loss, start - end)) 269 | 270 | # 保存模型 271 | save_path = saver.save(sess, "model/model_1.ckpt") 272 | print("Model saved in file: ", save_path) 273 | -------------------------------------------------------------------------------- /dssm_rnn.py: -------------------------------------------------------------------------------- 1 | # coding=utf8 2 | """ 3 | python=3.5 4 | TensorFlow=1.2.1 5 | """ 6 | 7 | import time 8 | import numpy as np 9 | import tensorflow as tf 10 | import data_input 11 | from config import Config 12 | import random 13 | 14 | random.seed(9102) 15 | 16 | start = time.time() 17 | # 是否加BN层 18 | norm, epsilon = False, 0.001 19 | 20 | # TRIGRAM_D = 21128 21 | TRIGRAM_D = 100 22 | # negative sample 23 | NEG = 4 24 | # query batch size 25 | query_BS = 100 26 | # batch size 27 | BS = query_BS * NEG 28 | 29 | # 读取数据 30 | conf = Config() 31 | data_train = data_input.get_data(conf.file_train) 32 | data_vali = data_input.get_data(conf.file_vali) 33 | # print(len(data_train['query']), query_BS, len(data_train['query']) / query_BS) 34 | train_epoch_steps = int(len(data_train['query']) / query_BS) - 1 35 | vali_epoch_steps = int(len(data_vali['query']) / query_BS) - 1 36 | 37 | 38 | def variable_summaries(var, name): 39 | """Attach a lot of summaries to a Tensor.""" 40 | with tf.name_scope('summaries'): 41 | mean = tf.reduce_mean(var) 42 | tf.summary.scalar('mean/' + name, mean) 43 | with tf.name_scope('stddev'): 44 | stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean))) 45 | tf.summary.scalar('sttdev/' + name, stddev) 46 | tf.summary.scalar('max/' + name, tf.reduce_max(var)) 47 | tf.summary.scalar('min/' + name, tf.reduce_min(var)) 48 | tf.summary.histogram(name, var) 49 | 50 | 51 | with tf.name_scope('input'): 52 | # 预测时只用输入query即可,将其embedding为向量。 53 | query_batch = tf.placeholder(tf.int32, shape=[None, None], name='query_batch') 54 | doc_pos_batch = tf.placeholder(tf.int32, shape=[None, None], name='doc_positive_batch') 55 | doc_neg_batch = tf.placeholder(tf.int32, shape=[None, None], name='doc_negative_batch') 56 | query_seq_length = tf.placeholder(tf.int32, shape=[None], name='query_sequence_length') 57 | pos_seq_length = tf.placeholder(tf.int32, shape=[None], name='pos_seq_length') 58 | neg_seq_length = tf.placeholder(tf.int32, shape=[None], name='neg_sequence_length') 59 | on_train = tf.placeholder(tf.bool) 60 | drop_out_prob = tf.placeholder(tf.float32, name='drop_out_prob') 61 | 62 | with tf.name_scope('word_embeddings_layer'): 63 | # 这里可以加载预训练词向量 64 | _word_embedding = tf.get_variable(name="word_embedding_arr", dtype=tf.float32, 65 | shape=[conf.nwords, TRIGRAM_D]) 66 | query_embed = tf.nn.embedding_lookup(_word_embedding, query_batch, name='query_batch_embed') 67 | doc_pos_embed = tf.nn.embedding_lookup(_word_embedding, doc_pos_batch, name='doc_positive_embed') 68 | doc_neg_embed = tf.nn.embedding_lookup(_word_embedding, doc_neg_batch, name='doc_negative_embed') 69 | 70 | with tf.name_scope('RNN'): 71 | # Abandon bag of words, use GRU, you can use stacked gru 72 | # query_l1 = add_layer(query_batch, TRIGRAM_D, L1_N, activation_function=None) # tf.nn.relu() 73 | # doc_positive_l1 = add_layer(doc_positive_batch, TRIGRAM_D, L1_N, activation_function=None) 74 | # doc_negative_l1 = add_layer(doc_negative_batch, TRIGRAM_D, L1_N, activation_function=None) 75 | if conf.use_stack_rnn: 76 | cell_fw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE) 77 | stacked_gru_fw = tf.contrib.rnn.MultiRNNCell([cell_fw], state_is_tuple=True) 78 | cell_bw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE) 79 | stacked_gru_bw = tf.contrib.rnn.MultiRNNCell([cell_fw], state_is_tuple=True) 80 | (output_fw, output_bw), (_, _) = tf.nn.bidirectional_dynamic_rnn(stacked_gru_fw, stacked_gru_bw) 81 | # not ready, to be continue ... 82 | else: 83 | cell_fw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE) 84 | cell_bw = tf.contrib.rnn.GRUCell(conf.hidden_size_rnn, reuse=tf.AUTO_REUSE) 85 | # query 86 | (_, _), (query_output_fw, query_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, query_embed, 87 | sequence_length=query_seq_length, 88 | dtype=tf.float32) 89 | query_rnn_output = tf.concat([query_output_fw, query_output_bw], axis=-1) 90 | query_rnn_output = tf.nn.dropout(query_rnn_output, drop_out_prob) 91 | # doc_pos 92 | (_, _), (doc_pos_output_fw, doc_pos_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, 93 | doc_pos_embed, 94 | sequence_length=pos_seq_length, 95 | dtype=tf.float32) 96 | doc_pos_rnn_output = tf.concat([doc_pos_output_fw, doc_pos_output_bw], axis=-1) 97 | doc_pos_rnn_output = tf.nn.dropout(doc_pos_rnn_output, drop_out_prob) 98 | # doc_neg 99 | (_, _), (doc_neg_output_fw, doc_neg_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, 100 | doc_neg_embed, 101 | sequence_length=neg_seq_length, 102 | dtype=tf.float32) 103 | doc_neg_rnn_output = tf.concat([doc_neg_output_fw, doc_neg_output_bw], axis=-1) 104 | doc_neg_rnn_output = tf.nn.dropout(doc_neg_rnn_output, drop_out_prob) 105 | 106 | with tf.name_scope('Merge_Negative_Doc'): 107 | # 合并负样本,tile可选择是否扩展负样本。 108 | # doc_y = tf.tile(doc_positive_y, [1, 1]) 109 | doc_y = tf.tile(doc_pos_rnn_output, [1, 1]) 110 | 111 | for i in range(NEG): 112 | for j in range(query_BS): 113 | # slice(input_, begin, size)切片API 114 | # doc_y = tf.concat([doc_y, tf.slice(doc_negative_y, [j * NEG + i, 0], [1, -1])], 0) 115 | doc_y = tf.concat([doc_y, tf.slice(doc_neg_rnn_output, [j * NEG + i, 0], [1, -1])], 0) 116 | 117 | with tf.name_scope('Cosine_Similarity'): 118 | # Cosine similarity 119 | # query_norm = sqrt(sum(each x^2)) 120 | query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(query_rnn_output), 1, True)), [NEG + 1, 1]) 121 | # doc_norm = sqrt(sum(each x^2)) 122 | doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_y), 1, True)) 123 | 124 | prod = tf.reduce_sum(tf.multiply(tf.tile(query_rnn_output, [NEG + 1, 1]), doc_y), 1, True) 125 | norm_prod = tf.multiply(query_norm, doc_norm) 126 | 127 | # cos_sim_raw = query * doc / (||query|| * ||doc||) 128 | cos_sim_raw = tf.truediv(prod, norm_prod) 129 | # gamma = 20 130 | cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, query_BS])) * 20 131 | 132 | with tf.name_scope('Loss'): 133 | # Train Loss 134 | # 转化为softmax概率矩阵。 135 | prob = tf.nn.softmax(cos_sim) 136 | # 只取第一列,即正样本列概率。 137 | hit_prob = tf.slice(prob, [0, 0], [-1, 1]) 138 | loss = -tf.reduce_sum(tf.log(hit_prob)) 139 | tf.summary.scalar('loss', loss) 140 | 141 | with tf.name_scope('Training'): 142 | # Optimizer 143 | train_step = tf.train.AdamOptimizer(conf.learning_rate).minimize(loss) 144 | 145 | # with tf.name_scope('Accuracy'): 146 | # correct_prediction = tf.equal(tf.argmax(prob, 1), 0) 147 | # accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 148 | # tf.summary.scalar('accuracy', accuracy) 149 | 150 | merged = tf.summary.merge_all() 151 | 152 | with tf.name_scope('Test'): 153 | average_loss = tf.placeholder(tf.float32) 154 | loss_summary = tf.summary.scalar('average_loss', average_loss) 155 | 156 | with tf.name_scope('Train'): 157 | train_average_loss = tf.placeholder(tf.float32) 158 | train_loss_summary = tf.summary.scalar('train_average_loss', train_average_loss) 159 | 160 | 161 | def pull_batch(data_map, batch_id): 162 | query_in = data_map['query'][batch_id * query_BS:(batch_id + 1) * query_BS] 163 | query_len = data_map['query_len'][batch_id * query_BS:(batch_id + 1) * query_BS] 164 | doc_positive_in = data_map['doc_pos'][batch_id * query_BS:(batch_id + 1) * query_BS] 165 | doc_positive_len = data_map['doc_pos_len'][batch_id * query_BS:(batch_id + 1) * query_BS] 166 | doc_negative_in = data_map['doc_neg'][batch_id * query_BS * NEG:(batch_id + 1) * query_BS * NEG] 167 | doc_negative_len = data_map['doc_neg_len'][batch_id * query_BS * NEG:(batch_id + 1) * query_BS * NEG] 168 | 169 | # query_in, doc_positive_in, doc_negative_in = pull_all(query_in, doc_positive_in, doc_negative_in) 170 | return query_in, doc_positive_in, doc_negative_in, query_len, doc_positive_len, doc_negative_len 171 | 172 | 173 | def feed_dict(on_training, data_set, batch_id, drop_prob): 174 | query_in, doc_positive_in, doc_negative_in, query_seq_len, pos_seq_len, neg_seq_len = pull_batch(data_set, 175 | batch_id) 176 | query_len = len(query_in) 177 | query_seq_len = [conf.max_seq_len] * query_len 178 | pos_seq_len = [conf.max_seq_len] * query_len 179 | neg_seq_len = [conf.max_seq_len] * query_len * NEG 180 | return {query_batch: query_in, doc_pos_batch: doc_positive_in, doc_neg_batch: doc_negative_in, 181 | on_train: on_training, drop_out_prob: drop_prob, query_seq_length: query_seq_len, 182 | neg_seq_length: neg_seq_len, pos_seq_length: pos_seq_len} 183 | 184 | 185 | # config = tf.ConfigProto() # log_device_placement=True) 186 | # config.gpu_options.allow_growth = True 187 | # if not config.gpu: 188 | # config = tf.ConfigProto(device_count= {'GPU' : 0}) 189 | 190 | # 创建一个Saver对象,选择性保存变量或者模型。 191 | saver = tf.train.Saver() 192 | # with tf.Session(config=config) as sess: 193 | with tf.Session() as sess: 194 | sess.run(tf.global_variables_initializer()) 195 | train_writer = tf.summary.FileWriter(conf.summaries_dir + '/train', sess.graph) 196 | 197 | start = time.time() 198 | for epoch in range(conf.num_epoch): 199 | batch_ids = [i for i in range(train_epoch_steps)] 200 | random.shuffle(batch_ids) 201 | for batch_id in batch_ids: 202 | # print(batch_id) 203 | sess.run(train_step, feed_dict=feed_dict(True, data_train, batch_id, 0.5)) 204 | end = time.time() 205 | # train loss 206 | epoch_loss = 0 207 | for i in range(train_epoch_steps): 208 | loss_v = sess.run(loss, feed_dict=feed_dict(False, data_train, i, 1)) 209 | epoch_loss += loss_v 210 | 211 | epoch_loss /= (train_epoch_steps) 212 | train_loss = sess.run(train_loss_summary, feed_dict={train_average_loss: epoch_loss}) 213 | train_writer.add_summary(train_loss, epoch + 1) 214 | print("\nEpoch #%d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" % 215 | (epoch, epoch_loss, end - start)) 216 | 217 | # test loss 218 | start = time.time() 219 | epoch_loss = 0 220 | for i in range(vali_epoch_steps): 221 | loss_v = sess.run(loss, feed_dict=feed_dict(False, data_vali, i, 1)) 222 | epoch_loss += loss_v 223 | epoch_loss /= (vali_epoch_steps) 224 | test_loss = sess.run(loss_summary, feed_dict={average_loss: epoch_loss}) 225 | train_writer.add_summary(test_loss, epoch + 1) 226 | # test_writer.add_summary(test_loss, step + 1) 227 | print("Epoch #%d | Test Loss: %-4.3f | Calc_LossTime: %-3.3fs" % 228 | (epoch, epoch_loss, start - end)) 229 | 230 | # 保存模型 231 | save_path = saver.save(sess, "model/model_1.ckpt") 232 | print("Model saved in file: ", save_path) 233 | -------------------------------------------------------------------------------- /flask_server.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #encoding=utf-8 3 | ''' 4 | @Time : 2020/11/02 00:06:44 5 | @Author : Zhiyang.zzy 6 | @Contact : zhiyangchou@gmail.com 7 | @Desc : 8 | ''' 9 | 10 | # here put the import lib 11 | from model.bert_classifier import BertClassifier 12 | import os 13 | import time 14 | from numpy.lib.arraypad import pad 15 | from tensorflow.python.ops.gen_io_ops import write_file 16 | import yaml 17 | import logging 18 | import argparse 19 | logging.basicConfig(level=logging.INFO) 20 | import data_input 21 | from config import Config 22 | from model.siamese_network import SiamenseRNN, SiamenseBert 23 | from data_input import Vocabulary, get_test 24 | from util import write_file 25 | from flask import Flask 26 | app = Flask(__name__) 27 | 28 | @app.route('/hello//') 29 | def hello_world(q1, q2): 30 | # print('Hello World! %s, %s' % (q1, q2)) 31 | test_arr, query_arr = data_input.get_test_bert_by_arr([[q1, q2]], vocab, is_merge=1) 32 | # print("test_arr:", test_arr) 33 | test_label, test_prob = model.predict(test_arr) 34 | # print("test label", test_label) 35 | return 'Hello World! {}:{}'.format(q1 + "-" + q2, test_prob[0]) 36 | 37 | if __name__ == '__main__': 38 | # 读取配置 39 | # conf = Config() 40 | cfg_path = "./configs/bert_classify.yml" 41 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 42 | # vocab: 将 seq转为id, 43 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 44 | # 读取数据 45 | # test_arr, query_arr = data_input.get_test_bert(file_, vocab, is_merge=1) 46 | # print("test size:{}".format(len(test_arr))) 47 | model = BertClassifier(cfg) 48 | model.restore_session(cfg["checkpoint_dir"]) 49 | app.run() 50 | # 输入url测试,例如:http://127.0.0.1:5000/hello/今天天气/明天天气 -------------------------------------------------------------------------------- /model/base_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding=utf-8 3 | ''' 4 | Author: zhiyang.zzy 5 | Date: 2020-10-25 11:07:55 6 | Contact: zhiyangchou@gmail.com 7 | FilePath: /dssm/base_model.py 8 | Desc: 基础模型,包含基本功能 9 | ''' 10 | # here put the import lib 11 | 12 | 13 | # here put the import lib 14 | import numpy as np 15 | import os 16 | import tensorflow as tf 17 | import nni 18 | # from tensorflow.python.ops import rnn_cell_impl as core_rnn_cell 19 | import logging 20 | from collections import defaultdict 21 | from .bert import modeling_v1 as modeling, tokenization, optimization 22 | # logging.basicConfig(level=logging.DEBUG) 23 | 24 | 25 | class TriplteLoss(object): 26 | # https://blog.csdn.net/u013082989/article/details/83537370 27 | @staticmethod 28 | def _pairwise_distance(embeddings, squared=False): 29 | ''' 30 | 计算两两embedding的距离 31 | ------------------------------------------ 32 | Args: 33 | embedding: 特征向量, 大小(batch_size, vector_size) 34 | squared: 是否距离的平方,即欧式距离 35 | 36 | Returns: 37 | distances: 两两embeddings的距离矩阵,大小 (batch_size, batch_size) 38 | ''' 39 | # 矩阵相乘,得到(batch_size, batch_size),因为计算欧式距离|a-b|^2 = a^2 -2ab + b^2, 40 | # 其中 ab 可以用矩阵乘表示 41 | dot_product = tf.matmul(embeddings, tf.transpose(embeddings)) 42 | # dot_product对角线部分就是 每个embedding的平方 43 | square_norm = tf.diag_part(dot_product) 44 | # |a-b|^2 = a^2 - 2ab + b^2 45 | # tf.expand_dims(square_norm, axis=1)是(batch_size, 1)大小的矩阵,减去 (batch_size, batch_size)大小的矩阵,相当于每一列操作 46 | distances = tf.expand_dims( 47 | square_norm, axis=1) - 2.0 * dot_product + tf.expand_dims(square_norm, axis=0) 48 | distances = tf.maximum(distances, 0.0) # 小于0的距离置为0 49 | if not squared: # 如果不平方,就开根号,但是注意有0元素,所以0的位置加上 1e*-16 50 | distances = distances + mask * 1e-16 51 | distances = tf.sqrt(distances) 52 | distances = distances * (1.0 - mask) # 0的部分仍然置为0 53 | return distances 54 | @staticmethod 55 | def _get_triplet_mask(labels): 56 | ''' 57 | 得到一个3D的mask [a, p, n], 对应triplet(a, p, n)是valid的位置是True 58 | ---------------------------------- 59 | Args: 60 | labels: 对应训练数据的labels, shape = (batch_size,) 61 | 62 | Returns: 63 | mask: 3D,shape = (batch_size, batch_size, batch_size) 64 | 65 | ''' 66 | 67 | # 初始化一个二维矩阵,坐标(i, j)不相等置为1,得到indices_not_equal 68 | indices_equal = tf.cast(tf.eye(tf.shape(labels)[0]), tf.bool) 69 | indices_not_equal = tf.logical_not(indices_equal) 70 | # 因为最后得到一个3D的mask矩阵(i, j, k),增加一个维度,则 i_not_equal_j 在第三个维度增加一个即,(batch_size, batch_size, 1), 其他同理 71 | i_not_equal_j = tf.expand_dims(indices_not_equal, 2) 72 | i_not_equal_k = tf.expand_dims(indices_not_equal, 1) 73 | j_not_equal_k = tf.expand_dims(indices_not_equal, 0) 74 | # 想得到i!=j!=k, 三个不等取and即可, 最后可以得到当下标(i, j, k)不相等时才取True 75 | distinct_indices = tf.logical_and(tf.logical_and( 76 | i_not_equal_j, i_not_equal_k), j_not_equal_k) 77 | 78 | # 同样根据labels得到对应i=j, i!=k 79 | label_equal = tf.equal(tf.expand_dims(labels, 0), 80 | tf.expand_dims(labels, 1)) 81 | i_equal_j = tf.expand_dims(label_equal, 2) 82 | i_equal_k = tf.expand_dims(label_equal, 1) 83 | valid_labels = tf.logical_and(i_equal_j, tf.logical_not(i_equal_k)) 84 | # mask即为满足上面两个约束,所以两个3D取and 85 | mask = tf.logical_and(distinct_indices, valid_labels) 86 | return mask 87 | @staticmethod 88 | def batch_all_triplet_loss(labels, embeddings, margin, squared=False): 89 | ''' 90 | triplet loss of a batch 91 | ------------------------------- 92 | Args: 93 | labels: 标签数据,shape = (batch_size,) 94 | embeddings: 提取的特征向量, shape = (batch_size, vector_size) 95 | margin: margin大小, scalar 96 | 97 | Returns: 98 | triplet_loss: scalar, 一个batch的损失值 99 | fraction_postive_triplets : valid的triplets占的比例 100 | ''' 101 | # 得到每两两embeddings的距离,然后增加一个维度,一维需要得到(batch_size, batch_size, batch_size)大小的3D矩阵 102 | # 然后再点乘上valid 的 mask即可 103 | pairwise_dis = _pairwise_distance(embeddings, squared=squared) 104 | anchor_positive_dist = tf.expand_dims(pairwise_dis, 2) 105 | assert anchor_positive_dist.shape[2] == 1, "{}".format( 106 | anchor_positive_dist.shape) 107 | anchor_negative_dist = tf.expand_dims(pairwise_dis, 1) 108 | assert anchor_negative_dist.shape[1] == 1, "{}".format( 109 | anchor_negative_dist.shape) 110 | triplet_loss = anchor_positive_dist - anchor_negative_dist + margin 111 | 112 | mask = _get_triplet_mask(labels) 113 | mask = tf.to_float(mask) 114 | triplet_loss = tf.multiply(mask, triplet_loss) 115 | triplet_loss = tf.maximum(triplet_loss, 0.0) 116 | 117 | # 计算valid的triplet的个数,然后对所有的triplet loss求平均 118 | valid_triplets = tf.to_float(tf.greater(triplet_loss, 1e-16)) 119 | num_positive_triplets = tf.reduce_sum(valid_triplets) 120 | num_valid_triplets = tf.reduce_sum(mask) 121 | fraction_postive_triplets = num_positive_triplets / \ 122 | (num_valid_triplets + 1e-16) 123 | 124 | triplet_loss = tf.reduce_sum(triplet_loss) / \ 125 | (num_positive_triplets + 1e-16) 126 | return triplet_loss, fraction_postive_triplets 127 | 128 | 129 | class BaseModel(object): 130 | def __init__(self, cfg, is_training=1): 131 | # config来自于yml文件。 132 | self.cfg = cfg 133 | # 通过cfg 解析出多少个 word, intent, action, 等 134 | # if not is_training: dropout=0 135 | self.is_training = is_training 136 | if not is_training: 137 | self.cfg['dropout'] = 0 138 | self.build() 139 | 140 | def __del__(self): 141 | # self.sess.close() 142 | pass 143 | 144 | def _init_session(self): 145 | # https://zhuanlan.zhihu.com/p/78998468 146 | config = tf.ConfigProto() 147 | config.gpu_options.allow_growth = True 148 | self.sess = tf.Session(config=config) 149 | self.sess.run(tf.global_variables_initializer()) 150 | self.sess.run(tf.tables_initializer()) 151 | # saver = tf.train.Saver(max_to_keep=None) 152 | self.saver = tf.train.Saver() 153 | 154 | def restore_session(self, dir_model): 155 | print("Reloading the latest trained model...") 156 | self.saver.restore(self.sess, dir_model) 157 | 158 | def _add_summary(self): 159 | self.merged = tf.summary.merge_all() 160 | if not os.path.exists(self.cfg['summaries_dir']): 161 | os.makedirs(self.cfg['summaries_dir']) 162 | self.file_writer = tf.summary.FileWriter( 163 | self.cfg['summaries_dir'], self.sess.graph) 164 | 165 | def save_session(self): 166 | if not os.path.exists(self.cfg['checkpoint_dir']): 167 | os.makedirs(self.cfg['checkpoint_dir']) 168 | self.saver.save(self.sess, self.cfg['checkpoint_dir']) 169 | 170 | def init_from_pre_dir(self, pre_dir): 171 | tvars = tf.trainable_variables() 172 | (assignment, init_variable_names) = modeling.get_assignment_map_from_checkpoint( 173 | tvars, pre_dir) 174 | tf.train.init_from_checkpoint(pre_dir, assignment) 175 | 176 | @staticmethod 177 | def get_params_count(): 178 | params_count = np.sum([np.prod(v.get_shape().as_list()) 179 | for v in tf.trainable_variables()]) 180 | print("params_count", params_count) 181 | return params_count 182 | 183 | #################### 基本功能: fit, evaluate, predict ##################### 184 | def fit(self, train, dev, test=None): 185 | ''' 186 | @description: 模型训练 187 | @param {type} 188 | @return: 189 | ''' 190 | best_score, nepoch_no_imprv = -1, 0 191 | for epoch in range(self.cfg["num_epoch"]): 192 | print("Epoch {:} out of {:}".format( 193 | epoch + 1, self.cfg["num_epoch"])) 194 | score = self.run_epoch(epoch, train, dev) 195 | if score > best_score: 196 | nepoch_no_imprv = 0 197 | self.save_session() 198 | best_score = score 199 | print("- new best score!") 200 | if test: 201 | test_acc = self.eval(test) 202 | # self.print_eval_result(test_result) 203 | print("test sf acc:{}".format(test_acc)) 204 | else: 205 | nepoch_no_imprv += 1 206 | if nepoch_no_imprv >= self.cfg["epoch_no_imprv"]: 207 | print( 208 | "- early stopping {} epoches without improvement".format(nepoch_no_imprv)) 209 | nni.report_final_result(best_score) 210 | break 211 | pass 212 | pass 213 | 214 | def eval(self, test): 215 | ''' 216 | @description: 测试集评测 217 | @param {type} 218 | @return: 219 | ''' 220 | pass 221 | 222 | def predict(self): 223 | ''' 224 | @description: 无标注数据评测 225 | @param {type} 226 | @return: 227 | ''' 228 | pass 229 | 230 | #################### 模型模块 ##################### 231 | def _state_lstm(self, input_emb, input_length, initial_state, hidden_size, variable_scope="StateLSTM"): 232 | with tf.variable_scope(variable_scope): 233 | cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) 234 | cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) 235 | initial_state = tf.nn.rnn_cell.LSTMStateTuple( 236 | initial_state, initial_state) 237 | _output = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb, 238 | sequence_length=input_length, 239 | dtype=tf.float32, 240 | initial_state_fw=initial_state, 241 | initial_state_bw=initial_state) 242 | (output_fw, output_bw), ((_, state_fw), (_, state_bw)) = _output 243 | output = tf.concat([output_fw, output_bw], axis=-1) 244 | state = tf.concat([state_fw, state_bw], axis=-1) 245 | 246 | return output, state 247 | 248 | def _concat_lstm(self, input_emb, input_length, extra_emb, hidden_size, variable_scope="ConcatLSTM"): 249 | """ 250 | input_emb: [batch_size, nstep, hidden_size] 251 | extra_emb: [batch_size, hidden_size] 252 | """ 253 | with tf.variable_scope(variable_scope): 254 | nstep = input_emb.shape[1].value 255 | # [batch_size, nstep, hidden_size] 256 | expand_extra_emb = tf.tile(tf.expand_dims( 257 | extra_emb, axis=1), multiples=[1, nstep, 1]) 258 | input_emb = tf.concat([input_emb, expand_extra_emb], axis=-1) 259 | 260 | cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) 261 | cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) 262 | _output = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb, 263 | sequence_length=input_length, 264 | dtype=tf.float32) 265 | (output_fw, output_bw), ((_, state_fw), (_, state_bw)) = _output 266 | output = tf.concat([output_fw, output_bw], axis=-1) 267 | state = tf.concat([state_fw, state_bw], axis=-1) 268 | 269 | return output, state 270 | 271 | def _train_op(self): 272 | lr_m = self.cfg['optimizer'].lower() 273 | with tf.variable_scope("train_op"): 274 | optimizer = self._get_optimizer(lr_m) 275 | grads_and_vars = optimizer.compute_gradients(self.loss) 276 | for grad, var in grads_and_vars: 277 | # grad = tf.Print(grad, [grad], "{} grad: ".format(var.name)) 278 | if grad is not None: 279 | tf.summary.histogram(var.op.name + "/gradients", grad) 280 | if self.cfg['clip'] > 0: 281 | grads, variables = zip(*grads_and_vars) 282 | grads, gnorm = tf.clip_by_global_norm(grads, self.cfg['clip']) 283 | self.train_op = optimizer.apply_gradients(zip(grads, variables), 284 | global_step=tf.train.get_global_step()) 285 | else: 286 | self.train_op = optimizer.minimize( 287 | self.loss, global_step=tf.train.get_global_step()) 288 | 289 | #################### 基础模块 ##################### 290 | def _add_word_embedding_matrix(self,): 291 | # 如果有预训练矩阵,从其中导入 292 | self.embedding_file = self.cfg['meta_dir'] + \ 293 | self.cfg.get('embedding_trimmed', None) 294 | if self.embedding_file and self.cfg["use_pretrained"]: 295 | embedding_matrix = np.load(self.embedding_file)["embeddings"] 296 | self.embedding_matrix = tf.Variable( 297 | embedding_matrix, name='embedding_matrix', dtype=tf.float32) 298 | pass 299 | else: 300 | self.embedding_matrix = tf.get_variable(name="embedding_matrix", 301 | dtype=tf.float32, 302 | shape=[self.cfg["word_num"], self.cfg["embedding_dim"]]) 303 | 304 | def add_bert_layer(self, use_bert_pre=1): 305 | self.bert_config = modeling.BertConfig.from_json_file( 306 | self.cfg["bert_dir"] + self.cfg["bert_config"]) 307 | bert_model = modeling.BertModel( 308 | config=self.bert_config, 309 | is_training=self.is_train_place, 310 | input_ids=self.query_ids, 311 | input_mask=self.mask_ids, 312 | token_type_ids=self.seg_ids, 313 | use_one_hot_embeddings=False) 314 | 315 | if use_bert_pre: 316 | tvars = tf.trainable_variables() 317 | bert_init_dir = self.cfg["bert_dir"] + \ 318 | self.cfg["bert_init_checkpoint"] 319 | (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, 320 | bert_init_dir) 321 | tf.train.init_from_checkpoint(bert_init_dir, assignment) 322 | 323 | bert_output_seq_ori = bert_model.get_sequence_output() 324 | bert_output_shape = tf.shape(bert_output_seq_ori) 325 | self.bert_output_seq_ori = bert_output_seq_ori 326 | # bs, seq, 768 327 | bert_output_seq = tf.strided_slice( 328 | bert_output_seq_ori, [0, 1, 0], bert_output_shape, [1, 1, 1]) 329 | nsteps = tf.shape(bert_output_seq)[1] 330 | self.bert_output_seq = tf.reshape( 331 | bert_output_seq, [-1, nsteps, self.bert_config.hidden_size]) 332 | self.cls_output = bert_model.get_pooled_output() 333 | self.embedding_table = bert_model.embedding_table 334 | # mask onehot 335 | bert_mask_shape = tf.shape(self.mask_ids) 336 | self.seq_mask_ids = tf.strided_slice( 337 | self.mask_ids, [0, 1], bert_mask_shape, [1, 1]) 338 | self.word_mask_ids = tf.expand_dims( 339 | tf.cast(self.seq_mask_ids, tf.float32), -1) 340 | 341 | def share_bert_layer(self, is_train_place, query_ids, mask_ids, seg_ids, use_bert_pre=1): 342 | self.bert_config = modeling.BertConfig.from_json_file( 343 | self.cfg["bert_dir"] + self.cfg["bert_config"]) 344 | bert_model = modeling.BertModel( 345 | config=self.bert_config, 346 | is_training=is_train_place, 347 | input_ids=query_ids, 348 | input_mask=mask_ids, 349 | token_type_ids=seg_ids, 350 | use_one_hot_embeddings=False, 351 | scope="bert") 352 | if use_bert_pre: 353 | tvars = tf.trainable_variables() 354 | bert_init_dir = self.cfg["bert_dir"] + \ 355 | self.cfg["bert_init_checkpoint"] 356 | (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, 357 | bert_init_dir) 358 | tf.train.init_from_checkpoint(bert_init_dir, assignment) 359 | bert_output_seq = bert_model.get_sequence_output() 360 | 361 | # 默认使用cls输出 362 | pooled = bert_model.get_pooled_output() 363 | embedding_table = bert_model.embedding_table 364 | input_mask_ = tf.cast(tf.expand_dims(mask_ids, axis=-1), dtype=tf.float32) 365 | if self.cfg['sentence_embedding_type'] == "avg": 366 | # 最后一层avg pooling 367 | pooled = tf.reduce_sum(bert_output_seq * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1) 368 | elif self.cfg['sentence_embedding_type'].startswith("avg-last-last-"): 369 | # 使用最后的第n层 avg pooling 370 | n_last = int(self.cfg['sentence_embedding_type'][-1]) 371 | sequence = bert_model.all_encoder_layers[-n_last] # [batch_size, seq_length, hidden_size] 372 | pooled = tf.reduce_sum(sequence * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1) 373 | elif self.cfg['sentence_embedding_type'].startswith("avg-last-"): 374 | # 使用最后的n层 avg pooling 375 | pooled = 0 376 | n_last = int(self.cfg['sentence_embedding_type'][-1]) 377 | for i in range(n_last): 378 | sequence = bert_model.all_encoder_layers[-i] 379 | pooled += tf.reduce_sum(sequence * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1) 380 | pooled /= float(n_last) 381 | elif self.cfg['sentence_embedding_type'].startswith("avg-last-concat-"): 382 | pooled = [] 383 | n_last = int(self.cfg['sentence_embedding_type'][-1]) 384 | for i in range(n_last): 385 | sequence = bert_model.all_encoder_layers[-i] 386 | pooled += [tf.reduce_sum(sequence * input_mask_, axis=1) / tf.reduce_sum(input_mask_, axis=1)] 387 | pooled = tf.concat(pooled, axis=-1) 388 | return pooled, bert_output_seq, embedding_table 389 | 390 | def _dropout(self, input_emb, ratio=None): 391 | if not self.is_training: 392 | return input_emb 393 | if ratio: 394 | return tf.layers.dropout(input_emb, ratio) 395 | else: 396 | return tf.layers.dropout(input_emb, self.cfg['dropout']) 397 | 398 | def _bigru(self, input_emb, input_length, hidden_size, variable_scope="BiGRU"): 399 | with tf.variable_scope(variable_scope): 400 | cell_fw = tf.nn.rnn_cell.GRUCell(hidden_size) 401 | cell_bw = tf.nn.rnn_cell.GRUCell(hidden_size) 402 | outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb, 403 | input_length, dtype=tf.float32) 404 | return tf.concat(outputs, axis=-1), tf.concat(states, axis=-1) 405 | 406 | def _bilstm(self, input_emb, input_length, hidden_size, variable_scope="BilSTM"): 407 | with tf.variable_scope(variable_scope): 408 | cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) 409 | cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) 410 | _output = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_emb, 411 | input_length, dtype=tf.float32) 412 | (output_fw, output_bw), ((_, state_fw), (_, state_bw)) = _output 413 | 414 | return tf.concat([output_fw, output_bw], axis=-1), tf.concat([state_fw, state_bw], axis=-1) 415 | 416 | def _iterable_dilated_cnn(self, embeddings): 417 | """ 418 | :param embeddings: [batch_size, steps, embedding_dim] 419 | :return: 420 | """ 421 | embedding_dim = embeddings.get_shape()[-1] 422 | with tf.variable_scope("id_cnn"): 423 | cnn_input = tf.expand_dims(embeddings, 1) 424 | initial_layer_filter_shape = [ 425 | 1, self.cfg.filter_width, embedding_dim, self.cfg.filter_num] 426 | initial_layer_w = tf.get_variable("initial_layer_w", shape=initial_layer_filter_shape, 427 | initializer=tf.contrib.layers.xavier_initializer()) 428 | initial_layer_b = tf.get_variable("initial_layer_b", 429 | initializer=tf.constant(0.01, shape=[self.cfg.filter_num])) 430 | initial_layer_output = tf.nn.conv2d(cnn_input, initial_layer_w, strides=[1, 1, 1, 1], 431 | padding="SAME", name="initial_layer") 432 | initial_layer_output = tf.nn.relu(tf.nn.bias_add( 433 | initial_layer_output, initial_layer_b), name="relu") 434 | 435 | atrous_input = initial_layer_output 436 | atrous_layers_output = [] 437 | atrous_layers_output_dim = 0 438 | for block in range(self.cfg.repeat_times): 439 | for i in range(len(self.cfg.idcnn_layers)): 440 | layer_name = "conv_{}".format(i) 441 | dilation = self.cfg.idcnn_layers[i] 442 | with tf.variable_scope("atrous_conv_{}".format(i), reuse=tf.AUTO_REUSE): 443 | filter_shape = [1, self.cfg.filter_width, 444 | self.cfg.filter_num, self.cfg.filter_num] 445 | conv_w = tf.get_variable("{}_w".format(layer_name), shape=filter_shape, 446 | initializer=tf.contrib.layers.xavier_initializer()) 447 | conv_b = tf.get_variable("{}_b".format( 448 | layer_name), shape=[self.cfg.filter_num]) 449 | conv_output = tf.nn.convolution(atrous_input, conv_w, dilation_rate=[1, dilation], 450 | padding="SAME", name=layer_name) 451 | conv_output = tf.nn.relu( 452 | tf.nn.bias_add(conv_output, conv_b)) 453 | if i == len(self.cfg.idcnn_layers) - 1: 454 | atrous_layers_output.append(conv_output) 455 | atrous_layers_output_dim += self.cfg.filter_num 456 | atrous_input = conv_output 457 | output = tf.concat(axis=3, values=atrous_layers_output) 458 | return tf.squeeze(output, [1]) 459 | 460 | def add_train_op(self, learning_method, learning_rate, loss, clip=-1): 461 | learning_rate = tf.train.exponential_decay(learning_rate=learning_rate, 462 | global_step=tf.train.get_or_create_global_step(), 463 | decay_steps=self.cfg['decay_step'], 464 | decay_rate=self.cfg['lr_decay']) 465 | update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 466 | _lr_m = learning_method.lower() 467 | with tf.variable_scope("train_step"): 468 | if _lr_m == "adam": 469 | optimizer = tf.train.AdamOptimizer(learning_rate) 470 | elif _lr_m == 'lazyadam': 471 | optimizer = tf.contrib.opt.LazyAdamOptimizer( 472 | learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8) 473 | elif _lr_m == "adagrad": 474 | optimizer = tf.train.AdagradOptimizer(learning_rate) 475 | elif _lr_m == "sgd": 476 | optimizer = tf.train.GradientDescentOptimizer(learning_rate) 477 | elif _lr_m == "rmsprop": 478 | optimizer = tf.train.RMSPropOptimizer(learning_rate) 479 | else: 480 | raise NotImplementedError("Unknown method {}".format(_lr_m)) 481 | with tf.control_dependencies(update_ops): 482 | if clip > 0: 483 | grads, variables = zip(*optimizer.compute_gradients(loss)) 484 | grads, gnorm = tf.clip_by_global_norm(grads, clip) 485 | self.train_op = optimizer.apply_gradients(zip(grads, variables), 486 | global_step=tf.train.get_global_step()) 487 | else: 488 | # 梯度截断 489 | # params = tf.trainable_variables() 490 | # all_gradients = tf.gradients(loss, all_variables, stop_gradients=stop_tensors) 491 | self.train_op = optimizer.minimize( 492 | loss, global_step=tf.train.get_global_step()) 493 | 494 | return self.train_op 495 | 496 | @staticmethod 497 | def label_smoothing(inp, ls_epsilon): 498 | """ 499 | From the paper: "... employed label smoothing of epsilon = 0.1. This hurts perplexity, 500 | as the model learns to be more unsure, but improves accuracy and BLEU score." 501 | Args: 502 | inp (tf.tensor): one-hot encoding vectors, [batch, seq_len, vocab_size] 503 | """ 504 | vocab_size = inp.shape.as_list()[-1] 505 | smoothed = (1.0 - ls_epsilon) * inp + (ls_epsilon / vocab_size) 506 | return smoothed 507 | 508 | 509 | if __name__ == "__main__": 510 | model = BaseModel("s") 511 | pass 512 | -------------------------------------------------------------------------------- /model/bert/ReadMe.md: -------------------------------------------------------------------------------- 1 | # download 2 | - bert-base chinese: https://github.com/google-research/bert 3 | - roberta_zh: https://github.com/brightmart/roberta_zh 4 | - albert_zh: https://github.com/brightmart/albert_zh 5 | - xlnet_zh: https://github.com/brightmart/xlnet_zh 6 | 7 | 来自于:[https://github.com/google-research/bert](https://github.com/google-research/bert) 8 | -------------------------------------------------------------------------------- /model/bert/modeling.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """The main BERT model and related functions.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import copy 23 | import json 24 | import math 25 | import re 26 | import six 27 | import tensorflow as tf 28 | 29 | 30 | class BertConfig(object): 31 | """Configuration for `BertModel`.""" 32 | 33 | def __init__(self, 34 | vocab_size, 35 | hidden_size=768, 36 | num_hidden_layers=12, 37 | num_attention_heads=12, 38 | intermediate_size=3072, 39 | hidden_act="gelu", 40 | hidden_dropout_prob=0.1, 41 | attention_probs_dropout_prob=0.1, 42 | max_position_embeddings=512, 43 | type_vocab_size=16, 44 | initializer_range=0.02): 45 | """Constructs BertConfig. 46 | Args: 47 | vocab_size: Vocabulary size of `inputs_ids` in `BertModel`. 48 | hidden_size: Size of the encoder layers and the pooler layer. 49 | num_hidden_layers: Number of hidden layers in the Transformer encoder. 50 | num_attention_heads: Number of attention heads for each attention layer in 51 | the Transformer encoder. 52 | intermediate_size: The size of the "intermediate" (i.e., feed-forward) 53 | layer in the Transformer encoder. 54 | hidden_act: The non-linear activation function (function or string) in the 55 | encoder and pooler. 56 | hidden_dropout_prob: The dropout probability for all fully connected 57 | layers in the embeddings, encoder, and pooler. 58 | attention_probs_dropout_prob: The dropout ratio for the attention 59 | probabilities. 60 | max_position_embeddings: The maximum sequence length that this model might 61 | ever be used with. Typically set this to something large just in case 62 | (e.g., 512 or 1024 or 2048). 63 | type_vocab_size: The vocabulary size of the `token_type_ids` passed into 64 | `BertModel`. 65 | initializer_range: The stdev of the truncated_normal_initializer for 66 | initializing all weight matrices. 67 | """ 68 | self.vocab_size = vocab_size 69 | self.hidden_size = hidden_size 70 | self.num_hidden_layers = num_hidden_layers 71 | self.num_attention_heads = num_attention_heads 72 | self.hidden_act = hidden_act 73 | self.intermediate_size = intermediate_size 74 | self.hidden_dropout_prob = hidden_dropout_prob 75 | self.attention_probs_dropout_prob = attention_probs_dropout_prob 76 | self.max_position_embeddings = max_position_embeddings 77 | self.type_vocab_size = type_vocab_size 78 | self.initializer_range = initializer_range 79 | 80 | @classmethod 81 | def from_dict(cls, json_object): 82 | """Constructs a `BertConfig` from a Python dictionary of parameters.""" 83 | config = BertConfig(vocab_size=None) 84 | for (key, value) in six.iteritems(json_object): 85 | config.__dict__[key] = value 86 | return config 87 | 88 | @classmethod 89 | def from_json_file(cls, json_file): 90 | """Constructs a `BertConfig` from a json file of parameters.""" 91 | with tf.gfile.GFile(json_file, "r") as reader: 92 | text = reader.read() 93 | return cls.from_dict(json.loads(text)) 94 | 95 | def to_dict(self): 96 | """Serializes this instance to a Python dictionary.""" 97 | output = copy.deepcopy(self.__dict__) 98 | return output 99 | 100 | def to_json_string(self): 101 | """Serializes this instance to a JSON string.""" 102 | return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n" 103 | 104 | 105 | class BertModel(object): 106 | """BERT model ("Bidirectional Embedding Representations from a Transformer"). 107 | Example usage: 108 | ```python 109 | # Already been converted into WordPiece token ids 110 | input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) 111 | input_mask = tf.constant([[1, 1, 1], [1, 1, 0]]) 112 | token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) 113 | config = modeling.BertConfig(vocab_size=32000, hidden_size=512, 114 | num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024) 115 | model = modeling.BertModel(config=config, is_training=True, 116 | input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) 117 | label_embeddings = tf.get_variable(...) 118 | pooled_output = model.get_pooled_output() 119 | logits = tf.matmul(pooled_output, label_embeddings) 120 | ... 121 | ``` 122 | """ 123 | 124 | def __init__(self, 125 | config, 126 | is_training, 127 | input_ids, 128 | input_mask=None, 129 | token_type_ids=None, 130 | use_one_hot_embeddings=True, 131 | scope=None): 132 | """Constructor for BertModel. 133 | Args: 134 | config: `BertConfig` instance. 135 | is_training: bool. rue for training model, false for eval model. Controls 136 | whether dropout will be applied. 137 | input_ids: int32 Tensor of shape [batch_size, seq_length]. 138 | input_mask: (optional) int32 Tensor of shape [batch_size, seq_length]. 139 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. 140 | use_one_hot_embeddings: (optional) bool. Whether to use one-hot word 141 | embeddings or tf.embedding_lookup() for the word embeddings. On the TPU, 142 | it is must faster if this is True, on the CPU or GPU, it is faster if 143 | this is False. 144 | scope: (optional) variable scope. Defaults to "bert". 145 | Raises: 146 | ValueError: The config is invalid or one of the input tensor shapes 147 | is invalid. 148 | """ 149 | config = copy.deepcopy(config) 150 | if not is_training: 151 | config.hidden_dropout_prob = 0.0 152 | config.attention_probs_dropout_prob = 0.0 153 | 154 | input_shape = get_shape_list(input_ids, expected_rank=2) 155 | batch_size = input_shape[0] 156 | seq_length = input_shape[1] 157 | 158 | if input_mask is None: 159 | input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) 160 | 161 | if token_type_ids is None: 162 | token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32) 163 | 164 | with tf.variable_scope(scope, default_name="bert"): 165 | with tf.variable_scope("embeddings"): 166 | # Perform embedding lookup on the word ids. 167 | (self.embedding_output, self.embedding_table) = embedding_lookup( 168 | input_ids=input_ids, 169 | vocab_size=config.vocab_size, 170 | embedding_size=config.hidden_size, 171 | initializer_range=config.initializer_range, 172 | word_embedding_name="word_embeddings", 173 | use_one_hot_embeddings=use_one_hot_embeddings) 174 | 175 | # Add positional embeddings and token type embeddings, then layer 176 | # normalize and perform dropout. 177 | self.embedding_output = embedding_postprocessor( 178 | input_tensor=self.embedding_output, 179 | use_token_type=True, 180 | token_type_ids=token_type_ids, 181 | token_type_vocab_size=config.type_vocab_size, 182 | token_type_embedding_name="token_type_embeddings", 183 | use_position_embeddings=True, 184 | position_embedding_name="position_embeddings", 185 | initializer_range=config.initializer_range, 186 | max_position_embeddings=config.max_position_embeddings, 187 | dropout_prob=config.hidden_dropout_prob) 188 | 189 | with tf.variable_scope("encoder"): 190 | # This converts a 2D mask of shape [batch_size, seq_length] to a 3D 191 | # mask of shape [batch_size, seq_length, seq_length] which is used 192 | # for the attention scores. 193 | attention_mask = create_attention_mask_from_input_mask( 194 | input_ids, input_mask) 195 | 196 | # Run the stacked transformer. 197 | # `sequence_output` shape = [batch_size, seq_length, hidden_size]. 198 | self.all_encoder_layers = transformer_model( 199 | input_tensor=self.embedding_output, 200 | attention_mask=attention_mask, 201 | hidden_size=config.hidden_size, 202 | num_hidden_layers=config.num_hidden_layers, 203 | num_attention_heads=config.num_attention_heads, 204 | intermediate_size=config.intermediate_size, 205 | intermediate_act_fn=get_activation(config.hidden_act), 206 | hidden_dropout_prob=config.hidden_dropout_prob, 207 | attention_probs_dropout_prob=config.attention_probs_dropout_prob, 208 | initializer_range=config.initializer_range, 209 | do_return_all_layers=True) 210 | 211 | self.sequence_output = self.all_encoder_layers[-1] 212 | # The "pooler" converts the encoded sequence tensor of shape 213 | # [batch_size, seq_length, hidden_size] to a tensor of shape 214 | # [batch_size, hidden_size]. This is necessary for segment-level 215 | # (or segment-pair-level) classification tasks where we need a fixed 216 | # dimensional representation of the segment. 217 | with tf.variable_scope("pooler"): 218 | # We "pool" the model by simply taking the hidden state corresponding 219 | # to the first token. We assume that this has been pre-trained 220 | first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) 221 | self.pooled_output = tf.layers.dense( 222 | first_token_tensor, 223 | config.hidden_size, 224 | activation=tf.tanh, 225 | kernel_initializer=create_initializer(config.initializer_range)) 226 | 227 | def get_pooled_output(self): 228 | return self.pooled_output 229 | 230 | def get_sequence_output(self): 231 | """Gets final hidden layer of encoder. 232 | Returns: 233 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding 234 | to the final hidden of the transformer encoder. 235 | """ 236 | return self.sequence_output 237 | 238 | def get_all_encoder_layers(self): 239 | return self.all_encoder_layers 240 | 241 | def get_embedding_output(self): 242 | """Gets output of the embedding lookup (i.e., input to the transformer). 243 | Returns: 244 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding 245 | to the output of the embedding layer, after summing the word 246 | embeddings with the positional embeddings and the token type embeddings, 247 | then performing layer normalization. This is the input to the transformer. 248 | """ 249 | return self.embedding_output 250 | 251 | def get_embedding_table(self): 252 | return self.embedding_table 253 | 254 | 255 | def gelu(input_tensor): 256 | """Gaussian Error Linear Unit. 257 | This is a smoother version of the RELU. 258 | Original paper: https://arxiv.org/abs/1606.08415 259 | Args: 260 | input_tensor: float Tensor to perform activation. 261 | Returns: 262 | `input_tensor` with the GELU activation applied. 263 | """ 264 | cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0))) 265 | return input_tensor * cdf 266 | 267 | 268 | def get_activation(activation_string): 269 | """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`. 270 | Args: 271 | activation_string: String name of the activation function. 272 | Returns: 273 | A Python function corresponding to the activation function. If 274 | `activation_string` is None, empty, or "linear", this will return None. 275 | If `activation_string` is not a string, it will return `activation_string`. 276 | Raises: 277 | ValueError: The `activation_string` does not correspond to a known 278 | activation. 279 | """ 280 | 281 | # We assume that anything that"s not a string is already an activation 282 | # function, so we just return it. 283 | if not isinstance(activation_string, six.string_types): 284 | return activation_string 285 | 286 | if not activation_string: 287 | return None 288 | 289 | act = activation_string.lower() 290 | if act == "linear": 291 | return None 292 | elif act == "relu": 293 | return tf.nn.relu 294 | elif act == "gelu": 295 | return gelu 296 | elif act == "tanh": 297 | return tf.tanh 298 | else: 299 | raise ValueError("Unsupported activation: %s" % act) 300 | 301 | 302 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint): 303 | """Compute the union of the current variables and checkpoint variables.""" 304 | assignment_map = {} 305 | initialized_variable_names = {} 306 | 307 | name_to_variable = collections.OrderedDict() 308 | for var in tvars: 309 | name = var.name 310 | m = re.match("^(.*):\\d+$", name) 311 | if m is not None: 312 | name = m.group(1) 313 | name_to_variable[name] = var 314 | 315 | init_vars = tf.train.list_variables(init_checkpoint) 316 | 317 | assignment_map = collections.OrderedDict() 318 | for x in init_vars: 319 | (name, var) = (x[0], x[1]) 320 | if name not in name_to_variable: 321 | continue 322 | assignment_map[name] = name 323 | initialized_variable_names[name] = 1 324 | initialized_variable_names[name + ":0"] = 1 325 | 326 | return (assignment_map, initialized_variable_names) 327 | 328 | 329 | def dropout(input_tensor, dropout_prob): 330 | """Perform dropout. 331 | Args: 332 | input_tensor: float Tensor. 333 | dropout_prob: Python float. The probability of dropping out a value (NOT of 334 | *keeping* a dimension as in `tf.nn.dropout`). 335 | Returns: 336 | A version of `input_tensor` with dropout applied. 337 | """ 338 | if dropout_prob is None or dropout_prob == 0.0: 339 | return input_tensor 340 | 341 | output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) 342 | return output 343 | 344 | 345 | def layer_norm(input_tensor, name=None): 346 | """Run layer normalization on the last dimension of the tensor.""" 347 | return tf.contrib.layers.layer_norm( 348 | inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) 349 | 350 | 351 | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None): 352 | """Runs layer normalization followed by dropout.""" 353 | output_tensor = layer_norm(input_tensor, name) 354 | output_tensor = dropout(output_tensor, dropout_prob) 355 | return output_tensor 356 | 357 | 358 | def create_initializer(initializer_range=0.02): 359 | """Creates a `truncated_normal_initializer` with the given range.""" 360 | return tf.truncated_normal_initializer(stddev=initializer_range) 361 | 362 | 363 | def embedding_lookup(input_ids, 364 | vocab_size, 365 | embedding_size=128, 366 | initializer_range=0.02, 367 | word_embedding_name="word_embeddings", 368 | use_one_hot_embeddings=False): 369 | """Looks up words embeddings for id tensor. 370 | Args: 371 | input_ids: int32 Tensor of shape [batch_size, seq_length] containing word 372 | ids. 373 | vocab_size: int. Size of the embedding vocabulary. 374 | embedding_size: int. Width of the word embeddings. 375 | initializer_range: float. Embedding initialization range. 376 | word_embedding_name: string. Name of the embedding table. 377 | use_one_hot_embeddings: bool. If True, use one-hot method for word 378 | embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better 379 | for TPUs. 380 | Returns: 381 | float Tensor of shape [batch_size, seq_length, embedding_size]. 382 | """ 383 | # This function assumes that the input is of shape [batch_size, seq_length, 384 | # num_inputs]. 385 | # 386 | # If the input is a 2D tensor of shape [batch_size, seq_length], we 387 | # reshape to [batch_size, seq_length, 1]. 388 | if input_ids.shape.ndims == 2: 389 | input_ids = tf.expand_dims(input_ids, axis=[-1]) 390 | 391 | embedding_table = tf.get_variable( 392 | name=word_embedding_name, 393 | shape=[vocab_size, embedding_size], 394 | initializer=create_initializer(initializer_range)) 395 | 396 | if use_one_hot_embeddings: 397 | flat_input_ids = tf.reshape(input_ids, [-1]) 398 | one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) 399 | output = tf.matmul(one_hot_input_ids, embedding_table) 400 | else: 401 | output = tf.nn.embedding_lookup(embedding_table, input_ids) 402 | 403 | input_shape = get_shape_list(input_ids) 404 | 405 | output = tf.reshape(output, 406 | input_shape[0:-1] + [input_shape[-1] * embedding_size]) 407 | return (output, embedding_table) 408 | 409 | 410 | def embedding_postprocessor(input_tensor, 411 | use_token_type=False, 412 | token_type_ids=None, 413 | token_type_vocab_size=16, 414 | token_type_embedding_name="token_type_embeddings", 415 | use_position_embeddings=True, 416 | position_embedding_name="position_embeddings", 417 | initializer_range=0.02, 418 | max_position_embeddings=512, 419 | dropout_prob=0.1): 420 | """Performs various post-processing on a word embedding tensor. 421 | Args: 422 | input_tensor: float Tensor of shape [batch_size, seq_length, 423 | embedding_size]. 424 | use_token_type: bool. Whether to add embeddings for `token_type_ids`. 425 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. 426 | Must be specified if `use_token_type` is True. 427 | token_type_vocab_size: int. The vocabulary size of `token_type_ids`. 428 | token_type_embedding_name: string. The name of the embedding table variable 429 | for token type ids. 430 | use_position_embeddings: bool. Whether to add position embeddings for the 431 | position of each token in the sequence. 432 | position_embedding_name: string. The name of the embedding table variable 433 | for positional embeddings. 434 | initializer_range: float. Range of the weight initialization. 435 | max_position_embeddings: int. Maximum sequence length that might ever be 436 | used with this model. This can be longer than the sequence length of 437 | input_tensor, but cannot be shorter. 438 | dropout_prob: float. Dropout probability applied to the final output tensor. 439 | Returns: 440 | float tensor with same shape as `input_tensor`. 441 | Raises: 442 | ValueError: One of the tensor shapes or input values is invalid. 443 | """ 444 | input_shape = get_shape_list(input_tensor, expected_rank=3) 445 | batch_size = input_shape[0] 446 | seq_length = input_shape[1] 447 | width = input_shape[2] 448 | 449 | output = input_tensor 450 | 451 | if use_token_type: 452 | if token_type_ids is None: 453 | raise ValueError("`token_type_ids` must be specified if" 454 | "`use_token_type` is True.") 455 | token_type_table = tf.get_variable( 456 | name=token_type_embedding_name, 457 | shape=[token_type_vocab_size, width], 458 | initializer=create_initializer(initializer_range)) 459 | # This vocab will be small so we always do one-hot here, since it is always 460 | # faster for a small vocabulary. 461 | flat_token_type_ids = tf.reshape(token_type_ids, [-1]) 462 | one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) 463 | token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) 464 | token_type_embeddings = tf.reshape(token_type_embeddings, 465 | [batch_size, seq_length, width]) 466 | output += token_type_embeddings 467 | 468 | if use_position_embeddings: 469 | assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) 470 | with tf.control_dependencies([assert_op]): 471 | full_position_embeddings = tf.get_variable( 472 | name=position_embedding_name, 473 | shape=[max_position_embeddings, width], 474 | initializer=create_initializer(initializer_range)) 475 | # Since the position embedding table is a learned variable, we create it 476 | # using a (long) sequence length `max_position_embeddings`. The actual 477 | # sequence length might be shorter than this, for faster training of 478 | # tasks that do not have long sequences. 479 | # 480 | # So `full_position_embeddings` is effectively an embedding table 481 | # for position [0, 1, 2, ..., max_position_embeddings-1], and the current 482 | # sequence has positions [0, 1, 2, ... seq_length-1], so we can just 483 | # perform a slice. 484 | position_embeddings = tf.slice(full_position_embeddings, [0, 0], 485 | [seq_length, -1]) 486 | num_dims = len(output.shape.as_list()) 487 | 488 | # Only the last two dimensions are relevant (`seq_length` and `width`), so 489 | # we broadcast among the first dimensions, which is typically just 490 | # the batch size. 491 | position_broadcast_shape = [] 492 | for _ in range(num_dims - 2): 493 | position_broadcast_shape.append(1) 494 | position_broadcast_shape.extend([seq_length, width]) 495 | position_embeddings = tf.reshape(position_embeddings, 496 | position_broadcast_shape) 497 | output += position_embeddings 498 | 499 | output = layer_norm_and_dropout(output, dropout_prob) 500 | return output 501 | 502 | 503 | def create_attention_mask_from_input_mask(from_tensor, to_mask): 504 | """Create 3D attention mask from a 2D tensor mask. 505 | Args: 506 | from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...]. 507 | to_mask: int32 Tensor of shape [batch_size, to_seq_length]. 508 | Returns: 509 | float Tensor of shape [batch_size, from_seq_length, to_seq_length]. 510 | """ 511 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) 512 | batch_size = from_shape[0] 513 | from_seq_length = from_shape[1] 514 | 515 | to_shape = get_shape_list(to_mask, expected_rank=2) 516 | to_seq_length = to_shape[1] 517 | 518 | to_mask = tf.cast( 519 | tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32) 520 | 521 | # We don't assume that `from_tensor` is a mask (although it could be). We 522 | # don't actually care if we attend *from* padding tokens (only *to* padding) 523 | # tokens so we create a tensor of all ones. 524 | # 525 | # `broadcast_ones` = [batch_size, from_seq_length, 1] 526 | broadcast_ones = tf.ones( 527 | shape=[batch_size, from_seq_length, 1], dtype=tf.float32) 528 | 529 | # Here we broadcast along two dimensions to create the mask. 530 | mask = broadcast_ones * to_mask 531 | 532 | return mask 533 | 534 | 535 | def attention_layer(from_tensor, 536 | to_tensor, 537 | attention_mask=None, 538 | num_attention_heads=1, 539 | size_per_head=512, 540 | query_act=None, 541 | key_act=None, 542 | value_act=None, 543 | attention_probs_dropout_prob=0.0, 544 | initializer_range=0.02, 545 | do_return_2d_tensor=False, 546 | batch_size=None, 547 | from_seq_length=None, 548 | to_seq_length=None): 549 | """Performs multi-headed attention from `from_tensor` to `to_tensor`. 550 | This is an implementation of multi-headed attention based on "Attention 551 | is all you Need". If `from_tensor` and `to_tensor` are the same, then 552 | this is self-attention. Each timestep in `from_tensor` attends to the 553 | corresponding sequence in `to_tensor`, and returns a fixed-with vector. 554 | This function first projects `from_tensor` into a "query" tensor and 555 | `to_tensor` into "key" and "value" tensors. These are (effectively) a list 556 | of tensors of length `num_attention_heads`, where each tensor is of shape 557 | [batch_size, seq_length, size_per_head]. 558 | Then, the query and key tensors are dot-producted and scaled. These are 559 | softmaxed to obtain attention probabilities. The value tensors are then 560 | interpolated by these probabilities, then concatenated back to a single 561 | tensor and returned. 562 | In practice, the multi-headed attention are done with transposes and 563 | reshapes rather than actual separate tensors. 564 | Args: 565 | from_tensor: float Tensor of shape [batch_size, from_seq_length, 566 | from_width]. 567 | to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. 568 | attention_mask: (optional) int32 Tensor of shape [batch_size, 569 | from_seq_length, to_seq_length]. The values should be 1 or 0. The 570 | attention scores will effectively be set to -infinity for any positions in 571 | the mask that are 0, and will be unchanged for positions that are 1. 572 | num_attention_heads: int. Number of attention heads. 573 | size_per_head: int. Size of each attention head. 574 | query_act: (optional) Activation function for the query transform. 575 | key_act: (optional) Activation function for the key transform. 576 | value_act: (optional) Activation function for the value transform. 577 | attention_probs_dropout_prob: (optional) float. Dropout probability of the 578 | attention probabilities. 579 | initializer_range: float. Range of the weight initializer. 580 | do_return_2d_tensor: bool. If True, the output will be of shape [batch_size 581 | * from_seq_length, num_attention_heads * size_per_head]. If False, the 582 | output will be of shape [batch_size, from_seq_length, num_attention_heads 583 | * size_per_head]. 584 | batch_size: (Optional) int. If the input is 2D, this might be the batch size 585 | of the 3D version of the `from_tensor` and `to_tensor`. 586 | from_seq_length: (Optional) If the input is 2D, this might be the seq length 587 | of the 3D version of the `from_tensor`. 588 | to_seq_length: (Optional) If the input is 2D, this might be the seq length 589 | of the 3D version of the `to_tensor`. 590 | Returns: 591 | float Tensor of shape [batch_size, from_seq_length, 592 | num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is 593 | true, this will be of shape [batch_size * from_seq_length, 594 | num_attention_heads * size_per_head]). 595 | Raises: 596 | ValueError: Any of the arguments or tensor shapes are invalid. 597 | """ 598 | 599 | def transpose_for_scores(input_tensor, batch_size, num_attention_heads, 600 | seq_length, width): 601 | output_tensor = tf.reshape( 602 | input_tensor, [batch_size, seq_length, num_attention_heads, width]) 603 | 604 | output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3]) 605 | return output_tensor 606 | 607 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) 608 | to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) 609 | 610 | if len(from_shape) != len(to_shape): 611 | raise ValueError( 612 | "The rank of `from_tensor` must match the rank of `to_tensor`.") 613 | 614 | if len(from_shape) == 3: 615 | batch_size = from_shape[0] 616 | from_seq_length = from_shape[1] 617 | to_seq_length = to_shape[1] 618 | elif len(from_shape) == 2: 619 | if (batch_size is None or from_seq_length is None or to_seq_length is None): 620 | raise ValueError( 621 | "When passing in rank 2 tensors to attention_layer, the values " 622 | "for `batch_size`, `from_seq_length`, and `to_seq_length` " 623 | "must all be specified.") 624 | 625 | # Scalar dimensions referenced here: 626 | # B = batch size (number of sequences) 627 | # F = `from_tensor` sequence length 628 | # T = `to_tensor` sequence length 629 | # N = `num_attention_heads` 630 | # H = `size_per_head` 631 | 632 | from_tensor_2d = reshape_to_matrix(from_tensor) 633 | to_tensor_2d = reshape_to_matrix(to_tensor) 634 | 635 | # `query_layer` = [B*F, N*H] 636 | query_layer = tf.layers.dense( 637 | from_tensor_2d, 638 | num_attention_heads * size_per_head, 639 | activation=query_act, 640 | name="query", 641 | kernel_initializer=create_initializer(initializer_range)) 642 | 643 | # `key_layer` = [B*T, N*H] 644 | key_layer = tf.layers.dense( 645 | to_tensor_2d, 646 | num_attention_heads * size_per_head, 647 | activation=key_act, 648 | name="key", 649 | kernel_initializer=create_initializer(initializer_range)) 650 | 651 | # `value_layer` = [B*T, N*H] 652 | value_layer = tf.layers.dense( 653 | to_tensor_2d, 654 | num_attention_heads * size_per_head, 655 | activation=value_act, 656 | name="value", 657 | kernel_initializer=create_initializer(initializer_range)) 658 | 659 | # `query_layer` = [B, N, F, H] 660 | query_layer = transpose_for_scores(query_layer, batch_size, 661 | num_attention_heads, from_seq_length, 662 | size_per_head) 663 | 664 | # `key_layer` = [B, N, T, H] 665 | key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, 666 | to_seq_length, size_per_head) 667 | 668 | # Take the dot product between "query" and "key" to get the raw 669 | # attention scores. 670 | # `attention_scores` = [B, N, F, T] 671 | attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) 672 | attention_scores = tf.multiply(attention_scores, 673 | 1.0 / math.sqrt(float(size_per_head))) 674 | 675 | if attention_mask is not None: 676 | # `attention_mask` = [B, 1, F, T] 677 | attention_mask = tf.expand_dims(attention_mask, axis=[1]) 678 | 679 | # Since attention_mask is 1.0 for positions we want to attend and 0.0 for 680 | # masked positions, this operation will create a tensor which is 0.0 for 681 | # positions we want to attend and -10000.0 for masked positions. 682 | adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 683 | 684 | # Since we are adding it to the raw scores before the softmax, this is 685 | # effectively the same as removing these entirely. 686 | attention_scores += adder 687 | 688 | # Normalize the attention scores to probabilities. 689 | # `attention_probs` = [B, N, F, T] 690 | attention_probs = tf.nn.softmax(attention_scores) 691 | 692 | # This is actually dropping out entire tokens to attend to, which might 693 | # seem a bit unusual, but is taken from the original Transformer paper. 694 | attention_probs = dropout(attention_probs, attention_probs_dropout_prob) 695 | 696 | # `value_layer` = [B, T, N, H] 697 | value_layer = tf.reshape( 698 | value_layer, 699 | [batch_size, to_seq_length, num_attention_heads, size_per_head]) 700 | 701 | # `value_layer` = [B, N, T, H] 702 | value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) 703 | 704 | # `context_layer` = [B, N, F, H] 705 | context_layer = tf.matmul(attention_probs, value_layer) 706 | 707 | # `context_layer` = [B, F, N, H] 708 | context_layer = tf.transpose(context_layer, [0, 2, 1, 3]) 709 | 710 | if do_return_2d_tensor: 711 | # `context_layer` = [B*F, N*V] 712 | context_layer = tf.reshape( 713 | context_layer, 714 | [batch_size * from_seq_length, num_attention_heads * size_per_head]) 715 | else: 716 | # `context_layer` = [B, F, N*V] 717 | context_layer = tf.reshape( 718 | context_layer, 719 | [batch_size, from_seq_length, num_attention_heads * size_per_head]) 720 | 721 | return context_layer 722 | 723 | 724 | def transformer_model(input_tensor, 725 | attention_mask=None, 726 | hidden_size=768, 727 | num_hidden_layers=12, 728 | num_attention_heads=12, 729 | intermediate_size=3072, 730 | intermediate_act_fn=gelu, 731 | hidden_dropout_prob=0.1, 732 | attention_probs_dropout_prob=0.1, 733 | initializer_range=0.02, 734 | do_return_all_layers=False): 735 | """Multi-headed, multi-layer Transformer from "Attention is All You Need". 736 | This is almost an exact implementation of the original Transformer encoder. 737 | See the original paper: 738 | https://arxiv.org/abs/1706.03762 739 | Also see: 740 | https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py 741 | Args: 742 | input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. 743 | attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, 744 | seq_length], with 1 for positions that can be attended to and 0 in 745 | positions that should not be. 746 | hidden_size: int. Hidden size of the Transformer. 747 | num_hidden_layers: int. Number of layers (blocks) in the Transformer. 748 | num_attention_heads: int. Number of attention heads in the Transformer. 749 | intermediate_size: int. The size of the "intermediate" (a.k.a., feed 750 | forward) layer. 751 | intermediate_act_fn: function. The non-linear activation function to apply 752 | to the output of the intermediate/feed-forward layer. 753 | hidden_dropout_prob: float. Dropout probability for the hidden layers. 754 | attention_probs_dropout_prob: float. Dropout probability of the attention 755 | probabilities. 756 | initializer_range: float. Range of the initializer (stddev of truncated 757 | normal). 758 | do_return_all_layers: Whether to also return all layers or just the final 759 | layer. 760 | Returns: 761 | float Tensor of shape [batch_size, seq_length, hidden_size], the final 762 | hidden layer of the Transformer. 763 | Raises: 764 | ValueError: A Tensor shape or parameter is invalid. 765 | """ 766 | if hidden_size % num_attention_heads != 0: 767 | raise ValueError( 768 | "The hidden size (%d) is not a multiple of the number of attention " 769 | "heads (%d)" % (hidden_size, num_attention_heads)) 770 | 771 | attention_head_size = int(hidden_size / num_attention_heads) 772 | input_shape = get_shape_list(input_tensor, expected_rank=3) 773 | batch_size = input_shape[0] 774 | seq_length = input_shape[1] 775 | input_width = input_shape[2] 776 | 777 | # The Transformer performs sum residuals on all layers so the input needs 778 | # to be the same as the hidden size. 779 | if input_width != hidden_size: 780 | raise ValueError("The width of the input tensor (%d) != hidden size (%d)" % 781 | (input_width, hidden_size)) 782 | 783 | # We keep the representation as a 2D tensor to avoid re-shaping it back and 784 | # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on 785 | # the GPU/CPU but may not be free on the TPU, so we want to minimize them to 786 | # help the optimizer. 787 | prev_output = reshape_to_matrix(input_tensor) 788 | 789 | all_layer_outputs = [] 790 | for layer_idx in range(num_hidden_layers): 791 | with tf.variable_scope("layer_%d" % layer_idx): 792 | layer_input = prev_output 793 | 794 | with tf.variable_scope("attention"): 795 | attention_heads = [] 796 | with tf.variable_scope("self"): 797 | attention_head = attention_layer( 798 | from_tensor=layer_input, 799 | to_tensor=layer_input, 800 | attention_mask=attention_mask, 801 | num_attention_heads=num_attention_heads, 802 | size_per_head=attention_head_size, 803 | attention_probs_dropout_prob=attention_probs_dropout_prob, 804 | initializer_range=initializer_range, 805 | do_return_2d_tensor=True, 806 | batch_size=batch_size, 807 | from_seq_length=seq_length, 808 | to_seq_length=seq_length) 809 | attention_heads.append(attention_head) 810 | 811 | attention_output = None 812 | if len(attention_heads) == 1: 813 | attention_output = attention_heads[0] 814 | else: 815 | # In the case where we have other sequences, we just concatenate 816 | # them to the self-attention head before the projection. 817 | attention_output = tf.concat(attention_heads, axis=-1) 818 | 819 | # Run a linear projection of `hidden_size` then add a residual 820 | # with `layer_input`. 821 | with tf.variable_scope("output"): 822 | attention_output = tf.layers.dense( 823 | attention_output, 824 | hidden_size, 825 | kernel_initializer=create_initializer(initializer_range)) 826 | attention_output = dropout(attention_output, hidden_dropout_prob) 827 | attention_output = layer_norm(attention_output + layer_input) 828 | 829 | # The activation is only applied to the "intermediate" hidden layer. 830 | with tf.variable_scope("intermediate"): 831 | intermediate_output = tf.layers.dense( 832 | attention_output, 833 | intermediate_size, 834 | activation=intermediate_act_fn, 835 | kernel_initializer=create_initializer(initializer_range)) 836 | 837 | # Down-project back to `hidden_size` then add the residual. 838 | with tf.variable_scope("output"): 839 | layer_output = tf.layers.dense( 840 | intermediate_output, 841 | hidden_size, 842 | kernel_initializer=create_initializer(initializer_range)) 843 | layer_output = dropout(layer_output, hidden_dropout_prob) 844 | layer_output = layer_norm(layer_output + attention_output) 845 | prev_output = layer_output 846 | all_layer_outputs.append(layer_output) 847 | 848 | if do_return_all_layers: 849 | final_outputs = [] 850 | for layer_output in all_layer_outputs: 851 | final_output = reshape_from_matrix(layer_output, input_shape) 852 | final_outputs.append(final_output) 853 | return final_outputs 854 | else: 855 | final_output = reshape_from_matrix(prev_output, input_shape) 856 | return final_output 857 | 858 | 859 | def get_shape_list(tensor, expected_rank=None, name=None): 860 | """Returns a list of the shape of tensor, preferring static dimensions. 861 | Args: 862 | tensor: A tf.Tensor object to find the shape of. 863 | expected_rank: (optional) int. The expected rank of `tensor`. If this is 864 | specified and the `tensor` has a different rank, and exception will be 865 | thrown. 866 | name: Optional name of the tensor for the error message. 867 | Returns: 868 | A list of dimensions of the shape of tensor. All static dimensions will 869 | be returned as python integers, and dynamic dimensions will be returned 870 | as tf.Tensor scalars. 871 | """ 872 | if name is None: 873 | name = tensor.name 874 | 875 | if expected_rank is not None: 876 | assert_rank(tensor, expected_rank, name) 877 | 878 | shape = tensor.shape.as_list() 879 | 880 | non_static_indexes = [] 881 | for (index, dim) in enumerate(shape): 882 | if dim is None: 883 | non_static_indexes.append(index) 884 | 885 | if not non_static_indexes: 886 | return shape 887 | 888 | dyn_shape = tf.shape(tensor) 889 | for index in non_static_indexes: 890 | shape[index] = dyn_shape[index] 891 | return shape 892 | 893 | 894 | def reshape_to_matrix(input_tensor): 895 | """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" 896 | ndims = input_tensor.shape.ndims 897 | if ndims < 2: 898 | raise ValueError("Input tensor must have at least rank 2. Shape = %s" % 899 | (input_tensor.shape)) 900 | if ndims == 2: 901 | return input_tensor 902 | 903 | width = input_tensor.shape[-1] 904 | output_tensor = tf.reshape(input_tensor, [-1, width]) 905 | return output_tensor 906 | 907 | 908 | def reshape_from_matrix(output_tensor, orig_shape_list): 909 | """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" 910 | if len(orig_shape_list) == 2: 911 | return output_tensor 912 | 913 | output_shape = get_shape_list(output_tensor) 914 | 915 | orig_dims = orig_shape_list[0:-1] 916 | width = output_shape[-1] 917 | 918 | return tf.reshape(output_tensor, orig_dims + [width]) 919 | 920 | 921 | def assert_rank(tensor, expected_rank, name=None): 922 | """Raises an exception if the tensor rank is not of the expected rank. 923 | Args: 924 | tensor: A tf.Tensor to check the rank of. 925 | expected_rank: Python integer or list of integers, expected rank. 926 | name: Optional name of the tensor for the error message. 927 | Raises: 928 | ValueError: If the expected shape doesn't match the actual shape. 929 | """ 930 | if name is None: 931 | name = tensor.name 932 | 933 | expected_rank_dict = {} 934 | if isinstance(expected_rank, six.integer_types): 935 | expected_rank_dict[expected_rank] = True 936 | else: 937 | for x in expected_rank: 938 | expected_rank_dict[x] = True 939 | 940 | actual_rank = tensor.shape.ndims 941 | if actual_rank not in expected_rank_dict: 942 | scope_name = tf.get_variable_scope().name 943 | raise ValueError( 944 | "For the tensor `%s` in scope `%s`, the actual rank " 945 | "`%d` (shape = %s) is not equal to the expected rank `%s`" % 946 | (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) -------------------------------------------------------------------------------- /model/bert/optimization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Functions and classes related to optimization (weight updates).""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import re 22 | import tensorflow as tf 23 | 24 | 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): 26 | """Creates an optimizer training op.""" 27 | global_step = tf.train.get_or_create_global_step() 28 | 29 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 30 | 31 | # Implements linear decay of the learning rate. 32 | learning_rate = tf.train.polynomial_decay( 33 | learning_rate, 34 | global_step, 35 | num_train_steps, 36 | end_learning_rate=0.0, 37 | power=1.0, 38 | cycle=False) 39 | 40 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the 41 | # learning rate will be `global_step/num_warmup_steps * init_lr`. 42 | if num_warmup_steps: 43 | global_steps_int = tf.cast(global_step, tf.int32) 44 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 45 | 46 | global_steps_float = tf.cast(global_steps_int, tf.float32) 47 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 48 | 49 | warmup_percent_done = global_steps_float / warmup_steps_float 50 | warmup_learning_rate = init_lr * warmup_percent_done 51 | 52 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 53 | learning_rate = ( 54 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 55 | 56 | # It is recommended that you use this optimizer for fine tuning, since this 57 | # is how the model was trained (note that the Adam m/v variables are NOT 58 | # loaded from init_checkpoint.) 59 | optimizer = AdamWeightDecayOptimizer( 60 | learning_rate=learning_rate, 61 | weight_decay_rate=0.01, 62 | beta_1=0.9, 63 | beta_2=0.999, 64 | epsilon=1e-6, 65 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 66 | 67 | if use_tpu: 68 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) 69 | 70 | tvars = tf.trainable_variables() 71 | grads = tf.gradients(loss, tvars) 72 | 73 | # This is how the model was pre-trained. 74 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 75 | 76 | train_op = optimizer.apply_gradients( 77 | zip(grads, tvars), global_step=global_step) 78 | 79 | new_global_step = global_step + 1 80 | train_op = tf.group(train_op, global_step.assign(new_global_step)) 81 | return train_op 82 | 83 | 84 | class AdamWeightDecayOptimizer(tf.train.Optimizer): 85 | """A basic Adam optimizer that includes "correct" L2 weight decay.""" 86 | 87 | def __init__(self, 88 | learning_rate, 89 | weight_decay_rate=0.0, 90 | beta_1=0.9, 91 | beta_2=0.999, 92 | epsilon=1e-6, 93 | exclude_from_weight_decay=None, 94 | name="AdamWeightDecayOptimizer"): 95 | """Constructs a AdamWeightDecayOptimizer.""" 96 | super(AdamWeightDecayOptimizer, self).__init__(False, name) 97 | 98 | self.learning_rate = learning_rate 99 | self.weight_decay_rate = weight_decay_rate 100 | self.beta_1 = beta_1 101 | self.beta_2 = beta_2 102 | self.epsilon = epsilon 103 | self.exclude_from_weight_decay = exclude_from_weight_decay 104 | 105 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 106 | """See base class.""" 107 | assignments = [] 108 | for (grad, param) in grads_and_vars: 109 | if grad is None or param is None: 110 | continue 111 | 112 | param_name = self._get_variable_name(param.name) 113 | 114 | m = tf.get_variable( 115 | name=param_name + "/adam_m", 116 | shape=param.shape.as_list(), 117 | dtype=tf.float32, 118 | trainable=False, 119 | initializer=tf.zeros_initializer()) 120 | v = tf.get_variable( 121 | name=param_name + "/adam_v", 122 | shape=param.shape.as_list(), 123 | dtype=tf.float32, 124 | trainable=False, 125 | initializer=tf.zeros_initializer()) 126 | 127 | # Standard Adam update. 128 | next_m = ( 129 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 130 | next_v = ( 131 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 132 | tf.square(grad))) 133 | 134 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 135 | 136 | # Just adding the square of the weights to the loss function is *not* 137 | # the correct way of using L2 regularization/weight decay with Adam, 138 | # since that will interact with the m and v parameters in strange ways. 139 | # 140 | # Instead we want ot decay the weights in a manner that doesn't interact 141 | # with the m/v parameters. This is equivalent to adding the square 142 | # of the weights to the loss with plain (non-momentum) SGD. 143 | if self._do_use_weight_decay(param_name): 144 | update += self.weight_decay_rate * param 145 | 146 | update_with_lr = self.learning_rate * update 147 | 148 | next_param = param - update_with_lr 149 | 150 | assignments.extend( 151 | [param.assign(next_param), 152 | m.assign(next_m), 153 | v.assign(next_v)]) 154 | return tf.group(*assignments, name=name) 155 | 156 | def _do_use_weight_decay(self, param_name): 157 | """Whether to use L2 weight decay for `param_name`.""" 158 | if not self.weight_decay_rate: 159 | return False 160 | if self.exclude_from_weight_decay: 161 | for r in self.exclude_from_weight_decay: 162 | if re.search(r, param_name) is not None: 163 | return False 164 | return True 165 | 166 | def _get_variable_name(self, param_name): 167 | """Get the variable name from the tensor name.""" 168 | m = re.match("^(.*):\\d+$", param_name) 169 | if m is not None: 170 | param_name = m.group(1) 171 | return param_name 172 | -------------------------------------------------------------------------------- /model/bert/tokenization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Tokenization classes.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import unicodedata 23 | import six 24 | import tensorflow as tf 25 | 26 | 27 | def convert_to_unicode(text): 28 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 29 | if six.PY3: 30 | if isinstance(text, str): 31 | return text 32 | elif isinstance(text, bytes): 33 | return text.decode("utf-8", "ignore") 34 | else: 35 | raise ValueError("Unsupported string type: %s" % (type(text))) 36 | elif six.PY2: 37 | if isinstance(text, str): 38 | return text.decode("utf-8", "ignore") 39 | elif isinstance(text, unicode): 40 | return text 41 | else: 42 | raise ValueError("Unsupported string type: %s" % (type(text))) 43 | else: 44 | raise ValueError("Not running on Python2 or Python 3?") 45 | 46 | 47 | def printable_text(text): 48 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 49 | 50 | # These functions want `str` for both Python2 and Python3, but in one case 51 | # it's a Unicode string and in the other it's a byte string. 52 | if six.PY3: 53 | if isinstance(text, str): 54 | return text 55 | elif isinstance(text, bytes): 56 | return text.decode("utf-8", "ignore") 57 | else: 58 | raise ValueError("Unsupported string type: %s" % (type(text))) 59 | elif six.PY2: 60 | if isinstance(text, str): 61 | return text 62 | elif isinstance(text, unicode): 63 | return text.encode("utf-8") 64 | else: 65 | raise ValueError("Unsupported string type: %s" % (type(text))) 66 | else: 67 | raise ValueError("Not running on Python2 or Python 3?") 68 | 69 | 70 | def load_vocab(vocab_file): 71 | """Loads a vocabulary file into a dictionary.""" 72 | vocab = collections.OrderedDict() 73 | index = 0 74 | with tf.gfile.GFile(vocab_file, "r") as reader: 75 | while True: 76 | token = convert_to_unicode(reader.readline()) 77 | if not token: 78 | break 79 | token = token.strip() 80 | vocab[token] = index 81 | index += 1 82 | return vocab 83 | 84 | 85 | def convert_tokens_to_ids(vocab, tokens, unk_token="[UNK]"): 86 | """Converts a sequence of tokens into ids using the vocab.""" 87 | ids = [] 88 | for token in tokens: 89 | if token in vocab: 90 | ids.append(vocab[token]) 91 | else: 92 | ids.append(vocab[unk_token]) 93 | return ids 94 | 95 | 96 | def whitespace_tokenize(text): 97 | """Runs basic whitespace cleaning and splitting on a peice of text.""" 98 | text = text.strip() 99 | if not text: 100 | return [] 101 | tokens = text.split() 102 | return tokens 103 | 104 | 105 | class FullTokenizer(object): 106 | """Runs end-to-end tokenziation.""" 107 | 108 | def __init__(self, vocab_file, do_lower_case=True): 109 | self.vocab = load_vocab(vocab_file) 110 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 111 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 112 | 113 | def tokenize(self, text): 114 | split_tokens = [] 115 | for token in self.basic_tokenizer.tokenize(text): 116 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 117 | split_tokens.append(sub_token) 118 | 119 | return split_tokens 120 | 121 | def convert_tokens_to_ids(self, tokens): 122 | return convert_tokens_to_ids(self.vocab, tokens) 123 | 124 | 125 | class CharTokenizer(object): 126 | """Runs end-to-end tokenziation.""" 127 | 128 | def __init__(self, vocab_file, do_lower_case=True): 129 | self.vocab = load_vocab(vocab_file) 130 | self.id2vocab = {v:k for k, v in self.vocab.items()} 131 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 132 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 133 | 134 | def tokenize(self, text): 135 | split_tokens = [] 136 | for token in self.basic_tokenizer.tokenize(text): 137 | for sub_token in token: 138 | split_tokens.append(sub_token) 139 | 140 | return split_tokens 141 | 142 | def convert_tokens_to_ids(self, tokens): 143 | return convert_tokens_to_ids(self.vocab, tokens) 144 | 145 | 146 | class BasicTokenizer(object): 147 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 148 | 149 | def __init__(self, do_lower_case=True): 150 | """Constructs a BasicTokenizer. 151 | 152 | Args: 153 | do_lower_case: Whether to lower case the input. 154 | """ 155 | self.do_lower_case = do_lower_case 156 | 157 | def tokenize(self, text): 158 | """Tokenizes a piece of text.""" 159 | text = convert_to_unicode(text) 160 | text = self._clean_text(text) 161 | 162 | # This was added on November 1st, 2018 for the multilingual and Chinese 163 | # models. This is also applied to the English models now, but it doesn't 164 | # matter since the English models were not trained on any Chinese data 165 | # and generally don't have any Chinese data in them (there are Chinese 166 | # characters in the vocabulary because Wikipedia does have some Chinese 167 | # words in the English Wikipedia.). 168 | text = self._tokenize_chinese_chars(text) 169 | 170 | orig_tokens = whitespace_tokenize(text) 171 | split_tokens = [] 172 | for token in orig_tokens: 173 | if self.do_lower_case: 174 | token = token.lower() 175 | token = self._run_strip_accents(token) 176 | split_tokens.extend(self._run_split_on_punc(token)) 177 | 178 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 179 | return output_tokens 180 | 181 | def _run_strip_accents(self, text): 182 | """Strips accents from a piece of text.""" 183 | text = unicodedata.normalize("NFD", text) 184 | output = [] 185 | for char in text: 186 | cat = unicodedata.category(char) 187 | if cat == "Mn": 188 | continue 189 | output.append(char) 190 | return "".join(output) 191 | 192 | def _run_split_on_punc(self, text): 193 | """Splits punctuation on a piece of text.""" 194 | chars = list(text) 195 | i = 0 196 | start_new_word = True 197 | output = [] 198 | while i < len(chars): 199 | char = chars[i] 200 | if _is_punctuation(char): 201 | output.append([char]) 202 | start_new_word = True 203 | else: 204 | if start_new_word: 205 | output.append([]) 206 | start_new_word = False 207 | output[-1].append(char) 208 | i += 1 209 | 210 | return ["".join(x) for x in output] 211 | 212 | def _tokenize_chinese_chars(self, text): 213 | """Adds whitespace around any CJK character.""" 214 | output = [] 215 | for char in text: 216 | cp = ord(char) 217 | if self._is_chinese_char(cp): 218 | output.append(" ") 219 | output.append(char) 220 | output.append(" ") 221 | else: 222 | output.append(char) 223 | return "".join(output) 224 | 225 | def _is_chinese_char(self, cp): 226 | """Checks whether CP is the codepoint of a CJK character.""" 227 | # This defines a "chinese character" as anything in the CJK Unicode block: 228 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 229 | # 230 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 231 | # despite its name. The modern Korean Hangul alphabet is a different block, 232 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 233 | # space-separated words, so they are not treated specially and handled 234 | # like the all of the other languages. 235 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 236 | (cp >= 0x3400 and cp <= 0x4DBF) or # 237 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 238 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 239 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 240 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 241 | (cp >= 0xF900 and cp <= 0xFAFF) or # 242 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 243 | return True 244 | 245 | return False 246 | 247 | def _clean_text(self, text): 248 | """Performs invalid character removal and whitespace cleanup on text.""" 249 | output = [] 250 | for char in text: 251 | cp = ord(char) 252 | if cp == 0 or cp == 0xfffd or _is_control(char): 253 | continue 254 | if _is_whitespace(char): 255 | output.append(" ") 256 | else: 257 | output.append(char) 258 | return "".join(output) 259 | 260 | 261 | class WordpieceTokenizer(object): 262 | """Runs WordPiece tokenziation.""" 263 | 264 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): 265 | self.vocab = vocab 266 | self.unk_token = unk_token 267 | self.max_input_chars_per_word = max_input_chars_per_word 268 | 269 | def tokenize(self, text): 270 | """Tokenizes a piece of text into its word pieces. 271 | 272 | This uses a greedy longest-match-first algorithm to perform tokenization 273 | using the given vocabulary. 274 | 275 | For example: 276 | input = "unaffable" 277 | output = ["un", "##aff", "##able"] 278 | 279 | Args: 280 | text: A single token or whitespace separated tokens. This should have 281 | already been passed through `BasicTokenizer. 282 | 283 | Returns: 284 | A list of wordpiece tokens. 285 | """ 286 | 287 | text = convert_to_unicode(text) 288 | 289 | output_tokens = [] 290 | for token in whitespace_tokenize(text): 291 | chars = list(token) 292 | if len(chars) > self.max_input_chars_per_word: 293 | output_tokens.append(self.unk_token) 294 | continue 295 | 296 | is_bad = False 297 | start = 0 298 | sub_tokens = [] 299 | while start < len(chars): 300 | end = len(chars) 301 | cur_substr = None 302 | while start < end: 303 | substr = "".join(chars[start:end]) 304 | if start > 0: 305 | substr = "##" + substr 306 | if substr in self.vocab: 307 | cur_substr = substr 308 | break 309 | end -= 1 310 | if cur_substr is None: 311 | is_bad = True 312 | break 313 | sub_tokens.append(cur_substr) 314 | start = end 315 | 316 | if is_bad: 317 | output_tokens.append(self.unk_token) 318 | else: 319 | output_tokens.extend(sub_tokens) 320 | return output_tokens 321 | 322 | 323 | def _is_whitespace(char): 324 | """Checks whether `chars` is a whitespace character.""" 325 | # \t, \n, and \r are technically contorl characters but we treat them 326 | # as whitespace since they are generally considered as such. 327 | if char == " " or char == "\t" or char == "\n" or char == "\r": 328 | return True 329 | cat = unicodedata.category(char) 330 | if cat == "Zs": 331 | return True 332 | return False 333 | 334 | 335 | def _is_control(char): 336 | """Checks whether `chars` is a control character.""" 337 | # These are technically control characters but we count them as whitespace 338 | # characters. 339 | if char == "\t" or char == "\n" or char == "\r": 340 | return False 341 | cat = unicodedata.category(char) 342 | if cat.startswith("C"): 343 | return True 344 | return False 345 | 346 | 347 | def _is_punctuation(char): 348 | """Checks whether `chars` is a punctuation character.""" 349 | cp = ord(char) 350 | # We treat all non-letter/number ASCII as punctuation. 351 | # Characters such as "^", "$", and "`" are not in the Unicode 352 | # Punctuation class but we treat them as punctuation anyways, for 353 | # consistency. 354 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 355 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 356 | return True 357 | cat = unicodedata.category(char) 358 | if cat.startswith("P"): 359 | return True 360 | return False 361 | -------------------------------------------------------------------------------- /model/bert_classifier.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding=utf-8 3 | ''' 4 | @Time : 2020/10/17 11:38:00 5 | @Author : zhiyang.zzy 6 | @Contact : zhiyangchou@gmail.com 7 | @Desc : 使用bert做分类。 8 | 1. 对于sentence pair,直接将两个句子输入,然后用sep分割输入,然后使用cls的输出作为类别预测的输入。 9 | ''' 10 | 11 | # here put the import lib 12 | import time 13 | import numpy as np 14 | import tensorflow as tf 15 | import random 16 | import paddlehub as hub 17 | from sklearn.metrics import accuracy_score 18 | import math 19 | from keras.layers import Dense, Subtract, Lambda 20 | import keras.backend as K 21 | from keras.regularizers import l2 22 | import nni 23 | 24 | import data_input 25 | from config import Config 26 | from .base_model import BaseModel 27 | 28 | random.seed(9102) 29 | 30 | 31 | def cosine_similarity(a, b): 32 | c = tf.sqrt(tf.reduce_sum(tf.multiply(a, a), axis=1)) 33 | d = tf.sqrt(tf.reduce_sum(tf.multiply(b, b), axis=1)) 34 | e = tf.reduce_sum(tf.multiply(a, b), axis=1) 35 | f = tf.multiply(c, d) 36 | r = tf.divide(e, f) 37 | return r 38 | 39 | 40 | def variable_summaries(var, name): 41 | """Attach a lot of summaries to a Tensor.""" 42 | with tf.name_scope('summaries'): 43 | mean = tf.reduce_mean(var) 44 | tf.summary.scalar('mean/' + name, mean) 45 | with tf.name_scope('stddev'): 46 | stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean))) 47 | tf.summary.scalar('sttdev/' + name, stddev) 48 | tf.summary.scalar('max/' + name, tf.reduce_max(var)) 49 | tf.summary.scalar('min/' + name, tf.reduce_min(var)) 50 | tf.summary.histogram(name, var) 51 | 52 | class BertClassifier(BaseModel): 53 | def __init__(self, cfg, is_training=1): 54 | super(BertClassifier, self).__init__(cfg, is_training) 55 | pass 56 | 57 | def add_placeholder(self): 58 | # 预测时只用输入query即可,将其embedding为向量。 59 | self.q_ids = tf.placeholder( 60 | tf.int32, shape=[None, None], name='query_batch') 61 | self.q_mask_ids = tf.placeholder( 62 | tf.int32, shape=[None, None], name='q_mask_ids') 63 | self.q_seg_ids = tf.placeholder( 64 | tf.int32, shape=[None, None], name='q_seg_ids') 65 | self.q_seq_length = tf.placeholder( 66 | tf.int32, shape=[None], name='query_sequence_length') 67 | self.is_train_place = tf.placeholder( 68 | dtype=tf.bool, name='is_train_place') 69 | # label 70 | self.sim_labels = tf.placeholder( 71 | tf.float32, shape=[None], name="sim_labels") 72 | 73 | def forward(self): 74 | # 获取cls的输出 75 | q_emb, _, self.q_e = self.share_bert_layer( 76 | self.is_train_place, self.q_ids, self.q_mask_ids, self.q_seg_ids, use_bert_pre=1) 77 | predict_prob = Dense(units=1, activation='sigmoid')(q_emb) 78 | self.predict_prob = tf.reshape(predict_prob, [-1]) 79 | self.predict_idx = tf.cast(tf.greater_equal(predict_prob, 0.5), tf.int32) 80 | with tf.name_scope('Loss'): 81 | # Train Loss 82 | loss = tf.losses.log_loss(self.sim_labels, self.predict_prob) 83 | self.loss = tf.reduce_mean(loss) 84 | tf.summary.scalar('loss', self.loss) 85 | 86 | def build(self): 87 | self.add_placeholder() 88 | self.forward() 89 | self.add_train_op(self.cfg['optimizer'], 90 | self.cfg['learning_rate'], self.loss) 91 | self._init_session() 92 | self._add_summary() 93 | pass 94 | 95 | def feed_batch(self, out_ids1, m_ids1, seg_ids1, seq_len1, label=None, is_test=0): 96 | is_train = 0 if is_test else 1 97 | fd = { 98 | self.q_ids: out_ids1, self.q_mask_ids: m_ids1, 99 | self.q_seg_ids: seg_ids1, 100 | self.q_seq_length: seq_len1, 101 | self.is_train_place: is_train} 102 | if label: 103 | fd[self.sim_labels] = label 104 | return fd 105 | 106 | def run_epoch(self, epoch, d_train, d_val): 107 | steps = int(math.ceil(float(len(d_train)) / self.cfg['batch_size'])) 108 | progbar = tf.keras.utils.Progbar(steps) 109 | # 每个 epoch 分batch训练 110 | batch_iter = data_input.get_batch( 111 | d_train, batch_size=self.cfg['batch_size']) 112 | for i, (out_ids1, m_ids1, seg_ids1, seq_len1, label) in enumerate(batch_iter): 113 | fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, label) 114 | # a = self.sess.run([self.is_train_place, self.q_e], feed_dict=fd) 115 | _, cur_loss = self.sess.run( 116 | [self.train_op, self.loss], feed_dict=fd) 117 | progbar.update(i + 1, [("loss", cur_loss)]) 118 | # 训练完一个epoch之后,使用验证集评估,然后预测, 然后评估准确率 119 | dev_acc = self.eval(d_val) 120 | nni.report_intermediate_result(dev_acc) 121 | print("dev set acc:", dev_acc) 122 | return dev_acc 123 | 124 | def eval(self, test_data): 125 | pbar = data_input.get_batch( 126 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 127 | val_label, val_pred = [], [] 128 | for (out_ids1, m_ids1, seg_ids1, seq_len1, label) in pbar: 129 | val_label.extend(label) 130 | fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, is_test=1) 131 | pred_labels, pred_prob = self.sess.run( 132 | [self.predict_idx, self.predict_prob], feed_dict=fd) 133 | val_pred.extend(pred_labels) 134 | test_acc = accuracy_score(val_label, val_pred) 135 | return test_acc 136 | 137 | def predict(self, test_data): 138 | pbar = data_input.get_batch( 139 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 140 | val_pred, val_prob = [], [] 141 | for (t1_ids, t1_len, t2_ids, t2_len) in pbar: 142 | fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1) 143 | pred_labels, pred_prob = self.sess.run( 144 | [self.predict_idx, self.predict_prob], feed_dict=fd) 145 | val_pred.extend(pred_labels) 146 | val_prob.extend(pred_prob) 147 | return val_pred, val_prob 148 | 149 | 150 | if __name__ == "__main__": 151 | start = time.time() 152 | # 读取配置 153 | conf = Config() 154 | # 读取数据 155 | dataset = hub.dataset.LCQMC() 156 | data_train, data_val, data_test = data_input.get_lcqmc() 157 | # data_train = data_train[:10000] 158 | print("train size:{},val size:{}, test size:{}".format( 159 | len(data_train), len(data_val), len(data_test))) 160 | model = SiamenseRNN(conf) 161 | model.fit(data_train, data_val, data_test) 162 | pass 163 | -------------------------------------------------------------------------------- /model/siamese_network.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding=utf-8 3 | ''' 4 | @Time : 2020/10/17 11:38:00 5 | @Author : zhiyang.zzy 6 | @Contact : zhiyangchou@gmail.com 7 | @Desc : siamense network, 使用曼哈顿距离、cos相似度进行实验。 8 | 1. 使用预训练词向量。2. 使用lcqmc数据集实验。3. 添加预测。 9 | todo: add triplet loss 10 | ''' 11 | 12 | # here put the import lib 13 | from os import name 14 | import time 15 | import numpy as np 16 | import tensorflow as tf 17 | import random 18 | import paddlehub as hub 19 | from sklearn.metrics import accuracy_score 20 | import math 21 | from keras.layers import Dense, Subtract, Lambda 22 | import keras.backend as K 23 | from keras.regularizers import l2 24 | 25 | import data_input 26 | from config import Config 27 | from .base_model import BaseModel 28 | 29 | random.seed(9102) 30 | 31 | 32 | def cosine_similarity(a, b): 33 | c = tf.sqrt(tf.reduce_sum(tf.multiply(a, a), axis=1)) 34 | d = tf.sqrt(tf.reduce_sum(tf.multiply(b, b), axis=1)) 35 | e = tf.reduce_sum(tf.multiply(a, b), axis=1) 36 | f = tf.multiply(c, d) 37 | r = tf.divide(e, f) 38 | return r 39 | 40 | def siamese_loss(out1,out2,y,Q=5): 41 | # 使用欧式距离,概率使用e^{-x} 42 | Q = tf.constant(Q, name="Q",dtype=tf.float32) 43 | E_w = tf.sqrt(tf.reduce_sum(tf.square(out1-out2),1)) 44 | pos = tf.multiply(tf.multiply(y,2/Q),tf.square(E_w)) 45 | neg = tf.multiply(tf.multiply(1-y,2*Q),tf.exp(-2.77/Q*E_w)) 46 | loss = pos + neg 47 | loss = tf.reduce_mean(loss) 48 | prob = tf.exp(-E_w) 49 | return loss, prob 50 | 51 | def variable_summaries(var, name): 52 | """Attach a lot of summaries to a Tensor.""" 53 | with tf.name_scope('summaries'): 54 | mean = tf.reduce_mean(var) 55 | tf.summary.scalar('mean/' + name, mean) 56 | with tf.name_scope('stddev'): 57 | stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean))) 58 | tf.summary.scalar('sttdev/' + name, stddev) 59 | tf.summary.scalar('max/' + name, tf.reduce_max(var)) 60 | tf.summary.scalar('min/' + name, tf.reduce_min(var)) 61 | tf.summary.histogram(name, var) 62 | 63 | 64 | class SiamenseRNN(BaseModel): 65 | def __init__(self, cfg, is_training=1): 66 | # config来自于yml, 或者config.py 文件。 67 | self.cfg = cfg 68 | # if not is_training: dropout=0 69 | self.is_training = is_training 70 | if not is_training: 71 | self.cfg['dropout'] = 0 72 | self.build() 73 | pass 74 | pass 75 | 76 | def share_encoder(self, query_batch, query_seq_length, keep_prob_place): 77 | with tf.variable_scope('word_embeddings_layer', reuse=tf.AUTO_REUSE): 78 | # 这里可以加载预训练词向量 79 | _word_embedding = tf.get_variable(name="word_embedding_arr", dtype=tf.float32, 80 | shape=[self.cfg['nwords'], self.cfg['word_dim']]) 81 | query_embed = tf.nn.embedding_lookup( 82 | _word_embedding, query_batch, name='query_batch_embed') 83 | with tf.variable_scope('RNN', reuse=tf.AUTO_REUSE): 84 | # Abandon bag of words, use GRU, you can use stacked gru 85 | cell_fw = tf.contrib.rnn.GRUCell( 86 | self.cfg['hidden_size_rnn'], reuse=tf.AUTO_REUSE) # , reuse=tf.AUTO_REUSE 87 | cell_bw = tf.contrib.rnn.GRUCell( 88 | self.cfg['hidden_size_rnn'], reuse=tf.AUTO_REUSE) 89 | # query 90 | (_, _), (query_output_fw, query_output_bw) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, query_embed, 91 | sequence_length=query_seq_length, 92 | dtype=tf.float32) 93 | query_rnn_output = tf.concat( 94 | [query_output_fw, query_output_bw], axis=-1) 95 | query_rnn_output = tf.nn.dropout(query_rnn_output, keep_prob_place) 96 | # TODO: 使用mean pooling, 或者self attention 来代替最后一个states 97 | return query_rnn_output 98 | 99 | def cos_sim(self, query_rnn_output, doc_rnn_output): 100 | with tf.name_scope('Cosine_Similarity'): 101 | # Cosine similarity 102 | # query_norm = sqrt(sum(each x^2)) 103 | query_norm = tf.sqrt(tf.reduce_sum(tf.square(query_rnn_output), 1)) 104 | # doc_norm = sqrt(sum(each x^2)) 105 | doc_norm = tf.sqrt(tf.reduce_sum(tf.square(doc_rnn_output), 1)) 106 | 107 | # 内积 108 | prod = tf.reduce_sum(tf.multiply( 109 | query_rnn_output, doc_rnn_output), axis=1) 110 | # 模相乘 111 | mul = tf.multiply(query_norm, doc_norm) 112 | # cos_sim_raw = query * doc / (||query|| * ||doc||) 113 | # cos_sim_raw = tf.truediv(prod, tf.multiply(query_norm, doc_norm)) 114 | cos_sim_raw = tf.divide(prod, mul) 115 | predict_prob = tf.sigmoid(cos_sim_raw) 116 | predict_idx = tf.cast(tf.greater_equal( 117 | predict_prob, 0.5), tf.int32) 118 | return predict_prob, predict_idx 119 | 120 | def l1_distance(self, query_rnn_output, doc_rnn_output): 121 | l1_distance_layer = Lambda( 122 | lambda tensors: K.abs(tensors[0] - tensors[1])) 123 | l1_distance = l1_distance_layer([query_rnn_output, doc_rnn_output]) 124 | l1_distance = tf.concat([l1_distance, query_rnn_output, doc_rnn_output], axis=-1) 125 | predict_prob = Dense(units=1, activation='sigmoid')(l1_distance) 126 | # bs * 1 127 | predict_prob = tf.reshape(predict_prob, [-1]) 128 | predict_idx = tf.cast(tf.greater_equal(predict_prob, 0.5), tf.int32) 129 | return predict_prob, predict_idx 130 | 131 | def forward(self): 132 | # 共享的encode来编码query 133 | query_rnn_output = self.share_encoder( 134 | self.query_batch, self.query_seq_length, self.keep_prob_place) 135 | self.query_rnn_output = query_rnn_output 136 | self.q_emb = query_rnn_output 137 | doc_rnn_output = self.share_encoder( 138 | self.doc_batch, self.doc_seq_length, self.keep_prob_place) 139 | # 计算cos相似度: 140 | # self.predict_prob, self.predict_idx = self.cos_sim(query_rnn_output, doc_rnn_output) 141 | # 使用原文曼哈顿距离 142 | self.predict_prob, self.predict_idx = self.l1_distance( 143 | query_rnn_output, doc_rnn_output) 144 | 145 | with tf.name_scope('Loss'): 146 | # Train Loss 147 | # cross_entropy = -tf.reduce_mean(self.sim_labels * tf.log(tf.clip_by_value(self.predict_prob,1e-10,1.0))+(1-self.sim_labels) * tf.log(tf.clip_by_value(1-self.predict_prob,1e-10,1.0))) 148 | loss = tf.losses.log_loss(self.sim_labels, self.predict_prob) 149 | self.loss = tf.reduce_mean(loss) 150 | tf.summary.scalar('loss', self.loss) 151 | # with tf.name_scope('Accuracy'): 152 | # correct_prediction = tf.equal(tf.argmax(prob, 1), 0) 153 | # accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 154 | # tf.summary.scalar('accuracy', accuracy) 155 | 156 | def add_placeholder(self): 157 | with tf.name_scope('input'): 158 | # 预测时只用输入query即可,将其embedding为向量。 159 | self.query_batch = tf.placeholder( 160 | tf.int32, shape=[None, None], name='query_batch') 161 | self.doc_batch = tf.placeholder( 162 | tf.int32, shape=[None, None], name='doc_batch') 163 | self.query_seq_length = tf.placeholder( 164 | tf.int32, shape=[None], name='query_sequence_length') 165 | self.doc_seq_length = tf.placeholder( 166 | tf.int32, shape=[None], name='doc_seq_length') 167 | # label 168 | self.sim_labels = tf.placeholder( 169 | tf.float32, shape=[None], name="sim_labels") 170 | self.keep_prob_place = tf.placeholder(tf.float32, name='keep_prob') 171 | 172 | def build(self): 173 | self.add_placeholder() 174 | self.forward() 175 | self.add_train_op(self.cfg['optimizer'], 176 | self.cfg['learning_rate'], self.loss) 177 | self._init_session() 178 | self._add_summary() 179 | pass 180 | 181 | def feed_batch(self, t1_ids, t1_len, t2_ids, t2_len, label=None, is_test=0): 182 | keep_porb = 1 if is_test else self.cfg['keep_porb'] 183 | fd = { 184 | self.query_batch: t1_ids, self.doc_batch: t2_ids, self.query_seq_length: t1_len, 185 | self.doc_seq_length: t2_len, self.keep_prob_place: keep_porb} 186 | if label: 187 | fd[self.sim_labels] = label 188 | return fd 189 | 190 | def eval(self, test_data): 191 | pbar = data_input.get_batch( 192 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 193 | val_label, val_pred = [], [] 194 | for (t1_ids, t1_len, t2_ids, t2_len, label) in pbar: 195 | val_label.extend(label) 196 | fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1) 197 | pred_labels, pred_prob = self.sess.run( 198 | [self.predict_idx, self.predict_prob], feed_dict=fd) 199 | val_pred.extend(pred_labels) 200 | test_acc = accuracy_score(val_label, val_pred) 201 | return test_acc 202 | 203 | def predict(self, test_data): 204 | pbar = data_input.get_batch( 205 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 206 | val_pred, val_prob = [], [] 207 | for (t1_ids, t1_len, t2_ids, t2_len) in pbar: 208 | fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1) 209 | pred_labels, pred_prob = self.sess.run( 210 | [self.predict_idx, self.predict_prob], feed_dict=fd) 211 | val_pred.extend(pred_labels) 212 | val_prob.extend(pred_prob) 213 | return val_pred, val_prob 214 | 215 | def run_epoch(self, epoch, data_train, data_val): 216 | steps = int(math.ceil(float(len(data_train)) / self.cfg['batch_size'])) 217 | progbar = tf.keras.utils.Progbar(steps) 218 | # 每个 epoch 分batch训练 219 | batch_iter = data_input.get_batch( 220 | data_train, batch_size=self.cfg['batch_size']) 221 | for i, (t1_ids, t1_len, t2_ids, t2_len, label) in enumerate(batch_iter): 222 | fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, label) 223 | # a = sess.run([query_norm, doc_norm, prod, cos_sim_raw], feed_dict=fd) 224 | _, cur_loss = self.sess.run( 225 | [self.train_op, self.loss], feed_dict=fd) 226 | progbar.update(i + 1, [("loss", cur_loss)]) 227 | # 训练完一个epoch之后,使用验证集评估,然后预测, 然后评估准确率 228 | dev_acc = self.eval(data_val) 229 | print("dev set acc:", dev_acc) 230 | return dev_acc 231 | 232 | 233 | class SiamenseBert(SiamenseRNN): 234 | def __init__(self, cfg, is_training=1): 235 | super(SiamenseBert, self).__init__(cfg, is_training) 236 | pass 237 | 238 | def add_placeholder(self): 239 | # 预测时只用输入query即可,将其embedding为向量。 240 | self.q_ids = tf.placeholder( 241 | tf.int32, shape=[None, None], name='query_batch') 242 | self.q_mask_ids = tf.placeholder( 243 | tf.int32, shape=[None, None], name='q_mask_ids') 244 | self.q_seg_ids = tf.placeholder( 245 | tf.int32, shape=[None, None], name='q_seg_ids') 246 | self.q_seq_length = tf.placeholder( 247 | tf.int32, shape=[None], name='query_sequence_length') 248 | 249 | self.d_ids = tf.placeholder( 250 | tf.int32, shape=[None, None], name='doc_batch') 251 | self.d_mask_ids = tf.placeholder( 252 | tf.int32, shape=[None, None], name='d_mask_ids') 253 | self.d_seg_ids = tf.placeholder( 254 | tf.int32, shape=[None, None], name='d_seg_ids') 255 | self.d_seq_length = tf.placeholder( 256 | tf.int32, shape=[None], name='doc_seq_length') 257 | self.is_train_place = tf.placeholder( 258 | dtype=tf.bool, name='is_train_place') 259 | # label 260 | self.sim_labels = tf.placeholder( 261 | tf.float32, shape=[None], name="sim_labels") 262 | self.keep_prob_place = tf.placeholder(tf.float32, name='keep_prob') 263 | def siamese_loss(self, out1, out2, y, Q=5.0): 264 | Q = tf.constant(Q, dtype=tf.float32) 265 | E_w = tf.sqrt(tf.reduce_sum(tf.square(out1-out2),1)) 266 | pos = tf.multiply(tf.multiply(y,2/Q),tf.square(E_w)) 267 | neg = tf.multiply(tf.multiply(1-y,2*Q),tf.exp(-2.77/Q*E_w)) 268 | loss = pos + neg 269 | loss = tf.reduce_mean(loss) 270 | return loss 271 | def contrastive_loss(self, model1, model2, y, margin=0.5): 272 | with tf.name_scope("contrastive-loss"): 273 | distance = tf.sqrt(tf.reduce_sum(tf.pow(model1 - model2, 2), 1, keepdims=True)) 274 | similarity = y * tf.square(distance) # keep the similar label (1) close to each other 275 | dissimilarity = (1 - y) * tf.square(tf.maximum((margin - distance), 0)) # give penalty to dissimilar label if the distance is bigger than margin 276 | return tf.reduce_mean(dissimilarity + similarity) / 2 277 | def forward(self): 278 | # 获取cls的输出 279 | q_emb, _, self.q_e = self.share_bert_layer( 280 | self.is_train_place, self.q_ids, self.q_mask_ids, self.q_seg_ids, use_bert_pre=1) 281 | d_emb, _, self.d_e = self.share_bert_layer( 282 | self.is_train_place, self.d_ids, self.d_mask_ids, self.d_seg_ids, use_bert_pre=1) 283 | self.q_emb = q_emb 284 | # 计算cos相似度: 285 | # self.predict_prob, self.predict_idx = self.cos_sim(q_emb, d_emb) 286 | # 使用原文曼哈顿距离 287 | self.predict_prob, self.predict_idx = self.l1_distance(q_emb, d_emb) 288 | with tf.name_scope('Loss'): 289 | # Train Loss 290 | # cross_entropy = -tf.reduce_mean(self.sim_labels * tf.log(tf.clip_by_value(self.predict_prob,1e-10,1.0))+(1-self.sim_labels) * tf.log(tf.clip_by_value(1-self.predict_prob,1e-10,1.0))) 291 | loss = tf.losses.log_loss(self.sim_labels, self.predict_prob) 292 | self.loss = tf.reduce_mean(loss) 293 | tf.summary.scalar('loss', self.loss) 294 | 295 | def build(self): 296 | self.add_placeholder() 297 | self.forward() 298 | self.add_train_op(self.cfg['optimizer'], 299 | self.cfg['learning_rate'], self.loss) 300 | self._init_session() 301 | self._add_summary() 302 | pass 303 | 304 | def feed_batch(self, out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, label=None, is_test=0): 305 | keep_porb = 1 if is_test else self.cfg['keep_porb'] 306 | is_train = 0 if is_test else 1 307 | fd = { 308 | self.q_ids: out_ids1, self.q_mask_ids: m_ids1, 309 | self.q_seg_ids: seg_ids1, 310 | self.q_seq_length: seq_len1, 311 | self.d_ids: out_ids2, 312 | self.d_mask_ids: m_ids2, 313 | self.d_seg_ids: seg_ids2, 314 | self.d_seq_length: seq_len2, 315 | self.keep_prob_place: keep_porb, 316 | self.is_train_place: is_train} 317 | if label: 318 | fd[self.sim_labels] = label 319 | return fd 320 | 321 | def run_epoch(self, epoch, d_train, d_val): 322 | steps = int(math.ceil(float(len(d_train)) / self.cfg['batch_size'])) 323 | progbar = tf.keras.utils.Progbar(steps) 324 | # 每个 epoch 分batch训练 325 | batch_iter = data_input.get_batch( 326 | d_train, batch_size=self.cfg['batch_size']) 327 | for i, (out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, label) in enumerate(batch_iter): 328 | fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, 329 | out_ids2, m_ids2, seg_ids2, seq_len2, label) 330 | # a = self.sess.run([self.q_emb1, self.q_e, self.d_e], feed_dict=fd) 331 | _, cur_loss = self.sess.run( 332 | [self.train_op, self.loss], feed_dict=fd) 333 | progbar.update(i + 1, [("loss", cur_loss)]) 334 | # 训练完一个epoch之后,使用验证集评估,然后预测, 然后评估准确率 335 | dev_acc = self.eval(d_val) 336 | print("dev set acc:", dev_acc) 337 | return dev_acc 338 | 339 | def eval(self, test_data): 340 | pbar = data_input.get_batch( 341 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 342 | val_label, val_pred = [], [] 343 | for (out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, label) in pbar: 344 | val_label.extend(label) 345 | fd = self.feed_batch(out_ids1, m_ids1, seg_ids1, seq_len1, out_ids2, m_ids2, seg_ids2, seq_len2, is_test=1) 346 | pred_labels, pred_prob = self.sess.run( 347 | [self.predict_idx, self.predict_prob], feed_dict=fd) 348 | val_pred.extend(pred_labels) 349 | test_acc = accuracy_score(val_label, val_pred) 350 | return test_acc 351 | 352 | def predict(self, test_data): 353 | pbar = data_input.get_batch( 354 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 355 | val_pred, val_prob = [], [] 356 | for (t1_ids, t1_len, t2_ids, t2_len) in pbar: 357 | fd = self.feed_batch(t1_ids, t1_len, t2_ids, t2_len, is_test=1) 358 | pred_labels, pred_prob = self.sess.run( 359 | [self.predict_idx, self.predict_prob], feed_dict=fd) 360 | val_pred.extend(pred_labels) 361 | val_prob.extend(pred_prob) 362 | return val_pred, val_prob 363 | 364 | def predict_embedding(self, test_data): 365 | pbar = data_input.get_batch( 366 | test_data, batch_size=self.cfg['batch_size'], is_test=1) 367 | val_embed = [] 368 | for (out_ids1, m_ids1, seg_ids1, seq_len1) in pbar: 369 | fd = { 370 | self.q_ids: out_ids1, self.q_mask_ids: m_ids1, 371 | self.q_seg_ids: seg_ids1, 372 | self.q_seq_length: seq_len1, 373 | self.keep_prob_place: 1, 374 | self.is_train_place: 0 375 | } 376 | pred_embedding = self.sess.run(self.q_emb, feed_dict=fd) 377 | val_embed.extend(pred_embedding) 378 | return val_embed 379 | 380 | 381 | if __name__ == "__main__": 382 | start = time.time() 383 | # 读取配置 384 | conf = Config() 385 | # 读取数据 386 | dataset = hub.dataset.LCQMC() 387 | data_train, data_val, data_test = data_input.get_lcqmc() 388 | # data_train = data_train[:10000] 389 | print("train size:{},val size:{}, test size:{}".format( 390 | len(data_train), len(data_val), len(data_test))) 391 | model = SiamenseRNN(conf) 392 | model.fit(data_train, data_val, data_test) 393 | pass 394 | -------------------------------------------------------------------------------- /multi_view_dssm_v3.py: -------------------------------------------------------------------------------- 1 | # coding=utf8 2 | """ 3 | python=3.5 4 | TensorFlow=1.2.1 5 | """ 6 | 7 | import pandas as pd 8 | from scipy import sparse 9 | import collections 10 | import random 11 | import time 12 | import numpy as np 13 | import tensorflow as tf 14 | import multi_view_data_input 15 | 16 | flags = tf.app.flags 17 | FLAGS = flags.FLAGS 18 | 19 | flags.DEFINE_string('summaries_dir', 'Summaries', 'Summaries directory') 20 | flags.DEFINE_float('learning_rate', 0.05, 'Initial learning rate.') 21 | flags.DEFINE_integer('max_steps', 800000, 'Number of steps to run trainer.') 22 | flags.DEFINE_integer('epoch_steps', 200, "Number of steps in one epoch.") 23 | flags.DEFINE_integer('test_pack_size', 3185, "Number of steps in one epoch.") 24 | flags.DEFINE_bool('gpu', 0, "Enable GPU or not") 25 | 26 | start = time.time() 27 | # user feature维度 28 | user_dimension = 17309 29 | # 负样本个数 30 | NEG = 4 31 | # positive batch size 32 | user_BS = 100 33 | # batch size 34 | # BS = user_BS * (NEG + 1) 35 | # 第1层网络的单元数目 36 | L1_N = 400 37 | # 第1层网络的单元数目 38 | L2_N = 120 39 | 40 | # 读取数据 41 | # train_size, test_size = 1000000, 100000 42 | # data_sets = multi_view_data_input.load_data() 43 | data_sets = multi_view_data_input.get_data() 44 | user_dimension = data_sets.TRIGRAM_D 45 | # view1维度 46 | view1_dimension = data_sets.app_number 47 | view2_dimension = data_sets.music_number 48 | view3_dimension = data_sets.novel_number 49 | # view1 训练集大小 50 | view1_size = data_sets.app_his.shape[0] 51 | view2_size = data_sets.music_his.shape[0] 52 | view3_size = data_sets.novel_his.shape[0] 53 | total_size = view1_size + view2_size + view3_size 54 | # view1 测试集大小 55 | view1_size_test = data_sets.app_his_test.shape[0] 56 | view2_size_test = data_sets.music_his_test.shape[0] 57 | view3_size_test = data_sets.novel_his_test.shape[0] 58 | # 测试集package size 59 | flags.test_pack_size = int((view1_size_test + view2_size_test + view3_size_test) / user_BS) 60 | 61 | 62 | def batch_normalization(x, phase_train, out_size): 63 | """ 64 | Batch normalization on convolutional maps. 65 | Ref.: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow 66 | Args: 67 | x: Tensor, 4D BHWD input maps 68 | out_size: integer, depth of input maps 69 | phase_train: boolean tf.Varialbe, true indicates training phase 70 | scope: string, variable scope 71 | Return: 72 | normed: batch-normalized maps 73 | """ 74 | with tf.variable_scope('bn'): 75 | beta = tf.Variable(tf.constant(0.0, shape=[out_size]), 76 | name='beta', trainable=True) 77 | gamma = tf.Variable(tf.constant(1.0, shape=[out_size]), 78 | name='gamma', trainable=True) 79 | batch_mean, batch_var = tf.nn.moments(x, [0], name='moments') 80 | ema = tf.train.ExponentialMovingAverage(decay=0.5) 81 | 82 | def mean_var_with_update(): 83 | ema_apply_op = ema.apply([batch_mean, batch_var]) 84 | with tf.control_dependencies([ema_apply_op]): 85 | return tf.identity(batch_mean), tf.identity(batch_var) 86 | 87 | mean, var = tf.cond(phase_train, 88 | mean_var_with_update, 89 | lambda: (ema.average(batch_mean), ema.average(batch_var))) 90 | normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3) 91 | return normed 92 | 93 | 94 | def variable_summaries(var, name): 95 | """Attach a lot of summaries to a Tensor.""" 96 | with tf.name_scope('summaries'): 97 | mean = tf.reduce_mean(var) 98 | tf.summary.scalar('mean/' + name, mean) 99 | with tf.name_scope('stddev'): 100 | stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean))) 101 | tf.summary.scalar('sttdev/' + name, stddev) 102 | tf.summary.scalar('max/' + name, tf.reduce_max(var)) 103 | tf.summary.scalar('min/' + name, tf.reduce_min(var)) 104 | tf.summary.histogram(name, var) 105 | 106 | 107 | with tf.name_scope('input'): 108 | user_batch = tf.sparse_placeholder(tf.float32, shape=[None, user_dimension], name='user_batch') 109 | view1_batch = tf.sparse_placeholder(tf.float32, shape=[None, view1_dimension], name='view1_batch') 110 | view2_batch = tf.sparse_placeholder(tf.float32, shape=[None, view2_dimension], name='view2_batch') 111 | view3_batch = tf.sparse_placeholder(tf.float32, shape=[None, view3_dimension], name='view3_batch') 112 | active_view = tf.placeholder(tf.int32, name='active_view_number') 113 | on_train = tf.placeholder(tf.bool) 114 | 115 | with tf.name_scope('User_View'): 116 | with tf.name_scope('User_FC1'): 117 | user_fc1_par_range = np.sqrt(6.0 / (user_dimension + L1_N)) 118 | user_weight1 = tf.Variable(tf.random_uniform([user_dimension, L1_N], -user_fc1_par_range, user_fc1_par_range)) 119 | user_bias1 = tf.Variable(tf.random_uniform([L1_N], -user_fc1_par_range, user_fc1_par_range)) 120 | # variable_summaries(user_weight1, 'L1_weights') 121 | # variable_summaries(user_bias1, 'L1_biases') 122 | 123 | user_l1 = tf.sparse_tensor_dense_matmul(user_batch, user_weight1) + user_bias1 124 | user_l1_out = tf.nn.relu(user_l1) 125 | 126 | with tf.name_scope('User_FC2'): 127 | user_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N)) 128 | user_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -user_fc2_par_range, user_fc2_par_range)) 129 | user_bias2 = tf.Variable(tf.random_uniform([L2_N], -user_fc2_par_range, user_fc2_par_range)) 130 | # variable_summaries(user_weight2, 'L2_weights') 131 | # variable_summaries(user_bias2, 'L2_biases') 132 | 133 | user_l2 = tf.matmul(user_l1_out, user_weight2) + user_bias2 134 | user_y = tf.nn.relu(user_l2) 135 | 136 | with tf.name_scope('Item_view1'): 137 | with tf.name_scope('Item_FC1'): 138 | view1_fc1_par_range = np.sqrt(6.0 / (view1_dimension + L1_N)) 139 | view1_weight1 = tf.Variable(tf.random_uniform([view1_dimension, L1_N], -view1_fc1_par_range, view1_fc1_par_range)) 140 | view1_bias1 = tf.Variable(tf.random_uniform([L1_N], -view1_fc1_par_range, view1_fc1_par_range)) 141 | # variable_summaries(item_weight1, 'L1_weights') 142 | # variable_summaries(item_bias1, 'L1_biases') 143 | view1_positive_l1 = tf.sparse_tensor_dense_matmul(view1_batch, view1_weight1) + view1_bias1 144 | view1_positive_l1_out = tf.nn.relu(view1_positive_l1) 145 | 146 | with tf.name_scope('Item_FC2'): 147 | view1_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N)) 148 | view1_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -view1_fc2_par_range, view1_fc2_par_range)) 149 | view1_bias2 = tf.Variable(tf.random_uniform([L2_N], -view1_fc2_par_range, view1_fc2_par_range)) 150 | # variable_summaries(item_weight2, 'L2_weights') 151 | # variable_summaries(item_bias2, 'L2_biases') 152 | 153 | view1_positive_l2 = tf.matmul(view1_positive_l1_out, view1_weight2) + view1_bias2 154 | view1_positive_y = tf.nn.relu(view1_positive_l2) 155 | 156 | with tf.name_scope('Item_view2'): 157 | with tf.name_scope('Item_FC1'): 158 | view2_fc1_par_range = np.sqrt(6.0 / (view2_dimension + L1_N)) 159 | view2_weight1 = tf.Variable(tf.random_uniform([view2_dimension, L1_N], -view2_fc1_par_range, view2_fc1_par_range)) 160 | view2_bias1 = tf.Variable(tf.random_uniform([L1_N], -view2_fc1_par_range, view2_fc1_par_range)) 161 | # variable_summaries(item_weight1, 'L1_weights') 162 | # variable_summaries(item_bias1, 'L1_biases') 163 | view2_positive_l1 = tf.sparse_tensor_dense_matmul(view2_batch, view2_weight1) + view2_bias1 164 | view2_positive_l1_out = tf.nn.relu(view2_positive_l1) 165 | 166 | with tf.name_scope('Item_FC2'): 167 | view2_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N)) 168 | view2_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -view2_fc2_par_range, view2_fc2_par_range)) 169 | view2_bias2 = tf.Variable(tf.random_uniform([L2_N], -view2_fc2_par_range, view2_fc2_par_range)) 170 | # variable_summaries(item_weight2, 'L2_weights') 171 | # variable_summaries(item_bias2, 'L2_biases') 172 | 173 | view2_positive_l2 = tf.matmul(view2_positive_l1_out, view2_weight2) + view2_bias2 174 | view2_positive_y = tf.nn.relu(view2_positive_l2) 175 | 176 | with tf.name_scope('Item_view3'): 177 | with tf.name_scope('Item_FC1'): 178 | view3_fc1_par_range = np.sqrt(6.0 / (view3_dimension + L1_N)) 179 | view3_weight1 = tf.Variable(tf.random_uniform([view3_dimension, L1_N], -view3_fc1_par_range, view3_fc1_par_range)) 180 | view3_bias1 = tf.Variable(tf.random_uniform([L1_N], -view3_fc1_par_range, view3_fc1_par_range)) 181 | # variable_summaries(item_weight1, 'L1_weights') 182 | # variable_summaries(item_bias1, 'L1_biases') 183 | view3_positive_l1 = tf.sparse_tensor_dense_matmul(view3_batch, view3_weight1) + view3_bias1 184 | view3_positive_l1_out = tf.nn.relu(view3_positive_l1) 185 | 186 | with tf.name_scope('Item_FC2'): 187 | view3_fc2_par_range = np.sqrt(6.0 / (L1_N + L2_N)) 188 | view3_weight2 = tf.Variable(tf.random_uniform([L1_N, L2_N], -view3_fc2_par_range, view3_fc2_par_range)) 189 | view3_bias2 = tf.Variable(tf.random_uniform([L2_N], -view3_fc2_par_range, view3_fc2_par_range)) 190 | # variable_summaries(item_weight2, 'L2_weights') 191 | # variable_summaries(item_bias2, 'L2_biases') 192 | 193 | view3_positive_l2 = tf.matmul(view3_positive_l1_out, view3_weight2) + view3_bias2 194 | view3_positive_y = tf.nn.relu(view3_positive_l2) 195 | 196 | with tf.name_scope('Make_Negative_Item'): 197 | # 合并负样本,tile可选择是否扩展负样本。 198 | # 判断激活哪一个view。 199 | if active_view == 1: 200 | item_y = tf.tile(view1_positive_y, [1, 1]) 201 | elif active_view == 2: 202 | item_y = tf.tile(view2_positive_y, [1, 1]) 203 | else: 204 | item_y = tf.tile(view3_positive_y, [1, 1]) 205 | 206 | item_y_temp = tf.tile(item_y, [1, 1]) 207 | # batch内随机负采样。 208 | for i in range(NEG): 209 | rand = int((random.random() + i) * user_BS / NEG) 210 | item_y = tf.concat([item_y, 211 | tf.slice(item_y_temp, [rand, 0], [user_BS - rand, -1]), 212 | tf.slice(item_y_temp, [0, 0], [rand, -1])], 0) 213 | 214 | with tf.name_scope('Cosine_Similarity'): 215 | # Cosine similarity 216 | # query_norm = sqrt(sum(each x^2)) 217 | query_norm = tf.tile(tf.sqrt(tf.reduce_sum(tf.square(user_y), 1, True)), [NEG + 1, 1]) 218 | # doc_norm = sqrt(sum(each x^2)) 219 | doc_norm = tf.sqrt(tf.reduce_sum(tf.square(item_y), 1, True)) 220 | # query * doc 221 | prod = tf.reduce_sum(tf.multiply(tf.tile(user_y, [NEG + 1, 1]), item_y), 1, True) 222 | # ||query|| * ||doc|| 223 | norm_prod = tf.multiply(query_norm, doc_norm) 224 | # cos_sim_raw = query * doc / (||query|| * ||doc||) 225 | cos_sim_raw = tf.truediv(prod, norm_prod) 226 | # gamma = 20 227 | # shape = [user_BS, NEG + 1],第一列是正样本cos相似度。 228 | cos_sim = tf.transpose(tf.reshape(tf.transpose(cos_sim_raw), [NEG + 1, user_BS])) * 20 229 | 230 | with tf.name_scope('Loss'): 231 | # Train Loss 232 | # 转化为softmax概率矩阵。 233 | prob = tf.nn.softmax(cos_sim) 234 | # 只取第一列,即正样本列概率。 235 | hit_prob = tf.slice(prob, [0, 0], [-1, 1]) 236 | loss = -tf.reduce_sum(tf.log(hit_prob)) 237 | tf.summary.scalar('loss', loss) 238 | 239 | with tf.name_scope('Training'): 240 | # Optimizer 241 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(loss) 242 | 243 | # with tf.name_scope('Accuracy'): 244 | # correct_prediction = tf.equal(tf.argmax(prob, 1), 0) 245 | # accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 246 | # tf.summary.scalar('accuracy', accuracy) 247 | 248 | merged = tf.summary.merge_all() 249 | 250 | with tf.name_scope('Test'): 251 | average_loss = tf.placeholder(tf.float32) 252 | loss_summary = tf.summary.scalar('average_loss', average_loss) 253 | 254 | with tf.name_scope('Train'): 255 | train_average_loss = tf.placeholder(tf.float32) 256 | train_loss_summary = tf.summary.scalar('train_average_loss', train_average_loss) 257 | 258 | 259 | def convert_to_sparse_tensor(data_in): 260 | data_in = data_in.tocoo() 261 | data_in = tf.SparseTensorValue( 262 | np.transpose([np.array(data_in.row, dtype=np.int64), np.array(data_in.col, dtype=np.int64)]), 263 | np.array(data_in.data, dtype=np.float), 264 | np.array(data_in.shape, dtype=np.int64)) 265 | return data_in 266 | 267 | 268 | def pull_batch(user_data, item_positive, batch_id): 269 | batch_id = int(batch_id) 270 | user_in = user_data[batch_id * user_BS:(batch_id + 1) * user_BS, :] 271 | item_positive_in = item_positive[batch_id * user_BS:(batch_id + 1) * user_BS, :] 272 | user_in, item_positive_in = convert_to_sparse_tensor(user_in), convert_to_sparse_tensor(item_positive_in) 273 | return user_in, item_positive_in 274 | 275 | 276 | def feed_dict(on_verify, Train, batch_id): 277 | view1_batch_in = convert_to_sparse_tensor(sparse.csr_matrix(([], ([], [])), shape=(user_BS, view1_dimension))) 278 | view2_batch_in = convert_to_sparse_tensor(sparse.csr_matrix(([], ([], [])), shape=(user_BS, view2_dimension))) 279 | view3_batch_in = convert_to_sparse_tensor(sparse.csr_matrix(([], ([], [])), shape=(user_BS, view3_dimension))) 280 | active_view_in = 1 281 | if Train: 282 | if batch_id <= view1_size / user_BS: 283 | batch_id = batch_id if batch_id < view1_size / user_BS - 1 else batch_id - 1 284 | active_view_in = 1 285 | user_batch_in, view1_batch_in = pull_batch(data_sets.app_search, data_sets.app_his, batch_id) 286 | elif view1_size / user_BS < batch_id <= (view1_size + view2_size) / user_BS: 287 | batch_id -= view1_size / user_BS 288 | batch_id = batch_id if batch_id < view2_size / user_BS - 1 else batch_id - 1 289 | active_view_in = 2 290 | user_batch_in, view2_batch_in = pull_batch(data_sets.music_search, data_sets.music_his, batch_id) 291 | else: 292 | batch_id -= view1_size / user_BS + view2_size / user_BS 293 | batch_id = batch_id if batch_id < view3_size / user_BS - 1 else batch_id - 1 294 | active_view_in = 3 295 | user_batch_in, view3_batch_in = pull_batch(data_sets.novel_search, data_sets.novel_his, batch_id) 296 | 297 | 298 | else: 299 | if batch_id <= view1_size_test / user_BS: 300 | batch_id = batch_id if batch_id < view1_size_test / user_BS - 1 else batch_id - 1 301 | active_view_in = 1 302 | user_batch_in, view1_batch_in = pull_batch(data_sets.app_search_test, data_sets.app_his_test, batch_id) 303 | elif view1_size_test / user_BS < batch_id <= (view1_size_test + view2_size_test) / user_BS: 304 | batch_id -= view1_size_test / user_BS 305 | batch_id = batch_id if batch_id < view2_size_test / user_BS - 1 else batch_id - 1 306 | active_view_in = 2 307 | user_batch_in, view2_batch_in = pull_batch(data_sets.music_search_test, data_sets.music_his_test, batch_id) 308 | else: 309 | batch_id -= view1_size_test / user_BS + view2_size_test / user_BS 310 | batch_id = batch_id if batch_id < view3_size_test / user_BS - 1 else batch_id - 1 311 | active_view_in = 3 312 | user_batch_in, view3_batch_in = pull_batch(data_sets.novel_search_test, data_sets.novel_his_test, batch_id) 313 | 314 | return {user_batch: user_batch_in, 315 | view1_batch: view1_batch_in, 316 | view2_batch: view2_batch_in, 317 | view3_batch: view3_batch_in, 318 | active_view: active_view_in, 319 | on_train: on_verify} 320 | 321 | 322 | # config = tf.ConfigProto() # log_device_placement=True) 323 | # config.gpu_options.allow_growth = True 324 | # if not FLAGS.gpu: 325 | # config = tf.ConfigProto(device_count= {'GPU' : 0}) 326 | 327 | # 创建一个Saver对象,选择性保存变量或者模型。 328 | saver = tf.train.Saver() 329 | # with tf.Session(config=config) as sess: 330 | with tf.Session() as sess: 331 | sess.run(tf.global_variables_initializer()) 332 | train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train', sess.graph) 333 | # test_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/test', sess.graph) 334 | 335 | start = time.time() 336 | for step in range(FLAGS.max_steps): 337 | batch_id = int(random.random() * (total_size / user_BS - 1)) 338 | # print(batch_id) 339 | sess.run(train_step, feed_dict=feed_dict(True, True, batch_id)) 340 | 341 | if step % FLAGS.epoch_steps == 0: 342 | # train loss 343 | loss_v = sess.run(loss, feed_dict=feed_dict(False, True, batch_id)) 344 | 345 | loss_v /= user_BS 346 | train_loss = sess.run(train_loss_summary, feed_dict={train_average_loss: loss_v}) 347 | train_writer.add_summary(train_loss, step + 1) 348 | end = time.time() 349 | print("\nEpoch #%-5d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" % 350 | (step / FLAGS.epoch_steps, loss_v, end - start)) 351 | 352 | # test loss 353 | epoch_loss = 0 354 | for i in range(FLAGS.test_pack_size): 355 | loss_v = sess.run(loss, feed_dict=feed_dict(False, False, i)) 356 | epoch_loss += loss_v 357 | epoch_loss /= (FLAGS.test_pack_size * user_BS) 358 | test_loss = sess.run(loss_summary, feed_dict={average_loss: epoch_loss}) 359 | train_writer.add_summary(test_loss, step + 1) 360 | start = time.time() 361 | print("Epoch #%-5d | Test Loss: %-4.3f | Calc_LossTime: %-3.3fs" % 362 | (step / FLAGS.epoch_steps, epoch_loss, start - end)) 363 | 364 | # 保存模型 365 | save_path = saver.save(sess, "model/model_1.ckpt") 366 | print("Model saved in file: ", save_path) 367 | -------------------------------------------------------------------------------- /requirement.txt: -------------------------------------------------------------------------------- 1 | paddlepaddle==1.7.1 2 | paddlehub==1.8.2 3 | TensorFlow==1.12 4 | sklearn 5 | pyyaml 6 | keras==2.2.5 7 | argparse -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #encoding=utf-8 3 | ''' 4 | @Time : 2020/10/25 22:28:30 5 | @Author : zhiyang.zzy 6 | @Contact : zhiyangchou@gmail.com 7 | @Desc : 训练相似度模型 8 | 1. siamese network,分别使用 cosine、曼哈顿距离 9 | 2. triplet loss 10 | ''' 11 | 12 | # here put the import lib 13 | from model.bert_classifier import BertClassifier 14 | import os 15 | import time 16 | from numpy.lib.arraypad import pad 17 | import nni 18 | from tensorflow.python.ops.gen_io_ops import write_file 19 | import yaml 20 | import logging 21 | import argparse 22 | logging.basicConfig(level=logging.INFO) 23 | import data_input 24 | from config import Config 25 | from model.siamese_network import SiamenseRNN, SiamenseBert 26 | from data_input import Vocabulary, get_test 27 | from util import write_file 28 | 29 | def train_siamese(): 30 | # 读取配置 31 | # conf = Config() 32 | cfg_path = "./configs/config.yml" 33 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 34 | # 读取数据 35 | data_train, data_val, data_test = data_input.get_lcqmc() 36 | # data_train = data_train[:100] 37 | print("train size:{},val size:{}, test size:{}".format( 38 | len(data_train), len(data_val), len(data_test))) 39 | model = SiamenseRNN(cfg) 40 | model.fit(data_train, data_val, data_test) 41 | pass 42 | 43 | def predict_siamese(file_='./results/'): 44 | # 加载配置 45 | cfg_path = "./configs/config.yml" 46 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 47 | # 将 seq转为id, 48 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 49 | test_arr, query_arr = get_test(file_, vocab) 50 | # 加载模型 51 | model = SiamenseRNN(cfg) 52 | model.restore_session(cfg["checkpoint_dir"]) 53 | test_label, test_prob = model.predict(test_arr) 54 | out_arr = [x + [test_label[i]] + [test_prob[i]] for i, x in enumerate(query_arr)] 55 | write_file(out_arr, file_ + '.siamese.predict', ) 56 | pass 57 | 58 | def train_siamese_bert(): 59 | # 读取配置 60 | # conf = Config() 61 | cfg_path = "./configs/config_bert.yml" 62 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 63 | # 自动调参的参数,每次会更新一组搜索空间中的参数 64 | tuner_params= nni.get_next_parameter() 65 | cfg.update(tuner_params) 66 | # vocab: 将 seq转为id, 67 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 68 | # 读取数据 69 | data_train, data_val, data_test = data_input.get_lcqmc_bert(vocab) 70 | # data_train = data_train[:100] 71 | print("train size:{},val size:{}, test size:{}".format( 72 | len(data_train), len(data_val), len(data_test))) 73 | model = SiamenseBert(cfg) 74 | model.fit(data_train, data_val, data_test) 75 | pass 76 | 77 | def predict_siamese_bert(file_="./results/input/test"): 78 | # 读取配置 79 | # conf = Config() 80 | cfg_path = "./configs/config_bert.yml" 81 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 82 | os.environ["CUDA_VISIBLE_DEVICES"] = "4" 83 | # vocab: 将 seq转为id, 84 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 85 | # 读取数据 86 | test_arr, query_arr = data_input.get_test_bert(file_, vocab) 87 | print("test size:{}".format(len(test_arr))) 88 | model = SiamenseBert(cfg) 89 | model.restore_session(cfg["checkpoint_dir"]) 90 | test_label, test_prob = model.predict(test_arr) 91 | out_arr = [x + [test_label[i]] + [test_prob[i]] for i, x in enumerate(query_arr)] 92 | write_file(out_arr, file_ + '.siamese.bert.predict', ) 93 | pass 94 | 95 | def train_bert(): 96 | # 读取配置 97 | # conf = Config() 98 | cfg_path = "./configs/bert_classify.yml" 99 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 100 | # 自动调参的参数,每次会更新一组搜索空间中的参数 101 | tuner_params= nni.get_next_parameter() 102 | cfg.update(tuner_params) 103 | # vocab: 将 seq转为id, 104 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 105 | # 读取数据 106 | data_train, data_val, data_test = data_input.get_lcqmc_bert(vocab, is_merge=1) 107 | # data_train = data_train[:100] 108 | print("train size:{},val size:{}, test size:{}".format( 109 | len(data_train), len(data_val), len(data_test))) 110 | model = BertClassifier(cfg) 111 | model.fit(data_train, data_val, data_test) 112 | pass 113 | 114 | def predict_bert(file_="./results/input/test"): 115 | # 读取配置 116 | # conf = Config() 117 | cfg_path = "./configs/bert_classify.yml" 118 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 119 | # vocab: 将 seq转为id, 120 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 121 | # 读取数据 122 | test_arr, query_arr = data_input.get_test_bert(file_, vocab, is_merge=1) 123 | print("test size:{}".format(len(test_arr))) 124 | model = BertClassifier(cfg) 125 | model.restore_session(cfg["checkpoint_dir"]) 126 | test_label, test_prob = model.predict(test_arr) 127 | out_arr = [x + [test_label[i]] + [test_prob[i]] for i, x in enumerate(query_arr)] 128 | write_file(out_arr, file_ + '.bert.predict', ) 129 | pass 130 | 131 | def siamese_bert_sentence_embedding(file_="./results/input/test.single"): 132 | # 输入一行是一个query,输出是此query对应的向量 133 | # 读取配置 134 | cfg_path = "./configs/config_bert.yml" 135 | cfg = yaml.load(open(cfg_path, encoding='utf-8'), Loader=yaml.FullLoader) 136 | cfg['batch_size'] = 64 137 | os.environ["CUDA_VISIBLE_DEVICES"] = "7" 138 | # vocab: 将 seq转为id, 139 | vocab = Vocabulary(meta_file='./data/vocab.txt', max_len=cfg['max_seq_len'], allow_unk=1, unk='[UNK]', pad='[PAD]') 140 | # 读取数据 141 | test_arr, query_arr = data_input.get_test_bert_single(file_, vocab) 142 | print("test size:{}".format(len(test_arr))) 143 | model = SiamenseBert(cfg) 144 | model.restore_session(cfg["checkpoint_dir"]) 145 | test_label = model.predict_embedding(test_arr) 146 | test_label = [",".join([str(y) for y in x]) for x in test_label] 147 | out_arr = [[x, test_label[i]] for i, x in enumerate(query_arr)] 148 | print("write to file...") 149 | write_file(out_arr, file_ + '.siamese.bert.embedding', ) 150 | pass 151 | 152 | if __name__ == "__main__": 153 | os.environ["CUDA_VISIBLE_DEVICES"] = "4" 154 | ap = argparse.ArgumentParser() 155 | ap.add_argument("--method", default="bert", type=str, help="train/predict") 156 | ap.add_argument("--mode", default="train", type=str, help="train/predict") 157 | ap.add_argument("--file", default="./results/input/test", type=str, help="train/predict") 158 | args = ap.parse_args() 159 | if args.mode == 'train' and args.method == 'rnn': 160 | train_siamese() 161 | elif args.mode == 'predict' and args.method == 'rnn': 162 | predict_siamese(args.file) 163 | elif args.mode == 'train' and args.method == 'bert_siamese': 164 | train_siamese_bert() 165 | elif args.mode == 'predict' and args.method == 'bert_siamese': 166 | predict_siamese_bert(args.file) 167 | elif args.mode == 'train' and args.method == 'bert': 168 | train_bert() 169 | elif args.mode == 'predict' and args.method == 'bert': 170 | predict_bert(args.file) 171 | elif args.mode == 'predict' and args.method == 'bert_siamese_embedding': 172 | # 此处输出句子的 embedding,如果想要使用向量召回 173 | # 建议训练模型的时候,损失函数使用功能和faiss一致的距离度量,例如faiss中使用是l2,那么损失函数用l2 174 | # faiss距离用cos,损失函数用cosin,或者损失中有一项是cosin相似度损失 175 | siamese_bert_sentence_embedding(args.file) 176 | -------------------------------------------------------------------------------- /util.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #encoding=utf-8 3 | ''' 4 | @Time : 2020/10/13 20:33:50 5 | @Author : zhiyang.zzy 6 | @Contact : zhiyangchou@gmail.com 7 | @Desc : 8 | ''' 9 | 10 | # here put the import lib 11 | import os 12 | import six 13 | import time 14 | 15 | def convert_to_unicode(text): 16 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 17 | if six.PY3: 18 | if isinstance(text, str): 19 | return text 20 | elif isinstance(text, bytes): 21 | return text.decode("utf-8", "ignore") 22 | else: 23 | raise ValueError("Unsupported string type: %s" % (type(text))) 24 | elif six.PY2: 25 | if isinstance(text, str): 26 | return text.decode("utf-8", "ignore") 27 | elif isinstance(text, unicode): 28 | return text 29 | else: 30 | raise ValueError("Unsupported string type: %s" % (type(text))) 31 | else: 32 | raise ValueError("Not running on Python2 or Python 3?") 33 | 34 | def read_file(file_:str, splitter:str=None): 35 | out_arr = [] 36 | with open(file_, encoding="utf-8") as f: 37 | out_arr = [x.strip("\n") for x in f.readlines()] 38 | if splitter: 39 | out_arr = [x.split(splitter) for x in out_arr] 40 | return out_arr 41 | 42 | def write_file(out_arr:list, file_:str, splitter='\t'): 43 | with open(file_, 'w', encoding='utf-8') as out: 44 | for line in out_arr: 45 | out.write(splitter.join([str(x) for x in line]) + '\n') 46 | 47 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 48 | """Truncates a sequence pair in place to the maximum length.""" 49 | while True: 50 | total_length = len(tokens_a) + len(tokens_b) 51 | if total_length <= max_length: 52 | break 53 | if len(tokens_a) > len(tokens_b): 54 | tokens_a.pop() 55 | else: 56 | tokens_b.pop() 57 | if __name__ == "__main__": 58 | pass --------------------------------------------------------------------------------