├── .DS_Store ├── .gitignore ├── ATEC2018.md ├── LICENSE ├── README.md ├── dict.txt ├── eddy-20180701-4k.tar.gz ├── eval.py ├── input_helpers.py ├── preliminary_contest ├── atec_nlp_sim_train.csv ├── dict.txt ├── eval.py ├── input_helpers.py ├── models │ ├── model-4000.data-00000-of-00001 │ ├── model-4000.index │ └── model-4000.meta ├── preprocess.py ├── run.sh └── vocab │ └── vocab ├── preprocess.py ├── siamese_network_semantic.py ├── test.py ├── train.py ├── train_data └── atec_nlp_sim_train.csv └── validation.txt0 /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | 3 | .idea/ 4 | *.pyc 5 | runs/ -------------------------------------------------------------------------------- /ATEC2018.md: -------------------------------------------------------------------------------- 1 | # 1 赛题任务描述 2 | 3 | 问题相似度计算,即给定客服里用户描述的两句话,用算法来判断是否表示了相同的语义。 4 | 5 | 示例: 6 | 7 | “花呗如何还款” --“花呗怎么还款”:同义问句 8 | “花呗如何还款” -- “我怎么还我的花被呢”:同义问句 9 | “花呗分期后逾期了如何还款”-- “花呗分期后逾期了哪里还款”:非同义问句 10 | 对于例子a,比较简单的方法就可以判定同义;对于例子b,包含了错别字、同义词、词序变换等问题,两个句子乍一看并不类似,想正确判断比较有挑战;对于例子c,两句话很类似,仅仅有一处细微的差别 “如何”和“哪里”,就导致语义不一致。 11 | 12 | # 2 数据 13 | 14 | 本次大赛所有数据均来自蚂蚁金服金融大脑的实际应用场景,赛制分初赛和复赛两个阶段: 15 | 16 | 初赛阶段 17 | 18 | 我们提供10万对的标注数据(分批次更新),作为训练数据,包括同义对和不同义对,可下载。数据集中每一行就是一条样例。格式如下: 19 | 20 | 行号\t句1\t句2\t标注,举例:1 花呗如何还款 花呗怎么还款 1 21 | 22 | 行号指当前问题对在训练集中的第几行; 23 | 句1和句2分别表示问题句对的两个句子; 24 | 标注指当前问题对的同义或不同义标注,同义为1,不同义为0。 25 | 评测数据集总共1万条。为保证大赛的公平公正、避免恶意的刷榜行为,该数据集不公开。大家通过提交评测代码和模型的方法完成预测、获取相应的排名。格式如下: 26 | 27 | 行号\t句1\t句2 28 | 29 | ## 初赛阶段 30 | 评测数据集会在评测系统一个特定的路径下面,由官方的平台系统调用选手提交的评测工具执行。 31 | 32 | ## 复赛阶段 33 | 34 | 我们将训练数据集的量级会增加到海量。该阶段的数据不提供下载,会以数据表的形式在蚂蚁金服的数巢平台上供选手使用。和初赛阶段类似,数据集包含四个字段,分别是行号、句1、句2和标注。 35 | 36 | 评测数据集还是1万条,同样以数据表的形式在数巢平台上。该数据集包含三个字段,分别是行号、句1、句2。 37 | 38 | #3 评测及评估指标 39 | 40 | ## 初赛阶段 41 | 比赛选手在本地完成模型的训练调优,将评测代码和模型打包后,提交官方测评系统完成预测和排名更新。测评系统为标准linux环境,安装有python 2.7、java 8、tensorflow 1.5、jieba 0.39。提交压缩包解压后,主目录下需包含脚本文件run.sh,该脚本以评测文件作为输入,评测结果作为输出(输出结果只有0和1),输出文件每行格式为“行号\t预测结果”,执行命令如下: 42 | 43 | bash run.sh INPUT_PATH OUTPUT_PATH 44 | 45 | 预测结果为空或总行数不对,评测结果直接判为0。 46 | 47 | 48 | 49 | ## 复赛阶段 50 | 选手的模型训练、调优和预测都是在蚂蚁金服的机器学习平台上完成。因此评测只需要提供相应的UDF即可,以问题对的两句话作为输入,相似度预测结果(0或1)作为输出,同样输出为空则终止评估,评测结果为0。 51 | 52 | 53 | 54 | 本赛题采用正确率(accuracy)、F1-score 进行评价,选手预测结果和真实标签进行比对,几个数值的定义先明确一下: 55 | 56 | True Positive(TP)意思表示做出同义的判定,而且判定是正确的,TP的数值表示正确的同义判定的个数; 57 | 58 | 同理,False Positive(FP)数值表示错误的同义判定的个数; 59 | 60 | 依此,True Negative(TN)数值表示正确的不同义判定个数; 61 | 62 | False Negative(FN)数值表示错误的不同义判定个数。 63 | 64 | 基于此,我们就可以计算出准确率(precision rate)、召回率(recall rate)和accuracy、F1-score: 65 | 66 | precision rate = TP / (TP + FP) 67 | 68 | recall rate = TP / (TP + FN) 69 | 70 | accuracy = (TP + TN) / (TP + FP + TN + FN) 71 | 72 | F1-score = 2 * precision rate * recall rate / (precision rate + recall rate) 73 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Dhwaj Raj 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 功能描述 2 | 基于siamese-lstm的中文句子相似度计算 3 | 4 | # 环境搭建 5 | * Ubuntu:16.04(64bit) 6 | * Anaconda:2-4.4.0(python 2.7) 7 | 8 | 历史版本下载: 9 | * TensorFlow:1.5.1 10 | * numpy:1.14.3 11 | * gensim:3.4.0 12 | * (nltk:3.2.3) 13 | * jieba:0.39 14 | * word2wec中文训练模型 15 | 16 | 参考链接: 17 | 18 | # 代码使用 19 | 20 | ### 模型训练 21 | # python train.py 22 | 23 | ### 模型评估 24 | # python eval.py 25 | ## 论文参考 26 | * [《Learning Text Similarity with Siamese Recurrent Networks》](http://www.aclweb.org/anthology/W16-16#page=162) 27 | * [《Siamese Recurrent Architectures for Learning Sentence Similarity》](http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf) 28 | 29 | # 代码参考 30 | 31 | * [dhwajraj/deep-siamese-text-similarity](https://github.com/dhwajraj/deep-siamese-text-similarity) 32 | 33 | 版本:a61f07f6bef76665f8ba2df12f34b25380016613 34 | 35 | # AETC2018赛题描述 36 | 相关链接: 37 | 38 | -------------------------------------------------------------------------------- /dict.txt: -------------------------------------------------------------------------------- 1 | 花呗 2 | 借呗 3 | 蚂蚁花呗 4 | 蚂蚁借呗 5 | 从新 6 | 支付宝 7 | 淘宝 8 | -------------------------------------------------------------------------------- /eddy-20180701-4k.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/eddy-20180701-4k.tar.gz -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # coding=utf-8 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | from input_helpers import InputHelper 9 | import sys 10 | 11 | # Parameters 12 | # ================================================== 13 | 14 | # Eval Parameters 15 | # 批大小 16 | BATCH_SIZE = 64 17 | # 验证集文件 18 | EVAL_FILEPATH = 'validation.txt0' 19 | # 词表(在训练过程中已生成) 20 | VOCAB_FILEPATH = 'runs/1528462228/checkpoints/vocab' 21 | # 模型文件 22 | MODEL = 'runs/1528462228/checkpoints/model-10000' 23 | 24 | # 语句最多长度(包含多少个词) 25 | MAX_DOCUMENT_LENGTH = 30 26 | 27 | # Misc Parameters 28 | ALLOW_SOFT_PLACEMENT = True 29 | LOG_DEVICE_PLACEMENT = False 30 | 31 | inpH = InputHelper() 32 | 33 | x1_test, x2_test, y_test = inpH.getTestDataSet(EVAL_FILEPATH, VOCAB_FILEPATH, MAX_DOCUMENT_LENGTH) 34 | 35 | # for index ,value in enumerate(x1_test): 36 | # print (index, x1_test[index], x2_test[index], y_test[index]) 37 | # sys.exit(0) 38 | 39 | print("\nEvaluating...\n") 40 | 41 | # Evaluation 42 | # ================================================== 43 | checkpoint_file = MODEL 44 | print checkpoint_file 45 | graph = tf.Graph() 46 | with graph.as_default(): 47 | session_conf = tf.ConfigProto( 48 | allow_soft_placement=ALLOW_SOFT_PLACEMENT, 49 | log_device_placement=LOG_DEVICE_PLACEMENT) 50 | sess = tf.Session(config=session_conf) 51 | with sess.as_default(): 52 | # Load the saved meta graph and restore variables 53 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) 54 | sess.run(tf.initialize_all_variables()) 55 | saver.restore(sess, checkpoint_file) 56 | 57 | # Get the placeholders from the graph by name 58 | input_x1 = graph.get_operation_by_name("input_x1").outputs[0] 59 | input_x2 = graph.get_operation_by_name("input_x2").outputs[0] 60 | input_y = graph.get_operation_by_name("input_y").outputs[0] 61 | 62 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0] 63 | # Tensors we want to evaluate 64 | predictions = graph.get_operation_by_name("output/distance").outputs[0] 65 | 66 | accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0] 67 | 68 | sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0] 69 | 70 | # emb = graph.get_operation_by_name("embedding/W").outputs[0] 71 | # embedded_chars = tf.nn.embedding_lookup(emb,input_x) 72 | # Generate batches for one epoch 73 | batches = inpH.batch_iter(list(zip(x1_test, x2_test, y_test)), 2 * BATCH_SIZE, 1, shuffle=False) 74 | # Collect the predictions here 75 | all_predictions = [] 76 | all_d = [] 77 | for db in batches: 78 | x1_dev_b, x2_dev_b, y_dev_b = zip(*db) 79 | batch_predictions, batch_acc, batch_sim = sess.run([predictions, accuracy, sim], 80 | {input_x1: x1_dev_b, input_x2: x2_dev_b, 81 | input_y: y_dev_b, dropout_keep_prob: 1.0}) 82 | all_predictions = np.concatenate([all_predictions, batch_predictions]) 83 | print(batch_predictions) 84 | all_d = np.concatenate([all_d, batch_sim]) 85 | print("DEV acc {}".format(batch_acc)) 86 | for ex in all_predictions: 87 | print ex 88 | correct_predictions = float(np.mean(all_d == y_test)) 89 | print("Accuracy: {:g}".format(correct_predictions)) 90 | -------------------------------------------------------------------------------- /input_helpers.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | import numpy as np 3 | import re 4 | import itertools 5 | from collections import Counter 6 | import numpy as np 7 | import time 8 | import gc 9 | from tensorflow.contrib import learn 10 | # from gensim.models.word2vec import Word2Vec 11 | import gensim 12 | import gzip 13 | from random import random 14 | from preprocess import MyVocabularyProcessor 15 | import sys 16 | import jieba 17 | 18 | reload(sys) 19 | sys.setdefaultencoding("utf-8") 20 | 21 | 22 | class InputHelper(object): 23 | pre_emb = dict() 24 | vocab_processor = None 25 | 26 | def loadW2V(self, emb_path, type="bin"): 27 | print("Loading W2V data...") 28 | num_keys = 0 29 | if type == "textgz": 30 | # this seems faster than gensim non-binary load 31 | for line in gzip.open(emb_path): 32 | l = line.strip().split() 33 | st = l[0].lower() 34 | self.pre_emb[st] = np.asarray(l[1:]) 35 | num_keys = len(self.pre_emb) 36 | if type == "text": 37 | # this seems faster than gensim non-binary load 38 | for line in open(emb_path): 39 | l = line.strip().split() 40 | st = l[0].lower() 41 | self.pre_emb[st] = np.asarray(l[1:]) 42 | num_keys = len(self.pre_emb) 43 | else: 44 | # self.pre_emb = Word2Vec.load_word2vec_format(emb_path,binary=True) 45 | self.pre_emb = gensim.models.KeyedVectors.load_word2vec_format(emb_path, binary=True) # eddy 46 | self.pre_emb.init_sims(replace=True) 47 | num_keys = len(self.pre_emb.vocab) 48 | print("loaded word2vec len ", num_keys) 49 | gc.collect() 50 | 51 | def deletePreEmb(self): 52 | self.pre_emb = dict() 53 | gc.collect() 54 | 55 | def getTsvData(self, filepath): 56 | print("Loading training data from " + filepath) 57 | x1 = [] 58 | x2 = [] 59 | y = [] 60 | num_p = 0 61 | num_n = 0 62 | # positive samples from file 63 | for line in open(filepath): 64 | # print(line) 65 | l = line.strip().split("\t") 66 | 67 | # print(l[0]) 68 | # print(l[1]) 69 | # print(l[2]) 70 | if len(l) >= 4: 71 | x1.append(l[1]) 72 | x2.append(l[2]) 73 | y.append(int(l[3])) 74 | 75 | flag = int(l[3]) 76 | if flag > 0: 77 | num_p += 1 78 | else: 79 | num_n += 1 80 | 81 | tmp_x1 = [] 82 | tmp_x2 = [] 83 | tmp_y = [] 84 | 85 | # # 欠采样处理 86 | # for idx, item in enumerate(y): 87 | # if item[1] == 1: 88 | # tmp_x1.append(x1[idx]) 89 | # tmp_x2.append(x2[idx]) 90 | # tmp_y.append(y[idx]) 91 | # elif num_p >= 0: 92 | # tmp_x1.append(x1[idx]) 93 | # tmp_x2.append(x2[idx]) 94 | # tmp_y.append(y[idx]) 95 | # num_p -= 1 96 | # x1 = tmp_x1 97 | # x2 = tmp_x2 98 | # y = tmp_y 99 | 100 | # 过采样处理 101 | add_p_num = num_n - num_p 102 | while add_p_num > 0: 103 | for idx, item in enumerate(y): 104 | if item == 1: 105 | tmp_x1.append(x1[idx]) 106 | tmp_x2.append(x2[idx]) 107 | tmp_y.append(y[idx]) 108 | add_p_num -= 1 109 | if add_p_num <= 0: 110 | break 111 | 112 | print('len(x1)={}, len(x2)={}, len(y)={}'.format(len(x1), len(x2), len(y))) 113 | 114 | x1 += tmp_x1 115 | x2 += tmp_x2 116 | y += tmp_y 117 | 118 | print('len(x1)={}, len(x2)={}, len(y)={}'.format(len(x1), len(x2), len(y))) 119 | 120 | # num_p=0 121 | # for item in y: 122 | # if item[1]==1: 123 | # num_p+=1 124 | # 125 | # print('num_p= {}'.format(num_p)) 126 | # exit(0) 127 | 128 | # print ('num_p= {}'.format(num_p)) 129 | # print('num_n= {}'.format(num_n)) 130 | # exit(0) 131 | 132 | return np.asarray(x1), np.asarray(x2), np.asarray(y) 133 | 134 | def getTsvTestData(self, filepath): 135 | print("Loading testing/labelled data from " + filepath) 136 | x1 = [] 137 | x2 = [] 138 | y = [] 139 | # positive samples from file 140 | for line in open(filepath): 141 | l = line.strip().split("\t") 142 | if len(l) < 3: 143 | continue 144 | x1.append(l[1]) 145 | x2.append(l[2]) 146 | y.append(int(l[0])) 147 | return np.asarray(x1), np.asarray(x2), np.asarray(y) 148 | 149 | def batch_iter(self, data, batch_size, num_epochs, shuffle=True): 150 | """ 151 | Generates a batch iterator for a dataset. 152 | """ 153 | data = np.asarray(data) 154 | # print(data) 155 | # print(data.shape) 156 | data_size = len(data) 157 | num_batches_per_epoch = int(len(data) / batch_size) + 1 158 | for epoch in range(num_epochs): 159 | # Shuffle the data at each epoch 160 | if shuffle: 161 | shuffle_indices = np.random.permutation(np.arange(data_size)) 162 | shuffled_data = data[shuffle_indices] 163 | else: 164 | shuffled_data = data 165 | for batch_num in range(num_batches_per_epoch): 166 | start_index = batch_num * batch_size 167 | end_index = min((batch_num + 1) * batch_size, data_size) 168 | yield shuffled_data[start_index:end_index] 169 | 170 | def dumpValidation(self, x1_text, x2_text, y, shuffled_index, dev_idx, i): 171 | print("dumping validation " + str(i)) 172 | x1_shuffled = x1_text[shuffled_index] 173 | x2_shuffled = x2_text[shuffled_index] 174 | y_shuffled = y[shuffled_index] 175 | x1_dev = x1_shuffled[dev_idx:] 176 | x2_dev = x2_shuffled[dev_idx:] 177 | y_dev = y_shuffled[dev_idx:] 178 | del x1_shuffled 179 | del y_shuffled 180 | with open('validation.txt' + str(i), 'w') as f: 181 | for text1, text2, label in zip(x1_dev, x2_dev, y_dev): 182 | f.write(str(label) + "\t" + text1 + "\t" + text2 + "\n") 183 | f.close() 184 | del x1_dev 185 | del y_dev 186 | 187 | # Data Preparatopn 188 | # ================================================== 189 | 190 | def getDataSets(self, training_paths, max_document_length, percent_dev, batch_size): 191 | x1_text, x2_text, y = self.getTsvData(training_paths) 192 | # print('x1_text= {}'.format(x1_text)) 193 | # print('x2_text= {}'.format(x2_text)) 194 | # print ('y= {}'.format(y)) 195 | 196 | # Build vocabulary 197 | print("Building vocabulary") 198 | vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0) 199 | vocab_processor.fit_transform(np.concatenate((x2_text, x1_text), axis=0)) 200 | print("Length of loaded vocabulary ={}".format(len(vocab_processor.vocabulary_))) 201 | 202 | sum_no_of_batches = 0 203 | x1 = np.asarray(list(vocab_processor.transform(x1_text))) 204 | x2 = np.asarray(list(vocab_processor.transform(x2_text))) 205 | # Randomly shuffle data 206 | np.random.seed(131) 207 | shuffle_indices = np.random.permutation(np.arange(len(y))) 208 | x1_shuffled = x1[shuffle_indices] 209 | x2_shuffled = x2[shuffle_indices] 210 | y_shuffled = y[shuffle_indices] 211 | dev_idx = -1 * len(y_shuffled) * percent_dev // 100 212 | print('dev_idx= {}'.format(dev_idx)) 213 | 214 | del x1 215 | del x2 216 | # Split train/test set 217 | self.dumpValidation(x1_text, x2_text, y, shuffle_indices, dev_idx, 0) 218 | # TODO: This is very crude, should use cross-validation 219 | x1_train, x1_dev = x1_shuffled[:dev_idx], x1_shuffled[dev_idx:] 220 | x2_train, x2_dev = x2_shuffled[:dev_idx], x2_shuffled[dev_idx:] 221 | y_train, y_dev = y_shuffled[:dev_idx], y_shuffled[dev_idx:] 222 | print("Train/Dev split for {}: {:d}/{:d}".format(training_paths, len(y_train), len(y_dev))) 223 | sum_no_of_batches = sum_no_of_batches + (len(y_train) // batch_size) 224 | train_set = (x1_train, x2_train, y_train) 225 | dev_set = (x1_dev, x2_dev, y_dev) 226 | gc.collect() 227 | return train_set, dev_set, vocab_processor, sum_no_of_batches 228 | 229 | def getTestDataSet(self, data_path, vocab_path, max_document_length): 230 | x1_temp, x2_temp, y = self.getTsvTestData(data_path) 231 | 232 | # Build vocabulary 233 | vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0) 234 | vocab_processor = vocab_processor.restore(vocab_path) 235 | print len(vocab_processor.vocabulary_) 236 | 237 | x1 = np.asarray(list(vocab_processor.transform(x1_temp))) 238 | x2 = np.asarray(list(vocab_processor.transform(x2_temp))) 239 | # Randomly shuffle data 240 | del vocab_processor 241 | gc.collect() 242 | return x1, x2, y 243 | -------------------------------------------------------------------------------- /preliminary_contest/dict.txt: -------------------------------------------------------------------------------- 1 | 花呗 2 | 借呗 3 | 蚂蚁花呗 4 | 蚂蚁借呗 -------------------------------------------------------------------------------- /preliminary_contest/eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # coding=utf-8 3 | import tensorflow as tf 4 | import numpy as np 5 | import os 6 | import time 7 | import datetime 8 | from tensorflow.contrib import learn 9 | from input_helpers import InputHelper 10 | import sys 11 | 12 | # Parameters 13 | # ================================================== 14 | EVAL_FILE = sys.argv[1] # 待评估文件 15 | OUTPUT_FILE = sys.argv[2] # 评估后输出文件 16 | 17 | print (EVAL_FILE) 18 | print (OUTPUT_FILE) 19 | 20 | # Eval Parameters 21 | BATCH_SIZE = 64 # 批大小 22 | VOCAB_FILE = './vocab/vocab' # 训练使使用的词表 23 | MODEL = './models/model-4000' # 加载训练模型 24 | ALLOW_SOFT_PLACEMENT = True 25 | LOG_DEVICE_PLACEMENT = False 26 | 27 | # 语句最多长度(包含多少个词) 28 | MAX_DOCUMENT_LENGTH = 40 29 | 30 | # load data and map id-transform based on training time vocabulary 31 | inpH = InputHelper() 32 | x1_test, x2_test = inpH.getTestDataSet(EVAL_FILE, VOCAB_FILE, MAX_DOCUMENT_LENGTH) 33 | 34 | # for index, _ in enumerate(x1_test): 35 | # print(index, x1_test[index], x2_test[index]) 36 | 37 | print("\nEvaluating...\n") 38 | 39 | # Evaluation 40 | # ================================================== 41 | checkpoint_file = MODEL 42 | print checkpoint_file 43 | graph = tf.Graph() 44 | with graph.as_default(): 45 | session_conf = tf.ConfigProto( 46 | allow_soft_placement=ALLOW_SOFT_PLACEMENT, 47 | log_device_placement=LOG_DEVICE_PLACEMENT) 48 | sess = tf.Session(config=session_conf) 49 | with sess.as_default(): 50 | # Load the saved meta graph and restore variables 51 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file)) 52 | sess.run(tf.initialize_all_variables()) 53 | saver.restore(sess, checkpoint_file) 54 | 55 | # Get the placeholders from the graph by name 56 | input_x1 = graph.get_operation_by_name("input_x1").outputs[0] 57 | input_x2 = graph.get_operation_by_name("input_x2").outputs[0] 58 | # input_y = graph.get_operation_by_name("input_y").outputs[0] 59 | 60 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0] 61 | # Tensors we want to evaluate 62 | predictions = graph.get_operation_by_name("output/distance").outputs[0] 63 | 64 | # accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0] 65 | 66 | sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0] 67 | 68 | # emb = graph.get_operation_by_name("embedding/W").outputs[0] 69 | # embedded_chars = tf.nn.embedding_lookup(emb,input_x) 70 | # Generate batches for one epoch 71 | batches = inpH.batch_iter(list(zip(x1_test, x2_test)), 2 * BATCH_SIZE, 1, shuffle=False) 72 | # Collect the predictions here 73 | all_predictions = [] 74 | all_d = [] 75 | 76 | for db in batches: 77 | # print('db') 78 | # print(db) 79 | # 80 | x1_dev_b, x2_dev_b = zip(*db) 81 | batch_predictions, batch_sim = sess.run([predictions, sim], 82 | {input_x1: x1_dev_b, input_x2: x2_dev_b, dropout_keep_prob: 1.0}) 83 | all_predictions = np.concatenate([all_predictions, batch_predictions]) 84 | # print(batch_predictions) 85 | print(batch_sim) 86 | print(type(batch_sim)) 87 | print(len(batch_sim)) 88 | all_d = np.concatenate([all_d, batch_sim]) 89 | # print("DEV acc {}".format(batch_acc)) 90 | for ex in all_predictions: 91 | print ex 92 | 93 | f_output = open(OUTPUT_FILE, 'a') 94 | index = 1 95 | predic_value = 0 96 | for item in all_d: 97 | # 专门写反 98 | if item > 0: 99 | predic_value = 1 100 | else: 101 | predic_value = 0 102 | f_output.write('{}\t{}\n'.format(index, predic_value)) 103 | index += 1 104 | 105 | # correct_predictions = float(np.mean(all_d == y_test)) 106 | # print("Accuracy: {:g}".format(correct_predictions)) 107 | 108 | print ('eval finished!') 109 | -------------------------------------------------------------------------------- /preliminary_contest/input_helpers.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | import numpy as np 3 | import re 4 | import itertools 5 | from collections import Counter 6 | import numpy as np 7 | import time 8 | import gc 9 | from tensorflow.contrib import learn 10 | # from gensim.models.word2vec import Word2Vec 11 | import gensim 12 | import gzip 13 | from random import random 14 | from preprocess import MyVocabularyProcessor 15 | import sys 16 | import jieba 17 | 18 | reload(sys) 19 | sys.setdefaultencoding("utf-8") 20 | 21 | 22 | class InputHelper(object): 23 | pre_emb = dict() 24 | vocab_processor = None 25 | 26 | def getTsvTestData(self, filepath): 27 | print("Loading testing/labelled data from " + filepath) 28 | x1 = [] 29 | x2 = [] 30 | for line in open(filepath): 31 | l = line.strip().split("\t") 32 | x1.append(l[1]) 33 | x2.append(l[2]) 34 | return np.asarray(x1), np.asarray(x2) 35 | 36 | def batch_iter(self, data, batch_size, num_epochs, shuffle=True): 37 | """ 38 | Generates a batch iterator for a dataset. 39 | """ 40 | data = np.asarray(data) 41 | print(data) 42 | print(data.shape) 43 | data_size = len(data) 44 | num_batches_per_epoch = int(len(data) / batch_size) + 1 45 | for epoch in range(num_epochs): 46 | # Shuffle the data at each epoch 47 | if shuffle: 48 | shuffle_indices = np.random.permutation(np.arange(data_size)) 49 | shuffled_data = data[shuffle_indices] 50 | else: 51 | shuffled_data = data 52 | for batch_num in range(num_batches_per_epoch): 53 | start_index = batch_num * batch_size 54 | end_index = min((batch_num + 1) * batch_size, data_size) 55 | yield shuffled_data[start_index:end_index] 56 | 57 | # Data Preparatopn 58 | # ================================================== 59 | 60 | def getTestDataSet(self, data_path, vocab_path, max_document_length): 61 | x1_temp, x2_temp = self.getTsvTestData(data_path) 62 | 63 | # Build vocabulary 64 | vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0) 65 | vocab_processor = vocab_processor.restore(vocab_path) 66 | print ('len(vocab_processor.vocabulary_)', len(vocab_processor.vocabulary_)) 67 | # sys.exit(0) 68 | 69 | x1 = np.asarray(list(vocab_processor.transform(x1_temp))) 70 | x2 = np.asarray(list(vocab_processor.transform(x2_temp))) 71 | # Randomly shuffle data 72 | del vocab_processor 73 | gc.collect() 74 | return x1, x2 75 | -------------------------------------------------------------------------------- /preliminary_contest/models/model-4000.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.data-00000-of-00001 -------------------------------------------------------------------------------- /preliminary_contest/models/model-4000.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.index -------------------------------------------------------------------------------- /preliminary_contest/models/model-4000.meta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.meta -------------------------------------------------------------------------------- /preliminary_contest/preprocess.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | import re 7 | import numpy as np 8 | import six 9 | from tensorflow.contrib import learn 10 | from tensorflow.python.platform import gfile 11 | from tensorflow.contrib import learn # pylint: disable=g-bad-import-order 12 | import jieba 13 | 14 | 15 | def tokenizer_word(iterator): 16 | jieba.load_userdict('./dict.txt') 17 | for sentence in iterator: 18 | yield list(jieba.lcut(sentence)) 19 | 20 | 21 | class MyVocabularyProcessor(learn.preprocessing.VocabularyProcessor): 22 | def __init__(self, 23 | max_document_length, 24 | min_frequency=0, 25 | vocabulary=None): 26 | 27 | tokenizer_fn = tokenizer_word 28 | self.sup = super(MyVocabularyProcessor, self) 29 | self.sup.__init__(max_document_length, min_frequency, vocabulary, tokenizer_fn) 30 | 31 | def transform(self, raw_documents): 32 | """Transform documents to word-id matrix. 33 | Convert words to ids with vocabulary fitted with fit or the one 34 | provided in the constructor. 35 | Args: 36 | raw_documents: An iterable which yield either str or unicode. 37 | Yields: 38 | x: iterable, [n_samples, max_document_length]. Word-id matrix. 39 | """ 40 | # print('len(raw_documents)= {}'.format(len(raw_documents))) 41 | # print('raw_documents= {}'.format(raw_documents)) 42 | 43 | # for index,value in enumerate(raw_documents): 44 | # print(index, value) 45 | 46 | for tokens in self._tokenizer(raw_documents): 47 | word_ids = np.zeros(self.max_document_length, np.int64) 48 | for idx, token in enumerate(tokens): 49 | if idx >= self.max_document_length: 50 | break 51 | word_ids[idx] = self.vocabulary_.get(token) 52 | yield word_ids 53 | -------------------------------------------------------------------------------- /preliminary_contest/run.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | python eval.py $1 $2 -------------------------------------------------------------------------------- /preliminary_contest/vocab/vocab: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/vocab/vocab -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | from __future__ import absolute_import 3 | from __future__ import division 4 | from __future__ import print_function 5 | 6 | import re 7 | import numpy as np 8 | import six 9 | from tensorflow.contrib import learn 10 | from tensorflow.python.platform import gfile 11 | from tensorflow.contrib import learn # pylint: disable=g-bad-import-order 12 | import jieba 13 | 14 | 15 | def tokenizer_word(iterator): 16 | jieba.load_userdict('./dict.txt') 17 | for sentence in iterator: 18 | sentence = sentence.decode("utf8") 19 | sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。:??、~@#¥%……&*()]+".decode("utf8"), "".decode("utf8"), 20 | sentence) 21 | yield list(jieba.lcut(sentence)) 22 | 23 | 24 | class MyVocabularyProcessor(learn.preprocessing.VocabularyProcessor): 25 | def __init__(self, 26 | max_document_length, 27 | min_frequency=0, 28 | vocabulary=None): 29 | 30 | tokenizer_fn = tokenizer_word 31 | self.sup = super(MyVocabularyProcessor, self) 32 | self.sup.__init__(max_document_length, min_frequency, vocabulary, tokenizer_fn) 33 | 34 | def transform(self, raw_documents): 35 | """Transform documents to word-id matrix. 36 | Convert words to ids with vocabulary fitted with fit or the one 37 | provided in the constructor. 38 | Args: 39 | raw_documents: An iterable which yield either str or unicode. 40 | Yields: 41 | x: iterable, [n_samples, max_document_length]. Word-id matrix. 42 | """ 43 | # print('len(raw_documents)= {}'.format(len(raw_documents))) 44 | # print('raw_documents= {}'.format(raw_documents)) 45 | 46 | # for index,value in enumerate(raw_documents): 47 | # print(index, value) 48 | 49 | for tokens in self._tokenizer(raw_documents): 50 | word_ids = np.zeros(self.max_document_length, np.int64) 51 | for idx, token in enumerate(tokens): 52 | if idx >= self.max_document_length: 53 | break 54 | word_ids[idx] = self.vocabulary_.get(token) 55 | yield word_ids 56 | -------------------------------------------------------------------------------- /siamese_network_semantic.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | import tensorflow as tf 3 | import tensorflow.contrib.slim as slim 4 | import numpy as np 5 | 6 | 7 | class SiameseLSTMw2v(object): 8 | """ 9 | A LSTM based deep Siamese network for text similarity. 10 | Uses an word embedding layer (looks up in pre-trained w2v), followed by a biLSTM and Energy Loss layer. 11 | """ 12 | 13 | def stackedRNN(self, x, dropout, scope, embedding_size, sequence_length, hidden_units): 14 | n_hidden = hidden_units 15 | n_layers = 3 16 | # n_layers = 6 17 | # Prepare data shape to match `static_rnn` function requirements 18 | x = tf.unstack(tf.transpose(x, perm=[1, 0, 2])) 19 | # print(x) 20 | # Define lstm cells with tensorflow 21 | # Forward direction cell 22 | 23 | with tf.name_scope("fw" + scope), tf.variable_scope("fw" + scope): 24 | stacked_rnn_fw = [] 25 | for _ in range(n_layers): 26 | fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0, state_is_tuple=True) 27 | lstm_fw_cell = tf.contrib.rnn.DropoutWrapper(fw_cell, output_keep_prob=dropout) 28 | stacked_rnn_fw.append(lstm_fw_cell) 29 | lstm_fw_cell_m = tf.nn.rnn_cell.MultiRNNCell(cells=stacked_rnn_fw, state_is_tuple=True) 30 | 31 | outputs, _ = tf.nn.static_rnn(lstm_fw_cell_m, x, dtype=tf.float32) 32 | return outputs[-1] 33 | 34 | def contrastive_loss(self, y, d, batch_size): 35 | tmp = y * tf.square(d) 36 | # tmp= tf.mul(y,tf.square(d)) 37 | tmp2 = (1 - y) * tf.square(tf.maximum((1 - d), 0)) 38 | reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-4), tf.trainable_variables()) 39 | return tf.reduce_sum(tmp + tmp2) / batch_size / 2+reg 40 | 41 | def __init__( 42 | self, sequence_length, vocab_size, embedding_size, hidden_units, l2_reg_lambda, batch_size, 43 | trainableEmbeddings): 44 | # Placeholders for input, output and dropout 45 | self.input_x1 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x1") 46 | self.input_x2 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x2") 47 | self.input_y = tf.placeholder(tf.float32, [None], name="input_y") 48 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 49 | 50 | # Keeping track of l2 regularization loss (optional) 51 | l2_loss = tf.constant(0.0, name="l2_loss") 52 | 53 | # Embedding layer 54 | with tf.name_scope("embedding"): 55 | self.W = tf.Variable( 56 | tf.constant(0.0, shape=[vocab_size, embedding_size]), 57 | trainable=trainableEmbeddings, name="W") 58 | self.embedded_words1 = tf.nn.embedding_lookup(self.W, self.input_x1) 59 | self.embedded_words2 = tf.nn.embedding_lookup(self.W, self.input_x2) 60 | # print self.embedded_words1 61 | # Create a convolution + maxpool layer for each filter size 62 | with tf.name_scope("output"): 63 | self.out1 = self.stackedRNN(self.embedded_words1, self.dropout_keep_prob, "side1", embedding_size, 64 | sequence_length, hidden_units) 65 | self.out2 = self.stackedRNN(self.embedded_words2, self.dropout_keep_prob, "side2", embedding_size, 66 | sequence_length, hidden_units) 67 | self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.out1, self.out2)), 1, keep_dims=True)) 68 | self.distance = tf.div(self.distance, 69 | tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.out1), 1, keep_dims=True)), 70 | tf.sqrt(tf.reduce_sum(tf.square(self.out2), 1, keep_dims=True)))) 71 | self.distance = tf.reshape(self.distance, [-1], name="distance") 72 | with tf.name_scope("loss"): 73 | self.loss = self.contrastive_loss(self.input_y, self.distance, batch_size) 74 | #### Accuracy computation is outside of this class. 75 | with tf.name_scope("accuracy"): 76 | self.temp_sim = tf.subtract(tf.ones_like(self.distance), tf.rint(self.distance), 77 | name="temp_sim") # auto threshold 0.5 78 | correct_predictions = tf.equal(self.temp_sim, self.input_y) 79 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy") 80 | 81 | with tf.name_scope('f1'): 82 | ones_like_actuals = tf.ones_like(self.input_y) 83 | zeros_like_actuals = tf.zeros_like(self.input_y) 84 | ones_like_predictions = tf.ones_like(self.temp_sim) 85 | zeros_like_predictions = tf.zeros_like(self.temp_sim) 86 | 87 | tp = tf.reduce_sum( 88 | tf.cast( 89 | tf.logical_and( 90 | tf.equal(self.input_y, ones_like_actuals), 91 | tf.equal(self.temp_sim, ones_like_predictions) 92 | ), 93 | 'float' 94 | ) 95 | ) 96 | 97 | tn = tf.reduce_sum( 98 | tf.cast( 99 | tf.logical_and( 100 | tf.equal(self.input_y, zeros_like_actuals), 101 | tf.equal(self.temp_sim, zeros_like_predictions) 102 | ), 103 | 'float' 104 | ) 105 | ) 106 | 107 | fp = tf.reduce_sum( 108 | tf.cast( 109 | tf.logical_and( 110 | tf.equal(self.input_y, zeros_like_actuals), 111 | tf.equal(self.temp_sim, ones_like_predictions) 112 | ), 113 | 'float' 114 | ) 115 | ) 116 | 117 | fn = tf.reduce_sum( 118 | tf.cast( 119 | tf.logical_and( 120 | tf.equal(self.input_y, ones_like_actuals), 121 | tf.equal(self.temp_sim, zeros_like_predictions) 122 | ), 123 | 'float' 124 | ) 125 | ) 126 | 127 | precision = tp / (tp + fp) 128 | recall = tp / (tp + fn) 129 | 130 | self.f1 = 2 * precision * recall / (precision + recall) 131 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # coding=utf-8 3 | 4 | import re 5 | 6 | line = "想做/ 兼_职/学生_/ 的 、加,我Q: 1 5. 8 0. !!?? 8 6 。0. 2。 3 有,惊,喜,哦" 7 | line = line.decode("utf8") 8 | string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。:??、~@#¥%……&*()]+".decode("utf8"), "".decode("utf8"), line) 9 | print(string) 10 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # coding=utf-8 3 | import tensorflow as tf 4 | import numpy as np 5 | import re 6 | import os 7 | import time 8 | import datetime 9 | import gc 10 | from input_helpers import InputHelper 11 | from siamese_network_semantic import SiameseLSTMw2v 12 | import gzip 13 | from random import random 14 | import sys 15 | 16 | # a=['你好','上帝','下地'] 17 | # b=[u'你好',u'上帝',u'下地'] 18 | # print(a) 19 | # print(b) 20 | # sys.exit(0) 21 | 22 | # Parameters 23 | # word2vec模型(采用已训练好的中文模型) 24 | WORD2VEC_MODEL = '../word2vecmodel/news_12g_baidubaike_20g_novel_90g_embedding_64.bin' 25 | #  模型格式为bin 26 | WORD2VEC_FORMAT = 'bin' 27 | # word2vec词嵌入维数(64/128可选) 28 | EMBEDDING_DIM = 64 29 | # dropout比例设置 30 | # DROPOUT_KEEP_PROB = '0.3'#训练集的拟合能力不够 31 | DROPOUT_KEEP_PROB = '0.8' 32 | # DROPOUT_KEEP_PROB = '0.6' 33 | # DROPOUT_KEEP_PROB = '0.7' 34 | # DROPOUT_KEEP_PROB = '0.8' 35 | # DROPOUT_KEEP_PROB = '1.0'(7th-June) 36 | # DROPOUT_KEEP_PROB = '0.8' 37 | # DROPOUT_KEEP_PROB = '0.4'#训练集的拟合能力不够 38 | # L2正规化系数(目前暂未生效) 39 | L2_REG_LAMBDA = 0.0 40 | # 原始训练文件 41 | TRAINING_FILES_RAW = './train_data/atec_nlp_sim_train.csv' 42 | # 隐藏层单元数 43 | # HIDDEN_UNITS = 64(7th-June) 44 | HIDDEN_UNITS = 128 45 | 46 | # Training parameters 47 | # 批大小 48 | # BATCH_SIZE = 64 49 | # BATCH_SIZE = 1024(7th-June) 50 | BATCH_SIZE = 1024 # 92229=102477-10248 51 | # epoch数目 52 | # NUM_EPOCHS = 300 53 | # NUM_EPOCHS = 3000 54 | NUM_EPOCHS = 100000 55 | # 模型评估周期(每隔多少步) 56 | # EVALUATE_EVERY = 10(7th-June) 57 | EVALUATE_EVERY = 100 58 | # EVALUATE_EVERY = 10 59 | # 模型保存周期(每隔多少步) 60 | # CHECKOUTPOINT_EVERY = 1000 61 | # CHECKOUTPOINT_EVERY = 10000 62 | # CHECKOUTPOINT_EVERY = 1000(7th-Jnue) 63 | CHECKOUTPOINT_EVERY = 1000 64 | # 语句最多长度(包含多少个词) 65 | # MAX_DOCUMENT_LENGTH = 12 66 | # MAX_DOCUMENT_LENGTH = 8 67 | # MAX_DOCUMENT_LENGTH = 20(7th-June) 68 | MAX_DOCUMENT_LENGTH = 40 69 | # 验证集比例 70 | DEV_PERCENT = 10 71 | 72 | # Misc Parameters 73 | ALLOW_SOFT_PLACEMENT = True 74 | LOG_DEVICE_PLACEMENT = False 75 | 76 | print ('训练开始......................') 77 | start_time = datetime.datetime.now() 78 | 79 | inpH = InputHelper() 80 | # 将原始的训练文件转化为分词后的训练文件 81 | # inpH.train_file_preprocess(TRAINING_FILES_RAW, TRAINING_FILES_FORMAT) 82 | # sys.exit(0) 83 | 84 | 85 | train_set, dev_set, vocab_processor, sum_no_of_batches = inpH.getDataSets(TRAINING_FILES_RAW, MAX_DOCUMENT_LENGTH, 86 | DEV_PERCENT, 87 | BATCH_SIZE) 88 | 89 | # dev_batches = inpH.batch_iter(list(zip(dev_set[0], dev_set[1], dev_set[2])), BATCH_SIZE, 1) 90 | # for index,dev_batch in enumerate(dev_batches): 91 | # print(index, dev_batch) 92 | # sys.exit(0) 93 | 94 | # for index, value in enumerate(dev_set[2]): 95 | # print(index, dev_set[0][index], dev_set[1][index], dev_set[2][index]) 96 | # sys.exit(0) 97 | 98 | # for index, w in enumerate(vocab_processor.vocabulary_._mapping): 99 | # print('vocab-{}:{}'.format(index, w)) 100 | # sys.exit(0) 101 | 102 | with tf.Graph().as_default(): 103 | session_conf = tf.ConfigProto( 104 | allow_soft_placement=ALLOW_SOFT_PLACEMENT, 105 | log_device_placement=LOG_DEVICE_PLACEMENT) 106 | sess = tf.Session(config=session_conf) 107 | 108 | with sess.as_default(): 109 | siameseModel = SiameseLSTMw2v( 110 | sequence_length=MAX_DOCUMENT_LENGTH, 111 | vocab_size=len(vocab_processor.vocabulary_), 112 | embedding_size=EMBEDDING_DIM, 113 | hidden_units=HIDDEN_UNITS, 114 | l2_reg_lambda=L2_REG_LAMBDA, 115 | batch_size=BATCH_SIZE, 116 | trainableEmbeddings=False 117 | ) 118 | # Define Training procedure 119 | global_step = tf.Variable(0, name="global_step", trainable=False) 120 | optimizer = tf.train.AdamOptimizer(1e-3) 121 | 122 | grads_and_vars = optimizer.compute_gradients(siameseModel.loss) 123 | tr_op_set = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 124 | print("defined training_ops") 125 | # Keep track of gradient values and sparsity (optional) 126 | grad_summaries = [] 127 | for g, v in grads_and_vars: 128 | if g is not None: 129 | grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g) 130 | sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 131 | grad_summaries.append(grad_hist_summary) 132 | grad_summaries.append(sparsity_summary) 133 | grad_summaries_merged = tf.summary.merge(grad_summaries) 134 | print("defined gradient summaries") 135 | # Output directory for models and summaries 136 | timestamp = str(int(time.time())) 137 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp)) 138 | print("Writing to {}\n".format(out_dir)) 139 | 140 | # Summaries for loss and accuracy 141 | loss_summary = tf.summary.scalar("loss", siameseModel.loss) 142 | acc_summary = tf.summary.scalar("accuracy", siameseModel.accuracy) 143 | f1_summary = tf.summary.scalar('f1', siameseModel.f1) 144 | 145 | # Train Summaries 146 | train_summary_op = tf.summary.merge([loss_summary, acc_summary, f1_summary, grad_summaries_merged]) 147 | train_summary_dir = os.path.join(out_dir, "summaries", "train") 148 | train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph) 149 | 150 | # Dev summaries 151 | dev_summary_op = tf.summary.merge([loss_summary, acc_summary, f1_summary]) 152 | dev_summary_dir = os.path.join(out_dir, "summaries", "dev") 153 | dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph) 154 | 155 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it 156 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 157 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 158 | if not os.path.exists(checkpoint_dir): 159 | os.makedirs(checkpoint_dir) 160 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=100) 161 | 162 | # Write vocabulary 163 | vocab_processor.save(os.path.join(checkpoint_dir, "vocab")) 164 | 165 | # Initialize all variables 166 | sess.run(tf.global_variables_initializer()) 167 | 168 | print("init all variables") 169 | graph_def = tf.get_default_graph().as_graph_def() 170 | graphpb_txt = str(graph_def) 171 | with open(os.path.join(checkpoint_dir, "graphpb.txt"), 'w') as f: 172 | f.write(graphpb_txt) 173 | 174 | # 加载word2vec 175 | inpH.loadW2V(WORD2VEC_MODEL, WORD2VEC_FORMAT) 176 | # initial matrix with random uniform 177 | # initW = np.random.uniform(-0.25, 0.25, (len(vocab_processor.vocabulary_), EMBEDDING_DIM)) 178 | initW = np.random.uniform(0, 0, (len(vocab_processor.vocabulary_), EMBEDDING_DIM)) 179 | # print(initW) 180 | # sys.exit(0) 181 | 182 | # load any vectors from the word2vec 183 | print("initializing initW with pre-trained word2vec embeddings") 184 | for index, w in enumerate(vocab_processor.vocabulary_._mapping): 185 | # print('vocab-{}:{}'.format(index, w)) 186 | 187 | arr = [] 188 | if w in inpH.pre_emb: 189 | arr = inpH.pre_emb[w] 190 | # print('=====arr-{},{}'.format(index, arr)) 191 | idx = vocab_processor.vocabulary_.get(w) 192 | initW[idx] = np.asarray(arr).astype(np.float32) 193 | 194 | # 不使用词向量 195 | # arr=[] 196 | # idx = vocab_processor.vocabulary_.get(w) 197 | # arr.append(idx) 198 | # initW[idx] = np.asarray(arr).astype(np.float32) 199 | 200 | print("Done assigning intiW. len=" + str(len(initW))) 201 | # exit(0) 202 | 203 | # for idx, value in enumerate(initW): 204 | # print(idx, value) 205 | # sys.exit(0) 206 | 207 | inpH.deletePreEmb() 208 | gc.collect() 209 | sess.run(siameseModel.W.assign(initW)) 210 | 211 | 212 | def train_step(x1_batch, x2_batch, y_batch): 213 | """ 214 | A single training step 215 | """ 216 | # for index, sentence in enumerate(x1_batch): 217 | # word_list1=[] 218 | # word_list2=[] 219 | # y=y_batch[index] 220 | # for idx in x1_batch[index]: 221 | # word_list1.append(vocab_processor.vocabulary_.reverse(idx)) 222 | # for idx in x2_batch[index]: 223 | # word_list2.append(vocab_processor.vocabulary_.reverse(idx)) 224 | # 225 | # # print(''.join(word_list1),'\t',''.join(word_list2),'\t',y) 226 | # print('==========={}=============='.format(index)) 227 | # print(''.join(word_list1)) 228 | # print (''.join(word_list2)) 229 | # print(y) 230 | # sys.exit(0) 231 | 232 | feed_dict = { 233 | siameseModel.input_x1: x1_batch, 234 | siameseModel.input_x2: x2_batch, 235 | siameseModel.input_y: y_batch, 236 | siameseModel.dropout_keep_prob: DROPOUT_KEEP_PROB, 237 | } 238 | _, step, loss, accuracy, f1, dist, sim, summaries = sess.run( 239 | [tr_op_set, global_step, siameseModel.loss, siameseModel.accuracy, siameseModel.f1, siameseModel.distance, 240 | siameseModel.temp_sim, train_summary_op], feed_dict) 241 | time_str = datetime.datetime.now().isoformat() 242 | print("TRAIN {}: step {}, loss {:g}, acc {:g}, f1 {:g}".format(time_str, step, loss, accuracy, f1)) 243 | train_summary_writer.add_summary(summaries, step) 244 | print(y_batch, dist, sim) 245 | 246 | 247 | def dev_step(x1_batch, x2_batch, y_batch): 248 | """ 249 | A single training step 250 | """ 251 | # for index, sentence in enumerate(x1_batch): 252 | # word_list1=[] 253 | # word_list2=[] 254 | # y=y_batch[index] 255 | # for idx in x1_batch[index]: 256 | # word_list1.append(vocab_processor.vocabulary_.reverse(idx)) 257 | # for idx in x2_batch[index]: 258 | # word_list2.append(vocab_processor.vocabulary_.reverse(idx)) 259 | # 260 | # # print(''.join(word_list1),'\t',''.join(word_list2),'\t',y) 261 | # print('==========={}=============='.format(index)) 262 | # print(''.join(word_list1)) 263 | # print (''.join(word_list2)) 264 | # print(y) 265 | # sys.exit(0) 266 | 267 | feed_dict = { 268 | siameseModel.input_x1: x2_batch, 269 | siameseModel.input_x2: x1_batch, 270 | siameseModel.input_y: y_batch, 271 | siameseModel.dropout_keep_prob: 1.0, 272 | } 273 | step, loss, accuracy, f1, sim, summaries = sess.run( 274 | [global_step, siameseModel.loss, siameseModel.accuracy, siameseModel.f1, siameseModel.temp_sim, 275 | dev_summary_op], feed_dict) 276 | time_str = datetime.datetime.now().isoformat() 277 | print("DEV {}: step {}, loss {:g}, acc {:g}, f1 {:g}".format(time_str, step, loss, accuracy, f1)) 278 | dev_summary_writer.add_summary(summaries, step) 279 | print (y_batch, sim) 280 | return accuracy 281 | 282 | 283 | ################## 284 | # sys.exit(0) 285 | 286 | # Generate batches 287 | batches = inpH.batch_iter( 288 | list(zip(train_set[0], train_set[1], train_set[2])), BATCH_SIZE, NUM_EPOCHS) 289 | 290 | ptr = 0 291 | max_validation_acc = 0.0 292 | for nn in xrange(sum_no_of_batches * NUM_EPOCHS): 293 | batch = batches.next() 294 | if len(batch) < 1: 295 | continue 296 | x1_batch, x2_batch, y_batch = zip(*batch) 297 | if len(y_batch) < 1: 298 | continue 299 | train_step(x1_batch, x2_batch, y_batch) 300 | current_step = tf.train.global_step(sess, global_step) 301 | sum_acc = 0.0 302 | cnt = 0 303 | if current_step % EVALUATE_EVERY == 0: 304 | print("\nEvaluation:") 305 | dev_batches = inpH.batch_iter(list(zip(dev_set[0], dev_set[1], dev_set[2])), BATCH_SIZE, 1) 306 | for db in dev_batches: 307 | if len(db) < 1: 308 | continue 309 | x1_dev_b, x2_dev_b, y_dev_b = zip(*db) 310 | if len(y_dev_b) < 1: 311 | continue 312 | acc = dev_step(x1_dev_b, x2_dev_b, y_dev_b) 313 | sum_acc = sum_acc + acc 314 | cnt += 1 315 | 316 | sum_acc /= cnt 317 | print("sum_acc= {}".format(sum_acc)) 318 | if current_step % CHECKOUTPOINT_EVERY == 0: 319 | if sum_acc >= max_validation_acc: 320 | max_validation_acc = sum_acc 321 | 322 | # 临时逻辑 323 | saver.save(sess, checkpoint_prefix, global_step=current_step) 324 | tf.train.write_graph(sess.graph.as_graph_def(), checkpoint_prefix, "graph" + str(nn) + ".pb", 325 | as_text=False) 326 | print("Saved model {} with sum_accuracy={} checkpoint to {}\n".format(nn, max_validation_acc, 327 | checkpoint_prefix)) 328 | 329 | print('max_validation_acc(each batch)= {}'.format(max_validation_acc)) 330 | 331 | end_time = datetime.datetime.now() 332 | train_duration = end_time - start_time 333 | print('训练开始时间: {}'.format(start_time)) 334 | print('训练结束时间: {}'.format(end_time)) 335 | print('训练结束, 训练总耗时: {}'.format(train_duration)) 336 | --------------------------------------------------------------------------------