├── .DS_Store
├── .gitignore
├── ATEC2018.md
├── LICENSE
├── README.md
├── dict.txt
├── eddy-20180701-4k.tar.gz
├── eval.py
├── input_helpers.py
├── preliminary_contest
├── atec_nlp_sim_train.csv
├── dict.txt
├── eval.py
├── input_helpers.py
├── models
│ ├── model-4000.data-00000-of-00001
│ ├── model-4000.index
│ └── model-4000.meta
├── preprocess.py
├── run.sh
└── vocab
│ └── vocab
├── preprocess.py
├── siamese_network_semantic.py
├── test.py
├── train.py
├── train_data
└── atec_nlp_sim_train.csv
└── validation.txt0
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/.DS_Store
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by .ignore support plugin (hsz.mobi)
2 |
3 | .idea/
4 | *.pyc
5 | runs/
--------------------------------------------------------------------------------
/ATEC2018.md:
--------------------------------------------------------------------------------
1 | # 1 赛题任务描述
2 |
3 | 问题相似度计算,即给定客服里用户描述的两句话,用算法来判断是否表示了相同的语义。
4 |
5 | 示例:
6 |
7 | “花呗如何还款” --“花呗怎么还款”:同义问句
8 | “花呗如何还款” -- “我怎么还我的花被呢”:同义问句
9 | “花呗分期后逾期了如何还款”-- “花呗分期后逾期了哪里还款”:非同义问句
10 | 对于例子a,比较简单的方法就可以判定同义;对于例子b,包含了错别字、同义词、词序变换等问题,两个句子乍一看并不类似,想正确判断比较有挑战;对于例子c,两句话很类似,仅仅有一处细微的差别 “如何”和“哪里”,就导致语义不一致。
11 |
12 | # 2 数据
13 |
14 | 本次大赛所有数据均来自蚂蚁金服金融大脑的实际应用场景,赛制分初赛和复赛两个阶段:
15 |
16 | 初赛阶段
17 |
18 | 我们提供10万对的标注数据(分批次更新),作为训练数据,包括同义对和不同义对,可下载。数据集中每一行就是一条样例。格式如下:
19 |
20 | 行号\t句1\t句2\t标注,举例:1 花呗如何还款 花呗怎么还款 1
21 |
22 | 行号指当前问题对在训练集中的第几行;
23 | 句1和句2分别表示问题句对的两个句子;
24 | 标注指当前问题对的同义或不同义标注,同义为1,不同义为0。
25 | 评测数据集总共1万条。为保证大赛的公平公正、避免恶意的刷榜行为,该数据集不公开。大家通过提交评测代码和模型的方法完成预测、获取相应的排名。格式如下:
26 |
27 | 行号\t句1\t句2
28 |
29 | ## 初赛阶段
30 | 评测数据集会在评测系统一个特定的路径下面,由官方的平台系统调用选手提交的评测工具执行。
31 |
32 | ## 复赛阶段
33 |
34 | 我们将训练数据集的量级会增加到海量。该阶段的数据不提供下载,会以数据表的形式在蚂蚁金服的数巢平台上供选手使用。和初赛阶段类似,数据集包含四个字段,分别是行号、句1、句2和标注。
35 |
36 | 评测数据集还是1万条,同样以数据表的形式在数巢平台上。该数据集包含三个字段,分别是行号、句1、句2。
37 |
38 | #3 评测及评估指标
39 |
40 | ## 初赛阶段
41 | 比赛选手在本地完成模型的训练调优,将评测代码和模型打包后,提交官方测评系统完成预测和排名更新。测评系统为标准linux环境,安装有python 2.7、java 8、tensorflow 1.5、jieba 0.39。提交压缩包解压后,主目录下需包含脚本文件run.sh,该脚本以评测文件作为输入,评测结果作为输出(输出结果只有0和1),输出文件每行格式为“行号\t预测结果”,执行命令如下:
42 |
43 | bash run.sh INPUT_PATH OUTPUT_PATH
44 |
45 | 预测结果为空或总行数不对,评测结果直接判为0。
46 |
47 |
48 |
49 | ## 复赛阶段
50 | 选手的模型训练、调优和预测都是在蚂蚁金服的机器学习平台上完成。因此评测只需要提供相应的UDF即可,以问题对的两句话作为输入,相似度预测结果(0或1)作为输出,同样输出为空则终止评估,评测结果为0。
51 |
52 |
53 |
54 | 本赛题采用正确率(accuracy)、F1-score 进行评价,选手预测结果和真实标签进行比对,几个数值的定义先明确一下:
55 |
56 | True Positive(TP)意思表示做出同义的判定,而且判定是正确的,TP的数值表示正确的同义判定的个数;
57 |
58 | 同理,False Positive(FP)数值表示错误的同义判定的个数;
59 |
60 | 依此,True Negative(TN)数值表示正确的不同义判定个数;
61 |
62 | False Negative(FN)数值表示错误的不同义判定个数。
63 |
64 | 基于此,我们就可以计算出准确率(precision rate)、召回率(recall rate)和accuracy、F1-score:
65 |
66 | precision rate = TP / (TP + FP)
67 |
68 | recall rate = TP / (TP + FN)
69 |
70 | accuracy = (TP + TN) / (TP + FP + TN + FN)
71 |
72 | F1-score = 2 * precision rate * recall rate / (precision rate + recall rate)
73 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2016 Dhwaj Raj
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
23 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 功能描述
2 | 基于siamese-lstm的中文句子相似度计算
3 |
4 | # 环境搭建
5 | * Ubuntu:16.04(64bit)
6 | * Anaconda:2-4.4.0(python 2.7)
7 |
8 | 历史版本下载:
9 | * TensorFlow:1.5.1
10 | * numpy:1.14.3
11 | * gensim:3.4.0
12 | * (nltk:3.2.3)
13 | * jieba:0.39
14 | * word2wec中文训练模型
15 |
16 | 参考链接:
17 |
18 | # 代码使用
19 |
20 | ### 模型训练
21 | # python train.py
22 |
23 | ### 模型评估
24 | # python eval.py
25 | ## 论文参考
26 | * [《Learning Text Similarity with Siamese Recurrent Networks》](http://www.aclweb.org/anthology/W16-16#page=162)
27 | * [《Siamese Recurrent Architectures for Learning Sentence Similarity》](http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf)
28 |
29 | # 代码参考
30 |
31 | * [dhwajraj/deep-siamese-text-similarity](https://github.com/dhwajraj/deep-siamese-text-similarity)
32 |
33 | 版本:a61f07f6bef76665f8ba2df12f34b25380016613
34 |
35 | # AETC2018赛题描述
36 | 相关链接:
37 |
38 |
--------------------------------------------------------------------------------
/dict.txt:
--------------------------------------------------------------------------------
1 | 花呗
2 | 借呗
3 | 蚂蚁花呗
4 | 蚂蚁借呗
5 | 从新
6 | 支付宝
7 | 淘宝
8 |
--------------------------------------------------------------------------------
/eddy-20180701-4k.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/eddy-20180701-4k.tar.gz
--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # coding=utf-8
3 | import tensorflow as tf
4 | import numpy as np
5 | import os
6 | import time
7 | import datetime
8 | from input_helpers import InputHelper
9 | import sys
10 |
11 | # Parameters
12 | # ==================================================
13 |
14 | # Eval Parameters
15 | # 批大小
16 | BATCH_SIZE = 64
17 | # 验证集文件
18 | EVAL_FILEPATH = 'validation.txt0'
19 | # 词表(在训练过程中已生成)
20 | VOCAB_FILEPATH = 'runs/1528462228/checkpoints/vocab'
21 | # 模型文件
22 | MODEL = 'runs/1528462228/checkpoints/model-10000'
23 |
24 | # 语句最多长度(包含多少个词)
25 | MAX_DOCUMENT_LENGTH = 30
26 |
27 | # Misc Parameters
28 | ALLOW_SOFT_PLACEMENT = True
29 | LOG_DEVICE_PLACEMENT = False
30 |
31 | inpH = InputHelper()
32 |
33 | x1_test, x2_test, y_test = inpH.getTestDataSet(EVAL_FILEPATH, VOCAB_FILEPATH, MAX_DOCUMENT_LENGTH)
34 |
35 | # for index ,value in enumerate(x1_test):
36 | # print (index, x1_test[index], x2_test[index], y_test[index])
37 | # sys.exit(0)
38 |
39 | print("\nEvaluating...\n")
40 |
41 | # Evaluation
42 | # ==================================================
43 | checkpoint_file = MODEL
44 | print checkpoint_file
45 | graph = tf.Graph()
46 | with graph.as_default():
47 | session_conf = tf.ConfigProto(
48 | allow_soft_placement=ALLOW_SOFT_PLACEMENT,
49 | log_device_placement=LOG_DEVICE_PLACEMENT)
50 | sess = tf.Session(config=session_conf)
51 | with sess.as_default():
52 | # Load the saved meta graph and restore variables
53 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
54 | sess.run(tf.initialize_all_variables())
55 | saver.restore(sess, checkpoint_file)
56 |
57 | # Get the placeholders from the graph by name
58 | input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
59 | input_x2 = graph.get_operation_by_name("input_x2").outputs[0]
60 | input_y = graph.get_operation_by_name("input_y").outputs[0]
61 |
62 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
63 | # Tensors we want to evaluate
64 | predictions = graph.get_operation_by_name("output/distance").outputs[0]
65 |
66 | accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
67 |
68 | sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0]
69 |
70 | # emb = graph.get_operation_by_name("embedding/W").outputs[0]
71 | # embedded_chars = tf.nn.embedding_lookup(emb,input_x)
72 | # Generate batches for one epoch
73 | batches = inpH.batch_iter(list(zip(x1_test, x2_test, y_test)), 2 * BATCH_SIZE, 1, shuffle=False)
74 | # Collect the predictions here
75 | all_predictions = []
76 | all_d = []
77 | for db in batches:
78 | x1_dev_b, x2_dev_b, y_dev_b = zip(*db)
79 | batch_predictions, batch_acc, batch_sim = sess.run([predictions, accuracy, sim],
80 | {input_x1: x1_dev_b, input_x2: x2_dev_b,
81 | input_y: y_dev_b, dropout_keep_prob: 1.0})
82 | all_predictions = np.concatenate([all_predictions, batch_predictions])
83 | print(batch_predictions)
84 | all_d = np.concatenate([all_d, batch_sim])
85 | print("DEV acc {}".format(batch_acc))
86 | for ex in all_predictions:
87 | print ex
88 | correct_predictions = float(np.mean(all_d == y_test))
89 | print("Accuracy: {:g}".format(correct_predictions))
90 |
--------------------------------------------------------------------------------
/input_helpers.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | import numpy as np
3 | import re
4 | import itertools
5 | from collections import Counter
6 | import numpy as np
7 | import time
8 | import gc
9 | from tensorflow.contrib import learn
10 | # from gensim.models.word2vec import Word2Vec
11 | import gensim
12 | import gzip
13 | from random import random
14 | from preprocess import MyVocabularyProcessor
15 | import sys
16 | import jieba
17 |
18 | reload(sys)
19 | sys.setdefaultencoding("utf-8")
20 |
21 |
22 | class InputHelper(object):
23 | pre_emb = dict()
24 | vocab_processor = None
25 |
26 | def loadW2V(self, emb_path, type="bin"):
27 | print("Loading W2V data...")
28 | num_keys = 0
29 | if type == "textgz":
30 | # this seems faster than gensim non-binary load
31 | for line in gzip.open(emb_path):
32 | l = line.strip().split()
33 | st = l[0].lower()
34 | self.pre_emb[st] = np.asarray(l[1:])
35 | num_keys = len(self.pre_emb)
36 | if type == "text":
37 | # this seems faster than gensim non-binary load
38 | for line in open(emb_path):
39 | l = line.strip().split()
40 | st = l[0].lower()
41 | self.pre_emb[st] = np.asarray(l[1:])
42 | num_keys = len(self.pre_emb)
43 | else:
44 | # self.pre_emb = Word2Vec.load_word2vec_format(emb_path,binary=True)
45 | self.pre_emb = gensim.models.KeyedVectors.load_word2vec_format(emb_path, binary=True) # eddy
46 | self.pre_emb.init_sims(replace=True)
47 | num_keys = len(self.pre_emb.vocab)
48 | print("loaded word2vec len ", num_keys)
49 | gc.collect()
50 |
51 | def deletePreEmb(self):
52 | self.pre_emb = dict()
53 | gc.collect()
54 |
55 | def getTsvData(self, filepath):
56 | print("Loading training data from " + filepath)
57 | x1 = []
58 | x2 = []
59 | y = []
60 | num_p = 0
61 | num_n = 0
62 | # positive samples from file
63 | for line in open(filepath):
64 | # print(line)
65 | l = line.strip().split("\t")
66 |
67 | # print(l[0])
68 | # print(l[1])
69 | # print(l[2])
70 | if len(l) >= 4:
71 | x1.append(l[1])
72 | x2.append(l[2])
73 | y.append(int(l[3]))
74 |
75 | flag = int(l[3])
76 | if flag > 0:
77 | num_p += 1
78 | else:
79 | num_n += 1
80 |
81 | tmp_x1 = []
82 | tmp_x2 = []
83 | tmp_y = []
84 |
85 | # # 欠采样处理
86 | # for idx, item in enumerate(y):
87 | # if item[1] == 1:
88 | # tmp_x1.append(x1[idx])
89 | # tmp_x2.append(x2[idx])
90 | # tmp_y.append(y[idx])
91 | # elif num_p >= 0:
92 | # tmp_x1.append(x1[idx])
93 | # tmp_x2.append(x2[idx])
94 | # tmp_y.append(y[idx])
95 | # num_p -= 1
96 | # x1 = tmp_x1
97 | # x2 = tmp_x2
98 | # y = tmp_y
99 |
100 | # 过采样处理
101 | add_p_num = num_n - num_p
102 | while add_p_num > 0:
103 | for idx, item in enumerate(y):
104 | if item == 1:
105 | tmp_x1.append(x1[idx])
106 | tmp_x2.append(x2[idx])
107 | tmp_y.append(y[idx])
108 | add_p_num -= 1
109 | if add_p_num <= 0:
110 | break
111 |
112 | print('len(x1)={}, len(x2)={}, len(y)={}'.format(len(x1), len(x2), len(y)))
113 |
114 | x1 += tmp_x1
115 | x2 += tmp_x2
116 | y += tmp_y
117 |
118 | print('len(x1)={}, len(x2)={}, len(y)={}'.format(len(x1), len(x2), len(y)))
119 |
120 | # num_p=0
121 | # for item in y:
122 | # if item[1]==1:
123 | # num_p+=1
124 | #
125 | # print('num_p= {}'.format(num_p))
126 | # exit(0)
127 |
128 | # print ('num_p= {}'.format(num_p))
129 | # print('num_n= {}'.format(num_n))
130 | # exit(0)
131 |
132 | return np.asarray(x1), np.asarray(x2), np.asarray(y)
133 |
134 | def getTsvTestData(self, filepath):
135 | print("Loading testing/labelled data from " + filepath)
136 | x1 = []
137 | x2 = []
138 | y = []
139 | # positive samples from file
140 | for line in open(filepath):
141 | l = line.strip().split("\t")
142 | if len(l) < 3:
143 | continue
144 | x1.append(l[1])
145 | x2.append(l[2])
146 | y.append(int(l[0]))
147 | return np.asarray(x1), np.asarray(x2), np.asarray(y)
148 |
149 | def batch_iter(self, data, batch_size, num_epochs, shuffle=True):
150 | """
151 | Generates a batch iterator for a dataset.
152 | """
153 | data = np.asarray(data)
154 | # print(data)
155 | # print(data.shape)
156 | data_size = len(data)
157 | num_batches_per_epoch = int(len(data) / batch_size) + 1
158 | for epoch in range(num_epochs):
159 | # Shuffle the data at each epoch
160 | if shuffle:
161 | shuffle_indices = np.random.permutation(np.arange(data_size))
162 | shuffled_data = data[shuffle_indices]
163 | else:
164 | shuffled_data = data
165 | for batch_num in range(num_batches_per_epoch):
166 | start_index = batch_num * batch_size
167 | end_index = min((batch_num + 1) * batch_size, data_size)
168 | yield shuffled_data[start_index:end_index]
169 |
170 | def dumpValidation(self, x1_text, x2_text, y, shuffled_index, dev_idx, i):
171 | print("dumping validation " + str(i))
172 | x1_shuffled = x1_text[shuffled_index]
173 | x2_shuffled = x2_text[shuffled_index]
174 | y_shuffled = y[shuffled_index]
175 | x1_dev = x1_shuffled[dev_idx:]
176 | x2_dev = x2_shuffled[dev_idx:]
177 | y_dev = y_shuffled[dev_idx:]
178 | del x1_shuffled
179 | del y_shuffled
180 | with open('validation.txt' + str(i), 'w') as f:
181 | for text1, text2, label in zip(x1_dev, x2_dev, y_dev):
182 | f.write(str(label) + "\t" + text1 + "\t" + text2 + "\n")
183 | f.close()
184 | del x1_dev
185 | del y_dev
186 |
187 | # Data Preparatopn
188 | # ==================================================
189 |
190 | def getDataSets(self, training_paths, max_document_length, percent_dev, batch_size):
191 | x1_text, x2_text, y = self.getTsvData(training_paths)
192 | # print('x1_text= {}'.format(x1_text))
193 | # print('x2_text= {}'.format(x2_text))
194 | # print ('y= {}'.format(y))
195 |
196 | # Build vocabulary
197 | print("Building vocabulary")
198 | vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0)
199 | vocab_processor.fit_transform(np.concatenate((x2_text, x1_text), axis=0))
200 | print("Length of loaded vocabulary ={}".format(len(vocab_processor.vocabulary_)))
201 |
202 | sum_no_of_batches = 0
203 | x1 = np.asarray(list(vocab_processor.transform(x1_text)))
204 | x2 = np.asarray(list(vocab_processor.transform(x2_text)))
205 | # Randomly shuffle data
206 | np.random.seed(131)
207 | shuffle_indices = np.random.permutation(np.arange(len(y)))
208 | x1_shuffled = x1[shuffle_indices]
209 | x2_shuffled = x2[shuffle_indices]
210 | y_shuffled = y[shuffle_indices]
211 | dev_idx = -1 * len(y_shuffled) * percent_dev // 100
212 | print('dev_idx= {}'.format(dev_idx))
213 |
214 | del x1
215 | del x2
216 | # Split train/test set
217 | self.dumpValidation(x1_text, x2_text, y, shuffle_indices, dev_idx, 0)
218 | # TODO: This is very crude, should use cross-validation
219 | x1_train, x1_dev = x1_shuffled[:dev_idx], x1_shuffled[dev_idx:]
220 | x2_train, x2_dev = x2_shuffled[:dev_idx], x2_shuffled[dev_idx:]
221 | y_train, y_dev = y_shuffled[:dev_idx], y_shuffled[dev_idx:]
222 | print("Train/Dev split for {}: {:d}/{:d}".format(training_paths, len(y_train), len(y_dev)))
223 | sum_no_of_batches = sum_no_of_batches + (len(y_train) // batch_size)
224 | train_set = (x1_train, x2_train, y_train)
225 | dev_set = (x1_dev, x2_dev, y_dev)
226 | gc.collect()
227 | return train_set, dev_set, vocab_processor, sum_no_of_batches
228 |
229 | def getTestDataSet(self, data_path, vocab_path, max_document_length):
230 | x1_temp, x2_temp, y = self.getTsvTestData(data_path)
231 |
232 | # Build vocabulary
233 | vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0)
234 | vocab_processor = vocab_processor.restore(vocab_path)
235 | print len(vocab_processor.vocabulary_)
236 |
237 | x1 = np.asarray(list(vocab_processor.transform(x1_temp)))
238 | x2 = np.asarray(list(vocab_processor.transform(x2_temp)))
239 | # Randomly shuffle data
240 | del vocab_processor
241 | gc.collect()
242 | return x1, x2, y
243 |
--------------------------------------------------------------------------------
/preliminary_contest/dict.txt:
--------------------------------------------------------------------------------
1 | 花呗
2 | 借呗
3 | 蚂蚁花呗
4 | 蚂蚁借呗
--------------------------------------------------------------------------------
/preliminary_contest/eval.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # coding=utf-8
3 | import tensorflow as tf
4 | import numpy as np
5 | import os
6 | import time
7 | import datetime
8 | from tensorflow.contrib import learn
9 | from input_helpers import InputHelper
10 | import sys
11 |
12 | # Parameters
13 | # ==================================================
14 | EVAL_FILE = sys.argv[1] # 待评估文件
15 | OUTPUT_FILE = sys.argv[2] # 评估后输出文件
16 |
17 | print (EVAL_FILE)
18 | print (OUTPUT_FILE)
19 |
20 | # Eval Parameters
21 | BATCH_SIZE = 64 # 批大小
22 | VOCAB_FILE = './vocab/vocab' # 训练使使用的词表
23 | MODEL = './models/model-4000' # 加载训练模型
24 | ALLOW_SOFT_PLACEMENT = True
25 | LOG_DEVICE_PLACEMENT = False
26 |
27 | # 语句最多长度(包含多少个词)
28 | MAX_DOCUMENT_LENGTH = 40
29 |
30 | # load data and map id-transform based on training time vocabulary
31 | inpH = InputHelper()
32 | x1_test, x2_test = inpH.getTestDataSet(EVAL_FILE, VOCAB_FILE, MAX_DOCUMENT_LENGTH)
33 |
34 | # for index, _ in enumerate(x1_test):
35 | # print(index, x1_test[index], x2_test[index])
36 |
37 | print("\nEvaluating...\n")
38 |
39 | # Evaluation
40 | # ==================================================
41 | checkpoint_file = MODEL
42 | print checkpoint_file
43 | graph = tf.Graph()
44 | with graph.as_default():
45 | session_conf = tf.ConfigProto(
46 | allow_soft_placement=ALLOW_SOFT_PLACEMENT,
47 | log_device_placement=LOG_DEVICE_PLACEMENT)
48 | sess = tf.Session(config=session_conf)
49 | with sess.as_default():
50 | # Load the saved meta graph and restore variables
51 | saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
52 | sess.run(tf.initialize_all_variables())
53 | saver.restore(sess, checkpoint_file)
54 |
55 | # Get the placeholders from the graph by name
56 | input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
57 | input_x2 = graph.get_operation_by_name("input_x2").outputs[0]
58 | # input_y = graph.get_operation_by_name("input_y").outputs[0]
59 |
60 | dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
61 | # Tensors we want to evaluate
62 | predictions = graph.get_operation_by_name("output/distance").outputs[0]
63 |
64 | # accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
65 |
66 | sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0]
67 |
68 | # emb = graph.get_operation_by_name("embedding/W").outputs[0]
69 | # embedded_chars = tf.nn.embedding_lookup(emb,input_x)
70 | # Generate batches for one epoch
71 | batches = inpH.batch_iter(list(zip(x1_test, x2_test)), 2 * BATCH_SIZE, 1, shuffle=False)
72 | # Collect the predictions here
73 | all_predictions = []
74 | all_d = []
75 |
76 | for db in batches:
77 | # print('db')
78 | # print(db)
79 | #
80 | x1_dev_b, x2_dev_b = zip(*db)
81 | batch_predictions, batch_sim = sess.run([predictions, sim],
82 | {input_x1: x1_dev_b, input_x2: x2_dev_b, dropout_keep_prob: 1.0})
83 | all_predictions = np.concatenate([all_predictions, batch_predictions])
84 | # print(batch_predictions)
85 | print(batch_sim)
86 | print(type(batch_sim))
87 | print(len(batch_sim))
88 | all_d = np.concatenate([all_d, batch_sim])
89 | # print("DEV acc {}".format(batch_acc))
90 | for ex in all_predictions:
91 | print ex
92 |
93 | f_output = open(OUTPUT_FILE, 'a')
94 | index = 1
95 | predic_value = 0
96 | for item in all_d:
97 | # 专门写反
98 | if item > 0:
99 | predic_value = 1
100 | else:
101 | predic_value = 0
102 | f_output.write('{}\t{}\n'.format(index, predic_value))
103 | index += 1
104 |
105 | # correct_predictions = float(np.mean(all_d == y_test))
106 | # print("Accuracy: {:g}".format(correct_predictions))
107 |
108 | print ('eval finished!')
109 |
--------------------------------------------------------------------------------
/preliminary_contest/input_helpers.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | import numpy as np
3 | import re
4 | import itertools
5 | from collections import Counter
6 | import numpy as np
7 | import time
8 | import gc
9 | from tensorflow.contrib import learn
10 | # from gensim.models.word2vec import Word2Vec
11 | import gensim
12 | import gzip
13 | from random import random
14 | from preprocess import MyVocabularyProcessor
15 | import sys
16 | import jieba
17 |
18 | reload(sys)
19 | sys.setdefaultencoding("utf-8")
20 |
21 |
22 | class InputHelper(object):
23 | pre_emb = dict()
24 | vocab_processor = None
25 |
26 | def getTsvTestData(self, filepath):
27 | print("Loading testing/labelled data from " + filepath)
28 | x1 = []
29 | x2 = []
30 | for line in open(filepath):
31 | l = line.strip().split("\t")
32 | x1.append(l[1])
33 | x2.append(l[2])
34 | return np.asarray(x1), np.asarray(x2)
35 |
36 | def batch_iter(self, data, batch_size, num_epochs, shuffle=True):
37 | """
38 | Generates a batch iterator for a dataset.
39 | """
40 | data = np.asarray(data)
41 | print(data)
42 | print(data.shape)
43 | data_size = len(data)
44 | num_batches_per_epoch = int(len(data) / batch_size) + 1
45 | for epoch in range(num_epochs):
46 | # Shuffle the data at each epoch
47 | if shuffle:
48 | shuffle_indices = np.random.permutation(np.arange(data_size))
49 | shuffled_data = data[shuffle_indices]
50 | else:
51 | shuffled_data = data
52 | for batch_num in range(num_batches_per_epoch):
53 | start_index = batch_num * batch_size
54 | end_index = min((batch_num + 1) * batch_size, data_size)
55 | yield shuffled_data[start_index:end_index]
56 |
57 | # Data Preparatopn
58 | # ==================================================
59 |
60 | def getTestDataSet(self, data_path, vocab_path, max_document_length):
61 | x1_temp, x2_temp = self.getTsvTestData(data_path)
62 |
63 | # Build vocabulary
64 | vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0)
65 | vocab_processor = vocab_processor.restore(vocab_path)
66 | print ('len(vocab_processor.vocabulary_)', len(vocab_processor.vocabulary_))
67 | # sys.exit(0)
68 |
69 | x1 = np.asarray(list(vocab_processor.transform(x1_temp)))
70 | x2 = np.asarray(list(vocab_processor.transform(x2_temp)))
71 | # Randomly shuffle data
72 | del vocab_processor
73 | gc.collect()
74 | return x1, x2
75 |
--------------------------------------------------------------------------------
/preliminary_contest/models/model-4000.data-00000-of-00001:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.data-00000-of-00001
--------------------------------------------------------------------------------
/preliminary_contest/models/model-4000.index:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.index
--------------------------------------------------------------------------------
/preliminary_contest/models/model-4000.meta:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.meta
--------------------------------------------------------------------------------
/preliminary_contest/preprocess.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | from __future__ import absolute_import
3 | from __future__ import division
4 | from __future__ import print_function
5 |
6 | import re
7 | import numpy as np
8 | import six
9 | from tensorflow.contrib import learn
10 | from tensorflow.python.platform import gfile
11 | from tensorflow.contrib import learn # pylint: disable=g-bad-import-order
12 | import jieba
13 |
14 |
15 | def tokenizer_word(iterator):
16 | jieba.load_userdict('./dict.txt')
17 | for sentence in iterator:
18 | yield list(jieba.lcut(sentence))
19 |
20 |
21 | class MyVocabularyProcessor(learn.preprocessing.VocabularyProcessor):
22 | def __init__(self,
23 | max_document_length,
24 | min_frequency=0,
25 | vocabulary=None):
26 |
27 | tokenizer_fn = tokenizer_word
28 | self.sup = super(MyVocabularyProcessor, self)
29 | self.sup.__init__(max_document_length, min_frequency, vocabulary, tokenizer_fn)
30 |
31 | def transform(self, raw_documents):
32 | """Transform documents to word-id matrix.
33 | Convert words to ids with vocabulary fitted with fit or the one
34 | provided in the constructor.
35 | Args:
36 | raw_documents: An iterable which yield either str or unicode.
37 | Yields:
38 | x: iterable, [n_samples, max_document_length]. Word-id matrix.
39 | """
40 | # print('len(raw_documents)= {}'.format(len(raw_documents)))
41 | # print('raw_documents= {}'.format(raw_documents))
42 |
43 | # for index,value in enumerate(raw_documents):
44 | # print(index, value)
45 |
46 | for tokens in self._tokenizer(raw_documents):
47 | word_ids = np.zeros(self.max_document_length, np.int64)
48 | for idx, token in enumerate(tokens):
49 | if idx >= self.max_document_length:
50 | break
51 | word_ids[idx] = self.vocabulary_.get(token)
52 | yield word_ids
53 |
--------------------------------------------------------------------------------
/preliminary_contest/run.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | python eval.py $1 $2
--------------------------------------------------------------------------------
/preliminary_contest/vocab/vocab:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/vocab/vocab
--------------------------------------------------------------------------------
/preprocess.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | from __future__ import absolute_import
3 | from __future__ import division
4 | from __future__ import print_function
5 |
6 | import re
7 | import numpy as np
8 | import six
9 | from tensorflow.contrib import learn
10 | from tensorflow.python.platform import gfile
11 | from tensorflow.contrib import learn # pylint: disable=g-bad-import-order
12 | import jieba
13 |
14 |
15 | def tokenizer_word(iterator):
16 | jieba.load_userdict('./dict.txt')
17 | for sentence in iterator:
18 | sentence = sentence.decode("utf8")
19 | sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。:??、~@#¥%……&*()]+".decode("utf8"), "".decode("utf8"),
20 | sentence)
21 | yield list(jieba.lcut(sentence))
22 |
23 |
24 | class MyVocabularyProcessor(learn.preprocessing.VocabularyProcessor):
25 | def __init__(self,
26 | max_document_length,
27 | min_frequency=0,
28 | vocabulary=None):
29 |
30 | tokenizer_fn = tokenizer_word
31 | self.sup = super(MyVocabularyProcessor, self)
32 | self.sup.__init__(max_document_length, min_frequency, vocabulary, tokenizer_fn)
33 |
34 | def transform(self, raw_documents):
35 | """Transform documents to word-id matrix.
36 | Convert words to ids with vocabulary fitted with fit or the one
37 | provided in the constructor.
38 | Args:
39 | raw_documents: An iterable which yield either str or unicode.
40 | Yields:
41 | x: iterable, [n_samples, max_document_length]. Word-id matrix.
42 | """
43 | # print('len(raw_documents)= {}'.format(len(raw_documents)))
44 | # print('raw_documents= {}'.format(raw_documents))
45 |
46 | # for index,value in enumerate(raw_documents):
47 | # print(index, value)
48 |
49 | for tokens in self._tokenizer(raw_documents):
50 | word_ids = np.zeros(self.max_document_length, np.int64)
51 | for idx, token in enumerate(tokens):
52 | if idx >= self.max_document_length:
53 | break
54 | word_ids[idx] = self.vocabulary_.get(token)
55 | yield word_ids
56 |
--------------------------------------------------------------------------------
/siamese_network_semantic.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | import tensorflow as tf
3 | import tensorflow.contrib.slim as slim
4 | import numpy as np
5 |
6 |
7 | class SiameseLSTMw2v(object):
8 | """
9 | A LSTM based deep Siamese network for text similarity.
10 | Uses an word embedding layer (looks up in pre-trained w2v), followed by a biLSTM and Energy Loss layer.
11 | """
12 |
13 | def stackedRNN(self, x, dropout, scope, embedding_size, sequence_length, hidden_units):
14 | n_hidden = hidden_units
15 | n_layers = 3
16 | # n_layers = 6
17 | # Prepare data shape to match `static_rnn` function requirements
18 | x = tf.unstack(tf.transpose(x, perm=[1, 0, 2]))
19 | # print(x)
20 | # Define lstm cells with tensorflow
21 | # Forward direction cell
22 |
23 | with tf.name_scope("fw" + scope), tf.variable_scope("fw" + scope):
24 | stacked_rnn_fw = []
25 | for _ in range(n_layers):
26 | fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0, state_is_tuple=True)
27 | lstm_fw_cell = tf.contrib.rnn.DropoutWrapper(fw_cell, output_keep_prob=dropout)
28 | stacked_rnn_fw.append(lstm_fw_cell)
29 | lstm_fw_cell_m = tf.nn.rnn_cell.MultiRNNCell(cells=stacked_rnn_fw, state_is_tuple=True)
30 |
31 | outputs, _ = tf.nn.static_rnn(lstm_fw_cell_m, x, dtype=tf.float32)
32 | return outputs[-1]
33 |
34 | def contrastive_loss(self, y, d, batch_size):
35 | tmp = y * tf.square(d)
36 | # tmp= tf.mul(y,tf.square(d))
37 | tmp2 = (1 - y) * tf.square(tf.maximum((1 - d), 0))
38 | reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-4), tf.trainable_variables())
39 | return tf.reduce_sum(tmp + tmp2) / batch_size / 2+reg
40 |
41 | def __init__(
42 | self, sequence_length, vocab_size, embedding_size, hidden_units, l2_reg_lambda, batch_size,
43 | trainableEmbeddings):
44 | # Placeholders for input, output and dropout
45 | self.input_x1 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x1")
46 | self.input_x2 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x2")
47 | self.input_y = tf.placeholder(tf.float32, [None], name="input_y")
48 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
49 |
50 | # Keeping track of l2 regularization loss (optional)
51 | l2_loss = tf.constant(0.0, name="l2_loss")
52 |
53 | # Embedding layer
54 | with tf.name_scope("embedding"):
55 | self.W = tf.Variable(
56 | tf.constant(0.0, shape=[vocab_size, embedding_size]),
57 | trainable=trainableEmbeddings, name="W")
58 | self.embedded_words1 = tf.nn.embedding_lookup(self.W, self.input_x1)
59 | self.embedded_words2 = tf.nn.embedding_lookup(self.W, self.input_x2)
60 | # print self.embedded_words1
61 | # Create a convolution + maxpool layer for each filter size
62 | with tf.name_scope("output"):
63 | self.out1 = self.stackedRNN(self.embedded_words1, self.dropout_keep_prob, "side1", embedding_size,
64 | sequence_length, hidden_units)
65 | self.out2 = self.stackedRNN(self.embedded_words2, self.dropout_keep_prob, "side2", embedding_size,
66 | sequence_length, hidden_units)
67 | self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.out1, self.out2)), 1, keep_dims=True))
68 | self.distance = tf.div(self.distance,
69 | tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.out1), 1, keep_dims=True)),
70 | tf.sqrt(tf.reduce_sum(tf.square(self.out2), 1, keep_dims=True))))
71 | self.distance = tf.reshape(self.distance, [-1], name="distance")
72 | with tf.name_scope("loss"):
73 | self.loss = self.contrastive_loss(self.input_y, self.distance, batch_size)
74 | #### Accuracy computation is outside of this class.
75 | with tf.name_scope("accuracy"):
76 | self.temp_sim = tf.subtract(tf.ones_like(self.distance), tf.rint(self.distance),
77 | name="temp_sim") # auto threshold 0.5
78 | correct_predictions = tf.equal(self.temp_sim, self.input_y)
79 | self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
80 |
81 | with tf.name_scope('f1'):
82 | ones_like_actuals = tf.ones_like(self.input_y)
83 | zeros_like_actuals = tf.zeros_like(self.input_y)
84 | ones_like_predictions = tf.ones_like(self.temp_sim)
85 | zeros_like_predictions = tf.zeros_like(self.temp_sim)
86 |
87 | tp = tf.reduce_sum(
88 | tf.cast(
89 | tf.logical_and(
90 | tf.equal(self.input_y, ones_like_actuals),
91 | tf.equal(self.temp_sim, ones_like_predictions)
92 | ),
93 | 'float'
94 | )
95 | )
96 |
97 | tn = tf.reduce_sum(
98 | tf.cast(
99 | tf.logical_and(
100 | tf.equal(self.input_y, zeros_like_actuals),
101 | tf.equal(self.temp_sim, zeros_like_predictions)
102 | ),
103 | 'float'
104 | )
105 | )
106 |
107 | fp = tf.reduce_sum(
108 | tf.cast(
109 | tf.logical_and(
110 | tf.equal(self.input_y, zeros_like_actuals),
111 | tf.equal(self.temp_sim, ones_like_predictions)
112 | ),
113 | 'float'
114 | )
115 | )
116 |
117 | fn = tf.reduce_sum(
118 | tf.cast(
119 | tf.logical_and(
120 | tf.equal(self.input_y, ones_like_actuals),
121 | tf.equal(self.temp_sim, zeros_like_predictions)
122 | ),
123 | 'float'
124 | )
125 | )
126 |
127 | precision = tp / (tp + fp)
128 | recall = tp / (tp + fn)
129 |
130 | self.f1 = 2 * precision * recall / (precision + recall)
131 |
--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # coding=utf-8
3 |
4 | import re
5 |
6 | line = "想做/ 兼_职/学生_/ 的 、加,我Q: 1 5. 8 0. !!?? 8 6 。0. 2。 3 有,惊,喜,哦"
7 | line = line.decode("utf8")
8 | string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。:??、~@#¥%……&*()]+".decode("utf8"), "".decode("utf8"), line)
9 | print(string)
10 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # coding=utf-8
3 | import tensorflow as tf
4 | import numpy as np
5 | import re
6 | import os
7 | import time
8 | import datetime
9 | import gc
10 | from input_helpers import InputHelper
11 | from siamese_network_semantic import SiameseLSTMw2v
12 | import gzip
13 | from random import random
14 | import sys
15 |
16 | # a=['你好','上帝','下地']
17 | # b=[u'你好',u'上帝',u'下地']
18 | # print(a)
19 | # print(b)
20 | # sys.exit(0)
21 |
22 | # Parameters
23 | # word2vec模型(采用已训练好的中文模型)
24 | WORD2VEC_MODEL = '../word2vecmodel/news_12g_baidubaike_20g_novel_90g_embedding_64.bin'
25 | # 模型格式为bin
26 | WORD2VEC_FORMAT = 'bin'
27 | # word2vec词嵌入维数(64/128可选)
28 | EMBEDDING_DIM = 64
29 | # dropout比例设置
30 | # DROPOUT_KEEP_PROB = '0.3'#训练集的拟合能力不够
31 | DROPOUT_KEEP_PROB = '0.8'
32 | # DROPOUT_KEEP_PROB = '0.6'
33 | # DROPOUT_KEEP_PROB = '0.7'
34 | # DROPOUT_KEEP_PROB = '0.8'
35 | # DROPOUT_KEEP_PROB = '1.0'(7th-June)
36 | # DROPOUT_KEEP_PROB = '0.8'
37 | # DROPOUT_KEEP_PROB = '0.4'#训练集的拟合能力不够
38 | # L2正规化系数(目前暂未生效)
39 | L2_REG_LAMBDA = 0.0
40 | # 原始训练文件
41 | TRAINING_FILES_RAW = './train_data/atec_nlp_sim_train.csv'
42 | # 隐藏层单元数
43 | # HIDDEN_UNITS = 64(7th-June)
44 | HIDDEN_UNITS = 128
45 |
46 | # Training parameters
47 | # 批大小
48 | # BATCH_SIZE = 64
49 | # BATCH_SIZE = 1024(7th-June)
50 | BATCH_SIZE = 1024 # 92229=102477-10248
51 | # epoch数目
52 | # NUM_EPOCHS = 300
53 | # NUM_EPOCHS = 3000
54 | NUM_EPOCHS = 100000
55 | # 模型评估周期(每隔多少步)
56 | # EVALUATE_EVERY = 10(7th-June)
57 | EVALUATE_EVERY = 100
58 | # EVALUATE_EVERY = 10
59 | # 模型保存周期(每隔多少步)
60 | # CHECKOUTPOINT_EVERY = 1000
61 | # CHECKOUTPOINT_EVERY = 10000
62 | # CHECKOUTPOINT_EVERY = 1000(7th-Jnue)
63 | CHECKOUTPOINT_EVERY = 1000
64 | # 语句最多长度(包含多少个词)
65 | # MAX_DOCUMENT_LENGTH = 12
66 | # MAX_DOCUMENT_LENGTH = 8
67 | # MAX_DOCUMENT_LENGTH = 20(7th-June)
68 | MAX_DOCUMENT_LENGTH = 40
69 | # 验证集比例
70 | DEV_PERCENT = 10
71 |
72 | # Misc Parameters
73 | ALLOW_SOFT_PLACEMENT = True
74 | LOG_DEVICE_PLACEMENT = False
75 |
76 | print ('训练开始......................')
77 | start_time = datetime.datetime.now()
78 |
79 | inpH = InputHelper()
80 | # 将原始的训练文件转化为分词后的训练文件
81 | # inpH.train_file_preprocess(TRAINING_FILES_RAW, TRAINING_FILES_FORMAT)
82 | # sys.exit(0)
83 |
84 |
85 | train_set, dev_set, vocab_processor, sum_no_of_batches = inpH.getDataSets(TRAINING_FILES_RAW, MAX_DOCUMENT_LENGTH,
86 | DEV_PERCENT,
87 | BATCH_SIZE)
88 |
89 | # dev_batches = inpH.batch_iter(list(zip(dev_set[0], dev_set[1], dev_set[2])), BATCH_SIZE, 1)
90 | # for index,dev_batch in enumerate(dev_batches):
91 | # print(index, dev_batch)
92 | # sys.exit(0)
93 |
94 | # for index, value in enumerate(dev_set[2]):
95 | # print(index, dev_set[0][index], dev_set[1][index], dev_set[2][index])
96 | # sys.exit(0)
97 |
98 | # for index, w in enumerate(vocab_processor.vocabulary_._mapping):
99 | # print('vocab-{}:{}'.format(index, w))
100 | # sys.exit(0)
101 |
102 | with tf.Graph().as_default():
103 | session_conf = tf.ConfigProto(
104 | allow_soft_placement=ALLOW_SOFT_PLACEMENT,
105 | log_device_placement=LOG_DEVICE_PLACEMENT)
106 | sess = tf.Session(config=session_conf)
107 |
108 | with sess.as_default():
109 | siameseModel = SiameseLSTMw2v(
110 | sequence_length=MAX_DOCUMENT_LENGTH,
111 | vocab_size=len(vocab_processor.vocabulary_),
112 | embedding_size=EMBEDDING_DIM,
113 | hidden_units=HIDDEN_UNITS,
114 | l2_reg_lambda=L2_REG_LAMBDA,
115 | batch_size=BATCH_SIZE,
116 | trainableEmbeddings=False
117 | )
118 | # Define Training procedure
119 | global_step = tf.Variable(0, name="global_step", trainable=False)
120 | optimizer = tf.train.AdamOptimizer(1e-3)
121 |
122 | grads_and_vars = optimizer.compute_gradients(siameseModel.loss)
123 | tr_op_set = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
124 | print("defined training_ops")
125 | # Keep track of gradient values and sparsity (optional)
126 | grad_summaries = []
127 | for g, v in grads_and_vars:
128 | if g is not None:
129 | grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
130 | sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
131 | grad_summaries.append(grad_hist_summary)
132 | grad_summaries.append(sparsity_summary)
133 | grad_summaries_merged = tf.summary.merge(grad_summaries)
134 | print("defined gradient summaries")
135 | # Output directory for models and summaries
136 | timestamp = str(int(time.time()))
137 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
138 | print("Writing to {}\n".format(out_dir))
139 |
140 | # Summaries for loss and accuracy
141 | loss_summary = tf.summary.scalar("loss", siameseModel.loss)
142 | acc_summary = tf.summary.scalar("accuracy", siameseModel.accuracy)
143 | f1_summary = tf.summary.scalar('f1', siameseModel.f1)
144 |
145 | # Train Summaries
146 | train_summary_op = tf.summary.merge([loss_summary, acc_summary, f1_summary, grad_summaries_merged])
147 | train_summary_dir = os.path.join(out_dir, "summaries", "train")
148 | train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
149 |
150 | # Dev summaries
151 | dev_summary_op = tf.summary.merge([loss_summary, acc_summary, f1_summary])
152 | dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
153 | dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
154 |
155 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
156 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
157 | checkpoint_prefix = os.path.join(checkpoint_dir, "model")
158 | if not os.path.exists(checkpoint_dir):
159 | os.makedirs(checkpoint_dir)
160 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=100)
161 |
162 | # Write vocabulary
163 | vocab_processor.save(os.path.join(checkpoint_dir, "vocab"))
164 |
165 | # Initialize all variables
166 | sess.run(tf.global_variables_initializer())
167 |
168 | print("init all variables")
169 | graph_def = tf.get_default_graph().as_graph_def()
170 | graphpb_txt = str(graph_def)
171 | with open(os.path.join(checkpoint_dir, "graphpb.txt"), 'w') as f:
172 | f.write(graphpb_txt)
173 |
174 | # 加载word2vec
175 | inpH.loadW2V(WORD2VEC_MODEL, WORD2VEC_FORMAT)
176 | # initial matrix with random uniform
177 | # initW = np.random.uniform(-0.25, 0.25, (len(vocab_processor.vocabulary_), EMBEDDING_DIM))
178 | initW = np.random.uniform(0, 0, (len(vocab_processor.vocabulary_), EMBEDDING_DIM))
179 | # print(initW)
180 | # sys.exit(0)
181 |
182 | # load any vectors from the word2vec
183 | print("initializing initW with pre-trained word2vec embeddings")
184 | for index, w in enumerate(vocab_processor.vocabulary_._mapping):
185 | # print('vocab-{}:{}'.format(index, w))
186 |
187 | arr = []
188 | if w in inpH.pre_emb:
189 | arr = inpH.pre_emb[w]
190 | # print('=====arr-{},{}'.format(index, arr))
191 | idx = vocab_processor.vocabulary_.get(w)
192 | initW[idx] = np.asarray(arr).astype(np.float32)
193 |
194 | # 不使用词向量
195 | # arr=[]
196 | # idx = vocab_processor.vocabulary_.get(w)
197 | # arr.append(idx)
198 | # initW[idx] = np.asarray(arr).astype(np.float32)
199 |
200 | print("Done assigning intiW. len=" + str(len(initW)))
201 | # exit(0)
202 |
203 | # for idx, value in enumerate(initW):
204 | # print(idx, value)
205 | # sys.exit(0)
206 |
207 | inpH.deletePreEmb()
208 | gc.collect()
209 | sess.run(siameseModel.W.assign(initW))
210 |
211 |
212 | def train_step(x1_batch, x2_batch, y_batch):
213 | """
214 | A single training step
215 | """
216 | # for index, sentence in enumerate(x1_batch):
217 | # word_list1=[]
218 | # word_list2=[]
219 | # y=y_batch[index]
220 | # for idx in x1_batch[index]:
221 | # word_list1.append(vocab_processor.vocabulary_.reverse(idx))
222 | # for idx in x2_batch[index]:
223 | # word_list2.append(vocab_processor.vocabulary_.reverse(idx))
224 | #
225 | # # print(''.join(word_list1),'\t',''.join(word_list2),'\t',y)
226 | # print('==========={}=============='.format(index))
227 | # print(''.join(word_list1))
228 | # print (''.join(word_list2))
229 | # print(y)
230 | # sys.exit(0)
231 |
232 | feed_dict = {
233 | siameseModel.input_x1: x1_batch,
234 | siameseModel.input_x2: x2_batch,
235 | siameseModel.input_y: y_batch,
236 | siameseModel.dropout_keep_prob: DROPOUT_KEEP_PROB,
237 | }
238 | _, step, loss, accuracy, f1, dist, sim, summaries = sess.run(
239 | [tr_op_set, global_step, siameseModel.loss, siameseModel.accuracy, siameseModel.f1, siameseModel.distance,
240 | siameseModel.temp_sim, train_summary_op], feed_dict)
241 | time_str = datetime.datetime.now().isoformat()
242 | print("TRAIN {}: step {}, loss {:g}, acc {:g}, f1 {:g}".format(time_str, step, loss, accuracy, f1))
243 | train_summary_writer.add_summary(summaries, step)
244 | print(y_batch, dist, sim)
245 |
246 |
247 | def dev_step(x1_batch, x2_batch, y_batch):
248 | """
249 | A single training step
250 | """
251 | # for index, sentence in enumerate(x1_batch):
252 | # word_list1=[]
253 | # word_list2=[]
254 | # y=y_batch[index]
255 | # for idx in x1_batch[index]:
256 | # word_list1.append(vocab_processor.vocabulary_.reverse(idx))
257 | # for idx in x2_batch[index]:
258 | # word_list2.append(vocab_processor.vocabulary_.reverse(idx))
259 | #
260 | # # print(''.join(word_list1),'\t',''.join(word_list2),'\t',y)
261 | # print('==========={}=============='.format(index))
262 | # print(''.join(word_list1))
263 | # print (''.join(word_list2))
264 | # print(y)
265 | # sys.exit(0)
266 |
267 | feed_dict = {
268 | siameseModel.input_x1: x2_batch,
269 | siameseModel.input_x2: x1_batch,
270 | siameseModel.input_y: y_batch,
271 | siameseModel.dropout_keep_prob: 1.0,
272 | }
273 | step, loss, accuracy, f1, sim, summaries = sess.run(
274 | [global_step, siameseModel.loss, siameseModel.accuracy, siameseModel.f1, siameseModel.temp_sim,
275 | dev_summary_op], feed_dict)
276 | time_str = datetime.datetime.now().isoformat()
277 | print("DEV {}: step {}, loss {:g}, acc {:g}, f1 {:g}".format(time_str, step, loss, accuracy, f1))
278 | dev_summary_writer.add_summary(summaries, step)
279 | print (y_batch, sim)
280 | return accuracy
281 |
282 |
283 | ##################
284 | # sys.exit(0)
285 |
286 | # Generate batches
287 | batches = inpH.batch_iter(
288 | list(zip(train_set[0], train_set[1], train_set[2])), BATCH_SIZE, NUM_EPOCHS)
289 |
290 | ptr = 0
291 | max_validation_acc = 0.0
292 | for nn in xrange(sum_no_of_batches * NUM_EPOCHS):
293 | batch = batches.next()
294 | if len(batch) < 1:
295 | continue
296 | x1_batch, x2_batch, y_batch = zip(*batch)
297 | if len(y_batch) < 1:
298 | continue
299 | train_step(x1_batch, x2_batch, y_batch)
300 | current_step = tf.train.global_step(sess, global_step)
301 | sum_acc = 0.0
302 | cnt = 0
303 | if current_step % EVALUATE_EVERY == 0:
304 | print("\nEvaluation:")
305 | dev_batches = inpH.batch_iter(list(zip(dev_set[0], dev_set[1], dev_set[2])), BATCH_SIZE, 1)
306 | for db in dev_batches:
307 | if len(db) < 1:
308 | continue
309 | x1_dev_b, x2_dev_b, y_dev_b = zip(*db)
310 | if len(y_dev_b) < 1:
311 | continue
312 | acc = dev_step(x1_dev_b, x2_dev_b, y_dev_b)
313 | sum_acc = sum_acc + acc
314 | cnt += 1
315 |
316 | sum_acc /= cnt
317 | print("sum_acc= {}".format(sum_acc))
318 | if current_step % CHECKOUTPOINT_EVERY == 0:
319 | if sum_acc >= max_validation_acc:
320 | max_validation_acc = sum_acc
321 |
322 | # 临时逻辑
323 | saver.save(sess, checkpoint_prefix, global_step=current_step)
324 | tf.train.write_graph(sess.graph.as_graph_def(), checkpoint_prefix, "graph" + str(nn) + ".pb",
325 | as_text=False)
326 | print("Saved model {} with sum_accuracy={} checkpoint to {}\n".format(nn, max_validation_acc,
327 | checkpoint_prefix))
328 |
329 | print('max_validation_acc(each batch)= {}'.format(max_validation_acc))
330 |
331 | end_time = datetime.datetime.now()
332 | train_duration = end_time - start_time
333 | print('训练开始时间: {}'.format(start_time))
334 | print('训练结束时间: {}'.format(end_time))
335 | print('训练结束, 训练总耗时: {}'.format(train_duration))
336 |
--------------------------------------------------------------------------------