├── .DS_Store
├── .gitignore
├── ATEC2018.md
├── LICENSE
├── README.md
├── dict.txt
├── eddy-20180701-4k.tar.gz
├── eval.py
├── input_helpers.py
├── preliminary_contest
    ├── atec_nlp_sim_train.csv
    ├── dict.txt
    ├── eval.py
    ├── input_helpers.py
    ├── models
    │   ├── model-4000.data-00000-of-00001
    │   ├── model-4000.index
    │   └── model-4000.meta
    ├── preprocess.py
    ├── run.sh
    └── vocab
    │   └── vocab
├── preprocess.py
├── siamese_network_semantic.py
├── test.py
├── train.py
├── train_data
    └── atec_nlp_sim_train.csv
└── validation.txt0


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by .ignore support plugin (hsz.mobi)
2 | 
3 | .idea/
4 | *.pyc
5 | runs/


--------------------------------------------------------------------------------
/ATEC2018.md:
--------------------------------------------------------------------------------
 1 | # 1 赛题任务描述
 2 | 
 3 | 问题相似度计算，即给定客服里用户描述的两句话，用算法来判断是否表示了相同的语义。
 4 | 
 5 | 示例：
 6 | 
 7 | “花呗如何还款” --“花呗怎么还款”：同义问句
 8 | “花呗如何还款” -- “我怎么还我的花被呢”：同义问句
 9 | “花呗分期后逾期了如何还款”-- “花呗分期后逾期了哪里还款”：非同义问句
10 | 对于例子a，比较简单的方法就可以判定同义；对于例子b，包含了错别字、同义词、词序变换等问题，两个句子乍一看并不类似，想正确判断比较有挑战；对于例子c，两句话很类似，仅仅有一处细微的差别 “如何”和“哪里”，就导致语义不一致。
11 | 
12 | # 2 数据
13 | 
14 | 本次大赛所有数据均来自蚂蚁金服金融大脑的实际应用场景，赛制分初赛和复赛两个阶段：
15 | 
16 | 初赛阶段
17 | 
18 | 我们提供10万对的标注数据（分批次更新），作为训练数据，包括同义对和不同义对，可下载。数据集中每一行就是一条样例。格式如下：
19 | 
20 | 行号\t句1\t句2\t标注，举例：1    花呗如何还款        花呗怎么还款        1
21 | 
22 | 行号指当前问题对在训练集中的第几行；
23 | 句1和句2分别表示问题句对的两个句子；
24 | 标注指当前问题对的同义或不同义标注，同义为1，不同义为0。
25 | 评测数据集总共1万条。为保证大赛的公平公正、避免恶意的刷榜行为，该数据集不公开。大家通过提交评测代码和模型的方法完成预测、获取相应的排名。格式如下：
26 | 
27 | 行号\t句1\t句2
28 | 
29 | ## 初赛阶段
30 | 评测数据集会在评测系统一个特定的路径下面，由官方的平台系统调用选手提交的评测工具执行。
31 | 
32 | ## 复赛阶段
33 | 
34 | 我们将训练数据集的量级会增加到海量。该阶段的数据不提供下载，会以数据表的形式在蚂蚁金服的数巢平台上供选手使用。和初赛阶段类似，数据集包含四个字段，分别是行号、句1、句2和标注。
35 | 
36 | 评测数据集还是1万条，同样以数据表的形式在数巢平台上。该数据集包含三个字段，分别是行号、句1、句2。
37 | 
38 | #3 评测及评估指标
39 | 
40 | ## 初赛阶段
41 | 比赛选手在本地完成模型的训练调优，将评测代码和模型打包后，提交官方测评系统完成预测和排名更新。测评系统为标准linux环境，安装有python 2.7、java 8、tensorflow 1.5、jieba 0.39。提交压缩包解压后，主目录下需包含脚本文件run.sh，该脚本以评测文件作为输入，评测结果作为输出（输出结果只有0和1），输出文件每行格式为“行号\t预测结果”，执行命令如下：
42 | 
43 | bash run.sh INPUT_PATH OUTPUT_PATH
44 | 
45 | 预测结果为空或总行数不对，评测结果直接判为0。
46 | 
47 |  
48 | 
49 | ## 复赛阶段
50 | 选手的模型训练、调优和预测都是在蚂蚁金服的机器学习平台上完成。因此评测只需要提供相应的UDF即可，以问题对的两句话作为输入，相似度预测结果（0或1）作为输出，同样输出为空则终止评估，评测结果为0。
51 | 
52 |  
53 | 
54 | 本赛题采用正确率（accuracy）、F1-score 进行评价，选手预测结果和真实标签进行比对，几个数值的定义先明确一下：
55 | 
56 | True Positive（TP）意思表示做出同义的判定，而且判定是正确的，TP的数值表示正确的同义判定的个数； 
57 | 
58 | 同理，False Positive（FP）数值表示错误的同义判定的个数；
59 | 
60 | 依此，True Negative（TN）数值表示正确的不同义判定个数；
61 | 
62 | False Negative（FN）数值表示错误的不同义判定个数。
63 | 
64 | 基于此，我们就可以计算出准确率（precision rate）、召回率（recall rate）和accuracy、F1-score：
65 | 
66 | precision rate = TP / (TP + FP)
67 | 
68 | recall rate = TP / (TP + FN)
69 | 
70 | accuracy = (TP + TN) / (TP + FP + TN + FN)
71 | 
72 | F1-score = 2 * precision rate * recall rate / (precision rate + recall rate)
73 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Dhwaj Raj
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 功能描述
 2 | 基于siamese-lstm的中文句子相似度计算
 3 | 
 4 | # 环境搭建
 5 | * Ubuntu：16.04（64bit）
 6 | * Anaconda：2-4.4.0（python 2.7）
 7 | 
 8 | 历史版本下载：<https://repo.continuum.io/archive/>
 9 | * TensorFlow：1.5.1
10 | * numpy：1.14.3
11 | * gensim：3.4.0
12 | * (nltk：3.2.3)
13 | * jieba：0.39
14 | * word2wec中文训练模型
15 | 
16 | 参考链接：<https://weibo.com/p/23041816d74e01f0102x77v>
17 | 
18 | # 代码使用
19 | 
20 | ### 模型训练
21 |         # python train.py
22 | 
23 | ### 模型评估
24 |         # python eval.py
25 | ## 论文参考
26 | * [《Learning Text Similarity with Siamese Recurrent Networks》](http://www.aclweb.org/anthology/W16-16#page=162)
27 | * [《Siamese Recurrent Architectures for Learning Sentence Similarity》](http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf)
28 | 
29 | # 代码参考
30 | 
31 | * [dhwajraj/deep-siamese-text-similarity](https://github.com/dhwajraj/deep-siamese-text-similarity)
32 | 
33 | 版本：a61f07f6bef76665f8ba2df12f34b25380016613
34 | 
35 | # AETC2018赛题描述
36 | 相关链接：<https://github.com/ATEC2018/deep-siamese-text-similarity/blob/master/ATEC2018.md>
37 | 
38 | 


--------------------------------------------------------------------------------
/dict.txt:
--------------------------------------------------------------------------------
1 | 花呗
2 | 借呗
3 | 蚂蚁花呗
4 | 蚂蚁借呗
5 | 从新
6 | 支付宝
7 | 淘宝
8 | 


--------------------------------------------------------------------------------
/eddy-20180701-4k.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/eddy-20180701-4k.tar.gz


--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # coding=utf-8
 3 | import tensorflow as tf
 4 | import numpy as np
 5 | import os
 6 | import time
 7 | import datetime
 8 | from input_helpers import InputHelper
 9 | import sys
10 | 
11 | # Parameters
12 | # ==================================================
13 | 
14 | # Eval Parameters
15 | # 批大小
16 | BATCH_SIZE = 64
17 | # 验证集文件
18 | EVAL_FILEPATH = 'validation.txt0'
19 | # 词表（在训练过程中已生成）
20 | VOCAB_FILEPATH = 'runs/1528462228/checkpoints/vocab'
21 | # 模型文件
22 | MODEL = 'runs/1528462228/checkpoints/model-10000'
23 | 
24 | # 语句最多长度(包含多少个词)
25 | MAX_DOCUMENT_LENGTH = 30
26 | 
27 | # Misc Parameters
28 | ALLOW_SOFT_PLACEMENT = True
29 | LOG_DEVICE_PLACEMENT = False
30 | 
31 | inpH = InputHelper()
32 | 
33 | x1_test, x2_test, y_test = inpH.getTestDataSet(EVAL_FILEPATH, VOCAB_FILEPATH, MAX_DOCUMENT_LENGTH)
34 | 
35 | # for index ,value in enumerate(x1_test):
36 | #     print (index, x1_test[index], x2_test[index], y_test[index])
37 | # sys.exit(0)
38 | 
39 | print("\nEvaluating...\n")
40 | 
41 | # Evaluation
42 | # ==================================================
43 | checkpoint_file = MODEL
44 | print checkpoint_file
45 | graph = tf.Graph()
46 | with graph.as_default():
47 |     session_conf = tf.ConfigProto(
48 |         allow_soft_placement=ALLOW_SOFT_PLACEMENT,
49 |         log_device_placement=LOG_DEVICE_PLACEMENT)
50 |     sess = tf.Session(config=session_conf)
51 |     with sess.as_default():
52 |         # Load the saved meta graph and restore variables
53 |         saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
54 |         sess.run(tf.initialize_all_variables())
55 |         saver.restore(sess, checkpoint_file)
56 | 
57 |         # Get the placeholders from the graph by name
58 |         input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
59 |         input_x2 = graph.get_operation_by_name("input_x2").outputs[0]
60 |         input_y = graph.get_operation_by_name("input_y").outputs[0]
61 | 
62 |         dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
63 |         # Tensors we want to evaluate
64 |         predictions = graph.get_operation_by_name("output/distance").outputs[0]
65 | 
66 |         accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
67 | 
68 |         sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0]
69 | 
70 |         # emb = graph.get_operation_by_name("embedding/W").outputs[0]
71 |         # embedded_chars = tf.nn.embedding_lookup(emb,input_x)
72 |         # Generate batches for one epoch
73 |         batches = inpH.batch_iter(list(zip(x1_test, x2_test, y_test)), 2 * BATCH_SIZE, 1, shuffle=False)
74 |         # Collect the predictions here
75 |         all_predictions = []
76 |         all_d = []
77 |         for db in batches:
78 |             x1_dev_b, x2_dev_b, y_dev_b = zip(*db)
79 |             batch_predictions, batch_acc, batch_sim = sess.run([predictions, accuracy, sim],
80 |                                                                {input_x1: x1_dev_b, input_x2: x2_dev_b,
81 |                                                                 input_y: y_dev_b, dropout_keep_prob: 1.0})
82 |             all_predictions = np.concatenate([all_predictions, batch_predictions])
83 |             print(batch_predictions)
84 |             all_d = np.concatenate([all_d, batch_sim])
85 |             print("DEV acc {}".format(batch_acc))
86 |         for ex in all_predictions:
87 |             print ex
88 |         correct_predictions = float(np.mean(all_d == y_test))
89 |         print("Accuracy: {:g}".format(correct_predictions))
90 | 


--------------------------------------------------------------------------------
/input_helpers.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | import numpy as np
  3 | import re
  4 | import itertools
  5 | from collections import Counter
  6 | import numpy as np
  7 | import time
  8 | import gc
  9 | from tensorflow.contrib import learn
 10 | # from gensim.models.word2vec import Word2Vec
 11 | import gensim
 12 | import gzip
 13 | from random import random
 14 | from preprocess import MyVocabularyProcessor
 15 | import sys
 16 | import jieba
 17 | 
 18 | reload(sys)
 19 | sys.setdefaultencoding("utf-8")
 20 | 
 21 | 
 22 | class InputHelper(object):
 23 |     pre_emb = dict()
 24 |     vocab_processor = None
 25 | 
 26 |     def loadW2V(self, emb_path, type="bin"):
 27 |         print("Loading W2V data...")
 28 |         num_keys = 0
 29 |         if type == "textgz":
 30 |             # this seems faster than gensim non-binary load
 31 |             for line in gzip.open(emb_path):
 32 |                 l = line.strip().split()
 33 |                 st = l[0].lower()
 34 |                 self.pre_emb[st] = np.asarray(l[1:])
 35 |             num_keys = len(self.pre_emb)
 36 |         if type == "text":
 37 |             # this seems faster than gensim non-binary load
 38 |             for line in open(emb_path):
 39 |                 l = line.strip().split()
 40 |                 st = l[0].lower()
 41 |                 self.pre_emb[st] = np.asarray(l[1:])
 42 |             num_keys = len(self.pre_emb)
 43 |         else:
 44 |             # self.pre_emb = Word2Vec.load_word2vec_format(emb_path,binary=True)
 45 |             self.pre_emb = gensim.models.KeyedVectors.load_word2vec_format(emb_path, binary=True)  # eddy
 46 |             self.pre_emb.init_sims(replace=True)
 47 |             num_keys = len(self.pre_emb.vocab)
 48 |         print("loaded word2vec len ", num_keys)
 49 |         gc.collect()
 50 | 
 51 |     def deletePreEmb(self):
 52 |         self.pre_emb = dict()
 53 |         gc.collect()
 54 | 
 55 |     def getTsvData(self, filepath):
 56 |         print("Loading training data from " + filepath)
 57 |         x1 = []
 58 |         x2 = []
 59 |         y = []
 60 |         num_p = 0
 61 |         num_n = 0
 62 |         # positive samples from file
 63 |         for line in open(filepath):
 64 |             # print(line)
 65 |             l = line.strip().split("\t")
 66 | 
 67 |             # print(l[0])
 68 |             # print(l[1])
 69 |             # print(l[2])
 70 |             if len(l) >= 4:
 71 |                 x1.append(l[1])
 72 |                 x2.append(l[2])
 73 |                 y.append(int(l[3]))
 74 | 
 75 |                 flag = int(l[3])
 76 |                 if flag > 0:
 77 |                     num_p += 1
 78 |                 else:
 79 |                     num_n += 1
 80 | 
 81 |         tmp_x1 = []
 82 |         tmp_x2 = []
 83 |         tmp_y = []
 84 | 
 85 |         # # 欠采样处理
 86 |         # for idx, item in enumerate(y):
 87 |         #     if item[1] == 1:
 88 |         #         tmp_x1.append(x1[idx])
 89 |         #         tmp_x2.append(x2[idx])
 90 |         #         tmp_y.append(y[idx])
 91 |         #     elif num_p >= 0:
 92 |         #         tmp_x1.append(x1[idx])
 93 |         #         tmp_x2.append(x2[idx])
 94 |         #         tmp_y.append(y[idx])
 95 |         #         num_p -= 1
 96 |         # x1 = tmp_x1
 97 |         # x2 = tmp_x2
 98 |         # y = tmp_y
 99 | 
100 |         # 过采样处理
101 |         add_p_num = num_n - num_p
102 |         while add_p_num > 0:
103 |             for idx, item in enumerate(y):
104 |                 if item == 1:
105 |                     tmp_x1.append(x1[idx])
106 |                     tmp_x2.append(x2[idx])
107 |                     tmp_y.append(y[idx])
108 |                     add_p_num -= 1
109 |                     if add_p_num <= 0:
110 |                         break
111 | 
112 |         print('len(x1)={}, len(x2)={}, len(y)={}'.format(len(x1), len(x2), len(y)))
113 | 
114 |         x1 += tmp_x1
115 |         x2 += tmp_x2
116 |         y += tmp_y
117 | 
118 |         print('len(x1)={}, len(x2)={}, len(y)={}'.format(len(x1), len(x2), len(y)))
119 | 
120 |         # num_p=0
121 |         # for item in y:
122 |         #     if item[1]==1:
123 |         #         num_p+=1
124 |         #
125 |         # print('num_p= {}'.format(num_p))
126 |         # exit(0)
127 | 
128 |         # print ('num_p= {}'.format(num_p))
129 |         # print('num_n= {}'.format(num_n))
130 |         # exit(0)
131 | 
132 |         return np.asarray(x1), np.asarray(x2), np.asarray(y)
133 | 
134 |     def getTsvTestData(self, filepath):
135 |         print("Loading testing/labelled data from " + filepath)
136 |         x1 = []
137 |         x2 = []
138 |         y = []
139 |         # positive samples from file
140 |         for line in open(filepath):
141 |             l = line.strip().split("\t")
142 |             if len(l) < 3:
143 |                 continue
144 |             x1.append(l[1])
145 |             x2.append(l[2])
146 |             y.append(int(l[0]))
147 |         return np.asarray(x1), np.asarray(x2), np.asarray(y)
148 | 
149 |     def batch_iter(self, data, batch_size, num_epochs, shuffle=True):
150 |         """
151 |         Generates a batch iterator for a dataset.
152 |         """
153 |         data = np.asarray(data)
154 |         # print(data)
155 |         # print(data.shape)
156 |         data_size = len(data)
157 |         num_batches_per_epoch = int(len(data) / batch_size) + 1
158 |         for epoch in range(num_epochs):
159 |             # Shuffle the data at each epoch
160 |             if shuffle:
161 |                 shuffle_indices = np.random.permutation(np.arange(data_size))
162 |                 shuffled_data = data[shuffle_indices]
163 |             else:
164 |                 shuffled_data = data
165 |             for batch_num in range(num_batches_per_epoch):
166 |                 start_index = batch_num * batch_size
167 |                 end_index = min((batch_num + 1) * batch_size, data_size)
168 |                 yield shuffled_data[start_index:end_index]
169 | 
170 |     def dumpValidation(self, x1_text, x2_text, y, shuffled_index, dev_idx, i):
171 |         print("dumping validation " + str(i))
172 |         x1_shuffled = x1_text[shuffled_index]
173 |         x2_shuffled = x2_text[shuffled_index]
174 |         y_shuffled = y[shuffled_index]
175 |         x1_dev = x1_shuffled[dev_idx:]
176 |         x2_dev = x2_shuffled[dev_idx:]
177 |         y_dev = y_shuffled[dev_idx:]
178 |         del x1_shuffled
179 |         del y_shuffled
180 |         with open('validation.txt' + str(i), 'w') as f:
181 |             for text1, text2, label in zip(x1_dev, x2_dev, y_dev):
182 |                 f.write(str(label) + "\t" + text1 + "\t" + text2 + "\n")
183 |             f.close()
184 |         del x1_dev
185 |         del y_dev
186 | 
187 |     # Data Preparatopn
188 |     # ==================================================
189 | 
190 |     def getDataSets(self, training_paths, max_document_length, percent_dev, batch_size):
191 |         x1_text, x2_text, y = self.getTsvData(training_paths)
192 |         # print('x1_text= {}'.format(x1_text))
193 |         # print('x2_text= {}'.format(x2_text))
194 |         # print ('y= {}'.format(y))
195 | 
196 |         # Build vocabulary
197 |         print("Building vocabulary")
198 |         vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0)
199 |         vocab_processor.fit_transform(np.concatenate((x2_text, x1_text), axis=0))
200 |         print("Length of loaded vocabulary ={}".format(len(vocab_processor.vocabulary_)))
201 | 
202 |         sum_no_of_batches = 0
203 |         x1 = np.asarray(list(vocab_processor.transform(x1_text)))
204 |         x2 = np.asarray(list(vocab_processor.transform(x2_text)))
205 |         # Randomly shuffle data
206 |         np.random.seed(131)
207 |         shuffle_indices = np.random.permutation(np.arange(len(y)))
208 |         x1_shuffled = x1[shuffle_indices]
209 |         x2_shuffled = x2[shuffle_indices]
210 |         y_shuffled = y[shuffle_indices]
211 |         dev_idx = -1 * len(y_shuffled) * percent_dev // 100
212 |         print('dev_idx= {}'.format(dev_idx))
213 | 
214 |         del x1
215 |         del x2
216 |         # Split train/test set
217 |         self.dumpValidation(x1_text, x2_text, y, shuffle_indices, dev_idx, 0)
218 |         # TODO: This is very crude, should use cross-validation
219 |         x1_train, x1_dev = x1_shuffled[:dev_idx], x1_shuffled[dev_idx:]
220 |         x2_train, x2_dev = x2_shuffled[:dev_idx], x2_shuffled[dev_idx:]
221 |         y_train, y_dev = y_shuffled[:dev_idx], y_shuffled[dev_idx:]
222 |         print("Train/Dev split for {}: {:d}/{:d}".format(training_paths, len(y_train), len(y_dev)))
223 |         sum_no_of_batches = sum_no_of_batches + (len(y_train) // batch_size)
224 |         train_set = (x1_train, x2_train, y_train)
225 |         dev_set = (x1_dev, x2_dev, y_dev)
226 |         gc.collect()
227 |         return train_set, dev_set, vocab_processor, sum_no_of_batches
228 | 
229 |     def getTestDataSet(self, data_path, vocab_path, max_document_length):
230 |         x1_temp, x2_temp, y = self.getTsvTestData(data_path)
231 | 
232 |         # Build vocabulary
233 |         vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0)
234 |         vocab_processor = vocab_processor.restore(vocab_path)
235 |         print len(vocab_processor.vocabulary_)
236 | 
237 |         x1 = np.asarray(list(vocab_processor.transform(x1_temp)))
238 |         x2 = np.asarray(list(vocab_processor.transform(x2_temp)))
239 |         # Randomly shuffle data
240 |         del vocab_processor
241 |         gc.collect()
242 |         return x1, x2, y
243 | 


--------------------------------------------------------------------------------
/preliminary_contest/dict.txt:
--------------------------------------------------------------------------------
1 | 花呗
2 | 借呗
3 | 蚂蚁花呗
4 | 蚂蚁借呗


--------------------------------------------------------------------------------
/preliminary_contest/eval.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # coding=utf-8
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import os
  6 | import time
  7 | import datetime
  8 | from tensorflow.contrib import learn
  9 | from input_helpers import InputHelper
 10 | import sys
 11 | 
 12 | # Parameters
 13 | # ==================================================
 14 | EVAL_FILE = sys.argv[1]  # 待评估文件
 15 | OUTPUT_FILE = sys.argv[2]  # 评估后输出文件
 16 | 
 17 | print (EVAL_FILE)
 18 | print (OUTPUT_FILE)
 19 | 
 20 | # Eval Parameters
 21 | BATCH_SIZE = 64  # 批大小
 22 | VOCAB_FILE = './vocab/vocab'  # 训练使使用的词表
 23 | MODEL = './models/model-4000'  # 加载训练模型
 24 | ALLOW_SOFT_PLACEMENT = True
 25 | LOG_DEVICE_PLACEMENT = False
 26 | 
 27 | # 语句最多长度(包含多少个词)
 28 | MAX_DOCUMENT_LENGTH = 40
 29 | 
 30 | # load data and map id-transform based on training time vocabulary
 31 | inpH = InputHelper()
 32 | x1_test, x2_test = inpH.getTestDataSet(EVAL_FILE, VOCAB_FILE, MAX_DOCUMENT_LENGTH)
 33 | 
 34 | # for index, _ in enumerate(x1_test):
 35 | #     print(index, x1_test[index], x2_test[index])
 36 | 
 37 | print("\nEvaluating...\n")
 38 | 
 39 | # Evaluation
 40 | # ==================================================
 41 | checkpoint_file = MODEL
 42 | print checkpoint_file
 43 | graph = tf.Graph()
 44 | with graph.as_default():
 45 |     session_conf = tf.ConfigProto(
 46 |         allow_soft_placement=ALLOW_SOFT_PLACEMENT,
 47 |         log_device_placement=LOG_DEVICE_PLACEMENT)
 48 |     sess = tf.Session(config=session_conf)
 49 |     with sess.as_default():
 50 |         # Load the saved meta graph and restore variables
 51 |         saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
 52 |         sess.run(tf.initialize_all_variables())
 53 |         saver.restore(sess, checkpoint_file)
 54 | 
 55 |         # Get the placeholders from the graph by name
 56 |         input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
 57 |         input_x2 = graph.get_operation_by_name("input_x2").outputs[0]
 58 |         # input_y = graph.get_operation_by_name("input_y").outputs[0]
 59 | 
 60 |         dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
 61 |         # Tensors we want to evaluate
 62 |         predictions = graph.get_operation_by_name("output/distance").outputs[0]
 63 | 
 64 |         # accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
 65 | 
 66 |         sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0]
 67 | 
 68 |         # emb = graph.get_operation_by_name("embedding/W").outputs[0]
 69 |         # embedded_chars = tf.nn.embedding_lookup(emb,input_x)
 70 |         # Generate batches for one epoch
 71 |         batches = inpH.batch_iter(list(zip(x1_test, x2_test)), 2 * BATCH_SIZE, 1, shuffle=False)
 72 |         # Collect the predictions here
 73 |         all_predictions = []
 74 |         all_d = []
 75 | 
 76 |         for db in batches:
 77 |             # print('db')
 78 |             # print(db)
 79 |             #
 80 |             x1_dev_b, x2_dev_b = zip(*db)
 81 |             batch_predictions, batch_sim = sess.run([predictions, sim],
 82 |                                                     {input_x1: x1_dev_b, input_x2: x2_dev_b, dropout_keep_prob: 1.0})
 83 |             all_predictions = np.concatenate([all_predictions, batch_predictions])
 84 |             # print(batch_predictions)
 85 |             print(batch_sim)
 86 |             print(type(batch_sim))
 87 |             print(len(batch_sim))
 88 |             all_d = np.concatenate([all_d, batch_sim])
 89 |             # print("DEV acc {}".format(batch_acc))
 90 |         for ex in all_predictions:
 91 |             print ex
 92 | 
 93 |         f_output = open(OUTPUT_FILE, 'a')
 94 |         index = 1
 95 |         predic_value = 0
 96 |         for item in all_d:
 97 |             # 专门写反
 98 |             if item > 0:
 99 |                 predic_value = 1
100 |             else:
101 |                 predic_value = 0
102 |             f_output.write('{}\t{}\n'.format(index, predic_value))
103 |             index += 1
104 | 
105 |         # correct_predictions = float(np.mean(all_d == y_test))
106 |         # print("Accuracy: {:g}".format(correct_predictions))
107 | 
108 |         print ('eval finished!')
109 | 


--------------------------------------------------------------------------------
/preliminary_contest/input_helpers.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | import numpy as np
 3 | import re
 4 | import itertools
 5 | from collections import Counter
 6 | import numpy as np
 7 | import time
 8 | import gc
 9 | from tensorflow.contrib import learn
10 | # from gensim.models.word2vec import Word2Vec
11 | import gensim
12 | import gzip
13 | from random import random
14 | from preprocess import MyVocabularyProcessor
15 | import sys
16 | import jieba
17 | 
18 | reload(sys)
19 | sys.setdefaultencoding("utf-8")
20 | 
21 | 
22 | class InputHelper(object):
23 |     pre_emb = dict()
24 |     vocab_processor = None
25 | 
26 |     def getTsvTestData(self, filepath):
27 |         print("Loading testing/labelled data from " + filepath)
28 |         x1 = []
29 |         x2 = []
30 |         for line in open(filepath):
31 |             l = line.strip().split("\t")
32 |             x1.append(l[1])
33 |             x2.append(l[2])
34 |         return np.asarray(x1), np.asarray(x2)
35 | 
36 |     def batch_iter(self, data, batch_size, num_epochs, shuffle=True):
37 |         """
38 |         Generates a batch iterator for a dataset.
39 |         """
40 |         data = np.asarray(data)
41 |         print(data)
42 |         print(data.shape)
43 |         data_size = len(data)
44 |         num_batches_per_epoch = int(len(data) / batch_size) + 1
45 |         for epoch in range(num_epochs):
46 |             # Shuffle the data at each epoch
47 |             if shuffle:
48 |                 shuffle_indices = np.random.permutation(np.arange(data_size))
49 |                 shuffled_data = data[shuffle_indices]
50 |             else:
51 |                 shuffled_data = data
52 |             for batch_num in range(num_batches_per_epoch):
53 |                 start_index = batch_num * batch_size
54 |                 end_index = min((batch_num + 1) * batch_size, data_size)
55 |                 yield shuffled_data[start_index:end_index]
56 | 
57 |     # Data Preparatopn
58 |     # ==================================================
59 | 
60 |     def getTestDataSet(self, data_path, vocab_path, max_document_length):
61 |         x1_temp, x2_temp = self.getTsvTestData(data_path)
62 | 
63 |         # Build vocabulary
64 |         vocab_processor = MyVocabularyProcessor(max_document_length, min_frequency=0)
65 |         vocab_processor = vocab_processor.restore(vocab_path)
66 |         print ('len(vocab_processor.vocabulary_)', len(vocab_processor.vocabulary_))
67 |         # sys.exit(0)
68 | 
69 |         x1 = np.asarray(list(vocab_processor.transform(x1_temp)))
70 |         x2 = np.asarray(list(vocab_processor.transform(x2_temp)))
71 |         # Randomly shuffle data
72 |         del vocab_processor
73 |         gc.collect()
74 |         return x1, x2
75 | 


--------------------------------------------------------------------------------
/preliminary_contest/models/model-4000.data-00000-of-00001:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.data-00000-of-00001


--------------------------------------------------------------------------------
/preliminary_contest/models/model-4000.index:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.index


--------------------------------------------------------------------------------
/preliminary_contest/models/model-4000.meta:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/models/model-4000.meta


--------------------------------------------------------------------------------
/preliminary_contest/preprocess.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | from __future__ import absolute_import
 3 | from __future__ import division
 4 | from __future__ import print_function
 5 | 
 6 | import re
 7 | import numpy as np
 8 | import six
 9 | from tensorflow.contrib import learn
10 | from tensorflow.python.platform import gfile
11 | from tensorflow.contrib import learn  # pylint: disable=g-bad-import-order
12 | import jieba
13 | 
14 | 
15 | def tokenizer_word(iterator):
16 |     jieba.load_userdict('./dict.txt')
17 |     for sentence in iterator:
18 |         yield list(jieba.lcut(sentence))
19 | 
20 | 
21 | class MyVocabularyProcessor(learn.preprocessing.VocabularyProcessor):
22 |     def __init__(self,
23 |                  max_document_length,
24 |                  min_frequency=0,
25 |                  vocabulary=None):
26 | 
27 |         tokenizer_fn = tokenizer_word
28 |         self.sup = super(MyVocabularyProcessor, self)
29 |         self.sup.__init__(max_document_length, min_frequency, vocabulary, tokenizer_fn)
30 | 
31 |     def transform(self, raw_documents):
32 |         """Transform documents to word-id matrix.
33 |         Convert words to ids with vocabulary fitted with fit or the one
34 |         provided in the constructor.
35 |         Args:
36 |           raw_documents: An iterable which yield either str or unicode.
37 |         Yields:
38 |           x: iterable, [n_samples, max_document_length]. Word-id matrix.
39 |         """
40 |         # print('len(raw_documents)= {}'.format(len(raw_documents)))
41 |         # print('raw_documents= {}'.format(raw_documents))
42 | 
43 |         # for index,value in enumerate(raw_documents):
44 |         #     print(index, value)
45 | 
46 |         for tokens in self._tokenizer(raw_documents):
47 |             word_ids = np.zeros(self.max_document_length, np.int64)
48 |             for idx, token in enumerate(tokens):
49 |                 if idx >= self.max_document_length:
50 |                     break
51 |                 word_ids[idx] = self.vocabulary_.get(token)
52 |             yield word_ids
53 | 


--------------------------------------------------------------------------------
/preliminary_contest/run.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | python eval.py $1 $2


--------------------------------------------------------------------------------
/preliminary_contest/vocab/vocab:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ATEC2018/deep-siamese-text-similarity/ef6df89a69683bf31c9370eb8728e4d6deabfcec/preliminary_contest/vocab/vocab


--------------------------------------------------------------------------------
/preprocess.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | from __future__ import absolute_import
 3 | from __future__ import division
 4 | from __future__ import print_function
 5 | 
 6 | import re
 7 | import numpy as np
 8 | import six
 9 | from tensorflow.contrib import learn
10 | from tensorflow.python.platform import gfile
11 | from tensorflow.contrib import learn  # pylint: disable=g-bad-import-order
12 | import jieba
13 | 
14 | 
15 | def tokenizer_word(iterator):
16 |     jieba.load_userdict('./dict.txt')
17 |     for sentence in iterator:
18 |         sentence = sentence.decode("utf8")
19 |         sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。：？?、~@#￥%……&*（）]+".decode("utf8"), "".decode("utf8"),
20 |                           sentence)
21 |         yield list(jieba.lcut(sentence))
22 | 
23 | 
24 | class MyVocabularyProcessor(learn.preprocessing.VocabularyProcessor):
25 |     def __init__(self,
26 |                  max_document_length,
27 |                  min_frequency=0,
28 |                  vocabulary=None):
29 | 
30 |         tokenizer_fn = tokenizer_word
31 |         self.sup = super(MyVocabularyProcessor, self)
32 |         self.sup.__init__(max_document_length, min_frequency, vocabulary, tokenizer_fn)
33 | 
34 |     def transform(self, raw_documents):
35 |         """Transform documents to word-id matrix.
36 |         Convert words to ids with vocabulary fitted with fit or the one
37 |         provided in the constructor.
38 |         Args:
39 |           raw_documents: An iterable which yield either str or unicode.
40 |         Yields:
41 |           x: iterable, [n_samples, max_document_length]. Word-id matrix.
42 |         """
43 |         # print('len(raw_documents)= {}'.format(len(raw_documents)))
44 |         # print('raw_documents= {}'.format(raw_documents))
45 | 
46 |         # for index,value in enumerate(raw_documents):
47 |         #     print(index, value)
48 | 
49 |         for tokens in self._tokenizer(raw_documents):
50 |             word_ids = np.zeros(self.max_document_length, np.int64)
51 |             for idx, token in enumerate(tokens):
52 |                 if idx >= self.max_document_length:
53 |                     break
54 |                 word_ids[idx] = self.vocabulary_.get(token)
55 |             yield word_ids
56 | 


--------------------------------------------------------------------------------
/siamese_network_semantic.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | import tensorflow as tf
  3 | import tensorflow.contrib.slim as slim
  4 | import numpy as np
  5 | 
  6 | 
  7 | class SiameseLSTMw2v(object):
  8 |     """
  9 |     A LSTM based deep Siamese network for text similarity.
 10 |     Uses an word embedding layer (looks up in pre-trained w2v), followed by a biLSTM and Energy Loss layer.
 11 |     """
 12 | 
 13 |     def stackedRNN(self, x, dropout, scope, embedding_size, sequence_length, hidden_units):
 14 |         n_hidden = hidden_units
 15 |         n_layers = 3
 16 |         # n_layers = 6
 17 |         # Prepare data shape to match `static_rnn` function requirements
 18 |         x = tf.unstack(tf.transpose(x, perm=[1, 0, 2]))
 19 |         # print(x)
 20 |         # Define lstm cells with tensorflow
 21 |         # Forward direction cell
 22 | 
 23 |         with tf.name_scope("fw" + scope), tf.variable_scope("fw" + scope):
 24 |             stacked_rnn_fw = []
 25 |             for _ in range(n_layers):
 26 |                 fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0, state_is_tuple=True)
 27 |                 lstm_fw_cell = tf.contrib.rnn.DropoutWrapper(fw_cell, output_keep_prob=dropout)
 28 |                 stacked_rnn_fw.append(lstm_fw_cell)
 29 |             lstm_fw_cell_m = tf.nn.rnn_cell.MultiRNNCell(cells=stacked_rnn_fw, state_is_tuple=True)
 30 | 
 31 |             outputs, _ = tf.nn.static_rnn(lstm_fw_cell_m, x, dtype=tf.float32)
 32 |         return outputs[-1]
 33 | 
 34 |     def contrastive_loss(self, y, d, batch_size):
 35 |         tmp = y * tf.square(d)
 36 |         # tmp= tf.mul(y,tf.square(d))
 37 |         tmp2 = (1 - y) * tf.square(tf.maximum((1 - d), 0))
 38 |         reg = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(1e-4), tf.trainable_variables())
 39 |         return tf.reduce_sum(tmp + tmp2) / batch_size / 2+reg
 40 | 
 41 |     def __init__(
 42 |             self, sequence_length, vocab_size, embedding_size, hidden_units, l2_reg_lambda, batch_size,
 43 |             trainableEmbeddings):
 44 |         # Placeholders for input, output and dropout
 45 |         self.input_x1 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x1")
 46 |         self.input_x2 = tf.placeholder(tf.int32, [None, sequence_length], name="input_x2")
 47 |         self.input_y = tf.placeholder(tf.float32, [None], name="input_y")
 48 |         self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
 49 | 
 50 |         # Keeping track of l2 regularization loss (optional)
 51 |         l2_loss = tf.constant(0.0, name="l2_loss")
 52 | 
 53 |         # Embedding layer
 54 |         with tf.name_scope("embedding"):
 55 |             self.W = tf.Variable(
 56 |                 tf.constant(0.0, shape=[vocab_size, embedding_size]),
 57 |                 trainable=trainableEmbeddings, name="W")
 58 |             self.embedded_words1 = tf.nn.embedding_lookup(self.W, self.input_x1)
 59 |             self.embedded_words2 = tf.nn.embedding_lookup(self.W, self.input_x2)
 60 |         # print self.embedded_words1
 61 |         # Create a convolution + maxpool layer for each filter size
 62 |         with tf.name_scope("output"):
 63 |             self.out1 = self.stackedRNN(self.embedded_words1, self.dropout_keep_prob, "side1", embedding_size,
 64 |                                         sequence_length, hidden_units)
 65 |             self.out2 = self.stackedRNN(self.embedded_words2, self.dropout_keep_prob, "side2", embedding_size,
 66 |                                         sequence_length, hidden_units)
 67 |             self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.out1, self.out2)), 1, keep_dims=True))
 68 |             self.distance = tf.div(self.distance,
 69 |                                    tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.out1), 1, keep_dims=True)),
 70 |                                           tf.sqrt(tf.reduce_sum(tf.square(self.out2), 1, keep_dims=True))))
 71 |             self.distance = tf.reshape(self.distance, [-1], name="distance")
 72 |         with tf.name_scope("loss"):
 73 |             self.loss = self.contrastive_loss(self.input_y, self.distance, batch_size)
 74 |         #### Accuracy computation is outside of this class.
 75 |         with tf.name_scope("accuracy"):
 76 |             self.temp_sim = tf.subtract(tf.ones_like(self.distance), tf.rint(self.distance),
 77 |                                         name="temp_sim")  # auto threshold 0.5
 78 |             correct_predictions = tf.equal(self.temp_sim, self.input_y)
 79 |             self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
 80 | 
 81 |         with tf.name_scope('f1'):
 82 |             ones_like_actuals = tf.ones_like(self.input_y)
 83 |             zeros_like_actuals = tf.zeros_like(self.input_y)
 84 |             ones_like_predictions = tf.ones_like(self.temp_sim)
 85 |             zeros_like_predictions = tf.zeros_like(self.temp_sim)
 86 | 
 87 |             tp = tf.reduce_sum(
 88 |                 tf.cast(
 89 |                     tf.logical_and(
 90 |                         tf.equal(self.input_y, ones_like_actuals),
 91 |                         tf.equal(self.temp_sim, ones_like_predictions)
 92 |                     ),
 93 |                     'float'
 94 |                 )
 95 |             )
 96 | 
 97 |             tn = tf.reduce_sum(
 98 |                 tf.cast(
 99 |                     tf.logical_and(
100 |                         tf.equal(self.input_y, zeros_like_actuals),
101 |                         tf.equal(self.temp_sim, zeros_like_predictions)
102 |                     ),
103 |                     'float'
104 |                 )
105 |             )
106 | 
107 |             fp = tf.reduce_sum(
108 |                 tf.cast(
109 |                     tf.logical_and(
110 |                         tf.equal(self.input_y, zeros_like_actuals),
111 |                         tf.equal(self.temp_sim, ones_like_predictions)
112 |                     ),
113 |                     'float'
114 |                 )
115 |             )
116 | 
117 |             fn = tf.reduce_sum(
118 |                 tf.cast(
119 |                     tf.logical_and(
120 |                         tf.equal(self.input_y, ones_like_actuals),
121 |                         tf.equal(self.temp_sim, zeros_like_predictions)
122 |                     ),
123 |                     'float'
124 |                 )
125 |             )
126 | 
127 |             precision = tp / (tp + fp)
128 |             recall = tp / (tp + fn)
129 | 
130 |             self.f1 = 2 * precision * recall / (precision + recall)
131 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # coding=utf-8
 3 | 
 4 | import re
 5 | 
 6 | line = "想做/ 兼_职/学生_/ 的 、加,我Q：  1 5.  8 0. ！！？？  8 6 。0.  2。 3     有,惊,喜,哦"
 7 | line = line.decode("utf8")
 8 | string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。：？?、~@#￥%……&*（）]+".decode("utf8"), "".decode("utf8"), line)
 9 | print(string)
10 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # coding=utf-8
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | import re
  6 | import os
  7 | import time
  8 | import datetime
  9 | import gc
 10 | from input_helpers import InputHelper
 11 | from siamese_network_semantic import SiameseLSTMw2v
 12 | import gzip
 13 | from random import random
 14 | import sys
 15 | 
 16 | # a=['你好','上帝','下地']
 17 | # b=[u'你好',u'上帝',u'下地']
 18 | # print(a)
 19 | # print(b)
 20 | # sys.exit(0)
 21 | 
 22 | # Parameters
 23 | # word2vec模型（采用已训练好的中文模型）
 24 | WORD2VEC_MODEL = '../word2vecmodel/news_12g_baidubaike_20g_novel_90g_embedding_64.bin'
 25 | # 　模型格式为bin
 26 | WORD2VEC_FORMAT = 'bin'
 27 | # word2vec词嵌入维数（64/128可选）
 28 | EMBEDDING_DIM = 64
 29 | # dropout比例设置
 30 | # DROPOUT_KEEP_PROB = '0.3'#训练集的拟合能力不够
 31 | DROPOUT_KEEP_PROB = '0.8'
 32 | # DROPOUT_KEEP_PROB = '0.6'
 33 | # DROPOUT_KEEP_PROB = '0.7'
 34 | # DROPOUT_KEEP_PROB = '0.8'
 35 | # DROPOUT_KEEP_PROB = '1.0'(7th-June)
 36 | # DROPOUT_KEEP_PROB = '0.8'
 37 | # DROPOUT_KEEP_PROB = '0.4'#训练集的拟合能力不够
 38 | # L2正规化系数(目前暂未生效)
 39 | L2_REG_LAMBDA = 0.0
 40 | # 原始训练文件
 41 | TRAINING_FILES_RAW = './train_data/atec_nlp_sim_train.csv'
 42 | # 隐藏层单元数
 43 | # HIDDEN_UNITS = 64(7th-June)
 44 | HIDDEN_UNITS = 128
 45 | 
 46 | # Training parameters
 47 | # 批大小
 48 | # BATCH_SIZE = 64
 49 | # BATCH_SIZE = 1024(7th-June)
 50 | BATCH_SIZE = 1024  # 92229=102477-10248
 51 | # epoch数目
 52 | # NUM_EPOCHS = 300
 53 | # NUM_EPOCHS = 3000
 54 | NUM_EPOCHS = 100000
 55 | # 模型评估周期（每隔多少步）
 56 | # EVALUATE_EVERY = 10(7th-June)
 57 | EVALUATE_EVERY = 100
 58 | # EVALUATE_EVERY = 10
 59 | # 模型保存周期(每隔多少步)
 60 | # CHECKOUTPOINT_EVERY = 1000
 61 | # CHECKOUTPOINT_EVERY = 10000
 62 | # CHECKOUTPOINT_EVERY = 1000(7th-Jnue)
 63 | CHECKOUTPOINT_EVERY = 1000
 64 | # 语句最多长度(包含多少个词)
 65 | # MAX_DOCUMENT_LENGTH = 12
 66 | # MAX_DOCUMENT_LENGTH = 8
 67 | # MAX_DOCUMENT_LENGTH = 20(7th-June)
 68 | MAX_DOCUMENT_LENGTH = 40
 69 | # 验证集比例
 70 | DEV_PERCENT = 10
 71 | 
 72 | # Misc Parameters
 73 | ALLOW_SOFT_PLACEMENT = True
 74 | LOG_DEVICE_PLACEMENT = False
 75 | 
 76 | print ('训练开始......................')
 77 | start_time = datetime.datetime.now()
 78 | 
 79 | inpH = InputHelper()
 80 | # 将原始的训练文件转化为分词后的训练文件
 81 | # inpH.train_file_preprocess(TRAINING_FILES_RAW, TRAINING_FILES_FORMAT)
 82 | # sys.exit(0)
 83 | 
 84 | 
 85 | train_set, dev_set, vocab_processor, sum_no_of_batches = inpH.getDataSets(TRAINING_FILES_RAW, MAX_DOCUMENT_LENGTH,
 86 |                                                                           DEV_PERCENT,
 87 |                                                                           BATCH_SIZE)
 88 | 
 89 | # dev_batches = inpH.batch_iter(list(zip(dev_set[0], dev_set[1], dev_set[2])), BATCH_SIZE, 1)
 90 | # for index,dev_batch in enumerate(dev_batches):
 91 | #     print(index, dev_batch)
 92 | # sys.exit(0)
 93 | 
 94 | # for index, value in enumerate(dev_set[2]):
 95 | #     print(index, dev_set[0][index], dev_set[1][index], dev_set[2][index])
 96 | # sys.exit(0)
 97 | 
 98 | # for index, w in enumerate(vocab_processor.vocabulary_._mapping):
 99 | #     print('vocab-{}:{}'.format(index, w))
100 | # sys.exit(0)
101 | 
102 | with tf.Graph().as_default():
103 |     session_conf = tf.ConfigProto(
104 |         allow_soft_placement=ALLOW_SOFT_PLACEMENT,
105 |         log_device_placement=LOG_DEVICE_PLACEMENT)
106 |     sess = tf.Session(config=session_conf)
107 | 
108 |     with sess.as_default():
109 |         siameseModel = SiameseLSTMw2v(
110 |             sequence_length=MAX_DOCUMENT_LENGTH,
111 |             vocab_size=len(vocab_processor.vocabulary_),
112 |             embedding_size=EMBEDDING_DIM,
113 |             hidden_units=HIDDEN_UNITS,
114 |             l2_reg_lambda=L2_REG_LAMBDA,
115 |             batch_size=BATCH_SIZE,
116 |             trainableEmbeddings=False
117 |         )
118 |         # Define Training procedure
119 |         global_step = tf.Variable(0, name="global_step", trainable=False)
120 |         optimizer = tf.train.AdamOptimizer(1e-3)
121 | 
122 |     grads_and_vars = optimizer.compute_gradients(siameseModel.loss)
123 |     tr_op_set = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
124 |     print("defined training_ops")
125 |     # Keep track of gradient values and sparsity (optional)
126 |     grad_summaries = []
127 |     for g, v in grads_and_vars:
128 |         if g is not None:
129 |             grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
130 |             sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
131 |             grad_summaries.append(grad_hist_summary)
132 |             grad_summaries.append(sparsity_summary)
133 |     grad_summaries_merged = tf.summary.merge(grad_summaries)
134 |     print("defined gradient summaries")
135 |     # Output directory for models and summaries
136 |     timestamp = str(int(time.time()))
137 |     out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
138 |     print("Writing to {}\n".format(out_dir))
139 | 
140 |     # Summaries for loss and accuracy
141 |     loss_summary = tf.summary.scalar("loss", siameseModel.loss)
142 |     acc_summary = tf.summary.scalar("accuracy", siameseModel.accuracy)
143 |     f1_summary = tf.summary.scalar('f1', siameseModel.f1)
144 | 
145 |     # Train Summaries
146 |     train_summary_op = tf.summary.merge([loss_summary, acc_summary, f1_summary, grad_summaries_merged])
147 |     train_summary_dir = os.path.join(out_dir, "summaries", "train")
148 |     train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
149 | 
150 |     # Dev summaries
151 |     dev_summary_op = tf.summary.merge([loss_summary, acc_summary, f1_summary])
152 |     dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
153 |     dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
154 | 
155 |     # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
156 |     checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
157 |     checkpoint_prefix = os.path.join(checkpoint_dir, "model")
158 |     if not os.path.exists(checkpoint_dir):
159 |         os.makedirs(checkpoint_dir)
160 |     saver = tf.train.Saver(tf.global_variables(), max_to_keep=100)
161 | 
162 |     # Write vocabulary
163 |     vocab_processor.save(os.path.join(checkpoint_dir, "vocab"))
164 | 
165 |     # Initialize all variables
166 |     sess.run(tf.global_variables_initializer())
167 | 
168 |     print("init all variables")
169 |     graph_def = tf.get_default_graph().as_graph_def()
170 |     graphpb_txt = str(graph_def)
171 |     with open(os.path.join(checkpoint_dir, "graphpb.txt"), 'w') as f:
172 |         f.write(graphpb_txt)
173 | 
174 |     # 加载word2vec
175 |     inpH.loadW2V(WORD2VEC_MODEL, WORD2VEC_FORMAT)
176 |     # initial matrix with random uniform
177 |     # initW = np.random.uniform(-0.25, 0.25, (len(vocab_processor.vocabulary_), EMBEDDING_DIM))
178 |     initW = np.random.uniform(0, 0, (len(vocab_processor.vocabulary_), EMBEDDING_DIM))
179 |     # print(initW)
180 |     # sys.exit(0)
181 | 
182 |     # load any vectors from the word2vec
183 |     print("initializing initW with pre-trained word2vec embeddings")
184 |     for index, w in enumerate(vocab_processor.vocabulary_._mapping):
185 |         # print('vocab-{}:{}'.format(index, w))
186 | 
187 |         arr = []
188 |         if w in inpH.pre_emb:
189 |             arr = inpH.pre_emb[w]
190 |             # print('=====arr-{},{}'.format(index, arr))
191 |             idx = vocab_processor.vocabulary_.get(w)
192 |             initW[idx] = np.asarray(arr).astype(np.float32)
193 | 
194 |         # 不使用词向量
195 |         # arr=[]
196 |         # idx = vocab_processor.vocabulary_.get(w)
197 |         # arr.append(idx)
198 |         # initW[idx] = np.asarray(arr).astype(np.float32)
199 | 
200 |     print("Done assigning intiW. len=" + str(len(initW)))
201 |     # exit(0)
202 | 
203 |     # for idx, value in enumerate(initW):
204 |     #     print(idx, value)
205 |     # sys.exit(0)
206 | 
207 |     inpH.deletePreEmb()
208 |     gc.collect()
209 |     sess.run(siameseModel.W.assign(initW))
210 | 
211 | 
212 |     def train_step(x1_batch, x2_batch, y_batch):
213 |         """
214 |         A single training step
215 |         """
216 |         # for index, sentence in enumerate(x1_batch):
217 |         #     word_list1=[]
218 |         #     word_list2=[]
219 |         #     y=y_batch[index]
220 |         #     for idx in x1_batch[index]:
221 |         #         word_list1.append(vocab_processor.vocabulary_.reverse(idx))
222 |         #     for idx in x2_batch[index]:
223 |         #         word_list2.append(vocab_processor.vocabulary_.reverse(idx))
224 |         #
225 |         #     # print(''.join(word_list1),'\t',''.join(word_list2),'\t',y)
226 |         #     print('==========={}=============='.format(index))
227 |         #     print(''.join(word_list1))
228 |         #     print (''.join(word_list2))
229 |         #     print(y)
230 |         # sys.exit(0)
231 | 
232 |         feed_dict = {
233 |             siameseModel.input_x1: x1_batch,
234 |             siameseModel.input_x2: x2_batch,
235 |             siameseModel.input_y: y_batch,
236 |             siameseModel.dropout_keep_prob: DROPOUT_KEEP_PROB,
237 |         }
238 |         _, step, loss, accuracy, f1, dist, sim, summaries = sess.run(
239 |             [tr_op_set, global_step, siameseModel.loss, siameseModel.accuracy, siameseModel.f1, siameseModel.distance,
240 |              siameseModel.temp_sim, train_summary_op], feed_dict)
241 |         time_str = datetime.datetime.now().isoformat()
242 |         print("TRAIN {}: step {}, loss {:g}, acc {:g}, f1 {:g}".format(time_str, step, loss, accuracy, f1))
243 |         train_summary_writer.add_summary(summaries, step)
244 |         print(y_batch, dist, sim)
245 | 
246 | 
247 |     def dev_step(x1_batch, x2_batch, y_batch):
248 |         """
249 |         A single training step
250 |         """
251 |         # for index, sentence in enumerate(x1_batch):
252 |         #     word_list1=[]
253 |         #     word_list2=[]
254 |         #     y=y_batch[index]
255 |         #     for idx in x1_batch[index]:
256 |         #         word_list1.append(vocab_processor.vocabulary_.reverse(idx))
257 |         #     for idx in x2_batch[index]:
258 |         #         word_list2.append(vocab_processor.vocabulary_.reverse(idx))
259 |         #
260 |         #     # print(''.join(word_list1),'\t',''.join(word_list2),'\t',y)
261 |         #     print('==========={}=============='.format(index))
262 |         #     print(''.join(word_list1))
263 |         #     print (''.join(word_list2))
264 |         #     print(y)
265 |         # sys.exit(0)
266 | 
267 |         feed_dict = {
268 |             siameseModel.input_x1: x2_batch,
269 |             siameseModel.input_x2: x1_batch,
270 |             siameseModel.input_y: y_batch,
271 |             siameseModel.dropout_keep_prob: 1.0,
272 |         }
273 |         step, loss, accuracy, f1, sim, summaries = sess.run(
274 |             [global_step, siameseModel.loss, siameseModel.accuracy, siameseModel.f1, siameseModel.temp_sim,
275 |              dev_summary_op], feed_dict)
276 |         time_str = datetime.datetime.now().isoformat()
277 |         print("DEV {}: step {}, loss {:g}, acc {:g}, f1 {:g}".format(time_str, step, loss, accuracy, f1))
278 |         dev_summary_writer.add_summary(summaries, step)
279 |         print (y_batch, sim)
280 |         return accuracy
281 | 
282 | 
283 |     ##################
284 |     # sys.exit(0)
285 | 
286 |     # Generate batches
287 |     batches = inpH.batch_iter(
288 |         list(zip(train_set[0], train_set[1], train_set[2])), BATCH_SIZE, NUM_EPOCHS)
289 | 
290 |     ptr = 0
291 |     max_validation_acc = 0.0
292 |     for nn in xrange(sum_no_of_batches * NUM_EPOCHS):
293 |         batch = batches.next()
294 |         if len(batch) < 1:
295 |             continue
296 |         x1_batch, x2_batch, y_batch = zip(*batch)
297 |         if len(y_batch) < 1:
298 |             continue
299 |         train_step(x1_batch, x2_batch, y_batch)
300 |         current_step = tf.train.global_step(sess, global_step)
301 |         sum_acc = 0.0
302 |         cnt = 0
303 |         if current_step % EVALUATE_EVERY == 0:
304 |             print("\nEvaluation:")
305 |             dev_batches = inpH.batch_iter(list(zip(dev_set[0], dev_set[1], dev_set[2])), BATCH_SIZE, 1)
306 |             for db in dev_batches:
307 |                 if len(db) < 1:
308 |                     continue
309 |                 x1_dev_b, x2_dev_b, y_dev_b = zip(*db)
310 |                 if len(y_dev_b) < 1:
311 |                     continue
312 |                 acc = dev_step(x1_dev_b, x2_dev_b, y_dev_b)
313 |                 sum_acc = sum_acc + acc
314 |                 cnt += 1
315 | 
316 |             sum_acc /= cnt
317 |             print("sum_acc= {}".format(sum_acc))
318 |         if current_step % CHECKOUTPOINT_EVERY == 0:
319 |             if sum_acc >= max_validation_acc:
320 |                 max_validation_acc = sum_acc
321 | 
322 |             # 临时逻辑
323 |             saver.save(sess, checkpoint_prefix, global_step=current_step)
324 |             tf.train.write_graph(sess.graph.as_graph_def(), checkpoint_prefix, "graph" + str(nn) + ".pb",
325 |                                  as_text=False)
326 |             print("Saved model {} with sum_accuracy={} checkpoint to {}\n".format(nn, max_validation_acc,
327 |                                                                                   checkpoint_prefix))
328 | 
329 |         print('max_validation_acc(each batch)= {}'.format(max_validation_acc))
330 | 
331 | end_time = datetime.datetime.now()
332 | train_duration = end_time - start_time
333 | print('训练开始时间: {}'.format(start_time))
334 | print('训练结束时间: {}'.format(end_time))
335 | print('训练结束, 训练总耗时: {}'.format(train_duration))
336 | 


--------------------------------------------------------------------------------