├── .gitignore
├── LICENSE
├── README.md
├── data
    ├── UserDict.txt
    ├── atec_nlp_sim_train.csv
    ├── atec_nlp_sim_train1.csv
    ├── atec_nlp_sim_train2.csv
    ├── atec_token.csv
    ├── char_vec
    ├── pred.csv
    ├── w2v.txt
    └── word_vec
├── pytorch
    ├── __init__.py
    ├── dataset.py
    ├── model.py
    ├── siamese_network.py
    ├── text_rcnn.py
    ├── train.py
    └── train2.py
├── requirements.txt
├── tf
    ├── __init__.py
    ├── bad_cases.py
    ├── dataset.py
    ├── encoder.py
    ├── pred.py
    ├── siamese_net.py
    └── train.py
└── utils
    ├── __init__.py
    ├── data_stats.py
    ├── feature_engineering.py
    ├── langconv.py
    ├── train_test_split.py
    └── zh_wiki.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | demo/
3 | model/
4 | thought.md
5 | experiment.md
6 | *.pyc
7 | rsync.sh


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Dhwaj Raj
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ATEC NLP sentence pair similarity competition
 2 | 
 3 | https://dc.cloud.alipay.com/index#/topic/intro?id=3
 4 | 
 5 | 1 、赛题任务描述
 6 | 
 7 | 问题相似度计算，即给定客服里用户描述的两句话，用算法来判断是否表示了相同的语义。
 8 | 
 9 | 示例：
10 | 
11 | “花呗如何还款” --“花呗怎么还款”：同义问句
12 | “花呗如何还款” -- “我怎么还我的花被呢”：同义问句
13 | “花呗分期后逾期了如何还款”-- “花呗分期后逾期了哪里还款”：非同义问句
14 | 对于例子a，比较简单的方法就可以判定同义；对于例子b，包含了错别字、同义词、词序变换等问题，两个句子乍一看并不类似，想正确判断比较有挑战；对于例子c，两句话很类似，仅仅有一处细微的差别 “如何”和“哪里”，就导致语义不一致。
15 | 
16 | 2、数据
17 | 
18 | 本次大赛所有数据均来自蚂蚁金服金融大脑的实际应用场景，赛制分初赛和复赛两个阶段：
19 | 
20 | 初赛阶段
21 | 
22 | 我们提供10万对的标注数据（分批次更新），作为训练数据，包括同义对和不同义对，可下载。数据集中每一行就是一条样例。格式如下：
23 | 
24 | 行号\t句1\t句2\t标注，举例：1    花呗如何还款        花呗怎么还款        1
25 | 
26 | 行号指当前问题对在训练集中的第几行；
27 | 句1和句2分别表示问题句对的两个句子；
28 | 标注指当前问题对的同义或不同义标注，同义为1，不同义为0。
29 | 评测数据集总共1万条。为保证大赛的公平公正、避免恶意的刷榜行为，该数据集不公开。大家通过提交评测代码和模型的方法完成预测、获取相应的排名。格式如下：
30 | 
31 | 行号\t句1\t句2
32 | 
33 | 初赛阶段，评测数据集会在评测系统一个特定的路径下面，由官方的平台系统调用选手提交的评测工具执行。
34 | 
35 | 复赛阶段
36 | 
37 | 我们将训练数据集的量级会增加到海量。该阶段的数据不提供下载，会以数据表的形式在蚂蚁金服的数巢平台上供选手使用。和初赛阶段类似，数据集包含四个字段，分别是行号、句1、句2和标注。
38 | 
39 | 评测数据集还是1万条，同样以数据表的形式在数巢平台上。该数据集包含三个字段，分别是行号、句1、句2。
40 | 
41 | 3、评测及评估指标
42 | 
43 | 初赛阶段，比赛选手在本地完成模型的训练调优，将评测代码和模型打包后，提交官方测评系统完成预测和排名更新。测评系统为标准Linux环境，内存8G，CPU4核，无网络访问权限。安装有python 2.7、java 8、tensorflow 1.5、jieba 0.39、pytorch 0.4.0、keras 2.1.6、gensim 3.4.0、pandas 0.22.0、sklearn 0.19.1、xgboost 0.71、lightgbm 2.1.1。 提交压缩包解压后，主目录下需包含脚本文件run.sh，该脚本以评测文件作为输入，评测结果作为输出（输出结果只有0和1），输出文件每行格式为“行号\t预测结果”，命令超时时间为30分钟，执行命令如下：
44 | 
45 | bash run.sh INPUT_PATH OUTPUT_PATH
46 | 
47 | 预测结果为空或总行数不对，评测结果直接判为0。
48 | 
49 |  
50 | 
51 | 复赛阶段，选手的模型训练、调优和预测都是在蚂蚁金服的机器学习平台上完成。因此评测只需要提供相应的UDF即可，以问题对的两句话作为输入，相似度预测结果（0或1）作为输出，同样输出为空则终止评估，评测结果为0。
52 | 
53 |  
54 | 
55 | 本赛题评分以F1-score为准，得分相同时，参照accuracy排序。选手预测结果和真实标签进行比对，几个数值的定义先明确一下：
56 | 
57 | True Positive（TP）意思表示做出同义的判定，而且判定是正确的，TP的数值表示正确的同义判定的个数； 
58 | 
59 | 同理，False Positive（FP）数值表示错误的同义判定的个数；
60 | 
61 | 依此，True Negative（TN）数值表示正确的不同义判定个数；
62 | 
63 | False Negative（FN）数值表示错误的不同义判定个数。
64 | 
65 | 基于此，我们就可以计算出准确率（precision rate）、召回率（recall rate）和accuracy、F1-score：
66 | 
67 | precision rate = TP / (TP + FP)
68 | 
69 | recall rate = TP / (TP + FN)
70 | 
71 | accuracy = (TP + TN) / (TP + FP + TN + FN)
72 | 
73 | F1-score = 2 * precision rate * recall rate / (precision rate + recall rate)
74 | 


--------------------------------------------------------------------------------
/data/UserDict.txt:
--------------------------------------------------------------------------------
1 | 花呗 57419
2 | 借呗 23730
3 | 支付宝 3275
4 | 淘宝网 13
5 | 淘宝 1467
6 | 


--------------------------------------------------------------------------------
/pytorch/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 | # @Author: lapis-hong
4 | # @Date  : 2018/5/18


--------------------------------------------------------------------------------
/pytorch/dataset.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/5/15
  5 | import os
  6 | import re
  7 | 
  8 | import jieba
  9 | import numpy as np
 10 | import torch
 11 | from torch.utils.data.dataset import Dataset
 12 | 
 13 | 
 14 | class Dictionary(object):
 15 |     def __init__(self, infile, char_vocab_file=None, word_vocab_file=None, char_level=True):
 16 |         self.infile = infile
 17 |         self.char_level = char_level
 18 |         self.word2idx = {}
 19 |         self.idx2word = []
 20 |         vocab_file = char_vocab_file if char_level else word_vocab_file
 21 | 
 22 |         if not vocab_file or os.path.exists(vocab_file):
 23 |             print('Vocabluary file not found. Building vocabulary...')
 24 |             self.build_vocab()
 25 |         else:
 26 |             self.idx2word = open(vocab_file).read().decode('utf-8').strip().split('\n')
 27 |             self.word2idx = dict(zip(self.idx2word, range(len(self.idx2word))))
 28 | 
 29 |     @staticmethod
 30 |     def _clean_text(text):
 31 |         """Text filter for Chinese corpus, only keep CN character."""
 32 |         re_non_ch = re.compile(ur'[^\u4e00-\u9fa5]+')
 33 |         text = text.decode('utf-8').strip(' ')
 34 |         text = re_non_ch.sub('', text)
 35 |         return text
 36 | 
 37 |     def add_word(self, word):
 38 |         if word not in self.word2idx:
 39 |             self.idx2word.append(word)
 40 |             self.word2idx[word] = len(self.idx2word) - 1
 41 |         return self.word2idx[word]
 42 | 
 43 |     def build_vocab(self):
 44 |         self.add_word('<PAD>')  # pad index: 0
 45 |         for line in open(self.infile, 'r'):
 46 |             _, s1, s2, label = line.strip().split('\t')
 47 |             s1, s2 = map(self._clean_text, [s1, s2])
 48 |             if not self.char_level:
 49 |                 s1 = list(jieba.cut(s1))
 50 |                 s2 = list(jieba.cut(s2))
 51 |             for token in s1+s2:
 52 |                 # build vocabulary
 53 |                 self.add_word(token)
 54 |         self.add_word('UNK')  # unk index: len(word2idx)-1
 55 | 
 56 |     def __len__(self):
 57 |         return len(self.idx2word)
 58 | 
 59 | 
 60 | class MyDataset(Dataset):
 61 | 
 62 |     def __init__(self, data_file, sequence_length, word2idx, char_level=True):
 63 |         self.word2idx = word2idx
 64 |         self.seq_len = sequence_length
 65 | 
 66 |         x1, x2, y = [], [], []
 67 |         for line in open(data_file, 'r'):
 68 |             _, s1, s2, label = line.strip().split('\t')
 69 |             s1, s2 = map(self._clean_text, [s1, s2])
 70 |             if not char_level:
 71 |                 s1 = list(jieba.cut(s1))
 72 |                 s2 = list(jieba.cut(s2))
 73 |             x1.append(s1)
 74 |             x2.append(s2)
 75 |             y.append(1) if label == '1' else y.append(0)
 76 |         self.x1 = x1
 77 |         self.x2 = x2
 78 |         self.y = y
 79 | 
 80 |     @staticmethod
 81 |     def _clean_text(text):
 82 |         """Text filter for Chinese corpus, only keep CN character."""
 83 |         re_non_ch = re.compile(ur'[^\u4e00-\u9fa5]+')
 84 |         text = text.decode('utf-8').strip(' ')
 85 |         text = re_non_ch.sub('', text)
 86 |         return text
 87 | 
 88 |     def __getitem__(self, index):
 89 |         s1, s2 = self.x1[index], self.x2[index]
 90 |         s1_id = torch.LongTensor(np.zeros(self.seq_len, dtype=np.int64))
 91 |         s2_id = torch.LongTensor(np.zeros(self.seq_len, dtype=np.int64))
 92 |         label = torch.LongTensor([self.y[index]])
 93 |         for idx, (w1, w2) in enumerate(zip(s1, s2)):
 94 |             if idx > self.seq_len - 1:
 95 |                 break
 96 |             s1_id[idx] = self.word2idx.get(w1, self.word2idx["UNK"])
 97 |             s2_id[idx] = self.word2idx.get(w2, self.word2idx["UNK"])
 98 |         return s1_id, s2_id, label
 99 | 
100 |     def __len__(self):
101 |         return len(self.y)
102 | 
103 | 
104 | if __name__ == '__main__':
105 |     dic = Dictionary('../data/atec_nlp_sim_train.csv', '../data/cha.vocab', '../data/word.vocab')
106 |     dataset = MyDataset('../data/atec_nlp_sim_train.csv', 15, dic.word2idx)
107 |     x1, x2, y = dataset[3]
108 |     print(x1)
109 |     print(y)


--------------------------------------------------------------------------------
/pytorch/model.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/5/15
  5 | from __future__ import print_function
  6 | 
  7 | import os
  8 | 
  9 | import torch
 10 | from torch.autograd import Variable
 11 | import torch.nn as nn
 12 | 
 13 | 
 14 | class BiLSTM(nn.Module):
 15 | 
 16 |     def __init__(self, config):
 17 |         super(BiLSTM, self).__init__()
 18 |         self.drop = nn.Dropout(config['dropout'])
 19 |         self.encoder = nn.Embedding(config['ntoken'], config['ninp'])
 20 |         self.bilstm = nn.LSTM(config['ninp'], config['nhid'], config['nlayers'], dropout=config['dropout'],
 21 |                               bidirectional=True)
 22 |         self.nlayers = config['nlayers']
 23 |         self.nhid = config['nhid']
 24 |         self.pooling = config['pooling']
 25 |         self.dictionary = config['dictionary']
 26 | #        self.init_weights()
 27 |         self.encoder.weight.data[self.dictionary.word2idx['<pad>']] = 0
 28 |         if os.path.exists(config['word-vector']):
 29 |             print('Loading word vectors from', config['word-vector'])
 30 |             vectors = torch.load(config['word-vector'])
 31 |             assert vectors[2] >= config['ninp']
 32 |             vocab = vectors[0]
 33 |             vectors = vectors[1]
 34 |             loaded_cnt = 0
 35 |             for word in self.dictionary.word2idx:
 36 |                 if word not in vocab:
 37 |                     continue
 38 |                 real_id = self.dictionary.word2idx[word]
 39 |                 loaded_id = vocab[word]
 40 |                 self.encoder.weight.data[real_id] = vectors[loaded_id][:config['ninp']]
 41 |                 loaded_cnt += 1
 42 |             print('%d words from external word vectors loaded.' % loaded_cnt)
 43 | 
 44 |     # note: init_range constraints the value of initial weights
 45 |     def init_weights(self, init_range=0.1):
 46 |         self.encoder.weight.data.uniform_(-init_range, init_range)
 47 | 
 48 |     def forward(self, inp, hidden):
 49 |         emb = self.drop(self.encoder(inp))
 50 |         outp = self.bilstm(emb, hidden)[0]
 51 |         if self.pooling == 'mean':
 52 |             outp = torch.mean(outp, 0).squeeze()
 53 |         elif self.pooling == 'max':
 54 |             outp = torch.max(outp, 0)[0].squeeze()
 55 |         elif self.pooling == 'all' or self.pooling == 'all-word':
 56 |             outp = torch.transpose(outp, 0, 1).contiguous()
 57 |         return outp, emb
 58 | 
 59 |     def init_hidden(self, bsz):
 60 |         weight = next(self.parameters()).data
 61 |         return (Variable(weight.new(self.nlayers * 2, bsz, self.nhid).zero_()),
 62 |                 Variable(weight.new(self.nlayers * 2, bsz, self.nhid).zero_()))
 63 | 
 64 | 
 65 | class SelfAttentiveEncoder(nn.Module):
 66 | 
 67 |     def __init__(self, config):
 68 |         super(SelfAttentiveEncoder, self).__init__()
 69 |         self.bilstm = BiLSTM(config)
 70 |         self.drop = nn.Dropout(config['dropout'])
 71 |         self.ws1 = nn.Linear(config['nhid'] * 2, config['attention-unit'], bias=False)
 72 |         self.ws2 = nn.Linear(config['attention-unit'], config['attention-hops'], bias=False)
 73 |         self.tanh = nn.Tanh()
 74 |         self.softmax = nn.Softmax()
 75 |         self.dictionary = config['dictionary']
 76 | #        self.init_weights()
 77 |         self.attention_hops = config['attention-hops']
 78 | 
 79 |     def init_weights(self, init_range=0.1):
 80 |         self.ws1.weight.data.uniform_(-init_range, init_range)
 81 |         self.ws2.weight.data.uniform_(-init_range, init_range)
 82 | 
 83 |     def forward(self, inp, hidden):
 84 |         outp = self.bilstm.forward(inp, hidden)[0]
 85 |         size = outp.size()  # [bsz, len, nhid]
 86 |         compressed_embeddings = outp.view(-1, size[2])  # [bsz*len, nhid*2]
 87 |         transformed_inp = torch.transpose(inp, 0, 1).contiguous()  # [bsz, len]
 88 |         transformed_inp = transformed_inp.view(size[0], 1, size[1])  # [bsz, 1, len]
 89 |         concatenated_inp = [transformed_inp for i in range(self.attention_hops)]
 90 |         concatenated_inp = torch.cat(concatenated_inp, 1)  # [bsz, hop, len]
 91 | 
 92 |         hbar = self.tanh(self.ws1(self.drop(compressed_embeddings)))  # [bsz*len, attention-unit]
 93 |         alphas = self.ws2(hbar).view(size[0], size[1], -1)  # [bsz, len, hop]
 94 |         alphas = torch.transpose(alphas, 1, 2).contiguous()  # [bsz, hop, len]
 95 |         penalized_alphas = alphas + (
 96 |             -10000 * (concatenated_inp == self.dictionary.word2idx['<pad>']).float())
 97 |             # [bsz, hop, len] + [bsz, hop, len]
 98 |         alphas = self.softmax(penalized_alphas.view(-1, size[1]))  # [bsz*hop, len]
 99 |         alphas = alphas.view(size[0], self.attention_hops, size[1])  # [bsz, hop, len]
100 |         # Performs a batch matrix-matrix product of matrices
101 |         return torch.bmm(alphas, outp), alphas
102 | 
103 |     def init_hidden(self, bsz):
104 |         return self.bilstm.init_hidden(bsz)
105 | 
106 | 
107 | class Classifier(nn.Module):
108 | 
109 |     def __init__(self, config):
110 |         super(Classifier, self).__init__()
111 |         if config['pooling'] == 'mean' or config['pooling'] == 'max':
112 |             self.encoder = BiLSTM(config)
113 |             self.fc = nn.Linear(config['nhid'] * 2, config['nfc'])
114 |         elif config['pooling'] == 'all':
115 |             self.encoder = SelfAttentiveEncoder(config)
116 |             self.fc = nn.Linear(config['nhid'] * 2 * config['attention-hops'], config['nfc'])
117 |         else:
118 |             raise Exception('Error when initializing Classifier')
119 |         self.drop = nn.Dropout(config['dropout'])
120 |         self.tanh = nn.Tanh()
121 |         self.pred = nn.Linear(config['nfc'], config['class-number'])
122 |         self.dictionary = config['dictionary']
123 | #        self.init_weights()
124 | 
125 |     def init_weights(self, init_range=0.1):
126 |         self.fc.weight.data.uniform_(-init_range, init_range)
127 |         self.fc.bias.data.fill_(0)
128 |         self.pred.weight.data.uniform_(-init_range, init_range)
129 |         self.pred.bias.data.fill_(0)
130 | 
131 |     def forward(self, inp, hidden):
132 |         outp, attention = self.encoder.forward(inp, hidden)
133 |         outp = outp.view(outp.size(0), -1)
134 |         fc = self.tanh(self.fc(self.drop(outp)))
135 |         pred = self.pred(self.drop(fc))
136 |         if type(self.encoder) == BiLSTM:
137 |             attention = None
138 |         return pred, attention
139 | 
140 |     def init_hidden(self, bsz):
141 |         return self.encoder.init_hidden(bsz)
142 | 
143 |     def encode(self, inp, hidden):
144 |         return self.encoder.forward(inp, hidden)[0]


--------------------------------------------------------------------------------
/pytorch/siamese_network.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | # @Author: lapis-hong
 4 | # @Date  : 2018/5/15
 5 | from numpy.linalg import norm
 6 | 
 7 | import torch
 8 | import torch.nn as nn
 9 | import torch.nn.functional as F
10 | 
11 | 
12 | class EmbeddingCNN(nn.Module):
13 |     def __init__(self, args):
14 |         super(EmbeddingCNN, self).__init__()
15 |         self.args = args
16 | 
17 |         self.embed = nn.Embedding(args.vocab_size, args.embedding_dim)
18 |         self.convs1 = nn.ModuleList([nn.Conv2d(1, args.num_kernels, (ks, args.embedding_dim)) for ks in args.kernel_sizes])
19 | 
20 |     def forward(self, x):
21 |         x = self.embed(x)  # (batch_size, sequence_length, embedding_dim)
22 |         if self.args.word_embedding_type == 'static':
23 |             x = torch.tensor(x)
24 |         x = x.unsqueeze(1)  # (batch_size, 1, sequence_length, embedding_dim)
25 |         # #  input size (N,Cin,H,W)  output size (N,Cout,Hout,1)
26 |         x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]
27 |         x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]
28 |         output = torch.cat(x, 1)  # (batch_size, len(kernel_sizes)*kernel_num)
29 |         return output
30 | 
31 | 
32 | class EmbeddingRNN(nn.Module):
33 | 
34 |     def __init__(self, args):
35 |         super(EmbeddingRNN, self).__init__()
36 |         self.hidden_units = args.hidden_units
37 |         self.batch_size = args.batch_size
38 | 
39 |         self.embeds = nn.Embedding(args.vocab_size, args.embedding_dim)
40 |         self.lstm = nn.LSTM(input_size=args.embedding_dim, hidden_size=args.hidden_units, num_layers=args.num_layers,
41 |                             batch_first=True, bidirectional=True)
42 |     #     self.hidden = self.init_hidden()
43 |     #
44 |     # def init_hidden(self):
45 |     #     h0 = Variable(torch.zeros(self.batch_size, num_layers, self.hidden_units))
46 |     #     c0 = Variable(torch.zeros(self.batch_size, num_layers, self.hidden_units))
47 |     #     return h0, c0
48 | 
49 |     def forward(self, sentence):
50 |         embeds = self.embeds(sentence)
51 |         # print(embeds)  # [torch.FloatTensor of size batch_zise*seq_len*embedding_dim]
52 |         # x = embeds.view(len(sentence), self.batch_size, -1)
53 |         # Inputs: input, (h_0, c_0) Outputs: output, (h_n, c_n) (batch, seq_len, hidden_size * num_directions)
54 |         # If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.
55 |         lstm_out, _ = self.lstm(embeds)
56 |         # print(lstm_out)
57 |         output = lstm_out[:, -1, :]
58 |         return output
59 | 
60 | 
61 | class SiameseNet(nn.Module):
62 |     def __init__(self, embedding_net):
63 |         super(SiameseNet, self).__init__()
64 |         self.embedding_net = embedding_net
65 | 
66 |     def forward(self, x1, x2):
67 |         out1 = self.embedding_net(x1)
68 |         out2 = self.embedding_net(x2)
69 | 
70 |         # out1_norm = torch.sqrt(torch.sum(torch.pow(out1, 2), dim=1))
71 |         # out2_norm = torch.sqrt(torch.sum(torch.pow(out2, 2), dim=1))
72 |         # cosine = (out1*out2).sum(1) / (out1_norm*out2_norm)
73 |         sim = F.cosine_similarity(out1, out2, dim=1)
74 |         # pdist = F.pairwise_distance(out1, out2, p=2, eps=1e-06, keepdim=False)
75 | 
76 |         return out1, out2, sim
77 | 
78 | 
79 | class ContrastiveLoss(torch.nn.Module):
80 | 
81 |     def __init__(self, margin=0.0):
82 |         super(ContrastiveLoss, self).__init__()
83 |         self.margin = margin
84 | 
85 |     def forward(self, y, y_):  # y, y_ must be same type float (*)
86 |         loss = y * torch.pow(1-y_, 2) + (1 - y) * torch.pow(y_-self.margin, 2)
87 |         loss = torch.sum(loss) / 2.0 / len(y)   #y.size()[0]
88 |         return loss
89 | 
90 | 
91 | 


--------------------------------------------------------------------------------
/pytorch/text_rcnn.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | # @Author: lapis-hong
 4 | # @Date  : 2018/5/15
 5 | import torch
 6 | import torch.nn as nn
 7 | import torch.nn.functional as F
 8 | from torch.autograd import Variable
 9 | 
10 | 
11 | class TextCNN(nn.Module):
12 |     def __init__(self, args):
13 |         super(TextCNN, self).__init__()
14 |         self.args = args
15 | 
16 |         self.embed = nn.Embedding(args.sequence_length, args.embed_dim)
17 |         self.convs1 = nn.ModuleList([nn.Conv2d(1, args.kernel_num, (ks, args.embed_dim)) for ks in args.kernel_sizes])
18 |         self.dropout = nn.Dropout(args.dropout)
19 |         self.fc1 = nn.Linear(len(args.kernel_sizes) * args.kernel_num, args.class_num)
20 | 
21 |     def forward(self, x):
22 |         x = self.embed(x)  # (batch_size, sequence_length, embedding_dim)
23 |         if self.args.static:
24 |             x = torch.tensor(x)
25 |         x = x.unsqueeze(1)  # (batch_size, 1, sequence_length, embedding_dim)
26 |         # #  input size (N,Cin,H,W)  output size (N,Cout,Hout,1)
27 |         x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]
28 |         x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]
29 |         x = torch.cat(x, 1)  # (batch_size, len(kernel_sizes)*kernel_num)
30 |         x = self.dropout(x)
31 |         logit = self.fc1(x)
32 |         return logit
33 | 
34 | 
35 | class TextRNN(nn.Module):
36 | 
37 |     def __init__(self, args):
38 |         super(TextRNN, self).__init__()
39 |         self.hidden_dim = args.hidden_dim
40 |         self.batch_size = args.batch_size
41 | 
42 |         self.embeds = nn.Embedding(args.vocab_size, args.embedding_dim)
43 |         self.lstm = nn.LSTM(input_size=args.embedding_dim, hidden_size=args.hidden_dim, num_layers=args.num_layers,
44 |                             batch_first=True, bidirectional=True)
45 |         self.hidden2label = nn.Linear(args.hidden_dim, args.num_classes)
46 |         self.hidden = self.init_hidden()
47 | 
48 |     def init_hidden(self):
49 |         h0 = Variable(torch.zeros(1, self.batch_size, self.hidden_dim))
50 |         c0 = Variable(torch.zeros(1, self.batch_size, self.hidden_dim))
51 |         return h0, c0
52 | 
53 |     def forward(self, sentence):
54 |         embeds = self.embeds(sentence)
55 |         # x = embeds.view(len(sentence), self.batch_size, -1)
56 |         lstm_out, self.hidden = self.lstm(embeds, self.hidden)
57 |         y = self.hidden2label(lstm_out[-1])
58 |         return y
59 | 
60 | 
61 | class TextRCNN(nn.Module):
62 | 
63 |     def __init__(self, args):
64 |         super(TextRCNN, self).__init__()
65 |         self.interaction = args.interaction
66 |         self.model_type = args.model_type
67 |         self.cnn = TextCNN(args)
68 |         self.rnn = TextRNN(args)
69 | 
70 |     def forward(self, x):
71 |         if self.model_type == 'cnn':
72 |             out = self.cnn.forward(x)
73 |         elif self.model_type == 'rnn':
74 |             out = self.rnn.forward(x)
75 |         elif self.model_type == 'rcnn':
76 |             out = self.cnn.forward(x) + self.rnn.forward(x)
77 | 
78 |         if self.interaction == 'multiply':
79 |             pass
80 | 
81 |         return out
82 | 
83 | 


--------------------------------------------------------------------------------
/pytorch/train.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/5/15
  5 | import os
  6 | import sys
  7 | 
  8 | import argparse
  9 | 
 10 | import torch
 11 | import torch.autograd as autograd
 12 | from torch.autograd import Variable
 13 | import torch.nn.functional as F
 14 | import torch.nn as nn
 15 | from torch.utils.data import DataLoader
 16 | 
 17 | from dataset import Dictionary, MyDataset
 18 | from siamese_network import EmbeddingCNN, EmbeddingRNN, SiameseNet, ContrastiveLoss
 19 | 
 20 | 
 21 | def get_args():
 22 |     parser = argparse.ArgumentParser(description='Siamese text classifier')
 23 |     parser.add_argument('--dictionary', type=str, default='',
 24 |                         help='path to save the dictionary, for faster corpus loading')
 25 |     parser.add_argument('--word_vector', type=str, default='',
 26 |                         help='path for pre-trained word vectors (e.g. GloVe)')
 27 |     parser.add_argument('--word_embedding_type', type=str, default='rand',
 28 |                         help='word embedding type {`rand`, `static`, `non-static`}')
 29 |     parser.add_argument('--train_data', type=str, default='../data/train.csv',
 30 |                         help='training data path')
 31 |     parser.add_argument('--val_data', type=str, default='../data/test.csv',
 32 |                         help='validation data path')
 33 |     parser.add_argument('--test_data', type=str, default='',
 34 |                         help='test data path')
 35 |     parser.add_argument('--char_model', type=bool, default=True,
 36 |                         help='whether to use character level model')
 37 |     # RNN
 38 |     parser.add_argument('--sequence_length', type=bool, default=20,
 39 |                         help='max sequence length')
 40 |     parser.add_argument('--embedding_dim', type=int, default=64,
 41 |                         help='size of word embeddings')
 42 |     parser.add_argument('--hidden_units', type=int, default=200,
 43 |                         help='number of hidden units per layer')
 44 |     parser.add_argument('--num_layers', type=int, default=2,
 45 |                         help='number of layers in BiLSTM')
 46 |     # CNN
 47 |     parser.add_argument('--kernel_sizes', type=list, default=[2,3,4,5],
 48 |                         help='kernel sizes in CNN')
 49 |     parser.add_argument('--num_kernels', type=int, default=100,
 50 |                         help='number of kernels in CNN')
 51 |     # parser.add_argument('--attention-unit', type=int, default=350,
 52 |     #                     help='number of attention unit')
 53 |     # parser.add_argument('--attention-hops', type=int, default=1,
 54 |     #                     help='number of attention hops, for multi-hop attention model')
 55 |     parser.add_argument('--dropout', type=float, default=0.1,
 56 |                         help='dropout applied to layers (0 = no dropout)')
 57 |     parser.add_argument('--clip', type=float, default=0.5,
 58 |                         help='clip to prevent the too large grad in LSTM')
 59 |     # parser.add_argument('--nfc', type=int, default=512,
 60 |     #                     help='hidden (fully connected) layer size for classifier MLP')
 61 |     # train
 62 |     parser.add_argument('--lr', type=float, default=.001,
 63 |                         help='initial learning rate')
 64 |     parser.add_argument('--epochs', type=int, default=50,
 65 |                         help='upper epoch limit')
 66 |     parser.add_argument('--batch_size', type=int, default=64,
 67 |                         help='batch size for training')
 68 |     parser.add_argument('--cuda', action='store_true',
 69 |                         help='use CUDA')
 70 |     parser.add_argument('--log_interval', type=int, default=100, metavar='N',
 71 |                         help='train log interval')
 72 |     parser.add_argument('--test_interval', type=int, default=100, metavar='N',
 73 |                         help='eval interval')
 74 |     parser.add_argument('--save_interval', type=int, default=1000, metavar='N',
 75 |                         help='save interval')
 76 |     parser.add_argument('--save_dir', type=str, default='model_torch',
 77 |                         help='path to save the final model')
 78 |     parser.add_argument('--optimizer', type=str, default='Adam',
 79 |                         help='type of optimizer')
 80 |     parser.add_argument('--seed', type=int, default=123,
 81 |                         help='random seed')
 82 |     # parser.add_argument('--penalization-coeff', type=float, default=1,
 83 |     #                     help='the penalization coefficient')
 84 |     return parser.parse_args()
 85 | 
 86 | 
 87 | def metrics(y, y_pred):
 88 |     # 8-bit integer (unsigned)
 89 |     # y, y_pred = torch.ByteTensor(y), torch.ByteTensor(y_pred)
 90 |     TP = ((y_pred == 1) & (y == 1)).sum().float()
 91 |     TN = ((y_pred == 0) & (y == 0)).sum().float()
 92 |     FN = ((y_pred == 0) & (y == 1)).sum().float()
 93 |     FP = ((y_pred == 1) & (y == 0)).sum().float()
 94 |     p = TP / (TP + FP).clamp(min=1e-8)
 95 |     r = TP / (TP + FN).clamp(min=1e-8)
 96 |     F1 = 2 * r * p / (r + p).clamp(min=1e-8)
 97 |     acc = (TP + TN) / (TP + TN + FP + FN).clamp(min=1e-8)
 98 |     return acc, p, r, F1
 99 | 
100 | 
101 | def train(train_iter, dev_iter, model, args):
102 |     if args.cuda:
103 |         model.cuda()
104 |     # for param in model.parameters():
105 |     #     print(param)
106 | 
107 |     def adjust_learning_rate(optimizer, learning_rate, epoch):
108 |         lr = learning_rate * (0.1 ** (epoch // 10))
109 |         for param_group in optimizer.param_groups:
110 |             param_group['lr'] = lr
111 |         return optimizer
112 | 
113 |     if args.optimizer == 'Adam':
114 |         optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, betas=[0.9, 0.999], eps=1e-8, weight_decay=0)
115 |     elif args.optimizer == 'SGD':
116 |         optimizer = torch.optim.SGD(model.parameters(), lr=args.lr, momentum=0.9, weight_decay=0.01)
117 |     else:
118 |         raise Exception('For other optimizers, please add it yourself. supported ones are: SGD and Adam.')
119 | 
120 |     F1_best = 0.0
121 |     last_improved_step = 0
122 |     model.train()
123 |     steps = 0
124 |     for epoch in range(1, args.epochs+1):
125 |         for batch in train_iter:
126 |             optimizer = adjust_learning_rate(optimizer, args.lr, epoch)
127 |             x1, x2, y = batch
128 |             y = torch.squeeze(y, 1).float()   # [[1], [1], [0]...] to [1, 1, 0, ...]
129 | 
130 |             # if args.cuda:
131 |             #     x1, x2, y = Variable(x1).cuda(), Variable(x2).cuda(), Variable(y).cuda()
132 |             # else:
133 |             #     x1, x2, y = Variable(x1), Variable(x2), Variable(y)
134 |             optimizer.zero_grad()
135 |             _, _, score = model(x1, x2)
136 | 
137 |             # print('out1', out1.dtype)
138 |             # print('target vector', y.dtype)
139 | 
140 |             # loss_function = nn.CrossEntropyLoss()
141 |             # loss = loss_function(output, Variable(train_labels))
142 |             # criterion = nn.CosineEmbeddingLoss(margin=0, size_average=True, reduce=False)
143 |             # loss = criterion(out1, out2, (2 * y - 1))  # cast y to {1, -1} and float type
144 |             # criterion = ContrastiveLoss()
145 |             # loss = criterion(y, sim)
146 | 
147 |             # loss = F.cross_entropy(sim, y)
148 |             loss = F.binary_cross_entropy_with_logits(score, y)
149 |             loss.backward()
150 |             optimizer.step()
151 |             steps += 1
152 | 
153 |             if steps % args.log_interval == 0:
154 |                 # _, pred = torch.max(sim.data, 1)
155 |                 print('model sim and label tuples:')
156 |                 for i, j in zip(score, y):
157 |                     print(i.item(), j.item())
158 | 
159 |                 pred = score.data >= 0.5
160 |                 acc, p, r, f1 = metrics(y, pred)
161 |                 print('TRAIN[steps={}] loss={:.6f} acc={:.3f} P={:.3f} R={:.3f} F1={:.6f}'.format(
162 |                     steps, loss.item(), acc, p, r, f1))
163 |             if steps % args.test_interval == 0:
164 |                 loss, acc, p, r, f1 = eval(dev_iter, model)
165 | 
166 |                 if f1 > F1_best:
167 |                     F1_best = f1
168 |                     last_improved_step = steps
169 |                     if F1_best > 0.5:
170 |                         save_prefix = os.path.join(args.save_dir, 'snapshot')
171 |                         save_path = '{}_steps{}.pt'.format(save_prefix, steps)
172 |                         torch.save(model, save_path)
173 |                     improved_token = '*'
174 |                 else:
175 |                     improved_token = ''
176 |                 print('DEV[steps={}] loss={:.6f} acc={:.3f} P={:.3f} R={:.3f} F1={:.6f} {}'.format(
177 |                     steps, loss, acc, p, r, f1, improved_token))
178 | 
179 |             if steps % args.save_interval == 0:
180 |                 if not os.path.isdir(args.save_dir):
181 |                     os.makedirs(args.save_dir)
182 |                 save_prefix = os.path.join(args.save_dir, 'snapshot')
183 |                 save_path = '{}_steps{}.pt'.format(save_prefix, steps)
184 |                 torch.save(model, save_path)
185 | 
186 |             if steps - last_improved_step > 2000:  # 2000 steps
187 |                 print("No improvement for a long time, early-stopping at best F1={}".format(F1_best))
188 |                 break
189 | 
190 | 
191 | def eval(data_iter, model):
192 |     loss_tot, y_list, y_pred_list = 0, [], []
193 |     model.eval()
194 |     for x1, x2, y in data_iter:
195 |         # if args.cuda:
196 |         #     x1, x2, y = Variable(x1).cuda(), Variable(x2).cuda(), Variable(y).cuda()
197 |         # else:
198 |         #     x1, x2, y = Variable(x1), Variable(x2), Variable(y)
199 |         out1, out2, sim = model(x1, x2)
200 |         # loss = F.cross_entropy(output, y, size_average=False)
201 |         criterion = nn.CosineEmbeddingLoss()
202 |         loss = criterion(out1, out2, (2*y-1).float())
203 |         loss_tot += loss.item()  # 0-dim scaler
204 |         y_pred = sim.data >= 0.5
205 |         y_pred_list.append(y_pred)
206 |         y_list.append(y)
207 |     y_pred = torch.cat(y_pred_list, 0)
208 |     y = torch.cat(y_list, 0)
209 |     acc, p, r, f1 = metrics(y, y_pred)
210 |     size = len(data_iter.dataset)
211 |     loss_avg = loss_tot / float(size)
212 |     model.train()
213 |     return loss_avg, acc, p, r, f1
214 | 
215 | 
216 | def predict(text, model, text_field, label_feild, cuda_flag):
217 |     assert isinstance(text, str)
218 |     model.eval()
219 |     # text = text_field.tokenize(text)
220 |     text = text_field.preprocess(text)
221 |     text = [[text_field.vocab.stoi[x] for x in text]]
222 |     x = text_field.tensor_type(text)
223 |     x = autograd.Variable(x, volatile=True)
224 |     if cuda_flag:
225 |         x = x.cuda()
226 |     print(x)
227 |     output = model(x)
228 |     _, predicted = torch.max(output, 1)
229 |     return label_feild.vocab.itos[predicted.data[0][0]+1]
230 | 
231 | 
232 | if __name__ == '__main__':
233 |     # parse the arguments
234 |     args = get_args()
235 |     print(args)
236 | 
237 |     # Set the random seed manually for reproducibility.
238 |     torch.manual_seed(args.seed)
239 |     if torch.cuda.is_available():
240 |         if not args.cuda:
241 |             print("WARNING: You have a CUDA device, so you should probably run with --cuda")
242 |         else:
243 |             torch.cuda.manual_seed(args.seed)
244 | 
245 |     # Load Dictionary
246 |     assert os.path.exists(args.train_data)
247 |     assert os.path.exists(args.val_data)
248 |     print('Begin to load the dictionary.')
249 |     dictionary = Dictionary('../data/atec_nlp_sim_train.csv')
250 | 
251 |     args.vocab_size = len(dictionary)
252 | 
253 |     best_val_loss = None
254 |     best_f1 = None
255 |     n_token = len(dictionary)
256 | 
257 |     embedding_net = EmbeddingCNN(args)
258 |     print("embedding_net: {}".format(embedding_net))
259 |     model = SiameseNet(embedding_net)
260 |     print(model)
261 | 
262 |     print('Begin to load data.')
263 |     train_data = MyDataset(args.train_data, args.sequence_length, dictionary.word2idx, args.char_model)
264 |     val_data = MyDataset(args.val_data, args.sequence_length, dictionary.word2idx, args.char_model)
265 |     train_loader = DataLoader(train_data, batch_size=args.batch_size, shuffle=True, num_workers=16)
266 |     val_loader = DataLoader(val_data, batch_size=1, shuffle=False)
267 |     try:
268 |         for epoch in range(args.epochs):
269 |             train(train_loader, val_loader, model, args)
270 |     except KeyboardInterrupt:
271 |         print('-' * 89)
272 |         print('Exit from training early.')
273 | 
274 | 


--------------------------------------------------------------------------------
/pytorch/train2.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/5/15
  5 | from __future__ import print_function
  6 | 
  7 | import json
  8 | import time
  9 | import random
 10 | import os
 11 | import argparse
 12 | 
 13 | import torch
 14 | import torch.nn as nn
 15 | import torch.optim as optim
 16 | from torch.autograd import Variable
 17 | 
 18 | from model import *
 19 | from dataset import *
 20 | 
 21 | 
 22 | def get_args():
 23 |     parser = argparse.ArgumentParser()
 24 |     parser.add_argument('--emsize', type=int, default=300,
 25 |                         help='size of word embeddings')
 26 |     parser.add_argument('--nhid', type=int, default=300,
 27 |                         help='number of hidden units per layer')
 28 |     parser.add_argument('--nlayers', type=int, default=2,
 29 |                         help='number of layers in BiLSTM')
 30 |     parser.add_argument('--attention-unit', type=int, default=350,
 31 |                         help='number of attention unit')
 32 |     parser.add_argument('--attention-hops', type=int, default=1,
 33 |                         help='number of attention hops, for multi-hop attention model')
 34 |     parser.add_argument('--dropout', type=float, default=0.5,
 35 |                         help='dropout applied to layers (0 = no dropout)')
 36 |     parser.add_argument('--clip', type=float, default=0.5,
 37 |                         help='clip to prevent the too large grad in LSTM')
 38 |     parser.add_argument('--nfc', type=int, default=512,
 39 |                         help='hidden (fully connected) layer size for classifier MLP')
 40 |     parser.add_argument('--lr', type=float, default=.001,
 41 |                         help='initial learning rate')
 42 |     parser.add_argument('--epochs', type=int, default=40,
 43 |                         help='upper epoch limit')
 44 |     parser.add_argument('--seed', type=int, default=1111,
 45 |                         help='random seed')
 46 |     parser.add_argument('--cuda', action='store_true',
 47 |                         help='use CUDA')
 48 |     parser.add_argument('--log-interval', type=int, default=200, metavar='N',
 49 |                         help='report interval')
 50 |     parser.add_argument('--save', type=str, default='',
 51 |                         help='path to save the final model')
 52 |     parser.add_argument('--dictionary', type=str, default='',
 53 |                         help='path to save the dictionary, for faster corpus loading')
 54 |     parser.add_argument('--word-vector', type=str, default='',
 55 |                         help='path for pre-trained word vectors (e.g. GloVe), should be a PyTorch model.')
 56 |     parser.add_argument('--train-data', type=str, default='',
 57 |                         help='location of the training data, should be a json file')
 58 |     parser.add_argument('--val-data', type=str, default='',
 59 |                         help='location of the development data, should be a json file')
 60 |     parser.add_argument('--test-data', type=str, default='',
 61 |                         help='location of the test data, should be a json file')
 62 |     parser.add_argument('--batch-size', type=int, default=32,
 63 |                         help='batch size for training')
 64 |     parser.add_argument('--class-number', type=int, default=2,
 65 |                         help='number of classes')
 66 |     parser.add_argument('--optimizer', type=str, default='Adam',
 67 |                         help='type of optimizer')
 68 |     parser.add_argument('--penalization-coeff', type=float, default=1,
 69 |                         help='the penalization coefficient')
 70 |     return parser.parse_args()
 71 | 
 72 | 
 73 | def Frobenius(mat):
 74 |     size = mat.size()
 75 |     if len(size) == 3:  # batched matrix
 76 |         ret = (torch.sum(torch.sum((mat ** 2), 1), 2).squeeze() + 1e-10) ** 0.5
 77 |         return torch.sum(ret) / size[0]
 78 |     else:
 79 |         raise Exception('matrix for computing Frobenius norm should be with 3 dims')
 80 | 
 81 | 
 82 | def evaluate():
 83 |     """evaluate the model while training"""
 84 |     model.eval()  # turn on the eval() switch to disable dropout
 85 |     total_loss = 0
 86 |     total_correct = 0
 87 |     for batch, i in enumerate(range(0, len(data_val), args.batch_size)):
 88 |         data, targets = package(data_val[i:min(len(data_val), i+args.batch_size)], volatile=True)
 89 |         if args.cuda:
 90 |             data = data.cuda()
 91 |             targets = targets.cuda()
 92 |         hidden = model.init_hidden(data.size(1))
 93 |         output, attention = model.forward(data, hidden)
 94 |         output_flat = output.view(data.size(1), -1)
 95 |         total_loss += criterion(output_flat, targets).data
 96 |         prediction = torch.max(output_flat, 1)[1]
 97 |         total_correct += torch.sum((prediction == targets).float())
 98 |     return total_loss[0] / (len(data_val) // args.batch_size), total_correct.data[0] / len(data_val)
 99 | 
100 | 
101 | def train(epoch_number):
102 |     global best_val_loss, best_acc
103 |     model.train()
104 |     total_loss = 0
105 |     total_pure_loss = 0  # without the penalization term
106 |     start_time = time.time()
107 |     for batch, i in enumerate(range(0, len(data_train), args.batch_size)):
108 |         data, targets = package(data_train[i:i+args.batch_size], volatile=False)
109 |         if args.cuda:
110 |             data = data.cuda()
111 |             targets = targets.cuda()
112 |         hidden = model.init_hidden(data.size(1))
113 |         output, attention = model.forward(data, hidden)
114 |         loss = criterion(output.view(data.size(1), -1), targets)
115 |         total_pure_loss += loss.data
116 | 
117 |         if attention:  # add penalization term
118 |             attentionT = torch.transpose(attention, 1, 2).contiguous()
119 |             extra_loss = Frobenius(torch.bmm(attention, attentionT) - I[:attention.size(0)])
120 |             loss += args.penalization_coeff * extra_loss
121 |         optimizer.zero_grad()
122 |         loss.backward()
123 | 
124 |         nn.utils.clip_grad_norm(model.parameters(), args.clip)
125 |         optimizer.step()
126 | 
127 |         total_loss += loss.data
128 | 
129 |         if batch % args.log_interval == 0 and batch > 0:
130 |             elapsed = time.time() - start_time
131 |             print('| epoch {:3d} | {:5d}/{:5d} batches | ms/batch {:5.2f} | loss {:5.4f} | pure loss {:5.4f}'.format(
132 |                   epoch_number, batch, len(data_train) // args.batch_size,
133 |                   elapsed * 1000 / args.log_interval, total_loss[0] / args.log_interval,
134 |                   total_pure_loss[0] / args.log_interval))
135 |             total_loss = 0
136 |             total_pure_loss = 0
137 |             start_time = time.time()
138 | 
139 | #            for item in model.parameters():
140 | #                print item.size(), torch.sum(item.data ** 2), torch.sum(item.grad ** 2).data[0]
141 | #            print model.encoder.ws2.weight.grad.data
142 | #            exit()
143 |     evaluate_start_time = time.time()
144 |     val_loss, acc = evaluate()
145 |     print('-' * 89)
146 |     fmt = '| evaluation | time: {:5.2f}s | valid loss (pure) {:5.4f} | Acc {:8.4f}'
147 |     print(fmt.format((time.time() - evaluate_start_time), val_loss, acc))
148 |     print('-' * 89)
149 |     # Save the model, if the validation loss is the best we've seen so far.
150 |     if not best_val_loss or val_loss < best_val_loss:
151 |         with open(args.save, 'wb') as f:
152 |             torch.save(model, f)
153 |         f.close()
154 |         best_val_loss = val_loss
155 |     else:  # if loss doesn't go down, divide the learning rate by 5.
156 |         for param_group in optimizer.param_groups:
157 |             param_group['lr'] = param_group['lr'] * 0.2
158 |     if not best_acc or acc > best_acc:
159 |         with open(args.save[:-3]+'.best_acc.pt', 'wb') as f:
160 |             torch.save(model, f)
161 |         f.close()
162 |         best_acc = acc
163 |     with open(args.save[:-3]+'.epoch-{:02d}.pt'.format(epoch_number), 'wb') as f:
164 |         torch.save(model, f)
165 |     f.close()
166 | 
167 | 
168 | if __name__ == '__main__':
169 |     # parse the arguments
170 |     args = get_args()
171 | 
172 |     # Set the random seed manually for reproducibility.
173 |     torch.manual_seed(args.seed)
174 |     if torch.cuda.is_available():
175 |         if not args.cuda:
176 |             print("WARNING: You have a CUDA device, so you should probably run with --cuda")
177 |         else:
178 |             torch.cuda.manual_seed(args.seed)
179 |     random.seed(args.seed)
180 | 
181 |     # Load Dictionary
182 |     assert os.path.exists(args.train_data)
183 |     assert os.path.exists(args.val_data)
184 |     print('Begin to load the dictionary.')
185 |     dictionary = Dictionary(path=args.dictionary)
186 | 
187 |     best_val_loss = None
188 |     best_acc = None
189 | 
190 |     n_token = len(dictionary)
191 |     model = Classifier({
192 |         'dropout': args.dropout,
193 |         'ntoken': n_token,
194 |         'nlayers': args.nlayers,
195 |         'nhid': args.nhid,
196 |         'ninp': args.emsize,
197 |         'pooling': 'all',
198 |         'attention-unit': args.attention_unit,
199 |         'attention-hops': args.attention_hops,
200 |         'nfc': args.nfc,
201 |         'dictionary': dictionary,
202 |         'word-vector': args.word_vector,
203 |         'class-number': args.class_number
204 |     })
205 |     if args.cuda:
206 |         model = model.cuda()
207 | 
208 |     print(args)
209 |     I = Variable(torch.zeros(args.batch_size, args.attention_hops, args.attention_hops))
210 |     for i in range(args.batch_size):
211 |         for j in range(args.attention_hops):
212 |             I.data[i][j][j] = 1
213 |     if args.cuda:
214 |         I = I.cuda()
215 | 
216 |     criterion = nn.CrossEntropyLoss()
217 |     if args.optimizer == 'Adam':
218 |         optimizer = optim.Adam(model.parameters(), lr=args.lr, betas=[0.9, 0.999], eps=1e-8, weight_decay=0)
219 |     elif args.optimizer == 'SGD':
220 |         optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0.9, weight_decay=0.01)
221 |     else:
222 |         raise Exception('For other optimizers, please add it yourself. '
223 |                         'supported ones are: SGD and Adam.')
224 |     print('Begin to load data.')
225 |     data_train = open(args.train_data).readlines()
226 |     data_val = open(args.val_data).readlines()
227 |     try:
228 |         for epoch in range(args.epochs):
229 |             train(epoch)
230 |     except KeyboardInterrupt:
231 |         print('-' * 89)
232 |         print('Exit from training early.')
233 |         data_val = open(args.test_data).readlines()
234 |         evaluate_start_time = time.time()
235 |         test_loss, acc = evaluate()
236 |         print('-' * 89)
237 |         fmt = '| test | time: {:5.2f}s | test loss (pure) {:5.4f} | Acc {:8.4f}'
238 |         print(fmt.format((time.time() - evaluate_start_time), test_loss, acc))
239 |         print('-' * 89)
240 |         exit(0)


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow >=1.5
2 | torch
3 | numpy
4 | jieba
5 | gensim
6 | 


--------------------------------------------------------------------------------
/tf/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 | # @Author: lapis-hong
4 | # @Date  : 2018/5/18


--------------------------------------------------------------------------------
/tf/bad_cases.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | # @Author: lapis-hong
 4 | # @Date  : 2018/5/10
 5 | # !/usr/bin/env python
 6 | import os
 7 | 
 8 | import tensorflow as tf
 9 | 
10 | from dataset import Dataset
11 | from demo.train import FLAGS
12 | 
13 | 
14 | def bad_cases():
15 |     print("\nPredicting...\n")
16 |     graph = tf.Graph()
17 |     with graph.as_default():  # with tf.Graph().as_default() as g:
18 |         sess = tf.Session()
19 |         with sess.as_default():
20 |             # Load the saved meta graph and restore variables
21 |             # saver = tf.train.Saver(tf.global_variables())
22 |             meta_file = os.path.abspath(os.path.join(FLAGS.model_dir, 'checkpoints/model-1000.meta'))
23 |             new_saver = tf.train.import_meta_graph(meta_file)
24 |             new_saver.restore(sess, tf.train.latest_checkpoint(os.path.join(FLAGS.model_dir, 'checkpoints')))
25 |             # graph = tf.get_default_graph()
26 | 
27 |             # Get the placeholders from the graph by name
28 |             # input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
29 |             input_x1 = graph.get_tensor_by_name("input_x1:0")  # Tensor("input_x1:0", shape=(?, 15), dtype=int32)
30 |             input_x2 = graph.get_tensor_by_name("input_x2:0")
31 |             dropout_keep_prob = graph.get_tensor_by_name("dropout_keep_prob:0")
32 |             # Tensors we want to evaluate
33 |             sim = graph.get_tensor_by_name("metrics/sim:0")
34 |             y_pred = graph.get_tensor_by_name("metrics/y_pred:0")
35 | 
36 |             dev_sample = {}
37 |             for line in open(FLAGS.data_file):
38 |                 line = line.strip().split('\t')
39 |                 dev_sample[line[0]] = line[1]
40 | 
41 |             # Generate batches for one epoch
42 |             dataset = Dataset(data_file="data/pred.csv")
43 |             x1, x2, y = dataset.process_data(sequence_length=FLAGS.max_document_length, is_training=False)
44 |             with open("result/fp_file", 'w') as f_fp, open("result/fn_file", 'w') as f_fn:
45 |                 for lineno, x1_online, x2_online, y_online in enumerate(zip(x1, x2, y)):
46 |                     sim, y_pred_ = sess.run(
47 |                         [sim, y_pred], {input_x1: x1_online, input_x2: x2_online, dropout_keep_prob: 1.0})
48 |                     if y_pred == 1 and y_online == 0:  # low precision
49 |                         f_fp.write(dev_sample[lineno+1] + str(sim) + '\n')
50 |                     elif y_pred == 0 and y_online == 1:  # low recall
51 |                         f_fn.write(dev_sample[lineno + 1] + str(sim) + '\n')
52 | 
53 | if __name__ == '__main__':
54 |     # Set to INFO for tracking training, default is WARN. ERROR for least messages
55 |     tf.logging.set_verbosity(tf.logging.WARN)
56 |     bad_cases()
57 | 


--------------------------------------------------------------------------------
/tf/dataset.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/5/6
  5 | """This module provide an elegant data process class."""
  6 | from __future__ import unicode_literals
  7 | 
  8 | import logging
  9 | import multiprocessing
 10 | import os
 11 | import re
 12 | import sys
 13 | import time
 14 | from collections import Counter
 15 | 
 16 | import jieba
 17 | import jieba.analyse
 18 | import numpy as np
 19 | import tensorflow as tf
 20 | from gensim.models import Word2Vec
 21 | 
 22 | sys.path.insert(0, '../')
 23 | from utils.langconv import Converter
 24 | reload(sys)
 25 | sys.setdefaultencoding('utf-8')
 26 | 
 27 | # jieba.enable_parallel(4)  # This is a bug, make add_word no use
 28 | jieba.load_userdict('../data/UserDict.txt')
 29 | stopwords = ['的', '了']
 30 | 
 31 | 
 32 | class Dataset(object):
 33 |     """Custom dataset class to deal with input text data."""
 34 |     def __init__(self,
 35 |                  data_file='../data/atec_nlp_sim_train.csv',
 36 |                  npy_char_data_file='../data/train_char.npy',
 37 |                  npy_word_data_file='../data/train_word.npy',
 38 |                  char_vocab_file='../data/vocab.char',
 39 |                  word_vocab_file='../data/vocab.word',
 40 |                  char2vec_file='../data/char_vec',
 41 |                  word2vec_file='../data/word_vec',
 42 |                  char_level=True,
 43 |                  embedding_dim=128,
 44 |                  is_training=True,
 45 |                  ):
 46 |         self.data_file = data_file
 47 |         self.npy_char_data_file = npy_char_data_file
 48 |         self.npy_word_data_file = npy_word_data_file
 49 |         self.char_vocab_file = char_vocab_file
 50 |         self.word_vocab_file = word_vocab_file
 51 |         self.word2vec_file = word2vec_file
 52 |         self.char2vec_file = char2vec_file
 53 |         self.char_level = char_level
 54 |         self.embedding_dim = embedding_dim
 55 |         self.is_training = is_training
 56 |         if self.char_level:
 57 |             print('Using character level model.')
 58 |         else:
 59 |             print('Using word level model.')
 60 |         self.w2v_file = self.char2vec_file if self.char_level else self.word2vec_file
 61 |         self.vocab_file = self.char_vocab_file if self.char_level else self.word_vocab_file
 62 |         self.npy_file = self.npy_char_data_file if self.char_level else self.npy_word_data_file
 63 | 
 64 |     @staticmethod
 65 |     def _clean_text(text):
 66 |         """Text filter for Chinese corpus, keep CN character and remove stopwords."""
 67 |         re_non_ch = re.compile(ur'[^\u4e00-\u9fa5]+')
 68 |         text = text.strip(' ')
 69 |         text = re_non_ch.sub('', text)
 70 |         for w in stopwords:
 71 |             text = re.sub(w, '', text)
 72 |         return text
 73 | 
 74 |     @staticmethod
 75 |     def _tradition2simple(text):
 76 |         """Tradition Chinese corpus to simplify Chinese."""
 77 |         text = Converter('zh-hans').convert(text)
 78 |         return text
 79 | 
 80 |     def _load_data(self, data_file):
 81 |         """Load origin train data and do text pre-processing (converting and cleaning)
 82 |         Returns:
 83 |             A generator
 84 |             if self.is_training:
 85 |                 train sentence pairs and labels (s1, s2, y).
 86 |             else:
 87 |                 train sentence pairs and None (s1, s2, None).
 88 |         """
 89 |         for line in open(data_file):
 90 |             line = line.strip().decode('utf-8').split('\t')
 91 |             s1, s2 = map(self._clean_text, map(self._tradition2simple, line[1:3]))
 92 |             if not self.char_level:
 93 |                 s1 = list(jieba.cut(s1))
 94 |                 s2 = list(jieba.cut(s2))
 95 |             if self.is_training:
 96 |                 y = int(line[-1])  # 1 or [1]
 97 |                 yield s1, s2, y
 98 |             else:
 99 |                 yield s1, s2, None  # for consistent
100 | 
101 |     def _save_token_data(self):
102 |         data_iter = self._load_data(self.data_file)
103 |         with open('../data/atec_token.csv', 'w') as f:
104 |             for s1, s2, _ in data_iter:
105 |                 f.write(' '.join(s1) + '|' + ' '.join(s2) + '\n')
106 | 
107 |     def _build_vocab(self, max_vocab_size=100000, min_count=2):
108 |         """Build vocabulary list."""
109 |         data_iter = self._load_data(self.data_file)
110 |         token = []
111 |         for s1, s2, _ in data_iter:
112 |             if self.char_level:
113 |                 for words in s1+s2:
114 |                     for char in words:
115 |                         token.append(char)
116 |             else:
117 |                 token.extend(s1+s2)
118 |         print("Number of tokens: {}".format(len(token)))
119 |         counter = Counter(token)
120 |         word_count = counter.most_common(max_vocab_size - 1)  # sort by word freq.
121 |         vocab = ['UNK']  # for oov words
122 |         vocab += [w[0] for w in word_count if w[1] >= min_count]
123 |         vocab.append('<PAD>')  # add word '<PAD>' for padding
124 |         print("Vocabulary size: {}".format(len(vocab)))
125 |         with open(self.vocab_file, 'w') as fo:
126 |             fo.write('\n'.join(vocab))
127 | 
128 |     def read_vocab(self):
129 |         """Read vocabulary list
130 |         Returns:
131 |              tuple (id2word, word2id).
132 |         """
133 |         if not os.path.exists(self.vocab_file):
134 |             print('Vocabulary file not found. Building vocabulary...')
135 |             self._build_vocab()
136 |         else:
137 |             print("Reading vocabulary file from {}".format(self.vocab_file))
138 |         id2word = open(self.vocab_file).read().split('\n')  # list
139 |         word2id = dict(zip(id2word, range(len(id2word))))  # dict
140 |         return id2word, word2id
141 | 
142 |     def _word2vec(self, window=5, min_count=2):
143 |         """Train and save word vectors"""
144 |         logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
145 |         s_time = time.time()
146 |         s1, s2, _ = zip(*list(self._load_data(self.data_file)))
147 |         sentences = s1 + s2
148 |         size = self.embedding_dim
149 |         # trim unneeded model memory = use(much) less RAM
150 |         # model.init_sims(replace=True)
151 |         model = Word2Vec(sentences, sg=1, size=size, window=window, min_count=min_count,
152 |                          negative=3, sample=0.001, hs=1, workers=multiprocessing.cpu_count(), iter=20)
153 |         # model.save(output_model_file)
154 |         model.wv.save_word2vec_format(self.w2v_file, binary=False)
155 |         print("Word2vec training time: %d s" % (time.time() - s_time))
156 | 
157 |     def load_word2vec(self):
158 |         """mapping the words to word vectors.
159 |         Returns:
160 |             tuple (words, vectors)
161 |         """
162 |         if not os.path.exists(self.w2v_file):
163 |             print('Word vectors file not found. Training word vectors...')
164 |             self._word2vec()
165 |         words, vecs = [], []
166 |         fr = open(self.w2v_file)
167 |         word_dim = int(fr.readline().strip().split(' ')[1])  # first line
168 |         print("Pre-trained word vectors dim: {}".format(word_dim))
169 |         if word_dim != self.embedding_dim:
170 |             print("Inconsistent word embedding dim, retrain word vectors...")
171 |             self._word2vec()
172 |             return self.load_word2vec()
173 |         else:
174 |             words.append("UNK")
175 |             vecs.append([0] * word_dim)
176 |             words.append("<PAD>")
177 |             vecs.append([0] * word_dim)
178 |             for line in fr:
179 |                 line = line.decode('utf-8').strip().split(' ')
180 |                 words.append(line[0])
181 |                 vecs.append(line[1:])
182 |             print("Loaded pre-trained word vectors.")
183 |         return words, vecs
184 | 
185 |     def process_data(self, data_file, sequence_length=20):
186 |         """Process text data file to word-id matrix representation.
187 |         Args:
188 |             data_file: process data file.
189 |             sequence_length: int, max sequence length. (default 20)
190 |         Returns:
191 |             2-D List.
192 |             if self.is_training:
193 |                 each element of list is [s1_pad, s2_pad, y]
194 |             else:
195 |                 each element of list is [s1_pad, s2_pad]
196 |         """
197 |         if data_file == self.data_file and os.path.exists(self.npy_file):  # only for all train data
198 |             dataset = np.load(self.npy_file)
199 |             # check sequence length same or not
200 |             if len(dataset[0][0]) == sequence_length:
201 |                 print("Loaded saved npy word-id matrix train file.")
202 |                 return dataset
203 |             else:
204 |                 print("Found inconsistent sequence length with npy file.")
205 | 
206 |         _, word2id = self.read_vocab()
207 |         data_iter = self._load_data(data_file)
208 |         dataset = []
209 |         print('Converting word-index matrix...')
210 |         for s1, s2, y in data_iter:
211 |             # oov words id is 0, token is either a single char or word.
212 |             s1_id = [word2id.get(token, 0) for token in s1]
213 |             s2_id = [word2id.get(token, 0) for token in s2]
214 |             # "pre" or "post" important, "pre" much better, why ?
215 |             s1_pad = tf.keras.preprocessing.sequence.pad_sequences(
216 |                 [s1_id], maxlen=sequence_length, padding='post', truncating='post', value=len(word2id)-1)
217 |             s2_pad = tf.keras.preprocessing.sequence.pad_sequences(
218 |                 [s2_id], maxlen=sequence_length, padding='post', truncating='post', value=len(word2id)-1)
219 |             # y = tf.keras.utils.to_categorical(y)  # turn label into onehot
220 |             if self.is_training:
221 |                 dataset.append([s1_pad[0], s2_pad[0], y])
222 |             else:
223 |                 dataset.append([s1_pad[0], s2_pad[0]])
224 |         print("Saving npy...")
225 |         dataset = np.asarray(dataset)
226 |         np.save(self.npy_file, dataset)
227 |         # np.savez(save_file, x1=x1, x2=x2, y=y)  # save multiple arrays as zip file.
228 |         # np.savetxt(save_file, np.concatenate([x1, x2, y], axis=1), fmt="%d")  # or use np.hstack()
229 |         return dataset
230 | 
231 |     @staticmethod
232 |     def train_test_split(dataset, test_size=0.2, random_seed=123):
233 |         """Split train data into train and test sets.
234 |         Args:
235 |             dataset: 2-D list, each element is a sample list [x1, x2, y, len(s1), len(s2)]
236 |             test_size: float, int. (default 0.2)
237 |                 If float, should be between 0.0 and 1.0 and represent the proportion of test set. 
238 |                 If int, represents the absolute number of test samples. 
239 |             random_seed: int or None. (default 123)
240 |                 If None, do not use random seed.
241 |         Returns
242 |             A tuple (trainset, testset)
243 |         """
244 |         dataset = np.asarray(dataset)
245 |         num_samples = len(dataset)
246 |         test_size = int(num_samples * test_size) if isinstance(test_size, float) else test_size
247 |         print('Total number of samples: {}'.format(num_samples))
248 |         print('Test data size: {}'.format(test_size))
249 |         if random_seed:
250 |             np.random.seed(random_seed)
251 |         shuffle_indices = np.random.permutation(np.arange(num_samples))
252 |         dataset_shuffled = dataset[shuffle_indices]
253 |         trainset = dataset_shuffled[test_size:]
254 |         testset = dataset_shuffled[:test_size]
255 |         print('Train eval data split done.')
256 |         return trainset, testset
257 | 
258 |     @staticmethod
259 |     def batch_iter(dataset, batch_size, num_epochs, shuffle=True):
260 |         """Generates a batch iterator for a dataset.
261 |         Args:
262 |             dataset: 2-D list, each element is a sample list [x1, x2, y]
263 |         Returns:
264 |             list of batch samples [x1, x2, y].
265 |             use zip(*return) to generate x1_batch, x2_batch, y_batch
266 |         """
267 |         dataset = np.asarray(dataset)
268 |         data_size = len(dataset)
269 |         num_batches_per_epoch = int((len(dataset)-1)/batch_size) + 1
270 |         for epoch in range(num_epochs):
271 |             # Shuffle the data at each epoch
272 |             if shuffle:
273 |                 shuffle_indices = np.random.permutation(np.arange(data_size))
274 |                 shuffled_data = dataset[shuffle_indices]
275 |             else:
276 |                 shuffled_data = dataset
277 |             for batch_num in range(num_batches_per_epoch):
278 |                 start_index = batch_num * batch_size
279 |                 end_index = min((batch_num + 1) * batch_size, data_size)
280 |                 yield shuffled_data[start_index:end_index]
281 | 
282 | 
283 | if __name__ == '__main__':
284 |     d_char = Dataset(char_level=True)
285 |     d_word = Dataset(char_level=False, embedding_dim=128)
286 |     # s1, s2, y = d_word._load_data('../data/train.csv').next()
287 | 
288 |     # d_word._build_vocab()
289 |     d_word._save_token_data()
290 |     # id2w, w2id = d_word.read_vocab()
291 |     # dataset = Dataset().process_data('../data/atec_nlp_sim_train.csv')
292 |     # data = Dataset().batch_iter(dataset, 5, 1, shuffle=False).next()
293 |     # print(data)
294 |     # d_word.load_word2vec()
295 | 
296 | 
297 | 
298 | 
299 | 
300 | 


--------------------------------------------------------------------------------
/tf/encoder.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/6/8
  5 | """This module contains two kinds of encoders: CNNEncoder and RNNEncoder."""
  6 | import tensorflow as tf
  7 | 
  8 | 
  9 | class CNNEncoder(object):
 10 | 
 11 |     def __init__(self, sequence_length, embedding_dim, filter_sizes, num_filters):
 12 |         self._sequence_length = sequence_length
 13 |         self._embedding_dim = embedding_dim
 14 |         self._filter_sizes = filter_sizes
 15 |         self._num_filters = num_filters
 16 | 
 17 |     def forward(self, x, scope="CNN"):
 18 |         with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
 19 |             # Create a convolution + maxpool layer for each filter size
 20 |             x = tf.expand_dims(x, -1)   # shape(batch_size, seq_len, dim, 1)
 21 |             pooled_outputs = []
 22 |             for i, filter_size in enumerate(self._filter_sizes):
 23 |                 with tf.variable_scope("conv-maxpool-%s" % filter_size, reuse=None):
 24 |                     # Convolution Layer
 25 |                     filter_shape = [filter_size, self._embedding_dim, 1, self._num_filters]
 26 |                     W = tf.get_variable("W", filter_shape, initializer=tf.truncated_normal_initializer(stddev=0.1))
 27 |                     b = tf.get_variable("bias", [self._num_filters], initializer=tf.constant_initializer(0.1))
 28 |                     conv = tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding="VALID", name="conv")
 29 |                     h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
 30 |                     # Maxpooling over the outputs
 31 |                     pooled = tf.nn.max_pool(h, ksize=[1, self._sequence_length - filter_size + 1, 1, 1],
 32 |                                             strides=[1, 1, 1, 1], padding='VALID', name="pool")
 33 |                     pooled_outputs.append(pooled)
 34 |             # Combine all the pooled features
 35 |             num_filters_total = self._num_filters * len(self._filter_sizes)
 36 |             h_pool = tf.concat(pooled_outputs, 3)
 37 |             h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])  # very sparse !
 38 | 
 39 |             # very important, very sensitive to dropout rate 0.7 good!
 40 |             with tf.name_scope("dropout"):
 41 |                 h_drop = tf.nn.dropout(h_pool_flat, 0.7)
 42 | 
 43 |             # very important, necessary
 44 |             with tf.name_scope("output"):
 45 |                 W = tf.get_variable("W", shape=[num_filters_total, 128],
 46 |                                     initializer=tf.contrib.layers.xavier_initializer())
 47 |                 b = tf.Variable(tf.constant(0.1, shape=[128]), name="b")
 48 |                 outputs = tf.nn.xw_plus_b(h_drop, W, b, name="outputs")
 49 |         return outputs
 50 | 
 51 | 
 52 | class RNNEncoder(object):
 53 | 
 54 |     def __init__(self, rnn_cell, hidden_units, num_layers, dropout_keep_prob, use_dynamic, use_attention):
 55 |         self._rnn_cell = rnn_cell
 56 |         self._hidden_units = hidden_units
 57 |         self._num_layers = num_layers
 58 |         self._dropout_keep_prob = dropout_keep_prob
 59 |         self._use_dynamic = use_dynamic
 60 |         self._use_attention = use_attention
 61 | 
 62 |     def forward(self, x, sequence_length=None, scope="RNN"):
 63 |         rnn = tf.nn.rnn_cell
 64 |         with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):  # initializer=tf.orthogonal_initializer(),
 65 |             # scope.reuse_variables()  # or tf.get_variable_scope().reuse_variables()
 66 |             # current_batch_of_words does not correspond to a "sentence" of words
 67 |             # but [t_steps, batch_size, num_features]
 68 |             # Unpacks the given dimension of a rank-`R` tensor into rank-`(R-1)` tensors.
 69 |             # sequence_length list tensors of shape (batch_size, embedding_dim)
 70 |             if not self._use_dynamic:
 71 |                 x = tf.unstack(tf.transpose(x, perm=[1, 0, 2]))  # `static_rnn` input
 72 |             if self._rnn_cell.lower() == 'lstm':
 73 |                 rnn_cell = rnn.LSTMCell
 74 |             elif self._rnn_cell.lower() == 'gru':
 75 |                 rnn_cell = rnn.GRUCell
 76 |             elif self._rnn_cell.lower() == 'rnn':
 77 |                 rnn_cell = rnn.BasicRNNCell
 78 |             else:
 79 |                 raise ValueError("Invalid rnn_cell type.")
 80 | 
 81 |             with tf.variable_scope("fw"):
 82 |                 # state(c, h), tf.nn.rnn_cell.BasicLSTMCell does not support gradient clipping, use tf.nn.rnn_cell.LSTMCell.
 83 |                 # fw_cells = [rnn_cell(hidden_units) for _ in range(num_layers)]
 84 |                 fw_cells = []
 85 |                 for _ in range(self._num_layers):
 86 |                     fw_cell = rnn_cell(self._hidden_units)
 87 |                     fw_cell = rnn.DropoutWrapper(fw_cell, output_keep_prob=self._dropout_keep_prob,
 88 |                                                  variational_recurrent=False, dtype=tf.float32)
 89 |                     fw_cells.append(fw_cell)
 90 |                 fw_cells = rnn.MultiRNNCell(cells=fw_cells, state_is_tuple=True)
 91 |             with tf.variable_scope("bw"):
 92 |                 bw_cells = []
 93 |                 for _ in range(self._num_layers):
 94 |                     bw_cell = rnn_cell(self._hidden_units)
 95 |                     bw_cell = rnn.DropoutWrapper(bw_cell, output_keep_prob=self._dropout_keep_prob,
 96 |                                                  variational_recurrent=False, dtype=tf.float32)
 97 |                     bw_cells.append(bw_cell)
 98 |                 bw_cells = rnn.MultiRNNCell(cells=bw_cells, state_is_tuple=True)
 99 | 
100 |             if self._use_dynamic:
101 |                 # [batch_size, max_time, cell_fw.output_size]
102 |                 outputs, output_states = tf.nn.bidirectional_dynamic_rnn(
103 |                     fw_cells, bw_cells, x, sequence_length=sequence_length, dtype=tf.float32)
104 |                 outputs = tf.concat(outputs, 2)
105 |                 if self._rnn_cell.lower() == 'lstm':
106 |                     out = tf.concat([output_states[-1][0].h, output_states[-1][1].h], 1)
107 |                 else:
108 |                     out = tf.concat([output_states[-1][0], output_states[-1][1]], 1)
109 |                 # outputs = outputs[:, -1, :]  # take last hidden states  (batch_size, 2*hidden_units)
110 |                 # outputs = self._last_relevant(outputs, sequence_length)
111 |             else:
112 |                 # `static_rnn` Returns: A tuple (outputs, output_state_fw, output_state_bw)
113 |                 # outputs is a list of timestep outputs, depth-concatenated forward and backward outputs.
114 |                 outputs, state_fw, state_bw = tf.nn.static_bidirectional_rnn(
115 |                     fw_cells, bw_cells, x, dtype=tf.float32, sequence_length=sequence_length)
116 |                 outputs = tf.transpose(tf.stack(outputs), perm=[1, 0, 2])
117 |                 if self._rnn_cell.lower() == 'lstm':
118 |                     out = tf.concat([state_fw[-1].h, state_bw[-1].h], 1)  # good
119 |                 else:
120 |                     out = tf.concat([state_fw[-1], state_bw[-1]], 1)
121 |                 # outputs = tf.reduce_mean(outputs, 0)  # average [batch_size, hidden_units] (mean pooling)
122 |                 # outputs = tf.reduce_max(outputs, axis=0)  # max pooling, bad result.
123 |                 # outputs = outputs[-1]  # take last hidden state [batch_size, hidden_units]
124 |                 # outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])  # shape(batch_size, seq_len, hidden_units)
125 |                 # outputs = self._last_relevant(outputs, sequence_length)
126 |             if self._use_attention:
127 |                 d_a = 300
128 |                 r = 2
129 |                 self.H = outputs
130 |                 batch_size = tf.shape(x)[0]
131 |                 initializer = tf.contrib.layers.xavier_initializer()
132 |                 with tf.variable_scope("attention"):  # TODO: Nan in summary histogram for: RNN/attention/W_s2_0/grad/hist
133 |                     # shape(W_s1) = d_a * 2u
134 |                     self.W_s1 = tf.get_variable('W_s1', shape=[d_a, 2 * self._hidden_units], initializer=initializer)
135 |                     # shape(W_s2) = r * d_a
136 |                     self.W_s2 = tf.get_variable('W_s2', shape=[r, d_a], initializer=initializer)
137 |                     # shape (d_a, 2u) --> shape(batch_size, d_a, 2u)
138 |                     self.W_s1 = tf.tile(tf.expand_dims(self.W_s1, 0), [batch_size, 1, 1])
139 |                     self.W_s2 = tf.tile(tf.expand_dims(self.W_s2, 0), [batch_size, 1, 1])
140 |                     # attention matrix A = softmax(W_s2*tanh(W_s1*H^T)  shape(A) = batch_siz * r * n
141 |                     self.H_T = tf.transpose(self.H, perm=[0, 2, 1], name="H_T")
142 |                     self.A = tf.nn.softmax(
143 |                         tf.matmul(self.W_s2, tf.tanh(tf.matmul(self.W_s1, self.H_T)), name="A"))
144 |                     # sentences embedding matrix M = AH  shape(M) = (batch_size, r, 2u)
145 |                     self.M = tf.matmul(self.A, self.H, name="M")
146 |                     out = tf.reshape(self.M, [batch_size, -1])
147 | 
148 |                 with tf.variable_scope("penalization"):
149 |                     # penalization term: Frobenius norm square of matrix AA^T-I, ie. P = |AA^T-I|_F^2
150 |                     A_T = tf.transpose(self.A, perm=[0, 2, 1], name="A_T")
151 |                     I = tf.eye(r, r, batch_shape=[batch_size], name="I")
152 |                     self.P = tf.square(tf.norm(tf.matmul(self.A, A_T) - I, axis=[-2, -1], ord='fro'), name="P")
153 |         return out
154 | 
155 |     @staticmethod
156 |     def _last_relevant(outputs, sequence_length):
157 |         """Deprecated"""
158 |         batch_size = tf.shape(outputs)[0]
159 |         max_length = outputs.get_shape()[1]
160 |         output_size = outputs.get_shape()[2]
161 |         index = tf.range(0, batch_size) * max_length + (sequence_length - 1)
162 |         flat = tf.reshape(outputs, [-1, output_size])
163 |         last_timesteps = tf.gather(flat, index)  # very slow
164 |         # mask = tf.sign(index)
165 |         # last_timesteps = tf.boolean_mask(flat, mask)
166 |         # # Creating a vector of 0s and 1s that will specify what timesteps to choose.
167 |         # partitions = tf.reduce_sum(tf.one_hot(index, tf.shape(flat)[0], dtype='int32'), 0)
168 |         # # Selecting the elements we want to choose.
169 |         # _, last_timesteps = tf.dynamic_partition(flat, partitions, 2)  # (batch_size, n_dim)
170 |         # https://stackoverflow.com/questions/35892412/tensorflow-dense-gradient-explanation
171 |         return last_timesteps
172 | 
173 | if __name__ == '__main__':
174 |     x1 = tf.placeholder(tf.int32, [None, 20], name="input_x1")
175 |     x2 = tf.placeholder(tf.int32, [None, 20], name="input_x2")
176 |     cnn_encoder = CNNEncoder(
177 |         sequence_length=20,
178 |         embedding_dim=128,
179 |         filter_sizes=[3,4,5],
180 |         num_filters=100,
181 |         )
182 |     rnn_encoder = RNNEncoder(
183 |         rnn_cell='lstm',
184 |         hidden_units=100,
185 |         num_layers=2,
186 |         dropout_keep_prob=0.7,
187 |         use_dynamic=False,
188 |         use_attention=False,
189 |         )
190 | 
191 | 
192 | 
193 | 


--------------------------------------------------------------------------------
/tf/pred.py:
--------------------------------------------------------------------------------
 1 | # !/usr/bin/env python
 2 | import os
 3 | import sys
 4 | 
 5 | import tensorflow as tf
 6 | 
 7 | from dataset import Dataset
 8 | from demo.train import FLAGS
 9 | 
10 | FLAGS.model_dir = '/home/hongquan/atec_nlp/model/rnn/'
11 | FLAGS.max_document_length = 20
12 | 
13 | 
14 | def main(input_file, output_file):
15 |     print("\nPredicting...\n")
16 |     graph = tf.Graph()
17 |     with graph.as_default():  # with tf.Graph().as_default() as g:
18 |         sess = tf.Session()
19 |         with sess.as_default():
20 |             # Load the saved meta graph and restore variables
21 |             # saver = tf.train.Saver(tf.global_variables())
22 |             meta_file = os.path.abspath(os.path.join(FLAGS.model_dir, 'checkpoints/model-3400.meta'))
23 |             new_saver = tf.train.import_meta_graph(meta_file)
24 |             new_saver.restore(sess, tf.train.latest_checkpoint(os.path.join(FLAGS.model_dir, 'checkpoints')))
25 |             # graph = tf.get_default_graph()
26 | 
27 |             # Get the placeholders from the graph by name
28 |             # input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
29 |             input_x1 = graph.get_tensor_by_name("input_x1:0")  # Tensor("input_x1:0", shape=(?, 15), dtype=int32)
30 |             input_x2 = graph.get_tensor_by_name("input_x2:0")
31 |             dropout_keep_prob = graph.get_tensor_by_name("dropout_keep_prob:0")
32 |             # Tensors we want to evaluate
33 |             y_pred = graph.get_tensor_by_name("metrics/y_pred:0")
34 |             # vars = tf.get_collection('vars')
35 |             # for var in vars:
36 |             #     print(var)
37 | 
38 |             e = graph.get_tensor_by_name("cosine:0")
39 | 
40 |             # Generate batches for one epoch
41 |             dataset = Dataset(data_file=input_file, is_training=False)
42 |             data = dataset.process_data(data_file=input_file, sequence_length=FLAGS.max_document_length)
43 |             batches = dataset.batch_iter(data, FLAGS.batch_size, 1, shuffle=False)
44 |             with open(output_file, 'w') as fo:
45 |                 lineno = 1
46 |                 for batch in batches:
47 |                     x1_batch, x2_batch, _, _ = zip(*batch)
48 |                     y_pred_ = sess.run([y_pred], {input_x1: x1_batch, input_x2: x2_batch, dropout_keep_prob: 1.0})
49 |                     for pred in y_pred_[0]:
50 |                         fo.write('{}\t{}\n'.format(lineno, pred))
51 |                         lineno += 1
52 | 
53 | if __name__ == '__main__':
54 |     # Set to INFO for tracking training, default is WARN. ERROR for least messages
55 |     tf.logging.set_verbosity(tf.logging.WARN)
56 |     main(sys.argv[1], sys.argv[2])
57 | 


--------------------------------------------------------------------------------
/tf/siamese_net.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/6/10
  5 | """Siamese Similarity Network regard this task as a sentence similarity problem;
  6 |    Siamese Classification Network regard this task as a text classification problem.
  7 | 
  8 | References:
  9 |     Learning Text Similarity with Siamese Recurrent Networks, 2016
 10 |     Siamese Recurrent Architectures for Learning Sentence Similarity, 2016
 11 | """
 12 | import tensorflow as tf
 13 | 
 14 | 
 15 | class SiameseNets(object):
 16 |     """Siamese base nets, input embedding and encoder layer output. """
 17 |     def __init__(self, input_x1, input_x2, word_embedding_type, vocab_size, embedding_size,
 18 |                  encoder_type, cnn_encoder, rnn_encoder, dense_layer, weight_sharing):
 19 |         """
 20 |         Args:
 21 |             cnn_encoder: instance of CNNEncoder
 22 |             rnn_encoder: instance of RNNEncoder
 23 |         """
 24 |         # input word level dropout, data augmentation, invariance to small input change
 25 |         # self.shape = tf.shape(self.input_x1)
 26 |         # self.mask1 = tf.cast(tf.random_uniform(self.shape) > 0.1, tf.int32)
 27 |         # self.mask2 = tf.cast(tf.random_uniform(self.shape) > 0.1, tf.int32)
 28 |         # self.input_x1 = self.input_x1 * self.mask1
 29 |         # self.input_x2 = self.input_x2 * self.mask2
 30 |         self._encoder_type = encoder_type
 31 |         self._rnn_encoder = rnn_encoder
 32 |         seqlen1 = tf.cast(tf.reduce_sum(tf.sign(input_x1), 1), tf.int32)
 33 |         seqlen2 = tf.cast(tf.reduce_sum(tf.sign(input_x2), 1), tf.int32)
 34 |         assert word_embedding_type in {'rand', 'static', 'non-static'}, 'Invalid word embedding type'
 35 |         with tf.variable_scope("embedding"):
 36 |             if word_embedding_type == "rand":
 37 |                 self.W = tf.Variable(
 38 |                     tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
 39 |                     trainable=True, name="W")  # tf.truncated_normal()
 40 |             else:
 41 |                 trainable = False if word_embedding_type == "static" else True
 42 |                 self.W = tf.Variable(
 43 |                     tf.constant(0.0, shape=[vocab_size, embedding_size]),
 44 |                     trainable=trainable, name="W")
 45 |                 # embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_size])
 46 |                 # self.embedding_init = self.W.assign(embedding_placeholder)
 47 |             self.embedded_1 = tf.nn.embedding_lookup(self.W, input_x1)
 48 |             self.embedded_2 = tf.nn.embedding_lookup(self.W, input_x2)
 49 |             # Input embedding dropout. very sensitive to the dropout rate !
 50 |             self.embedded_1 = tf.nn.dropout(self.embedded_1, 0.7)
 51 |             self.embedded_2 = tf.nn.dropout(self.embedded_2, 0.7)
 52 |         if weight_sharing:
 53 |             cnn_scope1, cnn_scope2, rnn_scope1, rnn_scope2 = "CNN", "CNN", "RNN", "RNN"
 54 |         else:
 55 |             cnn_scope1, cnn_scope2, rnn_scope1, rnn_scope2 = "CNN1", "CNN2", "RNN1", "RNN2"
 56 |         if encoder_type.lower() == 'cnn':
 57 |             self.out1 = cnn_encoder.forward(self.embedded_1, cnn_scope1)
 58 |             self.out2 = cnn_encoder.forward(self.embedded_2, cnn_scope2)
 59 |         elif encoder_type.lower() == 'rnn':
 60 |             self.out1 = rnn_encoder.forward(self.embedded_1, seqlen1, rnn_scope1)
 61 |             self.out2 = rnn_encoder.forward(self.embedded_2, seqlen2, rnn_scope2)
 62 |         elif encoder_type.lower() == 'rcnn':
 63 |             cnn_out1 = cnn_encoder.forward(self.embedded_1, cnn_scope1)
 64 |             cnn_out2 = cnn_encoder.forward(self.embedded_2, cnn_scope2)
 65 |             rnn_out1 = rnn_encoder.forward(self.embedded_1, seqlen1, rnn_scope1)
 66 |             rnn_out2 = rnn_encoder.forward(self.embedded_2, seqlen2, rnn_scope2)
 67 |             self.out1 = tf.concat([cnn_out1, rnn_out1], axis=1)
 68 |             self.out2 = tf.concat([cnn_out2, rnn_out2], axis=1)
 69 |         else:
 70 |             raise ValueError("Invalid encoder type.")
 71 | 
 72 |         if dense_layer:
 73 |             with tf.variable_scope("fc"):
 74 |                 out_dim = self.out1.get_shape().as_list()[1]
 75 |                 W1 = tf.get_variable("W1", shape=[out_dim, 128], initializer=tf.contrib.layers.xavier_initializer())
 76 |                 b1 = tf.Variable(tf.constant(0.1, shape=[128]), name="b1")
 77 |                 W2 = tf.get_variable("W2", shape=[out_dim, 128], initializer=tf.contrib.layers.xavier_initializer())
 78 |                 b2 = tf.Variable(tf.constant(0.1, shape=[128]), name="b2")
 79 |                 self.out1 = tf.nn.xw_plus_b(self.out1, W1, b1, name="out1")
 80 |                 self.out2 = tf.nn.xw_plus_b(self.out2, W2, b2, name="out2")
 81 | 
 82 |     @property
 83 |     def variables(self):
 84 |         # for v in tf.trainable_variables():
 85 |         #     print(v)
 86 |         return tf.global_variables()
 87 | 
 88 | 
 89 | class SiameseSimilarityNets(SiameseNets):
 90 |     """A siamese based deep network for text similarity.
 91 |     Use a character/word level embedding layer, followed by a {`BiLSTM`, `CNN`, `combine`} encoder layer,
 92 |     then use euclidean distance/cosine/manhattan distance to measure similarity"""
 93 |     def __init__(self, input_x1, input_x2, input_y,
 94 |                  word_embedding_type, vocab_size, embedding_size,
 95 |                  encoder_type, cnn_encoder, rnn_encoder, dense_layer,
 96 |                  l2_reg_lambda, pred_threshold, energy_func, loss_func='contrasive',
 97 |                  margin=0.0, contrasive_loss_pos_weight=1.0, weight_sharing=True):
 98 |         self.input_y = input_y
 99 |         self._l2_reg_lambda = l2_reg_lambda
100 |         self._pred_threshold = pred_threshold
101 |         self._energy_func = energy_func
102 |         self._loss_func = loss_func
103 |         self._margin = margin
104 |         self._contrastive_loss_pos_weight = contrasive_loss_pos_weight
105 |         super(SiameseSimilarityNets, self).__init__(
106 |             input_x1, input_x2, word_embedding_type, vocab_size, embedding_size,
107 |             encoder_type, cnn_encoder, rnn_encoder, dense_layer, weight_sharing)
108 | 
109 |     def forward(self):
110 |         # out1_norm = tf.sqrt(tf.reduce_sum(tf.square(self.out1), 1))
111 |         # out2_norm = tf.sqrt(tf.reduce_sum(tf.square(self.out2), 1))
112 |         # self.distance = tf.sqrt(tf.reduce_sum(tf.square(self.out1 - self.out2), 1, keepdims=False))
113 |         distance = tf.norm(self.out1-self.out2, ord='euclidean', axis=1, keepdims=False, name='euc-distance')
114 |         distance = tf.div(distance, tf.add(tf.norm(self.out1, 2, axis=1), tf.norm(self.out2, 2, axis=1)))
115 |         self.sim_euc = tf.subtract(1.0, distance, name="euc")
116 | 
117 |         # self.sim = tf.reduce_sum(tf.multiply(self.out1, self.out2), 1) / tf.multiply(out1_norm, out2_norm)
118 |         out1_norm = tf.nn.l2_normalize(self.out1, 1)  # output = x / sqrt(max(sum(x**2), epsilon))
119 |         out2_norm = tf.nn.l2_normalize(self.out2, 1)
120 |         self.sim_cos = tf.reduce_sum(tf.multiply(out1_norm, out2_norm), axis=1, name="cosine")
121 |         # sim = exp(-||x1-x2||) range (0, 1]
122 |         # self.sim_ma = tf.exp(-tf.reduce_sum(tf.abs(self.out1 - self.out2), 1), name="manhattan")
123 |         self.sim_ma = tf.exp(-tf.norm(self.out1-self.out2, 1, 1), name="manhattan")
124 | 
125 |         if self._energy_func == 'euclidean':
126 |             self.sim = self.sim_euc
127 |         elif self._energy_func == 'cosine':
128 |             self.sim = self.sim_cos
129 |         elif self._energy_func == 'exp_manhattan':
130 |             self.sim = self.sim_ma
131 |         elif self._energy_func == 'combine':
132 |             w = tf.Variable(1, dtype=tf.float32)
133 |             self.sim = w * self.sim_euc + (1 - w) * self.sim_cos
134 |         else:
135 |             raise ValueError("Invalid energy function name.")
136 |         self.y_pred = tf.cast(tf.greater(self.sim, self._pred_threshold), dtype=tf.float32, name="y_pred")
137 | 
138 |         with tf.name_scope("loss"):
139 |             if self._loss_func == 'contrasive':
140 |                 self.loss = self.contrastive_loss(self.input_y, self.sim)
141 |             elif self._loss_func == 'cross_entrophy':
142 |                 self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.input_y, logits=self.sim))
143 |             # add l2 reg except bias anb BN variables.
144 |             self.l2 = self._l2_reg_lambda * tf.reduce_sum(
145 |                 [tf.nn.l2_loss(v) for v in tf.trainable_variables() if not ("noreg" in v.name or "bias" in v.name)])
146 |             self.loss += self.l2
147 |             if self._encoder_type != 'cnn' and self._rnn_encoder._use_attention:
148 |                 self.loss += tf.reduce_mean(self._rnn_encoder.P)
149 | 
150 |         # Accuracy computation is outside of this class.
151 |         # self.accuracy = tf.reduce_mean(tf.cast(tf.equal(self.y_pred, self.input_y), tf.float32), name="accuracy")
152 |         TP = tf.count_nonzero(self.input_y * self.y_pred, dtype=tf.float32)
153 |         TN = tf.count_nonzero((self.input_y - 1) * (self.y_pred - 1), dtype=tf.float32)
154 |         FP = tf.count_nonzero(self.y_pred * (self.input_y - 1), dtype=tf.float32)
155 |         FN = tf.count_nonzero((self.y_pred - 1) * self.input_y, dtype=tf.float32)
156 |         # tf.div like python2 division, tf.divide like python3
157 |         self.acc = tf.divide(TP + TN, TP + TN + FP + FN, name="accuracy")
158 |         self.precision = tf.divide(TP, TP + FP, name="precision")
159 |         self.recall = tf.divide(TP, TP + FN, name="recall")
160 |         self.cm = tf.confusion_matrix(self.input_y, self.y_pred, name="confusion_matrix")
161 |         # tf.assert_equal(self.acc, self.acc_)
162 |         # https://github.com/tensorflow/tensorflow/issues/15115, be careful!
163 |         # _, self.acc = tf.metrics.accuracy(self.input_y, self.y_pred)
164 |         # _, self.precision = tf.metrics.precision(self.input_y, self.y_pred, name='precision')
165 |         # _, self.recall = tf.metrics.recall(self.input_y, self.y_pred, name='recall')
166 |         self.f1 = tf.divide(2 * self.precision * self.recall, self.precision + self.recall, name="F1_score")
167 | 
168 |     def contrastive_loss(self, y, e):
169 |         # margin and pos_weight can directly influence P and R metrics.
170 |         l_1 = self._contrastive_loss_pos_weight * tf.pow(1-e, 2)
171 |         l_0 = tf.square(tf.maximum(e-self._margin, 0))
172 |         loss = tf.reduce_mean(y * l_1 + (1 - y) * l_0)
173 |         return loss
174 | 
175 | 
176 | class SiameseClassificationNets(SiameseNets):
177 |     """A Siamese based deep network for text similarity.
178 |     Uses  character/word level embedding layer, followed by a {`BiLSTM`, `CNN`, `combine`} encoder layer,
179 |     then use multiply/concat interaction to feed for classification layers.
180 |     """
181 |     def __init__(self, input_x1, input_x2, input_y,
182 |                  word_embedding_type, vocab_size, embedding_size,
183 |                  encoder_type, cnn_encoder, rnn_encoder, dense_layer,
184 |                  l2_reg_lambda, interaction="multiply", weight_sharing=True):
185 |         self.input_y = input_y
186 |         self._l2_reg_lambda = l2_reg_lambda
187 |         self._interaction = interaction
188 |         super(SiameseClassificationNets, self).__init__(
189 |             input_x1, input_x2, word_embedding_type, vocab_size, embedding_size,
190 |             encoder_type, cnn_encoder, rnn_encoder, dense_layer, weight_sharing)
191 | 
192 |     def forward(self):
193 |         if self._interaction == 'concat':
194 |             self.out = tf.concat([self.out1, self.out2], axis=1, name="out")
195 |         elif self._interaction == 'multiply':
196 |             self.out = tf.multiply(self.out1, self.out2, name="out")
197 |         fc = tf.layers.dense(self.out, 128, name='fc1', activation=tf.nn.relu)
198 |         # self.scores = tf.layers.dense(self.fc, 1, activation=tf.nn.sigmoid)
199 |         self.logits = tf.layers.dense(fc, 2, name='fc2')
200 |         # self.y_pred = tf.round(tf.nn.sigmoid(self.logits), name="predictions")  # pred class
201 |         self.y_pred = tf.cast(tf.argmax(tf.nn.sigmoid(self.logits), 1, name="predictions"), tf.float32)
202 | 
203 |         with tf.name_scope("loss"):
204 |             # [batch_size, num_classes]
205 |             y = tf.one_hot(tf.cast(self.input_y, tf.int32), 2)
206 |             cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=self.logits, labels=y)
207 |             self.loss = tf.reduce_mean(cross_entropy)
208 |             # self.loss = tf.losses.sigmoid_cross_entropy(logits=self.logits, multi_class_labels=y)
209 | 
210 |             # y = self.input_y
211 |             # y_ = self.scores
212 |             # self.loss = -tf.reduce_mean(pos_weight * y * tf.log(tf.clip_by_value(y_, 1e-10, 1.0))
213 |             #                             + (1-y) * tf.log(tf.clip_by_value(1-y_, 1e-10, 1.0)))
214 |             # add l2 reg except bias anb BN variables.
215 |             self.l2 = self._l2_reg_lambda * tf.reduce_sum(
216 |                 [tf.nn.l2_loss(v) for v in tf.trainable_variables() if not ("noreg" in v.name or "bias" in v.name)])
217 |             self.loss += self.l2
218 | 
219 |         # Accuracy computation is outside of this class.
220 |         with tf.name_scope("metrics"):
221 |             TP = tf.count_nonzero(self.input_y * self.y_pred, dtype=tf.float32)
222 |             TN = tf.count_nonzero((self.input_y - 1) * (self.y_pred - 1), dtype=tf.float32)
223 |             FP = tf.count_nonzero(self.y_pred * (self.input_y - 1), dtype=tf.float32)
224 |             FN = tf.count_nonzero((self.y_pred - 1) * self.input_y, dtype=tf.float32)
225 |             # tf.div like python2 division, tf.divide like python3
226 |             self.cm = tf.confusion_matrix(self.input_y, self.y_pred, name="confusion_matrix")
227 |             self.acc = tf.divide(TP + TN, TP + TN + FP + FN, name="accuracy")
228 |             self.precision = tf.divide(TP, TP + FP, name="precision")
229 |             self.recall = tf.divide(TP, TP + FN, name="recall")
230 |             self.f1 = tf.divide(2 * self.precision * self.recall, self.precision + self.recall, name="F1_score")
231 | 
232 | 
233 | if __name__ == '__main__':
234 |     from encoder import CNNEncoder, RNNEncoder
235 |     x1 = tf.placeholder(tf.int32, [None, 20], name="input_x1")
236 |     x2 = tf.placeholder(tf.int32, [None, 20], name="input_x2")
237 |     y = tf.placeholder(tf.float32, [None], name="input_y")
238 |     cnn_encoder = CNNEncoder(
239 |         sequence_length=20,
240 |         embedding_dim=128,
241 |         filter_sizes=[3, 4, 5],
242 |         num_filters=100,
243 |     )
244 |     rnn_encoder = RNNEncoder(
245 |         rnn_cell='lstm',
246 |         hidden_units=100,
247 |         num_layers=2,
248 |         dropout_keep_prob=0.7,
249 |         use_dynamic=False,
250 |         use_attention=False,
251 |     )
252 |     model1 = SiameseSimilarityNets(
253 |         input_x1=x1,
254 |         input_x2=x2,
255 |         input_y=y,
256 |         word_embedding_type='rand',
257 |         vocab_size=10000,
258 |         embedding_size=128,
259 |         encoder_type='cnn',
260 |         cnn_encoder=cnn_encoder,
261 |         rnn_encoder=rnn_encoder,
262 |         dense_layer=False,
263 |         l2_reg_lambda=0,
264 |         pred_threshold=0.5,
265 |         energy_func='cosine',
266 |         loss_func='contrasive',
267 |         margin=0.0,
268 |         contrasive_loss_pos_weight=1.0,
269 |         weight_sharing=True)
270 |     model2 = SiameseClassificationNets(
271 |         input_x1=x1,
272 |         input_x2=x2,
273 |         input_y=y,
274 |         word_embedding_type='rand',
275 |         vocab_size=10000,
276 |         embedding_size=128,
277 |         encoder_type='cnn',
278 |         cnn_encoder=cnn_encoder,
279 |         rnn_encoder=rnn_encoder,
280 |         dense_layer=False,
281 |         l2_reg_lambda=0,
282 |         interaction='multiply',
283 |         weight_sharing=True
284 |     )


--------------------------------------------------------------------------------
/tf/train.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | # @Author: lapis-hong
  4 | # @Date  : 2018/6/11
  5 | # !/usr/bin/env python
  6 | # coding: utf-8
  7 | from __future__ import unicode_literals
  8 | import os
  9 | import datetime
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | from dataset import Dataset
 15 | from encoder import CNNEncoder, RNNEncoder
 16 | from siamese_net import SiameseSimilarityNets, SiameseClassificationNets
 17 | 
 18 | # Data loading params
 19 | tf.flags.DEFINE_string("data_file", "../data/atec_nlp_sim_train1.csv", "Training data file path.")
 20 | tf.flags.DEFINE_float("val_percentage", .1, "Percentage of the training data to use for validation. (default: 0.2)")
 21 | tf.flags.DEFINE_integer("random_seed", 123, "Random seed to split train and test. (default: None)")
 22 | tf.flags.DEFINE_integer("max_document_length", 30, "Max document length of each train pair. (default: 15)")
 23 | tf.flags.DEFINE_boolean("char_model", False, "Character based syntactic model. if false, word based semantic model. (default: True)")
 24 | tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character/word embedding (default: 300)")
 25 | 
 26 | # Model Hyperparameters
 27 | tf.flags.DEFINE_string("model_class", "similarity", "Model class, one of {`similarity`, `classification`}")
 28 | tf.flags.DEFINE_string("model_type", "rcnn", "Model type, one of {`cnn`, `rnn`, `rcnn`} (default: rnn)")
 29 | tf.flags.DEFINE_string("word_embedding_type", "non-static", "One of `rand`, `static`, `non-static`, random init(rand) vs pretrained word2vec(static) vs pretrained word2vec + training(non-static)")
 30 | # If include CNN
 31 | tf.flags.DEFINE_string("filter_sizes", "2,3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
 32 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 128)")
 33 | # If include RNN
 34 | tf.flags.DEFINE_string("rnn_cell", "gru", "Rnn cell type, lstm or gru or rnn(default: lstm)")
 35 | tf.flags.DEFINE_integer("hidden_units", 100, "Number of hidden units (default: 50)")
 36 | tf.flags.DEFINE_integer("num_layers", 2, "Number of rnn layers (default: 3)")
 37 | tf.flags.DEFINE_float("clip_norm", 5, "Gradient clipping norm value set None to not use (default: 5)")
 38 | tf.flags.DEFINE_boolean("use_dynamic", True, "Whether use dynamic rnn or not (default: False)")
 39 | tf.flags.DEFINE_boolean("use_attention", False, "Whether use self attention or not (default: False)")
 40 | # Common
 41 | tf.flags.DEFINE_boolean("weight_sharing", True, "Sharing CNN or RNN encoder weights. (default: True")
 42 | tf.flags.DEFINE_boolean("dense_layer", False, "Whether to add a fully connected layer before calculate energy function. (default: False)")
 43 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 1.0)")
 44 | tf.flags.DEFINE_string("energy_function", "cosine", "Similarity energy function, one of {`euclidean`, `cosine`, `exp_manhattan`, `combine`} (default: euclidean)")
 45 | tf.flags.DEFINE_string("loss_function", "contrasive", "Loss function one of `cross_entrophy`, `contrasive`, (default: contrasive loss)")
 46 | tf.flags.DEFINE_float("pred_threshold", 0.5, "Threshold for classify.(default: 0.5)")
 47 | tf.flags.DEFINE_float("l2_reg_lambda", 0, "L2 regularizaion lambda (default: 0.0)")
 48 | # Only for contrasive loss
 49 | tf.flags.DEFINE_float("scale_pos_weight", 2, "Scale loss function for imbalance data, set it around neg_samples / pos_samples ")
 50 | tf.flags.DEFINE_float("margin", 0.0, "Margin for contrasive loss (default: 0.0)")
 51 | 
 52 | # Training parameters
 53 | tf.flags.DEFINE_string("model_dir", "../model", "Model directory (default: ../model)")
 54 | tf.flags.DEFINE_integer("batch_size", 128, "Batch Size (default: 64)")
 55 | tf.flags.DEFINE_float("lr", 1e-2, "Initial learning rate (default: 1e-3)")
 56 | tf.flags.DEFINE_float("weight_decay_rate", 0.5, "Exponential weight decay rate (default: 0.9) ")
 57 | tf.flags.DEFINE_integer("num_epochs", 50, "Number of training epochs (default: 100)")
 58 | tf.flags.DEFINE_integer("log_every_steps", 100, "Print log info after this many steps (default: 100)")
 59 | tf.flags.DEFINE_integer("evaluate_every_steps", 100, "Evaluate model on dev set after this many steps (default: 100)")
 60 | # tf.flags.DEFINE_integer("checkpoint_every_steps", 1000, "Save model after this many steps (default: 100)")
 61 | tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")
 62 | 
 63 | FLAGS = tf.flags.FLAGS
 64 | # supress tensorflow logging other than errors
 65 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
 66 | 
 67 | 
 68 | def train():
 69 |     print("Using TensorFlow Version %s" % tf.__version__)
 70 |     assert "1.5" <= tf.__version__, "Need TensorFlow 1.5 or Later."
 71 |     print("\nParameters:")
 72 |     for attr in FLAGS:
 73 |         value = FLAGS[attr].value
 74 |         print("{}={}".format(attr.upper(), value))
 75 |     print("")
 76 |     if not FLAGS.data_file:
 77 |         exit("Train data file is empty. Set --data_file argument.")
 78 | 
 79 |     dataset = Dataset(data_file=FLAGS.data_file, char_level=FLAGS.char_model, embedding_dim=FLAGS.embedding_dim)
 80 |     vocab, word2id = dataset.read_vocab()
 81 |     print("Vocabulary Size: {:d}".format(len(vocab)))
 82 |     # Generate batches
 83 |     data = dataset.process_data(data_file=FLAGS.data_file, sequence_length=FLAGS.max_document_length)  # (x1, x2, y)
 84 |     train_data, eval_data = dataset.train_test_split(data, test_size=FLAGS.val_percentage, random_seed=FLAGS.random_seed)
 85 |     train_batches = dataset.batch_iter(train_data, FLAGS.batch_size, FLAGS.num_epochs, shuffle=True)
 86 | 
 87 |     with tf.Graph().as_default():
 88 |         tf.set_random_seed(FLAGS.random_seed)
 89 |         session_conf = tf.ConfigProto(
 90 |           allow_soft_placement=True,
 91 |           log_device_placement=False)
 92 |         sess = tf.Session(config=session_conf)
 93 | 
 94 |         input_x1 = tf.placeholder(tf.int32, [None, FLAGS.max_document_length], name="input_x1")
 95 |         input_x2 = tf.placeholder(tf.int32, [None, FLAGS.max_document_length], name="input_x2")
 96 |         input_y = tf.placeholder(tf.float32, [None], name="input_y")
 97 |         dropout_keep_prob = tf.placeholder(tf.float32, name="input_y")
 98 |         cnn_encoder = CNNEncoder(
 99 |             sequence_length=FLAGS.max_document_length,
100 |             embedding_dim=FLAGS.embedding_dim,
101 |             filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
102 |             num_filters=FLAGS.num_filters,
103 |         )
104 |         rnn_encoder = RNNEncoder(
105 |             rnn_cell=FLAGS.rnn_cell,
106 |             hidden_units=FLAGS.hidden_units,
107 |             num_layers=FLAGS.num_layers,
108 |             dropout_keep_prob=dropout_keep_prob,
109 |             use_dynamic=FLAGS.use_dynamic,
110 |             use_attention=FLAGS.use_attention,
111 |         )
112 | 
113 |         with sess.as_default():
114 |             if FLAGS.model_class == 'similarity':
115 |                 model = SiameseSimilarityNets(
116 |                     input_x1=input_x1,
117 |                     input_x2=input_x2,
118 |                     input_y=input_y,
119 |                     encoder_type=FLAGS.model_type,
120 |                     cnn_encoder=cnn_encoder,
121 |                     rnn_encoder=rnn_encoder,
122 |                     vocab_size=len(vocab),
123 |                     embedding_size=FLAGS.embedding_dim,
124 |                     word_embedding_type=FLAGS.word_embedding_type,
125 |                     dense_layer=FLAGS.dense_layer,
126 |                     pred_threshold=FLAGS.pred_threshold,
127 |                     l2_reg_lambda=FLAGS.l2_reg_lambda,
128 |                     energy_func=FLAGS.energy_function,
129 |                     loss_func=FLAGS.loss_function,
130 |                     margin=FLAGS.margin,
131 |                     contrasive_loss_pos_weight=FLAGS.scale_pos_weight,
132 |                     weight_sharing=FLAGS.weight_sharing
133 |                 )
134 |                 print("Initialized SiameseSimilarityNets model.")
135 |             elif FLAGS.model_class == 'classification':
136 |                 model = SiameseClassificationNets(
137 |                     input_x1=input_x1,
138 |                     input_x2=input_x2,
139 |                     input_y=input_y,
140 |                     word_embedding_type=FLAGS.word_embedding_type,
141 |                     vocab_size=len(vocab),
142 |                     embedding_size=FLAGS.embedding_dim,
143 |                     encoder_type=FLAGS.model_type,
144 |                     cnn_encoder=cnn_encoder,
145 |                     rnn_encoder=rnn_encoder,
146 |                     dense_layer=FLAGS.dense_layer,
147 |                     l2_reg_lambda=FLAGS.l2_reg_lambda,
148 |                     interaction='multiply',
149 |                     weight_sharing=FLAGS.weight_sharing
150 |                 )
151 |                 print("Initialized SiameseClassificationNets model.")
152 |             else:
153 |                 raise ValueError("Invalid model class. Expected one of {`similarity`, `classification`} ")
154 |             model.forward()
155 | 
156 |             # Define Training procedure
157 |             global_step = tf.Variable(0, name="global_step", trainable=False)
158 |             learning_rate = tf.train.exponential_decay(FLAGS.lr, global_step, decay_steps=int(40000/FLAGS.batch_size),
159 |                                                        decay_rate=FLAGS.weight_decay_rate, staircase=True)
160 |             optimizer = tf.train.AdamOptimizer(learning_rate)
161 |             # optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9)
162 |             # optimizer = tf.train.GradientDescentOptimizer(learning_rate)
163 |             # optimizer = tf.train.RMSPropOptimizer(learning_rate)
164 |             # optimizer = tf.train.AdadeltaOptimizer(learning_rate, epsilon=1e-6)
165 | 
166 |         # for i, (g, v) in enumerate(grads_and_vars):
167 |         #     if g is not None:
168 |         #         grads_and_vars[i] = (tf.clip_by_global_norm(g, 5), v)  # clip gradients
169 |         # train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
170 |         if FLAGS.clip_norm:  # improve loss, but small weight cause small score, need to turn threshold for better f1.
171 |             variables = tf.trainable_variables()
172 |             grads, _ = tf.clip_by_global_norm(tf.gradients(model.loss, variables), FLAGS.clip_norm)
173 |             train_op = optimizer.apply_gradients(zip(grads, variables), global_step=global_step)
174 |             grads_and_vars = zip(grads, variables)
175 |         else:
176 |             grads_and_vars = optimizer.compute_gradients(model.loss)
177 |             train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
178 |         # Keep track of gradient values and sparsity (optional)
179 |         grad_summaries = []
180 |         for g, v in grads_and_vars:
181 |             if g is not None:
182 |                 grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
183 |                 sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
184 |                 grad_summaries.append(grad_hist_summary)
185 |                 grad_summaries.append(sparsity_summary)
186 |         grad_summaries_merged = tf.summary.merge(grad_summaries)
187 |         print("Defined gradient summaries.")
188 | 
189 |         # Summaries for loss and accuracy
190 |         loss_summary = tf.summary.scalar("loss", model.loss)
191 |         f1_summary = tf.summary.scalar("F1-score", model.f1)
192 | 
193 |         # Train Summaries
194 |         train_summary_op = tf.summary.merge([loss_summary, f1_summary, grad_summaries_merged])
195 |         train_summary_dir = os.path.join(FLAGS.model_dir, "summaries", "train")
196 |         train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
197 | 
198 |         # Dev summaries
199 |         dev_summary_op = tf.summary.merge([loss_summary, f1_summary])
200 |         dev_summary_dir = os.path.join(FLAGS.model_dir, "summaries", "dev")
201 |         dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
202 | 
203 |         # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
204 |         checkpoint_dir = os.path.abspath(os.path.join(FLAGS.model_dir, "checkpoints"))
205 |         checkpoint_prefix = os.path.join(checkpoint_dir, "model")
206 |         if not os.path.exists(checkpoint_dir):
207 |             os.makedirs(checkpoint_dir)
208 |         graph_def = tf.get_default_graph().as_graph_def()
209 |         with open(os.path.join(checkpoint_dir, "graphpb.txt"), 'w') as f:
210 |             f.write(str(graph_def))
211 |         saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
212 |         # Initialize all variables
213 |         sess.run(tf.global_variables_initializer())
214 |         sess.run(tf.local_variables_initializer())
215 | 
216 |         if FLAGS.word_embedding_type != 'rand':
217 |             # initial matrix with random uniform
218 |             # embedding_init = np.random.uniform(-0.25, 0.25, (len(vocab), FLAGS.embedding_dim))
219 |             embedding_init = np.zeros(shape=(len(vocab), FLAGS.embedding_dim))
220 |             # load vectors from the word2vec
221 |             print("Initializing word embedding with pre-trained word2vec.")
222 |             words, vectors = dataset.load_word2vec()
223 |             for idx, w in enumerate(vocab):
224 |                 vec = vectors[words.index(w)]
225 |                 embedding_init[idx] = np.asarray(vec).astype(np.float32)
226 |             sess.run(model.W.assign(embedding_init))
227 | 
228 |         print("Starting training...")
229 |         F1_best = 0.0
230 |         last_improved_step = 0
231 |         for batch in train_batches:
232 |             x1_batch, x2_batch, y_batch = zip(*batch)
233 |             feed_dict = {
234 |                 input_x1: x1_batch,
235 |                 input_x2: x2_batch,
236 |                 input_y: y_batch,
237 |                 dropout_keep_prob: FLAGS.dropout_keep_prob
238 |             }
239 |             _, step, loss, cm, acc, precision, recall, f1, summaries = sess.run(
240 |                 [train_op, global_step, model.loss, model.cm, model.acc, model.precision, model.recall, model.f1, train_summary_op],  feed_dict)
241 |             time_str = datetime.datetime.now().isoformat()
242 |             if step % FLAGS.log_every_steps == 0:
243 |                 train_summary_writer.add_summary(summaries, step)
244 |                 print("{} step {} TRAIN loss={:g} acc={:.3f} P={:.3f} R={:.3f} F1={:.6f}".format(
245 |                     time_str, step, loss, acc, precision, recall, f1))
246 |             if step % FLAGS.evaluate_every_steps == 0:
247 |                 # eval
248 |                 x1_batch, x2_batch, y_batch = zip(*eval_data)
249 |                 feed_dict = {
250 |                     input_x1: x1_batch,
251 |                     input_x2: x2_batch,
252 |                     input_y: y_batch,
253 |                     dropout_keep_prob: 1
254 |                 }
255 |                 #### debug for similarity model
256 |                 # x1, out1, out2, sim_euc, sim_cos, sim_ma, sim = sess.run(
257 |                 #   [model.embedded_1, model.out1, model.out2, model.sim_euc, model.sim_cos, model.sim_ma, model.sim], feed_dict)
258 |                 # print(x1)
259 |                 # sim_euc = [round(s, 2) for s in sim_euc[:30]]
260 |                 # sim_cos = [round(s, 2) for s in sim_cos[:30]]
261 |                 # sim_ma = [round(s, 2) for s in sim_ma[:30]]
262 |                 # sim = [round(s, 2) for s in sim[:30]]
263 |                 # # print(out1)
264 |                 # out1 = [round(s, 3) for s in out1[0]]
265 |                 # out2 = [round(s, 3) for s in out2[0]]
266 |                 # print(zip(out1, out2))
267 |                 # for w in zip(y_batch[:30], sim, sim_euc, sim_cos, sim_ma):
268 |                 #     print(w)
269 | 
270 |                 ##### debug for classification model
271 |                 # out1, out2, out, logits = sess.run(
272 |                 #     [model.out1, model.out2, model.out, model.logits], feed_dict)
273 |                 # out1 = [round(s, 3) for s in out1[0]]
274 |                 # out2 = [round(s, 3) for s in out2[0]]
275 |                 # out = [round(s, 3) for s in out[0]]
276 |                 # print(zip(out1, out2))
277 |                 # print(out)
278 |                 # print(logits)
279 | 
280 |                 loss, cm, acc, precision, recall, f1, summaries = sess.run(
281 |                     [model.loss, model.cm, model.acc, model.precision, model.recall, model.f1, dev_summary_op], feed_dict)
282 |                 dev_summary_writer.add_summary(summaries, step)
283 |                 if f1 > F1_best:
284 |                     F1_best = f1
285 |                     last_improved_step = step
286 |                     if F1_best > 0.5:
287 |                         path = saver.save(sess, checkpoint_prefix, global_step=step)
288 |                         print("Saved model with F1={} checkpoint to {}\n".format(F1_best, path))
289 |                     improved_token = '*'
290 |                 else:
291 |                     improved_token = ''
292 |                 print("{} step {} DEV loss={:g} acc={:.3f} cm{} P={:.3f} R={:.3f} F1={:.6f} {}".format(
293 |                     time_str, step, loss, acc, cm, precision, recall, f1, improved_token))
294 |                 # if step % FLAGS.checkpoint_every_steps == 0:
295 |                 #     if F1 >= F1_best:
296 |                 #         F1_best = F1
297 |                 #         path = saver.save(sess, checkpoint_prefix, global_step=step)
298 |                 #         print("Saved model with F1={} checkpoint to {}\n".format(F1_best, path))
299 |             if step - last_improved_step > 4000:  # 2000 steps
300 |                 print("No improvement for a long time, early-stopping at best F1={}".format(F1_best))
301 |                 break
302 | 
303 | if __name__ == '__main__':
304 |     train()
305 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 | # @Author: lapis-hong
4 | # @Date  : 2018/6/8


--------------------------------------------------------------------------------
/utils/data_stats.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | # @Author: lapis-hong
 4 | # @Date  : 2018/5/6
 5 | """This scripts calculate some statistical information about the original text data file.
 6 | 
 7 | Main results:
 8 |     * positive sample percentage: train1->21.73%    train2->16.06%  total->18.23%
 9 |     * length from 5 to 97, 
10 |     * frequency distribution: (5, 0.0071), (6, 0.0292), (7, 0.0392), (8, 0.0752), (9, 0.0877), (10, 0.1072), (11, 0.1069), 
11 |                               (12, 0.0994), (13, 0.0833), (14, 0.0682), (15, 0.0554), (16, 0.043), (17, 0.0338), (18, 0.0285), 
12 |                               (19, 0.0229), (20, 0.0181), (21, 0.0153), (22, 0.0119), (23, 0.01), (24, 0.0088) ...
13 |     * there is no significant frequency difference between positive samples and negative samples.
14 |     * pair length diff: (0, 0.0902), (1, 0.1782), (2, 0.1557), (3, 0.1285), (4, 0.1004), (5, 0.0818), (6, 0.0587) ...
15 |     * the positive pairs length diff at {0, 1, 2} is slightly higher than negative pairs. 
16 | """
17 | from collections import Counter
18 | 
19 | 
20 | def positive_sample_percentage(filename):
21 |     tot, pos = 0, 0
22 |     for line in open(filename):
23 |         line = line.strip().split('\t')
24 |         if line[-1] == '1':
25 |             pos += 1
26 |         tot += 1
27 |     print pos / float(tot)
28 | 
29 | 
30 | def sentence_length_distribution(filename):
31 |     tot, pos = 0, 0
32 |     pos_seq_len = []
33 |     neg_seq_len = []
34 |     tot_seq_len = []
35 |     for line in open(filename):
36 |         line = line.strip().split('\t')
37 |         s1 = line[1].decode('utf-8')
38 |         s2 = line[2].decode('utf-8')
39 |         tot_seq_len.extend([len(s1), len(s2)])
40 |         tot += 2
41 |         if line[-1] == '1':
42 |             pos_seq_len.extend([len(s1), len(s2)])
43 |             pos += 2
44 |         else:
45 |             neg_seq_len.extend([len(s1), len(s2)])
46 |     tot_counter = Counter(tot_seq_len)
47 |     pos_counter = Counter(pos_seq_len)
48 |     neg_counter = Counter(neg_seq_len)
49 |     tot_freq = sorted(map(lambda x: (x[0], round(x[1]/float(tot), 4)), tot_counter.items()))
50 |     pos_freq = sorted(map(lambda x: (x[0], round(x[1]/float(pos), 4)), pos_counter.items()))
51 |     neg_freq = sorted(map(lambda x: (x[0], round(x[1]/float(tot-pos), 4)), neg_counter.items()))
52 |     print('Total sample length distribution: {}'.format(tot_freq))
53 |     print('Positive sample length distribution: {}'.format(pos_freq))
54 |     print('Negetive sample length distribution: {}'.format(neg_freq))
55 | 
56 | 
57 | def pair_length_diff_distribution(filename):
58 |     tot, pos = 0, 0
59 |     tot_diff = []
60 |     pos_diff = []
61 |     neg_diff = []
62 |     for line in open(filename):
63 |         line = line.strip().split('\t')
64 |         s1 = line[1].decode('utf-8')
65 |         s2 = line[2].decode('utf-8')
66 |         len_diff = abs(len(s1) - len(s2))
67 |         tot_diff.append(len_diff)
68 |         tot += 1
69 |         if line[-1] == '1':
70 |             pos_diff.append(len_diff)
71 |             pos += 1
72 |         else:
73 |             neg_diff.append(len_diff)
74 |     tot_counter = Counter(tot_diff)
75 |     pos_counter = Counter(pos_diff)
76 |     neg_counter = Counter(neg_diff)
77 |     tot_freq = sorted(map(lambda x: (x[0], round(x[1] / float(tot), 4)), tot_counter.items()))
78 |     pos_freq = sorted(map(lambda x: (x[0], round(x[1] / float(pos), 4)), pos_counter.items()))
79 |     neg_freq = sorted(map(lambda x: (x[0], round(x[1] / float(tot - pos), 4)), neg_counter.items()))
80 |     print('Total pair length diff distribution: {}'.format(tot_freq))
81 |     print('Positive pair length diff distribution: {}'.format(pos_freq))
82 |     print('Negetive pair length diff distribution: {}'.format(neg_freq))
83 | 
84 | if __name__ == '__main__':
85 |     filename = '../data/atec_nlp_sim_train.csv'
86 |     positive_sample_percentage(filename)
87 |     # sentence_length_distribution(filename)
88 |     # pair_length_diff_distribution(filename)
89 | 
90 | 


--------------------------------------------------------------------------------
/utils/feature_engineering.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | # @Author: lapis-hong
 4 | # @Date  : 2018/5/21
 5 | from __future__ import unicode_literals
 6 | from __future__ import division
 7 | 
 8 | 
 9 | def len_diff(s1, s2):
10 |     return abs(len(s1) - len(s2))
11 | 
12 | 
13 | def len_diff_ratio(s1, s2):
14 |     return 2 * abs(len(s1) - len(s2)) / (len(s1) + len(s2))
15 | 
16 | 
17 | def shingle_similarity(s1, s2, size=1):
18 |     """Shingle similarity of two sentences."""
19 |     def get_shingles(text, size):
20 |         shingles = set()
21 |         for i in range(0, len(text) - size + 1):
22 |             shingles.add(text[i:i + size])
23 |         return shingles
24 | 
25 |     def jaccard(set1, set2):
26 |         x = len(set1.intersection(set2))
27 |         y = len(set1.union(set2))
28 |         return x, y
29 | 
30 |     x, y = jaccard(get_shingles(s1, size), get_shingles(s2, size))
31 |     return x / float(y) if (y > 0 and x > 2) else 0.0
32 | 
33 | 
34 | def common_words(s1, s2):
35 |     s1_common_cnt = len([w for w in s1 if w in s2])
36 |     s2_common_cnt = len([w for w in s2 if w in s1])
37 |     return (s1_common_cnt + s2_common_cnt) / (len(s1) + len(s2))
38 | 
39 | 
40 | def tf_idf():
41 |     pass
42 | 
43 | 
44 | def wmd():
45 |     pass
46 | 
47 | 
48 | if __name__ == '__main__':
49 |     s1 = '怎么更改花呗手机号码'
50 |     s2 = '我的花呗是以前的手机号码，怎么更改成现在的支付宝的号码手机号'
51 |     print(len_diff(s1, s2))
52 |     print(shingle_similarity(s1, s2))
53 |     print(shingle_similarity(s1, s2, 2))
54 |     print(shingle_similarity(s1, s2, 3))
55 | 
56 | 


--------------------------------------------------------------------------------
/utils/langconv.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from copy import deepcopy
  5 | 
  6 | try:
  7 |     import psyco
  8 |     psyco.full()
  9 | except:
 10 |     pass
 11 | 
 12 | try:
 13 |     from utils.zh_wiki import zh2Hant, zh2Hans
 14 | except ImportError:
 15 |     from zhtools.zh_wiki import zh2Hant, zh2Hans
 16 | 
 17 | import sys
 18 | py3k = sys.version_info >= (3, 0, 0)
 19 | 
 20 | if py3k:
 21 |     UEMPTY = ''
 22 | else:
 23 |     _zh2Hant, _zh2Hans = {}, {}
 24 |     for old, new in ((zh2Hant, _zh2Hant), (zh2Hans, _zh2Hans)):
 25 |         for k, v in old.items():
 26 |             new[k.decode('utf8')] = v.decode('utf8')
 27 |     zh2Hant = _zh2Hant
 28 |     zh2Hans = _zh2Hans
 29 |     UEMPTY = ''.decode('utf8')
 30 | 
 31 | # states
 32 | (START, END, FAIL, WAIT_TAIL) = list(range(4))
 33 | # conditions
 34 | (TAIL, ERROR, MATCHED_SWITCH, UNMATCHED_SWITCH, CONNECTOR) = list(range(5))
 35 | 
 36 | MAPS = {}
 37 | 
 38 | class Node(object):
 39 |     def __init__(self, from_word, to_word=None, is_tail=True,
 40 |             have_child=False):
 41 |         self.from_word = from_word
 42 |         if to_word is None:
 43 |             self.to_word = from_word
 44 |             self.data = (is_tail, have_child, from_word)
 45 |             self.is_original = True
 46 |         else:
 47 |             self.to_word = to_word or from_word
 48 |             self.data = (is_tail, have_child, to_word)
 49 |             self.is_original = False
 50 |         self.is_tail = is_tail
 51 |         self.have_child = have_child
 52 | 
 53 |     def is_original_long_word(self):
 54 |         return self.is_original and len(self.from_word)>1
 55 | 
 56 |     def is_follow(self, chars):
 57 |         return chars != self.from_word[:-1]
 58 | 
 59 |     def __str__(self):
 60 |         return '<Node, %s, %s, %s, %s>' % (repr(self.from_word),
 61 |                 repr(self.to_word), self.is_tail, self.have_child)
 62 | 
 63 |     __repr__ = __str__
 64 | 
 65 | class ConvertMap(object):
 66 |     def __init__(self, name, mapping=None):
 67 |         self.name = name
 68 |         self._map = {}
 69 |         if mapping:
 70 |             self.set_convert_map(mapping)
 71 | 
 72 |     def set_convert_map(self, mapping):
 73 |         convert_map = {}
 74 |         have_child = {}
 75 |         max_key_length = 0
 76 |         for key in sorted(mapping.keys()):
 77 |             if len(key)>1:
 78 |                 for i in range(1, len(key)):
 79 |                     parent_key = key[:i]
 80 |                     have_child[parent_key] = True
 81 |             have_child[key] = False
 82 |             max_key_length = max(max_key_length, len(key))
 83 |         for key in sorted(have_child.keys()):
 84 |             convert_map[key] = (key in mapping, have_child[key],
 85 |                     mapping.get(key, UEMPTY))
 86 |         self._map = convert_map
 87 |         self.max_key_length = max_key_length
 88 | 
 89 |     def __getitem__(self, k):
 90 |         try:
 91 |             is_tail, have_child, to_word  = self._map[k]
 92 |             return Node(k, to_word, is_tail, have_child)
 93 |         except:
 94 |             return Node(k)
 95 | 
 96 |     def __contains__(self, k):
 97 |         return k in self._map
 98 | 
 99 |     def __len__(self):
100 |         return len(self._map)
101 | 
102 | class StatesMachineException(Exception): pass
103 | 
104 | class StatesMachine(object):
105 |     def __init__(self):
106 |         self.state = START
107 |         self.final = UEMPTY
108 |         self.len = 0
109 |         self.pool = UEMPTY
110 | 
111 |     def clone(self, pool):
112 |         new = deepcopy(self)
113 |         new.state = WAIT_TAIL
114 |         new.pool = pool
115 |         return new
116 | 
117 |     def feed(self, char, map):
118 |         node = map[self.pool+char]
119 | 
120 |         if node.have_child:
121 |             if node.is_tail:
122 |                 if node.is_original:
123 |                     cond = UNMATCHED_SWITCH
124 |                 else:
125 |                     cond = MATCHED_SWITCH
126 |             else:
127 |                 cond = CONNECTOR
128 |         else:
129 |             if node.is_tail:
130 |                 cond = TAIL
131 |             else:
132 |                 cond = ERROR
133 | 
134 |         new = None
135 |         if cond == ERROR:
136 |             self.state = FAIL
137 |         elif cond == TAIL:
138 |             if self.state == WAIT_TAIL and node.is_original_long_word():
139 |                 self.state = FAIL
140 |             else:
141 |                 self.final += node.to_word
142 |                 self.len += 1
143 |                 self.pool = UEMPTY
144 |                 self.state = END
145 |         elif self.state == START or self.state == WAIT_TAIL:
146 |             if cond == MATCHED_SWITCH:
147 |                 new = self.clone(node.from_word)
148 |                 self.final += node.to_word
149 |                 self.len += 1
150 |                 self.state = END
151 |                 self.pool = UEMPTY
152 |             elif cond == UNMATCHED_SWITCH or cond == CONNECTOR:
153 |                 if self.state == START:
154 |                     new = self.clone(node.from_word)
155 |                     self.final += node.to_word
156 |                     self.len += 1
157 |                     self.state = END
158 |                 else:
159 |                     if node.is_follow(self.pool):
160 |                         self.state = FAIL
161 |                     else:
162 |                         self.pool = node.from_word
163 |         elif self.state == END:
164 |             # END is a new START
165 |             self.state = START
166 |             new = self.feed(char, map)
167 |         elif self.state == FAIL:
168 |             raise StatesMachineException('Translate States Machine '
169 |                     'have error with input data %s' % node)
170 |         return new
171 | 
172 |     def __len__(self):
173 |         return self.len + 1
174 | 
175 |     def __str__(self):
176 |         return '<StatesMachine %s, pool: "%s", state: %s, final: %s>' % (
177 |                 id(self), self.pool, self.state, self.final)
178 |     __repr__ = __str__
179 | 
180 | class Converter(object):
181 |     def __init__(self, to_encoding):
182 |         self.to_encoding = to_encoding
183 |         self.map = MAPS[to_encoding]
184 |         self.start()
185 | 
186 |     def feed(self, char):
187 |         branches = []
188 |         for fsm in self.machines:
189 |             new = fsm.feed(char, self.map)
190 |             if new:
191 |                 branches.append(new)
192 |         if branches:
193 |             self.machines.extend(branches)
194 |         self.machines = [fsm for fsm in self.machines if fsm.state != FAIL]
195 |         all_ok = True
196 |         for fsm in self.machines:
197 |             if fsm.state != END:
198 |                 all_ok = False
199 |         if all_ok:
200 |             self._clean()
201 |         return self.get_result()
202 | 
203 |     def _clean(self):
204 |         if len(self.machines):
205 |             self.machines.sort(key=lambda x: len(x))
206 |             # self.machines.sort(cmp=lambda x,y: cmp(len(x), len(y)))
207 |             self.final += self.machines[0].final
208 |         self.machines = [StatesMachine()]
209 | 
210 |     def start(self):
211 |         self.machines = [StatesMachine()]
212 |         self.final = UEMPTY
213 | 
214 |     def end(self):
215 |         self.machines = [fsm for fsm in self.machines
216 |                 if fsm.state == FAIL or fsm.state == END]
217 |         self._clean()
218 | 
219 |     def convert(self, string):
220 |         self.start()
221 |         for char in string:
222 |             self.feed(char)
223 |         self.end()
224 |         return self.get_result()
225 | 
226 |     def get_result(self):
227 |         return self.final
228 | 
229 | 
230 | def registery(name, mapping):
231 |     global MAPS
232 |     MAPS[name] = ConvertMap(name, mapping)
233 | 
234 | registery('zh-hant', zh2Hant)
235 | registery('zh-hans', zh2Hans)
236 | del zh2Hant, zh2Hans
237 | 
238 | 
239 | def run():
240 |     import sys
241 |     from optparse import OptionParser
242 |     parser = OptionParser()
243 |     parser.add_option('-e', type='string', dest='encoding',
244 |             help='encoding')
245 |     parser.add_option('-f', type='string', dest='file_in',
246 |             help='input file (- for stdin)')
247 |     parser.add_option('-t', type='string', dest='file_out',
248 |             help='output file')
249 |     (options, args) = parser.parse_args()
250 |     if not options.encoding:
251 |         parser.error('encoding must be set')
252 |     if options.file_in:
253 |         if options.file_in == '-':
254 |             file_in = sys.stdin
255 |         else:
256 |             file_in = open(options.file_in)
257 |     else:
258 |         file_in = sys.stdin
259 |     if options.file_out:
260 |         if options.file_out == '-':
261 |             file_out = sys.stdout
262 |         else:
263 |             file_out = open(options.file_out, 'wb')
264 |     else:
265 |         file_out = sys.stdout
266 | 
267 |     c = Converter(options.encoding)
268 |     for line in file_in:
269 |         # print >> file_out, c.convert(line.rstrip('\n').decode(
270 |         file_out.write(c.convert(line.rstrip('\n').decode(
271 |             'utf8')).encode('utf8'))
272 | 
273 | 
274 | if __name__ == '__main__':
275 |     run()
276 | 
277 | 


--------------------------------------------------------------------------------
/utils/train_test_split.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | # @Author: lapis-hong
 4 | # @Date  : 2018/5/16
 5 | import random
 6 | 
 7 | 
 8 | def train_test_split(infile, test_rate=0.2):
 9 |     with open('data/train.csv', 'w') as f_train, \
10 |             open('data/test.csv', 'w') as f_test:
11 |         for line in open(infile):
12 |             if random.random() > test_rate:
13 |                 f_train.write(line)
14 |             else:
15 |                 f_test.write(line)
16 | 
17 | 
18 | if __name__ == '__main__':
19 |     train_test_split('data/atec_nlp_sim_train.csv')
20 | 
21 | 


--------------------------------------------------------------------------------