├── .gitignore ├── LICENSE ├── README.md ├── data ├── UserDict.txt ├── atec_nlp_sim_train.csv ├── atec_nlp_sim_train1.csv ├── atec_nlp_sim_train2.csv ├── atec_token.csv ├── char_vec ├── pred.csv ├── w2v.txt └── word_vec ├── pytorch ├── __init__.py ├── dataset.py ├── model.py ├── siamese_network.py ├── text_rcnn.py ├── train.py └── train2.py ├── requirements.txt ├── tf ├── __init__.py ├── bad_cases.py ├── dataset.py ├── encoder.py ├── pred.py ├── siamese_net.py └── train.py └── utils ├── __init__.py ├── data_stats.py ├── feature_engineering.py ├── langconv.py ├── train_test_split.py └── zh_wiki.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | demo/ 3 | model/ 4 | thought.md 5 | experiment.md 6 | *.pyc 7 | rsync.sh -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Dhwaj Raj 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ATEC NLP sentence pair similarity competition 2 | 3 | https://dc.cloud.alipay.com/index#/topic/intro?id=3 4 | 5 | 1 、赛题任务描述 6 | 7 | 问题相似度计算,即给定客服里用户描述的两句话,用算法来判断是否表示了相同的语义。 8 | 9 | 示例: 10 | 11 | “花呗如何还款” --“花呗怎么还款”:同义问句 12 | “花呗如何还款” -- “我怎么还我的花被呢”:同义问句 13 | “花呗分期后逾期了如何还款”-- “花呗分期后逾期了哪里还款”:非同义问句 14 | 对于例子a,比较简单的方法就可以判定同义;对于例子b,包含了错别字、同义词、词序变换等问题,两个句子乍一看并不类似,想正确判断比较有挑战;对于例子c,两句话很类似,仅仅有一处细微的差别 “如何”和“哪里”,就导致语义不一致。 15 | 16 | 2、数据 17 | 18 | 本次大赛所有数据均来自蚂蚁金服金融大脑的实际应用场景,赛制分初赛和复赛两个阶段: 19 | 20 | 初赛阶段 21 | 22 | 我们提供10万对的标注数据(分批次更新),作为训练数据,包括同义对和不同义对,可下载。数据集中每一行就是一条样例。格式如下: 23 | 24 | 行号\t句1\t句2\t标注,举例:1 花呗如何还款 花呗怎么还款 1 25 | 26 | 行号指当前问题对在训练集中的第几行; 27 | 句1和句2分别表示问题句对的两个句子; 28 | 标注指当前问题对的同义或不同义标注,同义为1,不同义为0。 29 | 评测数据集总共1万条。为保证大赛的公平公正、避免恶意的刷榜行为,该数据集不公开。大家通过提交评测代码和模型的方法完成预测、获取相应的排名。格式如下: 30 | 31 | 行号\t句1\t句2 32 | 33 | 初赛阶段,评测数据集会在评测系统一个特定的路径下面,由官方的平台系统调用选手提交的评测工具执行。 34 | 35 | 复赛阶段 36 | 37 | 我们将训练数据集的量级会增加到海量。该阶段的数据不提供下载,会以数据表的形式在蚂蚁金服的数巢平台上供选手使用。和初赛阶段类似,数据集包含四个字段,分别是行号、句1、句2和标注。 38 | 39 | 评测数据集还是1万条,同样以数据表的形式在数巢平台上。该数据集包含三个字段,分别是行号、句1、句2。 40 | 41 | 3、评测及评估指标 42 | 43 | 初赛阶段,比赛选手在本地完成模型的训练调优,将评测代码和模型打包后,提交官方测评系统完成预测和排名更新。测评系统为标准Linux环境,内存8G,CPU4核,无网络访问权限。安装有python 2.7、java 8、tensorflow 1.5、jieba 0.39、pytorch 0.4.0、keras 2.1.6、gensim 3.4.0、pandas 0.22.0、sklearn 0.19.1、xgboost 0.71、lightgbm 2.1.1。 提交压缩包解压后,主目录下需包含脚本文件run.sh,该脚本以评测文件作为输入,评测结果作为输出(输出结果只有0和1),输出文件每行格式为“行号\t预测结果”,命令超时时间为30分钟,执行命令如下: 44 | 45 | bash run.sh INPUT_PATH OUTPUT_PATH 46 | 47 | 预测结果为空或总行数不对,评测结果直接判为0。 48 | 49 | 50 | 51 | 复赛阶段,选手的模型训练、调优和预测都是在蚂蚁金服的机器学习平台上完成。因此评测只需要提供相应的UDF即可,以问题对的两句话作为输入,相似度预测结果(0或1)作为输出,同样输出为空则终止评估,评测结果为0。 52 | 53 | 54 | 55 | 本赛题评分以F1-score为准,得分相同时,参照accuracy排序。选手预测结果和真实标签进行比对,几个数值的定义先明确一下: 56 | 57 | True Positive(TP)意思表示做出同义的判定,而且判定是正确的,TP的数值表示正确的同义判定的个数; 58 | 59 | 同理,False Positive(FP)数值表示错误的同义判定的个数; 60 | 61 | 依此,True Negative(TN)数值表示正确的不同义判定个数; 62 | 63 | False Negative(FN)数值表示错误的不同义判定个数。 64 | 65 | 基于此,我们就可以计算出准确率(precision rate)、召回率(recall rate)和accuracy、F1-score: 66 | 67 | precision rate = TP / (TP + FP) 68 | 69 | recall rate = TP / (TP + FN) 70 | 71 | accuracy = (TP + TN) / (TP + FP + TN + FN) 72 | 73 | F1-score = 2 * precision rate * recall rate / (precision rate + recall rate) 74 | -------------------------------------------------------------------------------- /data/UserDict.txt: -------------------------------------------------------------------------------- 1 | 花呗 57419 2 | 借呗 23730 3 | 支付宝 3275 4 | 淘宝网 13 5 | 淘宝 1467 6 | -------------------------------------------------------------------------------- /pytorch/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/18 -------------------------------------------------------------------------------- /pytorch/dataset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/15 5 | import os 6 | import re 7 | 8 | import jieba 9 | import numpy as np 10 | import torch 11 | from torch.utils.data.dataset import Dataset 12 | 13 | 14 | class Dictionary(object): 15 | def __init__(self, infile, char_vocab_file=None, word_vocab_file=None, char_level=True): 16 | self.infile = infile 17 | self.char_level = char_level 18 | self.word2idx = {} 19 | self.idx2word = [] 20 | vocab_file = char_vocab_file if char_level else word_vocab_file 21 | 22 | if not vocab_file or os.path.exists(vocab_file): 23 | print('Vocabluary file not found. Building vocabulary...') 24 | self.build_vocab() 25 | else: 26 | self.idx2word = open(vocab_file).read().decode('utf-8').strip().split('\n') 27 | self.word2idx = dict(zip(self.idx2word, range(len(self.idx2word)))) 28 | 29 | @staticmethod 30 | def _clean_text(text): 31 | """Text filter for Chinese corpus, only keep CN character.""" 32 | re_non_ch = re.compile(ur'[^\u4e00-\u9fa5]+') 33 | text = text.decode('utf-8').strip(' ') 34 | text = re_non_ch.sub('', text) 35 | return text 36 | 37 | def add_word(self, word): 38 | if word not in self.word2idx: 39 | self.idx2word.append(word) 40 | self.word2idx[word] = len(self.idx2word) - 1 41 | return self.word2idx[word] 42 | 43 | def build_vocab(self): 44 | self.add_word('') # pad index: 0 45 | for line in open(self.infile, 'r'): 46 | _, s1, s2, label = line.strip().split('\t') 47 | s1, s2 = map(self._clean_text, [s1, s2]) 48 | if not self.char_level: 49 | s1 = list(jieba.cut(s1)) 50 | s2 = list(jieba.cut(s2)) 51 | for token in s1+s2: 52 | # build vocabulary 53 | self.add_word(token) 54 | self.add_word('UNK') # unk index: len(word2idx)-1 55 | 56 | def __len__(self): 57 | return len(self.idx2word) 58 | 59 | 60 | class MyDataset(Dataset): 61 | 62 | def __init__(self, data_file, sequence_length, word2idx, char_level=True): 63 | self.word2idx = word2idx 64 | self.seq_len = sequence_length 65 | 66 | x1, x2, y = [], [], [] 67 | for line in open(data_file, 'r'): 68 | _, s1, s2, label = line.strip().split('\t') 69 | s1, s2 = map(self._clean_text, [s1, s2]) 70 | if not char_level: 71 | s1 = list(jieba.cut(s1)) 72 | s2 = list(jieba.cut(s2)) 73 | x1.append(s1) 74 | x2.append(s2) 75 | y.append(1) if label == '1' else y.append(0) 76 | self.x1 = x1 77 | self.x2 = x2 78 | self.y = y 79 | 80 | @staticmethod 81 | def _clean_text(text): 82 | """Text filter for Chinese corpus, only keep CN character.""" 83 | re_non_ch = re.compile(ur'[^\u4e00-\u9fa5]+') 84 | text = text.decode('utf-8').strip(' ') 85 | text = re_non_ch.sub('', text) 86 | return text 87 | 88 | def __getitem__(self, index): 89 | s1, s2 = self.x1[index], self.x2[index] 90 | s1_id = torch.LongTensor(np.zeros(self.seq_len, dtype=np.int64)) 91 | s2_id = torch.LongTensor(np.zeros(self.seq_len, dtype=np.int64)) 92 | label = torch.LongTensor([self.y[index]]) 93 | for idx, (w1, w2) in enumerate(zip(s1, s2)): 94 | if idx > self.seq_len - 1: 95 | break 96 | s1_id[idx] = self.word2idx.get(w1, self.word2idx["UNK"]) 97 | s2_id[idx] = self.word2idx.get(w2, self.word2idx["UNK"]) 98 | return s1_id, s2_id, label 99 | 100 | def __len__(self): 101 | return len(self.y) 102 | 103 | 104 | if __name__ == '__main__': 105 | dic = Dictionary('../data/atec_nlp_sim_train.csv', '../data/cha.vocab', '../data/word.vocab') 106 | dataset = MyDataset('../data/atec_nlp_sim_train.csv', 15, dic.word2idx) 107 | x1, x2, y = dataset[3] 108 | print(x1) 109 | print(y) -------------------------------------------------------------------------------- /pytorch/model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/15 5 | from __future__ import print_function 6 | 7 | import os 8 | 9 | import torch 10 | from torch.autograd import Variable 11 | import torch.nn as nn 12 | 13 | 14 | class BiLSTM(nn.Module): 15 | 16 | def __init__(self, config): 17 | super(BiLSTM, self).__init__() 18 | self.drop = nn.Dropout(config['dropout']) 19 | self.encoder = nn.Embedding(config['ntoken'], config['ninp']) 20 | self.bilstm = nn.LSTM(config['ninp'], config['nhid'], config['nlayers'], dropout=config['dropout'], 21 | bidirectional=True) 22 | self.nlayers = config['nlayers'] 23 | self.nhid = config['nhid'] 24 | self.pooling = config['pooling'] 25 | self.dictionary = config['dictionary'] 26 | # self.init_weights() 27 | self.encoder.weight.data[self.dictionary.word2idx['']] = 0 28 | if os.path.exists(config['word-vector']): 29 | print('Loading word vectors from', config['word-vector']) 30 | vectors = torch.load(config['word-vector']) 31 | assert vectors[2] >= config['ninp'] 32 | vocab = vectors[0] 33 | vectors = vectors[1] 34 | loaded_cnt = 0 35 | for word in self.dictionary.word2idx: 36 | if word not in vocab: 37 | continue 38 | real_id = self.dictionary.word2idx[word] 39 | loaded_id = vocab[word] 40 | self.encoder.weight.data[real_id] = vectors[loaded_id][:config['ninp']] 41 | loaded_cnt += 1 42 | print('%d words from external word vectors loaded.' % loaded_cnt) 43 | 44 | # note: init_range constraints the value of initial weights 45 | def init_weights(self, init_range=0.1): 46 | self.encoder.weight.data.uniform_(-init_range, init_range) 47 | 48 | def forward(self, inp, hidden): 49 | emb = self.drop(self.encoder(inp)) 50 | outp = self.bilstm(emb, hidden)[0] 51 | if self.pooling == 'mean': 52 | outp = torch.mean(outp, 0).squeeze() 53 | elif self.pooling == 'max': 54 | outp = torch.max(outp, 0)[0].squeeze() 55 | elif self.pooling == 'all' or self.pooling == 'all-word': 56 | outp = torch.transpose(outp, 0, 1).contiguous() 57 | return outp, emb 58 | 59 | def init_hidden(self, bsz): 60 | weight = next(self.parameters()).data 61 | return (Variable(weight.new(self.nlayers * 2, bsz, self.nhid).zero_()), 62 | Variable(weight.new(self.nlayers * 2, bsz, self.nhid).zero_())) 63 | 64 | 65 | class SelfAttentiveEncoder(nn.Module): 66 | 67 | def __init__(self, config): 68 | super(SelfAttentiveEncoder, self).__init__() 69 | self.bilstm = BiLSTM(config) 70 | self.drop = nn.Dropout(config['dropout']) 71 | self.ws1 = nn.Linear(config['nhid'] * 2, config['attention-unit'], bias=False) 72 | self.ws2 = nn.Linear(config['attention-unit'], config['attention-hops'], bias=False) 73 | self.tanh = nn.Tanh() 74 | self.softmax = nn.Softmax() 75 | self.dictionary = config['dictionary'] 76 | # self.init_weights() 77 | self.attention_hops = config['attention-hops'] 78 | 79 | def init_weights(self, init_range=0.1): 80 | self.ws1.weight.data.uniform_(-init_range, init_range) 81 | self.ws2.weight.data.uniform_(-init_range, init_range) 82 | 83 | def forward(self, inp, hidden): 84 | outp = self.bilstm.forward(inp, hidden)[0] 85 | size = outp.size() # [bsz, len, nhid] 86 | compressed_embeddings = outp.view(-1, size[2]) # [bsz*len, nhid*2] 87 | transformed_inp = torch.transpose(inp, 0, 1).contiguous() # [bsz, len] 88 | transformed_inp = transformed_inp.view(size[0], 1, size[1]) # [bsz, 1, len] 89 | concatenated_inp = [transformed_inp for i in range(self.attention_hops)] 90 | concatenated_inp = torch.cat(concatenated_inp, 1) # [bsz, hop, len] 91 | 92 | hbar = self.tanh(self.ws1(self.drop(compressed_embeddings))) # [bsz*len, attention-unit] 93 | alphas = self.ws2(hbar).view(size[0], size[1], -1) # [bsz, len, hop] 94 | alphas = torch.transpose(alphas, 1, 2).contiguous() # [bsz, hop, len] 95 | penalized_alphas = alphas + ( 96 | -10000 * (concatenated_inp == self.dictionary.word2idx['']).float()) 97 | # [bsz, hop, len] + [bsz, hop, len] 98 | alphas = self.softmax(penalized_alphas.view(-1, size[1])) # [bsz*hop, len] 99 | alphas = alphas.view(size[0], self.attention_hops, size[1]) # [bsz, hop, len] 100 | # Performs a batch matrix-matrix product of matrices 101 | return torch.bmm(alphas, outp), alphas 102 | 103 | def init_hidden(self, bsz): 104 | return self.bilstm.init_hidden(bsz) 105 | 106 | 107 | class Classifier(nn.Module): 108 | 109 | def __init__(self, config): 110 | super(Classifier, self).__init__() 111 | if config['pooling'] == 'mean' or config['pooling'] == 'max': 112 | self.encoder = BiLSTM(config) 113 | self.fc = nn.Linear(config['nhid'] * 2, config['nfc']) 114 | elif config['pooling'] == 'all': 115 | self.encoder = SelfAttentiveEncoder(config) 116 | self.fc = nn.Linear(config['nhid'] * 2 * config['attention-hops'], config['nfc']) 117 | else: 118 | raise Exception('Error when initializing Classifier') 119 | self.drop = nn.Dropout(config['dropout']) 120 | self.tanh = nn.Tanh() 121 | self.pred = nn.Linear(config['nfc'], config['class-number']) 122 | self.dictionary = config['dictionary'] 123 | # self.init_weights() 124 | 125 | def init_weights(self, init_range=0.1): 126 | self.fc.weight.data.uniform_(-init_range, init_range) 127 | self.fc.bias.data.fill_(0) 128 | self.pred.weight.data.uniform_(-init_range, init_range) 129 | self.pred.bias.data.fill_(0) 130 | 131 | def forward(self, inp, hidden): 132 | outp, attention = self.encoder.forward(inp, hidden) 133 | outp = outp.view(outp.size(0), -1) 134 | fc = self.tanh(self.fc(self.drop(outp))) 135 | pred = self.pred(self.drop(fc)) 136 | if type(self.encoder) == BiLSTM: 137 | attention = None 138 | return pred, attention 139 | 140 | def init_hidden(self, bsz): 141 | return self.encoder.init_hidden(bsz) 142 | 143 | def encode(self, inp, hidden): 144 | return self.encoder.forward(inp, hidden)[0] -------------------------------------------------------------------------------- /pytorch/siamese_network.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/15 5 | from numpy.linalg import norm 6 | 7 | import torch 8 | import torch.nn as nn 9 | import torch.nn.functional as F 10 | 11 | 12 | class EmbeddingCNN(nn.Module): 13 | def __init__(self, args): 14 | super(EmbeddingCNN, self).__init__() 15 | self.args = args 16 | 17 | self.embed = nn.Embedding(args.vocab_size, args.embedding_dim) 18 | self.convs1 = nn.ModuleList([nn.Conv2d(1, args.num_kernels, (ks, args.embedding_dim)) for ks in args.kernel_sizes]) 19 | 20 | def forward(self, x): 21 | x = self.embed(x) # (batch_size, sequence_length, embedding_dim) 22 | if self.args.word_embedding_type == 'static': 23 | x = torch.tensor(x) 24 | x = x.unsqueeze(1) # (batch_size, 1, sequence_length, embedding_dim) 25 | # # input size (N,Cin,H,W) output size (N,Cout,Hout,1) 26 | x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] 27 | x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] 28 | output = torch.cat(x, 1) # (batch_size, len(kernel_sizes)*kernel_num) 29 | return output 30 | 31 | 32 | class EmbeddingRNN(nn.Module): 33 | 34 | def __init__(self, args): 35 | super(EmbeddingRNN, self).__init__() 36 | self.hidden_units = args.hidden_units 37 | self.batch_size = args.batch_size 38 | 39 | self.embeds = nn.Embedding(args.vocab_size, args.embedding_dim) 40 | self.lstm = nn.LSTM(input_size=args.embedding_dim, hidden_size=args.hidden_units, num_layers=args.num_layers, 41 | batch_first=True, bidirectional=True) 42 | # self.hidden = self.init_hidden() 43 | # 44 | # def init_hidden(self): 45 | # h0 = Variable(torch.zeros(self.batch_size, num_layers, self.hidden_units)) 46 | # c0 = Variable(torch.zeros(self.batch_size, num_layers, self.hidden_units)) 47 | # return h0, c0 48 | 49 | def forward(self, sentence): 50 | embeds = self.embeds(sentence) 51 | # print(embeds) # [torch.FloatTensor of size batch_zise*seq_len*embedding_dim] 52 | # x = embeds.view(len(sentence), self.batch_size, -1) 53 | # Inputs: input, (h_0, c_0) Outputs: output, (h_n, c_n) (batch, seq_len, hidden_size * num_directions) 54 | # If (h_0, c_0) is not provided, both h_0 and c_0 default to zero. 55 | lstm_out, _ = self.lstm(embeds) 56 | # print(lstm_out) 57 | output = lstm_out[:, -1, :] 58 | return output 59 | 60 | 61 | class SiameseNet(nn.Module): 62 | def __init__(self, embedding_net): 63 | super(SiameseNet, self).__init__() 64 | self.embedding_net = embedding_net 65 | 66 | def forward(self, x1, x2): 67 | out1 = self.embedding_net(x1) 68 | out2 = self.embedding_net(x2) 69 | 70 | # out1_norm = torch.sqrt(torch.sum(torch.pow(out1, 2), dim=1)) 71 | # out2_norm = torch.sqrt(torch.sum(torch.pow(out2, 2), dim=1)) 72 | # cosine = (out1*out2).sum(1) / (out1_norm*out2_norm) 73 | sim = F.cosine_similarity(out1, out2, dim=1) 74 | # pdist = F.pairwise_distance(out1, out2, p=2, eps=1e-06, keepdim=False) 75 | 76 | return out1, out2, sim 77 | 78 | 79 | class ContrastiveLoss(torch.nn.Module): 80 | 81 | def __init__(self, margin=0.0): 82 | super(ContrastiveLoss, self).__init__() 83 | self.margin = margin 84 | 85 | def forward(self, y, y_): # y, y_ must be same type float (*) 86 | loss = y * torch.pow(1-y_, 2) + (1 - y) * torch.pow(y_-self.margin, 2) 87 | loss = torch.sum(loss) / 2.0 / len(y) #y.size()[0] 88 | return loss 89 | 90 | 91 | -------------------------------------------------------------------------------- /pytorch/text_rcnn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/15 5 | import torch 6 | import torch.nn as nn 7 | import torch.nn.functional as F 8 | from torch.autograd import Variable 9 | 10 | 11 | class TextCNN(nn.Module): 12 | def __init__(self, args): 13 | super(TextCNN, self).__init__() 14 | self.args = args 15 | 16 | self.embed = nn.Embedding(args.sequence_length, args.embed_dim) 17 | self.convs1 = nn.ModuleList([nn.Conv2d(1, args.kernel_num, (ks, args.embed_dim)) for ks in args.kernel_sizes]) 18 | self.dropout = nn.Dropout(args.dropout) 19 | self.fc1 = nn.Linear(len(args.kernel_sizes) * args.kernel_num, args.class_num) 20 | 21 | def forward(self, x): 22 | x = self.embed(x) # (batch_size, sequence_length, embedding_dim) 23 | if self.args.static: 24 | x = torch.tensor(x) 25 | x = x.unsqueeze(1) # (batch_size, 1, sequence_length, embedding_dim) 26 | # # input size (N,Cin,H,W) output size (N,Cout,Hout,1) 27 | x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] 28 | x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] 29 | x = torch.cat(x, 1) # (batch_size, len(kernel_sizes)*kernel_num) 30 | x = self.dropout(x) 31 | logit = self.fc1(x) 32 | return logit 33 | 34 | 35 | class TextRNN(nn.Module): 36 | 37 | def __init__(self, args): 38 | super(TextRNN, self).__init__() 39 | self.hidden_dim = args.hidden_dim 40 | self.batch_size = args.batch_size 41 | 42 | self.embeds = nn.Embedding(args.vocab_size, args.embedding_dim) 43 | self.lstm = nn.LSTM(input_size=args.embedding_dim, hidden_size=args.hidden_dim, num_layers=args.num_layers, 44 | batch_first=True, bidirectional=True) 45 | self.hidden2label = nn.Linear(args.hidden_dim, args.num_classes) 46 | self.hidden = self.init_hidden() 47 | 48 | def init_hidden(self): 49 | h0 = Variable(torch.zeros(1, self.batch_size, self.hidden_dim)) 50 | c0 = Variable(torch.zeros(1, self.batch_size, self.hidden_dim)) 51 | return h0, c0 52 | 53 | def forward(self, sentence): 54 | embeds = self.embeds(sentence) 55 | # x = embeds.view(len(sentence), self.batch_size, -1) 56 | lstm_out, self.hidden = self.lstm(embeds, self.hidden) 57 | y = self.hidden2label(lstm_out[-1]) 58 | return y 59 | 60 | 61 | class TextRCNN(nn.Module): 62 | 63 | def __init__(self, args): 64 | super(TextRCNN, self).__init__() 65 | self.interaction = args.interaction 66 | self.model_type = args.model_type 67 | self.cnn = TextCNN(args) 68 | self.rnn = TextRNN(args) 69 | 70 | def forward(self, x): 71 | if self.model_type == 'cnn': 72 | out = self.cnn.forward(x) 73 | elif self.model_type == 'rnn': 74 | out = self.rnn.forward(x) 75 | elif self.model_type == 'rcnn': 76 | out = self.cnn.forward(x) + self.rnn.forward(x) 77 | 78 | if self.interaction == 'multiply': 79 | pass 80 | 81 | return out 82 | 83 | -------------------------------------------------------------------------------- /pytorch/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/15 5 | import os 6 | import sys 7 | 8 | import argparse 9 | 10 | import torch 11 | import torch.autograd as autograd 12 | from torch.autograd import Variable 13 | import torch.nn.functional as F 14 | import torch.nn as nn 15 | from torch.utils.data import DataLoader 16 | 17 | from dataset import Dictionary, MyDataset 18 | from siamese_network import EmbeddingCNN, EmbeddingRNN, SiameseNet, ContrastiveLoss 19 | 20 | 21 | def get_args(): 22 | parser = argparse.ArgumentParser(description='Siamese text classifier') 23 | parser.add_argument('--dictionary', type=str, default='', 24 | help='path to save the dictionary, for faster corpus loading') 25 | parser.add_argument('--word_vector', type=str, default='', 26 | help='path for pre-trained word vectors (e.g. GloVe)') 27 | parser.add_argument('--word_embedding_type', type=str, default='rand', 28 | help='word embedding type {`rand`, `static`, `non-static`}') 29 | parser.add_argument('--train_data', type=str, default='../data/train.csv', 30 | help='training data path') 31 | parser.add_argument('--val_data', type=str, default='../data/test.csv', 32 | help='validation data path') 33 | parser.add_argument('--test_data', type=str, default='', 34 | help='test data path') 35 | parser.add_argument('--char_model', type=bool, default=True, 36 | help='whether to use character level model') 37 | # RNN 38 | parser.add_argument('--sequence_length', type=bool, default=20, 39 | help='max sequence length') 40 | parser.add_argument('--embedding_dim', type=int, default=64, 41 | help='size of word embeddings') 42 | parser.add_argument('--hidden_units', type=int, default=200, 43 | help='number of hidden units per layer') 44 | parser.add_argument('--num_layers', type=int, default=2, 45 | help='number of layers in BiLSTM') 46 | # CNN 47 | parser.add_argument('--kernel_sizes', type=list, default=[2,3,4,5], 48 | help='kernel sizes in CNN') 49 | parser.add_argument('--num_kernels', type=int, default=100, 50 | help='number of kernels in CNN') 51 | # parser.add_argument('--attention-unit', type=int, default=350, 52 | # help='number of attention unit') 53 | # parser.add_argument('--attention-hops', type=int, default=1, 54 | # help='number of attention hops, for multi-hop attention model') 55 | parser.add_argument('--dropout', type=float, default=0.1, 56 | help='dropout applied to layers (0 = no dropout)') 57 | parser.add_argument('--clip', type=float, default=0.5, 58 | help='clip to prevent the too large grad in LSTM') 59 | # parser.add_argument('--nfc', type=int, default=512, 60 | # help='hidden (fully connected) layer size for classifier MLP') 61 | # train 62 | parser.add_argument('--lr', type=float, default=.001, 63 | help='initial learning rate') 64 | parser.add_argument('--epochs', type=int, default=50, 65 | help='upper epoch limit') 66 | parser.add_argument('--batch_size', type=int, default=64, 67 | help='batch size for training') 68 | parser.add_argument('--cuda', action='store_true', 69 | help='use CUDA') 70 | parser.add_argument('--log_interval', type=int, default=100, metavar='N', 71 | help='train log interval') 72 | parser.add_argument('--test_interval', type=int, default=100, metavar='N', 73 | help='eval interval') 74 | parser.add_argument('--save_interval', type=int, default=1000, metavar='N', 75 | help='save interval') 76 | parser.add_argument('--save_dir', type=str, default='model_torch', 77 | help='path to save the final model') 78 | parser.add_argument('--optimizer', type=str, default='Adam', 79 | help='type of optimizer') 80 | parser.add_argument('--seed', type=int, default=123, 81 | help='random seed') 82 | # parser.add_argument('--penalization-coeff', type=float, default=1, 83 | # help='the penalization coefficient') 84 | return parser.parse_args() 85 | 86 | 87 | def metrics(y, y_pred): 88 | # 8-bit integer (unsigned) 89 | # y, y_pred = torch.ByteTensor(y), torch.ByteTensor(y_pred) 90 | TP = ((y_pred == 1) & (y == 1)).sum().float() 91 | TN = ((y_pred == 0) & (y == 0)).sum().float() 92 | FN = ((y_pred == 0) & (y == 1)).sum().float() 93 | FP = ((y_pred == 1) & (y == 0)).sum().float() 94 | p = TP / (TP + FP).clamp(min=1e-8) 95 | r = TP / (TP + FN).clamp(min=1e-8) 96 | F1 = 2 * r * p / (r + p).clamp(min=1e-8) 97 | acc = (TP + TN) / (TP + TN + FP + FN).clamp(min=1e-8) 98 | return acc, p, r, F1 99 | 100 | 101 | def train(train_iter, dev_iter, model, args): 102 | if args.cuda: 103 | model.cuda() 104 | # for param in model.parameters(): 105 | # print(param) 106 | 107 | def adjust_learning_rate(optimizer, learning_rate, epoch): 108 | lr = learning_rate * (0.1 ** (epoch // 10)) 109 | for param_group in optimizer.param_groups: 110 | param_group['lr'] = lr 111 | return optimizer 112 | 113 | if args.optimizer == 'Adam': 114 | optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, betas=[0.9, 0.999], eps=1e-8, weight_decay=0) 115 | elif args.optimizer == 'SGD': 116 | optimizer = torch.optim.SGD(model.parameters(), lr=args.lr, momentum=0.9, weight_decay=0.01) 117 | else: 118 | raise Exception('For other optimizers, please add it yourself. supported ones are: SGD and Adam.') 119 | 120 | F1_best = 0.0 121 | last_improved_step = 0 122 | model.train() 123 | steps = 0 124 | for epoch in range(1, args.epochs+1): 125 | for batch in train_iter: 126 | optimizer = adjust_learning_rate(optimizer, args.lr, epoch) 127 | x1, x2, y = batch 128 | y = torch.squeeze(y, 1).float() # [[1], [1], [0]...] to [1, 1, 0, ...] 129 | 130 | # if args.cuda: 131 | # x1, x2, y = Variable(x1).cuda(), Variable(x2).cuda(), Variable(y).cuda() 132 | # else: 133 | # x1, x2, y = Variable(x1), Variable(x2), Variable(y) 134 | optimizer.zero_grad() 135 | _, _, score = model(x1, x2) 136 | 137 | # print('out1', out1.dtype) 138 | # print('target vector', y.dtype) 139 | 140 | # loss_function = nn.CrossEntropyLoss() 141 | # loss = loss_function(output, Variable(train_labels)) 142 | # criterion = nn.CosineEmbeddingLoss(margin=0, size_average=True, reduce=False) 143 | # loss = criterion(out1, out2, (2 * y - 1)) # cast y to {1, -1} and float type 144 | # criterion = ContrastiveLoss() 145 | # loss = criterion(y, sim) 146 | 147 | # loss = F.cross_entropy(sim, y) 148 | loss = F.binary_cross_entropy_with_logits(score, y) 149 | loss.backward() 150 | optimizer.step() 151 | steps += 1 152 | 153 | if steps % args.log_interval == 0: 154 | # _, pred = torch.max(sim.data, 1) 155 | print('model sim and label tuples:') 156 | for i, j in zip(score, y): 157 | print(i.item(), j.item()) 158 | 159 | pred = score.data >= 0.5 160 | acc, p, r, f1 = metrics(y, pred) 161 | print('TRAIN[steps={}] loss={:.6f} acc={:.3f} P={:.3f} R={:.3f} F1={:.6f}'.format( 162 | steps, loss.item(), acc, p, r, f1)) 163 | if steps % args.test_interval == 0: 164 | loss, acc, p, r, f1 = eval(dev_iter, model) 165 | 166 | if f1 > F1_best: 167 | F1_best = f1 168 | last_improved_step = steps 169 | if F1_best > 0.5: 170 | save_prefix = os.path.join(args.save_dir, 'snapshot') 171 | save_path = '{}_steps{}.pt'.format(save_prefix, steps) 172 | torch.save(model, save_path) 173 | improved_token = '*' 174 | else: 175 | improved_token = '' 176 | print('DEV[steps={}] loss={:.6f} acc={:.3f} P={:.3f} R={:.3f} F1={:.6f} {}'.format( 177 | steps, loss, acc, p, r, f1, improved_token)) 178 | 179 | if steps % args.save_interval == 0: 180 | if not os.path.isdir(args.save_dir): 181 | os.makedirs(args.save_dir) 182 | save_prefix = os.path.join(args.save_dir, 'snapshot') 183 | save_path = '{}_steps{}.pt'.format(save_prefix, steps) 184 | torch.save(model, save_path) 185 | 186 | if steps - last_improved_step > 2000: # 2000 steps 187 | print("No improvement for a long time, early-stopping at best F1={}".format(F1_best)) 188 | break 189 | 190 | 191 | def eval(data_iter, model): 192 | loss_tot, y_list, y_pred_list = 0, [], [] 193 | model.eval() 194 | for x1, x2, y in data_iter: 195 | # if args.cuda: 196 | # x1, x2, y = Variable(x1).cuda(), Variable(x2).cuda(), Variable(y).cuda() 197 | # else: 198 | # x1, x2, y = Variable(x1), Variable(x2), Variable(y) 199 | out1, out2, sim = model(x1, x2) 200 | # loss = F.cross_entropy(output, y, size_average=False) 201 | criterion = nn.CosineEmbeddingLoss() 202 | loss = criterion(out1, out2, (2*y-1).float()) 203 | loss_tot += loss.item() # 0-dim scaler 204 | y_pred = sim.data >= 0.5 205 | y_pred_list.append(y_pred) 206 | y_list.append(y) 207 | y_pred = torch.cat(y_pred_list, 0) 208 | y = torch.cat(y_list, 0) 209 | acc, p, r, f1 = metrics(y, y_pred) 210 | size = len(data_iter.dataset) 211 | loss_avg = loss_tot / float(size) 212 | model.train() 213 | return loss_avg, acc, p, r, f1 214 | 215 | 216 | def predict(text, model, text_field, label_feild, cuda_flag): 217 | assert isinstance(text, str) 218 | model.eval() 219 | # text = text_field.tokenize(text) 220 | text = text_field.preprocess(text) 221 | text = [[text_field.vocab.stoi[x] for x in text]] 222 | x = text_field.tensor_type(text) 223 | x = autograd.Variable(x, volatile=True) 224 | if cuda_flag: 225 | x = x.cuda() 226 | print(x) 227 | output = model(x) 228 | _, predicted = torch.max(output, 1) 229 | return label_feild.vocab.itos[predicted.data[0][0]+1] 230 | 231 | 232 | if __name__ == '__main__': 233 | # parse the arguments 234 | args = get_args() 235 | print(args) 236 | 237 | # Set the random seed manually for reproducibility. 238 | torch.manual_seed(args.seed) 239 | if torch.cuda.is_available(): 240 | if not args.cuda: 241 | print("WARNING: You have a CUDA device, so you should probably run with --cuda") 242 | else: 243 | torch.cuda.manual_seed(args.seed) 244 | 245 | # Load Dictionary 246 | assert os.path.exists(args.train_data) 247 | assert os.path.exists(args.val_data) 248 | print('Begin to load the dictionary.') 249 | dictionary = Dictionary('../data/atec_nlp_sim_train.csv') 250 | 251 | args.vocab_size = len(dictionary) 252 | 253 | best_val_loss = None 254 | best_f1 = None 255 | n_token = len(dictionary) 256 | 257 | embedding_net = EmbeddingCNN(args) 258 | print("embedding_net: {}".format(embedding_net)) 259 | model = SiameseNet(embedding_net) 260 | print(model) 261 | 262 | print('Begin to load data.') 263 | train_data = MyDataset(args.train_data, args.sequence_length, dictionary.word2idx, args.char_model) 264 | val_data = MyDataset(args.val_data, args.sequence_length, dictionary.word2idx, args.char_model) 265 | train_loader = DataLoader(train_data, batch_size=args.batch_size, shuffle=True, num_workers=16) 266 | val_loader = DataLoader(val_data, batch_size=1, shuffle=False) 267 | try: 268 | for epoch in range(args.epochs): 269 | train(train_loader, val_loader, model, args) 270 | except KeyboardInterrupt: 271 | print('-' * 89) 272 | print('Exit from training early.') 273 | 274 | -------------------------------------------------------------------------------- /pytorch/train2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/15 5 | from __future__ import print_function 6 | 7 | import json 8 | import time 9 | import random 10 | import os 11 | import argparse 12 | 13 | import torch 14 | import torch.nn as nn 15 | import torch.optim as optim 16 | from torch.autograd import Variable 17 | 18 | from model import * 19 | from dataset import * 20 | 21 | 22 | def get_args(): 23 | parser = argparse.ArgumentParser() 24 | parser.add_argument('--emsize', type=int, default=300, 25 | help='size of word embeddings') 26 | parser.add_argument('--nhid', type=int, default=300, 27 | help='number of hidden units per layer') 28 | parser.add_argument('--nlayers', type=int, default=2, 29 | help='number of layers in BiLSTM') 30 | parser.add_argument('--attention-unit', type=int, default=350, 31 | help='number of attention unit') 32 | parser.add_argument('--attention-hops', type=int, default=1, 33 | help='number of attention hops, for multi-hop attention model') 34 | parser.add_argument('--dropout', type=float, default=0.5, 35 | help='dropout applied to layers (0 = no dropout)') 36 | parser.add_argument('--clip', type=float, default=0.5, 37 | help='clip to prevent the too large grad in LSTM') 38 | parser.add_argument('--nfc', type=int, default=512, 39 | help='hidden (fully connected) layer size for classifier MLP') 40 | parser.add_argument('--lr', type=float, default=.001, 41 | help='initial learning rate') 42 | parser.add_argument('--epochs', type=int, default=40, 43 | help='upper epoch limit') 44 | parser.add_argument('--seed', type=int, default=1111, 45 | help='random seed') 46 | parser.add_argument('--cuda', action='store_true', 47 | help='use CUDA') 48 | parser.add_argument('--log-interval', type=int, default=200, metavar='N', 49 | help='report interval') 50 | parser.add_argument('--save', type=str, default='', 51 | help='path to save the final model') 52 | parser.add_argument('--dictionary', type=str, default='', 53 | help='path to save the dictionary, for faster corpus loading') 54 | parser.add_argument('--word-vector', type=str, default='', 55 | help='path for pre-trained word vectors (e.g. GloVe), should be a PyTorch model.') 56 | parser.add_argument('--train-data', type=str, default='', 57 | help='location of the training data, should be a json file') 58 | parser.add_argument('--val-data', type=str, default='', 59 | help='location of the development data, should be a json file') 60 | parser.add_argument('--test-data', type=str, default='', 61 | help='location of the test data, should be a json file') 62 | parser.add_argument('--batch-size', type=int, default=32, 63 | help='batch size for training') 64 | parser.add_argument('--class-number', type=int, default=2, 65 | help='number of classes') 66 | parser.add_argument('--optimizer', type=str, default='Adam', 67 | help='type of optimizer') 68 | parser.add_argument('--penalization-coeff', type=float, default=1, 69 | help='the penalization coefficient') 70 | return parser.parse_args() 71 | 72 | 73 | def Frobenius(mat): 74 | size = mat.size() 75 | if len(size) == 3: # batched matrix 76 | ret = (torch.sum(torch.sum((mat ** 2), 1), 2).squeeze() + 1e-10) ** 0.5 77 | return torch.sum(ret) / size[0] 78 | else: 79 | raise Exception('matrix for computing Frobenius norm should be with 3 dims') 80 | 81 | 82 | def evaluate(): 83 | """evaluate the model while training""" 84 | model.eval() # turn on the eval() switch to disable dropout 85 | total_loss = 0 86 | total_correct = 0 87 | for batch, i in enumerate(range(0, len(data_val), args.batch_size)): 88 | data, targets = package(data_val[i:min(len(data_val), i+args.batch_size)], volatile=True) 89 | if args.cuda: 90 | data = data.cuda() 91 | targets = targets.cuda() 92 | hidden = model.init_hidden(data.size(1)) 93 | output, attention = model.forward(data, hidden) 94 | output_flat = output.view(data.size(1), -1) 95 | total_loss += criterion(output_flat, targets).data 96 | prediction = torch.max(output_flat, 1)[1] 97 | total_correct += torch.sum((prediction == targets).float()) 98 | return total_loss[0] / (len(data_val) // args.batch_size), total_correct.data[0] / len(data_val) 99 | 100 | 101 | def train(epoch_number): 102 | global best_val_loss, best_acc 103 | model.train() 104 | total_loss = 0 105 | total_pure_loss = 0 # without the penalization term 106 | start_time = time.time() 107 | for batch, i in enumerate(range(0, len(data_train), args.batch_size)): 108 | data, targets = package(data_train[i:i+args.batch_size], volatile=False) 109 | if args.cuda: 110 | data = data.cuda() 111 | targets = targets.cuda() 112 | hidden = model.init_hidden(data.size(1)) 113 | output, attention = model.forward(data, hidden) 114 | loss = criterion(output.view(data.size(1), -1), targets) 115 | total_pure_loss += loss.data 116 | 117 | if attention: # add penalization term 118 | attentionT = torch.transpose(attention, 1, 2).contiguous() 119 | extra_loss = Frobenius(torch.bmm(attention, attentionT) - I[:attention.size(0)]) 120 | loss += args.penalization_coeff * extra_loss 121 | optimizer.zero_grad() 122 | loss.backward() 123 | 124 | nn.utils.clip_grad_norm(model.parameters(), args.clip) 125 | optimizer.step() 126 | 127 | total_loss += loss.data 128 | 129 | if batch % args.log_interval == 0 and batch > 0: 130 | elapsed = time.time() - start_time 131 | print('| epoch {:3d} | {:5d}/{:5d} batches | ms/batch {:5.2f} | loss {:5.4f} | pure loss {:5.4f}'.format( 132 | epoch_number, batch, len(data_train) // args.batch_size, 133 | elapsed * 1000 / args.log_interval, total_loss[0] / args.log_interval, 134 | total_pure_loss[0] / args.log_interval)) 135 | total_loss = 0 136 | total_pure_loss = 0 137 | start_time = time.time() 138 | 139 | # for item in model.parameters(): 140 | # print item.size(), torch.sum(item.data ** 2), torch.sum(item.grad ** 2).data[0] 141 | # print model.encoder.ws2.weight.grad.data 142 | # exit() 143 | evaluate_start_time = time.time() 144 | val_loss, acc = evaluate() 145 | print('-' * 89) 146 | fmt = '| evaluation | time: {:5.2f}s | valid loss (pure) {:5.4f} | Acc {:8.4f}' 147 | print(fmt.format((time.time() - evaluate_start_time), val_loss, acc)) 148 | print('-' * 89) 149 | # Save the model, if the validation loss is the best we've seen so far. 150 | if not best_val_loss or val_loss < best_val_loss: 151 | with open(args.save, 'wb') as f: 152 | torch.save(model, f) 153 | f.close() 154 | best_val_loss = val_loss 155 | else: # if loss doesn't go down, divide the learning rate by 5. 156 | for param_group in optimizer.param_groups: 157 | param_group['lr'] = param_group['lr'] * 0.2 158 | if not best_acc or acc > best_acc: 159 | with open(args.save[:-3]+'.best_acc.pt', 'wb') as f: 160 | torch.save(model, f) 161 | f.close() 162 | best_acc = acc 163 | with open(args.save[:-3]+'.epoch-{:02d}.pt'.format(epoch_number), 'wb') as f: 164 | torch.save(model, f) 165 | f.close() 166 | 167 | 168 | if __name__ == '__main__': 169 | # parse the arguments 170 | args = get_args() 171 | 172 | # Set the random seed manually for reproducibility. 173 | torch.manual_seed(args.seed) 174 | if torch.cuda.is_available(): 175 | if not args.cuda: 176 | print("WARNING: You have a CUDA device, so you should probably run with --cuda") 177 | else: 178 | torch.cuda.manual_seed(args.seed) 179 | random.seed(args.seed) 180 | 181 | # Load Dictionary 182 | assert os.path.exists(args.train_data) 183 | assert os.path.exists(args.val_data) 184 | print('Begin to load the dictionary.') 185 | dictionary = Dictionary(path=args.dictionary) 186 | 187 | best_val_loss = None 188 | best_acc = None 189 | 190 | n_token = len(dictionary) 191 | model = Classifier({ 192 | 'dropout': args.dropout, 193 | 'ntoken': n_token, 194 | 'nlayers': args.nlayers, 195 | 'nhid': args.nhid, 196 | 'ninp': args.emsize, 197 | 'pooling': 'all', 198 | 'attention-unit': args.attention_unit, 199 | 'attention-hops': args.attention_hops, 200 | 'nfc': args.nfc, 201 | 'dictionary': dictionary, 202 | 'word-vector': args.word_vector, 203 | 'class-number': args.class_number 204 | }) 205 | if args.cuda: 206 | model = model.cuda() 207 | 208 | print(args) 209 | I = Variable(torch.zeros(args.batch_size, args.attention_hops, args.attention_hops)) 210 | for i in range(args.batch_size): 211 | for j in range(args.attention_hops): 212 | I.data[i][j][j] = 1 213 | if args.cuda: 214 | I = I.cuda() 215 | 216 | criterion = nn.CrossEntropyLoss() 217 | if args.optimizer == 'Adam': 218 | optimizer = optim.Adam(model.parameters(), lr=args.lr, betas=[0.9, 0.999], eps=1e-8, weight_decay=0) 219 | elif args.optimizer == 'SGD': 220 | optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0.9, weight_decay=0.01) 221 | else: 222 | raise Exception('For other optimizers, please add it yourself. ' 223 | 'supported ones are: SGD and Adam.') 224 | print('Begin to load data.') 225 | data_train = open(args.train_data).readlines() 226 | data_val = open(args.val_data).readlines() 227 | try: 228 | for epoch in range(args.epochs): 229 | train(epoch) 230 | except KeyboardInterrupt: 231 | print('-' * 89) 232 | print('Exit from training early.') 233 | data_val = open(args.test_data).readlines() 234 | evaluate_start_time = time.time() 235 | test_loss, acc = evaluate() 236 | print('-' * 89) 237 | fmt = '| test | time: {:5.2f}s | test loss (pure) {:5.4f} | Acc {:8.4f}' 238 | print(fmt.format((time.time() - evaluate_start_time), test_loss, acc)) 239 | print('-' * 89) 240 | exit(0) -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow >=1.5 2 | torch 3 | numpy 4 | jieba 5 | gensim 6 | -------------------------------------------------------------------------------- /tf/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/18 -------------------------------------------------------------------------------- /tf/bad_cases.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/10 5 | # !/usr/bin/env python 6 | import os 7 | 8 | import tensorflow as tf 9 | 10 | from dataset import Dataset 11 | from demo.train import FLAGS 12 | 13 | 14 | def bad_cases(): 15 | print("\nPredicting...\n") 16 | graph = tf.Graph() 17 | with graph.as_default(): # with tf.Graph().as_default() as g: 18 | sess = tf.Session() 19 | with sess.as_default(): 20 | # Load the saved meta graph and restore variables 21 | # saver = tf.train.Saver(tf.global_variables()) 22 | meta_file = os.path.abspath(os.path.join(FLAGS.model_dir, 'checkpoints/model-1000.meta')) 23 | new_saver = tf.train.import_meta_graph(meta_file) 24 | new_saver.restore(sess, tf.train.latest_checkpoint(os.path.join(FLAGS.model_dir, 'checkpoints'))) 25 | # graph = tf.get_default_graph() 26 | 27 | # Get the placeholders from the graph by name 28 | # input_x1 = graph.get_operation_by_name("input_x1").outputs[0] 29 | input_x1 = graph.get_tensor_by_name("input_x1:0") # Tensor("input_x1:0", shape=(?, 15), dtype=int32) 30 | input_x2 = graph.get_tensor_by_name("input_x2:0") 31 | dropout_keep_prob = graph.get_tensor_by_name("dropout_keep_prob:0") 32 | # Tensors we want to evaluate 33 | sim = graph.get_tensor_by_name("metrics/sim:0") 34 | y_pred = graph.get_tensor_by_name("metrics/y_pred:0") 35 | 36 | dev_sample = {} 37 | for line in open(FLAGS.data_file): 38 | line = line.strip().split('\t') 39 | dev_sample[line[0]] = line[1] 40 | 41 | # Generate batches for one epoch 42 | dataset = Dataset(data_file="data/pred.csv") 43 | x1, x2, y = dataset.process_data(sequence_length=FLAGS.max_document_length, is_training=False) 44 | with open("result/fp_file", 'w') as f_fp, open("result/fn_file", 'w') as f_fn: 45 | for lineno, x1_online, x2_online, y_online in enumerate(zip(x1, x2, y)): 46 | sim, y_pred_ = sess.run( 47 | [sim, y_pred], {input_x1: x1_online, input_x2: x2_online, dropout_keep_prob: 1.0}) 48 | if y_pred == 1 and y_online == 0: # low precision 49 | f_fp.write(dev_sample[lineno+1] + str(sim) + '\n') 50 | elif y_pred == 0 and y_online == 1: # low recall 51 | f_fn.write(dev_sample[lineno + 1] + str(sim) + '\n') 52 | 53 | if __name__ == '__main__': 54 | # Set to INFO for tracking training, default is WARN. ERROR for least messages 55 | tf.logging.set_verbosity(tf.logging.WARN) 56 | bad_cases() 57 | -------------------------------------------------------------------------------- /tf/dataset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/6 5 | """This module provide an elegant data process class.""" 6 | from __future__ import unicode_literals 7 | 8 | import logging 9 | import multiprocessing 10 | import os 11 | import re 12 | import sys 13 | import time 14 | from collections import Counter 15 | 16 | import jieba 17 | import jieba.analyse 18 | import numpy as np 19 | import tensorflow as tf 20 | from gensim.models import Word2Vec 21 | 22 | sys.path.insert(0, '../') 23 | from utils.langconv import Converter 24 | reload(sys) 25 | sys.setdefaultencoding('utf-8') 26 | 27 | # jieba.enable_parallel(4) # This is a bug, make add_word no use 28 | jieba.load_userdict('../data/UserDict.txt') 29 | stopwords = ['的', '了'] 30 | 31 | 32 | class Dataset(object): 33 | """Custom dataset class to deal with input text data.""" 34 | def __init__(self, 35 | data_file='../data/atec_nlp_sim_train.csv', 36 | npy_char_data_file='../data/train_char.npy', 37 | npy_word_data_file='../data/train_word.npy', 38 | char_vocab_file='../data/vocab.char', 39 | word_vocab_file='../data/vocab.word', 40 | char2vec_file='../data/char_vec', 41 | word2vec_file='../data/word_vec', 42 | char_level=True, 43 | embedding_dim=128, 44 | is_training=True, 45 | ): 46 | self.data_file = data_file 47 | self.npy_char_data_file = npy_char_data_file 48 | self.npy_word_data_file = npy_word_data_file 49 | self.char_vocab_file = char_vocab_file 50 | self.word_vocab_file = word_vocab_file 51 | self.word2vec_file = word2vec_file 52 | self.char2vec_file = char2vec_file 53 | self.char_level = char_level 54 | self.embedding_dim = embedding_dim 55 | self.is_training = is_training 56 | if self.char_level: 57 | print('Using character level model.') 58 | else: 59 | print('Using word level model.') 60 | self.w2v_file = self.char2vec_file if self.char_level else self.word2vec_file 61 | self.vocab_file = self.char_vocab_file if self.char_level else self.word_vocab_file 62 | self.npy_file = self.npy_char_data_file if self.char_level else self.npy_word_data_file 63 | 64 | @staticmethod 65 | def _clean_text(text): 66 | """Text filter for Chinese corpus, keep CN character and remove stopwords.""" 67 | re_non_ch = re.compile(ur'[^\u4e00-\u9fa5]+') 68 | text = text.strip(' ') 69 | text = re_non_ch.sub('', text) 70 | for w in stopwords: 71 | text = re.sub(w, '', text) 72 | return text 73 | 74 | @staticmethod 75 | def _tradition2simple(text): 76 | """Tradition Chinese corpus to simplify Chinese.""" 77 | text = Converter('zh-hans').convert(text) 78 | return text 79 | 80 | def _load_data(self, data_file): 81 | """Load origin train data and do text pre-processing (converting and cleaning) 82 | Returns: 83 | A generator 84 | if self.is_training: 85 | train sentence pairs and labels (s1, s2, y). 86 | else: 87 | train sentence pairs and None (s1, s2, None). 88 | """ 89 | for line in open(data_file): 90 | line = line.strip().decode('utf-8').split('\t') 91 | s1, s2 = map(self._clean_text, map(self._tradition2simple, line[1:3])) 92 | if not self.char_level: 93 | s1 = list(jieba.cut(s1)) 94 | s2 = list(jieba.cut(s2)) 95 | if self.is_training: 96 | y = int(line[-1]) # 1 or [1] 97 | yield s1, s2, y 98 | else: 99 | yield s1, s2, None # for consistent 100 | 101 | def _save_token_data(self): 102 | data_iter = self._load_data(self.data_file) 103 | with open('../data/atec_token.csv', 'w') as f: 104 | for s1, s2, _ in data_iter: 105 | f.write(' '.join(s1) + '|' + ' '.join(s2) + '\n') 106 | 107 | def _build_vocab(self, max_vocab_size=100000, min_count=2): 108 | """Build vocabulary list.""" 109 | data_iter = self._load_data(self.data_file) 110 | token = [] 111 | for s1, s2, _ in data_iter: 112 | if self.char_level: 113 | for words in s1+s2: 114 | for char in words: 115 | token.append(char) 116 | else: 117 | token.extend(s1+s2) 118 | print("Number of tokens: {}".format(len(token))) 119 | counter = Counter(token) 120 | word_count = counter.most_common(max_vocab_size - 1) # sort by word freq. 121 | vocab = ['UNK'] # for oov words 122 | vocab += [w[0] for w in word_count if w[1] >= min_count] 123 | vocab.append('') # add word '' for padding 124 | print("Vocabulary size: {}".format(len(vocab))) 125 | with open(self.vocab_file, 'w') as fo: 126 | fo.write('\n'.join(vocab)) 127 | 128 | def read_vocab(self): 129 | """Read vocabulary list 130 | Returns: 131 | tuple (id2word, word2id). 132 | """ 133 | if not os.path.exists(self.vocab_file): 134 | print('Vocabulary file not found. Building vocabulary...') 135 | self._build_vocab() 136 | else: 137 | print("Reading vocabulary file from {}".format(self.vocab_file)) 138 | id2word = open(self.vocab_file).read().split('\n') # list 139 | word2id = dict(zip(id2word, range(len(id2word)))) # dict 140 | return id2word, word2id 141 | 142 | def _word2vec(self, window=5, min_count=2): 143 | """Train and save word vectors""" 144 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 145 | s_time = time.time() 146 | s1, s2, _ = zip(*list(self._load_data(self.data_file))) 147 | sentences = s1 + s2 148 | size = self.embedding_dim 149 | # trim unneeded model memory = use(much) less RAM 150 | # model.init_sims(replace=True) 151 | model = Word2Vec(sentences, sg=1, size=size, window=window, min_count=min_count, 152 | negative=3, sample=0.001, hs=1, workers=multiprocessing.cpu_count(), iter=20) 153 | # model.save(output_model_file) 154 | model.wv.save_word2vec_format(self.w2v_file, binary=False) 155 | print("Word2vec training time: %d s" % (time.time() - s_time)) 156 | 157 | def load_word2vec(self): 158 | """mapping the words to word vectors. 159 | Returns: 160 | tuple (words, vectors) 161 | """ 162 | if not os.path.exists(self.w2v_file): 163 | print('Word vectors file not found. Training word vectors...') 164 | self._word2vec() 165 | words, vecs = [], [] 166 | fr = open(self.w2v_file) 167 | word_dim = int(fr.readline().strip().split(' ')[1]) # first line 168 | print("Pre-trained word vectors dim: {}".format(word_dim)) 169 | if word_dim != self.embedding_dim: 170 | print("Inconsistent word embedding dim, retrain word vectors...") 171 | self._word2vec() 172 | return self.load_word2vec() 173 | else: 174 | words.append("UNK") 175 | vecs.append([0] * word_dim) 176 | words.append("") 177 | vecs.append([0] * word_dim) 178 | for line in fr: 179 | line = line.decode('utf-8').strip().split(' ') 180 | words.append(line[0]) 181 | vecs.append(line[1:]) 182 | print("Loaded pre-trained word vectors.") 183 | return words, vecs 184 | 185 | def process_data(self, data_file, sequence_length=20): 186 | """Process text data file to word-id matrix representation. 187 | Args: 188 | data_file: process data file. 189 | sequence_length: int, max sequence length. (default 20) 190 | Returns: 191 | 2-D List. 192 | if self.is_training: 193 | each element of list is [s1_pad, s2_pad, y] 194 | else: 195 | each element of list is [s1_pad, s2_pad] 196 | """ 197 | if data_file == self.data_file and os.path.exists(self.npy_file): # only for all train data 198 | dataset = np.load(self.npy_file) 199 | # check sequence length same or not 200 | if len(dataset[0][0]) == sequence_length: 201 | print("Loaded saved npy word-id matrix train file.") 202 | return dataset 203 | else: 204 | print("Found inconsistent sequence length with npy file.") 205 | 206 | _, word2id = self.read_vocab() 207 | data_iter = self._load_data(data_file) 208 | dataset = [] 209 | print('Converting word-index matrix...') 210 | for s1, s2, y in data_iter: 211 | # oov words id is 0, token is either a single char or word. 212 | s1_id = [word2id.get(token, 0) for token in s1] 213 | s2_id = [word2id.get(token, 0) for token in s2] 214 | # "pre" or "post" important, "pre" much better, why ? 215 | s1_pad = tf.keras.preprocessing.sequence.pad_sequences( 216 | [s1_id], maxlen=sequence_length, padding='post', truncating='post', value=len(word2id)-1) 217 | s2_pad = tf.keras.preprocessing.sequence.pad_sequences( 218 | [s2_id], maxlen=sequence_length, padding='post', truncating='post', value=len(word2id)-1) 219 | # y = tf.keras.utils.to_categorical(y) # turn label into onehot 220 | if self.is_training: 221 | dataset.append([s1_pad[0], s2_pad[0], y]) 222 | else: 223 | dataset.append([s1_pad[0], s2_pad[0]]) 224 | print("Saving npy...") 225 | dataset = np.asarray(dataset) 226 | np.save(self.npy_file, dataset) 227 | # np.savez(save_file, x1=x1, x2=x2, y=y) # save multiple arrays as zip file. 228 | # np.savetxt(save_file, np.concatenate([x1, x2, y], axis=1), fmt="%d") # or use np.hstack() 229 | return dataset 230 | 231 | @staticmethod 232 | def train_test_split(dataset, test_size=0.2, random_seed=123): 233 | """Split train data into train and test sets. 234 | Args: 235 | dataset: 2-D list, each element is a sample list [x1, x2, y, len(s1), len(s2)] 236 | test_size: float, int. (default 0.2) 237 | If float, should be between 0.0 and 1.0 and represent the proportion of test set. 238 | If int, represents the absolute number of test samples. 239 | random_seed: int or None. (default 123) 240 | If None, do not use random seed. 241 | Returns 242 | A tuple (trainset, testset) 243 | """ 244 | dataset = np.asarray(dataset) 245 | num_samples = len(dataset) 246 | test_size = int(num_samples * test_size) if isinstance(test_size, float) else test_size 247 | print('Total number of samples: {}'.format(num_samples)) 248 | print('Test data size: {}'.format(test_size)) 249 | if random_seed: 250 | np.random.seed(random_seed) 251 | shuffle_indices = np.random.permutation(np.arange(num_samples)) 252 | dataset_shuffled = dataset[shuffle_indices] 253 | trainset = dataset_shuffled[test_size:] 254 | testset = dataset_shuffled[:test_size] 255 | print('Train eval data split done.') 256 | return trainset, testset 257 | 258 | @staticmethod 259 | def batch_iter(dataset, batch_size, num_epochs, shuffle=True): 260 | """Generates a batch iterator for a dataset. 261 | Args: 262 | dataset: 2-D list, each element is a sample list [x1, x2, y] 263 | Returns: 264 | list of batch samples [x1, x2, y]. 265 | use zip(*return) to generate x1_batch, x2_batch, y_batch 266 | """ 267 | dataset = np.asarray(dataset) 268 | data_size = len(dataset) 269 | num_batches_per_epoch = int((len(dataset)-1)/batch_size) + 1 270 | for epoch in range(num_epochs): 271 | # Shuffle the data at each epoch 272 | if shuffle: 273 | shuffle_indices = np.random.permutation(np.arange(data_size)) 274 | shuffled_data = dataset[shuffle_indices] 275 | else: 276 | shuffled_data = dataset 277 | for batch_num in range(num_batches_per_epoch): 278 | start_index = batch_num * batch_size 279 | end_index = min((batch_num + 1) * batch_size, data_size) 280 | yield shuffled_data[start_index:end_index] 281 | 282 | 283 | if __name__ == '__main__': 284 | d_char = Dataset(char_level=True) 285 | d_word = Dataset(char_level=False, embedding_dim=128) 286 | # s1, s2, y = d_word._load_data('../data/train.csv').next() 287 | 288 | # d_word._build_vocab() 289 | d_word._save_token_data() 290 | # id2w, w2id = d_word.read_vocab() 291 | # dataset = Dataset().process_data('../data/atec_nlp_sim_train.csv') 292 | # data = Dataset().batch_iter(dataset, 5, 1, shuffle=False).next() 293 | # print(data) 294 | # d_word.load_word2vec() 295 | 296 | 297 | 298 | 299 | 300 | -------------------------------------------------------------------------------- /tf/encoder.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/6/8 5 | """This module contains two kinds of encoders: CNNEncoder and RNNEncoder.""" 6 | import tensorflow as tf 7 | 8 | 9 | class CNNEncoder(object): 10 | 11 | def __init__(self, sequence_length, embedding_dim, filter_sizes, num_filters): 12 | self._sequence_length = sequence_length 13 | self._embedding_dim = embedding_dim 14 | self._filter_sizes = filter_sizes 15 | self._num_filters = num_filters 16 | 17 | def forward(self, x, scope="CNN"): 18 | with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): 19 | # Create a convolution + maxpool layer for each filter size 20 | x = tf.expand_dims(x, -1) # shape(batch_size, seq_len, dim, 1) 21 | pooled_outputs = [] 22 | for i, filter_size in enumerate(self._filter_sizes): 23 | with tf.variable_scope("conv-maxpool-%s" % filter_size, reuse=None): 24 | # Convolution Layer 25 | filter_shape = [filter_size, self._embedding_dim, 1, self._num_filters] 26 | W = tf.get_variable("W", filter_shape, initializer=tf.truncated_normal_initializer(stddev=0.1)) 27 | b = tf.get_variable("bias", [self._num_filters], initializer=tf.constant_initializer(0.1)) 28 | conv = tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding="VALID", name="conv") 29 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") 30 | # Maxpooling over the outputs 31 | pooled = tf.nn.max_pool(h, ksize=[1, self._sequence_length - filter_size + 1, 1, 1], 32 | strides=[1, 1, 1, 1], padding='VALID', name="pool") 33 | pooled_outputs.append(pooled) 34 | # Combine all the pooled features 35 | num_filters_total = self._num_filters * len(self._filter_sizes) 36 | h_pool = tf.concat(pooled_outputs, 3) 37 | h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total]) # very sparse ! 38 | 39 | # very important, very sensitive to dropout rate 0.7 good! 40 | with tf.name_scope("dropout"): 41 | h_drop = tf.nn.dropout(h_pool_flat, 0.7) 42 | 43 | # very important, necessary 44 | with tf.name_scope("output"): 45 | W = tf.get_variable("W", shape=[num_filters_total, 128], 46 | initializer=tf.contrib.layers.xavier_initializer()) 47 | b = tf.Variable(tf.constant(0.1, shape=[128]), name="b") 48 | outputs = tf.nn.xw_plus_b(h_drop, W, b, name="outputs") 49 | return outputs 50 | 51 | 52 | class RNNEncoder(object): 53 | 54 | def __init__(self, rnn_cell, hidden_units, num_layers, dropout_keep_prob, use_dynamic, use_attention): 55 | self._rnn_cell = rnn_cell 56 | self._hidden_units = hidden_units 57 | self._num_layers = num_layers 58 | self._dropout_keep_prob = dropout_keep_prob 59 | self._use_dynamic = use_dynamic 60 | self._use_attention = use_attention 61 | 62 | def forward(self, x, sequence_length=None, scope="RNN"): 63 | rnn = tf.nn.rnn_cell 64 | with tf.variable_scope(scope, reuse=tf.AUTO_REUSE): # initializer=tf.orthogonal_initializer(), 65 | # scope.reuse_variables() # or tf.get_variable_scope().reuse_variables() 66 | # current_batch_of_words does not correspond to a "sentence" of words 67 | # but [t_steps, batch_size, num_features] 68 | # Unpacks the given dimension of a rank-`R` tensor into rank-`(R-1)` tensors. 69 | # sequence_length list tensors of shape (batch_size, embedding_dim) 70 | if not self._use_dynamic: 71 | x = tf.unstack(tf.transpose(x, perm=[1, 0, 2])) # `static_rnn` input 72 | if self._rnn_cell.lower() == 'lstm': 73 | rnn_cell = rnn.LSTMCell 74 | elif self._rnn_cell.lower() == 'gru': 75 | rnn_cell = rnn.GRUCell 76 | elif self._rnn_cell.lower() == 'rnn': 77 | rnn_cell = rnn.BasicRNNCell 78 | else: 79 | raise ValueError("Invalid rnn_cell type.") 80 | 81 | with tf.variable_scope("fw"): 82 | # state(c, h), tf.nn.rnn_cell.BasicLSTMCell does not support gradient clipping, use tf.nn.rnn_cell.LSTMCell. 83 | # fw_cells = [rnn_cell(hidden_units) for _ in range(num_layers)] 84 | fw_cells = [] 85 | for _ in range(self._num_layers): 86 | fw_cell = rnn_cell(self._hidden_units) 87 | fw_cell = rnn.DropoutWrapper(fw_cell, output_keep_prob=self._dropout_keep_prob, 88 | variational_recurrent=False, dtype=tf.float32) 89 | fw_cells.append(fw_cell) 90 | fw_cells = rnn.MultiRNNCell(cells=fw_cells, state_is_tuple=True) 91 | with tf.variable_scope("bw"): 92 | bw_cells = [] 93 | for _ in range(self._num_layers): 94 | bw_cell = rnn_cell(self._hidden_units) 95 | bw_cell = rnn.DropoutWrapper(bw_cell, output_keep_prob=self._dropout_keep_prob, 96 | variational_recurrent=False, dtype=tf.float32) 97 | bw_cells.append(bw_cell) 98 | bw_cells = rnn.MultiRNNCell(cells=bw_cells, state_is_tuple=True) 99 | 100 | if self._use_dynamic: 101 | # [batch_size, max_time, cell_fw.output_size] 102 | outputs, output_states = tf.nn.bidirectional_dynamic_rnn( 103 | fw_cells, bw_cells, x, sequence_length=sequence_length, dtype=tf.float32) 104 | outputs = tf.concat(outputs, 2) 105 | if self._rnn_cell.lower() == 'lstm': 106 | out = tf.concat([output_states[-1][0].h, output_states[-1][1].h], 1) 107 | else: 108 | out = tf.concat([output_states[-1][0], output_states[-1][1]], 1) 109 | # outputs = outputs[:, -1, :] # take last hidden states (batch_size, 2*hidden_units) 110 | # outputs = self._last_relevant(outputs, sequence_length) 111 | else: 112 | # `static_rnn` Returns: A tuple (outputs, output_state_fw, output_state_bw) 113 | # outputs is a list of timestep outputs, depth-concatenated forward and backward outputs. 114 | outputs, state_fw, state_bw = tf.nn.static_bidirectional_rnn( 115 | fw_cells, bw_cells, x, dtype=tf.float32, sequence_length=sequence_length) 116 | outputs = tf.transpose(tf.stack(outputs), perm=[1, 0, 2]) 117 | if self._rnn_cell.lower() == 'lstm': 118 | out = tf.concat([state_fw[-1].h, state_bw[-1].h], 1) # good 119 | else: 120 | out = tf.concat([state_fw[-1], state_bw[-1]], 1) 121 | # outputs = tf.reduce_mean(outputs, 0) # average [batch_size, hidden_units] (mean pooling) 122 | # outputs = tf.reduce_max(outputs, axis=0) # max pooling, bad result. 123 | # outputs = outputs[-1] # take last hidden state [batch_size, hidden_units] 124 | # outputs = tf.transpose(tf.stack(outputs), [1, 0, 2]) # shape(batch_size, seq_len, hidden_units) 125 | # outputs = self._last_relevant(outputs, sequence_length) 126 | if self._use_attention: 127 | d_a = 300 128 | r = 2 129 | self.H = outputs 130 | batch_size = tf.shape(x)[0] 131 | initializer = tf.contrib.layers.xavier_initializer() 132 | with tf.variable_scope("attention"): # TODO: Nan in summary histogram for: RNN/attention/W_s2_0/grad/hist 133 | # shape(W_s1) = d_a * 2u 134 | self.W_s1 = tf.get_variable('W_s1', shape=[d_a, 2 * self._hidden_units], initializer=initializer) 135 | # shape(W_s2) = r * d_a 136 | self.W_s2 = tf.get_variable('W_s2', shape=[r, d_a], initializer=initializer) 137 | # shape (d_a, 2u) --> shape(batch_size, d_a, 2u) 138 | self.W_s1 = tf.tile(tf.expand_dims(self.W_s1, 0), [batch_size, 1, 1]) 139 | self.W_s2 = tf.tile(tf.expand_dims(self.W_s2, 0), [batch_size, 1, 1]) 140 | # attention matrix A = softmax(W_s2*tanh(W_s1*H^T) shape(A) = batch_siz * r * n 141 | self.H_T = tf.transpose(self.H, perm=[0, 2, 1], name="H_T") 142 | self.A = tf.nn.softmax( 143 | tf.matmul(self.W_s2, tf.tanh(tf.matmul(self.W_s1, self.H_T)), name="A")) 144 | # sentences embedding matrix M = AH shape(M) = (batch_size, r, 2u) 145 | self.M = tf.matmul(self.A, self.H, name="M") 146 | out = tf.reshape(self.M, [batch_size, -1]) 147 | 148 | with tf.variable_scope("penalization"): 149 | # penalization term: Frobenius norm square of matrix AA^T-I, ie. P = |AA^T-I|_F^2 150 | A_T = tf.transpose(self.A, perm=[0, 2, 1], name="A_T") 151 | I = tf.eye(r, r, batch_shape=[batch_size], name="I") 152 | self.P = tf.square(tf.norm(tf.matmul(self.A, A_T) - I, axis=[-2, -1], ord='fro'), name="P") 153 | return out 154 | 155 | @staticmethod 156 | def _last_relevant(outputs, sequence_length): 157 | """Deprecated""" 158 | batch_size = tf.shape(outputs)[0] 159 | max_length = outputs.get_shape()[1] 160 | output_size = outputs.get_shape()[2] 161 | index = tf.range(0, batch_size) * max_length + (sequence_length - 1) 162 | flat = tf.reshape(outputs, [-1, output_size]) 163 | last_timesteps = tf.gather(flat, index) # very slow 164 | # mask = tf.sign(index) 165 | # last_timesteps = tf.boolean_mask(flat, mask) 166 | # # Creating a vector of 0s and 1s that will specify what timesteps to choose. 167 | # partitions = tf.reduce_sum(tf.one_hot(index, tf.shape(flat)[0], dtype='int32'), 0) 168 | # # Selecting the elements we want to choose. 169 | # _, last_timesteps = tf.dynamic_partition(flat, partitions, 2) # (batch_size, n_dim) 170 | # https://stackoverflow.com/questions/35892412/tensorflow-dense-gradient-explanation 171 | return last_timesteps 172 | 173 | if __name__ == '__main__': 174 | x1 = tf.placeholder(tf.int32, [None, 20], name="input_x1") 175 | x2 = tf.placeholder(tf.int32, [None, 20], name="input_x2") 176 | cnn_encoder = CNNEncoder( 177 | sequence_length=20, 178 | embedding_dim=128, 179 | filter_sizes=[3,4,5], 180 | num_filters=100, 181 | ) 182 | rnn_encoder = RNNEncoder( 183 | rnn_cell='lstm', 184 | hidden_units=100, 185 | num_layers=2, 186 | dropout_keep_prob=0.7, 187 | use_dynamic=False, 188 | use_attention=False, 189 | ) 190 | 191 | 192 | 193 | -------------------------------------------------------------------------------- /tf/pred.py: -------------------------------------------------------------------------------- 1 | # !/usr/bin/env python 2 | import os 3 | import sys 4 | 5 | import tensorflow as tf 6 | 7 | from dataset import Dataset 8 | from demo.train import FLAGS 9 | 10 | FLAGS.model_dir = '/home/hongquan/atec_nlp/model/rnn/' 11 | FLAGS.max_document_length = 20 12 | 13 | 14 | def main(input_file, output_file): 15 | print("\nPredicting...\n") 16 | graph = tf.Graph() 17 | with graph.as_default(): # with tf.Graph().as_default() as g: 18 | sess = tf.Session() 19 | with sess.as_default(): 20 | # Load the saved meta graph and restore variables 21 | # saver = tf.train.Saver(tf.global_variables()) 22 | meta_file = os.path.abspath(os.path.join(FLAGS.model_dir, 'checkpoints/model-3400.meta')) 23 | new_saver = tf.train.import_meta_graph(meta_file) 24 | new_saver.restore(sess, tf.train.latest_checkpoint(os.path.join(FLAGS.model_dir, 'checkpoints'))) 25 | # graph = tf.get_default_graph() 26 | 27 | # Get the placeholders from the graph by name 28 | # input_x1 = graph.get_operation_by_name("input_x1").outputs[0] 29 | input_x1 = graph.get_tensor_by_name("input_x1:0") # Tensor("input_x1:0", shape=(?, 15), dtype=int32) 30 | input_x2 = graph.get_tensor_by_name("input_x2:0") 31 | dropout_keep_prob = graph.get_tensor_by_name("dropout_keep_prob:0") 32 | # Tensors we want to evaluate 33 | y_pred = graph.get_tensor_by_name("metrics/y_pred:0") 34 | # vars = tf.get_collection('vars') 35 | # for var in vars: 36 | # print(var) 37 | 38 | e = graph.get_tensor_by_name("cosine:0") 39 | 40 | # Generate batches for one epoch 41 | dataset = Dataset(data_file=input_file, is_training=False) 42 | data = dataset.process_data(data_file=input_file, sequence_length=FLAGS.max_document_length) 43 | batches = dataset.batch_iter(data, FLAGS.batch_size, 1, shuffle=False) 44 | with open(output_file, 'w') as fo: 45 | lineno = 1 46 | for batch in batches: 47 | x1_batch, x2_batch, _, _ = zip(*batch) 48 | y_pred_ = sess.run([y_pred], {input_x1: x1_batch, input_x2: x2_batch, dropout_keep_prob: 1.0}) 49 | for pred in y_pred_[0]: 50 | fo.write('{}\t{}\n'.format(lineno, pred)) 51 | lineno += 1 52 | 53 | if __name__ == '__main__': 54 | # Set to INFO for tracking training, default is WARN. ERROR for least messages 55 | tf.logging.set_verbosity(tf.logging.WARN) 56 | main(sys.argv[1], sys.argv[2]) 57 | -------------------------------------------------------------------------------- /tf/siamese_net.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/6/10 5 | """Siamese Similarity Network regard this task as a sentence similarity problem; 6 | Siamese Classification Network regard this task as a text classification problem. 7 | 8 | References: 9 | Learning Text Similarity with Siamese Recurrent Networks, 2016 10 | Siamese Recurrent Architectures for Learning Sentence Similarity, 2016 11 | """ 12 | import tensorflow as tf 13 | 14 | 15 | class SiameseNets(object): 16 | """Siamese base nets, input embedding and encoder layer output. """ 17 | def __init__(self, input_x1, input_x2, word_embedding_type, vocab_size, embedding_size, 18 | encoder_type, cnn_encoder, rnn_encoder, dense_layer, weight_sharing): 19 | """ 20 | Args: 21 | cnn_encoder: instance of CNNEncoder 22 | rnn_encoder: instance of RNNEncoder 23 | """ 24 | # input word level dropout, data augmentation, invariance to small input change 25 | # self.shape = tf.shape(self.input_x1) 26 | # self.mask1 = tf.cast(tf.random_uniform(self.shape) > 0.1, tf.int32) 27 | # self.mask2 = tf.cast(tf.random_uniform(self.shape) > 0.1, tf.int32) 28 | # self.input_x1 = self.input_x1 * self.mask1 29 | # self.input_x2 = self.input_x2 * self.mask2 30 | self._encoder_type = encoder_type 31 | self._rnn_encoder = rnn_encoder 32 | seqlen1 = tf.cast(tf.reduce_sum(tf.sign(input_x1), 1), tf.int32) 33 | seqlen2 = tf.cast(tf.reduce_sum(tf.sign(input_x2), 1), tf.int32) 34 | assert word_embedding_type in {'rand', 'static', 'non-static'}, 'Invalid word embedding type' 35 | with tf.variable_scope("embedding"): 36 | if word_embedding_type == "rand": 37 | self.W = tf.Variable( 38 | tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0), 39 | trainable=True, name="W") # tf.truncated_normal() 40 | else: 41 | trainable = False if word_embedding_type == "static" else True 42 | self.W = tf.Variable( 43 | tf.constant(0.0, shape=[vocab_size, embedding_size]), 44 | trainable=trainable, name="W") 45 | # embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_size]) 46 | # self.embedding_init = self.W.assign(embedding_placeholder) 47 | self.embedded_1 = tf.nn.embedding_lookup(self.W, input_x1) 48 | self.embedded_2 = tf.nn.embedding_lookup(self.W, input_x2) 49 | # Input embedding dropout. very sensitive to the dropout rate ! 50 | self.embedded_1 = tf.nn.dropout(self.embedded_1, 0.7) 51 | self.embedded_2 = tf.nn.dropout(self.embedded_2, 0.7) 52 | if weight_sharing: 53 | cnn_scope1, cnn_scope2, rnn_scope1, rnn_scope2 = "CNN", "CNN", "RNN", "RNN" 54 | else: 55 | cnn_scope1, cnn_scope2, rnn_scope1, rnn_scope2 = "CNN1", "CNN2", "RNN1", "RNN2" 56 | if encoder_type.lower() == 'cnn': 57 | self.out1 = cnn_encoder.forward(self.embedded_1, cnn_scope1) 58 | self.out2 = cnn_encoder.forward(self.embedded_2, cnn_scope2) 59 | elif encoder_type.lower() == 'rnn': 60 | self.out1 = rnn_encoder.forward(self.embedded_1, seqlen1, rnn_scope1) 61 | self.out2 = rnn_encoder.forward(self.embedded_2, seqlen2, rnn_scope2) 62 | elif encoder_type.lower() == 'rcnn': 63 | cnn_out1 = cnn_encoder.forward(self.embedded_1, cnn_scope1) 64 | cnn_out2 = cnn_encoder.forward(self.embedded_2, cnn_scope2) 65 | rnn_out1 = rnn_encoder.forward(self.embedded_1, seqlen1, rnn_scope1) 66 | rnn_out2 = rnn_encoder.forward(self.embedded_2, seqlen2, rnn_scope2) 67 | self.out1 = tf.concat([cnn_out1, rnn_out1], axis=1) 68 | self.out2 = tf.concat([cnn_out2, rnn_out2], axis=1) 69 | else: 70 | raise ValueError("Invalid encoder type.") 71 | 72 | if dense_layer: 73 | with tf.variable_scope("fc"): 74 | out_dim = self.out1.get_shape().as_list()[1] 75 | W1 = tf.get_variable("W1", shape=[out_dim, 128], initializer=tf.contrib.layers.xavier_initializer()) 76 | b1 = tf.Variable(tf.constant(0.1, shape=[128]), name="b1") 77 | W2 = tf.get_variable("W2", shape=[out_dim, 128], initializer=tf.contrib.layers.xavier_initializer()) 78 | b2 = tf.Variable(tf.constant(0.1, shape=[128]), name="b2") 79 | self.out1 = tf.nn.xw_plus_b(self.out1, W1, b1, name="out1") 80 | self.out2 = tf.nn.xw_plus_b(self.out2, W2, b2, name="out2") 81 | 82 | @property 83 | def variables(self): 84 | # for v in tf.trainable_variables(): 85 | # print(v) 86 | return tf.global_variables() 87 | 88 | 89 | class SiameseSimilarityNets(SiameseNets): 90 | """A siamese based deep network for text similarity. 91 | Use a character/word level embedding layer, followed by a {`BiLSTM`, `CNN`, `combine`} encoder layer, 92 | then use euclidean distance/cosine/manhattan distance to measure similarity""" 93 | def __init__(self, input_x1, input_x2, input_y, 94 | word_embedding_type, vocab_size, embedding_size, 95 | encoder_type, cnn_encoder, rnn_encoder, dense_layer, 96 | l2_reg_lambda, pred_threshold, energy_func, loss_func='contrasive', 97 | margin=0.0, contrasive_loss_pos_weight=1.0, weight_sharing=True): 98 | self.input_y = input_y 99 | self._l2_reg_lambda = l2_reg_lambda 100 | self._pred_threshold = pred_threshold 101 | self._energy_func = energy_func 102 | self._loss_func = loss_func 103 | self._margin = margin 104 | self._contrastive_loss_pos_weight = contrasive_loss_pos_weight 105 | super(SiameseSimilarityNets, self).__init__( 106 | input_x1, input_x2, word_embedding_type, vocab_size, embedding_size, 107 | encoder_type, cnn_encoder, rnn_encoder, dense_layer, weight_sharing) 108 | 109 | def forward(self): 110 | # out1_norm = tf.sqrt(tf.reduce_sum(tf.square(self.out1), 1)) 111 | # out2_norm = tf.sqrt(tf.reduce_sum(tf.square(self.out2), 1)) 112 | # self.distance = tf.sqrt(tf.reduce_sum(tf.square(self.out1 - self.out2), 1, keepdims=False)) 113 | distance = tf.norm(self.out1-self.out2, ord='euclidean', axis=1, keepdims=False, name='euc-distance') 114 | distance = tf.div(distance, tf.add(tf.norm(self.out1, 2, axis=1), tf.norm(self.out2, 2, axis=1))) 115 | self.sim_euc = tf.subtract(1.0, distance, name="euc") 116 | 117 | # self.sim = tf.reduce_sum(tf.multiply(self.out1, self.out2), 1) / tf.multiply(out1_norm, out2_norm) 118 | out1_norm = tf.nn.l2_normalize(self.out1, 1) # output = x / sqrt(max(sum(x**2), epsilon)) 119 | out2_norm = tf.nn.l2_normalize(self.out2, 1) 120 | self.sim_cos = tf.reduce_sum(tf.multiply(out1_norm, out2_norm), axis=1, name="cosine") 121 | # sim = exp(-||x1-x2||) range (0, 1] 122 | # self.sim_ma = tf.exp(-tf.reduce_sum(tf.abs(self.out1 - self.out2), 1), name="manhattan") 123 | self.sim_ma = tf.exp(-tf.norm(self.out1-self.out2, 1, 1), name="manhattan") 124 | 125 | if self._energy_func == 'euclidean': 126 | self.sim = self.sim_euc 127 | elif self._energy_func == 'cosine': 128 | self.sim = self.sim_cos 129 | elif self._energy_func == 'exp_manhattan': 130 | self.sim = self.sim_ma 131 | elif self._energy_func == 'combine': 132 | w = tf.Variable(1, dtype=tf.float32) 133 | self.sim = w * self.sim_euc + (1 - w) * self.sim_cos 134 | else: 135 | raise ValueError("Invalid energy function name.") 136 | self.y_pred = tf.cast(tf.greater(self.sim, self._pred_threshold), dtype=tf.float32, name="y_pred") 137 | 138 | with tf.name_scope("loss"): 139 | if self._loss_func == 'contrasive': 140 | self.loss = self.contrastive_loss(self.input_y, self.sim) 141 | elif self._loss_func == 'cross_entrophy': 142 | self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.input_y, logits=self.sim)) 143 | # add l2 reg except bias anb BN variables. 144 | self.l2 = self._l2_reg_lambda * tf.reduce_sum( 145 | [tf.nn.l2_loss(v) for v in tf.trainable_variables() if not ("noreg" in v.name or "bias" in v.name)]) 146 | self.loss += self.l2 147 | if self._encoder_type != 'cnn' and self._rnn_encoder._use_attention: 148 | self.loss += tf.reduce_mean(self._rnn_encoder.P) 149 | 150 | # Accuracy computation is outside of this class. 151 | # self.accuracy = tf.reduce_mean(tf.cast(tf.equal(self.y_pred, self.input_y), tf.float32), name="accuracy") 152 | TP = tf.count_nonzero(self.input_y * self.y_pred, dtype=tf.float32) 153 | TN = tf.count_nonzero((self.input_y - 1) * (self.y_pred - 1), dtype=tf.float32) 154 | FP = tf.count_nonzero(self.y_pred * (self.input_y - 1), dtype=tf.float32) 155 | FN = tf.count_nonzero((self.y_pred - 1) * self.input_y, dtype=tf.float32) 156 | # tf.div like python2 division, tf.divide like python3 157 | self.acc = tf.divide(TP + TN, TP + TN + FP + FN, name="accuracy") 158 | self.precision = tf.divide(TP, TP + FP, name="precision") 159 | self.recall = tf.divide(TP, TP + FN, name="recall") 160 | self.cm = tf.confusion_matrix(self.input_y, self.y_pred, name="confusion_matrix") 161 | # tf.assert_equal(self.acc, self.acc_) 162 | # https://github.com/tensorflow/tensorflow/issues/15115, be careful! 163 | # _, self.acc = tf.metrics.accuracy(self.input_y, self.y_pred) 164 | # _, self.precision = tf.metrics.precision(self.input_y, self.y_pred, name='precision') 165 | # _, self.recall = tf.metrics.recall(self.input_y, self.y_pred, name='recall') 166 | self.f1 = tf.divide(2 * self.precision * self.recall, self.precision + self.recall, name="F1_score") 167 | 168 | def contrastive_loss(self, y, e): 169 | # margin and pos_weight can directly influence P and R metrics. 170 | l_1 = self._contrastive_loss_pos_weight * tf.pow(1-e, 2) 171 | l_0 = tf.square(tf.maximum(e-self._margin, 0)) 172 | loss = tf.reduce_mean(y * l_1 + (1 - y) * l_0) 173 | return loss 174 | 175 | 176 | class SiameseClassificationNets(SiameseNets): 177 | """A Siamese based deep network for text similarity. 178 | Uses character/word level embedding layer, followed by a {`BiLSTM`, `CNN`, `combine`} encoder layer, 179 | then use multiply/concat interaction to feed for classification layers. 180 | """ 181 | def __init__(self, input_x1, input_x2, input_y, 182 | word_embedding_type, vocab_size, embedding_size, 183 | encoder_type, cnn_encoder, rnn_encoder, dense_layer, 184 | l2_reg_lambda, interaction="multiply", weight_sharing=True): 185 | self.input_y = input_y 186 | self._l2_reg_lambda = l2_reg_lambda 187 | self._interaction = interaction 188 | super(SiameseClassificationNets, self).__init__( 189 | input_x1, input_x2, word_embedding_type, vocab_size, embedding_size, 190 | encoder_type, cnn_encoder, rnn_encoder, dense_layer, weight_sharing) 191 | 192 | def forward(self): 193 | if self._interaction == 'concat': 194 | self.out = tf.concat([self.out1, self.out2], axis=1, name="out") 195 | elif self._interaction == 'multiply': 196 | self.out = tf.multiply(self.out1, self.out2, name="out") 197 | fc = tf.layers.dense(self.out, 128, name='fc1', activation=tf.nn.relu) 198 | # self.scores = tf.layers.dense(self.fc, 1, activation=tf.nn.sigmoid) 199 | self.logits = tf.layers.dense(fc, 2, name='fc2') 200 | # self.y_pred = tf.round(tf.nn.sigmoid(self.logits), name="predictions") # pred class 201 | self.y_pred = tf.cast(tf.argmax(tf.nn.sigmoid(self.logits), 1, name="predictions"), tf.float32) 202 | 203 | with tf.name_scope("loss"): 204 | # [batch_size, num_classes] 205 | y = tf.one_hot(tf.cast(self.input_y, tf.int32), 2) 206 | cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=self.logits, labels=y) 207 | self.loss = tf.reduce_mean(cross_entropy) 208 | # self.loss = tf.losses.sigmoid_cross_entropy(logits=self.logits, multi_class_labels=y) 209 | 210 | # y = self.input_y 211 | # y_ = self.scores 212 | # self.loss = -tf.reduce_mean(pos_weight * y * tf.log(tf.clip_by_value(y_, 1e-10, 1.0)) 213 | # + (1-y) * tf.log(tf.clip_by_value(1-y_, 1e-10, 1.0))) 214 | # add l2 reg except bias anb BN variables. 215 | self.l2 = self._l2_reg_lambda * tf.reduce_sum( 216 | [tf.nn.l2_loss(v) for v in tf.trainable_variables() if not ("noreg" in v.name or "bias" in v.name)]) 217 | self.loss += self.l2 218 | 219 | # Accuracy computation is outside of this class. 220 | with tf.name_scope("metrics"): 221 | TP = tf.count_nonzero(self.input_y * self.y_pred, dtype=tf.float32) 222 | TN = tf.count_nonzero((self.input_y - 1) * (self.y_pred - 1), dtype=tf.float32) 223 | FP = tf.count_nonzero(self.y_pred * (self.input_y - 1), dtype=tf.float32) 224 | FN = tf.count_nonzero((self.y_pred - 1) * self.input_y, dtype=tf.float32) 225 | # tf.div like python2 division, tf.divide like python3 226 | self.cm = tf.confusion_matrix(self.input_y, self.y_pred, name="confusion_matrix") 227 | self.acc = tf.divide(TP + TN, TP + TN + FP + FN, name="accuracy") 228 | self.precision = tf.divide(TP, TP + FP, name="precision") 229 | self.recall = tf.divide(TP, TP + FN, name="recall") 230 | self.f1 = tf.divide(2 * self.precision * self.recall, self.precision + self.recall, name="F1_score") 231 | 232 | 233 | if __name__ == '__main__': 234 | from encoder import CNNEncoder, RNNEncoder 235 | x1 = tf.placeholder(tf.int32, [None, 20], name="input_x1") 236 | x2 = tf.placeholder(tf.int32, [None, 20], name="input_x2") 237 | y = tf.placeholder(tf.float32, [None], name="input_y") 238 | cnn_encoder = CNNEncoder( 239 | sequence_length=20, 240 | embedding_dim=128, 241 | filter_sizes=[3, 4, 5], 242 | num_filters=100, 243 | ) 244 | rnn_encoder = RNNEncoder( 245 | rnn_cell='lstm', 246 | hidden_units=100, 247 | num_layers=2, 248 | dropout_keep_prob=0.7, 249 | use_dynamic=False, 250 | use_attention=False, 251 | ) 252 | model1 = SiameseSimilarityNets( 253 | input_x1=x1, 254 | input_x2=x2, 255 | input_y=y, 256 | word_embedding_type='rand', 257 | vocab_size=10000, 258 | embedding_size=128, 259 | encoder_type='cnn', 260 | cnn_encoder=cnn_encoder, 261 | rnn_encoder=rnn_encoder, 262 | dense_layer=False, 263 | l2_reg_lambda=0, 264 | pred_threshold=0.5, 265 | energy_func='cosine', 266 | loss_func='contrasive', 267 | margin=0.0, 268 | contrasive_loss_pos_weight=1.0, 269 | weight_sharing=True) 270 | model2 = SiameseClassificationNets( 271 | input_x1=x1, 272 | input_x2=x2, 273 | input_y=y, 274 | word_embedding_type='rand', 275 | vocab_size=10000, 276 | embedding_size=128, 277 | encoder_type='cnn', 278 | cnn_encoder=cnn_encoder, 279 | rnn_encoder=rnn_encoder, 280 | dense_layer=False, 281 | l2_reg_lambda=0, 282 | interaction='multiply', 283 | weight_sharing=True 284 | ) -------------------------------------------------------------------------------- /tf/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/6/11 5 | # !/usr/bin/env python 6 | # coding: utf-8 7 | from __future__ import unicode_literals 8 | import os 9 | import datetime 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | from dataset import Dataset 15 | from encoder import CNNEncoder, RNNEncoder 16 | from siamese_net import SiameseSimilarityNets, SiameseClassificationNets 17 | 18 | # Data loading params 19 | tf.flags.DEFINE_string("data_file", "../data/atec_nlp_sim_train1.csv", "Training data file path.") 20 | tf.flags.DEFINE_float("val_percentage", .1, "Percentage of the training data to use for validation. (default: 0.2)") 21 | tf.flags.DEFINE_integer("random_seed", 123, "Random seed to split train and test. (default: None)") 22 | tf.flags.DEFINE_integer("max_document_length", 30, "Max document length of each train pair. (default: 15)") 23 | tf.flags.DEFINE_boolean("char_model", False, "Character based syntactic model. if false, word based semantic model. (default: True)") 24 | tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character/word embedding (default: 300)") 25 | 26 | # Model Hyperparameters 27 | tf.flags.DEFINE_string("model_class", "similarity", "Model class, one of {`similarity`, `classification`}") 28 | tf.flags.DEFINE_string("model_type", "rcnn", "Model type, one of {`cnn`, `rnn`, `rcnn`} (default: rnn)") 29 | tf.flags.DEFINE_string("word_embedding_type", "non-static", "One of `rand`, `static`, `non-static`, random init(rand) vs pretrained word2vec(static) vs pretrained word2vec + training(non-static)") 30 | # If include CNN 31 | tf.flags.DEFINE_string("filter_sizes", "2,3,4,5", "Comma-separated filter sizes (default: '3,4,5')") 32 | tf.flags.DEFINE_integer("num_filters", 100, "Number of filters per filter size (default: 128)") 33 | # If include RNN 34 | tf.flags.DEFINE_string("rnn_cell", "gru", "Rnn cell type, lstm or gru or rnn(default: lstm)") 35 | tf.flags.DEFINE_integer("hidden_units", 100, "Number of hidden units (default: 50)") 36 | tf.flags.DEFINE_integer("num_layers", 2, "Number of rnn layers (default: 3)") 37 | tf.flags.DEFINE_float("clip_norm", 5, "Gradient clipping norm value set None to not use (default: 5)") 38 | tf.flags.DEFINE_boolean("use_dynamic", True, "Whether use dynamic rnn or not (default: False)") 39 | tf.flags.DEFINE_boolean("use_attention", False, "Whether use self attention or not (default: False)") 40 | # Common 41 | tf.flags.DEFINE_boolean("weight_sharing", True, "Sharing CNN or RNN encoder weights. (default: True") 42 | tf.flags.DEFINE_boolean("dense_layer", False, "Whether to add a fully connected layer before calculate energy function. (default: False)") 43 | tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 1.0)") 44 | tf.flags.DEFINE_string("energy_function", "cosine", "Similarity energy function, one of {`euclidean`, `cosine`, `exp_manhattan`, `combine`} (default: euclidean)") 45 | tf.flags.DEFINE_string("loss_function", "contrasive", "Loss function one of `cross_entrophy`, `contrasive`, (default: contrasive loss)") 46 | tf.flags.DEFINE_float("pred_threshold", 0.5, "Threshold for classify.(default: 0.5)") 47 | tf.flags.DEFINE_float("l2_reg_lambda", 0, "L2 regularizaion lambda (default: 0.0)") 48 | # Only for contrasive loss 49 | tf.flags.DEFINE_float("scale_pos_weight", 2, "Scale loss function for imbalance data, set it around neg_samples / pos_samples ") 50 | tf.flags.DEFINE_float("margin", 0.0, "Margin for contrasive loss (default: 0.0)") 51 | 52 | # Training parameters 53 | tf.flags.DEFINE_string("model_dir", "../model", "Model directory (default: ../model)") 54 | tf.flags.DEFINE_integer("batch_size", 128, "Batch Size (default: 64)") 55 | tf.flags.DEFINE_float("lr", 1e-2, "Initial learning rate (default: 1e-3)") 56 | tf.flags.DEFINE_float("weight_decay_rate", 0.5, "Exponential weight decay rate (default: 0.9) ") 57 | tf.flags.DEFINE_integer("num_epochs", 50, "Number of training epochs (default: 100)") 58 | tf.flags.DEFINE_integer("log_every_steps", 100, "Print log info after this many steps (default: 100)") 59 | tf.flags.DEFINE_integer("evaluate_every_steps", 100, "Evaluate model on dev set after this many steps (default: 100)") 60 | # tf.flags.DEFINE_integer("checkpoint_every_steps", 1000, "Save model after this many steps (default: 100)") 61 | tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)") 62 | 63 | FLAGS = tf.flags.FLAGS 64 | # supress tensorflow logging other than errors 65 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 66 | 67 | 68 | def train(): 69 | print("Using TensorFlow Version %s" % tf.__version__) 70 | assert "1.5" <= tf.__version__, "Need TensorFlow 1.5 or Later." 71 | print("\nParameters:") 72 | for attr in FLAGS: 73 | value = FLAGS[attr].value 74 | print("{}={}".format(attr.upper(), value)) 75 | print("") 76 | if not FLAGS.data_file: 77 | exit("Train data file is empty. Set --data_file argument.") 78 | 79 | dataset = Dataset(data_file=FLAGS.data_file, char_level=FLAGS.char_model, embedding_dim=FLAGS.embedding_dim) 80 | vocab, word2id = dataset.read_vocab() 81 | print("Vocabulary Size: {:d}".format(len(vocab))) 82 | # Generate batches 83 | data = dataset.process_data(data_file=FLAGS.data_file, sequence_length=FLAGS.max_document_length) # (x1, x2, y) 84 | train_data, eval_data = dataset.train_test_split(data, test_size=FLAGS.val_percentage, random_seed=FLAGS.random_seed) 85 | train_batches = dataset.batch_iter(train_data, FLAGS.batch_size, FLAGS.num_epochs, shuffle=True) 86 | 87 | with tf.Graph().as_default(): 88 | tf.set_random_seed(FLAGS.random_seed) 89 | session_conf = tf.ConfigProto( 90 | allow_soft_placement=True, 91 | log_device_placement=False) 92 | sess = tf.Session(config=session_conf) 93 | 94 | input_x1 = tf.placeholder(tf.int32, [None, FLAGS.max_document_length], name="input_x1") 95 | input_x2 = tf.placeholder(tf.int32, [None, FLAGS.max_document_length], name="input_x2") 96 | input_y = tf.placeholder(tf.float32, [None], name="input_y") 97 | dropout_keep_prob = tf.placeholder(tf.float32, name="input_y") 98 | cnn_encoder = CNNEncoder( 99 | sequence_length=FLAGS.max_document_length, 100 | embedding_dim=FLAGS.embedding_dim, 101 | filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))), 102 | num_filters=FLAGS.num_filters, 103 | ) 104 | rnn_encoder = RNNEncoder( 105 | rnn_cell=FLAGS.rnn_cell, 106 | hidden_units=FLAGS.hidden_units, 107 | num_layers=FLAGS.num_layers, 108 | dropout_keep_prob=dropout_keep_prob, 109 | use_dynamic=FLAGS.use_dynamic, 110 | use_attention=FLAGS.use_attention, 111 | ) 112 | 113 | with sess.as_default(): 114 | if FLAGS.model_class == 'similarity': 115 | model = SiameseSimilarityNets( 116 | input_x1=input_x1, 117 | input_x2=input_x2, 118 | input_y=input_y, 119 | encoder_type=FLAGS.model_type, 120 | cnn_encoder=cnn_encoder, 121 | rnn_encoder=rnn_encoder, 122 | vocab_size=len(vocab), 123 | embedding_size=FLAGS.embedding_dim, 124 | word_embedding_type=FLAGS.word_embedding_type, 125 | dense_layer=FLAGS.dense_layer, 126 | pred_threshold=FLAGS.pred_threshold, 127 | l2_reg_lambda=FLAGS.l2_reg_lambda, 128 | energy_func=FLAGS.energy_function, 129 | loss_func=FLAGS.loss_function, 130 | margin=FLAGS.margin, 131 | contrasive_loss_pos_weight=FLAGS.scale_pos_weight, 132 | weight_sharing=FLAGS.weight_sharing 133 | ) 134 | print("Initialized SiameseSimilarityNets model.") 135 | elif FLAGS.model_class == 'classification': 136 | model = SiameseClassificationNets( 137 | input_x1=input_x1, 138 | input_x2=input_x2, 139 | input_y=input_y, 140 | word_embedding_type=FLAGS.word_embedding_type, 141 | vocab_size=len(vocab), 142 | embedding_size=FLAGS.embedding_dim, 143 | encoder_type=FLAGS.model_type, 144 | cnn_encoder=cnn_encoder, 145 | rnn_encoder=rnn_encoder, 146 | dense_layer=FLAGS.dense_layer, 147 | l2_reg_lambda=FLAGS.l2_reg_lambda, 148 | interaction='multiply', 149 | weight_sharing=FLAGS.weight_sharing 150 | ) 151 | print("Initialized SiameseClassificationNets model.") 152 | else: 153 | raise ValueError("Invalid model class. Expected one of {`similarity`, `classification`} ") 154 | model.forward() 155 | 156 | # Define Training procedure 157 | global_step = tf.Variable(0, name="global_step", trainable=False) 158 | learning_rate = tf.train.exponential_decay(FLAGS.lr, global_step, decay_steps=int(40000/FLAGS.batch_size), 159 | decay_rate=FLAGS.weight_decay_rate, staircase=True) 160 | optimizer = tf.train.AdamOptimizer(learning_rate) 161 | # optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9) 162 | # optimizer = tf.train.GradientDescentOptimizer(learning_rate) 163 | # optimizer = tf.train.RMSPropOptimizer(learning_rate) 164 | # optimizer = tf.train.AdadeltaOptimizer(learning_rate, epsilon=1e-6) 165 | 166 | # for i, (g, v) in enumerate(grads_and_vars): 167 | # if g is not None: 168 | # grads_and_vars[i] = (tf.clip_by_global_norm(g, 5), v) # clip gradients 169 | # train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 170 | if FLAGS.clip_norm: # improve loss, but small weight cause small score, need to turn threshold for better f1. 171 | variables = tf.trainable_variables() 172 | grads, _ = tf.clip_by_global_norm(tf.gradients(model.loss, variables), FLAGS.clip_norm) 173 | train_op = optimizer.apply_gradients(zip(grads, variables), global_step=global_step) 174 | grads_and_vars = zip(grads, variables) 175 | else: 176 | grads_and_vars = optimizer.compute_gradients(model.loss) 177 | train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 178 | # Keep track of gradient values and sparsity (optional) 179 | grad_summaries = [] 180 | for g, v in grads_and_vars: 181 | if g is not None: 182 | grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g) 183 | sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) 184 | grad_summaries.append(grad_hist_summary) 185 | grad_summaries.append(sparsity_summary) 186 | grad_summaries_merged = tf.summary.merge(grad_summaries) 187 | print("Defined gradient summaries.") 188 | 189 | # Summaries for loss and accuracy 190 | loss_summary = tf.summary.scalar("loss", model.loss) 191 | f1_summary = tf.summary.scalar("F1-score", model.f1) 192 | 193 | # Train Summaries 194 | train_summary_op = tf.summary.merge([loss_summary, f1_summary, grad_summaries_merged]) 195 | train_summary_dir = os.path.join(FLAGS.model_dir, "summaries", "train") 196 | train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph) 197 | 198 | # Dev summaries 199 | dev_summary_op = tf.summary.merge([loss_summary, f1_summary]) 200 | dev_summary_dir = os.path.join(FLAGS.model_dir, "summaries", "dev") 201 | dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph) 202 | 203 | # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it 204 | checkpoint_dir = os.path.abspath(os.path.join(FLAGS.model_dir, "checkpoints")) 205 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 206 | if not os.path.exists(checkpoint_dir): 207 | os.makedirs(checkpoint_dir) 208 | graph_def = tf.get_default_graph().as_graph_def() 209 | with open(os.path.join(checkpoint_dir, "graphpb.txt"), 'w') as f: 210 | f.write(str(graph_def)) 211 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints) 212 | # Initialize all variables 213 | sess.run(tf.global_variables_initializer()) 214 | sess.run(tf.local_variables_initializer()) 215 | 216 | if FLAGS.word_embedding_type != 'rand': 217 | # initial matrix with random uniform 218 | # embedding_init = np.random.uniform(-0.25, 0.25, (len(vocab), FLAGS.embedding_dim)) 219 | embedding_init = np.zeros(shape=(len(vocab), FLAGS.embedding_dim)) 220 | # load vectors from the word2vec 221 | print("Initializing word embedding with pre-trained word2vec.") 222 | words, vectors = dataset.load_word2vec() 223 | for idx, w in enumerate(vocab): 224 | vec = vectors[words.index(w)] 225 | embedding_init[idx] = np.asarray(vec).astype(np.float32) 226 | sess.run(model.W.assign(embedding_init)) 227 | 228 | print("Starting training...") 229 | F1_best = 0.0 230 | last_improved_step = 0 231 | for batch in train_batches: 232 | x1_batch, x2_batch, y_batch = zip(*batch) 233 | feed_dict = { 234 | input_x1: x1_batch, 235 | input_x2: x2_batch, 236 | input_y: y_batch, 237 | dropout_keep_prob: FLAGS.dropout_keep_prob 238 | } 239 | _, step, loss, cm, acc, precision, recall, f1, summaries = sess.run( 240 | [train_op, global_step, model.loss, model.cm, model.acc, model.precision, model.recall, model.f1, train_summary_op], feed_dict) 241 | time_str = datetime.datetime.now().isoformat() 242 | if step % FLAGS.log_every_steps == 0: 243 | train_summary_writer.add_summary(summaries, step) 244 | print("{} step {} TRAIN loss={:g} acc={:.3f} P={:.3f} R={:.3f} F1={:.6f}".format( 245 | time_str, step, loss, acc, precision, recall, f1)) 246 | if step % FLAGS.evaluate_every_steps == 0: 247 | # eval 248 | x1_batch, x2_batch, y_batch = zip(*eval_data) 249 | feed_dict = { 250 | input_x1: x1_batch, 251 | input_x2: x2_batch, 252 | input_y: y_batch, 253 | dropout_keep_prob: 1 254 | } 255 | #### debug for similarity model 256 | # x1, out1, out2, sim_euc, sim_cos, sim_ma, sim = sess.run( 257 | # [model.embedded_1, model.out1, model.out2, model.sim_euc, model.sim_cos, model.sim_ma, model.sim], feed_dict) 258 | # print(x1) 259 | # sim_euc = [round(s, 2) for s in sim_euc[:30]] 260 | # sim_cos = [round(s, 2) for s in sim_cos[:30]] 261 | # sim_ma = [round(s, 2) for s in sim_ma[:30]] 262 | # sim = [round(s, 2) for s in sim[:30]] 263 | # # print(out1) 264 | # out1 = [round(s, 3) for s in out1[0]] 265 | # out2 = [round(s, 3) for s in out2[0]] 266 | # print(zip(out1, out2)) 267 | # for w in zip(y_batch[:30], sim, sim_euc, sim_cos, sim_ma): 268 | # print(w) 269 | 270 | ##### debug for classification model 271 | # out1, out2, out, logits = sess.run( 272 | # [model.out1, model.out2, model.out, model.logits], feed_dict) 273 | # out1 = [round(s, 3) for s in out1[0]] 274 | # out2 = [round(s, 3) for s in out2[0]] 275 | # out = [round(s, 3) for s in out[0]] 276 | # print(zip(out1, out2)) 277 | # print(out) 278 | # print(logits) 279 | 280 | loss, cm, acc, precision, recall, f1, summaries = sess.run( 281 | [model.loss, model.cm, model.acc, model.precision, model.recall, model.f1, dev_summary_op], feed_dict) 282 | dev_summary_writer.add_summary(summaries, step) 283 | if f1 > F1_best: 284 | F1_best = f1 285 | last_improved_step = step 286 | if F1_best > 0.5: 287 | path = saver.save(sess, checkpoint_prefix, global_step=step) 288 | print("Saved model with F1={} checkpoint to {}\n".format(F1_best, path)) 289 | improved_token = '*' 290 | else: 291 | improved_token = '' 292 | print("{} step {} DEV loss={:g} acc={:.3f} cm{} P={:.3f} R={:.3f} F1={:.6f} {}".format( 293 | time_str, step, loss, acc, cm, precision, recall, f1, improved_token)) 294 | # if step % FLAGS.checkpoint_every_steps == 0: 295 | # if F1 >= F1_best: 296 | # F1_best = F1 297 | # path = saver.save(sess, checkpoint_prefix, global_step=step) 298 | # print("Saved model with F1={} checkpoint to {}\n".format(F1_best, path)) 299 | if step - last_improved_step > 4000: # 2000 steps 300 | print("No improvement for a long time, early-stopping at best F1={}".format(F1_best)) 301 | break 302 | 303 | if __name__ == '__main__': 304 | train() 305 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/6/8 -------------------------------------------------------------------------------- /utils/data_stats.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/6 5 | """This scripts calculate some statistical information about the original text data file. 6 | 7 | Main results: 8 | * positive sample percentage: train1->21.73% train2->16.06% total->18.23% 9 | * length from 5 to 97, 10 | * frequency distribution: (5, 0.0071), (6, 0.0292), (7, 0.0392), (8, 0.0752), (9, 0.0877), (10, 0.1072), (11, 0.1069), 11 | (12, 0.0994), (13, 0.0833), (14, 0.0682), (15, 0.0554), (16, 0.043), (17, 0.0338), (18, 0.0285), 12 | (19, 0.0229), (20, 0.0181), (21, 0.0153), (22, 0.0119), (23, 0.01), (24, 0.0088) ... 13 | * there is no significant frequency difference between positive samples and negative samples. 14 | * pair length diff: (0, 0.0902), (1, 0.1782), (2, 0.1557), (3, 0.1285), (4, 0.1004), (5, 0.0818), (6, 0.0587) ... 15 | * the positive pairs length diff at {0, 1, 2} is slightly higher than negative pairs. 16 | """ 17 | from collections import Counter 18 | 19 | 20 | def positive_sample_percentage(filename): 21 | tot, pos = 0, 0 22 | for line in open(filename): 23 | line = line.strip().split('\t') 24 | if line[-1] == '1': 25 | pos += 1 26 | tot += 1 27 | print pos / float(tot) 28 | 29 | 30 | def sentence_length_distribution(filename): 31 | tot, pos = 0, 0 32 | pos_seq_len = [] 33 | neg_seq_len = [] 34 | tot_seq_len = [] 35 | for line in open(filename): 36 | line = line.strip().split('\t') 37 | s1 = line[1].decode('utf-8') 38 | s2 = line[2].decode('utf-8') 39 | tot_seq_len.extend([len(s1), len(s2)]) 40 | tot += 2 41 | if line[-1] == '1': 42 | pos_seq_len.extend([len(s1), len(s2)]) 43 | pos += 2 44 | else: 45 | neg_seq_len.extend([len(s1), len(s2)]) 46 | tot_counter = Counter(tot_seq_len) 47 | pos_counter = Counter(pos_seq_len) 48 | neg_counter = Counter(neg_seq_len) 49 | tot_freq = sorted(map(lambda x: (x[0], round(x[1]/float(tot), 4)), tot_counter.items())) 50 | pos_freq = sorted(map(lambda x: (x[0], round(x[1]/float(pos), 4)), pos_counter.items())) 51 | neg_freq = sorted(map(lambda x: (x[0], round(x[1]/float(tot-pos), 4)), neg_counter.items())) 52 | print('Total sample length distribution: {}'.format(tot_freq)) 53 | print('Positive sample length distribution: {}'.format(pos_freq)) 54 | print('Negetive sample length distribution: {}'.format(neg_freq)) 55 | 56 | 57 | def pair_length_diff_distribution(filename): 58 | tot, pos = 0, 0 59 | tot_diff = [] 60 | pos_diff = [] 61 | neg_diff = [] 62 | for line in open(filename): 63 | line = line.strip().split('\t') 64 | s1 = line[1].decode('utf-8') 65 | s2 = line[2].decode('utf-8') 66 | len_diff = abs(len(s1) - len(s2)) 67 | tot_diff.append(len_diff) 68 | tot += 1 69 | if line[-1] == '1': 70 | pos_diff.append(len_diff) 71 | pos += 1 72 | else: 73 | neg_diff.append(len_diff) 74 | tot_counter = Counter(tot_diff) 75 | pos_counter = Counter(pos_diff) 76 | neg_counter = Counter(neg_diff) 77 | tot_freq = sorted(map(lambda x: (x[0], round(x[1] / float(tot), 4)), tot_counter.items())) 78 | pos_freq = sorted(map(lambda x: (x[0], round(x[1] / float(pos), 4)), pos_counter.items())) 79 | neg_freq = sorted(map(lambda x: (x[0], round(x[1] / float(tot - pos), 4)), neg_counter.items())) 80 | print('Total pair length diff distribution: {}'.format(tot_freq)) 81 | print('Positive pair length diff distribution: {}'.format(pos_freq)) 82 | print('Negetive pair length diff distribution: {}'.format(neg_freq)) 83 | 84 | if __name__ == '__main__': 85 | filename = '../data/atec_nlp_sim_train.csv' 86 | positive_sample_percentage(filename) 87 | # sentence_length_distribution(filename) 88 | # pair_length_diff_distribution(filename) 89 | 90 | -------------------------------------------------------------------------------- /utils/feature_engineering.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/21 5 | from __future__ import unicode_literals 6 | from __future__ import division 7 | 8 | 9 | def len_diff(s1, s2): 10 | return abs(len(s1) - len(s2)) 11 | 12 | 13 | def len_diff_ratio(s1, s2): 14 | return 2 * abs(len(s1) - len(s2)) / (len(s1) + len(s2)) 15 | 16 | 17 | def shingle_similarity(s1, s2, size=1): 18 | """Shingle similarity of two sentences.""" 19 | def get_shingles(text, size): 20 | shingles = set() 21 | for i in range(0, len(text) - size + 1): 22 | shingles.add(text[i:i + size]) 23 | return shingles 24 | 25 | def jaccard(set1, set2): 26 | x = len(set1.intersection(set2)) 27 | y = len(set1.union(set2)) 28 | return x, y 29 | 30 | x, y = jaccard(get_shingles(s1, size), get_shingles(s2, size)) 31 | return x / float(y) if (y > 0 and x > 2) else 0.0 32 | 33 | 34 | def common_words(s1, s2): 35 | s1_common_cnt = len([w for w in s1 if w in s2]) 36 | s2_common_cnt = len([w for w in s2 if w in s1]) 37 | return (s1_common_cnt + s2_common_cnt) / (len(s1) + len(s2)) 38 | 39 | 40 | def tf_idf(): 41 | pass 42 | 43 | 44 | def wmd(): 45 | pass 46 | 47 | 48 | if __name__ == '__main__': 49 | s1 = '怎么更改花呗手机号码' 50 | s2 = '我的花呗是以前的手机号码,怎么更改成现在的支付宝的号码手机号' 51 | print(len_diff(s1, s2)) 52 | print(shingle_similarity(s1, s2)) 53 | print(shingle_similarity(s1, s2, 2)) 54 | print(shingle_similarity(s1, s2, 3)) 55 | 56 | -------------------------------------------------------------------------------- /utils/langconv.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from copy import deepcopy 5 | 6 | try: 7 | import psyco 8 | psyco.full() 9 | except: 10 | pass 11 | 12 | try: 13 | from utils.zh_wiki import zh2Hant, zh2Hans 14 | except ImportError: 15 | from zhtools.zh_wiki import zh2Hant, zh2Hans 16 | 17 | import sys 18 | py3k = sys.version_info >= (3, 0, 0) 19 | 20 | if py3k: 21 | UEMPTY = '' 22 | else: 23 | _zh2Hant, _zh2Hans = {}, {} 24 | for old, new in ((zh2Hant, _zh2Hant), (zh2Hans, _zh2Hans)): 25 | for k, v in old.items(): 26 | new[k.decode('utf8')] = v.decode('utf8') 27 | zh2Hant = _zh2Hant 28 | zh2Hans = _zh2Hans 29 | UEMPTY = ''.decode('utf8') 30 | 31 | # states 32 | (START, END, FAIL, WAIT_TAIL) = list(range(4)) 33 | # conditions 34 | (TAIL, ERROR, MATCHED_SWITCH, UNMATCHED_SWITCH, CONNECTOR) = list(range(5)) 35 | 36 | MAPS = {} 37 | 38 | class Node(object): 39 | def __init__(self, from_word, to_word=None, is_tail=True, 40 | have_child=False): 41 | self.from_word = from_word 42 | if to_word is None: 43 | self.to_word = from_word 44 | self.data = (is_tail, have_child, from_word) 45 | self.is_original = True 46 | else: 47 | self.to_word = to_word or from_word 48 | self.data = (is_tail, have_child, to_word) 49 | self.is_original = False 50 | self.is_tail = is_tail 51 | self.have_child = have_child 52 | 53 | def is_original_long_word(self): 54 | return self.is_original and len(self.from_word)>1 55 | 56 | def is_follow(self, chars): 57 | return chars != self.from_word[:-1] 58 | 59 | def __str__(self): 60 | return '' % (repr(self.from_word), 61 | repr(self.to_word), self.is_tail, self.have_child) 62 | 63 | __repr__ = __str__ 64 | 65 | class ConvertMap(object): 66 | def __init__(self, name, mapping=None): 67 | self.name = name 68 | self._map = {} 69 | if mapping: 70 | self.set_convert_map(mapping) 71 | 72 | def set_convert_map(self, mapping): 73 | convert_map = {} 74 | have_child = {} 75 | max_key_length = 0 76 | for key in sorted(mapping.keys()): 77 | if len(key)>1: 78 | for i in range(1, len(key)): 79 | parent_key = key[:i] 80 | have_child[parent_key] = True 81 | have_child[key] = False 82 | max_key_length = max(max_key_length, len(key)) 83 | for key in sorted(have_child.keys()): 84 | convert_map[key] = (key in mapping, have_child[key], 85 | mapping.get(key, UEMPTY)) 86 | self._map = convert_map 87 | self.max_key_length = max_key_length 88 | 89 | def __getitem__(self, k): 90 | try: 91 | is_tail, have_child, to_word = self._map[k] 92 | return Node(k, to_word, is_tail, have_child) 93 | except: 94 | return Node(k) 95 | 96 | def __contains__(self, k): 97 | return k in self._map 98 | 99 | def __len__(self): 100 | return len(self._map) 101 | 102 | class StatesMachineException(Exception): pass 103 | 104 | class StatesMachine(object): 105 | def __init__(self): 106 | self.state = START 107 | self.final = UEMPTY 108 | self.len = 0 109 | self.pool = UEMPTY 110 | 111 | def clone(self, pool): 112 | new = deepcopy(self) 113 | new.state = WAIT_TAIL 114 | new.pool = pool 115 | return new 116 | 117 | def feed(self, char, map): 118 | node = map[self.pool+char] 119 | 120 | if node.have_child: 121 | if node.is_tail: 122 | if node.is_original: 123 | cond = UNMATCHED_SWITCH 124 | else: 125 | cond = MATCHED_SWITCH 126 | else: 127 | cond = CONNECTOR 128 | else: 129 | if node.is_tail: 130 | cond = TAIL 131 | else: 132 | cond = ERROR 133 | 134 | new = None 135 | if cond == ERROR: 136 | self.state = FAIL 137 | elif cond == TAIL: 138 | if self.state == WAIT_TAIL and node.is_original_long_word(): 139 | self.state = FAIL 140 | else: 141 | self.final += node.to_word 142 | self.len += 1 143 | self.pool = UEMPTY 144 | self.state = END 145 | elif self.state == START or self.state == WAIT_TAIL: 146 | if cond == MATCHED_SWITCH: 147 | new = self.clone(node.from_word) 148 | self.final += node.to_word 149 | self.len += 1 150 | self.state = END 151 | self.pool = UEMPTY 152 | elif cond == UNMATCHED_SWITCH or cond == CONNECTOR: 153 | if self.state == START: 154 | new = self.clone(node.from_word) 155 | self.final += node.to_word 156 | self.len += 1 157 | self.state = END 158 | else: 159 | if node.is_follow(self.pool): 160 | self.state = FAIL 161 | else: 162 | self.pool = node.from_word 163 | elif self.state == END: 164 | # END is a new START 165 | self.state = START 166 | new = self.feed(char, map) 167 | elif self.state == FAIL: 168 | raise StatesMachineException('Translate States Machine ' 169 | 'have error with input data %s' % node) 170 | return new 171 | 172 | def __len__(self): 173 | return self.len + 1 174 | 175 | def __str__(self): 176 | return '' % ( 177 | id(self), self.pool, self.state, self.final) 178 | __repr__ = __str__ 179 | 180 | class Converter(object): 181 | def __init__(self, to_encoding): 182 | self.to_encoding = to_encoding 183 | self.map = MAPS[to_encoding] 184 | self.start() 185 | 186 | def feed(self, char): 187 | branches = [] 188 | for fsm in self.machines: 189 | new = fsm.feed(char, self.map) 190 | if new: 191 | branches.append(new) 192 | if branches: 193 | self.machines.extend(branches) 194 | self.machines = [fsm for fsm in self.machines if fsm.state != FAIL] 195 | all_ok = True 196 | for fsm in self.machines: 197 | if fsm.state != END: 198 | all_ok = False 199 | if all_ok: 200 | self._clean() 201 | return self.get_result() 202 | 203 | def _clean(self): 204 | if len(self.machines): 205 | self.machines.sort(key=lambda x: len(x)) 206 | # self.machines.sort(cmp=lambda x,y: cmp(len(x), len(y))) 207 | self.final += self.machines[0].final 208 | self.machines = [StatesMachine()] 209 | 210 | def start(self): 211 | self.machines = [StatesMachine()] 212 | self.final = UEMPTY 213 | 214 | def end(self): 215 | self.machines = [fsm for fsm in self.machines 216 | if fsm.state == FAIL or fsm.state == END] 217 | self._clean() 218 | 219 | def convert(self, string): 220 | self.start() 221 | for char in string: 222 | self.feed(char) 223 | self.end() 224 | return self.get_result() 225 | 226 | def get_result(self): 227 | return self.final 228 | 229 | 230 | def registery(name, mapping): 231 | global MAPS 232 | MAPS[name] = ConvertMap(name, mapping) 233 | 234 | registery('zh-hant', zh2Hant) 235 | registery('zh-hans', zh2Hans) 236 | del zh2Hant, zh2Hans 237 | 238 | 239 | def run(): 240 | import sys 241 | from optparse import OptionParser 242 | parser = OptionParser() 243 | parser.add_option('-e', type='string', dest='encoding', 244 | help='encoding') 245 | parser.add_option('-f', type='string', dest='file_in', 246 | help='input file (- for stdin)') 247 | parser.add_option('-t', type='string', dest='file_out', 248 | help='output file') 249 | (options, args) = parser.parse_args() 250 | if not options.encoding: 251 | parser.error('encoding must be set') 252 | if options.file_in: 253 | if options.file_in == '-': 254 | file_in = sys.stdin 255 | else: 256 | file_in = open(options.file_in) 257 | else: 258 | file_in = sys.stdin 259 | if options.file_out: 260 | if options.file_out == '-': 261 | file_out = sys.stdout 262 | else: 263 | file_out = open(options.file_out, 'wb') 264 | else: 265 | file_out = sys.stdout 266 | 267 | c = Converter(options.encoding) 268 | for line in file_in: 269 | # print >> file_out, c.convert(line.rstrip('\n').decode( 270 | file_out.write(c.convert(line.rstrip('\n').decode( 271 | 'utf8')).encode('utf8')) 272 | 273 | 274 | if __name__ == '__main__': 275 | run() 276 | 277 | -------------------------------------------------------------------------------- /utils/train_test_split.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | # @Author: lapis-hong 4 | # @Date : 2018/5/16 5 | import random 6 | 7 | 8 | def train_test_split(infile, test_rate=0.2): 9 | with open('data/train.csv', 'w') as f_train, \ 10 | open('data/test.csv', 'w') as f_test: 11 | for line in open(infile): 12 | if random.random() > test_rate: 13 | f_train.write(line) 14 | else: 15 | f_test.write(line) 16 | 17 | 18 | if __name__ == '__main__': 19 | train_test_split('data/atec_nlp_sim_train.csv') 20 | 21 | --------------------------------------------------------------------------------