├── .gitignore ├── README.md ├── ATEC NLP比赛感受.md ├── utils ├── plot_model.py ├── extract_wiki.py ├── test_cv_stacking.py └── train_embedding.py ├── pai_old.py └── pai_train.py /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .vscode/ 3 | .ipynb_checkpoints/ 4 | __pycache__/ 5 | data/ 6 | docs/ 7 | GitHubs/ 8 | logs/ 9 | PAI/ 10 | pai_model/ 11 | resources/ 12 | submits/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ATEC2018 NLP赛题 复赛f1 = 0.7327 2 | 3 | 由于PAI平台限制,所有代码都放在一个文件里面,`pai_train.py`是获得本次比赛成绩的文件,实验共使用了4个模型,分别是自定义Siamese网络、ESIM网络、Decomposable Attention和DSSM网络。其中Siamese、ESIM和Decomposable Attention有char level和word level两个版本,DSSM网络只有char和word的合并版本。最佳记录由多个模型进行blending融合预测,遗憾没有尝试一下10fold交叉训练模型,前排貌似都用了,而且这里每个模型都只用了2个小时来训练。 4 | 5 | 模型性能比较,字符级的esim模型在这个任务中表现最佳。 6 | 7 | | model name | 模型输出与标签相关性r | 最优f1评分 | 取得最优f1评分的阈值 | 8 | | ------------ | --------------------- | ------------------ | -------------------- | 9 | | siamese char | 0.553536380131115 | 0.6971525551574581 | 0.258 | 10 | | siamese word | 0.5308273808879237 | 0.6873517065157875 | 0.242 | 11 | | esim char | 0.5853469280801447 | 0.7116622491480499 | 0.233 | 12 | | esim word | 0.5783574742744366 | 0.7100964753080524 | 0.263 | 13 | | decom char | 0.5288425401105513 | 0.6825720620842572 | 0.249 | 14 | | decom word | 0.4943718720970039 | 0.6677430929314676 | 0.212 | 15 | | dssm both | 0.5638034287814917 | 0.6980098067493511 | 0.263 | 16 | 17 | 18 | 训练感受: 19 | 1. batchsize不要太大,虽然每个epoch更快完成, 但每个epoch权重更新次数变少了,收敛更慢 20 | 2. 使用[循环学习率](https://arxiv.org/abs/1506.01186)可以收敛到更好的极值点,更容易跳出局部极值,如在一个epoch中,使学习率从小变大,又逐渐变小 21 | 3. 利用[SWA](https://arxiv.org/abs/1803.05407)这种简单的模型融合方法可以获得泛化能力更好的性能,本地提升明显,但线上没有改善。 22 | 23 | 24 | `pai_transform.py`和`pai_old.py`是两次不成功的尝试: 25 | `pai_transform.py`试图参考fastai的ULMFiT方法,通过训练语言模型作为embedding输入,并针对当前分类任务更改网络结构以适应当前训练过程。 26 | `pai_old.py`试图参考quora分享,使用文本特征工程进行分类。 27 | 28 | 29 | > 模型来源siamese参考:https://blog.csdn.net/huowa9077/article/details/81082795 30 | > ESIM网络、Decomposable Attention来自Kaggle分享:https://www.kaggle.com/lamdang/dl-models 31 | > DSSM网络来自bird大神分享:https://openclub.alipay.com/read.php?tid=7480&fid=96 32 | > 感谢以上! -------------------------------------------------------------------------------- /ATEC NLP比赛感受.md: -------------------------------------------------------------------------------- 1 | # ATEC NLP比赛总结 2 | 3 | 今年5月份报名了蚂蚁金服的比赛,有金融大脑和风险大脑两个赛题,金融大脑主要解决智能客服遇到的自然语言处理问题,对于两个语句,判断是否是同一个意思,帮助构建客服的专用问答库,比赛的评判标准是f1分数,这对于正负样本不平衡问题比准确率更好,风险大脑则是通过用户登录和交易信息判断此次交易是否存在风险,在网络安全形势严峻的今天,其重要意义不言而喻。 4 | 5 | 2个月时间的投入,还是有一些收获: 6 | 7 | - 对keras的使用更加熟练,尤其是Callback的使用, 8 | 9 | - 阅读了pytorch的文档,初学使用pytorch,其特点为: 10 | 1. 强化版的numpy,前向运算和操作与numpy非常相似,而且可以直接利用GPU的运算能力。 11 | 2. 与TensorFlow不同的是,pytorch无需编译图,每次backward都会根据当前运算过程构造新的图,然后销毁,在程序中甚至可以通过条件语句直接改变图的运行流程,启用或停止相关节点。 12 | 3. pytorch对于最新研究成果的跟踪实现比keras快得多,拥有更丰富的神经网络层,更多优化器等。 13 | 14 | - 学习了基于pytorch的fastai框架,框架的作者Jeremy Howard是Kaggle高手,fastai框架吸收了一些Keras中便于使用的特性,整个框架源码约4000余行,短小精悍,使用方便 15 | 16 | - 跟着fastai的源码实践了统一语言模型精调(ULMFiT)方法,在文本相似度任务上并未取得好结果,ULMFiT方法特点如下: 17 | 1. 训练一个语言模型,模型架构为Embedding + 三层双向LSTM(+dropout),数据集一般为wiki,受限于数据加载和预处理方式,目前的源码仅能处理不超过500M的语料。 18 | 2. 在当前任务语料上finetune语言模型。 19 | 3. 根据当前任务设计分类器模块,其出入为语言模型最后一个LSTM层的输出,从最后一层开始,逐层unfreeze,进行分类器模型精调。 20 | 21 | - 学习了batchsize参数对训练的影响,更大的batchsize意味着更准确的梯度方向,可以更快完成每个epoch,同时也意味着每个epoch的更新次数更少,需要更多的epoch才能使模型收敛,一味增大batchsize反而会延长训练时间。 22 | 23 | - 学会使用循环学习率变化的训练技巧,Circular Learning Rate通过循环改变学习率,从小到大,从大到小,不断循环,使模型更容易跳出局部最优,做出更多的尝试,该方法确实调高了模型训练的结果。 24 | 25 | - 学会使用SWA(stochastic weights averaging)模型融合方法,即将训练过程中的模型权重进行平均达到模型融合的目的,该方法的代价极小,仅需要保存另一份模型权重在内存或GPU显存中,在每个epoch后(或其他间隔)更新一次该权重,在训练结束时便可获得一个普通模型和一个SWA模型,该方法提高了5/7模型在初赛数据上的泛化能力,但并未提高任何模型对于5倍的复赛数据的泛化能力,这可能与模型训练不够充分有关。 26 | 27 | - 学习了一些模型融合方法,包括求多模型平均、投票、Stacking和Blending模型融合方法,其中Quora比赛中的一个stacking方案值得借鉴,他们将训练数据分成5份,三份训练,1份验证,1份测试,轮番5次,直到每份数据都参与1次验证和1次测试,这比传统的stacking更好的利用了数据。 28 | 29 | - 学习了语句对任务建模的两类基本模型,分别是向量表征模型和表征交互模型,向量表征模型利用孪生网络(Siamese Network)将两个语句编码成两个独立的向量,然后计算向量间的相似度,比如Siamese Net,DSSM;表征交互模型通过构造一个相关性交互矩阵,将两个语句的信息进行糅合处理,比如Decomposable Attention,ESIM。 30 | 31 | - 除了ESIM外,其他人用了两个新模型DRCN和DIIN (DRCN是SNLI排行榜最佳模型) 32 | 33 | - 本次比赛未尝试的方法: 34 | - 利用句子的拼音作为辅助输入,通过拼音embedding加强模型, 35 | - 将字、词和拼音等混入一个模型中,增强单个模型的能力 36 | - DRCN模型 37 | - **10折验证训练模型(大伙都在用)** 38 | 39 | 40 | > 其他两个队的开源代码 41 | > [复赛0.7368_红鲤鱼绿鲤鱼与驴](https://github.com/raven4752/huabei) 42 | > [复赛0.7352_World2vec](https://github.com/amxineohp/atec_2018_nlp) 43 | 44 | > 经验分享 45 | > [逼格learning](https://openclub.alipay.com/read.php?tid=9074&fid=96) 46 | 47 | > 语句对任务模型排行榜[SNLI项目](https://nlp.stanford.edu/projects/snli) 48 | 49 | -------------------------------------------------------------------------------- /utils/plot_model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.autograd import Variable 3 | import torch.nn as nn 4 | from graphviz import Digraph 5 | 6 | 7 | class CNN(nn.Module): 8 | def __init__(self): 9 | super(CNN, self).__init__() 10 | self.conv1 = nn.Sequential( 11 | nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2), 12 | nn.ReLU(), 13 | nn.MaxPool2d(kernel_size=2) 14 | ) 15 | self.conv2 = nn.Sequential( 16 | nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2), 17 | nn.ReLU(), 18 | nn.MaxPool2d(kernel_size=2) 19 | ) 20 | self.out = nn.Linear(32*7*7, 10) 21 | 22 | def forward(self, x): 23 | x = self.conv1(x) 24 | x = self.conv2(x) 25 | x = x.view(x.size(0), -1) # (batch, 32*7*7) 26 | out = self.out(x) 27 | return out 28 | 29 | 30 | def make_dot(var, params=None): 31 | """ Produces Graphviz representation of PyTorch autograd graph 32 | Blue nodes are the Variables that require grad, orange are Tensors 33 | saved for backward in torch.autograd.Function 34 | Args: 35 | var: output Variable 36 | params: dict of (name, Variable) to add names to node that 37 | require grad (TODO: make optional) 38 | """ 39 | if params is not None: 40 | assert isinstance(params.values()[0], Variable) 41 | param_map = {id(v): k for k, v in params.items()} 42 | 43 | node_attr = dict(style='filled', 44 | shape='box', 45 | align='left', 46 | fontsize='12', 47 | ranksep='0.1', 48 | height='0.2') 49 | dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12")) 50 | seen = set() 51 | 52 | def size_to_str(size): 53 | return '('+(', ').join(['%d' % v for v in size])+')' 54 | 55 | def add_nodes(var): 56 | if var not in seen: 57 | if torch.is_tensor(var): 58 | dot.node(str(id(var)), size_to_str(var.size()), fillcolor='orange') 59 | elif hasattr(var, 'variable'): 60 | u = var.variable 61 | name = param_map[id(u)] if params is not None else '' 62 | node_name = '%s\n %s' % (name, size_to_str(u.size())) 63 | dot.node(str(id(var)), node_name, fillcolor='lightblue') 64 | else: 65 | dot.node(str(id(var)), str(type(var).__name__)) 66 | seen.add(var) 67 | if hasattr(var, 'next_functions'): 68 | for u in var.next_functions: 69 | if u[0] is not None: 70 | dot.edge(str(id(u[0])), str(id(var))) 71 | add_nodes(u[0]) 72 | if hasattr(var, 'saved_tensors'): 73 | for t in var.saved_tensors: 74 | dot.edge(str(id(t)), str(id(var))) 75 | add_nodes(t) 76 | add_nodes(var.grad_fn) 77 | return dot 78 | -------------------------------------------------------------------------------- /utils/extract_wiki.py: -------------------------------------------------------------------------------- 1 | import codecs 2 | import re 3 | 4 | import bz2file 5 | import jieba_fast as jieba 6 | from gensim.corpora.wikicorpus import extract_pages, filter_wiki 7 | # from gensim.corpora import WikiCorpus 8 | from tqdm import tqdm 9 | 10 | 11 | def get_wiki(): 12 | from opencc import OpenCC 13 | # 参考这篇博客注释 14 | # https://kexue.fm/archives/4176 15 | opencc1 = OpenCC("t2s") 16 | resub1 = re.compile(':*{\|[\s\S]*?\|}') 17 | resub2 = re.compile('[\s\S]*?') 18 | resub3 = re.compile('(.){{([^{}\n]*?\|[^{}\n]*?)}}') 19 | resub4 = re.compile('\* *\n|\'{2,}') 20 | resub5 = re.compile('\n+') 21 | resub6 = re.compile('\n[:;]|\n +') 22 | resub7 = re.compile('\n==') 23 | 24 | refind1 = re.compile('^[a-zA-Z]+:') 25 | refind2 = re.compile('^#') 26 | 27 | p1 = re.compile(r'-\{.*?(zh-hans|zh-cn):([^;]*?)(;.*?)?\}-') 28 | p2 = re.compile(r'[(\(][,;。?!\s]*[)\)]') 29 | p3 = re.compile(r'[「『]') 30 | p4 = re.compile(r'[」』]') 31 | 32 | def wiki_replace(s): 33 | s = filter_wiki(s) 34 | s = resub1.sub('', s) 35 | s = resub2.sub('', s) 36 | s = resub3.sub('\\1[[\\2]]', s) 37 | s = resub4.sub('', s) 38 | s = resub5.sub('\n', s) 39 | s = resub6.sub('\n', s) 40 | s = resub7.sub('\n\n==', s) 41 | s = p1.sub(r'\2', s) 42 | s = p2.sub(r'', s) 43 | s = p3.sub(r'“', s) 44 | s = p4.sub(r'”', s) 45 | return opencc1.convert(s).strip() 46 | 47 | wiki = extract_pages(bz2file.open('zhwiki-latest-pages-articles.xml.bz2')) 48 | 49 | # wiki=WikiCorpus('zhwiki-latest-pages-articles.xml.bz2',lemmatize=False,dictionary={}) 50 | 51 | with codecs.open('wiki.txt', 'w', encoding='utf-8') as f: 52 | i = 0 53 | filelist = [] 54 | for d in tqdm(wiki): 55 | 56 | print(d[0]) 57 | print(d[1]) 58 | 59 | i+=1 60 | 61 | if i == 5:break 62 | 63 | continue 64 | if not refind1.findall(d[0]) and d[0] and not refind2.findall(d[1]): 65 | filelist.append(d[0]+"\n"+d[1]) 66 | line = d[1] 67 | 68 | i += 1 69 | if i % 100 == 0: 70 | s = wiki_replace("\n\n".join(filelist)) 71 | f.write(s) 72 | filelist = [] 73 | 74 | def get_cut_std_wiki(): 75 | with open("cut_std_wiki.txt","w",encoding="utf8") as output: 76 | with open("std_wiki.txt","r",encoding="utf8") as file: 77 | for line in tqdm(file): 78 | output.write(" ".join(list(jieba.cut(line)))) 79 | 80 | def get_wiki2(): 81 | reobj1 = re.compile(r"[ `~!@#$%^&*\(\)-_=+\[\]\{\}\\\|;:\'\",<.>/?a-zA-Z\d]+") 82 | reobj2 = re.compile(r"\n+") 83 | reobj3 = re.compile("(())|(“”)|(「」)|(《》)|(“”)|(‘’)|(【】)|[,。?——!]{2,}") 84 | reuseful = re.compile('^[a-zA-Z]+:') 85 | redirect = re.compile(r"^#") 86 | def wiki_replace(s): 87 | s = filter_wiki(s) 88 | s = reobj1.sub("", s) # 为上传阿里云剔除竖线(|)符号 89 | s = reobj2.sub("#",s) 90 | s = reobj3.sub("",s) 91 | return s 92 | 93 | wiki = extract_pages(bz2file.open('zhwiki-latest-pages-articles.xml.bz2')) 94 | with codecs.open('wiki-tw.csv', 'w', encoding='utf-8') as f: 95 | i = 0 96 | filelist = [] 97 | for d in tqdm(wiki): 98 | if not reuseful.findall(d[0]) and not redirect.findall(d[1]): 99 | i+=1 100 | filelist.append(reobj1.sub("",d[0])+"|"+wiki_replace(d[1])+"\n") 101 | if i % 1000 == 0: 102 | s = ("".join(filelist)) 103 | f.write(s) 104 | filelist = [] 105 | if filelist: 106 | s = ("".join(filelist)) 107 | f.write(s) 108 | 109 | def wiki_error(): 110 | for no,line in enumerate(open("wiki_1.csv",'r', encoding="utf8")): 111 | pair = line.split("|") 112 | if len(pair)>2: 113 | print(no,pair[0],pair[1]) 114 | 115 | if __name__ == '__main__': 116 | # get_wiki2() # 繁体转简体 + 特殊符号处理 117 | wiki_error() -------------------------------------------------------------------------------- /utils/test_cv_stacking.py: -------------------------------------------------------------------------------- 1 | 2 | from datetime import datetime 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | 6 | from sklearn import linear_model 7 | from sklearn import datasets 8 | from sklearn.svm import l1_min_c 9 | 10 | iris = datasets.load_iris() 11 | X = iris.data 12 | y = iris.target 13 | 14 | X = X[y != 2] 15 | y = y[y != 2] 16 | 17 | X -= np.mean(X, 0) 18 | cs = l1_min_c(X, y, loss='log') * np.logspace(0, 3) 19 | 20 | 21 | print("Computing regularization path ...") 22 | start = datetime.now() 23 | clf = linear_model.LogisticRegressionCV(penalty='l2', tol=1e-6) 24 | clf.fit(X, y) 25 | print("This took ", datetime.now() - start) 26 | 27 | 28 | # ============================================================= 29 | from sklearn import datasets 30 | 31 | iris = datasets.load_iris() 32 | X, y = iris.data[:, 1:3], iris.target 33 | 34 | from sklearn import model_selection 35 | from sklearn.linear_model import LogisticRegression 36 | from sklearn.neighbors import KNeighborsClassifier 37 | from sklearn.naive_bayes import GaussianNB 38 | from sklearn.ensemble import RandomForestClassifier 39 | from mlxtend.classifier import StackingClassifier 40 | from sklearn.model_selection import GridSearchCV 41 | import numpy as np 42 | 43 | clf1 = KNeighborsClassifier(n_neighbors=1) 44 | clf2 = RandomForestClassifier(random_state=1) 45 | clf3 = GaussianNB() 46 | lr = LogisticRegression() 47 | 48 | print('3-fold cross validation:\n') 49 | 50 | stack = 2 51 | if stack == 1: 52 | sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 53 | meta_classifier=lr) 54 | for clf, label in zip([clf1, clf2, clf3, sclf], 55 | ['KNN', 56 | 'Random Forest', 57 | 'Naive Bayes', 58 | 'StackingClassifier']): 59 | 60 | scores = model_selection.cross_val_score(clf, X, y, 61 | cv=3, scoring='accuracy') 62 | 63 | print("Accuracy: %0.2f (+/- %0.2f) [%s]" 64 | % (scores.mean(), scores.std(), label)) 65 | 66 | elif stack == 2: 67 | sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 68 | use_probas=True, 69 | average_probas=False, 70 | meta_classifier=lr) 71 | for clf, label in zip([clf1, clf2, clf3, sclf], 72 | ['KNN', 73 | 'Random Forest', 74 | 'Naive Bayes', 75 | 'StackingClassifier']): 76 | 77 | scores = model_selection.cross_val_score(clf, X, y, 78 | cv=3, scoring='accuracy') 79 | 80 | print("Accuracy: %0.2f (+/- %0.2f) [%s]" 81 | % (scores.mean(), scores.std(), label)) 82 | elif stack == 3: 83 | sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 84 | meta_classifier=lr) 85 | 86 | params = {'kneighborsclassifier__n_neighbors': [1, 5], 87 | 'randomforestclassifier__n_estimators': [10, 50], 88 | 'meta-logisticregression__C': [0.1, 10.0]} 89 | 90 | grid = GridSearchCV(estimator=sclf, 91 | param_grid=params, 92 | cv=5, 93 | refit=True) 94 | grid.fit(X, y) 95 | 96 | cv_keys = ('mean_test_score', 'std_test_score', 'params') 97 | 98 | for r, _ in enumerate(grid.cv_results_['mean_test_score']): 99 | print("%0.3f +/- %0.2f %r" 100 | % (grid.cv_results_[cv_keys[0]][r], 101 | grid.cv_results_[cv_keys[1]][r] / 2.0, 102 | grid.cv_results_[cv_keys[2]][r])) 103 | 104 | print('Best parameters: %s' % grid.best_params_) 105 | print('Accuracy: %.2f' % grid.best_score_) 106 | 107 | 108 | import matplotlib.pyplot as plt 109 | from mlxtend.plotting import plot_decision_regions 110 | import matplotlib.gridspec as gridspec 111 | import itertools 112 | 113 | gs = gridspec.GridSpec(2, 2) 114 | 115 | fig = plt.figure(figsize=(10,8)) 116 | 117 | for clf, lab, grd in zip([clf1, clf2, clf3, sclf], 118 | ['KNN', 119 | 'Random Forest', 120 | 'Naive Bayes', 121 | 'StackingClassifier'], 122 | itertools.product([0, 1], repeat=2)): 123 | 124 | clf.fit(X, y) 125 | ax = plt.subplot(gs[grd[0], grd[1]]) 126 | fig = plot_decision_regions(X=X, y=y, clf=clf) 127 | plt.title(lab) 128 | plt.show() -------------------------------------------------------------------------------- /utils/train_embedding.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python 2 | #coding=utf-8 3 | 4 | import os 5 | import re 6 | import multiprocessing 7 | 8 | import gensim 9 | from gensim.models.word2vec import LineSentence 10 | import jieba_fast as jieba 11 | import numpy as np 12 | import pandas as pd 13 | import fasttext 14 | 15 | 16 | os.environ["TF_CPP_MIN_LOG_LEVEL"]='3' 17 | model_dir = "pai_model/" 18 | 19 | new_words = "支付宝 付款码 二维码 收钱码 转账 退款 退钱 余额宝 运费险 还钱 还款 花呗 借呗 蚂蚁花呗 蚂蚁借呗 蚂蚁森林 小黄车 飞猪 微客 宝卡 芝麻信用 亲密付 淘票票 饿了么 摩拜 滴滴 滴滴出行".split(" ") 20 | for word in new_words: 21 | jieba.add_word(word) 22 | 23 | class MyChars(object): 24 | def __init__(self): 25 | pass 26 | 27 | def __iter__(self): 28 | with open(model_dir + "atec_nlp_sim_train.csv","r", encoding="utf8") as atec: 29 | for line in atec: 30 | lineno, s1, s2, label=line.strip().split("\t") 31 | yield list(s1) + list(s2) 32 | 33 | with open("resources/wiki_corpus/wiki.csv",'r',encoding="utf8") as wiki: 34 | for line in wiki: 35 | title, doc = line.strip().split("|") 36 | for sentense in doc.split("#"): 37 | if len(sentense)>0: 38 | yield [char for char in sentense if char and 0x4E00<= ord(char[0]) <= 0x9FA5] 39 | 40 | 41 | class MyWords(object): 42 | def __init__(self): 43 | pass 44 | 45 | def __iter__(self): 46 | with open(model_dir + "atec_nlp_sim_train.csv","r", encoding="utf8") as atec: 47 | for line in atec: 48 | lineno, s1, s2, label=line.strip().split("\t") 49 | yield list(jieba.cut(s1)) + list(jieba.cut(s2)) 50 | 51 | with open("resources/wiki_corpus/wiki.csv",'r',encoding="utf8") as wiki: 52 | for line in wiki: 53 | title, doc = line.strip().split("|") 54 | for sentense in doc.split("#"): 55 | if len(sentense)>0: 56 | yield [word for word in list(jieba.cut(sentense)) if word and 0x4E00<= ord(word[0]) <= 0x9FA5] 57 | 58 | 59 | def gen_data(): 60 | with open(model_dir + "train_char.txt","w",encoding="utf8") as file: 61 | mychars = MyChars() 62 | for cs in mychars: 63 | file.write(" ".join(cs)+"\n") 64 | 65 | with open(model_dir + "train_word.txt","w",encoding="utf8") as file: 66 | mywords = MyWords() 67 | for ws in mywords: 68 | file.write(" ".join(ws)+"\n") 69 | 70 | def train_embedding_gensim(): 71 | dim=256 72 | embedding_size = dim 73 | model = gensim.models.Word2Vec(LineSentence(model_dir + 'train_char.txt'), 74 | size=embedding_size, 75 | window=5, 76 | min_count=10, 77 | workers=multiprocessing.cpu_count()) 78 | 79 | model.save(model_dir + "char2vec_gensim"+str(embedding_size)) 80 | # model.wv.save_word2vec_format("model/char2vec_org"+str(embedding_size),"model/chars"+str(embedding_size),binary=False) 81 | 82 | dim=256 83 | embedding_size = dim 84 | model = gensim.models.Word2Vec(LineSentence(model_dir + 'train_word.txt'), 85 | size=embedding_size, 86 | window=5, 87 | min_count=10, 88 | workers=multiprocessing.cpu_count()) 89 | 90 | model.save(model_dir + "word2vec_gensim"+str(embedding_size)) 91 | # model.wv.save_word2vec_format("model/word2vec_org"+str(embedding_size),"model/vocabulary"+str(embedding_size),binary=False) 92 | 93 | 94 | def train_embedding_fasttext(): 95 | 96 | # Skipgram model 97 | model = fasttext.skipgram(model_dir + 'train_char.txt', model_dir + 'char2vec_fastskip256', word_ngrams=2, ws=5, min_count=10, dim=256) 98 | del(model) 99 | 100 | # CBOW model 101 | model = fasttext.cbow(model_dir + 'train_char.txt', model_dir + 'char2vec_fastcbow256', word_ngrams=2, ws=5, min_count=10, dim=256) 102 | del(model) 103 | 104 | # Skipgram model 105 | model = fasttext.skipgram(model_dir + 'train_word.txt', model_dir + 'word2vec_fastskip256', word_ngrams=2, ws=5, min_count=10, dim=256) 106 | del(model) 107 | 108 | # CBOW model 109 | model = fasttext.cbow(model_dir + 'train_word.txt', model_dir + 'word2vec_fastcbow256', word_ngrams=2, ws=5, min_count=10, dim=256) 110 | del(model) 111 | 112 | # gen_data() 113 | train_embedding_gensim() 114 | # train_embedding_word() -------------------------------------------------------------------------------- /pai_old.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python 2 | #coding=utf-8 3 | #=================================================================================== 4 | # 传统方法 5 | #=================================================================================== 6 | import numpy as np 7 | import pandas as pd 8 | import re 9 | import math 10 | import time 11 | from sklearn.feature_extraction.text import TfidfVectorizer 12 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV 13 | from sklearn.metrics import f1_score 14 | from sklearn.model_selection import train_test_split, KFold 15 | import gensim 16 | try: 17 | import jieba_fast as jieba 18 | except Exception as e: 19 | import jieba 20 | 21 | try: 22 | print(model_dir) 23 | test_size = 0.025 24 | online=True 25 | except: 26 | model_dir = "pai_model/" 27 | test_size = 0.05 28 | online=False 29 | 30 | new_words = "支付宝 付款码 二维码 收钱码 转账 退款 退钱 余额宝 运费险 还钱 还款 花呗 借呗 蚂蚁花呗 蚂蚁借呗 蚂蚁森林 小黄车 飞猪 微客 宝卡 芝麻信用 亲密付 淘票票 饿了么 摩拜 滴滴 滴滴出行".split(" ") 31 | for word in new_words: 32 | jieba.add_word(word) 33 | 34 | star = re.compile("\*+") 35 | if False: 36 | stops = ["、","。","〈","〉","《","》","一","一切","一则","一方面","一旦","一来","一样","一般","七","万一","三","上下","不仅","不但","不光","不单","不只","不如","不怕","不惟","不成","不拘","不比","不然","不特","不独","不管","不论","不过","不问","与","与其","与否","与此同时","且","两者","个","临","为","为了","为什么","为何","为着","乃","乃至","么","之","之一","之所以","之类","乌乎","乎","乘","九","也","也好","也罢","了","二","于","于是","于是乎","云云","五","人家","什么","什么样","从","从而","他","他人","他们","以","以便","以免","以及","以至","以至于","以致","们","任","任何","任凭","似的","但","但是","何","何况","何处","何时","作为","你","你们","使得","例如","依","依照","俺","俺们","倘","倘使","倘或","倘然","倘若","借","假使","假如","假若","像","八","六","兮","关于","其","其一","其中","其二","其他","其余","其它","其次","具体地说","具体说来","再者","再说","冒","冲","况且","几","几时","凭","凭借","则","别","别的","别说","到","前后","前者","加之","即","即令","即使","即便","即或","即若","又","及","及其","及至","反之","反过来","反过来说","另","另一方面","另外","只是","只有","只要","只限","叫","叮咚","可","可以","可是","可见","各","各个","各位","各种","各自","同","同时","向","向着","吓","吗","否则","吧","吧哒","吱","呀","呃","呕","呗","呜","呜呼","呢","呵","呸","呼哧","咋","和","咚","咦","咱","咱们","咳","哇","哈","哈哈","哉","哎","哎呀","哎哟","哗","哟","哦","哩","哪","哪个","哪些","哪儿","哪天","哪年","哪怕","哪样","哪边","哪里","哼","哼唷","唉","啊","啐","啥","啦","啪达","喂","喏","喔唷","嗡嗡","嗬","嗯","嗳","嘎","嘎登","嘘","嘛","嘻","嘿","四","因","因为","因此","因而","固然","在","在下","地","多","多少","她","她们","如","如上所述","如何","如其","如果","如此","如若","宁","宁可","宁愿","宁肯","它","它们","对","对于","将","尔后","尚且","就","就是","就是说","尽","尽管","岂但","己","并","并且","开外","开始","归","当","当着","彼","彼此","往","待","得","怎","怎么","怎么办","怎么样","怎样","总之","总的来看","总的来说","总的说来","总而言之","恰恰相反","您","慢说","我","我们","或","或是","或者","所","所以","打","把","抑或","拿","按","按照","换句话说","换言之","据","接着","故","故此","旁人","无宁","无论","既","既是","既然","时候","是","是的","替","有","有些","有关","有的","望","朝","朝着","本","本着","来","来着","极了","果然","果真","某","某个","某些","根据","正如","此","此外","此间","毋宁","每","每当","比","比如","比方","沿","沿着","漫说","焉","然则","然后","然而","照","照着","甚么","甚而","甚至","用","由","由于","由此可见","的","的话","相对而言","省得","着","着呢","矣","离","第","等","等等","管","紧接着","纵","纵令","纵使","纵然","经","经过","结果","给","继而","综上所述","罢了","者","而","而且","而况","而外","而已","而是","而言","能","腾","自","自个儿","自从","自各儿","自家","自己","自身","至","至于","若","若是","若非","莫若","虽","虽则","虽然","虽说","被","要","要不","要不是","要不然","要么","要是","让","论","设使","设若","该","诸位","谁","谁知","赶","起","起见","趁","趁着","越是","跟","较","较之","边","过","还是","还有","这","这个","这么","这么些","这么样","这么点儿","这些","这会儿","这儿","这就是说","这时","这样","这边","这里","进而","连","连同","通过","遵照","那","那个","那么","那么些","那么样","那些","那会儿","那儿","那时","那样","那边","那里","鄙人","鉴于","阿","除","除了","除此之外","除非","随","随着","零","非但","非徒","靠","顺","顺着","首先","︿","!","#","$","%","&","(",")","*","+",",","0","1","2","3","4","5","6","7","8","9",":",";","<",">","?","@","[","]","{","|","}","~","¥"] 37 | stops = set(stops) 38 | else: 39 | stops = set() 40 | 41 | train_file = model_dir+"atec_nlp_sim_train.csv" 42 | df1 = pd.read_csv(train_file,sep="\t", header=None, names =["id","sent1","sent2","label"], encoding="utf8") 43 | # if len(df1) >= 102477: df1 = df1[:1000] 44 | 45 | # 文本清理,预处理(分词) 46 | clean_path = model_dir+"atec_clean.csv" 47 | def pre_process(df, train_mode=True): 48 | x = lambda s: list(jieba.cut(star.sub("X",s))) 49 | df["words1"] = df["sent1"].apply(x) 50 | df["words2"] = df["sent2"].apply(x) 51 | if train_mode: df.to_csv(clean_path, sep="\t", index=False, encoding="utf8") 52 | return df 53 | 54 | # 特征提取 55 | feature_path = model_dir+"atec_feature.pkl" 56 | feature_cfg = ["Not", "Length", "WordMatchShare", "TFIDFWordMatchShare", 57 | # "PowerfulWordDoubleSide", "PowerfulWordDoubleSideRate", "PowerfulWordOneSide", "PowerfulWordOneSideRate", 58 | "TFIDF", "NgramJaccardCoef", "NgramDiceDistance", "NgramDistance", "WordEmbeddingAveDis", "WordEmbeddingTFIDFAveDis"] 59 | def feature_extract(df, train_mode=True): 60 | 61 | if "Not" in feature_cfg: 62 | def extract_row(row): 63 | not_cnt1 = row["words1"].count('不') 64 | not_cnt2 = row["words2"].count('不') 65 | 66 | fs = [] 67 | fs.append(not_cnt1) 68 | fs.append(not_cnt2) 69 | if not_cnt1 > 0 and not_cnt2 > 0: 70 | fs.append(1.) 71 | else: 72 | fs.append(0.) 73 | if (not_cnt1 > 0) or (not_cnt2 > 0): 74 | fs.append(1.) 75 | else: 76 | fs.append(0.) 77 | if not_cnt2 <= 0 < not_cnt1 or not_cnt1 <= 0 < not_cnt2: 78 | fs.append(1.) 79 | else: 80 | fs.append(0.) 81 | 82 | return fs 83 | 84 | df["Not"] = df.apply(extract_row, axis=1) 85 | print("done Not") 86 | 87 | if "Length" in feature_cfg: 88 | def extract_row(row): 89 | len_q1, len_q2 = len(row["sent1"]), len(row["sent2"]) 90 | return [len_q1, 91 | len_q2, 92 | len(row["words1"]), 93 | len(row["words2"]), 94 | abs(len_q1 - len_q2), 95 | 1.0 * min(len_q1, len_q2) / max(len_q1, len_q2)] 96 | 97 | df["Length"] = df.apply(extract_row, axis=1) 98 | print("done Length") 99 | 100 | if "WordMatchShare" in feature_cfg: 101 | def extract_row(row): 102 | q1words = {} 103 | q2words = {} 104 | for word in row["words1"]: 105 | if word not in stops: 106 | q1words[word] = q1words.get(word, 0) + 1 107 | for word in row["words2"]: 108 | if word not in stops: 109 | q2words[word] = q2words.get(word, 0) + 1 110 | n_shared_word_in_q1 = sum([q1words[w] for w in q1words if w in q2words]) 111 | n_shared_word_in_q2 = sum([q2words[w] for w in q2words if w in q1words]) 112 | n_tol = sum(q1words.values()) + sum(q2words.values()) 113 | if 1e-6 > n_tol: 114 | return [0.] 115 | else: 116 | return [1.0 * (n_shared_word_in_q1 + n_shared_word_in_q2) / n_tol] 117 | 118 | df["WordMatchShare"] = df.apply(extract_row, axis=1) 119 | print("done WordMatchShare") 120 | 121 | if "TFIDFWordMatchShare" in feature_cfg: 122 | idf_path = model_dir + "idf_weights.pkl" 123 | def init_idf(): # init idf weights 124 | idf = {} 125 | q_set = set() 126 | for index, row in df.iterrows(): 127 | q1 = str(row['sent1']) 128 | q2 = str(row['sent2']) 129 | if q1 not in q_set: 130 | q_set.add(q1) 131 | for word in row["words1"]: 132 | idf[word] = idf.get(word, 0) + 1 133 | if q2 not in q_set: 134 | q_set.add(q2) 135 | for word in row["words2"]: 136 | idf[word] = idf.get(word, 0) + 1 137 | num_docs = len(df) 138 | for word in idf: 139 | idf[word] = math.log(num_docs / (idf[word] + 1.)) / math.log(2.) 140 | print("idf calculation done, len(idf)=%d" % len(idf)) 141 | pd.to_pickle(idf, idf_path) 142 | return idf 143 | 144 | if train_mode: idf = init_idf() 145 | else: idf = pd.read_pickle(idf_path) 146 | 147 | def extract_row(row): 148 | q1words = {} 149 | q2words = {} 150 | for word in row["words1"]: 151 | q1words[word] = q1words.get(word, 0) + 1 152 | for word in row["words2"]: 153 | q2words[word] = q2words.get(word, 0) + 1 154 | sum_shared_word_in_q1 = sum([q1words[w] * idf.get(w, 0) for w in q1words if w in q2words]) 155 | sum_shared_word_in_q2 = sum([q2words[w] * idf.get(w, 0) for w in q2words if w in q1words]) 156 | sum_tol = sum(q1words[w] * idf.get(w, 0) for w in q1words) + sum( 157 | q2words[w] * idf.get(w, 0) for w in q2words) 158 | if 1e-6 > sum_tol: 159 | return [0.] 160 | else: 161 | return [1.0 * (sum_shared_word_in_q1 + sum_shared_word_in_q2) / sum_tol] 162 | 163 | df["TFIDFWordMatchShare"] = df.apply(extract_row, axis=1) 164 | print("done TFIDFWordMatchShare") 165 | 166 | powerful_words_path = model_dir + "powerful_words.pkl" 167 | def generate_powerful_word(): 168 | """ 169 | 计算数据中词语的影响力,格式如下: 170 | 词语 --> [0. 出现语句对数量,1. 出现语句对比例,2. 正确语句对比例,3. 单侧语句对比例,4. 单侧语句对正确比例,5. 双侧语句对比例,6. 双侧语句对正确比例] 171 | """ 172 | words_power = {} 173 | for index, row in df.iterrows(): 174 | label = int(row['label']) 175 | q1_words = row['words1'] 176 | q2_words = row['words2'] 177 | all_words = set(q1_words + q2_words) 178 | q1_words = set(q1_words) 179 | q2_words = set(q2_words) 180 | for word in all_words: 181 | if word not in words_power: 182 | words_power[word] = [0. for i in range(7)] 183 | # 计算出现语句对数量 184 | words_power[word][0] += 1. 185 | words_power[word][1] += 1. 186 | 187 | if ((word in q1_words) and (word not in q2_words)) or ((word not in q1_words) and (word in q2_words)): 188 | # 计算单侧语句数量 189 | words_power[word][3] += 1. 190 | if 0 == label: 191 | # 计算正确语句对数量 192 | words_power[word][2] += 1. 193 | # 计算单侧语句正确比例 194 | words_power[word][4] += 1. 195 | if (word in q1_words) and (word in q2_words): 196 | # 计算双侧语句数量 197 | words_power[word][5] += 1. 198 | if 1 == label: 199 | # 计算正确语句对数量 200 | words_power[word][2] += 1. 201 | # 计算双侧语句正确比例 202 | words_power[word][6] += 1. 203 | for word in words_power: 204 | # 计算出现语句对比例 205 | words_power[word][1] /= len(df) 206 | # 计算正确语句对比例 207 | words_power[word][2] /= words_power[word][0] 208 | # 计算单侧语句对正确比例 209 | if words_power[word][3] > 1e-6: 210 | words_power[word][4] /= words_power[word][3] 211 | # 计算单侧语句对比例 212 | words_power[word][3] /= words_power[word][0] 213 | # 计算双侧语句对正确比例 214 | if words_power[word][5] > 1e-6: 215 | words_power[word][6] /= words_power[word][5] 216 | # 计算双侧语句对比例 217 | words_power[word][5] /= words_power[word][0] 218 | sorted_words_power = sorted(words_power.items(), key=lambda d: d[1][0], reverse=True) 219 | print("power words calculation done, len(words_power)=%d" % len(sorted_words_power)) 220 | pd.to_pickle(sorted_words_power, powerful_words_path) 221 | return sorted_words_power 222 | 223 | if train_mode: pword = generate_powerful_word() 224 | else: pword = pd.load_pickle(powerful_words_path) 225 | 226 | # thresh_num, thresh_rate = 500, 0.9 227 | thresh_num, thresh_rate = 7, 0.3 228 | 229 | pword_filtered = filter(lambda x: x[1][0] * x[1][5] >= thresh_num, pword) 230 | pword_sort = sorted(pword_filtered, key=lambda d: d[1][6], reverse=True) 231 | pword_dside = set(map(lambda x: x[0], filter(lambda x: x[1][6] >= thresh_rate, pword_sort))) 232 | print('Double side power words(%d): %s' % (len(pword_dside), str(pword_dside))) 233 | 234 | def extract_row(row): 235 | tags = [] 236 | q1_words = row["words1"] 237 | q2_words = row["words2"] 238 | for word in pword_dside: 239 | if (word in q1_words) and (word in q2_words): 240 | tags.append(1.0) 241 | else: 242 | tags.append(0.0) 243 | return tags 244 | 245 | if "PowerfulWordDoubleSide" in feature_cfg: 246 | df["PowerfulWordDoubleSide"] = df.apply(extract_row, axis=1) 247 | print("done PowerfulWordDoubleSide") 248 | 249 | pword_dict = dict(pword) 250 | def extract_row(row): 251 | num_least = 300 252 | rate = [1.0] 253 | q1_words = set(row["words1"]) 254 | q2_words = set(row["words2"]) 255 | share_words = list(q1_words.intersection(q2_words)) 256 | for word in share_words: 257 | if word not in pword_dict: 258 | continue 259 | if pword_dict[word][0] * pword_dict[word][5] < num_least: 260 | continue 261 | rate[0] *= (1.0 - pword_dict[word][6]) 262 | rate = [1 - num for num in rate] 263 | return rate 264 | 265 | if "PowerfulWordDoubleSideRate" in feature_cfg: 266 | df["PowerfulWordDoubleSideRate"] = df.apply(extract_row, axis=1) 267 | print("done PowerfulWordDoubleSideRate") 268 | 269 | 270 | thresh_num, thresh_rate = 20, 0.8 271 | 272 | pword_filtered = filter(lambda x: x[1][0] * x[1][3] >= thresh_num, pword) 273 | pword_oside = set(map(lambda x: x[0], filter(lambda x: x[1][4] >= thresh_rate, pword_filtered))) 274 | print('One side power words(%d): %s' % (len(pword_oside), str(pword_oside))) 275 | def extract_row(row): 276 | tags = [] 277 | q1_words = set(row["words1"]) 278 | q2_words = set(row["words2"]) 279 | for word in pword_oside: 280 | if (word in q1_words) and (word not in q2_words): 281 | tags.append(1.0) 282 | elif (word not in q1_words) and (word in q2_words): 283 | tags.append(1.0) 284 | else: 285 | tags.append(0.0) 286 | return tags 287 | 288 | if "PowerfulWordOneSide" in feature_cfg: 289 | df["PowerfulWordOneSide"] = df.apply(extract_row, axis=1) 290 | print("done PowerfulWordOneSide") 291 | 292 | def extract_row(row): 293 | num_least = 300 294 | rate = [1.0] 295 | q1_words = set(row["words1"]) 296 | q2_words = set(row["words2"]) 297 | q1_diff = list(q1_words.difference(q2_words)) 298 | q2_diff = list(q2_words.difference(q1_words)) 299 | all_diff = set(q1_diff + q2_diff) 300 | for word in all_diff: 301 | if word not in pword_dict: 302 | continue 303 | if pword_dict[word][0] * pword_dict[word][3] < num_least: 304 | continue 305 | rate[0] *= (1.0 - pword_dict[word][4]) 306 | rate = [1 - num for num in rate] 307 | return rate 308 | 309 | if "PowerfulWordOneSideRate" in feature_cfg: 310 | df["PowerfulWordOneSideRate"] = df.apply(extract_row, axis=1) 311 | print("done PowerfulWordOneSideRate") 312 | 313 | if "TFIDF" in feature_cfg: 314 | tfidf_path = model_dir + "tfidf_transformer.pkl" 315 | def init_tfidf(): 316 | tfidf = TfidfVectorizer(stop_words=list(stops), ngram_range=(1, 1), token_pattern=r"\w+") 317 | tfidf_txt = pd.Series(df['words1'].apply(lambda x: " ".join(x)).tolist() + 318 | df['words2'].apply(lambda x: " ".join(x)).tolist()) 319 | tfidf.fit_transform(tfidf_txt) 320 | print("init tfidf done ") 321 | # print(tfidf.vocabulary_) 322 | pd.to_pickle(tfidf, tfidf_path) 323 | return tfidf 324 | 325 | if train_mode: tfidf = init_tfidf() 326 | else: tfidf = pd.read_pickle(tfidf_path) 327 | 328 | def extract_row(row): 329 | q1 = " ".join(row['words1']) 330 | q2 = " ".join(row['words2']) 331 | a1 = tfidf.transform([q1]).data 332 | a2 = tfidf.transform([q2]).data 333 | fs = [np.sum(a1),np.sum(a2),np.mean(a1),np.mean(a2),len(a1),len(a2)] 334 | return fs 335 | 336 | df["TFIDF"] = df.apply(extract_row, axis=1) 337 | print("done TFIDF") 338 | 339 | if "NgramJaccardCoef" in feature_cfg: 340 | def extract_row(row): 341 | q1_words = row['words1'] 342 | q2_words = row['words2'] 343 | fs = list() 344 | for n in range(1, 4): 345 | q1_ngrams = NgramUtil.ngrams(q1_words, n) 346 | q2_ngrams = NgramUtil.ngrams(q2_words, n) 347 | A = set(q1_ngrams) 348 | B = set(q2_ngrams) 349 | x = len(A.intersection(B)) 350 | y = len(A.union(B)) 351 | val = 0.0 if y==0 else x/y 352 | fs.append(val) 353 | return fs 354 | 355 | df["NgramJaccardCoef"] = df.apply(extract_row, axis=1) 356 | print("done NgramJaccardCoef") 357 | 358 | if "NgramDiceDistance" in feature_cfg: 359 | def extract_row(row): 360 | q1_words = row['words1'] 361 | q2_words = row['words2'] 362 | fs = list() 363 | for n in range(1, 4): 364 | q1_ngrams = NgramUtil.ngrams(q1_words, n) 365 | q2_ngrams = NgramUtil.ngrams(q2_words, n) 366 | A = set(q1_ngrams) 367 | B = set(q2_ngrams) 368 | x = 2. * len(A.intersection(B)) 369 | y = len(A) + len(B) 370 | val = 0.0 if y==0 else x/y 371 | fs.append(val) 372 | return fs 373 | 374 | df["NgramDiceDistance"] = df.apply(extract_row, axis=1) 375 | print("done NgramDiceDistance") 376 | 377 | if "NgramDistance" in feature_cfg: 378 | def extract_row(row): 379 | q1_words = row['words1'] 380 | q2_words = row['words2'] 381 | fs = list() 382 | aggregation_modes_outer = [np.mean,np.max,np.min,np.median] 383 | aggregation_modes_inner = [np.mean,np.std,np.max,np.min,np.median] 384 | for n_ngram in range(1, 4): 385 | q1_ngrams = NgramUtil.ngrams(q1_words, n_ngram) 386 | q2_ngrams = NgramUtil.ngrams(q2_words, n_ngram) 387 | val_list = list() 388 | for w1 in q1_ngrams: 389 | _val_list = list() 390 | for w2 in q2_ngrams: 391 | s = 1. - SequenceMatcher(None, w1, w2, False).quick_ratio() # ratio() 392 | _val_list.append(s) 393 | if len(_val_list) == 0: 394 | _val_list = [MISSING_VALUE_NUMERIC] 395 | val_list.append(_val_list) 396 | if len(val_list) == 0: 397 | val_list = [[MISSING_VALUE_NUMERIC]] 398 | data = np.array(val_list) 399 | fs.extend([mode_outer(mode_inner(data,axis=1)) for mode_inner in aggregation_modes_inner for mode_outer in aggregation_modes_outer]) 400 | return fs 401 | 402 | df["NgramDistance"] = df.apply(extract_row, axis=1) 403 | print("done NgramDistance") 404 | 405 | we_len = 300 if online else 256 406 | word_embedding_model = gensim.models.Word2Vec.load(model_dir + "word2vec_gensim%s"%we_len) 407 | word2index = {v:k for k,v in enumerate(word_embedding_model.wv.index2word)} 408 | if "WordEmbeddingAveDis" in feature_cfg: 409 | def extract_row(row): 410 | q1_words = row['words1'] 411 | q2_words = row['words2'] 412 | 413 | q1_vec = np.array(we_len * [0.]) 414 | q2_vec = np.array(we_len * [0.]) 415 | 416 | for word in q1_words: 417 | if word in word2index: 418 | q1_vec += word_embedding_model[word] 419 | for word in q2_words: 420 | if word in word2index: 421 | q2_vec += word_embedding_model[word] 422 | 423 | cos_sim = 0. 424 | q1_vec = np.mat(q1_vec) 425 | q2_vec = np.mat(q2_vec) 426 | factor = np.linalg.norm(q1_vec) * np.linalg.norm(q2_vec) 427 | if 1e-6 < factor: 428 | cos_sim = float(q1_vec * q2_vec.T) / factor 429 | 430 | return [cos_sim] 431 | 432 | df["WordEmbeddingAveDis"] = df.apply(extract_row, axis=1) 433 | 434 | if "WordEmbeddingTFIDFAveDis" in feature_cfg: 435 | idf = pd.read_pickle(idf_path) 436 | def extract_row(row): 437 | q1_words = row['words1'] 438 | q2_words = row['words2'] 439 | 440 | q1_vec = np.array(we_len * [0.]) 441 | q2_vec = np.array(we_len * [0.]) 442 | q1_words_cnt = {} 443 | q2_words_cnt = {} 444 | for word in q1_words: 445 | q1_words_cnt[word] = q1_words_cnt.get(word, 0.) + 1. 446 | for word in q2_words: 447 | q2_words_cnt[word] = q2_words_cnt.get(word, 0.) + 1. 448 | 449 | for word in q1_words_cnt: 450 | if word in word2index: 451 | q1_vec += idf.get(word, 0.) * q1_words_cnt[word] * word_embedding_model[word] 452 | for word in q2_words_cnt: 453 | if word in word2index: 454 | q2_vec += idf.get(word, 0.) * q2_words_cnt[word] * word_embedding_model[word] 455 | 456 | cos_sim = 0. 457 | q1_vec = np.mat(q1_vec) 458 | q2_vec = np.mat(q2_vec) 459 | factor = np.linalg.norm(q1_vec) * np.linalg.norm(q2_vec) 460 | if 1e-6 < factor: 461 | cos_sim = float(q1_vec * q2_vec.T) / factor 462 | 463 | return [cos_sim] 464 | 465 | df["WordEmbeddingTFIDFAveDis"] = df.apply(extract_row, axis=1) 466 | 467 | 468 | def merge_feature(row): 469 | fs = [] 470 | for feature in feature_cfg: 471 | fs += row[feature] 472 | return fs 473 | 474 | df["feature"] = df.apply(merge_feature, axis=1) 475 | x, y = np.array(df["feature"].tolist()), np.array(df["label"].astype(int)) 476 | if train_mode: pd.to_pickle((x,y),feature_path) 477 | return (x,y) 478 | 479 | 480 | from difflib import SequenceMatcher 481 | MISSING_VALUE_NUMERIC = -1 482 | 483 | class NgramUtil(object): 484 | 485 | def __init__(self): 486 | pass 487 | 488 | @staticmethod 489 | def unigrams(words): 490 | """ 491 | Input: a list of words, e.g., ["I", "am", "Denny"] 492 | Output: a list of unigram 493 | """ 494 | assert type(words) == list 495 | return words 496 | 497 | @staticmethod 498 | def bigrams(words, join_string, skip=0): 499 | """ 500 | Input: a list of words, e.g., ["I", "am", "Denny"] 501 | Output: a list of bigram, e.g., ["I_am", "am_Denny"] 502 | """ 503 | assert type(words) == list 504 | L = len(words) 505 | if L > 1: 506 | lst = [] 507 | for i in range(L - 1): 508 | for k in range(1, skip + 2): 509 | if i + k < L: 510 | lst.append(join_string.join([words[i], words[i + k]])) 511 | else: 512 | # set it as unigram 513 | lst = NgramUtil.unigrams(words) 514 | return lst 515 | 516 | @staticmethod 517 | def trigrams(words, join_string, skip=0): 518 | """ 519 | Input: a list of words, e.g., ["I", "am", "Denny"] 520 | Output: a list of trigram, e.g., ["I_am_Denny"] 521 | """ 522 | assert type(words) == list 523 | L = len(words) 524 | if L > 2: 525 | lst = [] 526 | for i in range(L - 2): 527 | for k1 in range(1, skip + 2): 528 | for k2 in range(1, skip + 2): 529 | if i + k1 < L and i + k1 + k2 < L: 530 | lst.append(join_string.join([words[i], words[i + k1], words[i + k1 + k2]])) 531 | else: 532 | # set it as bigram 533 | lst = NgramUtil.bigrams(words, join_string, skip) 534 | return lst 535 | 536 | @staticmethod 537 | def fourgrams(words, join_string): 538 | """ 539 | Input: a list of words, e.g., ["I", "am", "Denny", "boy"] 540 | Output: a list of trigram, e.g., ["I_am_Denny_boy"] 541 | """ 542 | assert type(words) == list 543 | L = len(words) 544 | if L > 3: 545 | lst = [] 546 | for i in xrange(L - 3): 547 | lst.append(join_string.join([words[i], words[i + 1], words[i + 2], words[i + 3]])) 548 | else: 549 | # set it as trigram 550 | lst = NgramUtil.trigrams(words, join_string) 551 | return lst 552 | 553 | @staticmethod 554 | def ngrams(words, ngram, join_string=" "): 555 | """ 556 | wrapper for ngram 557 | """ 558 | if ngram == 1: 559 | return NgramUtil.unigrams(words) 560 | elif ngram == 2: 561 | return NgramUtil.bigrams(words, join_string) 562 | elif ngram == 3: 563 | return NgramUtil.trigrams(words, join_string) 564 | elif ngram == 4: 565 | return NgramUtil.fourgrams(words, join_string) 566 | elif ngram == 12: 567 | unigram = NgramUtil.unigrams(words) 568 | bigram = [x for x in NgramUtil.bigrams(words, join_string) if len(x.split(join_string)) == 2] 569 | return unigram + bigram 570 | elif ngram == 123: 571 | unigram = NgramUtil.unigrams(words) 572 | bigram = [x for x in NgramUtil.bigrams(words, join_string) if len(x.split(join_string)) == 2] 573 | trigram = [x for x in NgramUtil.trigrams(words, join_string) if len(x.split(join_string)) == 3] 574 | return unigram + bigram + trigram 575 | 576 | 577 | def r_f1_thresh(y_pred,y_true,step=1000): 578 | e = np.zeros((len(y_true),2)) 579 | e[:,0] = y_pred.reshape(-1) 580 | e[:,1] = y_true 581 | f = pd.DataFrame(e) 582 | thrs = np.linspace(0,1,step+1) 583 | x = np.array([f1_score(y_pred=f.loc[:,0]>thr, y_true=f.loc[:,1]) for thr in thrs]) 584 | f1_, thresh = max(x),thrs[x.argmax()] 585 | return f.corr()[0][1], f1_, thresh 586 | 587 | random_state = 42 588 | def train_old_classifier(data=None,train_mode=True): 589 | if data is None: 590 | x,y = pd.read_pickle(feature_path) 591 | else:x,y = data 592 | 593 | trn_x, val_x, trn_y, val_y = train_test_split(x,y, test_size=test_size, random_state=random_state) 594 | 595 | 596 | classifier = ["lrcv","lgbm"][1] 597 | if classifier == "lgbm": 598 | print("lightgbm") 599 | params = { 600 | 'task': 'train', 601 | 'boosting_type': 'gbdt', 602 | 'objective': 'binary', 603 | 'metric': {'l2', 'auc'}, 604 | 'num_leaves': 31, 605 | 'learning_rate': 0.05, 606 | 'feature_fraction': 0.9, 607 | 'bagging_fraction': 0.8, 608 | 'bagging_freq': 5 609 | } 610 | 611 | import lightgbm as lgb 612 | lgb_train = lgb.Dataset(trn_x, trn_y) 613 | lgb_eval = lgb.Dataset(val_x, val_y, reference=lgb_train) 614 | model_gbm = lgb.train(params, lgb_train, num_boost_round=200, 615 | valid_sets=lgb_eval, early_stopping_rounds=10) 616 | 617 | val_y_pred = model_gbm.predict(val_x, num_iteration=model_gbm.best_iteration) 618 | print(r_f1_thresh(val_y_pred, val_y)) 619 | 620 | classifier = ["lrcv","lgbm"][0] 621 | if classifier == "lrcv": 622 | print("LogisticRegression") 623 | clf = LogisticRegression() 624 | clf.fit(trn_x, trn_y) 625 | val_y_pred = clf.predict_proba(val_x)[:,1] 626 | print(r_f1_thresh(val_y_pred, val_y)) 627 | 628 | df = None 629 | df = pre_process(df1) 630 | data = feature_extract(df) 631 | train_old_classifier(data) -------------------------------------------------------------------------------- /pai_train.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python 2 | #coding=utf-8 3 | 4 | indexes = [] 5 | 6 | import time 7 | start_time = time.time() 8 | import multiprocessing 9 | import os 10 | import re 11 | import json 12 | import gensim 13 | import jieba 14 | import keras 15 | import keras.backend as K 16 | import numpy as np 17 | import pandas as pd 18 | from itertools import combinations 19 | from keras.activations import softmax 20 | from keras.callbacks import EarlyStopping, ModelCheckpoint,LambdaCallback, Callback, ReduceLROnPlateau, LearningRateScheduler 21 | from keras.layers import * 22 | from keras.models import Model 23 | from keras.optimizers import SGD, Adadelta, Adam, Nadam, RMSprop 24 | from keras.regularizers import L1L2, l2 25 | from keras.preprocessing.sequence import pad_sequences 26 | from keras.engine.topology import Layer 27 | from keras import initializers, regularizers, constraints 28 | 29 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV 30 | from sklearn.metrics import f1_score 31 | from sklearn.model_selection import train_test_split, KFold 32 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier 33 | 34 | from gensim.models.word2vec import LineSentence 35 | from gensim.models.fasttext import FastText 36 | import copy 37 | 38 | os.environ["TF_CPP_MIN_LOG_LEVEL"]='3' 39 | 40 | ##################################################################### 41 | # 数据加载预处理阶段 42 | ##################################################################### 43 | new_words = "支付宝 付款码 二维码 收钱码 转账 退款 退钱 余额宝 运费险 还钱 还款 花呗 借呗 蚂蚁花呗 蚂蚁借呗 蚂蚁森林 小黄车 飞猪 微客 宝卡 芝麻信用 亲密付 淘票票 饿了么 摩拜 滴滴 滴滴出行".split(" ") 44 | for word in new_words: 45 | jieba.add_word(word) 46 | 47 | star = re.compile("\*+") 48 | 49 | test_size = 0.025 50 | random_state = 42 51 | fast_mode, fast_rate = False,0.01 # 快速调试,其评分不作为参考 52 | train_file = model_dir+"atec_nlp_sim_train.csv" 53 | def load_data(dtype = "both", input_length=[20,24], w2v_length=300): 54 | 55 | def __load_data(dtype = "word", input_length=20, w2v_length=300): 56 | 57 | filename = model_dir+"%s_%d_%d"%(dtype, input_length, w2v_length) 58 | if os.path.exists(filename): 59 | return pd.read_pickle(filename) 60 | 61 | data_l_n = [] 62 | data_r_n = [] 63 | y = [] 64 | for line in open(train_file,"r", encoding="utf8"): 65 | lineno, s1, s2, label=line.strip().split("\t") 66 | if dtype == "word": 67 | data_l_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s1))) if word in word2index]) 68 | data_r_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s2))) if word in word2index]) 69 | if dtype == "char": 70 | data_l_n.append([char2index[char] for char in s1 if char in char2index]) 71 | data_r_n.append([char2index[char] for char in s2 if char in char2index]) 72 | 73 | y.append(int(label)) 74 | 75 | # 对齐语料中句子的长度 76 | data_l_n = pad_sequences(data_l_n, maxlen=input_length) 77 | data_r_n = pad_sequences(data_r_n, maxlen=input_length) 78 | y = np.array(y) 79 | 80 | pd.to_pickle((data_l_n, data_r_n, y), filename) 81 | 82 | return (data_l_n, data_r_n, y) 83 | 84 | if dtype == "both": 85 | ret_array = [] 86 | for dtype,input_length in zip(['word', 'char'],input_length): 87 | data_l_n,data_r_n,y = __load_data(dtype, input_length, w2v_length) 88 | ret_array.append(np.asarray(data_l_n)) 89 | ret_array.append(np.asarray(data_r_n)) 90 | ret_array.append(y) 91 | return ret_array 92 | else: 93 | return __load_data(dtype, input_length, w2v_length) 94 | 95 | def input_data(sent1, sent2, dtype = "both", input_length=[20,24]): 96 | def __input_data(sent1, sent2, dtype = "word", input_length=20): 97 | data_l_n = [] 98 | data_r_n = [] 99 | for s1, s2 in zip(sent1, sent2): 100 | if dtype == "word": 101 | data_l_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s1))) if word in word2index]) 102 | data_r_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s2))) if word in word2index]) 103 | if dtype == "char": 104 | data_l_n.append([char2index[char] for char in s1 if char in char2index]) 105 | data_r_n.append([char2index[char] for char in s2 if char in char2index]) 106 | 107 | # 对齐语料中句子的长度 108 | data_l_n = pad_sequences(data_l_n, maxlen=input_length) 109 | data_r_n = pad_sequences(data_r_n, maxlen=input_length) 110 | 111 | return [data_l_n, data_r_n] 112 | 113 | if dtype == "both": 114 | ret_array = [] 115 | for dtype,input_length in zip(['word', 'char'],input_length): 116 | data_l_n,data_r_n = __input_data(sent1, sent2, dtype, input_length) 117 | ret_array.append(data_l_n) 118 | ret_array.append(data_r_n) 119 | return ret_array 120 | else: 121 | return __input_data(sent1, sent2, dtype, input_length) 122 | 123 | 124 | ########################################################################### 125 | # 训练验证集划分 126 | ########################################################################### 127 | def split_data(data,mode="train", test_size=test_size, random_state=random_state): 128 | # mode == "train": 划分成用于训练的四元组 129 | # mode == "orig": 划分成两组数据 130 | train = [] 131 | test = [] 132 | for data_i in data: 133 | if fast_mode: 134 | data_i, _ = train_test_split(data_i,test_size=1-fast_rate,random_state=random_state ) 135 | train_data, test_data = train_test_split(data_i,test_size=test_size,random_state=random_state ) 136 | train.append(np.asarray(train_data)) 137 | test.append(np.asarray(test_data)) 138 | 139 | if mode == "orig": 140 | return train, test 141 | 142 | train_x, train_y, test_x, test_y = train[:-1], train[-1], test[:-1], test[-1] 143 | return train_x, train_y, test_x, test_y 144 | 145 | 146 | ##################################################################### 147 | # 模型定义 148 | ##################################################################### 149 | 150 | w2v_length = 300 151 | ebed_type = "gensim" 152 | # ebed_type = "fastcbow" 153 | 154 | if ebed_type == "gensim": 155 | char_embedding_model = gensim.models.Word2Vec.load(model_dir + "char2vec_gensim%s"%w2v_length) 156 | char2index = {v:k for k,v in enumerate(char_embedding_model.wv.index2word)} 157 | word_embedding_model = gensim.models.Word2Vec.load(model_dir + "word2vec_gensim%s"%w2v_length) 158 | word2index = {v:k for k,v in enumerate(word_embedding_model.wv.index2word)} 159 | 160 | elif ebed_type == "fastskip" or ebed_type == "fastcbow": 161 | char_fastcbow = FastText.load(model_dir + "char2vec_%s%d"%(ebed_type, w2v_length)) 162 | char_embedding_matrix = char_fastcbow.wv.vectors 163 | char2index = {v:k for k,v in enumerate(char_fastcbow.wv.index2word)} 164 | word_fastcbow = FastText.load(model_dir + "word2vec_%s%d"%(ebed_type, w2v_length)) 165 | word_embedding_matrix = word_fastcbow.wv.vectors 166 | word2index = {v:k for k,v in enumerate(word_fastcbow.wv.index2word)} 167 | 168 | print("loaded w2v done!", len(char2index), len(word2index)) 169 | 170 | MAX_LEN = 30 171 | MAX_EPOCH = 90 172 | train_batch_size = 64 173 | test_batch_size = 500 174 | earlystop_patience, plateau_patience = 8,2 # patience 175 | cfgs = [ 176 | ("siamese", "char", 24, ebed_type, w2v_length, [100, 80, 64, 64], 102-5, earlystop_patience), # 69s 177 | ("siamese", "word", 20, ebed_type, w2v_length, [100, 80, 64, 64], 120-4, earlystop_patience), # 59s 178 | ("esim", "char", 24, ebed_type, w2v_length, [], 18, earlystop_patience), # 389s 179 | ("esim", "word", 20, ebed_type, w2v_length, [], 21, earlystop_patience), # 335s 180 | ("decom", "char", 24, ebed_type, w2v_length, [], 87-2, earlystop_patience), # 84s 181 | ("decom", "word", 20, ebed_type, w2v_length, [], 104-4, earlystop_patience), # 71s 182 | ("dssm", "both", [20,24], ebed_type, w2v_length, [], 124-8, earlystop_patience), # 55s 183 | ] 184 | 185 | 186 | def get_embedding_layers(dtype, input_length, w2v_length, with_weight=True): 187 | def __get_embedding_layers(dtype, input_length, w2v_length, with_weight=True): 188 | 189 | if dtype == 'word': 190 | embedding_length = len(word2index) 191 | elif dtype == 'char': 192 | embedding_length = len(char2index) 193 | 194 | if with_weight: 195 | if ebed_type == "gensim": 196 | if dtype == 'word': 197 | embedding = word_embedding_model.wv.get_keras_embedding(train_embeddings=True) 198 | else: 199 | embedding = char_embedding_model.wv.get_keras_embedding(train_embeddings=True) 200 | 201 | elif ebed_type == "fastskip" or ebed_type == "fastcbow": 202 | if dtype == 'word': 203 | embedding = Embedding(embedding_length, w2v_length, input_length=input_length, weights=[word_embedding_matrix], trainable=True) 204 | else: 205 | embedding = Embedding(embedding_length, w2v_length, input_length=input_length, weights=[char_embedding_matrix], trainable=True) 206 | else: 207 | embedding = Embedding(embedding_length, w2v_length, input_length=input_length, trainable=True) 208 | 209 | return embedding 210 | 211 | if dtype == "both": 212 | embedding = [] 213 | for dtype,input_length in zip(['word', 'char'],input_length): 214 | embedding.append(__get_embedding_layers(dtype, input_length, w2v_length, with_weight)) 215 | return embedding 216 | else: 217 | return __get_embedding_layers(dtype, input_length, w2v_length, with_weight) 218 | 219 | def create_pretrained_embedding(pretrained_weights_path, trainable=False, **kwargs): 220 | "Create embedding layer from a pretrained weights array" 221 | pretrained_weights = np.load(pretrained_weights_path) 222 | in_dim, out_dim = pretrained_weights.shape 223 | embedding = Embedding(in_dim, out_dim, weights=[pretrained_weights], trainable=False, **kwargs) 224 | return embedding 225 | 226 | 227 | def unchanged_shape(input_shape): 228 | "Function for Lambda layer" 229 | return input_shape 230 | 231 | 232 | def substract(input_1, input_2): 233 | "Substract element-wise" 234 | neg_input_2 = Lambda(lambda x: -x, output_shape=unchanged_shape)(input_2) 235 | out_ = Add()([input_1, neg_input_2]) 236 | return out_ 237 | 238 | 239 | def submult(input_1, input_2): 240 | "Get multiplication and subtraction then concatenate results" 241 | mult = Multiply()([input_1, input_2]) 242 | sub = substract(input_1, input_2) 243 | out_= Concatenate()([sub, mult]) 244 | return out_ 245 | 246 | 247 | def apply_multiple(input_, layers): 248 | "Apply layers to input then concatenate result" 249 | if not len(layers) > 1: 250 | raise ValueError('Layers list should contain more than 1 layer') 251 | else: 252 | agg_ = [] 253 | for layer in layers: 254 | agg_.append(layer(input_)) 255 | out_ = Concatenate()(agg_) 256 | return out_ 257 | 258 | 259 | def time_distributed(input_, layers): 260 | "Apply a list of layers in TimeDistributed mode" 261 | out_ = [] 262 | node_ = input_ 263 | for layer_ in layers: 264 | node_ = TimeDistributed(layer_)(node_) 265 | out_ = node_ 266 | return out_ 267 | 268 | 269 | def soft_attention_alignment(input_1, input_2): 270 | "Align text representation with neural soft attention" 271 | attention = Dot(axes=-1)([input_1, input_2]) 272 | w_att_1 = Lambda(lambda x: softmax(x, axis=1), 273 | output_shape=unchanged_shape)(attention) 274 | w_att_2 = Permute((2,1))(Lambda(lambda x: softmax(x, axis=2), 275 | output_shape=unchanged_shape)(attention)) 276 | in1_aligned = Dot(axes=1)([w_att_1, input_1]) 277 | in2_aligned = Dot(axes=1)([w_att_2, input_2]) 278 | return in1_aligned, in2_aligned 279 | 280 | def decomposable_attention(pretrained_embedding='../data/fasttext_matrix.npy', 281 | projection_dim=300, projection_hidden=0, projection_dropout=0.2, 282 | compare_dim=500, compare_dropout=0.2, 283 | dense_dim=300, dense_dropout=0.2, 284 | lr=1e-3, activation='elu', maxlen=MAX_LEN): 285 | # Based on: https://arxiv.org/abs/1606.01933 286 | 287 | q1 = Input(name='q1',shape=(maxlen,)) 288 | q2 = Input(name='q2',shape=(maxlen,)) 289 | 290 | # Embedding 291 | # embedding = create_pretrained_embedding(pretrained_embedding, 292 | # mask_zero=False) 293 | embedding = pretrained_embedding 294 | q1_embed = embedding(q1) 295 | q2_embed = embedding(q2) 296 | 297 | # Projection 298 | projection_layers = [] 299 | if projection_hidden > 0: 300 | projection_layers.extend([ 301 | Dense(projection_hidden, activation=activation), 302 | Dropout(rate=projection_dropout), 303 | ]) 304 | projection_layers.extend([ 305 | Dense(projection_dim, activation=None), 306 | Dropout(rate=projection_dropout), 307 | ]) 308 | q1_encoded = time_distributed(q1_embed, projection_layers) 309 | q2_encoded = time_distributed(q2_embed, projection_layers) 310 | 311 | # Attention 312 | q1_aligned, q2_aligned = soft_attention_alignment(q1_encoded, q2_encoded) 313 | 314 | # Compare 315 | q1_combined = Concatenate()([q1_encoded, q2_aligned, submult(q1_encoded, q2_aligned)]) 316 | q2_combined = Concatenate()([q2_encoded, q1_aligned, submult(q2_encoded, q1_aligned)]) 317 | compare_layers = [ 318 | Dense(compare_dim, activation=activation), 319 | Dropout(compare_dropout), 320 | Dense(compare_dim, activation=activation), 321 | Dropout(compare_dropout), 322 | ] 323 | q1_compare = time_distributed(q1_combined, compare_layers) 324 | q2_compare = time_distributed(q2_combined, compare_layers) 325 | 326 | # Aggregate 327 | q1_rep = apply_multiple(q1_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()]) 328 | q2_rep = apply_multiple(q2_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()]) 329 | 330 | # Classifier 331 | merged = Concatenate()([q1_rep, q2_rep]) 332 | dense = BatchNormalization()(merged) 333 | dense = Dense(dense_dim, activation=activation)(dense) 334 | dense = Dropout(dense_dropout)(dense) 335 | dense = BatchNormalization()(dense) 336 | dense = Dense(dense_dim, activation=activation)(dense) 337 | dense = Dropout(dense_dropout)(dense) 338 | out_ = Dense(1, activation='sigmoid')(dense) 339 | 340 | model = Model(inputs=[q1, q2], outputs=out_) 341 | return model 342 | 343 | 344 | def esim(pretrained_embedding='../data/fasttext_matrix.npy', 345 | maxlen=MAX_LEN, 346 | lstm_dim=300, 347 | dense_dim=300, 348 | dense_dropout=0.5): 349 | 350 | # Based on arXiv:1609.06038 351 | q1 = Input(name='q1',shape=(maxlen,)) 352 | q2 = Input(name='q2',shape=(maxlen,)) 353 | 354 | # Embedding 355 | # embedding = create_pretrained_embedding(pretrained_embedding, mask_zero=False) 356 | embedding = pretrained_embedding 357 | bn = BatchNormalization(axis=2) 358 | q1_embed = bn(embedding(q1)) 359 | q2_embed = bn(embedding(q2)) 360 | 361 | # Encode 362 | encode = Bidirectional(CuDNNLSTM(lstm_dim, return_sequences=True)) 363 | q1_encoded = encode(q1_embed) 364 | q2_encoded = encode(q2_embed) 365 | 366 | # Attention 367 | q1_aligned, q2_aligned = soft_attention_alignment(q1_encoded, q2_encoded) 368 | 369 | # Compose 370 | q1_combined = Concatenate()([q1_encoded, q2_aligned, submult(q1_encoded, q2_aligned)]) 371 | q2_combined = Concatenate()([q2_encoded, q1_aligned, submult(q2_encoded, q1_aligned)]) 372 | 373 | compose = Bidirectional(CuDNNLSTM(lstm_dim, return_sequences=True)) 374 | q1_compare = compose(q1_combined) 375 | q2_compare = compose(q2_combined) 376 | 377 | # Aggregate 378 | q1_rep = apply_multiple(q1_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()]) 379 | q2_rep = apply_multiple(q2_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()]) 380 | 381 | # Classifier 382 | merged = Concatenate()([q1_rep, q2_rep]) 383 | 384 | dense = BatchNormalization()(merged) 385 | dense = Dense(dense_dim, activation='elu')(dense) 386 | dense = BatchNormalization()(dense) 387 | dense = Dropout(dense_dropout)(dense) 388 | dense = Dense(dense_dim, activation='elu')(dense) 389 | dense = BatchNormalization()(dense) 390 | dense = Dropout(dense_dropout)(dense) 391 | out_ = Dense(1, activation='sigmoid')(dense) 392 | 393 | model = Model(inputs=[q1, q2], outputs=out_) 394 | return model 395 | 396 | def custom_loss(y_true, y_pred): 397 | margin = 1 398 | return K.mean(0.25 * y_true * K.square(1 - y_pred) + 399 | (1 - y_true) * K.square(K.maximum(y_pred, 0))) 400 | 401 | def siamese(pretrained_embedding=None, 402 | input_length=MAX_LEN, 403 | w2v_length=300, 404 | n_hidden=[64, 64, 64]): 405 | #输入层 406 | left_input = Input(shape=(input_length,), dtype='int32') 407 | right_input = Input(shape=(input_length,), dtype='int32') 408 | 409 | #对句子embedding 410 | encoded_left = pretrained_embedding(left_input) 411 | encoded_right = pretrained_embedding(right_input) 412 | 413 | #两个LSTM共享参数 414 | # # v1 一层lstm 415 | # shared_lstm = CuDNNLSTM(n_hidden) 416 | 417 | # # v2 带drop和正则化的多层lstm 418 | ipt = Input(shape=(input_length, w2v_length)) 419 | dropout_rate = 0.5 420 | x = Dropout(dropout_rate, )(ipt) 421 | for i,hidden_length in enumerate(n_hidden): 422 | # x = Bidirectional(CuDNNLSTM(hidden_length, return_sequences=(i!=len(n_hidden)-1), kernel_regularizer=L1L2(l1=0.01, l2=0.01)))(x) 423 | x = Bidirectional(CuDNNLSTM(hidden_length, return_sequences=True, kernel_regularizer=L1L2(l1=0.01, l2=0.01)))(x) 424 | 425 | # v3 卷及网络特征层 426 | x = Conv1D(64, kernel_size = 2, strides = 1, padding = "valid", kernel_initializer = "he_uniform")(x) 427 | x_p1 = GlobalAveragePooling1D()(x) 428 | x_p2 = GlobalMaxPooling1D()(x) 429 | x = Concatenate()([x_p1, x_p2]) 430 | shared_lstm = Model(inputs=ipt, outputs=x) 431 | 432 | left_output = shared_lstm(encoded_left) 433 | right_output = shared_lstm(encoded_right) 434 | 435 | 436 | # 距离函数 exponent_neg_manhattan_distance 437 | malstm_distance = Lambda(lambda x: K.exp(-K.sum(K.abs(x[0] - x[1]), axis=1, keepdims=True)), 438 | output_shape=lambda x: (x[0][0], 1))([left_output, right_output]) 439 | 440 | model = Model([left_input, right_input], [malstm_distance]) 441 | 442 | return model 443 | 444 | class Attention(Layer): 445 | def __init__(self, step_dim, 446 | W_regularizer=None, b_regularizer=None, 447 | W_constraint=None, b_constraint=None, 448 | bias=True, **kwargs): 449 | """ 450 | Keras Layer that implements an Attention mechanism for temporal data. 451 | Supports Masking. 452 | Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756] 453 | # Input shape 454 | 3D tensor with shape: `(samples, steps, features)`. 455 | # Output shape 456 | 2D tensor with shape: `(samples, features)`. 457 | :param kwargs: 458 | Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. 459 | The dimensions are inferred based on the output shape of the RNN. 460 | Example: 461 | model.add(LSTM(64, return_sequences=True)) 462 | model.add(Attention()) 463 | """ 464 | self.supports_masking = True 465 | #self.init = initializations.get('glorot_uniform') 466 | self.init = initializers.get('glorot_uniform') 467 | 468 | self.W_regularizer = regularizers.get(W_regularizer) 469 | self.b_regularizer = regularizers.get(b_regularizer) 470 | 471 | self.W_constraint = constraints.get(W_constraint) 472 | self.b_constraint = constraints.get(b_constraint) 473 | 474 | self.bias = bias 475 | self.step_dim = step_dim 476 | self.features_dim = 0 477 | super(Attention, self).__init__(**kwargs) 478 | 479 | def build(self, input_shape): 480 | assert len(input_shape) == 3 481 | 482 | self.W = self.add_weight(shape=(input_shape[-1],), 483 | initializer=self.init, 484 | name='%s_W'%self.name, 485 | regularizer=self.W_regularizer, 486 | constraint=self.W_constraint) 487 | self.features_dim = input_shape[-1] 488 | 489 | if self.bias: 490 | self.b = self.add_weight(shape=(input_shape[1],), 491 | initializer='zero', 492 | name='%s_b'%self.name, 493 | regularizer=self.b_regularizer, 494 | constraint=self.b_constraint) 495 | else: 496 | self.b = None 497 | 498 | self.built = True 499 | 500 | def compute_mask(self, input, input_mask=None): 501 | # do not pass the mask to the next layers 502 | return None 503 | 504 | def call(self, x, mask=None): 505 | # eij = K.dot(x, self.W) TF backend doesn't support it 506 | 507 | # features_dim = self.W.shape[0] 508 | # step_dim = x._keras_shape[1] 509 | 510 | features_dim = self.features_dim 511 | step_dim = self.step_dim 512 | eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim)) 513 | 514 | if self.bias: 515 | eij += self.b 516 | 517 | eij = K.tanh(eij) 518 | a = K.exp(eij) 519 | # apply mask after the exp. will be re-normalized next 520 | if mask is not None: 521 | # Cast the mask to floatX to avoid float64 upcasting in theano 522 | a *= K.cast(mask, K.floatx()) 523 | 524 | # in some cases especially in the early stages of training the sum may be almost zero 525 | a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx()) 526 | 527 | a = K.expand_dims(a) 528 | weighted_input = x * a 529 | #print weigthted_input.shape 530 | return K.sum(weighted_input, axis=1) 531 | 532 | def compute_output_shape(self, input_shape): 533 | #return input_shape[0], input_shape[-1] 534 | return input_shape[0], self.features_dim 535 | 536 | 537 | def DSSM(pretrained_embedding, input_length, lstmsize=90): 538 | word_embedding, char_embedding = pretrained_embedding 539 | wordlen, charlen = input_length 540 | 541 | input1 = Input(shape=(wordlen,)) 542 | input2 = Input(shape=(wordlen,)) 543 | lstm0 = CuDNNLSTM(lstmsize,return_sequences = True) 544 | lstm1 = Bidirectional(CuDNNLSTM(lstmsize)) 545 | lstm2 = CuDNNLSTM(lstmsize) 546 | att1 = Attention(wordlen) 547 | den = Dense(64,activation = 'tanh') 548 | 549 | # att1 = Lambda(lambda x: K.max(x,axis = 1)) 550 | 551 | v1 = word_embedding(input1) 552 | v2 = word_embedding(input2) 553 | v11 = lstm1(v1) 554 | v22 = lstm1(v2) 555 | v1ls = lstm2(lstm0(v1)) 556 | v2ls = lstm2(lstm0(v2)) 557 | v1 = Concatenate(axis=1)([att1(v1),v11]) 558 | v2 = Concatenate(axis=1)([att1(v2),v22]) 559 | 560 | input1c = Input(shape=(charlen,)) 561 | input2c = Input(shape=(charlen,)) 562 | lstm1c = Bidirectional(CuDNNLSTM(lstmsize)) 563 | att1c = Attention(charlen) 564 | v1c = char_embedding(input1c) 565 | v2c = char_embedding(input2c) 566 | v11c = lstm1c(v1c) 567 | v22c = lstm1c(v2c) 568 | v1c = Concatenate(axis=1)([att1c(v1c),v11c]) 569 | v2c = Concatenate(axis=1)([att1c(v2c),v22c]) 570 | 571 | 572 | mul = Multiply()([v1,v2]) 573 | sub = Lambda(lambda x: K.abs(x))(Subtract()([v1,v2])) 574 | maximum = Maximum()([Multiply()([v1,v1]),Multiply()([v2,v2])]) 575 | mulc = Multiply()([v1c,v2c]) 576 | subc = Lambda(lambda x: K.abs(x))(Subtract()([v1c,v2c])) 577 | maximumc = Maximum()([Multiply()([v1c,v1c]),Multiply()([v2c,v2c])]) 578 | sub2 = Lambda(lambda x: K.abs(x))(Subtract()([v1ls,v2ls])) 579 | matchlist = Concatenate(axis=1)([mul,sub,mulc,subc,maximum,maximumc,sub2]) 580 | matchlist = Dropout(0.05)(matchlist) 581 | 582 | matchlist = Concatenate(axis=1)([Dense(32,activation = 'relu')(matchlist),Dense(48,activation = 'sigmoid')(matchlist)]) 583 | res = Dense(1, activation = 'sigmoid')(matchlist) 584 | 585 | 586 | model = Model(inputs=[input1, input2, input1c, input2c], outputs=res) 587 | return model 588 | 589 | """ 590 | From the paper: 591 | Averaging Weights Leads to Wider Optima and Better Generalization 592 | Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson 593 | https://arxiv.org/abs/1803.05407 594 | 2018 595 | 596 | Author's implementation: https://github.com/timgaripov/swa 597 | """ 598 | class SWA(Callback): 599 | def __init__(self, model, swa_model, swa_start): 600 | super().__init__() 601 | self.model,self.swa_model,self.swa_start=model,swa_model,swa_start 602 | 603 | def on_train_begin(self, logs=None): 604 | self.epoch = 0 605 | self.swa_n = 0 606 | 607 | def on_epoch_end(self, epoch, logs=None): 608 | if (self.epoch + 1) >= self.swa_start: 609 | self.update_average_model() 610 | self.swa_n += 1 611 | 612 | self.epoch += 1 613 | 614 | def update_average_model(self): 615 | # update running average of parameters 616 | alpha = 1./(self.swa_n + 1) 617 | for layer,swa_layer in zip(self.model.layers, self.swa_model.layers): 618 | weights = [] 619 | for w1,w2 in zip(swa_layer.get_weights(), layer.get_weights()): 620 | weights.append( (1-alpha)*w1 + alpha*w2) 621 | swa_layer.set_weights(weights) 622 | 623 | class LR_Updater(Callback): 624 | ''' 625 | Abstract class where all Learning Rate updaters inherit from. (e.g., CircularLR) 626 | Calculates and updates new learning rate and momentum at the end of each batch. 627 | Have to be extended. 628 | ''' 629 | def __init__(self, init_lrs): 630 | self.init_lrs = init_lrs 631 | 632 | def on_train_begin(self, logs=None): 633 | self.update_lr() 634 | 635 | def on_batch_end(self, batch, logs=None): 636 | self.update_lr() 637 | 638 | def update_lr(self): 639 | # cur_lrs = K.get_value(self.model.optimizer.lr) 640 | new_lrs = self.calc_lr(self.init_lrs) 641 | K.set_value(self.model.optimizer.lr, new_lrs) 642 | 643 | def calc_lr(self, init_lrs): raise NotImplementedError 644 | 645 | 646 | class CircularLR(LR_Updater): 647 | ''' 648 | A learning rate updater that implements the CircularLearningRate (CLR) scheme. 649 | Learning rate is increased then decreased linearly. 650 | ''' 651 | def __init__(self, init_lrs, nb, div=4, cut_div=8, on_cycle_end=None): 652 | self.nb,self.div,self.cut_div,self.on_cycle_end = nb,div,cut_div,on_cycle_end 653 | super().__init__(init_lrs) 654 | 655 | def on_train_begin(self, logs=None): 656 | self.cycle_iter,self.cycle_count=0,0 657 | super().on_train_begin() 658 | 659 | def calc_lr(self, init_lrs): 660 | cut_pt = self.nb//self.cut_div 661 | if self.cycle_iter>cut_pt: 662 | pct = 1 - (self.cycle_iter - cut_pt)/(self.nb - cut_pt) 663 | else: pct = self.cycle_iter/cut_pt 664 | res = init_lrs * (1 + pct*(self.div-1)) / self.div 665 | self.cycle_iter += 1 666 | if self.cycle_iter==self.nb: 667 | self.cycle_iter = 0 668 | if self.on_cycle_end: self.on_cycle_end(self, self.cycle_count) 669 | self.cycle_count += 1 670 | return res 671 | 672 | class TimerStop(Callback): 673 | """docstring for TimerStop""" 674 | def __init__(self, start_time, total_seconds): 675 | super(TimerStop, self).__init__() 676 | self.start_time = start_time 677 | self.total_seconds = total_seconds 678 | self.epoch_seconds = [] 679 | 680 | def on_epoch_begin(self, epoch, logs=None): 681 | self.epoch_start = time.time() 682 | 683 | def on_epoch_end(self, epoch, logs=None): 684 | self.epoch_seconds.append(time.time() - self.epoch_start) 685 | 686 | mean_epoch_seconds = sum(self.epoch_seconds)/len(self.epoch_seconds) 687 | if time.time() + mean_epoch_seconds > self.start_time + self.total_seconds: 688 | self.model.stop_training = True 689 | 690 | def on_train_end(self, logs=None): 691 | print('timer stopping') 692 | 693 | 694 | def get_model(cfg,model_weights=None): 695 | print("======= CONFIG: ", cfg) 696 | 697 | model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg 698 | embedding = get_embedding_layers(dtype, input_length, w2v_length, with_weight=True) 699 | 700 | if model_type == "esim": 701 | model = esim(pretrained_embedding=embedding, 702 | maxlen=input_length, 703 | lstm_dim=300, 704 | dense_dim=300, 705 | dense_dropout=0.5) 706 | elif model_type == "decom": 707 | model = decomposable_attention(pretrained_embedding=embedding, 708 | projection_dim=300, projection_hidden=0, projection_dropout=0.2, 709 | compare_dim=500, compare_dropout=0.2, 710 | dense_dim=300, dense_dropout=0.2, 711 | lr=1e-3, activation='elu', maxlen=input_length) 712 | elif model_type == "siamese": 713 | model = siamese(pretrained_embedding=embedding, input_length=input_length, w2v_length=w2v_length, n_hidden=n_hidden) 714 | elif model_type == "dssm": 715 | model = DSSM(pretrained_embedding=embedding,input_length=input_length, lstmsize=90) 716 | 717 | if model_weights is not None: 718 | model.load_weights(model_weights) 719 | 720 | # keras.utils.plot_model(model, to_file=model_dir+model_type+"_"+dtype+'.png', show_shapes=True, show_layer_names=True, rankdir='TB') 721 | return model 722 | 723 | ##################################################################### 724 | # 评估指标和最佳阈值 725 | ##################################################################### 726 | 727 | def r_f1_thresh(y_pred,y_true,step=1000): 728 | e = np.zeros((len(y_true),2)) 729 | e[:,0] = y_pred.reshape(-1) 730 | e[:,1] = y_true 731 | f = pd.DataFrame(e) 732 | thrs = np.linspace(0,1,step+1) 733 | x = np.array([f1_score(y_pred=f.loc[:,0]>thr, y_true=f.loc[:,1]) for thr in thrs]) 734 | f1_, thresh = max(x),thrs[x.argmax()] 735 | return f.corr()[0][1], f1_, thresh 736 | 737 | ##################################################################### 738 | # 模型训练和保存 739 | ##################################################################### 740 | configs_path = model_dir+"all_configs.json" 741 | def save_config(filepath, cfg): 742 | configs = {} 743 | if os.path.exists(configs_path): configs = json.loads(open(configs_path,"r",encoding="utf8").read()) 744 | configs[filepath] = cfg 745 | open(configs_path,"w",encoding="utf8").write(json.dumps(configs, indent=2, ensure_ascii=False)) 746 | 747 | def train_model(model, swa_model, cfg): 748 | model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg 749 | 750 | data = load_data(dtype, input_length, w2v_length) 751 | train_x, train_y, test_x, test_y = split_data(data) 752 | filepath=model_dir+model_type+"_"+dtype+time.strftime("_%m-%d %H-%M-%S")+".h5" # 每次运行的模型都进行保存,不覆盖之前的结果 753 | checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=True,save_weights_only=True, mode='auto') 754 | earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=patience, verbose=0, mode='auto') 755 | reduce_lr = ReduceLROnPlateau(monitor='val_loss', verbose=0, factor=0.5,patience=2, min_lr=1e-6) 756 | swa_cbk = SWA(model, swa_model, swa_start=1) 757 | 758 | init_lrs = 0.001 759 | clr_div,cut_div = 10, 8 760 | batch_num = (train_x[0].shape[0]-1) // train_batch_size + 1 761 | cycle_len = 1 762 | total_iterators = batch_num*cycle_len 763 | print("total iters per cycle(epoch):",total_iterators) 764 | circular_lr = CircularLR(init_lrs, total_iterators, on_cycle_end=None, div=clr_div, cut_div=cut_div) 765 | callbacks = [checkpoint, earlystop, swa_cbk, circular_lr] 766 | callbacks.append(TimerStop(start_time=start_time, total_seconds=7100)) 767 | 768 | def fit(n_epoch=n_epoch): 769 | history = model.fit(x=train_x, y=train_y, 770 | class_weight={0:1/np.mean(train_y),1:1/(1-np.mean(train_y))}, 771 | validation_data=((test_x, test_y)), 772 | batch_size=train_batch_size, 773 | callbacks=callbacks, 774 | epochs=n_epoch,verbose=2) 775 | return history 776 | 777 | loss,metrics = 'binary_crossentropy',['binary_crossentropy',"accuracy"] 778 | 779 | model.compile(optimizer=Adam(lr=init_lrs, beta_1=0.8), loss=loss, metrics=metrics) 780 | fit() 781 | 782 | filepath_swa = model_dir + filepath.split("/")[-1].split(".")[0]+"-swa.h5" 783 | swa_cbk.swa_model.save_weights(filepath_swa) 784 | 785 | # 保存配置,方便多模型集成 786 | save_config(filepath, cfg) 787 | save_config(filepath_swa, cfg) 788 | 789 | def train_all_models(index): 790 | cfg = cfgs[index] 791 | K.clear_session() 792 | model = get_model(cfg,None) 793 | swa_model = get_model(cfg,None) 794 | train_model(model, swa_model, cfg) 795 | 796 | 797 | ##################################################################### 798 | # 模型评估、模型融合、模型测试 799 | ##################################################################### 800 | 801 | evaluate_path = model_dir + "y_pred.pkl" 802 | def evaluate_models(): 803 | train_y_preds, test_y_preds = [], [] 804 | all_cfgs = json.loads(open(configs_path,'r',encoding="utf8").read()) 805 | num_clfs = len(all_cfgs) 806 | 807 | for weight, cfg in all_cfgs.items(): 808 | K.clear_session() 809 | model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg 810 | data = load_data(dtype, input_length, w2v_length) 811 | train_x, train_y, test_x, test_y = split_data(data) 812 | model = get_model(cfg,weight) 813 | train_y_preds.append(model.predict(train_x, batch_size=test_batch_size).reshape(-1)) 814 | test_y_preds.append(model.predict(test_x, batch_size=test_batch_size).reshape(-1)) 815 | 816 | train_y_preds,test_y_preds = np.array(train_y_preds),np.array(test_y_preds) 817 | pd.to_pickle([train_y_preds,train_y,test_y_preds,test_y],evaluate_path) 818 | 819 | 820 | blending_path = model_dir + "blending_gdbm.pkl" 821 | def train_blending(): 822 | """ 根据配置文件和验证集的值计算融合模型 """ 823 | train_y_preds,train_y,valid_y_preds,valid_y = pd.read_pickle(evaluate_path) 824 | train_y_preds = train_y_preds.T 825 | valid_y_preds = valid_y_preds.T 826 | 827 | '''融合使用的模型''' 828 | clf = LogisticRegression() 829 | clf.fit(valid_y_preds, valid_y) 830 | 831 | train_y_preds_blend = clf.predict_proba(train_y_preds)[:,1] 832 | r,f1,train_thresh = r_f1_thresh(train_y_preds_blend, train_y) 833 | 834 | valid_y_preds_blend = clf.predict_proba(valid_y_preds)[:,1] 835 | r,f1,valid_thresh = r_f1_thresh(valid_y_preds_blend, valid_y) 836 | pd.to_pickle(((train_thresh+valid_thresh)/2,clf), blending_path) 837 | 838 | 839 | def result(): 840 | global df1 841 | all_cfgs = json.loads(open(configs_path,'r',encoding="utf8").read()) 842 | num_clfs = len(all_cfgs) 843 | test_y_preds = [] 844 | X = {} 845 | for cfg in all_cfgs.values(): 846 | model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg 847 | key_ = f"{dtype}_{input_length}" 848 | if key_ not in X: X[key_] = input_data(df1["sent1"],df1["sent2"], dtype = dtype, input_length=input_length) 849 | 850 | for weight, cfg in all_cfgs.items(): 851 | K.clear_session() 852 | model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg 853 | key_ = f"{dtype}_{input_length}" 854 | model = get_model(cfg, weight) 855 | test_y_preds.append(model.predict(X[key_], batch_size=test_batch_size).reshape(-1)) 856 | 857 | test_y_preds = np.array(test_y_preds).T 858 | thresh,clf = pd.read_pickle(blending_path) 859 | result = clf.predict_proba(test_y_preds)[:,1].reshape(-1)>thresh 860 | 861 | df_output = pd.concat([df1["id"],pd.Series(result,name="label",dtype=np.int32)],axis=1) 862 | 863 | topai(1,df_output) 864 | 865 | 866 | 867 | 868 | # 文档第二步,训练多个不同的模型,index取值为0-6 869 | if False: 870 | train_all_models(index=0) 871 | 872 | # 文档第三步,训练blending模型 873 | if False: 874 | evaluate_models() 875 | train_blending() 876 | 877 | # 文档第四步,测试blending模型 878 | if False: 879 | result() 880 | --------------------------------------------------------------------------------