├── .gitignore
├── README.md
├── ATEC NLP比赛感受.md
├── utils
    ├── plot_model.py
    ├── extract_wiki.py
    ├── test_cv_stacking.py
    └── train_embedding.py
├── pai_old.py
└── pai_train.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | .vscode/
 3 | .ipynb_checkpoints/
 4 | __pycache__/
 5 | data/
 6 | docs/
 7 | GitHubs/
 8 | logs/
 9 | PAI/
10 | pai_model/
11 | resources/
12 | submits/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ATEC2018 NLP赛题 复赛f1 = 0.7327
 2 | 
 3 | 由于PAI平台限制，所有代码都放在一个文件里面，`pai_train.py`是获得本次比赛成绩的文件，实验共使用了4个模型，分别是自定义Siamese网络、ESIM网络、Decomposable Attention和DSSM网络。其中Siamese、ESIM和Decomposable Attention有char level和word level两个版本，DSSM网络只有char和word的合并版本。最佳记录由多个模型进行blending融合预测，遗憾没有尝试一下10fold交叉训练模型，前排貌似都用了，而且这里每个模型都只用了2个小时来训练。
 4 | 
 5 | 模型性能比较，字符级的esim模型在这个任务中表现最佳。
 6 | 
 7 | | model name   | 模型输出与标签相关性r | 最优f1评分         | 取得最优f1评分的阈值 |
 8 | | ------------ | --------------------- | ------------------ | -------------------- |
 9 | | siamese char | 0.553536380131115     | 0.6971525551574581 | 0.258                |
10 | | siamese word | 0.5308273808879237    | 0.6873517065157875 | 0.242                |
11 | | esim char    | 0.5853469280801447    | 0.7116622491480499 | 0.233                |
12 | | esim word    | 0.5783574742744366    | 0.7100964753080524 | 0.263                |
13 | | decom char   | 0.5288425401105513    | 0.6825720620842572 | 0.249                |
14 | | decom word   | 0.4943718720970039    | 0.6677430929314676 | 0.212                |
15 | | dssm both    | 0.5638034287814917    | 0.6980098067493511 | 0.263                |
16 | 
17 | 
18 | 训练感受：
19 | 1. batchsize不要太大，虽然每个epoch更快完成， 但每个epoch权重更新次数变少了，收敛更慢
20 | 2. 使用[循环学习率](https://arxiv.org/abs/1506.01186)可以收敛到更好的极值点，更容易跳出局部极值，如在一个epoch中，使学习率从小变大，又逐渐变小
21 | 3. 利用[SWA](https://arxiv.org/abs/1803.05407)这种简单的模型融合方法可以获得泛化能力更好的性能，本地提升明显，但线上没有改善。
22 | 
23 | 
24 | `pai_transform.py`和`pai_old.py`是两次不成功的尝试：
25 | `pai_transform.py`试图参考fastai的ULMFiT方法，通过训练语言模型作为embedding输入，并针对当前分类任务更改网络结构以适应当前训练过程。
26 | `pai_old.py`试图参考quora分享，使用文本特征工程进行分类。
27 | 
28 | 
29 | > 模型来源siamese参考：https://blog.csdn.net/huowa9077/article/details/81082795
30 | > ESIM网络、Decomposable Attention来自Kaggle分享：https://www.kaggle.com/lamdang/dl-models
31 | > DSSM网络来自bird大神分享：https://openclub.alipay.com/read.php?tid=7480&fid=96
32 | > 感谢以上！


--------------------------------------------------------------------------------
/ATEC NLP比赛感受.md:
--------------------------------------------------------------------------------
 1 | # ATEC NLP比赛总结
 2 | 
 3 | 今年5月份报名了蚂蚁金服的比赛，有金融大脑和风险大脑两个赛题，金融大脑主要解决智能客服遇到的自然语言处理问题，对于两个语句，判断是否是同一个意思，帮助构建客服的专用问答库，比赛的评判标准是f1分数，这对于正负样本不平衡问题比准确率更好，风险大脑则是通过用户登录和交易信息判断此次交易是否存在风险，在网络安全形势严峻的今天，其重要意义不言而喻。
 4 | 
 5 | 2个月时间的投入，还是有一些收获：
 6 | 
 7 | - 对keras的使用更加熟练，尤其是Callback的使用，
 8 | 
 9 | - 阅读了pytorch的文档，初学使用pytorch，其特点为：
10 |   1. 强化版的numpy，前向运算和操作与numpy非常相似，而且可以直接利用GPU的运算能力。
11 |   2. 与TensorFlow不同的是，pytorch无需编译图，每次backward都会根据当前运算过程构造新的图，然后销毁，在程序中甚至可以通过条件语句直接改变图的运行流程，启用或停止相关节点。
12 |   3. pytorch对于最新研究成果的跟踪实现比keras快得多，拥有更丰富的神经网络层，更多优化器等。
13 | 
14 | - 学习了基于pytorch的fastai框架，框架的作者Jeremy Howard是Kaggle高手，fastai框架吸收了一些Keras中便于使用的特性，整个框架源码约4000余行，短小精悍，使用方便
15 | 
16 | - 跟着fastai的源码实践了统一语言模型精调(ULMFiT)方法，在文本相似度任务上并未取得好结果，ULMFiT方法特点如下：
17 |   1. 训练一个语言模型，模型架构为Embedding + 三层双向LSTM(+dropout)，数据集一般为wiki，受限于数据加载和预处理方式，目前的源码仅能处理不超过500M的语料。
18 |   2. 在当前任务语料上finetune语言模型。
19 |   3. 根据当前任务设计分类器模块，其出入为语言模型最后一个LSTM层的输出，从最后一层开始，逐层unfreeze，进行分类器模型精调。
20 | 
21 | - 学习了batchsize参数对训练的影响，更大的batchsize意味着更准确的梯度方向，可以更快完成每个epoch，同时也意味着每个epoch的更新次数更少，需要更多的epoch才能使模型收敛，一味增大batchsize反而会延长训练时间。
22 | 
23 | - 学会使用循环学习率变化的训练技巧，Circular Learning Rate通过循环改变学习率，从小到大，从大到小，不断循环，使模型更容易跳出局部最优，做出更多的尝试，该方法确实调高了模型训练的结果。
24 | 
25 | - 学会使用SWA(stochastic weights averaging)模型融合方法，即将训练过程中的模型权重进行平均达到模型融合的目的，该方法的代价极小，仅需要保存另一份模型权重在内存或GPU显存中，在每个epoch后(或其他间隔)更新一次该权重，在训练结束时便可获得一个普通模型和一个SWA模型，该方法提高了5/7模型在初赛数据上的泛化能力，但并未提高任何模型对于5倍的复赛数据的泛化能力，这可能与模型训练不够充分有关。
26 | 
27 | - 学习了一些模型融合方法，包括求多模型平均、投票、Stacking和Blending模型融合方法，其中Quora比赛中的一个stacking方案值得借鉴，他们将训练数据分成5份，三份训练，1份验证，1份测试，轮番5次，直到每份数据都参与1次验证和1次测试，这比传统的stacking更好的利用了数据。
28 | 
29 | - 学习了语句对任务建模的两类基本模型，分别是向量表征模型和表征交互模型，向量表征模型利用孪生网络(Siamese Network)将两个语句编码成两个独立的向量，然后计算向量间的相似度，比如Siamese Net，DSSM；表征交互模型通过构造一个相关性交互矩阵，将两个语句的信息进行糅合处理，比如Decomposable Attention，ESIM。
30 | 
31 | - 除了ESIM外，其他人用了两个新模型DRCN和DIIN     (DRCN是SNLI排行榜最佳模型）
32 | 
33 | - 本次比赛未尝试的方法：
34 |   - 利用句子的拼音作为辅助输入，通过拼音embedding加强模型，
35 |   - 将字、词和拼音等混入一个模型中，增强单个模型的能力
36 |   - DRCN模型
37 |   - **10折验证训练模型（大伙都在用）**
38 | 
39 | 
40 | > 其他两个队的开源代码
41 | > [复赛0.7368_红鲤鱼绿鲤鱼与驴](https://github.com/raven4752/huabei)
42 | > [复赛0.7352_World2vec](https://github.com/amxineohp/atec_2018_nlp)
43 | 
44 | > 经验分享
45 | > [逼格learning](https://openclub.alipay.com/read.php?tid=9074&fid=96)
46 | 
47 | > 语句对任务模型排行榜[SNLI项目](https://nlp.stanford.edu/projects/snli)
48 | 
49 | 


--------------------------------------------------------------------------------
/utils/plot_model.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.autograd import Variable
 3 | import torch.nn as nn
 4 | from graphviz import Digraph
 5 |  
 6 |  
 7 | class CNN(nn.Module):
 8 |     def __init__(self):
 9 |         super(CNN, self).__init__()
10 |         self.conv1 = nn.Sequential(
11 |             nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2),
12 |             nn.ReLU(),
13 |             nn.MaxPool2d(kernel_size=2)
14 |         )
15 |         self.conv2 = nn.Sequential(
16 |             nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2),
17 |             nn.ReLU(),
18 |             nn.MaxPool2d(kernel_size=2)
19 |         )
20 |         self.out = nn.Linear(32*7*7, 10)
21 |  
22 |     def forward(self, x):
23 |         x = self.conv1(x)
24 |         x = self.conv2(x)
25 |         x = x.view(x.size(0), -1)  # (batch, 32*7*7)
26 |         out = self.out(x)
27 |         return out
28 |  
29 |  
30 | def make_dot(var, params=None):
31 |     """ Produces Graphviz representation of PyTorch autograd graph
32 |     Blue nodes are the Variables that require grad, orange are Tensors
33 |     saved for backward in torch.autograd.Function
34 |     Args:
35 |         var: output Variable
36 |         params: dict of (name, Variable) to add names to node that
37 |             require grad (TODO: make optional)
38 |     """
39 |     if params is not None:
40 |         assert isinstance(params.values()[0], Variable)
41 |         param_map = {id(v): k for k, v in params.items()}
42 |  
43 |     node_attr = dict(style='filled',
44 |                      shape='box',
45 |                      align='left',
46 |                      fontsize='12',
47 |                      ranksep='0.1',
48 |                      height='0.2')
49 |     dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12"))
50 |     seen = set()
51 |  
52 |     def size_to_str(size):
53 |         return '('+(', ').join(['%d' % v for v in size])+')'
54 |  
55 |     def add_nodes(var):
56 |         if var not in seen:
57 |             if torch.is_tensor(var):
58 |                 dot.node(str(id(var)), size_to_str(var.size()), fillcolor='orange')
59 |             elif hasattr(var, 'variable'):
60 |                 u = var.variable
61 |                 name = param_map[id(u)] if params is not None else ''
62 |                 node_name = '%s\n %s' % (name, size_to_str(u.size()))
63 |                 dot.node(str(id(var)), node_name, fillcolor='lightblue')
64 |             else:
65 |                 dot.node(str(id(var)), str(type(var).__name__))
66 |             seen.add(var)
67 |             if hasattr(var, 'next_functions'):
68 |                 for u in var.next_functions:
69 |                     if u[0] is not None:
70 |                         dot.edge(str(id(u[0])), str(id(var)))
71 |                         add_nodes(u[0])
72 |             if hasattr(var, 'saved_tensors'):
73 |                 for t in var.saved_tensors:
74 |                     dot.edge(str(id(t)), str(id(var)))
75 |                     add_nodes(t)
76 |     add_nodes(var.grad_fn)
77 |     return dot
78 |  


--------------------------------------------------------------------------------
/utils/extract_wiki.py:
--------------------------------------------------------------------------------
  1 | import codecs
  2 | import re
  3 | 
  4 | import bz2file
  5 | import jieba_fast as jieba
  6 | from gensim.corpora.wikicorpus import extract_pages, filter_wiki
  7 | # from gensim.corpora import WikiCorpus
  8 | from tqdm import tqdm
  9 | 
 10 | 
 11 | def get_wiki():
 12 |     from opencc import OpenCC
 13 |     # 参考这篇博客注释
 14 |     # https://kexue.fm/archives/4176
 15 |     opencc1 = OpenCC("t2s")
 16 |     resub1 = re.compile(':*{\|[\s\S]*?\|}')  
 17 |     resub2 = re.compile('<gallery>[\s\S]*?</gallery>')  
 18 |     resub3 = re.compile('(.){{([^{}\n]*?\|[^{}\n]*?)}}')  
 19 |     resub4 = re.compile('\* *\n|\'{2,}')  
 20 |     resub5 = re.compile('\n+')  
 21 |     resub6 = re.compile('\n[:;]|\n +')  
 22 |     resub7 = re.compile('\n==')
 23 | 
 24 |     refind1 = re.compile('^[a-zA-Z]+:')
 25 |     refind2 = re.compile('^#')
 26 | 
 27 |     p1 = re.compile(r'-\{.*?(zh-hans|zh-cn):([^;]*?)(;.*?)?\}-')
 28 |     p2 = re.compile(r'[（\(][，；。？！\s]*[）\)]')
 29 |     p3 = re.compile(r'[「『]')
 30 |     p4 = re.compile(r'[」』]')
 31 | 
 32 |     def wiki_replace(s):
 33 |         s = filter_wiki(s)  
 34 |         s = resub1.sub('', s)  
 35 |         s = resub2.sub('', s)  
 36 |         s = resub3.sub('\\1[[\\2]]', s)  
 37 |         s = resub4.sub('', s)  
 38 |         s = resub5.sub('\n', s)  
 39 |         s = resub6.sub('\n', s)  
 40 |         s = resub7.sub('\n\n==', s)
 41 |         s = p1.sub(r'\2', s)
 42 |         s = p2.sub(r'', s)
 43 |         s = p3.sub(r'“', s)
 44 |         s = p4.sub(r'”', s)
 45 |         return opencc1.convert(s).strip()
 46 |     
 47 |     wiki = extract_pages(bz2file.open('zhwiki-latest-pages-articles.xml.bz2'))
 48 | 
 49 |     # wiki=WikiCorpus('zhwiki-latest-pages-articles.xml.bz2',lemmatize=False,dictionary={})
 50 | 
 51 |     with codecs.open('wiki.txt', 'w', encoding='utf-8') as f:
 52 |         i = 0
 53 |         filelist = []
 54 |         for d in tqdm(wiki):
 55 |             
 56 |             print(d[0])
 57 |             print(d[1])
 58 | 
 59 |             i+=1
 60 |             
 61 |             if i == 5:break
 62 | 
 63 |             continue
 64 |             if not refind1.findall(d[0]) and d[0] and not refind2.findall(d[1]):
 65 |                 filelist.append(d[0]+"\n"+d[1])
 66 |                 line = d[1]
 67 | 
 68 |                 i += 1  
 69 |                 if i % 100 == 0:  
 70 |                     s = wiki_replace("\n\n".join(filelist))
 71 |                     f.write(s)  
 72 |                     filelist = []
 73 | 
 74 | def get_cut_std_wiki():
 75 |     with open("cut_std_wiki.txt","w",encoding="utf8") as output:
 76 |         with open("std_wiki.txt","r",encoding="utf8") as file:
 77 |             for line in tqdm(file): 
 78 |                 output.write(" ".join(list(jieba.cut(line))))
 79 | 
 80 | def get_wiki2():
 81 |     reobj1 = re.compile(r"[ `~!@#$%^&*\(\)-_=+\[\]\{\}\\\|;:\'\",<.>/?a-zA-Z\d]+")
 82 |     reobj2 = re.compile(r"\n+")
 83 |     reobj3 = re.compile("(（）)|(“”)|(「」)|(《》)|(“”)|(‘’)|(【】)|[，。？——！]{2,}")
 84 |     reuseful = re.compile('^[a-zA-Z]+:')
 85 |     redirect = re.compile(r"^#")
 86 |     def wiki_replace(s):
 87 |         s = filter_wiki(s)  
 88 |         s = reobj1.sub("", s)     # 为上传阿里云剔除竖线(|)符号
 89 |         s = reobj2.sub("#",s)
 90 |         s = reobj3.sub("",s)
 91 |         return s
 92 | 
 93 |     wiki = extract_pages(bz2file.open('zhwiki-latest-pages-articles.xml.bz2'))
 94 |     with codecs.open('wiki-tw.csv', 'w', encoding='utf-8') as f:
 95 |         i = 0
 96 |         filelist = []
 97 |         for d in tqdm(wiki):
 98 |             if not reuseful.findall(d[0]) and not redirect.findall(d[1]):
 99 |                 i+=1
100 |                 filelist.append(reobj1.sub("",d[0])+"|"+wiki_replace(d[1])+"\n")
101 |                 if i % 1000 == 0:  
102 |                     s = ("".join(filelist))
103 |                     f.write(s)
104 |                     filelist = []
105 |         if filelist:
106 |             s = ("".join(filelist))
107 |             f.write(s)
108 |         
109 | def wiki_error():
110 |     for no,line in enumerate(open("wiki_1.csv",'r', encoding="utf8")):
111 |         pair = line.split("|")
112 |         if len(pair)>2:
113 |             print(no,pair[0],pair[1])
114 | 
115 | if __name__ == '__main__':
116 |     # get_wiki2() # 繁体转简体 + 特殊符号处理
117 |     wiki_error()


--------------------------------------------------------------------------------
/utils/test_cv_stacking.py:
--------------------------------------------------------------------------------
  1 | 
  2 | from datetime import datetime
  3 | import numpy as np
  4 | import matplotlib.pyplot as plt
  5 | 
  6 | from sklearn import linear_model
  7 | from sklearn import datasets
  8 | from sklearn.svm import l1_min_c
  9 | 
 10 | iris = datasets.load_iris()
 11 | X = iris.data
 12 | y = iris.target
 13 | 
 14 | X = X[y != 2]
 15 | y = y[y != 2]
 16 | 
 17 | X -= np.mean(X, 0)
 18 | cs = l1_min_c(X, y, loss='log') * np.logspace(0, 3)
 19 | 
 20 | 
 21 | print("Computing regularization path ...")
 22 | start = datetime.now()
 23 | clf = linear_model.LogisticRegressionCV(penalty='l2', tol=1e-6)
 24 | clf.fit(X, y)
 25 | print("This took ", datetime.now() - start)
 26 | 
 27 | 
 28 | # =============================================================
 29 | from sklearn import datasets
 30 | 
 31 | iris = datasets.load_iris()
 32 | X, y = iris.data[:, 1:3], iris.target
 33 | 
 34 | from sklearn import model_selection
 35 | from sklearn.linear_model import LogisticRegression
 36 | from sklearn.neighbors import KNeighborsClassifier
 37 | from sklearn.naive_bayes import GaussianNB 
 38 | from sklearn.ensemble import RandomForestClassifier
 39 | from mlxtend.classifier import StackingClassifier
 40 | from sklearn.model_selection import GridSearchCV
 41 | import numpy as np
 42 | 
 43 | clf1 = KNeighborsClassifier(n_neighbors=1)
 44 | clf2 = RandomForestClassifier(random_state=1)
 45 | clf3 = GaussianNB()
 46 | lr = LogisticRegression()
 47 | 
 48 | print('3-fold cross validation:\n')
 49 | 
 50 | stack = 2
 51 | if stack == 1:
 52 |     sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
 53 |                             meta_classifier=lr)
 54 |     for clf, label in zip([clf1, clf2, clf3, sclf], 
 55 |                         ['KNN', 
 56 |                         'Random Forest', 
 57 |                         'Naive Bayes',
 58 |                         'StackingClassifier']):
 59 | 
 60 |         scores = model_selection.cross_val_score(clf, X, y, 
 61 |                                                 cv=3, scoring='accuracy')
 62 | 
 63 |     print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
 64 |         % (scores.mean(), scores.std(), label))
 65 | 
 66 | elif stack == 2:
 67 |     sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
 68 |                             use_probas=True,
 69 |                             average_probas=False,
 70 |                             meta_classifier=lr)
 71 |     for clf, label in zip([clf1, clf2, clf3, sclf], 
 72 |                         ['KNN', 
 73 |                         'Random Forest', 
 74 |                         'Naive Bayes',
 75 |                         'StackingClassifier']):
 76 | 
 77 |         scores = model_selection.cross_val_score(clf, X, y, 
 78 |                                                 cv=3, scoring='accuracy')
 79 |     
 80 |     print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
 81 |         % (scores.mean(), scores.std(), label))
 82 | elif stack == 3:
 83 |     sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
 84 |                             meta_classifier=lr)
 85 | 
 86 |     params = {'kneighborsclassifier__n_neighbors': [1, 5],
 87 |             'randomforestclassifier__n_estimators': [10, 50],
 88 |             'meta-logisticregression__C': [0.1, 10.0]}
 89 | 
 90 |     grid = GridSearchCV(estimator=sclf, 
 91 |                         param_grid=params, 
 92 |                         cv=5,
 93 |                         refit=True)
 94 |     grid.fit(X, y)
 95 | 
 96 |     cv_keys = ('mean_test_score', 'std_test_score', 'params')
 97 | 
 98 |     for r, _ in enumerate(grid.cv_results_['mean_test_score']):
 99 |         print("%0.3f +/- %0.2f %r"
100 |             % (grid.cv_results_[cv_keys[0]][r],
101 |                 grid.cv_results_[cv_keys[1]][r] / 2.0,
102 |                 grid.cv_results_[cv_keys[2]][r]))
103 | 
104 |     print('Best parameters: %s' % grid.best_params_)
105 |     print('Accuracy: %.2f' % grid.best_score_)
106 | 
107 | 
108 | import matplotlib.pyplot as plt
109 | from mlxtend.plotting import plot_decision_regions
110 | import matplotlib.gridspec as gridspec
111 | import itertools
112 | 
113 | gs = gridspec.GridSpec(2, 2)
114 | 
115 | fig = plt.figure(figsize=(10,8))
116 | 
117 | for clf, lab, grd in zip([clf1, clf2, clf3, sclf], 
118 |                          ['KNN', 
119 |                           'Random Forest', 
120 |                           'Naive Bayes',
121 |                           'StackingClassifier'],
122 |                           itertools.product([0, 1], repeat=2)):
123 | 
124 |     clf.fit(X, y)
125 |     ax = plt.subplot(gs[grd[0], grd[1]])
126 |     fig = plot_decision_regions(X=X, y=y, clf=clf)
127 |     plt.title(lab)
128 | plt.show()


--------------------------------------------------------------------------------
/utils/train_embedding.py:
--------------------------------------------------------------------------------
  1 | #/usr/bin/env python
  2 | #coding=utf-8
  3 | 
  4 | import os
  5 | import re
  6 | import multiprocessing
  7 | 
  8 | import gensim
  9 | from gensim.models.word2vec import LineSentence
 10 | import jieba_fast as jieba
 11 | import numpy as np
 12 | import pandas as pd
 13 | import fasttext
 14 | 
 15 | 
 16 | os.environ["TF_CPP_MIN_LOG_LEVEL"]='3'
 17 | model_dir = "pai_model/"
 18 | 
 19 | new_words = "支付宝 付款码 二维码 收钱码 转账 退款 退钱 余额宝 运费险 还钱 还款 花呗 借呗 蚂蚁花呗 蚂蚁借呗 蚂蚁森林 小黄车 飞猪 微客 宝卡 芝麻信用 亲密付 淘票票 饿了么 摩拜 滴滴 滴滴出行".split(" ")
 20 | for word in new_words:
 21 |     jieba.add_word(word)
 22 | 
 23 | class MyChars(object):
 24 |     def __init__(self):
 25 |         pass
 26 | 
 27 |     def __iter__(self):
 28 |         with open(model_dir + "atec_nlp_sim_train.csv","r", encoding="utf8") as atec:
 29 |             for line in atec:
 30 |                 lineno, s1, s2, label=line.strip().split("\t")
 31 |                 yield list(s1) + list(s2)
 32 | 
 33 |         with open("resources/wiki_corpus/wiki.csv",'r',encoding="utf8") as wiki:
 34 |             for line in wiki:
 35 |                 title, doc = line.strip().split("|")
 36 |                 for sentense in doc.split("#"):
 37 |                     if len(sentense)>0:
 38 |                         yield [char for char in sentense if char and 0x4E00<= ord(char[0]) <= 0x9FA5]
 39 |         
 40 | 
 41 | class MyWords(object):
 42 |     def __init__(self):
 43 |         pass
 44 |  
 45 |     def __iter__(self):
 46 |         with open(model_dir + "atec_nlp_sim_train.csv","r", encoding="utf8") as atec:
 47 |             for line in atec:
 48 |                 lineno, s1, s2, label=line.strip().split("\t")    
 49 |                 yield list(jieba.cut(s1)) + list(jieba.cut(s2))
 50 | 
 51 |         with open("resources/wiki_corpus/wiki.csv",'r',encoding="utf8") as wiki:
 52 |             for line in wiki:
 53 |                 title, doc = line.strip().split("|")
 54 |                 for sentense in doc.split("#"):
 55 |                     if len(sentense)>0:
 56 |                         yield [word for word in list(jieba.cut(sentense)) if word and 0x4E00<= ord(word[0]) <= 0x9FA5]
 57 | 
 58 | 
 59 | def gen_data():
 60 |     with open(model_dir + "train_char.txt","w",encoding="utf8") as file:
 61 |         mychars = MyChars()
 62 |         for cs in mychars:
 63 |             file.write(" ".join(cs)+"\n")
 64 | 
 65 |     with open(model_dir + "train_word.txt","w",encoding="utf8") as file:
 66 |         mywords = MyWords()
 67 |         for ws in mywords:
 68 |             file.write(" ".join(ws)+"\n")
 69 | 
 70 | def train_embedding_gensim():
 71 |     dim=256
 72 |     embedding_size = dim
 73 |     model = gensim.models.Word2Vec(LineSentence(model_dir + 'train_char.txt'),
 74 |                                    size=embedding_size,
 75 |                                    window=5,
 76 |                                    min_count=10,
 77 |                                    workers=multiprocessing.cpu_count())
 78 | 
 79 |     model.save(model_dir + "char2vec_gensim"+str(embedding_size))
 80 |     # model.wv.save_word2vec_format("model/char2vec_org"+str(embedding_size),"model/chars"+str(embedding_size),binary=False)
 81 |     
 82 |     dim=256
 83 |     embedding_size = dim
 84 |     model = gensim.models.Word2Vec(LineSentence(model_dir + 'train_word.txt'),
 85 |                                    size=embedding_size,
 86 |                                    window=5,
 87 |                                    min_count=10,
 88 |                                    workers=multiprocessing.cpu_count())
 89 | 
 90 |     model.save(model_dir + "word2vec_gensim"+str(embedding_size))
 91 |     # model.wv.save_word2vec_format("model/word2vec_org"+str(embedding_size),"model/vocabulary"+str(embedding_size),binary=False)
 92 | 
 93 | 
 94 | def train_embedding_fasttext():
 95 |     
 96 |     # Skipgram model
 97 |     model = fasttext.skipgram(model_dir + 'train_char.txt', model_dir + 'char2vec_fastskip256', word_ngrams=2, ws=5, min_count=10, dim=256)
 98 |     del(model)
 99 | 
100 |     # CBOW model
101 |     model = fasttext.cbow(model_dir + 'train_char.txt', model_dir + 'char2vec_fastcbow256', word_ngrams=2, ws=5, min_count=10, dim=256)
102 |     del(model)
103 | 
104 |     # Skipgram model
105 |     model = fasttext.skipgram(model_dir + 'train_word.txt', model_dir + 'word2vec_fastskip256', word_ngrams=2, ws=5, min_count=10, dim=256)
106 |     del(model)
107 | 
108 |     # CBOW model
109 |     model = fasttext.cbow(model_dir + 'train_word.txt', model_dir + 'word2vec_fastcbow256', word_ngrams=2, ws=5, min_count=10, dim=256)
110 |     del(model)
111 | 
112 | # gen_data()
113 | train_embedding_gensim()
114 | # train_embedding_word()


--------------------------------------------------------------------------------
/pai_old.py:
--------------------------------------------------------------------------------
  1 | #/usr/bin/env python
  2 | #coding=utf-8
  3 | #===================================================================================
  4 | #                                      传统方法
  5 | #===================================================================================
  6 | import numpy as np
  7 | import pandas as pd
  8 | import re
  9 | import math
 10 | import time
 11 | from sklearn.feature_extraction.text import TfidfVectorizer
 12 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
 13 | from sklearn.metrics import f1_score
 14 | from sklearn.model_selection import train_test_split, KFold
 15 | import gensim
 16 | try:
 17 |     import jieba_fast as jieba
 18 | except Exception as e:
 19 |     import jieba
 20 | 
 21 | try:
 22 |     print(model_dir)
 23 |     test_size = 0.025
 24 |     online=True
 25 | except:
 26 |     model_dir = "pai_model/"
 27 |     test_size = 0.05
 28 |     online=False
 29 | 
 30 | new_words = "支付宝 付款码 二维码 收钱码 转账 退款 退钱 余额宝 运费险 还钱 还款 花呗 借呗 蚂蚁花呗 蚂蚁借呗 蚂蚁森林 小黄车 飞猪 微客 宝卡 芝麻信用 亲密付 淘票票 饿了么 摩拜 滴滴 滴滴出行".split(" ")
 31 | for word in new_words:
 32 |     jieba.add_word(word)
 33 | 
 34 | star = re.compile("\*+")
 35 | if False:
 36 |     stops = ["、","。","〈","〉","《","》","一","一切","一则","一方面","一旦","一来","一样","一般","七","万一","三","上下","不仅","不但","不光","不单","不只","不如","不怕","不惟","不成","不拘","不比","不然","不特","不独","不管","不论","不过","不问","与","与其","与否","与此同时","且","两者","个","临","为","为了","为什么","为何","为着","乃","乃至","么","之","之一","之所以","之类","乌乎","乎","乘","九","也","也好","也罢","了","二","于","于是","于是乎","云云","五","人家","什么","什么样","从","从而","他","他人","他们","以","以便","以免","以及","以至","以至于","以致","们","任","任何","任凭","似的","但","但是","何","何况","何处","何时","作为","你","你们","使得","例如","依","依照","俺","俺们","倘","倘使","倘或","倘然","倘若","借","假使","假如","假若","像","八","六","兮","关于","其","其一","其中","其二","其他","其余","其它","其次","具体地说","具体说来","再者","再说","冒","冲","况且","几","几时","凭","凭借","则","别","别的","别说","到","前后","前者","加之","即","即令","即使","即便","即或","即若","又","及","及其","及至","反之","反过来","反过来说","另","另一方面","另外","只是","只有","只要","只限","叫","叮咚","可","可以","可是","可见","各","各个","各位","各种","各自","同","同时","向","向着","吓","吗","否则","吧","吧哒","吱","呀","呃","呕","呗","呜","呜呼","呢","呵","呸","呼哧","咋","和","咚","咦","咱","咱们","咳","哇","哈","哈哈","哉","哎","哎呀","哎哟","哗","哟","哦","哩","哪","哪个","哪些","哪儿","哪天","哪年","哪怕","哪样","哪边","哪里","哼","哼唷","唉","啊","啐","啥","啦","啪达","喂","喏","喔唷","嗡嗡","嗬","嗯","嗳","嘎","嘎登","嘘","嘛","嘻","嘿","四","因","因为","因此","因而","固然","在","在下","地","多","多少","她","她们","如","如上所述","如何","如其","如果","如此","如若","宁","宁可","宁愿","宁肯","它","它们","对","对于","将","尔后","尚且","就","就是","就是说","尽","尽管","岂但","己","并","并且","开外","开始","归","当","当着","彼","彼此","往","待","得","怎","怎么","怎么办","怎么样","怎样","总之","总的来看","总的来说","总的说来","总而言之","恰恰相反","您","慢说","我","我们","或","或是","或者","所","所以","打","把","抑或","拿","按","按照","换句话说","换言之","据","接着","故","故此","旁人","无宁","无论","既","既是","既然","时候","是","是的","替","有","有些","有关","有的","望","朝","朝着","本","本着","来","来着","极了","果然","果真","某","某个","某些","根据","正如","此","此外","此间","毋宁","每","每当","比","比如","比方","沿","沿着","漫说","焉","然则","然后","然而","照","照着","甚么","甚而","甚至","用","由","由于","由此可见","的","的话","相对而言","省得","着","着呢","矣","离","第","等","等等","管","紧接着","纵","纵令","纵使","纵然","经","经过","结果","给","继而","综上所述","罢了","者","而","而且","而况","而外","而已","而是","而言","能","腾","自","自个儿","自从","自各儿","自家","自己","自身","至","至于","若","若是","若非","莫若","虽","虽则","虽然","虽说","被","要","要不","要不是","要不然","要么","要是","让","论","设使","设若","该","诸位","谁","谁知","赶","起","起见","趁","趁着","越是","跟","较","较之","边","过","还是","还有","这","这个","这么","这么些","这么样","这么点儿","这些","这会儿","这儿","这就是说","这时","这样","这边","这里","进而","连","连同","通过","遵照","那","那个","那么","那么些","那么样","那些","那会儿","那儿","那时","那样","那边","那里","鄙人","鉴于","阿","除","除了","除此之外","除非","随","随着","零","非但","非徒","靠","顺","顺着","首先","︿","！","＃","＄","％","＆","（","）","＊","＋","，","０","１","２","３","４","５","６","７","８","９","：","；","＜","＞","？","＠","［","］","｛","｜","｝","～","￥"]
 37 |     stops = set(stops)
 38 | else:
 39 |     stops = set()
 40 | 
 41 | train_file = model_dir+"atec_nlp_sim_train.csv"
 42 | df1 = pd.read_csv(train_file,sep="\t", header=None, names =["id","sent1","sent2","label"], encoding="utf8")
 43 | # if len(df1) >= 102477: df1 = df1[:1000]
 44 | 
 45 | # 文本清理，预处理（分词）
 46 | clean_path = model_dir+"atec_clean.csv"
 47 | def pre_process(df, train_mode=True):
 48 |     x = lambda s: list(jieba.cut(star.sub("X",s)))
 49 |     df["words1"] = df["sent1"].apply(x)
 50 |     df["words2"] = df["sent2"].apply(x)
 51 |     if train_mode: df.to_csv(clean_path, sep="\t", index=False, encoding="utf8")
 52 |     return df
 53 | 
 54 | # 特征提取
 55 | feature_path = model_dir+"atec_feature.pkl"
 56 | feature_cfg = ["Not", "Length", "WordMatchShare", "TFIDFWordMatchShare", 
 57 |                 # "PowerfulWordDoubleSide", "PowerfulWordDoubleSideRate", "PowerfulWordOneSide", "PowerfulWordOneSideRate", 
 58 |                 "TFIDF", "NgramJaccardCoef", "NgramDiceDistance", "NgramDistance", "WordEmbeddingAveDis", "WordEmbeddingTFIDFAveDis"]
 59 | def feature_extract(df, train_mode=True):
 60 | 
 61 |     if "Not" in feature_cfg:
 62 |         def extract_row(row):
 63 |             not_cnt1 = row["words1"].count('不')
 64 |             not_cnt2 = row["words2"].count('不')
 65 | 
 66 |             fs = []
 67 |             fs.append(not_cnt1)
 68 |             fs.append(not_cnt2)
 69 |             if not_cnt1 > 0 and not_cnt2 > 0:
 70 |                 fs.append(1.)
 71 |             else:
 72 |                 fs.append(0.)
 73 |             if (not_cnt1 > 0) or (not_cnt2 > 0):
 74 |                 fs.append(1.)
 75 |             else:
 76 |                 fs.append(0.)
 77 |             if not_cnt2 <= 0 < not_cnt1 or not_cnt1 <= 0 < not_cnt2:
 78 |                 fs.append(1.)
 79 |             else:
 80 |                 fs.append(0.)
 81 | 
 82 |             return fs
 83 | 
 84 |         df["Not"] = df.apply(extract_row, axis=1)
 85 |         print("done Not")
 86 | 
 87 |     if "Length" in feature_cfg:
 88 |         def extract_row(row):
 89 |             len_q1, len_q2 = len(row["sent1"]), len(row["sent2"])
 90 |             return [len_q1,
 91 |                     len_q2,
 92 |                     len(row["words1"]), 
 93 |                     len(row["words2"]),
 94 |                     abs(len_q1 - len_q2),
 95 |                     1.0 * min(len_q1, len_q2) / max(len_q1, len_q2)]
 96 | 
 97 |         df["Length"] = df.apply(extract_row, axis=1)
 98 |         print("done Length")
 99 | 
100 |     if "WordMatchShare" in feature_cfg:
101 |         def extract_row(row):
102 |             q1words = {}
103 |             q2words = {}
104 |             for word in row["words1"]:
105 |                 if word not in stops:
106 |                     q1words[word] = q1words.get(word, 0) + 1
107 |             for word in row["words2"]:
108 |                 if word not in stops:
109 |                     q2words[word] = q2words.get(word, 0) + 1
110 |             n_shared_word_in_q1 = sum([q1words[w] for w in q1words if w in q2words])
111 |             n_shared_word_in_q2 = sum([q2words[w] for w in q2words if w in q1words])
112 |             n_tol = sum(q1words.values()) + sum(q2words.values())
113 |             if 1e-6 > n_tol:
114 |                 return [0.]
115 |             else:
116 |                 return [1.0 * (n_shared_word_in_q1 + n_shared_word_in_q2) / n_tol]
117 | 
118 |         df["WordMatchShare"] = df.apply(extract_row, axis=1)
119 |         print("done WordMatchShare")
120 | 
121 |     if "TFIDFWordMatchShare" in feature_cfg:
122 |         idf_path = model_dir + "idf_weights.pkl"
123 |         def init_idf():  # init idf weights
124 |             idf = {}
125 |             q_set = set()
126 |             for index, row in df.iterrows():
127 |                 q1 = str(row['sent1'])
128 |                 q2 = str(row['sent2'])
129 |                 if q1 not in q_set:
130 |                     q_set.add(q1)
131 |                     for word in row["words1"]:
132 |                         idf[word] = idf.get(word, 0) + 1
133 |                 if q2 not in q_set:
134 |                     q_set.add(q2)
135 |                     for word in row["words2"]:
136 |                         idf[word] = idf.get(word, 0) + 1
137 |             num_docs = len(df)
138 |             for word in idf:
139 |                 idf[word] = math.log(num_docs / (idf[word] + 1.)) / math.log(2.)
140 |             print("idf calculation done, len(idf)=%d" % len(idf))
141 |             pd.to_pickle(idf, idf_path)
142 |             return idf
143 | 
144 |         if train_mode: idf = init_idf()
145 |         else: idf = pd.read_pickle(idf_path)
146 | 
147 |         def extract_row(row):
148 |             q1words = {}
149 |             q2words = {}
150 |             for word in row["words1"]:
151 |                 q1words[word] = q1words.get(word, 0) + 1
152 |             for word in row["words2"]:
153 |                 q2words[word] = q2words.get(word, 0) + 1
154 |             sum_shared_word_in_q1 = sum([q1words[w] * idf.get(w, 0) for w in q1words if w in q2words])
155 |             sum_shared_word_in_q2 = sum([q2words[w] * idf.get(w, 0) for w in q2words if w in q1words])
156 |             sum_tol = sum(q1words[w] * idf.get(w, 0) for w in q1words) + sum(
157 |                 q2words[w] * idf.get(w, 0) for w in q2words)
158 |             if 1e-6 > sum_tol:
159 |                 return [0.]
160 |             else:
161 |                 return [1.0 * (sum_shared_word_in_q1 + sum_shared_word_in_q2) / sum_tol]
162 | 
163 |         df["TFIDFWordMatchShare"] = df.apply(extract_row, axis=1)
164 |         print("done TFIDFWordMatchShare")
165 | 
166 |     powerful_words_path = model_dir + "powerful_words.pkl"
167 |     def generate_powerful_word():
168 |         """
169 |         计算数据中词语的影响力，格式如下：
170 |             词语 --> [0. 出现语句对数量，1. 出现语句对比例，2. 正确语句对比例，3. 单侧语句对比例，4. 单侧语句对正确比例，5. 双侧语句对比例，6. 双侧语句对正确比例]
171 |         """
172 |         words_power = {}
173 |         for index, row in df.iterrows():
174 |             label = int(row['label'])
175 |             q1_words = row['words1']
176 |             q2_words = row['words2']
177 |             all_words = set(q1_words + q2_words)
178 |             q1_words = set(q1_words)
179 |             q2_words = set(q2_words)
180 |             for word in all_words:
181 |                 if word not in words_power:
182 |                     words_power[word] = [0. for i in range(7)]
183 |                 # 计算出现语句对数量
184 |                 words_power[word][0] += 1.
185 |                 words_power[word][1] += 1.
186 | 
187 |                 if ((word in q1_words) and (word not in q2_words)) or ((word not in q1_words) and (word in q2_words)):
188 |                     # 计算单侧语句数量
189 |                     words_power[word][3] += 1.
190 |                     if 0 == label:
191 |                         # 计算正确语句对数量
192 |                         words_power[word][2] += 1.
193 |                         # 计算单侧语句正确比例
194 |                         words_power[word][4] += 1.
195 |                 if (word in q1_words) and (word in q2_words):
196 |                     # 计算双侧语句数量
197 |                     words_power[word][5] += 1.
198 |                     if 1 == label:
199 |                         # 计算正确语句对数量
200 |                         words_power[word][2] += 1.
201 |                         # 计算双侧语句正确比例
202 |                         words_power[word][6] += 1.
203 |         for word in words_power:
204 |             # 计算出现语句对比例
205 |             words_power[word][1] /= len(df)
206 |             # 计算正确语句对比例
207 |             words_power[word][2] /= words_power[word][0]
208 |             # 计算单侧语句对正确比例
209 |             if words_power[word][3] > 1e-6:
210 |                 words_power[word][4] /= words_power[word][3]
211 |             # 计算单侧语句对比例
212 |             words_power[word][3] /= words_power[word][0]
213 |             # 计算双侧语句对正确比例
214 |             if words_power[word][5] > 1e-6:
215 |                 words_power[word][6] /= words_power[word][5]
216 |             # 计算双侧语句对比例
217 |             words_power[word][5] /= words_power[word][0]
218 |         sorted_words_power = sorted(words_power.items(), key=lambda d: d[1][0], reverse=True)
219 |         print("power words calculation done, len(words_power)=%d" % len(sorted_words_power))
220 |         pd.to_pickle(sorted_words_power, powerful_words_path)
221 |         return sorted_words_power
222 | 
223 |     if train_mode: pword = generate_powerful_word()
224 |     else: pword = pd.load_pickle(powerful_words_path)
225 | 
226 |     # thresh_num, thresh_rate = 500, 0.9
227 |     thresh_num, thresh_rate = 7, 0.3
228 | 
229 |     pword_filtered = filter(lambda x: x[1][0] * x[1][5] >= thresh_num, pword)
230 |     pword_sort = sorted(pword_filtered, key=lambda d: d[1][6], reverse=True)
231 |     pword_dside = set(map(lambda x: x[0], filter(lambda x: x[1][6] >= thresh_rate, pword_sort)))
232 |     print('Double side power words(%d): %s' % (len(pword_dside), str(pword_dside)))
233 | 
234 |     def extract_row(row):
235 |         tags = []
236 |         q1_words = row["words1"]
237 |         q2_words = row["words2"]
238 |         for word in pword_dside:
239 |             if (word in q1_words) and (word in q2_words):
240 |                 tags.append(1.0)
241 |             else:
242 |                 tags.append(0.0)
243 |         return tags
244 | 
245 |     if "PowerfulWordDoubleSide" in feature_cfg:
246 |         df["PowerfulWordDoubleSide"] = df.apply(extract_row, axis=1)
247 |         print("done PowerfulWordDoubleSide")
248 | 
249 |     pword_dict = dict(pword)
250 |     def extract_row(row):
251 |         num_least = 300
252 |         rate = [1.0]
253 |         q1_words = set(row["words1"])
254 |         q2_words = set(row["words2"])
255 |         share_words = list(q1_words.intersection(q2_words))
256 |         for word in share_words:
257 |             if word not in pword_dict:
258 |                 continue
259 |             if pword_dict[word][0] * pword_dict[word][5] < num_least:
260 |                 continue
261 |             rate[0] *= (1.0 - pword_dict[word][6])
262 |         rate = [1 - num for num in rate]
263 |         return rate
264 | 
265 |     if "PowerfulWordDoubleSideRate" in feature_cfg:
266 |         df["PowerfulWordDoubleSideRate"] = df.apply(extract_row, axis=1)
267 |         print("done PowerfulWordDoubleSideRate")
268 | 
269 | 
270 |     thresh_num, thresh_rate = 20, 0.8
271 | 
272 |     pword_filtered = filter(lambda x: x[1][0] * x[1][3] >= thresh_num, pword)
273 |     pword_oside = set(map(lambda x: x[0], filter(lambda x: x[1][4] >= thresh_rate, pword_filtered)))
274 |     print('One side power words(%d): %s' % (len(pword_oside), str(pword_oside)))
275 |     def extract_row(row):
276 |         tags = []
277 |         q1_words = set(row["words1"])
278 |         q2_words = set(row["words2"])
279 |         for word in pword_oside:
280 |             if (word in q1_words) and (word not in q2_words):
281 |                 tags.append(1.0)
282 |             elif (word not in q1_words) and (word in q2_words):
283 |                 tags.append(1.0)
284 |             else:
285 |                 tags.append(0.0)
286 |         return tags
287 | 
288 |     if "PowerfulWordOneSide" in feature_cfg:
289 |         df["PowerfulWordOneSide"] = df.apply(extract_row, axis=1)
290 |         print("done PowerfulWordOneSide")
291 | 
292 |     def extract_row(row):
293 |         num_least = 300
294 |         rate = [1.0]
295 |         q1_words = set(row["words1"])
296 |         q2_words = set(row["words2"])
297 |         q1_diff = list(q1_words.difference(q2_words))
298 |         q2_diff = list(q2_words.difference(q1_words))
299 |         all_diff = set(q1_diff + q2_diff)
300 |         for word in all_diff:
301 |             if word not in pword_dict:
302 |                 continue
303 |             if pword_dict[word][0] * pword_dict[word][3] < num_least:
304 |                 continue
305 |             rate[0] *= (1.0 - pword_dict[word][4])
306 |         rate = [1 - num for num in rate]
307 |         return rate
308 | 
309 |     if "PowerfulWordOneSideRate" in feature_cfg:
310 |         df["PowerfulWordOneSideRate"] = df.apply(extract_row, axis=1)
311 |         print("done PowerfulWordOneSideRate")
312 | 
313 |     if "TFIDF" in feature_cfg:
314 |         tfidf_path = model_dir + "tfidf_transformer.pkl"
315 |         def init_tfidf():
316 |             tfidf = TfidfVectorizer(stop_words=list(stops), ngram_range=(1, 1), token_pattern=r"\w+")
317 |             tfidf_txt = pd.Series(df['words1'].apply(lambda x: " ".join(x)).tolist() + 
318 |                                   df['words2'].apply(lambda x: " ".join(x)).tolist())
319 |             tfidf.fit_transform(tfidf_txt)
320 |             print("init tfidf done ")
321 |             # print(tfidf.vocabulary_)
322 |             pd.to_pickle(tfidf, tfidf_path)
323 |             return tfidf
324 | 
325 |         if train_mode: tfidf = init_tfidf()
326 |         else: tfidf = pd.read_pickle(tfidf_path)
327 | 
328 |         def extract_row(row):
329 |             q1 = " ".join(row['words1'])
330 |             q2 = " ".join(row['words2'])
331 |             a1 = tfidf.transform([q1]).data
332 |             a2 = tfidf.transform([q2]).data
333 |             fs = [np.sum(a1),np.sum(a2),np.mean(a1),np.mean(a2),len(a1),len(a2)]
334 |             return fs
335 | 
336 |         df["TFIDF"] = df.apply(extract_row, axis=1)
337 |         print("done TFIDF")
338 | 
339 |     if "NgramJaccardCoef" in feature_cfg:
340 |         def extract_row(row):
341 |             q1_words = row['words1']
342 |             q2_words = row['words2']
343 |             fs = list()
344 |             for n in range(1, 4):
345 |                 q1_ngrams = NgramUtil.ngrams(q1_words, n)
346 |                 q2_ngrams = NgramUtil.ngrams(q2_words, n)
347 |                 A = set(q1_ngrams)
348 |                 B = set(q2_ngrams)
349 |                 x = len(A.intersection(B))
350 |                 y = len(A.union(B))
351 |                 val = 0.0 if y==0 else x/y
352 |                 fs.append(val)
353 |             return fs
354 | 
355 |         df["NgramJaccardCoef"] = df.apply(extract_row, axis=1)
356 |         print("done NgramJaccardCoef")
357 | 
358 |     if "NgramDiceDistance" in feature_cfg:
359 |         def extract_row(row):
360 |             q1_words = row['words1']
361 |             q2_words = row['words2']
362 |             fs = list()
363 |             for n in range(1, 4):
364 |                 q1_ngrams = NgramUtil.ngrams(q1_words, n)
365 |                 q2_ngrams = NgramUtil.ngrams(q2_words, n)
366 |                 A = set(q1_ngrams)
367 |                 B = set(q2_ngrams)
368 |                 x = 2. * len(A.intersection(B))
369 |                 y = len(A) + len(B)
370 |                 val = 0.0 if y==0 else x/y
371 |                 fs.append(val)
372 |             return fs
373 | 
374 |         df["NgramDiceDistance"] = df.apply(extract_row, axis=1)
375 |         print("done NgramDiceDistance")
376 | 
377 |     if "NgramDistance" in feature_cfg:
378 |         def extract_row(row):
379 |             q1_words = row['words1']
380 |             q2_words = row['words2']
381 |             fs = list()
382 |             aggregation_modes_outer = [np.mean,np.max,np.min,np.median]
383 |             aggregation_modes_inner = [np.mean,np.std,np.max,np.min,np.median]
384 |             for n_ngram in range(1, 4):
385 |                 q1_ngrams = NgramUtil.ngrams(q1_words, n_ngram)
386 |                 q2_ngrams = NgramUtil.ngrams(q2_words, n_ngram)
387 |                 val_list = list()
388 |                 for w1 in q1_ngrams:
389 |                     _val_list = list()
390 |                     for w2 in q2_ngrams:
391 |                         s = 1. - SequenceMatcher(None, w1, w2, False).quick_ratio()     # ratio()
392 |                         _val_list.append(s)
393 |                     if len(_val_list) == 0:
394 |                         _val_list = [MISSING_VALUE_NUMERIC]
395 |                     val_list.append(_val_list)
396 |                 if len(val_list) == 0:
397 |                     val_list = [[MISSING_VALUE_NUMERIC]]
398 |                 data = np.array(val_list)
399 |                 fs.extend([mode_outer(mode_inner(data,axis=1)) for mode_inner in aggregation_modes_inner for mode_outer in aggregation_modes_outer])
400 |             return fs
401 | 
402 |         df["NgramDistance"] = df.apply(extract_row, axis=1)
403 |         print("done NgramDistance")
404 | 
405 |     we_len = 300 if online else 256
406 |     word_embedding_model = gensim.models.Word2Vec.load(model_dir + "word2vec_gensim%s"%we_len)
407 |     word2index = {v:k for k,v in enumerate(word_embedding_model.wv.index2word)}
408 |     if "WordEmbeddingAveDis" in feature_cfg:
409 |         def extract_row(row):
410 |             q1_words = row['words1']
411 |             q2_words = row['words2']
412 | 
413 |             q1_vec = np.array(we_len * [0.])
414 |             q2_vec = np.array(we_len * [0.])
415 | 
416 |             for word in q1_words:
417 |                 if word in word2index:
418 |                     q1_vec += word_embedding_model[word]
419 |             for word in q2_words:
420 |                 if word in word2index:
421 |                     q2_vec += word_embedding_model[word]
422 | 
423 |             cos_sim = 0.
424 |             q1_vec = np.mat(q1_vec)
425 |             q2_vec = np.mat(q2_vec)
426 |             factor = np.linalg.norm(q1_vec) * np.linalg.norm(q2_vec)
427 |             if 1e-6 < factor:
428 |                 cos_sim = float(q1_vec * q2_vec.T) / factor
429 | 
430 |             return [cos_sim]
431 | 
432 |         df["WordEmbeddingAveDis"] = df.apply(extract_row, axis=1)
433 | 
434 |     if "WordEmbeddingTFIDFAveDis" in feature_cfg:
435 |         idf = pd.read_pickle(idf_path)
436 |         def extract_row(row):
437 |             q1_words = row['words1']
438 |             q2_words = row['words2']
439 | 
440 |             q1_vec = np.array(we_len * [0.])
441 |             q2_vec = np.array(we_len * [0.])
442 |             q1_words_cnt = {}
443 |             q2_words_cnt = {}
444 |             for word in q1_words:
445 |                 q1_words_cnt[word] = q1_words_cnt.get(word, 0.) + 1.
446 |             for word in q2_words:
447 |                 q2_words_cnt[word] = q2_words_cnt.get(word, 0.) + 1.
448 | 
449 |             for word in q1_words_cnt:
450 |                 if word in word2index:
451 |                     q1_vec += idf.get(word, 0.) * q1_words_cnt[word] * word_embedding_model[word]
452 |             for word in q2_words_cnt:
453 |                 if word in word2index:
454 |                     q2_vec += idf.get(word, 0.) * q2_words_cnt[word] * word_embedding_model[word]
455 | 
456 |             cos_sim = 0.
457 |             q1_vec = np.mat(q1_vec)
458 |             q2_vec = np.mat(q2_vec)
459 |             factor = np.linalg.norm(q1_vec) * np.linalg.norm(q2_vec)
460 |             if 1e-6 < factor:
461 |                 cos_sim = float(q1_vec * q2_vec.T) / factor
462 | 
463 |             return [cos_sim]
464 | 
465 |         df["WordEmbeddingTFIDFAveDis"] = df.apply(extract_row, axis=1)
466 | 
467 | 
468 |     def merge_feature(row):
469 |         fs = []
470 |         for feature in feature_cfg:
471 |             fs += row[feature]
472 |         return fs
473 | 
474 |     df["feature"] = df.apply(merge_feature, axis=1)
475 |     x, y = np.array(df["feature"].tolist()), np.array(df["label"].astype(int))
476 |     if train_mode: pd.to_pickle((x,y),feature_path)
477 |     return (x,y)
478 | 
479 | 
480 | from difflib import SequenceMatcher
481 | MISSING_VALUE_NUMERIC = -1
482 | 
483 | class NgramUtil(object):
484 | 
485 |     def __init__(self):
486 |         pass
487 | 
488 |     @staticmethod
489 |     def unigrams(words):
490 |         """
491 |             Input: a list of words, e.g., ["I", "am", "Denny"]
492 |             Output: a list of unigram
493 |         """
494 |         assert type(words) == list
495 |         return words
496 | 
497 |     @staticmethod
498 |     def bigrams(words, join_string, skip=0):
499 |         """
500 |            Input: a list of words, e.g., ["I", "am", "Denny"]
501 |            Output: a list of bigram, e.g., ["I_am", "am_Denny"]
502 |         """
503 |         assert type(words) == list
504 |         L = len(words)
505 |         if L > 1:
506 |             lst = []
507 |             for i in range(L - 1):
508 |                 for k in range(1, skip + 2):
509 |                     if i + k < L:
510 |                         lst.append(join_string.join([words[i], words[i + k]]))
511 |         else:
512 |             # set it as unigram
513 |             lst = NgramUtil.unigrams(words)
514 |         return lst
515 | 
516 |     @staticmethod
517 |     def trigrams(words, join_string, skip=0):
518 |         """
519 |            Input: a list of words, e.g., ["I", "am", "Denny"]
520 |            Output: a list of trigram, e.g., ["I_am_Denny"]
521 |         """
522 |         assert type(words) == list
523 |         L = len(words)
524 |         if L > 2:
525 |             lst = []
526 |             for i in range(L - 2):
527 |                 for k1 in range(1, skip + 2):
528 |                     for k2 in range(1, skip + 2):
529 |                         if i + k1 < L and i + k1 + k2 < L:
530 |                             lst.append(join_string.join([words[i], words[i + k1], words[i + k1 + k2]]))
531 |         else:
532 |             # set it as bigram
533 |             lst = NgramUtil.bigrams(words, join_string, skip)
534 |         return lst
535 | 
536 |     @staticmethod
537 |     def fourgrams(words, join_string):
538 |         """
539 |             Input: a list of words, e.g., ["I", "am", "Denny", "boy"]
540 |             Output: a list of trigram, e.g., ["I_am_Denny_boy"]
541 |         """
542 |         assert type(words) == list
543 |         L = len(words)
544 |         if L > 3:
545 |             lst = []
546 |             for i in xrange(L - 3):
547 |                 lst.append(join_string.join([words[i], words[i + 1], words[i + 2], words[i + 3]]))
548 |         else:
549 |             # set it as trigram
550 |             lst = NgramUtil.trigrams(words, join_string)
551 |         return lst
552 | 
553 |     @staticmethod
554 |     def ngrams(words, ngram, join_string=" "):
555 |         """
556 |         wrapper for ngram
557 |         """
558 |         if ngram == 1:
559 |             return NgramUtil.unigrams(words)
560 |         elif ngram == 2:
561 |             return NgramUtil.bigrams(words, join_string)
562 |         elif ngram == 3:
563 |             return NgramUtil.trigrams(words, join_string)
564 |         elif ngram == 4:
565 |             return NgramUtil.fourgrams(words, join_string)
566 |         elif ngram == 12:
567 |             unigram = NgramUtil.unigrams(words)
568 |             bigram = [x for x in NgramUtil.bigrams(words, join_string) if len(x.split(join_string)) == 2]
569 |             return unigram + bigram
570 |         elif ngram == 123:
571 |             unigram = NgramUtil.unigrams(words)
572 |             bigram = [x for x in NgramUtil.bigrams(words, join_string) if len(x.split(join_string)) == 2]
573 |             trigram = [x for x in NgramUtil.trigrams(words, join_string) if len(x.split(join_string)) == 3]
574 |             return unigram + bigram + trigram
575 | 
576 | 
577 | def r_f1_thresh(y_pred,y_true,step=1000):
578 |     e = np.zeros((len(y_true),2))
579 |     e[:,0] = y_pred.reshape(-1)
580 |     e[:,1] = y_true
581 |     f = pd.DataFrame(e)
582 |     thrs = np.linspace(0,1,step+1)
583 |     x = np.array([f1_score(y_pred=f.loc[:,0]>thr, y_true=f.loc[:,1]) for thr in thrs])
584 |     f1_, thresh = max(x),thrs[x.argmax()]
585 |     return f.corr()[0][1], f1_, thresh
586 | 
587 | random_state = 42
588 | def train_old_classifier(data=None,train_mode=True):
589 |     if data is None:
590 |         x,y = pd.read_pickle(feature_path)
591 |     else:x,y = data
592 | 
593 |     trn_x, val_x, trn_y, val_y = train_test_split(x,y, test_size=test_size, random_state=random_state)
594 | 
595 | 
596 |     classifier = ["lrcv","lgbm"][1]
597 |     if classifier == "lgbm":
598 |         print("lightgbm")
599 |         params = {
600 |             'task': 'train',
601 |             'boosting_type': 'gbdt',
602 |             'objective': 'binary',
603 |             'metric': {'l2', 'auc'},
604 |             'num_leaves': 31,
605 |             'learning_rate': 0.05,
606 |             'feature_fraction': 0.9,
607 |             'bagging_fraction': 0.8,
608 |             'bagging_freq': 5
609 |         }
610 | 
611 |         import lightgbm as lgb
612 |         lgb_train = lgb.Dataset(trn_x, trn_y)
613 |         lgb_eval = lgb.Dataset(val_x, val_y, reference=lgb_train)
614 |         model_gbm = lgb.train(params, lgb_train, num_boost_round=200, 
615 |                             valid_sets=lgb_eval, early_stopping_rounds=10)
616 | 
617 |         val_y_pred = model_gbm.predict(val_x, num_iteration=model_gbm.best_iteration)
618 |         print(r_f1_thresh(val_y_pred, val_y))
619 | 
620 |     classifier = ["lrcv","lgbm"][0]
621 |     if classifier == "lrcv":
622 |         print("LogisticRegression")
623 |         clf = LogisticRegression()
624 |         clf.fit(trn_x, trn_y)
625 |         val_y_pred = clf.predict_proba(val_x)[:,1]
626 |         print(r_f1_thresh(val_y_pred, val_y))
627 | 
628 | df = None
629 | df = pre_process(df1)
630 | data = feature_extract(df)
631 | train_old_classifier(data)


--------------------------------------------------------------------------------
/pai_train.py:
--------------------------------------------------------------------------------
  1 | #/usr/bin/env python
  2 | #coding=utf-8
  3 | 
  4 | indexes = []
  5 | 
  6 | import time
  7 | start_time = time.time()
  8 | import multiprocessing
  9 | import os
 10 | import re
 11 | import json
 12 | import gensim
 13 | import jieba
 14 | import keras
 15 | import keras.backend as K
 16 | import numpy as np
 17 | import pandas as pd
 18 | from itertools import combinations
 19 | from keras.activations import softmax
 20 | from keras.callbacks import EarlyStopping, ModelCheckpoint,LambdaCallback, Callback, ReduceLROnPlateau, LearningRateScheduler
 21 | from keras.layers import *
 22 | from keras.models import Model
 23 | from keras.optimizers import SGD, Adadelta, Adam, Nadam, RMSprop
 24 | from keras.regularizers import L1L2, l2
 25 | from keras.preprocessing.sequence import pad_sequences
 26 | from keras.engine.topology import Layer
 27 | from keras import initializers, regularizers, constraints
 28 | 
 29 | from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
 30 | from sklearn.metrics import f1_score
 31 | from sklearn.model_selection import train_test_split, KFold
 32 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
 33 | 
 34 | from gensim.models.word2vec import LineSentence
 35 | from gensim.models.fasttext import FastText
 36 | import copy
 37 | 
 38 | os.environ["TF_CPP_MIN_LOG_LEVEL"]='3'
 39 | 
 40 | #####################################################################
 41 | #                         数据加载预处理阶段
 42 | #####################################################################
 43 | new_words = "支付宝 付款码 二维码 收钱码 转账 退款 退钱 余额宝 运费险 还钱 还款 花呗 借呗 蚂蚁花呗 蚂蚁借呗 蚂蚁森林 小黄车 飞猪 微客 宝卡 芝麻信用 亲密付 淘票票 饿了么 摩拜 滴滴 滴滴出行".split(" ")
 44 | for word in new_words:
 45 |     jieba.add_word(word)
 46 | 
 47 | star = re.compile("\*+")
 48 | 
 49 | test_size = 0.025
 50 | random_state = 42
 51 | fast_mode, fast_rate = False,0.01    # 快速调试，其评分不作为参考
 52 | train_file = model_dir+"atec_nlp_sim_train.csv"
 53 | def load_data(dtype = "both", input_length=[20,24], w2v_length=300):
 54 | 
 55 |     def __load_data(dtype = "word", input_length=20, w2v_length=300):
 56 | 
 57 |         filename = model_dir+"%s_%d_%d"%(dtype, input_length, w2v_length)
 58 |         if os.path.exists(filename):
 59 |             return pd.read_pickle(filename)
 60 | 
 61 |         data_l_n = []
 62 |         data_r_n = []
 63 |         y = []
 64 |         for line in open(train_file,"r", encoding="utf8"):
 65 |             lineno, s1, s2, label=line.strip().split("\t")
 66 |             if dtype == "word":
 67 |                 data_l_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s1))) if word in word2index]) 
 68 |                 data_r_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s2))) if word in word2index])
 69 |             if dtype == "char":
 70 |                 data_l_n.append([char2index[char] for char in s1 if char in char2index]) 
 71 |                 data_r_n.append([char2index[char] for char in s2 if char in char2index])
 72 | 
 73 |             y.append(int(label))
 74 | 
 75 |         # 对齐语料中句子的长度 
 76 |         data_l_n = pad_sequences(data_l_n, maxlen=input_length)
 77 |         data_r_n = pad_sequences(data_r_n, maxlen=input_length)
 78 |         y = np.array(y)
 79 | 
 80 |         pd.to_pickle((data_l_n, data_r_n, y), filename)
 81 | 
 82 |         return (data_l_n, data_r_n, y)
 83 | 
 84 |     if dtype == "both":
 85 |         ret_array = []
 86 |         for dtype,input_length in zip(['word', 'char'],input_length):
 87 |             data_l_n,data_r_n,y = __load_data(dtype, input_length, w2v_length)
 88 |             ret_array.append(np.asarray(data_l_n))
 89 |             ret_array.append(np.asarray(data_r_n))
 90 |         ret_array.append(y)
 91 |         return ret_array
 92 |     else:
 93 |         return __load_data(dtype, input_length, w2v_length)
 94 | 
 95 | def input_data(sent1, sent2, dtype = "both", input_length=[20,24]):
 96 |     def __input_data(sent1, sent2, dtype = "word", input_length=20):
 97 |         data_l_n = []
 98 |         data_r_n = []
 99 |         for s1, s2 in zip(sent1, sent2):
100 |             if dtype == "word":
101 |                 data_l_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s1))) if word in word2index]) 
102 |                 data_r_n.append([word2index[word] for word in list(jieba.cut(star.sub("1",s2))) if word in word2index])
103 |             if dtype == "char":
104 |                 data_l_n.append([char2index[char] for char in s1 if char in char2index]) 
105 |                 data_r_n.append([char2index[char] for char in s2 if char in char2index])
106 | 
107 |         # 对齐语料中句子的长度 
108 |         data_l_n = pad_sequences(data_l_n, maxlen=input_length)
109 |         data_r_n = pad_sequences(data_r_n, maxlen=input_length)
110 | 
111 |         return [data_l_n, data_r_n]
112 | 
113 |     if dtype == "both":
114 |         ret_array = []
115 |         for dtype,input_length in zip(['word', 'char'],input_length):
116 |             data_l_n,data_r_n = __input_data(sent1, sent2, dtype, input_length)
117 |             ret_array.append(data_l_n)
118 |             ret_array.append(data_r_n)
119 |         return ret_array
120 |     else:
121 |         return __input_data(sent1, sent2, dtype, input_length)
122 | 
123 | 
124 | ###########################################################################
125 | #                            训练验证集划分
126 | ###########################################################################
127 | def split_data(data,mode="train", test_size=test_size, random_state=random_state):
128 |     # mode == "train":  划分成用于训练的四元组
129 |     # mode == "orig":   划分成两组数据
130 |     train = []
131 |     test = []
132 |     for data_i in data:
133 |         if fast_mode:
134 |             data_i, _ = train_test_split(data_i,test_size=1-fast_rate,random_state=random_state )
135 |         train_data, test_data = train_test_split(data_i,test_size=test_size,random_state=random_state )
136 |         train.append(np.asarray(train_data))
137 |         test.append(np.asarray(test_data))
138 | 
139 |     if mode == "orig":
140 |         return train, test
141 | 
142 |     train_x, train_y, test_x, test_y = train[:-1], train[-1], test[:-1], test[-1]
143 |     return train_x, train_y, test_x, test_y
144 | 
145 | 
146 | #####################################################################
147 | #                         模型定义
148 | #####################################################################
149 | 
150 | w2v_length = 300
151 | ebed_type = "gensim"
152 | # ebed_type = "fastcbow"
153 | 
154 | if ebed_type == "gensim":
155 |     char_embedding_model = gensim.models.Word2Vec.load(model_dir + "char2vec_gensim%s"%w2v_length)
156 |     char2index = {v:k for k,v in enumerate(char_embedding_model.wv.index2word)}
157 |     word_embedding_model = gensim.models.Word2Vec.load(model_dir + "word2vec_gensim%s"%w2v_length)
158 |     word2index = {v:k for k,v in enumerate(word_embedding_model.wv.index2word)}
159 | 
160 | elif ebed_type == "fastskip" or ebed_type == "fastcbow":
161 |     char_fastcbow = FastText.load(model_dir + "char2vec_%s%d"%(ebed_type, w2v_length))
162 |     char_embedding_matrix = char_fastcbow.wv.vectors
163 |     char2index = {v:k for k,v in enumerate(char_fastcbow.wv.index2word)}
164 |     word_fastcbow = FastText.load(model_dir + "word2vec_%s%d"%(ebed_type, w2v_length))
165 |     word_embedding_matrix = word_fastcbow.wv.vectors
166 |     word2index = {v:k for k,v in enumerate(word_fastcbow.wv.index2word)}
167 | 
168 | print("loaded w2v done!", len(char2index), len(word2index))
169 | 
170 | MAX_LEN = 30
171 | MAX_EPOCH = 90
172 | train_batch_size = 64
173 | test_batch_size = 500
174 | earlystop_patience, plateau_patience = 8,2    # patience
175 | cfgs = [
176 |     ("siamese", "char", 24, ebed_type,  w2v_length,    [100, 80, 64, 64],   102-5, earlystop_patience),  # 69s
177 |     ("siamese", "word", 20, ebed_type,  w2v_length,    [100, 80, 64, 64],   120-4, earlystop_patience),  # 59s
178 |     ("esim",    "char", 24, ebed_type,  w2v_length,    [],             18,  earlystop_patience),  # 389s
179 |     ("esim",    "word", 20, ebed_type,  w2v_length,    [],             21,  earlystop_patience),  # 335s   
180 |     ("decom",   "char", 24, ebed_type,  w2v_length,    [],             87-2,  earlystop_patience),   # 84s
181 |     ("decom",   "word", 20, ebed_type,  w2v_length,    [],             104-4, earlystop_patience),  # 71s
182 |     ("dssm",    "both", [20,24], ebed_type,  w2v_length, [],           124-8, earlystop_patience), # 55s
183 | ]
184 | 
185 | 
186 | def get_embedding_layers(dtype, input_length, w2v_length, with_weight=True):
187 |     def __get_embedding_layers(dtype, input_length, w2v_length, with_weight=True):
188 | 
189 |         if dtype == 'word':
190 |             embedding_length = len(word2index)
191 |         elif dtype == 'char':
192 |             embedding_length = len(char2index)
193 | 
194 |         if with_weight:
195 |             if ebed_type == "gensim":
196 |                 if dtype == 'word':
197 |                     embedding = word_embedding_model.wv.get_keras_embedding(train_embeddings=True)
198 |                 else:
199 |                     embedding = char_embedding_model.wv.get_keras_embedding(train_embeddings=True)
200 | 
201 |             elif ebed_type == "fastskip" or ebed_type == "fastcbow":
202 |                 if dtype == 'word':
203 |                     embedding = Embedding(embedding_length, w2v_length, input_length=input_length, weights=[word_embedding_matrix], trainable=True)
204 |                 else:
205 |                     embedding = Embedding(embedding_length, w2v_length, input_length=input_length, weights=[char_embedding_matrix], trainable=True)
206 |         else:
207 |             embedding = Embedding(embedding_length, w2v_length, input_length=input_length, trainable=True)
208 | 
209 |         return embedding
210 | 
211 |     if dtype == "both":
212 |         embedding = []
213 |         for dtype,input_length in zip(['word', 'char'],input_length):
214 |             embedding.append(__get_embedding_layers(dtype, input_length, w2v_length, with_weight))
215 |         return embedding
216 |     else:
217 |         return __get_embedding_layers(dtype, input_length, w2v_length, with_weight)
218 | 
219 | def create_pretrained_embedding(pretrained_weights_path, trainable=False, **kwargs):
220 |     "Create embedding layer from a pretrained weights array"
221 |     pretrained_weights = np.load(pretrained_weights_path)
222 |     in_dim, out_dim = pretrained_weights.shape
223 |     embedding = Embedding(in_dim, out_dim, weights=[pretrained_weights], trainable=False, **kwargs)
224 |     return embedding
225 | 
226 | 
227 | def unchanged_shape(input_shape):
228 |     "Function for Lambda layer"
229 |     return input_shape
230 | 
231 | 
232 | def substract(input_1, input_2):
233 |     "Substract element-wise"
234 |     neg_input_2 = Lambda(lambda x: -x, output_shape=unchanged_shape)(input_2)
235 |     out_ = Add()([input_1, neg_input_2])
236 |     return out_
237 | 
238 | 
239 | def submult(input_1, input_2):
240 |     "Get multiplication and subtraction then concatenate results"
241 |     mult = Multiply()([input_1, input_2])
242 |     sub = substract(input_1, input_2)
243 |     out_= Concatenate()([sub, mult])
244 |     return out_
245 | 
246 | 
247 | def apply_multiple(input_, layers):
248 |     "Apply layers to input then concatenate result"
249 |     if not len(layers) > 1:
250 |         raise ValueError('Layers list should contain more than 1 layer')
251 |     else:
252 |         agg_ = []
253 |         for layer in layers:
254 |             agg_.append(layer(input_))
255 |         out_ = Concatenate()(agg_)
256 |     return out_
257 | 
258 | 
259 | def time_distributed(input_, layers):
260 |     "Apply a list of layers in TimeDistributed mode"
261 |     out_ = []
262 |     node_ = input_
263 |     for layer_ in layers:
264 |         node_ = TimeDistributed(layer_)(node_)
265 |     out_ = node_
266 |     return out_
267 | 
268 | 
269 | def soft_attention_alignment(input_1, input_2):
270 |     "Align text representation with neural soft attention"
271 |     attention = Dot(axes=-1)([input_1, input_2])
272 |     w_att_1 = Lambda(lambda x: softmax(x, axis=1),
273 |                      output_shape=unchanged_shape)(attention)
274 |     w_att_2 = Permute((2,1))(Lambda(lambda x: softmax(x, axis=2),
275 |                              output_shape=unchanged_shape)(attention))
276 |     in1_aligned = Dot(axes=1)([w_att_1, input_1])
277 |     in2_aligned = Dot(axes=1)([w_att_2, input_2])
278 |     return in1_aligned, in2_aligned
279 | 
280 | def decomposable_attention(pretrained_embedding='../data/fasttext_matrix.npy', 
281 |                            projection_dim=300, projection_hidden=0, projection_dropout=0.2,
282 |                            compare_dim=500, compare_dropout=0.2,
283 |                            dense_dim=300, dense_dropout=0.2,
284 |                            lr=1e-3, activation='elu', maxlen=MAX_LEN):
285 |     # Based on: https://arxiv.org/abs/1606.01933
286 |     
287 |     q1 = Input(name='q1',shape=(maxlen,))
288 |     q2 = Input(name='q2',shape=(maxlen,))
289 |     
290 |     # Embedding
291 |     # embedding = create_pretrained_embedding(pretrained_embedding, 
292 |     #                                         mask_zero=False)
293 |     embedding = pretrained_embedding
294 |     q1_embed = embedding(q1)
295 |     q2_embed = embedding(q2)
296 |     
297 |     # Projection
298 |     projection_layers = []
299 |     if projection_hidden > 0:
300 |         projection_layers.extend([
301 |                 Dense(projection_hidden, activation=activation),
302 |                 Dropout(rate=projection_dropout),
303 |             ])
304 |     projection_layers.extend([
305 |             Dense(projection_dim, activation=None),
306 |             Dropout(rate=projection_dropout),
307 |         ])
308 |     q1_encoded = time_distributed(q1_embed, projection_layers)
309 |     q2_encoded = time_distributed(q2_embed, projection_layers)
310 |     
311 |     # Attention
312 |     q1_aligned, q2_aligned = soft_attention_alignment(q1_encoded, q2_encoded)    
313 |     
314 |     # Compare
315 |     q1_combined = Concatenate()([q1_encoded, q2_aligned, submult(q1_encoded, q2_aligned)])
316 |     q2_combined = Concatenate()([q2_encoded, q1_aligned, submult(q2_encoded, q1_aligned)]) 
317 |     compare_layers = [
318 |         Dense(compare_dim, activation=activation),
319 |         Dropout(compare_dropout),
320 |         Dense(compare_dim, activation=activation),
321 |         Dropout(compare_dropout),
322 |     ]
323 |     q1_compare = time_distributed(q1_combined, compare_layers)
324 |     q2_compare = time_distributed(q2_combined, compare_layers)
325 |     
326 |     # Aggregate
327 |     q1_rep = apply_multiple(q1_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
328 |     q2_rep = apply_multiple(q2_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
329 |     
330 |     # Classifier
331 |     merged = Concatenate()([q1_rep, q2_rep])
332 |     dense = BatchNormalization()(merged)
333 |     dense = Dense(dense_dim, activation=activation)(dense)
334 |     dense = Dropout(dense_dropout)(dense)
335 |     dense = BatchNormalization()(dense)
336 |     dense = Dense(dense_dim, activation=activation)(dense)
337 |     dense = Dropout(dense_dropout)(dense)
338 |     out_ = Dense(1, activation='sigmoid')(dense)
339 |     
340 |     model = Model(inputs=[q1, q2], outputs=out_)
341 |     return model
342 | 
343 | 
344 | def esim(pretrained_embedding='../data/fasttext_matrix.npy', 
345 |          maxlen=MAX_LEN, 
346 |          lstm_dim=300, 
347 |          dense_dim=300, 
348 |          dense_dropout=0.5):
349 |              
350 |     # Based on arXiv:1609.06038
351 |     q1 = Input(name='q1',shape=(maxlen,))
352 |     q2 = Input(name='q2',shape=(maxlen,))
353 |     
354 |     # Embedding
355 |     # embedding = create_pretrained_embedding(pretrained_embedding, mask_zero=False)
356 |     embedding = pretrained_embedding
357 |     bn = BatchNormalization(axis=2)
358 |     q1_embed = bn(embedding(q1))
359 |     q2_embed = bn(embedding(q2))
360 | 
361 |     # Encode
362 |     encode = Bidirectional(CuDNNLSTM(lstm_dim, return_sequences=True))
363 |     q1_encoded = encode(q1_embed)
364 |     q2_encoded = encode(q2_embed)
365 |     
366 |     # Attention
367 |     q1_aligned, q2_aligned = soft_attention_alignment(q1_encoded, q2_encoded)
368 |     
369 |     # Compose
370 |     q1_combined = Concatenate()([q1_encoded, q2_aligned, submult(q1_encoded, q2_aligned)])
371 |     q2_combined = Concatenate()([q2_encoded, q1_aligned, submult(q2_encoded, q1_aligned)]) 
372 |        
373 |     compose = Bidirectional(CuDNNLSTM(lstm_dim, return_sequences=True))
374 |     q1_compare = compose(q1_combined)
375 |     q2_compare = compose(q2_combined)
376 |     
377 |     # Aggregate
378 |     q1_rep = apply_multiple(q1_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
379 |     q2_rep = apply_multiple(q2_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
380 |     
381 |     # Classifier
382 |     merged = Concatenate()([q1_rep, q2_rep])
383 |     
384 |     dense = BatchNormalization()(merged)
385 |     dense = Dense(dense_dim, activation='elu')(dense)
386 |     dense = BatchNormalization()(dense)
387 |     dense = Dropout(dense_dropout)(dense)
388 |     dense = Dense(dense_dim, activation='elu')(dense)
389 |     dense = BatchNormalization()(dense)
390 |     dense = Dropout(dense_dropout)(dense)
391 |     out_ = Dense(1, activation='sigmoid')(dense)
392 |     
393 |     model = Model(inputs=[q1, q2], outputs=out_)
394 |     return model
395 | 
396 | def custom_loss(y_true, y_pred):
397 |     margin = 1
398 |     return K.mean(0.25 * y_true * K.square(1 - y_pred) +
399 |                 (1 - y_true) * K.square(K.maximum(y_pred, 0)))
400 | 
401 | def siamese(pretrained_embedding=None,
402 |             input_length=MAX_LEN, 
403 |             w2v_length=300, 
404 |             n_hidden=[64, 64, 64]):
405 |     #输入层
406 |     left_input = Input(shape=(input_length,), dtype='int32')
407 |     right_input = Input(shape=(input_length,), dtype='int32')
408 | 
409 |     #对句子embedding
410 |     encoded_left = pretrained_embedding(left_input)
411 |     encoded_right = pretrained_embedding(right_input)
412 | 
413 |     #两个LSTM共享参数
414 |     # # v1 一层lstm
415 |     # shared_lstm = CuDNNLSTM(n_hidden)
416 | 
417 |     # # v2 带drop和正则化的多层lstm
418 |     ipt = Input(shape=(input_length, w2v_length))
419 |     dropout_rate = 0.5
420 |     x = Dropout(dropout_rate, )(ipt)
421 |     for i,hidden_length in enumerate(n_hidden):
422 |         # x = Bidirectional(CuDNNLSTM(hidden_length, return_sequences=(i!=len(n_hidden)-1), kernel_regularizer=L1L2(l1=0.01, l2=0.01)))(x)
423 |         x = Bidirectional(CuDNNLSTM(hidden_length, return_sequences=True, kernel_regularizer=L1L2(l1=0.01, l2=0.01)))(x)
424 | 
425 |     # v3 卷及网络特征层
426 |     x = Conv1D(64, kernel_size = 2, strides = 1, padding = "valid", kernel_initializer = "he_uniform")(x)
427 |     x_p1 = GlobalAveragePooling1D()(x)
428 |     x_p2 = GlobalMaxPooling1D()(x)
429 |     x = Concatenate()([x_p1, x_p2])
430 |     shared_lstm = Model(inputs=ipt, outputs=x)
431 | 
432 |     left_output = shared_lstm(encoded_left)
433 |     right_output = shared_lstm(encoded_right)
434 | 
435 | 
436 |     # 距离函数 exponent_neg_manhattan_distance
437 |     malstm_distance = Lambda(lambda x: K.exp(-K.sum(K.abs(x[0] - x[1]), axis=1, keepdims=True)),
438 |                             output_shape=lambda x: (x[0][0], 1))([left_output, right_output])
439 | 
440 |     model = Model([left_input, right_input], [malstm_distance])
441 | 
442 |     return model
443 | 
444 | class Attention(Layer):
445 |     def __init__(self, step_dim,
446 |                  W_regularizer=None, b_regularizer=None,
447 |                  W_constraint=None, b_constraint=None,
448 |                  bias=True, **kwargs):
449 |         """
450 |         Keras Layer that implements an Attention mechanism for temporal data.
451 |         Supports Masking.
452 |         Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
453 |         # Input shape
454 |             3D tensor with shape: `(samples, steps, features)`.
455 |         # Output shape
456 |             2D tensor with shape: `(samples, features)`.
457 |         :param kwargs:
458 |         Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
459 |         The dimensions are inferred based on the output shape of the RNN.
460 |         Example:
461 |             model.add(LSTM(64, return_sequences=True))
462 |             model.add(Attention())
463 |         """
464 |         self.supports_masking = True
465 |         #self.init = initializations.get('glorot_uniform')
466 |         self.init = initializers.get('glorot_uniform')
467 | 
468 |         self.W_regularizer = regularizers.get(W_regularizer)
469 |         self.b_regularizer = regularizers.get(b_regularizer)
470 | 
471 |         self.W_constraint = constraints.get(W_constraint)
472 |         self.b_constraint = constraints.get(b_constraint)
473 | 
474 |         self.bias = bias
475 |         self.step_dim = step_dim
476 |         self.features_dim = 0
477 |         super(Attention, self).__init__(**kwargs)
478 | 
479 |     def build(self, input_shape):
480 |         assert len(input_shape) == 3
481 | 
482 |         self.W = self.add_weight(shape=(input_shape[-1],),
483 |                                  initializer=self.init,
484 |                                  name='%s_W'%self.name,
485 |                                  regularizer=self.W_regularizer,
486 |                                  constraint=self.W_constraint)
487 |         self.features_dim = input_shape[-1]
488 | 
489 |         if self.bias:
490 |             self.b = self.add_weight(shape=(input_shape[1],),
491 |                                      initializer='zero',
492 |                                      name='%s_b'%self.name,
493 |                                      regularizer=self.b_regularizer,
494 |                                      constraint=self.b_constraint)
495 |         else:
496 |             self.b = None
497 | 
498 |         self.built = True
499 | 
500 |     def compute_mask(self, input, input_mask=None):
501 |         # do not pass the mask to the next layers
502 |         return None
503 | 
504 |     def call(self, x, mask=None):
505 |         # eij = K.dot(x, self.W) TF backend doesn't support it
506 | 
507 |         # features_dim = self.W.shape[0]
508 |         # step_dim = x._keras_shape[1]
509 | 
510 |         features_dim = self.features_dim
511 |         step_dim = self.step_dim
512 |         eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
513 | 
514 |         if self.bias:
515 |             eij += self.b
516 | 
517 |         eij = K.tanh(eij)
518 |         a = K.exp(eij)
519 |         # apply mask after the exp. will be re-normalized next
520 |         if mask is not None:
521 |             # Cast the mask to floatX to avoid float64 upcasting in theano
522 |             a *= K.cast(mask, K.floatx())
523 | 
524 |         # in some cases especially in the early stages of training the sum may be almost zero
525 |         a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
526 | 
527 |         a = K.expand_dims(a)
528 |         weighted_input = x * a
529 |         #print weigthted_input.shape
530 |         return K.sum(weighted_input, axis=1)
531 | 
532 |     def compute_output_shape(self, input_shape):
533 |         #return input_shape[0], input_shape[-1]
534 |         return input_shape[0],  self.features_dim
535 |         
536 | 
537 | def DSSM(pretrained_embedding, input_length, lstmsize=90):
538 |     word_embedding, char_embedding = pretrained_embedding
539 |     wordlen, charlen = input_length
540 | 
541 |     input1 = Input(shape=(wordlen,))
542 |     input2 = Input(shape=(wordlen,))
543 |     lstm0 = CuDNNLSTM(lstmsize,return_sequences = True)
544 |     lstm1 = Bidirectional(CuDNNLSTM(lstmsize))
545 |     lstm2 = CuDNNLSTM(lstmsize)
546 |     att1 = Attention(wordlen)
547 |     den = Dense(64,activation = 'tanh')
548 | 
549 |     # att1 = Lambda(lambda x: K.max(x,axis = 1))
550 | 
551 |     v1 = word_embedding(input1)
552 |     v2 = word_embedding(input2)
553 |     v11 = lstm1(v1)
554 |     v22 = lstm1(v2)
555 |     v1ls = lstm2(lstm0(v1))
556 |     v2ls = lstm2(lstm0(v2))
557 |     v1 = Concatenate(axis=1)([att1(v1),v11])
558 |     v2 = Concatenate(axis=1)([att1(v2),v22])
559 | 
560 |     input1c = Input(shape=(charlen,))
561 |     input2c = Input(shape=(charlen,))
562 |     lstm1c = Bidirectional(CuDNNLSTM(lstmsize))
563 |     att1c = Attention(charlen)
564 |     v1c = char_embedding(input1c)
565 |     v2c = char_embedding(input2c)
566 |     v11c = lstm1c(v1c)
567 |     v22c = lstm1c(v2c)
568 |     v1c = Concatenate(axis=1)([att1c(v1c),v11c])
569 |     v2c = Concatenate(axis=1)([att1c(v2c),v22c])
570 | 
571 | 
572 |     mul = Multiply()([v1,v2])
573 |     sub = Lambda(lambda x: K.abs(x))(Subtract()([v1,v2]))
574 |     maximum = Maximum()([Multiply()([v1,v1]),Multiply()([v2,v2])])
575 |     mulc = Multiply()([v1c,v2c])
576 |     subc = Lambda(lambda x: K.abs(x))(Subtract()([v1c,v2c]))
577 |     maximumc = Maximum()([Multiply()([v1c,v1c]),Multiply()([v2c,v2c])])
578 |     sub2 = Lambda(lambda x: K.abs(x))(Subtract()([v1ls,v2ls]))
579 |     matchlist = Concatenate(axis=1)([mul,sub,mulc,subc,maximum,maximumc,sub2])
580 |     matchlist = Dropout(0.05)(matchlist)
581 | 
582 |     matchlist = Concatenate(axis=1)([Dense(32,activation = 'relu')(matchlist),Dense(48,activation = 'sigmoid')(matchlist)])
583 |     res = Dense(1, activation = 'sigmoid')(matchlist)
584 | 
585 | 
586 |     model = Model(inputs=[input1, input2, input1c, input2c], outputs=res)
587 |     return model
588 |     
589 | """
590 |     From the paper:
591 |         Averaging Weights Leads to Wider Optima and Better Generalization
592 |         Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
593 |         https://arxiv.org/abs/1803.05407
594 |         2018
595 |         
596 |     Author's implementation: https://github.com/timgaripov/swa
597 | """
598 | class SWA(Callback):
599 |     def __init__(self, model, swa_model, swa_start):
600 |         super().__init__()
601 |         self.model,self.swa_model,self.swa_start=model,swa_model,swa_start
602 |         
603 |     def on_train_begin(self, logs=None):
604 |         self.epoch = 0
605 |         self.swa_n = 0
606 | 
607 |     def on_epoch_end(self, epoch, logs=None):
608 |         if (self.epoch + 1) >= self.swa_start:
609 |             self.update_average_model()
610 |             self.swa_n += 1
611 |             
612 |         self.epoch += 1
613 |             
614 |     def update_average_model(self):
615 |         # update running average of parameters
616 |         alpha = 1./(self.swa_n + 1)
617 |         for layer,swa_layer in zip(self.model.layers, self.swa_model.layers):
618 |             weights = []
619 |             for w1,w2 in zip(swa_layer.get_weights(), layer.get_weights()):
620 |                 weights.append( (1-alpha)*w1 + alpha*w2)
621 |             swa_layer.set_weights(weights)
622 | 
623 | class LR_Updater(Callback):
624 |     '''
625 |     Abstract class where all Learning Rate updaters inherit from. (e.g., CircularLR)
626 |     Calculates and updates new learning rate and momentum at the end of each batch. 
627 |     Have to be extended. 
628 |     '''
629 |     def __init__(self, init_lrs):
630 |         self.init_lrs = init_lrs
631 | 
632 |     def on_train_begin(self, logs=None):
633 |         self.update_lr()
634 | 
635 |     def on_batch_end(self, batch, logs=None):
636 |         self.update_lr()
637 | 
638 |     def update_lr(self):
639 |         # cur_lrs = K.get_value(self.model.optimizer.lr)
640 |         new_lrs = self.calc_lr(self.init_lrs)
641 |         K.set_value(self.model.optimizer.lr, new_lrs)
642 | 
643 |     def calc_lr(self, init_lrs): raise NotImplementedError
644 | 
645 | 
646 | class CircularLR(LR_Updater):
647 |     '''
648 |     A learning rate updater that implements the CircularLearningRate (CLR) scheme. 
649 |     Learning rate is increased then decreased linearly. 
650 |     '''
651 |     def __init__(self, init_lrs, nb, div=4, cut_div=8, on_cycle_end=None):
652 |         self.nb,self.div,self.cut_div,self.on_cycle_end = nb,div,cut_div,on_cycle_end
653 |         super().__init__(init_lrs)
654 | 
655 |     def on_train_begin(self, logs=None):
656 |         self.cycle_iter,self.cycle_count=0,0
657 |         super().on_train_begin()
658 | 
659 |     def calc_lr(self, init_lrs):
660 |         cut_pt = self.nb//self.cut_div
661 |         if self.cycle_iter>cut_pt:
662 |             pct = 1 - (self.cycle_iter - cut_pt)/(self.nb - cut_pt)
663 |         else: pct = self.cycle_iter/cut_pt
664 |         res = init_lrs * (1 + pct*(self.div-1)) / self.div
665 |         self.cycle_iter += 1
666 |         if self.cycle_iter==self.nb:
667 |             self.cycle_iter = 0
668 |             if self.on_cycle_end: self.on_cycle_end(self, self.cycle_count)
669 |             self.cycle_count += 1
670 |         return res
671 | 
672 | class TimerStop(Callback):
673 |     """docstring for TimerStop"""
674 |     def __init__(self, start_time, total_seconds):
675 |         super(TimerStop, self).__init__()
676 |         self.start_time = start_time
677 |         self.total_seconds = total_seconds
678 |         self.epoch_seconds = []
679 | 
680 |     def on_epoch_begin(self, epoch, logs=None):
681 |         self.epoch_start = time.time()
682 | 
683 |     def on_epoch_end(self, epoch, logs=None):
684 |         self.epoch_seconds.append(time.time() - self.epoch_start)
685 | 
686 |         mean_epoch_seconds = sum(self.epoch_seconds)/len(self.epoch_seconds)
687 |         if time.time() + mean_epoch_seconds > self.start_time + self.total_seconds:
688 |             self.model.stop_training = True
689 | 
690 |     def on_train_end(self, logs=None):
691 |         print('timer stopping')
692 | 
693 | 
694 | def get_model(cfg,model_weights=None):
695 |     print("=======   CONFIG: ", cfg)
696 | 
697 |     model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg
698 |     embedding = get_embedding_layers(dtype, input_length, w2v_length, with_weight=True)
699 | 
700 |     if model_type == "esim":
701 |         model = esim(pretrained_embedding=embedding, 
702 |             maxlen=input_length, 
703 |             lstm_dim=300, 
704 |             dense_dim=300, 
705 |             dense_dropout=0.5)
706 |     elif model_type == "decom":
707 |         model = decomposable_attention(pretrained_embedding=embedding, 
708 |             projection_dim=300, projection_hidden=0, projection_dropout=0.2,
709 |             compare_dim=500, compare_dropout=0.2,
710 |             dense_dim=300, dense_dropout=0.2,
711 |             lr=1e-3, activation='elu', maxlen=input_length)
712 |     elif model_type == "siamese":
713 |         model = siamese(pretrained_embedding=embedding, input_length=input_length, w2v_length=w2v_length, n_hidden=n_hidden)
714 |     elif model_type == "dssm":
715 |         model = DSSM(pretrained_embedding=embedding,input_length=input_length, lstmsize=90)
716 | 
717 |     if model_weights is not None:
718 |         model.load_weights(model_weights)
719 | 
720 |     # keras.utils.plot_model(model, to_file=model_dir+model_type+"_"+dtype+'.png', show_shapes=True, show_layer_names=True, rankdir='TB')
721 |     return model
722 | 
723 | #####################################################################
724 | #                         评估指标和最佳阈值
725 | #####################################################################
726 | 
727 | def r_f1_thresh(y_pred,y_true,step=1000):
728 |     e = np.zeros((len(y_true),2))
729 |     e[:,0] = y_pred.reshape(-1)
730 |     e[:,1] = y_true
731 |     f = pd.DataFrame(e)
732 |     thrs = np.linspace(0,1,step+1)
733 |     x = np.array([f1_score(y_pred=f.loc[:,0]>thr, y_true=f.loc[:,1]) for thr in thrs])
734 |     f1_, thresh = max(x),thrs[x.argmax()]
735 |     return f.corr()[0][1], f1_, thresh
736 | 
737 | #####################################################################
738 | #                         模型训练和保存
739 | #####################################################################
740 | configs_path = model_dir+"all_configs.json"
741 | def save_config(filepath, cfg):
742 |     configs = {}
743 |     if os.path.exists(configs_path): configs = json.loads(open(configs_path,"r",encoding="utf8").read())
744 |     configs[filepath] = cfg
745 |     open(configs_path,"w",encoding="utf8").write(json.dumps(configs, indent=2, ensure_ascii=False))
746 | 
747 | def train_model(model, swa_model, cfg):
748 |     model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg
749 | 
750 |     data = load_data(dtype, input_length, w2v_length)
751 |     train_x, train_y, test_x, test_y = split_data(data)
752 |     filepath=model_dir+model_type+"_"+dtype+time.strftime("_%m-%d %H-%M-%S")+".h5"   # 每次运行的模型都进行保存，不覆盖之前的结果
753 |     checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=True,save_weights_only=True, mode='auto')
754 |     earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=patience, verbose=0, mode='auto')
755 |     reduce_lr = ReduceLROnPlateau(monitor='val_loss', verbose=0, factor=0.5,patience=2, min_lr=1e-6)
756 |     swa_cbk = SWA(model, swa_model, swa_start=1)
757 | 
758 |     init_lrs = 0.001
759 |     clr_div,cut_div = 10, 8
760 |     batch_num = (train_x[0].shape[0]-1) // train_batch_size + 1
761 |     cycle_len = 1
762 |     total_iterators = batch_num*cycle_len
763 |     print("total iters per cycle(epoch):",total_iterators)
764 |     circular_lr = CircularLR(init_lrs, total_iterators, on_cycle_end=None, div=clr_div, cut_div=cut_div)
765 |     callbacks = [checkpoint, earlystop, swa_cbk, circular_lr]
766 |     callbacks.append(TimerStop(start_time=start_time, total_seconds=7100))
767 | 
768 |     def fit(n_epoch=n_epoch):
769 |         history = model.fit(x=train_x, y=train_y,
770 |             class_weight={0:1/np.mean(train_y),1:1/(1-np.mean(train_y))},
771 |             validation_data=((test_x, test_y)),
772 |             batch_size=train_batch_size, 
773 |             callbacks=callbacks, 
774 |             epochs=n_epoch,verbose=2)
775 |         return history
776 | 
777 |     loss,metrics = 'binary_crossentropy',['binary_crossentropy',"accuracy"]
778 | 
779 |     model.compile(optimizer=Adam(lr=init_lrs, beta_1=0.8), loss=loss, metrics=metrics)
780 |     fit()
781 | 
782 |     filepath_swa = model_dir + filepath.split("/")[-1].split(".")[0]+"-swa.h5"
783 |     swa_cbk.swa_model.save_weights(filepath_swa)
784 | 
785 |     # 保存配置，方便多模型集成
786 |     save_config(filepath, cfg)
787 |     save_config(filepath_swa, cfg)
788 | 
789 | def train_all_models(index):
790 |     cfg = cfgs[index]
791 |     K.clear_session()
792 |     model = get_model(cfg,None)
793 |     swa_model = get_model(cfg,None)
794 |     train_model(model, swa_model, cfg)
795 | 
796 | 
797 | #####################################################################
798 | #                         模型评估、模型融合、模型测试
799 | #####################################################################
800 | 
801 | evaluate_path = model_dir + "y_pred.pkl"
802 | def evaluate_models():
803 |     train_y_preds, test_y_preds = [], []
804 |     all_cfgs = json.loads(open(configs_path,'r',encoding="utf8").read())
805 |     num_clfs = len(all_cfgs)
806 | 
807 |     for weight, cfg in all_cfgs.items():
808 |         K.clear_session()
809 |         model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg   
810 |         data = load_data(dtype, input_length, w2v_length)
811 |         train_x, train_y, test_x, test_y = split_data(data)
812 |         model = get_model(cfg,weight)
813 |         train_y_preds.append(model.predict(train_x, batch_size=test_batch_size).reshape(-1))
814 |         test_y_preds.append(model.predict(test_x, batch_size=test_batch_size).reshape(-1))
815 | 
816 |     train_y_preds,test_y_preds = np.array(train_y_preds),np.array(test_y_preds)
817 |     pd.to_pickle([train_y_preds,train_y,test_y_preds,test_y],evaluate_path)
818 | 
819 | 
820 | blending_path = model_dir + "blending_gdbm.pkl"
821 | def train_blending():
822 |     """ 根据配置文件和验证集的值计算融合模型 """
823 |     train_y_preds,train_y,valid_y_preds,valid_y = pd.read_pickle(evaluate_path)
824 |     train_y_preds = train_y_preds.T
825 |     valid_y_preds = valid_y_preds.T
826 | 
827 |     '''融合使用的模型'''
828 |     clf = LogisticRegression()
829 |     clf.fit(valid_y_preds, valid_y)
830 | 
831 |     train_y_preds_blend = clf.predict_proba(train_y_preds)[:,1]
832 |     r,f1,train_thresh = r_f1_thresh(train_y_preds_blend, train_y)
833 | 
834 |     valid_y_preds_blend = clf.predict_proba(valid_y_preds)[:,1]
835 |     r,f1,valid_thresh = r_f1_thresh(valid_y_preds_blend, valid_y)
836 |     pd.to_pickle(((train_thresh+valid_thresh)/2,clf), blending_path)
837 | 
838 | 
839 | def result():
840 |     global df1
841 |     all_cfgs = json.loads(open(configs_path,'r',encoding="utf8").read())
842 |     num_clfs = len(all_cfgs)
843 |     test_y_preds = []
844 |     X = {}
845 |     for cfg in all_cfgs.values():
846 |         model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg
847 |         key_ = f"{dtype}_{input_length}"
848 |         if key_ not in X: X[key_] = input_data(df1["sent1"],df1["sent2"], dtype = dtype, input_length=input_length)
849 | 
850 |     for weight, cfg in all_cfgs.items():
851 |         K.clear_session()
852 |         model_type,dtype,input_length,ebed_type,w2v_length,n_hidden,n_epoch,patience = cfg
853 |         key_ = f"{dtype}_{input_length}"
854 |         model = get_model(cfg, weight)
855 |         test_y_preds.append(model.predict(X[key_], batch_size=test_batch_size).reshape(-1))
856 | 
857 |     test_y_preds = np.array(test_y_preds).T
858 |     thresh,clf = pd.read_pickle(blending_path)
859 |     result = clf.predict_proba(test_y_preds)[:,1].reshape(-1)>thresh
860 | 
861 |     df_output = pd.concat([df1["id"],pd.Series(result,name="label",dtype=np.int32)],axis=1)
862 |     
863 |     topai(1,df_output)
864 | 
865 | 
866 | 
867 | 
868 | # 文档第二步，训练多个不同的模型，index取值为0-6
869 | if False:
870 |     train_all_models(index=0)
871 | 
872 | # 文档第三步，训练blending模型
873 | if False:
874 |     evaluate_models()
875 |     train_blending()
876 | 
877 | # 文档第四步，测试blending模型
878 | if False:
879 |     result()
880 | 


--------------------------------------------------------------------------------